You are here

Big Data Integration


Access Once, Process, Combine and Consume Anywhere

From ingesting and manipulating data to modeling, Pentaho decreases the time and complexity involved in preparing data for analytics. Pentaho weaves big data technologies like Hadoop and NoSQL with relational data stores, warehouses, data marts, and enterprise applications to deliver integrated, analysis-ready data.

Broad, Deep Big Data Ecosystem Support

Pentaho provides native integration with all the major vendors and distributions for Big Data, NoSQL, specialized, and analytic data stores through its Adaptive Big Data Layer. You can choose any store, and even change your choice, without impacting and changing your data access and use. Pentaho provides complete freedom of choice, maximum flexibility, and insulation from risk while allowing immediate access to new features and functionality for rapid time to value.

Simple visual tools to improve developer productivity

Pentaho includes a visual extract-transform-load (ETL) tool to load and process big data sources in the same familiar way as traditional relational and file-based data sources. Instead of writing Java programs or Pig scripts, Pentaho empowers less technical developers to design and develop big data jobs using visual tools - resulting in greater team productivity and efficiency. Pentaho works with any semi-structured and unstructured data type, for example, parsing web log and application log files to extract useful data to gain powerful insights about customer behavior. In addition, Pentaho’s visual interface enables calling of custom code, for example, to analyze images and video files to extract meaningful metadata for identifying people and places. Pentaho also provides visual data modeling capabilities, making it quick and easy to deliver an end-user friendly view of the data source.

Visual job orchestration

Pentaho provides a rich graphical design tool for orchestrating the execution of jobs in Hadoop, NoSQL and high performance analytic databases, as well as traditional data stores. Orchestration capabilities include conditional checking steps, event waiting steps, execution steps and notification steps. Together these steps can be combined to enable easy visual assembly of extremely powerful job flow logic, across multiple jobs and data sources.

Pentaho also integrates with Hadoop-native utilities such as Oozie, an open source workflow/coordination service to manage data processing jobs for Apache Hadoop. This integration is key for companies who have already defined Oozie jobs but would like to migrate over to a visual, no-programming environment like Pentaho.

Processing data volumes and varieties with speed

Pentaho has powerful and innovative capabilities to process massive data volumes within constrained time windows such as:

  • High performance data flow engine – With a multi-threaded parallel processing architecture and in-memory data caching, Pentaho Data Integration (PDI) provides a world-class enterprise-scalable data integration platform ideal for handling the largest big data challenges.
  • Cluster support – PDI may be deployed in a cluster, enabling distributed processing of jobs across multiple nodes in the cluster.
  • Run as Hadoop MapReduce – Pentaho's small footprint and Java-based data integration engine is unique in its ability to execute as a Hadoop MapReduce job, running on every node in a Hadoop cluster of any size with up to thousands of nodes. Pentaho's support for Hadoop's distributed cache makes deployment of Pentaho across the cluster automatic and seamless.

Instant and interactive analytics

Provides immediate access to data inside Hadoop, NoSQL or other big and traditional data stores, and with interactive analysis, rich visualization and data discovery.  Learn more about big data analytics


Legal Notices | Privacy Policy

Copyright © 2005 - 2014 Pentaho Corporation. All Rights Reserved