Pentaho and Hadoop
Visual Development, Data Integration, Immediate Insight
Pentaho Business Analytics provides easy to use visual development tools and big data analytics that empower users to easily prepare, model, visualize and explore structured and unstructured data sets in Hadoop. Pentaho simplifies the end-to-end Hadoop data life cycle by providing a complete platform from data preparation to predictive analytics. Pentaho is unique by providing in-Hadoop execution for extremely fast performance.

Visual development for Hadoop data preparation and modeling
Pentaho’s visual development tools drastically reduce the time to design, develop and deploy Hadoop analytics solutions by as much as 15x, compared to traditional custom coding and ETL approaches.
Pentaho provides a powerful visual user interface for ingesting and manipulating data within Hadoop, and makes it easy to enrich Hadoop data with reference data from other sources. Pentaho gives the option of accessing Hadoop data either directly, or through rapid visual extraction into data marts/warehouses optimized for fast response times. A visual tool for defining business metadata models helps developers prepare their data for analytics.
With a simple, point-and-click alternative to writing Hadoop MapReduce programs in Java or Pig, Pentaho exposes a familiar ETL-style user interface. Hadoop becomes easily usable by IT and data scientists, not just developers with specialized MapReduce and Pig coding skills. Pentaho can also easily coexist with existing MapReduce and Pig jobs by providing drag & drop graphical components for executing and orchestrating these jobs.

Would you rather do this ... or this?
Interactive visualization and exploration for Hadoop
Pentaho enables IT to rapidly deploy interactive Hadoop data visualization and exploration capabilities making them self-service for business users. Business users can build and run enterprise and interactive reports and dashboards, as well as interactively visualize and explore Hadoop data across multiple dimensions and measures.
Pentaho’s complete Hadoop data integration and business analytics platform enables IT and business users to easily analyze Hadoop data through:
- Rich visualization – Interactive web-based interfaces for ad hoc reporting, charting and dashboards
- Flexible exploration – Views of data across dimensions such as time, product, and geography, and across measures such as revenue and quantity
- Predictive analytics – Powerful predictive analytics capabilities using advanced statistical algorithms such as classification, regression, clustering and association rules
Instant and interactive Hadoop analytics for data analysts

Pentaho Instaview takes data analysts from data to visualization in minutes with interactive self-service access and analytics for Hadoop. Preparation of Hadoop data for analysis is greatly simplified and automated, enabling users to accelerate the big data analytics cycle from days and weeks to minutes and hours.
Learn more: Pentaho Instaview
Pentaho Visual MapReduce - scalable in-Hadoop execution
Pentaho’s Java-based data integration engine integrates with the Hadoop cache for automatic deployment as a MapReduce task across every data node in a Hadoop cluster, making use of the massively parallel processing and high availability of Hadoop.

Pentaho Visual MapReduce
Pentaho can natively connect to Hadoop in the following ways:
- HDFS – input and output directly to the Hadoop Distributed File System
- MapReduce – input and output directly to MapReduce programs
- HBase – input and output directly to HBase, a NoSQL database optimized for use with Hadoop that provides real-time response times
- Hive – a JDBC driver that enables interaction with Hadoop via the Hadoop Query Language (HQL), a SQL-like query and data description language (DDL)
Multi-threaded engine for faster execution
The Pentaho Data Integration engine is multi-threaded, with each step in a job executing on one or multiple threads. Multi-core processors running on each data node of the cluster are fully leveraged, eliminating the need for specialized multi-threaded programming techniques.
In addition, the Pentaho Data Integration engine executes as a single MapReduce task, instead of the typical multiple tasks resulting from machine-generated or hand-coded MapReduce programs or Pig scripts.
As a result, Pentaho MapReduce jobs typically execute many times faster than machine-generated or custom coded Hadoop MapReduce jobs or Pig scripts.
The table below compares the differences between performing common Hadoop tasks using traditional MapReduce programming skills in contrast to using Pentaho’s visual interface for loading, processing and extracting data.

Library of graphical job flow components for easy data orchestration
Data orchestration is made easy via Pentaho’s library of graphical job flow components for execution of jobs across Hadoop and traditional data stores. Key components include conditional checking, event waiting, execution and notification job flow components.

Together these components can be combined to enable visual assembly of powerful job flow logic across multiple jobs and data sources.
Pentaho provides graphical drag & drop components for Hadoop ecosystem projects such as Sqoop and Oozie, drastically reducing the amount of time needed to use these powerful bulk data load and workflow utilities:
Sqoop – a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases
Oozie – an open source workflow/coordination service that manages data processing jobs for Hadoop
Complete and deep support for Hadoop
Pentaho provides support for all the capabilities of Hadoop:

Pentaho fully supports the leading Hadoop-based distributions and supports native capabilities, such as MapR’s NFS high performance mountable file system. Several distributions of Hadoop are available as open source projects and from commercial providers.

