Skip navigation

Pentaho

11 posts

Learn how to combine the power of Pentaho Data Integration and R.

 

This combination will leverage your machine learning project

 

Please visit the link below to learn how to install/enable R Script plugin into PDI.

 

https://support.pentaho.com/hc/en-us/articles/115001741806-How-to-Install-Enable-R-Script-in-PDI-7-0-EE-and-Windows-10-64-

 

Demo:

 

In 2006, a group of business intelligence professionals became smitten with the technology of a small startup with a big dream. That company, Pentaho, first assembled over a decade ago, developed a business analytics platform that is today unparalleled in the market.

 

That team started a company that became Inquidia Consulting.  From inception, our goal has been to help organizations become data-driven. A key focus of our consulting business revolved on Pentaho’s commercial open source technologies.

 

As we've joined forces with Pentaho, we're now aligned with an organization that values our technology expertise and will provide opportunities to access broader analytics capabilities and vision. Collaboration with Pentaho will turbocharge efforts to deliver sophisticated data and analytics, especially around IoT and machine learning. Inquidia customers and consultants will benefit from our expanded capabilities and expertise.

 

Inquidia's team will remain true to our principles that managing and analyzing core data assets is our clients’ surest path to creating value. And by joining Pentaho’s team, we know that the whole will be greater than the sum of the parts. With Pentaho, the Inquidia team is poised to deliver on the data vision, helping our staff and customers reach their potential. We’re honored to be part of Pentaho and excited for the future.

 

We're bringing over some of our favorite (and useful) blog posts from the past several years. As we migrate these over, feel free to peruse them, and don't hesitate to reach out for more information!

 

The Future of ETL and ELT

Pentaho Data Integration Best Practices: Lookup Optimization

Harvesting Raw Data into Custom Data API’s with Pentaho Data Integration

Pentaho Data Integration Best Practices: Lookup Optimization

Swallowing a mouthful of JSON via the Elephant

Metadata Injection with Pentaho Data Integration

Map Analytics with Javascript and Pentaho

Pentaho Custom Dashboard Development with Javascript

This post was written by Kevin Haas and originally published on Tuesday, July 14, 2015

 

When working with our clients, we find a growing number who regard their customer or transaction data as not just an internally leveraged asset, but one that can enrich their relationships with customers and supporting partners. They need to systematically share data and analytics outside their firewall.

 

One of the ways we help them achieve this goal is with Custom Data API's. In this article, we'll show you how to build Data API's using visual programming within the Pentaho Data Integration Interface. Along the way, we'll expose capabilities of the platform that are not always seen by the average user.

 

Inspecting The Field: Inside Pentaho Data Integration

 

For those who use Pentaho's Data Integration (PDI aka Kettle) platform regularly, you're probably aware of the amazing things you can do. For the uninitiated, PDI is a very powerful Extract-Transform-Load (ETL) technology that lets you access data from virtually any source, perform complex transformations, and push/load that data to virtually any destination. It's commonly used for integrating and shaping data into new forms optimized for analytical purposes (data warehouses, data marts, etc.).

 

However, Pentaho experts know that Kettle is more than an ETL platform. It is powered by an amazing, open, and flexible data management platform that can not only read and write data from relational databases, Hadoop, NoSQL databases, and Web API's, but it can also serve data to other applications.

 

Enter one of the most powerful (but maybe underutilized) components of the Pentaho Data Integration Platform: the Data Integration Server. It can power your API's.

 

Sowing The Seeds: The Data Integration Server

 

If you're an Enterprise Edition Pentaho customer, you're probably aware of the capabilities of the Data Integration Server. It's one of the foundational reasons why customers choose the Enterprise Edition of PDI. The Data Integration Server is a server application that allows you to schedule and run jobs and transformations created with PDIs "Spoon" visual development tool.

 

Community Edition users have a similar (but not as sophisticated) version of the Data Integration Server with the included Carte server. If you're a community edition user, it's easy enough to start up your own local Carte server on a port of your choosing.  You could start Carte on port 8081 like this: 

 

    carte.sh 127.0.0.1 8081 

or

    carte.bat 127.0.0.1 8081 

 

There are many configuration options for the Data Integration Server and Carte server that allow it to operate in all kinds of forms. See here for the current list of options.

 

Bringing It To Life: Calling The Transformation on the Server with ExecuteTrans 

 

The first step to create a Data API is to develop a Pentaho Data Integration transformation that gathers data from virtually any source:  RDBMS, Hadoop, NoSQL Databases, Web APIs, etc.  Since transformations can be parameterized, you can apply filtering, sorting or any other customizable logic to your code. Again, this can all be developed using PDIs graphical development tool:  Spoon.

 

Once you have installed and launched your Data Integration or Carte server, you can execute your transformations with a simple HTTP call using the ExecuteTrans method. For example, if you have Carte running on port 8081 and want to execute a transformation that is stored in /srv/pentaho/sample.ktr and accepts a parameter named "parm", you can simply call: 

    http://127.0.0.1:8081/kettle/executeTrans/trans=/srv/pentaho/sample.ktr&parm=parmvalue 

When you launch this URL within a browser, you wont see a lot of information in return. In fact, you'll get nothing back! But, if you look at the PDI logs on the server, you'll see that the sample.ktr transformation did run.  This is all well and good -- you were able to launch a transformation on a server via an HTTP request. But, now you want it to return some data! 

 

Growing a Transformation into an API: The Output To Servlet Option 

 

To provide an output, the transformation must produce one or more rows of data through a configured output step.  To send rows of data out as a response from the ExecuteTrans method, your transformation must pass this data to a Text File Output (or JSON Output) Step, configuring the step to "Pass output to servlet."

 

pho-data-web-service-1.png       

 

      By checking the "Pass output to servlet" box, the transform will take the stream of data entering this step and push it out as your response. Note that the exact data contents of the stream are output as specified by the steps formatting options (e.g. fields, headers, separators).  You'll want to configure depending on your needs. 

 

Bringing it to Market: A Taste of Farmers Market Data In A Custom Data API 

 

Lets look at an example. With harvest season upon us, more and more farmers markets are filling up with their bounty. But where can you find local farmers markets? Thankfully, the USDA provides an API that lets you search for nearby markets.

 

      http://catalog.data.gov/dataset/farmers-markets-search 

 

This is a powerful set of data, but its not always easy to use the API (or any API, as provided). In this case, to get a market, you first provide a zip code.  The API returns a list of markets with their associated IDs, which you then must use to query the API again in order to get detailed information on each market. Also, we'd like to handle the result data using JSON format and some elements of the API return comma separated values. Finally, we need to format the returned address a bit better, attaching latitude and longitude for mapping. 

 

To begin, we'll create a transform that accepts a parameter called "zip", and does the following: 

1. Generate Rows: Initialize with a row for Pentaho to process input parameters.
2. HTTP Client: Call the USDA API to get the list of markets with the input parameters.
3. JSON Output: Convert the JSON that comes back into a list of markets.
4. JSON Output: Extract the Market IDs from that list of markets.
5. Calculator: Format a URL string to re-query the API to get detailed market info for each market.
6. HTTP Client: Call the USDA API again for each of the markets to get the details.
7. JSON Output: Extract the detailed address info on the market.
8. HTTP Client: Call the Google Geocoding API to get the latitude and longitude.
9. JSON Output: Extract the latitude and longitude from Google's response.
0. Javascript: Clean up and format the Address and Market info from the source APIs into a nice JSON format.         11. Group By: Group all of the detailed Markets from the API into a single row.
12. Javascript: Format the final JSON.
13. Text File Output: Output the single data element as a response to the servlet to show in the server.

  
 

A lot of steps, for sure, but their configuration is relatively simple. This is what the transformation looks like in the Spoon IDE: 

 

pho-data-web-service-2.png       

 

Once we have developed and tested our transformation using Spoon, we are ready to use our Carte server and the ExecuteTrans method.   We open a browser and enter the following URL: 

 

      http://127.0.0.1:8081/kettle/executeTrans/?trans=/mnt/pentaho/pdi-ee/repository/api/farmers-market.ktr&zip=60634 

 

Note, we can even front end the Carte or DI server with Apache and do a little URL rewriting to make it a friendlier URL to call.  We did this. So instead of the above executeTrans URL, we can call this: 

 

    http://127.0.0.1:8081/api/farmers-market?zip=60634 

 

When you execute the API, it does all the work for you. It calls all the various child APIs, does all the cleanup and formatting, and returns delicious looking JSON, with all of the added items from the Transformation as a result. For example: 

 

    { javascript:;
   market:[ javascript:;
      { javascript:;
         ami_id:"1010508",
         marketname:"Eli's Cheesecake and Wright College Farmer's Market",
         location:"6701 W. Forest Preserve Drive, Chicago, Illinois, 60634",
         address1:"6701 West Forest Preserve Drive",
         address2:"",
         city:"Chicago",
         state:"IL",
         postal_code:"60634",
         latitude:"41.9588993",
         longitude:"-87.7944428",
         schedule:[ javascript:;
            "06/12/2015 to 09/25/2015 Thu: 7:00 AM-1:00 PM"
         ],
         products:[ javascript:;
            "Baked goods",
            "Crafts and/or woodworking items",
            "Cut flowers",
            "Fresh fruit and vegetables",
            "Honey",
            "Canned or preserved fruits, vegetables, jams, jellies, preserves, dried fruit, etc.",
            "Maple syrup and/or maple products",
            "Poultry",
            "Soap and/or body care products"
         ]
      }
   ]
}

 

Taking It Home: Building Your Own Data APIs 

 

This is a fun illustration of how to pull data from various sources and craft it together into your very own API. All of this was done with with out-of-the-box Pentaho Data Integration capabilities, no extra parts required.

 

Clearly, you're not limited on what you can do with this capability. You can easily get data from virtually any source and wrap a custom data API around it, giving you new capabilities to leverage your existing infrastructure, and share data in new ways. Pentaho has a lot of capabilities inside.  We know that once you take PDI home from the market and look inside the box, you'll be inspired. 

 

If you're interested in seeing how the transformation works, including a look at the transformation that powers the example, just let us know, and we'll be happy to share with you.

This post was written by David Reinke and originally published on Thursday, June 16, 2016

 

This blog is the first in a series featuring some of the practices we use with Pentaho Data Integration.

 

Pentaho Consulting is often asked to performance tune and/or stabilize ETL processes created by developers who are inexperienced with the product. One of the first areas we investigate is the use of data lookups. Looking up values from a database or file for each record in a stream of data is one of the most frequently executed activities in any data program.  If improperly tuned, ETL jobs can suffer.  Examples of lookups abound including looking up a technical product dimension key for a product name, obtaining the most recent order record for a given customer, or gathering all pageview history for a given website visitor.

 

A frequent issue we find is either the misuse or inefficient configuration of these data lookup steps. It's amazing that problems with lookups would be such an issue -- data lookups seem, at first, like a very straightforward exercise. Yet, their optimization arguably provides the biggest “bang for the buck” in performance tuning and code stabilization.

 

We've found the root causes of these issues to be a lack of understanding of the various data lookup options provided by PDI and minimal recognition of regularly occurring lookup patterns.   In this blog, I'll compare the most commonly used lookup steps and their core functionality.  In future blogs, I'll review design patterns applying these characteristics to what we implement as part of our consulting projects.

 

Most Common Pentaho Data Integration Lookup Steps

There are many steps categorized as “Lookup” within the PDI environment. Let's focus on the three most frequently used steps: Database Lookup, Stream Lookup and Database Join. All of these steps assume an “inbound stream” of records that are then compared record by record with some “lookup source” to produce 0 or more “outbound stream” records with one or more “looked up” fields appended.

 

The following table summarizes the functionality and behavior of these steps.

 

Database Lookup

Stream Lookup

Database Join

Lookup Source

Single Database Table

PDI Stream

Database Query

# of Records Looked Up

1

1

0 or more

Lookup Operators

Typical SQL Operators

Equality Only

Any valid SQL

Caching of Lookup Data

0 to All Rows, Many Config Options

Always All Rows

None

No-Match Lookup Behavior

Can filter inbound record or insert a null or default value in the lookup fields.

Can insert a null or default value in the lookup fields.

Can return null lookup fields or act as a filter on inbound stream

Multiple-Match Lookup Behavior

Can generate failure or match on first record returned from generated lookup query. (Lookup query can have an order by clause applied)

Looks up the last matching record in lookup stream

Returns all matched records potentially causing the inbound stream record to be replicated for each matched lookup record

 

As you can see, there is quite a bit of overlapping functionality. You can often use any one of the three to accomplish the same lookup task. However, there are both striking and nuanced differences which provide significant opportunity for advanced functionality and impactful performance tuning. This is where the use of patterns can aid the insightful developer -- applying rules of thumb and best practice algorithms.

 

In future blogs, I’ll bring to bear common patterns that we have uncovered through our years of PDI implementation work. I’ll start by examining the most frequently occurring pattern: Key-Based, Single Record Lookup. Visit our website in the next week or so as we publish the rest of this series.

This post was written by Bryan Senseman and originally published on Tuesday, November 29, 2016

 

In today’s diverse data environments, it seems like JSON data is everywhere now.  Often working with JSON datums means using one of the popular NoSQL databases like MongoDB or Cassandra. These are easy enough to install, but most organizations use SQL databases for analytics. This begs the question:  Is it possible to use SQL-based tools to do analytics on a JSON dataset? Well, the short answer is YES!...with a smattering of technical gymnastics and a recent version of PostgreSQL.

 

The PostgreSQL community has been steadily adding JSON functionality over the last couple of major releases. While a simple web search provides plenty of how-to instructions for storing JSON data in PostgreSQL, few articles actually exist that tell you how to use it. This blog will help you understand how to not only capture, but to analyze JSON data within PostgreSQL.

 

Finding Some JSON Data

The first thing I had to do was find a firehouse of JSON that would work. While I was forming the idea for this blog, I was 100% focused on a riveting World Series between the Cleveland Indians and Chicago Cubs. It dawned on me that I could use a twitter search (stealing liberally from a previous Pentaho Consulting Labs exercise) of World Series tweets to gather my JSON data.

 

I designed a basic Postgres table to store the tweets. It contains 2 columns: an id and the tweet. Note that the tweet is a JSON datum.

 

 

I used Pentaho Data Integration (PDI) to consume the Twitter “data firehouse” and persist it into my new Postgres tweets table. Since my company, a Pentaho Consulting, is one of the premier Pentaho consultancies on the planet, and that you are reading this on our website, I’ll assume the majority of you know at least a little about Pentaho. (If not, please explore our myriad blog posts to learn more about this wonderful platform.)

 

Trick #1:  Writing JSON Data to Postgres using Views and Rules

Pentaho has a multitude of output capabilities, some specifically for JSON; Mongo and Cassandra just to name two. But, if you have ever tried to write to a JSON(B) datatype via PDI, you know it doesn’t work…or does it?

Trying to write directly to this table using the traditional Table Output step doesn’t work as PDI does not understand the JSONB data type.

 

 

Remember those technical gymnastics I mentioned earlier? It’s time to get limber now.

 

There are a couple of approaches to solve this. I chose to create a view which does an explicit type cast to text, which PDI can read with ease.

 

 

But, how then can I write data through a view? Well, the trick is to use the PostgreSQL rule system!  You can present a view to PDI that has plain vanilla text fields. When PDI inserts to this view, Postgres handles the conversion and writes directly to the underlying table.

 

So, now we can use a simple Table Output step in PDI to insert the JSON configured per the diagram below.

 

Trick #2:  Using PDI with a Little Java Magic to Obtain Historical Tweets

 

I would be remiss if I didn’t point out how, in the past, we’ve used a PDI extension point, User Defined Java Class, to consume tweets via the Twitter API. Since I was working my little experiment after the World Series actually began, I needed to use the Search API to look backward in time. So, a little Java magic and we have a “step” that calls the Twitter search API outputting rows to its result stream.

 

 

Since this API is documented to be non-exhaustive, we took a brute force approach and called it repeatedly, using the Transform Executor step, and ignored Primary Key violations at the Table Output using error handling as the search will return duplicates from its prior invocations.

This resulted in 150,000+ tweets across games 6 & 7. Admittedly, I turned it off after the Cubs victory as I am from Cleveland and have been an avid Indians fan since birth. So, I didn’t get nearly the volume of tweets I expected, and the general vitriol and “spam” made the dataset NSFW. But, the concept is sound and I got enough data to do my experiment.

Trick #3:  Using Functions and Views to Unpack JSON for Ubiquitous SQL access

 

So, we now have a PostgreSQL repository of tweets that we’ve consumed via the Twitter API using Pentaho Data Integration. How do we interrogate/mine that information for meaning?  I did not want to write an analytics tool from scratch and most of the packaged OLAP tools on the market are SQL based.  I had to do some more technical gymnastics, but I found a way to make the open source Mondrian OLAP engine work.

 

My first challenge was figuring out how to expose elements within the JSON field such that they were  queriable via the SQL an OLAP tool generates The answer:  PostgreSQL functions. There are multitudes of functions available for use on JSON(B) fields. After some trial and error, I found a combination of operators and functions that produced a dataset that I could work with. This is by no means the best combination, but it does work for my quick POC analyses.

 

Having already used Pentaho to populate the data, my first choice was to see if I could create a Mondrian schema, using custom SQL code to expose JSON elements as columns, for use in dimensions and measures. I could and it was relatively easy.

 

Using the Expression capabilities within Mondrian schemas, I created SQL snippets that extract various elements from the JSON encoded tweet and present them as separate items. I created the schema using Pentaho’s schemaWorkbench tool. The [Tweet Country] field is created by retrieving the place object from within the tweet and then retrieving the country element as text.

 

 

Which when presented via Pentaho Analyzer, allows for simple slicing and dicing of JSON data!

 

 

Here is the log, just to prove we are analyzing the information encapsulated within the tweet JSON.

 

 

Extracting single elements from JSON proved to be a simple task. However, JSON allows for much more complex data structures, including arrays, specifically the Hashtags and User Mentions. Using the same technique above proved to be problematic. I was able to create a dimension that exploded the array contents into individual values.

 

 

But, it only worked under ideal conditions, and could not be used within a SQL WHERE clause -- basically invalidating its use within Mondrian, as eventually a user will drill on a value and generate SQL that includes a WHERE component.

This led me down the path of creating a view in PostgreSQL that would bundle all the necessary JSON functions together and present a “ROLAPy” SQL representation to Mondrian. Although not as “sexy” as using the Mondrian schema to parse out the JSON elements, the view approach did circumvent all the issues encountered with the JSON arrays.

 

 

With minor modifications to our existing cube, we now reference the view and can act upon each field as if they were table columns

 

 

The upside to doing the conversion of the JSON within the database (instead of the Mondrian schema), is that it can be used by any SQL-based analytics tool. I attached QlikSense to the view and was able to analyze JSON data with Qlik!

 

 

And it doesn’t stop there!  Tableau 10.1 recently announced support for JSON files, so you can assume JSON within a database is on the horizon. Not surprisingly, we can use the same techniques; either using the view, or Custom SQL in the connector to make data available within Tableau. I really like how simple it is to do geographic analysis, and the following image definitely shows, baseball is an American game but has avid followers in Latin America and Asia.

 

But wait...there is a small problem with our approach

 

Those of you familiar with RDBMS optimizers will realize there is a flaw in the view approach. The view is fully resolved for every request, even when the Hashtags or User Mentions are not needed for the analytic query.

But, there are tweaks we can make to our view definition and physical structures to improve the situation. We’ll investigate these and the potential performance benefit of using MonetDB in a future post.

This post was written by Matt Casper and originally published on Monday, January 23, 2017

 

In November of 2010, Pentaho Data Integration (PDI) v4.1 introduced the concept of Metadata Injection.  The promise of creating generalized ETL transformations whose behavior can be changed at run-time became a reality.  However, this initial breakthrough was limited to just 4 steps.   Today, with the recent release of PDI 7.0, there are 74 supported steps, making Metadata Injection more powerful (and applicable) than ever before!

 

With Metadata Injection, developer agility and productivity accelerates.  Instead of creating and maintaining dozens of transformations built with a common pattern, developers define a single transformation template and change its runtime behavior by gathering and injecting metadata from property files or database tables.  At Pentaho Consulting, we’ve seen code lines shrink by as much as 50:1 through the use of Metadata Injection.

 

When is Metadata Injection Useful?

There is often a lot of confusion about the proper and best use of new technologies.  Metadata Injection is no different. Over the past couple years, we have observed many situations which benefited from Metadata Injection and others where its use added unnecessary complexity.  Invariably, our assessments generated questions as to why and when Metadata Injection should be used. So, we decided to update our best practices to codify our thinking.

 

We took stock of our development experiences and identified the following traits to indicate Metadata Injection’s use:

  1. There is a repeatable processing pattern,
  2. Whose variations are finite and declarative,
  3. And is required for a large number of transformations,
  4. That often have dynamic source(s) and/or target(s).

 

As we continue to innovate, we’d expect this list to evolve, but we think it is a great starting point.   Let’s elaborate on these traits through the use of two examples.

 

Imagine you are tasked with creating an export process.   Each export will be sourced using a SQL SELECT statement. The queried records will be placed into a delimited file with a specific  name and directory.   It’s clear that the process has a pattern. (#1)  It simply issues a select statement and write the rows out to an export file.   The variations between each export are well-defined.  They consist of a distinct SELECT statement, list of fields with formatting and other metadata, record delimiter, field delimiter, field closure, and filename/directory location.   None of the variations involve optional or new processing steps, rather they include declarative values used to configure existing steps.  (#2)   If you only need to do this for one or two exports, you might as well create one distinct transformation per export, but you have requirements for 20 exports and you know your users will be demanding more over time.  In fact, you expect some of the exports to be retired while others are added.  In short, you have volume. (#3)   Finally, since each export has a distinct source query and target file, you have dynamism. (#4)

 

Another common scenario involves loading a set of staging tables with delta data from source tables.   Staging tables are often loaded without much transformation, with each load process following the same pattern.  A query is issued to get changed data from the source table with the results inserted into the staging table.   The process is patterned and repetitive. (#1)  Variation is limited to metadata about the source and target tables (#2)   Most staging processes deal with dozens if not scores of tables, so there is usually volume.  (#3)  The source and target tables vary. (#4)

 

Admittedly, these are straightforward transformations that are easy to describe in a blog.  Rest assured that there are more complex patterns to be found.   For example, we’ve implemented big data ingestion processes that consume varying input feeds into standardized, yet customizable JSON or Avro files as well as complex change data capture patterns that require determination of logical deletes and target record versioning.   Regardless of how simple or complex, all of our experiences (to date) share the four traits outlined above.

 

A Metadata Injection Example

Enough of the generalized concepts, let’s look at some code!   We’ll expand on the staging load example outlined above.    Here is our repetitive, processing pattern for each staging table.

  1. A SQL SELECT is used to extract source data,
  2. Which is written to a delimited text file,
  3. That is bulk loaded into a target staging table.

 

Below is a screenshot of the PDI job for the overall process:  j_run_staging_jobs.

 

image04.png

It starts by truncating all of the target staging tables via the t_truncate_tables transformation. (label 1; details not shown) This transformation obtains a list of staging table names, uses a Concat Fields step to construct a “TRUNCATE TABLE table_name;” SQL statement, then passes this statement to an Execute SQL Statements step.  Although this transform does not technically use Metadata Injection, it is utilizing metadata and PDI’s ability to dynamically construct and execute SQL statements.

 

After truncating, we use Metadata Injection within a processing loop to load each staging table (label 2)   The t_get_table_names transform generates metadata for each staging table to load.   The j_staging_table subjob executes for each staging table using Metadata Injection to extract source data, generate a file and bulkload into the target staging table.

 

A couple of assumptions for this example:

  • The source and target databases are both SQL Server.
  • The Metadata Injection technology is not restricted to sourcing and targeting only SQL Server tables. This same pattern could be adapted to source and target from just about any JDBC compliant database that has a bulk load step.
  • The source and target table structures are identical.
  • We did this for simplicity’s sake.  We can still use this same pattern if there is a simple mapping between fields or extra audit fields to be added.

 

The following diagram depicts the transform (t_get_table_names) which acquires initial metadata for each staging table.

 

image05.png

The transformation is summarized as numbered:

  1. Query information_schema.columns to produce the metadata required for each table.
  2. Remove excess fields from the stream.
  3. Use a Group By step to generate a concatenated, comma delimited list of column names for each table.
  4. Copy each row to the result buffer for processing by the subsequent j_load_staging_table job.  (We’ll be using Metadata Injection within this job!)

 

The output rows for the “crtr: tables_columns” step (#4) will look something like:

 

table_name

field_list

table_1

column_1, column_2, column_3, column_4, column_5

table_2

column_1, column_2, column_3, column_4

table_3

column_1, column_2, column_3

table_...

column_...

 

You can see the metadata forming as we will use these fields to generate a SQL extract statement like SELECT ${FIELD_LIST} FROM ${TABLE_NAME}.

 

Next, let’s examine the j_load_staging_table job which executes once per row returned from the t_get_table_names result set.  (i.e. one execution per table.)   The following diagram depicts and numbers the steps within this job.

 

image00.png

 

  1. Write an entry to the PDI file log denoting the parameters being used, i.e. the table and its column list.
  2. Use Metadata Injection to extract data from our source and generate a delimited file for bulk loading.
  3. Double-check that the data file was produced.
  4. Invoke a SQL Server bulk load process to populate the target table
  5. Delete the generated data file, freeing up space for the next iteration.

 

The t_staging_metainject transformation (pictured below) obtains metadata for the table being processed and injects it into the t_staging_table_load transformation template.  

 

image03.png

 

The metadata to be injected is:

  • Source SELECT statement
  • Target (staging) table name
  • Target (staging) column metadata

 

The source SELECT statement can be constructed from the table_name and field_list input parameters.   In our example, the target table name will be named the same as the source.  The target column metadata is obtained via the ti: table_col_data_types Table Input step which uses the table_name input parameter to pull metadata from the staging database catalog.

 

Once obtained, the metadata is passed into the “emi: t_staging_table_load” Metadata Injection step.   The metadata will look something like:

 

target_ table_ schema

target_ table_ name

column_ name

data_ type

format

length

precision

ordinal_ position

sql_stmnt

staging

customer

customer_id

Integer

0

<null>

<null>

1

SELECT customer_id, city, state, number_field1 FROM dbo.customer

staging

customer

city

String

<null>

<null>

<null>

2

SELECT customer_id, city, state, number_field1 FROM dbo.customer

staging

customer

state

String

<null>

<null>

<null>

3

SELECT customer_id, city, state, number_field1 FROM dbo.customer

staging

customer

number_fld1

Number

0

18

2

4

SELECT customer_id, city, state, number_field1 FROM dbo.customer

staging

customer

create_date

Date

yyyy-mm-dd

<null>

<null>

5

SELECT customer_id, city, state, number_field1 FROM dbo.customer

 

Before explaining how the Metadata Injection step is configured, let’s review the t_staging_table_load template transformation.   This is called a template because its steps are not completely configured, i.e. the steps will have their metadata injected at runtime.    The template transformation is pictured below.

 

image01.png

 

This template is straightforward.  We will inject the source SELECT into the Table Input step (table_name_select) and the column metadata into the Text File Output step (bulk_load_file).   Note that we obtain the target record create_date using a Get System Info step.  There is no metadata injected into this step, but we do include metadata about the create_date column in the Text File Output step.

 

In the Inject Metadata tab of the “emi: t_staging_table_load” step, we’ll configure the Table Input step to use the SQL SELECT statement (sql_stmnt).  (Note that while sql_stmnt is repeated for every record in the input metadata, it is used only once and will be executed only once.)   We’ll also configure the Text File Output step with metadata using the table_name as the filename and the column metadata for the fields. The column metadata includes column name, format, length, precision, and data type.

 

Voila!  We’re done.  All we need to do next is test and debug.   Alas, debugging metadata injection jobs with their dynamic runtime content is not easy.    Fortunately, PDI provides a helpful feature in the Metadata Injection step’s Option tab:

 

Setting this option will generate the runtime transformation file (ktr file) that is generated after the metadata has been injected.   If your tests don’t produce the desired results, you can open the generated ktr(s) in spoon and run them individually to find the bug -- a huge timesaver!

 

Stepping back, imagine that you have 100 staging tables to load.   Before metadata injection, you’d build the process for one table and then pay an army of programmers to copy and paste the remaining 99.   (Not to mention the maintenance costs!)   With Metadata Injection, you are loading 100 tables using just 2 jobs and 4 transformations. BOOM!

 

When Not to Use Metadata Injection

There’s always a flipside.   Let’s next consider when Metadata Injection’s use would not be appropriate.    We have found four indicators that deem a set of processes incompatible with Metadata Injection:

  1. Business rules vary between process instances
  2. Substantial and complex transformation of the data
  3. Slowly changing dimension table loads
  4. Fact table loads

 

We suspect that there may be other indicators, but as with our traits for adoption, we need to start with what we have found and adjust as we learn more.

 

The first indicator implies that we cannot define a template transformation that uses the same set of processing steps for each process instance.   We need the variation to be declarative, meaning we can inject a finite set of attributes into well-defined, templated steps.  If the processing requires different steps or step sequences, it will be hard, if not impossible, to implement using Metadata Injection.

 

Determining whether a transformation pattern is too complex for Metadata Injection is often driven by the availability of injectable steps.   One can imagine a repeatable pattern with declarative variations that is complex.   That said, there are steps in PDI that may not be compatible with Metadata Injection.  Before embarking on a Metadata Injection based design, confirm that all of the steps in your process template are supported.   The complete list of steps that work with Metadata Injection can be found on the Pentaho help site here.

 

As of the current PDI release, the Dimension Lookup/Update step (used for maintaining slowly changing dimensions) does not support Metadata Injection.  But even if it did, it’s entirely unlikely that you will find repeatable patterns between dimensions.   For example, a transformation for loading dim_customer will likely have vast differences with dim_product.

 

Considering fact tables, if a transformation template can accommodate the loading of more than one fact table, then there is a likely argument that the Fact table design is flawed.  Fact tables typically having varying dimensional grain and metric columns that are distinctly computed by the dimensional grain.  These features cause uniqueness of processing that is not suitable for Metadata Injection.

 

The Future of Metadata Injection

As PDI continues to grow, we can be certain that more steps will support Metadata Injection and the opportunity to apply its magic will grow.  But, heed the warning given to Peter Parker as he became Spiderman: “with great power comes great responsibility”.  Knowing when to apply this capability is just as important as knowing how to use it.

 

With our experience working with PDI across many different industries and domains you can be assured that Pentaho Consulting knows when and how to apply the dynamism and  might of Metadata Injection.    If you’d like more info on Metadata Injection and the rest of our data engineering services and how we can help you find success, drop us a note.  We’d be happy to share our expertise to ensure your success.

This post was written by Greg Graham and originally published on Friday, May 19, 2017

 

We’ve been seeing an increase in customers looking for cost-effective ways to include analytical mapping solutions in their analytical applications. To do this within the Pentaho Business Analysis ecosystem we can develop custom dashboards with the latest Javascript charting libraries, including LeafletJS used by Tableau.  These custom dashboards leverage Pentaho’s robust data services using Community Data Access (CDA) to provide fully interactive and integrated mapping and charting solutions right to your web browser or device.

 

The following demonstration walks through the use of LeafletJS and JQuery DataTables to create interactive displays of financial and test score data from School Districts in Illinois.  You can also see the demo on Inquidia's site here.

            C:\Users\GregoryGraham\AppData\Local\Microsoft\Windows\INetCache\Content.Word\2017-05-19 10_29_45-School Districts in Illinois 2014.png

 

The data were gathered from the Illinois State Board of Education and the US Census Bureau and joined using Pentaho Data Integration.  GEOJSon files containing map boundaries of Illinois elementary, secondary and unified school districts were obtained from the US Census Bureau Tigerline system.  Selections in the data table are highlighted in the map, and vice versa.

 

 

 

Additionally, we’ve produced a user-driven charting laboratory in another view that allows the user to make scatterplots of the same data selected in the Data Table on axes chosen by the user.
C:\Users\GregoryGraham\AppData\Local\Microsoft\Windows\INetCache\Content.Word\2017-05-19 10_46_12-School Districts in Illinois 20142.png

 

This demo presents just of a taste of the integrated mapping, analytics and visualization techniques that we’ve developed through the years.  Integrating them with a mature web BI platform such as Pentaho enables us to deliver cost-effective, geographical analytical applications to our clients that provide a highly-tailored user experience with a robust and scalable data platform.  Drop us a line today to see how we can help you meet your mapping analytics challenges!

 

See the demo here!

 

This post was written by Greg Graham and originally published on Thursday, April 13, 2017

Many of our customers are looking for a sound, cost-effective platform to provide rich, embedded visualizations in their web applications. To do this, there are a number of options that we commonly turn to. The Pentaho Business Analysis Server platform comes with many excellent tools for custom dashboard development (Analysis Reports/CDE/CDF). Pentaho allows developers to create highly customized dashboards using a relatively easy to use framework. But sometimes, Pentaho’s “out of the box” visualizations aren’t exactly what you’re looking for...


In these cases, where more visual flexibility is required, we develop custom dashboards with the latest Javascript charting libraries.  These dashboards leverage Pentaho’s robust data services using Community Data Access (CDA) to provide mobile friendly interactive charting solutions right to your web browser or device.


The following demonstration walks through the use of ChartJS and DimpleJS to create interactive displays of Pentaho’s “Steel Wheels” sample data. The user can navigate and drill down by clicking on chart elements.   We utilize Bootstrap for mobile friendly resizing , and the dynamic appearance of ergonomic controls for smaller screen sizes.

 

 

Additionally, we’ve produced map based displays with colorful, interactive Choropleth maps of Steel Wheels data like these examples using Plotly.js and Highcharts.

 


These demos present just of a few of the many free and “paid-for” Javascript charting, mapping and visualization libraries that we’ve worked with through the years.  Integrating them with a mature web BI platform such as Pentaho enables us to deliver analytical applications to our clients that provide a highly tailored user experience with a robust and scalable data platform.  Drop us a line today to see how we can help you meet your web and mobile analytics challenges!

This post was written by Steve Miller and originally published on Wednesday, August 2, 2017

 

A few weeks ago, I participated in a conference call with a customer to discuss their new initiative in predictive analytics. We’ve already been helping them build a web-based analytic app for about 9 months. The partnership has been quite successful, and with analytics, once the stakeholders start to see their data and statistics, they often say “aha, this is great, now I’d like to see….” Or in this case, “now that I’ve seen some of the drilldowns, I’d like to produce go-forward predictive models…” The discussion was wide-ranging, spirited, and productive.

 

The analytics in the app today are primarily exploratory, with slice and dice/drilldowns from important business dimensions such as product, geography, and vendor into financial targets like billings, revenue, and profit. The aim is to identify those dimension categories that are business leaders and laggards. In analytics parlance, the drilldown dimensions are independent variables or features, while the financial targets are the outputs or dependent variables.

 

Going forward, the mandate is to be able to predict the performance with regression or machine learning models.

I had my statistical hat on early in the call, translating “dimensions” into categorical features, and “billings/profit” into numeric regression dependent variables. In concept, the work would be pretty straightforward, producing models that looked something like: profit = catvar1 + catvar2 + …..

 

After the call I flashed back to grad school, where we spent a lot of time working with precisely the same regression model challenge of continuous target variables against categorical or qualitative features. The tricky part was transforming the categorical variables so they made sense in the regression setting. What we ended up doing was coding “dummy variables” of 0’s and 1’s to represent qualitative variable categories. We then regressed the output variables of interest against those dummy variables. Back then, it was quite tedious to set up; today, thankfully, such “model matrices” are generated automatically by statistical software regression functions.

 

I also remembered a class assignment we were given to “prove” that the predictions derived from a regression of a numeric dependent on a qualitative feature were in fact equivalent to the means of the dependent summarized across levels of the categorical. In other words, in this simple case, the results of the predictive analytics would be identical to the exploratory.

 

So I just had to dig up a data set to test the conjecture. Lo and behold, I resurrected a blog from a few months back that used a data set Ames Iowa Housing Data that perfectly captured the need: a continuous dependent variable (house sale price), with scores of mostly categorical feature variables. So I loaded the data into R and ran a few models to “test” the exploratory = prediction conjecture.

 

What follows are the analyses. I’ll first explore the relationship between selected categorical predictors and house sale price visually. I’ll then compare house sale price means by those features with the unique predictions derived from regression models to determine if memory serves.

 

First, set a few options, load some packages, and read the data into an R data.table named housing. Compute several new variables, changing character vars to factors and dates as appropriate. Also, create binned variables for grlivearea (ground living area size).

 

 

Define a frequencies function and examine bathrcde, quantgrlivearea, and the 2-way combination. Note that while the distributions of bathrcde and quantgrlivearea are individually reasonable, the 2-way “crosstab” is quite sparse, since living area size and the number of bathrooms are correlated. This is a common problem with survey data, reflecting collinearity among features.

First look at the density violin plot of log10saleprice by bathrcde, ordering by median log10saleprice left to right. Not surprisingly, median sale prices increase with the number of bedrooms. From the x-axis, note that there are 955 1-bedroom homes with a median sale price of $127,500, compared to 48 homes with 4 or more bedrooms and a median sale price of $294,859. bathrcde seems to be highly related to log10saleprice.

Next look at log10saleprice by quantgrlivarea. Not surprisingly again, median sale price increases pretty dramatically with the percentile-binned living area size.

Finally, consider the two-way, faceted violin plot of log10saleprice by bathrcde and quantgrlivarea, which affirms both the positive relationships between # bathrooms, floor space and sale price, and cross-feature sparsity.

Now contrast the log10saleprice means broken down by bathrcde in the top table with the regression predictions of log10saleprice on bathrcde in table 2, cell below. With the regression table, only the distinct predictions are cited. Confirming the math, the means by group are identical to the regression predictions.

Ditto for log10saleprice by quantgrlivarea.

And finally, the exploratory means summarized by both bathrcde and quantgrlivarea are identical to the predictions derived from the two-way regression of log10saleprice on bathrcde and quantgrlivarea plus the interaction term.

The takeaways from this simple exercise? First, I haven’t as yet completely lost my memory. Also, sometimes it may be prudent not to obsess on the distinction between exploratory and predictive analytics, which in certain situations produce similar results. Rather, a focus on the design and execution of asking/answering business analytic questions with pertinent visual and predictive tools might be optimal.

Kevin Haas

The Future of ETL and ELT

Posted by Kevin Haas Sep 18, 2017

This post was written by Chris Deptula and originally published on Tuesday, August 8, 2017

 

The data engineering industry has been through heavy disruption over the last five years. The increasing adoption of Hadoop, NoSQL databases, columnar analytic databases and Cloud platforms have all changed data integration strategy. One change as a result of these new platforms has been a move from a more traditional ETL (Extract, Transform, Load) design to an ELT (Extract, Load, Transform) design.

 

What is ELT?

ELT and ETL are both similar in that they are high level strategies to move data from one system to another. The difference is where data transformation processing occurs. In ETL the transformation is done in an external application such as Pentaho, Talend, or Informatica with the results loaded into the target. ELT does the transformation in the target system. The data is first loaded into the target system exactly as it looks in the source and then scripts are run inside the target platform to perform the necessary transformations.

 

ELT is not a new concept. It has been around for years and is a common approach. However, in many situations, it introduces difficult challenges that make ETL a more attractive option. Some of these challenges include:

  • ELT increases load on the target system during data loads. ETL isolates the processing power required to transform the data outside the target database. ELT puts all of this load on the target database, perhaps causing performance conflicts with active analytic queries.
  • ELT processes usually require scripting. For databases this means writing SQL queries. For Hadoop this means writing Spark, Pig, Hive, or MapReduce. These scripts can be very complex and require significant expertise to maintain. Think 100+ line SQL queries.
  • Although the Transform is done by scripts inside the target platform, a solution is still required for Extract, Load, and overall job orchestration. This generally requires custom scripts or the use of traditional ETL tools such as Pentaho, Talend, or Informatica. (And if you’re using a traditional ETL tool for Extract and Load, the temptation to leverage its Transformation capabilities grows!)

 

Why ELT?

Performance. Data engineers are always under pressure to load the data into the target as quickly as possible. The faster the data can be loaded, the sooner the data can be acted upon --  improving the speed of business.

 

Performance differences between ELT and ETL when using classic relational databases such as Oracle, SQL Server, MySQL, and Postgres have been negligible. However, the introduction of Hadoop and analytic databases has tipped the scale, making ELT orders of magnitude faster than ETL when using these platforms.

 

Hadoop solved the system performance problem by massively scaling out processing. The performance benefits of MapReduce and Spark moving the processing to the data and the volume of data processed in Hadoop make ELT the obvious choice. Analytic databases, on the other hand, have significant performance degradation when you use ETL processing. These databases are very slow when processing a multitude of record-level updates and inserts. In some cases, a single record insert can take seconds. On the other hand, analytic databases are very good at set-based operations requiring users to bulk load data into the database, and then execute SQL scripts to manipulate (i.e. Transform) the required records in batch.

 

ELT is faster for these types of new systems, but as discussed, it still requires manual writing of scripts, limiting the audience that is able to maintain it. Pentaho recognizes this problem and sees a couple of new products on the market to help solve them.

 

ELT Technology Options

For the Cloud-based Analytic databases Amazon Redshift, Snowflake, and Google BigQuery, Pentaho has experimented with Matillion, a Cloud-based visual ELT environment. There is no software license for Matillion, you simply pay a small upcharge on an EC2 instance and pay for Matillion by the hour. Matillion uses the same concepts as traditional ETL tools with orchestration jobs and transformation steps that read data from the database and transform the data in a visual development environment. The difference between the traditional ETL and Matillion is Matillion compiles the transformations into SQL that are run in the database rather than externally. Users familiar with traditional ETL tools will find the Matillion development interface familiar.

 

For ELT on Hadoop, at Pentaho, we find our adaptive execution layer to be a powerful solution. This layer allows users to run ordinary Pentaho transformations as Spark jobs on a Cloudera Hadoop cluster. Existing Pentaho users can take advantage of parallel processing on Hadoop using Spark with no additional training. The written transformations are the exact same as Pentaho developers are used to writing maintaining the same level of maintainability as traditional ETL transformations. We expect the adaptive execution layer to continue to evolve into the preferred way of developing ETL. Write one, run anywhere.

 

The Future of ETL and ELT

It’s clear that new data management solutions are upsetting the traditional ways of transporting and organizing data. With every release, traditional ETL providers are adding capabilities to support evolving data management environments. And new vendors are also rapidly entering the market with purpose built tools to solve these new challenges. As these new data management solutions mature, the purpose built tools will continue to alter the traditional ETL and ELT landscape. With all of these changes, expect to see best practices for data management evolve along with them.  Pentaho can help you stay up to date with the latest data management solutions and designs.

Ivy Tech Community College is a sprawling network of nearly 120 campuses and satellite locations spanning the state of Indiana. It’s the largest singly accredited community college in the country, with nearly 175,000 students each year. A system that big generates a lot of data – about 100 million new records per day. Ivy Tech is using self-service analytics to bust bottlenecks and get information into the hands of the people who need it.

Ivy Tech defies the common wisdom that educational institutions are technology laggards. In 2015 it became only the ninth higher education institution in the country to appoint a chief data officer and the first community college ever to do so. They focused on creating a ‘data democracy’ that would empower data-informed decision-making, accelerate institutional processes and speed innovation.

ivy-tech

The results have been powerful. In one example, they uncovered 2,800 certificates and degrees that students had earned but didn’t know about. Because the college receives funding from the state for each student who graduates, those unclaimed credentials added up to $5.6 million in funding, which more than paid for the entire analytics project.

Learn more here.

Blogger Bio

Jonathan Shafer is Senior Product Customer Marketing Manager at Pentaho focused on customer marketing and advocacy. He enjoys collaborating with customers, providing them an opportunity to share their story and supporting their success. Prior to joining Pentaho, Jonathan has held various roles at organizations such as Informatica and Oracle.