Ken Wood

New Plugin! Working Without a Data Source in PDI, and other Cool Tricks

Blog Post created by Ken Wood on Jul 24, 2017



Pentaho Labs' Presents







Pentaho Data Integration “Dataset” plugin


By Matt Casters & Ken Wood





Pentaho Labs would like to introduce a new PDI plugin to make your data integration projects easier to develop and maintain. Typically, when developers are developing their PDI transformations, they find it very easy to start with an existing data source like database tables, file data or other data sources. This ease of development comes from the absence of having to map individual fields from operation to operation (steps). However, without the data source, the field mapping process is tedious, time consuming, and difficult to maintain. Even PDI with Spoon (the PDI graphical IDE), has some of these issues and can increase development time and include maintenance complexities. This new PDI plugin can increase the speed of developing PDI transformations and reduce the complexities of maintaining PDI transformations by introducing the Test Dataset as the test module for development.


The use cases being addressed with the PDI Dataset plugin include (but not limited to),


  • PDI transformation development without access to input data1stTile.png
  • Validation of PDI transformation results
  • Protection of earlier or prior PDI transformation investment
  • Troubleshooting and debugging – temporarily disable steps


This new plugin, created by Matt Casters and supported by Labs and the Community, is downloadable through the PDI Marketplace, and is released as an experimental plugin.


The following discusses the identified four use cases, though there are other use cases that could be derived from these foundational use cases or with combinations of them, we will focus on these four for now.


Developing PDI Transformations without Input Data


In certain scenarios, the PDI transformation developer will have to develop a transformation without access to input data. This undoubtedly makes this task difficult at best. Some of these scenarios for this situation might include,


  • The transformation is part of a larger application, like a reusable mapping transformation, a Mapper, Reducer or Combiner in a Map/Reduce transformation on Hadoop, a Single Threading transformation, etc.
  • The data that will eventually be used, hasn’t been developed yet, just the specifications (fields/columns and some sample data), but the transformation needs to be developed either in advance or simultaneously.
  • The data that needs to be used is hidden behind security measures like firewalls, encryption, authentication, or is just too sensitive to be accessed for development.
  • The data that needs to be used is very slow to access or impractical to access. Such as data that is the result of a very long running process, geographical distance, slow networking, etc.


This is just an example of some of the situations where access to the input data is not possible to start the development of a PDI transformation. In all these scenarios, the PDI transformation developer could benefits from having a simple method of defining an input dataset for development purposes so that a PDI transformation can be developed without access to the eventual data source.


Validation of PDI Transformation Results2ndTile.png


There are many PDI transformation development scenarios where working with or having access to the desired results is more valuable than combing through reams of specification documents. This is easily demonstrable by reviewing online forums or emails for problems developers are trying to overcome – “I have this input data and I need to get to this result…”. Developers can accomplish a lot with just a few examples of input data and the desired result set(s). Thus, it makes sense to provide developers the mechanism to validate the output of their PDI transformation and steps against a “Golden Dataset”.


When development starts with a desired output behavior in mind, we call this “Test Driven Development” or TDD. The test in our case is the desired Golden Dataset. The combination of coded modules (PDI transformation in the case of Pentaho) and the validation against the specification (in this case, the Golden Dataset) is called a Unit Test. In our more general use case, we combine a PDI transformation with zero or more datasets for input, and zero or more Golden Datasets to validate against into one Unit Test.


Protection of PDI Transformation Investments


If you have developed a lot of PDI solutions in an organization, it can be destructive, frustrating and costly to ignore original requirements to fix new issues that come up. Any change to an existing solution or application can potentially lead to, occasionally very subtle, errors that can be difficult to trace down. Because of this, organizations use unit tests to fix old requirements over time by doing “tests” against them. These organizations run all defined unit tests periodically (every night or after every change to their code) to detect any potential errors in any of the tests as soon as possible, preferably before a new version of a solution or application is released into production.


The solution, in the case of PDI, is to then allow all defined unit tests to be executed through a step and to get output indicating which tests failed, which PDI transformation it effected and other information regarding the test and the nature of the failure. Doing this allows organizations to protect their, sometimes massive, data integration investment.


Temporarily Disable Steps


Just like a traditional code developer, sometimes a PDI transformation developer wants to comment out blocks of code, or disable steps in a transformation to observe the effect this change has (or doesn’t have) on a PDI transformation.DatasetPluginScreenShot.png


The Dataset plugin can address all these use cases while developing or testing your PDI transformations. Once you’ve downloaded and installed this plugin, then restart PDI, you will notice a special symbol on your development canvas in Spoon. This indicates that the Dataset plugin is ready for you to get started. You can reference Matt’s excellent document on how to use this new Dataset plugin here.


Note: This plugin is not supported by Pentaho Support. It is a community and Pentaho Labs supported feature.