1 2 3 4 Previous Next

53 posts

# Internationalization and localization of Pentaho Report Designer

Posted by Joana Carvalho Mar 13, 2018

Introduction

Pentaho Report Designer can be internationalized and localized. To accomplish that, translation files used by the report need to have a specific format and their names need to obey certain rules.

How to configure translation files?

By default, each report that you create contains an empty fall-back/default translation file: translations.properties, editable. You can access it by clicking on File Tab, then Resources option and finally selecting translations.properties and selecting Edit option.

The other translation files can be created or imported by accessing File Tab/Resources option. These files need to contain key-value pairs, separated by an equal symbol (=):

• dashboardTitle = Internationalization dashboard

Additionally, translation files need to have the following format name:

• translations_en.properties
• translations_ja.properties

• translations_en_US.properties

Note: Special characters such as Japanese characters or "Ç" for instance or even accents need to be Ascii encoded (Native2ascii Online), otherwise the characters are not displayed correctly in the report.

Resource-labels are the only elements that perform translations in PRD. You need to set its value property with a key, defined in the translations files. For instance:

• value property = dashboardTitle

value property is under resource-label Atributes Panel.

How is the hierarchical structure in the translations.properties ?

Translations or messages can be created in multiple translation files. Whenever that happens, keep in mind the following hierarchy:

+ translations.properties

++ translations_en.properties

++++ translations_en_US.properties

++++ translations_en_UK.properties

In conclusion, translations_<language>_<COUNTRY>.properties overrides  translations_<language>.properties file, which overrides translations.properties file.

Pentaho Report Designer Example

Imagine that you want to create a PRD for English and Japanese languages. You need to configure/create

translations.properties, translations_en.properties and translations_ja.properties with the keys and values that you want to translate and add resource-labels with the keys that you just defined as values properties.  Then, to test the translation functionality, you need to go to File Tab, click on Configuration option and click and set ~.environment.designtime.Locale with the language code that you want to see: ja or en to see a Japanese or English translation, respectively.

In case you want to publish this PRD in the repository, PUC, and test the translation functionality,  you need to click on View Tab, choose Languages option and then select English or Japanese language. Afterwards, open the PRD report or reload it if it was already opened and see that the report shows the translation for the language chosen in PUC.

Note: In the event of having to translate two reports that share some translation, you only want to maintain one file for each language instead of several. So, you should import the translation file from your file system in File Tab/Resources option.  You can edit the imported translation file by clicking on edit option from File Tab/Resources option, or by editing the translation file in your file system. In these situations there are some considerations your should take into consideration. Whenever you edit the translation file inside the File Tab/Resources option you will need to export it. Otherwise, the translation file in your file system won't have the changes that you have performed. The same way, if you change the translation file in your file system, you will need to remove the file that you have imported to File Tab/Resources option and imported it again.

This example was tested in Pentaho 8.0.

# Internationalization and localization of CDE dashboards

Posted by Joana Carvalho Mar 13, 2018

Introduction

Dashboards can be internationalized and localized. Several files need to be created in order to perform the dashboard translation, all with .properties extension and their location needs to be the same as the dashboard.

How to configure?

As mentioned before, some files need to be created and configured properly:

• en_US=English
• ja=\u65E5\u672C\u8A9E

en_US and ja correspond to the language and country codes and English and \u65E5\u672C\u8A9E are the languages names.

Note: Special characters such as Japanese characters or "Ç" for instance or even accents need to be Ascii encoded (Native2ascii Online), otherwise the characters are not displayed correctly in the dashboard.

• The translation file name should have this format
• messages_<language>.properties - language corresponds to the language code, specified in the  messages_supported_languages.properties. File name examples:

• messages_en.properties
• messages_ja.properties

• messages_<language>_<COUNTRY>.properties - COUNTRY corresponds to the country code, uppercase code ISO ALPHA-2. Whenever, you add this type of file, do not forget to add <language>_<COUNTRY> to messages_supported_languages.properties file. Otherwise, the  configured translation file will never be used.

• messages_en_US.properties

The latter files should contain their translation information organised as key value pairs, separated by an equal symbol (=):

• dashboardTitle = Internationalization dashboard

• messages.properties - is the translation fallback file, the default translation, in case of something not being configured well in translation functionality. The file should have the same structure as described for messages_<language>_<COUNTRY>.properties and messages_<language>.properties.

How is messages.properties structure hierarchical?

Translations or messages can be created in multiple translation files. Whenever that happens, keep in mind the following hierarchy:

+ messages.properties

++ messages_en.properties

++++ messages_en_US.properties

++++ messages_en_UK.properties

In conclusion, messages_<language>_<COUNTRY>.properties overrides messages_<language>.properties file, which overrides messages.properties file.

CDE dashboard Example

Imagine that you want to build a CDE dashboard with only a title for English and Japanese languages. You construct the CDE layout and then you add the Text Component element. In the Expression of the latter, place the following code that calls a CDF API- prop function from i18nSupport, that uses the translation file according to the PUC language:

function f(){

return this.dashboard.i18nSupport.prop('dashboardTitle');

}

Additionally, message.properties and message_en.properties should have:

• dashboardTitle = Internationalization dashboard

and  message_ja.properties should have:

• dashboardTitle = \u56fd\u969b\u5316\u30c0\u30c3\u30b7\u30e5\u30dc\u30fc\u30c9

When you want to test the translation configuration, you need to go to View Tab, choose Languages option and then select English or Japanese language. After that, you need to reload the dashboard, if it was already opened, or open the dashboard and you will see that the translation showed respects the PUC language chosen.

Note: This example was tested in Pentaho 8.0.

Using a language that it is not available in PUC

To add new languages, you need to instal them from the marketplace. Please follow these instructions: pentahoLanguagePacks/README.md at master · webdetails/pentahoLanguagePacks · GitHub

After your language pack is installed, please add the language code to messages_supported_languages.properties file and create the translations file using the format name previously specified.

# Highlights from Pentaho User Meeting 2018, Frankfurt

Posted by Ruth H Mar 7, 2018

Hi everyone,

with 100 participants from Austria, Germany and Switzerland, Pentaho User Meeting has been a great success. I´m happy to share with you the live blog covering all presentations and speakers:

• Migrating from Business Objects to Pentaho (CERN, Gabriele Thiede)
• Pentaho 8 (Pedro Alves)
• Best Practices for Data Integration Architectures (Matt Casters)
• Operating Pentaho at Scale (Jens Bleuel)
• Running Pentaho in Kubernetes (Nis Christian Carstensen, Netfonds)
• Data handling with Pentaho (Marco Menzel, Hansainvest)
• IoT and Predictive Analytics (Jonathan Doering, Hitachi Vantara)
• Adding Pentaho Dashboards to Angular 5 applications (Francesco Corti, Alfresco)
• Predictive Analytics with PDI and R (Dr. David James, it-novum)
• Integrating and analyzing SAP data with SAP/Pentaho Connector (Stefan Müller, it-novum)
• Analyzing IT service management data with openLighthouse (Dirk Rönsch, it-novum)

See the full agenda at

Thanks to all who have contributed to this event!

# PMI Installation, Developer Guide and Sample/Demos

Posted by Ken Wood Mar 6, 2018

There are currently 3 Installation Guides to accompany the Plug-In Machine Intelligence (PMI) plug-in and one Developers Guide. Also, the demonstration transformations and sample datasets are available. These sample transformations and sample datasets are for demonstration and educational purposes. They are downloadable at the following,

Description
PMI_Installation_Linux.pdfInstallation guide for the Linux OS platform.
PMI_Installation_Windows.pdfInstallation guide for the Windows OS platform.
PMI_Installation_Mac_OSX.pdfInstallation guide for Mac OS X platform.
PMI_Developer_Docs.pdfA developer's guide to extending and contributing to the PMI framework.
PMI_MLChampionChallengeSamples.zipThis zip file contains all of the sample transformations, sample folder layouts and datasets for running the Machine Learning demonstrations and the Machine Learning Model Management samples. This is for demonstration and educational purposes.

Posted by Ken Wood Mar 6, 2018

Introducing Plug-in Machine Intelligence

by Mark Hall and Ken Wood

Today, the need to swiftly operationalize machine learning based solutions to meet the challenges of businesses is more pressing than ever. The ability to create, deploy and scale a company’s business logic to quickly take advantage of opportunities or react to changes is exceeding the capabilities of people and legacy thinking. Better and more machine learning is vital going forward but, more importantly, easier machine learning is essential. Leveraging an organization’s existing staff levels, business domain knowledge, and skillsets by lowering the entry into the realm of data science can dramatically expand business opportunities.

Everytime I am in PMI, I am seeing more and more of its value!!! Great stuff!!!”

Carl Speshock - Hitachi Vantara Product Manager, Hitachi Vantara Analytics Group

The world of Machine Learning is empowering an ever-increasing breadth of applications and services from IoT to Healthcare to Manufacturing to Energy to Telecom, and everything in between. Yet the skills gap between business domain knowledge and the analytic tools used to solve these challenges needs to be bridged. People are doing their part through education, training and experimentation in order to become data scientists, but that’s only half of the equation. Making the analytic tools easier to use can help bridge this gap quickly. Throw in the ability to access and blend different data sources, cleanse, format and engineer features into these datasets, and you have a unique and powerful tool. In fact, the combination of PDI and PMI is an evolution of the PDI tool suite for deeper analytics and data integration capabilities

"While exploring solutions with a major healthcare provider that was using predictive
analytics to reduce the costs and negative patient care incurred from
complications
from surgery
and using
the Scikit-Learn library required
2 weeks of coding and prototyping to perform
just the
machine
learning model selection and training. With PDI and PMI, I was able to
prep
the data, engineer in the features and train the models
in about 3 hours. And, I could
include other machine learning engines from R and Weka and evaluate the results. The
combination of PDI and PMI makes machine learning solutions easier to use and maintain."

Dave Huh - Data Scientist - Hitachi Vantara Analytics Services

Hitachi Vantara Labs is excited to introduce a new PDI capability, Plug-in Machine Intelligence (PMI) to the PDI Marketplace. PMI is a series of steps for Pentaho Data Integration (PDI) that provides direct access to various supervised machine learning algorithms as full PDI steps that can be designed directly into your PDI data flow transformations. Users can download the PMI plugin from the Hitachi Vantara Marketplace or directly from the Marketplace feature in PDI (automatic download and install). Installation Guides for your platform, the Developer's Document, and the sample transformation, and datasets are available here. The motivation for PMI is:

1. To make machine learning easier to use by combining it with our data integration tool as a suite of easy to consume steps, and ensuring these steps guide the developer through its usage. These supervised machine learning steps work “out-of-the-box” by applying a number of “under-the-cover” pre-processing operations and algorithm specific "last-mile data prep" to the incoming dataset. Default settings work well for many applications, and advanced settings are still available for the power user and data scientist.
2. To combine machine learning and data integration together in one tool/platform. This powerful coupling between machine learning and data integration allows the PMI steps to receive row data as seamlessly as any other step in PDI. AND! No more jumping between multiple tools with inconsistent data passing methods or, complex and tedious performance evaluation manipulation.
3. To be extensible. PMI provides access to 12 supervised Classifiers and Regressors “out-of-the-box”. The majority of these 12 algorithms are available in each of the four underlying execution engines that PMI currently implements: WEKA, python scikit-learn, R MLR and Spark MLlib. New algorithms and execution engines can be easily added to the PMI framework with its dynamic PDI step generation feature.

PMI also incorporates revamped versions of the Weka steps that have been originally part of the Pentaho Data Science pack. Essentially, this could be looked at as version 2 of the Data Science Pack. These include:

• PMI Forecasting for deploying time-series models learned in WEKA’s time series forecasting environment.
• PMI Scoring for deploying trained supervised and unsupervised ML models. This includes new features to support evaluation/monitoring of existing supervised models on fresh data (when class labels are available).
• PMI Flow Executor for executing arbitrary WEKA Knowledge Flow processes. This revamped step supports WEKA’s new Knowledge Flow execution engine and UI.

PMI tightly integrates, into the PDI “Data Mining” category, four popular machine learning “engines” via their machine learning libraries. These four engines are, Weka, Python, R and Spark. This first phase of PMI incorporates the supervised machine learning algorithms from these four engines from their associates machine learning libraries - Weka, Scikit-Learn, MLR and MLlib, respectively. Not all of the engines support all of the same algorithms evenly. Essentially, there are 12 new PMI algorithms added to the Data Mining category that executes across the four different engines;

1. Decision Tree Classifier – Weka, Python, Spark & R
2. Decision Tree Regressor – Weka, Python, Spark & R
3. Gradient Boosted Trees – Weka, Python, Spark & R
4. Linear Regression – Weka, Python, Spark & R
5. Logistic Regression – Weka, Python, Spark & R
6. Naive Bayes – Weka, Python, Spark & R
7. Naive Bayes Multinomial – Weka, Python & Spark
8. Random Forest Classifier – Weka, Python, Spark & R
9. Random Forest Regressor – Weka, Python & Spark
10. 1Support Vector Classifier – Weka, Python, Spark & R
11. Support Vector Regressor – Weka, Python, & R
12. Naive Bayes Incremental – Weka

As such, eventually the existing “Weka Scoring” step will be deprecated and replaced with the new “PMI Scoring” step. This step can consume (and evaluate for model management monitoring processes) any model produced by PMI, regardless of which underlying engine is employed.

I know what you’re thinking, “why implement machine learning across four engines?”. Good question. Believe it or not, data scientists are picky and set in their ways, and… not all engines and algorithms perform (think accuracy and speed) the same or yield the same accuracies for any given dataset. Many analysts, data scientists, data engineers and others that look to these tools to solve their challenges, tend to use their favorite tool/engine. With PMI, you can compare up to four different engines and up to 12 different algorithms against each other to determine the best fit for your requirement.

What Happens When Data Patterns Change?

An important benefit to PMI is the evaluation metrics used to measure accuracy is uniform and unified. Since all steps are now built into the same PMI framework - unified, the resulting metrics are all calculated uniformly and can be used to easily compare performance even across the different engine's algorithms. This unique characteristic has resulted in a whole new use case in the form of model management. Concepts around model management with the PMI framework has enabled the ability to Auto-Retrain models, Auto-Re-evaluate, Dynamic-Deploy models and so on. A concept that we have recently proven with demonstration is the Champion / Challenger model management strategy. This Champion / Challenger strategy easily allows currently active model(s) to be re-evaluated and compared with other candidate models' performance and "hot-swap' deploy the new Champion model. A more detailed discussion on Machine Learning Model Management can be found with this accompanying blog called "4-Steps to Machine Learning Model Management".

The Fail Fast Approach

Thomas Edison is quoted as saying "I have not failed. I've just found 10,000 ways that won't work.". And back in the days leading up to the invention of the light bulb, this 10,000 ways that won't work took years to iterate through. What if you could eliminate candidates in days or hours? PMI allows a “Fail Fast” approach to achieving results. With the ease of using PMI on datasets, many combinations of algorithms and configurations can be tried and testing very fast, weeding out the approaches that won’t work and narrowing down to promising candidates quickly. The days of churning on code until it finally works, then finding out the results aren’t good enough and a new approach is needed, are coming to an end.

Over the next few month, Hitachi Vantara Labs will continue to provide blogs and videos to demonstrate how to use PMI, how to extend the PMI framework and how to add additional algorithms to PMI.

It is important to point out that this initiative is not formally supported by Hitachi Vantara, and there are no current plans on the Enterprise Edition roadmap to support PMI at this time.  It is recommended that this experimental feature be used for testing only and not used in production environments. PMI is supported by Hitachi Vantara Labs and the community. Hitachi Vantara Labs was created to formally test out new ideas, explore emerging technologies and as much as possible, share our prototypes with the community and users through the Hitachi Vantara Marketplace. We like to refer to this as "providing early access to advanced capabilities". Our hope is that the community and users of these advanced capabilities will help us improve and recommend additional use cases. Hitachi Vantara has forward thinking customers and users, so we hope you will download, install and test this plugin. We would appreciate any and all of your comments, ideas and opinions.

# 4-Steps to Machine Learning Model Management

Posted by Ken Wood Mar 6, 2018

Eliminating Machine Learning Model Management Complexity

By Mark Hall and Ken Wood

Last year in 4-Steps to Machine Learning we looked at how the Pentaho Data Integration (PDI) product provides the ideal platform for operationalizing machine learning pipelines – i.e. processes that, typically, ingest raw source data and take it all the way through a series of transformations that culminate in actionable predictions from predictive machine learning models. The enterprise-grade features in PDI provide a robust and maintainable way to encode tedious data preparation and feature engineering tasks that data scientists often write (and re-write) code for, accelerating the process of deploying machine learning processes and models.

“According to our research, two-thirds of organizations do not have an automated
process to update their predictive analytics models seamlessly. As a result, less than
one-quarter of machine learning models are updated daily, approximately one third
are updated weekly, and just over half are updated monthly. Out of
date models can
create a significant risk to organizations.”

- David Menninger, SVP  & Research Director, Ventana Research

It is well known that, once operationalized, machine learning models need to be updated periodically in order to take into account changes in the underlying distribution of the data for which they are being used to predict. That is, model predictions can become less accurate over time as the nature of the data changes. The frequency that models get updated is application dependent, and itself can be dynamic. This necessitates an ability to automatically monitor the performance of models and, if necessary, swap the current best model for a better performing alternative one. There should be facilities for the application of business rules that can trigger re-building of all models or manual intervention if performance drops dramatically across the board. These sorts of activities fall under the umbrella of what is referred to as model management. In the original diagram for the 4-Steps to Machine Learning blog, the last step was entitled “Update Models.” We could expand the original "Update Models" step and detail the underlying steps that are necessary to automatically manage the models. Then relabel this step to "Machine Learning Model Management" (MLMM). The MLMM step includes the 4-Steps to Machine Learning Model Management, “Monitor, Evaluate, Compare, and Rebuild all Models” in order to cover what we are describing here. This concept now looks like this diagram.

The 4-Steps to Machine Learning Model Management, as highlighted, include Monitor, Evaluate, Compare and Rebuild. Each of these steps implements a phase of a concept called a "Champion / Challenger" strategy. In a Champion / Challenger strategy applied to machine learning, the idea is to compare two or more models against each other in order to promote the one model that performs the best. There can be only one Champion model, in our case the model that is currently deployed, and there can be one or more Challengers, in our case other models that are trained differently, use different algorithms and so forth, but all running against the same dataset. The implementation of the Champion / Challenger strategy for MLMM goes like this,

1. Monitor - constant monitoring of all of the models is needed to determine the performance accuracy of the models in the Champion / Challenger strategy. Detecting a degraded model's performance should be viewed as a positive result to your business strategy in that the characteristic of the underlying data has changed. This can be viewed as the behaviors you are striving for are being achieved, resulting in different external behaviors to overcome your current model strategy. In the case of our retail fraud prediction scenario, the degradation of our current Champion model's performance is due to a change in the nature of the initial data. The predictions worked and is preventing further fraudulent transactions, therefore new fraud techniques are being leveraged which the current Champion model wasn't trained to predict.
2. Evaluate - an evaluation of the current Champion model needs to be performed to provide evaluation metrics of the model's current accuracy. This evaluation results in performance metrics on the current situation and can provide both a detailed set visual and programmatic data to use to determine what is happening. Based on business rules, if the accuracy level has dropped to a determined threshold level, then this event can trigger notifications of the degraded performance or initiate automated mechanisms. In our retail fraud prediction scenario, since the characteristic of the data has changed, the Champion model's accuracy has degraded. Evaluation metrics from the evaluation can be used to determine that model retraining, tuning and/or a new algorithm is needed. Simultaneously, all models in the Champion / Challenger strategy could be evaluated against the data to ensure an even evaluation on the same data.
3. Compare - by comparing the performance accuracy of all the models against each other from the evaluation step, the Champion and the Challenger models can be compared against each other to determine which model performs best, at this time. Since the most likely case is that the current Champion and all the Challenger models were built and trained against the initial state of the data, these models will need to be rebuilt.
4. Rebuild - by rebuilding (retraining) all the models against the current state of the data, the best performing model on the current state of the data, is promoted to Champion. The new Champion can be hot-swapped and deployed or redeployed into the environment by using a PDI transformation to orchestrate this action.

This 4-Steps to Machine Learning Model Management is a continuous process, usually scheduled to run on a periodic basis. This blogs describes how to implement a Champion / Challenger strategy using PDI as both the machine learning and the model management orchestration.

The new functionality that provides a new set of supervised machine learning capabilities and the model management enablers to PDI is called Plug-in Machine Intelligence (PMI). PMI provides a suite of steps to PDI that gives direct access to various supervised machine learning algorithms as full PDI steps that can be designed directly into your PDI data flow transformations with no coding. Users can download the PMI plugin from the Hitachi Vantara Marketplace or directly from the Marketplace feature in PDI (automatic download and install). The motivation for PMI is:

• To make machine learning easier to use by combining it with our data integration tool as a suite of easy toconsume steps that do not require writing code, and ensuring these steps guide the developer through its usage. These supervised machine learning steps work “out-of-the-box” by applying a number of “under-the-cover” pre-processing operations and algorithm specific "last-mile data prep" to the incoming dataset. Default settings work well for many applications, and advanced settings are still available for the power user and data scientist.
• To combine machine learning and data integration together in one tool/platform. This powerful coupling between machine learning and data integration allows the PMI steps to receive row data as seamlessly as any other step in PDI. No more jumping between multiple tools with inconsistent data passing methods or, complex and tedious performance evaluation manipulation.
• To be extensible. PMI provides access to 12 supervised Classifiers and Regressors “out-of-the-box”. The majority of these 12 algorithms are available in each of the four underlying execution engines that PMI currently supports: WEKA, python scikit-learn, R MLR and Spark MLlib. New algorithms and execution engines can be easily added to the PMI framework with its dynamic step generation feature.

A more detailed introduction of the Plug-in Machine Intelligence plug-in can be found in this accompanying blog.

PMI also provides a unified evaluation framework. That is, the ability to output a comprehensive set of performance evaluation metrics that can be used to facilitate model management. We call this unified because data shuffling, splitting and the computation of evaluation metrics is performed in the same way regardless of which of the underlying execution engines is used. Again, no coding is required which, in turn, translates into significant savings in time and effort for the practitioner. Evaluation metrics computed by PMI include (for supervised learning): percent correct, root mean squared error (RMSE) and mean absolute error (MAE) of the class probability estimates in the case of classification problems, F-measure, and area under the ROC (AUC) and precision-recall curves (AUPRC). Such metrics provide the input to model management mechanisms that can decide whether a given “challenger” model (maintained in parallel to the current “champion”) should be deployed, or whether champion and all challengers should be re-built on current historical data, or whether something fundamental has been altered in the system and manual intervention is needed to determine data processing problems or to investigate new models/parameter settings. It is this unified evaluation framework that enables PDI to do model management.

Implementing MLMM in PDI

The PDI transformations below are also included in the PMI plugin download complete with the sample datasets.

The following figure shows a PDI transformation for (re)building models and evaluating their performance on the retail fraud application introduced in the 4-Steps to Machine Learning blog. It also shows some of the evaluation metrics produced under a 2/3rd training - 1/3rd test split of the data. These stats can be easily visualized within PDI via DET (Data Exploration Tool), or the transformation can be used as a data service for driving reports and dashboards in the Business Analytics (BA) server.

The following figure shows a PDI transformation that implements champion/challenger monitoring of model performance. In this example, an evaluation metric of interest (area under the ROC curve) is computed for three static models: the current champion, and two challengers. Models are arranged on the file system such that the current champion always resides in one directory and challenger models in a separate directory. If the best challenger achieves a higher AUC score than the current champion, then it is copied to the champion directory. In this way, hot-swapping of models can be made on-the-fly in the environment.

PMI provides the ability to build processes for model management very easily. This, along with its no-coding access to heterogeneous algorithms, automation of “last mile” algorithm-specific data transformations, and when combined with enterprise-grade features in PDI – such as data blending, governance, lineage and versioning – results in a robust platform for addressing the needs of citizen data scientists and modern MI deployments.

Installation documentation for your specific platform and a developer's guide, as well as, the sample transformations and datasets used in this blog can be found at here. The sample transformations and sample datasets are for demonstration and educational purposes.

It is important to point out that this initiative is not formally supported by Hitachi Vantara, and there are no current plans on the Enterprise Edition roadmap to support PMI at this time.  It is recommended that this experimental feature be used for testing only and not used in production environments. PMI is supported by Hitachi Vantara Labs and the community. Hitachi Vantara Labs was created to formally test out new ideas, explore emerging technologies and as much as possible, share our prototypes with the community and users through the Hitachi Vantara Marketplace. We like to refer to this as "providing early access to advanced capabilities". Our hope is that the community and users of these advanced capabilities will help us improve and recommend additional use cases. Hitachi Vantara has forward thinking customers and users, so we hope you will download, install and test this plugin. We would appreciate any and all comments, ideas and opinions.

# Pentaho Report Designer Tricks

Posted by Joana Carvalho Mar 6, 2018

# Introduction

PRD is a powerful report tool. However, from my personal experience, some of the most cool stuff is hidden under all the available settings and the combinations that PRD allows. Due to this, only through your experience and others, you will start to understand what and how you can do certain implementations.

I would like to share with you the knowledge gathered during a project using PRD tool.

This blog post is divided in sections, where features from particular PRD components are described.

The sections are below, and if you click in one of them you will navigate to the correspondent section:

Tricks only to emphasise the details that certain components features have.

Bear in mind that all the examples in this post use dummy data.

# General Tricks

## · How do you format a column retrieved by the query?

• You need to have that field selected and then click on attributes, format

## · How do you define a name for your sub-report?

• Go the attributes and then name

## · There are a lot of functions that can be used in the reports:

• Page - adds page number, you need to put this function in Page Footer
• Row Banding – defines the color for the even and odd rows in a table

# Query Tricks

## · How many queries can be selected in your report?

• You can only have one query selected per report/sub-report. This have an impact on how you structure your report.

## · What happens when you change the query name?

• Whenever you change the query name, the query that was previously selected, will become deselected, forcing the report/sub-report to produce empty reports because it doesn’t know where to fetch data from. The selected query layout should be something like:

# Parameter tricks

## · How do you create a date parameter?

• You need to keep in mind the formatting that you use. Taking as example this date: 2017-01-01 00:00, you should
• Date Format - yyyy-MM-dd HH:mm. So, both year, day and minutes are with no capital letters. Month needs to be with capital letters, otherwise it will be seen as minutes instead of a month.
• Default Value - choose a default value to prevent you to type the same value every time you want to generate a report.
• Value Type - in this example it is a timestamp, but you can choose a date type as well

## · What are the formats used to display the parameter value?

• On Message Field component - $(nameOfTheParameter) • On Query script -${nameOfTheParameter)
• On Functions[nameOfTheParameter]

## · How do you get the current date and format it in a Message Field Component?

• Report.date – current date
• $(Report.date, date, yyyy-MM-dd) ## · How do you do a cascading filtering in your report? See the following example: • Customer prompt will influence the arrayTypeParam prompt and the later will influence the arraySerialParam prompt • arrayTypeParam prompt query needs to be filtered by customer value and arraySerialParam prompt query needs to be filtered by arrayTypeParam value • The parameters associated with each prompt needs to be in the following order Having the parameters with the order that you want to perform the cascading enables you to activate the cascading filtering that PRD gives you OOB. ## · How are the parameters passed through the reports? • To have access to the parameters values between the Master Report and the sub-reports, the Parameters option under Data in the sub-reports, needs to have the following options selected # Table tricks ## · In which sections, should a table be placed? • The columns/data retrieved by the query on the Details Body • The labels, header of those columns, in the Details Header • If you want to repeat the header in each page, you need to set, in the Details Header, Repeat-header property to true. The Details Body run for each row of your query. ## · Table Example Imagine that you have a query with the following result set:  Port Host Storage Avg Read Max Read … CL1-A, CL2-A TOTWSCPPRDHDS03 5 20 … CL1-A, CL2-A TOTWSCTDEVRPT04 12 16 … CL1-B, CL2-B TOTVH10001 6 10 … CL1-B, CL2-B TOTVH10002 12 15 … and you want to group the Host storages information by Port, see the below image: We created a sub-report, where we defined • in the Details Header, the port field • in the Details Body, the remaining fields: Host Storage Domain,LDEV • in the Page Header, the table headers The Page Header section is repeated in every page. As this table is big, occupies more than one page, the table header will be repeated at the top of every page. To iterate the ports, we added in the Group section, the Port column. This way, a new Details Header will be shown every time a different port appears. # Generate Reports in excel tricks ## · How do you set the sheet name? • To add a sheet name in an excel file you need to select the Report Header of your Master or sub-report and then set the sheetname property ## · How do you create different sheet names? • To create different sheets in your PRD report, you need to set, in each sub-report, in the Report Header, the pagebreak-before to true ## · How do you set dynamically the sheet name? • Imagine that you have information for different sites in one single query. But you want to display in each sheet the information for each site • You need to define a sub-report that • has a query that retrieves all the sites, query layout - siteQuery  site EMEA Korea India • has in Group section the column site selected – Group. It will repeat the sections that it contains (Details section) for each iteration that it makes, in this case, for each site. • in the Details section, you need to add another sub-report (child) that will contain the information for each site • As reports inherit the queries from their report father, the site iterated is passed to the report child • The query from the report child needs to be filtered by site • The Report Header needs to have • the pagebreak-before to true • the sheetname property needs contain something like # Charts tricks ## · Which java chart library PRD charts use? ## · In which section, should a chart be added? • In contrary from what happens with tables, you need to add a chart, not in the Details section, but in the Report Header. Otherwise, you will have as many charts as the number of rows retrieved by your /report/subreport query. ## · How to configure a time series chart with two plots (Area and Line charts)? In this case, the Used and Provisioned are in one( primary) plot and the warning in another one (secondary) • Choose XY Area Line Chart • Choose Time Series Collector both in Primary and Secondary DataSources • Your result set needs to be something like the table below, otherwise, it will only show the series with greater value, in this case Provisioned.  Date_time Column (Date Column) Label Column Value Column Threshold Label Threshold Value 2017/01/01 Used 5 warning 45 2017/01/01 Provisioned 12 warning 45 2017/01/02 Used 6 warning 45 2017/01/02 Provisioned 12 warning 45 2017/01/03 Used 7 warning 45 2017/01/03 Provisioned 12 warning 45 • How to format x tick axis labels? The date_time column is in day granularity. But our chart is only showing months. So, if you look at the images above, we need to set • time-period-type - Month • x-tick-period - Month • x-tick-fmt-str (corresponds to the tick label format) – MMM YYYY • x-auto-range - True, otherwise we need to specify the x-min and x-max properties • x-tick-interval (corresponds to the interval that you want between ticks) - 1 (we wanted to see one tick per month) It is essential to set this property, otherwise it will not apply the format that you specified. • x-vtick-label True (ticks will be displayed vertically instead of horizontally) ## · How to hide the x and y gridlines and the x or y axis? • There is a scripting section that enables you to customise the chart furthermore. You can use several languages: javascript or java. The code is below, using javascript: • chart - points to the chart itself • chart.getPlot() – will grab some chart properties. • chart.getDomainAxis() - picks the x axis properties • domain.setVisible(false) – hides the x axis • chart.setRangeGridlinesVisible(false) and chart.setDomainGridlinesVisible(false) – will hide gridlines for both axis  var chart = chart.getPlot();var domain = chart.getDomainAxis(); domain.setVisible(false);chart.setRangeGridlinesVisible(false);chart.setDomainGridlinesVisible(false); ## · How to set dynamically the x axis ticks of a time series chart depending on a start and end date parameters? • Using the scripting section, choosing the java language • Include the necessary libraries • int difInDays = (int) ((endDate.getTime() - startDate.getTime())/(1000*60*60*24)) - from the startDate and endDate parameters difference, I get the number of days that I want to show. Once that number is converted to ms, when I call DateTickUnit unit = new DateTickUnit(DateTickUnit.HOUR, difInDays, new SimpleDateFormat("dd-MM-yyyy HH:mm")), the java function already knows how many points will show for hour. • xAxis.setTickUnit(unit) - will set the tick unit that we set previously  import org.jfree.chart.axis.ValueAxis;import org.jfree.chart.axis.DateTickUnit;import org.jfree.chart.axis.DateAxis;import org.jfree.data.Range;import java.text.SimpleDateFormat;import java.util.*;import java.math.*;import java.util.Date; import org.jfree.chart.plot.Plot; // Get the chartPlot chartPlot = chart.getPlot(); ValueAxis xAxis= chartPlot.getDomainAxis(); Date startDate = dataRow.get("startDate");Date endDate = dataRow.get("endDate");int difInDays = (int) ((endDate.getTime() - startDate.getTime())/(1000*60*60*24)); DateTickUnit unit = new DateTickUnit(DateTickUnit.HOUR, difInDays, new SimpleDateFormat("dd-MM-yyyy HH:mm")); xAxis.setTickUnit(unit); ## · How to define dashed lines and specific colours in certain series in a line chart? • Using the scripting section, choosing the java language and include the following code: • In the query, the series that I wanted to paint and add dashed lines were at the top • chartPlot.getSeriesCount() – access the result set of the query • renderer.setSeriesPaint(i, Color.green) – set the color, for a given row i • renderer.setSeriesStroke(i, new BasicStroke(1.0f,BasicStroke.CAP_BUTT, BasicStroke.JOIN_MITER, 10.0f, new float[] {2.0f}, 0.0f)) - set the shape style (BasicStroke.CAP_BUTT) and new float[] {2.0f} argument corresponds to the width of the dashed line • renderer.setSeriesShapesVisible(i,false) – will disable the line shapes on the remaining series • chartPlot.setRenderer ( renderer ) – updates the chart configurations with the ones specified in the scripting section  import java.awt.Color;import java.awt.BasicStroke;import java.awt.Stroke; import org.jfree.chart.plot.XYPlot;import org.jfree.chart.renderer.xy.XYLineAndShapeRenderer; import java.util.*;import java.math.*;import java.util.Date; // Get the chartXYPlot chartPlot = chart.getXYPlot(); XYLineAndShapeRenderer renderer = new XYLineAndShapeRenderer( );for (int i = 0; i < chartPlot.getSeriesCount(); i++) { if(i <3){ if(i <1){ renderer.setSeriesPaint(i, Color.orange); }else if(i < 2){ renderer.setSeriesPaint(i, Color.red); }else{ renderer.setSeriesPaint(i, Color.green); } renderer.setSeriesStroke(i, new BasicStroke(1.0f,BasicStroke.CAP_BUTT, BasicStroke.JOIN_MITER, 10.0f, new float[] {2.0f}, 0.0f)); } renderer.setSeriesShapesVisible(i,false); } chartPlot.setRenderer( renderer ); # Css tricks ## · How can you stylise items, without using the style properties from the left panel? • It is possible to use an external stylesheet or the internal stylesheet - consult here how you specify the css style. However, there are some css properties that take precedence in relation to others. Imagine that you have the following example: • In the report-header, under a band element, there are two elements: text-field and a message. • The following Rules were defined: • report-header • font-size:16 • background:blue • report-header text-field • font-size:16 • background:pink • band text-field • font-size:10 • background:purple • It was expected that • text-field should have the font-size:10 and background:purple, because band text-field Rule is more precise than report-header text-field Rule, as the band is inside the report-header • message should have font-size:16 and background:blue, because the other Rules refer to text-field • The output was • text-field had font-size:16 and background:pink, applied report-header text-field Rule • message had the expected Rule applied • Even, adding the report-header to band text-fieldRule does not make any difference. The solution is to not mixture bands elements with other elements such as report-header,group-header, text-field. Use only bands with id and style-class properties under attributes panel. For instance, if you add a style-class, to band: reportHeader and a style-class to text-field: textField: • band.reportHeader • font-size:16 • background:blue • band.reportHeader .textField • font-size:10 • background:purple • The expected and output are the same • text-field have the font-size:10 and background:purple • message have font-size:16 and background:blue. ## · When should not you use bands? • Imagine two sub-reports, one at the bottom of the other, and each one • is inside a band • contains a table • As bands do not have the “overlap notion”, in other words, if you place one on top of the other, you won’t get the red alarm rectangle, meaning they are overlapping. So, if the first sub-report has a lot of rows to display, they will get overlapped with the rows returned by the second sub-report. • Due to this, do not use sub-reports inside bands. # A Data Architecture for Collaborative Enterprise Analytics Posted by Kevin Haas Jan 27, 2018 This was originally published by Dave Reinke & Kevin Haas on Monday, February 22, 2016 Our previous blog on the Rise of Enterprise Analytics (EA) created quite a stir. Many readers had strong reactions both for and against our perspective. Several pointed comments about the continued importance of centralized, enterprise data repositories (lakes, warehouses, marts) gave us pause. To summarize: “ How dare you even consider throwing away years of best practice data design and engineering and return us to an age of inconsistent spreadmarts and siloed Access databases. You should be ashamed!” The critics will be heartened to learn we’re not advocating giving up entirely on IT-managed enterprise data. On the contrary, we believe the adoption of Enterprise Analytics mandates even more attention to and extension of best practice enterprise data management and engineering. ## The Power of Analytic Producers Is Reforming How IT Manages Data The subtle differentiation we’re making is between the data itself, and the analytics on that data. EA is about shifting analytic production toward those in the organization who drive innovation from data, i.e. the Analytic Producers. Analytic Producers are typically business analysts, data scientists and others responsible for measuring and forecasting performance and identifying new, data-driven products and opportunities. Most of our recent projects have revolved on the enablement of Enterprise Analytics through a modern, extensible data architecture. One that relies on a foundation of governed dimensional data warehousing and modern big data lakes, while simultaneously enabling analysts to create and blend their own datasets. As Analytics Producers find value in adjunct datasets, that data is then integrated into what we call the “enterprise data ecosystem” (EDE). In contrast to the traditional EBI ecosystem, the vitality of the EDE is driven by business priorities and analytic innovations -- not the other way around. The picture above depicts how the old EBI and new EA worlds integrate. The blue elements should look very familiar to EBI colleagues. This is the domain of stewarded enterprise data and standardized access mechanisms for “Analytics Consumers”. Most are probably familiar with the classic reporting, OLAP and dashboards provided by legacy BI vendors. ## New Analytics Technologies Have Also Upset Traditional Data Semantic Governance In addition to core EBI, we’ve added boxes for the increasingly prevalent statistical tools and libraries such as R, Python and Spark used by data scientists. Further, analytical apps built with R/Shiny, Qlik, Tableau, etc. provide tailored, managed access to data via highly visual and ubiquitous web and mobile interfaces. Indeed, Inquidia’s business is now more focused on analytical app dev than it is on dashboards and reports enabled via legacy commercial EBI tools. (More on this in an upcoming blog…) The orange elements of the diagram depict new architectural elements driven by Enterprise Analytics clients. Ad-hoc data discovery and the ability to experiment with new data sources drives the need. Depending on the Analytics Producer, the new data sources range from simple spreadsheets to data scraped from the web -- and curated using agile programming languages like Python, R, Alteryx and even freely-available ETL software such as Pentaho Data Integration. Additionally, for some of our clients, we help create “data sandboxes” where Analytics Producers combine extracts (we often are asked to build) of enterprise data with their new, embellishing datasets for ease of blending. ## A Modern Approach to Collaborative Enterprise Analytics Yields Benefits for Analysts and IT Central to EA is the ability for Analytic Producers to share discoveries and collaborate. The Shared Analytics Results Repository provides this functionality. Many of our clients enable this sharing using Tableau server, though the same results could be attained through other low cost approaches including Tableau desktop with file sharing, internal wikis, Google Drive & Docs, etc. There’s certainly no need to reinvent collaboration technology. Inevitably, a new “hot” analytic will emerge from EA initiatives -- one that is in demand by traditional Analytics Consumers. This is where expert enterprise data architecture and engineering is critical -- and often where data integration expertise plays a helping role. The gray boxes depict the escalation process with outputs detailing new data integration and semantic requirements. The orange “New Sources” box represents the extensibility of the data ecosystem. Depending on the nature of the data, it may land in the classic data warehouse or become part of the big data lake (e.g. Hadoop). The orange “Integrated User Models” box shows the extension of the enterprise semantic driven by the newly integrated data. These data may manifest in cubes, ad-hoc report metadata, or new analytical app requirements. We hope this deeper dive into the nature of emerging Enterprise Analytics will allay fears of our colleagues that data architecture and engineering are no longer critical. The revolutionary concept of EA is not rampant decentralization of enterprise data, but rather an acknowledgement that for many business organizations (and perhaps yours), significant analytic expertise resides outside of IT. These analytics constituents must be serviced with more flexibility and agility for an enterprise that wishes to drive innovation through analytics. # The Rise of Enterprise Analytics Posted by Kevin Haas Jan 27, 2018 This post was originally published by Dave Reinke & Kevin Haas on Wednesday, January 27, 2016 Is Enterprise BI dying? That’s the question our colleagues have been debating the past few months. We’re heavily wired into the business intelligence marketplace and have seen the nature of our projects change recently. Fewer clients are asking for classic ad-hoc query, reporting and analysis provided by enterprise BI platforms such as BusinessObjects, Cognos, Microstrategy and Pentaho. Rather, clients are obsessed with providing data services to a growing potpourri of visual analytic tools and custom built analytic apps. More organizations expect data-driven, tangible evidence to support their decisions. The fundamental shift is from an IT mandated common data semantic via a monolithic BI platform to an assortment of “BYO” analytics technologies that depend on the creativity and self-reliance of business analysts and data scientists to meet enterprise analytical needs. Perhaps we are seeing the rise of a new analytics philosophy. Are we witnessing the Rise of Enterprise Analytics? ## A Brief History of Enterprise BI Enterprise BI platforms began life in the 1990’s when upstart disrupters like BusinessObjects and Cognos promised to “democratize data access” and introduce “BI to the masses.” For the most part, they delivered on these promises. The core concept was a centralized, shared data semantic, enabling users to interact with data without requiring an understanding of the underlying database structure or writing their own SQL. Yes, SQL. All data for these platforms had to be queried from relational databases, preferably dimensionalized data warehouses that were designed and populated by IT. The Enterprise BI platforms provided tremendous value to organizations that were starved for consistent data access. Once the underlying data was organized and a semantic defined, users were guaranteed conformed data access via ad-hoc and canned query, reporting and analysis modules. Additionally, complex reports and dashboards could be stitched together from common components. Nirvana...unless you were paying the license fees or wanted to switch to a new platform. Excessive licensing fees and lock-in began the undoing of the monolithic BI platforms as open source technologies like Pentaho and Jaspersoft aggressively commoditized. However, even the open source options were still bottlenecked by a dependence on centralized IT to organize data and define a common semantic. Time for the next disruption… ## The Trend is not Enterprise BI’s Friend: Five Trends Sparking the Rise of Enterprise Analytics For context, consider how radically user’s expectations of technology have changed since the Enterprise BI platforms took shape in the 1990’s. We’ve identified five “megatrends” that are particularly relevant for analytics. First, technology has become high touch and amazingly intuitive. High touch as in actually touching via tablets and phones. Apps and websites don’t come with binders or user manuals. You download and use, figuring it out along the way. Second, technology is perpetually connected, enabling interaction with people and things anywhere. We expect to be able to do things at any time, any place and on any device. We change our home thermostat from across the country and speak with colleagues on the other side of the globe for little or no cost. Simply amazing if you stop to think about it. Third, technology answers questions now. We’ve become impatient, no longer willing to wait even for the simple latency of an email exchange. Ubiquitous connectivity and Google are now taken for granted by a new generation of perpetually informed consumers. Fourth, the increasing compute power in the hands of every business analyst is changing their problem solving approach. Data scientists can solve business problems by processing even more data with vastly more sophisticated algorithms than ever before. This has yielded technologies that visually depict these advanced analytics, resulting in greater experimentation, and an embrace of the scientific method. Finally, technological sharing and collaboration is the new norm. Social networks have taught us that if we participate, then we will get more than we give. The open source software development model has spilled into just about every domain, harvesting the benefits of collaboration and improvement via derivative works. The trends empower and embolden the individual and stand in stark contrast to the command and control deployment inherent in classic Enterprise BI platforms. ## Enter Enterprise Analytics The legacy, centralized approach of Enterprise BI simply hasn’t recognized and responded to these trends. Imagine an enterprise that leverages IT and engineering resources to provide a shared, secure data asset, but also fosters an ecosystem where analytics evolve through creativity, exploration, collaboration and sharing. An ecosystem where analytics take on a life of their own; where the “best-fit” analytics thrive and pass their “DNA” on to new generations built from an expanding data asset. Markets are won by data-driven organizations that learn faster and execute better than their competitors. This is the vision for Enterprise Analytics. As long time BI veterans, we were taught to root out siloed analytics, spreadmarts and the like. One of the commonly argued benefits for Enterprise BI platforms is that reported metrics are guaranteed to be consistent no matter how many users ask for them. There would be no more arguing over whose numbers are right. Rather, energy was to be spent interpreting and acting. We found this to be true with users rapidly absorbing the IT-managed metrics. However, just as quickly as we delivered the standardized, IT-governed metrics, users demanded new metrics requiring the rapid integration of increasingly varied and voluminous data. Few IT organizations could respond with the necessary agility, and innovation was stifled. Adopting Enterprise Analytics changes the dynamic between user and IT. IT becomes an enabler, providing a shared and secure data infrastructure while users are empowered to create, share and improve analytics. For sure, the path is not straight. There are bumps along the way with arguments over whose metrics are more apt, etc. But the benefits of rapid innovation overpower the stagnation that comes from lack of analytical agility. With a platform that enables collaboration, users are more apt to reuse and then extend metrics as they produce new analytics -- experimenting to improve an organization’s understanding. The costs of a little less governance are far outweighed by the benefits of rapidly improving actionable insight. ## What's Next in Enterprise Analytics Although the opportunity of Enterprise Analytics is staggering, Enterprise BI is not going to disappear overnight. We’ll still need pixel perfect and banded reports to satisfy regulations, official documents, operational norms and tailored communication. Privacy requirements still demand secured and managed access for wide swathes of enterprise data -- access that likely requires a stable, common semantic for which Enterprise BI platforms excel. And, we’ll increasingly see analytics delivered via “apps” with tightly scoped, but well-directed functionality to address a specific business process and targeted audience. Not everything will be 100% ad-hoc. But, in our view, the reality of how business analysts and data scientists work, the tools they use, and the information they have access to is inciting a real change in the way that individuals are using information. Enterprise Analytics is at hand, and organizations that do not respond to this reality will find themselves increasingly irrelevant. In future blogs, we’ll expand on the concepts introduced here, elaborating on the benefits of maintaining an Enterprise Analytics portfolio that consists of business-led Data Exploration, Data Science, Analytical Apps and governed data access and reporting. We’ll also discuss how to start and grow your own Enterprise Analytics ecosystem discussing technologies and techniques that work and the changed but still critical role of central IT. Along the way we’ll share insights and experiences as we enter this unquestionably exciting new age. # Movie Sentiment on Twitter using Naïve Bayes Posted by Jesse Zuckerman Jan 24, 2018 One of the most heavily discussed topics in machine learning and data mining today is sentiment analysis. For the uninitiated, sentiment analysis is a goal to classify text as positive or negative based only on previously classified text. In this article, I will attempt to classify the sentiment of Twitter comments about a certain movie, based only on a dataset of 10,662 movie reviews, released in 2005. This solution will be demonstrated using 2 methods—once using only Pentaho Data Integration (with some R), and a more sophisticated solution will be built using Weka. # Understanding the Naïve Bayes Classifier Although many machine learning algorithms become very complex and difficult to understand very quickly, the Naïve Bayes classifier relies on one of the most fundamental rules in statistics, allowing its results to be highly interpretable, while also maintaining a high degree of predictive power. It is based upon Bayes’ Rule, which can be used to predict conditional probability. The equation reads: Applying Bayes’ Rule to sentiment analysis to classify a movie as bad given a specific review of “I hated it” would be: The classifier is called “naïve” because we will assume that each word in the review is independent. This is probably an incorrect assumption, but it allows the equation to be simplified and solvable, while the results tend to hold their predictive power. Applying Bayes’ Rule has allowed us to dramatically simplify our solution. To solve the above equation, the probability of each event will be calculated. • P("I"|negative) can be described as the total number of times “I” appears in negative reviews, divided by the total number of words in negative reviews • P(negative) is the total number of words that are in negative reviews divided by the total number of words in the training data • P("I") is the total number of times “I” occurs in all reviews divided by the total number of words in the training data We can then do the same above equation and replace the occurrences of negative with positive. Whichever probability is higher allows us to predict a movie review’s sentiment as negative or positive. The expectation would be that hated occurs significantly more often in the negative reviews, with the other terms being similar in both classes, thus allowing us to correctly classify this review as negative. # Build a Naïve Bayes Model using Pentaho Data Integration To build the model in Pentaho, there are a few steps involved. First, we must prepare the data by cleaning the data. Once this is done, we then build the terms for each word in the classifier. Lastly, we test the performance of the model using cross-validation. ## Step 1: Cleaning and Exploring the Source Data To perform the sentiment analysis, we’ll begin the process with 2 input files—1 for negative reviews and 1 for positive reviews. Here is a sample of the negative reviews: To clean the data for aggregation, punctuation is removed and words are made lowercase to allow for a table aggregated by class and word. Using the data explorer, we can start to see the word count differences for some descriptive words. These numbers intuitively make sense and help to build a strong classifier. ## Step 2: Building Terms for the Classifier Next, we build the various terms for the classifier. Using the calculator steps, we need the probabilities and conditional probabilities for each word that occurs either in a negative review or positive review (or both) in the training data. The output from these steps then creates the parameters for the model. These need to be saved, so eventually they can be used against testing data. Here is a sample: It can be noted that some of these word counts are null (zero). In the training data, this only occurs if a word count is zero for one of the two classes. But in the test data, this can occur for both classes of a give word. You will notice that the conditional probabilities for these null words are nonzero. This is because Add-1 Smoothing is implemented. We “pretend” that this count is 1 when we calculate the classifier, preventing a zero-out of the calculation. To calculate the classifier for a given instance of a review, like the formula previously explained, we must apply the training parameters to the review—that is match each word in the review being classified with its probability parameters and apply the formula. It can be noted that when we solve the equation, we take the log of both sides because the terms being multiplied are very small. ## Step 3: Model Accuracy using Cross-Validation You will notice there is a Section #3 on the transformation to see how well our classifier did. It turns out, that this is not the best way to check the accuracy. Instead, we will check the results using cross-validation. When building a model, it important not to test a model against the training data alone. This will cause overfitting, as the model is biased towards the instances it was built upon. Instead, using cross-validation we can re-build the model exactly as before, except only with a randomly sampled subset of the data (say, 75%). We then test the model against the remaining instances to see how well the model did. A subset of the predictions, with 4 correct predictions and 1 incorrect prediction, from cross-validation can be seen here: Ultimately, using cross-validation, the model made the correct prediction 88% of the time. # Test the Naïve Bayes Model on Tweets using Pentaho Data Integration and R To read Tweets using R, we make use of two R libraries, twitteR and ROAuth. A more detailed explanation to create a Twitter application can be found here. This allows for stream of tweets using the R Script Executor in PDI. We will test the model using Jumanji: Welcome to the Jungle, the movie leading the box office on MLK Jr. Day Weekend. Using the following code, we can search for recent tweets on a given subject. The twitteR package allows us to specify features, like ignoring retweets and using only tweets in English. tweetStream = searchTwitter(‘Jumanji’ ,lang='en' ,n=100) dat = do.call("rbind", lapply(tweetStream, as.data.frame)) dat = dat[dat$isRetweet==FALSE,]
review = dat$text Encoding(review) = "UTF-8" review = iconv(review, "UTF-8", "UTF-8",sub='') ## remove any non UTF char review = gsub("[\r\n;]", "", review)  Here is sample of the incoming tweets: Clearly, these tweets are not of the same format as the training data of old movie reviews. To overcome this, we can remove all @ mentions. Most of these are unlikely to affect sentiment and are not present in the training data. We can also remove all special characters—this will treat hashtags as regular words. Additionally, we remove all http links within a tweet. To keep only tweets that are likely to reveal sentiment, we will only test tweets with 5+ words. To get predictions, we now follow the same process as before, joining the individual words of a tweet to the training parameters and solve the classifier. Here is a sample of the results, along with my own subjective classifier:  Predicted Class Subjective Class Tweet negative positive I have to admit this was a fun movie to watch jumanji jumanjiwelcometothejungle action httpstcopXCGbOgNGf negative negative Jumanji 2 was trash Im warning you before you spend your money to go see it If you remember the first you wont httpstcoV4TfNPHGpC negative positive @TheRock @ColinHanks Well the people who have not seen JUMANJI are just wrong so positive positive Finally managed to watch Jumanji today Melampau tak kalau aku cakap it was the best movie I have ever watched in my life negative negative Is Jumanji Welcome to the Jungle just another nostalgia ploy for money Probably httpstcoDrfOEyeEW2 httpstcoRsfv7Q5mnH positive positive Saw Jumanji today with my bro such an amazing movie I really loved it cant wait to see more of your work @TheRock positive positive Jumanji Welcome to the Jungle reigns over MLK weekend httpstcoOL3l6YyMmt httpstcoLjOzIa4rhD One of the major issues with grabbing tweets based on a simple keyword is that many tweets do not reveal sentiment. Of the 51 tweets that were tested (the other 49 were either retweets or did not contain 5 words), I subjectively determined only 22 of them contained sentiment. The successful classification rate of these tweets is 68%. This is significantly less than the success rate in the cross-validation, but can be explained by the different use of language between the training set and the tweets. The slang, acronyms and pop culture phrasing used on Twitter is not prevalent in the movie review training data from 2005. # Enhancing the Model with Weka: The Naïve Bayes model can be greatly enhanced using Weka. Weka provides powerful features that can be applied within a simple interface and fewer steps. Using their pre-built classifiers, the parameters can be easily tuned. Here, Multinomial Naïve Bayes is used. First, the reviews are split by word, as required by Naïve Bayes, by using the StringToWordVector filter. Additionally, 10-fold cross validation is used. Instead of building the model once as we did before the data is randomly partitioned into 10 sets. The model is run 10 times, leaving 1 set out each time and then the ten models are averaged out to build the classifier. This model will reduce overfitting, making it more robust to the tweets. Here is the output from the model: When the tweets are scored using the PDI Weka scoring step, the subjective successful prediction rate increased slightly to 73%. # Ingesting UDP Packets in PDI with a Custom Transformation Step - Part I Posted by Greg Graham Jan 4, 2018 Pentaho Data Integration comprises a powerful high throughput framework for moving data through a network of transformation steps, turning data from one form into another. PDI is an excellent choice for data engineers because it provides a intuitive visual UI so that the engineer gets to focus on the interesting business details. PDI is also easily extendable. PDI traditionally focuses on large scale re-aggregation of data for data warehousing, big data, and business intelligence, but there are lots of other interesting things you can contemplate doing with embedded PDI. Coming from a high frequency trading background, when I looked at the default list of transformation steps I was surprised to not see a generalized UDP packet processor. Data transmitted over UDP is everywhere in the trading world - market data often comes in UDP and so do custom prompt trading signals. So I ended up creating a toy implementation of transformation steps UDPPacketSender and a UDPPacketReceiver to send/receive data over UDP. Part I of this blog post will introduce some concepts describing what UDP is and is not, an introduction to packet handling in Java, and how to write transformation steps for PDI. It is intended to be an introduction, and is not exhaustive. Where applicable, links are provided to further reading. Code is provided separately in a GitHub repository for educational purposes only, no warranty is expressed or implied. Part II of this blog post will include some kettle transformations and demonstrations of the UDPSender and UDPReceiver in action. What is UDP? So what is UDP anyway and why should we care about it? Information is generally sent over the internet in one of two protocols: UDP and TCP. UDP stands for User Datagram Protocol, and it is a lightweight, connectionless protocol that was introduced in 1980 by David Reed (RFC 768 - User Datagram Protocol ) as an alternative to the comparatively high overhead, connection-oriented network Transmission Control Protocol, or TCP. Among the properties of UDP: • Connectionless • Does not guarantee reliable transmission of packets • Does not guarantee packet arrival in order • Multicast/Broadcast By contrast, TCP is connection oriented and has protocol mechanisms to guarantee packet transmission and ordered arrival of packets. But it cannot do Multicast/Broadcast, and TCP has higher overhead associated with flow control and with maintaining connection state. So UDP is a good candidate when latency is more important than TCP protocol guarantees, or when packet delivery guarantees can be ignored or engineered into an enterprise network or into the application level. Much of the time, UDP packet loss occurs simply because intervening network devices are free to drop UDP packets in periods of high activity when switch and router buffers fill up. (It's rarely due to an errant backhoe, although that does happen, and when it does it's a problem for TCP as well :-) For many traditional UDP uses like VoiP, packet loss manifests as hiss and pop and is mitigated by the filtering and natural language processing capabilities of the human brain. For many other uses, like streaming measurement data, missing measurements can be similarly imputed. But, it is important to note, UDP packet loss may not be a problem at all on enterprise networks provided that the hardware is deliberately selected and controlled to always have enough buffer. Also it should be noted that socket communication is an excellent mode of inter-process communication (IPC) for endpoints in different platforms on the same machine. Finally, if packet loss is a problem, a lightweight mitigation protocol may be implemented in the application on top of UDP. More information on the differences between UDP and TCP can be found here: User Datagram Protocol or Transmission Control Protocol. Creating PDI Transformation Steps There is excellent documentation on how to extend PDI by creating your own transformation steps The most relevant information for creating additional PDI plugins is located at the bottom of the page under the heading "Extend Pentaho Data Integration". In a nutshell, you'll need a working knowledge of Java and common Java tools like eclipse and Maven. First we'll explore sending and receiving packets in Java, and move on to a discussion of threading models. Then we'll move on to a discussion of implementation of the four required PDI interfaces. Sending and Receiving UDP Packets in Java Since buffer overflow is usually the main culprit behind lost UDP packets, it is important to grab packets from the network stack as quickly as possible. If you leave them hanging around long enough, they will get bumped by some other packet. So a common pattern for handling UDP packets is to have a dedicated thread listening for packets on a port, read the packets immediately and queue the packets up for further processing on another thread. Depending on how fast packets are coming in and how burst-y the incoming rate is, you may also need to adjust buffer size parameters in the software or even on the network device directly. This implementation uses classes in the java.nio package. These classes are suited for high performance network programming. Two classes of interest are ByteBuffer and DatagramChannel. You can think of ByteBuffer as a byte array on steroids: it has get and set methods for each basic data type and an internal pointer so that data can be read and written very conveniently. Also, ByteBuffer has the possibility to use native memory outside of the Java heap, using the allocateDirect() method. When passed to a DatagramChannel, then received data can be written to the native memory in a ByteBuffer without the extra step of copying the data initially into the Java heap. For low latency applications this is very desirable, but it comes at the cost of allocation time and extra memory management. (For both of these reasons, it's a good idea to pre-allocate all of the ByteBuffers anticipated and use an ObjectPool.) Threading Model A word of caution is in order here. Multi-threaded application programming is challenging enough on its own, but when you are contemplating adding threads inside of another multi-threaded application that you didn't write, you should be extra careful! For example, PDI transformations are usually runnable within Spoon, the UI designer for PDI transformations. Spoon has start and stop buttons, and it would be extra bad to cause a transformation to not stop when the user hits the stop button because of a threading issue, or to leak resources on multiple starts and stops. Above all, don't do anything to affect the user experience. If at all possible, when designing a transformation step that involves some threading, look at a similar example or two. When creating this implementation, I looked at the pentaho-mqtt implementation on GitHub located at pentaho-labs/pentaho-mqtt-plugin. This was very useful because I could look at how the mqtt-plugin developers married a network implementation capable of sending/receiving packets to PDI. The model I ultimately settled upon has the following characteristics: • A startable/stoppable DatagramChannel listener with a single-use, flagged thread loop, and an "emergency" hard stop in case a clean join timeout limit was exceeded. • Assumes that the putRow() method is thread safe. • Uses an object pool of pre-allocated ByteBuffers to avoid inline object construction costs. • Uses a Java thread pool (ExecutorService) to both process packets and to serve as a packet queue. Looking at the pentaho-mqtt-plugin and understanding the flow of data and the initialization sequences greatly speeded up my own development efforts. Implementation Each PDI transformation step has to implement all of the following four interfaces. • StepMetaInterface: Responsible for holding step configuration information and implementing persistence to XML and PDI repository. • StepDataInterface: Responsible for keeping processing state information. • StepInterface: Implements the processing logic of the transformation. • StepDialogInterface: Implements the configuration UI dialog for the step. I'm not going to go into a lot of detail on the coding here, but the source code is provided for reference at the GitHub repositories • ggraham-412/ggutils v1.0.0 contains utility classes that implement logging that can be injected with alternative logging implementations at runtime, an object pool implementation, and UDP packet sender/receiver objects that wrap DatagramChannel and ByteBuffer objects together with a simple packet encoder/decoder. • ggraham-412/kettle-udp v1.0.0 contains the implementations of the UDPReceiver and UDPSender steps (Again, the code is provided for educational purposes only and no support or warranty is expressed or implied.) The basic flow for UDPSender is simple. We're expecting data to come in from PDI into the UDPSenderStep. So in the processRow() method (which is invoked by the framework on every row) we basically get a row of data, find the fields we're interested to send, and send them on a UDPSender from the ggutils library. If the row we get is the first one, we open the UDPSender ahead of processing. If the row we get is null (end of data) then we close the UDPSender and signal that the step is finished by returning false. The flow in UDPReceiver is a little more complicated because it is ingesting externally generated data from the point of view of PDI. Going from the pentaho-mqtt-plugin model, the processRow() method is used only to determine a stop condition to signal to the framework. Since there are no rows coming in, we use the init() method to start up a UDPReceiver from ggutils listening on the configured port. The UDPReceiver has a handler method so that whenever a packet is received, it is passed to the handler. The handler will unpack the packet into a row and call putRow(). The packets are handled on a thread pool, so we do potentially process fast arriving packets out of order. When the number of packets received hits a maximum, or when a time limit expires, processRow() will return false to the framework stopping the transformation. The UDPReceiver is stopped with a boolean flag, but just in case the thread does not stop for unknown reasons, it is hard-stopped after a join timeout to avoid leaking resources. The settings UI dialogs are created using the Standard Widget Toolkit (SWT), which is a desktop UI library available for Java. Note that there are many excellent custom widgets that already exist for use with Spoon, such as the table input widget. To maintain the look and feel of PDI, it is best to work from an existing example dialog class from GitHub using these widgets. Build and Deploy In order to build the PDI transformation steps, import both of the above ggutils and kettle-udp projects into eclipse. kettle-udp uses Maven, so make sure that the m2e eclipse extension is installed. Finally, if you're not familiar with building Pentaho code using Maven, you should visit the GitHub repository Pentaho OSS Parent Poms for important configuration help. (Big hint: you should use the settings.xml file given at the above repo in the maven-support-files in your ~username/.m2 folder to make sure your build tools can find the appropriate Pentaho repositories.) Once everything is building, export a jar file called UDPReceiver.jar. In order to see the new steps in Spoon, copy the jar file to the folder data-integration/plugins/steps/UDPReceiver. The folder should have the same name as the jar file it contains. You should be able to see the UDPReceiver step in the Input category and the UDPSender step in the output category when you start up Spoon the next time. i18n and l10n Pentaho contains facilities for internationalization and localization of strings and date formats. The repository above contains en_US strings, but alternates could easily be provided depending on the language/locale of your choice. In order to provide for localized strings in the application, simply include a messages package in your project structure (it should match the name of the target package plus ".messages") containing a messages_en_US.properties file. The file should contain properties like "UDPSenderDialog.GeneralTab.Label=General", and the dialog code retrieves the string with a call to BaseMessages.getText(class, key), where class is a Java class in the target package, and key is a string in the properties file. So for example, BaseMessages.getString( UDPSenderMeta.class, "UDPSenderDialog.GeneralTab.Label") will return the string "General" if the locale in Spoon is set to en_US. Note: to troubleshoot this during development, make sure that the messages properties file is actually deployed in the jar file by inspection with an archiver like 7Zip, and that the properties files are included in the build CLASSPATH. Also you may have to change the locale back and forth away from and back to the desired locale in Spoon to flush cached strings. Conclusions Data sent via the internet is generally sent using either TCP or UDP protocols. In this blog post, we looked at the characteristics of UDP, and highlighted the basic techniques for handling UDP packets in Java. Then we showed how those techniques could be incorporated into a PDI transformation step and provided a reference implementation (for educational purposes only.) In part II of this post, we will put these ideas into action with some demonstrations using Kettle transformations. # Viz-SVG Posted by Joao Gameiro Dec 22, 2017 # Introduction Charts are a good way to visualize our data, but in some cases they are not enough. Imagine you need to show a vehicle current status, based on his sensor data, don't you think the image below is more suitable than a table or chart? The intention is to expose in the easiest way the information needed to support future actions. Despite all possibilities to do this using the available methods, the idea of using an SVG image, which could be changed according to the data being received, sounded really cool. As you might know, SVG images are basically XML files with four interesting characteristics (at least the ones that matter for this component): • are scalable, • normally end up with file sizes shorter than similar image formats, • can have identifiable elements on their structure, • modifiable once injected into the HTML DOM, via javascript. Merging the approach used on SVG component and VIZ-APIVIZ-SVG was born. # Requirements • Javascript knowledge • Pentaho 8.0 • npm If you don’t have the node.js (npm) already installed follow this instruction: https://docs.npmjs.com/getting-started/installing-node # How to build your own visualization using Viz-SVG project Download the git folder example. Extract the zip file, the necessary files are already there: • SVG image – circle.svg • The Model – yourmodel.js • Dummy data – data.json circle.svg <g> <title>SVG Example</title> <rect fill="#fff" id="canvas_background" height="402" width="582" y="-1" x="-1"/> <g display="none" overflow="visible" y="0" x="0" height="100%" width="100%" id="canvasGrid"> <rect fill="url(#gridpattern)" stroke-width="0" y="0" x="0" height="100%" width="100%"/> </g> </g> <g> <title>Circle</title> <ellipse id="Sales" fill="lightgray" stroke="black" ry="70" rx="70" cy="195" cx="281" stroke-width="5" /> </g> </svg>  ##### data.json • svgElementID - the element name on circle.svg • value - bussines data { "model": [ {"name": "svgElementID", "type": "string", "label": "svg Element ID"}, {"name": "value", "type": "number", "label": "Value"} ], "rows": [ {"c": [ "Sales", 11000000]} ] }  ##### yourmodel.js var SvgModel = BaseModel.extend({$type: {
// SVG Label and Class
styleClass: "pentaho-visual-samples-svg",
label: "SVG Circle"
},

getSvgPath: function() {
return "./circle.svg";
},

getSvgPartforDataID: function(dataId){
return ["Stroke", "Fill"];
},

toSvgId: function(dataId, prop){
return dataId ;
},

toSvgAttribute: function(dataId, prop, dataValue) {
switch(prop){
case "Stroke": return "stroke";
case "Fill": return "fill";
}
return prop;
},

toSvgValue: function(dataId, prop, dataValue) {
switch(prop){
case "Stroke": return dataValue > 11000000 ? "red" : "green";
case "Fill": return dataValue > 11000000? "green" : "red";
}
}
});



Pay attention to:

• Label, this is the name that will appear in the list of available visualizations in Analyzer
label: "SVG Circle"


getSvgPath: function() {
return "./circle.svg";
},


• Properties that will be changed for each SVG element, (Stroke and Fill)
getSvgPartforDataID: function(dataId){
return ["Stroke", "Fill"];
},


• SVG id - if you need to change the SVG element id add some logic to this function
• Take a look at the other example (modelProductLine.js)
toSvgId: function(dataId, prop){
return dataId ;
},


• Mapping between properties defined previous and SVG style properties
toSvgAttribute: function(dataId, prop, dataValue) {
switch(prop){
case "Stroke": return "stroke";
case "Fill": return "fill";
}
return prop;
},


toSvgValue: function(dataId, prop, dataValue) {
switch(prop){
case "Stroke": return dataValue > 11000000 ? "red" : "green";
case "Fill": return dataValue > 11000000? "green" : "red";
}
}


##### Now it's time to test our visualization

Register our files on index.html (Index.html purpose is to render our code on the browser.)

• data.json is our dummy data
• yourModel is our configuration

require([
"pentaho/type/Context",
"pentaho/data/Table",
"json!./sales-by-product-family.json"
], function (Context, Table, dataSpec) {

// Setup up a VizAPI context.
Context.createAsync({ application: "viz-api-sandbox" })
.then(function (context) {
// Get the model and base view types
return context.getDependencyAsync({
BarModel: "pentaho/visual/samples/bar/modelProductLine",
BaseView: "pentaho/visual/base/view"
});
})



Change:

line 04 - "json!./sales-by-product-family.json"   to    "json!./data.json"   and

line 13 - BarModel: "pentaho/visual/samples/bar/modelProductLine",  to   BarModel: "pentaho/visual/samples/bar/yourModel",

Let’s open index.html on browser.

Attention: directly opening the file through the filesystem will not work when using Google Chrome (and possibly other browsers), because of security restrictions that disallow the loading of local resources using XHR — a functionality that is required by the VIZ-SVG to load localization bundles and other resources.

To overcome this limitation we will use Node.js to serve the project files through an HTTP server.

Node:

npm install -g node-static

static -p 8001

Now you can open http://127.0.0.1:8001/index.html in the browser and test.

Change the values of data.json to modify colors

Edit data.json and change 11000000 to -> 11000001.

Colors should change (if not, clear browser cache).

# Create the Pentaho Web Package

Register your model on config.js and package.json

• Config.js
•      add the following line: module.id.replace(/(\w+)$/, "yourModel")(\w+)$/, "yourModel")
select: {
type: [
module.id.replace(/(\w+)$/, "modelGeneric"), module.id.replace(/(\w+)$/, "modelProductLine"),
module.id.replace(/(\w+)$/, "yourModel") ],  • package.json • add the following line: "pentaho-visual-samples-svg/yourModel": {"base": "pentaho/visual/base/model"}  "pentaho/typeInfo": { "pentaho-visual-samples-svg/modelProductLine": {"base": "pentaho/visual/base/model"}, "pentaho-visual-samples-svg/yourModel": {"base": "pentaho/visual/base/model"} },  ##### Let's create our .tgz file Note: By using the npm pack command you ensure only your files and bundled dependencies are compressed in the resulting tgz file. Run npm pack on viz-svg-sandbox folder. # Deploy on Pentaho server and PDI Copy the pentaho-visual-samples-svg-0.0.1.tgz to • pentaho-solutions/system/karaf/deploy/ (Pentaho Server) • data-integration/system/karaf/deploy (PDI) # See the visualization on Pentaho BA server and PDI Create a new Analysis Report using SampleData • Choose • Measure - Actual • Category – Department Pick on the list, right side, our SVG Circle visualization. Do the same on PDI (DET) • Choose • Measure - Actual • Category – Department Our .tgz has another extra visualization: SVG Product Line. Marketplace You can find more examples in Pentaho Marketplace, search by VIZ-SVG, and install both. # Why use Viz-SVG Viz-SVG can be a good way to visualize your IoT data during all pipeline process, PDI, Analyzer and CTools dashboards. Example: you may need to check the occupancy on your office floor, what best than an SVG office layout. Enjoy and share your visualizations. # SVG Component Posted by Joao Barbosa Dec 22, 2017 # Introduction Born from the need to simplify data consumption on dashboards, this component's goal is to provide more flexibility by leveraging SVG rendering capabilities. ## The Spark and the Idea The design of a particular dashboard to show a vehicle current status, based on his sensor data, required the use of the next image. The intention was to expose in the easiest way the information needed to support future actions. Despite all possibilities to do this using the available CTools components and methods, the idea of using an SVG image on the dashboard, which could be changed according to the data being received, sounded really cool. As you might know, SVG images are basically XML files with four interesting characteristics (at least the ones that matter for this component): they are scalable, they normally end up with file sizes shorter than similar image formats, they can have identifiable elements on their structure, and they are modifiable once injected into the HTML DOM, via javascript. So, to show something similar to the designed data visualization for our dashboard we only needed to produce the SVG image (badly represented on the people's icon below), with ids on the elements that we wanted to change, gather the information that each sensor was sending, and modify the SVG layout (colors and/or text), using the required business rules. ## How to Use With this in mind and a couple more requirements, we ended up by creating the SVG component's first version that is available through Pentaho Marketplace.The component is quite simple, you can add it to your dashboard layout just like any other component, and its use can be controlled mostly be these 4 parameters: • URL - the relative path to the SVG image that will be used to show your data • Keep Image - the indication if an SVG reload should be performed (or not) when a dashboard refresh is required by the user • CallBackDraw - a courtesy function that will be executed after all the SVG changes are performed • Data source - The information that will be used to manage all the modifications necessary to show your data current status ### Datasource Format and other requirements In fact, everything is going to be managed by the information on your component's data source. Each record will represent a new SVG modification and is characterized by the following 4 attributes (the order is not significant but the names are): • svgElementID - the SVG element id where we want the component to perform changes • changeType - the kind of change the component should execute on the current element. This first version allows style (any style attribute common on CSS styling can be applied to SVG images) and text updates • styleProperty - the style property that will be added/changed. On text type operations this attribute is not mandatory • value - the value that will be used on the style property or the text update. CDA example - data source record to perform the people icon's change illustrated earlier Summing up, the process to use this component should be something like this: • Install it from Pentaho Marketplace (server restart will be needed). The component's use was only tested on Pentaho 7.0 versions or higher. • Review the component's samples that will be available after installation on plugin-samples / sample-svg-component • On a requireJS dashboard, add your layout placeholder, like for any other component, add the SVG component and start to fill the component's parameters • create and join your data source with the component specification • test it on browsers that support HTML 5.0 and SVG rendering (and pay attention that your audience should also have this type of browsers) ## Where and when to use it Whenever you sense that data readability is improved by using a specific image, and you are about to show it on a CTools dashboard, consider the SVG component. Remember, you can always show sensor data on a table or a particular chart, but the data coming from car sensors will always be more easy to read if we can show it on a related image. ## Related work To spread the use of these visualizations in the Pentaho product family, a merge between Viz-API and SVG capabilities is also available. This new artifact, called Viz-SVG, will allow you to use this type of visualizations on Pentaho Analyzer, Pentaho Data Integration DET and also CTools. # JDBC SQL logging Posted by Kleyson Rios Dec 19, 2017 # Introduction A very typical use case for SQL logging is to troubleshoot various database related issues during development. For this purpose, the P6Spy is a useful library to log all SQL statement and parameter values before send it to database. The P6Spy is free, it’s use to intercepts and logs all your database SQL statements into a log file, and it works for any application that uses JDBC driver. If your application uses a DataSource, simply wrap your current DataSource object with P6DataSource. P6DataSource has a constructor method that accepts the DataSource to wrap. This is by far the simplest method especially if you use a dependency injection framework such as Spring or Guice. If your application obtains connections from DriverManager, simply modify your JDBC connection URL to include p6spy: . For example, if your URL is jdbc:mysql://host/db then just change it to jdbc:p6spy:mysql://host/db . P6Spy implements the JDBC 4.0 API allowing automatic registration of our JDBC driver with DriverManager. As long as your application obtains connections through DriverManager, you only need to modify your database connection URL to activate P6Spy. # Using it Pentaho can be configured to leverage on the P6Spy logging capability as well. To demonstrate its use, let's configure the P6Spy to log all the Sql queries passing thru the SampleData datasource. 1. Download the P6Spy .zip file and unzip it. 2. Copy into the pentaho-server/tomcat/webapps/pentaho/WEB-INF/lib folder the p6spy-x.x.x-log4j-nodep.jar and p6spy-x.x.x.jar files. 3. Edit the pentaho-server/tomcat/webapps/pentaho/WEB-INF/classes/log4j.xml file and append at the end of the file, before the tag </log4j:configuration>, the following configuration:  <!-- ========================================================= --> <!-- Special Log File specifically for p6spy --> <!-- ========================================================= --> <appender name="SPYLOG" class="org.apache.log4j.RollingFileAppender"> <param name="File" value="../logs/spy.log"/> <param name="Append" value="false"/> <param name="MaxFileSize" value="500KB"/> <param name="MaxBackupIndex" value="1"/> <layout class="org.apache.log4j.PatternLayout"> <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/> </layout> </appender> <category name="p6spy"> <priority value="DEBUG"/> <appender-ref ref="SPYLOG"/> </category>  NOTE: If the spy.log file is not created after the start of the Pentaho server, update the File param with the absolute path. 4. Create the pentaho-server/tomcat/lib/spy.properties file and insert the following configuration as its content: driverlist=org.hsqldb.jdbcDriver appender=com.p6spy.engine.spy.appender.Slf4JLogger  5. Edit the pentaho-server/tomcat/webapps/pentaho/META-INF/context.xml and add the following configuration for the new P6Spy datasource: <Resource name="jdbc/DSspy" auth="Container" type="javax.sql.DataSource" factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" maxActive="20" minIdle="0" maxIdle="5" initialSize="0" maxWait="10000" username="pentaho_admin" password="password" driverClassName="com.p6spy.engine.spy.P6SpyDriver" url="jdbc:p6spy:hsqldb:hsql://localhost:9001/SampleData" validationQuery="select count(*) from INFORMATION_SCHEMA.SYSTEM_SEQUENCES"/>  6. Restart the Pentaho server. 7. Logs in with Admin in the PUC and click on the Manage Data Sources button. 8. Edit the datasource SampleData/JDBC. 9. Change the Access type from Native(JDBC) to JNDI, and set the JNDI Name to DSspy and Test the connection. 10. Test it. Open the CTools Dashboard under the Public/Steel Wheels folder and/or create a new Analysis Report to use the SteelWheelsSales cube. 11. Verify all the SQL queries being logged in the spy.log file. Where: • current time - the current time is obtained through System.getCurrentTimeMillis() and represents the number of milliseconds that have passed since January 1, 1970 00:00:00.000 GMT. (Refer to the J2SE documentation for further details on System.getCurrentTimeMillis().) To change the format, use the dateformat property described in Common Property File Settings. • execution time - the time it takes in milliseconds for a particular method to execute. (This is not the total cost for the SQL statement.) For example, a statementSELECT * FROM MYTABLE WHERE THISCOL = ? might be executed as a prepared statement, in which the .execute() function will be measured. This is recorded as the statement category. Further, as you call .next() on the ResultSet, each .next() call is recorded in the result category. • category - You can manage your log by including and excluding categories, which is described in Common Property File Settings. • connection id - Indicates the connection on which the activity was logged. The connection id is a sequentially generated identifier. • statement SQL string - This is the SQL string passed to the statement object. If it is a prepared statement, it is the prepared statement that existed prior to the parameters being set. To see the complete statement, refer to effective SQL string. • effective SQL string - If you are not using a prepared statement, this contains no value. Otherwise, it fills in the values of the Prepared Statement so you can see the effective SQL statement that is passed to the database. Of course, the database still sees the prepared statement, but this string is a convenient way to see the actual values being sent to the database. # Caveats • For a production environment, check out the impact on the performance. • SQL queries will not be logged if the data was already loaded in the cache. • No tests was done for scheduled tasks. # Multi-tenant Environment For multi-tenant environment you might have the need to have the logged user on the log file as well. For that, you can also supply your own log message formatter to customize the format, including new data on it. Simply create a class which implements the com.p6spy.engine.spy.appender.MessageFormattingStrategy interface and place it on the classpath. # Customising the Log Message 1. Copy into the pentaho-server/tomcat/webapps/pentaho/WEB-INF/lib folder the attached P6SpyCustomMessage-1.0.0.jar file. 2. Append the to pentaho-server/tomcat/lib/spy.properties file following configuration: logMessageFormat=com.pentaho.krios.p6spy.message.formatting.PentahoCustomMessage 3. Restart the Pentaho Server. 4. Execute again the Reports/Dashboards and check out the new log message. Enjoy ! # Plotting ESRI Shapefiles on a Map in Pentaho Posted by Jesse Zuckerman Dec 18, 2017 This article was co-authored with Benjamin Webb Foundational to any map—whether it be a globe, GPS or any online map— is the functionality to understand data on specific locations. The ability to plot geospatial data is powerful as it allows one to distinguish, aggregate and display information in a very familiar manner. Using Pentaho, one can use shape files to plot areas on a map within a dashboard and then explore data geospatially. In this example, we can use C*Tools and Pentaho Data Integration to examine geographic spread of crime occurring in the city of Chicago. # Getting and Preparing Shapefiles There are many popular formats of shapefiles. The most popular is the format developed and regulated by ESRI. Many geographic analytics packages produce this format, so it is relatively easy to find shapefiles for common geographic boundaries, including Country, State/Province, County, Postal Code, Political boundaries and more. For this analysis, we’ll use data provided by the Chicago Data Portal. First, to get the shapefiles, we will download the Community Area Boundaries datafile in ESRI format. To use in Pentaho, we will prepare the shapefile by converting ESRI to GeoJSON. We will use a command line tool provided by GDAL titled ogr2ogr. More information on this suite of tools can be found on their website and on GitHub. To execute this tool we can use the following PDI transformation that will call ogr2ogr.exe with parameters including GeoJSON for the destination filetype as well as the source and destination files. From this process, important information is collected on the 76 community areas in Chicago. As seen below in a small sample of the GeoJSON file created, information is contained for the community of Douglas including a series of latitudes and longitudes representing the points that form a polygon. { "type": "FeatureCollection", "name": "geo_export_80095076-0a6b-4028-a365-64ec9f0350d7", "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } }, "features": [ { "type": "Feature", "properties": { "perimeter": 0.0, "community": "DOUGLAS", "shape_len": 31027.0545098, "shape_area": 46004621.158100002, "area": 0.0, "comarea": 0.0, "area_numbe": "35", "area_num_1": "35", "comarea_id": 0.0 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -87.609140876178913, 41.84469250265397 ], [ -87.609148747578061, 41.844661598424025 ], [ -87.609161120412566, 41.844589611939533 ]… Next, we can get crime data from the same site. This very large file contains diverse information on every crime occurring in Chicago since 2001 including the location, type of crime, date, community ID etc. To reduce the size of this file with over 6 million rows, we first run a quick PDI transformation that will group the crimes by year and community. We need to join the crime dataset to the GeoJSON dataset we created as only that dataset contains the names of the community areas. Both datasets do share a common geographic code, allowing us to blend the data by community areas ID. For this C*Tools dashboard, we will use another PDI transformation as the source of the data. As will be seen later, when the user selects a year on the dashboard, this transformation will be triggered to get the total number of crimes for every Chicago community area in a given year. We will also get the max number of crimes for all neighborhoods, to be used later in the color gradient. With this transformation, we now have a set of data that includes key metrics along with their corresponding geographic areas represented as polygons. # Create the Dashboard with NewMapComponent Now to create the actual map in Pentaho, we will use the C*Tools Community Dashboard Editor (CDE). This will be done using the NewMapComponent under the Components View. This powerful component utilizes a mapping engine that renders the world for free (choosing between OpenLayers or Google engines). Then, GeoJSON or KML is used to plot polygons above the map. In our case, the GeoJSON file outline the communities throughout Chicago will be mapped. Much of the functionality will be used through JavaScript snippets in the Component Lifecycle steps, such as Pre-Execution and Post-Execution. As a brief example here, we can load our GeoJSON file with the following JavaScript in our Pre-Execution stage: // locate our GeoJSON file under resources. Note, the BA server does not // recognize the .geojson extension, so we rename to .js var getResource = this.dashboard.getWebAppPath() + '/plugin/pentaho-cdf-dd/api/resources'; var mapDef = '${solution:resources/geojson/community-areas-current-geojson.js}';

// here we pass in our GeoJSON file, and also specify the GeoJSON property
// to use as a polygon ID
this.shapeResolver = 'geoJSON';
url: getResource + mapDef,
idPropertyName: 'area_num_1'
});


This will plot the polygons, with 1 caveat. The NewMapComponent also takes a data source data source as input (see the properties tab). This data source must contain data that matches the IDs specified above in the GeoJSON file, and only those polygons for which data points with matching IDs exist will be rendered.

We can specify which columns from our data source to use as the ID in a snippet also in the Pre-Execution phase like so:

this.visualRoles = {
id: 0,                            // communityarea
fill: 1                           // crimes
};


Note, here we defined the column for id as the first column (index 0), and use the 2nd column (index 1) as the fill value (more below).

To load our pre-aggregated data and render it on our map, the Kettle transformation described above is used which takes a year parameter and then reads & filters the Aggregated-Crime-Counts.csv file.

This transformation ensures that the 1st column is the Community Area ID and the 2nd column is the # of crimes, to match our JavaScript above.

Finally, more JavaScript can be added to add additional dashboard features. For our heat map example, we want to vary the fill color based on # of crimes.

We've already linked the data with the code snippet above. The NewMapComponent has some defaults, but to ensure it works smoothly we can implement the fill function ourselves as follows, which is also implemented in the Pre Execution step:

// define the polygon's fill function manually based on the crimes/fill
// value incoming from the datasource
this.attributeMapping.fill = function(context, seriesRoot, mapping, row) {
var value = row[mapping.fill];
var maxValue = row[mapping.max];
if (_.isNumber(value)) {
return this.mapColor(value,
0,                                        // min crimes
maxValue,                // max crimes from dataset in yr
this.getColorMap()     // a default color map of green->red
);
}
};


The above function validates the incoming fill/crimes column, and then tweaks the default color map (green to red) and maps all values on a scale of 0 to the max number of crimes in a year (in 2015, this number was 17,336; coming from the western community of Austin, seen below). All values between will be somewhere on the gradient.

Another very useful function that can be implemented within the NewMapComponent is the tool tip, that will highlight information about a community area, displaying the community name and the total number of crimes, when hovered by a mouse. This is implemented in the Post Execution, again utilizing JavaScript.

function tooltipOnHover() {
/* TOOLTIP ON MOUSE HOVER 2 */
var me = this;
/*
** Define events for mouse move
**/
this.mapEngine.map.events.register('mousemove', this.mapEngine.map, function (e) {
if (!_.isEmpty(me.currentFeatureOver)) {
var modelItem = me.mapEngine.model.findWhere({
id: me.currentFeatureOver.id
});
$('#popupObj') .css('top', e.pageY -50) .css('left', e.pageX + 5) // html contained in popup .html( modelItem.attributes.data.area_name + '<br>Total Crimes: ' + modelItem.attributes.data.fill + '' ); } }); this.mapEngine.map.events.register('movestart', this.mapEngine.map, function (e) {$('#popupObj').fadeIn(500);
});
this.mapEngine.map.events.register('moveend', this.mapEngine.map, function (e) {
});
}


Effectively analyzing geospatial data, especially when used with enhancement tools like NewMapComponent can be a very powerful tool. From this basic example, we can better understand how crime exists across a very large city and how that spread has changed over time. Using the polygons allows us to better group the data in order to gain valuable insight.

This approach is heavily indebted to Kleyson Rios's NMC-samples repository, which has similar examples and can also be zipped & uploaded for exploring the NewMapComponent.

The code for this example can be found on GitHub.

By date: By tag: