Biju Krishnan

Drill through data at REST

Blog Post created by Biju Krishnan on Apr 19, 2017

It is predicted that Internet of Things (IoT) is set to change our perceptions about enormous data volumes once again. If data in Petabytes seems challenging for enterprises, the era of IoT is set to usher in data volumes in zettabytes and exabytes in the very near future. For those who are familiar with Hitachi Content Platform (HCP), would agree that the HCP could be an ideal platform for hosting such data. It is highly scalable, cost effective and provides S3 like access to an industrial grade object store. Most of this data is likely to be generated by sensors (things) and device gateways. And its likely that the sensor data is streamed in a semi structured form, for example structured as JSON, XML or in some cases as CSV. Usually this data is then aggregated and stored in more efficient data formats like Parquet.

 

Data engineers see a huge value in storing all of this data without exception, provided the costs are reasonable because one can never know in advance, which element of the data might prove useful in the future. For example if your application is designed to use sensor values such as temperature and pressure to perform predictive maintenance as of today, tomorrow business might wish to add the element of humidity to the analysis as well. Now if you had decided to purge the data on humidity values to save capacity, your business might lose and edge over its competitor due to lack of data points to make decision.

 

Simultaneously not deriving value from vast amounts of data stored on HCP might lead to business questioning the return on investment from your project.

 

Now what if you could let your data scientists drill through vast data repository hosted on HCP? What if they could run SQL queries on semi structured data sets without having to worry about the schema of data stored in those files? What if the files need not be moved from one platform to another for such complex tasks?

 

Recently I decided to take Apache Drill for a test ride to verify such possibilities. There are tons of blogs and tutorials on Apache Drill but I would use a few illustrations to make my point.

 

Illustration-1 shows how a data engineer is likely to design the data pipeline, in order to prepare data for visualization or analysis by a data scientist. Designs may vary but the central idea is that this involves quite a few ETL steps and additional processing is needed before analytical tools are be able to query and visualize semi structured data. The process can be simplified with graphical tools like Pentaho Data Integration (PDI) but it still entails at least some delay in gaining insights from data stored at rest.

 

priortodrill.jpg

illustration-1

 

Illustration-2 shows how Apache Drill can significantly alter this process and probably save 50-90% of the time taken to derive insights from data at rest.

afterdrill.png

illustration-2

 

As you can probably notice in the above illustration, Apache Drill eliminates the need for ETL or processing via multiple stages prior to data scientists querying it. The beauty is that it will detect the schema on the fly with some clever techniques and provide low latency query speeds with in-memory processing. All you need is a cluster of nodes known as drillbit's which do the job, and a JDBC or ODBC driver for you to be able to query this data. The illustration shows Pentaho Business Analytics connecting to a resource coordinator via JDBC, and this is useful if you are like our customer Caterpillar Marine who have 100's of data scientists exploring data. But even if you are a data scientist with some desktop based analytical tools (Tableau desktop, Qlik Sense desktop), you will be able to query data on HCP directly with Apache Drill.

 

As an example here is what I did with some weather data in JSON stored inside a HCP namespace (bucket), see screenshot-1 below.

 

weatherdata.jpg

screenshot-1

 

To hasten the process of having an Apache Drill installation, I used industry's leading distribution for Apache Drill from MapR. MapR engineers are the leading contributors to the Apache drill project as well.

 

Connecting MapR packaged Apache Drill to HCP namespace was as simple as creating a new storage plugin using the drill web interface. Copy an existing storage plugin config and add the config elements highlighted in screenshot-2.

 

drillweb.png

screenshot-2

 

If you have a valid SSL certificate then SSL can be set to enabled as well.

 

Then all you need to do is setup JDBC or ODBC for MapR Apache drill by downloading the relevant libraries from the MapR website and you can either use SQL queries as shown below or use something like Tableau desktop to create visualization (screenshot-3) from the weather data set or any other data set, stored in an HCP namespace, without the need for any additional processing or movement of data.

 

0: jdbc:drill:zk=maprdemo:5181> SELECT * FROM hcp.`stocks.json` WHERE Country='Finland' LIMIT 20;

| _id | Ticker  | Profit Margin  | Institutional Ownership  | EPS growth past 5 years  | Total Debt/Equity  | Current Ratio  | Return on Assets  |   Sector    |  P/S |    Change from Open    | Performance (YTD)  | Performance (Week)  | Quick Ratio  | Insider Transactions  |  P/B  | EPS growth quarter over quarter  | Payout Ratio  | Performance (Quarter)  | Forward P/E  | P/E  | 200-Day Simple Moving Average  | Shares Outstanding |

| {"$oid":"52853808bb1177ca391c287e"}  | NOK     | -0.054         | 0.119 | -0.197 | 1.01 | 1.3            | -0.049            | Technology  | 0.87  | -0.008800000000000001  | 1.0101 | 0.0051 | 1.2          | null | 3.31  | 0.912 | null          | 0.9319 | 42.23        | null  | 0.7457 | 3712.23 | 2011-04-21 04:00:00.0  | -0.0125       | 2.34    | -0.0063  | 2.8            | 0.0289 | Finland  | -0.182            | 0.4775      | 7.89   | -0.0125      | -0.253 | 3743.33     

 

tableau.jpg

screenshot-3

 

I hope this post helps all HDS customers who are at the cusp of an impending data explosion or for those who are struggling to select the right platform for their data lakes. I hope this post also helps data scientists and data analysts who are likely to spend 50-90% of their time in preparing data for analysis to cut that down significantly, in turn helping their organization realize their return on investment sooner than expected. I'll leave you meanwhile with an illustration portraying the overall architecture of the solution.

 

Drill HCP.jpg

illustration-4

 

 

If you wish to know more or see a demo of the described solution, feel free to leave comment below or catch me on twitter.

 

And if you think Apache Hadoop is vital to your data pipeline, its worthwhile reading my colleague Thorsten Simons post on how one can connect Hadoop to HCP for long term data retention.

Outcomes