Paul Lewis

The Brain of a Grumbling Droid … Adding Data Lakes to Enterprise Information Management

Blog Post created by Paul Lewis on Jun 26, 2015

abstract-outer-space-HD.jpg


One time I went on a jungle cruise where the boat captain got us lost and we nearly got killed by head hunters…

 

…was the start of my comedic story until someone in the audience yelled “Can you pretend to be normal today?”  And by “audience” I mean the driver’s side backseat passenger in the carpool I was ferrying to work.

 

“Fine, then I will not regale you with a tale of the time I was in the middle of a 18th century seaport town being pillaged by pirates, coincidentally, also slowly floating in a boat”, I snapped back.  “If you insist on having a predictable and humourless tour of Endor, you’ve boarded the wrong Star Speeder!”.

 

“So speaking of droids, if you want to talk about work, how many TB’s of data do you think C3PO carries on board versus connected via some form of wireless connection to the ship he happened to find himself trapped in today?”, I inquired of the car load of information management professionals.

 

“Considering how much of a long, long time ago it was, even if it was far, far away, I would have to assume most of the few TB’s C3PO needed was onboard, in some sort of electronic storage.  Spinning disk simply wouldn’t be able to handle the variety of extreme temperatures, altitude changes and  garbage crushing he encounters on a daily basis.  My guess is he only connects to the ship situationally, maybe for occasional updates to firmware or blueprints to spheres of death.” opined the front seat passenger.

 

The final member of our foursome wonders less about his storage requirements and more about his attitude problem.  “Sure he needs to keep and access some relevant information to do his job of sidekick to the one true hero of the galaxy R2D2, but what kind of technologies are necessary for the constant whining?”.

 

Excellent question.  I take a deep breath to start my rant….

 

“Let’s focus on the whining, as the most important technologically relevant feature of C3PO.  That one emotional characteristic is no doubt, in large part, providing comic relief to the force as a whole and therefore let’s assume complaining, disapproving, and bellyaching can largely be described  as “business value”.  If we were to compare to corporate Information Management terms (as we are not Jedi masters) , we know that the conventional technologies to solve a business value would include:

 

  • Source databases, typical relational databases that include data collected and saved from applications built and bought for the purpose of running the business.  The source databases may or may not be a single source of truth and may or may not contain duplicate information but storage the various operational transactions in progress and completed.  Examples would include billing, clients, applications, inventory, etc.

 

  • Enterprise Data Warehouse (EDW), the collection of specially designed and optimized databases commonly used for MIS (reporting), BI and Analytics.

 

  • Data marts, a simpler version of an EDW, and usually based on the EDW for a single functional area

 

  • ETL tools, that provide an ability to “select from” source databases, transform and cleanse data, and “insert into” destination databases

 

  • BI Tools, a suite of tools that read from marts and warehouses in order to create reports and/or dashboards

 

This model is crucial and necessary when you create hundreds or thousands of operational reports necessary to run the business, especially if stability and structure is unavoidable due to contractual or legislative reasons. 

Even if the classic model is required, it does have its limitations in creating alternative/additional business value.  Some additional considerations when attempting to fully operate a gold finished protocol droid include:

 

  • Batch vs real-time: Overnight reports, batch works perfectly fine.  Determining whether I need to bank left or right to avoid the meteor, feels like a real time assessment of the situation, to create some sort of prescriptive action

 

  • Relational databases vs any data sources (sensor, network, visual, audio):  When you are producing multi record reports and you can easily query information from a known application to provide it, relational databases are the perfect source. However, in the course of the data you have to internalize your current location as compared to the speed and distance away from a hurling rock, a variety of sources becomes necessary.

 

  • MIS vs experimental analytics: when the business doesn’t change, concerns over the impact and cost of change doesn’t matter.  When real time decisions make it a requirement to access new information with potentially new formats in real time, the cost and time allocated to experimentation needs to be eliminated.   If the blueprints turn out to be wrong, you can just go back to your home planet disappointed in yourself.

 

So what new technologies can complement the conventional IM model and create additional business value?

 

  • Data Lake: a place to store all the data you need, in its native format, until you need it.  A place to query the real time stream of geo locations, network data, sensor information, complaint key value pairs, and documents associated with the Frommer’s planet by planet vacation/survival guides at the point you need it, or never at all.

 

  • Integration and orchestration: a means to connect to real time streams of external sources, batch collect documents, and orchestrate analytical jobs.

 

  • Visualization tools: an ability to create various dashboards, map overlays, diagrammatic visuals and embed them in a variety of applications.  An interface to create situational machine learning algorithms, especially ones that would determine the most appropriate grievance to send to the Jedi council after landing upside down in the sand….not forgetting the inherent complexity in actually making a humanoid function in any way.

 

  • Bridging EDW and Data Lake: the EDW may be more valuable with access to twitter sentiment, and credit card application documents may be more valuable joined with the customer records.  The Data Lake and EDW should be  integrated together and/or blended together through the visualization tools.  Data lake alone, without business oversight and governance applied within the EDW could turn swampy http://blogs.sap.com/innovation/analytics/from-data-lakes-to-data-swamps-01867475

 

Therefore…C3PO presumably contains a data lake with a horizontally scalable amount of fast storage with hyper converged compute, an ability to collect a variety of data and process that data both in realtime and batch, applying 1000’s of machine learning algorithms.  However, his scheduled monthly reports, likely blend the data lake with the EDW before emailing the Yoda.

 

If you want to see  how one would do that…see this: http://www.hds.com/solutions/big-data/

 

I exhale.

 

The passengers are silent for the rest of the ride.  They know I am right but unwilling to admit it.

 

After we park, and as we are letting the fifth passenger out the trunk, he comments:  “All of your joke premises are contrived and hard to believe….do you have a plan B?

 

No…my plans are numbered.

Outcomes