Paul Lewis

Practically speaking, impacts of Big Data on the Data Warehouse

Blog Post created by Paul Lewis on Sep 29, 2014

The literary equivalent of “breaking the fourth wall” is simply called narration. 

 

Even the first sentence of this blog would be considered narration, a means to speak (literarily) directly to the reader, who in this case happens to be you.

 

For a few minutes my goal is to both entertain you and potentially provide a modicum of enlightenment on an interesting subject.  Presumably if you didn't find the subject interesting, you would not have double-clicked the link (or right-click open tab for those of us who care to do so).  In fairness, your boss may have requested you read the entry, and therefore you are forced to read the content in case she somehow asks about it in an upcoming meeting.  I’m fine with it either way, as long as you are still with me.  Hello? Hello?  Just checking.

 

For those returning readers (thank you), you are fully aware of what I REALLY THINK about Cloud, Converged and especially Big Data, in terms of level setting on practical definitions. For new readers, I will thank you in future blogs…(I’m not sure if you are a keeper yet).  For Big Data specifically here is a reminder about what we’re talking about:

 

Big data, like Cloud, does not refer to a particular technology or set of technologies.  Big Data defines a class of information management problems that are difficult or impossible to efficiently solve using conventional data management tools and techniques.  Big Data is often characterized by the 5 V’s: Volume, Velocity, Variety, Veracity, and Value.  The first three V’s are certainly common, while the latter two are increasingly evident in the Big Data vernacular.  There’s a good chance that the following points characterize the types of challenges the five V’s may be causing in your organization: 

 

  • The VOLUME of data you are collecting overwhelms the people, processes and technical capabilities in your Enterprise Information Management group. You will likely need to find a distinctually (not a word but should be) different set of tools and techniques to solve this problem
  • The ever increasing VELOCITY with which you are expected to analyze and process data already exceeds the abilities of your people and infrastructure.
  • You wish the scope of analytics was a small set of data warehouses, but reality suggests that the scope includes a VARIETY of data types including call recordings, word documents, sensor logs, tweets, likes, surveillance video footage from retail, etc.
  • Some parts of the your organization have doubts concerning VERACITY of the facial recognition in the video surveillance data.  Storing the video files are easy; precisely identifying a face in a single frame of video is anything but.
  • Statements like “data is our greatest source of untapped VALUE” are prolific within your organization. Those who make such utterances when pressed to quantify or justify the VALUE of data are conspicuously silent.

 

Still reading? Good.

 

It would not shock me to know that MIS, Business Intelligence and Business Analytics are front of mind within your organization. It would also not shock me if you spent large sums of money in technology, both in solutions and integrations to source systems on BI and related systems  My guess is that at a very high level your data warehouse architecture could be loosely approximated as follows:


EDW.png


Your implementation “likely” contains numerous source systems (that of course need to be backed up/archived/protected), staging areas for a variety of transformations and quality enhancing operations, an EDW to serve as the container of structured information assets, and many data marts to be used specific analysis tasks. Of course all of this complexity is cleanly automated fully documented.  Are there still IT operations that aren’t fully automated and documented?

 

All is well and good if you DON’T have a Big Data problem.  The architecture above works quite nicely if you have relatively small homogeneous source data sets.  However, let’s say the business is attempting to introduce new value into the organization that “tweaks” the kind of information you are now being asked to “steward”.  A new mobile application that perhaps is deployed to millions of clients, that need to have REAL TIME impacts on decisions.  Maybe sentiment of social conversations needs to be analysed with/against sentiment of thousands of recorded contact centre interactions.    Well then, you have a few more concerns with the design:

 

  • Volume: 
    • Each additional data repository, from source system through data mart, in this architecture may need to store petabytes of data versus terabytes
    • It is impractical to maintain operational and offsite backups for multiple, massive data sets
    • Streaming data from storage to processing engine (e.g. RDBMS) leads to unacceptable query response time at big data scale

 

  • Velocity:
    • ETL can entail massive lag between data generating events and delivery to business consumers
    • “Schema on write” orientation requires extensive up front design and analysis, further delaying derivation of value from data
    • Traditional SAN based architecture struggles to grow fast enough to meet demands
    • Batch oriented architecture cannot deliver real-time insight

 

  • Variety:
    • EDW architecture is optimized for relational data generated by business applications
    • Unstructured and semi-structured data is becoming as important as structured data for analytics
    • Cost of up-front design and analysis exacerbated by variety in source data sources

 

  • Veracity:
    • Lack of implicit trust in data sources requires an environment that is more friendly to agile discovery and exploration.  Yet the high cost of design and infrastructure inhibits agility, thus creating a “analyze to invest, invest to analyze” paradox.

 

  • Value:
    • Organizations often struggle to find value in their data.  Without careful consideration big data will only make value harder to find

 

Hopefully, you are not concluding that your investment is a huge waste of time, money and effort (by chance you perceive a Big Data tsunami coming your way).  Clearly you are solving REAL business problems now, and you will need to CONTINUE to solve those problems in the future.  Your current architecture is not going away, but new tools and techniques will need to be explored and exploited to enable the efficient derivation of business value from this new onslaught of requirements.

 

The data lake architecture has emerged as an answer to the challenges posed by big data to conventional data management architecture.  Note that the data lake complements the conventional BI and analytics architecture.  To suggest wholesale replacement given the substantial investments made by organizations throughout the world in BI would be folly.


Data lake.png


A data lake (or as I call it, the seven seas lagoon) helps by: storing data as-is, processing in place, eliminating operational backups, optimizing placement and storage of data, scaling simply and cheaply, being amenable to exploration and analysis, and working at petabyte scale

 

Your implementation “may” require federating your source information versus centralizing (what you ultimately can’t control/forecast); include streaming or complex event processing using in memory store to manage millions of new transactions per second; include a means to wrap un-structured information meta data and store in object form for use in analytics; include scale out analytical processing  in Hadoop clusters to create combined business value from all sources of information.

 

If you have fainted from this level of enlightenment, I’ll wait.  Good now?

 

I know you are an intelligent and well-read (well…now at least) technology professional, but if you would like more information on how to implement these types of solutions look here...  http://www.hds.com/solutions/big-data/

 

And if you don’t like that flattery, what kind sycophant do you want me to be?

Outcomes