Bob Madaio

Data Science vs. (IT) Science Projects

Blog Post created by Bob Madaio on Feb 3, 2016

7464483418_d25dd61343_o.jpgOf all the IT trends that are top of mind these days, (Big) Data Analytics is among the easiest to tie directly to business initiatives. Customers across every industry are looking to monetize their data, manage business risk, streamline operations and increase competitive advantage through improved customer and business intelligence. It’s easy to see how fast access to a well-rounded supply of blended data will support better decisions and business outcomes. Add data from operation technology (OT) sources like the Internet of Things (IoT) to the mix, and the potential for actionable insights and real business value increases exponentially.

 

So why isn’t everyone capitalizing on “big data” right now? Because the tricky part is “how.”


How do I know what data sources to collect, blend and analyze?  How much data should I prepare for? How can I do this on the infrastructure I have in place? How should I blend data from all my systems of record, or social media and external data sources?


If you’re asking these questions, you’re not alone. Most companies we talk to start with more questions than answers. They’ve got some very smart Data Scientists and a clear expectation that there’s gold at the end of this big data rainbow, somewhere. These initiatives often start ‘small’ – and, more often than not, they start outside of IT.


I generally refer to these as Data Science Project, or Science Project, for short.


In this scenario, some smart folks set up a Hadoop cluster with some quickly bought or borrowed servers and off they go. Until soon it becomes clear that tools will be needed to simplify the process of ingesting and blending new data sources. This is where Pentaho, a Hitachi Group Company, has excelled. 

Picture1.pngA leader in business analytics and a powerhouse in open-source-based data integration, extraction and blending, the Pentaho Enterprise Platform helps to dramatically simplify the blending of data sources and run business-relevant analysis.  It makes sense that Pentaho would see growth here. As Gartner states about Pentaho: (here) “Customers report achieving above-average business benefits and the low license cost is the top reason why customers select the platform.”


With a tool like Pentaho fueling their science project, some Data Scientists strike gold: Business Value. (Maybe they should have been called Data Explorers? Anyway…) Better decisions can now be made, and suddenly, they are dealing with the challenges of success: growth and the new found criticality of their project environment to actionable insight and business decision-making.


Much of the software they have leveraged to this point can scale to meet the growth challenges. The Pentaho Platform, for example, can expand and scale to meet the need, certainly, as Gartner also noted (here): “Customers recognize its ability to collect and transform many data sources, including social media, and with 20% of reference customers claiming to use Pentaho to collect and analyze data this type of data placing it among the top three in this category.”


computer-smoke-600x473.pngHowever, the makeshift cluster on which the science project was built often starts to show signs of stress. Some of the technologies that rely on the cluster require much more management than first anticipated. Performance suffers; cost, complexity and risk to the business begin to creep in.


This is generally the point when Data Scientists call IT for help, because, as it turns out, they are interested in the Data Science, not the (IT) Science Project.


It could have gone differently. Despite the prognostications that infrastructure doesn’t matter for Big Data, simplicity, scale and performance are, in fact, essential. Had IT been involved from the start, and had that forward-thinking IT department had the assistance of a vendor partner like Hitachi, things could have gone much smoother.IMG_6613.JPG

 

No, we wouldn’t have necessarily dragged in fully flashed-out VSP G1000s. (Although, we are more than happy to do that when needed.)  Rather, a conversation about a hyper-converged, scale out platform that could grow with your data could have been taken place.  A system built with open source virtualization technology, specialized file-system IP, simple management and seamless scalability would have been considered. If that conversation happened today, the system (a Hitachi Hyper Scale-out Platform) could have been pre-configured with the Pentaho Enterprise Platform to speed time to value of the project.


Data silos could be broken down with multiple data types streaming quickly into an enterprise data lake. Headaches of NAS filers and external servers for analysis could be collapsed into a single platform where a virtual instance of your analytics tools can be spun up right on top of that data. Hyper-convergence – especially one that allows direct, high-speed ingestion of data that’s accessible to all the virtualized apps you need it to be – is very, very cool.  Add in scalability to 100+ nodes and the future doesn’t look so scary.


That’s what is available to Hitachi customers today with our HSP + Pentaho big data appliance.


Simple, scalable, silo-free, Data Science.

Outcomes