Geek Out With These Books
by Ken Wood on Dec 28, 2011
Amy Hodler’s post a few weeks ago on the Cloud Blog inspired me to share some of my own geek related book buys from 2011. They are as follows (in my preferred ranking).
- The Grand Design By: Stephen Hawking (@Prof_S_Hawking)
- I’m a huge Stephen Hawking fan and have read (more than once) every book he has published—which will explain the next book pick.
- The Illustrated – A Brief History of Time & The Universe in a Nutshell (double book release) By: Stephen Hawking (@Prof_S_Hawking)
- Having read the original versions of these books, this superbly illustrated release is packed with high quality, glossy pictures compared to the original books. This is more of a collector’s edition and of course, when reading a Hawking’s book, a quality picture is with worth a billion-billion (Carl Sagan reference) words. The best part of this book is I bought it at an “everything must go” blowout sale as the Borders in my neighborhood was shutting down.
- Holographic Data Storage – from Theory to Practical Systems By: Kevin Curtis; Lisa Dhar; Adrian Hill; WilliamWilson; Mark Ayres
- Since I was researching some optical storage technologies for Hitachi, this book came highly recommended and from an interesting angle. Customers were asking about the Hitachi references within the book, so I bought it. It has been extremely helpful for me to understand this evolving area of technology which I believe will be game changing in the future.
- CUDA by Example – An introduction to General-Purpose GPU Programming By: Jason Sanders; Edward Kandrot (@ekandrot)
- Another part of my research for hardware accelerated applications and their uses in enterprise applications.
- HTML5 – Step by Step By: Faithe Wempen M.A.
- Mainly purchased this as an HTML5 reference book for an internal project I am working on.
- Adobe Dreamweaver CS5 with PHP: Training Source CodeBy: David Powers
- Same project support as above.
- HTML5 – 24–Hour Trainer By: Joseph W. Lowery; Mark Fletcher
- Again, same project support as the above.
Also, here are some miscellaneous books I picked up at a clearance table. If you’re like me, you can’t pass up one of those 70% off clearance deals to fortify your technical library. And since I do a lot of video and audio editing, I also needed these for some personal projects.
- Producing Great Sounds for Digital Video By: Jay Rose
- Audio/Video Protocol Handbook By: Jerry Whitaker
- Gigahertz and Terahertz – Technologies for Broadcast Communications By: Terry Edwards
What are your top book recommendations from 2011?
Is it COTS or Commodity?
by Michael Hay on Dec 21, 2011
I find the IT community seems to be in a state of confusion between the two—now mind you I think that some people get it and can easily discriminate between the two. Commercial off the Shelf (COTS) offerings are just that. A more formal definition of COTS from Wikipedia follows:
In the United States, Commercially available Off-The-Shelf (COTS) is a Federal Acquisition Regulation (FAR) term defining a non-developmental item (NDI) of supply that is both commercial and sold in substantial quantities in the commercial marketplace, and that can be procured or utilized under government contract in the same precise form as available to the general public. For example, technology related items, such as computer software, hardware systems or free software with commercial support, and construction materials qualify, but bulk cargo, such as agricultural or petroleum products, do not. (source: Commercial off-the-shelf – Wikipedia, the free encyclopedia)
My colleague Ken Wood talks about commodity in a post several months ago, Soybean Is A Commodity, where he muses on what is and is not a commodity. His summary is that basically the output and the resulting measures of many of the devices and systems that are produced in the ICT field are a commodity. However, the systems and devices themselves aren’t.
He says, “It is my opinion that there is a misunderstanding and confusion in the IT industry between ‘commodity goods’ and ‘consumer products’ when it comes to technology. I can’t pinpoint the exact origin of why or how these two concepts seem to have become synonyms for each other, especially in the IT industry, but there is a difference between commodity goods and consumer products.”
Personally, I think that this stems from confusion about COTS and commodity, and may have crept into the ICT vocabulary just like “NIC Card” and “Transparent to the Application,” see my previous post where I ranted on the topic of language misuse in ICT. While occasional misuse is relatively harmless, I believe that the misapplication of commodity has resulted in inappropriate thinking about the costs of technology. Let’s explore this last point for a bit.
Let’s assume that hard disk drives and the resulting capacity were a commodity, if so they are a strange beast I would have to say. In fact they may even represent a kind of unique commodity which follow the law of supply and demand, but have planned cost erosion. The cost erosion is associated to predictions by industry patterns like Moore’s law, and is somewhere between 20%-30% per year. Imagine that—a commodity with a predictable annual per unit price decline. I’m sure that the mathematical wizards on Wall Street would love to have something with the level of predictability experienced in both consumer and enterprise capacity production/purchases. Sure tragedies like those in Thailand and restrictions of rare Earth metals can cause disruptions in supply that has the potential to increase costs if demand is not damped or constrained, but through the innovative human potential and given enough time even these unfortunate events have little impact. So is storage capacity a commodity? I would say that unless the definition of a commodity has changed, both the capacity measures and the actual devices aren’t a commodity. They are rather COTS devices with commodity properties.
With that in mind, what about CPUs, memories, and more importantly advanced systems that aggregate and combine COTS in unique ways to release innovation? In my opinion the clear answer is no. Storage, servers, networking, etc. are not commodity, but surely can be COTS. Obvious questions are: Why is this important? Why does it matter? I see this is important because of the potential for COTS to contain innovations unique to a particular technology supplier. This matters to any consumer of ICT because these innovations may actually be a better match to your business, and potentially even the entire market as a whole.
This last statement “entire market as a whole” is interesting because I see that the fundamental tectonic plates of the technology industry are shifting to favor more OPEX-friendly technologies. It also means that as a consumer, you may be willing to pay more for innovation in the short term especially if the technology delivers innovation you can leverage and amplify, or it reduces your OPEX such that you can reinvest money elsewhere. So where do we see this occurring in the industry now? Well, I would say that the trend to deliver complete IT stacks as I’ve discussed in The Rack is the New Server, the Data Center is the New Rack, is an example where CAPEX may be a bit higher but the potential for savings on the backend through staff reallocation, reduced maintenance costs, and assured configurations may make the slightly higher CAPEX worth it.
So the next time you hear an IT professional say something like, “That’s just a commodity technology…” stop them and correct the usage of commodity with COTS. By keeping this terminology misuse in the ICT industry it serves to devalue innovations that vendors add, users can take advantage of, and that creative companies can leverage to engender new innovation on top. I personally fear that without a grass roots effort to make a change here as an industry we are going to be increasingly satisfied with mediocre offerings and products.
With All The Talk Around Cloud
by Dave Wilson on Dec 20, 2011
With all of the talk around the cloud and healthcare’s increasing movement toward adopting cloud technology, there are some issues that any organization must ensure have been addressed that are unique to healthcare. It should be understood that it is because of these issues that some of the healthcare providers lag behind other industries in moving to cloud technologies. Both cloud service providers and healthcare organizations should heed these areas when looking at cloud adoption.
Data Movement Across Borders
While a cloud service provider may be located within the country of origin, some of the cost saving benefits that can be realized by customers are due to the economies of scale that the service provider attains by sharing the infrastructure between multiple customers. This may mean that a cloud provider backs up or replicates data at a secondary site that does not reside within the original country–think Belgian hospitals’ local cloud provider who backs up data into their German data center. In many countries this would violate their privacy regulations and can be quite a complex and expensive problem to address, particularly if there is a breach of patient information. Healthcare organizations need to ensure that their data does not move across borders that it is not allowed to.
It would be naive to think that a facility would stay with one cloud provider forever. Cloud providers are free to manage their infrastructure as they see fit—a benefit for facilities who don’t want to worry about this infrastructure component. But some caution is advised. Customers need to know that their cloud provider is using accepted standards to store data. Proprietary mechanisms of storage will make migration very difficult in the future. An understanding of the cloud provider’s infrastructure and contractual agreements that ensure not only the ability to remove data but also assistance in migrating this data should be considered a high priority for any organization looking to adopt the cloud.
Ownership of the Data
This has been highlighted as a concern, but it should be clearly defined. Patient data belongs to the customer and the patient. The cloud provider is providing a service – network, storage, application, infrastructure, resources – but they have no claim to the data. The regulatory constraints should support this, in that patient data is subject to privacy and security laws such that a cloud provider could not, for example, sell access to the data to a marketing company. The customer is entitled to move, manipulate, change and otherwise remove data from the provider as desired. It is worth having this written into the contract so that all parties are clear.
Privacy and Security Compliance
Many organizations are reluctant to give up control of their patient information as there are certain risks that may suddenly become beyond their control. A breach of privacy falls to the healthcare organization to manage, and a cloud provider becomes an entity that threatens that control. There are many aspects to mitigate these risks:
A. Contractual compliance with stiff financial penalties for any breach of privacy such that the healthcare provider has a course of action to rectify the breach without undo cost burden;
B. Requirement of the cloud provider to meet regulatory compliance, regular audits of this compliance by third parties and immediate actions to rectify any gaps;
C. Private cloud models that ensure the data is stored on the premises of the healthcare organization while still getting the benefits of the cloud;
D. Use of the cloud for non-critical applications such as email, clinical collaboration, analytic tools, etc.
There is as much talk about cloud security as there is about privacy concerns, and they are somewhat related. Interestingly, HHS claims that of all the HIPAA breaches, only 6% can be attributed to hacking a system. The majority of cases involve theft of a computer—likely for the value of the computer and not the information within. A cloud provider will have top notch security protocols and processes that any healthcare organization should understand prior to a contract. How does the DC handle phishing or denial of service attacks? Do they have virus protection? What are the physical security aspects to prevent unwanted access? In many cases the cloud provider will have better systems in place than the organization itself – but these should be investigated.
Healthcare deals with mission critical and life or death information. A cloud provider needs to understand that the architecture needed for healthcare is typically more robust than in other industries. Down time can’t be tolerated and service level agreements need to clearly define the expected response times. In Canada, Canada Health Infoway specifies that medical images must be stored in and retrieved from the Diagnostic Repository within 15 minutes of acquisition or request. These types of requirements must be written into the contracts prior to agreement.
The cloud can bring many benefits to healthcare organizations, but as with any new technology, due diligence needs to be done to ensure that better patient outcomes can be achieved at the same level of confidence as they are without cloud technologies.
Capacity Efficiencies, Again and Again
by Claus Mikkelsen on Dec 19, 2011
Why again? Well, it was about a year and a half ago when I last blogged, and much has changed since then, although the subject is still front and center. David Merrill has certainly discussed this numerous times as well, including this always-amusing post from March.
So why bring Capacity Efficiency (CE) up again? Well, two recent events bring it back to center stage in my mind.
The first event was a meeting with a prospective customer a number of weeks ago who was looking to secure a fairly large amount of storage capacity, and kept hammering away at our sales guy for the “bottom line”.
“What’s your dollars per terabyte” he kept asking.
A befuddled sales team (and frustrated yours truly) hung in there, and we were finally able to turn it into a constructive conversation. But I’m still baffled at how many people have not let go of the cost/TB mentality.
The second was a meeting I had with Dave Russell of Gartner when we were discussing data protection issues. David Merrill was on the phone during this discussion. I like Dave Russell, and he’s a great analyst, but when he said that we keep 12-15 copies of all data, I guess I was a bit surprised that it was that high. Then David Merrill chimed in that his assessment was 11-13 copies.
So now I’m thinking that two of the smartest guys in this biz are agreeing that we’re keeping way too much data. I’ll come back to these numbers in a few weeks when Ros Schulman and I blog on this in our data protection series (with Merrill, just to confuse you even further). But keep that number of 11-15 in mind.
So now let’s mix all the various ingredients: my original blog from March 2010, David’s blog from March of this year, the “let’s all keep a dozen copies” discussion with Russell, and the fact that many folks are still in the per/TB world when it comes to storage purchase. Well, mix that together and you get a pretty bad-tasting stew.
But with CE, getting the most out of every TB you purchase is becoming a much larger issue as I peel back the layers. Thin provisioning (including write same and page zeroing), single instance store, deduplication, compression, dynamic tiering, archiving, etc., when multiplied by a factor of 15, makes a huge difference in data center and storage economics.
For other posts on maximizing storage and capacity efficiencies, check these out: http://blogs.hds.com/capacity-efficiency.php
Great Books for Geek Wannabes
by Amy Hodler on Dec 16, 2011
It’s that time of year. We’re busy finishing up all those loose ends that were to be done “before the end of the year” as well as juggling family and holiday plans. Despite the usual frenetic start, I really enjoy this time of year because once we slow down just a bit, most people are in the mood for thoughtful conversations about what we want out of the next 12 months.
For me, those conversations usually include discussing and recommending favorite books as a way to share what’s been useful for us. Since recommendations from people with similar interests are usually more helpful, below are my book recommendations from 2011—not all are new–for those that want to be a geek but really aren’t. (You know who you are…or you know who those people in your life are. We might love the idea of fractals and quantum mechanics…but we can’t do math in our heads.)
Author: James Gleick (@JamesGleick)
About: A historical perspective of information.
Why read it?
It’s a beautifully written study of information as its own historical topic, which I haven’t seen anyone else do.
Gleick does a wonderful job explaining some very difficult concepts and I love the history of great discoveries, Even more impressive are the implications when his historical evaluation is taken in its entirety. I believe it reveals future trends that will impact how we relate to information in the near and long term. Although I found the first few chapters to be a little slow, I’ve actually read it twice and may read it a third time…it’s that good.
Author: Nassim Nicholas Taleb
About: How the unpredictable is really unpredictable and how to deal with that.
Why read it?
We should all be skeptical of models and people that “predict” what will happen, but we also need to plan for success and different possibilities. Taleb does a great job of explaining why the improbable usually has a lot more impact on our lives and businesses than what we planned for, and gives some advice on how to deal with that. (Also, the summary of fractals and self-similar replication at the end was really helpful for this geek wannabe.)
Title: World 3.0
Author: Pankaj Ghemawat (@PankajGhemawat)
About: How distances and borders still heavily influence our lives, businesses and economics in general.
Why read it?
This should have been called “the world is NOT flat” as it’s really a counterpoint to the book The World is Flat and for that reason alone I think it’s a must read. So many people either blindly favor both globalization and deregulation—or oppose both of them. Ghemawat offers what I think is a more balanced option where these are not linked, binary choices. It’s worth picking this up just to understand an alternative way of looking at globalization.
Author: Tim Wu (@Superwuster)
About: Ebb and flow of decentralization and monolithic centralization of power in the information industry.
Why read it?
It’s a fascinating and entertaining review of the rise (and stumbling and rising) of major 20th century information powerhouses from the telegraph and telephone to Apple and the Internet. Regardless of whether you agree with Wu’s recommendations at the end, it’s worthwhile for those of us in IT to understand this history and evaluate how the cycle of decentralization and centralization might apply to our industry.
Author: Ray Kurzweil (@raykurzweil2035)
About: Enhanced human cognition and existence taken to a logical extreme.
Why read it?
- I almost didn’t include this one because Kurzweil gets pretty far out there on his ideas. However, it’s extremely interesting to consider technology as another phase in evolution and it might be valuable to ponder what that implies. I recommend it for those particularly interested in sci-fi and anyone that wants to get out of the box of their own thinking. This book was published around 2005, so you’ll notice some predictions that haven’t come true yet but if you can get past that it’s quite thought provoking.
So these are a few of my favorite reads that I managed to squeeze into 2011. What are your book recommendations?
Enhancements to Hitachi Data Ingestor
by Miki Sandorfi on Dec 14, 2011
A couple of months ago, we announced the broader HDS vision of Infrastructure, Content, and Information Cloud (see the post here and our press release). Today we announced the newest version of the Hitachi Data Ingestor (HDI) which will help organizations begin bridging between simple Infrastructure Clouds towards the Content Cloud.
With this newest release of HDI (see the press release), coupled with the power of the Hitachi Content Platform, we are arming customers with the necessary technology to free their information and take a step into the Content Cloud. As we outlined before, the key capabilities of the Content Cloud include information mobility and intelligence – putting the right data in the right place, at the right time, whilst empowering user control. This new version of HDI supports this vision in several ways.
First, HDI v3.0 includes technology that permits dynamic dispersion and sharing of data. Based on chosen policies, information written into one HDI (via standard NFS or CIFS) can automatically and transparently become available at multiple remote HDI instances. Imagine, for instance, wanting to distribute the new 20MB corporate presentation to each of many regional offices. Instead of emailing it (propagating hundreds if not thousands of copies – yuck) or putting it on SharePoint (slow downloads), you can instead drop it onto the “corporate drive”. This action will cause the other inter-connected HDIs to “see” that new content is available, and based on access it will be cached close to the users who want to get the new presentation (much faster, simple and seamless).
Next, HDI places control at the fingertips of users. Because by design and construction, a cloud built with HCP and HDI is backup free, placing tools in the hands of users to manage their own data is imperative. HCP already affords many controls for managing where data is stored, replicated, versioned, retained, and disposed. Now HDI passes this richness directly onto users via self directed recovery of prior stored versions or recovery of deleted content. Unlike other NAS technologies, HDI natively couples with the power of object-based management affording unparalleled granular access and control.
Finally, HDI includes some clever technology that helps customers adopt cloud in a very seamless fashion. By directly managing the transition of data non-disruptively from legacy NAS devices into the cloud-attached HDI, making the transition to cloud-based storage has never been easier. During the transition, all data remains available and once the transition completes, the richness of the Hitachi solution becomes fully available – bottom-less, backup-free file sharing that looks “legacy” NFS or CIFS to users and applications, but with the power of cloud underneath.
Google Health Dies – What Next?
by Dave Wilson on Dec 12, 2011
Back in 2008, Google launched its health platform – Google Health. It was an attempt to allow patients to control their own health record by uploading records to a Google site, and then granting privileges to their physician—thus making their health record completely portable. They even piloted this at Cleveland Clinic.
The intent was by “…using Google Health, physicians will be able to more efficiently share important diagnostic data with their patients. As patients become better informed and proactive in managing their healthcare, they may be more likely to practice preventive care, adopt healthful behaviors and practice other measures that promote improved medical outcomes.”
Well, as it turns out, Google wasn’t so successful. What!?! Google failed to make a go of something? How does a company that brought us Google, Chrome, Google Earth and the like not be successful in healthcare? Aaron Brown, senior product manager of Google Health, said the initial aim of the service was to offer users a way to organize and access their personal health and wellness information, and thereby “translate our successful consumer-centered approach from other domains to healthcare, and have a real impact on the day-to-day health experiences of millions of our users.”
And so here lies the problem that inundates healthcare. What Google didn’t realize was that people aren’t so willing to put their personal health records out in cyberspace as readily as they are willing to post their drunken party pictures.
Funny how that works.
Google then relaunched Google Health two and a half years later with a new UI and some more interactive tools. But alas that failed to catch on. Since then Microsoft Health Vault and Intel have offered to convert any Google Health files over to their format. The vultures are circling.
A personal health record has a lot of value if properly implemented. Ensuring that the content is accurate, that you can access this data from anywhere in the world and enable who you want to see your records is of immense value. Think about being on vacation and needing to have some healthcare treatment. If you have a cardiac problem, you can share your records with the local physician and they can see all of your medications (that you can’t spell or remember). They can see recent tests and the results and not repeat certain tests reducing your exposure to radiation and the like.
So what’s the problem?
The first issue: A personal health record that is maintained by a patient can’t be trusted by the physician treating the patient.
Patients may tend to put only what they want in the record. They may omit or even edit certain results, thinking that no harm can come to them. Who wants to share their positive HIV test or their mental health issues? How relevant is that to the chest pains they present with? In some cases the patient may even disagree with the results and omit them altogether. A personal health record that is not maintained by the parties providing the service is somewhat questionable when it comes to using it as a reliable source of information.
Second: Can we trust the Internet, the cloud and Google to maintain a level of security and privacy?
Most people do not trust companies to maintain their privacy when it comes to health records. Too many newspaper articles have front page stories where some hospital has leaked patient information. And recent stories about Google provide more proof that maybe Google has a conflict of interest in wanting to provide a personal health record. After privacy concerns were raised, Google’s CEO, Eric Schmidt, declared in December 2009: “If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place. If you really need that kind of privacy, the reality is that search engines—including Google—do retain this information for some time and it’s important, for example, that we are all subject in the United States to the Patriot Act and it is possible that all that information could be made available to the authorities.”
Who would want this? Perhaps Eric Schmidt was the downfall of Google Health and didn’t even know it.
Personal health records have their time and place if properly administered, accurately maintained and controlled in a non-biased, healthcare managed way. But getting to this stage will be difficult with all of the issues that surround our need for privacy, not to mention the sheer task of trying to coordinate the massive amounts of data. Some facilities are doing this. Governments are investing in electronic health records, which may serve a similar purpose.
But today, personal health records are still something of a nice to have.
RSNA 2011 – Meaningful Use and Cloud Took a Back Burner to ‘Imaging’
by Dave Wilson on Dec 5, 2011
Renee Stacey, Senior Solutions Marketing Manager of Health and Life Sciences at HDS, accompanied me to RSNA 2011 last week. It was a great show, and Renee asked if she could contribute a recap for the blog. Take it away, Renee…
Earlier this year, leaders in the radiology space were pushing the industry to be better engaged with the Meaningful Use (MU) incentive program. MUs is a government incentive program that financially rewards healthcare professionals when they adopt certified EHR technology and use it to achieve specified objectives. Initially, radiologists were hesitant to participate, which raised fear that adoption delays could impact the ability to meet new clinical and technology demands. For an industry that has typically led the pack on clinical innovation, there seemed to be real risk of radiology being left behind.
Since that time, radiology’s participation in the MU program has been clarified, however radiology groups are still slower than expected in adopting these IT innovations – innovations that are essential both for improving patient interactions and for their promised financial reward. KLAS recently teamed up with the Radiological Society of North America (RSNA) to conduct a survey on this very topic. Among the results were two very interesting outcomes, the first showed that 60% of those surveyed either have a plan or are considering qualifying for the MU incentives…. and more interesting, only 6% considered themselves educated on the MU incentive program. To me, that says there is a deep disconnect between the needs of the radiology consumer and how the technology players in this market are delivering their message.
I bring this up because, after following this story and having just returned from RSNA, I would have expected to see a plethora of MU messaging – and while RSNA provided a number of professional sessions on the topic – the MU message did not seem to make it to the show floor. Does it mean that IT health vendors are not coming to the table with Meaningful Use Certified solutions? Probably not. I think it means that perhaps, there was a miscalculation in what vendors believed the radiology community wanted or needed to hear. When only 6% of those surveyed have a comfortable understanding of MU opportunities, it means 94% need more information to make better educated decisions about their MU plans.
The same must be said for Cloud, which was surprising when private and public cloud was the message dujour this time last year. I expected attendees of RSNA11 to be able to see and hear a more mature and better defined cloud message with a lot of industry examples and success stories. Instead, with the exception of a very small handful, the big message appeared to be imaging. Relevant? Yes. Forward thinking? I am not so sure.
And with 40,000+ radiology professionals in attendance, imaging is a given. I would have expected RSNA to be THE place to learn more about cloud and MU offerings – because it is clear that the radiology market is working hard to learn more about it.
What were your thoughts on how RSNA promoted innovation in radiology?
Answering Ilja’s Request for Server Capacity – HDS Style
by Ken Wood on Nov 30, 2011
For everyone that celebrates the holiday, Happy [late!] Thanksgiving. The week of Thanksgiving in the US is a great time to catch-up on many of those little work chores that pile up or slip through the cracks while traveling and prioritizing big tasks ahead of fun work stuff.
This morning I was catching up on some HDS Industry Influencer Summit bloggers’ and analysts’ write-ups and opinions from the days of Nov 10th and 11th. As I wrote previously my blog, I was in a concurrent breakout event with several industry bloggers on the “other stage” during the afternoon’s main session. I was following links to various write-ups and ran across Ilja’s Coolen’s blog. Funny thing is, it was a write-up from a separate event back in March in the UK that I didn’t participate in. So, when I finished writing this blog, I realized what happened: too much link-clicking, and I ultimately ended up at an article that was six months old. While it is several months old, Ilja’s post did ask a very good question that still I would like to respond to on server packaging and capacity. In his blog, he states:
“Hitachi is able to deliver a completely filled 42u rack with 320 high density micro servers. The total rack would consume less than 12 Kilowatt. Whether or not this is a great accomplishment, actually depends on the total processing capacity this rack would have. I need to dig deeper into this to make a comparison.”
I have, in the past, performed paper exercises of sizing computing horsepower for initial comparisons. When everyone uses the same processor chips from either Intel or AMD, do they all perform the same? Where are the differentiators? To cut through the fat, one area is packaging.
The whole discussion of rack mount servers versus blade server systems has been in debate for a number of years, and it is somewhat falling into a religious discussion. The argument for commodity-like advantages of pizza box servers over the easier to manage enterprise blade server architectures is not going to be resolved here. But what I will offer is some of my insight into performance and the advantage of packaging using the same processor chipsets.
Using a standard formula to calculate floating-point operations per second (FLOPS) then applying to a server system such as the one I use here:
(Number of FLOPS per Cycle) * (Clock Cycle) * (Number of Cores)
Adding additional system level information will yield the system’s or blade’s overall calculated FLOPS performance (of course this is a brute force approach, but it is a good starting point). For instance, using the Compute Blade 320 (CB320) blade server option as stated in Ilja’s blog and as mentioned by Lynn McLean in her presentation:
- Using the X5670 XEON based blade
- (there is a reason I’m using this one and not the fastest processor option for now)
- 4 FLOPs per clock cycle
- times 2.93GHz
- times 6 cores per processor
- times 2 processors per blade
- Single precision FLOPS – 4 times 2.93GHz times 6 times 2 = 140 GFLOPS
- times 10 blades in a system= 1.4 TFLOPS
- times 7 CB320 systems in a rack = ~10 TFLOPS
I should note here also, these are general purpose computing FLOPS as compared to GPGPU FLOPS, which require additional coding and compiling steps to take advantage of this technology. This means almost everything running on these systems can take advantage of the performance assuming internode awareness (application is scale-out aware).
If you followed this same formula for the standard 1U rack mount server using the same chipset and core count, it would result in about 6 TFLOPS (140 GFLOPS per server times 42 servers in a fully populated rack) compared to almost 10 TFLOPS per rack using the CB320 (70 blade servers in a rack). So, net results of this exercise is packaging and density of 70 blade servers in a rack compared to 42 servers using the standard 1U rack mount servers yields a 40% improvement in floor space requirements for the same computational capability. Stated in a more measurable metric, the CB320 yields 235 GFLOPS per rack U compared to 140 GFLOPS for a standard 1U server in the same 42U rack. On the higher end of the CB320 product line, the X5690 blade, the calculated floating-point performance for this blade is 166 GFLOPS, which would put the rack’s total calculated performance at 11.6 TFLOPS or 277 GFLOPS per rack U.
Hopefully this helps answer Ilja Coolen’s question about server density and capacity. The other point to finish off this article is the notion of data intensive and computational intensive architectures. I’ve just shown some data that suggests the CB320 has the capacity to be very computationally capable. On the other hand, the CB2000 has the capability to be very data intensive. The specifications for the CB2000 states 16 GB/s total bandwidth in a single blade system and 64 GB/s of total bandwidth from a fully configured rack. Combined, these two systems form a formidable platform for solving Big Data challenges. Not that Big Data problems are a floating-point intensive workloads, but you never know.
More on this thought in future blogs.
Unintended Consequences of Cloud – From Influencers to Super Computing
by Amy Hodler on Nov 22, 2011
The week before last we had our first Influencer Summit in San Jose, CA that brought together analysts, bloggers and trusted advisors. I really enjoyed Frank Wilikinson’s blog from the event regarding something we heard a lot that day, You Guys Do That?! This got me to thinking about the unexpected and another comment that piqued my interest that day.
During a panel discussion, two of our customers commented how the use of cloud solutions was influencing their organizational structure—in short, that the higher level segmentations of server, storage and network groups were merging. I know many folks had postulated that this would happen, but this is the first time I’ve heard customers from different industries talking about how it impacts business structure. Imagine what that might mean in the long term for business processes. This alone is an interesting topic, and I’d love to hear more real-life examples.
With the idea of unintended consequences of cloud still stuck in my head, I attended Super Computing 2011 last week, which Ken Wood summarized our participation in A Brief Visit to SC’11. This is a fascinating conference if you’re interested in the amazing things being done to turn data into meaningful information, and seeing the impressive projects from the likes of NASA, NOAA and educational institutions. For a non-promotional report on why this conference and supercomputing is important to our industry and society, check out this video summary from EE Times.
Amongst all the super charged brainpower, I heard one of the providers of High Performance Computing (HPC) mention that the concepts of cloud were changing what their end users wanted. I started asking the same question to others and it turns out that because this type of computing and analytics is extremely dependent on node to node fidelity and intolerant of failures, HPC providers had not anticipated a strong interest in cloud services. However that’s exactly what they are starting to see.
The providers that I spoke to weren’t precisely sure how they would meet these growing requests for cloud-like hosting and delivery but they are working on it. Super computing as a service (would that be SC-a-a-S?) would require some unique implementations of cloud solutions that would vary greatly from big data solutions due to dissimilar data and analytics models. Is there enough of a market for SCaaS? Hmm. Maybe or maybe we’ll call it something else?
These last two weeks have been ones of idea exploration for me, and I’m left with many more questions than I can answer. However, if you’ve read my other posts, you know I love this process. (Something good usually comes out of this exploration; I just can’t predict what it will be.)
So, please send me a quick note or write a comment about any unintended consequence of cloud that you’ve experienced or heard about. I’ll collect them, post a summary, and maybe we’ll collectively come to a few “ah-ha!” moments.
For more content from HDS Analyst Day, visit our bit.ly bundle: http://bitly.com/u0mh27.
Unstructured Data in Life Sciences
by Dave Wilson on Nov 21, 2011
Unstructured data is a major challenge in the life sciences market. Unstructured data, by its very definition, is difficult to analyze as it doesn’t fit into a relational database. Pharmaceutical and biotechnology organizations live and die by their ability to analyze this unstructured data, and studies show that the average company makes decisions based on data that is 14 months old. Companies that can make faster decisions will win the race.
Gaining access to unstructured data opens opportunities for organizations, but it is only a start. It is even more important that organization know what data to access because the advantage will go to the company that can mine the most relevant value out of the data.
Consider a pharmaceutical company. They conduct clinical trials in the drug development phase. Multiple departments generate massive amounts of data that all relate to the drug’s interactions: blood tests, biopsy samples, pathology images, nursing and patient notes, not to mention chemical analysis and more. Combined with race, gender and geography factors, there is too much data to make sense of. Aggregating this data into meaningful information is the key to driving better decisions, like, for instance, identifying trends that can uncover major discoveries.
It’s a little known fact that Viagra was discovered by researchers when they found that patients didn’t want to give up their medication. The “benefit” (or side effect, depending on your gender) was an accidental discovery. Being able to correlate data that is seemingly unrelated can lead to major finds, and a way to show a relationship between your data will drive data mining and analytics to higher levels.
So there are two main challenges facing pharmaceutical companies when it comes to big data.
- How does a company manage to store big data?
- How can they make sense of this big data?
As you have seen recently with our cloud announcements, HDS has cloud technology that can address both challenges. Cloud computing for pharma companies comes with its own challenges, like:
- Ownership of data
And these are important factors to consider.
Also, object-based storage is a way to store unstructured data and mine the associated meta data. Both Hitachi Content Platform (HCP) and Hitachi NAS powered by BlueArc® provide a means to manage unstructured data. HCP also forms the core of HDS cloud technology.
The key to managing big data is to enable reduced cycle times for computing massive queries. This drives pharma and biotech to gain an advantage over their competitors. High performance computing has a big role to play here – but that is a blog for another day.
A Brief Visit to SC11
by Ken Wood on Nov 18, 2011
Initially, I wasn’t planning to attend SC11, especially since this week I had several other meetings to participate in. However, as is common in this industry, I ended up heading to Seattle to meet with several people and companies at SC11 at the last minute for the day. I was able get into the exhibit hall early to explore the behind-the-scenes activities of many of the booths. Hitachi Ltd.’s HPC Group was present with a very impressive booth again this year.
A VSP was on display in the booth, next to what I call the world’s largest server blade. I don’t actually know if this is a fact or not, but it is very impressive to see this device used in this specialized field of computing. Also, there was the new HA8000-tc rack mount server for technical computing (I want/need some of these in the Innovation Lab).
I also hung out at the BlueArc booth, which now displayed new panels with “BlueArc – Part of Hitachi Data Systems” in large, vivid lettering. Nice! Sorry, I didn’t get a picture of this for some reason, but I’ll grab one from someone or someplace. I did hang out at the booth meeting with new BlueArc colleagues and old HDS colleagues, as well as customers of other vendors interested in knowing more about everything.
Probably one of the more interesting activities at the conference for me was the attention given to data intensive workloads specifically around “Big Data”. There were several events going on surrounding Big Data that, unfortunately, I was not able to attend. However, since the majority of my time and my team’s time is spent solving Big Data problems in the enterprise, this is an area and community we will continue to monitor closely. I have been using scale-out and HPC architectures to explain and solve the Big Data challenges in the enterprise and this is evidence of that approach. Stay tuned for more on this subject. Unfortunately, I was unable to attend any of the sessions, tutorials or BoFs this year. I didn’t even get a t-shirt or conference goodie bag. Hopefully, next year my schedule will allow for more time to participate and explore like I usually do.
HDS Industry Influencer Summit – The Other Stage
by Ken Wood on Nov 18, 2011
Last week was the inaugural HDS Influencer Summit, convened in downtown San Jose. This event included financial analysts, industry analysts and key industry bloggers. It is interesting that the majority (maybe all) of these attendees are related to the storage industry in some fashion. There are several blogs detailing the event and explaining the resounding “…they do that?” in these posts from my colleagues Frank and Miki. What I would like to describe here is the blogger breakout sessions, and the tour of the new Innovation Lab, which is an extension of the Hitachi Central Research Laboratory, the day after the main event.
There was a special breakout session during this event specifically for our invited industry bloggers, Greg Knieriemen (@Knieriemen), Nigel Poulton (@nigelpoulton), Chris Evans (@chrismevans) (not the Captain America Chris Evans), Devang Panchigar (@storagenerve) and Elias Khnaser (@ekhnaser), which overlapped some of the main event. By comparison, this portion of the event was more exciting than what was missed in the main event (in my biased opinion). I kicked off this breakout session with an overview of our “R&D and Futures” and an introduction of the new Innovation Lab at our headquarters. I also did a brief one slide description of three active projects we are working on in the lab and noted that these projects will be demonstrated the following day. Sorry, these projects are under NDA.
After that tour, and while walking Greg out of the rest of the day’s activities, he stated to me “…you’ve probably have the greatest job in the world!” I replied back “trust me, this isn’t the only thing I do, and the rest of my job isn’t so great” (sorry Michael). However, I took this in meaning that my team is instrumental to changing the industry’s perception of the “New HDS”, or at least that’s how I interpreted his comment.
I didn’t think much more about his comment until this week when he followed up with an email to Michael Hay and myself basically stating the same. Unfortunately, he didn’t say that I WAS ‘doing a great job’ and he included Michael so I couldn’t edit his email before forwarding it ;^) Obviously, there’s a sense of pride when someone recognizes the work being done, especially since being so close to the work can take your focus off the larger vision.
It is rewarding for me and my team to know that we are helping to transform HDS from being viewed as a storage company to something more while keeping to our roots. I normally describe the difference between HDS and other technology companies in this market space as – companies that are primarily seen as a server company see storage as a place to keep data, but a storage company would treat data as the digital assets of an enterprise and use servers as a way of making that data useful to the business. To this, I also like to describe storage (at least the way HDS does it) as maintaining the “state” of the company, while servers can become “stateless” interchangeable components that essentially are data processing offload engines to the storage infrastructure.
I am definitely looking forward to following up with several participants of this event as I received many requests and questions. Also, I am looking forward to next year’s event, and what we will be sharing.
For more content from HDS Analyst Day, visit our bit.ly bundle: http://bitly.com/u0mh27
Hey! You Guys Do That?!
by Frank Wilkinson on Nov 15, 2011
Last week HDS held its inaugural Influencer Summit in San Jose, California. It was a very big deal for our company, not to mention our invited guests, who by all accounts were about to get the one, two punch! (In a good way).
The creation and preparations for this historic event were driven by our marketing group and trusted advisors, not to mention half the company (well it seemed that way, I may be over exaggerating a little bit). We had our executives on hand to deliver the core messaging with some great insights, as well as folks from The Office of Technology and Planning. This was our first of many events which will enable HDS to share its strategy and technologies within an exclusive (for now anyway) invitation to the industries’ most prestigious analysts and bloggers.
There were many internal meetings to discuss our individual participation and also to cover each presenters’ topics, strategy materials, presentations, NDA clarification, blood type, first born, and dare I forget that I had to sign my name in secret ink (I am kidding, there was no secret ink).
All kidding aside, it was a great event. Jack Domme (pictured above) was first to present, and he did a great job as always. (I will save the remaining details for my fellow colleagues and bloggers, who I am sure will do it better justice than I can).
One of the initiatives at the event was to decide who would be monitoring Twitter and responding. I of course volunteered, as did many of my peers and colleagues. The job was easy enough, as I am on Twitter (@FTWilkinson) as much as I can, and also I like to see instant feedback from our customers and peers in real-time.
As the event started off, the Twitter chatter was quiet, but started to pick up rather quickly with some tweets pointing out that HDS is the best kept secret…
- @seepij: If you thought #HDS were just into storage – like I did – hearing impressive insights on new technology coming #hdsday
- @CIOmatters: #HDS much more innovative and strategic than I realised -not content buying IP they build it and use/re-use it, ahead of the market #hdsday
- @nigelpoulton: Randy Demont saying customers are really pleased but telling HDS that they dont market well enough <– only for the last 10+ years! #hdsday
- @ekhnaser: Great message at #hdsday but y limited to 70 people? This should be heard by more partners customers influencers….
Ahem!…The presentations were great and overall were well received—at some level I think we shocked some folks by our candor as well as laying out our strategy and some possible future areas of focus and technologies (all covered under NDA, so no sharing in this blog forum). Nonetheless, the Summit was a great inaugural event that proved to those in attendance that we, HDS, do have a complete strategy, and a plan and vision for carrying them out.
So, YES, we do that!
For more content from HDS Analyst Day, visit our bit.ly bundle: http://bitly.com/u0mh27
Cloud Strategy and the Influencer Summit 2011
by Miki Sandorfi on Nov 11, 2011
The inaugural Influencer Summit 2011 was a tremendous success! Industry and financial analysts as well as some key bloggers traveled from all over to meet with HDS and spend a day, plus some, talking strategy, industry and futures. Although the agenda was packed, I was able to do a short recap on HDS cloud announcment. Check it out:
Excited About Our Inaugural HDS Influencer Summit 2011
by Claus Mikkelsen on Nov 9, 2011
Fall is in the air, but HDS will be turning up the heat in San Jose this week with our first analyst day. Kicking off on Thursday, November 10th, HDS will host two days of executive presentations, financial updates, and updates on our product strategy from CEO Jack Domme and other key HDS leaders. Throughout the event, we will cover our strategy for infrastructure, content, and information cloud with help from a few gracious customers who have also decided to participate.
Our goal is simple: let the storage world understand where HDS is going and how we will get there from a technology roadmap perspective.
I will be tweeting from the summit (@yoclaus), as will a deep bench of HDSers also in attendance, including:
- Miki Sandorfi – @MikiSandorfi
- Amy Hodler – @AmyeHodler
- Hu Yoshida – @HuYoshida
- Ken Wood – @KenWoodonTech
- Shmuel Shottan – @ShmuelShottan
- Frank T. Wilkinson – @FTWilkinson
You can follow the event by using the #HDSday hashtag on Google+ and Twitter.Speaking of HDS on Google+, our company page is now live, so please add us to your Circles so we can share information from the event with you. We will be hosting in-depth conversations with storage analysts, bloggers and HDSers during the summit. We have shared our “HDSday 2011 Circle” so you can connect with our participants.You can also follow the #HDSday Twitter List to connect with those who will be participating in the summit and posting tidbits on Twitter throughout the event.We hope you’ll follow the live stream of tweets and blog posts that emerge this week!There will also be some industry bloggers in attendance, so make sure to check in with their Twitter handles for updates:
- Chris M Evans – @chrismevans – www.thestoragearchitect.com
- Devang Panchigar – @storagenerve – www.storagenerve.com
- Greg Knieriemen – @knieriemen – www.nekkidtech.com
- Nigel Poulton – @nigelpoulton – www.nigelpoulton.com
- Elias Khnaser – @ekhnaser – www.eliaskhnaser.com
Looking forward to engaging with you throughout the show!
Big Data in Healthcare
by Dave Wilson on Nov 7, 2011
Being in the healthcare space my entire career, I had no idea what all the fuss was about when Big Data started to be the topic of the day. Sounded like a large file to me – mammography images can be 60Mb each – and aside from the potential joke about large mammo images, what was the big deal?
So I did some research. “Big Data” refers to a volume of data too large to be harnessed and used in meaningful ways. In other words, Big Data is an accumulation of data that is going to waste and has no immediate meaning, mostly because no one can do anything with it due to its size.
Healthcare providers are quickly becoming inundated with Big Data. Governments are unintentionally driving big data warehouses through health information exchanges, diagnostic imaging repositories and electronic health Records. Eighty percent of the data is unstructured, and it is accumulating to the point where it can be called Big Data. Now, on an individual basis, each patient record has value and meaning to the patient–obviously. But as we look at the growing accumulation of data, the opportunities are endless to drive meaningful analytics out of the volumes of data available.
Data warehouses are simply that–a warehouse for data. Data has no meaning on its own, and as providers create these warehouses, there needs to be a shift in how we think about the potential use of data. Data warehouses need to become information or content repositories. Information is a useful tool that results from the analysis of data driving decision making. Sounds like marketing fluff right? Let me demonstrate.
A diabetic patient monitors their blood sugar multiple times per day. This value gets stored electronically–this value is data. On its own it has little meaning (aside from the obvious immediate value to the patient) in the big picture of things. Now take that patient and all the patients in the region and their blood sugar values for the last 5 years. Analysis of this data could lead to trends—important information that can drive preventative health measures. This leads to better patient care, improved quality of life, and lower healthcare costs.
Much of this data is available today in separate repositories, isolated applications and local data warehouses. The potential to combine the blood sugar data with nursing notes keywords, weather forecasts, and other related and unrelated data can help drive this analysis. The challenge becomes getting access to this data and then overcoming the interoperability aspects. Problem is, we don’t know what we don’t know. Questions we would never ask today can be asked when the restrictions are lifted – Is there a correlation between diabetic hospital admissions and the weather pattern?
A content cloud could answer some of these challenges. Consolidate and aggregate data from multiple sources, and at the same time capture the relevant meta data associated with the data against which analytics can be run. Meta data can help manage the massive amounts of data being generated (the Big Data) and provide a way to correlate this data into meaningful information. This content then can be accessed by researchers and scientists to analyze.
Big Data is, and will continue to be, a major problem for healthcare providers. One estimate has healthcare Big Data sized at 150 exabytes and growing at a phenomenal rate of 1.2 exabytes per year. The possibilities of tapping into that information are endless. It has been a challenge for pharmaceutical and biotech companies for years – but that’s another discussion.
Measuring Up A Supersized McBlu-ray With BDXL
by Ken Wood on Nov 1, 2011
In March of last year, I wrote a blog about disruptive technologies, specifically how the Blu-ray technology could change the storage landscape (or not). The new Blu-ray disc format specification enhancement was defined in June 2010 for the BDXL standard. This new specification increases the recording capacities for Blu-ray discs in two ways.
First, per-layer formatted capacity has increased from the initial 25GB to about 33GB per layer. Second, the number of layers now supported on a disc is three and four, or stated as a triple and quad layered disc. The conventional capacity description for the triple layered disc is now 100GB (99+GB) and 128GB per disc. There is some discussion with Blu-ray media suppliers looking for requirements for dual sided media, which could increase the per disc capacities to 200GB for a six layered disc (triple layers on both sides) and a 256GB for an eight layered disc (quad layers on both sides).
Now, when I discuss optical storage capacities and Blu-ray, I’m typically referring to the storage capacities and read/write speeds for these devices, and how they would apply to enterprise uses. I’m less interested in the video formatting and supported formats—at least until I’m working on my own home videos. For my job at HDS, I’m always watching this space for the “kicker” that sends this technology to the forefront of an enterprise’s alternative storage technology strategy. More specifically, when could optical storage technologies, like Blu-ray, replace tape (albeit, for certain types of use cases)?
Even with an inconveniently laid out 256GB of capacity (flipping a disc over is a pain at best), Blu-ray would have a tough time replacing the current multi-terabyte LTO tape formats today and planned improvements for the future.
But wait! What’s actually being compared when discussing LTO tapes and Blu-ray discs, head-to-head on capacity, or head-to-head on capacity and footprint? The current specifications for an LTO5 tape is 1,500GB of uncompressed capacity with a cartridge dimension of 4.1 x 4.0 x 0.8 inches (105 x 102 x 21 mm). Using some creative math, this is 13.12 cubic inches, which works out to about 114.3GB per cubic inch by that cartridge.
The specifications for a quad layered disc (single sided) is 128GB of uncompressed capacity with a disc dimension of 4.7 inches in diameter x 0.047 inches thick (120 mm diameter x 1.2 mm thick). Again, doing some creative math, this is 0.815 cubic inches, which in this cases works out to 157.1GB per cubic inch. This is an amazing 314.1GB per cubic inch if we test the waters with a dual sided disc. Stated another way, in roughly the same amount of space that an LTO5 cartridge occupies for 1,500GB of uncompressed storage capacity, a stack of Blu-ray discs would contain a whopping 2,060GB using a single sided 128GB quad layered disc and over 4,100GB for a stack of double-sided quad layered 256GB discs.
Granted, the Blu-ray disc is a bit wider than the LTO cartridge and a straight stacking of discs or a forced fitting of cubic inches from one form factor into the other is not a precise or practical method of comparing these two storage medias. Also, in a well designed apparatus for managing many optical discs, there would be zero surface contact required, so a measureable gap between discs would be needed to manage them properly. This would drop the GB per cubic inch capacity somewhat. However, the numbers are disparate enough to look closer at this from a different perspective.
So what about performance? The current LTO5 specification states a 140MB/s rating for the uncompressed 1,500GB format. This means for a read operation, once the tape sequentially seeks to the requested location, at top speed (tape moving across read/write head) the data can be read at 140MB/s. Impressive and roughly as fast as most magnetic hard discs sequentially streaming today. A Blu-ray disc reads and writes at about 18MB/s using the slim form factor optical disc drives and about twice this speed for the full and half height form factor optical disc drives. So for comparison sake, I’ll use 30MB/s for read/write as a conservative estimate. This is one of Blu-ray’s main deficiencies when looking to be used in the enterprise as a serious storage alternative, poor performance, as well as, per disc capacity.
However, let’s look at this from a different perspective again. That same “stack” of discs that yields up to 2,060GB of capacity when compared to LTO5’s footprint, would roughly need 17 optical disc drives to simultaneously load all of these discs. Two things happen when this is done. First, 17 drives x 30MB/s is a total of 510MB/s of aggregate performance from this “stack” of discs. This assumes reading ALL of the discs for some sort of read everything or write everything operation. Second, even if there isn’t this many optical disc drives available, all 2,060GB of data is divided into 128GB chunks and 33GB layers. When seeking a single file from a Blu-ray disc, the operation would only load the disc with the requested file on it, focus the laser to the required layer, then directly seek to the location on that layer to read the file and satisfy the request.
Of course both storage mediums have excellent power consumption characteristics when not in use – ZERO watts. However, the knock on optical storage (among many others), specifically Blu-ray, I think in many cases is unfairly compared (myself included initially). Individual medium – LTO5 cartridge – to individual medium – disc – is the standard comparison, but both will be used and managed in similar ways. That is, nobody uses just one, and there’s always some device in place to manage them as a larger body of media. This means when the aggregation of capacity and performance is measured, the two technologies compare fairly well, at least now that the new BDXL format is being used for Blu-ray.
I would like to hear about your experiences with Blu-ray storage and what your opinions are concerning this technology–especially those professionals using Blu-ray in their companies to solve problems.
Trick Or Treat
by Claus Mikkelsen on Oct 31, 2011
Ah, yes, Halloween. Time to pick up that empty wine glass and start knocking on neighbors’ doors. I’ll leave the candy for the kids. But there is a “trick or treat” to this blog.
The “trick” was to get you to read this far, and the “treat” will be unfolding over the next few weeks and months.
I’ve blogged, written and spoken a lot recently—not so much about the data explosion (we all talk about that)—but the changes in technology that are allowing it, and in many cases almost encouraging it, to occur. Most recently, Ian Vogelesang on this page about disk drive futures, and this past April I blogged about how Moore’s Law might be tilting a little faster these days. Recently, I read that we might be seeing up to 100TB disk drives by the year 2020.
All of these trends are great, but also tend to create more problems to solve, or processes that need to be changed. For example, how long will it take to do a RAID rebuild of a 100TB drive (don’t ask!), or what does data protection look like in the year 2020, or more specifically, what should it look like today?
That’s the subject at hand, and one I’ll be focusing on in the weeks ahead, along with my good friend David Merrill, whom many of you already follow. Helping us along will be Ros Schulman (our “diva of data protection”) who “guested” on Hu’s blog this past April about business resilience following the Japan earthquake and tsunami. There will be other guest bloggers as well, but between David’s blog and mine, we’ll be unraveling the intricacies of data protection as well as the storage economics surrounding it. For anyone interested in data protection and backup challenges, this should prove to be a “treat”.
It’s not your father’s simple tape backup any longer. Stay tuned.
HDS Information Cloud Vision and ParaScale
by Cameron Bahar on Oct 31, 2011
Last week, HDS unveiled its roadmap for the Information Cloud and stated that it is based on technology obtained through the acquisition of ParaScale in August 2010. In this blog post, I will explain how the ParaScale platform will serve as a foundation and enabler for the Information Cloud.
In my last blog post, I wrote about the impact and requirements of a new class of applications on storage and computing infrastructures. As this massive wave of unstructured data is created, we need platforms that are specifically architected to efficiently ingest, store and analyze this data.
In the early 90s I had the privilege of working on the second version of the Teradata data warehouse appliance which was used by leading companies around the world to mine their structured data sets. The interfaces were SQL and we implemented an ACID compliant RDBMS while leveraging a scale-out shared nothing architecture to achieve scale and performance for TCP-D (decision support) type queries. Teradata allowed companies to achieve record “time to answer” results to their most pressing business intelligence problems. In the process, Teradata gave these select companies a significant competitive advantage and in many cases led to these companies dominating their various industry segments.
An example of this is Walmart, which invested heavily in this technology in the 1990s and even sued Amazon for hiring their data warehouse experts as it considered this knowledge a competitive weapon. Walmart was able to analyze supplier and customer patterns and figure out what product to put on which shelf, in which city, in which month in order to maximize its revenue. It is interesting to compare the relative growth rates of Walmart to Kmart, which I believe did not invest aggressively in this area in the 1990s!
What we’re witnessing is the second wave of this paradigm. The fundamental difference is that the data sets are a few orders of magnitude larger and they’re all unstructured or semi-structured.
Many organizations are realizing that there’s tremendous hidden value in this data and it needs to be harnessed to enable businesses to get new insight about their businesses and customers. Proof points are leading web companies such as Google, Amazon, Facebook, Yahoo!, eBay and others that are mining user activity and patterns to sell ads or products or both.
Because the size of these data sets is relatively large (TB’s to PB’s), it is impractical and expensive to copy the data over a network into a data warehouse in order to process. It is much faster and cheaper to move the processing to where the data is stored, which brings me to why ParaScale is an important bridge to the Information Cloud.
The ParaScale platform will allow organizations to efficiently ingest, store and mine unstructured and semi-structured big data sets at scale and all in one place using a scale-out shared nothing architecture. This technology provides an elegant and flexible method to combine storage and compute in the same stack and achieve performance, efficiencies and capabilities required to solve this class of problems. The journey from the Content Cloud to the Information Cloud requires an array of technologies and we think the ParaScale technology is a key element required to bring this vision to reality.
Want to read more about our Cloud Roadmap? Visit our bit.ly bundle here: http://bitly.com/pCt5Gk
Cloud Maturity Model In Healthcare
by Dave Wilson on Oct 28, 2011
Now I don’t like to steal titles or concepts from other companies but at my last company we had a concept of a Multi-site Maturity Model and I think the same “maturity model” concept works for healthcare and our cloud offering. It took a little while for this to kick in (thanks to my colleague Linda Xu for helping me see the light) but I think that our vision for where the cloud will take us is most appropriate for healthcare. Let me explain.
I see four stages of evolution in the healthcare space for customers that want or need to go to the cloud for various reasons. Healthcare tends to be behind other industries when it comes to adopting technology so this evolution needs to happen at the speed at which our customers will be comfortable. This evolution starts with “Cloud Ready” Technology. By buying the HDS “box”, the customer starts down the road of enabling the cloud by being ready from an investment perspective. Take for example Hitachi Content Platform (HCP). A customer can utilize HCP to manage their patient information from the radiology department to start. This would be locally hosted and managed and be like all other departments – a silo of storage.
Stage two involves the customer moving this equipment to a hosted environment, adding additional applications to HCP and expanding the capabilities. It may also include virtualization of an existing data center such that multiple applications are now sharing the storage virtually and Hitachi and its partners take on more of a managed service role. Storage on demand becomes an option. At this point our customers start to build out the Infrastructure Cloudand are managing a number of applications and their accompanying storage as a cloud model. Think of MS Exchange, PACS and Hospital Information Systems data being managed through the cloud.
But as we have talked about before, just storing data in the cloud is of limited use unless the physicians are able to utilize this data for clinical decision making and so the Content Cloud evolves as the customers add Hitachi Clinical Repository to the mix. Now the customers are ingesting data through the cloud, indexing that data as it arrives so physicians can see the patient longitudinal record and storing of this information. Through various applications facilities can now provide access to their physicians via cloud based applications. Electronic medical records, physician portals and patient portals all fit into this arena. The Content Cloud offers healthcare providers the opportunity to expand their services and to provide new services and methods of communications to its population much more easily and quickly. All the benefits of the cloud apply to healthcare: reliability, lower costs with higher utilization rates and a simplified IT management environment. Remember – healthcare providers’ main function is to care for patients and so IT is not always their strength.
Now that we have created content and made it available to those in need, the final stage of our maturity model starts to appear. The Information Cloud brings hospitals and caregivers to the highest level of maturity. Utilizing the content that has been collected from across multiple facilities, from many patients and physicians, we can start to apply “smarts” to this data. By adding analytics to the healthcare information we can start to develop best practices for disease treatment and determine which medications have the best impact on certain diseases. We can start to track epidemics before they happen and track back to Patient Zero when they do happen. We can make better clinical decisions for the population and identify trends in the early stages. With the right applications we can manage chronic diseases like diabetes, cancer and heart disease much better – thus lowering the cost of healthcare for all and improving the quality of life for many.
In the end, a Cloud Maturity Model is probably what is most needed for our healthcare facilities to improve patient outcomes and we can show our customers how to get there at their own speed. Without it, we will continue to struggle with access to data, we will continue to miss the big picture and ultimately we will continue to see patient healthcare costs increase to unmanageable levels and that will hurt us all – ouch!
What do you think of this Cloud Maturity Model? Are their other steps that need to be taken?
Want to read more about our Cloud Roadmap? Visit our bit.ly bundle here: http://bitly.com/pCt5Gk
What Can You See In Our Cloud Vision?
by Amy Hodler on Oct 26, 2011
Yesterday, HDS announced our roadmap for the information cloud to help customers transform data so it can be better used as a strategic asset with the goal of fostering more business insight and innovation. Miki Sandorfi, our Chief Strategy Officer for Cloud, also explained yesterday in his post how our 3-tier strategy builds—starting from infrastructure cloud to provide more dynamic infrastructure, then layering content cloud to enable more fluid content, and then finally building to information cloud to facilitate more sophisticated insight. Miki also reviewed some of the new Hitachi Cloud Solutions and Services that were announced to help customers achieve this goal.
Having seen the cloud strategy develop from within HDS, and the significant debate over even minor implications, I can attest to how serious we are about using this vision as a lens to organize, prioritize and drive our business. To help you understand how we are using this for our “bigger picture” I wanted to give you a little insight into some of our brainstorming as to what this vision might make possible today and in the future. And if you’re like me, there’s nothing better than an example to really illustrate the potential of an idea.
So let’s imagine what we could do with better insight using the concepts of infrastructure cloud, content cloud and information cloud. Regardless of the industry, there are amazing accomplishments that can be made, but I’d like to use a healthcare example since we all can relate to having received care (or at least knowing someone that has received care).
In considering how the infrastructure cloud might be used in a healthcare example: Let’s say that a doctor notices a suspicious illness pattern on rounds. In this case, the doctor could immediately spin up an internal SharePoint for hospital collaboration and an external message board requesting input. No, this is not rocket science and yes, we can do this today. However without the dynamic resources of an infrastructure cloud, the doctor would have to submit his request and then wait a considerable amount of time for the approval, acquisition and implementation of resources to support the service he needs. This would delay his ability to collaborate and reach out for assistance, and consequently lose valuable time in detecting a possible outbreak.
If we now look at the next stage of activities for this use case, we can see how a content cloud, built on top of the infrastructure cloud, would help the doctor obtain the data he needs quickly and securely. For example, the doctor would likely want to search and share information among hospitals with appropriate rights and privacy protection in place while collecting content independent of the application it was created in. In this case, each patient report has a meta-tag for the mystery illness with pertinent information and privacy protection. The doctor can quickly search across multiple IT systems and find the appropriate information he can then use for other study. Without the fluid content that the content cloud enables, the doctor might be delayed in aggregating data, or even worse, completely miss information essential to understanding this developing situation.
In this example, we know that the doctor is ultimately trying to save lives and possibly contain an outbreak. The information cloud can best enable this when layered on top of an infrastructure cloud and content cloud. For instance, the doctor may identify a possible outbreak in a nearby city using a report that recombines medical analytic data with Google search trends. With this information, the doctor would notify city officials, who would then take preventative action to limit risk. At the same time, based on preset variables the information might also self-direct further search and analysis. For illustration, let’s say the results of this self-direction alerts the doctor of a potential historical match to these trends and this leads him to consider applying an old inoculation solution in a new way. Applying this insight, he would be able to quickly head off the outbreak. Without the sophisticated insight the information cloud facilitates, the doctor might spend countless hours pouring over data and information: unable to sift out trends and patterns and unable to even consume other sources of information such as machine generated information…let alone be able to blend and analyze everything. This could mean serious delay in not only identifying but also resolving the health risk.
(Is it me or does all this talk of “the doctor” and future capabilities have you thinking about the “Doctor Who” BBC series? If you’re among us oddball science fiction fans, I’d love to hear who your favorite “Doctor” is. One of my favorites is number 9, who you can see in this great promotional picture.)
It’s entertaining to consider vision concepts and build on the possibilities, but it also brings to light novel ideas, uses and even unforeseen challenges.
So what other possibilities and use cases can you dream up for our strategy for infrastructure cloud, content cloud and information cloud?
Want to read more about our Cloud Roadmap? Visit our bit.ly bundle here: http://bitly.com/pCt5Gk
Data is Our Middle Name
by Michael Hay on Oct 26, 2011
…and it is our sincere desire to Inspire Your Next Insight. The reveal of our infrastructure, content and information clouds exposes a framework to help our customers and the market get to that next “ah ha” moment.
To paraphrase and add to my colleague Miki Sandorfi’s post on our Cloud Blog, we aren’t just delivering the framework, we’ve included “batteries” as well. (By “batteries included”, I mean we have real solutions, not just products, adhering to our framework that help our customers live by the Hitachi gospel. I view this as not unlike how the Python computer language is positioned: it has lots of helper packages, scaffolding and frameworks making it easy for software engineers to be productive quickly. In the same way solutions like private file tiering and Sharepoint archiving help our customers deploy big parts of a private cloud infrastructure with little worry, helping them be more productive and attentive to the business.)
However, this is not the real reason why I’m excited about this announcement. My excitement stems from the fact that HDS is unleashing what we stand for publicly. If you’ve listened to our CEO Jack Domme you’ll get a hint, and if you live within our “four walls” you’ll also understand. What we are doing is beginning to talk externally about what we stand for, inviting you to participate in our journey.
That is because we live in exciting times and we are now within the midst of what HDS calls the Big Bang of Data. Precursor factors contributing to the Big Bang of Data include increased volumes of data, a hyper pace of data growth and newer methods of accessing data.
The figure I’ve constructed and included in this post above shows what I believe to be three important factors: a shift in the character of our society with the emerging Millennials, the on-demand anytime data access from mobile devices, and the resulting accumulation of unstructured data/content. The post-Big Bang of Data era is the “quiet” yet equally exciting time where new fundamental structures are developed, new treasures are uncovered and created, and information intermingles with other information engendering the next insight!
While the message may sound similar to other players in the industry as I have pointed out in this previous post, Hitachi’s interactions with the world are vastly different than our competitors. Here is a relevant quote from my previous post:
“In a very real sense, and key to Hitachi’s differentiation, we are learning to think more like a customer because we are actively working to fuse IT to social infrastructure. This makes us think a lot about how to cope with our own deluge of data so that we can improve our own offerings directly and indirectly.”
As to how long this journey we are on will take, I’m not sure, but likely we will be doing this kind of thing for years into the future. I think our journey is not unlike how Apple exposed their vision to be the digital hub of the living room not as a strategy but as what they stood for. This approach has allowed Apple to have a guiding direction with sufficient room to explore, correct, modify and expand as they participated in the market.
I believe that today we are doing the same thing by exposing what HDS stands for, not just a strategic vision. I suppose that this was inevitable as Data is our middle name.
Want to read more about our Cloud Roadmap? Visit our bit.ly bundle here: http://bitly.com/pCt5Gk
The Road To HDS Cloud
by Miki Sandorfi on Oct 25, 2011
Today HDS made an important announcement supporting our commitment to cloud, and it came in two parts.
First, we announced and provided details behind our vision of where cloud is headed: from infrastructure towards content and then information cloud.
Second, we announced the availability of new Hitachi Cloud Services that customers can leverage to reduce the TCO of unstructured data in their environment while putting themselves on the path to content cloud.
Let me explain a little about these two parts of our announcement.
The first part is giving further insight into what is driving our development, acquisitions, and delivery around cloud services. The graphic below shows at a high level what we are describing:
We view the infrastructure cloud as a basic set of capabilities that allow us to solve bigger problems—the infrastructure cloud is a set of tools in a toolbox. In the content cloud, we focus on leveraging the dynamic infrastructure enabled by the infrastructure cloud and add capabilities to liberate data so that it can be freely used and re-used: it enables fluid content. We have a great example of this in practice with our customer Klinikum Wels (you can see the case study here). Finally, this leads to the information cloud, where we can leverage the dynamic infrastructure cloud and the liberated data in all forms (structured, unstructured, semi-structured) from the content cloud to drive towards sophisticated analysis and insight.
The second part of the announcement focuses on new Hitachi Cloud Services that are now available and help customers build towards the content cloud while realizing immediate benefits in the infrastructure cloud.
These solutions are focused on TCO reduction for unstructured data, and are delivered with self-service and pay-per-use capabilities. Here you can continue with already-deployed traditional NAS and complement it with a cloud solution for 30% or more TCO reduction (file tiering); augment or replace NAS filers with a backup-free, bottom-less cloud implementation that still “looks and feels” like traditional NAS but is deployed with next-generation cloud technology; and complement SharePoint environments by offloading a bulk of content into the private cloud.
Because customer choice is extremely important, we have designed all of these new solutions to be modularly delivered: customers can purchase these offering as cloud packages and build their own cloud around it. They can optionally enable self-service and billing/chargeback by electing to deploy the management portal. Or we can provide fully managed solutions including a true OpEx pay-per-use consumption model with no upfront capital expense to the customer.
Regardless of deployment choice, these solutions put the customer on the path towards content and information cloud. Customers get common, policy-driven data lifecycle management; search and retrieval of their data any time, from anywhere; and a foundation to provide data abstraction from the application that generated the data. Information is the lifeblood of any organization, and we are helping deliver that value to our customers.
Want to read more about our Cloud Roadmap? Visit our bit.ly bundle here: http://bitly.com/pCt5Gk
All Data Has Value
by Amy Hodler on Oct 24, 2011
The trends are clear and spectacular.
We create, collect and store an immense amount of data and information at an exponential rate.
IDC research has shown that in 2009 there were 800 exabytesof data (that’s 800 million terabytes), by 2020, they forecast this number to be over 35,000. It’s hard for me to wrap my head around such a large number, but “Data, data everywhere”, an article published last year by The Economist, put the scale and magnitude of this increase in perspective. The piece noted that Wal-Mart “handles more than one million customer transactions every hour, feeding databases more than an estimated2.5 petabytes—the equivalent of 167 times the books in America’s Library of Congress.”
Do we really need all this data?
Well, I think we do. (A separate post on Information Overload is sure to follow if we can think of something new to add and thereby add to the noise…er… I mean information.)
First off, I’m in awe of the totally cool and amazing things people are doing with information that was once not even a consideration. In a recent article on Feedback Loops, Wired magazine reported some fascinating work done by Shwetak Patel, who translated “the cacophony of electromagnetic interference into the symphony of signals given off by specific appliances and devices and lights.” Through a single device in a single outlet and a stack of algorithms, he could tell if someone left the blender on, and thereby possibly get assistance where needed. Who would have thought that your house has such unique and useful electronic fingerprints?
And then even the most seemingly useless information—such as Twitter status updates on anything and everything and nothing—is proving to be valuable. Last year researchers at Indiana University and the University of Manchester submitted a paper that claimed and illustrated at a high level that the, “moods expressed in Twitter feeds can accurately predict some changes in the Dow Jones Industrial Average three or four days before they occur”. If this seems far-fetched, well I’ll have you know that the Library of Congress is now saving your tweets. Yes, our tweets are now part of the data stream that is considered a historical timeline, as reported by the New York Times last year. Even though this information previously had questionable relevance, there is obviously an expectation of current importance and future potential.
(Am I a total geek for hoping that someone will dub in “Every Bit is Sacred” over a somewhat well known Monty Python tune?)
When I start thinking about the idea of saving all possible data, the image of my garage suddenly pops into my mind.
No matter how many new and valuable things I want to pack into it, I only have so much space. And that leads me back to that The Economist article on “Data, data everywhere”, because they also reported on IDC studies indicating the amount of digital information being created already exceeds the amount of available storage, as illustrated in this chart.
Now this really starts to look like a classic hockey-stick problem.
In a Washington Post article earlier this year they cited a University of Southern California study regarding the massive overtaking of digital data over analog data, which this chart really illustrates. The piece quotes one of the authors of the study, Martin Hilbert, who states that “Humans generate enough data – from TV and radio broadcasts, telephone conversations and, of course, Internet traffic – to fill our 276 exabyte storage capacity every eight weeks.” And that’s JUST the human generated data!
So what does this mean now? What happens when this starts to look like 0/1? Better technology and quantum mechanics to the rescue?
I don’t know. But I do know the only way I fit more stuff in my garage is to organize, repack and prioritize what stays. Knowing there will come a time when you, your organization or your clients have more digital information than space available, how are you planning to prioritize? Are you taking any steps now?
Want to read more about our Cloud Roadmap? Visit our bit.ly bundle here: http://bitly.com/pCt5Gk
Big Data? Big Deal!
by Frank Wilkinson on Oct 19, 2011
There have been many articles, opinions and positions written about the big data phenomenon in the past year, and if you were not confused before, you may be trying to decipher its impact to your organization and your business. There is no doubt about what big data implies, and its affect in business and beyond the data center. It is an important milestone as we enter the next phase of the IT technology era, as we look to not only the data created by business applications, databases, machine generated data, email and file data, but social media generated data and business productivity tools (such as business social media applications).
They all have their impact to how we run business, and that is the easy part to discern from big data issues. What about insight? How will we better utilize our data objects, object stores and their associated metadata to really help drive real opportunities and insight? This was part of the premise of business intelligence solutions, and what they were going to deliver, right?
Not so much!
Seriously Big Data
There are many reasons to take big data issues seriously, not to mention how to best manage it and integrate it, leverage it, and back it all up. Those are challenges all in themselves. For me the greater issue is: how will we interact with the data and make decisions based upon factors that can have an immediate impact to our business? Sure, integration, mobility, backup and exponential data growth are important and they do relate to how we can gain better insight for making more accurate business decisions, but the fact is, if we have not figured out how to manage the data growth by now, we are in serious trouble. Data consolidation, converged infrastructures and scalable architectures are the beginning to solving how best to approach big data impacts, but what do you do with the data, how will you leverage its insights, and what will you use to extract the insight of the data?
Data mashup dashboards and business dashboards are not necessarily new concepts, and while there are some that have made impacts to the market, none of them have breached beyond basic information without leaving you feeling like you forgot something (like that feeling you get when you leave the house in the morning).
Analytics are becoming more relevant in unstructured data, just as they did for structured data. This is such a big market for enterprise vendors such as IBM, EMC, NetApp and Hitachi, that we are all hard at work building better solutions to help this effort. But that does not mean all solutions will be created equal, nor does it mean that they will offer identical or similar approaches. The end result will be end user adoption, usability and customization.
The question I often ask myself is:”what is really needed in an architecture that will lend itself to provide greater clarity and insight of the data?” This is a complex problem, and one with many different variants for the answer. I believe that we are facing a dynamic shift in how solutions will be designed in the future with a heavier emphasis on truly understanding how businesses run and keying in on the requirements to help business make more intelligent decisions. Throughout the past twenty years we have focused on helping to improve processes, IO, smarter applications and the underlining hardware technologies. I feel, however, the next step is determining the best way to marry hardware and software functions more tightly with more internal hand-off. As an example of this, Michael Hay has written a post describing the essence of autopoiesis, the marriage and tighter integration between hardware and software functions. This is a start. But how?
There is no doubt that our systems and solutions have become much more intelligent—as well as more complex—but at the same time they have become less dynamically integrated with no common connector to share information in a true usable way. This can be leveraged for greater insight to what is happening within the systems and applications, being able to report data into a business dashboard (which can be used as the single pane of glass view of the data and its relevance to the business).
What Are We Searching For? What Will We Discover?
If I want to understand a particular event, such as a sales engagement, what would I like to know about it? Is there information already residing within my data stores? What type of data? Could I also have instant access to social media data and what kind of insight can I gain to assist my decisions going forward?
What I am talking about is how can we have a true 360 degree view of the data that we have access to, as well as data that we need to have access to? While we are in the midst of a big data revolution, the issue is how can we find what we are searching for when the results are collected from various, and perhaps disparate, data silos presented in a correlated view? This would enable data to be displayed with relevancy across data streams and data types. As an example: If I perform a search for John Dowe, the results returned should show all John’s activity related to the constructed query. In this case, the results may show some relevant emails, documents, John’s network or server activity, log file data (he was tying to gain access to a SharePoint site which he does not have access to), some voice mails and perhaps some security surveillance video of him while on the company campus and let’s add in his social network activities, to see if he has breached corporate security by discussing corporate secrets (Yes big brother is watching). Alone, the data parts are not very interesting by themselves, but If we can tie the data together, it can show that John was up to something very nefarious due to the email correspondence to a competitor, voice mails that captured a conversation between John and a corporate spy, security footage with audio capturing me in the parking lot exchanging a sealed envelope.
Is that interesting? Absolutely!
How Do We Get There?
The Search is NOT over!
Search is a term which is too widely used and not fully understood for its potential abilities. Whether you are searching data as part of an eDiscovery process, corporate governance or simply to find relevance around a certain subject, search is the tool we want to leverage and has the results we want to gain insight from to make decisions based on. Search is just the beginning, but to be fair, its just search, and what we need is more data and data types to help us see a true picture of what we are looking at.
Starry Night by Van Gogh is a beautiful painting depicting a wondrous night sky. But what if you only could see the Rhone river? What would we know about the rest of the painting? Would we know that the artist wanted us to know that it is evening? Would we know that there are other aspects and objects? Of course not, we would only see what we have access to. Much like search and discovery of data, we only know what we know. Nothing more.
There are often discussions around what it would take to get there, and the answer I hear quite often is that it is too hard and too costly. To get to where business needs us to be, we need to think about how we can develop THE NEXT smarter architectures and infrastructures leveraging open source solutions and common connectors, while leveraging PCIe, smarter controllers, FPGA’s, SSDS and imbed software functions more closely to the data streams.
Sounds hard right? Yes, but not impossible. We have spent the last two decades adding more and more complexity and infrastructure to handle large amounts of data we generate, but we fell short of true integration across the data center and business applications. I do agree that there are technical obstacles to overcome in order to have a better integration, and how we can push search functions and its associated IO down to the hardware layer to minimize the IO across the infrastructure. This is where HDS shines, as we have the best scientists, engineers and resources that are working and solving these problems. Let’s not forget that Hitachi develops some of the worlds best technology solutions and core IP that it utilized in almost every facet of life and is leveraged in some of the technology you have in your data center. We know a thing or two about taking the overtly complex and making it simple.
While I cannot go into great detail about our current endeavors around our research and development efforts, rest assured that we have been working to decipher the necessary technologies to bring THE NEXT—whatever it may be.
Check back soon for my next blog, which will discuss next generation discovery needs and data correlation.
HDS At The Gartner Symposium 2011: Come Hear Our Plan For Harnessing The Power Of Cloud
by Miki Sandorfi on Oct 13, 2011
HDS is excited to be a Premier Sponsor at this year’s Gartner Symposium/ITxpo 2011 held October 16 – 20, 2011 in Orlando, FL. Attendees should expect the conference to be packed full of classes, labs, and have access to tons of information on emerging technologies. A few key examples on how the agenda has expanded from last year:
- 450+ educational sessions, workshops, how-to clinics, Mavericks, Market overviews, Net IT Outs, Magic Quadrants/MarketScopes and roundtables from Gartner Analysts (up from 350+ sessions in 2010)
- More Gartner analysts onsite –170 (150 in 2010)
- Four+ hours of keynote presentations covering the latest trends
- 20+ hours of networking with peers from other companies
- CIO Program (pending approval) with an increase in workshops – 80 (58 in 2010)
- Role-Based Tracks: Increase in capacity and volume of interactive workshops and Net-IT Out short-form sessions
- Industry Program expansion on Sunday and throughout the week (gartner.com)
I am excited to present Infrastructure Cloud and Beyond on Monday, October 17th. 12:45-1:45PM (Room: Southern IV-V)
Through my presentation, you will learn about our strategy to harness the power of Cloud with best practices and examples. The presentation will provide information for organizations that are moving to a cloud service delivery model in order to remain agile, competitive and to improve operations and cost efficiency. As companies embark on their journey to the cloud, it’s critical to understand the path ahead. The foundational elements must include integrated components, virtualized infrastructure and be capable of dynamic scale. In my session, hear from CSC Corporation on how they are harnessing the power of cloud to provide federal cloud computing service delivery.
We also have fun giveaways and Q and A’s planned. Please join me on Monday, October 17th. 12:45-1:45PM Room: Southern IV-V.
During the conference expo hours please stop by HDS’ booth located in the Pacific Hall of the Dolphin Hotel (booth #PS9). The booth will be jam packed with new HDS product information and strategy, HDS leaders and other fun things!
We want to help you start putting the power of Cloud in context to your goals by having better and more sophisticated insight for decisions and advancements. In order to do this, we need more fluid content, which of course requires more dynamic infrastructure. Find out how HDS is driving solutions to support infrastructure cloud, content cloud and information cloud.
For more information on the conference, visit http://blogs.gartner.com/symposium-live-orlando/
Silence Is A Part Of The Discussion
by Michael Hay on Oct 11, 2011
Have you ever watched a completely uncut interview or viewed footage of Steve Jobs when he doesn’t have a script? From the moment he announced his official retirement this summer I have done exactly that. When the moment called for it, Jobs would pause, sometimes uncomfortably, collect his thoughts and then respond. Most recently, I watched Steve’s extensive interview at D8. There, he pauses on several occasions, once about the Gizmodo incident, once in response to an audience question about content business/distribution, and a mild pause when reflecting on his 2005 Stanford speech. In every instance the silence spoke loudly, making me, as the listener, pay even more attention.
In the same way the white space in a document, software application or even hardware device redoubles a user’s focus on what is actually there. Steve and the company he gave birth to accomplish exactly the same. You will find this principal, one of the 10 usability commandments of aesthetic and minimalist design, applied in Apple’s products and product portfolio. Space is given to sharpen the focus on what is important.
With this in mind, I want to look at last week’s activities and pose a question. Almost since the beginning of this year the press, the blog-o-sphere, etc. have been saying that a new iPhone would be just around the corner. Creative hobbyists and design firms pitched in with crazy designs (see video), many speculatively produced cases, etc.
In short, everyone who could offered predictions and set expectations for what Apple was going to do (except, well, Apple). Then Apple sent out invitations to an event on October 4th for what we all now know as the release of the iPhone 4S. Apple’s event dashed many expectations, mostly around giving birth to a bold, newly designed iPhone 5. However, even more importantly, the event itself was not widely broadcast and the tone was low key — I watched the event in its entirety via AppleTV. The following day, on October 5th, the iPhone 4S adorned Apple’s website in its glistening glory. Later that evening when news broke that Mr. Steve P. Jobs passed away, Apple adorned their site with a portrait of their founder, linking to a profoundly simple epitaph.
I can see clearly now that the subdued somber tone of the iPhone 4S announcement was creating space for something vastly more important; Apple was practicing culturally what their founder encoded in their corporate DNA, creating space to focus on celebrating Steve’s life.
All of this gets me to my parting question.
To celebrate the life and achievement of Steve Jobs, what space are you creating to focus on something more important?
Putting HDD Product Trends Into Perspective: A Subsystem View
by Claus Mikkelsen on Sep 30, 2011
On a few occasions I’ve blogged about the drive industry, drive performance, and the effects it has on the storage array business in general, but below is a guest blog from my friend and colleague Ian Vogelesang on disk drive trends. He originally had posted this on our internal HDS website, but it’s too good to keep wrapped up, so I’m sharing it here for general consumption. Make sure you have some time on your hands, as the post is quite lengthy (but worth it!)
Ian is one of the smartest guys in this area (well, smart in general), and this offers amazing insight into what many of us think of as “just a disk drive”. It’s long, it won’t be for everyone, but it’s definitely worth reading by anyone in the storage “biz”.
Ian’s quick bio: Assistant GM of HDD development at Hitachi in Odawara, Japan, then VP Operations of Hitachi Data Systems in Santa Clara, then VP Marketing and then VP Strategy & Product Planning at Hitachi GST (during assimilation of IBM Storage Technology Division), before returning to Hitachi Data Systems.
So let’s take it away…!!!
By: Ian Vogelesang
This detailed blog posting targets a highly technical audience, exploring how HDD product trends will impact subsystem performance and economics over the next year or two and beyond.
It’s hard to anticipate what the reader will know and what the reader will not know, so I’ll leave the Reader’s Digest version for others.
Trends by category
- 7200 RPM LFF
- Today’s capacity point is 2 TB. This is available in both SATA and SAS versions on the AMS2000 family, and in a SATA version on the USP V and VSP.
- This platform is actively under development: 3 TB SATA models are already available in retail stores from multiple vendors, and SAS 3 TB models are expected soon.
- 7200 RPM SFF
- Both the VSP and AMS2000 product lines support 7200 LFF drives which have twice the capacity, so for now we are offering only the LFF 7200 RPM models.
- 10K LFF
- Long dead.
- 10K SFF SAS
- Today’s capacity points are 300 GB and 600 GB.
- This form factor is currently the mainstream platform in enterprise HDDs, so we expect higher capacity 10K SFF models over time.
- 15K LFF
- The 15K 600 LFF drive was the end of the line, and no higher capacity 15K LFF drives are expected.
- 15K SFF SAS
- Today’s capacity point is 146 GB
- Seagate has a 300 GB drive but we are not carrying this product because it is twice as expensive as a 10K RPM 300 GB SFF drive, but has much less than twice the performance.
- Hitachi GST says they plan no further 15K SFF drives at this time.
3 TB 7200 RPM LFF
Let’s start with the 3 TB SATA drive which is expected any time now, as it has been available in retail for a while. It’s safe to assume 3 TB models must currently be in qualification for subsystem applications.
This 3 TB 7200 RPM platform will also be available in a SAS-interface version at a higher price. Even larger models are expected over time. Generally speaking the SAS version of new models based on the 7200 RPM LFF platform will be available a few months after the corresponding SATA version ships.
Seagate had implemented a SAS-interface version of their 2 TB 7200 RPM LFF platform, and this is the SAS 7.2K 2 TB drive that is currently available on the AMS2000 series.
Both Seagate and Hitachi GST have announced SAS versions of their 3 TB 7200 RPM LFF drives, and thus going forward we will have multiple suppliers for SAS 7200 RPM LFF models.
Tests of the SAS interface version of existing 2 TB 7200 RPM Seagate HDDs on the AMS2000 series showed the SAS interface version of the drive when configured in the AMS2000 to offer in most cases over 2x the throughput of the SATA version of the same drive enclosure.
In other words, in a subsystem application spending extra money to use the electronics from a SAS drive instead of the SATA electronics on the same basic drive with the same platters, heads, spindle motor, and actuator gives you twice the performance.
Some people say that performance isn’t the point on SATA drives which are all about capacity. To those people, I ask them if they think that poor people don’t care about money.
Prediction – SAS will largely displace SATA for 7200 RPM subsystem applications, even as SATA will remain the interface in PCs
The first part of this blog posting is going to explain why I expect to see the SAS 7.2K drive to largely displace its SATA-interface twin within 2 years.
The gigantic capacity of a 7200 RPM LFF drive is achieved as a tradeoff against other factors. In order to have the highest recording capacity, we need to use the largest diameter platters.
The platters in a 7200 RPM LFF drive are actually a bit bigger in diameter than 3.5 inches. When the dimensions of the 3.5-inch external form factor were originally fixed, drives actually had 3.5-inch diameter platters that fit between the sides of the base casting that naturally had to have clearance inside the casting walls for those 3.5-inch diameter platters. Nowadays if you take a drive apart you will see that the base casting in the area where the edge of the platter approaches the walls is slightly machined out, enabling the diameter of the platter to be slightly bigger than 3.5-inches. (I forget right now what the actual diameter is, but it’s a value in millimeters, not inches.)
OK, so given that you are going to use the largest diameter platters there are, it turns out (sic) that you can only rotate such big platters at 7200 RPM. Because of the wide diameter of the platters, if you try to run the platters faster you would astronomically increase the power consumed creating air turbulence, and you wouldn’t be able to cool the drive effectively.
So the first strike against SATA drives is that the HDD rotates relatively slowly.
Then second strike against SATA drives is the slow seek speed.
Part of the slow seek speed comes from the simple fact that seek speed is inversely proportional mechanically to the length of the access arm, and a drive with 3.5-inch platters inside will have the longest arm and thus proportionately slower seek speed for how hard you push.
Then the other part of the slower seek speed comes from the budget financially and in space for the actuator, and more specifically, the rare earth metal permanent magnet inside that the actuator pushes against. If you are making the cheapest drive, you can’t afford a bigger rare earth magnet to let the actuator push harder, and anyway, with those huge 3.5-inch platters inside, there’s no room for a bigger actuator motor anyway.
So strike one was the slow RPM, strike two was the slow seek speed, and now strike three is that the ATA specification that has progressed into SATA was purely conceived for the purpose of direct attach to a host (a PC). The original authors of SATA never considered subsystem applications as evidenced by the fact that they fixed the sector size in SATA at 512 bytes.
Using SATA drives in subsystem applications
If you want to provide the usual assurance that you can detect any subsystem virtualization mechanism failures and any data corruption “end to end” within the subsystem, you are going to have to further compromise the performance of a SATA drive.
The problem with a fixed 512-byte sector size at the HDD level happens with the mechanisms that you need to employ in the architecture of a subsystem.
The same kinds of challenges face subsystem designers as face the architects of disk drives themselves – how can you really be sure every time that you are presenting the right data at the right time?
In a disk drive, you are emulating a logical disk drive with LBA addresses that are statically mapped (except during defect assignment) from the LBA to a physical location on disk.
So you have a virtualization mechanism that translates a logical address into a physical address on hardware. How can you know if your virtualization mechanism is working as designed, that is, that every time you read from a logical address, you get the data that was most recently written to that logical address? In other words, how can you detect if you accidentally wrote it in the wrong place or accidently tried to retrieve it from the wrong place?
How do disk drives do it?
In disk drives, to provide these assurances, we used to use an ID field as a physical field that was stored in addition to the host sector, and where the ID field that is written with the data is the (logical) ID used by the host to specify where to put the data.
Suppose the drive originally wrote some data in the wrong place. If at any time later the host tried to read the LBA that really did belong in that physical location, the drive would retrieve the erroneously written data, but the contents of the ID field on disk wouldn’t match the requested ID field, and thus this ID field mechanism detects and fails I/O operations that would otherwise give the wrong data to the host.
Similarly, if you were to read from the wrong location, the ID field at that wrong location wouldn’t match the ID you were requesting and the I/O would fail.
So storing a logical address field with the data (or to make the field smaller, storing a cryptographic hash of the logical address with the data as is done in subsystems) can assure that virtualization mechanisms are working as designed.
Then IBM invented no-ID formatting in the late 1990s. The idea here has to do with how you compute the ECC bytes that are used to detect and even repair minor corruptions sprinkled here and there within a sector.
The idea is to compute these ECC bytes logically as if the sector were longer than 512 bytes by the size of the LBA address, and then to compute the ECC as if the LBA were logically prepended to the data. This meant that we no longer needed to have a separate ID field, but without increasing the size of the ECC data, we could still could derive a fingerprint that detected if the data that had been stored under that LBA really originally came from that LBA. Thus we can ensure that every time we read data from disk, we know that after all the complex logical-to-physical things there are in a disk drive (zoned recording, serpentine LBA layout, skip-slip defect assignment, grown-defect assignment, etc.),we still will detect if there’s any corruption of the virtualization mechanism.
Now let’s talk about subsystem applications of HDDs.
Subsystems also have virtualization mechanisms, and you need to provide the same two assurances that you provide as a disk drive designer, namely that the data is intact, and that it physically came from the same place originally used to put the data when it came from the host.
Therefore you need to have a checksum of the data, and you need to have a fingerprint of the LUN &LBA, and these need to be captured at the point of entry into the subsystem at the host port, because we need to offer “end to end” protection, and these check bytes need to accompany the host sector on its journey through the data paths in the subsystem, in and out of cache and onto disk and back from disk.
These requirements to store a few check bytes with each 512-byte host sector in subsystem applications were well understood at the time of the first SCSI drives, and even back before then.
The SCSI spec as originally conceived provided for the user to perform what is called a low-level format which (re)creates all the sectors on the drive. This low-level format can be performed with a range of sector sizes from 512 bytes (always used for direct host attach) in increments of 4 or 8 bytes up to as much as 528 bytes in some models. These larger sector sizes allowed subsystem architects to store a check byte field along with every 512-byte host sector. Hitachi subsystems use HDDs low-level formatted with 520 byte sectors in drives that use the SCSI command set, namely SAS and Fibre channel HDDs.
In this way, the subsystem maker can provide a checking mechanism to assure the accuracy of the virtualization layer, as well as to ensure that the sector is not corrupted along its journeys through subsystem data paths and cache memory.
The problem with SATA is that there is no provision in the spec for sector sizes other than 512 bytes. And the reality is that for every possible bit pattern in a sector, the host has to be able to write that 512-byte sector and then read it back again. So we need to use all 512 bytes to store the information contained in the host sector, because if we were able to condense 512-bytes worth of data into less than 512 bytes of space to make some room to store some additional information, then there would have to be at least two host bit patterns that would result in the same smaller-than-512-byte encoding. So there’s no room to encode more information in 512 bytes of space on top of what the host is storing in that bit pattern.
So you can’t provide virtualization mechanism protection and subsystem data path corruption detection assurance and still map each host 512-byte sector on a SATA drive to one 512-byte sector on disk.
Thus you can either decide to fly blind, trusting that there are no design flaws and that the hardware works perfectly, storing each host sector as one 512-byte SATA sector on disk, or you can decide to provide the usual assurance mechanisms, albeit with a performance penalty.
Hitachi protects against subsystem virtualization errors even with SATA
What Hitachi does is to expand each sector as it is written by the host as usual into a 520-byte sector that contains the 8 check bytes guarding against virtualization and corruption errors, and then at the point of writing these 520-byte sectors on disk, we write 64 of them in a “clump” (my term) of 65 physical 512-byte sectors on disk. Doing it this way means that the customer is protected as usual against any subsystem architectural or algorithmic flaws.
For read operations, it doesn’t matter much, because you just read the whole clump (it’s only 32K out of a track size of about 1 MB anyway) even if all you want is a 4K bit within the clump.
But for writing there will be extra I/O operations required with SATA drives that are not necessary with SAS or FC drives.
This is because a 4K write from a host will be for a set of sectors that once they are mapped to 520-byte logical sectors will always require updating part but not all of at least one physical sector on disk. (Every 520-byte logical sector gets mapped to range that is bigger than one physical sector but shorter than two physical sectors. And any write smaller than 32K will always need a “pre-read” of the old contents of the clump so that you can update the bit newly written from the host before writing it back. If you think about it, this only applies to RAID-1,because in RAID-5 and RAID-6 random writes, you read the old data before you write the new data, and while you are at it reading the old data you just read the whole clump.
SATA-W/V vs. SATA-E on enterprise subsystems
But this isn’t the only performance issue with SATA. Our design engineering department is very concerned about the potential for lower reliability with SATA drives. Therefore on the Enterprise subsystem, we offer the user two choices, “SATA-W/V” or “write and verify” where every write to disk once destaged to the disk surface is followed by a read from disk to assure the write didn’t encounter any “silent write failure” condition. There is a known failure mechanism which is common to all disk drives, regardless of host interface type, that can have silent write failures for a short period of time before the host (or the drive) figure out that nothing changes as a result of writes any more. Doing a read verify operation after every write allows us to detect if this rare but nevertheless possible failure mode is occurring.
The second option offered to Enterprise customers is the “SATA-E” mechanism. With SATA-E, we also protect against silent write failures, but in such a way that we don’t need a read-verify operation after every write. Instead, what we do is randomize the mapping from the 64 host 520-byte sectors to the 65 physical sectors within a clump on every write. That way if you write something and there was a silent write failure, on the I/O that failed, the physical location of that LBA on disk in the clump would have changed, and if you try to read the data back after a silent write failure, you will read from the new location, not the old location, and therefore the data at the physical place you are looking would not match the LBA and the I/O would fail.
So SATA-E actually is faster for pure random writes than SATA-W/V on Hitachi enterprise subsystems, and it detects silent write failures.
But the problems with SATA-E are twofold. Firstly, there are so many clumps on a SATA drive that this mapping information that that records for each clump the permutation of the logical 520 byte sectors for that clump, which is called the Volume Management Area or VMA, is too big to entirely fit into Shared Memory. This means that when you read from a SATA-E parity group, there a chance that you will need to do a pre-read, an additional I/O operation, to fetch the section of VMA from disk before you can satisfy the host read request. Thus random reads are slower on SATA-E.
There is some chance you will have to write to the VMA with every host data destage operation as well.
The second problem with SATA-E is that the extra computation required to essentially add another virtualization layer substantially increases microprocessor busy. In the USP V, the guidance from engineering was that SATA-E would increase BED utilization by 70%. For this reason alone, we generally don’t recommend SATA-E because it disproportionately consumes MP resource, and this MP resource is what limits the overall IOPS throughput of a subsystem.
OK, so if you bare with me so far, and there is light approaching at the end of the tunnel. We don’t know whether it’s the end or whether Ian’s just going to move to the next phase of the explanation.
How bit is too big?
We’ve seen that performance on a SATA drive is much worse than performance with SAS or FC drives, because SATA drives rotate slower, seek slower, and because the mechanisms that assure the integrity of subsystem virtualization layers, and detect data corruption “end to end” require extra I/O operations to be issued to SATA drives that don’t have to be issued to SAS and FC drives, the performance of SATA drives in subsystem applications is further impaired.
The good thing about SATA drives is the huge capacity. The bad thing about SATA drives is 1/3 the IOPS capability to handle not only host I/O, but so called “SATA supplemental” I/O operations that aren’t needed with SAS or FC.
Is this a big problem? Let’s run a couple of numbers to see where we are in the ballpark.
A very large dot com customer with an instantly recognizable name has told us that the overall average access density to their data over their entire shop is about 600 host I/O operations for every TB of host data. This corresponds exactly with some data that IBM published a while back, so this is a reasonable average amount of I/O activity per TB of data.
Let’s compute very roughly what a 10K 300 GB SFF drive and a 2 TB HDD look like in terms of host IOPS per TB of data.
First let’s look at the drive itself. A 2 TB drive can do about 100 IOPS and the capacity of the drive is 2 TB, so the SATA drive can do about 50 IOPS per TB. If the AVERAGE activity of the customer’s data is 600 IOPS per TB, that would mean that if the drive were directly attached to the host (without RAID), on average you could only use less than 1/10th of the capacity of the HDD before you ran out of IOPS. Oh. And then 3 TB drive is coming and then even bigger drives after that, all with the same IOPS capability!
What about our 10K 300 GB SFF drive? It can do about 300 IOPS, so the direct attach access density capability if you fill the drive is 300 IOPS per 0.3 TB, or about 1,000 IOPS per TB of data. So for direct attach, you should be able to fill 300 GB 10K drives with data of average activity.
You could even almost fill 600 GB drives which have the same IOPS but twice the capacity, so they can do 500 IOPS per TB of data.
But that’s only for direct host attach.
The picture gets worse in a subsystem
In a subsystem you have some sort of “RAID penalty” for writes that depends on the RAID level. Just for the purposes of illustration let’s look at the number of HDD-level I/O operations needed to support one host read and one host write. For the host read, you need to do one HDD I/O. For the host write, you would need to do 4 HDD I/Os (read old data, read old parity, write new data, and write new parity) and more than that if you need SATA supplemental I/Os as well.
So for this workload, you would need 1+4 = 5 HDD I/Os, or 5/2 = 2.5 HDD I/Os for each host I/O. This ratio gets as bad as 1:4 for pure random writes in RAID-5, and worse still for SATA drives. So let’s use 2.5 HDD I/Os per host I/O for our ballpark estimation.
Our SATA drive that we thought could do 50 IOPS / TB now looks like it really can only do 20 host IOPS per HDD TB, before we even TALK about the SATA supplemental I/Os. So for ordinary data with an average access density of 600 IOPS per TB of data, we can only fill the drive to less than 20/600 = 3% of its capacity. (Yes, I’m sure the sharp-eyed reader will have noticed that this doesn’t account for the fraction of the potential drive’s capacity that is used for parity, but with the SATA supplemental I/O, thinking that you will hit the drive IOPS limit at about 2% or 3% full from a capacity point of view is about right.)
If you buy SATA drives and then try to use them for average activity data, they will be hideously expensive if you buy enough drives to handle the host IOPS, since you can only fill them a couple of percent full with average activity data, and therefore you will need a LOT of HDDs. If you do fill the drives more than a couple of percent full of average data, the drives won’t be able to keep up with the IOPS, and you will have horrible response times, and Write Pending will instantly fill to the limit with data waiting to be destaged to disk. (That’s why many people put SATA parity groups in their own CLPR, addressing the symptom rather than the root cause.) Performance problems with SATA are sad in my opinion, since for normal applications, you either spend much more money buying a lot of SATA drives then the money you would have spend on SAS enterprise drives, or else you buy an insufficient number of SATA drives, trying to use the capacity of the drives, and then not be able to cope with host workloads and have serious problems.
So ANYTHING you can do to improve the IOPS capability of SATA drives will proportionately increase the amount of data that can fit on the drive before the drive reaches its max physical throughput.
Putting a SAS interface on a 7200 RPM LFF platform
What does the SAS 7.2K 2 TB drive offer us compared to the SATA 7.2K 2 TB drive? There are two main differences from a performance point of view.
The first is that SAS offers native 520-byte sector formatting, and thus there are no SATA supplemental I/Os on a SAS 7.2K drive. (The same sharp-eyed reader will have raised an eyebrow as to why the SAS version of the same drive can be trusted not to have silent write failures, but it’s always a judgment call in the end when it comes to what is “sufficient” protection, and Hitachi engineering is very conservative on protecting customer data.)
The second performance advantage of putting the SAS electronics on the 2 TB 7200 RPM LFF drive is that the Tagged Command Queuing or TCQ acceleration capability is much higher on the SAS electronics card.
SATA drives are optimized for the absolute lowest cost, and there just aren’t the microprocessor cycles nor the number of logic gates in an ASIC that you get with the more expensive electronics in the SAS electronics card.
The TCQ feature is what allows the host to independently issue a bunch of different write operations to the drive (called queuing I/Os in the drive) where each I/O operation is identified by a “tag” number, a small integer between say 0 and 31 typically for a SAS drive. The drive has the luxury of browsing the queued I/O requests from the host, and deciding what order to perform the I/O operations in regardless of the time sequence received from the host. Of course no I/O operation can be indefinitely delayed, but within such a constraint the drive as the freedom to re-order the I/O operations so as to be able to visit the associated physical locations in a sequence that minimizes seek time and rotation time.
TCQ can accelerate SATA I/O by about 30% (of course when the workload is multi-threaded) compared to the drive’s single threaded throughput. SAS drives can accelerate the IOPS to about 60% higher throughput compared to single threaded throughput.
So where we computed the throughput capability of SATA and SAS drives above, I’ll leave it as an exercise to the reader to calculate the access density capability of SAS and SATA drives in subsystem applications.
The bottom line is that SATA drives are desperately slow. They are so slow that you can only fill them to a couple of percent of their capacity before they run out of IOPS. The SAS version of the same drive, without the burden of having to perform SATA Supplemental I/O operations, and being able to perform the IOPS faster (1.6 / 1.3 or 23% faster) results a combined doubling of the access density capability of the drive.
This is a BIG DEAL, and it’s why I expect all customers that learn this stuff to switch pretty much from SATA to SAS 7.2K except for those rare cases where the data truly is below cryogenically cold in terms of its activity level, or where the customer is fixated on SATA being cheaper per inaccessible GB.
For everybody that gets this, even a modest bump in drive cost with the SAS card will pay enormous returns in the form of more than doubled throughput.
This trend will only be driven harder by a shift to 3 TB and then even larger drives over time.
Other trends in disk drives:
The 10K SFF platform is the mainstream product for Enterprise HDDs. This means that in addition to the current 300 GB and 600 GB models, we should expect even higher capacity 10K SFF SAS models over time.
I’ll do the homework for you. If we have a drive with 600 GB that can do 300 physical IOPS further accelerated by 60% using Tagged Command Queuing, and where the RAID mechanism causes backend IOPS to be 2.5x the host IOPS, that drive can accommodate ((300 IOPS * 1.6) / ( (7/8)*600 GB) ) / (2.5 backend IOPS per host IOPS ) or 365 host IOPS per TB. Given that a global average host access density is 600 IOPS per TB, this means that the 600 GB 10 SFF drive is a “fat” drive, that can only be completely filled with data that has about half the activity of average data. In other words, the majority of your data is going to be too active to store on a 600 GB 10 SFF drive, if you plan on using all 600 GB and fill the drive with data. Today the “sweet spot” for average data is somewhere between a 10K 300 GB and a 10K 600 GB drive.
Future 10K SFF drives that have a capacity of even higher than 600 GB will still be capable of essentially the same IOPS as current drives, and thus these future 10K SFF drives with higher capacity will be even “fatter”, meaning that you can only use the entire capacity of the drive for low activity data.
This is a great lead-in to talk about what happened to 15K RPM.
The end of 15K, even in SFF?
Why are the HDD vendors saying that with SFF they can’t make any money selling 15K RPM drives and they are going to drop them? Hitachi certainly has been saying that, and although Seagate did launch a 15K RPM 300 GB SFF drive, it may be that this might not be viable for Seagate going forward at least short term. This is one of those things that could go either way, and one of those things where if you look at it in the short term (next year or two) there is one trend, and when you look at it longer term (e.g. 3-5 years) there could be quite a different trend.
To understand what happened, we should observe that disk drive platter diameters (and their associated external form factors) have been getting smaller and smaller when you look at them over a 50-year time span. The original IBM RAMAC drive had 24-inch diameter platters. The very next generation of drives used 14-inch platters, and at least for enterprise drives, that’s where it stayed for a long time. But over time we saw 14-inch platters give way to 8-inch platters, to 5 ¼- inch platters, then 3.5-inch platters, and now we are transitioning to the 2.5-inch SFF. (OK, Hitachi buffs will note that Hitachi enterprise drives introduced 9.5-inch platters with the Hitachi “Q2″ drive that was compatible with the IBM 3380-K 14-inch platters, and then went to 6.5-inch platters that had this cool “reactionless” linear actuator design before finally switching over to standard OEM type 3.5-inch drives.)
The factor that drives an industry shift to a smaller form factor has to do with drive access density capability.
We talked about this earlier. Basically once you have fixed the platter diameter and drive RPM, and these are the factors that characterize a “form factor”, then all drives of that form factor basically are capable of the same random IOPS regardless of drive capacity, because the random IOPS is determined by the mechanical rotation time and the mechanical seek time. And you can’t increase the RPM or improve seek time in any big way without moving to smaller platters.
Some of you may remember 9 GB drives going to 18 GB going to 36 GB going to 73 GB going to 146 GB etc. So you can well imagine that if you build a 10,000 RPM drive that only has 9 GB of capacity (actually, probably 10K RPM didn’t come along until later, but bear with me for the sake of argument. I just don’t remember of the top of my head what the capacity was of the first 10K 3.5-inch model). Imagine if you will a 10K RPM drive with only 9 GB of storage capacity. You can immediately grasp that the ratio of IOPS to GB would be very high, so high in fact that you would almost always run out of GB before you would run out of IOPS.
So if you run out of GB before you run out of IOPS, then higher RPM drives look horribly expensive.
The reason for this comes from the relationship between drive RPM, platter diameter, and drive capacity.
If you double the RPM of a drive, you will need a little more than 4x the power to turn the platters, because the power required to create air turbulence goes up as a little more than the square of the speed. For example, to make a car go twice as fast, you need more than 4x the horsepower.
So we can’t double the RPM of a disk drive and keep the same platter diameter, because the drive would burn too much power and it would be difficult to cool.
But if we keep the RPM of the drive the same, and double the diameter of the platters, you increase the power required to spin the platters by over 16x. How can this be? Well, each unit of surface area at the edge of the platters is going twice as fast if the platter diameter is doubled. Therefore each unit of surface area needs a bit more than 4x the power. At the same time, if you double the diameter of the platter, it has 4x the surface area, and we just noted that each unit of surface area needs 4x the power, and thus the total power required goes up by over 16x.
In other words, we can increase the RPM and keep the power consumption the same if we decrease the size of the platters only moderately. All within a 3.5-inch LFF form factor,7200 RPM drives have (roughly) 3.5-inch platters,10,000 RPM LFF drives have 3.0-inch platters, and 15K LFF drives have 2.5-inch platters.
The problem early on in the life of a form factor is that you run out of GB before you run out of IOPS.
It turns out that a 15K RPM drive has platters that are have a bit bigger than ½ the area of platters in a 10K RPM drive. Thus most of the capacity difference comes from the area of these platters that have a smaller outer diameter. There is also a smaller factor associated with the higher linear velocity of the heads which in an LFF drive at the outer edge is about 100 mph or 160 km/h for 15K LFF vs. 60 mph or 100 km/h for 10K LFF. With the higher flying speed in 15K comes more flow induced vibration of the head due to turbulence and more difficulty to fly close to the surface without hitting it. This means that 15K RPM drives run a bit lower recording density than 10K RPM drives.
So the problem is that if you are going to the biggest capacity point that you can achieve in the form factor, using all the platters that fit within the form factor, in other words, if the GB are the limiting factor, not the IOPS, then 15K RPM drive are twice as expensive per GB as 10K RPM drives because the 15K RPM drive with the same number of platters and heads has only ½ of the storage capacity, and it actually costs a bit more to make with the bigger actuator and its magnet. (The cost of a platter is about the same regardless of diameter – cost is by number of platters.)
OK so at the beginning of the life of a form factor, where GB are more important and you are trying to make the biggest drive that will fit within the form factor, then 15K RPM drives cost twice as much per GB as 10K RPM drives.
But later on as we keep doubling the capacity of the drives over and over as new drive generations come out, each with twice the recording capacity (for enterprise drives) of the previous generation, there comes a point where there’s no point to increasing the capacity any more, because you run out of IOPS before you run out of GB.
At this point is where the mainstream of the market switches to the higher RPM, when the recording density gets so high that you might as well use advances in recording density to let you make the platters smaller, since you can’t use more capacity at the original platter diameter anyway.
Making the platters smaller in diameter means you can rotate the drive faster, and voila, this is the point where the 15K RPM drive displaces the 10K RPM drive as the mainstream product. This happened above the 300 GB point on LFF. In fact, Hitachi decided not to build a 10K 600 GB drive in LFF when it became possible to do so. Instead we made new generation drives that also had 300 GB but had fewer heads and media. Well, actually, we did do a 400 GB 10K LFF drive as a kind of last-kick at 10K LFF, but if you look at HDD vendors’ web sites today you will see that there are no more 10K LFF drives for sale at all.
From an economic point of view, we have found that customers are willing to pay up to 25% ~ 30% more for a 15K RPM drive than a 10K RPM drive, but they generally will buy very few 15K RPM drives if they are twice the price of 10K RPM drives.
And that’s where we are right now with SFF. We are making the absolutely biggest drive the we can with the most platters and heads that fit in within the form factor. And even in SFF,15K platters are smaller than 10K SFF platters. The capacity of a 15K RPM SFF drive is one half of the capacity of a 10K SFF drive of the same generation.
And that’s the economic factor that’s squeezing out the 15K RPM drive right now. You don’t get a 100% improvement in IOPS for a 100% increase in cost going from a 10K SFF drive to a 15K SFF drive, and this makes 15K RPM drives look very expensive in SFF.
Hitachi did build a 15K 146 GB SFF drive in the same generation as the 300 GB 10K SFF drive, but decided against making a 300 GB 15K SFF drive in the generation of the 10K 600 GB SFF drive. Seagate did come up with a 300 GB 15K SFF drive, but again, it has ½ the capacity of a 10K RPM SFF drive at a bit higher cost, and thus doesn’t look very attractive financially.
That’s where we are at right now. 15K SFF is just not very attractive financially. 15K $/IOPS is actually worse than 10K $/IOPS in SFF because 15K doesn’t yield 2x the IOPS of 10K,but it’s 2x the price. So 15K SFF never really makes sense in terms of cost effectiveness to achieve the necessary IOPS.
The 15K sale only makes sense where the customer’s business can increase revenue or profit with faster HDD response time. This means that 15K is very much a niche product in SFF at this time. (With SSDs, the situation is the same, that you can only justify them where there is business value to having faster response time, because SSDs not only have worse cost per GB, but they also have worse cost per IOPS.)
Where we are at the moment is that we are still early in the life of the SFF form factor. Seagate has decided to go for it and make a 15K 300 SFF drive, but Hitachi GST couldn’t see how it could sell in enough volume to make any money building one.
But if you think about it, in the disk drive business sometimes it’s like “déjà vu” all over again.
The point at which we basically started to transition from 10K to 15K as a mainstream product in 3.5-inch LFF was “above 300 GB”.
Yes, SFF drives have smaller platters and thus shorter arms that seek faster and thus we can make 10K SFF drives bigger in GB than 10K LFF drives because the SFF drives are capable of higher IOPS.
Ian thinks 15K SFF will come back in the next few years
My own personal view is that since 10K SFF is mainstream and we expect 10K SFF drives with more than 600 GB in future, then we can’t be all that far off from the point where 15K starts to look more attractive again. This will happen when we start using increases in areal density to decrease the numbers of platters and heads instead of increasing the capacity.
Wait, there’s another light appearing at the end of the tunnel …
This brings us to the topic of increasing areal density. A transition to 15K as the main product in SFF would be driven by increases in areal density.
Can the researchers keep performing magic?
The big news here is that although over 50 years of HDD evolution we have come to expect that our brilliant scientists will keep solving problems and inventing new technologies to keep doubling the capacity of the drives over and over and over at an average rate over those 50 years of something like 40% compound annual growth rate, we may actually be reaching the “areal density end point” where we reach fundamental physical limits.
Just for your amusement, I remember when Dr. Jun Naruse, who later became HDS’ CEO, was head of HDD R&D at Hitachi told me that the theoretical maximum recording density that ever could be achieved was about 65 megabits per square inch. At the time, we were shipping about 5 megabits per square inch using particulate oxide media (basically rust particles in epoxy resin) and inductive read/write heads. Today’s product are shipping at about 500 gigabytes per square inch or about 10,000 times higher capacity than we previously believed possible.
But we appear now to really be getting close to the ultimate physical limit.
There are some technologies that are being worked on, most notably Bit Patterned Media and Thermally Assisted Recording (called Heat Assisted Magnetic Recording by Seagate),but they haven’t quite hit the market as fast as originally targeted. In fact, at least at last year’s Diskcon HDD industry convention, no vendor would publicly speculate on what year either BPM or TAR/HAMR will appear in production products.
So what can we do to keep increasing disk drive recording capacities? Well, one thing that is now very publicly being talked about by multiple HDD vendors is the possible introduction of Shingled Magnetic Recording or SMR HDDs. The basic idea here is that without any advances in read/write technology, but just reconfiguring the write head so that you give up on random writes and write relatively wide tracks overlapping like shingles on a roof, you can still easily read back each track from the bit that’s exposed. Using this technique with the same head technology you can generate much stronger magnetic fields for writing and thus you can use higher coercivity media that are harder to write on, but let you make the bits smaller. Higher recording density means higher capacity drives with the same read/write head technology.
What about 7200 RPM SFF?
Why don’t we offer a 7200 RPM SFF drive? These drives are available from some HDD vendors. The issue here is that because 7200 RPM SFF drives use 2.5-inch platters, and 7200 RPM LFF drives use 3.5-inch platters, these SFF and LFF drives would cost about the same to make, but the LFF drive would have twice the capacity. Since both the VSP and the AMS2000 family support both SFF and LFF drives, we are offering the LFF 7200 RPM drive because it has twice the capacity at about the same price.
The last prediction, anticipating humans will become rational
If people were ever to really think about the economics, realizing that if you try to put normal computer data on a SATA drive you could only fill it to a couple of percent of its storage capacity, then who cares if you can get 2 TB if you couldn’t possibly even use 1 TB? To me the 7200 RPM SFF drive looks like a solid price performer that hasn’t been given sufficient consideration. So I think that if people ever figure this out we’ll see the majority of the requirement for “fat” drives to be on 7200 RPM SFF drives, and only the truly cryogenically cold data going on 7200 RPM LFF drives. For a set-top box recording that records and plays video, 100 IOPS is plenty for handling a few HD video streams, so keep on using 7200 RPM LFF and bring on all the TB you can! And for archival applications with almost no read activity, 7200 RPM LFF will always offer the best price per GB.
But for anything the resembles normal computer data,7200 RPM LFF is “too big to make sense” and 7200 RPM SFF would be plenty big enough to get into trouble running out of IOPS before running out of GB. Of course, we would want the SAS interface 7200 RPM drive, not the SATA version.
And then there’s the fact that SFF drives use a fraction of the power that LFF drives use.
I hope this gives some perspective on how the HDD roadmap will impact subsystem performance and economics over the next year or two and beyond.
Big Data – Coming Down The Pipe!
by Cameron Bahar on Sep 29, 2011
These are exciting times.
I joined HDS through the acquisition of ParaScale, a startup I founded to focus on solving what the industry is now calling the Big Data problem. When I visit customers, I notice a growing percentage are faced with a very challenging data problem. The data in its original form is unstructured as expected, but it’s not your typical “human generated data”. Human generated data is what I refer to as file based data such as documents, spreadsheets, presentations, medical records, or block based data such as transactions, customer records, sales and financials records that are usually stored in relational databases. These traditional data sets are adequately served with high performance SAN/NAS systems, which have excellent random I/O characteristics and can handle massive amounts of structured and unstructured content.
I noticed that this new workload and the corresponding data being produced was fundamentally different from traditional enterprise workloads. This data was being generated by hundreds to thousands of machines, was rarely updated, often appended, and was fundamentally streaming in nature. I called this category “Machine Generated Data” and people have started to embrace this term over the last few years. We started looking at this problem as early as 2004.
Imagine a web company that stores and processes log files from one hundred thousand web servers to generate optimized advertisements to monetize its freemium business model. Prime examples of such business models are Google, Facebook or Yahoo. Imagine a bio-informatics company sequencing millions of genomes in hopes of finding patterns in disease, or a security company scanning lots of high-def video and analyzing it looking for a specific person or object, or millions of sensors generating data that needs to be processed and analyzed.
Besides the workload being different for Machine Generated Data (MGD), what else is different? Notice that with MGD, companies store this data in order to be able to analyze it shortly after storing it. The faster they can analyze the data, the better the payoff usually. Think of ad placement by analyzing log files or video analytics to find the right guy in the video just in time, or analyzing genome data for a cancer patient who doesn’t have a lot of time, or finding inefficiencies or trends in financial markets or looking for oil in all the wrong places.
This problem as it turns out is really about large scale data storage and mining of unstructured and semi-structured data to gather information and insight from a company’s vast datasets. It’s nothing short of information processing, repurposing, and data transformations to uncover hidden patterns in data that will ultimately lead to better decisions.
So what are the high level requirements to build a system to solve this problem?
1. Scale – the system has to scale. But scale to what and in which dimensions. I suppose it has to scale in capacity to be able to store petabytes or exabytes of data. But it also has to ingest this data pretty fast, since there are lots of concurrent streams of machine generated data that are coming through the pipe. So the pipe can’t be the bottleneck either.
What role does system management play in this equation? If you have lots and lots of data, you can’t afford to hire lots of storage admins to sit around and manage this data; it costs too much! So what is one to do? The system should be self healing and self managing, right? It should handle most of the mundane data management tasks automatically instead of relying on people. Ideally all people do is add new hardware or replace failed disks or power supplies.
2. Processing – once you store petabytes of data into a storage system, how do you analyze it? You can’t very well load a few petabytes over the network into a compute farm, can you? How long would that take? Isn’t the network the bottleneck then? What if we were able to move the processing or the program to the data and run it there instead. The program is pretty small compared to the data set size and so you at least take the network out of the equation to a large degree. Therefore, the platform should allow in-place data processing or analytics.
3. Cost – when you store lots and lots of data, your TCO should be reasonable, so using commodity components as much as possible, virtualization, and automation will go a long way.
So when you look around and you read and hear about people talk about or working on gfs, bigtable, hadoop , hdfs, cassandra, riak, couchbase, mongodb, hbase, zookeeper, flume, mahout, pig, hive, etc, etc, … you know what I’m thinking about.
The information revolution is upon us and its pace is accelerating.
HDS @ #OOW2011
by Gary Pilafas on Sep 29, 2011
The first week of October marks the largest Oracle global event in San Francisco: Oracle Open World 2011.
HDS will once again be present at the show (booth #2101) where we will be previewing several new technologies and integration around:
- Oracle Enterprise Manager (OEM)
- Oracle Recovery Manager (RMAN); and
- Oracle Virtual Machine (OVM)
HDS will demonstrate the best in class functionality and integration around tools that Oracle database administrators use on a daily basis, allowing an enterprise DBA to manage enterprise database environments in one single pane of glass.
In addition to a set of rich tool sets, HDS will also preview the Hitachi Converged Platform for Oracle Database Reference Architecture, which allows for three key solutions:
- Hitachi Compute Blade 2000 Model Servers with Brocade Fibre Channel Switches and Hitachi VSP and/or AMS Storage, which leverages our HDS differentiators.
- HDS high performance computing platform, which includes the following differentiators in Hitachi servers over all competitors:
- I/O Acceleration – Hitachi Servers use two dedicated PCIe connections per blade, which allows for anindustry best performance with our Partner Fusion-IO’s PCIe SSD Cards
- Symmetrical Multi Processing (SMP) – Hitachi Servers use 2-blade and 4-blade SMP connections, allowing for the synergy of multiple blades to act as a single computer (64) cores and 1TB of memory
- Hitachi Virtualization Manager (HvM), which allows our customers the opportunity to utilize logical partitions in both Oracle Single Instance and Real Application Cluster (RAC) environments
- I/O Expansion Units –Hitachi 4-blade SMP, allowing for the connectivity of (4) I/O Expansion Units (IOUE) that can each hold (16) PCIe SSD Cards from Fusion-IO for a total (64) Fusion-IO cards, which would allow for a server based SSD pool of 76TB
- The third solution is our Real Application Cluster (RAC) Platform, which includes HDD High Performance Computing Platform, consisting of Hitachi Compute Blade 2000 Model Servers with Brocade Fibre Channel switches and Hitachi VSP and/or AMS Storage.
For our existing storage customers, this development and integration is good news given that HDS integrations have been enhanced by strong partners like Fusion-IO and Oracle.
Make sure you stop by HDS booth (#2101) at the show for a preview of all of these great tools!
Is That Information…and do I care? (Part 2)
by Amy Hodler on Sep 27, 2011
Last week I posted a review of the definitions and differences of data, information, knowledge and wisdom (DIKW) on this blog. I wanted to continue that thread to look at why it’s important to understand these sometimes-subtle distinctions.
Although most of us in IT stop at considering data and information as more tangible and actionable elements, it’s really in the later areas of knowledge and wisdom where we find things to be most useful—though also more ephemeral. Understanding how we advance from one to the other will help us more readily progress into developing greater insight. Misunderstanding means more mistakes and false starts.
Anjana Bala (Stanford University and former HDS intern) and I found looking at the DIWK concepts in the framework of a process to be very meaningful. If you consider the move from data to wisdom as a progression of understanding and connectedness, a clearer picture starts to emerge. (Note that I’ve pulled explanations heavily from Jonathan Hey’s work on “The DIKW Chain” as well as the DIKW Wikipedia entry.)
1. We research and collect lots of factual elements to provide us data.
2. We add context to data to understand the relation of elements and gain information.
3. As we connect information in sets, we gain understanding of patterns and acquire knowledge.
4. When we start joining whole sets of knowledge we can understand how they relate to bigger principles to ultimately achieve wisdom.
Another interesting outcome of looking at this as a process is that it clearly shows that data, information and knowledge are attributable to better understanding the past. We all understand that’s a critical basis, but our goals usually depend on insight into what we should do in the future, which requires wisdom.
At HDS, we’re passionate about data and information as our way to help you do great things. Assisting in getting more value from data and information is core to what we do, and it’s why you’ll see us continue to invest in solutions related to data, content management and cloud. And it’s also why we’re working on additional innovations to help organizations move forward from data to information and knowledge.
Greasing this ‘DIKW’ process will significantly accelerate the rate of innovations, and I believe we’re very close to a knowledge tipping point in this area. We should all do what we can to lubricate this process by considering ways to better connect the elements and more readily move understanding forward.
What might you help to improve? Can you add to this high level list?
- Resource and data/information connections (consider cloud implications)
- Data standards and data virtualization
- Correlation tools and analytics
- Human-understandable results (e.g. data visualization)
- Processes that help internalize knowledge
- Next generation of hardware and software to support the above
Kicking Off a Blog Series on Object Stores
by Robert Primmer on Sep 27, 2011
I’ve worked on three commercial object stores: Centera, Atmos, and now HCP (Hitachi Content Platform). In that time, I’ve seen numerous misunderstandings about this particular brand of storage technology–not just in how they function, but even more fundamentally, on where, when and why such a system would be employed. In a new blog series starting today, I hope to answer these questions.
To that end, my co-authors and I will provide a series of tutorials on distributed object stores (DOS), essentially providing a short course on the subject with the occasional odd topic interspersed periodically.
Where it makes sense, articles will be split along business and technical lines, as the two topics often will have completely different audiences. In some cases this division will result in wholly separate articles, but generally both should fit in a single article.
It’s always challenging to get the right level of technical detail with a diverse audience. Generally we’ll bias toward simplicity, as there’s an inverse relationship between how technical an article is and the number of possible readers. So, the typical article will strive to hit a moderate technical level, hopefully avoiding the arcane. However, if there’s sufficient interest in a particular topic, we’ll go back later and add greater technical and mathematical rigor to those specific areas of interest.
I’ll try to construct the series in such a way that each topic builds successively upon the previous. However, as this will be my first attempt at creating such a course I’m sure to get some topics out of sequence. Fortunately, web pages – with their ability to readily point to other content outside the page displayed – allow for non-linear reading in a manner far superior to what is possible in print.
Here is the Topic Index as I see it at this juncture. A single topic might span several articles connected together. As with source code, it’s generally better to write several small modules that connect together rather than a single large source file that tries to do everything.
This method of relatively short articles also allows me to later go back and insert new articles within a given topic as either new things occur to me, in response to feedback about a given topic, or additions as the state of technology changes.
1. What is Structured and Unstructured data, and why do we care about the difference?
2. What is an Object?
3. What is an Object Store?
4. What is a Distributed Object Store?
5. When would I use an Object Store versus other forms of Storage?
6. Industry Implications of Object Stores
a. Traditional Storage Vendors vs. Cloud Vendors Approach to Object Stores
7. Basic elements of an Object Store ecosystem
8. Distributed Object Store Blueprint
9. Architectural Considerations of an Object Store
10. A Comparison of Object Store Implementations
11. The Life and Times of an Object
a. The Birth of an Object
b. Data Ingest
c. Life inside the Object Store
d. Where does an Object live?
e. How is Data Protection achieved?
f. Object Mobility
i. Duplication and Replication
g. Why is Tape Backup not required?
h. Basics of Data Unavailability and Data Loss (DUDL)
i. Fundamentals of Self Healing
j. Read / Write: What makes it fast, or slow?
12. De-duplication and Object Stores
13. The Road Ahead – The Evolution of Object Stores going forward
Please let me know your thoughts along the way.
by Frank Wilkinson on Sep 26, 2011
This is my first official blog with Hitachi Data Systems. I started with HDS on of all days, Valentines Day 2011! This reminded me of what I love: of course my wife and children, but also what I have chosen as a career. I mean, who can’t get excited everyday when you have the unique opportunity to try and make the world (or at least technology) a better place?
Some background on me: I have been in the technology industry for over 15 years. I am a Master Architect, both hardware and software, and have worked in sales, pre-sales and strategy roles for most of my career.
For the past 10+ years my main focus has been around search and eDiscovery, and I have had the pleasure to work with some of the industry’s top thought leaders and developers, who actually created this space for archiving and search in 1999. I have helped create this market, and along with some others, we have watched our “child” grow into what it is today, which is far bigger than most thought it would be: hungry for better, faster, richer search capabilities and faster insight to data. I have worked for some of the largest providers of search and eDiscovery solutions and it has been a joy ride all the way.
Like I said, I started on Valentines Day of this year and I think it is apropos, that I have the fortunate opportunity to work for the best IT Solutions company in the world! I get to do what I love and that is to create new concepts and technology, and bring new ideas for Content Cloud solutions for HDS. Who wouldn’t be excited?!
Speaking of exciting…
Big Data Problems
IT is not a mystery in that at some point we were going to run into issues of having too much data which led to too much overly complex architectures, networking latency issues and backup issues—just to name a few. The reality is this is not a new problem, but one that has always existed; just look back at the first Disk Pak that you bought, 10mb and we thought that was enough capacity. When that space ran out we bought a larger one, and thought that was going to be enough capacity! We simply added capacity as needed and along with that we grew our complexities for backup, recovery, replication, networking, etc. You get the picture.
As the decades flew by we had grown to the point of an information overload and grew data beyond our wildest imaginations. While archiving content gave us some relief for email and file data, it also added an extra capability that allowed the ingest data to be indexed, making the data searchable. This was the beginning of the next era for search and eDiscovery.
Fast forward a few years and now we see that there are hundreds of companies who offer search and archiving solutions—perhaps you have one of those solutions in your enterprise today.
I am not telling you something that you don’t already know, but I will tell you that Big Data impacts your ability for greater insight to your data.
Why do we keep data?
Why don’t we expunge that data when its usefulness expires?
Well, we know that data comes in all types and forms, and while some have a high relevance to your business—such as a financial record or contract–we also keep data for historical reference.
So What’s The Issue Here?
The reality is that we have tired to address the big data issue with options like archiving, data de-duplication, application retirement, consolidation, file planning, etc. Please don’t get me wrong, we need these technologies in order to hold the Big Data beast at bay, and these technologies also offer the baseline for building the next generation architectures. Since data has grown exponentially over the past several years, we have tired our best to contain data in its place while trying to adjust our architectures for the next great thing.
So Where Are We Today?
We have complicated and rigid architectures which may not lend themselves to take advantage of Cloud solutions. We have complicated applications and integration issues. We have a ton of meta-data, objects and even some archiving solutions in place—perhaps too many and not easily integrated together. We’ll even throw in some business intelligence solutions.
So What Do We Have?
A big mess!
Maybe it’s just me, and perhaps I see things from too simplistic a view point, but the data we create should serve the business and allow us to make better business decisions that react to the markets with more agility. As Michael Hay and my friends over at the Techno-Musings blog preach, data should not hold us hostage and cause such pains. To me, that is what is important: how can we make business’ run better and more efficiently while reducing risk and reducing the size complexity of IT, while maximizing new technology to gain better insight to the data? This is what I love, and it is my passion (just ask anyone who knows me).
I do have a point to all of this, I promise!
The challenge, as I see it, is not giving you more solutions and adding more complexities, as some companies are trying to sell you, but rather the opposite. How can we look to technology to help solve some basic issues with regards to Big Data?
We have all been promised at one time or another that technology will unlock the value in IT. Well, I am here to tell you that we are almost there, but we still have work to do.
While there are many challenges to address, there are a few which make it to the top of the list:
- Objects and Meta-data: In order for greater insight, we know that we need to do a much better job at exploring and expanding meta-data capabilities. And while we are at it, a better way to standardize meta-data abstraction and find relationships between dissociative meta data.
- Search: we need to think bigger than just search. Analytics need to be more tightly integrated with the data and meta-data. I call it “Content Analytics,” since that is really what we need. If all the data and its associated meta-data can be made to be more intelligent—and yes we are talking way beyond meta-data tagging here, and yes I am leaving out all the other parts to this like (data dictionaries, indices, BI, etc.) because they are understood, at least for now—then we will be able to provide a much richer set of data and meta-data. This could lend itself to a more refined result, giving us greater insight. But how do we get there?
- Open Frameworks and API’s: If we really want to think about the impact that big data has today and its strong hold on our data centers and architecture, then we need to think of a better way for data to communicate with business’. Part of this can be addressed with the adoption and implementation of such open source solutions like OpenSearch, which allows for the sharing of data through a common and structured format for data to be shared and even extended with formats like RSS and Atom just to name a few.
I Told You I Had A Point…..Here It Is!
When I look at the different variants and the many ways in which meta-data is utilized today, and how we are just learning how to maximize its potential, we are still tied to the old architectures, and thus tied to a less spectacular way to gain insight of our data. Being able to leverage that analysis will help us make better business decisions. I am not talking about Business Intelligence solutions, although I do ponder that perhaps if data and its meta-data were able to leverage a common open source set of API’s, then we may have something here. Perhaps we could then be the well for all intelligent meta-data and be the providers and or helpers who other solutions can leverage for meta-data analytics. In order to shrink our architectures and squeeze out more benefits from our data, we need to think about a new approach to meta-data management and what we could expect from next generation solutions.
In Michael Hay’s recent blog post Interacting with Cloud Stores, Michael articulates precisely what is needed around a common set of open sourced APIs, which can be leveraged throughout the enterprise across all applications, storage, data objects and meta-data in order to get us to be able to link and share meta-data across not only the enterprise, but with external meta-data from social media networks (and even from public cloud solutions). Meta-data is the Holy Grail, and the more we can leverage, embed unique markers and add intelligence into our meta-data, the more we can begin to reap the benefits of clearer insight to that data.
As you can probably tell, my blog will take on the challenge of looking at Big Data in many different ways, but most importantly around Content Cloud, Search and eDiscovery.
I look forward to your thoughts and comments!
Soybean Is A Commodity
by Ken Wood on Sep 23, 2011
While having dinner in Waltham, Massachusetts the other night with a table of smart people, the topic of commodities came up.
What is a commodity?
Soybean is a commodity. Steel is a commodity.
Why does everyone consider storage a commodity? I said, “…actually, capacity is a commodity, not storage.” Everyone’s eyes lit up. That’s true. CPU cycles are a commodity, but processors and servers are not. Networking bandwidth is a commodity, but network switches are not. Storage capacity is a commodity, but the storage equipment behind it is not.
So is there a distinction? My old colleague Michael Hay posted a blog on a similar subject last year, and my new colleague, Shmuel Shottan from BlueArc, also blogged about this subject on Nov. 8th 2010. So I grabbed this definition from Wikipedia:
A commodity is a good for which there is demand, but which is supplied without qualitative differentiationacross a market. A commodity has full or partial fungibility; that is, the market treats it as equivalent or nearly so no matter who produces it. Examples are petroleum and copper. The price of copper is universal, and fluctuates daily based on global supply and demand. Stereo systems, on the other hand, have many aspects of product differentiation, such as the brand, the user interface, the perceived quality etc. And, the more valuable a stereo is perceived to be, the more it will cost.
The case of “stereo systems” is a very good example of a misunderstood situation in our industry today—product differentiation matters. While the output of these products may be perceived as a commodity–such as CPU cycles, capacity and bandwidth—the quality of the product producing these “outputs” is worth it.
So here’s my distinction between CPU cycles and CPUs/servers, or storage capacity and storage systems, or bandwidth and networking:
Cycles can be bought and sold at varying prices based on several factors, such as time of day or week, based on current loads or some other seasonal impact (supply and demand) similar to electricity. Service providers can even compete on the cost per CPU-cycle-hour. I’ve even seen CPU-cycle-hours donated or given away as a promo or part of a grant. The same could be said for bandwidth or Internet access, but the equipment that creates these “goods” provide a distinguishable quality that is noticeable, and in most cases, sought after.
The difference is that you can buy and sell commodities by the bulk (there are extreme corner cases where some companies buy equipment at huge quantities, but this does not fit the mainstream model nor does it make the gear a commodity). Hence, you can buy CPU cycles by the hours or months. You can buy bandwidth by the gigabyte per month. You can buy capacity by the gigabyte per month. You can buy soybeans by the ton.
It is my opinion that there is a misunderstanding and confusion in the IT industry between “commodity goods” and “consumer products” when it comes to technology. I can’t pinpoint the exact origin of why or how these two concepts seem to have become synonyms for each other, especially in the IT industry, but there is a difference between commodity goods and consumer products. The technologies used and the blending of these technologies, be it the user interface, the way it is managed, its performance and/or reliability of a product, contributes to the overall perceived and measured quality of a product.
There is also the “secret-sauce factor”. The hybrid combination of commodity components with special, purpose built components to complement the overall invention of a product can elevate that product to something altogether different, a new standard. In fact, one could argue (wrongfully) that once a technology becomes a “standard” it is treated as a commodity. Of course we all know there is a difference between being an industry standard or following a standard, and setting the standard.
The other argument that can be made (wrongfully again), is that if a product is constructed with some measureable amount of commodity parts (resistors and capacitors are bought by the bulk), then the product itself must be a commodity. Of course, with many quality products, there is also a measureable amount of differentiated know-how blended together, plus the inclusion of secret-sauce to create a unique and desirable product. Only companies in the business of inventing innovative technology and products can provide the quality and reliability that customer’s desire.
So, while soybeans are a commodity, a fancy restaurant, a little sea salt, a culturally focused serving bowl and an exotic name can buy you Edamame for $15 bucks. For just a few ounces.
Announcing the Data Content Blog
by Robert Primmer on Sep 23, 2011
Hello, and welcome to the Data Content Blog.
On this blog we will focus on the area of data storage commonly referred to as “unstructured data.” We have all seen the industry charts that show the growth of unstructured data to be dramatic, growing to exabytes by 2015. The storage systems designed for this class of data are evolving to meet the challenges associated with trying to manage and keep track of such massive amounts of data.
Perhaps the most important distinction is that these systems are really software applications that use storage, but aren’t intrinsic to the storage subsystem. As a software solution, the storage application itself can perform functions particular to the data content as there’s a greater knowledge of the nature of the customer data than is possible in a pure disk storage subsystem. This increased knowledge allows for a greater set of actions to be taken based upon the data content, rather than on more generic qualities such as disk segment boundaries. A simple example is the ability for an application or storage administrator to select which specific objects or files it would like to have replicated, when and where. A more complex example involves performing arbitrarily complex analytics on the data that effectively transforms data to information.
A second important function for this class of storage software is the ability to simply keep track of where all the data resides. It’s a lot easier to store a petabyte than it is to store, catalog, and keep track of a billion objects over their lifetime. As the time horizon for stored data approaches decades, it’s a given that the associated storage and server hardware will change generations multiple times. The software application that fronts these systems needs to be built to withstand these changes without requiring forklift upgrades. Therefore, it’s important that all hardware elements are sufficiently abstracted in the storage software to accommodate change. Likewise, the ability the test the veracity of stored data is needed as backup for data at this size is impractical.
On this blog we’ll talk about these and other issues particular to unstructured data and how it relates to what’s happening in the industry at large, as well as specific customer segments.
This blog will have four contributors: Cameron Bahar, Frank Wilkinson, Shmuel Shottan, and myself, Robert Primmer. Below are biographies for Cameron, Frank and Shmuel.
Although no longer blogging for HDS, for reference from his previous blog contributions, see his bio below.
Chief Product Strategist, Scale Out Storage Platforms & Big Data Analytics
Cameron Bahar leads the technology direction and strategy on scale-out storage platforms, distributed file systems and Big Data analytics at Hitachi Data Systems, bringing over 20 years of systems software development and deep expertise in distributed operating systems, parallel databases and data center management.
Cameron joined HDS through the acquisition of ParaScale where he was the founder, CTO and VP of engineering. At ParaScale, he developed and released a private cloud storage and computing software platform for the enterprise to address Big Data workloads.
Earlier, Cameron led design, deployment, and operation of Scale8′s distributed Internet storage service that provided storage for digital content owners including MSN and Viacom/MTV. At the HP Enterprise Systems Technology Lab, he developed system software for disk volume management, data security, and utility data center management. At Teradata, Cameron developed extensions to UNIX to allow the massive parallel processing required by the Teradata database, the world’s largest and fastest distributed RDBMS. Cameron started his career at Locus Computing, a pioneer in distributed operating systems, single-system image clustering, and distributed file systems.
Cameron holds a BS, summa cum laude, and an MS with honors, in Electrical Engineering from UC Irvine. Cameron holds 4 patents in scalable distributed storage systems, virtual file systems, and high availability systems.
Frank T. Wilkinson
Content Services Strategist, Office of Technology and Planning
As the content strategist in the Office of Technology and Planning, Frank brings over 15 years’ experience in the field of IT and software development and expertise in HDDS and HNAS. He is an industry expert in the areas of knowledge and content management, information management within the financial services, healthcare, media & entertainment, retail and SLED.
He has held many IT certifications, including being a Master Architect. Prior to joining HDS, Frank was the Business Strategist for HP Software, Information Management Division and spent five years developing and deploying information management and content management solutions. Prior to HP, he enjoyed stints at the EMC TSG group’s information management practice and K-Vault Software (Symantec Enterprise Vault). Frank has been an active speaker for industry led panels and discussion on the topics of eDiscovery, information management, content management and next generation solutions facing the enterprise.
SVP, CTO BlueArc, Part of Hitachi Data Systems
As CTO of BlueArc technology, Shmuel is responsible for developing and advancing BlueArc product innovation. Previously Shmuel served as senior vice president of Product Development of BlueArc. He joined BlueArc in 2001.
Shmuel has over 30 years’ experience in the research and development of hardware and software, and in engineering management for firms ranging from start-ups to Fortune 500 companies. Prior to BlueArc, he was senior vice president of Engineering and chief strategy officer for Snap Appliance. Earlier, Shmuel held executive positions at Quantum Technology Ventures and Parallan Computer. Previously he held senior engineering positions at AST Computers and ICL North America.
Shmuel holds B.S. degrees from the Technion – Israel Institute of Technology in electrical engineering and computer science.
As for me, I am senior technologist and senior director of product management in the file content and services division at HDS where I am responsible for devising technology solutions that will be incorporated into future enterprise and cloud product/service offerings, with in-depth knowledge and expertise with Hitachi Content Platform(HCP).
I have 25 years’ experience in technology, working in R&D and Product Management organizations with Cisco Systems, EMC, HDS and several start-ups. I am a member of ACM and IEEE and belong to the IEEE Computer Society.
We all look forward to bringing you insights on this dynamic topic of data content. If you have particular issues you’d like us to address, please let us know in the comments.
Is That Information…And Do I Care? (Part 1)
by Amy Hodler on Sep 20, 2011
It’s no wonder we have so much buzz around Big Data: we’ve reached a tipping point, which Michael Hay discussed in a previous blog post on how ‘Big Data Is Turning Content Into Appreciating Assets.’ I find news of discoveries and innovation based on massive amounts of data pretty inspiring, and, as you can imagine, we at Hitachi Data Systems talk about these topics among our colleagues and friends…and sometimes even to the point of annoyance among our families. (I can personally attest to a glazed stare this morning as I chatted again about this very blog post.)
With this kind of enthusiasm, you can understand how we might get into debates over what we actually mean by ‘Data’, ‘Information’ and ‘Knowledge.’ This usually leads to fatigued, sometimes discussion-ending questions of, “does it matter?” and “do I care?” After some debate, Anjana Bala (Stanford University and former HDS intern) and I decided to research this a bit more, and we’ve come to the conclusion that 1) we do care, 2) it is important and 3) we’d bet you’d agree.
Ok – So this is by no means a new topic, it’s just really hot right now. We’ve been thinking about this for ages, from all different angles. I believe this topic has captivated us for so long because the journey from data to knowledge is transformational and provides the basis for innovation. And, it continues to elude us because this journey, particularly when we go beyond information, becomes more personal and implicit.
Among the mass of different views on these concepts (which sometimes felt like hair splitting) Anjana and I found Jonathan Hey’s paper on “The DIKW Chain” as a good consumable starting place and the discussion of how data and information relate actually continues to knowledge and wisdom as well. We’ve summarized the distinction between these concepts below based on Hey’s paper and Dr. Russell Ackoff work with one of the most common representations of the relationships: the pyramid diagram.
- Dataare discrete symbols that represent facts. You might think of them as recordings or statistics.
- There is no meaning or significance beyond the data’s existence.
- Data may be clean or noisy; structured or unstructured; relevant or irrelevant.
- Informationis data that has been processed to be useful. I like to think of it as adding the first bit of context to data relating to “who”, “what”, “where”, and “when”.
- Information captures data at a single point in time and from a particular context; it can be wrong.
- Knowledgeis the mental application of data and information. Most consider this as addressing questions around “how”.
- Some consider knowledge a deterministic process, which is to say the appropriation of information with the intent of use.
- Wisdomis the evaluation and internalization of knowledge. It applies insight and understanding to answer “why” and “should” questions.
- Wisdom has been characterized as “integrated knowledge — information made super-useful.” (Cleveland, Harlan. December 1982. “Information as a Resource”. The Futurist: 34-39.)
Much of these definitions are based on Dr. Russell Ackoff’s work, however I added the graphic, examples and further explanation from various sources. (Ackoff, R. L., “From Data to Wisdom”, Journal of Applies Systems Analysis, Volume 16, 1989.)
So this is all very interesting with implications for not only the terms we use for a good debate, but also how we might progress from one to another. This is probably enough to chew on for now, so next week I’ll follow-up on these implications and we’ll look at why it’s important to consider. In the mean time, let me know if you have any good examples to illustrate DIKW differences.
What a Difference a Year Makes
by Miki Sandorfi on Sep 20, 2011
Last year, VMworld was all about virtual machines. All the partner plug-ins focused on helping to do just about anything to virtual machines. From backing up to deploying virtual machines, almost every eye was on the virtual machine ball. What a difference a year makes!
At this year’s VMworld, most players came out swinging at “cloud,” trying to establish their place in the market. Though there was lots of schwag (I have never seen so many people carrying prizes of iPads and AppleTVs, and a few with both), I was very impressed by the number of in-depth conversations, well attended knowledge sessions and sincere desire to gain cloud information. Right off the bat the show’s vibe was clear: Cloud. Cloud. Cloud.
Every exhibitor had their own take on the game, and some were just trying to figure out how to play–but no doubt cloud was the game. Much of the focus was on integrating into VMware or supplementing to adding value. Application distribution, service and cloud monitoring, application service performance, security, extended cloud management orchestration, and hardware management were most commonly seen. Beyond the common cloud themes, some key players (HDS included) are executing to a much more integrated approach, extending the management more deeply to enhance the end-to-end cloud workflow. This year, all players were aiming at a common goal, from cloud appliances, cloud enablers, cloud managers, and every cloud item in between.
Regardless of how they precisely define cloud, all were focused on driving to it.
The attendees had evolved as well, which is a direct reflection of how IT is evolving. A year ago, most were strongly virtualizing and dipping their toes in the cloud. Most had a goal for the cloud or had been considering it, but not many were invested or even optimistic about the potential of achieving a cloud environment. This year, if not swimming in it, all were wading in the cloud.
Conversations definitely changed: knowledge about how to get to cloud has greatly increased with an increased focus on, and most importantly the reality of getting to cloud in customers’ own environments. Optimism and belief were definitely the theme, and even more exciting, many discussions changed from IT and application focused to their end customer and how to operate efficiently while meeting the business needs in a timely manner. This is incredibly important to actually achieving the dream – making the ability to execute start to become a reality.
What a great direction we are all moving.
The cloud focus shows not only how predominate cloud is in the industry, but the evolution technology and IT have gone through over the last year. We’ve moved from focusing on server virtualization to entire data center landscape, integrating all IT into a consolidated flowing environment. Though we are not there yet, the evolution demonstrated not only at the booths and attendees, but in the roadmaps. Discussions clearly point to cloud as the target, which is something we have waited for a long time: the ability to increase flexibility, reduce OpEx, and simplify complicated architectures—all leading to better meeting needs and adding value to the business.
Freakonomics to Flying Circus
by Amy Hodler on Sep 19, 2011
I’m an idea junkie: deconstruct an old one or give me a new one and I’m happy.
That might be why the team asked me to introduce our new blog. We all have good and great ideas and the only way to know the difference and capitalize on them is to share.
With that spirit in mind we are launching our new Cloud Blog. There are many exciting initiatives and discussions going on within HDS and the cloud teams. And we thought it was about time to share more of these with you – our customers, partners, analysts, friends and anyone who finds cloud interesting. Our goal is to germinate and “Inspire the Next” great set of innovations together.
Any topic with a cloud angle will be fair game but we’ll try to tie things back to IT and data center cloud interests. (Although quite honestly, I know I’m already going to break this guideline. Because rules are…well, you know. ) Our goal is to spark great conversations and ideas by being thought provoking, interesting or entertaining. If we get boring…promise to tell us and we promise to do something about it!
Following are our regular contributors. I’ll let you work out where we all lie on that spectrum.
Miki Sandorfi – Chief Strategy Officer, File, Content & Cloud
As Chief Strategy Officer for File, Content and Cloud, Miki is responsible for driving forward the Hitachi File, Content and Cloud product and solution portfolio. Miki has over 19 years of experience. He holds 16 issued and 20 pending patents and pioneered advances in enterprise data protection and deduplication at Sepaton as the Chief Technology Officer. Miki is enthusiastic about all new things, especially shiny new gadgets! [Read more about Miki..]
Linda Xu – Sr. Director of Production Marketing, File, Content & Cloud
As Senior Director of File, Content and Cloud product marketing team, Linda is responsible for developing go-to-market strategy for the portfolio, and end-to-end execution of the marketing strategies. Prior to joining Hitachi Data Systems, Linda held various positions at Dell Inc. including strategic planning and business development and was previously a public relations manager at AT&T China. Linda has recently been learning to stretch and relax with yoga. [Read more about Linda...]
Amy Hodler – Sr. Product Marketing Manager, Corporate & Product Marketing
As Senior Product Marketing Manager for File, Content and Cloud, I’m responsible for marketing of the Hitachi Unified Compute Platform as well as other solutions. Throughout my career, I’ve been advancing emerging solutions at both small and large companies including EDS, Microsoft and Hewlett-Packard (HP). I thoroughly enjoy the beautiful outdoors of eastern Washington as well as most activities on two wheels. [Read more about Amy...]
You’ll get to know our guest and regular writers more as we move forward and we’re hoping this will be a dialogue. Your ideas are important and there’s a lot that could be accomplished if we had more in the mixing pot and stirred. So remember that crazy ideas and discussions are always welcome and sometimes we just get lonely for a chat. Please let us know what you’re thinking whether you’re internal or external to HDS.
You never know where the rabbit hole might lead but it will be interesting . . . let’s jump in together and see where this goes!
Interacting with Cloud Stores
by Michael Hay on Sep 15, 2011
In a July post on the Techno-Musings blog, we made the case for keeping content/data in its original, unaltered form. More specifically when an application stores data/content in a non-obfuscated mode, it is possible to unleash the true power of the content. This is a key tenet by which Hitachi lives and breathes-ensuring that application
owners, users and customers are the winners in the end. At Hitachi, we view ourselves as custodians or trustees of the content and metadata applications store on our platforms, and not the “content wardens” holding data and metadata prisoner.
While there is the potential for a market sea change, whereby users of business applications demand complete access to their data in an unaltered form, many application vendors still practice a belief where they imprison their customers’ data and metadata.
The ability to have unfettered access to the content and its metadata is even more important today than before, mostly due to the rise of cloud/object storage platforms. That is because offerings like Amazon S3 for the public cloud and premier cloud/object stores such as the Hitachi Content Platform (HCP) for the private cloud are paving the way for ubiquitous and open REST based access to objects stored on them. However, the middle leg in this chain is the source application, which produces or shelters the content and then stores or persists it into the cloud store. If the source application fails to persist the complete unaltered content and metadata, or obfuscates it in any way when placed on the object or cloud store, then the principal being articulated in this post fails.
A key question is, why is this unfettered access important now and into the future?
The best way to begin to answer this question is by giving an example. In another blog post, we evaluated the usage of Microsoft Remote BLOB Storage (RBS) versus the usage of the Hitachi’s Sharepoint plugin (Hitachi Data Discovery for Microsoft Sharepoint or HDD-MS for short) to transport Sharepoint content to HCP. The post states that there is a machine generated way to move content from Sharepoint or a Microsoft SQL Server application to HCP through a RBS provider. It is true that the RBS approach persists unaltered actual data for the object, but that is about all it does. RBS fails to endure an object’s core metadata, such as file name, containing directory structure, and doesn’t even give a thought to custom or other SharePoint specific metadata. So RBS really persists an incomplete representation of an object or cloud store. The net effect is that RBS expects the authoring application or machine to contain all of the object’s metadata and context and not the storage. Whereas Hitachi’s plugin is specifically designed for SharePoint archival moves of the object plus replicates the metadata to HCP. With this approach, HDD-MS is effectively trying to strike a balance between machine and human accessibility.
We have intentionally designed HDD-MS in this way that if in the future SharePoint was not available and a human needed to do a manual investigation they could look at the original object, all of its versions and its metadata. If we purely relied on RBS, then the user would need to access the version of SharePoint the data originated from to get everything they need. This practice would be repeated per SharePoint site, and further if Microsoft ever decided to halt the development of SharePoint – yes unlikely in today’s thinking, but what about 30 or 50 years from now – then the only way to maintain access to the content and its metadata is to archive the application along with the data. This would create a term coined by one of our customers as a “museum of applications.”
Best Practices for Today and the Future
With the basic premise articulated, it is now time to talk about what best practices we have developed in our adventures with platforms like HCP and our partners. First, a bit of preparation work about cloud stores and how to persist and access objects on them if it is warranted. The “lingua franca” of public/private cloud stores is generally an HTTP variant called REST or REpresentational State Transfer, but there are some exceptions. The fundamental technology which REST and related technologies, like WebDAV or SOAP, share is HTTP. HTTP has basically grown up with the Internet, making it ”well oiled” to move data and metadata over high latency Internet and Intranet pipes. This is contrasted to protocols like NFS and CIFS which are very chatty and usually designed for LANs or in some cases MANs. Basically, when you get to public and private cloud infrastructures, protocol efficiency matters.
Since RESTful over HTTP is the lingua franca of public/private cloud stores, you might ask: is there a standard? And the answer is that, as of today, there is no International ratified standard that covers public/private cloud stores. There is an emerging cloud storage standard developed by SNIA called Cloud Data Management Interface (CDMI), which is RESTful.There is also Amazon’s defacto S3 RESTful interface. With all of this in mind, it is now time to talk about some observations and specific best practices we’ve learned as we have evolved our private cloud store HCP.
Observations from Integrations to a Cloud Store
1. Newer applications are being built to make use of RESTful HTTP implementations natively, while older applications are frequently tied to NFS and CIFS.
2. It takes a long time to get an ISV partner who has invested deeply in a CIFS/NFS implementation or had a bad experience with a proprietary API to integrate a company specific RESTful interface, unless they see business value.
3. Some ISVs, like Symantec and Microsoft, are implementing plugin architectures, which allow the cloud store provider to self integrate to an application specific set of primitives. This has proved to be a faster way to have an optimal integration with a cloud store.
4. Some application vendors will never change their code to interface to anything other than NFS/CIFS.
5. Many application vendors practice vendor lock-in, holding their customer’s data hostage. In the long term we see this as decreasing customer satisfaction, and our principals dictate that we push our partners to not imprison our mutual customers’ data.
Best Practices for Integrating to a Cloud Store
1. Where possible, use your cloud store’s RESTful implementation to store content and its metadata both on the cloud store, especially for new application starts, with the following notable exceptions:
a. Legacy applications typically make use of NFS/CIFS and aren’t well optimized for cloud stores. For a legacy application which uses NFS/CIFS we recommend the use of a cloud on-ramp that translates from NFS/CIFS to a RESTful interface.
b. When an application has native ingestion techniques, use them. Native ingestion typically unleashes the content and its metadata from the application and decreases the solution cost, mostly because middleware is not needed. Examples of applications with native ingestion we’ve integrated to HCP are Microsoft Exchange Journaling via SMTP and SAP’s Netweaver ILM via WebDAV.
2. If a legacy applications’ developers have other priorities, start with an on-ramp implementation via NFS/CIFS and deliberately plan a migration to RESTful over time.
3. Cloud store providers should jump on the plugin model, utilizing a vendor’s API primitives to create an optimally integrated offering.
4. When possible, store an object in its unaltered form, and always store the object’s implicit and explicit metadata with it in an open format (like XML). Ideally, this should be done during the initial commit or ingest to the cloud store.
5. If a machine-only readable format is required, the authoring application should also store a map or recipe to re-instantiate the data and metadata independent of the application. This is an excellent compromise between machine and human accessible objects.
6. Ideally, don’t containerize many objects into a single mega-object like a ZIP package. Containerization makes it difficult to retrieve and access an individual object at a time. Usually specific cloud store implementations allow for connections to be kept open, improving transport efficiency. Long running sessions provide a compromise to containerization in that a single session can support the movement of many smaller objects, much like a container can support the transmission of many objects in a single session.
7. Data breaches of sensitive and/or business critical data are a major concern for many organizations, and these concerns are amplified in the cloud. Encryption (both in-flight and at-rest) is often a mechanism used to counter the real and perceived dangers, but there are numerous complexities and inter-dependencies that have to be addressed (i.e., there is no one best approach). Hitachi believes that encryption is an important tool to guard against unauthorized data disclosure and encourages users to take full advantage of the guidance offered by organizations like the Storage Networking Industry Association with its “Encryption of Data at Rest – a Step by Step Checklist“ as well as the Cloud Security Alliance (CSA) with the “Security Guidance for Critical Areas of Focus in Cloud Computing” and “Cloud Controls Matrix (CCM) v1.2“
8. Always use the authorization and access controls available for the particular cloud store implementation to ensure improved security. Further, if the source application’s access controls mechanisms are not completely compatible with the target cloud store, capture this information and add it to the metadata stored alongside the object. Like other metadata, all access control list or authorization related information should be encoded in a standard format like XML.
In this post we’ve detailed the case for making content and metadata independent from source applications for future usage. In some sense, content independence assures us that, when there are future use cases for the data we hadn’t considered, it is at least possible. With the application and data intimately intertwined, creating future data mash-ups is nearly impossible.
Since we believe that mashing-up data is potentially a fundamental part of the Big Data phenomena, not unleashing the data from the application puts everyone in a Big Data prison. With the basic theory articulated, we’ve explained some lessons learned and best practices for integrating with a cloud store. Finally, while many of these lessons were formulated as a result of the HCP program, HCP specifics have been mostly excluded.
For more HCP, HNAS, HDI and HDDS specifics, please stay tuned to the HDS blog-o-sphere.
Beyond Vendor Neutral Archiving
by Bill Burns on Sep 13, 2011
In today’s hypercompetitive healthcare environments, driving cost control is paramount and healthcare providers, payers and vendors must reduce cost while increasing the quality of care. Patient-centric technologies are at the core of the solution. These technologies necessitate clinical integration using health information exchanges and associated components. Hitachi Clinical Repository (HCR) was created to address this enterprise-centric view for all data types, not just medical imaging. HCR is a real time “active” repository thatconsolidates data from a variety of clinical sources to present a unified view of a single patient. It is optimized to allow clinicians to retrieve data for a single patient rather than identify a population of patients with common characteristics or facilitate the management of a specific clinical department.
Unfortunately, the care environment is littered with application-centric technologies like PACS and provider-centric technologies like EMRs. Neither of which delivers a patient-centric, portable view of the patient. The concept of the holistic electronic health record has been extenuated in recent moves toward personal health records. The use of clinical data mining in genetic records will change the outcomes of various disease states. Once healthcare providers gain access to comprehensive electronic patient records, the drive toward predictive, personalized medicine will be possible.
Here is an interesting article by Brian T. Horowitz of eWEEK, on the challenges cloud computing presents to the healthcare industry.
Will the role of VNAs in the healthcare of tomorrow be central or peripheral? The continued evolution of IT solutions is likely to be the catalyst propelling the VNA approach either into the limelight or the shadows.
What do you think?
Textbook Acquisition Strategy
by Ken Wood on Sep 8, 2011
What a week this was! The HDS acquisition of BlueArc announcement was a great event. Members of the blog team (Michael Hay, Claus Mikkelsen, and myself) and executive team held several Q&As, as well as multiple briefings with media, analysts and bloggers. We even held a tweetchat, which was a great opportunity to connect with our followers about the news. It seemed like only positive comments (unless you are a competing vendor) about the acquisition were made, even from our toughest critics. In fact, the only comments remotely negative was similar to “…what took you so long?” which really isn’t that entirely negative, but more a validation of our decision.
This brings me to my point. I think our industry is a little calloused about acquisitions. There are too many companies that seem to acquire first, then try to figure out what they bought and how to use it. Too many times these acquisitions disappear or are tossed away. I’m sure you’ve seen or read about other companies acquiring a company, then about a year later buying another that seems to do the exact same thing. Also, the strategy for the acquired company doesn’t seem to match the current strategy of the company. This is not always true with acquisitions, but if you watch the industry like many of us do, you notice these “…why did they do that?” transactions. I am purposefully not naming names, but I know we’ve all seen this behavior in our industry.
I would classify the acquisition of BlueArc by HDS as a textbook technology acquisition strategy. As Chris Evans at The Storage Architect noted in his recap, partnering with a company first to see if cultures and ideas mesh together properly, is a solid approach. The level of partnerships vary in many ways, but a partnership gets you access to potentially new customers and markets, technology, talent, business models, etc. Many would call this due diligence when shooting for a direct acquisition, but a partnership is due diligence in practice—the old “try before you buy” mentality. It’s one thing to do interviews and a paper evaluation, it’s a whole other thing to work side-by-side with a company, share wins and losses, figure out ways to make the technology better to the benefit of both parties, make some money together and get to know each other. Many times these partnerships are started with one of the goals being acquisition, but sometimes, for various reasons, things don’t work out—bad chemistry, partnership fallout, disputes and conflicts, or someone else buys them. Good to know before you buy a company. However, there are risks, as a good partnership doesn’t go unnoticed in this industry.
In fact, there are many who would say that HDS is the due diligence for many other company’s acquisitions. HDS has a reputation for putting together successful partnerships and for being a good partner. Many times this relationship building can be misinterpreted by our competition as “…this must be a good company to acquire,” with the actual due diligence being a little too light. This also means that there is risk to partnerships, especially where integration work and investments are involved.
Then, when the timing is right, an acquisition is considered and executed. If there is a good partnership, and everyone is in agreement, the industry says “…it’s about time!” while customers say “FINALLY!” I personally feel that if more plans followed this strategy, there’d be a lot less “…why did they buy them?” head scratching.
There were several questions asked publically, like: “does this open HDS to a flood of acquisitions?” To which I can confidently say no. We will not fall into an “acquire first, ask questions later” tactic.
Again, I want to welcome the entire BlueArc team to HDS and the Hitachi family. It’s because of our great partnership and person to person relationships—and our relationships with our customers—that resulted in this textbook union.
All In The Family
by Claus Mikkelsen on Sep 7, 2011
No, not the Archie Bunker kind, but the “let’s build the HDS family” kind.
So, as many of you have already heard, today HDS took a great next step in our relationship with BlueArc (check out the announcement here).
It’s all about growing the family with a quality company, with industry-leading products, and some really great people. It’s also a great way to demonstrate our commitment to NAS.
We’ve been reselling the BlueArc line for a few years now, and although that partnership has been very successful for both companies, it was taking on the appearance of a very long engagement with no date for a wedding. Well, that has changed today and the commitment has been made. It’s official and vows have been exchanged.
Any doubts about our commitment to, and support of, BlueArc have been removed. I’ve met many of the BlueArc talent and have seen the company execute over the past few years, and personally, I’m soooo happy to see this transaction complete and I’m sure many others are too. No more ambiguity about this relationship.
As HDS moves into the future, I can tell you that this acquisition is a critical part of our overall strategy and look forward to seeing it unfold.
BlueArc: The Jumbo Carrot
by Ken Wood on Sep 7, 2011
“Project Carrot” is officially complete! As Michael Hay recounted, our relationship with BlueArc has been an exciting journey over the past 5 years; a journey that we dubbed internally as “Project Carrot”. As you know by now, Project Carrot set out to evaluate and choose the NAS technology that would become HNAS. A small team of HDSers conducted the technical and performance evaluation for our “carrot patch”. After dozens of paper evaluations and interviews, it eventually came down to 3 finalists – Littlefinger, Resistafly, and Jumbo, aka BlueArc (If you’re not a vegetable aficionado, all of these names are different types of carrots.) Without doubt, with today’s acquisition of BlueArc’s talent and technology, HDS has enhanced our file and content services strategy and vision. It’s been a great journey thus far working with the BlueArc teams and we are all looking forward to a great future together.
Let me back up and add some insight and historical background to how we got here…
As part of the HDS technical evaluation that started several years back, we each had an opportunity to submit the vendor or technology we thought would be a great complement to our existing technology. My carrot in the patch was Resistafly. I brought them into the carrot patch to blow away the other carrots and to show off my “vision” for the company’s future. I wanted us to embrace a software-based solution, go for the scale-out NAS approach, and select a platform that we could “easily” develop with our strategy. From my perspective at that time, Resistafly was a shared-everything, clustered file system that ran on Windows and/or Linux, could scale-out to 64 nodes (for the right use case and workload, which at the time was fairly large), was not as complex as an asymmetrical clustered file system such as Lustre or IBRIX, and was software-based so it required no special hardware. We could integrate our search technology and develop other Hitachi IP into this platform. At that time, I truly thought the possibilities were endless.
However, after the evaluation, I found Jumbo was the only real option – despite my blind allegiance to Resitafly, the technology only out performed Jumbo in a couple of corner cases with 8 nodes versus Jumbo’s 2 node configuration. Just imagine if Jumbo had 8 nodes! Throughout the evaluation process I designed over 150 different tests, (truth be known, my performance tests were designed using a high performance computing cluster and leaned towards giving Resistafly an edge by using scale-out workloads that “ramped-up” over a period of time). In my mind, there was no way the other vendors could keep up with this type of loading for very long and my preference of a clustered file system would leave them all in the dust, or so I thought. However, in most cases Resistafly only had an edge in some of the streaming reads with low overhead and metadata processing requirements. Any multi-read, non-shared solution in this environment could have performed like this, like the technology in Littlefinger (most of my detailed results from that evaluation are still held in confidence). It was the “mainstream” workloads where the results were undeniable. Random access and high metadata manipulation is the sweet spot where Jumbo shined and held its own everywhere else. This is also where the target market and primary workload environment we were focusing on was going to be-in the enterprise. Jumbo delivered on management and performance, which are tasks that are high on a system administrator’s task load, and these would not be a problem for enterprise customers.
As I attempted to polish Resitafly, Jumbo out-shined across all workloads (in some cases blew everyone away) and was easy to manage (which, as we all know is extremely important!). Setup time was 6 minutes or less to get a share or export online and useable, and everything fit and worked together seamlessly from a management point of view. Plus, it was on special hardware (FPGAs) for accelerating specific tasks. This “fit” the then HDS mode of operation and culture of “hardware, hardware, hardware”. I was trying to break that mode by thinking expandability and multipurpose. There’s no way we can expand Jumbo’s capability. “…we have a NAS device and that’s all we have”, I would say. In the end, Resistafly felt like a cluster. I had to manage it on several occasions as individual nodes instead of like a single system. I wanted this technology to break away from the old mindset that a scale-out clustered file system couldn’t be a general purpose NAS in its use. But alas, it fell into the typical role. There was no comparison between the two, Jumbo unequivocally demonstrated its superiority as the king carrot.
After choosing the path of BlueArc, I was assigned to manage a small group of HDS employees who were assigned to collaborate with BlueArc developers in the UK. Their task was to start developing features into HNAS to incorporate Hitachi IP and strategy, and integrate Hitachi products into this “hardware NAS device”. In 3 years, they developed in full collaboration with Jumbo developers:
- Change file notification for accelerated content indexing for Hitachi Data Discover Suite (HDDS) search
- Data Migrator; CVL for internal and XVL for external
- Internal file migration
- External NFSv3 based off platform data migration for any NFS based storage target
- External HTTP/Rest based off-platform data migration for Hitachi Content Platform (HCP) storage targets as objects (data plus metadata) ingestion
- Migration policies for Data Migrator for data migration of files based on simple file primitives like date, name, size, etc.
- HDDS integrated with Hitachi NAS Platform (HNAS) for policies management of more complex migration decisions
- CIFS audit logging
- Real-time file blocking and filtering
- VSS support for Windows systems
- … and more!!
All of these advanced features for content and content management were developed on a hardware platform. I ate it up. I was completely converted. The possibilities were endless and the hardware accelerated, AND it was still easy to manage.
Over the years I continued to monitor Resistafly’s progress in the industry – in 2007 they were acquired, and what I can gather is that this technology has been limited to scale-out NAS solutions and apparently wasn’t enough as their parent company acquired yet another company in 2009, presumably for more scale-out capabilities. The updates included the same software capabilities on new hardware servers or storage systems. There was no significant software features announced in the area of content or content management, search or tight integration with other software solutions. It was just medium scale-out NAS. Siloed. Sad.
About 18 months ago, I finally admitted to Michael that I needed to eat crow. He was right and I was wrong. The “hardware” platform that won the bake-off in our carrot patch 5 years ago was no fluke and that all of the advance software features that I wanted to integrate into a software platform indeed run better on HNAS. I expected these features to be in the other carrots by now, but apparently to others software is treated as a hardware enabler, while we (HDS) treated hardware as a feature enabler. I’d mentioned this story a few times internally during meetings with others and now I’m blogging about it for all of you: barbequed crow, when washed down with jumbo carrot juice, definitely tastes triumphant and successful.
The most exciting part about this journey is that all of this development was done with a great partnership between HDS and BlueArc. Just imagine what we’ll be able to accomplish now that BlueArc’s unyielding talent and technology are part of the Hitachi family! I’d like to take this opportunity to welcome the entire BlueArc team to Hitachi. I’ve thoroughly enjoyed working with you in the past on several projects and I look forward to working and furthering our innovation, strategy and vision together in the future.
BlueArc: A Bountiful Garden of BIG Data
by Michael Hay on Sep 7, 2011
These are exciting times we live in, indeed! Steve Jobs has officially resigned his post. HP has opened their kimono to enterprise focused strategy leaving behind WebOS and the PC business. And today we celebrate combining the innovative talent of BlueArc with the Hitachi family.
Our relationship with BlueArc began when we were tasked to find the perfect technology partner offering file systems or file storage appliances. The pursuit spawned some healthy competition internally, as we each wanted to find the perfect match. I named this endeavor Project Carrot. The field of companies and products we looked into we called the “Carrot Patch” and each company had their own code name based upon a type of carrot. Ultimately, Jumbo – BlueArc – surpassed all others in the Carrot Patch. (I’ll leave Ken to recount that series of events in more detail, but suffice to say the quest that we had anticipated turned out to be a no brainer when BlueArc’s technologies and team came on the scene.)
In 2008, Shmuel – BlueArc’s CTO, my personal friend and mentor –and I celebrated the beginning of super deep inter-product integration. This realized the joint vision the two companies had been working on for the better part of 18 months, and as my previous blog postrecounts in some detail. Our shared vision and spirit of collaboration has led to many customer wins for each company. To date each company has brought complementary specialties to bear on our mutual markets. For example, BlueArc is well known in verticals like Oil & Gas and Media & Entertainment, while Hitachi Data Systems serves other verticals such as Telecommunications and Financial Services. Our complementary market specializations have allowed us to address the overall market gaining confidence and momentum along the way. Well, from today we can address all customers, all markets and all verticals as one Hitachi.
For HDS, we have many reference grade implementations of our joint vision with BlueArc across a variety of verticals. Here are some quotes from a bank and a telecommunication services company:
- Yes Bank – “Hitachi’s proven mature product, clear roadmap, budgets for research, and their having served the enterprise segment of storage for more than a decade, prompted us to approach them. They completely understood market requirements.” Mitesh Toila, Senior Vice President
- Ziggo – “The storage platform from Hitachi Data Systems offers us maximum flexibility in supporting the organization, which is essential in the dynamic telecom market.” Peter de Boer, Manager of Core Systems
These praises would not have been possible without BlueArc’s innovative technologies and collective team expertise. Moving forward I know we will see many more implementations at a quickened pace and of a heightened caliber as BlueArc culture infuses the Hitachi team with their pioneering spirit. Further I look forward to being able to meld our mutual companies’ IP together intermingling our unique attributes and benefits into a superior aggregated portfolio. Stay tuned and watch our progress!
Finally, I want to welcome all of my new colleagues to Hitachi and state that we all look forward to working with you to better address the market and our customers’ needs. I truly believe that you’ll find working with the family I know as Hitachi a special experience in your career. That is because “Hitachi’s honest intentions are to improve society, which we talk about through our concept of the Social Innovation Business.” (Hay, HDS as a Member of the Hitachi Family: The View From Japan).
The Emergence of the Storage Computer
by Claus Mikkelsen on Aug 31, 2011
I said that my next piece would be a guest blog from storage performance practitioner Ian Vogelesang on where the disk drive market is going. Ian has been on vacation, so I thought I would slip this one in.
Disk drives, Winchester drives, then storage subsystems, then arrays.
These have all been storage names over the years, but I’m thinking more and more that we should be calling them “Storage Computers” to keep up with the times. I cite 1992 as a pivotal year in storage, when the industry changed dramatically, and the march towards the Storage Computer began. Let me explain.
Prior to 1992, storage was just a dumb box and a commodity in every sense of the word. You could write to it, and read from it, but that was about it. As a storage vendor, the only thing we could ever compete on was performance, reliability, and price. And by the time this reached our customers ears, all that was heard was price, price, andprice. There even used to be a company called “Reliability Plus” that rated the various vendors on reliability, just in case anyone actually cared.
What changed in 1992 was the emergence of intelligent storage function. The first was called Concurrent Copy (the first copy-on-write technology), and before long us vendors were piling various replication functions onto the storage subsystems. Then came RAID. Reliability Plus was gone, and the storage industry was plotting a new course.
Recently, we’ve been on a roll with things like dynamic provisioning, dynamic tiering and storage virtualization. Add this to all of the Capacity Efficiency functions such as compression, deduplication, single instance store, thin provisioning, “spaceless” snapshots, etc. Anyone out there want to define what a terabyte is these days?
One interesting combination of functions is HDP (Dynamic Provisioning) and HDT (Dynamic Tiering) from HDS. The combination of these two functions:
- Dramatically removes the task of “provisioning for performance”. That is, the Storage Computer can manage performance better than you can. Time and money saved.
- Manages the tiering of data at a small granularity (42 MB). That is, the Storage Computer will move 42 MB pages to their proper tier of disk. We’ve found that well over 80% of data residing on tier 1 disk does not need to be there. Again, time and money saved.
- Removes almost all of the physicality associated with a LUN, meaning a LUN should now just be viewed as a logical container for data. Grab what you think you need and don’t worry about grabbing too much since we thin provision underneath.
With these functions and more, the Storage Computer is really being designed to take over many of the tasks that we all used to do manually. I think the biggest challenge, however, is getting storage admins and DBAs to “let go,” and let the automation take over. So far, we’re seeing many of our customers letting this happen, and ultimately, appreciating the benefits.
The Storage Computer, after a long march, has arrived.
Lies and Virtualization – Capacity Optimization is an Altered Reality
by Ken Wood on Aug 26, 2011
OK, so I lied about my last blog being the final of a three part series (1, 2 & 3). But aren’t we used to being lied to these days? I classify virtualization into three categories of prose: one-to-many, many-to-one, and this-to-that. (Turns out, I thought I blogged about this some time ago, but it was a paper I started and never finished. I’ll extract this and post it as a future blog soon.)
Virtualization, as we use it in our IT lives, is a lie. Virtualization technologies, whether it is in storage (which includes RAID, capacity optimization, Dynamic Provisioning, Dynamic Tiering, etc.), servers, CPU, file systems, or desktops hides the real truth and sometimes administrative tasks from the rest of the world. Virtualization is an abstraction of a complex physical reality, or the aggregation of many simple elements, to a more understandable, usable, acceptable and simplistic view, on behalf of an application or user.
Virtualization lies to us for our own good. When used with capacity optimization technologies, as I’ve described in several of my previous posts, you will be presented with a false view, but you will like what you see. The amount of available space for storing your data will appear vast and voluminous (and in some cases “bottomless”) but under the covers lies a complex system of algorithms and coding used to squeeze, chop-up and delete your data into a smaller and smaller footprint. These layers, in fact, can continue down to another layer of abstraction, such as a dynamically provisioned volume of storage. This, in turn, records data on an array of disk drives protected by another layer of abstraction called RAID-6. Even the disk drives themselves are accessed via a logical block address, which hides and simplifies the realities of addressing data as cylinders, tracks and sectors, or steering you from the bad block replacement area.
The abstraction layer above capacity optimization technologies hides the truth from the user and application in order to focus on doing what they do best, and not manage capacity, or so it seems. More times than not, this just moves the administrative action line further out in time. Storage is not “bottomless,” but it can seem very deep. This layer presents, for example, a 5TB file system to the user or application, which can write data to this space. Depending on the architecture and the type of data being written, the file system’s utilization percentage will act strange to people who are used to chasing the 90% threshold for most modern file systems. It may seem like no data is being written, or like data was written, then disappeared, but when checking the file system, data (or at least the file names) exists.
What’s happening is storage is being optimized against the data that already exists. The more data that is stored, the more data there is to compare to, the more chances of duplicate data detected, the more duplicate data is removed. The task at hand is the management and metadata tracking of keeping references. The mapping of the chunks that are used and shared across many files can tax a management system by creating a large database with heavy comparison tables.
This mapping of chunks—you can replace chunks for blocks, pages, sectors, sub-files, etc.—IS the lie. That is, virtualization is hiding this from you.
Or, as David Black of EMC used to be fond of saying during our time together in SNIA Technical Council meetings, “…most problems can be solved in computer science with yet another layer of indirection…”
I’m Entitled to Title My Own Titles
by Ken Wood on Aug 25, 2011
I’ve been laying it on a little thick technically of late, so I thought I would throw a change-up in the mix before everyone thinks I’m losing my sense of fun. And, I’d rather have fun while working than working to have fun. Plus, my last post was rather long and voluminous, so here’s a short blog on, well, blogging.
Believe it or not, one of the more challenging parts of blogging is coming up with a catchy title. HDS bloggers are responsible for all of our content, including the title and graphics. I try to be catchy, vague and a little profound, while playing on words, especially with the multiplicity of meanings between the rest of the world and how the techno-world communicates. According to my count, this is my 27th post since I started guest blogging with Michael Hay back in July of 2009, and a lot has evolved since then. I could have waited for a more significant milestone, like number 30, but I want to get back into more technical blogging next. Of these 27 blogs, my favorite titles are listed in no particular order below with some comments on the meanings.
- Chips or Pits! How do you like your content?
- Chips, meaning integrated memory circuit or a collection of pits, being the “bit” of information burned into an optical medium.
- Storage Fusion – StoraFgUeSION – SfTuOsRiAoGnE
- An evolving intermixing of storage technologies.
- Does HDP Make Gas, or Just Removes Gas?
- Percentage improvements could lead to overstated claims like gas adding up additives and engine improvements, and HDP could reduce your stomach anxieties as a system or storage administrator.
- I mainly like this one for the graphic in the post that illustrates the title.
- The USS Starship Enterprise Storage Array
- The obvious tie in to Star Trek and storage controllers.
- YAAA! – Yet Another Automobile Analogy
- This is a play on the Unix program/parser called “yacc” – yet another compiler compiler.
Which one is your favorite?
As I stated, these are in no real particular order, but my favorite is “YAAA! – Yet Another Automobile Analogy”. Obviously, there’s a range of titling that could start at “blog entry number 28″ to outright lying about the content of the blog topic itself. If the latter were to happen, I would feel like I was spamming or scamming everyone. So there is always a tie-in no matter how outrageous; the title is always loosely related.
This also leads me to product labeling. It amazes me how some product names over the course of history can become forever linked with the function of the product or company’s name, like Crescent wrench, Kleenex tissue, Xerox copy, or Google. The name or label is actually part of the world’s new lexicon, regardless of language.
I am currently working with some technology from Hitachi’s Central Research Lab, dealing with speech and language. It’s interesting how many words mean the same thing, and are spoken the same way in any language. However, these words tend to be basically modern, stemming from the technology fields. I can’t fully blog about this technology at this time, but at some point I will describe it as it relates to other projects. When referring to “Google” the company or “google” the synonym for “search,” you’ll be surprised how different cultures are similar.
Grout Expectations, and a Disk Story
by Claus Mikkelsen on Aug 24, 2011
It’s been two weeks since returning from Thailand on my Habitat for Humanity build. After the 17-hour flight home from Thailand to the US, I had a 48-hour turnaround before heading off to Mexico City for the week to participate in another one of our remarkably successful executive briefing centers with Hu Yoshida (see his post on this trip).
No rest, wicked or otherwise.
As I mentioned in my previous blog, my “vacation” included building a home outside of Udon Thani, Thailand, courtesy of Habitat for Humanity. These guys are awesome, and do some very good work. We started with nothing, poured a foundation, built walls of cinder block, poured a concrete floor, set in doors and windows, and dug a septic tank. In only a week and a half, we (there were 16 of us) turned dirt into a home for a great family with two boys (aged four, and the other less than a year old.)
Elegant housing? Hardly, but clearly a step up from where they were living previously. And, since my previous blog was titled “Pounding Nails,” I decided to call this one (with all due respect to Charles Dickens) “Grout Expectations,” since there were no nails involved (just a bunch of cinder blocks and concrete.)
Here are some pictures from my time in Thailand.
Habitat has one of the most bizarre business models I can think of, where they (through their Global Village program) offer folks the opportunity to pay them money for the privilege of working FOR them. I hope HDS does not adopt this model.
All joking aside, it really is a great opportunity to give some sweat equity for a good cause, and at the same time see parts of the world, otherwise not generally on the tourist (or business) track. I thought I could sell a couple of VSP’s whilst there, but to no avail.
I’m not endorsing Habitat exclusively, (although their “builds” are a lot of fun), but I do endorse the concept of getting out of the mainstream box and seeing the real world. Over the weekend in Thailand, I suggested to the group that we traipse over to Laos (since the border was only about 40 miles away), so we did. We started with a nice lunch on the Mekong River, then spent the rest of the day touring the capital city of Vientiane. On Sunday we visited an HIV/AIDS orphanage, which was both sad and uplifting at the same time. Playing with the kids was fun, and they certainly seemed to enjoy the attention.
Next year, body willing, the plan is to move the operation to Ghana in West Africa.
I did want to talk about the disk drive market before I close this. I came across this link on Twitter before I left on my vacation. It’s a year-old article, but still has some interesting perspectives on the HDD industry, and quotes a Seagate representative that was predicting 100TB-300TB drives by the year 2020 (only nine years away!) I also remember an article from Gartner or IDC in the year 2000 that quoted that the average array size being shipped in the industry was 1.2TB. So now that we’re about midway between 2000 and 2020, it’s interesting to look back a decade and also into the future, and imagine what the larger storage industry will be like.
I bring this up as a teaser to my next blog, which will actually be a guest post from storage performance practitioner Ian Vogelesang, who knows much more about the HDD industry than I ever will. I’ve read it, and it’s a pretty interesting take from a very smart guy. You’ll enjoy it.
May I Please Have Some More Capacity Optimization, Sir?
by Ken Wood on Aug 16, 2011
So, this is the third and final installment of my blog series on capacity optimization techniques. The first article was on file level single instancing and file level compression, which also included a combination of the two. The second article described how data de-duplication works, which I demonstrated by using Linux commands.
In this post, I’ll show you how file level de-duplication and compression can save even more capacity. Plus, data de-duplication implies that sub-file single instancing is in use under the covers of the capacity optimization front. Since I enjoyed doing the demonstration style article showing some of the ways that this technology works under the covers, I’m going to employ this technique again.
From my previous post, “To De-dupe, or Not to De-dupe, That is De-data”, I demonstrated that the “split”, “md5sum”, “rm” and “cat” native Linux commands (at least in my CentOS 5.4 system) can be combined to de-duplicate a large file into a smaller file footprint (several smaller sub-files) at almost a 3:1 capacity savings ratio. What I plan to show and demonstrate here is combining a form of file-level single-instancing and file-level compression, AND file-level data de-duplication to reduce the capacity footprint further. In this case, I’ll show how a sub-file is a sub-file is a sub-file. Just because a sub-file was “extracted” from another original file, doesn’t mean it can’t be used somewhere else. In fact, these “multiple references” to a sub-file are the most powerful feature in reducing the storage capacity footprint in de-duplication systems.
Note: This demonstration is to illustrate the core functions of file-level de-duplication and capacity optimization techniques. It should not be used to de-duplicate your production data as a capacity savings technique!!!
In the screenshot below, I have created two files, “blogtest.dog” and “kentest.bup”. They are approximately 4.4 MB in size each for a combined total of about 8.9 MB. The first thing I do is find out what the MD5 hash fingerprint is for both files using the “md5sum” command. The two files are different, therefore I can’t file-level single instance these two files.
Next, I’ll use the “split” command to fix block split the files up into 256KB sub-files, then I “ls –l” the directory to show the resulting sub-files and the sizes of these sub-files. I prefixed the output file with “1st-” and “2nd-” (“blogtest.dog” and “kentest.bup”, respectively). You should be able to see that filenames “1st-aa” through “1st-aq” correspond to file “blogtest.dog” and that filenames “2nd-aa” through “2nd-aq” correspond to file “kentest.bup.” Both sun-file sequences end with sub-filename “*aq”, however, the file sizes reflect the size differences of the originals. Counting the number of sub-files generated (using the “wc” command) shows that 17 sub-files were created for each original file or 34 total sub-files.
Similar to before, I will now calculate the MD5 hash fingerprint of each sub-file using the “md5sum” command on both sub-file sequences. You should be able to see that the hash fingerprint “ec87a838931d4d5d2e94a04644788a55” is present in both sets of sub-files from both original files. This means that each of the 2 original files contain a set of 256KB sub-file patterns that are identical to one another.. This also means that both original files can share this sub-file between them, thus I can delete all sub-files that calculate to this fingerprint except for the first one. So, I keep the first sub-file “1st-ab” with the hash signature “ec87a838931d4d5d2e94a04644788a55” and execute the “rm” command on the remaining sub-files with the same fingerprint.
As you can see, I have “de-duplicated” the total number of sub-files from 34 down to 10 sub-files, 4 for the original file “blogtest.dog” and 6 for the original file “kentest.bup”, and there are no sub-file instances containing the “ec87a838931d4d5d2e94a04644788a55” hash fingerprint for the original file “kentest.bup”. Basically, each sub-file is now a unique piece of data.
The amount of capacity occupied by these two original files has now been reduced from approximately 8.9 MB to 2.5 MB, assuming we actually deleted the original two files. This is approximately a 3.5:1 reduction ratio.
But wait! There’s more.
Now let’s compress the remaining sub-files to see how much additional capacity savings we can achieve. By using the “gzip” command, I compress the 10 sub-files individually and replace the original sub-file with the compressed sub-file and append the “.gz” label after the filename. There are still 10 sub-files, but now the amount of capacity occupied has been dramatically reduced further. The combined total capacity of the two original files is now approximately 196 KB! So from 8.9 MB to 196 KB, this is about a 45:1 reduction ratio.
Of course, this is a dramatization and a demonstration. Your actual de-duplication ratios will vary considerably, or as they say, “your mileage may vary”. It really depends on the type of data you have to store.
So, now we have to “rehydrate” the two original files to their fully bloated original state by reversing this process. As you recall from the previous post, the high-level order in which the data de-duplication functions happen is:
- Chunk it
- Hash it
- Toss it or keep it
For this extra level of capacity optimization, there are a couple of additional steps.
- Chunk it
- Hash it
- Toss it or keep it
- Compress it
- Reference it
Technically speaking, the Reference it part is done even without the compression step, so even in the three step functions, there is a Reference it step. However, I’m highlighting this in this blog post because we did two extra steps to achieve the extraordinary capacity optimization results: sub-file Single Instancing and sub-file Compression. The sub-file Single Instancing comes from the one common sub-file between two completely separate original files. This can be illustrated in the diagram below. In fact, this is going to serve as the mapping we will use to rehydrate these compressed sub-files back to the original files.
Again, instead of deleting the original files, I’ve renamed them so that I can do a binary comparison of everything in the end. Then, using the “gunzip” command, I uncompress the sub-files back to their original 256KB chunk size, except of course for the lastsub-files, which are the remainder of the original files during the chunking process. Now we need to assemble the files back together. I use the “cat” command to concatenate the sub-files together in the proper order. I use the sub-file “1st-ab” as a replacement for sub-files “1st-ac”, “1st-ad”, “1st-ae”, “1st-af”, “1st-ag”, “1st-ah”, “1st-ai”, “1st-aj”, “1st-al”, “1st-am”, “1st-an”, “1st-ao”, “1st-ap”, “2nd-ac”, “2nd-ad”, “2nd-ae”, “2nd-ag”, “2nd-ah”, “2nd-ai”, “2nd-al”, “2nd-am”, “2nd-an”, “2nd-ao” and “2nd-ap”, which were all deleted earlier. This is to re-create the original files “blogtest.dog” and “kentest.bup”.
Initially, you can see that the files rehydrate back to their original sizes. To ensure that everything went back together properly, I run the “md5sum” command against the newly rehydrated files and compare them to the original renamed files, then I perform a full binary comparison with the “cmp” command to make sure everything is 100% perfect.
Trust me when I say that if any of these pieces don’t go back together in the right order, then the hashes will not match up correctly. Then you know you have a problem. As I have shown, this could get to be a laborious task by hand. Scripting could be an option to automate several aspects of this process. However, the best way is to let an appliance with reliable code and a hardened database do this for you; it makes all of these steps invisible. I’ve gone through these steps for you to show a little bit of what’s under the covers to this technology—or maybe what’s not under the covers. The combination of several of these techniques also has the potential of saving large amounts of capacity beyond any one method alone.
A True Holographic System Would be Disruptive
by Ken Wood on Aug 9, 2011
A recent article by The Register’s Chris Mellor was passed my way by a colleague, at a perfect time. I was planning to post a blog this week on Hitachi’s contribution to the world of optical and holographic storage, but I didn’t know how I was going to introduce the subject.