Geek Out With These Books
by Ken Wood on Dec 28, 2011
Amy Hodler’s post a few weeks ago on the Cloud Blog inspired me to share some of my own geek related book buys from 2011. They are as follows (in my preferred ranking).
- The Grand Design By: Stephen Hawking (@Prof_S_Hawking)
- I’m a huge Stephen Hawking fan and have read (more than once) every book he has published—which will explain the next book pick.
- The Illustrated – A Brief History of Time & The Universe in a Nutshell (double book release) By: Stephen Hawking (@Prof_S_Hawking)
- Having read the original versions of these books, this superbly illustrated release is packed with high quality, glossy pictures compared to the original books. This is more of a collector’s edition and of course, when reading a Hawking’s book, a quality picture is with worth a billion-billion (Carl Sagan reference) words. The best part of this book is I bought it at an “everything must go” blowout sale as the Borders in my neighborhood was shutting down.
- Holographic Data Storage – from Theory to Practical Systems By: Kevin Curtis; Lisa Dhar; Adrian Hill; WilliamWilson; Mark Ayres
- Since I was researching some optical storage technologies for Hitachi, this book came highly recommended and from an interesting angle. Customers were asking about the Hitachi references within the book, so I bought it. It has been extremely helpful for me to understand this evolving area of technology which I believe will be game changing in the future.
- CUDA by Example – An introduction to General-Purpose GPU Programming By: Jason Sanders; Edward Kandrot (@ekandrot)
- Another part of my research for hardware accelerated applications and their uses in enterprise applications.
- HTML5 – Step by Step By: Faithe Wempen M.A.
- Mainly purchased this as an HTML5 reference book for an internal project I am working on.
- Adobe Dreamweaver CS5 with PHP: Training Source CodeBy: David Powers
- Same project support as above.
- HTML5 – 24–Hour Trainer By: Joseph W. Lowery; Mark Fletcher
- Again, same project support as the above.
Also, here are some miscellaneous books I picked up at a clearance table. If you’re like me, you can’t pass up one of those 70% off clearance deals to fortify your technical library. And since I do a lot of video and audio editing, I also needed these for some personal projects.
- Producing Great Sounds for Digital Video By: Jay Rose
- Audio/Video Protocol Handbook By: Jerry Whitaker
- Gigahertz and Terahertz – Technologies for Broadcast Communications By: Terry Edwards
What are your top book recommendations from 2011?
Is it COTS or Commodity?
by Michael Hay on Dec 21, 2011
I find the IT community seems to be in a state of confusion between the two—now mind you I think that some people get it and can easily discriminate between the two. Commercial off the Shelf (COTS) offerings are just that. A more formal definition of COTS from Wikipedia follows:
In the United States, Commercially available Off-The-Shelf (COTS) is a Federal Acquisition Regulation (FAR) term defining a non-developmental item (NDI) of supply that is both commercial and sold in substantial quantities in the commercial marketplace, and that can be procured or utilized under government contract in the same precise form as available to the general public. For example, technology related items, such as computer software, hardware systems or free software with commercial support, and construction materials qualify, but bulk cargo, such as agricultural or petroleum products, do not. (source: Commercial off-the-shelf – Wikipedia, the free encyclopedia)
My colleague Ken Wood talks about commodity in a post several months ago, Soybean Is A Commodity, where he muses on what is and is not a commodity. His summary is that basically the output and the resulting measures of many of the devices and systems that are produced in the ICT field are a commodity. However, the systems and devices themselves aren’t.
He says, “It is my opinion that there is a misunderstanding and confusion in the IT industry between ‘commodity goods’ and ‘consumer products’ when it comes to technology. I can’t pinpoint the exact origin of why or how these two concepts seem to have become synonyms for each other, especially in the IT industry, but there is a difference between commodity goods and consumer products.”
Personally, I think that this stems from confusion about COTS and commodity, and may have crept into the ICT vocabulary just like “NIC Card” and “Transparent to the Application,” see my previous post where I ranted on the topic of language misuse in ICT. While occasional misuse is relatively harmless, I believe that the misapplication of commodity has resulted in inappropriate thinking about the costs of technology. Let’s explore this last point for a bit.
Let’s assume that hard disk drives and the resulting capacity were a commodity, if so they are a strange beast I would have to say. In fact they may even represent a kind of unique commodity which follow the law of supply and demand, but have planned cost erosion. The cost erosion is associated to predictions by industry patterns like Moore’s law, and is somewhere between 20%-30% per year. Imagine that—a commodity with a predictable annual per unit price decline. I’m sure that the mathematical wizards on Wall Street would love to have something with the level of predictability experienced in both consumer and enterprise capacity production/purchases. Sure tragedies like those in Thailand and restrictions of rare Earth metals can cause disruptions in supply that has the potential to increase costs if demand is not damped or constrained, but through the innovative human potential and given enough time even these unfortunate events have little impact. So is storage capacity a commodity? I would say that unless the definition of a commodity has changed, both the capacity measures and the actual devices aren’t a commodity. They are rather COTS devices with commodity properties.
With that in mind, what about CPUs, memories, and more importantly advanced systems that aggregate and combine COTS in unique ways to release innovation? In my opinion the clear answer is no. Storage, servers, networking, etc. are not commodity, but surely can be COTS. Obvious questions are: Why is this important? Why does it matter? I see this is important because of the potential for COTS to contain innovations unique to a particular technology supplier. This matters to any consumer of ICT because these innovations may actually be a better match to your business, and potentially even the entire market as a whole.
This last statement “entire market as a whole” is interesting because I see that the fundamental tectonic plates of the technology industry are shifting to favor more OPEX-friendly technologies. It also means that as a consumer, you may be willing to pay more for innovation in the short term especially if the technology delivers innovation you can leverage and amplify, or it reduces your OPEX such that you can reinvest money elsewhere. So where do we see this occurring in the industry now? Well, I would say that the trend to deliver complete IT stacks as I’ve discussed in The Rack is the New Server, the Data Center is the New Rack, is an example where CAPEX may be a bit higher but the potential for savings on the backend through staff reallocation, reduced maintenance costs, and assured configurations may make the slightly higher CAPEX worth it.
So the next time you hear an IT professional say something like, “That’s just a commodity technology…” stop them and correct the usage of commodity with COTS. By keeping this terminology misuse in the ICT industry it serves to devalue innovations that vendors add, users can take advantage of, and that creative companies can leverage to engender new innovation on top. I personally fear that without a grass roots effort to make a change here as an industry we are going to be increasingly satisfied with mediocre offerings and products.
With All The Talk Around Cloud
by Dave Wilson on Dec 20, 2011
With all of the talk around the cloud and healthcare’s increasing movement toward adopting cloud technology, there are some issues that any organization must ensure have been addressed that are unique to healthcare. It should be understood that it is because of these issues that some of the healthcare providers lag behind other industries in moving to cloud technologies. Both cloud service providers and healthcare organizations should heed these areas when looking at cloud adoption.
Data Movement Across Borders
While a cloud service provider may be located within the country of origin, some of the cost saving benefits that can be realized by customers are due to the economies of scale that the service provider attains by sharing the infrastructure between multiple customers. This may mean that a cloud provider backs up or replicates data at a secondary site that does not reside within the original country–think Belgian hospitals’ local cloud provider who backs up data into their German data center. In many countries this would violate their privacy regulations and can be quite a complex and expensive problem to address, particularly if there is a breach of patient information. Healthcare organizations need to ensure that their data does not move across borders that it is not allowed to.
It would be naive to think that a facility would stay with one cloud provider forever. Cloud providers are free to manage their infrastructure as they see fit—a benefit for facilities who don’t want to worry about this infrastructure component. But some caution is advised. Customers need to know that their cloud provider is using accepted standards to store data. Proprietary mechanisms of storage will make migration very difficult in the future. An understanding of the cloud provider’s infrastructure and contractual agreements that ensure not only the ability to remove data but also assistance in migrating this data should be considered a high priority for any organization looking to adopt the cloud.
Ownership of the Data
This has been highlighted as a concern, but it should be clearly defined. Patient data belongs to the customer and the patient. The cloud provider is providing a service – network, storage, application, infrastructure, resources – but they have no claim to the data. The regulatory constraints should support this, in that patient data is subject to privacy and security laws such that a cloud provider could not, for example, sell access to the data to a marketing company. The customer is entitled to move, manipulate, change and otherwise remove data from the provider as desired. It is worth having this written into the contract so that all parties are clear.
Privacy and Security Compliance
Many organizations are reluctant to give up control of their patient information as there are certain risks that may suddenly become beyond their control. A breach of privacy falls to the healthcare organization to manage, and a cloud provider becomes an entity that threatens that control. There are many aspects to mitigate these risks:
A. Contractual compliance with stiff financial penalties for any breach of privacy such that the healthcare provider has a course of action to rectify the breach without undo cost burden;
B. Requirement of the cloud provider to meet regulatory compliance, regular audits of this compliance by third parties and immediate actions to rectify any gaps;
C. Private cloud models that ensure the data is stored on the premises of the healthcare organization while still getting the benefits of the cloud;
D. Use of the cloud for non-critical applications such as email, clinical collaboration, analytic tools, etc.
There is as much talk about cloud security as there is about privacy concerns, and they are somewhat related. Interestingly, HHS claims that of all the HIPAA breaches, only 6% can be attributed to hacking a system. The majority of cases involve theft of a computer—likely for the value of the computer and not the information within. A cloud provider will have top notch security protocols and processes that any healthcare organization should understand prior to a contract. How does the DC handle phishing or denial of service attacks? Do they have virus protection? What are the physical security aspects to prevent unwanted access? In many cases the cloud provider will have better systems in place than the organization itself – but these should be investigated.
Healthcare deals with mission critical and life or death information. A cloud provider needs to understand that the architecture needed for healthcare is typically more robust than in other industries. Down time can’t be tolerated and service level agreements need to clearly define the expected response times. In Canada, Canada Health Infoway specifies that medical images must be stored in and retrieved from the Diagnostic Repository within 15 minutes of acquisition or request. These types of requirements must be written into the contracts prior to agreement.
The cloud can bring many benefits to healthcare organizations, but as with any new technology, due diligence needs to be done to ensure that better patient outcomes can be achieved at the same level of confidence as they are without cloud technologies.
Capacity Efficiencies, Again and Again
by Claus Mikkelsen on Dec 19, 2011
Why again? Well, it was about a year and a half ago when I last blogged, and much has changed since then, although the subject is still front and center. David Merrill has certainly discussed this numerous times as well, including this always-amusing post from March.
So why bring Capacity Efficiency (CE) up again? Well, two recent events bring it back to center stage in my mind.
The first event was a meeting with a prospective customer a number of weeks ago who was looking to secure a fairly large amount of storage capacity, and kept hammering away at our sales guy for the “bottom line”.
“What’s your dollars per terabyte” he kept asking.
A befuddled sales team (and frustrated yours truly) hung in there, and we were finally able to turn it into a constructive conversation. But I’m still baffled at how many people have not let go of the cost/TB mentality.
The second was a meeting I had with Dave Russell of Gartner when we were discussing data protection issues. David Merrill was on the phone during this discussion. I like Dave Russell, and he’s a great analyst, but when he said that we keep 12-15 copies of all data, I guess I was a bit surprised that it was that high. Then David Merrill chimed in that his assessment was 11-13 copies.
So now I’m thinking that two of the smartest guys in this biz are agreeing that we’re keeping way too much data. I’ll come back to these numbers in a few weeks when Ros Schulman and I blog on this in our data protection series (with Merrill, just to confuse you even further). But keep that number of 11-15 in mind.
So now let’s mix all the various ingredients: my original blog from March 2010, David’s blog from March of this year, the “let’s all keep a dozen copies” discussion with Russell, and the fact that many folks are still in the per/TB world when it comes to storage purchase. Well, mix that together and you get a pretty bad-tasting stew.
But with CE, getting the most out of every TB you purchase is becoming a much larger issue as I peel back the layers. Thin provisioning (including write same and page zeroing), single instance store, deduplication, compression, dynamic tiering, archiving, etc., when multiplied by a factor of 15, makes a huge difference in data center and storage economics.
For other posts on maximizing storage and capacity efficiencies, check these out: http://blogs.hds.com/capacity-efficiency.php
Great Books for Geek Wannabes
by Amy Hodler on Dec 16, 2011
It’s that time of year. We’re busy finishing up all those loose ends that were to be done “before the end of the year” as well as juggling family and holiday plans. Despite the usual frenetic start, I really enjoy this time of year because once we slow down just a bit, most people are in the mood for thoughtful conversations about what we want out of the next 12 months.
For me, those conversations usually include discussing and recommending favorite books as a way to share what’s been useful for us. Since recommendations from people with similar interests are usually more helpful, below are my book recommendations from 2011—not all are new–for those that want to be a geek but really aren’t. (You know who you are…or you know who those people in your life are. We might love the idea of fractals and quantum mechanics…but we can’t do math in our heads.)
Author: James Gleick (@JamesGleick)
About: A historical perspective of information.
Why read it?
It’s a beautifully written study of information as its own historical topic, which I haven’t seen anyone else do.
Gleick does a wonderful job explaining some very difficult concepts and I love the history of great discoveries, Even more impressive are the implications when his historical evaluation is taken in its entirety. I believe it reveals future trends that will impact how we relate to information in the near and long term. Although I found the first few chapters to be a little slow, I’ve actually read it twice and may read it a third time…it’s that good.
Author: Nassim Nicholas Taleb
About: How the unpredictable is really unpredictable and how to deal with that.
Why read it?
We should all be skeptical of models and people that “predict” what will happen, but we also need to plan for success and different possibilities. Taleb does a great job of explaining why the improbable usually has a lot more impact on our lives and businesses than what we planned for, and gives some advice on how to deal with that. (Also, the summary of fractals and self-similar replication at the end was really helpful for this geek wannabe.)
Title: World 3.0
Author: Pankaj Ghemawat (@PankajGhemawat)
About: How distances and borders still heavily influence our lives, businesses and economics in general.
Why read it?
This should have been called “the world is NOT flat” as it’s really a counterpoint to the book The World is Flat and for that reason alone I think it’s a must read. So many people either blindly favor both globalization and deregulation—or oppose both of them. Ghemawat offers what I think is a more balanced option where these are not linked, binary choices. It’s worth picking this up just to understand an alternative way of looking at globalization.
Author: Tim Wu (@Superwuster)
About: Ebb and flow of decentralization and monolithic centralization of power in the information industry.
Why read it?
It’s a fascinating and entertaining review of the rise (and stumbling and rising) of major 20th century information powerhouses from the telegraph and telephone to Apple and the Internet. Regardless of whether you agree with Wu’s recommendations at the end, it’s worthwhile for those of us in IT to understand this history and evaluate how the cycle of decentralization and centralization might apply to our industry.
Author: Ray Kurzweil (@raykurzweil2035)
About: Enhanced human cognition and existence taken to a logical extreme.
Why read it?
- I almost didn’t include this one because Kurzweil gets pretty far out there on his ideas. However, it’s extremely interesting to consider technology as another phase in evolution and it might be valuable to ponder what that implies. I recommend it for those particularly interested in sci-fi and anyone that wants to get out of the box of their own thinking. This book was published around 2005, so you’ll notice some predictions that haven’t come true yet but if you can get past that it’s quite thought provoking.
So these are a few of my favorite reads that I managed to squeeze into 2011. What are your book recommendations?
Enhancements to Hitachi Data Ingestor
by Miki Sandorfi on Dec 14, 2011
A couple of months ago, we announced the broader HDS vision of Infrastructure, Content, and Information Cloud (see the post here and our press release). Today we announced the newest version of the Hitachi Data Ingestor (HDI) which will help organizations begin bridging between simple Infrastructure Clouds towards the Content Cloud.
With this newest release of HDI (see the press release), coupled with the power of the Hitachi Content Platform, we are arming customers with the necessary technology to free their information and take a step into the Content Cloud. As we outlined before, the key capabilities of the Content Cloud include information mobility and intelligence – putting the right data in the right place, at the right time, whilst empowering user control. This new version of HDI supports this vision in several ways.
First, HDI v3.0 includes technology that permits dynamic dispersion and sharing of data. Based on chosen policies, information written into one HDI (via standard NFS or CIFS) can automatically and transparently become available at multiple remote HDI instances. Imagine, for instance, wanting to distribute the new 20MB corporate presentation to each of many regional offices. Instead of emailing it (propagating hundreds if not thousands of copies – yuck) or putting it on SharePoint (slow downloads), you can instead drop it onto the “corporate drive”. This action will cause the other inter-connected HDIs to “see” that new content is available, and based on access it will be cached close to the users who want to get the new presentation (much faster, simple and seamless).
Next, HDI places control at the fingertips of users. Because by design and construction, a cloud built with HCP and HDI is backup free, placing tools in the hands of users to manage their own data is imperative. HCP already affords many controls for managing where data is stored, replicated, versioned, retained, and disposed. Now HDI passes this richness directly onto users via self directed recovery of prior stored versions or recovery of deleted content. Unlike other NAS technologies, HDI natively couples with the power of object-based management affording unparalleled granular access and control.
Finally, HDI includes some clever technology that helps customers adopt cloud in a very seamless fashion. By directly managing the transition of data non-disruptively from legacy NAS devices into the cloud-attached HDI, making the transition to cloud-based storage has never been easier. During the transition, all data remains available and once the transition completes, the richness of the Hitachi solution becomes fully available – bottom-less, backup-free file sharing that looks “legacy” NFS or CIFS to users and applications, but with the power of cloud underneath.
Google Health Dies – What Next?
by Dave Wilson on Dec 12, 2011
Back in 2008, Google launched its health platform – Google Health. It was an attempt to allow patients to control their own health record by uploading records to a Google site, and then granting privileges to their physician—thus making their health record completely portable. They even piloted this at Cleveland Clinic.
The intent was by “…using Google Health, physicians will be able to more efficiently share important diagnostic data with their patients. As patients become better informed and proactive in managing their healthcare, they may be more likely to practice preventive care, adopt healthful behaviors and practice other measures that promote improved medical outcomes.”
Well, as it turns out, Google wasn’t so successful. What!?! Google failed to make a go of something? How does a company that brought us Google, Chrome, Google Earth and the like not be successful in healthcare? Aaron Brown, senior product manager of Google Health, said the initial aim of the service was to offer users a way to organize and access their personal health and wellness information, and thereby “translate our successful consumer-centered approach from other domains to healthcare, and have a real impact on the day-to-day health experiences of millions of our users.”
And so here lies the problem that inundates healthcare. What Google didn’t realize was that people aren’t so willing to put their personal health records out in cyberspace as readily as they are willing to post their drunken party pictures.
Funny how that works.
Google then relaunched Google Health two and a half years later with a new UI and some more interactive tools. But alas that failed to catch on. Since then Microsoft Health Vault and Intel have offered to convert any Google Health files over to their format. The vultures are circling.
A personal health record has a lot of value if properly implemented. Ensuring that the content is accurate, that you can access this data from anywhere in the world and enable who you want to see your records is of immense value. Think about being on vacation and needing to have some healthcare treatment. If you have a cardiac problem, you can share your records with the local physician and they can see all of your medications (that you can’t spell or remember). They can see recent tests and the results and not repeat certain tests reducing your exposure to radiation and the like.
So what’s the problem?
The first issue: A personal health record that is maintained by a patient can’t be trusted by the physician treating the patient.
Patients may tend to put only what they want in the record. They may omit or even edit certain results, thinking that no harm can come to them. Who wants to share their positive HIV test or their mental health issues? How relevant is that to the chest pains they present with? In some cases the patient may even disagree with the results and omit them altogether. A personal health record that is not maintained by the parties providing the service is somewhat questionable when it comes to using it as a reliable source of information.
Second: Can we trust the Internet, the cloud and Google to maintain a level of security and privacy?
Most people do not trust companies to maintain their privacy when it comes to health records. Too many newspaper articles have front page stories where some hospital has leaked patient information. And recent stories about Google provide more proof that maybe Google has a conflict of interest in wanting to provide a personal health record. After privacy concerns were raised, Google’s CEO, Eric Schmidt, declared in December 2009: “If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place. If you really need that kind of privacy, the reality is that search engines—including Google—do retain this information for some time and it’s important, for example, that we are all subject in the United States to the Patriot Act and it is possible that all that information could be made available to the authorities.”
Who would want this? Perhaps Eric Schmidt was the downfall of Google Health and didn’t even know it.
Personal health records have their time and place if properly administered, accurately maintained and controlled in a non-biased, healthcare managed way. But getting to this stage will be difficult with all of the issues that surround our need for privacy, not to mention the sheer task of trying to coordinate the massive amounts of data. Some facilities are doing this. Governments are investing in electronic health records, which may serve a similar purpose.
But today, personal health records are still something of a nice to have.
RSNA 2011 – Meaningful Use and Cloud Took a Back Burner to ‘Imaging’
by Dave Wilson on Dec 5, 2011
Renee Stacey, Senior Solutions Marketing Manager of Health and Life Sciences at HDS, accompanied me to RSNA 2011 last week. It was a great show, and Renee asked if she could contribute a recap for the blog. Take it away, Renee…
Earlier this year, leaders in the radiology space were pushing the industry to be better engaged with the Meaningful Use (MU) incentive program. MUs is a government incentive program that financially rewards healthcare professionals when they adopt certified EHR technology and use it to achieve specified objectives. Initially, radiologists were hesitant to participate, which raised fear that adoption delays could impact the ability to meet new clinical and technology demands. For an industry that has typically led the pack on clinical innovation, there seemed to be real risk of radiology being left behind.
Since that time, radiology’s participation in the MU program has been clarified, however radiology groups are still slower than expected in adopting these IT innovations – innovations that are essential both for improving patient interactions and for their promised financial reward. KLAS recently teamed up with the Radiological Society of North America (RSNA) to conduct a survey on this very topic. Among the results were two very interesting outcomes, the first showed that 60% of those surveyed either have a plan or are considering qualifying for the MU incentives…. and more interesting, only 6% considered themselves educated on the MU incentive program. To me, that says there is a deep disconnect between the needs of the radiology consumer and how the technology players in this market are delivering their message.
I bring this up because, after following this story and having just returned from RSNA, I would have expected to see a plethora of MU messaging – and while RSNA provided a number of professional sessions on the topic – the MU message did not seem to make it to the show floor. Does it mean that IT health vendors are not coming to the table with Meaningful Use Certified solutions? Probably not. I think it means that perhaps, there was a miscalculation in what vendors believed the radiology community wanted or needed to hear. When only 6% of those surveyed have a comfortable understanding of MU opportunities, it means 94% need more information to make better educated decisions about their MU plans.
The same must be said for Cloud, which was surprising when private and public cloud was the message dujour this time last year. I expected attendees of RSNA11 to be able to see and hear a more mature and better defined cloud message with a lot of industry examples and success stories. Instead, with the exception of a very small handful, the big message appeared to be imaging. Relevant? Yes. Forward thinking? I am not so sure.
And with 40,000+ radiology professionals in attendance, imaging is a given. I would have expected RSNA to be THE place to learn more about cloud and MU offerings – because it is clear that the radiology market is working hard to learn more about it.
What were your thoughts on how RSNA promoted innovation in radiology?
Answering Ilja’s Request for Server Capacity – HDS Style
by Ken Wood on Nov 30, 2011
For everyone that celebrates the holiday, Happy [late!] Thanksgiving. The week of Thanksgiving in the US is a great time to catch-up on many of those little work chores that pile up or slip through the cracks while traveling and prioritizing big tasks ahead of fun work stuff.
This morning I was catching up on some HDS Industry Influencer Summit bloggers’ and analysts’ write-ups and opinions from the days of Nov 10th and 11th. As I wrote previously my blog, I was in a concurrent breakout event with several industry bloggers on the “other stage” during the afternoon’s main session. I was following links to various write-ups and ran across Ilja’s Coolen’s blog. Funny thing is, it was a write-up from a separate event back in March in the UK that I didn’t participate in. So, when I finished writing this blog, I realized what happened: too much link-clicking, and I ultimately ended up at an article that was six months old. While it is several months old, Ilja’s post did ask a very good question that still I would like to respond to on server packaging and capacity. In his blog, he states:
“Hitachi is able to deliver a completely filled 42u rack with 320 high density micro servers. The total rack would consume less than 12 Kilowatt. Whether or not this is a great accomplishment, actually depends on the total processing capacity this rack would have. I need to dig deeper into this to make a comparison.”
I have, in the past, performed paper exercises of sizing computing horsepower for initial comparisons. When everyone uses the same processor chips from either Intel or AMD, do they all perform the same? Where are the differentiators? To cut through the fat, one area is packaging.
The whole discussion of rack mount servers versus blade server systems has been in debate for a number of years, and it is somewhat falling into a religious discussion. The argument for commodity-like advantages of pizza box servers over the easier to manage enterprise blade server architectures is not going to be resolved here. But what I will offer is some of my insight into performance and the advantage of packaging using the same processor chipsets.
Using a standard formula to calculate floating-point operations per second (FLOPS) then applying to a server system such as the one I use here:
(Number of FLOPS per Cycle) * (Clock Cycle) * (Number of Cores)
Adding additional system level information will yield the system’s or blade’s overall calculated FLOPS performance (of course this is a brute force approach, but it is a good starting point). For instance, using the Compute Blade 320 (CB320) blade server option as stated in Ilja’s blog and as mentioned by Lynn McLean in her presentation:
- Using the X5670 XEON based blade
- (there is a reason I’m using this one and not the fastest processor option for now)
- 4 FLOPs per clock cycle
- times 2.93GHz
- times 6 cores per processor
- times 2 processors per blade
- Single precision FLOPS – 4 times 2.93GHz times 6 times 2 = 140 GFLOPS
- times 10 blades in a system= 1.4 TFLOPS
- times 7 CB320 systems in a rack = ~10 TFLOPS
I should note here also, these are general purpose computing FLOPS as compared to GPGPU FLOPS, which require additional coding and compiling steps to take advantage of this technology. This means almost everything running on these systems can take advantage of the performance assuming internode awareness (application is scale-out aware).
If you followed this same formula for the standard 1U rack mount server using the same chipset and core count, it would result in about 6 TFLOPS (140 GFLOPS per server times 42 servers in a fully populated rack) compared to almost 10 TFLOPS per rack using the CB320 (70 blade servers in a rack). So, net results of this exercise is packaging and density of 70 blade servers in a rack compared to 42 servers using the standard 1U rack mount servers yields a 40% improvement in floor space requirements for the same computational capability. Stated in a more measurable metric, the CB320 yields 235 GFLOPS per rack U compared to 140 GFLOPS for a standard 1U server in the same 42U rack. On the higher end of the CB320 product line, the X5690 blade, the calculated floating-point performance for this blade is 166 GFLOPS, which would put the rack’s total calculated performance at 11.6 TFLOPS or 277 GFLOPS per rack U.
Hopefully this helps answer Ilja Coolen’s question about server density and capacity. The other point to finish off this article is the notion of data intensive and computational intensive architectures. I’ve just shown some data that suggests the CB320 has the capacity to be very computationally capable. On the other hand, the CB2000 has the capability to be very data intensive. The specifications for the CB2000 states 16 GB/s total bandwidth in a single blade system and 64 GB/s of total bandwidth from a fully configured rack. Combined, these two systems form a formidable platform for solving Big Data challenges. Not that Big Data problems are a floating-point intensive workloads, but you never know.
More on this thought in future blogs.
Unintended Consequences of Cloud – From Influencers to Super Computing
by Amy Hodler on Nov 22, 2011
The week before last we had our first Influencer Summit in San Jose, CA that brought together analysts, bloggers and trusted advisors. I really enjoyed Frank Wilikinson’s blog from the event regarding something we heard a lot that day, You Guys Do That?! This got me to thinking about the unexpected and another comment that piqued my interest that day.
During a panel discussion, two of our customers commented how the use of cloud solutions was influencing their organizational structure—in short, that the higher level segmentations of server, storage and network groups were merging. I know many folks had postulated that this would happen, but this is the first time I’ve heard customers from different industries talking about how it impacts business structure. Imagine what that might mean in the long term for business processes. This alone is an interesting topic, and I’d love to hear more real-life examples.
With the idea of unintended consequences of cloud still stuck in my head, I attended Super Computing 2011 last week, which Ken Wood summarized our participation in A Brief Visit to SC’11. This is a fascinating conference if you’re interested in the amazing things being done to turn data into meaningful information, and seeing the impressive projects from the likes of NASA, NOAA and educational institutions. For a non-promotional report on why this conference and supercomputing is important to our industry and society, check out this video summary from EE Times.
Amongst all the super charged brainpower, I heard one of the providers of High Performance Computing (HPC) mention that the concepts of cloud were changing what their end users wanted. I started asking the same question to others and it turns out that because this type of computing and analytics is extremely dependent on node to node fidelity and intolerant of failures, HPC providers had not anticipated a strong interest in cloud services. However that’s exactly what they are starting to see.
The providers that I spoke to weren’t precisely sure how they would meet these growing requests for cloud-like hosting and delivery but they are working on it. Super computing as a service (would that be SC-a-a-S?) would require some unique implementations of cloud solutions that would vary greatly from big data solutions due to dissimilar data and analytics models. Is there enough of a market for SCaaS? Hmm. Maybe or maybe we’ll call it something else?
These last two weeks have been ones of idea exploration for me, and I’m left with many more questions than I can answer. However, if you’ve read my other posts, you know I love this process. (Something good usually comes out of this exploration; I just can’t predict what it will be.)
So, please send me a quick note or write a comment about any unintended consequence of cloud that you’ve experienced or heard about. I’ll collect them, post a summary, and maybe we’ll collectively come to a few “ah-ha!” moments.
For more content from HDS Analyst Day, visit our bit.ly bundle: http://bitly.com/u0mh27.
Unstructured Data in Life Sciences
by Dave Wilson on Nov 21, 2011
Unstructured data is a major challenge in the life sciences market. Unstructured data, by its very definition, is difficult to analyze as it doesn’t fit into a relational database. Pharmaceutical and biotechnology organizations live and die by their ability to analyze this unstructured data, and studies show that the average company makes decisions based on data that is 14 months old. Companies that can make faster decisions will win the race.
Gaining access to unstructured data opens opportunities for organizations, but it is only a start. It is even more important that organization know what data to access because the advantage will go to the company that can mine the most relevant value out of the data.
Consider a pharmaceutical company. They conduct clinical trials in the drug development phase. Multiple departments generate massive amounts of data that all relate to the drug’s interactions: blood tests, biopsy samples, pathology images, nursing and patient notes, not to mention chemical analysis and more. Combined with race, gender and geography factors, there is too much data to make sense of. Aggregating this data into meaningful information is the key to driving better decisions, like, for instance, identifying trends that can uncover major discoveries.
It’s a little known fact that Viagra was discovered by researchers when they found that patients didn’t want to give up their medication. The “benefit” (or side effect, depending on your gender) was an accidental discovery. Being able to correlate data that is seemingly unrelated can lead to major finds, and a way to show a relationship between your data will drive data mining and analytics to higher levels.
So there are two main challenges facing pharmaceutical companies when it comes to big data.
- How does a company manage to store big data?
- How can they make sense of this big data?
As you have seen recently with our cloud announcements, HDS has cloud technology that can address both challenges. Cloud computing for pharma companies comes with its own challenges, like:
- Ownership of data
And these are important factors to consider.
Also, object-based storage is a way to store unstructured data and mine the associated meta data. Both Hitachi Content Platform (HCP) and Hitachi NAS powered by BlueArc® provide a means to manage unstructured data. HCP also forms the core of HDS cloud technology.
The key to managing big data is to enable reduced cycle times for computing massive queries. This drives pharma and biotech to gain an advantage over their competitors. High performance computing has a big role to play here – but that is a blog for another day.
A Brief Visit to SC11
by Ken Wood on Nov 18, 2011
Initially, I wasn’t planning to attend SC11, especially since this week I had several other meetings to participate in. However, as is common in this industry, I ended up heading to Seattle to meet with several people and companies at SC11 at the last minute for the day. I was able get into the exhibit hall early to explore the behind-the-scenes activities of many of the booths. Hitachi Ltd.’s HPC Group was present with a very impressive booth again this year.
A VSP was on display in the booth, next to what I call the world’s largest server blade. I don’t actually know if this is a fact or not, but it is very impressive to see this device used in this specialized field of computing. Also, there was the new HA8000-tc rack mount server for technical computing (I want/need some of these in the Innovation Lab).
I also hung out at the BlueArc booth, which now displayed new panels with “BlueArc – Part of Hitachi Data Systems” in large, vivid lettering. Nice! Sorry, I didn’t get a picture of this for some reason, but I’ll grab one from someone or someplace. I did hang out at the booth meeting with new BlueArc colleagues and old HDS colleagues, as well as customers of other vendors interested in knowing more about everything.
Probably one of the more interesting activities at the conference for me was the attention given to data intensive workloads specifically around “Big Data”. There were several events going on surrounding Big Data that, unfortunately, I was not able to attend. However, since the majority of my time and my team’s time is spent solving Big Data problems in the enterprise, this is an area and community we will continue to monitor closely. I have been using scale-out and HPC architectures to explain and solve the Big Data challenges in the enterprise and this is evidence of that approach. Stay tuned for more on this subject. Unfortunately, I was unable to attend any of the sessions, tutorials or BoFs this year. I didn’t even get a t-shirt or conference goodie bag. Hopefully, next year my schedule will allow for more time to participate and explore like I usually do.
HDS Industry Influencer Summit – The Other Stage
by Ken Wood on Nov 18, 2011
Last week was the inaugural HDS Influencer Summit, convened in downtown San Jose. This event included financial analysts, industry analysts and key industry bloggers. It is interesting that the majority (maybe all) of these attendees are related to the storage industry in some fashion. There are several blogs detailing the event and explaining the resounding “…they do that?” in these posts from my colleagues Frank and Miki. What I would like to describe here is the blogger breakout sessions, and the tour of the new Innovation Lab, which is an extension of the Hitachi Central Research Laboratory, the day after the main event.
There was a special breakout session during this event specifically for our invited industry bloggers, Greg Knieriemen (@Knieriemen), Nigel Poulton (@nigelpoulton), Chris Evans (@chrismevans) (not the Captain America Chris Evans), Devang Panchigar (@storagenerve) and Elias Khnaser (@ekhnaser), which overlapped some of the main event. By comparison, this portion of the event was more exciting than what was missed in the main event (in my biased opinion). I kicked off this breakout session with an overview of our “R&D and Futures” and an introduction of the new Innovation Lab at our headquarters. I also did a brief one slide description of three active projects we are working on in the lab and noted that these projects will be demonstrated the following day. Sorry, these projects are under NDA.
After that tour, and while walking Greg out of the rest of the day’s activities, he stated to me “…you’ve probably have the greatest job in the world!” I replied back “trust me, this isn’t the only thing I do, and the rest of my job isn’t so great” (sorry Michael). However, I took this in meaning that my team is instrumental to changing the industry’s perception of the “New HDS”, or at least that’s how I interpreted his comment.
I didn’t think much more about his comment until this week when he followed up with an email to Michael Hay and myself basically stating the same. Unfortunately, he didn’t say that I WAS ‘doing a great job’ and he included Michael so I couldn’t edit his email before forwarding it ;^) Obviously, there’s a sense of pride when someone recognizes the work being done, especially since being so close to the work can take your focus off the larger vision.
It is rewarding for me and my team to know that we are helping to transform HDS from being viewed as a storage company to something more while keeping to our roots. I normally describe the difference between HDS and other technology companies in this market space as – companies that are primarily seen as a server company see storage as a place to keep data, but a storage company would treat data as the digital assets of an enterprise and use servers as a way of making that data useful to the business. To this, I also like to describe storage (at least the way HDS does it) as maintaining the “state” of the company, while servers can become “stateless” interchangeable components that essentially are data processing offload engines to the storage infrastructure.
I am definitely looking forward to following up with several participants of this event as I received many requests and questions. Also, I am looking forward to next year’s event, and what we will be sharing.
For more content from HDS Analyst Day, visit our bit.ly bundle: http://bitly.com/u0mh27
Hey! You Guys Do That?!
by Frank Wilkinson on Nov 15, 2011
Last week HDS held its inaugural Influencer Summit in San Jose, California. It was a very big deal for our company, not to mention our invited guests, who by all accounts were about to get the one, two punch! (In a good way).
The creation and preparations for this historic event were driven by our marketing group and trusted advisors, not to mention half the company (well it seemed that way, I may be over exaggerating a little bit). We had our executives on hand to deliver the core messaging with some great insights, as well as folks from The Office of Technology and Planning. This was our first of many events which will enable HDS to share its strategy and technologies within an exclusive (for now anyway) invitation to the industries’ most prestigious analysts and bloggers.
There were many internal meetings to discuss our individual participation and also to cover each presenters’ topics, strategy materials, presentations, NDA clarification, blood type, first born, and dare I forget that I had to sign my name in secret ink (I am kidding, there was no secret ink).
All kidding aside, it was a great event. Jack Domme (pictured above) was first to present, and he did a great job as always. (I will save the remaining details for my fellow colleagues and bloggers, who I am sure will do it better justice than I can).
One of the initiatives at the event was to decide who would be monitoring Twitter and responding. I of course volunteered, as did many of my peers and colleagues. The job was easy enough, as I am on Twitter (@FTWilkinson) as much as I can, and also I like to see instant feedback from our customers and peers in real-time.
As the event started off, the Twitter chatter was quiet, but started to pick up rather quickly with some tweets pointing out that HDS is the best kept secret…
- @seepij: If you thought #HDS were just into storage – like I did – hearing impressive insights on new technology coming #hdsday
- @CIOmatters: #HDS much more innovative and strategic than I realised -not content buying IP they build it and use/re-use it, ahead of the market #hdsday
- @nigelpoulton: Randy Demont saying customers are really pleased but telling HDS that they dont market well enough <– only for the last 10+ years! #hdsday
- @ekhnaser: Great message at #hdsday but y limited to 70 people? This should be heard by more partners customers influencers….
Ahem!…The presentations were great and overall were well received—at some level I think we shocked some folks by our candor as well as laying out our strategy and some possible future areas of focus and technologies (all covered under NDA, so no sharing in this blog forum). Nonetheless, the Summit was a great inaugural event that proved to those in attendance that we, HDS, do have a complete strategy, and a plan and vision for carrying them out.
So, YES, we do that!
For more content from HDS Analyst Day, visit our bit.ly bundle: http://bitly.com/u0mh27
Cloud Strategy and the Influencer Summit 2011
by Miki Sandorfi on Nov 11, 2011
The inaugural Influencer Summit 2011 was a tremendous success! Industry and financial analysts as well as some key bloggers traveled from all over to meet with HDS and spend a day, plus some, talking strategy, industry and futures. Although the agenda was packed, I was able to do a short recap on HDS cloud announcment. Check it out:
Excited About Our Inaugural HDS Influencer Summit 2011
by Claus Mikkelsen on Nov 9, 2011
Fall is in the air, but HDS will be turning up the heat in San Jose this week with our first analyst day. Kicking off on Thursday, November 10th, HDS will host two days of executive presentations, financial updates, and updates on our product strategy from CEO Jack Domme and other key HDS leaders. Throughout the event, we will cover our strategy for infrastructure, content, and information cloud with help from a few gracious customers who have also decided to participate.
Our goal is simple: let the storage world understand where HDS is going and how we will get there from a technology roadmap perspective.
I will be tweeting from the summit (@yoclaus), as will a deep bench of HDSers also in attendance, including:
- Miki Sandorfi – @MikiSandorfi
- Amy Hodler – @AmyeHodler
- Hu Yoshida – @HuYoshida
- Ken Wood – @KenWoodonTech
- Shmuel Shottan – @ShmuelShottan
- Frank T. Wilkinson – @FTWilkinson
You can follow the event by using the #HDSday hashtag on Google+ and Twitter.Speaking of HDS on Google+, our company page is now live, so please add us to your Circles so we can share information from the event with you. We will be hosting in-depth conversations with storage analysts, bloggers and HDSers during the summit. We have shared our “HDSday 2011 Circle” so you can connect with our participants.You can also follow the #HDSday Twitter List to connect with those who will be participating in the summit and posting tidbits on Twitter throughout the event.We hope you’ll follow the live stream of tweets and blog posts that emerge this week!There will also be some industry bloggers in attendance, so make sure to check in with their Twitter handles for updates:
- Chris M Evans – @chrismevans – www.thestoragearchitect.com
- Devang Panchigar – @storagenerve – www.storagenerve.com
- Greg Knieriemen – @knieriemen – www.nekkidtech.com
- Nigel Poulton – @nigelpoulton – www.nigelpoulton.com
- Elias Khnaser – @ekhnaser – www.eliaskhnaser.com
Looking forward to engaging with you throughout the show!
Big Data in Healthcare
by Dave Wilson on Nov 7, 2011
Being in the healthcare space my entire career, I had no idea what all the fuss was about when Big Data started to be the topic of the day. Sounded like a large file to me – mammography images can be 60Mb each – and aside from the potential joke about large mammo images, what was the big deal?
So I did some research. “Big Data” refers to a volume of data too large to be harnessed and used in meaningful ways. In other words, Big Data is an accumulation of data that is going to waste and has no immediate meaning, mostly because no one can do anything with it due to its size.
Healthcare providers are quickly becoming inundated with Big Data. Governments are unintentionally driving big data warehouses through health information exchanges, diagnostic imaging repositories and electronic health Records. Eighty percent of the data is unstructured, and it is accumulating to the point where it can be called Big Data. Now, on an individual basis, each patient record has value and meaning to the patient–obviously. But as we look at the growing accumulation of data, the opportunities are endless to drive meaningful analytics out of the volumes of data available.
Data warehouses are simply that–a warehouse for data. Data has no meaning on its own, and as providers create these warehouses, there needs to be a shift in how we think about the potential use of data. Data warehouses need to become information or content repositories. Information is a useful tool that results from the analysis of data driving decision making. Sounds like marketing fluff right? Let me demonstrate.
A diabetic patient monitors their blood sugar multiple times per day. This value gets stored electronically–this value is data. On its own it has little meaning (aside from the obvious immediate value to the patient) in the big picture of things. Now take that patient and all the patients in the region and their blood sugar values for the last 5 years. Analysis of this data could lead to trends—important information that can drive preventative health measures. This leads to better patient care, improved quality of life, and lower healthcare costs.
Much of this data is available today in separate repositories, isolated applications and local data warehouses. The potential to combine the blood sugar data with nursing notes keywords, weather forecasts, and other related and unrelated data can help drive this analysis. The challenge becomes getting access to this data and then overcoming the interoperability aspects. Problem is, we don’t know what we don’t know. Questions we would never ask today can be asked when the restrictions are lifted – Is there a correlation between diabetic hospital admissions and the weather pattern?
A content cloud could answer some of these challenges. Consolidate and aggregate data from multiple sources, and at the same time capture the relevant meta data associated with the data against which analytics can be run. Meta data can help manage the massive amounts of data being generated (the Big Data) and provide a way to correlate this data into meaningful information. This content then can be accessed by researchers and scientists to analyze.
Big Data is, and will continue to be, a major problem for healthcare providers. One estimate has healthcare Big Data sized at 150 exabytes and growing at a phenomenal rate of 1.2 exabytes per year. The possibilities of tapping into that information are endless. It has been a challenge for pharmaceutical and biotech companies for years – but that’s another discussion.
Measuring Up A Supersized McBlu-ray With BDXL
by Ken Wood on Nov 1, 2011
In March of last year, I wrote a blog about disruptive technologies, specifically how the Blu-ray technology could change the storage landscape (or not). The new Blu-ray disc format specification enhancement was defined in June 2010 for the BDXL standard. This new specification increases the recording capacities for Blu-ray discs in two ways.
First, per-layer formatted capacity has increased from the initial 25GB to about 33GB per layer. Second, the number of layers now supported on a disc is three and four, or stated as a triple and quad layered disc. The conventional capacity description for the triple layered disc is now 100GB (99+GB) and 128GB per disc. There is some discussion with Blu-ray media suppliers looking for requirements for dual sided media, which could increase the per disc capacities to 200GB for a six layered disc (triple layers on both sides) and a 256GB for an eight layered disc (quad layers on both sides).
Now, when I discuss optical storage capacities and Blu-ray, I’m typically referring to the storage capacities and read/write speeds for these devices, and how they would apply to enterprise uses. I’m less interested in the video formatting and supported formats—at least until I’m working on my own home videos. For my job at HDS, I’m always watching this space for the “kicker” that sends this technology to the forefront of an enterprise’s alternative storage technology strategy. More specifically, when could optical storage technologies, like Blu-ray, replace tape (albeit, for certain types of use cases)?
Even with an inconveniently laid out 256GB of capacity (flipping a disc over is a pain at best), Blu-ray would have a tough time replacing the current multi-terabyte LTO tape formats today and planned improvements for the future.
But wait! What’s actually being compared when discussing LTO tapes and Blu-ray discs, head-to-head on capacity, or head-to-head on capacity and footprint? The current specifications for an LTO5 tape is 1,500GB of uncompressed capacity with a cartridge dimension of 4.1 x 4.0 x 0.8 inches (105 x 102 x 21 mm). Using some creative math, this is 13.12 cubic inches, which works out to about 114.3GB per cubic inch by that cartridge.
The specifications for a quad layered disc (single sided) is 128GB of uncompressed capacity with a disc dimension of 4.7 inches in diameter x 0.047 inches thick (120 mm diameter x 1.2 mm thick). Again, doing some creative math, this is 0.815 cubic inches, which in this cases works out to 157.1GB per cubic inch. This is an amazing 314.1GB per cubic inch if we test the waters with a dual sided disc. Stated another way, in roughly the same amount of space that an LTO5 cartridge occupies for 1,500GB of uncompressed storage capacity, a stack of Blu-ray discs would contain a whopping 2,060GB using a single sided 128GB quad layered disc and over 4,100GB for a stack of double-sided quad layered 256GB discs.
Granted, the Blu-ray disc is a bit wider than the LTO cartridge and a straight stacking of discs or a forced fitting of cubic inches from one form factor into the other is not a precise or practical method of comparing these two storage medias. Also, in a well designed apparatus for managing many optical discs, there would be zero surface contact required, so a measureable gap between discs would be needed to manage them properly. This would drop the GB per cubic inch capacity somewhat. However, the numbers are disparate enough to look closer at this from a different perspective.
So what about performance? The current LTO5 specification states a 140MB/s rating for the uncompressed 1,500GB format. This means for a read operation, once the tape sequentially seeks to the requested location, at top speed (tape moving across read/write head) the data can be read at 140MB/s. Impressive and roughly as fast as most magnetic hard discs sequentially streaming today. A Blu-ray disc reads and writes at about 18MB/s using the slim form factor optical disc drives and about twice this speed for the full and half height form factor optical disc drives. So for comparison sake, I’ll use 30MB/s for read/write as a conservative estimate. This is one of Blu-ray’s main deficiencies when looking to be used in the enterprise as a serious storage alternative, poor performance, as well as, per disc capacity.
However, let’s look at this from a different perspective again. That same “stack” of discs that yields up to 2,060GB of capacity when compared to LTO5’s footprint, would roughly need 17 optical disc drives to simultaneously load all of these discs. Two things happen when this is done. First, 17 drives x 30MB/s is a total of 510MB/s of aggregate performance from this “stack” of discs. This assumes reading ALL of the discs for some sort of read everything or write everything operation. Second, even if there isn’t this many optical disc drives available, all 2,060GB of data is divided into 128GB chunks and 33GB layers. When seeking a single file from a Blu-ray disc, the operation would only load the disc with the requested file on it, focus the laser to the required layer, then directly seek to the location on that layer to read the file and satisfy the request.
Of course both storage mediums have excellent power consumption characteristics when not in use – ZERO watts. However, the knock on optical storage (among many others), specifically Blu-ray, I think in many cases is unfairly compared (myself included initially). Individual medium – LTO5 cartridge – to individual medium – disc – is the standard comparison, but both will be used and managed in similar ways. That is, nobody uses just one, and there’s always some device in place to manage them as a larger body of media. This means when the aggregation of capacity and performance is measured, the two technologies compare fairly well, at least now that the new BDXL format is being used for Blu-ray.
I would like to hear about your experiences with Blu-ray storage and what your opinions are concerning this technology–especially those professionals using Blu-ray in their companies to solve problems.
Trick Or Treat
by Claus Mikkelsen on Oct 31, 2011
Ah, yes, Halloween. Time to pick up that empty wine glass and start knocking on neighbors’ doors. I’ll leave the candy for the kids. But there is a “trick or treat” to this blog.
The “trick” was to get you to read this far, and the “treat” will be unfolding over the next few weeks and months.
I’ve blogged, written and spoken a lot recently—not so much about the data explosion (we all talk about that)—but the changes in technology that are allowing it, and in many cases almost encouraging it, to occur. Most recently, Ian Vogelesang on this page about disk drive futures, and this past April I blogged about how Moore’s Law might be tilting a little faster these days. Recently, I read that we might be seeing up to 100TB disk drives by the year 2020.
All of these trends are great, but also tend to create more problems to solve, or processes that need to be changed. For example, how long will it take to do a RAID rebuild of a 100TB drive (don’t ask!), or what does data protection look like in the year 2020, or more specifically, what should it look like today?
That’s the subject at hand, and one I’ll be focusing on in the weeks ahead, along with my good friend David Merrill, whom many of you already follow. Helping us along will be Ros Schulman (our “diva of data protection”) who “guested” on Hu’s blog this past April about business resilience following the Japan earthquake and tsunami. There will be other guest bloggers as well, but between David’s blog and mine, we’ll be unraveling the intricacies of data protection as well as the storage economics surrounding it. For anyone interested in data protection and backup challenges, this should prove to be a “treat”.
It’s not your father’s simple tape backup any longer. Stay tuned.
HDS Information Cloud Vision and ParaScale
by Cameron Bahar on Oct 31, 2011
Last week, HDS unveiled its roadmap for the Information Cloud and stated that it is based on technology obtained through the acquisition of ParaScale in August 2010. In this blog post, I will explain how the ParaScale platform will serve as a foundation and enabler for the Information Cloud.
In my last blog post, I wrote about the impact and requirements of a new class of applications on storage and computing infrastructures. As this massive wave of unstructured data is created, we need platforms that are specifically architected to efficiently ingest, store and analyze this data.
In the early 90s I had the privilege of working on the second version of the Teradata data warehouse appliance which was used by leading companies around the world to mine their structured data sets. The interfaces were SQL and we implemented an ACID compliant RDBMS while leveraging a scale-out shared nothing architecture to achieve scale and performance for TCP-D (decision support) type queries. Teradata allowed companies to achieve record “time to answer” results to their most pressing business intelligence problems. In the process, Teradata gave these select companies a significant competitive advantage and in many cases led to these companies dominating their various industry segments.
An example of this is Walmart, which invested heavily in this technology in the 1990s and even sued Amazon for hiring their data warehouse experts as it considered this knowledge a competitive weapon. Walmart was able to analyze supplier and customer patterns and figure out what product to put on which shelf, in which city, in which month in order to maximize its revenue. It is interesting to compare the relative growth rates of Walmart to Kmart, which I believe did not invest aggressively in this area in the 1990s!
What we’re witnessing is the second wave of this paradigm. The fundamental difference is that the data sets are a few orders of magnitude larger and they’re all unstructured or semi-structured.
Many organizations are realizing that there’s tremendous hidden value in this data and it needs to be harnessed to enable businesses to get new insight about their businesses and customers. Proof points are leading web companies such as Google, Amazon, Facebook, Yahoo!, eBay and others that are mining user activity and patterns to sell ads or products or both.
Because the size of these data sets is relatively large (TB’s to PB’s), it is impractical and expensive to copy the data over a network into a data warehouse in order to process. It is much faster and cheaper to move the processing to where the data is stored, which brings me to why ParaScale is an important bridge to the Information Cloud.
The ParaScale platform will allow organizations to efficiently ingest, store and mine unstructured and semi-structured big data sets at scale and all in one place using a scale-out shared nothing architecture. This technology provides an elegant and flexible method to combine storage and compute in the same stack and achieve performance, efficiencies and capabilities required to solve this class of problems. The journey from the Content Cloud to the Information Cloud requires an array of technologies and we think the ParaScale technology is a key element required to bring this vision to reality.
Want to read more about our Cloud Roadmap? Visit our bit.ly bundle here: http://bitly.com/pCt5Gk
Cloud Maturity Model In Healthcare
by Dave Wilson on Oct 28, 2011
Now I don’t like to steal titles or concepts from other companies but at my last company we had a concept of a Multi-site Maturity Model and I think the same “maturity model” concept works for healthcare and our cloud offering. It took a little while for this to kick in (thanks to my colleague Linda Xu for helping me see the light) but I think that our vision for where the cloud will take us is most appropriate for healthcare. Let me explain.
I see four stages of evolution in the healthcare space for customers that want or need to go to the cloud for various reasons. Healthcare tends to be behind other industries when it comes to adopting technology so this evolution needs to happen at the speed at which our customers will be comfortable. This evolution starts with “Cloud Ready” Technology. By buying the HDS “box”, the customer starts down the road of enabling the cloud by being ready from an investment perspective. Take for example Hitachi Content Platform (HCP). A customer can utilize HCP to manage their patient information from the radiology department to start. This would be locally hosted and managed and be like all other departments – a silo of storage.
Stage two involves the customer moving this equipment to a hosted environment, adding additional applications to HCP and expanding the capabilities. It may also include virtualization of an existing data center such that multiple applications are now sharing the storage virtually and Hitachi and its partners take on more of a managed service role. Storage on demand becomes an option. At this point our customers start to build out the Infrastructure Cloudand are managing a number of applications and their accompanying storage as a cloud model. Think of MS Exchange, PACS and Hospital Information Systems data being managed through the cloud.
But as we have talked about before, just storing data in the cloud is of limited use unless the physicians are able to utilize this data for clinical decision making and so the Content Cloud evolves as the customers add Hitachi Clinical Repository to the mix. Now the customers are ingesting data through the cloud, indexing that data as it arrives so physicians can see the patient longitudinal record and storing of this information. Through various applications facilities can now provide access to their physicians via cloud based applications. Electronic medical records, physician portals and patient portals all fit into this arena. The Content Cloud offers healthcare providers the opportunity to expand their services and to provide new services and methods of communications to its population much more easily and quickly. All the benefits of the cloud apply to healthcare: reliability, lower costs with higher utilization rates and a simplified IT management environment. Remember – healthcare providers’ main function is to care for patients and so IT is not always their strength.
Now that we have created content and made it available to those in need, the final stage of our maturity model starts to appear. The Information Cloud brings hospitals and caregivers to the highest level of maturity. Utilizing the content that has been collected from across multiple facilities, from many patients and physicians, we can start to apply “smarts” to this data. By adding analytics to the healthcare information we can start to develop best practices for disease treatment and determine which medications have the best impact on certain diseases. We can start to track epidemics before they happen and track back to Patient Zero when they do happen. We can make better clinical decisions for the population and identify trends in the early stages. With the right applications we can manage chronic diseases like diabetes, cancer and heart disease much better – thus lowering the cost of healthcare for all and improving the quality of life for many.
In the end, a Cloud Maturity Model is probably what is most needed for our healthcare facilities to improve patient outcomes and we can show our customers how to get there at their own speed. Without it, we will continue to struggle with access to data, we will continue to miss the big picture and ultimately we will continue to see patient healthcare costs increase to unmanageable levels and that will hurt us all – ouch!
What do you think of this Cloud Maturity Model? Are their other steps that need to be taken?
Want to read more about our Cloud Roadmap? Visit our bit.ly bundle here: http://bitly.com/pCt5Gk
What Can You See In Our Cloud Vision?
by Amy Hodler on Oct 26, 2011
Yesterday, HDS announced our roadmap for the information cloud to help customers transform data so it can be better used as a strategic asset with the goal of fostering more business insight and innovation. Miki Sandorfi, our Chief Strategy Officer for Cloud, also explained yesterday in his post how our 3-tier strategy builds—starting from infrastructure cloud to provide more dynamic infrastructure, then layering content cloud to enable more fluid content, and then finally building to information cloud to facilitate more sophisticated insight. Miki also reviewed some of the new Hitachi Cloud Solutions and Services that were announced to help customers achieve this goal.
Having seen the cloud strategy develop from within HDS, and the significant debate over even minor implications, I can attest to how serious we are about using this vision as a lens to organize, prioritize and drive our business. To help you understand how we are using this for our “bigger picture” I wanted to give you a little insight into some of our brainstorming as to what this vision might make possible today and in the future. And if you’re like me, there’s nothing better than an example to really illustrate the potential of an idea.
So let’s imagine what we could do with better insight using the concepts of infrastructure cloud, content cloud and information cloud. Regardless of the industry, there are amazing accomplishments that can be made, but I’d like to use a healthcare example since we all can relate to having received care (or at least knowing someone that has received care).
In considering how the infrastructure cloud might be used in a healthcare example: Let’s say that a doctor notices a suspicious illness pattern on rounds. In this case, the doctor could immediately spin up an internal SharePoint for hospital collaboration and an external message board requesting input. No, this is not rocket science and yes, we can do this today. However without the dynamic resources of an infrastructure cloud, the doctor would have to submit his request and then wait a considerable amount of time for the approval, acquisition and implementation of resources to support the service he needs. This would delay his ability to collaborate and reach out for assistance, and consequently lose valuable time in detecting a possible outbreak.
If we now look at the next stage of activities for this use case, we can see how a content cloud, built on top of the infrastructure cloud, would help the doctor obtain the data he needs quickly and securely. For example, the doctor would likely want to search and share information among hospitals with appropriate rights and privacy protection in place while collecting content independent of the application it was created in. In this case, each patient report has a meta-tag for the mystery illness with pertinent information and privacy protection. The doctor can quickly search across multiple IT systems and find the appropriate information he can then use for other study. Without the fluid content that the content cloud enables, the doctor might be delayed in aggregating data, or even worse, completely miss information essential to understanding this developing situation.
In this example, we know that the doctor is ultimately trying to save lives and possibly contain an outbreak. The information cloud can best enable this when layered on top of an infrastructure cloud and content cloud. For instance, the doctor may identify a possible outbreak in a nearby city using a report that recombines medical analytic data with Google search trends. With this information, the doctor would notify city officials, who would then take preventative action to limit risk. At the same time, based on preset variables the information might also self-direct further search and analysis. For illustration, let’s say the results of this self-direction alerts the doctor of a potential historical match to these trends and this leads him to consider applying an old inoculation solution in a new way. Applying this insight, he would be able to quickly head off the outbreak. Without the sophisticated insight the information cloud facilitates, the doctor might spend countless hours pouring over data and information: unable to sift out trends and patterns and unable to even consume other sources of information such as machine generated information…let alone be able to blend and analyze everything. This could mean serious delay in not only identifying but also resolving the health risk.
(Is it me or does all this talk of “the doctor” and future capabilities have you thinking about the “Doctor Who” BBC series? If you’re among us oddball science fiction fans, I’d love to hear who your favorite “Doctor” is. One of my favorites is number 9, who you can see in this great promotional picture.)
It’s entertaining to consider vision concepts and build on the possibilities, but it also brings to light novel ideas, uses and even unforeseen challenges.
So what other possibilities and use cases can you dream up for our strategy for infrastructure cloud, content cloud and information cloud?
Want to read more about our Cloud Roadmap? Visit our bit.ly bundle here: http://bitly.com/pCt5Gk
Data is Our Middle Name
by Michael Hay on Oct 26, 2011
…and it is our sincere desire to Inspire Your Next Insight. The reveal of our infrastructure, content and information clouds exposes a framework to help our customers and the market get to that next “ah ha” moment.
To paraphrase and add to my colleague Miki Sandorfi’s post on our Cloud Blog, we aren’t just delivering the framework, we’ve included “batteries” as well. (By “batteries included”, I mean we have real solutions, not just products, adhering to our framework that help our customers live by the Hitachi gospel. I view this as not unlike how the Python computer language is positioned: it has lots of helper packages, scaffolding and frameworks making it easy for software engineers to be productive quickly. In the same way solutions like private file tiering and Sharepoint archiving help our customers deploy big parts of a private cloud infrastructure with little worry, helping them be more productive and attentive to the business.)
However, this is not the real reason why I’m excited about this announcement. My excitement stems from the fact that HDS is unleashing what we stand for publicly. If you’ve listened to our CEO Jack Domme you’ll get a hint, and if you live within our “four walls” you’ll also understand. What we are doing is beginning to talk externally about what we stand for, inviting you to participate in our journey.
That is because we live in exciting times and we are now within the midst of what HDS calls the Big Bang of Data. Precursor factors contributing to the Big Bang of Data include increased volumes of data, a hyper pace of data growth and newer methods of accessing data.
The figure I’ve constructed and included in this post above shows what I believe to be three important factors: a shift in the character of our society with the emerging Millennials, the on-demand anytime data access from mobile devices, and the resulting accumulation of unstructured data/content. The post-Big Bang of Data era is the “quiet” yet equally exciting time where new fundamental structures are developed, new treasures are uncovered and created, and information intermingles with other information engendering the next insight!
While the message may sound similar to other players in the industry as I have pointed out in this previous post, Hitachi’s interactions with the world are vastly different than our competitors. Here is a relevant quote from my previous post:
“In a very real sense, and key to Hitachi’s differentiation, we are learning to think more like a customer because we are actively working to fuse IT to social infrastructure. This makes us think a lot about how to cope with our own deluge of data so that we can improve our own offerings directly and indirectly.”
As to how long this journey we are on will take, I’m not sure, but likely we will be doing this kind of thing for years into the future. I think our journey is not unlike how Apple exposed their vision to be the digital hub of the living room not as a strategy but as what they stood for. This approach has allowed Apple to have a guiding direction with sufficient room to explore, correct, modify and expand as they participated in the market.
I believe that today we are doing the same thing by exposing what HDS stands for, not just a strategic vision. I suppose that this was inevitable as Data is our middle name.
Want to read more about our Cloud Roadmap? Visit our bit.ly bundle here: http://bitly.com/pCt5Gk
The Road To HDS Cloud
by Miki Sandorfi on Oct 25, 2011
Today HDS made an important announcement supporting our commitment to cloud, and it came in two parts.
First, we announced and provided details behind our vision of where cloud is headed: from infrastructure towards content and then information cloud.
Second, we announced the availability of new Hitachi Cloud Services that customers can leverage to reduce the TCO of unstructured data in their environment while putting themselves on the path to content cloud.
Let me explain a little about these two parts of our announcement.
The first part is giving further insight into what is driving our development, acquisitions, and delivery around cloud services. The graphic below shows at a high level what we are describing:
We view the infrastructure cloud as a basic set of capabilities that allow us to solve bigger problems—the infrastructure cloud is a set of tools in a toolbox. In the content cloud, we focus on leveraging the dynamic infrastructure enabled by the infrastructure cloud and add capabilities to liberate data so that it can be freely used and re-used: it enables fluid content. We have a great example of this in practice with our customer Klinikum Wels (you can see the case study here). Finally, this leads to the information cloud, where we can leverage the dynamic infrastructure cloud and the liberated data in all forms (structured, unstructured, semi-structured) from the content cloud to drive towards sophisticated analysis and insight.
The second part of the announcement focuses on new Hitachi Cloud Services that are now available and help customers build towards the content cloud while realizing immediate benefits in the infrastructure cloud.
These solutions are focused on TCO reduction for unstructured data, and are delivered with self-service and pay-per-use capabilities. Here you can continue with already-deployed traditional NAS and complement it with a cloud solution for 30% or more TCO reduction (file tiering); augment or replace NAS filers with a backup-free, bottom-less cloud implementation that still “looks and feels” like traditional NAS but is deployed with next-generation cloud technology; and complement SharePoint environments by offloading a bulk of content into the private cloud.
Because customer choice is extremely important, we have designed all of these new solutions to be modularly delivered: customers can purchase these offering as cloud packages and build their own cloud around it. They can optionally enable self-service and billing/chargeback by electing to deploy the management portal. Or we can provide fully managed solutions including a true OpEx pay-per-use consumption model with no upfront capital expense to the customer.
Regardless of deployment choice, these solutions put the customer on the path towards content and information cloud. Customers get common, policy-driven data lifecycle management; search and retrieval of their data any time, from anywhere; and a foundation to provide data abstraction from the application that generated the data. Information is the lifeblood of any organization, and we are helping deliver that value to our customers.
Want to read more about our Cloud Roadmap? Visit our bit.ly bundle here: http://bitly.com/pCt5Gk
All Data Has Value
by Amy Hodler on Oct 24, 2011
The trends are clear and spectacular.
We create, collect and store an immense amount of data and information at an exponential rate.
IDC research has shown that in 2009 there were 800 exabytesof data (that’s 800 million terabytes), by 2020, they forecast this number to be over 35,000. It’s hard for me to wrap my head around such a large number, but “Data, data everywhere”, an article published last year by The Economist, put the scale and magnitude of this increase in perspective. The piece noted that Wal-Mart “handles more than one million customer transactions every hour, feeding databases more than an estimated2.5 petabytes—the equivalent of 167 times the books in America’s Library of Congress.”
Do we really need all this data?
Well, I think we do. (A separate post on Information Overload is sure to follow if we can think of something new to add and thereby add to the noise…er… I mean information.)
First off, I’m in awe of the totally cool and amazing things people are doing with information that was once not even a consideration. In a recent article on Feedback Loops, Wired magazine reported some fascinating work done by Shwetak Patel, who translated “the cacophony of electromagnetic interference into the symphony of signals given off by specific appliances and devices and lights.” Through a single device in a single outlet and a stack of algorithms, he could tell if someone left the blender on, and thereby possibly get assistance where needed. Who would have thought that your house has such unique and useful electronic fingerprints?
And then even the most seemingly useless information—such as Twitter status updates on anything and everything and nothing—is proving to be valuable. Last year researchers at Indiana University and the University of Manchester submitted a paper that claimed and illustrated at a high level that the, “moods expressed in Twitter feeds can accurately predict some changes in the Dow Jones Industrial Average three or four days before they occur”. If this seems far-fetched, well I’ll have you know that the Library of Congress is now saving your tweets. Yes, our tweets are now part of the data stream that is considered a historical timeline, as reported by the New York Times last year. Even though this information previously had questionable relevance, there is obviously an expectation of current importance and future potential.
(Am I a total geek for hoping that someone will dub in “Every Bit is Sacred” over a somewhat well known Monty Python tune?)
When I start thinking about the idea of saving all possible data, the image of my garage suddenly pops into my mind.
No matter how many new and valuable things I want to pack into it, I only have so much space. And that leads me back to that The Economist article on “Data, data everywhere”, because they also reported on IDC studies indicating the amount of digital information being created already exceeds the amount of available storage, as illustrated in this chart.
Now this really starts to look like a classic hockey-stick problem.
In a Washington Post article earlier this year they cited a University of Southern California study regarding the massive overtaking of digital data over analog data, which this chart really illustrates. The piece quotes one of the authors of the study, Martin Hilbert, who states that “Humans generate enough data – from TV and radio broadcasts, telephone conversations and, of course, Internet traffic – to fill our 276 exabyte storage capacity every eight weeks.” And that’s JUST the human generated data!
So what does this mean now? What happens when this starts to look like 0/1? Better technology and quantum mechanics to the rescue?
I don’t know. But I do know the only way I fit more stuff in my garage is to organize, repack and prioritize what stays. Knowing there will come a time when you, your organization or your clients have more digital information than space available, how are you planning to prioritize? Are you taking any steps now?
Want to read more about our Cloud Roadmap? Visit our bit.ly bundle here: http://bitly.com/pCt5Gk
Big Data? Big Deal!
by Frank Wilkinson on Oct 19, 2011
There have been many articles, opinions and positions written about the big data phenomenon in the past year, and if you were not confused before, you may be trying to decipher its impact to your organization and your business. There is no doubt about what big data implies, and its affect in business and beyond the data center. It is an important milestone as we enter the next phase of the IT technology era, as we look to not only the data created by business applications, databases, machine generated data, email and file data, but social media generated data and business productivity tools (such as business social media applications).
They all have their impact to how we run business, and that is the easy part to discern from big data issues. What about insight? How will we better utilize our data objects, object stores and their associated metadata to really help drive real opportunities and insight? This was part of the premise of business intelligence solutions, and what they were going to deliver, right?
Not so much!
Seriously Big Data
There are many reasons to take big data issues seriously, not to mention how to best manage it and integrate it, leverage it, and back it all up. Those are challenges all in themselves. For me the greater issue is: how will we interact with the data and make decisions based upon factors that can have an immediate impact to our business? Sure, integration, mobility, backup and exponential data growth are important and they do relate to how we can gain better insight for making more accurate business decisions, but the fact is, if we have not figured out how to manage the data growth by now, we are in serious trouble. Data consolidation, converged infrastructures and scalable architectures are the beginning to solving how best to approach big data impacts, but what do you do with the data, how will you leverage its insights, and what will you use to extract the insight of the data?
Data mashup dashboards and business dashboards are not necessarily new concepts, and while there are some that have made impacts to the market, none of them have breached beyond basic information without leaving you feeling like you forgot something (like that feeling you get when you leave the house in the morning).
Analytics are becoming more relevant in unstructured data, just as they did for structured data. This is such a big market for enterprise vendors such as IBM, EMC, NetApp and Hitachi, that we are all hard at work building better solutions to help this effort. But that does not mean all solutions will be created equal, nor does it mean that they will offer identical or similar approaches. The end result will be end user adoption, usability and customization.
The question I often ask myself is:”what is really needed in an architecture that will lend itself to provide greater clarity and insight of the data?” This is a complex problem, and one with many different variants for the answer. I believe that we are facing a dynamic shift in how solutions will be designed in the future with a heavier emphasis on truly understanding how businesses run and keying in on the requirements to help business make more intelligent decisions. Throughout the past twenty years we have focused on helping to improve processes, IO, smarter applications and the underlining hardware technologies. I feel, however, the next step is determining the best way to marry hardware and software functions more tightly with more internal hand-off. As an example of this, Michael Hay has written a post describing the essence of autopoiesis, the marriage and tighter integration between hardware and software functions. This is a start. But how?
There is no doubt that our systems and solutions have become much more intelligent—as well as more complex—but at the same time they have become less dynamically integrated with no common connector to share information in a true usable way. This can be leveraged for greater insight to what is happening within the systems and applications, being able to report data into a business dashboard (which can be used as the single pane of glass view of the data and its relevance to the business).
What Are We Searching For? What Will We Discover?
If I want to understand a particular event, such as a sales engagement, what would I like to know about it? Is there information already residing within my data stores? What type of data? Could I also have instant access to social media data and what kind of insight can I gain to assist my decisions going forward?
What I am talking about is how can we have a true 360 degree view of the data that we have access to, as well as data that we need to have access to? While we are in the midst of a big data revolution, the issue is how can we find what we are searching for when the results are collected from various, and perhaps disparate, data silos presented in a correlated view? This would enable data to be displayed with relevancy across data streams and data types. As an example: If I perform a search for John Dowe, the results returned should show all John’s activity related to the constructed query. In this case, the results may show some relevant emails, documents, John’s network or server activity, log file data (he was tying to gain access to a SharePoint site which he does not have access to), some voice mails and perhaps some security surveillance video of him while on the company campus and let’s add in his social network activities, to see if he has breached corporate security by discussing corporate secrets (Yes big brother is watching). Alone, the data parts are not very interesting by themselves, but If we can tie the data together, it can show that John was up to something very nefarious due to the email correspondence to a competitor, voice mails that captured a conversation between John and a corporate spy, security footage with audio capturing me in the parking lot exchanging a sealed envelope.
Is that interesting? Absolutely!
How Do We Get There?
The Search is NOT over!
Search is a term which is too widely used and not fully understood for its potential abilities. Whether you are searching data as part of an eDiscovery process, corporate governance or simply to find relevance around a certain subject, search is the tool we want to leverage and has the results we want to gain insight from to make decisions based on. Search is just the beginning, but to be fair, its just search, and what we need is more data and data types to help us see a true picture of what we are looking at.
Starry Night by Van Gogh is a beautiful painting depicting a wondrous night sky. But what if you only could see the Rhone river? What would we know about the rest of the painting? Would we know that the artist wanted us to know that it is evening? Would we know that there are other aspects and objects? Of course not, we would only see what we have access to. Much like search and discovery of data, we only know what we know. Nothing more.
There are often discussions around what it would take to get there, and the answer I hear quite often is that it is too hard and too costly. To get to where business needs us to be, we need to think about how we can develop THE NEXT smarter architectures and infrastructures leveraging open source solutions and common connectors, while leveraging PCIe, smarter controllers, FPGA’s, SSDS and imbed software functions more closely to the data streams.
Sounds hard right? Yes, but not impossible. We have spent the last two decades adding more and more complexity and infrastructure to handle large amounts of data we generate, but we fell short of true integration across the data center and business applications. I do agree that there are technical obstacles to overcome in order to have a better integration, and how we can push search functions and its associated IO down to the hardware layer to minimize the IO across the infrastructure. This is where HDS shines, as we have the best scientists, engineers and resources that are working and solving these problems. Let’s not forget that Hitachi develops some of the worlds best technology solutions and core IP that it utilized in almost every facet of life and is leveraged in some of the technology you have in your data center. We know a thing or two about taking the overtly complex and making it simple.
While I cannot go into great detail about our current endeavors around our research and development efforts, rest assured that we have been working to decipher the necessary technologies to bring THE NEXT—whatever it may be.
Check back soon for my next blog, which will discuss next generation discovery needs and data correlation.
HDS At The Gartner Symposium 2011: Come Hear Our Plan For Harnessing The Power Of Cloud
by Miki Sandorfi on Oct 13, 2011
HDS is excited to be a Premier Sponsor at this year’s Gartner Symposium/ITxpo 2011 held October 16 – 20, 2011 in Orlando, FL. Attendees should expect the conference to be packed full of classes, labs, and have access to tons of information on emerging technologies. A few key examples on how the agenda has expanded from last year:
- 450+ educational sessions, workshops, how-to clinics, Mavericks, Market overviews, Net IT Outs, Magic Quadrants/MarketScopes and roundtables from Gartner Analysts (up from 350+ sessions in 2010)
- More Gartner analysts onsite –170 (150 in 2010)
- Four+ hours of keynote presentations covering the latest trends
- 20+ hours of networking with peers from other companies
- CIO Program (pending approval) with an increase in workshops – 80 (58 in 2010)
- Role-Based Tracks: Increase in capacity and volume of interactive workshops and Net-IT Out short-form sessions
- Industry Program expansion on Sunday and throughout the week (gartner.com)
I am excited to present Infrastructure Cloud and Beyond on Monday, October 17th. 12:45-1:45PM (Room: Southern IV-V)
Through my presentation, you will learn about our strategy to harness the power of Cloud with best practices and examples. The presentation will provide information for organizations that are moving to a cloud service delivery model in order to remain agile, competitive and to improve operations and cost efficiency. As companies embark on their journey to the cloud, it’s critical to understand the path ahead. The foundational elements must include integrated components, virtualized infrastructure and be capable of dynamic scale. In my session, hear from CSC Corporation on how they are harnessing the power of cloud to provide federal cloud computing service delivery.
We also have fun giveaways and Q and A’s planned. Please join me on Monday, October 17th. 12:45-1:45PM Room: Southern IV-V.
During the conference expo hours please stop by HDS’ booth located in the Pacific Hall of the Dolphin Hotel (booth #PS9). The booth will be jam packed with new HDS product information and strategy, HDS leaders and other fun things!
We want to help you start putting the power of Cloud in context to your goals by having better and more sophisticated insight for decisions and advancements. In order to do this, we need more fluid content, which of course requires more dynamic infrastructure. Find out how HDS is driving solutions to support infrastructure cloud, content cloud and information cloud.
For more information on the conference, visit http://blogs.gartner.com/symposium-live-orlando/
Silence Is A Part Of The Discussion
by Michael Hay on Oct 11, 2011
Have you ever watched a completely uncut interview or viewed footage of Steve Jobs when he doesn’t have a script? From the moment he announced his official retirement this summer I have done exactly that. When the moment called for it, Jobs would pause, sometimes uncomfortably, collect his thoughts and then respond. Most recently, I watched Steve’s extensive interview at D8. There, he pauses on several occasions, once about the Gizmodo incident, once in response to an audience question about content business/distribution, and a mild pause when reflecting on his 2005 Stanford speech. In every instance the silence spoke loudly, making me, as the listener, pay even more attention.
In the same way the white space in a document, software application or even hardware device redoubles a user’s focus on what is actually there. Steve and the company he gave birth to accomplish exactly the same. You will find this principal, one of the 10 usability commandments of aesthetic and minimalist design, applied in Apple’s products and product portfolio. Space is given to sharpen the focus on what is important.
With this in mind, I want to look at last week’s activities and pose a question. Almost since the beginning of this year the press, the blog-o-sphere, etc. have been saying that a new iPhone would be just around the corner. Creative hobbyists and design firms pitched in with crazy designs (see video), many speculatively produced cases, etc.
In short, everyone who could offered predictions and set expectations for what Apple was going to do (except, well, Apple). Then Apple sent out invitations to an event on October 4th for what we all now know as the release of the iPhone 4S. Apple’s event dashed many expectations, mostly around giving birth to a bold, newly designed iPhone 5. However, even more importantly, the event itself was not widely broadcast and the tone was low key — I watched the event in its entirety via AppleTV. The following day, on October 5th, the iPhone 4S adorned Apple’s website in its glistening glory. Later that evening when news broke that Mr. Steve P. Jobs passed away, Apple adorned their site with a portrait of their founder, linking to a profoundly simple epitaph.
I can see clearly now that the subdued somber tone of the iPhone 4S announcement was creating space for something vastly more important; Apple was practicing culturally what their founder encoded in their corporate DNA, creating space to focus on celebrating Steve’s life.
All of this gets me to my parting question.
To celebrate the life and achievement of Steve Jobs, what space are you creating to focus on something more important?
Putting HDD Product Trends Into Perspective: A Subsystem View
by Claus Mikkelsen on Sep 30, 2011
On a few occasions I’ve blogged about the drive industry, drive performance, and the effects it has on the storage array business in general, but below is a guest blog from my friend and colleague Ian Vogelesang on disk drive trends. He originally had posted this on our internal HDS website, but it’s too good to keep wrapped up, so I’m sharing it here for general consumption. Make sure you have some time on your hands, as the post is quite lengthy (but worth it!)
Ian is one of the smartest guys in this area (well, smart in general), and this offers amazing insight into what many of us think of as “just a disk drive”. It’s long, it won’t be for everyone, but it’s definitely worth reading by anyone in the storage “biz”.
Ian’s quick bio: Assistant GM of HDD development at Hitachi in Odawara, Japan, then VP Operations of Hitachi Data Systems in Santa Clara, then VP Marketing and then VP Strategy & Product Planning at Hitachi GST (during assimilation of IBM Storage Technology Division), before returning to Hitachi Data Systems.
So let’s take it away…!!!
By: Ian Vogelesang
This detailed blog posting targets a highly technical audience, exploring how HDD product trends will impact subsystem performance and economics over the next year or two and beyond.
It’s hard to anticipate what the reader will know and what the reader will not know, so I’ll leave the Reader’s Digest version for others.
Trends by category
- 7200 RPM LFF
- Today’s capacity point is 2 TB. This is available in both SATA and SAS versions on the AMS2000 family, and in a SATA version on the USP V and VSP.
- This platform is actively under development: 3 TB SATA models are already available in retail stores from multiple vendors, and SAS 3 TB models are expected soon.
- 7200 RPM SFF
- Both the VSP and AMS2000 product lines support 7200 LFF drives which have twice the capacity, so for now we are offering only the LFF 7200 RPM models.
- 10K LFF
- Long dead.
- 10K SFF SAS
- Today’s capacity points are 300 GB and 600 GB.
- This form factor is currently the mainstream platform in enterprise HDDs, so we expect higher capacity 10K SFF models over time.
- 15K LFF
- The 15K 600 LFF drive was the end of the line, and no higher capacity 15K LFF drives are expected.
- 15K SFF SAS
- Today’s capacity point is 146 GB
- Seagate has a 300 GB drive but we are not carrying this product because it is twice as expensive as a 10K RPM 300 GB SFF drive, but has much less than twice the performance.
- Hitachi GST says they plan no further 15K SFF drives at this time.
3 TB 7200 RPM LFF
Let’s start with the 3 TB SATA drive which is expected any time now, as it has been available in retail for a while. It’s safe to assume 3 TB models must currently be in qualification for subsystem applications.
This 3 TB 7200 RPM platform will also be available in a SAS-interface version at a higher price. Even larger models are expected over time. Generally speaking the SAS version of new models based on the 7200 RPM LFF platform will be available a few months after the corresponding SATA version ships.
Seagate had implemented a SAS-interface version of their 2 TB 7200 RPM LFF platform, and this is the SAS 7.2K 2 TB drive that is currently available on the AMS2000 series.
Both Seagate and Hitachi GST have announced SAS versions of their 3 TB 7200 RPM LFF drives, and thus going forward we will have multiple suppliers for SAS 7200 RPM LFF models.
Tests of the SAS interface version of existing 2 TB 7200 RPM Seagate HDDs on the AMS2000 series showed the SAS interface version of the drive when configured in the AMS2000 to offer in most cases over 2x the throughput of the SATA version of the same drive enclosure.
In other words, in a subsystem application spending extra money to use the electronics from a SAS drive instead of the SATA electronics on the same basic drive with the same platters, heads, spindle motor, and actuator gives you twice the performance.
Some people say that performance isn’t the point on SATA drives which are all about capacity. To those people, I ask them if they think that poor people don’t care about money.
Prediction – SAS will largely displace SATA for 7200 RPM subsystem applications, even as SATA will remain the interface in PCs
The first part of this blog posting is going to explain why I expect to see the SAS 7.2K drive to largely displace its SATA-interface twin within 2 years.
The gigantic capacity of a 7200 RPM LFF drive is achieved as a tradeoff against other factors. In order to have the highest recording capacity, we need to use the largest diameter platters.
The platters in a 7200 RPM LFF drive are actually a bit bigger in diameter than 3.5 inches. When the dimensions of the 3.5-inch external form factor were originally fixed, drives actually had 3.5-inch diameter platters that fit between the sides of the base casting that naturally had to have clearance inside the casting walls for those 3.5-inch diameter platters. Nowadays if you take a drive apart you will see that the base casting in the area where the edge of the platter approaches the walls is slightly machined out, enabling the diameter of the platter to be slightly bigger than 3.5-inches. (I forget right now what the actual diameter is, but it’s a value in millimeters, not inches.)
OK, so given that you are going to use the largest diameter platters there are, it turns out (sic) that you can only rotate such big platters at 7200 RPM. Because of the wide diameter of the platters, if you try to run the platters faster you would astronomically increase the power consumed creating air turbulence, and you wouldn’t be able to cool the drive effectively.
So the first strike against SATA drives is that the HDD rotates relatively slowly.
Then second strike against SATA drives is the slow seek speed.
Part of the slow seek speed comes from the simple fact that seek speed is inversely proportional mechanically to the length of the access arm, and a drive with 3.5-inch platters inside will have the longest arm and thus proportionately slower seek speed for how hard you push.
Then the other part of the slower seek speed comes from the budget financially and in space for the actuator, and more specifically, the rare earth metal permanent magnet inside that the actuator pushes against. If you are making the cheapest drive, you can’t afford a bigger rare earth magnet to let the actuator push harder, and anyway, with those huge 3.5-inch platters inside, there’s no room for a bigger actuator motor anyway.
So strike one was the slow RPM, strike two was the slow seek speed, and now strike three is that the ATA specification that has progressed into SATA was purely conceived for the purpose of direct attach to a host (a PC). The original authors of SATA never considered subsystem applications as evidenced by the fact that they fixed the sector size in SATA at 512 bytes.
Using SATA drives in subsystem applications
If you want to provide the usual assurance that you can detect any subsystem virtualization mechanism failures and any data corruption “end to end” within the subsystem, you are going to have to further compromise the performance of a SATA drive.
The problem with a fixed 512-byte sector size at the HDD level happens with the mechanisms that you need to employ in the architecture of a subsystem.
The same kinds of challenges face subsystem designers as face the architects of disk drives themselves – how can you really be sure every time that you are presenting the right data at the right time?
In a disk drive, you are emulating a logical disk drive with LBA addresses that are statically mapped (except during defect assignment) from the LBA to a physical location on disk.
So you have a virtualization mechanism that translates a logical address into a physical address on hardware. How can you know if your virtualization mechanism is working as designed, that is, that every time you read from a logical address, you get the data that was most recently written to that logical address? In other words, how can you detect if you accidentally wrote it in the wrong place or accidently tried to retrieve it from the wrong place?
How do disk drives do it?
In disk drives, to provide these assurances, we used to use an ID field as a physical field that was stored in addition to the host sector, and where the ID field that is written with the data is the (logical) ID used by the host to specify where to put the data.
Suppose the drive originally wrote some data in the wrong place. If at any time later the host tried to read the LBA that really did belong in that physical location, the drive would retrieve the erroneously written data, but the contents of the ID field on disk wouldn’t match the requested ID field, and thus this ID field mechanism detects and fails I/O operations that would otherwise give the wrong data to the host.
Similarly, if you were to read from the wrong location, the ID field at that wrong location wouldn’t match the ID you were requesting and the I/O would fail.
So storing a logical address field with the data (or to make the field smaller, storing a cryptographic hash of the logical address with the data as is done in subsystems) can assure that virtualization mechanisms are working as designed.
Then IBM invented no-ID formatting in the late 1990s. The idea here has to do with how you compute the ECC bytes that are used to detect and even repair minor corruptions sprinkled here and there within a sector.
The idea is to compute these ECC bytes logically as if the sector were longer than 512 bytes by the size of the LBA address, and then to compute the ECC as if the LBA were logically prepended to the data. This meant that we no longer needed to have a separate ID field, but without increasing the size of the ECC data, we could still could derive a fingerprint that detected if the data that had been stored under that LBA really originally came from that LBA. Thus we can ensure that every time we read data from disk, we know that after all the complex logical-to-physical things there are in a disk drive (zoned recording, serpentine LBA layout, skip-slip defect assignment, grown-defect assignment, etc.),we still will detect if there’s any corruption of the virtualization mechanism.
Now let’s talk about subsystem applications of HDDs.
Subsystems also have virtualization mechanisms, and you need to provide the same two assurances that you provide as a disk drive designer, namely that the data is intact, and that it physically came from the same place originally used to put the data when it came from the host.
Therefore you need to have a checksum of the data, and you need to have a fingerprint of the LUN &LBA, and these need to be captured at the point of entry into the subsystem at the host port, because we need to offer “end to end” protection, and these check bytes need to accompany the host sector on its journey through the data paths in the subsystem, in and out of cache and onto disk and back from disk.
These requirements to store a few check bytes with each 512-byte host sector in subsystem applications were well understood at the time of the first SCSI drives, and even back before then.
The SCSI spec as originally conceived provided for the user to perform what is called a low-level format which (re)creates all the sectors on the drive. This low-level format can be performed with a range of sector sizes from 512 bytes (always used for direct host attach) in increments of 4 or 8 bytes up to as much as 528 bytes in some models. These larger sector sizes allowed subsystem architects to store a check byte field along with every 512-byte host sector. Hitachi subsystems use HDDs low-level formatted with 520 byte sectors in drives that use the SCSI command set, namely SAS and Fibre channel HDDs.
In this way, the subsystem maker can provide a checking mechanism to assure the accuracy of the virtualization layer, as well as to ensure that the sector is not corrupted along its journeys through subsystem data paths and cache memory.
The problem with SATA is that there is no provision in the spec for sector sizes other than 512 bytes. And the reality is that for every possible bit pattern in a sector, the host has to be able to write that 512-byte sector and then read it back again. So we need to use all 512 bytes to store the information contained in the host sector, because if we were able to condense 512-bytes worth of data into less than 512 bytes of space to make some room to store some additional information, then there would have to be at least two host bit patterns that would result in the same smaller-than-512-byte encoding. So there’s no room to encode more information in 512 bytes of space on top of what the host is storing in that bit pattern.
So you can’t provide virtualization mechanism protection and subsystem data path corruption detection assurance and still map each host 512-byte sector on a SATA drive to one 512-byte sector on disk.
Thus you can either decide to fly blind, trusting that there are no design flaws and that the hardware works perfectly, storing each host sector as one 512-byte SATA sector on disk, or you can decide to provide the usual assurance mechanisms, albeit with a performance penalty.
Hitachi protects against subsystem virtualization errors even with SATA
What Hitachi does is to expand each sector as it is written by the host as usual into a 520-byte sector that contains the 8 check bytes guarding against virtualization and corruption errors, and then at the point of writing these 520-byte sectors on disk, we write 64 of them in a “clump” (my term) of 65 physical 512-byte sectors on disk. Doing it this way means that the customer is protected as usual against any subsystem architectural or algorithmic flaws.
For read operations, it doesn’t matter much, because you just read the whole clump (it’s only 32K out of a track size of about 1 MB anyway) even if all you want is a 4K bit within the clump.
But for writing there will be extra I/O operations required with SATA drives that are not necessary with SAS or FC drives.
This is because a 4K write from a host will be for a set of sectors that once they are mapped to 520-byte logical sectors will always require updating part but not all of at least one physical sector on disk. (Every 520-byte logical sector gets mapped to range that is bigger than one physical sector but shorter than two physical sectors. And any write smaller than 32K will always need a “pre-read” of the old contents of the clump so that you can update the bit newly written from the host before writing it back. If you think about it, this only applies to RAID-1,because in RAID-5 and RAID-6 random writes, you read the old data before you write the new data, and while you are at it reading the old data you just read the whole clump.
SATA-W/V vs. SATA-E on enterprise subsystems
But this isn’t the only performance issue with SATA. Our design engineering department is very concerned about the potential for lower reliability with SATA drives. Therefore on the Enterprise subsystem, we offer the user two choices, “SATA-W/V” or “write and verify” where every write to disk once destaged to the disk surface is followed by a read from disk to assure the write didn’t encounter any “silent write failure” condition. There is a known failure mechanism which is common to all disk drives, regardless of host interface type, that can have silent write failures for a short period of time before the host (or the drive) figure out that nothing changes as a result of writes any more. Doing a read verify operation after every write allows us to detect if this rare but nevertheless possible failure mode is occurring.
The second option offered to Enterprise customers is the “SATA-E” mechanism. With SATA-E, we also protect against silent write failures, but in such a way that we don’t need a read-verify operation after every write. Instead, what we do is randomize the mapping from the 64 host 520-byte sectors to the 65 physical sectors within a clump on every write. That way if you write something and there was a silent write failure, on the I/O that failed, the physical location of that LBA on disk in the clump would have changed, and if you try to read the data back after a silent write failure, you will read from the new location, not the old location, and therefore the data at the physical place you are looking would not match the LBA and the I/O would fail.
So SATA-E actually is faster for pure random writes than SATA-W/V on Hitachi enterprise subsystems, and it detects silent write failures.
But the problems with SATA-E are twofold. Firstly, there are so many clumps on a SATA drive that this mapping information that that records for each clump the permutation of the logical 520 byte sectors for that clump, which is called the Volume Management Area or VMA, is too big to entirely fit into Shared Memory. This means that when you read from a SATA-E parity group, there a chance that you will need to do a pre-read, an additional I/O operation, to fetch the section of VMA from disk before you can satisfy the host read request. Thus random reads are slower on SATA-E.
There is some chance you will have to write to the VMA with every host data destage operation as well.
The second problem with SATA-E is that the extra computation required to essentially add another virtualization layer substantially increases microprocessor busy. In the USP V, the guidance from engineering was that SATA-E would increase BED utilization by 70%. For this reason alone, we generally don’t recommend SATA-E because it disproportionately consumes MP resource, and this MP resource is what limits the overall IOPS throughput of a subsystem.
OK, so if you bare with me so far, and there is light approaching at the end of the tunnel. We don’t know whether it’s the end or whether Ian’s just going to move to the next phase of the explanation.
How bit is too big?
We’ve seen that performance on a SATA drive is much worse than performance with SAS or FC drives, because SATA drives rotate slower, seek slower, and because the mechanisms that assure the integrity of subsystem virtualization layers, and detect data corruption “end to end” require extra I/O operations to be issued to SATA drives that don’t have to be issued to SAS and FC drives, the performance of SATA drives in subsystem applications is further impaired.
The good thing about SATA drives is the huge capacity. The bad thing about SATA drives is 1/3 the IOPS capability to handle not only host I/O, but so called “SATA supplemental” I/O operations that aren’t needed with SAS or FC.
Is this a big problem? Let’s run a couple of numbers to see where we are in the ballpark.
A very large dot com customer with an instantly recognizable name has told us that the overall average access density to their data over their entire shop is about 600 host I/O operations for every TB of host data. This corresponds exactly with some data that IBM published a while back, so this is a reasonable average amount of I/O activity per TB of data.
Let’s compute very roughly what a 10K 300 GB SFF drive and a 2 TB HDD look like in terms of host IOPS per TB of data.
First let’s look at the drive itself. A 2 TB drive can do about 100 IOPS and the capacity of the drive is 2 TB, so the SATA drive can do about 50 IOPS per TB. If the AVERAGE activity of the customer’s data is 600 IOPS per TB, that would mean that if the drive were directly attached to the host (without RAID), on average you could only use less than 1/10th of the capacity of the HDD before you ran out of IOPS. Oh. And then 3 TB drive is coming and then even bigger drives after that, all with the same IOPS capability!
What about our 10K 300 GB SFF drive? It can do about 300 IOPS, so the direct attach access density capability if you fill the drive is 300 IOPS per 0.3 TB, or about 1,000 IOPS per TB of data. So for direct attach, you should be able to fill 300 GB 10K drives with data of average activity.
You could even almost fill 600 GB drives which have the same IOPS but twice the capacity, so they can do 500 IOPS per TB of data.
But that’s only for direct host attach.
The picture gets worse in a subsystem
In a subsystem you have some sort of “RAID penalty” for writes that depends on the RAID level. Just for the purposes of illustration let’s look at the number of HDD-level I/O operations needed to support one host read and one host write. For the host read, you need to do one HDD I/O. For the host write, you would need to do 4 HDD I/Os (read old data, read old parity, write new data, and write new parity) and more than that if you need SATA supplemental I/Os as well.
So for this workload, you would need 1+4 = 5 HDD I/Os, or 5/2 = 2.5 HDD I/Os for each host I/O. This ratio gets as bad as 1:4 for pure random writes in RAID-5, and worse still for SATA drives. So let’s use 2.5 HDD I/Os per host I/O for our ballpark estimation.
Our SATA drive that we thought could do 50 IOPS / TB now looks like it really can only do 20 host IOPS per HDD TB, before we even TALK about the SATA supplemental I/Os. So for ordinary data with an average access density of 600 IOPS per TB of data, we can only fill the drive to less than 20/600 = 3% of its capacity. (Yes, I’m sure the sharp-eyed reader will have noticed that this doesn’t account for the fraction of the potential drive’s capacity that is used for parity, but with the SATA supplemental I/O, thinking that you will hit the drive IOPS limit at about 2% or 3% full from a capacity point of view is about right.)
If you buy SATA drives and then try to use them for average activity data, they will be hideously expensive if you buy enough drives to handle the host IOPS, since you can only fill them a couple of percent full with average activity data, and therefore you will need a LOT of HDDs. If you do fill the drives more than a couple of percent full of average data, the drives won’t be able to keep up with the IOPS, and you will have horrible response times, and Write Pending will instantly fill to the limit with data waiting to be destaged to disk. (That’s why many people put SATA parity groups in their own CLPR, addressing the symptom rather than the root cause.) Performance problems with SATA are sad in my opinion, since for normal applications, you either spend much more money buying a lot of SATA drives then the money you would have spend on SAS enterprise drives, or else you buy an insufficient number of SATA drives, trying to use the capacity of the drives, and then not be able to cope with host workloads and have serious problems.
So ANYTHING you can do to improve the IOPS capability of SATA drives will proportionately increase the amount of data that can fit on the drive before the drive reaches its max physical throughput.
Putting a SAS interface on a 7200 RPM LFF platform
What does the SAS 7.2K 2 TB drive offer us compared to the SATA 7.2K 2 TB drive? There are two main differences from a performance point of view.
The first is that SAS offers native 520-byte sector formatting, and thus there are no SATA supplemental I/Os on a SAS 7.2K drive. (The same sharp-eyed reader will have raised an eyebrow as to why the SAS version of the same drive can be trusted not to have silent write failures, but it’s always a judgment call in the end when it comes to what is “sufficient” protection, and Hitachi engineering is very conservative on protecting customer data.)
The second performance advantage of putting the SAS electronics on the 2 TB 7200 RPM LFF drive is that the Tagged Command Queuing or TCQ acceleration capability is much higher on the SAS electronics card.
SATA drives are optimized for the absolute lowest cost, and there just aren’t the microprocessor cycles nor the number of logic gates in an ASIC that you get with the more expensive electronics in the SAS electronics card.
The TCQ feature is what allows the host to independently issue a bunch of different write operations to the drive (called queuing I/Os in the drive) where each I/O operation is identified by a “tag” number, a small integer between say 0 and 31 typically for a SAS drive. The drive has the luxury of browsing the queued I/O requests from the host, and deciding what order to perform the I/O operations in regardless of the time sequence received from the host. Of course no I/O operation can be indefinitely delayed, but within such a constraint the drive as the freedom to re-order the I/O operations so as to be able to visit the associated physical locations in a sequence that minimizes seek time and rotation time.
TCQ can accelerate SATA I/O by about 30% (of course when the workload is multi-threaded) compared to the drive’s single threaded throughput. SAS drives can accelerate the IOPS to about 60% higher throughput compared to single threaded throughput.
So where we computed the throughput capability of SATA and SAS drives above, I’ll leave it as an exercise to the reader to calculate the access density capability of SAS and SATA drives in subsystem applications.
The bottom line is that SATA drives are desperately slow. They are so slow that you can only fill them to a couple of percent of their capacity before they run out of IOPS. The SAS version of the same drive, without the burden of having to perform SATA Supplemental I/O operations, and being able to perform the IOPS faster (1.6 / 1.3 or 23% faster) results a combined doubling of the access density capability of the drive.
This is a BIG DEAL, and it’s why I expect all customers that learn this stuff to switch pretty much from SATA to SAS 7.2K except for those rare cases where the data truly is below cryogenically cold in terms of its activity level, or where the customer is fixated on SATA being cheaper per inaccessible GB.
For everybody that gets this, even a modest bump in drive cost with the SAS card will pay enormous returns in the form of more than doubled throughput.
This trend will only be driven harder by a shift to 3 TB and then even larger drives over time.
Other trends in disk drives:
The 10K SFF platform is the mainstream product for Enterprise HDDs. This means that in addition to the current 300 GB and 600 GB models, we should expect even higher capacity 10K SFF SAS models over time.
I’ll do the homework for you. If we have a drive with 600 GB that can do 300 physical IOPS further accelerated by 60% using Tagged Command Queuing, and where the RAID mechanism causes backend IOPS to be 2.5x the host IOPS, that drive can accommodate ((300 IOPS * 1.6) / ( (7/8)*600 GB) ) / (2.5 backend IOPS per host IOPS ) or 365 host IOPS per TB. Given that a global average host access density is 600 IOPS per TB, this means that the 600 GB 10 SFF drive is a “fat” drive, that can only be completely filled with data that has about half the activity of average data. In other words, the majority of your data is going to be too active to store on a 600 GB 10 SFF drive, if you plan on using all 600 GB and fill the drive with data. Today the “sweet spot” for average data is somewhere between a 10K 300 GB and a 10K 600 GB drive.
Future 10K SFF drives that have a capacity of even higher than 600 GB will still be capable of essentially the same IOPS as current drives, and thus these future 10K SFF drives with higher capacity will be even “fatter”, meaning that you can only use the entire capacity of the drive for low activity data.
This is a great lead-in to talk about what happened to 15K RPM.
The end of 15K, even in SFF?
Why are the HDD vendors saying that with SFF they can’t make any money selling 15K RPM drives and they are going to drop them? Hitachi certainly has been saying that, and although Seagate did launch a 15K RPM 300 GB SFF drive, it may be that this might not be viable for Seagate going forward at least short term. This is one of those things that could go either way, and one of those things where if you look at it in the short term (next year or two) there is one trend, and when you look at it longer term (e.g. 3-5 years) there could be quite a different trend.
To understand what happened, we should observe that disk drive platter diameters (and their associated external form factors) have been getting smaller and smaller when you look at them over a 50-year time span. The original IBM RAMAC drive had 24-inch diameter platters. The very next generation of drives used 14-inch platters, and at least for enterprise drives, that’s where it stayed for a long time. But over time we saw 14-inch platters give way to 8-inch platters, to 5 ¼- inch platters, then 3.5-inch platters, and now we are transitioning to the 2.5-inch SFF. (OK, Hitachi buffs will note that Hitachi enterprise drives introduced 9.5-inch platters with the Hitachi “Q2″ drive that was compatible with the IBM 3380-K 14-inch platters, and then went to 6.5-inch platters that had this cool “reactionless” linear actuator design before finally switching over to standard OEM type 3.5-inch drives.)
The factor that drives an industry shift to a smaller form factor has to do with drive access density capability.
We talked about this earlier. Basically once you have fixed the platter diameter and drive RPM, and these are the factors that characterize a “form factor”, then all drives of that form factor basically are capable of the same random IOPS regardless of drive capacity, because the random IOPS is determined by the mechanical rotation time and the mechanical seek time. And you can’t increase the RPM or improve seek time in any big way without moving to smaller platters.
Some of you may remember 9 GB drives going to 18 GB going to 36 GB going to 73 GB going to 146 GB etc. So you can well imagine that if you build a 10,000 RPM drive that only has 9 GB of capacity (actually, probably 10K RPM didn’t come along until later, but bear with me for the sake of argument. I just don’t remember of the top of my head what the capacity was of the first 10K 3.5-inch model). Imagine if you will a 10K RPM drive with only 9 GB of storage capacity. You can immediately grasp that the ratio of IOPS to GB would be very high, so high in fact that you would almost always run out of GB before you would run out of IOPS.
So if you run out of GB before you run out of IOPS, then higher RPM drives look horribly expensive.
The reason for this comes from the relationship between drive RPM, platter diameter, and drive capacity.
If you double the RPM of a drive, you will need a little more than 4x the power to turn the platters, because the power required to create air turbulence goes up as a little more than the square of the speed. For example, to make a car go twice as fast, you need more than 4x the horsepower.
So we can’t double the RPM of a disk drive and keep the same platter diameter, because the drive would burn too much power and it would be difficult to cool.
But if we keep the RPM of the drive the same, and double the diameter of the platters, you increase the power required to spin the platters by over 16x. How can this be? Well, each unit of surface area at the edge of the platters is going twice as fast if the platter diameter is doubled. Therefore each unit of surface area needs a bit more than 4x the power. At the same time, if you double the diameter of the platter, it has 4x the surface area, and we just noted that each unit of surface area needs 4x the power, and thus the total power required goes up by over 16x.
In other words, we can increase the RPM and keep the power consumption the same if we decrease the size of the platters only moderately. All within a 3.5-inch LFF form factor,7200 RPM drives have (roughly) 3.5-inch platters,10,000 RPM LFF drives have 3.0-inch platters, and 15K LFF drives have 2.5-inch platters.
The problem early on in the life of a form factor is that you run out of GB before you run out of IOPS.
It turns out that a 15K RPM drive has platters that are have a bit bigger than ½ the area of platters in a 10K RPM drive. Thus most of the capacity difference comes from the area of these platters that have a smaller outer diameter. There is also a smaller factor associated with the higher linear velocity of the heads which in an LFF drive at the outer edge is about 100 mph or 160 km/h for 15K LFF vs. 60 mph or 100 km/h for 10K LFF. With the higher flying speed in 15K comes more flow induced vibration of the head due to turbulence and more difficulty to fly close to the surface without hitting it. This means that 15K RPM drives run a bit lower recording density than 10K RPM drives.
So the problem is that if you are going to the biggest capacity point that you can achieve in the form factor, using all the platters that fit within the form factor, in other words, if the GB are the limiting factor, not the IOPS, then 15K RPM drive are twice as expensive per GB as 10K RPM drives because the 15K RPM drive with the same number of platters and heads has only ½ of the storage capacity, and it actually costs a bit more to make with the bigger actuator and its magnet. (The cost of a platter is about the same regardless of diameter – cost is by number of platters.)
OK so at the beginning of the life of a form factor, where GB are more important and you are trying to make the biggest drive that will fit within the form factor, then 15K RPM drives cost twice as much per GB as 10K RPM drives.
But later on as we keep doubling the capacity of the drives over and over as new drive generations come out, each with twice the recording capacity (for enterprise drives) of the previous generation, there comes a point where there’s no point to increasing the capacity any more, because you run out of IOPS before you run out of GB.
At this point is where the mainstream of the market switches to the higher RPM, when the recording density gets so high that you might as well use advances in recording density to let you make the platters smaller, since you can’t use more capacity at the original platter diameter anyway.
Making the platters smaller in diameter means you can rotate the drive faster, and voila, this is the point where the 15K RPM drive displaces the 10K RPM drive as the mainstream product. This happened above the 300 GB point on LFF. In fact, Hitachi decided not to build a 10K 600 GB drive in LFF when it became possible to do so. Instead we made new generation drives that also had 300 GB but had fewer heads and media. Well, actually, we did do a 400 GB 10K LFF drive as a kind of last-kick at 10K LFF, but if you look at HDD vendors’ web sites today you will see that there are no more 10K LFF drives for sale at all.
From an economic point of view, we have found that customers are willing to pay up to 25% ~ 30% more for a 15K RPM drive than a 10K RPM drive, but they generally will buy very few 15K RPM drives if they are twice the price of 10K RPM drives.
And that’s where we are right now with SFF. We are making the absolutely biggest drive the we can with the most platters and heads that fit in within the form factor. And even in SFF,15K platters are smaller than 10K SFF platters. The capacity of a 15K RPM SFF drive is one half of the capacity of a 10K SFF drive of the same generation.
And that’s the economic factor that’s squeezing out the 15K RPM drive right now. You don’t get a 100% improvement in IOPS for a 100% increase in cost going from a 10K SFF drive to a 15K SFF drive, and this makes 15K RPM drives look very expensive in SFF.
Hitachi did build a 15K 146 GB SFF drive in the same generation as the 300 GB 10K SFF drive, but decided against making a 300 GB 15K SFF drive in the generation of the 10K 600 GB SFF drive. Seagate did come up with a 300 GB 15K SFF drive, but again, it has ½ the capacity of a 10K RPM SFF drive at a bit higher cost, and thus doesn’t look very attractive financially.
That’s where we are at right now. 15K SFF is just not very attractive financially. 15K $/IOPS is actually worse than 10K $/IOPS in SFF because 15K doesn’t yield 2x the IOPS of 10K,but it’s 2x the price. So 15K SFF never really makes sense in terms of cost effectiveness to achieve the necessary IOPS.
The 15K sale only makes sense where the customer’s business can increase revenue or profit with faster HDD response time. This means that 15K is very much a niche product in SFF at this time. (With SSDs, the situation is the same, that you can only justify them where there is business value to having faster response time, because SSDs not only have worse cost per GB, but they also have worse cost per IOPS.)
Where we are at the moment is that we are still early in the life of the SFF form factor. Seagate has decided to go for it and make a 15K 300 SFF drive, but Hitachi GST couldn’t see how it could sell in enough volume to make any money building one.
But if you think about it, in the disk drive business sometimes it’s like “déjà vu” all over again.
The point at which we basically started to transition from 10K to 15K as a mainstream product in 3.5-inch LFF was “above 300 GB”.
Yes, SFF drives have smaller platters and thus shorter arms that seek faster and thus we can make 10K SFF drives bigger in GB than 10K LFF drives because the SFF drives are capable of higher IOPS.
Ian thinks 15K SFF will come back in the next few years
My own personal view is that since 10K SFF is mainstream and we expect 10K SFF drives with more than 600 GB in future, then we can’t be all that far off from the point where 15K starts to look more attractive again. This will happen when we start using increases in areal density to decrease the numbers of platters and heads instead of increasing the capacity.
Wait, there’s another light appearing at the end of the tunnel …
This brings us to the topic of increasing areal density. A transition to 15K as the main product in SFF would be driven by increases in areal density.
Can the researchers keep performing magic?
The big news here is that although over 50 years of HDD evolution we have come to expect that our brilliant scientists will keep solving problems and inventing new technologies to keep doubling the capacity of the drives over and over and over at an average rate over those 50 years of something like 40% compound annual growth rate, we may actually be reaching the “areal density end point” where we reach fundamental physical limits.
Just for your amusement, I remember when Dr. Jun Naruse, who later became HDS’ CEO, was head of HDD R&D at Hitachi told me that the theoretical maximum recording density that ever could be achieved was about 65 megabits per square inch. At the time, we were shipping about 5 megabits per square inch using particulate oxide media (basically rust particles in epoxy resin) and inductive read/write heads. Today’s product are shipping at about 500 gigabytes per square inch or about 10,000 times higher capacity than we previously believed possible.
But we appear now to really be getting close to the ultimate physical limit.
There are some technologies that are being worked on, most notably Bit Patterned Media and Thermally Assisted Recording (called Heat Assisted Magnetic Recording by Seagate),but they haven’t quite hit the market as fast as originally targeted. In fact, at least at last year’s Diskcon HDD industry convention, no vendor would publicly speculate on what year either BPM or TAR/HAMR will appear in production products.
So what can we do to keep increasing disk drive recording capacities? Well, one thing that is now very publicly being talked about by multiple HDD vendors is the possible introduction of Shingled Magnetic Recording or SMR HDDs. The basic idea here is that without any advances in read/write technology, but just reconfiguring the write head so that you give up on random writes and write relatively wide tracks overlapping like shingles on a roof, you can still easily read back each track from the bit that’s exposed. Using this technique with the same head technology you can generate much stronger magnetic fields for writing and thus you can use higher coercivity media that are harder to write on, but let you make the bits smaller. Higher recording density means higher capacity drives with the same read/write head technology.
What about 7200 RPM SFF?
Why don’t we offer a 7200 RPM SFF drive? These drives are available from some HDD vendors. The issue here is that because 7200 RPM SFF drives use 2.5-inch platters, and 7200 RPM LFF drives use 3.5-inch platters, these SFF and LFF drives would cost about the same to make, but the LFF drive would have twice the capacity. Since both the VSP and the AMS2000 family support both SFF and LFF drives, we are offering the LFF 7200 RPM drive because it has twice the capacity at about the same price.
The last prediction, anticipating humans will become rational
If people were ever to really think about the economics, realizing that if you try to put normal computer data on a SATA drive you could only fill it to a couple of percent of its storage capacity, then who cares if you can get 2 TB if you couldn’t possibly even use 1 TB? To me the 7200 RPM SFF drive looks like a solid price performer that hasn’t been given sufficient consideration. So I think that if people ever figure this out we’ll see the majority of the requirement for “fat” drives to be on 7200 RPM SFF drives, and only the truly cryogenically cold data going on 7200 RPM LFF drives. For a set-top box recording that records and plays video, 100 IOPS is plenty for handling a few HD video streams, so keep on using 7200 RPM LFF and bring on all the TB you can! And for archival applications with almost no read activity, 7200 RPM LFF will always offer the best price per GB.
But for anything the resembles normal computer data,7200 RPM LFF is “too big to make sense” and 7200 RPM SFF would be plenty big enough to get into trouble running out of IOPS before running out of GB. Of course, we would want the SAS interface 7200 RPM drive, not the SATA version.
And then there’s the fact that SFF drives use a fraction of the power that LFF drives use.
I hope this gives some perspective on how the HDD roadmap will impact subsystem performance and economics over the next year or two and beyond.
Big Data – Coming Down The Pipe!
by Cameron Bahar on Sep 29, 2011
These are exciting times.
I joined HDS through the acquisition of ParaScale, a startup I founded to focus on solving what the industry is now calling the Big Data problem. When I visit customers, I notice a growing percentage are faced with a very challenging data problem. The data in its original form is unstructured as expected, but it’s not your typical “human generated data”. Human generated data is what I refer to as file based data such as documents, spreadsheets, presentations, medical records, or block based data such as transactions, customer records, sales and financials records that are usually stored in relational databases. These traditional data sets are adequately served with high performance SAN/NAS systems, which have excellent random I/O characteristics and can handle massive amounts of structured and unstructured content.
I noticed that this new workload and the corresponding data being produced was fundamentally different from traditional enterprise workloads. This data was being generated by hundreds to thousands of machines, was rarely updated, often appended, and was fundamentally streaming in nature. I called this category “Machine Generated Data” and people have started to embrace this term over the last few years. We started looking at this problem as early as 2004.
Imagine a web company that stores and processes log files from one hundred thousand web servers to generate optimized advertisements to monetize its freemium business model. Prime examples of such business models are Google, Facebook or Yahoo. Imagine a bio-informatics company sequencing millions of genomes in hopes of finding patterns in disease, or a security company scanning lots of high-def video and analyzing it looking for a specific person or object, or millions of sensors generating data that needs to be processed and analyzed.
Besides the workload being different for Machine Generated Data (MGD), what else is different? Notice that with MGD, companies store this data in order to be able to analyze it shortly after storing it. The faster they can analyze the data, the better the payoff usually. Think of ad placement by analyzing log files or video analytics to find the right guy in the video just in time, or analyzing genome data for a cancer patient who doesn’t have a lot of time, or finding inefficiencies or trends in financial markets or looking for oil in all the wrong places.
This problem as it turns out is really about large scale data storage and mining of unstructured and semi-structured data to gather information and insight from a company’s vast datasets. It’s nothing short of information processing, repurposing, and data transformations to uncover hidden patterns in data that will ultimately lead to better decisions.
So what are the high level requirements to build a system to solve this problem?
1. Scale – the system has to scale. But scale to what and in which dimensions. I suppose it has to scale in capacity to be able to store petabytes or exabytes of data. But it also has to ingest this data pretty fast, since there are lots of concurrent streams of machine generated data that are coming through the pipe. So the pipe can’t be the bottleneck either.
What role does system management play in this equation? If you have lots and lots of data, you can’t afford to hire lots of storage admins to sit around and manage this data; it costs too much! So what is one to do? The system should be self healing and self managing, right? It should handle most of the mundane data management tasks automatically instead of relying on people. Ideally all people do is add new hardware or replace failed disks or power supplies.
2. Processing – once you store petabytes of data into a storage system, how do you analyze it? You can’t very well load a few petabytes over the network into a compute farm, can you? How long would that take? Isn’t the network the bottleneck then? What if we were able to move the processing or the program to the data and run it there instead. The program is pretty small compared to the data set size and so you at least take the network out of the equation to a large degree. Therefore, the platform should allow in-place data processing or analytics.
3. Cost – when you store lots and lots of data, your TCO should be reasonable, so using commodity components as much as possible, virtualization, and automation will go a long way.
So when you look around and you read and hear about people talk about or working on gfs, bigtable, hadoop , hdfs, cassandra, riak, couchbase, mongodb, hbase, zookeeper, flume, mahout, pig, hive, etc, etc, … you know what I’m thinking about.
The information revolution is upon us and its pace is accelerating.
HDS @ #OOW2011
by Gary Pilafas on Sep 29, 2011
The first week of October marks the largest Oracle global event in San Francisco: Oracle Open World 2011.
HDS will once again be present at the show (booth #2101) where we will be previewing several new technologies and integration around:
- Oracle Enterprise Manager (OEM)
- Oracle Recovery Manager (RMAN); and
- Oracle Virtual Machine (OVM)
HDS will demonstrate the best in class functionality and integration around tools that Oracle database administrators use on a daily basis, allowing an enterprise DBA to manage enterprise database environments in one single pane of glass.
In addition to a set of rich tool sets, HDS will also preview the Hitachi Converged Platform for Oracle Database Reference Architecture, which allows for three key solutions:
- Hitachi Compute Blade 2000 Model Servers with Brocade Fibre Channel Switches and Hitachi VSP and/or AMS Storage, which leverages our HDS differentiators.
- HDS high performance computing platform, which includes the following differentiators in Hitachi servers over all competitors:
- I/O Acceleration – Hitachi Servers use two dedicated PCIe connections per blade, which allows for anindustry best performance with our Partner Fusion-IO’s PCIe SSD Cards
- Symmetrical Multi Processing (SMP) – Hitachi Servers use 2-blade and 4-blade SMP connections, allowing for the synergy of multiple blades to act as a single computer (64) cores and 1TB of memory
- Hitachi Virtualization Manager (HvM), which allows our customers the opportunity to utilize logical partitions in both Oracle Single Instance and Real Application Cluster (RAC) environments
- I/O Expansion Units –Hitachi 4-blade SMP, allowing for the connectivity of (4) I/O Expansion Units (IOUE) that can each hold (16) PCIe SSD Cards from Fusion-IO for a total (64) Fusion-IO cards, which would allow for a server based SSD pool of 76TB
- The third solution is our Real Application Cluster (RAC) Platform, which includes HDD High Performance Computing Platform, consisting of Hitachi Compute Blade 2000 Model Servers with Brocade Fibre Channel switches and Hitachi VSP and/or AMS Storage.
For our existing storage customers, this development and integration is good news given that HDS integrations have been enhanced by strong partners like Fusion-IO and Oracle.
Make sure you stop by HDS booth (#2101) at the show for a preview of all of these great tools!
Is That Information…and do I care? (Part 2)
by Amy Hodler on Sep 27, 2011
Last week I posted a review of the definitions and differences of data, information, knowledge and wisdom (DIKW) on this blog. I wanted to continue that thread to look at why it’s important to understand these sometimes-subtle distinctions.
Although most of us in IT stop at considering data and information as more tangible and actionable elements, it’s really in the later areas of knowledge and wisdom where we find things to be most useful—though also more ephemeral. Understanding how we advance from one to the other will help us more readily progress into developing greater insight. Misunderstanding means more mistakes and false starts.
Anjana Bala (Stanford University and former HDS intern) and I found looking at the DIWK concepts in the framework of a process to be very meaningful. If you consider the move from data to wisdom as a progression of understanding and connectedness, a clearer picture starts to emerge. (Note that I’ve pulled explanations heavily from Jonathan Hey’s work on “The DIKW Chain” as well as the DIKW Wikipedia entry.)
1. We research and collect lots of factual elements to provide us data.
2. We add context to data to understand the relation of elements and gain information.
3. As we connect information in sets, we gain understanding of patterns and acquire knowledge.
4. When we start joining whole sets of knowledge we can understand how they relate to bigger principles to ultimately achieve wisdom.
Another interesting outcome of looking at this as a process is that it clearly shows that data, information and knowledge are attributable to better understanding the past. We all understand that’s a critical basis, but our goals usually depend on insight into what we should do in the future, which requires wisdom.
At HDS, we’re passionate about data and information as our way to help you do great things. Assisting in getting more value from data and information is core to what we do, and it’s why you’ll see us continue to invest in solutions related to data, content management and cloud. And it’s also why we’re working on additional innovations to help organizations move forward from data to information and knowledge.
Greasing this ‘DIKW’ process will significantly accelerate the rate of innovations, and I believe we’re very close to a knowledge tipping point in this area. We should all do what we can to lubricate this process by considering ways to better connect the elements and more readily move understanding forward.
What might you help to improve? Can you add to this high level list?
- Resource and data/information connections (consider cloud implications)
- Data standards and data virtualization
- Correlation tools and analytics
- Human-understandable results (e.g. data visualization)
- Processes that help internalize knowledge
- Next generation of hardware and software to support the above
Kicking Off a Blog Series on Object Stores
by Robert Primmer on Sep 27, 2011
I’ve worked on three commercial object stores: Centera, Atmos, and now HCP (Hitachi Content Platform). In that time, I’ve seen numerous misunderstandings about this particular brand of storage technology–not just in how they function, but even more fundamentally, on where, when and why such a system would be employed. In a new blog series starting today, I hope to answer these questions.
To that end, my co-authors and I will provide a series of tutorials on distributed object stores (DOS), essentially providing a short course on the subject with the occasional odd topic interspersed periodically.
Where it makes sense, articles will be split along business and technical lines, as the two topics often will have completely different audiences. In some cases this division will result in wholly separate articles, but generally both should fit in a single article.
It’s always challenging to get the right level of technical detail with a diverse audience. Generally we’ll bias toward simplicity, as there’s an inverse relationship between how technical an article is and the number of possible readers. So, the typical article will strive to hit a moderate technical level, hopefully avoiding the arcane. However, if there’s sufficient interest in a particular topic, we’ll go back later and add greater technical and mathematical rigor to those specific areas of interest.
I’ll try to construct the series in such a way that each topic builds successively upon the previous. However, as this will be my first attempt at creating such a course I’m sure to get some topics out of sequence. Fortunately, web pages – with their ability to readily point to other content outside the page displayed – allow for non-linear reading in a manner far superior to what is possible in print.
Here is the Topic Index as I see it at this juncture. A single topic might span several articles connected together. As with source code, it’s generally better to write several small modules that connect together rather than a single large source file that tries to do everything.
This method of relatively short articles also allows me to later go back and insert new articles within a given topic as either new things occur to me, in response to feedback about a given topic, or additions as the state of technology changes.
1. What is Structured and Unstructured data, and why do we care about the difference?
2. What is an Object?
3. What is an Object Store?
4. What is a Distributed Object Store?
5. When would I use an Object Store versus other forms of Storage?
6. Industry Implications of Object Stores
a. Traditional Storage Vendors vs. Cloud Vendors Approach to Object Stores
7. Basic elements of an Object Store ecosystem
8. Distributed Object Store Blueprint
9. Architectural Considerations of an Object Store
10. A Comparison of Object Store Implementations
11. The Life and Times of an Object
a. The Birth of an Object
b. Data Ingest
c. Life inside the Object Store
d. Where does an Object live?
e. How is Data Protection achieved?
f. Object Mobility
i. Duplication and Replication
g. Why is Tape Backup not required?
h. Basics of Data Unavailability and Data Loss (DUDL)
i. Fundamentals of Self Healing
j. Read / Write: What makes it fast, or slow?
12. De-duplication and Object Stores
13. The Road Ahead – The Evolution of Object Stores going forward
Please let me know your thoughts along the way.
by Frank Wilkinson on Sep 26, 2011
This is my first official blog with Hitachi Data Systems. I started with HDS on of all days, Valentines Day 2011! This reminded me of what I love: of course my wife and children, but also what I have chosen as a career. I mean, who can’t get excited everyday when you have the unique opportunity to try and make the world (or at least technology) a better place?
Some background on me: I have been in the technology industry for over 15 years. I am a Master Architect, both hardware and software, and have worked in sales, pre-sales and strategy roles for most of my career.
For the past 10+ years my main focus has been around search and eDiscovery, and I have had the pleasure to work with some of the industry’s top thought leaders and developers, who actually created this space for archiving and search in 1999. I have helped create this market, and along with some others, we have watched our “child” grow into what it is today, which is far bigger than most thought it would be: hungry for better, faster, richer search capabilities and faster insight to data. I have worked for some of the largest providers of search and eDiscovery solutions and it has been a joy ride all the way.
Like I said, I started on Valentines Day of this year and I think it is apropos, that I have the fortunate opportunity to work for the best IT Solutions company in the world! I get to do what I love and that is to create new concepts and technology, and bring new ideas for Content Cloud solutions for HDS. Who wouldn’t be excited?!
Speaking of exciting…
Big Data Problems
IT is not a mystery in that at some point we were going to run into issues of having too much data which led to too much overly complex architectures, networking latency issues and backup issues—just to name a few. The reality is this is not a new problem, but one that has always existed; just look back at the first Disk Pak that you bought, 10mb and we thought that was enough capacity. When that space ran out we bought a larger one, and thought that was going to be enough capacity! We simply added capacity as needed and along with that we grew our complexities for backup, recovery, replication, networking, etc. You get the picture.
As the decades flew by we had grown to the point of an information overload and grew data beyond our wildest imaginations. While archiving content gave us some relief for email and file data, it also added an extra capability that allowed the ingest data to be indexed, making the data searchable. This was the beginning of the next era for search and eDiscovery.
Fast forward a few years and now we see that there are hundreds of companies who offer search and archiving solutions—perhaps you have one of those solutions in your enterprise today.
I am not telling you something that you don’t already know, but I will tell you that Big Data impacts your ability for greater insight to your data.
Why do we keep data?
Why don’t we expunge that data when its usefulness expires?
Well, we know that data comes in all types and forms, and while some have a high relevance to your business—such as a financial record or contract–we also keep data for historical reference.
So What’s The Issue Here?
The reality is that we have tired to address the big data issue with options like archiving, data de-duplication, application retirement, consolidation, file planning, etc. Please don’t get me wrong, we need these technologies in order to hold the Big Data beast at bay, and these technologies also offer the baseline for building the next generation architectures. Since data has grown exponentially over the past several years, we have tired our best to contain data in its place while trying to adjust our architectures for the next great thing.
So Where Are We Today?
We have complicated and rigid architectures which may not lend themselves to take advantage of Cloud solutions. We have complicated applications and integration issues. We have a ton of meta-data, objects and even some archiving solutions in place—perhaps too many and not easily integrated together. We’ll even throw in some business intelligence solutions.
So What Do We Have?
A big mess!
Maybe it’s just me, and perhaps I see things from too simplistic a view point, but the data we create should serve the business and allow us to make better business decisions that react to the markets with more agility. As Michael Hay and my friends over at the Techno-Musings blog preach, data should not hold us hostage and cause such pains. To me, that is what is important: how can we make business’ run better and more efficiently while reducing risk and reducing the size complexity of IT, while maximizing new technology to gain better insight to the data? This is what I love, and it is my passion (just ask anyone who knows me).
I do have a point to all of this, I promise!
The challenge, as I see it, is not giving you more solutions and adding more complexities, as some companies are trying to sell you, but rather the opposite. How can we look to technology to help solve some basic issues with regards to Big Data?
We have all been promised at one time or another that technology will unlock the value in IT. Well, I am here to tell you that we are almost there, but we still have work to do.
While there are many challenges to address, there are a few which make it to the top of the list:
- Objects and Meta-data: In order for greater insight, we know that we need to do a much better job at exploring and expanding meta-data capabilities. And while we are at it, a better way to standardize meta-data abstraction and find relationships between dissociative meta data.
- Search: we need to think bigger than just search. Analytics need to be more tightly integrated with the data and meta-data. I call it “Content Analytics,” since that is really what we need. If all the data and its associated meta-data can be made to be more intelligent—and yes we are talking way beyond meta-data tagging here, and yes I am leaving out all the other parts to this like (data dictionaries, indices, BI, etc.) because they are understood, at least for now—then we will be able to provide a much richer set of data and meta-data. This could lend itself to a more refined result, giving us greater insight. But how do we get there?
- Open Frameworks and API’s: If we really want to think about the impact that big data has today and its strong hold on our data centers and architecture, then we need to think of a better way for data to communicate with business’. Part of this can be addressed with the adoption and implementation of such open source solutions like OpenSearch, which allows for the sharing of data through a common and structured format for data to be shared and even extended with formats like RSS and Atom just to name a few.
I Told You I Had A Point…..Here It Is!
When I look at the different variants and the many ways in which meta-data is utilized today, and how we are just learning how to maximize its potential, we are still tied to the old architectures, and thus tied to a less spectacular way to gain insight of our data. Being able to leverage that analysis will help us make better business decisions. I am not talking about Business Intelligence solutions, although I do ponder that perhaps if data and its meta-data were able to leverage a common open source set of API’s, then we may have something here. Perhaps we could then be the well for all intelligent meta-data and be the providers and or helpers who other solutions can leverage for meta-data analytics. In order to shrink our architectures and squeeze out more benefits from our data, we need to think about a new approach to meta-data management and what we could expect from next generation solutions.
In Michael Hay’s recent blog post Interacting with Cloud Stores, Michael articulates precisely what is needed around a common set of open sourced APIs, which can be leveraged throughout the enterprise across all applications, storage, data objects and meta-data in order to get us to be able to link and share meta-data across not only the enterprise, but with external meta-data from social media networks (and even from public cloud solutions). Meta-data is the Holy Grail, and the more we can leverage, embed unique markers and add intelligence into our meta-data, the more we can begin to reap the benefits of clearer insight to that data.
As you can probably tell, my blog will take on the challenge of looking at Big Data in many different ways, but most importantly around Content Cloud, Search and eDiscovery.
I look forward to your thoughts and comments!
Soybean Is A Commodity
by Ken Wood on Sep 23, 2011
While having dinner in Waltham, Massachusetts the other night with a table of smart people, the topic of commodities came up.
What is a commodity?
Soybean is a commodity. Steel is a commodity.
Why does everyone consider storage a commodity? I said, “…actually, capacity is a commodity, not storage.” Everyone’s eyes lit up. That’s true. CPU cycles are a commodity, but processors and servers are not. Networking bandwidth is a commodity, but network switches are not. Storage capacity is a commodity, but the storage equipment behind it is not.
So is there a distinction? My old colleague Michael Hay posted a blog on a similar subject last year, and my new colleague, Shmuel Shottan from BlueArc, also blogged about this subject on Nov. 8th 2010. So I grabbed this definition from Wikipedia:
A commodity is a good for which there is demand, but which is supplied without qualitative differentiationacross a market. A commodity has full or partial fungibility; that is, the market treats it as equivalent or nearly so no matter who produces it. Examples are petroleum and copper. The price of copper is universal, and fluctuates daily based on global supply and demand. Stereo systems, on the other hand, have many aspects of product differentiation, such as the brand, the user interface, the perceived quality etc. And, the more valuable a stereo is perceived to be, the more it will cost.
The case of “stereo systems” is a very good example of a misunderstood situation in our industry today—product differentiation matters. While the output of these products may be perceived as a commodity–such as CPU cycles, capacity and bandwidth—the quality of the product producing these “outputs” is worth it.
So here’s my distinction between CPU cycles and CPUs/servers, or storage capacity and storage systems, or bandwidth and networking:
Cycles can be bought and sold at varying prices based on several factors, such as time of day or week, based on current loads or some other seasonal impact (supply and demand) similar to electricity. Service providers can even compete on the cost per CPU-cycle-hour. I’ve even seen CPU-cycle-hours donated or given away as a promo or part of a grant. The same could be said for bandwidth or Internet access, but the equipment that creates these “goods” provide a distinguishable quality that is noticeable, and in most cases, sought after.
The difference is that you can buy and sell commodities by the bulk (there are extreme corner cases where some companies buy equipment at huge quantities, but this does not fit the mainstream model nor does it make the gear a commodity). Hence, you can buy CPU cycles by the hours or months. You can buy bandwidth by the gigabyte per month. You can buy capacity by the gigabyte per month. You can buy soybeans by the ton.
It is my opinion that there is a misunderstanding and confusion in the IT industry between “commodity goods” and “consumer products” when it comes to technology. I can’t pinpoint the exact origin of why or how these two concepts seem to have become synonyms for each other, especially in the IT industry, but there is a difference between commodity goods and consumer products. The technologies used and the blending of these technologies, be it the user interface, the way it is managed, its performance and/or reliability of a product, contributes to the overall perceived and measured quality of a product.
There is also the “secret-sauce factor”. The hybrid combination of commodity components with special, purpose built components to complement the overall invention of a product can elevate that product to something altogether different, a new standard. In fact, one could argue (wrongfully) that once a technology becomes a “standard” it is treated as a commodity. Of course we all know there is a difference between being an industry standard or following a standard, and setting the standard.
The other argument that can be made (wrongfully again), is that if a product is constructed with some measureable amount of commodity parts (resistors and capacitors are bought by the bulk), then the product itself must be a commodity. Of course, with many quality products, there is also a measureable amount of differentiated know-how blended together, plus the inclusion of secret-sauce to create a unique and desirable product. Only companies in the business of inventing innovative technology and products can provide the quality and reliability that customer’s desire.
So, while soybeans are a commodity, a fancy restaurant, a little sea salt, a culturally focused serving bowl and an exotic name can buy you Edamame for $15 bucks. For just a few ounces.
Announcing the Data Content Blog
by Robert Primmer on Sep 23, 2011
Hello, and welcome to the Data Content Blog.
On this blog we will focus on the area of data storage commonly referred to as “unstructured data.” We have all seen the industry charts that show the growth of unstructured data to be dramatic, growing to exabytes by 2015. The storage systems designed for this class of data are evolving to meet the challenges associated with trying to manage and keep track of such massive amounts of data.
Perhaps the most important distinction is that these systems are really software applications that use storage, but aren’t intrinsic to the storage subsystem. As a software solution, the storage application itself can perform functions particular to the data content as there’s a greater knowledge of the nature of the customer data than is possible in a pure disk storage subsystem. This increased knowledge allows for a greater set of actions to be taken based upon the data content, rather than on more generic qualities such as disk segment boundaries. A simple example is the ability for an application or storage administrator to select which specific objects or files it would like to have replicated, when and where. A more complex example involves performing arbitrarily complex analytics on the data that effectively transforms data to information.
A second important function for this class of storage software is the ability to simply keep track of where all the data resides. It’s a lot easier to store a petabyte than it is to store, catalog, and keep track of a billion objects over their lifetime. As the time horizon for stored data approaches decades, it’s a given that the associated storage and server hardware will change generations multiple times. The software application that fronts these systems needs to be built to withstand these changes without requiring forklift upgrades. Therefore, it’s important that all hardware elements are sufficiently abstracted in the storage software to accommodate change. Likewise, the ability the test the veracity of stored data is needed as backup for data at this size is impractical.
On this blog we’ll talk about these and other issues particular to unstructured data and how it relates to what’s happening in the industry at large, as well as specific customer segments.
This blog will have four contributors: Cameron Bahar, Frank Wilkinson, Shmuel Shottan, and myself, Robert Primmer. Below are biographies for Cameron, Frank and Shmuel.
Although no longer blogging for HDS, for reference from his previous blog contributions, see his bio below.
Chief Product Strategist, Scale Out Storage Platforms & Big Data Analytics
Cameron Bahar leads the technology direction and strategy on scale-out storage platforms, distributed file systems and Big Data analytics at Hitachi Data Systems, bringing over 20 years of systems software development and deep expertise in distributed operating systems, parallel databases and data center management.
Cameron joined HDS through the acquisition of ParaScale where he was the founder, CTO and VP of engineering. At ParaScale, he developed and released a private cloud storage and computing software platform for the enterprise to address Big Data workloads.
Earlier, Cameron led design, deployment, and operation of Scale8′s distributed Internet storage service that provided storage for digital content owners including MSN and Viacom/MTV. At the HP Enterprise Systems Technology Lab, he developed system software for disk volume management, data security, and utility data center management. At Teradata, Cameron developed extensions to UNIX to allow the massive parallel processing required by the Teradata database, the world’s largest and fastest distributed RDBMS. Cameron started his career at Locus Computing, a pioneer in distributed operating systems, single-system image clustering, and distributed file systems.
Cameron holds a BS, summa cum laude, and an MS with honors, in Electrical Engineering from UC Irvine. Cameron holds 4 patents in scalable distributed storage systems, virtual file systems, and high availability systems.
Frank T. Wilkinson
Content Services Strategist, Office of Technology and Planning
As the content strategist in the Office of Technology and Planning, Frank brings over 15 years’ experience in the field of IT and software development and expertise in HDDS and HNAS. He is an industry expert in the areas of knowledge and content management, information management within the financial services, healthcare, media & entertainment, retail and SLED.
He has held many IT certifications, including being a Master Architect. Prior to joining HDS, Frank was the Business Strategist for HP Software, Information Management Division and spent five years developing and deploying information management and content management solutions. Prior to HP, he enjoyed stints at the EMC TSG group’s information management practice and K-Vault Software (Symantec Enterprise Vault). Frank has been an active speaker for industry led panels and discussion on the topics of eDiscovery, information management, content management and next generation solutions facing the enterprise.
SVP, CTO BlueArc, Part of Hitachi Data Systems
As CTO of BlueArc technology, Shmuel is responsible for developing and advancing BlueArc product innovation. Previously Shmuel served as senior vice president of Product Development of BlueArc. He joined BlueArc in 2001.
Shmuel has over 30 years’ experience in the research and development of hardware and software, and in engineering management for firms ranging from start-ups to Fortune 500 companies. Prior to BlueArc, he was senior vice president of Engineering and chief strategy officer for Snap Appliance. Earlier, Shmuel held executive positions at Quantum Technology Ventures and Parallan Computer. Previously he held senior engineering positions at AST Computers and ICL North America.
Shmuel holds B.S. degrees from the Technion – Israel Institute of Technology in electrical engineering and computer science.
As for me, I am senior technologist and senior director of product management in the file content and services division at HDS where I am responsible for devising technology solutions that will be incorporated into future enterprise and cloud product/service offerings, with in-depth knowledge and expertise with Hitachi Content Platform(HCP).
I have 25 years’ experience in technology, working in R&D and Product Management organizations with Cisco Systems, EMC, HDS and several start-ups. I am a member of ACM and IEEE and belong to the IEEE Computer Society.
We all look forward to bringing you insights on this dynamic topic of data content. If you have particular issues you’d like us to address, please let us know in the comments.
Is That Information…And Do I Care? (Part 1)
by Amy Hodler on Sep 20, 2011
It’s no wonder we have so much buzz around Big Data: we’ve reached a tipping point, which Michael Hay discussed in a previous blog post on how ‘Big Data Is Turning Content Into Appreciating Assets.’ I find news of discoveries and innovation based on massive amounts of data pretty inspiring, and, as you can imagine, we at Hitachi Data Systems talk about these topics among our colleagues and friends…and sometimes even to the point of annoyance among our families. (I can personally attest to a glazed stare this morning as I chatted again about this very blog post.)
With this kind of enthusiasm, you can understand how we might get into debates over what we actually mean by ‘Data’, ‘Information’ and ‘Knowledge.’ This usually leads to fatigued, sometimes discussion-ending questions of, “does it matter?” and “do I care?” After some debate, Anjana Bala (Stanford University and former HDS intern) and I decided to research this a bit more, and we’ve come to the conclusion that 1) we do care, 2) it is important and 3) we’d bet you’d agree.
Ok – So this is by no means a new topic, it’s just really hot right now. We’ve been thinking about this for ages, from all different angles. I believe this topic has captivated us for so long because the journey from data to knowledge is transformational and provides the basis for innovation. And, it continues to elude us because this journey, particularly when we go beyond information, becomes more personal and implicit.
Among the mass of different views on these concepts (which sometimes felt like hair splitting) Anjana and I found Jonathan Hey’s paper on “The DIKW Chain” as a good consumable starting place and the discussion of how data and information relate actually continues to knowledge and wisdom as well. We’ve summarized the distinction between these concepts below based on Hey’s paper and Dr. Russell Ackoff work with one of the most common representations of the relationships: the pyramid diagram.
- Dataare discrete symbols that represent facts. You might think of them as recordings or statistics.
- There is no meaning or significance beyond the data’s existence.
- Data may be clean or noisy; structured or unstructured; relevant or irrelevant.
- Informationis data that has been processed to be useful. I like to think of it as adding the first bit of context to data relating to “who”, “what”, “where”, and “when”.
- Information captures data at a single point in time and from a particular context; it can be wrong.
- Knowledgeis the mental application of data and information. Most consider this as addressing questions around “how”.
- Some consider knowledge a deterministic process, which is to say the appropriation of information with the intent of use.
- Wisdomis the evaluation and internalization of knowledge. It applies insight and understanding to answer “why” and “should” questions.
- Wisdom has been characterized as “integrated knowledge — information made super-useful.” (Cleveland, Harlan. December 1982. “Information as a Resource”. The Futurist: 34-39.)
Much of these definitions are based on Dr. Russell Ackoff’s work, however I added the graphic, examples and further explanation from various sources. (Ackoff, R. L., “From Data to Wisdom”, Journal of Applies Systems Analysis, Volume 16, 1989.)
So this is all very interesting with implications for not only the terms we use for a good debate, but also how we might progress from one to another. This is probably enough to chew on for now, so next week I’ll follow-up on these implications and we’ll look at why it’s important to consider. In the mean time, let me know if you have any good examples to illustrate DIKW differences.
What a Difference a Year Makes
by Miki Sandorfi on Sep 20, 2011
Last year, VMworld was all about virtual machines. All the partner plug-ins focused on helping to do just about anything to virtual machines. From backing up to deploying virtual machines, almost every eye was on the virtual machine ball. What a difference a year makes!
At this year’s VMworld, most players came out swinging at “cloud,” trying to establish their place in the market. Though there was lots of schwag (I have never seen so many people carrying prizes of iPads and AppleTVs, and a few with both), I was very impressed by the number of in-depth conversations, well attended knowledge sessions and sincere desire to gain cloud information. Right off the bat the show’s vibe was clear: Cloud. Cloud. Cloud.
Every exhibitor had their own take on the game, and some were just trying to figure out how to play–but no doubt cloud was the game. Much of the focus was on integrating into VMware or supplementing to adding value. Application distribution, service and cloud monitoring, application service performance, security, extended cloud management orchestration, and hardware management were most commonly seen. Beyond the common cloud themes, some key players (HDS included) are executing to a much more integrated approach, extending the management more deeply to enhance the end-to-end cloud workflow. This year, all players were aiming at a common goal, from cloud appliances, cloud enablers, cloud managers, and every cloud item in between.
Regardless of how they precisely define cloud, all were focused on driving to it.
The attendees had evolved as well, which is a direct reflection of how IT is evolving. A year ago, most were strongly virtualizing and dipping their toes in the cloud. Most had a goal for the cloud or had been considering it, but not many were invested or even optimistic about the potential of achieving a cloud environment. This year, if not swimming in it, all were wading in the cloud.
Conversations definitely changed: knowledge about how to get to cloud has greatly increased with an increased focus on, and most importantly the reality of getting to cloud in customers’ own environments. Optimism and belief were definitely the theme, and even more exciting, many discussions changed from IT and application focused to their end customer and how to operate efficiently while meeting the business needs in a timely manner. This is incredibly important to actually achieving the dream – making the ability to execute start to become a reality.
What a great direction we are all moving.
The cloud focus shows not only how predominate cloud is in the industry, but the evolution technology and IT have gone through over the last year. We’ve moved from focusing on server virtualization to entire data center landscape, integrating all IT into a consolidated flowing environment. Though we are not there yet, the evolution demonstrated not only at the booths and attendees, but in the roadmaps. Discussions clearly point to cloud as the target, which is something we have waited for a long time: the ability to increase flexibility, reduce OpEx, and simplify complicated architectures—all leading to better meeting needs and adding value to the business.
Freakonomics to Flying Circus
by Amy Hodler on Sep 19, 2011
I’m an idea junkie: deconstruct an old one or give me a new one and I’m happy.
That might be why the team asked me to introduce our new blog. We all have good and great ideas and the only way to know the difference and capitalize on them is to share.
With that spirit in mind we are launching our new Cloud Blog. There are many exciting initiatives and discussions going on within HDS and the cloud teams. And we thought it was about time to share more of these with you – our customers, partners, analysts, friends and anyone who finds cloud interesting. Our goal is to germinate and “Inspire the Next” great set of innovations together.
Any topic with a cloud angle will be fair game but we’ll try to tie things back to IT and data center cloud interests. (Although quite honestly, I know I’m already going to break this guideline. Because rules are…well, you know. ) Our goal is to spark great conversations and ideas by being thought provoking, interesting or entertaining. If we get boring…promise to tell us and we promise to do something about it!
Following are our regular contributors. I’ll let you work out where we all lie on that spectrum.
Miki Sandorfi – Chief Strategy Officer, File, Content & Cloud
As Chief Strategy Officer for File, Content and Cloud, Miki is responsible for driving forward the Hitachi File, Content and Cloud product and solution portfolio. Miki has over 19 years of experience. He holds 16 issued and 20 pending patents and pioneered advances in enterprise data protection and deduplication at Sepaton as the Chief Technology Officer. Miki is enthusiastic about all new things, especially shiny new gadgets! [Read more about Miki..]
Linda Xu – Sr. Director of Production Marketing, File, Content & Cloud
As Senior Director of File, Content and Cloud product marketing team, Linda is responsible for developing go-to-market strategy for the portfolio, and end-to-end execution of the marketing strategies. Prior to joining Hitachi Data Systems, Linda held various positions at Dell Inc. including strategic planning and business development and was previously a public relations manager at AT&T China. Linda has recently been learning to stretch and relax with yoga. [Read more about Linda...]
Amy Hodler – Sr. Product Marketing Manager, Corporate & Product Marketing
As Senior Product Marketing Manager for File, Content and Cloud, I’m responsible for marketing of the Hitachi Unified Compute Platform as well as other solutions. Throughout my career, I’ve been advancing emerging solutions at both small and large companies including EDS, Microsoft and Hewlett-Packard (HP). I thoroughly enjoy the beautiful outdoors of eastern Washington as well as most activities on two wheels. [Read more about Amy...]
You’ll get to know our guest and regular writers more as we move forward and we’re hoping this will be a dialogue. Your ideas are important and there’s a lot that could be accomplished if we had more in the mixing pot and stirred. So remember that crazy ideas and discussions are always welcome and sometimes we just get lonely for a chat. Please let us know what you’re thinking whether you’re internal or external to HDS.
You never know where the rabbit hole might lead but it will be interesting . . . let’s jump in together and see where this goes!
Interacting with Cloud Stores
by Michael Hay on Sep 15, 2011
In a July post on the Techno-Musings blog, we made the case for keeping content/data in its original, unaltered form. More specifically when an application stores data/content in a non-obfuscated mode, it is possible to unleash the true power of the content. This is a key tenet by which Hitachi lives and breathes-ensuring that application
owners, users and customers are the winners in the end. At Hitachi, we view ourselves as custodians or trustees of the content and metadata applications store on our platforms, and not the “content wardens” holding data and metadata prisoner.
While there is the potential for a market sea change, whereby users of business applications demand complete access to their data in an unaltered form, many application vendors still practice a belief where they imprison their customers’ data and metadata.
The ability to have unfettered access to the content and its metadata is even more important today than before, mostly due to the rise of cloud/object storage platforms. That is because offerings like Amazon S3 for the public cloud and premier cloud/object stores such as the Hitachi Content Platform (HCP) for the private cloud are paving the way for ubiquitous and open REST based access to objects stored on them. However, the middle leg in this chain is the source application, which produces or shelters the content and then stores or persists it into the cloud store. If the source application fails to persist the complete unaltered content and metadata, or obfuscates it in any way when placed on the object or cloud store, then the principal being articulated in this post fails.
A key question is, why is this unfettered access important now and into the future?
The best way to begin to answer this question is by giving an example. In another blog post, we evaluated the usage of Microsoft Remote BLOB Storage (RBS) versus the usage of the Hitachi’s Sharepoint plugin (Hitachi Data Discovery for Microsoft Sharepoint or HDD-MS for short) to transport Sharepoint content to HCP. The post states that there is a machine generated way to move content from Sharepoint or a Microsoft SQL Server application to HCP through a RBS provider. It is true that the RBS approach persists unaltered actual data for the object, but that is about all it does. RBS fails to endure an object’s core metadata, such as file name, containing directory structure, and doesn’t even give a thought to custom or other SharePoint specific metadata. So RBS really persists an incomplete representation of an object or cloud store. The net effect is that RBS expects the authoring application or machine to contain all of the object’s metadata and context and not the storage. Whereas Hitachi’s plugin is specifically designed for SharePoint archival moves of the object plus replicates the metadata to HCP. With this approach, HDD-MS is effectively trying to strike a balance between machine and human accessibility.
We have intentionally designed HDD-MS in this way that if in the future SharePoint was not available and a human needed to do a manual investigation they could look at the original object, all of its versions and its metadata. If we purely relied on RBS, then the user would need to access the version of SharePoint the data originated from to get everything they need. This practice would be repeated per SharePoint site, and further if Microsoft ever decided to halt the development of SharePoint – yes unlikely in today’s thinking, but what about 30 or 50 years from now – then the only way to maintain access to the content and its metadata is to archive the application along with the data. This would create a term coined by one of our customers as a “museum of applications.”
Best Practices for Today and the Future
With the basic premise articulated, it is now time to talk about what best practices we have developed in our adventures with platforms like HCP and our partners. First, a bit of preparation work about cloud stores and how to persist and access objects on them if it is warranted. The “lingua franca” of public/private cloud stores is generally an HTTP variant called REST or REpresentational State Transfer, but there are some exceptions. The fundamental technology which REST and related technologies, like WebDAV or SOAP, share is HTTP. HTTP has basically grown up with the Internet, making it ”well oiled” to move data and metadata over high latency Internet and Intranet pipes. This is contrasted to protocols like NFS and CIFS which are very chatty and usually designed for LANs or in some cases MANs. Basically, when you get to public and private cloud infrastructures, protocol efficiency matters.
Since RESTful over HTTP is the lingua franca of public/private cloud stores, you might ask: is there a standard? And the answer is that, as of today, there is no International ratified standard that covers public/private cloud stores. There is an emerging cloud storage standard developed by SNIA called Cloud Data Management Interface (CDMI), which is RESTful.There is also Amazon’s defacto S3 RESTful interface. With all of this in mind, it is now time to talk about some observations and specific best practices we’ve learned as we have evolved our private cloud store HCP.
Observations from Integrations to a Cloud Store
1. Newer applications are being built to make use of RESTful HTTP implementations natively, while older applications are frequently tied to NFS and CIFS.
2. It takes a long time to get an ISV partner who has invested deeply in a CIFS/NFS implementation or had a bad experience with a proprietary API to integrate a company specific RESTful interface, unless they see business value.
3. Some ISVs, like Symantec and Microsoft, are implementing plugin architectures, which allow the cloud store provider to self integrate to an application specific set of primitives. This has proved to be a faster way to have an optimal integration with a cloud store.
4. Some application vendors will never change their code to interface to anything other than NFS/CIFS.
5. Many application vendors practice vendor lock-in, holding their customer’s data hostage. In the long term we see this as decreasing customer satisfaction, and our principals dictate that we push our partners to not imprison our mutual customers’ data.
Best Practices for Integrating to a Cloud Store
1. Where possible, use your cloud store’s RESTful implementation to store content and its metadata both on the cloud store, especially for new application starts, with the following notable exceptions:
a. Legacy applications typically make use of NFS/CIFS and aren’t well optimized for cloud stores. For a legacy application which uses NFS/CIFS we recommend the use of a cloud on-ramp that translates from NFS/CIFS to a RESTful interface.
b. When an application has native ingestion techniques, use them. Native ingestion typically unleashes the content and its metadata from the application and decreases the solution cost, mostly because middleware is not needed. Examples of applications with native ingestion we’ve integrated to HCP are Microsoft Exchange Journaling via SMTP and SAP’s Netweaver ILM via WebDAV.
2. If a legacy applications’ developers have other priorities, start with an on-ramp implementation via NFS/CIFS and deliberately plan a migration to RESTful over time.
3. Cloud store providers should jump on the plugin model, utilizing a vendor’s API primitives to create an optimally integrated offering.
4. When possible, store an object in its unaltered form, and always store the object’s implicit and explicit metadata with it in an open format (like XML). Ideally, this should be done during the initial commit or ingest to the cloud store.
5. If a machine-only readable format is required, the authoring application should also store a map or recipe to re-instantiate the data and metadata independent of the application. This is an excellent compromise between machine and human accessible objects.
6. Ideally, don’t containerize many objects into a single mega-object like a ZIP package. Containerization makes it difficult to retrieve and access an individual object at a time. Usually specific cloud store implementations allow for connections to be kept open, improving transport efficiency. Long running sessions provide a compromise to containerization in that a single session can support the movement of many smaller objects, much like a container can support the transmission of many objects in a single session.
7. Data breaches of sensitive and/or business critical data are a major concern for many organizations, and these concerns are amplified in the cloud. Encryption (both in-flight and at-rest) is often a mechanism used to counter the real and perceived dangers, but there are numerous complexities and inter-dependencies that have to be addressed (i.e., there is no one best approach). Hitachi believes that encryption is an important tool to guard against unauthorized data disclosure and encourages users to take full advantage of the guidance offered by organizations like the Storage Networking Industry Association with its “Encryption of Data at Rest – a Step by Step Checklist“ as well as the Cloud Security Alliance (CSA) with the “Security Guidance for Critical Areas of Focus in Cloud Computing” and “Cloud Controls Matrix (CCM) v1.2“
8. Always use the authorization and access controls available for the particular cloud store implementation to ensure improved security. Further, if the source application’s access controls mechanisms are not completely compatible with the target cloud store, capture this information and add it to the metadata stored alongside the object. Like other metadata, all access control list or authorization related information should be encoded in a standard format like XML.
In this post we’ve detailed the case for making content and metadata independent from source applications for future usage. In some sense, content independence assures us that, when there are future use cases for the data we hadn’t considered, it is at least possible. With the application and data intimately intertwined, creating future data mash-ups is nearly impossible.
Since we believe that mashing-up data is potentially a fundamental part of the Big Data phenomena, not unleashing the data from the application puts everyone in a Big Data prison. With the basic theory articulated, we’ve explained some lessons learned and best practices for integrating with a cloud store. Finally, while many of these lessons were formulated as a result of the HCP program, HCP specifics have been mostly excluded.
For more HCP, HNAS, HDI and HDDS specifics, please stay tuned to the HDS blog-o-sphere.
Beyond Vendor Neutral Archiving
by Bill Burns on Sep 13, 2011
In today’s hypercompetitive healthcare environments, driving cost control is paramount and healthcare providers, payers and vendors must reduce cost while increasing the quality of care. Patient-centric technologies are at the core of the solution. These technologies necessitate clinical integration using health information exchanges and associated components. Hitachi Clinical Repository (HCR) was created to address this enterprise-centric view for all data types, not just medical imaging. HCR is a real time “active” repository thatconsolidates data from a variety of clinical sources to present a unified view of a single patient. It is optimized to allow clinicians to retrieve data for a single patient rather than identify a population of patients with common characteristics or facilitate the management of a specific clinical department.
Unfortunately, the care environment is littered with application-centric technologies like PACS and provider-centric technologies like EMRs. Neither of which delivers a patient-centric, portable view of the patient. The concept of the holistic electronic health record has been extenuated in recent moves toward personal health records. The use of clinical data mining in genetic records will change the outcomes of various disease states. Once healthcare providers gain access to comprehensive electronic patient records, the drive toward predictive, personalized medicine will be possible.
Here is an interesting article by Brian T. Horowitz of eWEEK, on the challenges cloud computing presents to the healthcare industry.
Will the role of VNAs in the healthcare of tomorrow be central or peripheral? The continued evolution of IT solutions is likely to be the catalyst propelling the VNA approach either into the limelight or the shadows.
What do you think?
Textbook Acquisition Strategy
by Ken Wood on Sep 8, 2011
What a week this was! The HDS acquisition of BlueArc announcement was a great event. Members of the blog team (Michael Hay, Claus Mikkelsen, and myself) and executive team held several Q&As, as well as multiple briefings with media, analysts and bloggers. We even held a tweetchat, which was a great opportunity to connect with our followers about the news. It seemed like only positive comments (unless you are a competing vendor) about the acquisition were made, even from our toughest critics. In fact, the only comments remotely negative was similar to “…what took you so long?” which really isn’t that entirely negative, but more a validation of our decision.
This brings me to my point. I think our industry is a little calloused about acquisitions. There are too many companies that seem to acquire first, then try to figure out what they bought and how to use it. Too many times these acquisitions disappear or are tossed away. I’m sure you’ve seen or read about other companies acquiring a company, then about a year later buying another that seems to do the exact same thing. Also, the strategy for the acquired company doesn’t seem to match the current strategy of the company. This is not always true with acquisitions, but if you watch the industry like many of us do, you notice these “…why did they do that?” transactions. I am purposefully not naming names, but I know we’ve all seen this behavior in our industry.
I would classify the acquisition of BlueArc by HDS as a textbook technology acquisition strategy. As Chris Evans at The Storage Architect noted in his recap, partnering with a company first to see if cultures and ideas mesh together properly, is a solid approach. The level of partnerships vary in many ways, but a partnership gets you access to potentially new customers and markets, technology, talent, business models, etc. Many would call this due diligence when shooting for a direct acquisition, but a partnership is due diligence in practice—the old “try before you buy” mentality. It’s one thing to do interviews and a paper evaluation, it’s a whole other thing to work side-by-side with a company, share wins and losses, figure out ways to make the technology better to the benefit of both parties, make some money together and get to know each other. Many times these partnerships are started with one of the goals being acquisition, but sometimes, for various reasons, things don’t work out—bad chemistry, partnership fallout, disputes and conflicts, or someone else buys them. Good to know before you buy a company. However, there are risks, as a good partnership doesn’t go unnoticed in this industry.
In fact, there are many who would say that HDS is the due diligence for many other company’s acquisitions. HDS has a reputation for putting together successful partnerships and for being a good partner. Many times this relationship building can be misinterpreted by our competition as “…this must be a good company to acquire,” with the actual due diligence being a little too light. This also means that there is risk to partnerships, especially where integration work and investments are involved.
Then, when the timing is right, an acquisition is considered and executed. If there is a good partnership, and everyone is in agreement, the industry says “…it’s about time!” while customers say “FINALLY!” I personally feel that if more plans followed this strategy, there’d be a lot less “…why did they buy them?” head scratching.
There were several questions asked publically, like: “does this open HDS to a flood of acquisitions?” To which I can confidently say no. We will not fall into an “acquire first, ask questions later” tactic.
Again, I want to welcome the entire BlueArc team to HDS and the Hitachi family. It’s because of our great partnership and person to person relationships—and our relationships with our customers—that resulted in this textbook union.
All In The Family
by Claus Mikkelsen on Sep 7, 2011
No, not the Archie Bunker kind, but the “let’s build the HDS family” kind.
So, as many of you have already heard, today HDS took a great next step in our relationship with BlueArc (check out the announcement here).
It’s all about growing the family with a quality company, with industry-leading products, and some really great people. It’s also a great way to demonstrate our commitment to NAS.
We’ve been reselling the BlueArc line for a few years now, and although that partnership has been very successful for both companies, it was taking on the appearance of a very long engagement with no date for a wedding. Well, that has changed today and the commitment has been made. It’s official and vows have been exchanged.
Any doubts about our commitment to, and support of, BlueArc have been removed. I’ve met many of the BlueArc talent and have seen the company execute over the past few years, and personally, I’m soooo happy to see this transaction complete and I’m sure many others are too. No more ambiguity about this relationship.
As HDS moves into the future, I can tell you that this acquisition is a critical part of our overall strategy and look forward to seeing it unfold.
BlueArc: The Jumbo Carrot
by Ken Wood on Sep 7, 2011
“Project Carrot” is officially complete! As Michael Hay recounted, our relationship with BlueArc has been an exciting journey over the past 5 years; a journey that we dubbed internally as “Project Carrot”. As you know by now, Project Carrot set out to evaluate and choose the NAS technology that would become HNAS. A small team of HDSers conducted the technical and performance evaluation for our “carrot patch”. After dozens of paper evaluations and interviews, it eventually came down to 3 finalists – Littlefinger, Resistafly, and Jumbo, aka BlueArc (If you’re not a vegetable aficionado, all of these names are different types of carrots.) Without doubt, with today’s acquisition of BlueArc’s talent and technology, HDS has enhanced our file and content services strategy and vision. It’s been a great journey thus far working with the BlueArc teams and we are all looking forward to a great future together.
Let me back up and add some insight and historical background to how we got here…
As part of the HDS technical evaluation that started several years back, we each had an opportunity to submit the vendor or technology we thought would be a great complement to our existing technology. My carrot in the patch was Resistafly. I brought them into the carrot patch to blow away the other carrots and to show off my “vision” for the company’s future. I wanted us to embrace a software-based solution, go for the scale-out NAS approach, and select a platform that we could “easily” develop with our strategy. From my perspective at that time, Resistafly was a shared-everything, clustered file system that ran on Windows and/or Linux, could scale-out to 64 nodes (for the right use case and workload, which at the time was fairly large), was not as complex as an asymmetrical clustered file system such as Lustre or IBRIX, and was software-based so it required no special hardware. We could integrate our search technology and develop other Hitachi IP into this platform. At that time, I truly thought the possibilities were endless.
However, after the evaluation, I found Jumbo was the only real option – despite my blind allegiance to Resitafly, the technology only out performed Jumbo in a couple of corner cases with 8 nodes versus Jumbo’s 2 node configuration. Just imagine if Jumbo had 8 nodes! Throughout the evaluation process I designed over 150 different tests, (truth be known, my performance tests were designed using a high performance computing cluster and leaned towards giving Resistafly an edge by using scale-out workloads that “ramped-up” over a period of time). In my mind, there was no way the other vendors could keep up with this type of loading for very long and my preference of a clustered file system would leave them all in the dust, or so I thought. However, in most cases Resistafly only had an edge in some of the streaming reads with low overhead and metadata processing requirements. Any multi-read, non-shared solution in this environment could have performed like this, like the technology in Littlefinger (most of my detailed results from that evaluation are still held in confidence). It was the “mainstream” workloads where the results were undeniable. Random access and high metadata manipulation is the sweet spot where Jumbo shined and held its own everywhere else. This is also where the target market and primary workload environment we were focusing on was going to be-in the enterprise. Jumbo delivered on management and performance, which are tasks that are high on a system administrator’s task load, and these would not be a problem for enterprise customers.
As I attempted to polish Resitafly, Jumbo out-shined across all workloads (in some cases blew everyone away) and was easy to manage (which, as we all know is extremely important!). Setup time was 6 minutes or less to get a share or export online and useable, and everything fit and worked together seamlessly from a management point of view. Plus, it was on special hardware (FPGAs) for accelerating specific tasks. This “fit” the then HDS mode of operation and culture of “hardware, hardware, hardware”. I was trying to break that mode by thinking expandability and multipurpose. There’s no way we can expand Jumbo’s capability. “…we have a NAS device and that’s all we have”, I would say. In the end, Resistafly felt like a cluster. I had to manage it on several occasions as individual nodes instead of like a single system. I wanted this technology to break away from the old mindset that a scale-out clustered file system couldn’t be a general purpose NAS in its use. But alas, it fell into the typical role. There was no comparison between the two, Jumbo unequivocally demonstrated its superiority as the king carrot.
After choosing the path of BlueArc, I was assigned to manage a small group of HDS employees who were assigned to collaborate with BlueArc developers in the UK. Their task was to start developing features into HNAS to incorporate Hitachi IP and strategy, and integrate Hitachi products into this “hardware NAS device”. In 3 years, they developed in full collaboration with Jumbo developers:
- Change file notification for accelerated content indexing for Hitachi Data Discover Suite (HDDS) search
- Data Migrator; CVL for internal and XVL for external
- Internal file migration
- External NFSv3 based off platform data migration for any NFS based storage target
- External HTTP/Rest based off-platform data migration for Hitachi Content Platform (HCP) storage targets as objects (data plus metadata) ingestion
- Migration policies for Data Migrator for data migration of files based on simple file primitives like date, name, size, etc.
- HDDS integrated with Hitachi NAS Platform (HNAS) for policies management of more complex migration decisions
- CIFS audit logging
- Real-time file blocking and filtering
- VSS support for Windows systems
- … and more!!
All of these advanced features for content and content management were developed on a hardware platform. I ate it up. I was completely converted. The possibilities were endless and the hardware accelerated, AND it was still easy to manage.
Over the years I continued to monitor Resistafly’s progress in the industry – in 2007 they were acquired, and what I can gather is that this technology has been limited to scale-out NAS solutions and apparently wasn’t enough as their parent company acquired yet another company in 2009, presumably for more scale-out capabilities. The updates included the same software capabilities on new hardware servers or storage systems. There was no significant software features announced in the area of content or content management, search or tight integration with other software solutions. It was just medium scale-out NAS. Siloed. Sad.
About 18 months ago, I finally admitted to Michael that I needed to eat crow. He was right and I was wrong. The “hardware” platform that won the bake-off in our carrot patch 5 years ago was no fluke and that all of the advance software features that I wanted to integrate into a software platform indeed run better on HNAS. I expected these features to be in the other carrots by now, but apparently to others software is treated as a hardware enabler, while we (HDS) treated hardware as a feature enabler. I’d mentioned this story a few times internally during meetings with others and now I’m blogging about it for all of you: barbequed crow, when washed down with jumbo carrot juice, definitely tastes triumphant and successful.
The most exciting part about this journey is that all of this development was done with a great partnership between HDS and BlueArc. Just imagine what we’ll be able to accomplish now that BlueArc’s unyielding talent and technology are part of the Hitachi family! I’d like to take this opportunity to welcome the entire BlueArc team to Hitachi. I’ve thoroughly enjoyed working with you in the past on several projects and I look forward to working and furthering our innovation, strategy and vision together in the future.
BlueArc: A Bountiful Garden of BIG Data
by Michael Hay on Sep 7, 2011
These are exciting times we live in, indeed! Steve Jobs has officially resigned his post. HP has opened their kimono to enterprise focused strategy leaving behind WebOS and the PC business. And today we celebrate combining the innovative talent of BlueArc with the Hitachi family.
Our relationship with BlueArc began when we were tasked to find the perfect technology partner offering file systems or file storage appliances. The pursuit spawned some healthy competition internally, as we each wanted to find the perfect match. I named this endeavor Project Carrot. The field of companies and products we looked into we called the “Carrot Patch” and each company had their own code name based upon a type of carrot. Ultimately, Jumbo – BlueArc – surpassed all others in the Carrot Patch. (I’ll leave Ken to recount that series of events in more detail, but suffice to say the quest that we had anticipated turned out to be a no brainer when BlueArc’s technologies and team came on the scene.)
In 2008, Shmuel – BlueArc’s CTO, my personal friend and mentor –and I celebrated the beginning of super deep inter-product integration. This realized the joint vision the two companies had been working on for the better part of 18 months, and as my previous blog postrecounts in some detail. Our shared vision and spirit of collaboration has led to many customer wins for each company. To date each company has brought complementary specialties to bear on our mutual markets. For example, BlueArc is well known in verticals like Oil & Gas and Media & Entertainment, while Hitachi Data Systems serves other verticals such as Telecommunications and Financial Services. Our complementary market specializations have allowed us to address the overall market gaining confidence and momentum along the way. Well, from today we can address all customers, all markets and all verticals as one Hitachi.
For HDS, we have many reference grade implementations of our joint vision with BlueArc across a variety of verticals. Here are some quotes from a bank and a telecommunication services company:
- Yes Bank – “Hitachi’s proven mature product, clear roadmap, budgets for research, and their having served the enterprise segment of storage for more than a decade, prompted us to approach them. They completely understood market requirements.” Mitesh Toila, Senior Vice President
- Ziggo – “The storage platform from Hitachi Data Systems offers us maximum flexibility in supporting the organization, which is essential in the dynamic telecom market.” Peter de Boer, Manager of Core Systems
These praises would not have been possible without BlueArc’s innovative technologies and collective team expertise. Moving forward I know we will see many more implementations at a quickened pace and of a heightened caliber as BlueArc culture infuses the Hitachi team with their pioneering spirit. Further I look forward to being able to meld our mutual companies’ IP together intermingling our unique attributes and benefits into a superior aggregated portfolio. Stay tuned and watch our progress!
Finally, I want to welcome all of my new colleagues to Hitachi and state that we all look forward to working with you to better address the market and our customers’ needs. I truly believe that you’ll find working with the family I know as Hitachi a special experience in your career. That is because “Hitachi’s honest intentions are to improve society, which we talk about through our concept of the Social Innovation Business.” (Hay, HDS as a Member of the Hitachi Family: The View From Japan).
The Emergence of the Storage Computer
by Claus Mikkelsen on Aug 31, 2011
I said that my next piece would be a guest blog from storage performance practitioner Ian Vogelesang on where the disk drive market is going. Ian has been on vacation, so I thought I would slip this one in.
Disk drives, Winchester drives, then storage subsystems, then arrays.
These have all been storage names over the years, but I’m thinking more and more that we should be calling them “Storage Computers” to keep up with the times. I cite 1992 as a pivotal year in storage, when the industry changed dramatically, and the march towards the Storage Computer began. Let me explain.
Prior to 1992, storage was just a dumb box and a commodity in every sense of the word. You could write to it, and read from it, but that was about it. As a storage vendor, the only thing we could ever compete on was performance, reliability, and price. And by the time this reached our customers ears, all that was heard was price, price, andprice. There even used to be a company called “Reliability Plus” that rated the various vendors on reliability, just in case anyone actually cared.
What changed in 1992 was the emergence of intelligent storage function. The first was called Concurrent Copy (the first copy-on-write technology), and before long us vendors were piling various replication functions onto the storage subsystems. Then came RAID. Reliability Plus was gone, and the storage industry was plotting a new course.
Recently, we’ve been on a roll with things like dynamic provisioning, dynamic tiering and storage virtualization. Add this to all of the Capacity Efficiency functions such as compression, deduplication, single instance store, thin provisioning, “spaceless” snapshots, etc. Anyone out there want to define what a terabyte is these days?
One interesting combination of functions is HDP (Dynamic Provisioning) and HDT (Dynamic Tiering) from HDS. The combination of these two functions:
- Dramatically removes the task of “provisioning for performance”. That is, the Storage Computer can manage performance better than you can. Time and money saved.
- Manages the tiering of data at a small granularity (42 MB). That is, the Storage Computer will move 42 MB pages to their proper tier of disk. We’ve found that well over 80% of data residing on tier 1 disk does not need to be there. Again, time and money saved.
- Removes almost all of the physicality associated with a LUN, meaning a LUN should now just be viewed as a logical container for data. Grab what you think you need and don’t worry about grabbing too much since we thin provision underneath.
With these functions and more, the Storage Computer is really being designed to take over many of the tasks that we all used to do manually. I think the biggest challenge, however, is getting storage admins and DBAs to “let go,” and let the automation take over. So far, we’re seeing many of our customers letting this happen, and ultimately, appreciating the benefits.
The Storage Computer, after a long march, has arrived.
Lies and Virtualization – Capacity Optimization is an Altered Reality
by Ken Wood on Aug 26, 2011
OK, so I lied about my last blog being the final of a three part series (1, 2 & 3). But aren’t we used to being lied to these days? I classify virtualization into three categories of prose: one-to-many, many-to-one, and this-to-that. (Turns out, I thought I blogged about this some time ago, but it was a paper I started and never finished. I’ll extract this and post it as a future blog soon.)
Virtualization, as we use it in our IT lives, is a lie. Virtualization technologies, whether it is in storage (which includes RAID, capacity optimization, Dynamic Provisioning, Dynamic Tiering, etc.), servers, CPU, file systems, or desktops hides the real truth and sometimes administrative tasks from the rest of the world. Virtualization is an abstraction of a complex physical reality, or the aggregation of many simple elements, to a more understandable, usable, acceptable and simplistic view, on behalf of an application or user.
Virtualization lies to us for our own good. When used with capacity optimization technologies, as I’ve described in several of my previous posts, you will be presented with a false view, but you will like what you see. The amount of available space for storing your data will appear vast and voluminous (and in some cases “bottomless”) but under the covers lies a complex system of algorithms and coding used to squeeze, chop-up and delete your data into a smaller and smaller footprint. These layers, in fact, can continue down to another layer of abstraction, such as a dynamically provisioned volume of storage. This, in turn, records data on an array of disk drives protected by another layer of abstraction called RAID-6. Even the disk drives themselves are accessed via a logical block address, which hides and simplifies the realities of addressing data as cylinders, tracks and sectors, or steering you from the bad block replacement area.
The abstraction layer above capacity optimization technologies hides the truth from the user and application in order to focus on doing what they do best, and not manage capacity, or so it seems. More times than not, this just moves the administrative action line further out in time. Storage is not “bottomless,” but it can seem very deep. This layer presents, for example, a 5TB file system to the user or application, which can write data to this space. Depending on the architecture and the type of data being written, the file system’s utilization percentage will act strange to people who are used to chasing the 90% threshold for most modern file systems. It may seem like no data is being written, or like data was written, then disappeared, but when checking the file system, data (or at least the file names) exists.
What’s happening is storage is being optimized against the data that already exists. The more data that is stored, the more data there is to compare to, the more chances of duplicate data detected, the more duplicate data is removed. The task at hand is the management and metadata tracking of keeping references. The mapping of the chunks that are used and shared across many files can tax a management system by creating a large database with heavy comparison tables.
This mapping of chunks—you can replace chunks for blocks, pages, sectors, sub-files, etc.—IS the lie. That is, virtualization is hiding this from you.
Or, as David Black of EMC used to be fond of saying during our time together in SNIA Technical Council meetings, “…most problems can be solved in computer science with yet another layer of indirection…”
I’m Entitled to Title My Own Titles
by Ken Wood on Aug 25, 2011
I’ve been laying it on a little thick technically of late, so I thought I would throw a change-up in the mix before everyone thinks I’m losing my sense of fun. And, I’d rather have fun while working than working to have fun. Plus, my last post was rather long and voluminous, so here’s a short blog on, well, blogging.
Believe it or not, one of the more challenging parts of blogging is coming up with a catchy title. HDS bloggers are responsible for all of our content, including the title and graphics. I try to be catchy, vague and a little profound, while playing on words, especially with the multiplicity of meanings between the rest of the world and how the techno-world communicates. According to my count, this is my 27th post since I started guest blogging with Michael Hay back in July of 2009, and a lot has evolved since then. I could have waited for a more significant milestone, like number 30, but I want to get back into more technical blogging next. Of these 27 blogs, my favorite titles are listed in no particular order below with some comments on the meanings.
- Chips or Pits! How do you like your content?
- Chips, meaning integrated memory circuit or a collection of pits, being the “bit” of information burned into an optical medium.
- Storage Fusion – StoraFgUeSION – SfTuOsRiAoGnE
- An evolving intermixing of storage technologies.
- Does HDP Make Gas, or Just Removes Gas?
- Percentage improvements could lead to overstated claims like gas adding up additives and engine improvements, and HDP could reduce your stomach anxieties as a system or storage administrator.
- I mainly like this one for the graphic in the post that illustrates the title.
- The USS Starship Enterprise Storage Array
- The obvious tie in to Star Trek and storage controllers.
- YAAA! – Yet Another Automobile Analogy
- This is a play on the Unix program/parser called “yacc” – yet another compiler compiler.
Which one is your favorite?
As I stated, these are in no real particular order, but my favorite is “YAAA! – Yet Another Automobile Analogy”. Obviously, there’s a range of titling that could start at “blog entry number 28″ to outright lying about the content of the blog topic itself. If the latter were to happen, I would feel like I was spamming or scamming everyone. So there is always a tie-in no matter how outrageous; the title is always loosely related.
This also leads me to product labeling. It amazes me how some product names over the course of history can become forever linked with the function of the product or company’s name, like Crescent wrench, Kleenex tissue, Xerox copy, or Google. The name or label is actually part of the world’s new lexicon, regardless of language.
I am currently working with some technology from Hitachi’s Central Research Lab, dealing with speech and language. It’s interesting how many words mean the same thing, and are spoken the same way in any language. However, these words tend to be basically modern, stemming from the technology fields. I can’t fully blog about this technology at this time, but at some point I will describe it as it relates to other projects. When referring to “Google” the company or “google” the synonym for “search,” you’ll be surprised how different cultures are similar.
Grout Expectations, and a Disk Story
by Claus Mikkelsen on Aug 24, 2011
It’s been two weeks since returning from Thailand on my Habitat for Humanity build. After the 17-hour flight home from Thailand to the US, I had a 48-hour turnaround before heading off to Mexico City for the week to participate in another one of our remarkably successful executive briefing centers with Hu Yoshida (see his post on this trip).
No rest, wicked or otherwise.
As I mentioned in my previous blog, my “vacation” included building a home outside of Udon Thani, Thailand, courtesy of Habitat for Humanity. These guys are awesome, and do some very good work. We started with nothing, poured a foundation, built walls of cinder block, poured a concrete floor, set in doors and windows, and dug a septic tank. In only a week and a half, we (there were 16 of us) turned dirt into a home for a great family with two boys (aged four, and the other less than a year old.)
Elegant housing? Hardly, but clearly a step up from where they were living previously. And, since my previous blog was titled “Pounding Nails,” I decided to call this one (with all due respect to Charles Dickens) “Grout Expectations,” since there were no nails involved (just a bunch of cinder blocks and concrete.)
Here are some pictures from my time in Thailand.
Habitat has one of the most bizarre business models I can think of, where they (through their Global Village program) offer folks the opportunity to pay them money for the privilege of working FOR them. I hope HDS does not adopt this model.
All joking aside, it really is a great opportunity to give some sweat equity for a good cause, and at the same time see parts of the world, otherwise not generally on the tourist (or business) track. I thought I could sell a couple of VSP’s whilst there, but to no avail.
I’m not endorsing Habitat exclusively, (although their “builds” are a lot of fun), but I do endorse the concept of getting out of the mainstream box and seeing the real world. Over the weekend in Thailand, I suggested to the group that we traipse over to Laos (since the border was only about 40 miles away), so we did. We started with a nice lunch on the Mekong River, then spent the rest of the day touring the capital city of Vientiane. On Sunday we visited an HIV/AIDS orphanage, which was both sad and uplifting at the same time. Playing with the kids was fun, and they certainly seemed to enjoy the attention.
Next year, body willing, the plan is to move the operation to Ghana in West Africa.
I did want to talk about the disk drive market before I close this. I came across this link on Twitter before I left on my vacation. It’s a year-old article, but still has some interesting perspectives on the HDD industry, and quotes a Seagate representative that was predicting 100TB-300TB drives by the year 2020 (only nine years away!) I also remember an article from Gartner or IDC in the year 2000 that quoted that the average array size being shipped in the industry was 1.2TB. So now that we’re about midway between 2000 and 2020, it’s interesting to look back a decade and also into the future, and imagine what the larger storage industry will be like.
I bring this up as a teaser to my next blog, which will actually be a guest post from storage performance practitioner Ian Vogelesang, who knows much more about the HDD industry than I ever will. I’ve read it, and it’s a pretty interesting take from a very smart guy. You’ll enjoy it.
May I Please Have Some More Capacity Optimization, Sir?
by Ken Wood on Aug 16, 2011
So, this is the third and final installment of my blog series on capacity optimization techniques. The first article was on file level single instancing and file level compression, which also included a combination of the two. The second article described how data de-duplication works, which I demonstrated by using Linux commands.
In this post, I’ll show you how file level de-duplication and compression can save even more capacity. Plus, data de-duplication implies that sub-file single instancing is in use under the covers of the capacity optimization front. Since I enjoyed doing the demonstration style article showing some of the ways that this technology works under the covers, I’m going to employ this technique again.
From my previous post, “To De-dupe, or Not to De-dupe, That is De-data”, I demonstrated that the “split”, “md5sum”, “rm” and “cat” native Linux commands (at least in my CentOS 5.4 system) can be combined to de-duplicate a large file into a smaller file footprint (several smaller sub-files) at almost a 3:1 capacity savings ratio. What I plan to show and demonstrate here is combining a form of file-level single-instancing and file-level compression, AND file-level data de-duplication to reduce the capacity footprint further. In this case, I’ll show how a sub-file is a sub-file is a sub-file. Just because a sub-file was “extracted” from another original file, doesn’t mean it can’t be used somewhere else. In fact, these “multiple references” to a sub-file are the most powerful feature in reducing the storage capacity footprint in de-duplication systems.
Note: This demonstration is to illustrate the core functions of file-level de-duplication and capacity optimization techniques. It should not be used to de-duplicate your production data as a capacity savings technique!!!
In the screenshot below, I have created two files, “blogtest.dog” and “kentest.bup”. They are approximately 4.4 MB in size each for a combined total of about 8.9 MB. The first thing I do is find out what the MD5 hash fingerprint is for both files using the “md5sum” command. The two files are different, therefore I can’t file-level single instance these two files.
Next, I’ll use the “split” command to fix block split the files up into 256KB sub-files, then I “ls –l” the directory to show the resulting sub-files and the sizes of these sub-files. I prefixed the output file with “1st-” and “2nd-” (“blogtest.dog” and “kentest.bup”, respectively). You should be able to see that filenames “1st-aa” through “1st-aq” correspond to file “blogtest.dog” and that filenames “2nd-aa” through “2nd-aq” correspond to file “kentest.bup.” Both sun-file sequences end with sub-filename “*aq”, however, the file sizes reflect the size differences of the originals. Counting the number of sub-files generated (using the “wc” command) shows that 17 sub-files were created for each original file or 34 total sub-files.
Similar to before, I will now calculate the MD5 hash fingerprint of each sub-file using the “md5sum” command on both sub-file sequences. You should be able to see that the hash fingerprint “ec87a838931d4d5d2e94a04644788a55” is present in both sets of sub-files from both original files. This means that each of the 2 original files contain a set of 256KB sub-file patterns that are identical to one another.. This also means that both original files can share this sub-file between them, thus I can delete all sub-files that calculate to this fingerprint except for the first one. So, I keep the first sub-file “1st-ab” with the hash signature “ec87a838931d4d5d2e94a04644788a55” and execute the “rm” command on the remaining sub-files with the same fingerprint.
As you can see, I have “de-duplicated” the total number of sub-files from 34 down to 10 sub-files, 4 for the original file “blogtest.dog” and 6 for the original file “kentest.bup”, and there are no sub-file instances containing the “ec87a838931d4d5d2e94a04644788a55” hash fingerprint for the original file “kentest.bup”. Basically, each sub-file is now a unique piece of data.
The amount of capacity occupied by these two original files has now been reduced from approximately 8.9 MB to 2.5 MB, assuming we actually deleted the original two files. This is approximately a 3.5:1 reduction ratio.
But wait! There’s more.
Now let’s compress the remaining sub-files to see how much additional capacity savings we can achieve. By using the “gzip” command, I compress the 10 sub-files individually and replace the original sub-file with the compressed sub-file and append the “.gz” label after the filename. There are still 10 sub-files, but now the amount of capacity occupied has been dramatically reduced further. The combined total capacity of the two original files is now approximately 196 KB! So from 8.9 MB to 196 KB, this is about a 45:1 reduction ratio.
Of course, this is a dramatization and a demonstration. Your actual de-duplication ratios will vary considerably, or as they say, “your mileage may vary”. It really depends on the type of data you have to store.
So, now we have to “rehydrate” the two original files to their fully bloated original state by reversing this process. As you recall from the previous post, the high-level order in which the data de-duplication functions happen is:
- Chunk it
- Hash it
- Toss it or keep it
For this extra level of capacity optimization, there are a couple of additional steps.
- Chunk it
- Hash it
- Toss it or keep it
- Compress it
- Reference it
Technically speaking, the Reference it part is done even without the compression step, so even in the three step functions, there is a Reference it step. However, I’m highlighting this in this blog post because we did two extra steps to achieve the extraordinary capacity optimization results: sub-file Single Instancing and sub-file Compression. The sub-file Single Instancing comes from the one common sub-file between two completely separate original files. This can be illustrated in the diagram below. In fact, this is going to serve as the mapping we will use to rehydrate these compressed sub-files back to the original files.
Again, instead of deleting the original files, I’ve renamed them so that I can do a binary comparison of everything in the end. Then, using the “gunzip” command, I uncompress the sub-files back to their original 256KB chunk size, except of course for the lastsub-files, which are the remainder of the original files during the chunking process. Now we need to assemble the files back together. I use the “cat” command to concatenate the sub-files together in the proper order. I use the sub-file “1st-ab” as a replacement for sub-files “1st-ac”, “1st-ad”, “1st-ae”, “1st-af”, “1st-ag”, “1st-ah”, “1st-ai”, “1st-aj”, “1st-al”, “1st-am”, “1st-an”, “1st-ao”, “1st-ap”, “2nd-ac”, “2nd-ad”, “2nd-ae”, “2nd-ag”, “2nd-ah”, “2nd-ai”, “2nd-al”, “2nd-am”, “2nd-an”, “2nd-ao” and “2nd-ap”, which were all deleted earlier. This is to re-create the original files “blogtest.dog” and “kentest.bup”.
Initially, you can see that the files rehydrate back to their original sizes. To ensure that everything went back together properly, I run the “md5sum” command against the newly rehydrated files and compare them to the original renamed files, then I perform a full binary comparison with the “cmp” command to make sure everything is 100% perfect.
Trust me when I say that if any of these pieces don’t go back together in the right order, then the hashes will not match up correctly. Then you know you have a problem. As I have shown, this could get to be a laborious task by hand. Scripting could be an option to automate several aspects of this process. However, the best way is to let an appliance with reliable code and a hardened database do this for you; it makes all of these steps invisible. I’ve gone through these steps for you to show a little bit of what’s under the covers to this technology—or maybe what’s not under the covers. The combination of several of these techniques also has the potential of saving large amounts of capacity beyond any one method alone.
A True Holographic System Would be Disruptive
by Ken Wood on Aug 9, 2011
A recent article by The Register’s Chris Mellor was passed my way by a colleague, at a perfect time. I was planning to post a blog this week on Hitachi’s contribution to the world of optical and holographic storage, but I didn’t know how I was going to introduce the subject.
Chris has just given me the angle I need. I have written a couple of pieces here and there about optical storage and somewhat knocked the holographic commercialization efforts by InPhase, mainly due to the events in February 2010.
In these blog posts, I also mentioned some of the R&D breakthroughs Hitachicontributes to the optical storage industry and the advancement of Blu-rayTMtechnology through Hitachi’s HLDS division. However, through Hitachi’s Central and Yokohama Research Labs, Hitachi is a major contributor to the advancement of holographic storage – true holographic storage.
What I mean by “true” holographic storage (versus the implied “not true” holographic storage) is the method of how a unit of information is actually recorded and read from the media. In the true holographic storage architecture, the smallest unit imprinted on the media is a page (essentially, a hologram). Today that page is approximately 1 mega-pixel in size and is used in Hitachi’s Angular-Multiplexing architecture, whereas the Micro-Holographic architecture used in GE’s approach is still based on a bitwise recording method. Translation: one unit of data written or read with the Angular-Multiplexing architecture results in a 1 megapixel transfer, whereas in the Micro-Holographic architecture the smallest unit written or read is still a bit.
The primary difference between these two approaches in my opinion (beyond speed and capacity) is compatibility. GE’s Micro-Holographic architecture maintains the highest level of compatibility with the current Blu-ray recording format, meaning there is a transitional, yet painless, stage in going from a Blu-ray based system to a page-based system. Again, this to me is not a true holographic system, though there are some holographic techniques employed to achieve a higher density than current Blu-ray formats. Also, when thinking about high capacities, you should always start thinking about the speed at which the medium can transfer this data. This is one of the primary knocks on optical storage devices today and magnetic disks for that matter. Hence, the rise of the SSD.
The Angular-Multiplexing architecture, developed by Hitachi, InPhase and Mitsubishi’s MCRC division (both drive systems and media development), is NOT compatible with existing media or disc formats that are commercially available. This would be a disruptive technology compared to existing optical storage systems. While this may initially seem to be a challenge, the adoption of optical storage systems in production datacenters or the enterprise has had bouts of fits and sputters to gain any traction aside from software, content distribution mechanisms and niche segments. So, starting up a system based on this technology would be a green-field effort anyway.
However, many believe that the Angular-Multiplexing architecture is the most promising next generation of optically based storage medium for the future. Twelve terabytes of capacity at 6 Gbps transfer rates on a single sided 12 cm disk are within the realm of eager discussion when looking to develop commercially available products and design solutions of the future. This means technology stages and levels prior to this goal are even closer to reality than one might think. In fact, the first 500 GB holographic storage system (I use Holographic Storage System versus Holographic Data Systems due to the HDS confusable reference) was demonstrated at ISOM 2009 back in 2009 by Hitachi, as well as Hitachi sharing papers and presentations on several of the necessary manufacturing attributes required for mass productization.
Just to complete the Holographic Storage System scene, there is another approach in the mix of competing architectures called Collinear architecture. This architecture is similar to GE’s Micro-Holographic architecture in that it maintains some compatibility with the Blu-ray formats, but is also a page recording based approach. This approach essentially sits between the Angular-Multiplexing architecture and the Micro-Holographic architecture in density and capacity, speed and compatibility.
There are experts and bleeding-edge visionaries looking to design exabyte-scale persistent long-term archives that believe magnetic disk-base storage solutions won’t be viable for this purpose for another 50 years, and don’t even try to squeeze or force magnetic tape into solving this problem.
Now, don’t misunderstand my statement here. When looking at the overall big picture in the way of “solution”, I’m talking about the “data” portion of a long-term archive. Magnetic disk storage, plus some acceleration from SSD storage, will still be the medium of choice for the rest of the archive. In fact, there are discussions taking place that suggest the metadata, indexes (enhanced indexing, a topic for another blog sometime in the future), and other non-data relative information associated with the long-term data archive quite easily could range from half to equal the size of the data archive. This means, for a 1 EB data archive, the total amount of raw data stored, metadata and data, could be as high a 2 EB. From a TCO point of view, being able to practically cut the power usage and other facility costs nearly in half from this number would be a huge benefit. Especially when the majority of the workload will be on the metadata mining, and only a small percentage of the data being accessed.
Expect me to write up several follow-up articles in this space as this is an exciting subject to discuss. It is also quite futuristic in nature even though the concept and early prototypes for holographic storage has its roots dating back to the 1960’s. It is only recently that great strides in the advancement of this technology have started to come to the forefront of people’s minds. Coincidently, Hitachi’s full involvement and joint development efforts dates back to 2006, around the time the last holographic buzz peaked.
Since Hitachi’s contributions and efforts here tend to be subdued and hidden, it will be my job to brag about these efforts publicly.
A great honor for a great leader
by Michael Hay on Jul 26, 2011
Today I want to draw attention to the the 2012 IEEE(*1) Reynold B. Johnson Information Storage Systems Award, which was granted to Dr. Naoya Takahashi. The Hitachi press release details several of of Dr. Takahashi’s achievements, and I want key off this statement in the release:
“The achievements which this award recognizes could not have been realized without the long-term cooperation between users worldwide who shared invaluable comments and feedback, and the engineers and service representatives who listened and worked to incorporate those ideas into new products. The Hitachi Group is committed to continue developing technologies and products in order to contribute to the expansion of the world’s storage industry.” — Naoya Takahashi
Dr. Takahashi’s nurturing style helped to develop what was many years ago a nascent business into a very successful global storage and IT company, HDS. Further, his leadership was and continues to be very real and very personal to many of the colleagues I work with on a daily basis. In fact there are many stories illustrating his leadership and courage as a global innovator.
There are some I cannot share, but there are aspects of one that epitomize his leadership. This past March, Dr. Takahashi came to RSD just before his transition to a new role within the Hitachi Group, President & CEO of HES. During this speech he encouraged RSD to continue aggressively pursuing the global storage business, and at the end of the discussion we all walked with him out the door signaling the closure of one chapter of his career at Hitachi and the beginning of a new one. This scene really tugged at my heart as it showed how deeply respected and thought of Dr. Takahashi is within Hitachi and especially at RSD and HDS.
So congratulations Dr. Takahashi on the IEEE award and such a distinguished career (so far, and I’m sure there’s more to come). And rest assured that the culture and spirit of global innovation that you fostered so long ago at RSD and HDS is accelerating, preparing us for the years and decades to come.
Pounding Nails and the Reach of Technology
by Claus Mikkelsen on Jul 23, 2011
So it’s time for that all-too-rare vacation to somewhere. I’ll undoubtedly trigger an out-of-office message that says I’ll not be checking email, voicemail, or any other kind of “mail”, but probably will sneak in an occasional digital-check, if I can. That said, I’ll also be traveling in places where electricity, much less wireless, should ever be assumed.
A number of years ago I came to the conclusion that all business cities in the world are pretty much the same. Airports, hotels, office buildings, and taxis make business travel pretty homogeneous.
Not that I’m complaining, mind you. The flight upgrades, top notch hotels, and mind-blowing meals are to be savored, and certainly contribute to the enjoyment of what would otherwise be a very dreary experience, but more and more I sensed that when I traveled, especially abroad, I never really experienced the city, much less the country, I was visiting. And the people I would interact with were definitely not representative of the populations as a whole. Just like me, but with different languages and customs.
We speak a common language: Enterprise Storage. Pretty exciting, really!
That began to change a few years ago when I took a 2-week photography excursion into rural China with a (incredibly awesome) professional photographer as a guide. It was a photography class, essentially. But seeing the REAL rural China was quite a sight and an eye opener. (I’ll skip over the part where I almost got arrested for being in the “wrong place”).
That trip was followed by a Habitat for Humanity build in rural Comanesti, Romania last year. And then in January of this year yet another visit to parts, villages, and city slums of India. Realizing that 80% of the world’s population lives in conditions like this or worse not only makes one appreciate what we have, but how much more we need to accomplish.
One struggle I’ve always had being a hi-tech guy is it’s relevancy to the betterment of the overall population of the world. Who cares if I can shave 2 milliseconds from every Oracle transaction? Who cares that my backup has been de-duped? Who cares that I can thin provision a LUN and dynamically move small granularity data to cheaper storage? And who cares that we can virtualize storage and servers, thereby reducing CapEx, OpEx, and environmentals?
So who cares? About 7 billion people care. When you see a family in a small village in India – two hours away via 4WD from the nearest town – enjoy TV and have cell phone coverage to dispatch medical care when needed with recently-installed electrical and 3G infrastructure, that’s huge. To see that small Chinese village that received its first electricity two years earlier, install refrigerators and other modern appliances and coordinate the distribution of their rice crops with modern technology, is impressive.
Seeing laptops with spreadsheets in the rice paddies is just too cool. And to have a frail old man buy a beer for me and my friend in a small village in Romania, realizing that my hourly income exceeds his yearly income, well, with all due respect to MasterCard, that is priceless.
Technology makes this all possible (well, except for that beer; I just had to throw that in). The reach of technology into the most remote areas of the world is amazing to me. It’s not just the visible signs of villagers with laptops and cell phones, but the less visible indications of governments coordinating the enforcement of child labor laws, distribution of food, weather warnings, transportation, education and healthcare. I’m not expecting to sell a 200GB AMS2500 to that small Indian village, but I do see that their living conditions have improved dramatically because of the wide reach of technology.
So I’m off again, this time to a small village outside of Udon Thani, Thailand, for my second Habitat build. Rather than pounding keyboards and ticket counters, I’ll be pounding nails for 2 weeks.
I’ll see y’all when I get back.
EMC/HDS: Fundamental Strategy Difference
by Claus Mikkelsen on Jul 20, 2011
I’m sure many of you have seen EMC’s latest product announcement, the VMAXe. It’s basically a “mini” VMAX which makes the name rather oxymoronic (VMAXmini?), I think. But I’d like to use that announcement to point out a basic difference in the storage strategies of our two companies.
And to do this, I’d like to start with Brian Gallagher’s quote in the announcement, that the VMAXe is “purpose-built”, meaning, obviously, that it is built for a specific purpose. That’s the line that jumped out at me, since it really demonstrates our strategic difference. Why does one have to build a separate product to address a specific need, when one should be capable of filling that “need” in an existing product?
I wrote a magazine article in late 2003 called “Buckets, Pipes, and Pools” giving my vision of where storage was heading, and among other things, I decried the proliferation of all of the “buckets” of storage the industry was cranking out. It seemed to be heading in the wrong direction. The article was translated into French so I’m not sure you can search on it (unless you speak French, of course), but if anyone wants a copy (in English) I can send it to you. But I think EMC has just invented yet another bucket.
Our strategy on the other hand, can be explained in a nutshell: “One platform for all data”. Yes, you can point to our different products and claim otherwise, but we’re definitely on the road to consolidate different platforms, not crank out more buckets (‘scuse me, a purpose-built buckets) of storage. One platform for everything, where data can fluidly and dynamically move up and down ties of storage until it’s been archived. All data. That’s the difference.
It’s clear to me that the VMAXe is targeted at our own VSP. Our VSP can range from a fully popped and smokin’ 2048-drive 2-chassis model with up to 1TB of cache all the way down to a 14U “controller only” version sitting in a 19” rack.
That little controller-only version can virtualize literally petabytes of external storage, providing all that storage with all forms of replication, Dynamic Provisioning and Dynamic Tiering, etc. We decided NOT to call that the VSPe.
But the point is that the VMAXe is trying to fill this gap, but it doesn’t come close and here’s why:
- The entire configuration range of our VSP runs the same microcode so has all the same functionality. The VMAXe runs a different version of Enginuity.
- The upgrade path on the VSP from the smallest to the largest, is seamless. The VMAXe cannot be upgraded to the VMAX which is why it is yet another “purpose-built” bucket.
- The VMAXe does not support z/OS, encryption, SRDF, and certainly still no support for external storage virtualization. VSP supports all of that and more.
So although the VMAXe attempts to fill a big gap in the EMC lineup, it really does not.
I expect to get comments that “it’s cheaper”, but price is settled on the street, and as David Merrill continually points out in his blog, “cost does not equal price”.
C’mon, guys, let’s stop with the proliferation. You’re making storage harder to manage, not easier.
Why Holding Data Prisoner is Not a Good Idea
by Michael Hay on Jul 18, 2011
Have you watched Apple’s Lion and iOS 5 keynotes? Being the fanboy that I am, I snuck a peek on my AppleTV and was greeted with several new concepts including making the cloud invisible. I was, however, struck by a comment from Steve Jobs that I’d like to address:
“A lot of us have been working for 10 years to get rid of the file system so the user doesn’t have to learn about it.” – Steve Jobs (06 June 2011 WWDC).
I know that there is a lot of debate over this statement and I want to weigh in on the topic – especially in the context of Big Data.
As I discussed in my previous post, I think about Big Data as a way to transform existing data/content into appreciating assets. To do that we will need applications that are separate or independent from the content they produce; the combination of seemingly disparate content may yield new super valuable information we haven’t thought of before. One might call this “ah ha” data or data mashups, and I believe that the mashup idea is actually a great way to visualize what is possible. We can visualize GUI platforms that combine many different data sources into one, and help the user reach new conclusions.
The image above is a screenshot of the iGeoPix application on the iPad/iPhone, which links photos to your current location with a bit of a twist: the larger photo in the center is close to your currently selected location, whereas the smaller ones are farther away. This kind of information can help you as the user better plan a vacation day so that you know what places are nearby on any given day. We also could imagine other data or content complementing the application, like which venues are the most popular, etc.
This mashup application requires various content types to be brought together, which allows the user to see the content in a new way to create new information or establish a new perspective. If the content was all self contained and could not escape from the containing application, then iGeoPix would be impossible to build. So if we take Mr. Jobs’ statement literally, and make the file system or another mechanism of free interchange go away, then this class of mashup application would disappear. (Note that I’m pretty sure Steve was more focused on the user experience, but during his keynote speech he didn’t really make the distinction.)
Storing data in a non-obfuscated form in which unleashing the data from the application is possible is a key theme by which Hitachi lives and breathes. We view ourselves as guardians or trustees of the data and metadata you store with us, and not the owners nor the warden guarding you from accessing your data.
Single Point Compression is Not a Black Hole
by Ken Wood on Jul 10, 2011
This blog is the first in a series of short articles I plan on writing over the next several weeks describing capacity optimization techniques and designs. This series will describe various standard data manipulation algorithms for reducing the amount of storage required to store data such as single instancing, compression and de-duplication.
The amount of data being generated today is staggering. The amount of data being stored is just as amazing. Those of us that have been in this industry for a while (my career started in the late ‘70s) can attest to the time when storage was measured in kilobytes on real rusty platters. This is not a reminiscence piece, but more of a how far we’ve come and what we’re doing about it. Back then there wasn’t as much digital data being generated or kept, and the majority of that data was human generated versus today’s machine generated data.
The technique of storing more data in less space or the perception that actual data is being stored is not new. Data compression utilities built into the operating systems and as add-on applications have been available for decades, and single instancing had been available in Unix OSes for just as long through symbolic and hard links. In fact, I remember examining scripts and programs that were actually the exact same file with a different file name. Depending on what file name you referred to it as would change the behavior of the program. In reality, the different file names of the program or scripts where hard linked to the same file in the file system. Examining the argv argument (the name of the file) would cause a different set of instructions to be executed within the program. This was a way of saving time maintaining code, but also a very early form of data de-duplication – Single Instancing. This technique was also used with hard links and casing off the $0 argument for scripts.
So today, there’s file based single instancing in appliances and software products. Basically, the management and orchestrated pointing of file names or pathnames to files that are exactly the same file, usually using symbolic links and/or stub files, and the management of these symbolic links and/or stub files. This assumes that there is a significant amount of files stored that are the same with different file names or possible a different pathname. In many common environments, there are high percentages of files that are the same, but with different file names or pathnames, so file-based single instancing can provide a very good level of capacity optimization for network file servers.
Hitachi Content Platform (HCP) uses this technique to optimize its storage capacity. The removal of a file is extremely sensitive so a two-step integrity check is performed to ensure that the file being “single instanced” is an exact copy. This integrity check does a quick matching of hash signatures associated with each file to create a candidate list of potential file copies, then a binary compare of the files listed in the candidate list is performed and only when an exact match is made does the file get single instanced, that is, the file is replaced with a pointer to the already stored file. HCP also leverages file compression to further optimize its storage capacity.
File level compression can work on every file. Granted, some files are already in a compressed format such as MP3, JPEGs and MPEGs. In these cases, compression doesn’t create a significant capacity savings and in some cases can increase the size of the original file if not managed correctly. However, for the majority of the rest of the files types that exist in an enterprise, many files can be reduced significantly through compression. In higher end systems, the files appear to be stored as is, however the underlying bits, bytes and blocks of the file is compressed through the file system. The tall tale sign is the slight performance impact at read time as the file(s) are uncompressed for use.
Of course, the combination of both file-based single instancing and compression can yield even more capacity optimization. This technique is used to take the file that is referenced by many file links via single instancing and to also compress the original file. This compounds that effect of just single instancing or just compression.
So this is a quick article on two capacity optimization techniques that have been available in the industry for quite some time. In my next blog, I will discuss some of the more modern and advance techniques for capacity optimization called data de-duplication, which you will find out is a modern twist on single instancing.
So what’s your X, Y, Z year plan?
by Michael Hay on Jun 17, 2011
Recently there has been some media attention about a roadmap and our X, Y, and Z year plans. Along with this, it is also expected that there would be some speculation on the health of certain product lines. Let me assure you, the Hitachi VSP and AMS lines are still alive, kicking and quite healthy (see our FY10 numbers) with long lives and many innovations yet to come. And while the information published is outdated and speculative, it does highlight our innovative thinking and shows we have real, tangible technologies rather than marketing hype seen elsewhere in the industry. As usual, Hitachi continues our 100+ year history of innovation through cross-disciplinary collaborations, extending our state of the art technologies, and diligent listening to customer challenges. At the end of the day, the Hitachi Group is here to do two things: 1. make society better and, 2. meet our customers’ needs.
Given the recent speculation and outdated materials that were published, I should note that the outlined disciplines aren’t the only ones we maintain long term plans on, and, while storage is critical to the HDS business, there are other facets and long term plans we are working. In fact, while I won’t expose the other various disciplines, I can confirm that Hitachi maintains plans out 10-20 years. As previously alluded to, these plans aren’t static, as they need to adjust to changes in the market, changes in fundamental technologies, changes in society, etc. Further, Hitachi is uniquely positioned because we still do fundamental R&D, such as materials science, allowing us to both dream up and consume fundamental technologies driving innovations of the future.
If closely inspected, the speculative materials hint at a core tenant of our strategy: inter-product integration. This is a focus we have been pushing forward for a number of years and said draft plan shows it becoming more pronounced. Further, I would argue that HDS, unlike most of our competitors, is uniquely positioned to make this a reality. Example: Hitachi Command Suite – for many years we have had a single management suite for AMS, VSP and their predecessors. What we aim to do is move the outward eye of the user experience inward making our block microcode pervasive and consistent across all platforms. Further, we are already well into our transition to Intel based platforms across storage and compute which started with the AMS1000 generation some years ago. There are more examples of this, if you look closely, but I’m not here to give away the barn.
But to give a hint, there are three major themes in our strategy: Infrastructure Cloud, Content Cloud, and Information Cloud. None were arrived at without methodical thought and planning and always done in partnership with our customer and partner base. Of course, as a company we’d be more than happy to review more detailed plans with our customers and partners under NDA. Until then you’ll just have to trust me that what was “revealed” is merely the tip of the iceberg.
With the attention this week, I’ll be interested to see what FUD our competition tries to bring. As most organizations have mixed storage environments (all of which we can virtualize), feel free to ask them an important question: What’s your X, Y, Z year plan?
HDS Blog Comment Policy
by Michael Hay on Jun 16, 2011
We recently issued our Blog Comment Policy and I thought it worthwhile to mention it here and also share a few thoughts and highlights from the policy.
Why, you ask? Well, we want to make sure the HDS blogs remain a place where industry appropriate discussions happen. In order to do this, it is important that the conversation stay focused and on-topic. As you can imagine, we do receive every bit of SPAM as the next blogger, so in order to combat this we do moderate all comments as they come through to make sure that the conversation remains relevant to you, our readers.
The full comment policy can be found here (and on every blog page), but here are some key aspects I feel are worth reiterating (and in my own words):
- Be respectful
- Be relevant
- Be transparent
- Play nice
- Share with your network
- Post off-topic
- Engage in personal attacks
- Impersonate others
- Be incoherent or belligerent in your comments
- Post inappropriate comments or content
Feel free to let us know if this isn’t clear. You can always contact our social media lead via Twitter, @HDSCorp). On behalf of the bloggers, we look forward to continued engaging conversation.
Hitachi and Quality: An Overview
by Michael Hay on Jun 15, 2011
The Hitachi companies are passionately committed to quality, which is an important goal based on the company philosophy of long term excellence. Products that Hitachi Data Systems sell go through rigorous design, testing, and manufacturing processes to ensure quality,reliability, interoperability, and performance. In fact, engineers that participate throughout the entire product lifecycle are very proud of the positive impact our products have on our customers. Ultimately, it is our customers who are the judge of product quality, a recent report from TheInfoPro (TM) this year cited Hitachi as earning top ratings in Product Quality, Delivery as Promised, Interoperability, Reliability and Performance.
However, we never rest and aim to bring our product quality to new heights. For example, our processes are continuously monitored and improved so that our users have the best possible experience. In order to better communicate more detail on Hitachi, quality, expect to see more on this topic over the coming weeks and months.
This is what converged AND open looks like
by Miki Sandorfi on Jun 14, 2011
Today we announced a new set of converged data center solutions. Convergence is one of latest hot words given the industry movement recently. What’s unique about HDS converged solutions, you might ask? They’re combinations of enterprise class storage and compute from Hitachi, combined with industry standard networks, and management software that are optimized for one or more tasks. In particular, we view these solutions as building blocks upon which you can construct your private cloud – without fear of infrastructure silos. All of this falls under the broader Unified Compute Platform strategy we’ve been talking about for some time now.
First, it’s worthwhile noting that these converged solutions and platforms include Hitachi Compute Blades – not just because “we build servers too”, but because we believe a virtualized, cloud-ready data center requires a solid platform on which to build. Our compute blades include a range of unique capabilities, such as multi-blade Symmetric Multi Processing (SMP) up to 4 blades and borrowing from our mainframe heritage, Logical Partitions (LPARs). The latter is particularly useful as we start thinking about “converging”: it allows us to subdivide multi-CPU, multi-core server blades into secure, fenced, sub-server instances that include CPU core, memory and I/O channel resources (see picture below). While this sounds much like what a hypervisor might do, the LPARs are implemented at the hardware layer. As such, they support native applications as well as act as separation zones for multiple vendor hypervisors.
So, leveraging this compute technology, we have released two new converged solutions: Hitachi Solutions Built on Microsoft Hyper-V Cloud Fast Track, and Hitachi Converged Platform for Microsoft Exchange 2010, which is specifically focused on making Exchange simple to deploy. The Hyper-V Cloud Fast Track solution includes AMS 2500 storage, industry-standard networks, and Hitachi Compute Blade 2000. This solution has been pre-tested and certified with Microsoft across various workloads, so you know what you’re getting. It’s a great way to step along the path towards private cloud, and is delivered as a reference architecture with supporting deployment guides and best practices.
Hitachi Converged Platform for Microsoft Exchange includes a choice of AMS 2100, 2300 or 2500 storage, networks and the Hitachi Compute Blade 2000 system. This is delivered as a pre-configured package from Hitachi, and is engineered to deliver multiples of 8,000 mailboxes. This makes deploying Exchange – which can sometimes be tedious, time-consuming and error-prone – a simple task of determining the number of required mailboxes. You can also add on in multiples of 8,000 mailboxes as your environment grows.
Most importantly, these two solutions fall under our broader vision of converged data center. Hitachi Unified Compute Platform, also part of the portfolio, provides a higher-level orchestration layer that ties together storage, networks, compute, and virtual machine services for integration and automation leading to private cloud deployments. UCP is separated from other similar solutions owing to an open, multi-vendor strategy – not just supporting Hitachi gear. UCP permits third-party storage, Fibre Channel and Ethernet gear, and multi-vendor x86 servers to all be managed as a single logical cloud.
In fact, we recognize that many IT environments may not be quite ready for cloud deployment – hence, converged platforms announced today that focus on particular business needs. However, these same platforms can be orchestrated by UCP, and therefore permit stepping into the cloud at your own pace.
Let us know what you think!
What the Hitachi/VMware Partnership Means
by Michael Hay on Jun 9, 2011
Today’s announcement between Hitachi and VMware shows an expansion of our mutual collaboration in the Asian marketplace. This approach was taken for several reasons:
- Providing VMware the ability to leveraging a larger portion of the overall Hitachi portfolio, including technologies that currently aren’t available globally through HDS. An example of this is the uCosminexus Stream Data Processing platform (uCSDP), which is a stream processing engine currently available only in Japan. This product is designed to consume various data streams (e.g. logs, sensor inputs, stock information, alerts/events) and in scope for further collaborations between VMware and Hitachi.
- Further for financial services firms in the Japanese market, Hitachi will make use of vSphere and vFabric including GemFire to help deliver private cloud infrastructures. Living in Japan, I am aware of several customers that are looking to almost hollow out their data centers and move to a pure hosted approach. There were and are a lot of reasons for this, but one was to support CO2 reduction.
Reviewing the external announcement and internal collaborations, I see these and other engagements with VMware as continuing the acceleration of the overall VMware and Hitachi relationship. Further in this specific case, we are using our Asian operations to perform a joint proof of concept with both companies to scope what, if and when we should release something in all of the markets that Hitachi serves.
Big Data Is About Turning Content Into Appreciating Assets
by Michael Hay on Jun 7, 2011
I was inspired by Doug Henschen’s article in InformationWeek on Big Data in which he hypothesizes that Big Data is bigger than data warehousing. More specifically, he explores whether the data warehousing concept of ETL is also an important facet of Big Data.
I would agree with his statement and I’d like to further it. In order to do so, we first need to understand why Big Data now. For me the big “ah ha” moment started from the idea that Big Data is really about turning data/content from being a depreciated asset or liability into an appreciating asset. My colleague Bill Burns describes this for healthcare content in the following way.
Today data/content in healthcare is the only form that actually becomes more valuable over time. The fact is the interactions you get today when combined with previous interactions and future interactions compounds to establish both qualitative and quantitative trends. It is this compounding effect of data that makes healthcare data unique and more than the sum of its parts. (Bill Burns, 2011)
So this got me wondering, if this is possible for healthcare data then why not for other classes of data? Perhaps Big Data is really about applying this concept to other disciplines. In hindsight this, like many other things, is obvious, but I would argue it is not “well understood” in the industry. After all, we have been trying to attack unstructured content as a kind of enemy with technologies like data shredding, capacity reduction, cold media, HSM, etc. for a really long time. I’ll be the first to say that we need these tools to make better and more efficient use of resources, but I would argue that the tone of applying these technologies is more about treating unstructured content as a liability and not an asset.
Big Data and Retirement Portfolios
I want to be cognizant of discussions asserting that the Big Data-verse means nothing. In one sense I agree, as projects like SAMBA, Perl, C/C++ and *NIX OSes have long deployed primitive “Big Data” technologies like key value stores. However, barring the striking similarities in the technologies, there is a difference in the market: content savings to the point where there are terabytes in the home and soon exabytes in the enterprise.
In fact, in the past I have blogged about the results of content savings resulting in me lusting for easy to use storage resource management and disaster tools for the home. To answer my question, why Big Data now, I’m going to relate Big Data and content accumulation to a retirement portfolio.
When you start saving for retirement, your portfolio is small and you don’t much think about what will happen to it in the future. You also don’t think about how quickly it may grow, but at some point you acquire enough wealth that you realize you need to seek advice, tools, or a mix of the two to accelerate the appreciation of your retirement portfolio, as well as make critical decisions about how to use your retirement savings.
Now, if we think about this concept of savings for content instead of money, I believe we can approach an understanding why Big Data now. Namely, as an industry we are transitioning from the petabyte to the exabyte and in parallel we’ve accumulated enough time and performed sufficient introspection to desire more from our content. We want tools and expertise that unleash the hidden potential in our content savings to assist us in making critical business decisions, such as the derivation of new lines of business, or in the case of healthcare, improved well being.
While this is extraordinarily obvious — as are patents after you invent them — what I think is not obvious is that getting more out of the content is not a localized phenomena. Instead, there are a sufficient number of companies, groups and individuals now contemplating this issue, so we move from it being merely a whisper to being a real sustainable trend.
Of course, we still have the usual cycle to process through as an industry and this will result in a movement from the vague to the specific. However, at least for me this somewhat simple metaphoric example helps me put things in perspective. Now that I have proposed my definition of Big Data — transforming content into an appreciating asset — in future posts, I’ll begin exposing why I think that Big Data is more than just Data Warehousing.
Where have all the DBAs Gone?
by Claus Mikkelsen on May 27, 2011
First, let’s get rid of some old business.
In some interesting timing, my last blog post questioned the future of Moore’s Law. The date of my post was April 26th. On May 5th, Intel announced the 22nm 3D tri-gate transistor, which will be available in their upcoming Ivy Bridge chip in the second half of this year.
Was I right? Yes.
Am I gloating? Perhaps a bit.
Was I lucky? Totally, but it’s a feeling I’ve had for a while. Thank you, Intel!! Read about this Intel technology and judge for yourself, but the days of the traditional planar transistors are over and it’s time to move on. It’s really quite a breakthrough, but it remains to be seen what the long term impact will be.
But I want to talk about another entirely different topic, and that’s the database administer (DBA). Full disclosure: one of my many previous “jobs” was a DBA. I learned a few things like how hard of a job it is, and I certainly learned to appreciate their work. I also learned that “getting it wrong” is what fuels DBAs’ nightmares.
I know. I got it wrong a couple of times, which is why I moved on to my next job: sweeping the offices of the remaining and competent DBAs. Incompetency can actually build a great and long resume!
Anyway, the obvious best approach was to “follow the rules”, something I was obviously not good at. The “rules” are what we now call best practices. That way, if you did get it wrong, at least you could claim it wasn’t your fault!
Many years later, when developing synchronous and asynchronous remote replication, I again got to work closely with some of the best database guys on the planet on the subject of data integrity with databases. I did enjoy it and learned a ton about storage performance (and data integrity). But I’m about to take issue with the profession. Let me explain.
When I was a DBA, we worked entirely on the application side and not with any storage folks. But this was still in the era of JBOD and “dumb” storage, so that made sense. But it got me recently thinking, and since my current job has me traveling more often than not, and spending a lot of time with customers, I began to wonder what has changed.
The answer, unfortunately, is not much. For about a year now, as I walk into a room, I quickly ask if anyone in the room is a DBA, married to a DBA, ever met a DBA, or even know where the DBA’s are hiding within their company. The answer generally is consistent with what I saw decades ago, that DBAs hide somewhere else, and are not part of the normal storage team.
Is this right? No, I think things need to change.
In the past month alone I have spent 2 full weeks on the road participating in what we call the “traveling EBC” or “EBC on the road”. That is, we fly some of our top techie folks and executives to where the customers live as opposed to asking them to fly to Santa Clara.
Three weeks ago, I spent a week in our offices in Hanover, MD outside of Baltimore and last week I was in Atlanta. In Baltimore I met with 19 customers; in Atlanta I met with 13. That means I asked my DBA question 32 times mostly with a resounding “no way”. But in Baltimore, one CIO said they recently reorganized and the DBAs and storage guys are now in the same group. Hmmm. In Atlanta, I met with a customer and there were 2 DBAs IN THE ROOM! I was speechless (for a bit), then got into a great discussion on storage and databases and performance. These two guys got me deciding to write this post (and I know you’re reading this, you know who you are, so thank you, and comment if you wish).
But here is my point. Storage is no longer JBOD and is no longer dumb. We’re living a world of storage computers that automate many tasks, including many done by DBAs. I’ve blogged on this in that past, and will challenge any DBA to provision a database that can outperform what we can do with Hitachi Dynamic Provisioning (HDP). You may get close (at great expense), but you will not exceed. Once I explain the magic behind HDP to a DBA, they immediately “get it” and agree.
Throw Hitachi Dynamic Tiering (HDT) into the mix and not only do you get the performance and throughput boost, you get the benefits of moving less accessed data to lower tiers of storage. Anyone disagree that (generally) large parts of databases haven’t been accessed months or longer? Demote those parts in 42MB pages to less expensive storage (or leave it all on Tier 1, if you wish).
When I speak to the pure storage guys, they get this well, which is why HDP and HDT adoption is very high. Now it’s time for the DBAs to get educated on this new (well, not that new!) technology and jump on board. Seriously, it’ll eliminate some of those nightmares.
And what better way to transfer the knowledge than to start working more closely together. All will benefit.
So here’s my plea: DBAs, wake up to the new storage technology available and Storage Admins, how about helping your fellows out. Like I said, all will benefit.
Announcing the Hitachi RBS Provider
by Michael Hay on May 18, 2011
I implied in my previous post that Hitachi was soon to release our own RBS provider. Today I’m pleased to announce that our first version of RBS is available with a healthy roadmap and many rock solid capabilities in the initial release.
Among them is comprehensive support for block storage, file storage, and object storage platforms for MS-SQL Server 2008 R2 and beyond. Our RBS provider can be downloaded from the HDS Portal (see below) and is available to HDS customers at no cost. Here is the set of supported features in the initial release.
As someone who helped the team that instigated the RBS efforts, I’m proud to see it released and with such comprehensive support. More generally, RBS joins a stable of growing data ingestion technologies designed to unleash the potential power of data and information from applications:
- Hitachi Data Discovery for Microsoft Sharepoint (Protocol: REST, CIFS)
- Hitachi RBS Provider (Protocol: many)
- Hitachi Content Platform ingestion for MS Exchange (Protocol: SMTP)
- Hitachi Content Platform ingestion for SAP (Protocol: WebDAV)
- Hitachi Data Ingestor for HCP (Protocol: REST)
- Hitachi NAS for HCP (Protocol: REST)
Hitachi’s intention is to help our customers squeeze the data and information out of applications with lightweight plug-ins and protocol enhancements that reduce the cost to archive. We want to inspire our customers, partners and the market to see that the needed first step towards the Big Data revolution is to free your data. So with that in mind, RBS represents another step in a freer Big Dataverse!To download RBS and other adapters, visit the Hitachi Data Systems web site at https://portal.hds.com:
- Log in or register if you are a first time visitor to the Support Portal.
- Click on the link for Software Download & Entitlement System.
- Search for the adapter using the Find Software and System Downloads search tool.
In the results list, you can click on the name of the adapter to download the software package.
Healthcare in the Cloud: Benefits and Precautions
by Dave Wilson on May 6, 2011
These days it seems you can’t go anywhere without seeing “Cloud” mentioned. Microsoft has their “To the Cloud” commercials on television. The last healthcare tradeshow I attended had cloud messaging everywhere and one vendor even hung fluffy cotton balls from their stand.
But what is the cloud to healthcare providers really? How does the cloud fit into healthcare?
What is the cloud in a healthcare context? Cloud means many things to many people and questions abound. If you have access to the Internet, do you have the cloud? Is paying for access on an as-needed basis the cloud? What about having a 3rd party manage your IT – is this a cloud model?
Complex Yet Simple
The answer is yes – and no. While these are factors that contribute to having a cloud operation, the cloud in healthcare is really more complex and at the same time simplified for providers. Here are four benefits:
1. Economies of Scale
The cloud gives healthcare providers a means of saving costs by taking advantage of economies of scale. By sharing resources with other facilities, the initial investment required to get started is greatly diminished and the ability to transform the clinical setting can be accelerated.
Take for example the use case of having disaster recovery capabilities for your imaging department. To invest in an offsite DR facility can be as costly, as buying equipment, providing a site, and all the trappings that come with this can be quite onerous. Outsourcing this to a cloud provider means that a facility could get started today without requiring vast amounts of capital.
2. OPEX vs. CAPEX
The cloud also allows a facility to operationalize these expenses. Rather than a capital purchase, the facility funds this through operational expenditures, which are often easier to manage than going to the board for capital.
Healthcare data grows at unusual rates. The growth of digital information is hard to predict because as more systems become digital the volume of data they produce increases demand. Using the cloud as a way to predict the growth rate of information takes away all the guess work.
No longer does a facility have to plan their infrastructure for growth because the cloud provider is ready to handle any increase or decrease in demand on the fly. This also means that expenses go by the way of usage – don’t use a service, don’t pay for it. Get 100% optimization out of a system all the time. It’s hard to beat that kind of return on investment.
4. Service Innovation
The cloud also offers healthcare providers a unique opportunity to provide new services that would otherwise be cost prohibitive. Digital pathology is a good example. The move to digital pathology poses a huge problem for many healthcare providers as the costs associated with large file sizes and the required infrastructure is challenging to say the least. But a cloud based solution opens new doors. Not only could a facility manage the infrastructure requirements but they can provide access to specialists and pathologists that they may not have had access to before. This enables smaller facilities or remote caregivers to provide new services to their patient population in a cost effective manner.
However, given the suspicious nature of healthcare information being “out there,” the cloud is not the panacea it may seem. There are a number of issues that providers need to be aware of before jumping into a cloud infrastructure. Legislation and regulatory concerns such as HIPAA impact not only the healthcare facility, but as patient information moves to the cloud, there are issues that need to be addressed. The cloud provider must be “security and privacy aware” as it pertains to patient information. In addition to this, the actual location of where data will be stored can also be a problem. Patient data that crosses borders, for example, can present a whole myriad of problems.
Pundits often raise the issue of who owns the data once it gets to the cloud but I think the issue is more around how you get the data out of one provider and move it to another, should the desire to change providers arise. The use of standards and ensuring that there are no proprietary features that will tie in a facility are of utmost importance.
Finally, Service Level Agreements will be important. Healthcare providers need to have 24/7/365 access to patient data. It needs to be fast and reliable and downtime is not tolerated when it comes to patient care. As cloud providers move into healthcare, I’m sure we will see demanding healthcare requirements push the technology to its limits very quickly. That said, I believe that the technology will keep up and providers will be able to take advantage of all the benefits the cloud can offer in terms of improving patient care.
So how do healthcare providers get to the cloud? Stay tuned for my Cloud Maturity Model in Healthcare post, coming soon.
HDS Joins the Cloud Security Alliance
by Eric Hibbard on Apr 29, 2011
Like many others organizations, I’m pleased to announce that Hitachi Data Systems has joined the Cloud Security Alliance (CSA) as a full corporate member. We are quite excited about the CSA scope of work, and in particular, we anticipate getting involved in Version 3.0 of the “Security Guidance for Critical Areas of Focus in Cloud Computing”, further work on the “Cloud Controls Matrix (CCM)”, the “Trusted Cloud Initiative”, the “Top Threats to Cloud Computing”, and “CloudAudit” (see https://cloudsecurityalliance.org/research/projects/ for CSA research projects).
As some of you may already be aware, the CSA has recently become active in the standards development arena, and most recently established a formal liaison with ISO/IEC JTC 1/SC 27. As an attendee of the April-2011 SC 27 meeting in Singapore, I can attest to the fact that the CSA was playing more than a passive role—an uncommon engagement for most liaison organizations—at its first SC 27 meeting. This is a good move for the CSA because it will elevate the importance of the standardized cloud security guidance; some nations confer a law-like status on ISO standards.
Hitachi has been an active participant in SC 27 for quite some time (as participants from multiple National Bodies as well as editors on standards), and we’ve already been able to help the CSA engage the international security community. HDS is also involved with ISO/IEC JTC 1/SC 38 (Distributed application platforms and services, or DAPS for short) and we see the potential for a “little fun” in the future. If and when the SC 38 Study Group on Cloud Computing (SGCC) transitions into a standards-producing working group and further realizes that it’s not reasonable to produce cloud computing standards without addressing security, there could be some conflict or overlap.
As I’ve said before, I’m a huge fan of the CSA and routinely recommend the CSA guidance and whitepapers whenever given a chance. I’m definitely looking forward to helping the CSA with its many endeavors as well as working with them in creating cloud computing security and privacy standards. It will be interesting see if the cloud buzz can be turned into useful standards. What do you think?
Moore’s Law? What’s Up With That?
byClaus Mikkelsen on Apr 26, 2011
My last blog, “Binary”, covered our recent Geek Day (Blogger Day) outside of London and some of our observations surrounding them. This time I’ll be covering another event: the recent Spring SNW in Santa Clara, down the road from HDS. The event occurred two weeks ago, but I want to bring up a subject that surfaced in my presentation, and I’d like to pose this to the wider audience here.
As I travel around, meet with customers, travel, present, talk, fly, present and fly, I like to ask questions. One of the questions I’ve been asking recently relates to Moore’s Law. Let me explain.
Since the beginning of time (as defined by when storage was invented), the “size” of the processor (MIPS, GIPS, etc.) has determined the I/O load expected by storage arrays. That is, if you double the processor speed, you also double the I/O load – pretty basic arithmetic. So, when Gordon Moore predicted that processor speeds would double every 24 months, us storage guys took note and assumed “The Law” applied to us as well. More accurately, he said that the number of transistors that could be placed on an IC would double every 2 years, but you get the idea. The plumbers out there also took note, which is why FC speeds double at about the same rate. Got it? Moore’s Law applies to everything!
Now before I get into the question about “The Law,” I should reveal that I have never in my career been NDA’d by Intel, AMD, or any other chip manufacturer. My closest encounter was all those old annoying “Intel Inside” stickers that seemed to have been slapped on just about everything I purchased. So what I’m posing here is pure speculation and curiosity on my part.
The subject of my session at SNW (which was shared with HDS’ esteemed Michael “Heff” Heffernan and a great customer of ours, Jerry Sutton, from TSYS) was server virtualization and its impact on storage.
I started it off and described a couple of server trends that are impacting storage. One such trend was that server utilization has gone from 10% (generally accepted industry average, or GAIA, I made that acronym up) to 40%-50% with server virtualization (another GAIA, and great argument for scale-up storage, incidentally). That doesn’t mean we’re (us storage folk) seeing a 50% spike in IOPS, but we’re definitely seeing increases above the norm. I doubt anyone would argue that!
This brings me to the question I asked of the SNW audience. Watching various announcements, speculations, and rumors on pending 48-core, 64-core, and 128-core processors, is “The Law” still relevant? Are we exceeding it? Does it still apply? When asked, no one raised their hand that we were slowing down. Two raised their hand that it still applies, and about 10 indicated that Moore’s Law does not apply (for servers) and that we are, in fact, accelerating. The others had obviously fallen asleep by then, or otherwise declined to vote.
Personally, for me, on Mondays, Wednesdays, and Fridays I believe we are accelerating. Tuesdays, Thursdays, and Saturdays, I think not. Sundays I don’t think at all. So, I’m really curious as to what others think.
Are we actually accelerating our compute capabilities? I think this makes for an interesting discussion, so let’s discuss. History will be the ultimate arbiter, but let’s at least ask the question.
Stepping Further into the Cloud
by Miki Sandorfi on Apr 14, 2011
Stepping Further into the Cloud
Last October, I told you about our new Edge-to-Core solution. Today, I’d like to give you an update on that and to show how we can help you step further into the cloud.
For those of you who didn’t already know, HDS recently released a newer and very cool version of the “Edge” – Hitachi Data Ingestor (HDI) v2.5. With this launch, we are now offering our customers THREE different types of HDIs, which have been designed for different customers with different requirements and different budgets.
Here’s a quick recap of the recent announcement.
Three Choices of HDI
Again, the whole idea of our integrated Edge-to-Core design (HDI with Hitachi Content Platform) is to provide distributed consumers of IT (private cloud) and cloud providers (public cloud), with a seamlessly scalable, backup-free storage solution. Now with the latest VMware (software-only virtual appliance) and Single-node HDIs, not only have we greatly improved the cost line, we are also enabling customers to manage their distributed/cloud environments with much more flexibility.
As a minimal-footprint or virtual appliance with “read-write NFS/CIFS” support that ties directly to the Hitachi Content Platform’s REST interface, HDI continues to act as a “cloud on-ramp” to the distributed object store at the heart of your data centers.
HDI has made it all possible for dynamic, collaborative data to stay in the local cache with versions copied to the central object store on a regular basis. When the files are no longer being used often, they remain accessible via stub files stored locally. This dramatically broadens the applicability of object storage by adapting current environments to the advanced data and storage management capabilities provided by the Hitachi Content Platform. As I mentioned before, there has been a lot of discussions about deploying clouds, but what’s been mostly overlooked is providing a comprehensive path for customers to transition from their traditional IT environments to more “cloudy” ones, without leaving their current environments in the “legacy” trash bin. Our HDI solution addresses just that.
There’s actually one more thing I’d like to write down here: we always think ahead and do our best to protect your investment. In fact, for every new release of our offerings, we’ve been focused on providing you more automation, simplification, flexibility, and control. These things together all help improve the adaptability of our systems to newer technology and make sure it’s a long-term solution that outlives the data stored, as well as helping you better manage your storage overtime, with regard to tech-refresh, upgrades, and scaling.
So, what did I mean by “we can help you step further into the Cloud”? To sum up, here’s what you can get out of our “Edge-to-Core” solution:
- Complete control of distributed data in your entire IT world
- Simplified cloud adoption with standard file serving access (i.e. NFS or CIFS) to stored content
- Improved manageability, efficiency, flexibility, and data protection
- And of course, significant cost reduction by simplifying both IT infrastructure and processes
That being said, if there’s anything else you want to know about our cloud offerings, just let me (or your other HDS contacts) know. All the best!
byClaus Mikkelsen on Apr 4, 2011
I just returned from a great couple of days outside Windsor in the UK. Our HDS offices are at Sefton Park just a few miles away, in Stoke Poges, which in turn is outside Windsor. Got it? If you haven’t been there, it’s a must see in a beautiful setting. I’m actually a little late getting this post out since the UK trip was followed by a short vacation, which in turn was followed by a few mountains of email and other such obligations.
I was in the UK for our second annual HDS “Geek Day” where we invited some influential bloggers in the storage industry to give them a deep dive of our products, roadmaps, and the “why” behind the “what” in our product lines. We discussed the Hitachi Command Suite (HCS, and it is sweet), our VMware and hypervisor integration, enhancements to our HDP (dynamic/thin provisioning), HDT (dynamic tiering), our server line and storage economics. Also included were multiple demos and giving the bloggers some hands-on time with our products.
In addition to the two days of sessions, we were still able to fit in two #Storagebeers and two podcasts with the illustrious Greg Knieriemen. I, unfortunately, missed the second of both, but it was certainly a busy and full-packed two days. But that’s not the focus of this post, and I won’t repeat the excellent coverage Hu posted on this event.
One of the great things about Blogger Day (or HDS Geek Day), is that these independent bloggers tend to approach us (the vendor) with a great deal of caution, skepticism, and cynicism, as they should, as should all of our customers. They were certainly willing to ask the tough questions and none were bashful.
What was great to see, as we were going through the various topics, were the “ah ha” moments when they began to see the differentiations we had.
This brings me to the “binary” topic. We often tend to see the world in a binary state (a brief look at current politics will attest to that), but things are seldom that clean cut. So many storage bids I’ve seen require us vendors to support certain “checkbox” items (binary), but they mostly do not compare.
For example, we all claim to support the VAAI API’s of VMware, but do we all have the same integration points and support external storage virtualization with these features? The answer is no, but one would have to dig deeper than the obvious to learn that. Do all the storage vendors support thin provisioning? Yes, but dig deeper and you’ll find that not all support sub-features such as “write same” (block zeroing), performance enhancements, and zero page reclaim. We all claim to have the best management software, but until you actually sit down and use Hitachi Command Suite (HCS), you’ll never know just how sweet it really is. On Storage Economics, are we even close to equal in CAPEX, OPEX, environmentals, capacity efficiencies, and other metrics? Hardly!!
The point of this is that one really needs to dig deeper to find the subtle (or not so subtle) differences on the products available. And that’s what I loved about the “Geek Days”. One by one, product by product, we could see them “get it”. Many that started the event skeptical had certainly changed their perceptions, and that’s the purpose of the event: education.
Are we (HDS) perfect? No. Do we always have the best solution? No. But do we, overall, have the best and most complete product set? Yes. And it was rewarding to see the positive responses to that last week.
After our launch of VSP last year, we’ve been handing out 3-D glasses to the 3-D theme we introduced. Maybe “binary truth” glasses would be more appropriate.
open -a TextEdit 000000cc-00000080–000b
by Michael Hay on Apr 4, 2011
I cannot resist starting out the post this way; the title of this post is how one would open a file stored in a Remote BLOB Store (RBS), a part of MS-SQL 2008 and later, stored on a Mac — that is if we assume that the file is a *.txt file. Said BLOB/file could be retrieved from an “easily comprehensible” directory like ‘/blobstore/9de21b5-e4c3-4ef1-ad71-9b123021f2ba/f6c43087-816f-4b08-a68d-a77a04452c81.’ (Note this bit is a joke and was inspired from a blog post on Todd Klindt’s SharePoint Admin Blog.)
To be very clear, BLOBs in RBS are not managed by humans and are assumed to be managed 100% by MS-SQL Server, that is they are machine managed data. Further the stated design target of RBS is to improve the performance for both the DBMS layer and the storage layer with respect to BLOBs that were formerly stored directly in the DBMS. So this reinforces the notion that data stored in a BLOBstore managed by MS-SQL Server is not meant for human consumption.
When Microsoft (MS) first debuted RBS, Hitachi Data Systems made a commitment to write an RBS provider (stay tuned on this front). We totally understood the intention of clearing out the DBMS of BLOBs to increase DB performance and scale while easing backups. From discussions with MS at the time we also understood that the primary motivation in making the innovative RBS layer is ensuring that customers who had stored piles of BLOBs in MS-SQL had a path to increase the scale of their applications. More specifically during the informal discussion, geospatial applications were cited whereby many satellite images were persisted in MS-SQL as BLOBs with machine generated metadata (they were from a satellite after all) so some way was needed to clear these beasties out of the DMBS allowing MS’s customers to do more on each MS-SQL server, so along came RBS. So with all of that in mind we signed up, and then…
Microsoft’s SharePoint team produced a product that has been the fastest to achieve more than $1 billion in annual revenues (actually Microsoft stated that SharePoint made $1.3 billion as of 2009). We can link SharePoint’s rapid success to its ease of use resulting in an almost democratic IT experience, creating headaches for IT administrators who exhibit a love-hate relationship with SharePoint.
They love the fact that it is so easy to use, modify and set up, but they hate the fact that it creates problems like storage consumption spiraling out of control and “fuzzy governance.” As a result, the SharePoint development team realized they had exactly the same problem as the custom application developers: DBMS instances with BLOBs growing out of control. So they decided to turn off their old implementation, External BLOB Storage (EBS), and move to RBS for SharePoint 2010.
Almost immediately when the SharePoint team adopted RBS, there was a pack of companies who started updating their SharePoint archiving products to include RBS, yet at Hitachi we realized that the competition was playing our game so we could take our time. Our game was in fact the product we call Hitachi Data Discovery for Microsoft SharePoint (HDD-MS) and because we already had the implementation which was independent of EBS and RBS using documented SharePoint APIs an immediate move was not warranted. (Note I have blogged a lot about HDD-MS. Here is a good gateway link to many such posts: Managing SharePoint Growth.) Let me provide you with a bit of the analysis that we did when thinking about when and if we should transition HDD-MS to RBS, or accelerate our RBS development.
All of these attributes were important to Hitachi, but because I started off the discussion illustrating that RBS is machine optimized, I want to focus on that. Since long ago we recognized that to unlock the true potential of data and generate new information (see my previous post on Big Data) we must start storing data independent of the application in an open and readable format. That is because we cannot predict the future where we might be seeking the union of disparate data types to find hidden gems from the noise, remastering the content in a new format, etc. It also means that we need to not only think about the data but also the data’s metadata. HDD-MS extracts both the object and its metadata from SharePoint persisting both in the storage media in a human (and machine) readable format. Whereas RBS is 100% machine optimized using hints from MS-SQL server to construct the directory and object naming schemas, ensuring that MS-SQL performance and scale is optimized.
Both design choices are perfectly fine and while there is some overlap depending upon a customer’s problem, one may be more applicable than the other. For instance, if the customer is really after optimizing their MS-SQL database attached to their SharePoint infrastructure, then RBS may be the best fit. On the other hand, if a “future-proof” archive is key, then HDD-MS is the best choice (note this is a Hitachi internal best practice. Here are a few examples beyond HDD-MS illustrating this point:
- Hitachi Data Ingestor (HDI) – stores the file system object + all metadata onto HCP
- Hitachi NAS Platform (HNAS) – stores the file system object + all metadata onto HCP
- Universal Volume Manager (UVM) – when implemented users can select a mode to merely virtualize but not change the LU this allows the user to back out of the virtualization process quickly
- HCP SMTP Interface – stores emails from Exchange in a user and machine comprehensible format including persisting individual emails as *.eml or *.mbox format
Now I’ll be the first to admit that my primary argument and others I hinted at make sense for SharePoint. But what about the applications that we haven’t optimized our products for or implemented specific plugins to that make use of MS-SQL server? What about a user who doesn’t care about perfectly archiving metadata and the content itself? That is where RBS shines! If a customer has written their own application making use of MS-SQL 2008 and they want to store BLOBs outside of the DBMS: RBS is the right choice. If an application like Microsoft Dynamics wants to store BLOBs outside of the MS-SQL DBMS: RBS is the right choice.
In short, RBS is the right tool for a set of use cases and HDD-MS is the right tool for another set. So when our competitors were jumping on the RBS bandwagon and racing to the finish line to solve SharePoint problems there, we were standing on the finish line waiting to cheer the company who makes it to second place.
I Don’t Glow in the Dark, but the Stars Surely Do!
by Michael Hay on Mar 31, 2011
A View from Japan
As of late, I have been watching here from Japan the world gripped in fascination about the goings on at the Fukushima nuclear power plants. There are a lot of discussions about banning imports and outright fear that a cloud of radiation is going to spread from Japan to other areas and magically contaminate everything it touches.
As I’m writing this post from Japan, I believe it very helpful for people first to find out what various doses of radiation mean. So let’s start with the meaning of a sievert.
The sievert (symbol: Sv) is the SI derived unit of dose equivalent radiation. It attempts to quantitatively evaluate the biological effects of ionizing radiation as opposed to the physical aspects, which are characterised by the absorbed dose, measured in gray. It is named after Rolf Sievert, a Swedish medical physicist renowned for work on radiation dosage measurement and research into the biological effects of radiation. (source: http://en.wikipedia.org/wiki/Sievert)
With that in mind, the next thing to do is to find out what exposures of various levels actually mean and how to put them in perspective. Fortunately the folks at xkcd have done just that; they have assembled an easy to understand chart which make exposures to various levels of radiation relevant to everyday people. I have included an image of the public domain chart in this post and the link is above so you are free to explore on your own.
A reference alone is not good enough to understand the current situation, so what is needed is a source of data for the number of sieverts per hour in a given location. Fortunately through some crowdsourcing efforts the site RDTN.org provides a visualization of various data sources independent and inclusive of Japanese government sources.
As a result, if we combine the current information from nearby Kawasaki (0.1325 micro-sieverts) and Yokosuka (0.095 micro-sierverts), the amount of radiation produced in an hour in Yokohama, where I live, is roughly somewhere between eating a banana and living within 50 miles of coal burning power plant. Further, while there has been contamination of spinach and milk from nearby Fukushima, it also needs to be put into perspective. I think that a blog post at NPR does this quite nicely, which says that you have to eat a little over 2 pounds of spinach everyday for a year to reach a level that is potentially harmful. You’d also have to consume about 58,000 eight ounce glasses of milk or one glass everyday for 160 years.
To me, this suggests that much of the information floating around may be coloring the situation as more serious than it needs to be. I’ll be the first to say that we need to continue to monitor the situation, but to me as someone living here in Japan, this information puts my mind at ease.
Finally, I do want to recognize that even under such terrible circumstance the people of Japan have reacted with considerable grace. Let me illustrate by example. I am sure that many of you know there are rolling blackouts to conserve energy during this time of crisis. One of my colleagues said that when the power is out in his area his family goes for a walk outside and does some stargazing. He said that it was the first time in a long time that he had witnessed firsthand the brightness and beauty of the stars. I know that this may not be as power packed with excitement as some areas of the World are experiencing now, but I think is does illustrate grace.
Big Data, Big Challenge, Big Payoff
by Ken Wood on Mar 28, 2011
Getting from “Where Information Lives” to “Where Information Works for You”
The buzz around “Big Data” is turning many heads. The idea of mining the vast repositories of unstructured data and blending this with the structured data collections within an enterprise seems daunting, but everyone seems to think that there are new opportunities hidden within their data.
So what are you looking for? What should you be looking for? What needs to be found?
The real question should be, “What can I do with all this data to benefit my business and HOW?”
In my opinion, eDiscovery as part of a regulatory compliance solution is one promising implementation of a corporate wide Big Data architecture.
eDiscovery eases the burden of having to find all relevant documents related to a legal case or audit. So while many businesses and industries need to implement this kind of compliant repository to store and retain documents for legal discovery purposes, eDiscovery addresses many of the prerequisites of Big Data requirements; that is, access to all of your data in an unstructured searchable scheme.
But what about industries that don’t have regulations demanding this type of compliance, or businesses that may have their own requirements to retain files, or the desire to keep everything because it seems like the right thing to do. In essence, companies that have already implemented eDiscovery solutions may already have a leg up on their way to Big Data systems.
eDiscovery architectures are a good starting place for collecting, storing and indexing data, and essentially leads to Big Data systems. The challenge of Big Data is to apply this approach to ALL data related, tightly and loosely, to your company, which may even apply to data beyond the corporate firewall such as partner systems, distribution channels and/or supplier systems.
Michael Hay did a good job covering the definition of “Big Data” in his recent post. To summarize, it is defined as datasets that are too big and awkward to work with standard database management tools or other data management tools. The ability to apply indexing and search, capture and store, and analytics to these datasets exceed the current envelope of today’s mainstream tools and techniques. Applying analytics to this data goes far beyond just search and complex queries, but being able to search all of the data is a prerequisite to analyzing the data.
Patterns, Trends, Competitive Advantage
What other benefits can be gained from being able to mine all of your data? The idea of being able to combine different data types to form sub-datasets in the context of a business query or process can begin to show trends in the data, which can also transform “just data” into “information”. When thinking about Big Data this way, it starts to sounds like chaos theory – initial seemingly small random samples appear chaotic, but at large scales, patterns can emerge. These patterns in business could be interpreted as trends. Trends lead to predictions, and the better the trends, the more accurate the predictions, which leads to competitive advantage and first to market opportunities.
This may mean mining data and the trends of partner and distributor systems related to your own data. Another outside-the-firewall reach could include social triggers: the monitoring of the Internet and tracking the buzz around many activities by linking these triggers to your own internal systems. In fact, using a social trigger to initiate queries on your data conjures up a kind of self-aware infrastructure. This is where your infrastructure becomes connected and contextually attuned to what your business is about, where your data becomes information and where your information begins to work for you. Essentially, this is where the repurposing of your data begins to take form.
While the idea that the tealeaves in your data can lead your business to future unforeseen opportunities is believed by many in the industry to be the Holy Grail of data repurposing, however getting to that point will require fortifying the datacenter architecture.
In my opinion, the emergence of High Performance Computing architectures from both the research venues and on the fringes of many mainstream corporations will find more and more of this technology applied to Big Data problems. The computing power for analytics and the massive IO performance to keep analytic engines busy is a scale-out problem and HPC is a scale-out infrastructure. Also, the usefulness of answers to make business decisions from Big Data will have to be timely. Many would say that timeliness is a real-time or near real-time requirement. This is true for many time-to-market sensitive businesses like advertising, while other industries can afford slower responses to more detailed queries that have company-wide strategic impact.
So some will say that we are still years away from true Big Data architectures. While I agree that TRUE Big Data architectures are years out, it may depend on the definition of “TRUE”. There will be a build up to some Big Data architectures, like eDiscovery solutions, so partial implementations will continue to come online addressing critical portions from different industries and market segments. It will depend on business priorities and market success. The sudden success of a competitor can motivate innovation implementation and investment in an industry especially if there is a competitive advantage based on reexamining already existing data. eDiscovery solutions will be an early implementation and an advantage to getting there from here.
What is the Big Data Pot of Gold at the End of the Rainbow?
by Michael Hay on Mar 16, 2011
First off, I want to acknowledge all of my family, friends, and colleagues here in Japan after our recent shocking event. In the Northern territory, where Japan was hardest hit, there are several nuclear power plants in danger as a result of multiple explosions. Fortunately, as I write this post my family is okay, but shaken — literally and figuratively. As you can, please provide your personal level of support to the people here in Japan.
Now back to the subject at hand – Big Data. To begin the discussion, I want to cite the definition of Big Data from Wikipedia:
The term Big Data from software engineering and computer science describes datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to “spot business trends, prevent diseases, combat crime.” Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data. Scientists regularly encounter this problem in meteorology, genomics, biological research, Internet search, finance and business informatics. Data sets also grow in size because they are increasingly being gathered by ubiquitous information-sensing mobile devices, “software logs, cameras, microphones, RFID readers, wireless sensor networks and so on.”
In bold, I highlighted a set of text that I think summarizes Big Data well. In fact, Big Data has not just magically appeared, nor is it something that will have its 15 minutes of fame and then disappear (see references to this at a high level on Pete’s blog). Big Data encompasses serious computer disciplines that arguably can be traced back nearly a decade. Pharmaceutical companies, eDiscovery events, web scale architectures and other industries/activities have long deployed HPC-like infrastructures affording users the ability to answer big questions from Big Data.
What is different now? Simply, what is possible now comes from infrastructure compression (e.g. smaller cluster systems, storage, and analytics) that is less costly, meaning more companies can take advantage of Big Data architectures and tools. Therefore I believe Big Data to not be about technologists salivating at a new gold rush, but about the promise of everyday people interacting with confidence in technologies to answer questions that may require analyzing enormous quantities of data to make their work and, ultimately, society a better place.
Hitachi’s Social Innovation business is a realization of the last sentence above and what I really believe sets us apart from our competitors. It is Hitachi’s view that Information Technology must be fused to things in the world like trains, microscopes, people movers, escalators, physical security systems, excavation equipment, cameras, etc. in an effort to make a difference. All of these systems produce many of the data streams and attributes as described within the Wikipedia definition.
For example, earth moving equipment has a variety of sensors that produce data about location, fuel consumption, component wear, etc. All of these data and more can then be “crunched on” allowing the manufacturer, Hitachi, and the owner to improve the system or its operational efficiency. Since Hitachi makes earth moving equipment and we acquire telemetry for both ourselves and our customers we must also produce tools necessary to analyze these Big Data. In a very real sense, and key to Hitachi’s differentiation, we are learning to think more like a customer because we are actively working to fuse IT to social infrastructure. This makes us think a lot about how to cope with our own deluge of data so that we can improve our own offerings directly and indirectly.
As for how this shows up through HDS, today it is largely as components built into our storage, systems management, data management and server systems. For example, modeling and cooling facets of Virtual Storage Platform (VSP) are heavily influenced by the tools and technologies coming from our power systems group. We can utilize our own modeling tools, knowledge and processes to make sure our system designs are denser and more efficient than the competition.
Moving forward, however, we will transition to a more overt exposure of “Big Data” sorts of technologies through HDS. It is one of the reasons my team, the Office of Technology and Planning, spends a lot of one-on-one time with the various R&D teams. We are “on the hunt,” so stay tuned.
But be assured that we will be thinking about how we can apply a variety of resources to the task including our most important point of differentiation: being a user of Big Data. In a very real sense, we want you to find your own gold at the end of the rainbow, but you could always buy our cloud equipped mining equipment to find and build your own pot of gold from an actual gold mine.
All-in-One Meals: Converged Solutions Aligned With Customer Needs
by Gary Pilafas on Mar 4, 2011
Today’s discussion is a continuation of my last blog post, in which I introduced the concept of pervasive computing.
My role within Hitachi Data Systems is to create converged solutions, which I like to refer to as “Hitachi all-in-one meals”; however, unlike the traditional McDonalds’ Happy Meal which consists of burgers, fries and a drink served in a bag, a “Hitachi all-in-one meal” consists of computing, integrated networking and storage wrapped in and delivered with quality management software.
Like a Happy Meal, there is a special sauce that consumers get really enjoy. Hitachi Data Systems has the access to the Hitachi special sauce, which is Hitachi’s intellectual property that ranges across a spectrum of technology from electric motors to nuclear energy plants.
My objective is to help educate the global market outside of Japan that HDS is more than the industry leader in storage IP.
Since April 1, 2010 HDS has been selling Hitachi servers globally and have been growing our customer base; however, we have decided to continue to raise the bar and continue to give our customers more.
Just as we’ve listened to the customer and delivered reference architectures, or converged data center solutions such as Hyper-V Fast Track, or most recently, Hitachi Clinical Repository, we’ll continue to deliver converged solutions (or a Hitachi all-in-one meal) that include Hitachi servers, integrated networking, storage and management software in addition to some additional Hitachi special sauce. Solutions will range from high performance SMP, security, OEM software integration, vertical segment integration, which will all be aligned to our customers needs.
The good news for our customers is they won’t have to cobble together options from three or four different vendors with the misperception that the solution is “unified” or “flexible”.
After all, who would actually expect someone to buy a burger from one store, fries from another and a drink from a third with little to no integration across the solution?
What a Difference a Decade Makes
by Claus Mikkelsen on Feb 28, 2011
There was an interesting phenomenon occurring a decade or more ago.
SAN’s and Open Systems were in their heyday (still are, but much changed) and everyone was rushing to update their skills and get certified for this and that, including myself. Those were fun times (and still fun, in my mind). This was also the time that the sage analysts, consultants, and media were proclaiming the mainframe was dead and was totally un-cool. End of debate.
Those of us that grew up on mainframes retreated into the shadows. Many folks even went so far as to remove this skill from their resumes (I know that to be true) and dye their hair (OK, so I’m exaggerating a bit on this one). But the future turned to a NEW set of challenges, problems, issues, and concerns as mainframes were relegated to the dark ages. But were they new? Read on…except for you non-mainframers, you can stop reading now….
One thing that always amused me was everything had to have a new term. DASD became disk, datasets became files, VTOC’s became file systems, and pretty much every accepted term was changed. But as time has passed, it strikes me we’ve gone full circle. We’ve finally seen long-standing mainframe functions appear in the open space. Decades later, we’re seeing concepts like thin provisioning (RLSE), archiving (HSM migration), disk tiering (SMS Storage Class), LUN expansion (secondary allocation), security (RACF), and the like.
But I think 2011 will be an exciting year for the mainframe again, at least on the storage side. A few weeks ago, we had a “mainframe summit” here at HDS to discuss where the mainframe market is going, and you know what? You just might see some interesting announcements from HDS in this space. Actually, you might even see some exciting developments. While we’ve ported many of the concepts of mainframe to open systems, in the last decade, we’re now poised to take what we’ve learned to reapply to the “zSpace”. Stay tuned for more information on these developments. I know I’m pretty excited.
Also, and this is good timing. As I write this post, I’m heading to the SHARE conference in Anaheim. I have 2 gigs there. The first is Tuesday March 1st at 9:30 a.m. presenting:
Storage Virtualization with the HDS Virtual Storage Platform: Saving time, Reducing costs, and Transforming the Data Center to the Information Center
And the second is on Wednesday at 4:30 p.m:
9018: Replication Vendor Panel
We started this a few years ago, and it involves violating a bit of SHARE protocol by putting six folks (two each from HDS, IBM, and EMC) on a panel debate. And debate it is! We’ve had some memorable moments – not exactly of the Lincoln-Douglas variety – but we’re not afraid to take a few shots at each other, and the audience is not shy at trying to stump us. It’s all good fun and the feedback we’ve received has been great. In spite of the jousting and poking, we’ll generally retire over a glass or two and continue the debate. This debate is never over!
So if you’re in Anaheim, please stop by the sessions. There’s a lot of fun to be had.
And while you’re at it, try to imagine what HDS has up its sleeves this year…
Virtualization is the way of the future? It is now.
by Dave Wilson on Feb 18, 2011
When you spend as much time in an airport as I do, you will appreciate it when I say that you meet all kinds of interesting people when you are playing hurry-up-and-wait. More and more, I spend my time chatting to those around me. Sometimes I learn something and sometimes I teach something – like recently while sitting on the tarmac, I struck up a conversation with my seatmate, who like me traveled in a suit with a single carry-on suitcase and a laptop bag easily accessible once the seat belt light is turned off.
After dispensing with the pleasantries, I was pleased to discover we were in remarkably similar fields – and with a wide range of topics to chat about, we settled into an interesting conversation about virtualization. The chat began to derail a bit when he said, “I read an article recently that said virtualization is the way of the future”. I had to admit (and I told him so), I was a bit surprised – the future? Do you mean….it’s not here and now?
I was recently involved in a CIO Summit where we surveyed the attendees and found that all of them had a virtualization strategy; granted it was predominantly servers and desktops but still, we have moved beyond the early adoption stage and have moved into the mainstream. So either my flight mate was quoting an article from the year 2001 or maybe he was talking about the second round of virtualization – storage – to be exact.
A hospital typically buys storage at a departmental level, driven by the departmental application needs. In a growing number of scenarios, there exists an enterprise storage strategy from the perspective of virtualization where all of these different departments can be virtualized to reduce the amount of administration required, save money from the storage acquisition perspective and optimize the existing systems. Today, Hitachi is working with health provider organizations to show the ROI possible through an innovative Storage Economics offering that can estimate the savings available!
We then started talking about how virtualizing storage was NOT the last frontier for virtualization. This got him completely confused. Virtualizing storage is great and has significant benefits to the IT department and C-Suite, but what about virtualizing the information that all of these silos generate? This degenerated into discussions around vendor neutral archives and as this is my background I may have caused his eyes to glaze over (although it could have been the drink). We soon got beyond radiology and discussed how to create an Electronic Health Record infrastructure that would transcend beyond the silos we find today and enable physicians to see the complete medical record of their patient.
This is the future aspect of virtualization. In reading a report on Canadian healthcare, the author identified that 33% of facilities had trouble sharing data between different departments. This has a significant clinical impact on physicians and patients. Getting access to data is critical to good decision making and it can contribute to cost savings in the long run. My seatmate started to see where I was going with this. Just because a doctor has a computer doesn’t mean he has access to your information. Scary thought!
It’s on flights like this that time passes quickly; I only hope my new friend feels the same way.
Don’t forget – Hitachi Data Systems will be at HIMSS11 in Orlando, Florida February 20-24 in Booth #6742, where you can see – first hand – Hitachi virtualization solutions and Hitachi’s capabilities around managing metadata from disparate information silos across a healthcare facility.
It should be an exciting show and we are looking forward to it – see you in Orlando!
Top 5 Myths About Vendor Neutral Archives
by Dave Wilson on Feb 8, 2011
A Vendor Neutral Archive (VNA), according to Mike Cannavo of Image Management Consultants, is defined as a standards-based archive that works independent of the Picture Archiving and Communications System (PACS) provider storing all data in non-proprietary, interchange formats. A VNA also provides context management so that information can be transferred seamlessly between disparate PACS. A VNA provides the following requisite functionality:
- A Digital Imaging and Communications in Medicine (DICOM) storage platform that stores DICOM as it is received, unchanged and with all proprietary and optional DICOM tags present
- Consolidation of multi-departmental imaging centers
- The ability to morph DICOM header information so that they are “standardized” between facilities
- A shareable database that is easily queried by different PACS vendors
- Stores all DICOM SOP classes, and
- The ability to import and export data in DICOM format
Listening to the VNA hype, a VNA promises to eliminate data migration or future data migrations, consolidate imaging data throughout the enterprise or simplify access for physicians. But it’s not so cut and dry, and herein is the grey fuzzy space between truth and marketing spin.
Myth #1: Vendor Neutral Archives are actually vendor neutral
The name is misleading. How does anything procured from a single vendor become vendor neutral? Every vendor has some proprietary mechanism or confidential code that makes the VNA not a VNA. The VNA uses a database to store the data and while it may be a commercial database, the schema and structure would be proprietary.
Maybe a better name would be PACS neutral archive as the intention of the VNA is to enable sharing of medical images between disparate PACS products. The only true VNA would be one acquired through Open Source methods like Source Forge or Tigris. The key to VNAs are the ability to accept the data in a non-proprietary format and to be able to ensure that the data can be shared between applications and users. Providers should look into the technical aspects of how their data is being stored and the efforts required to share it.
Myth #2: Purchasing a Vendor Neutral Archive means never having to migrate data
I don’t understand this claim. Data is stored on disk and information is recorded in a database. When a DICOM image has some form of structural change made to it, the typical workflow involves HL7 messages updating the database. DICOM headers are not changed – the change is recorded in the database.
What happens when the provider gets tired of working with their VNA partner (it does happen, you know)? If the hospital wants to move to another vendor they’ll need access to their data. If they just migrate the DICOM data from disk to disk all of the structural changes (merges, splits…) would be lost. Databases need to be updated. When a new VNA is installed, then the PACS that feed it need to know where the new data is; in fact, they need to know what data IS available. So how does a VNA eliminate the need for migration?
Myth #3: Infrastructure vendors can’t provide VNA functionality
If vendors can provide Vendor Neutral Archives, why can’t infrastructure vendors like Hitachi provide the same functionality? While a traditional PACS vendor understands DICOM and medical imaging, Hitachi understands the nuances of sharing data and the utilization of metadata to better utilize the information generated.
Hitachi’s capabilities expand beyond the medical imaging departments to other aspects of the hospital to include applications like SharePoint and Exchange, not to mention non-DICOM formats generated from laboratory, pharmacy and hospital information systems. What PACS vendor can make that claim – without DICOMizing the data? As we talked about in the previous blog post – Hitachi’s command of metadata is driving change throughout the provider space to maximize the use of information.
Myth #4: To implement Electronic Health Records (EHR) effectively, there needs to be an enterprise imaging archive that can consolidate all the different departmental systems.
OK – this is partially true. To enable access to the medical imaging component of the EHR, an enterprise imaging archive would make things easier. However, the imaging component only accounts for 20% of the patient information (it represents 60% of the storage requirements). Almost 80% of the patient’s medical record information is from sources other than medical imaging. Being able to consolidate information from all departments is critical to developing the EHR.
Providers need to look beyond the medical imaging component and ensure that when they are making infrastructure decisions that will support an Electronic Health Record incorporating all aspects of the hospital, not just medical imaging.
Myth #5: Hitachi Data Systems is a storage vendor, not a VNA provider
Hitachi has been around for over 100 years and has been involved in many aspects of healthcare. This experience has led to a deep understanding of how today’s challenges need to be addressed. Providers need solutions that help them solve the challenge of interoperable data between applications. They need to be able to build an infrastructure that will support their EHR strategies while delivering cost effective solutions. Hitachi has developed solutions that will enable the virtualization of information between these silos providing access to the EHR, portals and the longitudinal patient record.
Hitachi Data Systems will be at HIMSS 2011 in Orlando, Florida February 20-24 in Booth #6742, where we will make an exciting announcement. Hitachi’s capabilities around managing metadata from disparate information silos across a healthcare facility will be highlighted in Hall A (EHR/EMR, Booth # 263) on Monday, Feb 21st, 1:30 – 2:15 p.m. Come out and see how Hitachi is addressing the challenges faced by providers.
HDS as a Member of the Hitachi Family: The View From Japan
by Michael Hay on Feb 2, 2011
There are a lot of opinions in the industry as to the relationship between HDS and Hitachi Ltd. in Japan. When last I checked Hitachi is included in our name, Hitachi Data Systems, and we are most certainly a member of the Hitachi group. To dispel any lingering doubts I recommend looking at this presentation from the President of Hitachi, and in particular slide 17, as it is extremely interesting. I won’t summarize this slide, but instead have included it here for your direct review to illustrate that we are one company with one vision.
From my position here in Japan, I know that the spirit of collaboration memorialized in Mr. Nakanishi’s diagram is real. On a given week, I personally spend time at various business divisions and research labs to engage in debate about, co-design, and jointly frame the tone for our future. However, I am not alone in working in partnership with various teams throughout the Hitachi Group — including engineering at HDS. Folks on my team, colleagues within my parent organization at HDS, and sister organizations within HDS all engage with a variety of teams throughout the world.
The feeling is mutual. Engineers and researchers throughout the Hitachi group all recognize that they need to partner with HDS so that we can mutually understand the requirements and future intentions of Hitachi’s users and customers.
More generally, Hitachi’s honest intentions are to improve society, which we talk about through our concept of the Social Innovation Business. Already we are seeing real changes including the announced updates to the research labs within Hitachi — astute readers will notice that within Mr. Azuhata’s presentation is a picture of HDS headquarters in Santa Clara — in support of our aspirational corporate theme. Overall, the market can expect a more visible partnership between Hitachi group companies and more engagement with our customers across a variety of expanding segments, and HDS is a key component of Hitachi’s plan to achieve that vision.
Would you buy a Lego set with no picture on the box?
by Gary Pilafas on Feb 1, 2011
A colleague and I were recently visiting a customer to solicit some feedback about our future roadmaps. During the meeting, he made a comment related to Legos. Our customer mentioned that all of his vendors come in and show him the individual Legos that they sell (server, integrated networking, storage, software, etc.) but nobody can sell them the picture of what the solution looks like with the pieces are snapped together, much like what is on the box in which Lego pieces come.
Being the strategist that I am, I left this meeting with a lot of ideas in my head. Being a customer myself not long ago at the largest airline in the world, I used to have the same thoughts and the question came to mind, “Why can’t someone find a way to put all of these different parts together and deliver them to me as an integrated platform?”
Based upon this experience on the customer side, I began thinking about Hitachi and all of its intellectual property that is sold around the globe as individual parts. I began to think, “Why can’t we take all of these different pieces of IP and bundle them together?” In fact, this is exactly what our competitors are doing.
One company sells storage and virtualization software but doesn’t sell servers, so they fill the hole through acquisition. Then, they form a partnership and give it an “ABC” type of name or create another offering with the word “Flexible” in the name. In reality, there is not a company that can offer the complete integrated platform that Hitachi can.
Integrated Platforms & Pervasive Computing
What you are going to read from me on Techno-Musings going forward is how integrated platforms are leading their way to what I define as “Pervasive Computing.” What is pervasive computing? I consider pervasive computing to be what is next after cloud computing. Pervasive computing allows for an Integrated Chip (IC) to be installed anywhere such as your fork where it measures calorie intake, your cars to detect more intelligent safety and communication protocols or a soccer ball that may or may not have crossed the goal line in World Cup soccer.
Pervasive computing is our tomorrow and we need to begin planning for this technology and the way we begin doing so is to combine all of Hitachi’s intellectual property into converged packaged solutions that are easy, fast and cheap for our customers to deploy. We believe that now is the time to help address customer problems with the cornerstones of Hitachi technologies, which are reliability and performance.
Because, if you think about it, why would customers want to buy a solution from multiple companies that have no integrated abilities to manage the entire stack as a “Certified Solution”?
The Role of Standards When Stacking to the Sky
by Michael Hay on Jan 24, 2011
I have blogged before on vertical stacks in “An observation on vertical integration” and again in response to George Crump in “The Rack is the New Server, the Data Center is the New Rack,” both of which talk about the trend that continues to evolve in the industry. In “The Rack…” I hinted at the need for systems management within vertically integrated stacks. In fact, I think as an industry we need to make progress on management standards for the verticalized stacks.
Back in the day, the well oiled API for management was SNMP. Everything adhered to MIB-2 and then there were enterprise specific MIBs per company and device. As the world started to change and flaws in SNMP were uncovered, the need grew for new standardization efforts, which resulted in SMI-S, WMI (not truly a standard), IPMI and SMASH, many of which used CIM at their core. (Note: I have had reasonable person-to-person debates around the idea that, had SNMPv3 arrived sooner, we may not have spawned many of the additional standardization efforts.) However, these newer standards and the venerable SNMP are not yet primed for the world of the vertically integrated stack.
To get to a standards based aggregation API, it is necessary to potentially go through many steps, such as:
- Define an aggregate object model
- Consider a variety of use cases
- Consider security aspects
- Define a transport protocol
- Publish reference implementations potentially as open source
- “Knight a standards body” as the go-to location for the standard, etc.
All the while the industry (customers, systems integrators, and vendors) has to wait until the standard emerges, which presents the risk that by the time it arrives, it is no longer relevant due to market dynamics. When it comes to picking stacks, I do believe that many CxO’s want to enact a two-vendor strategy, and this brings with it an implicit mandate to adopt an aggregated “rack level” management interface, whether this is based on de facto or formal standards. This interface is not about an OS vendor, a virtual machine vendor, a pure storage vendor (which Hitachi is no longer), a networking vendor, etc. building the standard, but about a collation of companies who can field the complete stacks coming together for constructing a prototype management interface.
Are there any takers as we stack our building blocks to the sky?
Storage Visions Conference 2011: Looking into the Clouds
by Miki Sandorfi on Jan 21, 2011
Guest post by Tracey Doyle
Storage Visions is an event focused on digital media storage that occurs every year just before CES. I am lucky enough to have presented at Storage Visions the past four years, and am pleased to share this year’s presentation with you here (see the bottom of the post).
During this time, the conference has increased in attendees and in number of exhibitors but continues to stay focused on the forefront of what’s next in storage technology. It only makes sense that this year the cloud was a major topic of presentations, panel discussions and the buzz around the exhibition hall.
Why is cloud so hot in 2011? As the definition of a cloud is solidifying, people are starting to seriously consider it for their storage needs. The consensus at Storage Visions is right in line with Hitachi Data Systems’ view on the cloud.
The 451 Group defines cloud as a way of using technology, not a technology in itself; it’s a self-service, on-demand pay-per-use model. Consolidation, virtualization and automation strategies will be the catalysts behind cloud adoption. Given this definition, businesses are able to see the opportunity to avoid capital outlay and reduce operating expenses. The chatter at Storage Visions was that the on-demand model is very enticing in this “do more with less” IT environment.
I saw the building excitement for the cloud in many of my conversations last week. People are starting to see the light through the clouds. Puns aside, many of the cloud solutions that were being discussed were right in line with HDS’s cloud roadmap. Scaling up and down on an as-needed basis is key for customers moving forward. This is great news, as this means our HDS offering hits the mark for what customers need.
There is a lot of talk of predictions for cloud in 2011. Two of the major keys for this year in cloud will be focus on cost and being transparent to the customers regarding areas of cost avoidance and savings. Secondly, cloud will continue to be an area of confusion for customers. They know they probably want/need the cloud but they need help to figure out how to get there. This will take some influence and education from cloud providers.
We have been excited about cloud since early 2010. We had some innovative Cloud releases during the second half of 2010 including our cloud private file tiering offering. Kicking off 2011 with a conference that had so much focus on moving towards the cloud was energizing!
Hearing customers’ needs and knowing that HDS’s model for cloud fits perfectly with where customers want to go was a perfect way to enter the New Year. Our low-risk cloud, go at your own pace, innovative pay-per-use consumption model enables customers to scale-up or scale-down on a monthly basis using remote managed services.
I am looking forward to this exciting year in the clouds. I hope you are too!
How the Latest FRCP Changes Should Put Experts on Notice
by Eric Hibbard on Jan 19, 2011
The Federal Rules of Civil Procedure (FRCP) is a set of regulations that specify procedures for civil legal suits within United States District Courts. Federal district courts in all fifty states are required to follow these rules, and many state courts’ civil procedural rules closely follow or adopt similarly worded rules. Currently there are 86 rules in the FRCP, which are grouped into 11 chapters.
These rules are periodically amended, and in December 2006 these rules underwent significant changes that addressed the following issues (over simplified):
- Formally recognized Electronically Stored Information (ESI).
- Requires the parties to address ESI early in the discovery process (e.g., during the “meet and confer” phase)
- Addresses the format of production of ESI, and permits the requesting party to designate the form or forms (e.g., native file formats) in which it wants ESI produced.
- Addresses discovery of ESI from sources that are not reasonably accessible.
- Establishes the procedure for asserting claim of privilege or work product protection after production (i.e., “claw back” provisions).
- Incorporates a “safe harbor” limit on sanctions for the loss of ESI as a result of the routine operation of computer systems.
- Includes sanctions that a court can impose (e.g., spoliation of ESI)
- Allows discovery from Non-parties.
The 2006 amendments moved the U.S. courts into the digital age rather abruptly, but not without some growing pains. Many lawyers and judges still struggle with issues associated with ESI as well as electronic discovery. Case law has helped address some of the initial issues and confusion, but further adjustments to the Rules may be necessary.Speaking of amendments to the FRCP, the U.S. Supreme Court ratified proposed changes to the FRCP in July 2010; since Congress did not intervene, the changes went into effect on December 1, 2010. These latest amendments were not nearly as earth-shattering as the 2006 amendments, but for security professionals and other experts, they are worth noting.In a nutshell, the Committee on Rules of Practice and Procedure, which proposed the amendments, expressed that the existing Rule 26 (Duty to Disclose; General Provisions Governing Discovery) “inhibits robust communications between attorney and expert trial witness, jeopardizing the quality of the expert’s opinion.” The amendments restore protections to certain aspects of the communications between experts and retaining counsel.To digress for a moment, the form of the Rule 26 was established in 1993, which was amended at that time to eliminate the prior regime that had provided work product protection to testifying experts. Under the Rule 26(a)(2) and prior to December 2010 amendments, a testifying expert witness’ report was required to contain data “or other information” considered by the witness in forming his opinions.Many courts interpreted this language broadly, holding that conversations between counsel and testifying expert witnesses were discoverable, and requiring the production of attorney work product (even opinion work product) if given to a testifying expert witness. Some courts also held that disclosure of documents protected by the attorney-client privilege to a testifying expert witness waives the privilege. In response to these rulings, many counsel and experts developed practices to shield information from discovery such as using two sets of experts (one testifying, one not) and refraining from creating draft reports or other written work product (i.e., highly inefficient and potentially costly approaches).Under the new amended Rules, there are two kinds of testifying expert witnesses: 1) witnesses who are “retained or specially employed to provide expert testimony in the case or one whose duties as the party’s employee regularly involve giving expert testimony,” and 2) witnesses who are not retained for the purpose of providing testimony but are otherwise qualified to offer expert opinions, such as treating physicians. The former are required to provide a written report, while under the new rules counsel must provide a more limited statement regarding the subject of the latter’s testimony. Communications between counsel and expert witnesses who are required to provide a report will be generally protected. Note that this protection applies to communications regardless of form.For experts who are not required to provide a written report, under the new FRCP Rule 26(a)(2)(C), counsel must state the subject matter and summarize the facts and opinions to which an expert is expected to testify. This revision is designed to prevent a party being ambushed by an unknown expert opinion, as well as reduce the likelihood that courts will require a full written report from such experts.Under the new FRCP Rule 26(b)(4)(B), the work product protection applies to drafts of written reports as well as drafts of the disclosure for other expert witnesses required under FRCP Rule 26(a)(2)(C). All experts will be required to disclose information with respect to:
- Compensation received (including communications associated with the expert’s compensation).
- Communications identifying facts or data considered by the expert are not protected. However, other conversations about the potential relevance of those facts are protected.
- Communications identifying assumptions provided by an attorney are not protected if the expert actually relied upon them. Communications about general hypotheticals or other assumptions that the expert does not rely upon are protected.
The ABA’s website contains a red-lined version of the prospective Rule 26 for those seeking more information.
On the surface, these amendments should be good for experts, but it will be interesting to see how the legal community uses them. Do you think this will cut out some of the current inefficiencies?
NIST Delivers a New Batch of Security Publications
by Eric Hibbard on Jan 13, 2011
For Security professionals, especially those of us based in the U.S., the guidance from the National Institutes of Standards and Technology provides a wealth of information. Of particular interest are the documents from the 800 Series Special Publications and the NIST Interagency or Internal Reports (NISTIRs) from the Computer Security Resource Center (CSRC) of the Computer Security Division.
Normally, things get kind of quiet within the Government during the holidays (i.e., between Thanksgiving and the Birthday of Martin Luther King, Jr.). NIST clearly bucked that trend and issued a whole bunch of interesting documents during this period. The remainder of this blog post provides a summary of the key security publications that NIST made publicly available.
- · NIST Special Publication 800-39 (Final Public Draft), Integrated Enterprise-Wide Risk Management: Organization, Mission, and Information System View
The final public draft of Special Publication 800-39 introduces a three-tiered risk management approach that allows organizations to focus, initially, on establishing an enterprise-wide risk management strategy as part of a mature governance structure involving senior leaders/executives and a robust risk executive (function). The risk management strategy addresses some of the fundamental issues that organizations face in how risk is assessed, responded to, and monitored over time in the context of critical missions and business functions.
The strategic focus of the risk management strategy allows organizations to influence the design of key mission and business processes—making these processes risk aware. Risk-aware mission/business processes drive enterprise architecture decisions and facilitate the development and implementation of effective information security architectures that provide roadmaps for allocating safeguards and countermeasures to information systems and the environments in which those systems operate.
- · Draft NIST Special Publication 800-51 Revision 1, Guide to Using Vulnerability Naming Schemes
The purpose of this document is to provide recommendations for using vulnerability naming schemes. The document covers two schemes: Common Vulnerabilities and Exposures (CVE) and Common Configuration Enumeration (CCE). The document gives an introduction to both schemes and makes recommendations for end-user organizations on using the names produced by these schemes. The document also presents recommendations for software and service vendors on how they should use vulnerability names and naming schemes in their product and service offerings.
- · NIST Special Publication 800-119, Guidelines for the Secure Deployment of IPv6
The purpose of this document is to provide information security guidance to organizations that are planning to deploy IPv6 technologies or are simply seeking a better understanding of IPv6. The scope of this document encompasses the IPv6 protocol and related protocol specifications. IPv6-related security considerations are discussed with emphasis on deployment-related security concerns. The document also includes general guidance on secure IPv6 deployment and integration planning.
- · NIST Special Publication 800-127, Guide to Securing WiMAX Wireless Communications
The purpose of this document is to provide information to organizations regarding the security capabilities of wireless communications using WiMAX networks and to provide recommendations on using these capabilities.
WiMAX technology is a wireless metropolitan area network (WMAN) technology based upon the IEEE 802.16 standard. It is used for a variety of purposes, including, but not limited to, fixed last-mile broadband access, long-range wireless backhaul, and access layer technology for mobile wireless subscribers operating on telecommunications networks. The scope of this document is limited to the security of the WiMAX air interface and user subscriber devices, to include: security services for device and user authentication; data confidentiality; data integrity; and replay protection.
- · NIST Special Publication 800-132, Recommendation for Password Based Key Derivation Part 1: Storage Applications
This Recommendation specifies a family of password-based key derivation functions (PBKDFs) for deriving cryptographic keys from passwords or passphrases for the protection of electronically-stored data or for the protection of data protection keys.
- · NIST Special Publication 800-135, Recommendation for Application-Specific Key Derivation Functions
This document specifies security requirements for existing application-specific key derivation functions in: IKEv1 and IKEv2, SSH, TLS, SRTP, the User-based Security Model for version 3 of SNMP, the Trusted Platform Module (TPM), American National Standard (ANS) X9.42 (Agreement of Symmetric Keys Using Discrete Logarithm Cryptography) and ANS X9.63 (Key Agreement and Key Transport Using Elliptic Curve Cryptography).
- · Draft NIST Special Publication 800-137 (initial public draft), Information Security Continuous Monitoring for Federal Information Systems and Organizations
The initial public draft of Special Publication 800-137 provides guidelines to assist organizations in the development of a continuous monitoring strategy and the implementation of a continuous monitoring program providing visibility into organizational assets, awareness of threats and vulnerabilities, and visibility into the effectiveness of deployed security controls. It provides ongoing assurance that planned and implemented security controls are aligned with organizational risk tolerance as well as the information needed to respond to risk in a timely manner should observations indicate that the security controls are inadequate.
- · Draft NIST Interagency Report (NISTIR) 7693 Specification for Asset Identification 1.1
Asset identification plays an important role in an organization’s ability to quickly correlate different sets of information about assets. Draft NISTIR 7693 provides the necessary constructs to uniquely identify assets based on known identifiers and/or known information about the asset. The Asset Identification specification includes a data model, methods for identifying assets, and guidelines on how to use asset identification.
- · Draft NIST Interagency Report (NISTIR) 7694 Specification for the Asset Reporting Format 1.1
The draft NIST Interagency Report (NISTIR) 7694 proposes the Asset Reporting Format (ARF), a data model for expressing the transport format of information about assets and the relationships between assets and reports. The intent of ARF is to provide a uniform foundation for the expression of reporting results. ARF is vendor and technology neutral, flexible, and suited for a wide variety of reporting applications. Draft NISTIR 7694 builds upon the asset identification concepts in draft NISTIR 7693, Specification for Asset Identification 1.1
Enjoy all the good reading and let me know what you think about the NIST security publications.
Vendor Neutral Archive Baloney
by Bill Burns on Jan 11, 2011
Having cleared the holidays and now firmly back at work, I finally have found some time to reflect on this year’s Radiological Society of North America (RSNA) meeting in Chicago.
For those in regular attendance, we have weathered the “Client-Server”, “Enterprise Computing”, and “Web Based” marketing storms to continually advance the tools and technologies used in world-class Radiological care and informatics. However, we have entered (or fallen) into an entirely different sphere of “Hype’ology” with the Vendor Neutral Archiving (VNA) craze and its associated marketing flimflam that is grasping the Radiology Informatics space.
To be clear, a number of firms are providing DICOM migration, conversation, toolkit services and products that improve workflow and increase the sharing of medical imaging assets. It is, however, the current crop of messaging from a small but growing set of Picture Archive & Communications Systems (PACS) vendors touting their products as a VNA solution that is clearly misleading.
Almost all PACS imaging archives in today’s classical configuration are closed, proprietary systems that thwart, if not block, the wholesale migration and sharing of image data unless you are deeply committed to the software licensing and maintenance revenue stream of your current vendor. Claiming these systems as now magically VNA compliant is nothing more than a way to extend the proprietary grasp on your organization’s budget and extend the feigned reverence of your current PACS archive.
As healthcare institutions grapple with an ever-shrinking informatics budget, they are being forced to knock down these proprietary silos of data and demand systems that manage entire classes of imaging assets and doing so in an open standards and transparent manner are just the “Table Stakes” in this new paradigm.
Let’s dial down the VNA hype. Healthcare informatics executives deserve real choices and a real conversation regarding content repositories.
2010 meant great changes for Techno-Musings. What will 2011 bring?
by Michael Hay on Jan 5, 2011
The past year was all about expansion — expansion of my family with the birth of my son, expansion of my role at HDS and expansion of the Techno-Musings blogger roll. These were all exciting and very positive changes in their own ways and make me very excited for what 2011 will bring.
Following Hu’s lead, as we start the year, I want to share with you three of the most popular posts on Techno-Musings during 2010 (based upon a combination of traffic and comments).
One of the biggest stories of 2010 was the Toyota recalls. Being in Japan gave me an interesting vantage point as the whole crisis unfolded, so I decided to share some of this by discussing the great track record and commitment to quality that Japanese companies have showed over time, including Hitachi.
Proving the power of a good headline, I used the fact that infiniband was put into several new products last year as an opportunity to open up the discussion on this topic. Here was my introduction:
Specifically, I’ve been watching as IB has been put into several appliances of late: Oracle Exadata, Clusterix KVS, etc. There are a series of well oiled uses too such as RDMA for HPC, NetApp’s usage for internode communications, SGI’s mapping of NUMAflex on top of IB, XSigo’s use for I/O aggregation, etc. However, these are all niche plays and IB was meant to be the I/O messiah in the datacenter — or solve world hunger, I forget — but alas IB hasn’t really lived up to the hype. With all of that in mind is there a replacement on the horizon? (Mind you by replacement I’m thinking 3+ years out…)
This was one of Ken Wood’s first post on Techno-Musings and it was one that generated healthy conversation in the comments. Here was his opening to the post:
So I’ve been looking into the benefits of Solid State Storage Devices over Hard Disk Storage Drives. Personally, I’m more infatuated by the performance advantage than the power advantage, but both play in this discussion. In fact, one solves the problem of the other as I’ll try to show. If you do a search on “ssd hoax”, the first hit that should pop up is an article from Tom’s Hardware, here, dating back to June 2008. The initial experiment was to find out how much an SSD device extended a laptop’s battery life over a standard hard disk.
Thank you all for reading Techno-Musings over the past year. Happy New Year to all and here’s to a great 2011!