Michael Hay

The Many Core Phenomena

Blog Post created by Michael Hay on Dec 13, 2016

Introduction

Have you ever started an effort to push something, figure out it is like pushing a rope, pause, and then realize someone else picked it up and is pulling you instead?  Well, that is what F1 (not the car racing) did for the market’s use of many core computing.  For Hitachi our usage of FPGAs, custom chips and standard CPUs, also known as many core computing, is something we do because it elegantly and efficiently solves customer problems.  Even if it isn't fashionable it provides benefits and we believe in benefiting our customers.  So in a very real sense the self initiated mission that Shmuel and I embarked on more than 10 years ago has just come full circle.

 

About a decade ago we (Shmuel Shottan and Michael Hay) began both an open dialogue and to emphasize our efforts utilizing FPGA technologies in combination of “traditional” multi-core processors.  The product was of course HNAS and the results were seriously high levels of performance levels unachievable using only general purpose CPUs.  Another benefit is that the cost performance of such an offering is naturally superior (both from a CAPEX and OPEX perspective) than an architecture utilizing only one type of computing.

 

Our dialogue depicted an architecture in which the following attributes defined the implementation:

  • High Degree of parallelism - Parallelism is key to performance. Many systems attempt to achieve such parallelism. While processors based implementations can provide some parallelism (provided the data has parallelism), as demonstrated by traditional MIMD architectures, (Cache coherent or message passing), such implementations require synchronization that limits scalability. We chose fine grain parallelism by implementing in FPGA state machines.
  • Off-loading - Off-loading allows the core file system to independently process metadata and move data while the multi-core processor module is dedicated to data management. This is similar to traditional coprocessors (DSPs, Systolic Arrays, Graphics engines). This architecture provides yet another degree of parallelism.
  • Pipelining - Pipelining is achieved when multiple instructions are simultaneously overlapped in execution. For a NAS system it means multiple file requests overlapping in execution.
…”So, why offload a network  file system functions?  The key reason for it was the need to achieve massive fine grain parallelism. Some applications indeed lend themselves well to achieving parallelism with multiplicity of cores, most do not. Since a NAS system will “park” on network and storage resources, any implementation that requires multiplicity of processors will create synchronization chatter larger than the advantage of adding processing elements beyond a very small number. Thus, the idea of offloading to a “co-processor” required the design from the ground up of an inherently parallelized and pipelined processing element by design. Choosing a state machine approach and leveraging this design methodology by implementing in FPGAs provided the massive parallelism for the file system, as the synchronization was 'free'…” (Shmuel Shottan)

shmuel award 2.pngThe implementation examples given in the above quote were about network file system function, dedupe, etc.  At the time the reference was published it seemed like a depiction of an esoteric technology, and did not resonate well within the technical community. Basically, it was rather like pushing on a rope meaning it was kind of interesting exercise in futility for us. People were on the general purpose CPU train and weren’t willing to think differently.  Maybe a better way to say it:  The perceived return from the investments in writing for FPGAs, GPUs and other seemingly exotic computing devices was low.  If you did the math it wasn't so, but at the time that was the perception.  Another attribute of time times Moore’s law still had plenty of gas.  So again, our message was like pushing on a rope lots of effort very little progress.  Our arguments weren't lost on everyone.  The action was tracked by HIPC, and Shmuel was invited to deliver a talk at the HiPC conference. Additionally, within the Reconfigurable Computing track of IEEE people paid attention.

 

Mr. Moore strike one for the general purpose CPU: Intel Acquired Altera, an FPGA company.

In a world where the trend of commoditization in storage, server and networking has been addressed multiple times and “accepted” by many, the addition of FPGA as an accepted “canvas” for the developers to paint on is natural and welcomed. Intel, the world largest chip company, is copying a page from its play book: Identify a mature and growing market and then embed it into their chipsets.  Does anyone remember that motherboards used to include only compute and core logic elements? Intel has identified graphics and networks, SAS connectivity (now NVMe) and core RAID engine as ubiquitous in the previous decades, and indeed most NIC, HBA,  RAID and Graphic chips providers vanished.

 

Companies who leverage the functions provided by Intel’s chipsets for supply chain efficiency while continuing to innovate and focus on the needs of the users not satisfied with mediocrity will excel. (Look at NVIDIA with their focus on GPUs and CUDA). Others, who elected to compete with Intel on cost, perished. The lesson is that when a segment matures and becomes a substantial market segment, expect commoditization, and innovate on top of and in addition to the commodity building blocks since one size does not fit all.

 

Some carry the commoditization arguments further. Not only are all networking, storage and compute systems predicted to run on the same Intel platform from one’s favorite OEM, but even software will become just “general purpose software”, thus, efforts put into software development are a waste since Linux will eventually do just as well.  This hypothesis is fundamentally flawed. It fails to distinguish between leveraging a supply chain for hardware and OSS for SW when applicable, but ignores to acknowledge that innovation is not and will never be dead.

 

IoT, AI and other modern applications will only create more demands on networking, compute and storage systems. Innovation should and will now include enabling certain relevant applications to be implemented with FPGAs and when relevant custom chips. Hitachi, being the leading provider of enterprise scalable systems is best positioned to lead in this new world.  Honestly, we know of no other vendor better positioned to benefit from such a future trend. There are several reasons for this, but the most important is that we recognize that innovation isn’t about OR but AND.  For example general purpose CPUs and FPGAs, scale-up and scale-out, etc.  Therefore, we don’t extinguish skills due to fashion we invest over the long term returning benefits to our customers.

 

Mr. Moore second swing and strike two:  Microsoft Catapults Azure into the future

Microsoft announced that it has deployed hundreds of thousands of FPGAs (field-programmable gate arrays) across servers in 15 countries and five different continents. The chips have been put to use in a variety of first-party Microsoft services, and they're now starting to accelerate networking on the company's Azure cloud platform.

 

In addition to improving networking speeds, the FPGAs which sit on custom, Microsoft-designed boards connected to Azure servers can also be used to improve the speed of machine-learning tasks and other key cloud functionality. Microsoft hasn't said exactly what the contents of the boards include, other than revealing that they hold an FPGA, static RAM chips and hardened digital signal processors.  Microsoft's deployment of the programmable hardware is important as the previously reliable increase in CPU speeds continues to slow down. FPGAs can provide an additional speed boost in processing power for the particular tasks that they've been configured to work on, cutting down on the time it takes to do things like manage the flow of network traffic or translate text. 

 

Azure CTO Mark Russinovich said using the FPGAs was key to helping Azure take advantage of the networking hardware that it put into its data centers. While the hardware could support 40Gbps speeds, actually moving all that network traffic with the different software-defined networking rules that are attached to it took a massive amount of CPU power.

"That's just not economically viable," he said in an interview. "Why take those CPUs away from what we can sell to customers in virtual machines, when we could potentially have that off-loaded into FPGA? They could serve that purpose as well as future purposes, and get us familiar with FPGAs in our data center. It became a pretty clear win for us."

"If we want to allocate 1,000 FPGAs to a single [deep neural net] we can," said Mike Burger, a distinguished engineer in Microsoft Research. "We get that kind of scale." (Microsoft Azure Networking..., PcWorld)

That scale can provide massive amounts of computing power. If Microsoft used Azure's entire FPGA deployment to translate the English-language Wikipedia, it would take only a tenth of a second, Burger said on stage at Ignite. (Programmable chips turning Azure into a supercomputing powerhouse | Ars Technica )

Third strike you’re out - Amazon announces FPGA enabled EC2 instances the new F1
The paragon in the commodity movement, Amazon Web Services not Intel, has just made bets on Moore’s Law but not where you might think.  You see, if we change the game and say that Moore was correct about a team of processing power then Mr. Moore’s hypothesis is still proved true on FPGAs, GPUs and other special purposed elements.  Both of us believe that if software is kind of like a digital organism then the hardware is the digital environment organisms live on.  And just like in the natural world both the environment and the organisms evolve independently and together dynamically.  Let’s tune into Amazon’s blog post announcing F1 to gather a snapshot on their perspective.

"Have you ever had to decide between a general purpose tool and one built for a very specific purpose? The general purpose tools can be used to solve many different problems, but may not be the best choice for any particular one. Purpose-built tools excel at one task, but you may need to do that particular task infrequently.

 

"…Computer engineers face this problem when designing architectures and instruction sets, almost always pursuing an approach that delivers good performance across a very wide range of workloads. From time to time, new types of workloads and working conditions emerge that are best addressed by custom hardware. This requires another balancing act: trading off the potential for incredible performance vs. a development life cycle often measured in quarters or years….

 

"...One of the more interesting routes to a custom, hardware-based solution is known as a Field Programmable Gate Array, or FPGA. In contrast to a purpose-built chip which is designed with a single function in mind and then hard-wired to implement it, an FPGA is more flexible. It can be programmed in the field, after it has been plugged in to a socket on a PC board….

 

"…This highly parallelized model is ideal for building custom accelerators to process compute-intensive problems. Properly programmed, an FPGA has the potential to provide a 30x speedup to many types of genomics, seismic analysis, financial risk analysis, big data search, and encryption algorithms and applications…. (Developer Preview..., Amazon Blogs)

Hitachi and its ecosystem of partners has led the way and continues to lead in providing FPGA and as relevant custom chip based innovations to many areas such as genomics, seismic sensing, seismic analysis, financial risk analysis, big data search, combinatorics, etc.

 

The real game, Innovation is about multiple degrees of freedom

To end our discussion let’s start by reviewing Hitachi’s credo.

[Our aim, as members of Hitachi,] is to further elevate [our] founding concepts of harmony, sincerity and pioneering spirit, to instill a resolute pride in being a member of Hitachi, and thereby to contribute to society through the development of superior, original technology and products.

 

Deeply aware that a business enterprise is itself a member of society, Hitachi [members are] also resolved to strive as good citizens of the community towards the realization of a truly prosperous society and, to this end, to conduct [our] corporate activities in a fair and open manner, promote harmony with the natural environment, and engage vigorously in activities that contribute to social progress.

This credo has guided Hitachi employees for more than 100 years and in our opinion is a key to our success.  What it inspires is innovation across many degrees of freedom to improve society.  This freedom could be in the adoption of a clever commercial model, novel technologies, commodity technologies, new experiences, co-creation and so on.  In other words, we are Hitachi Unlimited! 

 

VSP FPGA Blade - sketch.jpeg

During the time of pushing on the rope – with respect to FPGA and custom chip usage –  if we had listened to the pundits and only chased general purpose CPUs, when the market picked up and pulled, we’d not be in a leadership position.  Given this innovation rope here are four examples:

 

Beyond these examples, there are many active and innovative FPGA and custom chip development tracks in our Research pipline.  So we are continuing the intention of our credo by picking up other ropes and pushing abundantly to better society!

Outcomes