Today many different interconnection topologies are used for multicore chips. For as few as eight cores direct bus connections can be made — cores taking turns using the same bus. MIT’s 36-core processors, on the other hand, are connected by an on-chip mesh network reminiscent of Intel’s 2007 Teraflop Research Chip — code-named Polaris — where direct connections were made to adjacent cores, with data intended for remote cores passed from core-to-core until reaching its destination. For its 50-core Xeon Phi, however, Intel settled instead on using multiple high-speed rings for data, address, and acknowledgement instead of a mesh.
I commented some time back on a similar article on the same topic. It appears now the MIT research group has working silicon of the design. As mentioned in the pull-quote, the Xeon Phi (which has made some news in the Top 500 SuperComputer stories recently) is a massively multicore architecture but uses a different interconnect that Intel designed on their own. These stories as they appear get filed into the category of massively multicore or low power CPU developments. Most times the same CPUs add cores without significantly drawing more power and thus provide a net increase in compute ability. Tilera, Calxeda and yes even SeaMicro were all working along towards those ends. Either through mergers, or cutting of funding each one has seemed to trail off and not succeed at its original goal (massively multicore, low power designs). Also along the way Intel has done everything it can to dull and dent the novelty of the new designs by revising an Atom based or Celeron based CPU to provide much lower power at the scale of maybe 2 cores per CPU.
Like this chip MIT announced Tilera too was originally an MIT research product spun off of the University campus. Its principals were the PI and a research associate if I remember correctly. Now that MIT has the working silicon they’re going to benchmark and test and verify their design. The researchers will release the verilog hardware description of chip for anyone use, research or verify for themselves once they’ve completed their own study. It will be interesting to see how much of an incremental improvement this design provides, and possibly could be the launch of another Tilera style product out of MIT.
SUMMARY: Microsoft has been experimenting with its own custom chip effort in order to make its data centers more efficient, and these chips aren’t centered around ARM-based cores, but rather FPGAs from Altera.
FPGAs for the win, at least for eliminating unnecessary Xeon CPUs for doing online analytic processing for the Bing Search service. MS are saying they can process the same amount of data with half the number of CPUs by offloading some of the heavy lifting from general purpose CPUs to specially programmed FPGAs tune to the MS algorithms to deliver up the best search results. For MS the cost of the data center will out, and if you can drop half of the Xeons in a data center you just cut your per transaction costs by half. That is quite an accomplishment these days of radical incrementalism when it comes to Data Center ops and DevOps. The Field Programmable Gate Array is known as a niche, discipline specific kind of hardware solution. But when flashed, and programmed properly and re-configured as workloads and needs change it can do some magical heavy lifting from a computing standpoint.
Specifically I’m thinking really repetitive loops or recursive algorithms that take forever to unwind and deliver a final result are things best done in hardware versus software. For Search Engines that might be the process used to determine the authority of a page in the rankings (like Google’s PageRank). And knowing you can further tune the hardware to fit the algorithm means you’ll spend less time attempting to do heavy lifting on the General CPU using really fast C/C++ code instead. In Microsoft’s plan that means less CPUs need to do the same amount of work. And better yet, if you determine a better algorithm for your daily batch processes, you can spin up a new hardware/circuit diagram and apply that to the compute cluster over time (and not have to do a pull and replace of large sections of the cluster). It will be interesting to see if Microsoft reports out any efficiencies in a final report, as of now this seems somewhat theoretical though it may have been tested at least in a production test bed of some sort using real data.
It’s not often that you see something that makes you think “this is a game changer.” The introduction of logic synthesis circa 1990 was one such event; today’s introduction of SDNet from Xilinx may well be another.
Cisco has used different RISC chips over the years as its network processors. Both in it’s network closet switches and the core router chassis. First generation was based on the venerable MIPS processor, then subsequently they migrated to PowerPC, both for power reduced processing but also network optimized cpus. Cisco’s engineers would accommodate changes in function by releasing new version of the IOS. Or they would release new line cards for the big multi-slot router chassis. Between software and hardware releases they would cover the whole spectrum of wired, wireless, optical networking. It was a rich mix of what could be done.
Enter now the possibility of not just Software Defined Networking (kind of like using Virtual Machines instead of physical switches), but software defined firmware/hardware. FPGAs (field programmable gate arrays) are the computing world’s reconfigurable processor. So instead of provisioning a fixed network processor, and virtualizing on top of that to gain the software defined network, what if you could work the problem from both ends? Reconfigure the software AND the network processor. That’s what Xilinx is proposing with this announcement of SDNet. The prime example given in this announcement is the line card that would slot into a a large router chassis (some Cisco gear comes with 13 slots). If you had just a bunch of ports, let’s say RJ-45 facing outward, what then happens on the inside via the software/hardware reconfigurability would astound you. You want Fibre Channel over Ethernet? You want 10Gbit? You want SIP traffic only? You don’t buy a line card per application because it’s set in stone what the function is. You tell the SDNet compiler these are the inputs, these are the outputs, please optimize the functions and reconfigure the firmware as needed.
Once programmed, that line card does what you tell it to do. It can inspect packets, it could act as a firewall, it could prioritize traffic, shape bandwidth or just simple route things as fast as it could possibly go. Doesn’t matter what signals are running over what pins, as long as it knows it’s RJ-45 connectors, it will do the rest. Amazing when you think about it that way.
OpenCL is a breakthrough precisely because it enables developers to accelerate the real-time execution of their algorithms quickly and easily — particularly those that lend themselves to the considerable parallel processing capabilities of FPGAs (which yield superior compute densities and far better performance/Watt than CPU- and GPU-based solutions)
There’s still a lot of untapped energy available with the OpenCL programming tools. Apple is still the single largest manufacturer who has adopted OpenCL through a large number of it’s products (OS and App software). And I know from reading about super computing on GPUs that some large scale hybrid CPU/GPU computers have been ranked worldwide (the Chinese Tiahne being the first and biggest example). This article from EETimes encourages anyone with a brackground in C programming to try and give it a shot, see what algorithms could stand to be accelerated using the resources on the motherboard alone. But being EETimes they are also touting the benefits of using FPGAs in the mix as well.
To date the low-hanging fruit for desktop PC makers and their peripheral designers and manufacturers has been to reuse the GPU as massively parallel co-processor where it makes sense. But as the EETimes writer emphasizes, FPGAs can be equal citizens too and might further provide some more flexible acceleration. Interest in the FPGA as a co-processor for desktop to higher end enterprise data center motherboards was brought to the fore by AMD back in 2006 with the Torrenza cpu socket. The hope back then was that giving a secondary specialty processor (at the time an FPGA) might prove to be a market no one had addressed up to that point. So depending on your needs and what extra processors you might have available on your motherboard, OpenCL might be generic enough going forward to get a boost from ALL the available co-processors on your motherboard.
Whether or not we see benefits at the consumer level desktop is very dependent on the OS level support for OpenCL. To date the biggest adopter of OpenCL has been Apple as they needed an OS level acceleration API for video intensive apps like video editing suites. Eventually Adobe recompiled some of its Creative Suite apps to take advantage of OpenCL on MacOS. On the PC side Microsoft has always had DirectX as its API for accelerating any number of different multimedia apps (for playback, editing) and is less motivated to incorporate OpenCL at the OS level. But that’s not to say a 3rd party developer who saw a benefit to OpenCL over DirectX couldn’t create their own plumbing and libraries and get a runtime package that used OpenCL to support their apps or anyone who wanted to license this as part of a larger package installer (say for a game or for a multimedia authoring suite).
For the data center this makes way more sense than for the desktop, as DirectX isn’t seen as a scientific computing or means of allowing a GPU to be used as a numeric accelerator for scientific calculations. In this context, OpenCL might be a nice, open and easy to adopt library for people working on compute farms with massive numbers of both general purpose cpus and GPUs handing off parts of a calculation to one another over the PCI bus or across CPU sockets on a motherboard. So everyone’s needs are going to vary and widely vary in some cases. But OpenCL might help make that variation more easily addressed by having a common library that would allow one to touch all the co-processors available when a computation is needing to be sped up. So keep an eye on OpenCL as a competitor to any GPGPU style API and library put out by either nVidia or AMD or Intel. OpenCL might help people bridge differences between these different manufacturers too.
My supercomputer can beat your supercomputer, and money is no object. FPGAs (Field Programmable Gate Arrays) are used most often in prototyping new computer processors. You can design a chip, then ‘program’ the FPGA to match the circuit design so that it can be verified. Verification is the process by which you do exhaustive tests on the logic and circuits to see if you’ve left anything out or didn’t get the timing right for the circuits that may run at different speeds within the chip itself. They are expensive niche products that chip design outfits and occasionally product manufacturers use to solve problems. Less often they might be used in data network gear to help classify and reroute packets in a data center and optimize performance over time.
This by itself would be a pretty good roster of applications, but something near and dear to my heart is the use of FPGAs as a kind of reconfigurable processor. I am certain one day we will see the application of FPGA in desktop computers. But until then, we’ll have to settle for using FPGAs as special purpose application accelerators in high volume trading and Wall Street type data centers. This article in WSJ is going to change a few opinions about the application of FPGAs for real computing tasks. The speedups quoted for different analysis and reports derived from the transactions show multiple orders of magnitude speedups. In extreme examples sometimes 1,000 times faster speed-ups occurred when using a fully optimized FPGA versus a general purpose CPU.
When someone can tout 1,000X speedups everyone is going to take notice. And hopefully it won’t be simply a bunch of copycats trying to speed up their reports and management dashboards. There’s a renaissance out there waiting to happen with FPGAs and I still have hope I’ll see it in my lifetime.