Tilera’s roadmap calls for its next generation of processors, code-named Stratton, to be released in 2013. The product line will expand the number of processors in both directions, down to as few as four and up to as many as 200 cores. The company is going from a 40-nm to a 28-nm process, meaning they’re able to cram more circuits in a given area. The chip will have improvements to interfaces, memory, I/O and instruction set, and will have more cache memory.
I’m enjoying the survey of companies doing massively parallel, low power computing products. Wired.com|Enterprise started last week with a look at SeaMicro and how the two principal founders got their start observing Google’s initial stabs at a warehouse sized computer. Since that time things have fractured somewhat instead of coalescing and now three big attempts are competing to fulfil the low power, massively parallel computer in a box. Tilera is a longer term project startup from MIT going back further than Calxeda or SeaMicro.
However application of this technology has been completely dependent on the software. Whether it be OSes or Applications, they all have to be constructed carefully to take full advantage of the Tile processor architecture. To their credit Tilera has attempted to insulate application developers from some of the vagaries of the underlying chip by creating an OS that does the heavy lifting of queuing and scheduling. But still, there’s got to be a learning curve there even if it isn’t quite as daunting as say folks who develop applications for the super computers at National Labs here in the U.S. Suffice it to say it’s a non-trivial choice to adopt a Tilera cpu for a product/project you are working on. And the people who need a Tilera GX cpu for their app, already know all they need to know about the the chip in advance. It’s that kind of choice they are making.
I’m also relieved to know they are continuing development to shrink down the design rules. Intel being the biggest leader in silicon semi-conductor manufacturing, continues to shrink its design, development and manufacturing design rules. We’re fast approaching a 20nm-18nm production line in both Oregon and Arizona. Both are Intel design fabrication plants and there not about to stop and take a breath. Companies like Tilera, Calxeda and SeaMicro need to do continuous development on their products to keep from being blind sided by Intel’s continuous product development juggernaut. So Tilera is very wise to shrink its design rule from 40nm down to 28nm as fast as it can and then get good yields on the production lines once they start sampling chips at this size.
*UPDATE: Just saw this run through my blogroll last week. Tilera has announced a new chip coming in March. Glad to see Tilera is still duking it out, battling for the design wins with manufacturers selling into the Data Center as it were. Larger Memory addressing will help make the Tilera chips more competitive with Commodity Intel Hardware shops, and maybe we’ll see full 64bit memory extensions at some point as a follow on to the current 40bit address space extenstions currently being touted in this article from The Register.
By itself Calxeda has made some big plans attempting to create computers like the SeaMicro SM10000. But the ability to manufacture on any scale and then sell that product is a bit limited. But as of today HP has partnered with Calxeda to sell product and help design a server using the reference design for a compute node. So the ball is rolling, and now there’s a third leg in this race between the Compute Cloud in a Box manufacturers (Calxeda, SeaMicro and Tilera). Read On:
Calxeda is producing 4-core, 32-bit, ARM-based system-on-chip SOC designs, developed from ARMs Cortex A9. It says it can deliver a server node with a thermal envelope of less than 5 watts. In the summer it was designing an interconnect to link thousands of these things together. A 2U rack enclosure could hold 120 server nodes: thats 480 cores.
HP signing on as a OEM for Calxeda designed equipment is going to push ARM based massively parallel server designs into a lot more data centers. Add to this the announcement of the new ARM-15 cpu and it’s timeline for addressing 64-bit memory and you have a battle royale going up against Intel. Currently the Intel Xeon is the preferred choice for applications requiring large amounts of DRAM to hold whole databases and Memcached webpages for lightning quick fetches. On the other end of the scale is the low per watt 4 core ARM chips dissipating a mere 5 watts. Intel is trying to drive down the Thermal Design Point for their chips even resorting to 64bit Atom chips to keep the Memory Addressing advantage. But the timeline for decreasing the Thermal Design Point doesn’t quite match up to the ARM x64 timeline. So I suspect ARM will have the advantage as will Calxeda for quite some time to come.
While I had hoped the recen ARM-15 announcement was also going to usher in a fully 64-bit capable cpu, it will at least be able to fake larger size memory access. The datapath I remember being quoted was 40-bits wide and that can be further extended using software. And it doesn’t seem to have discouraged HP at all who are testing the Calxeda designed prototype EnergyCore evaluation board. This is all new territory for both Calxeda and HP so a fully engineered and designed prototype is absolutely necessary to get this project off the ground. My hope is HP can do a large scale test and figure out some of the software configuration optimization that needs to occur to gain an advantage in power savings, density and speed over an Intel Atom server (like SeaMicro).
Exotic chip architectures are always interesting because they tend to be designed to fix a very particular problem. The transputer was designed to be general purpose enough that you could get it to solve a variety of problems as long as you had the chops and the development tools to do it. It was in its day a massively parallel computer in a single chip.
The key idea was to create a component that could be scaled from use as a single embedded chip in dedicated devices like a TV set-top box, all the way up to a vast supercomputer built from a huge array of interconnected Transputers.
Connect them up and you had, what was, for its era, a hugely powerful system, able to render Mandelbrot Set images and even do ray tracing in real time – a complex computing task only now coming into the reach of the latest GPUs, but solved by British boffins 30-odd years ago.
I remember the Transputer. I remember seeing ISA-based add-on cards for desktop computers back in the early 1980s. They would advertise in the back of the popular computer technology magazines of the day. And while it seemed really mysterious what you could do with a Transputer, the price premium to buy those boards made you realize it must have been pretty magical.
Most recently while I was attending workshop in Open Source software I met a couple form employees of a famous manufacturer of camera film. In their research labs these guys used to build custom machines using arrays of Transputers to speed up image processing tasks inside the products they were developing. So knowing that there’s even denser architectures using chips like Tilera, Intel Atom and ARM chips absolutely blows them away. The price/performance ratio doesn’t come close.
Software was probably the biggest point off friction in that the tools to integrate the Transputer into the overall design required another level of expertise. That is true to of the General Purpose Graphics Processing Unit (GPGU) that nVidia championed and now markets with its Tesla product line. And the Chinese have created a hybrid supercomputer mating Tesla boards up with commodity cpus. It’s too bad that the economics of designing and producing the Transputer didn’t scale with the time (the way it has for Intel as a comparison). Clock speeds also fell behind too, which allowed general purpose micro-processors to spend the extra clock cycles performing the same calculations only faster. This is also the advantage that RISC chips had until they couldn’t overcome the performance increases designed in by Intel.
I still have great hopes for Tilera in the data center cloud market place. But the only real competition out there now is Seamicro’s own SM-10000×64 which is tearing up the charts with Intel’s Atom N570. Once Tilera is able to ship its chips in volume and get manufacturers to start building servers with Tilera CPUs inside, it will be a true horse race.
Upstart mega-multicore chip maker Tilera has not yet started sampling its future Tile-Gx 3000 series of server processors, and companies have already locked in orders for the chips.
Proof that sometimes a shipping product doesn’t always make all the difference. Although it might be nice to tout performance of actual shipping product. What’s becoming more real is the power efficiency of the Tilera architcture core for core versus the Intel IA-64 architecture. Tilera can provide a much lower Thermal Design Point (TDM) per core than typical Intel chips running the same workloads. So Tilera for the win on paper anyways.
A further update on Tilera’s product launches as the old Interop tradeshow for network switch and infrastructure vendors is held in Las Vegas. They have tweaked the chip packaging of their cpus and now are going to market different cpus to different industries. This family of Tilera chips is called the 8000 series and will be followed by a next generation of 3000 and 5000 series chips. Projections are by the time the Tilera 3000 series is released the density of the chips will be sufficient to pack upwards of 20,000 cpu cores of Tilera chips in a single 42 unit tall, 19 inch wide server rack. with a future revision possibly doubling that number of cores to 40,000. That road map is very agressive but promising and shows that there is lots of scaling possible with the Tilera product over time. Hopefully these plans will lead to some big customers signing up to use Tilera in shipping product in the immediate and near future.
What I’m most interested in knowing is how does the Qanta server currently shipping that uses the Tilera cpu benchmark compared to an Intel Atom based or ARM based server on a generic webserver benchmark. While white papers and press releases have made regular appearances on the technolog weblogs, very few have attempted to get sample product and run it through the paces. I suspect, and cannot confirm that anyone who is a potential customer are given Non-disclosure Agreements and shipping samples to test in their data centers before making any big purchases. I also suspect that as is often the case the applications for these low power massively parallel dense servers is very narrow. Not unlike that for a super computer. IBM‘s Cell Processor that powers the Blue Gene super computers is essentially a PowerPC architecture with some extra optimizations and streamlining to make it run very specific workloads and algorithms faster. In a super computing environment you really need to tune your software to get the most out of the huge up front investment in the ‘iron’ that you got from the manufacturer. There’s not a lot of value add available in that scientific and super computing environment. You more or less roll your own solution, or beg, borrow or steal it from a colleague at another institution using the same architecture as you. So the Quanta S2Q server using the Tilera chip is similarly likely to be a one off or niche product, but a very valuable one to those who purchase it. Tilera will need a software partner to really pump up the volumes of shipping product if they expect a wider market for their chips.
But using a Tilera processor in a network switch or a ‘security’ device or some other inspection engine might prove very lucrative. I’m thinking of your typical warrantless wire-tapping application like the NSA‘s attempt to scoop up and analyze all the internet traffic at large carriers around the U.S. Analyzing data traffic in real time prevents folks like NSA from capturing and having to move around large volumes of useless data in order to have it analyzed at a central location. Instead localized computing nodes can do the initial inspection in realtime keying on phrases, words, numbers, etc. which then trigger the capturing process and send the tagged data back to NSA for further analysis. Doing that in parallel with a 100 core CPU would be very advantageous in that a much smaller footprint would be required in the secret closets NSA maintains at those big data carriers operations centers. Smaller racks, less power makes for a much less obvious presence in the data center.