Posts Tagged ‘gpu’
OpenCL is a breakthrough precisely because it enables developers to accelerate the real-time execution of their algorithms quickly and easily — particularly those that lend themselves to the considerable parallel processing capabilities of FPGAs (which yield superior compute densities and far better performance/Watt than CPU- and GPU-based solutions)
There’s still a lot of untapped energy available with the OpenCL programming tools. Apple is still the single largest manufacturer who has adopted OpenCL through a large number of it’s products (OS and App software). And I know from reading about super computing on GPUs that some large scale hybrid CPU/GPU computers have been ranked worldwide (the Chinese Tiahne being the first and biggest example). This article from EETimes encourages anyone with a brackground in C programming to try and give it a shot, see what algorithms could stand to be accelerated using the resources on the motherboard alone. But being EETimes they are also touting the benefits of using FPGAs in the mix as well.
To date the low-hanging fruit for desktop PC makers and their peripheral designers and manufacturers has been to reuse the GPU as massively parallel co-processor where it makes sense. But as the EETimes writer emphasizes, FPGAs can be equal citizens too and might further provide some more flexible acceleration. Interest in the FPGA as a co-processor for desktop to higher end enterprise data center motherboards was brought to the fore by AMD back in 2006 with the Torrenza cpu socket. The hope back then was that giving a secondary specialty processor (at the time an FPGA) might prove to be a market no one had addressed up to that point. So depending on your needs and what extra processors you might have available on your motherboard, OpenCL might be generic enough going forward to get a boost from ALL the available co-processors on your motherboard.
Whether or not we see benefits at the consumer level desktop is very dependent on the OS level support for OpenCL. To date the biggest adopter of OpenCL has been Apple as they needed an OS level acceleration API for video intensive apps like video editing suites. Eventually Adobe recompiled some of its Creative Suite apps to take advantage of OpenCL on MacOS. On the PC side Microsoft has always had DirectX as its API for accelerating any number of different multimedia apps (for playback, editing) and is less motivated to incorporate OpenCL at the OS level. But that’s not to say a 3rd party developer who saw a benefit to OpenCL over DirectX couldn’t create their own plumbing and libraries and get a runtime package that used OpenCL to support their apps or anyone who wanted to license this as part of a larger package installer (say for a game or for a multimedia authoring suite).
For the data center this makes way more sense than for the desktop, as DirectX isn’t seen as a scientific computing or means of allowing a GPU to be used as a numeric accelerator for scientific calculations. In this context, OpenCL might be a nice, open and easy to adopt library for people working on compute farms with massive numbers of both general purpose cpus and GPUs handing off parts of a calculation to one another over the PCI bus or across CPU sockets on a motherboard. So everyone’s needs are going to vary and widely vary in some cases. But OpenCL might help make that variation more easily addressed by having a common library that would allow one to touch all the co-processors available when a computation is needing to be sped up. So keep an eye on OpenCL as a competitor to any GPGPU style API and library put out by either nVidia or AMD or Intel. OpenCL might help people bridge differences between these different manufacturers too.
AMD, and NVIDIA before it, has been trying to convince us of the usefulness of its GPUs for general purpose applications for years now. For a while it seemed as if video transcoding would be the killer application for GPUs, that was until Intel’s Quick Sync showed up last year.
There’s a lot to talk about when it comes to accelerated video transcoding, really. Not the least of which is HandBrake’s dominance generally for anyone doing small scale size reductions of their DVD collections for transport on mobile devices. We owe it all to the open source x264 codec and all the programmers who have contributed to it over the years, standing on one another’s shoulders allowing us to effortlessly encode or transcode gigabytes of video to manageable sizes. But Intel has attempted to rock the boat by inserting itself into the fray by tooling its QuickSync technology for accelerating the compression and decompression of video frames. However it is a proprietary path pursued by a few small scale software vendors. And it prompts the question, when is open source going to benefit from the proprietary Intel QuickSync technology? Maybe its going to take a long time. Maybe it won’t happen at all. Lucky for the HandBrake users in the audience some attempt is being made now to re-engineer the x264 codec to take advantage of any OpenCL compliant hardware on a given computer.
- What We’ve Been Waiting For: Testing OpenCL Accelerated Handbrake with AMD’s Trinity (anandtech.com)
- Photoshop CS6 Gives ‘Fangs’ to your GPU (barefeats.com)
Similarly disappointing for everyone who isnt Intel, its been more than a year after Sandy Bridges launch and none of the GPU vendors have been able to put forth a better solution than Quick Sync. If youre constantly transcoding movies to get them onto your smartphone or tablet, you need Ivy Bridge. In less than 7 minutes, and with no impact to CPU usage, I was able to transcode a complete 130 minute 1080p video to an iPad friendly format—thats over 15x real time.
QuickSync for anyone who doesn’t follow Intel’s own technology white papers and cpu releases is a special feature of Sandy Bridge era Intel CPUs. Originally its duty on Intel is as old as the Clarkdale series with embedded graphics (first round of the 32nm design rule). It can do things like just simply speeding up the process of decoding a video stream saved in a number of popular video formats VC-1, H.264, MP4, etc. Now it’s marketed to anyone trying to speed up the transcoding of video from one format to another. The first Sandy Bridge CPUs using the the hardware encoding portion of QuickSync showed incredible speeds as compared to GPU-accelerated encoders of that era. However things have been kicked up a further notch in the embedded graphics of the Intel Ivy Bridge series CPUs.
In the quote at the beginning of this article, I included a summary from the Anandtech review of the Intel Core i7 3770 which gives a better sense of the magnitude of the improvement. The full 130 minute Blu-ray DVD was converted at a rate of 15 times real time, meaning for every minute of video coming off the disk, QuickSync is able to transcode it in 4 seconds! That is major progress for anyone who has followed this niche of desktop computing. Having spent time capturing, editing and exporting video I will admit transcoding between formats is a lengthy process that uses up a lot of CPU resources. Offloading all that burden to the embedded graphics controller totally changes that traditional impedance of slowing the computer to a crawl and having to walk away and let it work.
Now transcoding is trivial, it costs nothing in terms of CPU load. And any time it can be faster than realtime means you don’t have to walk away from your computer (or at least not for very long), but 10X faster than real time makes that doubly true. Now we are fully at 15X realtime for a full length movie. The time spent is so short you wouldn’t ever have a second thought about “Will this transcode slow down the computer?” It won’t in fact you can continue doing all your other work, be productive, have fun and continue on your way just as if you hadn’t just asked your computer to do the most complicated, time consuming chore that (up until now) you could possibly ask it to do.
Knowing this application of the embedded graphics is so useful for desktop computers makes me wonder about Scientific Computing. What could Intel provide in terms of performance increases for simulations and computation in a super-computer cluster? Seeing how hybrid super computers using nVidia Tesla GPU co-processors mixed with Intel CPUs have slowly marched up the list of the Top 500 Supercomputers makes me think Intel could leverage QuickSync further,. . . Much further. Unfortunately this performance boost is solely dependent on a few vendors of proprietary transcoding software. The open software developers do not have an opening into the QuickSync tech in order to write a library that will re-direct a video stream into the QuickSync acceleration pipeline. When somebody does accomplish this feat, it may be shortly after when you see some Linux compute clusters attempt to use QuickSync as an embedded algorithm accelerator too.
- Intel Core i7-3770K review: Ivy Bridge brings lower power, better performance (alltech360.wordpress.com)
- Image Quality: Intel Ivy Bridge vs. Radeon Gallium3D (phoronix.com)
- Intel Ivy Bridge CPUs now available to order (slashgear.com)
And with clock speeds topped out and electricity use and cooling being the big limiting issue, Scott says that an exaflops machine running at a very modest 1GHz will require one billion-way parallelism, and parallelism in all subsystems to keep those threads humming.
Interesting write-up of a blog entry from nVidia‘s chief of super-computing, including his thoughts regarding scaling up to an exascale supercomputer. I’m surprised at how power efficient a GPU is for floating point operations. I’m amazed at these company’s ability to measure the power consumption down to the single operation level. Microjoules and picojoules are worlds apart from on another and here’s the illustration:
1 Microjoule is 1 millionth of a joule or 1×10-6 (six decimal places) whereas 1 picojoule is 1×10-12 or twice as many decimal places a total of 12 zeroes. So that is a HUGE difference 6 orders of magnitude in efficiency from an electrical consumption standpoint. The nVidia guy, author Steve estimates that to get to exascale supercomputers any hybrid CPU/GPU machine would need GPUs that have one order of magnitude higher efficiency in joules per floating point operation (FLOP) or 1×10-13, one whole decimal point better. To borrow a cliche, Supercomputer manufacturers have their work cut out for them. The way forward is efficiency and the GPU has the edge per operation, and all they need do is increase the efficiency that one decimal point to get them closer to the exascale league of super-computing.
Why is exascale important to the scientific community at large? In one segment there’s never enough cycles per second to satisfy the scale of the computations being done. Models of systems can be created but the simulations they provide may not have enough fine grained ‘detail’. The detail say for weather model simulating a period of time in the future needs to know the current conditions then it can start the calculation. But the ‘resolution’ or fine-grained detail of ‘conditions’ is what limits the accuracy over time. Especially when small errors get amplified by each successive cycle of calculating. One way to help limit the damage by these small errors is to increase the resolution or the land area over which you are assign a ‘current condition’. So instead of 10 miles of resolution (meaning each block on the face of the planet is 10miles square), you switch to 1mile resolution. Any error in a one mile square patch is less likely to cause huge errors in the future weather prediction. But now you have to calculate 10x the number of squares as compared to the previous best model which you set at 10miles of resolution. That’s probably the easiest way to see how demands on the computer increase as people increase the resolution of their weather prediction models. But it’s not limited just to weather. It could be used to simulate a nuclear weapon aging over time. Or it could be used to decrypt foreign messages intercepted by NSA satellites. The speed of the computer would allow more brute force attempts ad decrypting any message they capture.
In spite of all the gains to be had with an exascale computer, you still have to program the bloody thing to work with your simulation. And that’s really the gist of this article, no free lunch in High Performance Computing. The level of knowledge of the hardware required to get anything like the maximum theoretical speed is a lot higher than one would think. There’s no magic bullet or ‘re-compile’ button that’s going to get your old software running smoothly on the exascale computer. More likely you and a team of the smartest scientists are going to work for years to tailor your simulation to the hardware you want to run it on. And therein lies the rub, the hardware alone isn’t going to get you the extra performance.
- ExaFLOP computers: Faster than 50 million laptops – the race to go exascale (talesfromthelou.wordpress.com)
- Exascale: The Faraway Frontier of Computing? (lcitnetworks.wordpress.com)
- Nvidia: No magic compilers for HPC coprocessors (go.theregister.com)
Harkening back to when he joined ARM, Segars said: “2G, back in the early 90s, was a hard problem. It was solved with a general-purpose processor, DSP, and a bit of control logic, but essentially it was a programmable thing. It was hard then – but by todays standards that was a complete walk in the park.”
He wasn’t merely indulging in “Hey you kids, get off my lawn!” old-guy nostalgia. He had a point to make about increasing silicon complexity – and he had figures to back it up: “A 4G modem,” he said, “which is going to deliver about 100X the bandwidth … is going to be about 500 times more complex than a 2G solution.”
A very interesting look a the state of the art in microprocessor manufacturing, The Register talks with one of the principles at ARM, the folks who license their processor designs to almost every cell phone manufacturer worldwide. Looking at the trends in manufacturing, Simon Segars is predicting a more difficult level of sustained performance gains in the near future. Most advancement he feels will be had by integrating more kinds of processing and coordinating the I/O between those processors on the same processor die. Which is kind of what Intel is attempting to do integrating graphics cores, memory controllers and CPU all on one slice of silicon. But the software integration is the trickiest part, and Intel still sees fit to just add more general purpose CPU cores to continue making new sales. Processor clocks stay pretty rigidly near the 3GHz boundary and have not shifted significantly since the end of the Pentium IV era.
Note too, the difficulty of scaling up as well as designing the next gen chips. Referring back to my article from Dec.21, 2010; 450mm wafers (commentary on Electronista article), Intel is the only company rich enough to scale up to the next size of wafer. Every step in the manufacturing process has become so specialized that the motivation to create new devices for manufacture and test just isn’t there because the total number of manufacturers who can scale up to the next largest size of silicon wafer is probably 4 companies worldwide. That’s a measure of how exorbitantly expensive large scale chip manufacturing has become. It seems more and more a plateau is being reached in terms of clock speeds and the size of wafers finished in manufacturing. With these limits, Simon Segars thesis becomes even stronger.
“Could Apple be opening up the platform more?” he asked. “What happens to NVIDIA? Why support for cards that aren’t in Macs yet? Will the 2011 Sandy Bridge iMacs contain one or more of these new 6xxx cards?”
This is an interesting tidbit of news. A Macintosh hacker has discovered within the most recent update of Mac OS X 10.6 a number of hardware drivers for ATI graphics cards that do not ship and are currently ‘unsupported’ on the Mac. Anyone who has attempted to buy after market, third party OEM graphics cards for Macs know this is treacherous minefield to navigate. The principle problem being Apple absolutely positively does not want people sticking any old graphics card in the Macintosh Pro towers. Or even in old legacy towers going back to the first PowerPC/PCI based Macs. No, you must buy direct from Apple the bona fide supported hardware with drivers they supply. In a pinch you might be able to fake it with a PC graphics card that has had its BIOS flashed to make it appear to be a genuine Apple part.
But now if Apple is just bundling up a bunch of drivers for various and sundry graphics cards (albeit from one supplier: ATI), is it possible you could finally buy any card you wanted and it would work? That would be big news indeed for any owner of an end-user upgradeable Macintosh Pro owner and welcome news at that. I’m hoping that this news continues to develop and Apple comes out with a policy or strategy statement heralding a change in past policy towards peripheral manufacturers. More devices being supported would be a great thing.
- 10.6.7 has new AMD video card support, perhaps for new iMacs (9to5mac.com)
- Mac OS X may natively support “PC” Radeon graphics cards (arstechnica.com)
Intel’s executives were quite brash when talking about Larrabee even though most of its public appearances were made on PowerPoint slides. They said that Larrabee would roar onto the scene and outperform competing products.
And so now finally the NY Times nails the coffin shut on Intel’s Larrabee saga. To refresh your memory this is the second attempt by Intel to create a graphics processor. The first failed attempt was some years ago in the late 1990s when 3dfx (bought by nVidia) was tearing up the charts with their Voodoo 1 and Voodoo 2 PCI-based 3D accelerator cards. The age of Quake, Quake 2 were upon us and everyone wanted smoother frame rates. Intel wanted to show its prowess in the design of a low cost graphics card running on the brand new AGP slot which Intel had just invented (remember AGP?). What turned out was a similar set of delays and poor performance as engineering samples came out of the development labs. Given the torrid pace of products released by nVidia and eventually ATI, Intel couldn’t keep up. Their benchmark was surpassed by the time their graphics card saw the light of day, and they couldn’t give them away. (see Wikipedia: Intel i740)
The Intel740, or i740, is a graphics processing unit using an AGP interface released by Intel in 1998. Intel was hoping to use the i740 to popularize the AGP port, while most graphics vendors were still using PCI. Released with enormous fanfare, the i740 proved to have disappointing real-world performance, and sank from view after only a few months on the market
Enter Larrabee, a whole new ball game at Intel, right?! The trend toward larger numbers of parallel processors on GPUs from nVidia and ATI/AMD led Intel to believe they might leverage some of their production lines to make a graphics card again. But this time it was different, nVidia had moved from single purpose GPUs to General Purpose GPUs in order to create a secondary market using their cards as compute intensive co-processor cards. They called it CUDA and provided a few development tools at the early stages. Intel latched onto this idea of the General Purpose GPU and decided they could do better. What’s more general purpose than an Intel x86 processor right? And what if you could provided the libraries and Hardware Abstraction Layer that could turn a larger number of processor cores into something that looked and smelled like a GPU?
For Intel it seemed like a win/win/win everybody wins. The manufacturing lines using older design rules at the 45nm size could be utilized for production, making the graphics card pure profit. They could put 32 processors on a card and program them to do multi duties for the OS (graphics for games, co-processor for transcoding videos to MP4). But each time they did a demo a product white paper and demo at a trade show it became obvious the timeline and schedule was slipping. They had benchmarks to show, great claims to make, future projections of performance to declare. Roadmaps were the order of the day. But just last week rumors started to set in.
Similar to the graphics card foray of the past Intel couldn’t beat it’s time to market demons. The Larrabee project was going to be so late and still was using 45nm manufacturing design rules. Given Intel’s top of the line production lines moved to 32nm this year, and nVidia and AMD are doing design process shrinks on their current products, Intel was at a disadvantage. Rather than scrap the thing and lose face again, they decided to recover somewhat and put Larrabee out there as a free software/hardware development kit and see if that was enough to get people to bite. I don’t know what if any benefit any development on this platform would bring. It would rank right up there with the Itanium and i740 as hugely promoted dead-end products with zero to negative market share. Big Fail – Do Not Want.
And for you armchair Monday morning technology quarter backs here are some links to enjoy leading up to the NYTimes article today:
Tim Sweeney Laments Intel Larrabee Demise (Tom’s Hardware Dec. 7)
Intel Kills Consumer Larrabee Plans (Slashdot Dec. 4)
Intel delays Larrabee GPU, aims for developer “kit” in 2010 (MacNN Dec. 4)
Intel condemns tardy Larrabee to dev purgatory (The Register Dec.4)