Archive for the ‘gpu’ Category
OpenCL is a breakthrough precisely because it enables developers to accelerate the real-time execution of their algorithms quickly and easily — particularly those that lend themselves to the considerable parallel processing capabilities of FPGAs (which yield superior compute densities and far better performance/Watt than CPU- and GPU-based solutions)
There’s still a lot of untapped energy available with the OpenCL programming tools. Apple is still the single largest manufacturer who has adopted OpenCL through a large number of it’s products (OS and App software). And I know from reading about super computing on GPUs that some large scale hybrid CPU/GPU computers have been ranked worldwide (the Chinese Tiahne being the first and biggest example). This article from EETimes encourages anyone with a brackground in C programming to try and give it a shot, see what algorithms could stand to be accelerated using the resources on the motherboard alone. But being EETimes they are also touting the benefits of using FPGAs in the mix as well.
To date the low-hanging fruit for desktop PC makers and their peripheral designers and manufacturers has been to reuse the GPU as massively parallel co-processor where it makes sense. But as the EETimes writer emphasizes, FPGAs can be equal citizens too and might further provide some more flexible acceleration. Interest in the FPGA as a co-processor for desktop to higher end enterprise data center motherboards was brought to the fore by AMD back in 2006 with the Torrenza cpu socket. The hope back then was that giving a secondary specialty processor (at the time an FPGA) might prove to be a market no one had addressed up to that point. So depending on your needs and what extra processors you might have available on your motherboard, OpenCL might be generic enough going forward to get a boost from ALL the available co-processors on your motherboard.
Whether or not we see benefits at the consumer level desktop is very dependent on the OS level support for OpenCL. To date the biggest adopter of OpenCL has been Apple as they needed an OS level acceleration API for video intensive apps like video editing suites. Eventually Adobe recompiled some of its Creative Suite apps to take advantage of OpenCL on MacOS. On the PC side Microsoft has always had DirectX as its API for accelerating any number of different multimedia apps (for playback, editing) and is less motivated to incorporate OpenCL at the OS level. But that’s not to say a 3rd party developer who saw a benefit to OpenCL over DirectX couldn’t create their own plumbing and libraries and get a runtime package that used OpenCL to support their apps or anyone who wanted to license this as part of a larger package installer (say for a game or for a multimedia authoring suite).
For the data center this makes way more sense than for the desktop, as DirectX isn’t seen as a scientific computing or means of allowing a GPU to be used as a numeric accelerator for scientific calculations. In this context, OpenCL might be a nice, open and easy to adopt library for people working on compute farms with massive numbers of both general purpose cpus and GPUs handing off parts of a calculation to one another over the PCI bus or across CPU sockets on a motherboard. So everyone’s needs are going to vary and widely vary in some cases. But OpenCL might help make that variation more easily addressed by having a common library that would allow one to touch all the co-processors available when a computation is needing to be sped up. So keep an eye on OpenCL as a competitor to any GPGPU style API and library put out by either nVidia or AMD or Intel. OpenCL might help people bridge differences between these different manufacturers too.
In almost every kind of electronic equipment we buy today, there is memory in the form of SRAM and/or flash memory. Following Moores law, memories have doubled in size every second year. When Intel introduced the 1103 1Kbit dynamic RAM in 1971, it cost $20. Today, we can buy a 4Gbit SDRAM for the same price.
Read now, a look back from an Ericsson engineer surveying the use of solid state, chip-based memory in electronic devices. It is always interesting to know how these things start and evolved over time. Advances in RAM design and manufacture are the quintessential example of Moore’s Law even more so than the advances in processors during the same time period. Yes CPUs are cool and very much a foundation upon which everything else rests (especially dynamic ram storage). But remember this Intel didn’t start out making microprocessors, they started out as a dynamic RAM chip company at a time that DRAM was just entering the market. That’s the foundation upon which even Gordon Moore knew the rate at which change was possible with silicon based semiconductor manufacturing.
Now we’re looking at mobile smartphone processors and System on Chip (SoC) advancing the state of the art. Desktop and server CPUs are making incremental gains but the smartphone is really trailblazing in showing what’s possible. We went from combining the CPU with the memory (so-called 3D memory) and now graphics accelerators (GPU) are in the mix. Multiple cores and soon fully 64bit clean cpu designs are entering the market (in the form of the latest model iPhones). It’s not just a memory revolution, but it is definitely a driver in the market when we migrated from magnetic core memory (state of the art in 1951-52 while developed at MIT) to the Dynamic RAM chip (state of the art in 1968-69). That drive to develop the DRAM brought all other silicon based processes along with it and all the boats were raised. So here’s to the DRAM chip that helped spur the revolution. Without those shoulders, the giants of today wouldn’t be able to stand.
If there is any single number that people point to for resolution, it is the 1 arcminute value that Apple uses to indicate a “Retina Display”.
Earlier in my job where I work, I had to try and recommend the resolution people needed to get a good picture using a scanner or a digital camera. As we know the resolution arms race knows no bounds. First in scanners then in digital cameras. The same is true now for displays. How fine is fine enough. Is it noticeable, is it beneficial? The technical limits that enforce lower resolution usually are tied to costs. For the consumer level product cost has to fit into a narrow range, and the perceived benefit of “higher quality” or sharpness are rarely enough to get someone to spend more. But as phones can be upgraded for free and printers and scanners are now commodity items, you just keep slowly migrating up to the next model for little to no entry threshold cost. And everything is just ‘better’, all higher rez, and therefore by association higher quality, sharper, etc.
I used to quote or try to pin down a rule of thumb I found once regarding the acuity of the human eye. Some of this was just gained by noticing things when I started out using Photoshop and trying to print to Imagesetters and Laser Printers. At some point in the past someone decided 300 dpi is what a laser printer needed in order to reproduce text on letter size paper. As for displays, I bumped into a quote from an IBM study on visual acuity that indicated the human eye can discern display pixels in the 225 ppi range. I tried many times to find the actual publication where that appears so I could site it. But no luck, I only found it as a footnote on a webpage from another manufacturer. Now in this article we get more stats on human vision, much more extensive than that vague footnote all those years ago.
What can one conclude from all the data in this article? Just the same thing, that resolution arms races are still being waged by manufacturers. This time however it’s in mobile phones, not printers, not scanners, not digital cameras. Those battles were fought and now there’s damned little product differentiation. Mobile phones will fall into that pattern and people will be less and less Apple fanbois or Samsung fanbois. We’ll all just upgrade to a newer version of whatever phone is cheap and expect to always have the increased spec hardware, and higher resolution, better quality, all that jazz. It is one more case where everything old is new again. My suspicion is we’ll see this happen when a true VR goggle hits the market with real competitors attempting to gain advantage with technical superiority or more research and development. Bring on the the VR Wars I say.
nVidia is making a new bit of electronics hardware to be added to LCD displays made by third party manufacturers. The idea is to send syncing data to the display to let it know when a frame is rendered by the 3D video hardware on the video card. Having this bit of extra electronics will smooth out the high rez/high frame rate games played by the elite desktop game players.
It would be cool to also see this adopted for the game console markets as well, meaning TV manufacturers could also use this same idea and make your PS4 and XBox One play smoother as well. It’s a chicken and egg situation though, where unless someone like Steam or another manufacturer tries to push this out to a wider audience, it will get stuck as a niche product for the higher of the end of the high end PC desktop gamers. But it is definitely a step in the right direction and helps push us further away from the old VGA standard from some years ago. Video cards AND displays should both be smart those no reason, no excuse to not have them both be somewhat more aware of their surroundings and coordinate things. And if AMD decide they too need this capability, how soon after that will both AMD and nVidia have to come to the table and get a standard going? I hope that would happen sooner rather than later and that too would possibly drive this technology to a wider audience.
id Software has formally announced Carmack has left the building. Prior to this week he was on a sabbatical from id, doing consulting/advisory work for the folks putting the Oculus Rift together. Work being done now is to improve the speed of the refresh on the video screens. That’s really the last biggest hurdle to jump prior to this set of VR goggles and motion sensor out on the open market. The beta units are still out there, and people are experimenting with the Oculus versions of some First Person Shooters, but the revolution is not here,… yet.
Oculus will need to pull-off some optimizations for the headset. Some outstanding are not just the refresh rate but also what display technology is going to chosen. OLED is still up for consideration over backlit LCDs, but that may be a last stand in order to solve the refresh problem. Latency in the frame rates on the video displays is causing motion sickness of the current crop of beta testers of the Oculus Rift VR headset. The amount they’re attempting to speed up is roughly 1/2 the current fastest frame refresh rate they can achieve now. Hopefully this problem can be engineered out of the next revision of the beta units.
- Oculus VR Confirms Android Version Coming Next Year (tomshardware.com)
- Doom god John Carmack teleports from id Software to VR upstart Oculus Rift – Register (newestgadgetsinfo.com)
For now, use Handbrake for simple, effective encodes. Arcsoft or Xilisoft might be worth a look if you know you’ll be using CUDA or Quick Sync and have no plans for any demanding work. Avoid MediaEspresso entirely.
via By Joel Hruska @ ExtremeTech The wretched state of GPU transcoding – Slideshow | ExtremeTech.
Joel Hruska does a great survey of GPU enabled video encoders. He even goes back to the original Avivo and Badaboom encoders put out by AMD and nVidia when they were promoting GPU accelerated video encoding. Sadly the hype doesn’t live up to the results. Even Intel’s most recent competitor in the race, QuickSync, is left wanting. HandBrake appears to be the best option for most people and the most reliable and repeatable in the results it gives.
Ideally the maintainers of the HandBrake project might get a boost by starting up a fork of the source code that has Intel QuickSync support. There’s no indication now that that everyone is interested in proprietary Intel technology like QuickSynch as expressed in this article from Anandtech. OpenCL seems like a more attractive option for the Open Source community at large. So the OpenCL/HandBrake development is at least a little encouraging. Still as Joel Hruska points out the CPU still is the best option for encoding high quality at smaller frame sizes, it just beats the pants off all the GPU accelerated options available to date.
- AnandTech – Testing OpenCL Accelerated Handbrake with AMD’s Trinity (carpetbomberz.com)
- The Wretched State of GPU Transcoding (tech.slashdot.org)
- Lucid Demonstrates XLR8 Frame Rate Boosting Technology (tomshardware.com)
AMD, and NVIDIA before it, has been trying to convince us of the usefulness of its GPUs for general purpose applications for years now. For a while it seemed as if video transcoding would be the killer application for GPUs, that was until Intel’s Quick Sync showed up last year.
There’s a lot to talk about when it comes to accelerated video transcoding, really. Not the least of which is HandBrake’s dominance generally for anyone doing small scale size reductions of their DVD collections for transport on mobile devices. We owe it all to the open source x264 codec and all the programmers who have contributed to it over the years, standing on one another’s shoulders allowing us to effortlessly encode or transcode gigabytes of video to manageable sizes. But Intel has attempted to rock the boat by inserting itself into the fray by tooling its QuickSync technology for accelerating the compression and decompression of video frames. However it is a proprietary path pursued by a few small scale software vendors. And it prompts the question, when is open source going to benefit from the proprietary Intel QuickSync technology? Maybe its going to take a long time. Maybe it won’t happen at all. Lucky for the HandBrake users in the audience some attempt is being made now to re-engineer the x264 codec to take advantage of any OpenCL compliant hardware on a given computer.
- What We’ve Been Waiting For: Testing OpenCL Accelerated Handbrake with AMD’s Trinity (anandtech.com)
- Photoshop CS6 Gives ‘Fangs’ to your GPU (barefeats.com)
Similarly disappointing for everyone who isnt Intel, its been more than a year after Sandy Bridges launch and none of the GPU vendors have been able to put forth a better solution than Quick Sync. If youre constantly transcoding movies to get them onto your smartphone or tablet, you need Ivy Bridge. In less than 7 minutes, and with no impact to CPU usage, I was able to transcode a complete 130 minute 1080p video to an iPad friendly format—thats over 15x real time.
QuickSync for anyone who doesn’t follow Intel’s own technology white papers and cpu releases is a special feature of Sandy Bridge era Intel CPUs. Originally its duty on Intel is as old as the Clarkdale series with embedded graphics (first round of the 32nm design rule). It can do things like just simply speeding up the process of decoding a video stream saved in a number of popular video formats VC-1, H.264, MP4, etc. Now it’s marketed to anyone trying to speed up the transcoding of video from one format to another. The first Sandy Bridge CPUs using the the hardware encoding portion of QuickSync showed incredible speeds as compared to GPU-accelerated encoders of that era. However things have been kicked up a further notch in the embedded graphics of the Intel Ivy Bridge series CPUs.
In the quote at the beginning of this article, I included a summary from the Anandtech review of the Intel Core i7 3770 which gives a better sense of the magnitude of the improvement. The full 130 minute Blu-ray DVD was converted at a rate of 15 times real time, meaning for every minute of video coming off the disk, QuickSync is able to transcode it in 4 seconds! That is major progress for anyone who has followed this niche of desktop computing. Having spent time capturing, editing and exporting video I will admit transcoding between formats is a lengthy process that uses up a lot of CPU resources. Offloading all that burden to the embedded graphics controller totally changes that traditional impedance of slowing the computer to a crawl and having to walk away and let it work.
Now transcoding is trivial, it costs nothing in terms of CPU load. And any time it can be faster than realtime means you don’t have to walk away from your computer (or at least not for very long), but 10X faster than real time makes that doubly true. Now we are fully at 15X realtime for a full length movie. The time spent is so short you wouldn’t ever have a second thought about “Will this transcode slow down the computer?” It won’t in fact you can continue doing all your other work, be productive, have fun and continue on your way just as if you hadn’t just asked your computer to do the most complicated, time consuming chore that (up until now) you could possibly ask it to do.
Knowing this application of the embedded graphics is so useful for desktop computers makes me wonder about Scientific Computing. What could Intel provide in terms of performance increases for simulations and computation in a super-computer cluster? Seeing how hybrid super computers using nVidia Tesla GPU co-processors mixed with Intel CPUs have slowly marched up the list of the Top 500 Supercomputers makes me think Intel could leverage QuickSync further,. . . Much further. Unfortunately this performance boost is solely dependent on a few vendors of proprietary transcoding software. The open software developers do not have an opening into the QuickSync tech in order to write a library that will re-direct a video stream into the QuickSync acceleration pipeline. When somebody does accomplish this feat, it may be shortly after when you see some Linux compute clusters attempt to use QuickSync as an embedded algorithm accelerator too.
- Intel Core i7-3770K review: Ivy Bridge brings lower power, better performance (alltech360.wordpress.com)
- Image Quality: Intel Ivy Bridge vs. Radeon Gallium3D (phoronix.com)
- Intel Ivy Bridge CPUs now available to order (slashgear.com)
And with clock speeds topped out and electricity use and cooling being the big limiting issue, Scott says that an exaflops machine running at a very modest 1GHz will require one billion-way parallelism, and parallelism in all subsystems to keep those threads humming.
Interesting write-up of a blog entry from nVidia‘s chief of super-computing, including his thoughts regarding scaling up to an exascale supercomputer. I’m surprised at how power efficient a GPU is for floating point operations. I’m amazed at these company’s ability to measure the power consumption down to the single operation level. Microjoules and picojoules are worlds apart from on another and here’s the illustration:
1 Microjoule is 1 millionth of a joule or 1×10-6 (six decimal places) whereas 1 picojoule is 1×10-12 or twice as many decimal places a total of 12 zeroes. So that is a HUGE difference 6 orders of magnitude in efficiency from an electrical consumption standpoint. The nVidia guy, author Steve estimates that to get to exascale supercomputers any hybrid CPU/GPU machine would need GPUs that have one order of magnitude higher efficiency in joules per floating point operation (FLOP) or 1×10-13, one whole decimal point better. To borrow a cliche, Supercomputer manufacturers have their work cut out for them. The way forward is efficiency and the GPU has the edge per operation, and all they need do is increase the efficiency that one decimal point to get them closer to the exascale league of super-computing.
Why is exascale important to the scientific community at large? In one segment there’s never enough cycles per second to satisfy the scale of the computations being done. Models of systems can be created but the simulations they provide may not have enough fine grained ‘detail’. The detail say for weather model simulating a period of time in the future needs to know the current conditions then it can start the calculation. But the ‘resolution’ or fine-grained detail of ‘conditions’ is what limits the accuracy over time. Especially when small errors get amplified by each successive cycle of calculating. One way to help limit the damage by these small errors is to increase the resolution or the land area over which you are assign a ‘current condition’. So instead of 10 miles of resolution (meaning each block on the face of the planet is 10miles square), you switch to 1mile resolution. Any error in a one mile square patch is less likely to cause huge errors in the future weather prediction. But now you have to calculate 10x the number of squares as compared to the previous best model which you set at 10miles of resolution. That’s probably the easiest way to see how demands on the computer increase as people increase the resolution of their weather prediction models. But it’s not limited just to weather. It could be used to simulate a nuclear weapon aging over time. Or it could be used to decrypt foreign messages intercepted by NSA satellites. The speed of the computer would allow more brute force attempts ad decrypting any message they capture.
In spite of all the gains to be had with an exascale computer, you still have to program the bloody thing to work with your simulation. And that’s really the gist of this article, no free lunch in High Performance Computing. The level of knowledge of the hardware required to get anything like the maximum theoretical speed is a lot higher than one would think. There’s no magic bullet or ‘re-compile’ button that’s going to get your old software running smoothly on the exascale computer. More likely you and a team of the smartest scientists are going to work for years to tailor your simulation to the hardware you want to run it on. And therein lies the rub, the hardware alone isn’t going to get you the extra performance.
- ExaFLOP computers: Faster than 50 million laptops – the race to go exascale (talesfromthelou.wordpress.com)
- Exascale: The Faraway Frontier of Computing? (lcitnetworks.wordpress.com)
- Nvidia: No magic compilers for HPC coprocessors (go.theregister.com)