Netlist are the owners of the patent on key parts of Sandisk’s UtraDIMM technology (licensed from Diablo Technologies originally, I believe). While Netlist has lawsuits going back and forth regarding its intellectual property, it continued to develop products. Here now is the EXPRESSvaultTM EV3 announcement. It’s a PCI RAM disk of sorts that backs up the RAM with a ultracapacitor/battery combo. If power is lost an automated process backs up the RAM to onboard flash memory for safe keeping until power is restored. This design is intended to get around the disadvantages of using Flash memory as a disk and the wear and tear that occurs to flash when it is written to frequently. Less expensive flash memory suffers more degradation the more you write to it, eventually memory cells will fail altogether. By using the backing flash memory as failsafe, you will write to that flash only in the event of an emergency, thereby keeping the flash out of the grindstone of high levels of I/O. Note this is a very specific niche application of this technology but is very much the market for which Netlist has produced products in the past. This is their target market.
The Future is lower latencies-Enter UltraDIMM
Turn now to a recent announcement by Lenovo and it’s X6 server line announcing further adoption of the UltraDIMM technology. Lenovo at least is carrying on trying to sell this technology of Flash based memory interspersed with DRAMs. The idea of having “tiers” of storage with SSDs, UltraDIMMs and DRAM all acting in concert is the high speed future for the data center architect. Lucky for the people purchasing these things Netlist and Diablo’s legal wrangling began to sort itself out this Spring 2015: http://www.storagereview.com/diablo_technologies_gains_ground_against_netlist_in_ulltradimm_lawsuit
With a final decision being made fairly recently: http://www.diablo-technologies.com/federal-court-completely-dissolves-injunction/
Now Diablo and Sandisk and UltraDIMM can compete in the marketplace once more. And provide a competitive advantage to the people willing to spend the money for the UltraDIMM product. By itself UltraDIMM does make for some very interesting future uses. More broadly the adoption of an UltraDIMM like technology in laptops, desktops, tablets could see speed improvements across the board. Whether that happens or not is based more on the economics of BIOS and motherboard manufacturers than the merit of the design engineering of UltraDIMMs. More specifically Lenovo and IBM before that had to do a lot of work on the X6 servers to support the new memory technology. Which points to another article from the person I trust to collect all the news and information on storage worldwide, The Register’s Chris Mellor. I’ve followed his writing since about 2005 and really enjoyed his take on the burgeoning SSD market as new products were announced with faster I/O every month in the heady days of 2007 and beyond. Things have slowed down a bit now and PCIe SSDs are still the reference standard by which I/O benchmarks are measured. Fusion-io is now owned by Sandisk and everyone’s still enjoying the speed increases they get when buying these high end PCIe products. But it’s important to note for further increases to occur, just like with Sandisk’s use of UltraDIMM you have to keep pushing the boundaries. And that’s where Chris’s most recent article comes in.
Chris discusses the how Non-Volatile Memory Host Controller Interface (NVMHCI) came about as a result of legacy carry-over from spinning hard drives in the AHCI (Advanced Host Controller Interface) standard developed by Intel. AHCI and SATA (Serial ATA, the follow-on to ATA) both assumed spinning magnetic hard drives (and the speeds at which they push I/O) would be the technology used by a CPU to interact with it’s largest data store, the hard drive. Once that data store became flash memory, a new standard to drive faster access I/O and lower latencies needed to be invented. Enter the NVMe (Non-volatile Memory Express) interface, now being marketed and sold by some manufacturers. A native data channel from the PCI bus to your SSD however it may be designed, is the next big thing in hardware for SSDs. With the promise of better speeds it is worth migrating, once the manufacturers get onboard. But Chris’s article goes further to really look out beyond the immediate concerns of migrating from SATA to NVMe as even Flash memory may eventually be usurped by a different as yet unheard of technology. Given that’s the case, NVMe abstracts enough of the “media” of the non-volatile memory that it should allow future adoption of a number of possible technologies that could usurp the crown of NAND memory chips. And that potentially is a greater benefit than simply just squeezing out a few more Megabytes per second read and write speed. Even more tantalizing in Chris’s view is the mixing of DRAM and Flash memories in a “mesh” lets say of higher and lower speed memories like Fusion-io’s software uses to make the sharp distinction between DRAM and Flash less visible. In a sense, the speed would just come with the purchase of the technology, how it actually works would be the proverbial magic to the sysadmins and residents of Userland.
The ever-increasing density of virtual infrastructures, and the need to scale databases larger than ever, is creating an ongoing need for faster storage. And while flash has become the “go to” performance option, there are environments that still need more. Nonvolatile DRAM is the heir apparent, but it often requires customized motherboards to implement, for which widespread availability could be years away. Netlist, pioneer of NVRAM, has introduced a product that is viable for most data centers right now: the EXPRESSvaultTM EV3.
The Flash Problem
While flash has solved many performance problems, it also creates a few. First there is a legitimate concern over flash wear, especially if the environment is write-heavy. There is also a concern about performance. While flash is fast compared to hard disk drives it’s slow when compared to RAM, especially, again, on writes.
But flash does have two compelling advantages over DRAM. First it is…
Previous generations of multi-core, massively parallel, ARM based servers were one off manufacturers with their own toolsets and Linux distros. HP’s attempt to really market to this segment of the market will hopefully be substantial enough to get an Ubuntu distro that has enough Libraries and packages to make it function right out of the box. In the article it says companies are using the Proliant ARM-based system as a memcached server. I would speculate that if that’s what people want, the easier you can make that happen from an OS and app server standpoint the better. There’s a reason folks like to buy Synology and BuffaloTech NAS products and that’s the ease with which you spin them up and get a lot of storage attached in a short amount of time. If Proliant can do that for people needing quicker and more predictable page loads on their web apps, then optimize for memcached performance and make it easy to configure and put into production.
Now what, you may ask, is memcached? If you’re running a web server or a web application that requires a lot of speed so that purchases or other transactions complete and show some visual cue that it was successful, the easiest way to do that is through cacheing. The web page contents are kept in a high speed storage location separate from the actual webpage and when required will redirect, or point to the stuff that sits over in that high speed location. By swapping the high speed stored stuff for the slower stuff, you get a really good experience with the web page refreshing automagically showing your purchases in a shopping cart, or that your tax refund is on it’s way. The web site world is built on caching so we don’t see spinning watches or other indications that processing is going on in the background.
To date, this type of caching has seen different software packages do this for first Apache web servers, but now in the world of Social Media, it’s doing it for any type of web server. Whether it’s Amazon, Google or Facebook, memcached or a similar cacheing server is sending you that actual webpage as you click, submit and wait for the page to refresh. And if a data center owner like Amazon, Google and Facebook can lower the cost for each of it’s memcached servers, they can lower their operating costs for each of these cached web pages and keep everyone happy with the speed of their respective websites. Whether or not ARM-based servers see a wider application is dependent on the apps being written specifically for that chip architecture. But at least now people can point to memcached and web page acceleration as a big first win that might see wider adoption longer term.
Since last year, Apple’s been hard at work building out their own CDN and now those efforts are paying off. Recently, Apple’s CDN has gone live in the U.S. and Europe and the company is now delivering some of their own content, directly to consumers. In addition, Apple has interconnect deals in place with multiple ISPs, including Comcast and others, and has paid to get direct access to their networks.
Given some of my experiences attempting to watch the Live Stream from Apple’s combined iPhone, Watch event, I wanted to address CDN. Content Distribution Networks are designed to speed the flow of many types of files from Data Centers or Video head ends for Live Events. So I note, I started this article back on August 1st when this original announcement went out. And now it’s doubly poignant as the video stream difficulties at the start of the show (1PM EDT) kind of ruined it for me and for a few others. They lost me in that scant few first 10 minutes and they never recovered. I did connect later but that was after the Apple Watch presentation was half done. Oh well, you get what you pay for. I paid nothing for the Live Event stream from Apple and got nothing in return.
Back during the Steve Jobs era, one of the biggest supporters of Akamais and its content delivery network was Apple Inc. And this was not just for streaming of the Keynote Speeches and MacWorld (before they withdrew from that event) but also the World Developers Conference (WWDC). At the time enjoyed great access to free streams and great performance levels for free. But Apple cut way back on that simulcasts and rivals like Eventbrite began to eat in to Akamai’s lower end. Since then the huge data center providers began to build out their own data centers worldwide. And in so doing, a kind of internal monopoly of content distribution went into effect. Google was first to really scale up in a massive way then scale out, to make sure all those GMail accounts ran faster and better in spite of the huge mail spools on each account member. Eventually the second wave of social media outlets joined in (with Facebook leading a revolution in Open Stack and Open Hardware specs) and created their own version of content delivery as well.
Now Apple has attempted to scale up and scale out to keep people tightly bound to brand. iCloud really is a thing, but more than that now the real heavy lifting is going on once and for all time. Peering arrangements (anathema to the open Internet) would be signed and deals made to scratch each other’s backs by sharing the load/burden of carrying not just your own internal traffic, but those of others too. And depending on the ISP you could really get gouged by those negotiations. But no matter Apple soldiered on and now they’re ready to really let all the prep work be put to good use. Hopefully the marketing will be sufficient to express the satisfaction and end user experience at all levels, iTunes, iApps, iCloud data storage and everything else would experience the boosts in speed. If Apple can hold its own against both Facebook and Gmail in this regard, they future’s so bright they’re gonna need shades.
Today many different interconnection topologies are used for multicore chips. For as few as eight cores direct bus connections can be made — cores taking turns using the same bus. MIT’s 36-core processors, on the other hand, are connected by an on-chip mesh network reminiscent of Intel’s 2007 Teraflop Research Chip — code-named Polaris — where direct connections were made to adjacent cores, with data intended for remote cores passed from core-to-core until reaching its destination. For its 50-core Xeon Phi, however, Intel settled instead on using multiple high-speed rings for data, address, and acknowledgement instead of a mesh.
I commented some time back on a similar article on the same topic. It appears now the MIT research group has working silicon of the design. As mentioned in the pull-quote, the Xeon Phi (which has made some news in the Top 500 SuperComputer stories recently) is a massively multicore architecture but uses a different interconnect that Intel designed on their own. These stories as they appear get filed into the category of massively multicore or low power CPU developments. Most times the same CPUs add cores without significantly drawing more power and thus provide a net increase in compute ability. Tilera, Calxeda and yes even SeaMicro were all working along towards those ends. Either through mergers, or cutting of funding each one has seemed to trail off and not succeed at its original goal (massively multicore, low power designs). Also along the way Intel has done everything it can to dull and dent the novelty of the new designs by revising an Atom based or Celeron based CPU to provide much lower power at the scale of maybe 2 cores per CPU.
Like this chip MIT announced Tilera too was originally an MIT research product spun off of the University campus. Its principals were the PI and a research associate if I remember correctly. Now that MIT has the working silicon they’re going to benchmark and test and verify their design. The researchers will release the verilog hardware description of chip for anyone use, research or verify for themselves once they’ve completed their own study. It will be interesting to see how much of an incremental improvement this design provides, and possibly could be the launch of another Tilera style product out of MIT.
SUMMARY: Microsoft has been experimenting with its own custom chip effort in order to make its data centers more efficient, and these chips aren’t centered around ARM-based cores, but rather FPGAs from Altera.
FPGAs for the win, at least for eliminating unnecessary Xeon CPUs for doing online analytic processing for the Bing Search service. MS are saying they can process the same amount of data with half the number of CPUs by offloading some of the heavy lifting from general purpose CPUs to specially programmed FPGAs tune to the MS algorithms to deliver up the best search results. For MS the cost of the data center will out, and if you can drop half of the Xeons in a data center you just cut your per transaction costs by half. That is quite an accomplishment these days of radical incrementalism when it comes to Data Center ops and DevOps. The Field Programmable Gate Array is known as a niche, discipline specific kind of hardware solution. But when flashed, and programmed properly and re-configured as workloads and needs change it can do some magical heavy lifting from a computing standpoint.
Specifically I’m thinking really repetitive loops or recursive algorithms that take forever to unwind and deliver a final result are things best done in hardware versus software. For Search Engines that might be the process used to determine the authority of a page in the rankings (like Google’s PageRank). And knowing you can further tune the hardware to fit the algorithm means you’ll spend less time attempting to do heavy lifting on the General CPU using really fast C/C++ code instead. In Microsoft’s plan that means less CPUs need to do the same amount of work. And better yet, if you determine a better algorithm for your daily batch processes, you can spin up a new hardware/circuit diagram and apply that to the compute cluster over time (and not have to do a pull and replace of large sections of the cluster). It will be interesting to see if Microsoft reports out any efficiencies in a final report, as of now this seems somewhat theoretical though it may have been tested at least in a production test bed of some sort using real data.
If Hewlett-Packard (HPQ) founders Bill Hewlett and Dave Packard are spinning in their graves, they may be due for a break. Their namesake company is cooking up some awfully ambitious industrial-strength computing technology that, if and when it’s released, could replace a data center’s worth of equipment with a single refrigerator-size machine.
Memristor makes an appearance again as a potential memory technology for future computers. To date, flash memory has shown it can scale for a while far into the future. What benefit could there possibly be by adopting memristor? You might be able to put a good deal of it on the same die as the CPU for starters. Which means similar to Intel’s most recent i-Series CPUs with embedded graphics DRAM on the CPU, you could instead put an even larger amount of Memristor memory. Memristor is denser than DRAM and stays resident even after power is taken away from the circuit. Intel’s eDRAM scales up to 128MB on die, imagine how much Memristor memory might fit in the same space? The article states Memristor is 64-128 times denser than DRAM. I wonder if that also holds true from Intel’s embedded DRAM too? Even if it’s only 10x denser as compared to eDRAM, you could still fit 10x 128MB of Memristor memory embedded within a 4 core CPU socket. With that much available space the speed at which memory access needed to occur would solely be determined by the on chip bus speeds. No PCI or DRAM memory controller bus needed. Keep it all on die as much as possible and your speeds would scream along.
There are big downsides to adopting Memristor however. One drawback is how a CPU resets the memory on power down, when all the memory is non-volatile. The CPU now has to explicitly erase things on reset/shutdown before it reboots. That will take some architecture changes both on the hardware and software side. The article further states that even how programming languages use memory would be affected. Long term the promise of memristor is great, but the heavy lifting needed to accommodate the new technology hasn’t been done yet. In an effort to help speed the plow on this evolution in hardware and software, HP is enlisting the Open Source community. It’s hoped that some standards and best practices can slowly be hashed out as to how Memristor is accessed, written to and flushed by the OS, schedulers and apps. One possible early adopter and potential big win would be the large data center owners and Cloud operators.
In memory caches and databases are the bread and butter of the big hitters in Cloud Computing. Memristor might be adapted to this end as a virtual disk made up of memory cells on which a transaction log was written. Or could be pointed to by OS to be treated as a raw disk of sorts, only much faster. By the time the Cloud provider’s architects really optimized their infrastructure for Memristor, there’s no telling how flat the memory hierarchy could become. Today it’s a huge chain of higher and higher speed caches attached to spinning drives at the base of the pyramid. Given higher density like Memristor and physical location closer to the CPU core, one might eliminate a storage tier altogether for online analytical systems. Spinning drives might be relegated to the task of being storage tape replacements for less accessed, less hot data. HP’s hope is to deliver a computer optimized for Memristor (called “The Machine” in this article) by 2019 where Cache, Memory and Storage are no longer so tightly defined and compartmentalized. With any luck this will be a shipping product and will perform at the level they are predicting.
Although Intel’s SSD DC P3700 is clearly targeted at the enterprise, the drive will be priced quite aggressively at $3/GB. Furthermore, Intel will be using the same controller and firmware architecture in two other, lower cost derivatives (P3500/P3600). In light of Intel’s positioning of the P3xxx family, a number of you asked for us to run the drive through our standard client SSD workload. We didn’t have the time to do that before Computex, but it was the first thing I did upon my return. If you aren’t familiar with the P3700 I’d recommend reading the initial review, but otherwise let’s look at how it performs as a client drive.
This is Part #2 of the full review Anandtech did on the Intel P3700 PCIe/NVMe card. It’s reassuring to know that Anandtech reports Intel’s got more than just the top end P3700 coming out on the market. Other price points will be competing too for the non-enterprise workload types. $3/GB puts it at the top of a desktop peripheral price for even a fanboy gamer. But for data center workloads and the prices that crowd pays this is going to be an easy choice. Intel’s P3700 as the Anandtech concludes is built not just for speed (peak I/O) but for consistency at all queue depths, file sizes and block sizes. If you’re attempting to budget a capital improvement in your Data Center and you want to quote the increases you’ll see, these benchmarks will be proof enough that you’ll get every penny back that you spent. No need to throw an evaluation unit into your test rig, testing lab and benchmarking it yourself.
As for the lower end models, you might be able to dip your toe, though not at the same performance level, in at the $600 price point. That will be an average to smallish 400GB PCIe card the Intel SSD DC P3500. But still the overall design and engineering is derived in part from the move from just a straight PCIe interface to one that harnesses more data lanes on the PCIe bus and connects to the BIOS via the NVMHCI drive interface. That’s what you’re getting for that price. If you’re very sensitive to price, do not purchase this product line. Samsung has you more than adequately covered under the old regime SSD-SATA drive technology. And even then the performance is nothing to sneeze at. But do know things are in flux with the new higher performance drive interfaces manufacturers will be marketing and selling to you soon. Remember roughly this is the order in which things are improving and of higher I/O:
And the incremental differences in the middle are small enough that you will only see benefits really if the price is cheaper for a slightly faster interface (say SATA SSD vs. SATA Express choose based on the price being dead equal, not necessarily just performance alone). Knowing what all these things do or even just what they mean and how that equates to your computer’s I/O performance will help you choose wisely over the next year to two years.