Posts Tagged ‘Hadoop’
SeaMicro has been peddling its SM10000-64 micro server, based on Intels dual-core, 64-bit Atom N570 processor and cramming 256 of these chips into a 10U chassis. . .
. . . The SM10000-64 is not so much a micro server as a complete data center in a box, designed for low power consumption and loosely coupled parallel processing, such as Hadoop or Memcached, or small monolithic workloads, like Web servers.
While it is not always easy to illustrate the cost/benefit and Return on Investment on a lower power box like the Seamicro, running it head to head on a similar workload with a bunch of off the shelf Xeon boxes really shows the difference. The calculation of the benefit is critical too. What do you measure? Is it speed? Is it speed per transaction? Is it total volume allowed through? Or is it cost per unit transaction within a set amount of transactions? You’re getting closer with that last one. The test setup used a set number of transaction needing to be done in a set period of time. The benchmark then measure total power dissipation to accomplish that number of transactions in the set amount of time. SeaMicro came away the winner in unit cost per transaction in power terms. While the Xeon based servers had huge excess speed and capacity the power dissipation put it pretty far into the higher cost per transaction category.
However it is very difficult to communicate this advantage that SeaMicro has over Intel. Future tests/benchmarks need to be constructed with clearly stated goals and criteria. Specifically if it can be communicated as a Case History of a particular problem that could be solved by either a SeaMicro server or a bunch of Intel boxes running Xeon cpus with big caches. Once that Case History is well described, then the two architectures are then put to work showing what the end goal is in clear terms (cost per transaction). Then and only then will SeaMicro communicate effectively how it does things different and how that can save money. Otherwise it’s too different to measure effectively versus a Intel Xeon based rack of servers.
- Big data on micro servers? You bet. (gigaom.com)
- A visit to Silicon Valley’s hot new hardware company: Microserver maker SeaMicro (scobleizer.com)
- eHarmony Switches from Cloud to Atom Servers (datacenterknowledge.com)
In short, big data simply means data sets that are large enough to be difficult to work with. Exactly how big is big is a matter of debate. Data sets that are multiple petabytes in size are generally considered big data (a petabye is 1,024 terabytes). But the debate over the term doesn’t stop there.
There’s big doin’s inside and outside the data center theses days. You cannot spend a day without a cool new article about some new project that’s just been open sourced from one of the departments inside the social networking giants. Hadoop being the biggest example. What you ask is Hadoop? It is a project Yahoo started after Google started spilling the beans on it’s two huge technological leaps in massively parallel databases and processing real time data streams. The first one was called BigTable. It is a huge distributed database that could be brought up on an inordinately large number of commodity servers and then ingest all the indexing data sent by Google’s web bots as they found new websites. That’s the database and ingestion point. The second point is the way in which the rankings and ‘pertinence’ of the indexed websites would be calculated through PageRank. The invention for the realtime processing of this data being collected is called MapReduce. It was a way of pulling in, processing and quickly sorting out the important highly ranked websites. Yahoo read the white papers put out by Google and subsequently created a version of those technologies which today power the Yahoo! search engine. Having put this into production and realizing the benefits of it, Yahoo turned it into an open source project to lower the threshold of people wanting to get into the Big Data industry. Similarly, they wanted to get many eyes of programmers looking at the source code and adding features, packaging it, and all importantly debugging what was already there. Hadoop was the name given to the Yahoo bag of software and this is what a lot of people initially adopt if they are trying to do large scale collection and real-time analysis of Big Data.
Another discovery along the way towards the Big Data movement was a parallel attempt to overcome the limitations of extending the schema of a typical database holding all the incoming indexed websites. Tables and Rows and Structured Query Language (SQL) have ruled the day since about 1977 or so, and for many kinds of tabbed data there is no substitute. However, the kinds of data being stored now fall into the big amorphous mass of binary large objects (BLOBs) that can slow down a traditional database. So a non-SQL approach was adopted and there are parts of the BigTable database and Hadoop that dump the unique key values and relational tables of SQL to just get the data in and characterize it as quickly as possible, or better yet to re-characterize it by adding elements to the schema after the fact. Whatever you are doing, what you collect might not be structured or easily structured so you’re going to need to play fast and loose with it and you need a database of some sort equal to that task. Enter the NoSQL movement to collect and analyze Big Data in its least structured form. So my recommendation to anyone trying to get the square peg of Relational Databases to fit the round hole of their unstructured data is to give up. Go NoSQL and get to work.
This first article from Read Write Web is good in that it lays the foundation for what a relational database universe looks like and how you can manipulate it. Having established what IS, future articles will be looking at what quick, dirty workarounds and one off projects people have come up with to fit their needs. And subsequently which ‘Works for Me’ type solutions have been turned into bigger open source projects that will ‘Work for Others’, as that is where each of these technologies will really differentiate themselves. Ease of use and lowering the threshold will be deciding factors for many people’s adoption of a NoSQL database I’m sure.