Archive for May 24th, 2011
In short, big data simply means data sets that are large enough to be difficult to work with. Exactly how big is big is a matter of debate. Data sets that are multiple petabytes in size are generally considered big data (a petabye is 1,024 terabytes). But the debate over the term doesn’t stop there.
There’s big doin’s inside and outside the data center theses days. You cannot spend a day without a cool new article about some new project that’s just been open sourced from one of the departments inside the social networking giants. Hadoop being the biggest example. What you ask is Hadoop? It is a project Yahoo started after Google started spilling the beans on it’s two huge technological leaps in massively parallel databases and processing real time data streams. The first one was called BigTable. It is a huge distributed database that could be brought up on an inordinately large number of commodity servers and then ingest all the indexing data sent by Google’s web bots as they found new websites. That’s the database and ingestion point. The second point is the way in which the rankings and ‘pertinence’ of the indexed websites would be calculated through PageRank. The invention for the realtime processing of this data being collected is called MapReduce. It was a way of pulling in, processing and quickly sorting out the important highly ranked websites. Yahoo read the white papers put out by Google and subsequently created a version of those technologies which today power the Yahoo! search engine. Having put this into production and realizing the benefits of it, Yahoo turned it into an open source project to lower the threshold of people wanting to get into the Big Data industry. Similarly, they wanted to get many eyes of programmers looking at the source code and adding features, packaging it, and all importantly debugging what was already there. Hadoop was the name given to the Yahoo bag of software and this is what a lot of people initially adopt if they are trying to do large scale collection and real-time analysis of Big Data.
Another discovery along the way towards the Big Data movement was a parallel attempt to overcome the limitations of extending the schema of a typical database holding all the incoming indexed websites. Tables and Rows and Structured Query Language (SQL) have ruled the day since about 1977 or so, and for many kinds of tabbed data there is no substitute. However, the kinds of data being stored now fall into the big amorphous mass of binary large objects (BLOBs) that can slow down a traditional database. So a non-SQL approach was adopted and there are parts of the BigTable database and Hadoop that dump the unique key values and relational tables of SQL to just get the data in and characterize it as quickly as possible, or better yet to re-characterize it by adding elements to the schema after the fact. Whatever you are doing, what you collect might not be structured or easily structured so you’re going to need to play fast and loose with it and you need a database of some sort equal to that task. Enter the NoSQL movement to collect and analyze Big Data in its least structured form. So my recommendation to anyone trying to get the square peg of Relational Databases to fit the round hole of their unstructured data is to give up. Go NoSQL and get to work.
This first article from Read Write Web is good in that it lays the foundation for what a relational database universe looks like and how you can manipulate it. Having established what IS, future articles will be looking at what quick, dirty workarounds and one off projects people have come up with to fit their needs. And subsequently which ‘Works for Me’ type solutions have been turned into bigger open source projects that will ‘Work for Others’, as that is where each of these technologies will really differentiate themselves. Ease of use and lowering the threshold will be deciding factors for many people’s adoption of a NoSQL database I’m sure.