While I am not a DB admin, I do appreciate the wealth of new database projects spawned by the likes of Google’s MapReduce/BigTables architecture. Similarly the non-traditional Nonrelational DBs are also very interesting and prove that there’s always a right tool for the right job. Though some programmers and developers will continuously try to hammer each nail with MySQL, there options for them are increasing. Whether it’s scale, load, malleability there’s a NoSQL/NewSQL product that will do the job.
In Part One we covered data, big data, databases, relational databases and other foundational issues. In Part Two we talked about data warehouses, ACID compliance, distributed databases and more. Now well cover non-relational databases, NoSQL and related concepts.
I really give a lot of credit to ReadWriteWeb for packaging up this 3 part series (started May 24th I think). This at least narrows down what is meant by all the fast and loose terms White Papers and Admen are throwing around to get people to consider their products in RFPs. Just know this though, in many cases to NoSQL databases that keep coming into the market tend to be one-off solutions created by big social networking companies who couldn’t get MySQL/Oracle/MSQL to scale in size/speed sufficiently during their early build-outs. Just think of Facebook hitting the 500million user mark and you will know that there’s got to be a better way than relational algebra and tables with columns and rows.
In part 3 we finally get to what we have all been waiting for, Non-relational Databases, so-called NoSQL. Google’s MapReduce technology is quickly shown as one of the most widely known examples of a NoSQL type distributed database that while not adhering to absolute or immediate consistency gets there with ‘eventual consistency (Consistency being the big C in the acronym ACID). The coolest thing about MapReduce is the similarity (at least in my mind) it bears to the Seti@Home Project where ‘work units’ were split out of large data tapes and distributed piecemeal over the Internet and analyzed on a person’s desktop computer. The complete units were then gathered up and brought together into a final result. This is similar to how Google does it’s big data analysis to get work done in its data centers. And it follows on in the opensource project Hadoop, an opensource version of MapReduce started by Yahoo and now part of the Apache organization.
Document databases are cool too, and very much like an Object-oriented Database where you have a core item with attributes appended. I think also of LDAP directories which also have similarities to Object -oriented databases. A person has a ‘Common Name’ or CN attribute. The CN is as close to a unique identifier as you can get, with all the attributes strung along, appended on the end as they need to be added, in no particular order. The ability to add attributes as needed is like ‘tagging’ in the way Social networking websites like Picture, Bookmark websites do it. You just add an arbitrary tag in order to help search engines index the site and help relevant web searches find your content.
The relationship between Graph Databases and Mind-Mapping is also very interesting. There’s a good graphic illustrating a Graph database of blog content to show how relation lines are drawn and labeled. So now I have a much better understanding of Graph databases as I have used mind-mapping products before. Nice parallel there I think.
At the very end of hte article there’s mention of NewSQL of which Drizzle is an interesting offshoot. Looking up more about it, I found it interesting as a fork of the MySQL project. Specifically Drizzle factors out tons of functions some folks absolutely need but don’t always have (like say 32-bit legacy support). There’s a lot of attempts to get the code smaller so the overall lines of code went from over 1 million for MySQL to just under 300,000 for the Drizzle project. Speed and simplicity is the order of the day with Drizzle. Add missing functions by simply add the plug-in to the main app and you get back some of the MySQL features that might have been missing.
I don’t know if you have ever heard of Relational Databases or Structured Query Language. They became di rigeur after 1977 in most corporate data centers pushing more power into the hands of users instead of programmers. But that type of structured data can only carry you so far until you bump against its limits. In this age of Social Networking and data gathering on users, we are severely testing the limits of the last big thing in databases.
In short, big data simply means data sets that are large enough to be difficult to work with. Exactly how big is big is a matter of debate. Data sets that are multiple petabytes in size are generally considered big data (a petabye is 1,024 terabytes). But the debate over the term doesn’t stop there.
There’s big doin’s inside and outside the data center theses days. You cannot spend a day without a cool new article about some new project that’s just been open sourced from one of the departments inside the social networking giants. Hadoop being the biggest example. What you ask is Hadoop? It is a project Yahoo started after Google started spilling the beans on it’s two huge technological leaps in massively parallel databases and processing real time data streams. The first one was called BigTable. It is a huge distributed database that could be brought up on an inordinately large number of commodity servers and then ingest all the indexing data sent by Google’s web bots as they found new websites. That’s the database and ingestion point. The second point is the way in which the rankings and ‘pertinence’ of the indexed websites would be calculated through PageRank. The invention for the realtime processing of this data being collected is called MapReduce. It was a way of pulling in, processing and quickly sorting out the important highly ranked websites. Yahoo read the white papers put out by Google and subsequently created a version of those technologies which today power the Yahoo! search engine. Having put this into production and realizing the benefits of it, Yahoo turned it into an open source project to lower the threshold of people wanting to get into the Big Data industry. Similarly, they wanted to get many eyes of programmers looking at the source code and adding features, packaging it, and all importantly debugging what was already there. Hadoop was the name given to the Yahoo bag of software and this is what a lot of people initially adopt if they are trying to do large scale collection and real-time analysis of Big Data.
Another discovery along the way towards the Big Data movement was a parallel attempt to overcome the limitations of extending the schema of a typical database holding all the incoming indexed websites. Tables and Rows and Structured Query Language (SQL) have ruled the day since about 1977 or so, and for many kinds of tabbed data there is no substitute. However, the kinds of data being stored now fall into the big amorphous mass of binary large objects (BLOBs) that can slow down a traditional database. So a non-SQL approach was adopted and there are parts of the BigTable database and Hadoop that dump the unique key values and relational tables of SQL to just get the data in and characterize it as quickly as possible, or better yet to re-characterize it by adding elements to the schema after the fact. Whatever you are doing, what you collect might not be structured or easily structured so you’re going to need to play fast and loose with it and you need a database of some sort equal to that task. Enter the NoSQL movement to collect and analyze Big Data in its least structured form. So my recommendation to anyone trying to get the square peg of Relational Databases to fit the round hole of their unstructured data is to give up. Go NoSQL and get to work.
This first article from Read Write Web is good in that it lays the foundation for what a relational database universe looks like and how you can manipulate it. Having established what IS, future articles will be looking at what quick, dirty workarounds and one off projects people have come up with to fit their needs. And subsequently which ‘Works for Me’ type solutions have been turned into bigger open source projects that will ‘Work for Others’, as that is where each of these technologies will really differentiate themselves. Ease of use and lowering the threshold will be deciding factors for many people’s adoption of a NoSQL database I’m sure.