March 2010 Archives

Analysis of the NoSQL Landscape

| 8 Comments | 1 TrackBack

This is an overview of the current state of the NoSQL landscape. It's getting large and somewhat unwieldy and there may be projects which have landed in the wrong category here. I have included object databases in the mix too. Seriously folks, some of you need to pick more google friendly project names. Here are the types and the players in each category. Background data is available in this Google docs spreadsheet.

Key-value stores

Column-oriented stores

Google BigTable, HBase, Cassandra, HyperTable, OpenNeptune, KDI, QBase

Document Databases

CouchDB, MongoDB, Apache JackRabbit, ThruDB, CloudKit, Perservere, Lotus Domino, Riak, Terrastore

Object Databases

ZODB, db40, Versant, Gemstone/s, Progress Objectstore

Graph Databases

Neo4j, VertexDB, Infogrid, Sones, Filament, Allegrograph, HyperGraphDB

Projects by Type

If we graph all the projects by type we get this view:

projects_by_type(2).png

There are more key-value stores than the other types combined. Why is this? Are key-value stores that much easier to implement? I would at least guess that the first area where we see projects being abandoned and convergence of projects is this one. The important thing is the features users want, not the project themselves. There must be a lot of overlap here and a lot of projects that are slightly different and almost identical. On the other hand a lot of knowledge of these kinds of system is spread around and there is a good chance of innovation. The combination of the best technical features and API features will hopefully bubble to top and stay on.

License Breakdown

If we graph the projects in the list above by license chosen we get the following:

projects_by_license(2).png

This shows a clear dominance for open source licenses over commercial ones. Some product have chosen a dual licensing model (neo4j and BerkelyDB). Quite a few are unknown which really means they are unable to communicate their license in a understandable manner or the project wasn't really found on the web at all (see point about google friendly names).

Language Breakdown

Graphing the projects by implementation language we get the following:

projects_by_language(4).png

Java takes the lead by with C and C++ following close behind. But is the prevalence of Java a result of the amount of Java knowledge spread around and the big Java usage in Open source, or is Java more suited than other languages to implement these kinds of systems? Interesting to note the number of Erlang implementations and also the fact that quite a few of the projects have implementations in more than one language. The ones with more than one implementation are mostly commercial ones.

Some ending questions:
* Have we reach the maximum of projects that are sustainable now or will the ecosystem continue to grow even more? * Will more of them go commerical? Or will more choose the model with support as the income, like 10Gen has with MongoDB? * How does one choose the right one to use for a given project? This is an increasingly hard problem, at least for key-value stores.

References:

A Brief History of NoSQL

| 17 Comments | 3 TrackBacks

NoSQL is getting a lot of traction and hype these days but in reality it's not that new of a thing. I thought I'd trace the roots of NoSQL and see what I'd find. The name "NoSQL" was in fact first used by Carlo Strozzi in 1998 as the name of file-based database he was developing. Ironically it's relational database just one without a SQL interface. As such it is not actually a part of the whole NoSQL movement we see today. The term re-surfaced in 2009 when Eric Evans used it to name the current surge in non-relational databases. It seems like the name has stuck for better or for worse. Note that not all projects are included in this post. See the post on analyzing the NoSQL landscape for a more complete listing.

1960s

  • MultiValue (aka PICK) databases are developed at TRW in 1965.
  • According to comment from Scott Jones M[umps] is developed at Mass General Hospital in 1966. It is a programming language that incorporates a hierarchical database with B+ tree storage.
  • IBM IMS, a hierarchical database, is developed with Rockwell and Caterpillar for the Apollo space program in 1966.

1970s

  • InterSystems develops the ISM product family succeeded by the Open M product, all M[umps] implementations. See comment from Scott Jones below.
  • M[umps] is approved as a ANSI standard language in 1977.
  • in 1979 Ken Thompson creates DBM which is released by AT&T. At it's core it is a file-based hash.

1980's

Several successors to DBM spring into life.

  • TDBM supporting atomic transactions
  • NDBM was the Berkeley version of DBM supporting having multiple databases open at the same time.
  • SDBM - another clone of DBM mainly for licensing reasons.
  • GT.M is the first version of a key-value store with focus on high performance transaction processing. It is open sourced in 2000.
  • BerkeleyDB is created at Berkeley in the transition from 4.3BSD to 4.4BSD. Sleepycat software is started as a company in 1996 when Netscape needed new features for BerkeleyDB. Later acquired by Oracle which still sell and maintain BerkeleyDB.
  • Lotus Notes or rather the server part, Lotus Domino, which really is a document database has it's initial release in 1989, now sold by IBM. It has evolved a lot from the early versions and is now a full office and collaboration suite.

1990's

  • GDBM is the Gnu project clone of DBM
  • Mnesia is developed by Ericsson as a soft real-time database to be used in telecom. It is relational in nature but does not use SQL as query language but rather Erlang itself.
  • InterSystems Caché launched in 1997 and is a hybrid so-called post-relational database. It has object interfaces, SQL, PICK/MultiValue and direct manipulation of data structures. It is a M[umps] implementation. See Scott Jones comment below for more on the history of InterSystems
  • Metakit is started in 1997 and is probably the first document oriented database. Supports smaller datasets than the ones in vogue nowadays.

    2000-2005

This is were the NoSQL train really picks up some momentum and a lot is starting to happen.

  • Graph database Neo4j is started in 2000.
  • db4o an object database for java and .net is started in 2000
  • QDBM is a re-implementation of DBM with better performance by Mikio Hirabayashi.
  • Memcached is started in 2003 by Danga to power Livejournal. Memcached isn't really a database since it's memory-only but there is soon a version with file storage called memcachedb.
  • Infogrid graph database is started as closed source in 2005, open sourced in 2008
  • CouchDB is started in 2005 and provides a document database inspired by Lotus Notes. The project moves to the Apache Foundation in 2008.
  • Google BigTable is started in 2004 and the research paper is released in 2006.

2006-2010

  • JackRabbit is started in 2006 as an implementation of JSR 170 and 283.
  • Tokyo Cabinet is a successor to QDBM by (Mikio Hirabayashi) started in 2006
  • The research paper on Amazon Dynamo is released in 2007.
  • The document database MongoDB is started in 2007 as a part of a open source cloud computing stack and first standalone release in 2009.
  • Facebooks open sources the Cassandra project in 2008
  • Project Voldemort is a replicated database with no single point-of-failure. Started in 2008.
  • Dynomite is a Dynamo clone written in Erlang.
  • Terrastore is a scalable elastic document store started in 2009
  • Redis is persistent key-value store started in 2009
  • Riak Another dynamo-inspired database started in 2009.
  • HBase is a BigTable clone for the Hadoop project while Hypertable is another BigTable type database also from 2009.
  • Vertexdb another graph database is started in 2009
  • Eric Evans of Rackspace, a committer on the Cassandra project, introduces the term "NoSQL" often used in the sense of "Not only SQL" to describe the surge of new projects and products.

(Some of these dates need to be taken with a small pinch of salt as finding out exactly when the projects started can be a bit difficult. Also not all projects started in last few years have been included)

In 2009 and 2010 we also saw the coming of NoSQL conferences like NoSQL live in Boston in 2010, the upcoming NoSQL eu in London in April 2010. Last year we also saw the NoSQL east conference in Atlanta.

mini bio

Knut Haugen [Knu:t Hæugen], Norwegian software developer with a penchant for dynamic languages and anything to with developer testing. Agile methodology geek with bias on Lean and Kanban. Some pointers to other stuff by me

meta

This page is an archive of entries from March 2010 listed from newest to oldest.

February 2010 is the previous archive.

April 2010 is the next archive.

Find recent content on the main index or look in the archives to find all content.