Comparing MongoDB java frameworks

| 4 Comments | No TrackBacks

I while ago I gave a talk at a Oslo JavaBin meeting on NoSQL and in the research for that I wounded up writing some sample code for three different frameworks for using MongoDB from Java. The code is available on GitHub. The three are the 10Gen mongo driver (and not really a framework), morphia and mungbean.

The Whole Dynamic Thing

MongoDB is a schema-free document database and does not impose any restriction on what you can store in a collection. There is nothing stopping you from puttting entirely different objects into the same collection and thus nothing stopping you from omitting fields in two different objects of the same kind either. So how does this play with Javas strong static typing compared to a dynamic language like Ruby?

Dude, Where's My ORM?

Developers used to relational databases have been using ORMs for a long time to abstract away the dirty details of dealing with the database and separating the domain from the persistence. So, are any of the three frameworks an ORM for MongoDB? Well, no. But a more interesting question is: Do I need one? MongoDB is a very different beast than oracle/mysql/MSsql/PostgreSQL and different beasts need different handling. You need an abstraction model for MongoDB, not necessarily the ORM abstraction model.

MongoDB Java Driver

10Gen java driver basically gives you to options when storing data in MongoDB: Either subclass BasicDBObject which is the general database object, or implement the DBObject interface. Both approaches gives you the interface of Map from the standard java library and lets you put and get keys and values. The drawback is that there are a lot of methods in this interface and it will soon become tedious to implement them all for different domain objects. This it the DBObject interface methods:

    public Object put(String s, Object o) { }

    public void putAll(DBObject dbObject) { }

    public void putAll(Map map) { }

    public Object get(String s) { }

    public Map toMap() { }

    public Object removeField(String s) { }

    public boolean containsKey(String s) { }

    public boolean containsField(String s) { }

    public Set keySet() { }

    public void markAsPartialObject() { }

    public boolean isPartialObject() {}

So I'd prefer to subclass BasicDBObject instead or wrap it.

So how does code using it look like? Assume the very minimal domain objects Person and Address

package no.kh.mongo.direct;
import com.mongodb.BasicDBObject;

public class Person extends BasicDBObject {

    public Person () { }

    public Person(String fullName, Address newAddress) {
        put("name", fullName);
        put("address", newAddress);

    }

}

package no.kh.mongo.direct;
import com.mongodb.BasicDBObject;

public class Address extends BasicDBObject {

   public Address() {
  }

  public Address(String streetName, String postalCode, String place, String country) {
    put("street", streetName);
    put("postalcode", postalCode);
    put("place", place);
    put("country", country);
  }

  public String place(){
    return (String) get("place");
  }

}

Using this class is done like this in the form of functional/integration tests, since they touch the database but written in Junit:

package no.kh.mongo.direct;

import com.mongodb.BasicDBObject;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.Mongo;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import java.net.UnknownHostException;
import static junit.framework.Assert.assertEquals;
import static junit.framework.Assert.assertNotNull;
import static org.junit.Assert.assertNull;

public class PersonPersistance {

    DB testDb;
    DBCollection persons;

    @Before
    public void setUp() throws UnknownHostException {
       Mongo m = new Mongo( "127.0.0.1" , 27017 );

       testDb = m.getDB( "test" );
       persons = testDb.getCollection("persons");
       persons.setObjectClass(Person.class);
    }

    @Test
    public void insertPersonSavesPersonToDatabase () {

        Person test = new Person("Knut Haugen",
                                 new Address("Josefines gate", "0401",
                                             "Oslo", "Norge"));
        persons.insert(test);
        assertNotNull(test.get("_id"));

    }


    @Test
    public void personWithDocumentNotMatchingObject(){
      BasicDBObject tmp = new BasicDBObject();
      tmp.append("foo", "value");
      persons.insert(tmp);

      Person test2 = (Person) persons.findOne();
      assertEquals(test2.get("foo"), "value");
      assertNull(test2.get("name"));
    }

    @Test
    public void retrievePersonFromDatabase(){
        Person test = new Person("Knut Haugen",
                                  new Address("Josefines gate", "0401",
                                              "oslo", "Norge"));
        persons.insert(test);

        Person test2 = (Person) persons.findOne();
        assertEquals(test.get("name"), test2.get("name"));
        assertEquals( ((Address) test.get("address")).place(), "oslo");
    }

    @After
    public void tearDown(){
        persons.drop();
    }

}

Notice the call to setObjectClass() on the collection object to get "type-safe" operations even though you still have to cast the return value from findOne() to get you precious object back. Other than that it is pretty straight forward. Call insert() on the collection to insert an object, findOne() or any other query method to retrieve it. But to the bottom line is this driver really begs for some more abstraction when you're beyond toy samples. The positive effect is more direct access to the data which is often the way it's done in dynamic languages. But does that suit Java? I'm not sure but tend to think no. And how about those nulls? Well, if one object doesn't store a value in a field, the field is null when returned from the database, as you can see from the test personWithDocumentNotMatchingObject.

Morphia

Morphia is a (according to the blurb) a light-weight type-safe mapper for MongoDB providing DAO and Datastore abstractions. It takes the annotation approach. Here's our Person and Address classes for Morphia:

package no.kh.mongo.morphia;
import com.google.code.morphia.annotations.Embedded;
import com.google.code.morphia.annotations.Entity;
import com.google.code.morphia.annotations.Id;

@Entity
public class Person {

    @Id
    private String id;
    private String name;

    @Embedded
    private Address address;

  public Person() {
        address = new Address("", "", "", "");
    }

    public Person(Address newAddress){
        address = newAddress;
    }

    public String getId() {
        return id;
    }


    public String getName() {
        return name;
    }

    public void setId(String newId) {
        id = newId;
    }

    public void setName(String newName) {
        name = newName;
    }

    public void setAddress(Address newAddress) {
        address = newAddress;
    }

}
package no.kh.mongo.morphia;
import com.google.code.morphia.annotations.Embedded;

@Embedded
public class Address {

  private String streetName;
  private String postalCode;
  private String place;
  private String country;

  public Address() {
  }

  public Address(String streetName, String postalCode, String place, String country) {
    this.streetName = streetName;
    this.postalCode = postalCode;
    this.place = place;
    this.country = country;
  }

}

Where Address is annotated as an embedded object. This is the same approach taken by the ruby Mongo_Mapper with its MongoMapper::Document and MongoMapper::EmbeddedDocument classes.

package no.kh.mongo.morphia;

import com.google.code.morphia.Morphia;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.Mongo;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import java.net.UnknownHostException;
import static junit.framework.Assert.assertNotNull;
import static junit.framework.Assert.assertNull;

public class PersistanceThroughMorphia {

    Morphia morph;
    Mongo mongo;
    DBCollection persons;

    @Before
    public void setUp() throws UnknownHostException {
       morph = new Morphia();
       mongo = new Mongo("127.0.0.1", 27017);
       // This is where we map Persons and addresses
       // But shouldn't the annotation be able to handle that?
       morph.map(Person.class).map(Address.class);
       DB testDb = mongo.getDB( "test" );
       persons = testDb.getCollection("persons");
    }

    @Test
    public void storePersonThroughMorphiaMapping () {

        Person test = new Person(new Address("Josefines gate", "0401", "Oslo", "Norge") );
        test.setName("Knut Haugen");

        persons.save(morph.toDBObject(test));
        Person test2 = morph.fromDBObject(Person.class, persons.findOne());
        assertNotNull(test2.getId());

    }


    @Test
    public void personMissingField () {

        Person test = new Person(new Address("Josefines gate", "0401", "Oslo", "Norge"));

        persons.save(morph.toDBObject(test));
        Person test2 = morph.fromDBObject(Person.class, persons.findOne());
        assertNull(test2.getName());

    }


    @After
    public void tearDown(){
        persons.drop();
    }

}

It seems like something out of the department of redundancy department that you annotate the document and the embedded document and then have to specify the relationship between them in a method call. My first reaction was that that would have been cleaner if the relationship could be specified in the annotation too. The calls to morph.toDBObject() and morph.fromDBObject() breaks up an otherwise elegant solution. It also introduces some more code and it basically wraps up a cast. That could have been a lot cleaner.

Mungbean

Mungbean is our last contestant and represents a third way of doing the mapping. It wraps up everything you need for accessing MongoDB with generic collection classes and introduces a DSL for querys and the like. There's also a clojure version if that is more like your poison. The domain classes with mungbean:

package no.kh.mongo.mungbean;

public class Address {

  private String streetName;
  private String postalCode;
  private String place;
  private String country;

  public Address() {
  }

  public Address(String streetName, String postalCode, String place, String country) {
    this.streetName = streetName;
    this.postalCode = postalCode;
    this.place = place;
    this.country = country;
  }

  public String place(){
    return place;
  }

}

Nothing special here, no imports and no annotations. (Almost the) Same with Person except for the import and field of type ObjectID which handles the object ids generated by mongo on insert:

package no.kh.mongo.mungbean;
import mungbean.ObjectId;

public class Person {
  private String name;
  private ObjectId _id = new ObjectId();
  private String address;


  public Person(String name, String address){
    this.name = name;
    this.address = address;
  }

  public String getAddress(){
    return address;
  }

  public ObjectId getId(){
    return _id;
  }
}

Using it on the other hand, creates a very different looking code than the other two, thanks to the wrapper classes for mongodb connections, the generic collections and the query DSL:

package no.kh.mongo.mungbean;

import mungbean.DBCollection;
import mungbean.Database;
import mungbean.Mungbean;
import mungbean.Settings;
import mungbean.query.Query;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import static junit.framework.Assert.assertEquals;
import static junit.framework.Assert.assertNotNull;

public class mungbeanPersistence {

  private Database test;
  private Person testPerson;
  private DBCollection persons;


  @Before
  public void create() {
    test = new Mungbean(new Settings(), "localhost", 27017).openDatabase("test");
    testPerson = new Person("Knut Haugen", "josefines gate");
    persons = test.openCollection("persons", Person.class);

  }

  @Test
  public void storePerson(){
    persons.save(testPerson);
    assertEquals(persons.query(new Query().field("name").is("Knut Haugen")).size(), 1);

  }


  @Test
  public void personGetGeneratedIdAndAddress(){
    persons.save(testPerson);
    Person found = persons.query(new Query().field("name").is("Knut Haugen")).get(0);
    assertNotNull(found.getId());
    assertEquals(found.getAddress(), "josefines gate");

  }


  @After
  public void destroy() {
    test.dbAdmin().dropDatabase();
  }

}

The syntax is nice but perhaps a tad verbose for mye taste. I find the abstraction quite good, at least better than the other two. I also like the fact that there is almost no trace of the library in the domain classes and as such it is by far the best of the three.

The Verdict

It's Mungbean, by a nose! Mainly because of the cleaner domain objects and the DSL. There is more code involved but I found it to be more elegant than the other approaches. I want to note that both morphia and mungbean are not immensely mature and done by any definition of the word and that has to come into consideration when using them. And it may be that a wordy statically typed language like Java has a bit of friction with a very dynamic database backend like MongoDB. I don't know, but I'll be looking into ruby drivers in the future and we'll see.

I haven't looked into the different code generation frameworks for mongo, namely Sculptor and GuicyData which take a different approach to accessing MongoDB. That's for another time and another post.

A Book Overflow

| No Comments | No TrackBacks

I am a habitual book buyer and I kinda like the paper kind which is evident if you see my office shelf or my living room. So once again I bought more books than I can read in the time I have left from living my normal life.

Growing software

Growing Object-oriented Software Guided by Tests By Steve Freeman and Nat Pryce. This is a book I've only heard positive things about and the TOC seem to deliver on that. It offers a lot of details on the how and more important the why of TDD/BDD and testing in relation to development. I am so looking forward to this one.

Being Productive

The Productive Programmer by Neal Ford. I'm a productivity junkie just like Neal Ford and some of the tricks (and the first part of the book is a long list of tools and tips) are already in my tool belt. But there's always more to learn, all in the spirit of kaizen. He throws in some good advice on software design also, all in the spirit of being productive.

The Art of Agile

The Art of Agile Software Development by James Shore and Chromatic. Another one I have heard a lot about and all positive. By the first look it seems to cover the state of agile software development today and touches on some lean and TPS-inspired concepts too. And you can't really navigate around that in today's agile landscape.

Effective Java

Effective Java second edition by Joshua Block. I'll be writing more Java in the coming months as I'll be changing jobs so this is in order to brush up on the old java skills. Also a book that is widely regarded as a must.

The Legacy You Inherit

Working Effectively with Legacy Code by Michael Feathers. How do you test legacy apps? Highly recommended by a colleague who recently inherited a largish legacy codebase with only a tauntingly small amount of tests. What's the definition of legacy code? Code without tests :-) Even if you write it today...

Cooking with Mongodb and Solr

| No Comments | No TrackBacks

I've recently changed storage backend and search backend for a small web project and it has been a real blast. What follows is an overview of the reasons for the change, what the change actually was and the relative amount of joy involved.

The Old System

System was built using PHP/Apache2 and MySQL and it covers a very simple domain with only a single object (Person, sort of) and simple data records for several years.

  • No writing through the web interface, only search and lookup.
  • Batch updates with between 4 and 5 million records in each update, 4-5 years history, so total 16-20 million records
  • "Search" through very simple text indexes on the relevant columns in MySQL.
  • InnoDB backend
  • File is transformed to LOAD DATA INFILE format and fed into MySQL with manual delete of the set for that year beforehand.

The Pain Points

  • Batch update with 4 million rows averages (on prod machine: dual core 3GHz 4GB Ram, roughly 1GB set aside for MySQL) taking 4-5 hours hours with index updates being the main culprit. This could be done as a check-the-record-and-update-if-changed but that would also require a lot of queries and updates to the database.
  • Queries with wildcards are dead slow when hitting outside the query cache.
  • Not really advanced search as such.

The Plan

  • Replacing MySQL with MongoDB as there is no actual relations needed and everything fits in one collection of documents
  • Replacing MySQL indexes with Apache Solr for consolidating search across several other systems. And speed.
  • Use the PECL extensions for both MongoDB and Solr.

MongoDB is a document database storing documents in binary json form, written in C++.

Solr is built on top of the java version of Lucene and does indexing over HTTP and runs happily in tomcat, jetty or most other servlet engines.

The Implementation

Names of domain objects are changed to protect the guilty - and the domain.

Both Solr and MongoDB are fast and easy to work with. There is very little in your way when it comes to just doing what you want and solving the problems in a straight-forward manner. Some examples:

  • The MongoDB "upsert" feature saves you some round-trips to the database. Normally if you want to update an existing record if it's there or inserting it if it's not you need to query first and then insert a new one when not found. If you just want to update/add data to part of the object, it complicates the matter further. With mongodb you can call update with a special parameter in the data array and the rest is handled server-side.
  • Solr has the default behavior of updating instead of complaining when you send a document with a primary key field that already exists in the index.
  • Solr does everything over HTTP and you get easy-to-read xml message back as responses. This is also handy when you need to debug what data is sent over the wire.

I created a very thin layer between mongodb and the domain, with an insert() method (which as we will see, also handles updates) that take a DataRecord (read from the file) as an argument.

    
public function insert(DataRecord $record) {
       $this->collection->update(array('id' => $record->id() ), 
           array('$set' => array(
               'list.' . $record->year() => $record->getDetails()->toArray())), 
           array('upsert' => true));
}

This will insert a document in the collection if it's not there. When it is there, it will add an element to the (nested) 'list' element with the value of $record->year() as key. The value will be the value of $record->getDetails(). The toArray() call is there because the mongo driver expects arrays to store. The super cool part is that if the key exists, it will just be updated with the data from the details object. Read more on the details of the MongoDB update options.

For indexing the document in Solr, I added a similarly thin wrapper for the SolrClient object with an index() method. This method takes a SolrInputDocument as an argument. I chose to delegate to the domain object to decide what should be indexed and thus create the index document object but the responsibilities could easily have switched around. The finer point is that when indexing you have to read the complete object from the database in order to get all data. The DataRecord that was read from file and stored with upsert may just have been part of the picture. Reading back the updated object incurs a performance penalty that wasn't present in the old system. It was also a consequence of structuring the data as a collection of person objects in Mongodb, rather than a long list of records in the old version. This maps better to the domain.

    
public function index(SolrInputDocument $document) {
        $response = $this->solr->addDocument($document);

        if($this->pendingDocuments++ == $this->commitInterval) {
            $this->commit();
            $this->pendingDocuments = 0;
        }

        return $response->getResponse();

    }

Commit on every Solr document makes indexing very slow. Small tests indicated 3 minutes for indexing 5000 documents with commit on every submit and 15 seconds with one commit every 2000 document (and at the end of course). The code above commits every $commitInterval(10000 default) to speed things up a bit. Note also that the commit() and optimize() calls for Solr may time out as they can take a long time to finish. Solr does not time out but rather the java application server you're running times out. When this happens an exception is thrown in the php driver which has to be caught.

The Results

Platform is Ubuntu 9.10 server edition 64 bit and all timings from the shell are done with time on linux. MySQL times are the times reported from MySQL itself.

Time for batch insert/update

  • Commit interval for solr: 10000
  • update-logging for Solr turned off (default is very verbose)
  • nssize=1024 for Mongodb
System Operation Time
MySQL initial import 16m 5.7s
MySQL update (delete+insert) 3h 24m 56s (delete) + 40m 56s insert
Mongodb (no indexing) initial import 12m 16s real, 10m 41s user
Mongodb+solr initial import 78m 39s real, 25m 28s user.
Mongodb (no indexing) update 13m 22s real, 11m 8s user.
Mongo+solr update 69m 8s real, 19m 25 user


Space usage

No pre-allocation was done for MongoDB so it created the data files as needed. This means that the last was created at 2GB and very well may be almost empty. Mongo creates files in a doubling fashion from 64 MB to 2G like this: 64, 128, 256, 512, 1GB, 2GB.

System Index Data
MySQL initial import 717 MB 516 MB
MySQL initial import + 1 additional dataset 1.4 GB 1 GB
Mongo+Solr initial import 744 MB 1294 MB (3GB of datafiles)
Mongo+Solr initial + 1 additional dataset. 1538 MB 4348 MB


CPU usage

When importing to MySQL it more or less maxes on CPU for the entire import. When doing the import with a php script feeding data to mongodb and solr, the component using the most cpu is the php script splitting the file, creating objects and calling the mongodb and solr APIs.This takes up 35-40% CPU and around 10 MB of ram. Mongodb is using around 10% cpu (with 1.1GB of ram) and solr (tomcat, that is) is spending 30% and around 300MB of ram. Disks on these machines are virtualized, through Vmware, 15K SAS disks on an IBM S3200 storage array with dedicated GB LAN between blade center and storage. Disk IO seems to be the bottleneck here.

Space usage

Solr and mongodb seem to be a bit more sloppy with their space usage than MySQL but I guess this is the price to pay for some of the other benefits you get. See mongo faq on data files for info on how to see real space usage for databases and not just file sizes. In return for more storage space spent, you get much better search capabilities (and faster) and faster (although small improvement) query times against database.

Naming Classes and Interfaces

| No Comments | No TrackBacks

The naming of interfaces and corresponding implementation classes in languages that use them (of which Java and C# is the most used I guess) is a subject of sometimes heated debate. A growing distaste of the "best practice" (at least on Microsofts part) of using a capital "I" on interfaces prompted this rundown and a round of pro et contra. My ultimate goal is readability of code but a slight predicament emerges when considering a team: if the majority of team members are used to one particular convention is that, in that particular context, more readable for that team than another approach, all other things being equal? Even though the name of the interface reads more like natural English? The jury's still out on this one.

The different patterns I've found:
  1. Interfaces are prefixed with a capital I, where IReportGenerator is the interface and e.g. ReportGenerator is the implementation. Some argue for making the interface name read "I generate reports" and keep the I prefix, thus: IGenerateReports instead.
  2. Ditching the capital I and appending "imp" or "impl" to the implementation. Like so: UserManager and UserManagerImpl. Some use uppercase or lowercase "c" as prefix for concrete classes too but I find that horribly ugly, to be frank.
  3. Naming an interface for it's role and naming the implementing class(es) for what distinguishes it from (possibly) other implementations. E.g. UserManager for the interface and DatabaseUserManager or LDAPUserManager for the concrete implementations.
  4. Naming abstract class with the word "Abstract" as a prefix to make a distinction for abstract and concrete classes. Possibly combined with other conventions for interfaces as well. I would argue the context often reveals an abstract class to the reader.

C# vs Java

In C# the extension of an abstract base class and the implementation of an interface uses the same syntax and which one it is isn't always easy to discern. A naming convention makes this a bit easier:

public class DatabaseUserManager : UserManager {

}

//versus

public class DatabaseUserManager : IUserManager {

}

In java the syntax is different:

public class LDAPUserManager implements UserManager {

}

//versus

public class LDAPUserManager extends UserManager {

}

The need for a prefix for readability is less needed in Java, compared to C#.

Good names trumps conventions?

Clean Code by Robert C. Martin argues for good names: good names for variables, good names for methods and good names for interfaces and classes. The name should reveal intention. Choosing good names takes time but saves more than it takes. (Clean Code, page 18). The book also mentions not using the "I" prefix and prefers to encode the implementation, if at all.

The question is: if the names are as good as they can be, do we need a prefix or suffix or other kinds of encoding to indicate the type? In my mind we don't. I find that the I hurts readability and the "impl" certainly doesn't look good to me. So how to choose good names?

  1. Classes should have noun names or noun phrases, e.g. Customer, Policy or StreetAddress.
  2. Use name from the problem domain. E.g. If you're dealing with a customer or client object. In a medical system, should it be called Patient? Or if it's dealing with social benefits maybe BenefitsReceiver is a better name? The domain influences the choice of names. In stock-broking perhaps the name FuturesHolder is name candidate for a special kind of customer? (note: I'm not very familiar with either of these domains so the examples may be somewhat off. You get the point: know your domain and find good names from it.).
  3. Use names from the solution domain. If you use e.g. the Visitor pattern, make it a part of the name(s) so other programmers see it InsuranceCustomerVisitor is better than CustomerTraverser

So unless some team coding standard absolutely makes me (and that is, I'm afraid, likely) I will not in the future prefix my interfaces with I and not adorn my implementations with c, impl or any other stuff. Just meaningful, names from the domain. Easier to write and easier to read.

Lastly, I posted this question on twitter and @borud made the comment: «the only useful information conveyed by an I-prefix to interface types is information about the author of the code.» That's a suitable end note I think :-)

Analysis of the NoSQL Landscape

| 8 Comments | No TrackBacks

This is an overview of the current state of the NoSQL landscape. It's getting large and somewhat unwieldy and there may be projects which have landed in the wrong category here. I have included object databases in the mix too. Seriously folks, some of you need to pick more google friendly project names. Here are the types and the players in each category. Background data is available in this Google docs spreadsheet.

Key-value stores

Column-oriented stores

Google BigTable, HBase, Cassandra, HyperTable, OpenNeptune, KDI, QBase

Document Databases

CouchDB, MongoDB, Apache JackRabbit, ThruDB, CloudKit, Perservere, Lotus Domino, Riak, Terrastore

Object Databases

ZODB, db40, Versant, Gemstone/s, Progress Objectstore

Graph Databases

Neo4j, VertexDB, Infogrid, Sones, Filament, Allegrograph, HyperGraphDB

Projects by Type

If we graph all the projects by type we get this view:

projects_by_type(2).png

There are more key-value stores than the other types combined. Why is this? Are key-value stores that much easier to implement? I would at least guess that the first area where we see projects being abandoned and convergence of projects is this one. The important thing is the features users want, not the project themselves. There must be a lot of overlap here and a lot of projects that are slightly different and almost identical. On the other hand a lot of knowledge of these kinds of system is spread around and there is a good chance of innovation. The combination of the best technical features and API features will hopefully bubble to top and stay on.

License Breakdown

If we graph the projects in the list above by license chosen we get the following:

projects_by_license(2).png

This shows a clear dominance for open source licenses over commercial ones. Some product have chosen a dual licensing model (neo4j and BerkelyDB). Quite a few are unknown which really means they are unable to communicate their license in a understandable manner or the project wasn't really found on the web at all (see point about google friendly names).

Language Breakdown

Graphing the projects by implementation language we get the following:

projects_by_language(4).png

Java takes the lead by with C and C++ following close behind. But is the prevalence of Java a result of the amount of Java knowledge spread around and the big Java usage in Open source, or is Java more suited than other languages to implement these kinds of systems? Interesting to note the number of Erlang implementations and also the fact that quite a few of the projects have implementations in more than one language. The ones with more than one implementation are mostly commercial ones.

Some ending questions:
* Have we reach the maximum of projects that are sustainable now or will the ecosystem continue to grow even more? * Will more of them go commerical? Or will more choose the model with support as the income, like 10Gen has with MongoDB? * How does one choose the right one to use for a given project? This is an increasingly hard problem, at least for key-value stores.

References:

A Brief History of NoSQL

| 17 Comments | 1 TrackBack

NoSQL is getting a lot of traction and hype these days but in reality it's not that new of a thing. I thought I'd trace the roots of NoSQL and see what I'd find. The name "NoSQL" was in fact first used by Carlo Strozzi in 1998 as the name of file-based database he was developing. Ironically it's relational database just one without a SQL interface. As such it is not actually a part of the whole NoSQL movement we see today. The term re-surfaced in 2009 when Eric Evans used it to name the current surge in non-relational databases. It seems like the name has stuck for better or for worse. Note that not all projects are included in this post. See the post on analyzing the NoSQL landscape for a more complete listing.

1960s

  • MultiValue (aka PICK) databases are developed at TRW in 1965.
  • According to comment from Scott Jones M[umps] is developed at Mass General Hospital in 1966. It is a programming language that incorporates a hierarchical database with B+ tree storage.
  • IBM IMS, a hierarchical database, is developed with Rockwell and Caterpillar for the Apollo space program in 1966.

1970s

  • InterSystems develops the ISM product family succeeded by the Open M product, all M[umps] implementations. See comment from Scott Jones below.
  • M[umps] is approved as a ANSI standard language in 1977.
  • in 1979 Ken Thompson creates DBM which is released by AT&T. At it's core it is a file-based hash.

1980's

Several successors to DBM spring into life.

  • TDBM supporting atomic transactions
  • NDBM was the Berkeley version of DBM supporting having multiple databases open at the same time.
  • SDBM - another clone of DBM mainly for licensing reasons.
  • GT.M is the first version of a key-value store with focus on high performance transaction processing. It is open sourced in 2000.
  • BerkeleyDB is created at Berkeley in the transition from 4.3BSD to 4.4BSD. Sleepycat software is started as a company in 1996 when Netscape needed new features for BerkeleyDB. Later acquired by Oracle which still sell and maintain BerkeleyDB.
  • Lotus Notes or rather the server part, Lotus Domino, which really is a document database has it's initial release in 1989, now sold by IBM. It has evolved a lot from the early versions and is now a full office and collaboration suite.

1990's

  • GDBM is the Gnu project clone of DBM
  • Mnesia is developed by Ericsson as a soft real-time database to be used in telecom. It is relational in nature but does not use SQL as query language but rather Erlang itself.
  • InterSystems Caché launched in 1997 and is a hybrid so-called post-relational database. It has object interfaces, SQL, PICK/MultiValue and direct manipulation of data structures. It is a M[umps] implementation. See Scott Jones comment below for more on the history of InterSystems
  • Metakit is started in 1997 and is probably the first document oriented database. Supports smaller datasets than the ones in vogue nowadays.

    2000-2005

This is were the NoSQL train really picks up some momentum and a lot is starting to happen.

  • Graph database Neo4j is started in 2000.
  • db4o an object database for java and .net is started in 2000
  • QDBM is a re-implementation of DBM with better performance by Mikio Hirabayashi.
  • Memcached is started in 2003 by Danga to power Livejournal. Memcached isn't really a database since it's memory-only but there is soon a version with file storage called memcachedb.
  • Infogrid graph database is started as closed source in 2005, open sourced in 2008
  • CouchDB is started in 2005 and provides a document database inspired by Lotus Notes. The project moves to the Apache Foundation in 2008.
  • Google BigTable is started in 2004 and the research paper is released in 2006.

2006-2010

  • JackRabbit is started in 2006 as an implementation of JSR 170 and 283.
  • Tokyo Cabinet is a successor to QDBM by (Mikio Hirabayashi) started in 2006
  • The research paper on Amazon Dynamo is released in 2007.
  • The document database MongoDB is started in 2007 as a part of a open source cloud computing stack and first standalone release in 2009.
  • Facebooks open sources the Cassandra project in 2008
  • Project Voldemort is a replicated database with no single point-of-failure. Started in 2008.
  • Dynomite is a Dynamo clone written in Erlang.
  • Terrastore is a scalable elastic document store started in 2009
  • Redis is persistent key-value store started in 2009
  • Riak Another dynamo-inspired database started in 2009.
  • HBase is a BigTable clone for the Hadoop project while Hypertable is another BigTable type database also from 2009.
  • Vertexdb another graph database is started in 2009
  • Eric Evans of Rackspace, a committer on the Cassandra project, introduces the term "NoSQL" often used in the sense of "Not only SQL" to describe the surge of new projects and products.

(Some of these dates need to be taken with a small pinch of salt as finding out exactly when the projects started can be a bit difficult. Also not all projects started in last few years have been included)

In 2009 and 2010 we also saw the coming of NoSQL conferences like NoSQL live in Boston in 2010, the upcoming NoSQL eu in London in April 2010. Last year we also saw the NoSQL east conference in Atlanta.

The Employee View of Agile Interviews

| 5 Comments | No TrackBacks

I found two interesting blog posts via Twitter, http://blog.thirstybear.co.uk/2010/02/don-just-interview-new-developers_03.html and http://www.davenicolette.net/agile/index.blog/1947137/the-new-interview/ both covering a new and perhaps more agile (hey! buzzword!) way of doing interviews. The key point is to audition the candidate rather than to just ask questions. Doing actual pair programming and seeing how people how actually work to solve a concrete programming tasks is much more valuable than to give them a random task, ask if they remember some obscure part of a spec everybody looks up anyway or not least, see what certifications they have. Would you like to work for a company that put more faith in your certifications than your current abilities?

Both posts cover this from the viewpoint of the potential employer but I can see just as much value for the employee. When interviewing for a new job you rarely get an accurate view of how the company really is through the interview. The reality really dawns on you a couple of months into the job. My (somewhat limited) experience with interviews is that they are very rarely used to dig deep enough, both from the standpoint of the employer and the employee. But we should! Hiring the wrong people is expensive for the company and choosing to work for the wrong company is just waste of time for an employee. Or worse.

The Questions to Ask

If I where to look for a new job I would milk the interview for all it's worth and use it to see what kind of company I'm dealing with.

  • If they're not doing pair programming in the interview, will they do it in real life?
  • How serious are they about their agility?
  • Do they check what I remember or do they make an effort to find out how I work?
  • Are they interested in the buzzwords I know or the process I use to develop software and solve problems?
  • Are they interested in finding out about the real me?
  • If at all possible: pair program with the people I'm going to be working with

I would perhaps even go as far as asking them why they're are hiring they way they are, regardless of how, and use the answer to decipher their way of thinking about people and the craft of software development. I mean: would you choose to work for a company which isn't really interested in finding out the real you but rather just hire the résumé you?

Some other questions I would consider asking to find out more about the company:

  • What kind of computer hardware and software are developers using (this can seem like useless geek obsessions, but it tells me something about how they value the productivity of their developers. Best answer: you get to choose :-))
  • How does the company (top to bottom) ensure continuous improvement?
  • How does the company measure the productivity of their developers? (trick question).
  • How far up the management chain does the agile principles really go? Think Toyota Production System-like environment.
  • For a consulting firm: how much/often do you spend time inside the mother ship opposed to on-site at clients?

The Time it Takes

Doing interviews in this fashion takes time and I have heard stories (at Hashrocket I think) of week long pair programming sessions with the entire team and one "no" from one team member is all it takes to turn down the candidate. This probably works better in the US with their system of two weeks notice, but not so well in Norway with (normally) three months mutual notice. But for sure: one hour of questions doesn't tell you enough. Practice trumps talking, talking trumps résumés - to paraphrase the kanban saying.

Learning Ruby With Ruby Koans

| No Comments | No TrackBacks

I'm learning Ruby these days and after reading the excellent book The Ruby Programming Language I found myself in need of actually writing and reading some code. I tried starting a pet project in Ruby but the, uhm, distractions of two young children made it difficult to get that off the ground and I also quickly realized that I needed some more training before I could get productive on that project.

So enter Ruby Koans written by Jim Weirich and Joe O'Brien of EdgeCase. This little git repo guides (and forces you) through a series of test files with (at first) failing unit tests where you read code and make failing tests pass one after another. It starts out easy with variables and true false and the likes and end up with open classes, message passing and modules.

If you tolerate some zen puns and need a quick intro to ruby and testing in ruby, it's worth a look. The beauty for me was the ease of doing one pomodoro of Ruby Koans each morning and the mental context needed to do that was very small.

Systems Thinking applied to software development

| 2 Comments | No TrackBacks

I have watched some talks by John Seddon on Systems Thinking and where he feels Lean has failed. Seddon's focus is service organizations in public and private sector like health-care and customer support and sales for various products. But how would systems thinking apply to software development? Have we got it all wrong? I'm going to try and map some of the ideas of Systems Thinking over to existing software development methodologies and practices and see where we end up.

Systems Thinking in a nutshell

Systems Thinking is about seeing the organization as a system and studying it as a system from the view of the customer. Some core ideas summarized:

  1. The only plan for changing an organization is to get knowledge
  2. Study demand going into the system
  3. Differ between failure demand (bug reports, wrong feature and other sources of rework) and value demand (new features/contracts/projects).
  4. Study the variability and predictability of the failure demand. Only what is predictable is preventable. Take steps to prevent failure demand
  5. Peoples behavior is a product of the system in which they work. If you want to change the people, change the system
  6. Train people to handle the incoming requests and let them pull help on the things they don't know how to solve.
  7. Don't standardize the work
  8. Give the workers the means to control the work and the power to change it.
  9. Measure the actual value delivered to customers.
  10. The best way to learn counter-intuitive truths is to see and experience them for yourself

The goals are of course to deliver maximum value (the right features at the right time to solve the right problems) to customers in the shortest time frame possible. Sounds familiar? Read on for more in-depth discussion on each of the items in the list.

Linkdropping on NoSQL, Lean and Systems Thinking

| 4 Comments | No TrackBacks

I thought I would round up a collection of links that interested me the past few weeks on various topics. First off: NoSQL databases:

Then over to Lean and Kanban. Henrik Kniberg and Mattias Skarin just published a new book on InfoQ titled "Kanban and Scrum - making the most of both". If you´re not familiar with Henrik Kniberg´s work, I also suggest "Scrum and XP from the trenches". Erling Wegger Linde´s "A Kanban brown bag recipe" is also worth a read.

And in the spirit of Lean: There is a video of talk by John Seddon of Vanguard titled "Cultural Change is Free". Mainly about systems thinking in the public sector, but private sector aren´t infallible either. Seddon often criticize Lean for being wrong in many places, but I often feel he is criticizing a wrongful implementation of lean ideas, much the same as scrum is often criticized for the misgivings of wrongful implementations. Or rather he is criticizing the tool focus of a lot of lean consultants not the lean principles themselves. And he stresses the differences between the Toyota Productions system and other kinds of organizations. You could also check out is talk "Re-thinking Lean service" on InfoQ which deals with the same topic in a slightly different packaging.