April 2010 Archives

A Book Overflow

| No TrackBacks

I am a habitual book buyer and I kinda like the paper kind which is evident if you see my office shelf or my living room. So once again I bought more books than I can read in the time I have left from living my normal life.

Growing software

Growing Object-oriented Software Guided by Tests By Steve Freeman and Nat Pryce. This is a book I've only heard positive things about and the TOC seem to deliver on that. It offers a lot of details on the how and more important the why of TDD/BDD and testing in relation to development. I am so looking forward to this one.

Being Productive

The Productive Programmer by Neal Ford. I'm a productivity junkie just like Neal Ford and some of the tricks (and the first part of the book is a long list of tools and tips) are already in my tool belt. But there's always more to learn, all in the spirit of kaizen. He throws in some good advice on software design also, all in the spirit of being productive.

The Art of Agile

The Art of Agile Software Development by James Shore and Chromatic. Another one I have heard a lot about and all positive. By the first look it seems to cover the state of agile software development today and touches on some lean and TPS-inspired concepts too. And you can't really navigate around that in today's agile landscape.

Effective Java

Effective Java second edition by Joshua Block. I'll be writing more Java in the coming months as I'll be changing jobs so this is in order to brush up on the old java skills. Also a book that is widely regarded as a must.

The Legacy You Inherit

Working Effectively with Legacy Code by Michael Feathers. How do you test legacy apps? Highly recommended by a colleague who recently inherited a largish legacy codebase with only a tauntingly small amount of tests. What's the definition of legacy code? Code without tests :-) Even if you write it today...

Cooking with Mongodb and Solr

| 3 Comments | 1 TrackBack

I've recently changed storage backend and search backend for a small web project and it has been a real blast. What follows is an overview of the reasons for the change, what the change actually was and the relative amount of joy involved.

The Old System

System was built using PHP/Apache2 and MySQL and it covers a very simple domain with only a single object (Person, sort of) and simple data records for several years.

  • No writing through the web interface, only search and lookup.
  • Batch updates with between 4 and 5 million records in each update, 4-5 years history, so total 16-20 million records
  • "Search" through very simple text indexes on the relevant columns in MySQL.
  • InnoDB backend
  • File is transformed to LOAD DATA INFILE format and fed into MySQL with manual delete of the set for that year beforehand.

The Pain Points

  • Batch update with 4 million rows averages (on prod machine: dual core 3GHz 4GB Ram, roughly 1GB set aside for MySQL) taking 4-5 hours hours with index updates being the main culprit. This could be done as a check-the-record-and-update-if-changed but that would also require a lot of queries and updates to the database.
  • Queries with wildcards are dead slow when hitting outside the query cache.
  • Not really advanced search as such.

The Plan

  • Replacing MySQL with MongoDB as there is no actual relations needed and everything fits in one collection of documents
  • Replacing MySQL indexes with Apache Solr for consolidating search across several other systems. And speed.
  • Use the PECL extensions for both MongoDB and Solr.

MongoDB is a document database storing documents in binary json form, written in C++.

Solr is built on top of the java version of Lucene and does indexing over HTTP and runs happily in tomcat, jetty or most other servlet engines.

The Implementation

Names of domain objects are changed to protect the guilty - and the domain.

Both Solr and MongoDB are fast and easy to work with. There is very little in your way when it comes to just doing what you want and solving the problems in a straight-forward manner. Some examples:

  • The MongoDB "upsert" feature saves you some round-trips to the database. Normally if you want to update an existing record if it's there or inserting it if it's not you need to query first and then insert a new one when not found. If you just want to update/add data to part of the object, it complicates the matter further. With mongodb you can call update with a special parameter in the data array and the rest is handled server-side.
  • Solr has the default behavior of updating instead of complaining when you send a document with a primary key field that already exists in the index.
  • Solr does everything over HTTP and you get easy-to-read xml message back as responses. This is also handy when you need to debug what data is sent over the wire.

I created a very thin layer between mongodb and the domain, with an insert() method (which as we will see, also handles updates) that take a DataRecord (read from the file) as an argument.

    
public function insert(DataRecord $record) {
       $this->collection->update(array('id' => $record->id() ), 
           array('$set' => array(
               'list.' . $record->year() => $record->getDetails()->toArray())), 
           array('upsert' => true));
}

This will insert a document in the collection if it's not there. When it is there, it will add an element to the (nested) 'list' element with the value of $record->year() as key. The value will be the value of $record->getDetails(). The toArray() call is there because the mongo driver expects arrays to store. The super cool part is that if the key exists, it will just be updated with the data from the details object. Read more on the details of the MongoDB update options.

For indexing the document in Solr, I added a similarly thin wrapper for the SolrClient object with an index() method. This method takes a SolrInputDocument as an argument. I chose to delegate to the domain object to decide what should be indexed and thus create the index document object but the responsibilities could easily have switched around. The finer point is that when indexing you have to read the complete object from the database in order to get all data. The DataRecord that was read from file and stored with upsert may just have been part of the picture. Reading back the updated object incurs a performance penalty that wasn't present in the old system. It was also a consequence of structuring the data as a collection of person objects in Mongodb, rather than a long list of records in the old version. This maps better to the domain.

    
public function index(SolrInputDocument $document) {
        $response = $this->solr->addDocument($document);

        if($this->pendingDocuments++ == $this->commitInterval) {
            $this->commit();
            $this->pendingDocuments = 0;
        }

        return $response->getResponse();

    }

Commit on every Solr document makes indexing very slow. Small tests indicated 3 minutes for indexing 5000 documents with commit on every submit and 15 seconds with one commit every 2000 document (and at the end of course). The code above commits every $commitInterval(10000 default) to speed things up a bit. Note also that the commit() and optimize() calls for Solr may time out as they can take a long time to finish. Solr does not time out but rather the java application server you're running times out. When this happens an exception is thrown in the php driver which has to be caught.

The Results

Platform is Ubuntu 9.10 server edition 64 bit and all timings from the shell are done with time on linux. MySQL times are the times reported from MySQL itself.

Time for batch insert/update

  • Commit interval for solr: 10000
  • update-logging for Solr turned off (default is very verbose)
  • nssize=1024 for Mongodb
System Operation Time
MySQL initial import 16m 5.7s
MySQL update (delete+insert) 3h 24m 56s (delete) + 40m 56s insert
Mongodb (no indexing) initial import 12m 16s real, 10m 41s user
Mongodb+solr initial import 78m 39s real, 25m 28s user.
Mongodb (no indexing) update 13m 22s real, 11m 8s user.
Mongo+solr update 69m 8s real, 19m 25 user


Space usage

No pre-allocation was done for MongoDB so it created the data files as needed. This means that the last was created at 2GB and very well may be almost empty. Mongo creates files in a doubling fashion from 64 MB to 2G like this: 64, 128, 256, 512, 1GB, 2GB.

System Index Data
MySQL initial import 717 MB 516 MB
MySQL initial import + 1 additional dataset 1.4 GB 1 GB
Mongo+Solr initial import 744 MB 1294 MB (3GB of datafiles)
Mongo+Solr initial + 1 additional dataset. 1538 MB 4348 MB


CPU usage

When importing to MySQL it more or less maxes on CPU for the entire import. When doing the import with a php script feeding data to mongodb and solr, the component using the most cpu is the php script splitting the file, creating objects and calling the mongodb and solr APIs.This takes up 35-40% CPU and around 10 MB of ram. Mongodb is using around 10% cpu (with 1.1GB of ram) and solr (tomcat, that is) is spending 30% and around 300MB of ram. Disks on these machines are virtualized, through Vmware, 15K SAS disks on an IBM S3200 storage array with dedicated GB LAN between blade center and storage. Disk IO seems to be the bottleneck here.

Space usage

Solr and mongodb seem to be a bit more sloppy with their space usage than MySQL but I guess this is the price to pay for some of the other benefits you get. See mongo faq on data files for info on how to see real space usage for databases and not just file sizes. In return for more storage space spent, you get much better search capabilities (and faster) and faster (although small improvement) query times against database.

Naming Classes and Interfaces

| No TrackBacks

The naming of interfaces and corresponding implementation classes in languages that use them (of which Java and C# is the most used I guess) is a subject of sometimes heated debate. A growing distaste of the "best practice" (at least on Microsofts part) of using a capital "I" on interfaces prompted this rundown and a round of pro et contra. My ultimate goal is readability of code but a slight predicament emerges when considering a team: if the majority of team members are used to one particular convention is that, in that particular context, more readable for that team than another approach, all other things being equal? Even though the name of the interface reads more like natural English? The jury's still out on this one.

The different patterns I've found:
  1. Interfaces are prefixed with a capital I, where IReportGenerator is the interface and e.g. ReportGenerator is the implementation. Some argue for making the interface name read "I generate reports" and keep the I prefix, thus: IGenerateReports instead.
  2. Ditching the capital I and appending "imp" or "impl" to the implementation. Like so: UserManager and UserManagerImpl. Some use uppercase or lowercase "c" as prefix for concrete classes too but I find that horribly ugly, to be frank.
  3. Naming an interface for it's role and naming the implementing class(es) for what distinguishes it from (possibly) other implementations. E.g. UserManager for the interface and DatabaseUserManager or LDAPUserManager for the concrete implementations.
  4. Naming abstract class with the word "Abstract" as a prefix to make a distinction for abstract and concrete classes. Possibly combined with other conventions for interfaces as well. I would argue the context often reveals an abstract class to the reader.

C# vs Java

In C# the extension of an abstract base class and the implementation of an interface uses the same syntax and which one it is isn't always easy to discern. A naming convention makes this a bit easier:

public class DatabaseUserManager : UserManager {

}

//versus

public class DatabaseUserManager : IUserManager {

}

In java the syntax is different:

public class LDAPUserManager implements UserManager {

}

//versus

public class LDAPUserManager extends UserManager {

}

The need for a prefix for readability is less needed in Java, compared to C#.

Good names trumps conventions?

Clean Code by Robert C. Martin argues for good names: good names for variables, good names for methods and good names for interfaces and classes. The name should reveal intention. Choosing good names takes time but saves more than it takes. (Clean Code, page 18). The book also mentions not using the "I" prefix and prefers to encode the implementation, if at all.

The question is: if the names are as good as they can be, do we need a prefix or suffix or other kinds of encoding to indicate the type? In my mind we don't. I find that the I hurts readability and the "impl" certainly doesn't look good to me. So how to choose good names?

  1. Classes should have noun names or noun phrases, e.g. Customer, Policy or StreetAddress.
  2. Use name from the problem domain. E.g. If you're dealing with a customer or client object. In a medical system, should it be called Patient? Or if it's dealing with social benefits maybe BenefitsReceiver is a better name? The domain influences the choice of names. In stock-broking perhaps the name FuturesHolder is name candidate for a special kind of customer? (note: I'm not very familiar with either of these domains so the examples may be somewhat off. You get the point: know your domain and find good names from it.).
  3. Use names from the solution domain. If you use e.g. the Visitor pattern, make it a part of the name(s) so other programmers see it InsuranceCustomerVisitor is better than CustomerTraverser

So unless some team coding standard absolutely makes me (and that is, I'm afraid, likely) I will not in the future prefix my interfaces with I and not adorn my implementations with c, impl or any other stuff. Just meaningful, names from the domain. Easier to write and easier to read.

Lastly, I posted this question on twitter and @borud made the comment: «the only useful information conveyed by an I-prefix to interface types is information about the author of the code.» That's a suitable end note I think :-)

mini bio

Knut Haugen [Knu:t Hæugen], Norwegian software developer with a penchant for dynamic languages and anything to with developer testing. Agile methodology geek with bias on Lean and Kanban. Some pointers to other stuff by me

meta

This page is an archive of entries from April 2010 listed from newest to oldest.

March 2010 is the previous archive.

July 2010 is the next archive.

Find recent content on the main index or look in the archives to find all content.