home‎ > ‎Programming‎ > ‎

Comparing NoSQL Data Stores

Disclaimer: Any views or opinions presented in this article are solely those of the author and do not necessarily represent those of the company that the author is employed by or its subsidiaries



Given the recent brouhaha on NoSQL data stores, the hype and the myth that surround them, I decided to dip my toes into comparing how they fare with respect to the basic task of storing objects and retrieving them.

I was hoping that the inevitable wrinkles that I would encounter would give me an insight into the strengths and weaknesses of each.

Let me come clean right at the start. I come from an Oracle Coherence background and I firmly believe that large scale map reduction problems coupled with a combinatorially explosive dimensionality requires the clever use of near caching technology to return performant results,

A lot of folks would disagree and might (rightly so) have a different point of view. This view of mine stems from my own experience and is true for my domain of problems. Your mileage might vary. Please do leave me comments as I too would like to learn from your experiences.

For these class of problems, given the current state of technology, I would pitch my personal tent with either Oracle Coherence or (now) Hazelcast. Hazelcast was one of the more pleasant surprises of this exercise and more on that later.

The problem definition for this comparison is a straight store and retrieval of data, not a map reduction use-case.

Problem definition

The problem is that of pricing trades (Equity Option trades in particular). I decided to use JQuantlib in order to do this because none of the Java classes there implement serializable or even implement a default constructor (therefore not allowing the ability to work in a serializable wrap).

The idea was to create a Market data object that defined things like interest rates, volatility and Trade objects that distinguished between call and put options for a given strike and maturity date.

The process then involved passing this market data and the collection of trades through a pricing engine (based on Black Scholes) to generate NPV and other Greeks.

If none of that made any sense, it does not matter. Let me reinstate the problem in different words.

In a simpler terms, take two classes of objects, pass them through a third and generate some metrics. This generated metric is then usually passed into an elaborate map reduction framework. This second part is not being addressed here.

I thought I would develop a sophisticated model structure, but then changed my mind and kept it simple(r). So if you feel it appears half baked, that's probably true. The intention was to model as much as possible to meet the demands of the use case.

The models are somewhat like this.

MarketData implemented an interface called IMarketData which implemented Serializable
Again ITradeData implemented Serializable. I started down the noble route of defining the AssetClass as an Enum (and Serializable). The overhead of defining enum serializations soon put paid to that idea as I fell back on using the String name of the Enum.

I stored one market data object and the idea was to generate and store one million Equity Option objects into the data store. These would randomly be either call or put options.

Then, the complexity involved in extracting a query based on multiple criteria would provide further insights into the searching and filtering mechanisms.

That (kind of) sets the scene for my comparative exercise.

As I said the objective were to learn the following (amongst other things):

  • What does it take to define the schema for the data store
  • How easy is the plain vanilla system to setup
  • What additional configuration is required to add Serializable objects in and extract from, the data store
  • Are there any limitation to sizing and partitioning
  • How easy is it to setup and run the data store server
  • How easy is it to connect to the server via (Java) drivers
  • How intuitive is the query language
  • Does the system scale (or at least give the appearance of scaling)
  • Is the system performant
The systems involved in the comparison were:

I recently fitted an SSD drive to my laptop (would recommend this for a real hardware boost) and as a result have a 32-bit version of Windows 7 running on my PC. This is where the current metrics stem from. I do intend to run my comparison on my 8GB MacOS based Mac mini when time is not such a constraint.

Prerequisites

I used the IntelliJ community edition (despite being a long term Eclipse user). It is nimble and intuitive.

Also required are:
If you are unfamiliar with Maven, I strongly recommend that you spend some time picking it up. It is perhaps one of the strongest prerequisites.

All my projects were maven based and all evaluations were done via unit tests. I have included my zipped up projects in the setup for each evaluation. The pom.xml describes the maven dependencies required to run the tests. Some of these dependencies are off the web repositories so will work out of the box.

For example, most of my pom.xml have this entry:
 
       <dependency>
            <groupId>org.testng</groupId>
            <artifactId>testng</artifactId>
            <version>6.1.1</version>
            <scope>test</scope>
        </dependency>

If your IDE is set to auto-import, then testng will download to your local repo (provided you are internet enabled). If not, use the manual method below.

A lot of the dependency jars, however, were added manually into my local repo. For example the jquantlib jar entry in my pom is:

 
       <dependency>
            <groupId>org.jquantlib</groupId>
            <artifactId>jquantlib</artifactId>
            <version>0.2.4</version>
        </dependency>

This was added to my local repository using the following maven command (assume that your jar is stored under C:\JQuantLib\jquantlib-0.2.4.jar)

mvn install:install-file -DgroupId=org.jquantlib -DartifactId=jquantlib -Dpackaging=jar -Dversion=0.2.4 -Dfile=C:/JQuantLib/jquantlib-0.2.4.jar

Please note that you need to have the maven bin directory in your system path. I set my MAVEN_HOME to C:/Progra~1/apache-maven-3.1.1 and included %MAVEN_HOME%/bin to my system path (or user path if you lack admin access). On unix you would set path=$path:$MAVEN_HOME/bin

Similarly, you will need the latest JDK installed with JAVA_HOME setup and the JAVA_HOME bin directory added to your path.

Project Setup

Each system has its own nuances when setting up the server and running the queries, each one has been described in detail, below.

Results

The excel spreadsheet with the full data results is attached below.

Please note that in these results, Cassandra is shown in red because it repeatedly failed when executing the query against one million trades

I did various searches for this and the suggested remedies were confusing. One posting recommended an increase in JVM size. For some reason that was beyond me, increasing the JVM size from 1GB to even 1.25GB meant that the server would not start.

Another post suggested reducing JVM size to keep GC's low.

A whole hosts of other posts were on why people had moved away from Cassandra, then others on exhaustive JVM tuning parameters to get it going (life's too short for that!)

On the whole I would like to discount Cassandra as I had to reduce the population size to 500,000 in order for it to work. But having made the effort, I decided to include it but mark it out as red.

Please infer your own conclusions from this (I have!)

Data insertions

Single market data (for one effective date)

Trade data insertion for one million trades (except for Cassandra)


Data retrieval

Single market data (for one effective date)

Trade data retrieval for one million trades (except for Cassandra)

Pricing metrics (though this, strictly speaking, has little to do with the data store, but possible bias on the system due to running servers)


Conclusions (see disclaimer at top)

Before I started on this exercise, I must admit that I had some per-conceived bias. I thought MongoDB would be the most suitable system for this particular use case. I also had high hopes for Cassandra.I sneered at the prospects of Terracotta (because of the underlying Ehcache). I did not have much of a view on Hazelcast (version 1 was dire)

After the evaluation I am disappointed with MongoDB but not as much as I am with Cassandra (the word barge pole seems to spring to mind).



Update on 20th November 2013 - there has been this very useful perspective from a knowledgeable source:

Michael Kremliovsky

Michael

Michael Kremliovsky

Software Systems at Hospira

Raj, I apologize for being negligent to the details of your setup, but I just want to notice (again) couple things: Cassandra under Windows is the same as an elephant in a bird cage, Cassandra is designed for horizontal scalability with fault-tolerance (no single point of failure), Cassandra is designed to run on multiple nodes (three is a recommended minimum), Cassandra is optimized for writes, reads are OK, and it does relatively poor job at updates. If the goal is simplicity and single machine performance, then MongoDB is a better choice for persistent storage (depending on use case, it can be up to 30 times faster than MySQL, for example). So, if you are building a backend system for massive eventually consistent highly available datastore running on Linux, Cassandra is a good choice. If you are trying a PC-format application, Cassandra is not good fit. Pretty simple. On a different note, you may also try SciDB (especially, if your target development is on Java).



With respect to MongoDB, I am not a big fan of JSON (or BSON for that matter) and it feels like XML in another form. In the world of Google protocol buffers and Coherence PortableObject's - this seems a bit archaic.

Hazelcast pleasantly surprised me. I think the product has matured well. Whether its at the same industrial strength as to compete with Coherence, I do not know at this point in time. But its certainly one of those that I am going to keep a close watch on.

Terracotta was also a pleasant surprise and seemed to deliver much more than I expected (maybe my expectations were too low). However I hate the XML based config and the need to define searchable attributes 'a priori'

I shall not opine on Coherence because of my inherent bias towards it. For use cases where the cache is used for simple puts and gets and data is stored not as objects but in some flattened text based structure, using Coherence is like buying a F12Berlinetta to do the supermarket shopping. It will do the job but you will be perceived as a poser or an idiot (or both).

Ease of use

MongoDB scores high in the ease of use stakes, as does Hazelcast. Coherence would be easy for someone familiar with its nuances but from a out-of-the-box perspective, perhaps not.

Querying

Coherence (particularly using CQL) and Hazelcast scored quite high for me in the ease of querying and retrieval. The semantics seemed natural, no pre-configurations were required, but most importantly the queries worked off the serialized object. For simpler use cases Hazelcast seems a logical option.

Performance and Scale

Again, I think Coherence and Hazelcast scored well in my book on this. MongoDB is a decent contender but I think the fact that attributes that needed to be searched upon have to be registered within the document is a limitation I do not like (I haven't explored it deep enough to know otherwise, but then I haven't done this for others too). I can see a plethora of application for which MongoDB would be suitable but I would like to reserve my judgement for this particular one, at the moment


Finally, before you decide that my opinions do not match with your experience of a particular product and you decide to launch off a big flame in my direction, this exercise is not an in-depth study (I just do not have the time for that) that intimately explores all the intricacies of each product. Its just a "first glance" opinion. If you have something to share please feel free to leave me with your thoughts and experience, positive or negative.

This is part comparison and part tutorial and I hope it has helped in some small way.


Feedback



Your views

Feedback


Ĉ
Raj Subramani,
15 Nov 2013, 00:30