I stored one market data object and the idea was to generate and store one million Equity Option objects into the data store. These would randomly be either call or put options.
Disclaimer: Any views or opinions presented in this article are solely those of the author and do not necessarily represent those of the company that the author is employed by or its subsidiaries
Let me come clean right at the start. I come from an Oracle Coherence background and I firmly believe that large scale map reduction problems coupled with a combinatorially explosive dimensionality requires the clever use of near caching technology to return performant results,
Given the recent brouhaha on NoSQL data stores, the hype and the myth that surround them, I decided to dip my toes into comparing how they fare with respect to the basic task of storing objects and retrieving them.
I was hoping that the inevitable wrinkles that I would encounter would give me an insight into the strengths and weaknesses of each.
A lot of folks would disagree and might (rightly so) have a different point of view. This view of mine stems from my own experience and is true for my domain of problems. Your mileage might vary. Please do leave me comments as I too would like to learn from your experiences.
For these class of problems, given the current state of technology, I would pitch my personal tent with either Oracle Coherence or (now) Hazelcast. Hazelcast was one of the more pleasant surprises of this exercise and more on that later.
The problem definition for this comparison is a straight store and retrieval of data, not a map reduction use-case.
The problem is that of pricing trades (Equity Option trades in particular). I decided to use JQuantlib in order to do this because none of the Java classes there implement serializable or even implement a default constructor (therefore not allowing the ability to work in a serializable wrap).
The idea was to create a Market data object that defined things like interest rates, volatility and Trade objects that distinguished between call and put options for a given strike and maturity date.
If none of that made any sense, it does not matter. Let me reinstate the problem in different words.
In a simpler terms, take two classes of objects, pass them through a third and generate some metrics. This generated metric is then usually passed into an elaborate map reduction framework. This second part is not being addressed here.
I thought I would develop a sophisticated model structure, but then changed my mind and kept it simple(r). So if you feel it appears half baked, that's probably true. The intention was to model as much as possible to meet the demands of the use case.
The models are somewhat like this.
Then, the complexity involved in extracting a query based on multiple criteria would provide further insights into the searching and filtering mechanisms.
That (kind of) sets the scene for my comparative exercise.
As I said the objective were to learn the following (amongst other things):
The systems involved in the comparison were:
I recently fitted an SSD drive to my laptop (would recommend this for a real hardware boost) and as a result have a 32-bit version of Windows 7 running on my PC. This is where the current metrics stem from. I do intend to run my comparison on my 8GB MacOS based Mac mini when time is not such a constraint.
IntelliJ community edition (despite being a long term Eclipse user). It is nimble and intuitive.
Also required are:
If you are unfamiliar with Maven, I strongly recommend that you spend some time picking it up. It is perhaps one of the strongest prerequisites.
All my projects were maven based and all evaluations were done via unit tests. I have included my zipped up projects in the setup for each evaluation. The pom.xml describes the maven dependencies required to run the tests. Some of these dependencies are off the web repositories so will work out of the box.
For example, most of my pom.xml have this entry:
If your IDE is set to auto-import, then testng will download to your local repo (provided you are internet enabled). If not, use the manual method below.
A lot of the dependency jars, however, were added manually into my local repo. For example the jquantlib jar entry in my pom is:
This was added to my local repository using the following maven command (assume that your jar is stored under C:\JQuantLib\jquantlib-0.2.4.jar)
Please note that you need to have the maven bin directory in your system path. I set my MAVEN_HOME to C:/Progra~1/apache-maven-3.1.1 and included %MAVEN_HOME%/bin to my system path (or user path if you lack admin access). On unix you would set path=$path:$MAVEN_HOME/bin
Similarly, you will need the latest JDK installed with JAVA_HOME setup and the JAVA_HOME bin directory added to your path.
Each system has its own nuances when setting up the server and running the queries, each one has been described in detail, below.
The excel spreadsheet with the full data results is attached below.
Please note that in these results, Cassandra is shown in red because it repeatedly failed when executing the query against one million trades
I did various searches for this and the suggested remedies were confusing. One posting recommended an increase in JVM size. For some reason that was beyond me, increasing the JVM size from 1GB to even 1.25GB meant that the server would not start.
Another post suggested reducing JVM size to keep GC's low.
A whole hosts of other posts were on why people had moved away from Cassandra, then others on exhaustive JVM tuning parameters to get it going (life's too short for that!)
On the whole I would like to discount Cassandra as I had to reduce the population size to 500,000 in order for it to work. But having made the effort, I decided to include it but mark it out as red.
Please infer your own conclusions from this (I have!)
Single market data (for one effective date)
Trade data insertion for one million trades (except for Cassandra)
Single market data (for one effective date)
Trade data retrieval for one million trades (except for Cassandra)
Pricing metrics (though this, strictly speaking, has little to do with the data store, but possible bias on the system due to running servers)
Before I started on this exercise, I must admit that I had some per-conceived bias. I thought MongoDB would be the most suitable system for this particular use case. I also had high hopes for Cassandra.I sneered at the prospects of Terracotta (because of the underlying Ehcache). I did not have much of a view on Hazelcast (version 1 was dire)
After the evaluation I am disappointed with MongoDB but not as much as I am with Cassandra (the word barge pole seems to spring to mind).
Update on 20th November 2013 - there has been this very useful perspective from a knowledgeable source:
Software Systems at Hospira
Raj, I apologize for being negligent to the
details of your setup, but I just want to notice (again) couple things:
Cassandra under Windows is the same as an elephant in a bird cage,
Cassandra is designed for horizontal scalability with fault-tolerance
(no single point of failure), Cassandra is designed to run on multiple
nodes (three is a recommended minimum), Cassandra is optimized for
writes, reads are OK, and it does relatively poor job at updates. If the
goal is simplicity and single machine performance, then MongoDB is a
better choice for persistent storage (depending on use case, it can be
up to 30 times faster than MySQL, for example). So, if you are building a
backend system for massive eventually consistent highly available
datastore running on Linux, Cassandra is a good choice. If you are
trying a PC-format application, Cassandra is not good fit. Pretty
simple. On a different note, you may also try SciDB (especially, if your
target development is on Java).
With respect to MongoDB, I am not a big fan of JSON (or BSON for that matter) and it feels like XML in another form. In the world of Google protocol buffers and Coherence PortableObject's - this seems a bit archaic.
Hazelcast pleasantly surprised me. I think the product has matured well. Whether its at the same industrial strength as to compete with Coherence, I do not know at this point in time. But its certainly one of those that I am going to keep a close watch on.
Terracotta was also a pleasant surprise and seemed to deliver much more than I expected (maybe my expectations were too low). However I hate the XML based config and the need to define searchable attributes 'a priori'
I shall not opine on Coherence because of my inherent bias towards it. For use cases where the cache is used for simple puts and gets and data is stored not as objects but in some flattened text based structure, using Coherence is like buying a F12Berlinetta to do the supermarket shopping. It will do the job but you will be perceived as a poser or an idiot (or both).
MongoDB scores high in the ease of use stakes, as does Hazelcast. Coherence would be easy for someone familiar with its nuances but from a out-of-the-box perspective, perhaps not.
Coherence (particularly using CQL) and Hazelcast scored quite high for me in the ease of querying and retrieval. The semantics seemed natural, no pre-configurations were required, but most importantly the queries worked off the serialized object. For simpler use cases Hazelcast seems a logical option.
Again, I think Coherence and Hazelcast scored well in my book on this. MongoDB is a decent contender but I think the fact that attributes that needed to be searched upon have to be registered within the document is a limitation I do not like (I haven't explored it deep enough to know otherwise, but then I haven't done this for others too). I can see a plethora of application for which MongoDB would be suitable but I would like to reserve my judgement for this particular one, at the moment
Finally, before you decide that my opinions do not match with your experience of a particular product and you decide to launch off a big flame in my direction, this exercise is not an in-depth study (I just do not have the time for that) that intimately explores all the intricacies of each product. Its just a "first glance" opinion. If you have something to share please feel free to leave me with your thoughts and experience, positive or negative.
This is part comparison and part tutorial and I hope it has helped in some small way.