In the last several years I’ve seen several projects suffering from performance problems caused by one bad decision: inadequate choice of storage.
As Greg Young noted developers often stick to RDBMS without even considering the alternatives. I’m not going to discuss SQL vs. NoSQL, but rather share some ideas on what might affect the decision.
Data life time
In one of the projects I’ve been involved in, there was an incoming stream of real time geo location data from hundred of thousands of users. While the whole system generally worked well the performance of geo-location data storage was very poor.
As one might guess the dev team decided to store the data in RDBMS, I believe the vendor doesn’t really matter.
No ORM layer above, easy structure with several tables.
When some load testing was performed it revealed that the system can’t serve even a fraction of the load it was assumed to.
The problem with real-time geo location info is that there is no reason to store it durably, other than for historical reasons. If we are talking about latest coordinates, those change often and the best storage for them is… in-memory cache.
After some rework we have moved the writes to a queue in the backend, which dumped the coordinates to the database and the system was able to pass the load tests.
Model Responsibility
Much responsibility put on the model narrows the storage options. For years applications employed same model for both reporting and transaction processing.
In a recent talk on CQRS I gave, there was a slide with the following comparison table:
| | Commands | Queries |
| Processing | Synch/Asynch | Synch |
| Data | Mostly Normalized | Highly De-normalized |
| Form | Objects | Data Sets/Tables |
| Freshness | Very High | Very Low-High |
| Target | Isolated object | Any subset |
| Security | Reject command | Pre-filter results |
| Logic | Highly involved | Little to no |
As one might see, the ideal storage for underlying data from queries and commands point of view highly diverges. But because commands contain most of the logic of the system which developers should implement, they usually concentrate on the command-side and the query side becomes second-class citizen developed on top of commands model.
And here is where RDBMS really shines. Joins and aggregation are the magic wands that turn normalized model into de-normalized one. That’s what makes RDBMS cool – you mostly concentrate on one side of the system and get the other one very cheaply in terms of development effort. After all that’s what most of the systems that serve several hundreds users need.
When the load increases (like in the example above) the cost of joins, aggregation and full ACID cannot be denied anymore. Even systems built on top of monolithic database, usually have a second storage for reports generation (be it a copy of the OLTP database, de-normalized store, OLAP or whatever).
When applying CQRS, you logically decouple the command and query model, opening a wide set of options for stores on both sides (RDBMS, key-value storage, in-memory cache, event storages, etc).
Model Complexity
Model complexity may affect your storage options as well. Imagine the model behind Twitter-like service, which is very simple to express in terms of objects. Most of the objects are very small and can be stored in isolation, therefore, can reside in a variety of stores.
If the model requires a wide set of transactional operations and the cost of failure is high, you will likely need a more sophisticated storage than Simple DB to guard yourself from nasty errors.
In Twitter the cost of error is rather low, usually the outcome is undelivered tweet. Imagine the same thing in medical billing or accountant software.
Individual Objects Value
Sometimes individual objects are useless for being observed on their own. Imagine a paragraph of a text document stored in a separate table in a relation database. Sounds ridiculous isn’t it?
Yet, the idea of “document” is it too many times overlooked with developers stick to ORM + SQL DB.
I’ve seen once the developers implemented the whole thing relational and then figured out that they need to implement their own versioning, and conversion to portable format, which they could achieve for free using XML.
The following signs can indicate a place for “Document” model:
-
Individual objects are of low value (e.g. they would never be queried on their own outside of the “Document” scope)
-
User usually given a single “Save” button for the whole “Document” (with auto-save functionality implemented)
-
The “Document”-like model will be transmitted for usage elsewhere
Conclusion
Finally you can get the best of both worlds by breaking the system into different services. The storage decision is made then per-service, so you can evaluate storage requirements one at a time.