How to Choose the Right Database using real-world use cases?

How to Choose the Right Database using real-world use cases?

In this blog, we'll discuss the considerations you should consider while deciding the right database for your application with real-world use cases.

Contents:

  1. Integration Consideration
  2. Scaling Requirements
  3. Support
  4. Budget Consideration
  5. CAP Theorem
  6. Examples
    • Build a phone directory application for your organization.
    • Mint logs and show interesting patterns for your organization.
    • Build applications like Google Analytics for millions of users.
    • A movie or show recommendation for Netflix or Prime Video.
    • A stockbroker application like Zerodha or Upstox.

How do you make this decision when you're architecting a given system?

It's not an easy decision that's why people who make these decisions make the big bucks but I can at least guide you through some of the considerations that you might make when making this decision as to what database if any you might need for your system.

There's no real formula you can follow but there are things you should think about or at least things you should consider.

1. Integration Consideration

One thing to think about is what systems you need to integrate so different technologies have different connectors for different other technologies.

For example, if you have a big analytics job that's currently running an Apache Spark then you probably want to limit yourself to external databases that can connect easily to Apache Spark. Most of the ones we've talked about today can but that's just one example.

Maybe you have some front-end system that depends on having a SQL interface to a back-end database and you're thinking about moving from a monolithic relational database to a distributed non-relational database. In that case, it might make life a lot easier if the non-relational database you're moving to offer some sort of SQL-like interface that can be easily migrated to from your front-end application.

So think about the pieces that need to talk together in your system and see if they can talk together or not with existing off-the-shelf components and whether those components are well maintained and up-to-date that might limit your choices right then and there so it's the first thing to think about.

2. Scaling Requirements:

How much data are you talking about? Is it going to grow unbounded over time? If so then you need some sort of database technology that is not limited to the data that you can store on one PC right so you're going to have to look at something like a Cassandra or MongoDB or HBase where you can actually distribute the storage of your data across an entire cluster and scale horizontally instead of vertically.

Think too about your transaction rates. How many requests do you intend to get per second? If we're talking about thousands then again a single database service server is not gonna cut it, you need something that's distributed where you can spread out a load of those transactions more evenly.

Typically in these sorts of situations, we're talking about a big website where we have a lot of web servers that are serving a lot of people at the same time. In those situations, you need to be looking at distributed NoSQL databases more so than monolithic relational databases.

3. Support:

The third thing you want to look into is the support you might need. Do you have the in-house expertise to spin up this new technology and configure it properly? It's gonna be harder than you think especially if you're using this in the real world or in any sort of situation where you have personally identifiable information in the mix from your end-users in that case you need to make sure you're thinking very deeply about the security of your system.

If you are in a big organization that has these experts in-house that's great don't even think about it but if you're in a smaller organization you might want to consider does this technology I'm choosing offer professional paid support that will help guide me through these setup decisions in the initial administration of my server over time or are their administrators that I can outsource the ongoing administration to overtime.

So, do your homework and try to figure out, can I do this on my own and if not what resources are out there in the marketplace to help me.

4. Budget Considerations:

Well! Apart from the cost of support you're probably not going to be thinking about money very much. Because everything we've talked about so far is open-source you know it is free to use you're going to be probably installing it on Linux servers where the operating system is also free really.

We're just talking about the cost of the servers themselves here. So, think about the cost of those servers they can be significant in a large application or a large distribution but apart from the cost of the servers budget usually isn't a concern when you're designing these systems.

5. Cap Theorem:

Cap theorem our old buddy where we need to pick two out of three consistency, availability, or partition tolerance.

When you're thinking about the scale of your requirements - Do you need to have partition tolerance? Do you have a sufficient scale where you know you're going to eventually need more than one server serving up this data just for handling the transactions? If so partition tolerance is non-negotiable you need that one and your only real choice, in that case, is consistency or availability.

Now, The type of application will determine what you want.

If it is not okay if your system goes down for a few seconds or few minutes then availability is gonna be your prime concern. Is it okay if you have eventual consistency where if you write something people might get the old value back on subsequent reads for a few seconds if so who cares about consistency right again I would take availability instead?

But if you're dealing with something that's dealing with real transactional information like you know stock transactions or some sort of financial transactions you might value consistency above all else. In that case, you want to focus on Partition Tolerance & Consistency.

2.png

Let's understand how to read this CAP diagram here. Cassandra for example lies on the availability and partition tolerance side of the triangle where we're favoring these two over consistency. And when we talk about HBase & MongoDB they are favoring consistency and partition tolerance above availability.

Now I should point out that the Cap Theorem isn't a hard and fast rule.

In all honesty, you know the lines are getting blurred between these different trade-offs over time.

To be honest, the truth is any of these technologies can be made to work in pretty much any situation if you try hard enough. It's a question of you choosing the technology that's best suited to the trade-offs.

One suggestion I'd like to give is to keep it simple stupid. Simplicity is the guiding principle that all of your architecture decisions should be based around.

Examples

In this section, we will see few application examples and try to answer few questions first & then decide which database can be suitable for our use case.

1. Build an internal phone directory application

Explanation: I need to build an intranet page for my company where I can look up phone numbers, names & email addresses of my co-workers.

Scale: Limited, because I'm never gonna have soo many employees in my organization which I can't handle on a single machine.

Consistency: Eventual consistency is fine, It's not a big deal if someone changes their phone number and people get the old number for few seconds after the update has been made. So consistency is not a big deal, eventual consistency for few seconds or few minutes is fine.

Availability: Not mission-critical, even if the page is down for few seconds it does not impact any mission at the end of the day.

Which DB should we use?: Any relational DB (MySQL, PostgreSQL, etc.) will be good for this use case.

2. Mine Server logs for interesting patterns like Google Analytics for your internal platform.

Explanation: In the application, you might want to answer few questions like what's the most popular time of the day, average session length, geographic location of the users, etc.

Scale: Do I even have enough scale here to warrant a NoSQL database at all why am I thinking about this question right if all I'm doing is analytics. That's what Hadoop or Apache Spark is for. All we can do is we can import this data into the HDFS cluster & analyze it offline.

If you're just using this internally for analytic use there's no need to even talk about things like NoSQL or Cassandra or something like that so the answer, in that case, would be no DB required just use your Hadoop cluster and the capabilities that the Hadoop ecosystem gives you without resorting to outside database technologies.

Which DB should we use?: No need for any external DB just use Hadoop or Apache Spark cluster for this kind of application.

3. Build something like Google Analytics for millions of users.

Explanation: It is similar to the above example, but in this application, we are going to expose these analytics to millions of people wanted to hit it and get answers from it.

Scale: Very High Scaled, as it will be used by millions of users and transactions per second (TPS) is also high because it will be used by a lot of users concurrently.

Consistency: The logs are generated very frequently but it does not have any impact if the users see old data for few seconds after the new logs are created. So we can move ahead with eventual consistency for this application.

Availability: As the application are used by millions of people and we do not want to throw 404 at any time when they look for data so the application has to be highly available.

Which DB should we use?: As the data are huge and it can have variant schema so a NoSQL is a good option. Availability is far more important than consistency so we can go for DBs like DynamoDB, CouchDB, or Cassandra.

4. Movie Recommendation for Netflix or Prime Video.

Explanation: We need to send a new movie or show a recommendation to users by analyzing the users' behavior data, rating, or logs.

Scale: These applications are used by millions of users simultaneously. So the scale is massive.

Consistency: Let's suppose we have a Spark job or something else that produces movie recommendations for my end-users once a night you know the supposition here is that people's recommendations won't change very quickly so computing them offline periodically is acceptable and in this case, we can give up consistency. It's okay to render old recommendations for a few seconds after new recommendations have been computed for a user.

Availability: But downtime is not tolerated. So, the application has to be made as highly available as possible.

Which database can we use?: Cassandra can be a good option as it makes sure to prefers availability over consistency.

5. Build a stockbroker platform like Zerodha, Upstox?

Let me know in the comment section which database should you use for this application.