When to use Cassandra?

Priya Panthi
5 min readJun 29, 2021
Trojan priestess of Apollo. You know how it’s so cool to pick the technology names from Greek Mythology

What does this even mean? When to use cassandra?

Well, before you decide which database you require, you need to have a thorough understanding of which use cases it will serve. You need to religiously write down every single use-case, identify access patterns, need to make a fair estimate of the storage and throughput requirements(of all the queries).

Before I Jump onto the question at hand, let’s have a knife at a more general problem, what characteristics of a storage do you look at(by taking an example of cassandra)!

Following is a characteristics map of cassandra w.r.t all the areas that are important decision makers such as consistency, replication, partitioning, transactions etc.

Database Type: NoSQL

Data Model: A general advice, while deciding the data model you always choose the modelling that best represents your application and cleanly serves your access patterns. Look at the types of relationships that exist b/w the entities.
Cassandra is a wide Column Store, structurally a table with rows and columns, though the name and format of every column can vary(schemaless, schema on read). Can be interpreted as a key value store. Please note that the database is not columnar.

Storage Engine Type: LSM Tree, As a rule of thumb LSM Trees are faster for writes than B Tree(so if your application is write heavy you could consider this option), Though for read its always recommended to do a benchmarking with your particular workload. Old benchmarks have been inconclusive.

Storage Engine: Apache Cassandra 3.x Storage Engine

Query Language: CQL (Cassandra Query Language)

Now before I go ahead and suddenly make our database distributed, just take a look at the access pattern of all the use cases, storage and throughput requirements that you worked up.

Would you require replication? Just check if you would need following:

1. Keeping data geographically close to the users.

2. Fault Tolerance

3. Scale out(serve a lot of read queries)

Would you require partitioning? This also requires a quick glance at the storage and throughput parameters. We will discuss this in the section below.

Now let’s weigh out what is important for you, consistency or availability(this is almost always considered on the lines of replication of the same data on multiple nodes). Well according to CAP, given you have network partitioning, you could choose one of these but this is a bit complex in leaderless replication. Read on the next section.

Replication:

Replication Type: Leaderless Replication.

Considering you have n replicas, every write will eventually go to all the nodes, but you can decide a number w which is the number of replicas it should synchronously update for every write request. Now if your application is write heavy, you could make w=1, then to offset the inconsistency it will create till the time we reach eventual consistency, every read request will read from more number of nodes(let’s call this number r). Generally its suggested to keep w+r>n(and this in no way ensures strong consistency), but you can go for <n, if availability is what is more important for your application. The point here is you can move across a spectrum here from very high availability/very low consistency to very low availability/high consistency by tweaking w and r.

Consistency: Eventual Consistency, it’s imperative to note here that irrespective of what database configuration parameters you give, strong consistency is just not guaranteed in cassandra. This is another characteristic that can rule out this database.

Conflict Resolution: A leaderless model calls for conflict resolution. The supported strategy is LWW(Last Write Wins). Will this do?

Partitioning:

Support: Natively Supported

Again this requires a glance at the access patterns. How you partition your data can have a huge impact on the performance of your database. What you need to decide is the partition key. A good key will uniformly distribute your data as well as read-write throughput across multiple partitions. Generally the former one is not a problem.

2 most common approaches are range based and hash based partitioning. As the name suggests in the former one you sort everything based on the key and partition the data by creating the boundaries while the later one simply uses a hash function. By the looks of it, its clear that range queries would be good in first one though it might lead to hotspots in certain cases while in the later one range queries will break all hell loose! Luckily cassandra uses a compromise b/w the 2. So you can have the best of both worlds. Though this doesn’t rule out the due diligence required to decide this key.

Type of Partitioning: Cassandra use a compromise between range based and hash based. You can have a compound primary key consisting of several columns. Only the first part is hashed to determine the partition, other columns are used as concatenated index for sorting the data in SSTables. If a fixed value for first column is specified, range scan can be done efficiently.

Secondary Indexes: Partitioning of secondary indexes is by document. Each partition maintains its own secondary index. Write only needs to deal with the partition in which you are writing the document. Also called local index. Reading requires scatter-gather. Read queries need to be made to all secondary indexes. Thus the read queries are quite expensive. Even parallel queries are prone to tail latency amplification.

Rebalancing: Cassandra uses a strategy to make the number of partitions proportional to the number of nodes. If a new node is added. Some partitions are chosen split in half and are transferred to this new node. Very closely resembles ‘consistent hashing’.

Request Routing: gossip-protocol to make routing decisions

Transactions (ACID considerations):

Let me make this simple, ACID is the only sell of transactions. Now there are systems that just can’t compromise on ACID, certain financial systems, or mission critical systems.

Though ACID is a cake walk when it comes to single node. And generally achievable with all sincerity in single leader replication systems. Just think about each of these guarantees, how each of these could be made very difficult when you are dealing with multiple leaders where writes can happen simultaneously on multiple replicas, resolving the conflicts. How will you ensure that all the replicas receive all the operations(CRUD) in the same order when writes are happening simultaneously on multiple replicas or do you think this guarantee is even important? What all could this break?

Cassandra provides weak transactions.

Atomicity: Provided across single node. Not provided when several statements execute across multiple nodes. This means ‘all or none’ doesn’t really exist if transaction is spanning multiple nodes.

Consistency: Implements Paxos which is an implementation of total order broadcast, TOB guarantees the same order of the operations across replicas though doesn’t gurantee the time when the messages will be delivered, so you can see stale values on some replicas.

Isolation: Paxos provides isolation in Compare and set operations.

Durability: Provides durability using multiple replicas.

Now jumping onto the question at hand!

It’s evident that cassandra should be avoided if the use case requires strong consistency or strong atomicity properties.

--

--