Cassandra is a distributed database management system, originally designed by Facebook for their inbox search.
New database management system
It is expensive to use a single machine for various applications, a better idea is to use a set of machines to store the data and therefore a more powerful distributed system was designed so that we can manage data without a single point of power failure.
What is Cassandra?
Cassandra is a database technology that runs off a set of machines which interact with each other using various predefined protocols. These machines store data in their respective hard drives. The key is that no single machine stores all the data and each piece of data is stored in at least 2 machines. By replicating data across various machines, we ensure that we can tolerate failures.
The Big Picture
Consider Facebook with its billions of users. Each user has associated data- the timeline data, comments, likes, friends, photos, posts, etc. When you open Facebook the servers construct a newsfeed for you based on the data stored in them. As all these data is valuable, Facebook cannot afford to lose user data and also their service cannot be slow at any particular time. Hence, they need 2 things – reliability and performance. This is where Cassandra comes into play. It ensures that the server requests are served with an excellent performance.
Some Technical Jargon
Node – A single machine is called as a node.
Rack – A set of a few tens of nodes.
Cluster – A collection of nodes.
Data center – A large collection of racks.
Facebook stores your data in their data centers which contain thousands and thousands of nodes across various racks.
Working of Cassandra
Cassandra constructs a cluster by placing the nodes in a ring. This is just a logical ring and is not physically present. Each node is given a random position in the ring. The ring wraps around.
The data is stored in the form of a record. Each record has something called a key which uniquely identifies the record. Each node in a ring is allocated a range of universally unique identifiers(UUID) depending on its position in the ring.
The red arc indicates the range of keys stored by node one. Similarly, for blue and green nodes.
Write a request. Say a write request comes in which the key is 45. The client can send the request to any of its servers. Let us assume it comes to node 2. Now, node 2 will look at the UUID and determine that node 3 has the data corresponding to UUID = 45. So it simply redirects to node 3 and waits for it to reply.
Read a request. Say a read request comes with key 45.Assume it comes to node 1. Node 1 determines that data corresponding to UUID = 45 is stored at node 3 and so it redirects this request to node 3.
Reliability and Performance
Cassandra provides reliability by replicating data, So, instead of putting a data on a single node, it replicates data. For instance, in a cluster of 100 nodes with keys from 0 – 1000, keys 60 – 70 may be allocated to node 6 as well as node 7. This way if node 6 fails, we still have node 7 to service requests corresponding to keys in range 60 – 70.
Cassandra provides excellent write performance and a good read performance as well. It tries to keep data in RAM to provide fast access and at the same time it periodically flushes the RAM to the hard disk to ensure no data is lost.