Discord has around 500 million registered users
How and where does Discord store these many messages and how are they retrieved?
1. Discord initially stored the messages in MongoDB. With increasing scale, they realized that MongoDB will not suffice their use case because MongoDB sharding is complicated and is not very stable.
2. The data of Discord is also quite skewed. There are some voice channels with a maximum of 10 messages a day. On the other hand, there are channels with thousands of messages each hour.
3. Their choice of the database was Cassandra. Cassandra is KKV store. The two Ks comprise the primary key. The first K is the partition key and is used to determine which node the data lives on and where it is found on disk. The second K acts as a sort key in that particular partition and it also uniquely identifies the object and acts as the primary key also.
4. To store the messages in MongoDB, Discord used channel_id and created_at as the index key. While storing data in Cassandra, they used channel_id as the partition key because the messages in a channel must be located in the same partition. Now, the challenge comes, how to uniquely identify a message in a partition?
5. Can we use created_at as the primary key? But it is possible that there were two messages at the same timestamp. Therefore, the messages need to have some id. How will we generate this ID? Can we generate a simple UUID? But, that will pose a problem. You might want to search for the messages in a particular range, like the last 10 messages. To solve this issue, Discord made sure that the generated id is chronologically sorted. They used the same approach as Snowflake.
6. Snowflake is an internal service built by Twitter that was used to generate chronologically increasing Ids. Why was this required and why they could not just go with created time as a sorting measure? I will talk about that in another post.
7. Migrating data to Cassandra was also not easy. They faced various partition outsized issues which caused them to bucket their messages according to created time.
9. Cassandra also has limitations. It is an AP database. So, there were many issues when the same message was being edited/deleted by two different users. Cassandra is eventually consistent it cannot just delete data immediately. It has to replicate deletes to other nodes. They handled this case by tweaking around with the way values were stored in Cassandra.