How Discord Indexes Billions Of Messages?
They depend heavily on open source..
Discord has millions of users sending billions of messages every single day. Now, these users want to search these messages too. How do we index these to make them searchable by different keywords in the message?
Let’s find out.
1. The simple answer is that Discord uses Elastic search. What is elastic search? Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Elasticsearch is built on Apache Lucene.
2. How does elastic search store data? It stores it in the form of JSON documents which are a combination of multiple key-value pairs.
3. How does this elastic search enables indexing? Elastic search internally creates an inverted index. What is an inverted index?An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.
4. During the indexing process, Elasticsearch stores documents and builds an inverted index to make the document data searchable in near real-time. Indexing is initiated with the index API, through which you can add or update a JSON document in a specific index.
5. Now let’s see how discord uses Elastic Search. Elasticsearch likes it when documents are indexed in bulk. This meant that discord couldn’t index messages as they were being posted in real-time. But do you even need real-time indexing? No, because you don’t search just posted messages. You want to search old messages.
6. Now, let’s see what do we need for bulk indexing.
1. A distributed queue for transiently storing incoming messages.
2. A bunch of indexer workers which will index a batch of messages into the elastic search.
7. For queue, discord uses Celery. It is an open-source distributed queue. Now, the elastic search won’t be running on a single server. It will be running in the form of clusters. Now, the question comes, where to put a message? On which cluster?
8. This is decided with the help of a shard allocator which decides on which shard to put the message on. But, wait a minute. What is a shard? A shard is a combination of the elastic search cluster and the index on. So, these two form a shard which is used as a unit by discord. The elastic search itself has some shards. But this is different, so don’t get confused.
9. Now, the final part is service discovery — to discover the elastic search clusters and the hosts within that cluster. This, they do with the help of etcd another open source tool.
A great thing to notice here is that discord relies heavily on open source systems and their base implementations which is very different from a lot of other products.
_
Subscribe to my youtube channel: https://lnkd.in/eS-vkqi3
Let’s have an informal chat on Twitter: https://lnkd.in/eje_RpzF