One billion is a really big number and one billion people means more than one in every eight people living on Earth right now. But still there are several tech giants who have more than one billion active users. There are several aspects of achieving and managing this huge user base; it requires top notch idea, management, investment, marketing and great products. In this post we shall not focus in all these aspects, we shall only discuss about the system design of a product that can serve a really really huge user base.
We shall begin with a very small scale, from deploying the first server for handful of users. Then we shall gradually walk through different challenges as the number of users grow and discuss about ways to solve them. You should keep in mind that the nature and magnitude of the problems discussed here can differ greatly with the type and purpose of the system.
For the sake of this discussion, let’s build a crowd sourced news site, “Our News”. On this system, users will be able to publish news articles written by them. A news article is mostly text document but it may contain photos. Most users will visit Our News to read articles. Users can follow the writers they like. So in our system, there are two main entities: users and articles. There are other important data, who-follows-whom, comments on the articles, false news flag etc. To keep things simple, we shall mainly focus on the users and articles.
Our target is to maintain high reliability and availability of Our News. By reliability it is meant that any news article that are posted by users will not be lost. By high availability, it means that users will be access Our News from their device always. The server will be able to serve them within an acceptable amount of time (almost instantaneously) no matter how many other active users are using the system.
The First Step
At the beginning you can launch your system in a single machine as a server. Application server, database, files everything hosted from a single machine. It can be served just like your development environment with some refinement for production environment. For the first few users you can keep this setup without any problem.
At this point, almost all of your focus and effort should be on optimizing the source code and database queries. This is the time, you have luxury to change design of a database table, or few classes of your source code without much trouble. You can do hundreds of things to optimize your system and increase performance of your lone server machine, but I shall point out two things which poses no big threat at the beginning of a system but becomes major headache when the number of users starts to accelerate.
- If any calculation / aggregation, decision making can be done in either on the server or on the client side, always do it on the client side. Try to divert as much of the trivial computations to the client side to keep your server operations lean. The client side (browser, mobile app etc) will always serve one user at a time but your server will be bombarded with requests as the number of users increases.
- Keep your target to reduce both number of database queries per request and unnecessary joins. This may be more difficult than it sounds because often to reduce the number of queries, you will add new joins in your queries. Joins are not always bad, but if you want to horizontally scale your database in future or perform sharding, joins will become major headaches. At this stage of the system, keep this in mind: “One slower query is better than lots of small queries but few small queries is always better than one slower query.”
With these in mind, try to squeeze as much performance from your single server machine as possible before moving to the next stage. I believe most of my fellow system designer colleagues will raise an eyebrow (or even laugh) at me for considering this single machine first step. But from my experience the effort put into squeezing out the best performance from a single machine by optimizing the source code plays an important role in good performance in large scale.
Thinking Bigger
Okay, the business is good and Our News is gaining some popularity. Daily active user number is now passed ten thousand. On average there are 100 new articles and 10,000 users read 10 articles on average everyday. Therefore we can assume 100,000 read operations per day, which means just over one read request per second. As the read vs write ration is 1000:1, we are mostly concerned with read operations right now.
At this stage we shall see the spikes in resource usage in the server computer. To ensure that our system can take further load, we need to start taking steps. The first change is to separating database server from the application server. As Our News articles contains photos, we can also separate the file server too. So we now have three separate machine: one acting as the application server, one hosting the database and the third one storing all the files and serving them (To keep a separate file server, the standard behavior is to keep the files meta data in the database and the file content in the file server). After a closer inspection, we can see that, not all news articles are equally popular. Only a portion of the articles are causing most of the traffics. If we can cache these popular articles in the memory, we can reduce the database queries with a very high degree. Therefore we can have a cache server behind the application server. The application server will first look up in the cache server for an article, if it is not in the cache server, the article will be looked into database server (and after it is returned it can be cached).
Serving One Million Users
With the architecture discussed above we shall be able to serve some thousands of users. But we need to scale the system again if along with the growth of the popularity of Our News. Let’s assume the scenario when we reach the one million user milestone. If we assume that the user behavior remain same from the beginning, now we should expect that there will be 10,000 new articles and 10 million article reads on average per day. Which means just more than one write operation per second and nearly 116 read operations per second. If we consider that our application server can serve an article read request in 50 milliseconds, we can only serve 20 read operations in a second. Therefore to reach one million users, we shall have to upgrade our architecture long before.
As it is obvious that our single application server will not withstand the required amount of request we shall need to have multiple copies of application server. They will be similar to each other, running on the same code-base. There can be hundreds or thousands of the application servers. In front of them, there will be a load balancer (or a set of load balancers). For each new request, load balancer will choose an application server to forward this request. There can be several ways to choose an application server. Load balancer can run a round robin distribution, a random distribution, weighted random distribution etc. We may add some sophistication on this distribution by considering the existing load on the application servers, server health status and network distance between server and the client.
After we fix the load on the application server, we find another bottleneck in the database server. Until this point we were maintaining the DB server by increasing horsepower: more CPU, more RAM and storage. But at this point our DB server is struggling to perform all the queries and failures are now more frequent. To improve things, we prepare some read-replica (slave) databases of the master database. They operate from different servers. Now we divert all the read operations to the read-replica databases via a load balancer to decide which DB instance to use. Only the write operations are passed to the master database. At this moment, the master database is a single point of failure, means if the master database is down, the whole system will stop working. To ensure availability, we can separate real time copies of the master database (we can obtain it by using services like multi-AZ of AWS). When the master database is down, one of its backup takes over. Or, we can promote a read replica database to master database as soon as the master database fails.
Similarly we can prepare multiple copies of the cache servers and file servers. Each application server may have its own cache server, or there can be a set of cache servers with a load balancer.
Let’s Go Global Scale
Now we are targeting an even bigger user-base. Almost everyone in the planet with internet access is going to use Our News and we want keep our system capable of serving them. When users count reaches one billion, there will be on average 10 million new article and 10 billion article reads per day. If each article is on average 100 kB in size, then per day out-going traffic is 10 billion times 100 kB = 1 TB, and incoming daily traffic is 1 GB. This is in 24 hours (=86400 seconds).
So we need outgoing bandwidth of 11.6 MB/s and incoming bandwidth of nearly 12 kB/s.
Now we are going to have nearly 120 write operations per second. At this stage there will be billions of rows in both users and articles tables and join operation will not be feasible anymore. It will be significantly difficult and time consuming to retrieve who follows who and preparing a dashboard for user with her/his followed writers’ latest articles. Therefore we shall not be able to use the features of RDBMS and it will be a good idea to switch to a NoSQL database system before reaching to this stage. We can choose from multiple options of NoSQL databases, but for our case, we can use a wide column database system (like Cassandra) to store all of user’s following list in one entry. Or we can choose for other design.
Now that we are not going to perform join operations, we shall separate the large tables (like user table, article table and file metadata table) in different database servers. To distribute the loads further, we shall be performing sharding on the tables. We can keep multiple user databases and and based on user ids we can map which user in which database. For sharding article table, we can either separate the articles by writer (user) id or by article id. By separating them by writer ids, we can make sure that all articles written by one user will be in the same database (which can be helpful for some cases?). But some writers can be much more popular than most other writers and databases store their articles will receive much more traffic than other databases. To keep things balanced, we can store articles by the article ids. For example, if we have 10 databases for articles, then we can map the article id to a number 0 to 9 and then assign a database for the article. We can use consistent hashing algorithm to map database so that we can distribute balanced load. Similarly file metadata database can be sharded by user id, article id or file object id.
We shall get the id of a user or article after saving to database, how shall we decide which database to use for saving before saving for the first time? To solve this, we can have an “id generating” service which will consistently generate unique ids. To avoid this id generating service to be a single point of failure, we must have backups for this service so that another one is activated if the current one fails.
We can have multiple instances of cache servers to reduce traffic on the cache servers. Each cache can work independently. The cache size can be decided from the number of DB hits. If we encounter excessive number of DB hits, we shall increase the size of each cache server. To free up memory in cache servers we shall need to evacuate cache entries regularly. We should use least recently used (LRU) algorithm to free up cache memory. In this way, the entry not used recently by any of the users will be removed from the cache.
To distribute the photos (and possibly videos) we shall use CDN servers as file servers. Files will be copied in multiple locations and will be delivered to users from the most easily accessible location. Frequently used files will be cached in the file servers and will be delivered from the memory so that disk read time is saved during the request.
With this design Our News will be able to serve the basic functionalities (post news articles and read news articles by posted other users) to billions of users. To add other features, like preparing user’s dashboard, or showing statistics / reports we shall need to add other components in the system. In those cases we might have to perform the upward scaling much earlier than the discussed time.
In this post, we have discussed a general structure of a large scale system. Based on the functionality and features of a system, the design can be different from the design discussed here. Services like YouTube and Netflix need to put much more priority in file server where services like Twitter may have more priority on the database design. Therefore the architecture of the servers will be different.