Dropbox is a cloud storage system that allows users to store their data on remote servers. The remote servers store files durably and securely, and these files are accessible anywhere with an internet connection.
Let’s do some back-of-the-envelope calculations.
Assumptions
Storage Estimations
The Dropbox system needs to deal with a huge volume of read and write data, and their ratio will remain almost the same. So while designing the system, we should focus on optimizing the data exchange between client and server.
At a high level, we need to store files and their metadata information like file name, file size, directory, etc., and who this file is shared with. So, we need servers to help the client upload/download files to cloud storage and some servers to facilitate updating metadata about files and users. We also need some mechanism to notify all clients whenever an update happens to synchronize their files.
As shown in the diagram below, block servers will work with the clients to upload/download files from cloud storage, and metadata servers will keep metadata of files updated in an SQL or NoSQL database. Synchronization servers will handle the workflow of notifying all clients about different changes for synchronization.
The client application is responsible for monitoring the changes in the workspace. It interacts with the synchronization service to process metadata updates like changes in the file name or contents. It is also responsible for indexing the file, sending the updated chunks to the cloud storage, and retrieving the same if other clients have updated the file.
Major components of the client application:
The metadata database is responsible for maintaining the version and metadata information about files/chunks, users, and workspaces. It can be a relational database such as MySQL or a NoSQL database service such as DynamoDB.
Regardless of the type of database, the synchronization service should provide a consistent view of the files using a database, especially if more than one user is working with the same file simultaneously.
Since NoSQL data stores do not support ACID properties in favour of scalability and performance, we need to incorporate the support for ACID properties in our synchronization service's logic if we opt for this kind of database. However, using a relational database can simplify the synchronization service implementation as they natively support ACID properties.
It is one of the critical components of cloud storage design. It processes all the client's updates on the file and synchronizes those updates across all the devices. It updates the client's local database to be in sync with the metadata stored on the server. All Dropbox clients, including desktop, mobile, and web clients, talk to the synchronization service to get updates from the server or push updates to the server.
This way, all clients are in sync with the master copy stored in the Dropbox cloud. When the client is offline, all updates are stored locally, and when the client becomes online, the synchronization service syncs the data to metadata storage. The same is subsequently pushed to other clients or shared workspace users.
It is also possible that two clients have made changes to the same file offline. Dropbox handles such scenarios by creating a conflicted copy and saving it with the editor’s username and the save date. Users will be required to resolve that conflict manually.
An important part of our architecture is a messaging middleware that will handle substantial reads and writes. So a scalable message queuing service that supports asynchronous message-based communication between clients and the synchronization service instances best fits our application's requirements.
The figure below illustrates two types of queues that are used in our message queuing service. The Request Queue is a global queue that is shared among all clients. Clients’ requests to update the metadata database through the synchronization service will be sent to the request queue.
The response queues that correspond to individual subscribed clients are responsible for delivering the client's update messages. Since a message will be deleted from the queue once received by a client, we need to create separate Response Queues for each client to share an update message that should be sent to multiple subscribed clients.
Cloud storage stores the chunks of the files uploaded by the users. Clients directly interact with cloud storage to send and receive objects using the cloud provider's API. The separation of the metadata from the object storage enables our reference architecture to use any cloud storage as the back-end data store.
Thanks to Navtosh for his contribution in creating the first version of this content. If you have any queries/doubts/feedback, please write us at contact@enjoyalgorithms.com. Enjoy learning, Enjoy system design, Enjoy algorithms!