Design Twitter

We’ll follow the 5-step procedure:

1. Scenario

The C-A-P theory:

Consistency: eventual, no need strong consistency.
Availability:
1. always available
2. no gurantee of most updated data
3. scaclable, low-latency
Partition tolerance
1. continue to operate despite an arbitrary number of messaged being lost.

User:

DAU = 200M
New tweets = 100M per day
Each timeline page = 20 tweets
Tweet size = 280 bytes, with another 30 bytes of metadata
1. photo size: 500KB; 20% of tweets have photo
2. video size = 5MB; 10% of tweets have video, among them 30% will be watched

Data:

why do we need this?

Also called “push mode”.

For 99% users, when they post a new tweet, write to all their follower’s timeline on cache.

Also called “pull”.

This is only efficient for IOLs with >10,000 followers.

eg. Tylor Swift, Elon Must etc. Only fetch their tweets when user reads.

General Steps:

by creation time
1. Pros: na.
2. Cons: hot/cold shards
by user ID
1. Pros: simple
2. Cons:
  1. need to query many shards for a timeline request (follwer and followee on different shards)
  2. Some IOLs may not fit into 1 shard (non-uniform distribution of storage).
  3. also have hot user issue
by hash (tweetId)
1. Pros: uniform distribution, high availability.
2. Cons: Need to query all shards.

Follower-followee: try use GraphDB, which is a adjacency list of user relations.