Design Twitter
We’ll follow the 5-step procedure:
- Scenario
- Numbers
- API and Database
- Performance
- Evolve
1. Scenario
Functional requirement
- tweets
- create
- delete
- timeline/feeds
- home timeline
- user self timeline
- follow
- like
- search
Non-Functional requirement
The C-A-P theory:
Consistency: eventual, no need strong consistency.
Availability:
- always available
- no gurantee of most updated data
- scaclable, low-latency
Partition tolerance
- continue to operate despite an arbitrary number of messaged being lost.
2. Numbers
User:
- DAU = 200M
- New tweets = 100M per day
- Each timeline page = 20 tweets
- Tweet size = 280 bytes, with another 30 bytes of metadata
- photo size: 500KB; 20% of tweets have photo
- video size = 5MB; 10% of tweets have video, among them 30% will be watched
Data:
Write size:
- Text
- 100M * 300Bytes = 30GB
- Images
- 20M * 500KB = 10,000GB
- Video
- 10M * 5MB = 50,000,GB
- Total = 60TB of new data every day.
- That is 2PB per month.
- Text
Bandwidth:
- Average user: 5 home visit + 3 profile visit, with each page having 20 tweets
- That is 200M * 8visits * 20 = 32 billion tweets/day
- Text: 100MB/s
- Images: 30GB/s
- Video: 50GB/s
why do we need this?
3. API and Database
API design
- postTweet(token, string)
- deleteTweet(token, id)
- likeOrDislike(token, id, bool likeOrDislike)
- readHomeTimeline(token, int pageSize, string paginationToken)
- readUserTimeline(token, int pageSize, string paginationToken)
Application server
Fan-out on write
Also called “push mode”.
For 99% users, when they post a new tweet, write to all their follower’s timeline on cache.
Fan-in on read
Also called “pull”.
This is only efficient for IOLs with >10,000 followers.
eg. Tylor Swift, Elon Must etc. Only fetch their tweets when user reads.
Database
- User table
- id
- name
- isHot: boolean
- Tweet table
- id
- creationTime
- content: varchar(140)
- Follower table
- userID1: integer
- userID2: integer
4. Performance
Scalability
General Steps:
- Identify bottleneck
- Find solutions
- data sharding
- load balancing
- data caching
Sharding
- by creation time
- Pros: na.
- Cons: hot/cold shards
- by user ID
- Pros: simple
- Cons:
- need to query many shards for a timeline request (follwer and followee on different shards)
- Some IOLs may not fit into 1 shard (non-uniform distribution of storage).
- also have hot user issue
- by hash (tweetId)
- Pros: uniform distribution, high availability.
- Cons: Need to query all shards.
Caching
More read than write.
Use cache to avoid frequent DB hits.
Timeline service - is just a pre-computed list of tweet IDs, stored in No-sql DB as a (userId, array
5. Evolve
Follower-followee: try use GraphDB, which is a adjacency list of user relations.