Design Twitter

We’ll follow the 5-step procedure:

  • Scenario
  • Numbers
  • API and Database
  • Performance
  • Evolve

1. Scenario

Functional requirement

  1. tweets
    1. create
    2. delete
  2. timeline/feeds
    1. home timeline
    2. user self timeline
  3. follow
  4. like
  5. search

Non-Functional requirement

The C-A-P theory:

  1. Consistency: eventual, no need strong consistency.

  2. Availability:

    1. always available
    2. no gurantee of most updated data
    3. scaclable, low-latency
  3. Partition tolerance

    1. continue to operate despite an arbitrary number of messaged being lost.

2. Numbers

User:

  1. DAU = 200M
  2. New tweets = 100M per day
  3. Each timeline page = 20 tweets
  4. Tweet size = 280 bytes, with another 30 bytes of metadata
    1. photo size: 500KB; 20% of tweets have photo
    2. video size = 5MB; 10% of tweets have video, among them 30% will be watched

Data:

  1. Write size:

    1. Text
      • 100M * 300Bytes = 30GB
    2. Images
      • 20M * 500KB = 10,000GB
    3. Video
      • 10M * 5MB = 50,000,GB
    4. Total = 60TB of new data every day.
    5. That is 2PB per month.
  2. Bandwidth:

    1. Average user: 5 home visit + 3 profile visit, with each page having 20 tweets
    2. That is 200M * 8visits * 20 = 32 billion tweets/day
    3. Text: 100MB/s
    4. Images: 30GB/s
    5. Video: 50GB/s

why do we need this?

3. API and Database

API design

  1. postTweet(token, string)
  2. deleteTweet(token, id)
  3. likeOrDislike(token, id, bool likeOrDislike)
  4. readHomeTimeline(token, int pageSize, string paginationToken)
  5. readUserTimeline(token, int pageSize, string paginationToken)

Application server

Fan-out on write

Also called “push mode”.

For 99% users, when they post a new tweet, write to all their follower’s timeline on cache.

Fan-in on read

Also called “pull”.

This is only efficient for IOLs with >10,000 followers.

eg. Tylor Swift, Elon Must etc. Only fetch their tweets when user reads.

Database

  1. User table
    1. id
    2. name
    3. email
    4. isHot: boolean
  2. Tweet table
    1. id
    2. creationTime
    3. content: varchar(140)
  3. Follower table
    1. userID1: integer
    2. userID2: integer

4. Performance

Scalability

General Steps:

  1. Identify bottleneck
  2. Find solutions
    1. data sharding
    2. load balancing
    3. data caching

Sharding

  1. by creation time
    1. Pros: na.
    2. Cons: hot/cold shards
  2. by user ID
    1. Pros: simple
    2. Cons:
      1. need to query many shards for a timeline request (follwer and followee on different shards)
      2. Some IOLs may not fit into 1 shard (non-uniform distribution of storage).
      3. also have hot user issue
  3. by hash (tweetId)
    1. Pros: uniform distribution, high availability.
    2. Cons: Need to query all shards.

Caching

More read than write.

Use cache to avoid frequent DB hits.

Timeline service - is just a pre-computed list of tweet IDs, stored in No-sql DB as a (userId, array) K-V pair. This K-V pais is constently updated.

5. Evolve

Follower-followee: try use GraphDB, which is a adjacency list of user relations.