Day 1 : Social Media Data Modeling

Lesson 1 15 min

Building the Foundation of Your Twitter Clone

What We're Building Today

You're about to design the data backbone that powers every social media platform. Today we'll create user profiles, model tweet relationships, and build the follower graph that makes social connections possible. By the end, you'll have a PostgreSQL database that can handle 1,000 concurrent users posting and reading tweets seamlessly.

Today's Agenda:

  • Design user profile schema with privacy controls

  • Model tweets with engagement tracking

  • Build bidirectional follower relationships

  • Create performance-optimized database indexes

  • Test timeline query performance

Why This Matters in Real Systems

Instagram handles 500 million daily active users with similar data structures. LinkedIn's professional network relies on follower graphs we'll build today. The indexing strategies you'll learn prevent the "fail whale" that plagued early Twitter during traffic spikes.

Core Concept: Bidirectional Graph Storage

Social media is fundamentally about relationships. When User A follows User B, we need to efficiently answer two questions:

  • Who does A follow? (following list)

  • Who follows B? (followers list)

Most beginners store this as a single "follows" table, creating performance nightmares. We'll use a dual-storage approach that makes both queries lightning fast.

The User Profile Foundation

sql
-- User profiles with privacy and performance in mind
CREATE TYPE privacy_level AS ENUM ('public', 'protected', 'private');

CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    display_name VARCHAR(100),
    bio TEXT,
    profile_image_url TEXT,
    privacy privacy_level DEFAULT 'public',
    verified BOOLEAN DEFAULT false,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

Why BIGSERIAL? Twitter has 400+ million users. Regular INTEGER maxes at 2 billion - too close for comfort.

Tweet Data Modeling

Tweets aren't just text. They're engagement magnets with replies, likes, and retweets forming complex interaction webs.

sql
CREATE TABLE tweets (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT REFERENCES users(id) ON DELETE CASCADE,
    content TEXT NOT NULL CHECK (LENGTH(content) <= 280),
    reply_to_tweet_id BIGINT REFERENCES tweets(id),
    original_tweet_id BIGINT REFERENCES tweets(id), -- For retweets
    media_urls TEXT[],
    hashtags TEXT[],
    mentions BIGINT[],
    like_count INTEGER DEFAULT 0,
    retweet_count INTEGER DEFAULT 0,
    reply_count INTEGER DEFAULT 0,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

The Genius Detail: Separate reply_to_tweet_id and original_tweet_id columns. This lets us build conversation threads AND track retweet chains efficiently.

Bidirectional Follower Graph Magic

Here's where most tutorials fail. They create one followers table and wonder why timeline queries crawl.

sql
-- Primary follower relationship
CREATE TABLE user_follows (
    follower_id BIGINT REFERENCES users(id) ON DELETE CASCADE,
    following_id BIGINT REFERENCES users(id) ON DELETE CASCADE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    PRIMARY KEY (follower_id, following_id)
);

-- Reverse lookup optimization
CREATE TABLE user_followers (
    user_id BIGINT REFERENCES users(id) ON DELETE CASCADE,
    follower_id BIGINT REFERENCES users(id) ON DELETE CASCADE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    PRIMARY KEY (user_id, follower_id)
);

Performance-Critical Indexes

Indexes are the secret sauce. Without them, your database performs table scans on millions of rows.

sql
-- Timeline generation (most critical)
CREATE INDEX idx_tweets_user_created ON tweets(user_id, created_at DESC);
CREATE INDEX idx_follows_follower ON user_follows(follower_id);

-- Search and discovery
CREATE INDEX idx_tweets_hashtags ON tweets USING GIN(hashtags);
CREATE INDEX idx_users_username ON users(username);

-- Engagement queries
CREATE INDEX idx_tweets_reply_to ON tweets(reply_to_tweet_id) WHERE reply_to_tweet_id IS NOT NULL;

The Timeline Index Trick: (user_id, created_at DESC) lets PostgreSQL quickly grab a user&#039;s latest tweets without sorting.

Context in Twitter Overall Architecture

Your data model sits at the foundation layer, supporting all user-facing features. The follower graph enables timeline generation. Tweet storage powers the posting API. User profiles drive authentication and personalization.

Next week, we'll build the REST API that transforms this data into a Twitter-like experience, handling 100 tweets per second with sub-100ms response times.

Architecture Workflow

  1. User Registration: Profile creation with privacy settings

  2. Social Connections: Follow/unfollow operations update both relationship tables

  3. Content Creation: Tweet posting with engagement tracking

  4. Timeline Generation: Query optimization through strategic indexing

Success Criteria

Your implementation passes when:

  • User registration completes in < 50ms

  • Follow operations update both relationship tables atomically

  • Timeline queries return in < 100ms for users with 1000+ follows

  • Database handles 1000 concurrent connections without performance degradation

Real-World Impact

Facebook's early scaling challenges stemmed from inefficient relationship queries. Twitter' 'fail whale' appeared when timeline generation couldn&#039;t keep up with user growth. Your dual-table approach and strategic indexing solve both problems at the architectural level.

Assignment Challenge

Create a user with 1000 followers and measure timeline generation performance. Document query execution plans and identify optimization opportunities. This mirrors real-world database tuning that prevents production outages.

The foundation you're building today scales to millions of users. Master these patterns, and you'll understand how every major social platform handles data at scale.

Need help?