Day 1: The Social Media Graph – Data Modeling for Social Connectivity

Lesson 1 60 min

Day 1: The Social Media Graph – Data Modeling for Social Connectivity

Welcome, engineers, to the first lesson in our journey towards building hyperscale distributed systems in Java. You’re not just learning theory here; you’re stepping into the shoes of engineers at the forefront, grappling with real-world architectural challenges. Today, we're laying the cornerstone of any social platform: the Social Media Graph. This isn't just about drawing circles and lines; it's about understanding how to model billions of relationships efficiently to sustain 100 million requests per second.

The Problem: Social Connectivity at Hyperscale

Flowchart

Start User Action (Follow) App calls Service.follow() Atomic Update: Following & Followers End

Imagine a social media platform with billions of users. Each user can follow thousands, and be followed by millions. Every "follow," "unfollow," "like," or "share" fundamentally alters the network. At 100 million requests per second, you're looking at an astronomical rate of changes and queries against this intricate web of relationships.

The core challenge isn't merely storing who follows whom. It's:

  1. Efficient lookups: Can "User A" see "User B's" private posts? Who are "User C's" followers? Who does "User D" follow? These need to be near-instantaneous.

  2. Fast updates: When "User E" follows "User F," that relationship must be established quickly and consistently (or eventually consistently, a topic for another day!).

  3. Scalability: The system must grow to accommodate an ever-increasing number of users and connections without crumbling.

A naive approach might be to simply store a List<String> followers inside a User object. But think about it:

  • Adding a follower requires checking if they're already in the list (an O(N) operation).

  • Removing a follower is also O(N).

  • Retrieving all followers means iterating a potentially huge list.

  • What if a user follows millions? That list becomes a memory hog, and operations become painfully slow, rendering your system unusable at scale.

This is where intelligent data modeling becomes your first line of defense.

Core Concept: The Adjacency List for Graph Representation

Component Architecture

Social Media Graph Architecture SocialGraphApp • main(String[] args) • Invokes Service SocialGraphService Methods: + follow(u1, u2) + unfollow(u1, u2) ASSIGNMENT LOGIC getMutualFollows() suggestFriends() following Map ConcurrentHashMap<String, Set> "Who do I follow?" followers Map ConcurrentHashMap<String, Set> "Who follows me?" Scale & Performance Notes: • Space Complexity: 2x Edges (O(2E)) to optimize read speeds (O(1)). • Thread Safety: ConcurrentHashMap prevents data corruption during 100M req/sec spikes.

At its heart, a social graph is a collection of nodes (users) and edges (relationships like "follows"). The most intuitive and efficient way to represent this in memory for many graph operations is an Adjacency List.

Instead of embedding a list of followers within each user object, we use a central structure. For our hands-on today, we'll use a Map where:

  • The key is a String representing a userId.

  • The value is a Set<String> containing the userIds of all users that the key user is following.

Why a Set and not a List? This is a critical insight for high-performance systems:

  • Set provides O(1) average-case performance for adding, removing, and checking for the existence of an element. This is because Set implementations (like HashSet or ConcurrentHashMap.newKeySet()) use hash tables.

  • A List, by contrast, requires O(N) for these operations (you might have to iterate through the entire list).

So, for a user with millions of followers, a Set operation is effectively instantaneous, while a List operation would grind your system to a halt. This choice directly impacts the latency of "follow" and "unfollow" actions, which are core to a social platform.

For efficient bi-directional lookups (i.e., "who does Alice follow?" AND "who follows Alice?"), we'll maintain two adjacency lists:

  1. following: Map<String, Set<String>> where key is followerId, value is Set<followeeId>.

  2. followers: Map<String, Set<String>> where key is followeeId, value is Set<followerId>.

This approach duplicates some data, yes, but it dramatically optimizes read performance for both directions, which is a common trade-off in hyperscale systems where reads often vastly outnumber writes. The challenge of keeping these two maps consistent during updates is real, and it's a taste of the distributed consistency problems we'll tackle later.

State Machine

User-to-User Relationship State Machine NOT_FOLLOWING (ID not in Map) FOLLOWING (ID present in Map) follow() unfollow() In-memory state is synchronized across 'following' and 'followers' maps via SocialGraphService.

Hands-On Implementation: Building Our First Graph Component

Let's build a SocialGraphService in Java. For now, we'll keep it in-memory. This simple model will be the brain of our social network's connectivity.

java
// src/main/java/com/hyperscale/socialgraph/SocialGraphService.java
package com.hyperscale.socialgraph;

import java.util.Collections;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;

public class SocialGraphService {
    // Map: userId -> Set of userIds this user is FOLLOWING
    private final Map<String, Set<String>> following;
    // Map: userId -> Set of userIds that are FOLLOWING this user (i.e., this user's followers)
    private final Map<String, Set<String>> followers;

    public SocialGraphService() {
        // Using ConcurrentHashMap for thread safety, crucial even for in-memory services
        this.following = new ConcurrentHashMap<>();
        this.followers = new ConcurrentHashMap<>();
    }

    /**
     * Establishes a follow relationship.
     * @param followerId The ID of the user initiating the follow.
     * @param followeeId The ID of the user being followed.
     */
    public void follow(String followerId, String followeeId) {
        if (followerId.equals(followeeId)) {
            System.out.println("Error: User " + followerId + " cannot follow themselves.");
            return;
        }

        // Add to followerId's following set
        following.computeIfAbsent(followerId, k -> ConcurrentHashMap.newKeySet()).add(followeeId);
        // Add to followeeId's followers set
        followers.computeIfAbsent(followeeId, k -> ConcurrentHashMap.newKeySet()).add(followerId);
        System.out.println("SUCCESS: " + followerId + " is now following " + followeeId);
    }

    /**
     * Removes a follow relationship.
     * @param followerId The ID of the user initiating the unfollow.
     * @param followeeId The ID of the user being unfollowed.
     */
    public void unfollow(String followerId, String followeeId) {
        if (following.containsKey(followerId) && following.get(followerId).remove(followeeId)) {
            System.out.println("SUCCESS: " + followerId + " unfollowed " + followeeId);
        } else {
            System.out.println("INFO: " + followerId + " was not following " + followeeId + ".");
        }
        // Also remove from the followee's followers set
        if (followers.containsKey(followeeId)) {
            followers.get(followeeId).remove(followerId);
        }
    }

    /**
     * Gets the set of user IDs that a given user is following.
     * @param userId The ID of the user.
     * @return A set of user IDs being followed by the given user. Returns an empty set if no one is followed.
     */
    public Set<String> getFollowing(String userId) {
        return following.getOrDefault(userId, Collections.emptySet());
    }

    /**
     * Gets the set of user IDs that are following a given user.
     * @param userId The ID of the user.
     * @return A set of user IDs following the given user. Returns an empty set if no one is following.
     */
    public Set<String> getFollowers(String userId) {
        return followers.getOrDefault(userId, Collections.emptySet());
    }

    /**
     * Prints the current state of the social graph for demonstration.
     */
    public void printGraphState() {
        System.out.println("n--- Current Social Graph State ---");
        System.out.println("Following relationships:");
        following.forEach((user, follows) -> {
            if (!follows.isEmpty()) {
                System.out.println("  " + user + " follows: " + String.join(", ", follows));
            }
        });
        System.out.println("Follower relationships:");
        followers.forEach((user, byUsers) -> {
            if (!byUsers.isEmpty()) {
                System.out.println("  " + user + " is followed by: " + String.join(", ", byUsers));
            }
        });
        System.out.println("----------------------------------");
    }
}

This SocialGraphService uses ConcurrentHashMap.newKeySet() which is a factory method that returns a Set backed by a ConcurrentHashMap. This ensures thread-safety if multiple threads try to modify the graph concurrently – a common scenario in any real-world application.

Fitting into the Overall Hyperscale System

What we've built today is the conceptual core of a social graph. In a real hyperscale system, this SocialGraphService would be an API layer over a highly optimized, distributed persistence layer. This could be:

  • A specialized Graph Database (like Neo4j, JanusGraph) designed for traversing relationships.

  • A highly sharded Key-Value Store (like Apache Cassandra, Amazon DynamoDB), where each user's following and followers sets are stored as individual entries.

  • A custom-built, in-memory distributed cache layer (think Memcached or Redis at scale) backed by a durable storage.

The in-memory Map we're using today showcases the fundamental data structures and operations. The challenges of distributing this Map across hundreds or thousands of machines, ensuring consistency, and handling network partitions are what distributed systems engineering is all about, and we'll tackle them in future lessons.

Real-World Production System Application – Sizing & Challenges

Consider the scale:

  • Users: 2 billion

  • Average follows per user: 100

  • Total edges: 2 billion users * 100 follows/user = 200 billion edges.

  • Each edge is (followerId, followeeId). If each ID is an 8-byte long, that's 16 bytes per edge, plus overhead.

  • 200 billion edges * 16 bytes/edge = 3.2 Terabytes of raw relationship data.

  • Add in the overhead of Sets (hash table structure, pointers, etc.), and this number grows substantially.

  • Now imagine this data needs to be replicated for fault tolerance and sharded across thousands of servers for performance.

This is why optimizing data structures from Day 1, even in an in-memory prototype, is crucial. It informs your database choices, your sharding strategy, and your caching layers down the line.

Assignment: Expanding Our Social Graph

Your homework is to extend our SocialGraphService with two common social graph functionalities. This will solidify your understanding of graph traversal and set operations.

  1. getMutualFollows(userId1, userId2): Implement a method that returns a Set<String> of user IDs that are followed by both userId1 and userId2.

  2. suggestFriends(userId): Implement a simple friend suggestion algorithm. For a given userId, return a Set<String> of users that userId's friends are following, but userId itself is not already following. This is a basic "friends of friends" or "people you might know" algorithm.

Success Criteria:

  • All methods should be implemented within SocialGraphService.

  • The methods should correctly return the expected sets of user IDs.

  • Your SocialGraphApp (which you'll create in the start.sh script) should demonstrate these new functionalities.

Solution Hints

  • getMutualFollows: This is a classic set intersection problem. You'll need to get the following sets for both userId1 and userId2 and then find their common elements. Java's Set interface has methods like retainAll() that can be very useful (though be careful with modifying original sets; often it's better to create a copy).

  • suggestFriends:

    1.  Get all users that userId is currently following.
    2.  For each of those users (let's call them friendOfUser), get the users *they* are following.
    3.  Collect all these "friends of friends" into a temporary set.
    4.  From this temporary set, remove userId itself and any users that userId is *already* following. The remaining users are your suggestions.
    

This exercise will give you a taste of graph traversal logic, which is fundamental to many advanced features in social media (e.g., personalized feeds, trending topics, spam detection). You're already thinking like a distributed systems engineer!

Need help?