Day 3 : Designing the SQL schema for problems, users, and submissions.

Lesson 3 60 min

The Blueprint of Data – Designing SQL Schemas for Scale

Welcome back, architects of the digital realm! Yesterday, we laid the groundwork with a robust monorepo setup, establishing a shared foundation for our distributed system. Today, we dive into the very heart of our application's intelligence: its data. Specifically, we're going to design the SQL schema for our competitive programming platform, focusing on problems, users, and submissions.

This isn't just about drawing boxes and lines; it's about crafting the DNA of a system destined to handle millions of interactions per second. Every choice we make here—from data types to indexing strategies—has profound implications for performance, scalability, and maintainability. Let's build a schema that doesn't just work, but thrives under pressure.

Agenda for Today

Core Concepts: SQL Schema Design Principles, Data Types, Indexing, Constraints.
Component Architecture: How our database layer fits into the broader system.
Data Flow & State Changes: Understanding how entities evolve and interact.
Hands-on Schema Definition: Crafting our users, problems, and submissions tables.
Scalability Insights: Why these choices matter at 100M RPS.

The Unseen Foundation: Why Schema Design is System Design

When we talk about ultra-high-scale systems, the database often becomes the bottleneck. A poorly designed schema can lead to cascading performance issues, requiring heroic (and expensive) efforts to mitigate later. Conversely, a well-thought-out schema simplifies application logic, optimizes query performance, and makes horizontal scaling strategies like sharding far more manageable.

Think of it this way: your database schema is the blueprint for your data's physical storage. Just as a skyscraper needs a solid foundation and a clear structural plan, your data needs a schema that anticipates growth and access patterns.

Core Concepts & Architecture Fit

Component Architecture Diagram

Our database, likely a PostgreSQL instance for its robustness and rich feature set, will serve as the single source of truth for our platform. It's the persistent storage layer that our Backend API services (which we'll build in future lessons) will interact with.

Relational Model: We're sticking with a relational model for its strong consistency guarantees and well-understood ACID properties, crucial for user accounts, problem definitions, and submission results.
Normalization: We'll aim for 3rd Normal Form (3NF) to minimize data redundancy and improve data integrity. While denormalization can sometimes boost read performance, it often complicates writes and data consistency. For our core entities, normalization is the safer, more robust starting point.
Data Types: Choosing the right data type isn't trivial. BIGINT for IDs (if auto-incrementing) or UUID for distributed primary keys are critical. TEXT vs. VARCHAR(N), TIMESTAMP WITH TIME ZONE for global systems, and JSONB for flexible, semi-structured data (like problem test cases or configuration) are nuanced decisions we'll explore.
Indexing: This is where performance lives or dies. We'll use B-tree indexes for primary keys, foreign keys, and frequently queried columns. Understanding which columns to index and why is paramount. Over-indexing can hurt write performance, while under-indexing cripples reads.
Constraints: NOT NULL, UNIQUE, FOREIGN KEY, and CHECK constraints are our guardians of data integrity. They ensure our data remains valid and consistent, preventing application-level bugs from corrupting our database.

Designing Our Tables

Let's get concrete. We'll define three core tables: users, problems, and submissions.

1. `users` Table

This table stores information about our platform's users.

sql

CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    username VARCHAR(50) UNIQUE NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash TEXT NOT NULL,
    is_admin BOOLEAN DEFAULT FALSE,
    status VARCHAR(20) DEFAULT 'active' CHECK (status IN ('active', 'suspended', 'deleted')),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Index for frequently queried columns beyond PK/UNIQUE
CREATE INDEX idx_users_status ON users(status);

Insights:

id UUID PRIMARY KEY DEFAULT gen_random_uuid(): For ultra-high-scale, UUIDs are superior to auto-incrementing BIGINTs. They prevent bottlenecks at a central sequence generator, simplify sharding (no need for coordination to generate unique IDs across shards), and are globally unique. gen_random_uuid() is a PostgreSQL function for this.
username VARCHAR(50) UNIQUE NOT NULL: Enforces uniqueness and a reasonable max length. NOT NULL prevents orphaned data.
password_hash TEXT NOT NULL: TEXT is used because password hashes can be variable and often long (e.g., bcrypt hashes). Never store plain passwords!
status VARCHAR(20) DEFAULT 'active' CHECK (...): Using an ENUM type or CHECK constraint with VARCHAR ensures data integrity by limiting possible values. CHECK constraints are a powerful, often underutilized feature for business logic at the DB level.
TIMESTAMP WITH TIME ZONE: Absolutely critical for any global application. It stores timestamps with respect to UTC, avoiding time zone conversion headaches and ensuring consistency.
idx_users_status: If we frequently query for active users or suspended users, this index will speed up those lookups.

2. `problems` Table

This table defines the programming challenges users will solve.

sql

CREATE TABLE problems (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title VARCHAR(255) UNIQUE NOT NULL,
    description TEXT NOT NULL,
    difficulty VARCHAR(20) DEFAULT 'easy' CHECK (difficulty IN ('easy', 'medium', 'hard')),
    tags TEXT[], -- PostgreSQL array type for simplicity; can be normalized later
    test_cases JSONB NOT NULL, -- Stores input/output pairs for judging
    starter_code JSONB, -- Optional: Stores starter code snippets per language
    time_limit_ms INT NOT NULL DEFAULT 1000,
    memory_limit_kb INT NOT NULL DEFAULT 256000,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Index for frequently filtered columns
CREATE INDEX idx_problems_difficulty ON problems(difficulty);
CREATE INDEX idx_problems_tags ON problems USING GIN (tags); -- For array search

Insights:

title VARCHAR(255) UNIQUE NOT NULL: Problem titles should be unique and concise.
description TEXT NOT NULL: Problems can have lengthy descriptions.
tags TEXT[]: PostgreSQL's native array type is incredibly useful for simple tag management. For more complex tag queries or many-to-many relationships, a separate tags table and a problem_tags join table would be better, but TEXT[] is a practical starting point.
test_cases JSONB NOT NULL: This is a game-changer. JSONB (binary JSON) allows us to store arbitrary, evolving structures for test cases (e.g., [{ "input": "...", "output": "..." }]) directly in the database. It's indexed efficiently and queryable, offering schema flexibility without sacrificing performance.
starter_code JSONB: Similar to test_cases, allows flexible storage of starter code for different languages.
GIN (tags) index: A Generalized Inverted Index (GIN) is perfect for indexing array types (TEXT[]) or JSONB columns, allowing efficient searching within them. This is a powerful, non-obvious technique for flexible schemas.

3. `submissions` Table

This table records every attempt a user makes at a problem.

sql

CREATE TYPE submission_status AS ENUM (
    'pending', 'running', 'accepted', 'wrong_answer',
    'time_limit_exceeded', 'memory_limit_exceeded',
    'compilation_error', 'runtime_error', 'internal_error'
);

CREATE TABLE submissions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    problem_id UUID NOT NULL REFERENCES problems(id) ON DELETE CASCADE,
    language VARCHAR(50) NOT NULL,
    code TEXT NOT NULL,
    status submission_status DEFAULT 'pending',
    runtime_ms INT,
    memory_usage_kb INT,
    submitted_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    judged_at TIMESTAMP WITH TIME ZONE
);

-- Indexes for efficient lookups
CREATE INDEX idx_submissions_user_id ON submissions(user_id);
CREATE INDEX idx_submissions_problem_id ON submissions(problem_id);
CREATE INDEX idx_submissions_status ON submissions(status);
CREATE INDEX idx_submissions_user_problem ON submissions(user_id, problem_id, submitted_at DESC);

Insights:

CREATE TYPE submission_status AS ENUM (...): Using a custom ENUM type is ideal for fixed sets of values. It's more type-safe and often more performant than VARCHAR with a CHECK constraint for frequently used status fields.
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE: A foreign key constraint links submissions to users. ON DELETE CASCADE means if a user is deleted, all their submissions are automatically deleted. This is a strong choice for referential integrity, but ensure your application logic is okay with this cascading deletion. For sensitive data, ON DELETE RESTRICT or ON DELETE SET NULL might be preferred, often accompanied by "soft deletes" (marking records as deleted rather than truly removing them).
problem_id UUID NOT NULL REFERENCES problems(id) ON DELETE CASCADE: Similar foreign key for problems.
code TEXT NOT NULL: The actual code submitted by the user.
runtime_ms INT, memory_usage_kb INT: Metrics from the judging process. These are nullable initially, as they're populated after judging.
idx_submissions_user_id, idx_submissions_problem_id: Essential indexes for quickly fetching all submissions by a user or all submissions for a specific problem. Foreign keys should almost always be indexed.
idx_submissions_user_problem: A composite index (on multiple columns) is crucial for queries like "show me all submissions for a specific user on a specific problem, ordered by most recent." This can significantly boost performance for common UI patterns.

Control Flow, Data Flow, and State Changes

Flowchart Diagram

Control Flow: Our backend services will issue SQL commands (DML: INSERT, UPDATE, DELETE, DQL: SELECT) to this database.
Data Flow:
User Registration: Frontend -> Backend API -> INSERT into users.
Problem Creation: Admin UI -> Backend API -> INSERT into problems.
Code Submission: Frontend -> Backend API -> INSERT into submissions (with status='pending').
Judging Result: Judging Service -> Backend API -> UPDATE submissions (set status, runtime_ms, memory_usage_kb, judged_at).
Viewing Submissions: Frontend -> Backend API -> SELECT from submissions (often joining with users and problems).
State Changes (e.g., for a Submission): A submission starts in pending state, transitions to running when picked up by a judge, and finally to one of the terminal states (accepted, wrong_answer, time_limit_exceeded, etc.). This state machine is modeled by the submission_status ENUM and updated via UPDATE statements.

State Machine Diagram

Real-Time Production System Application (100M RPS)

At 100 million requests per second, every schema choice is amplified.

UUIDs for Primary Keys: As discussed, essential for distributed ID generation and sharding.
Appropriate Data Types: Minimizes storage footprint and improves cache locality, crucial for databases operating at scale. TEXT for large blobs, VARCHAR for bounded strings.
Strategic Indexing: This is non-negotiable. Queries must be lightning-fast. We're not just indexing primary and foreign keys; we're analyzing common query patterns (e.g., "get top submissions for a problem," "user's submission history") and creating composite indexes where needed.
JSONB for Flexibility: Reduces schema migrations for evolving data structures, which can be extremely disruptive at scale.
ENUMs/CHECK Constraints: Ensures data quality at the source, preventing bad data from polluting the system and causing downstream errors. This consistency is vital when many services interact with the same data.
ON DELETE CASCADE consideration: While convenient, at extreme scale, cascading deletes can lock tables or cause performance spikes. Often, a "soft delete" (an is_deleted BOOLEAN flag) is preferred, allowing background processes to eventually clean up, or the application logic handles explicit deletion of related entities. For this lesson, we stick to CASCADE for simplicity, but know the trade-off.

Assignment: Expanding the Schema

Now it's your turn to extend this foundation!

Add a comments table: Users should be able to comment on problems.

Consider who can comment (user_id).
What information does a comment need (text, timestamp)?
How does it relate to a problem?
Think about potential moderation (e.g., is_deleted flag).
What indexes would be useful for displaying comments on a problem page?

Enhance users with profile information: Add fields like display_name, bio, profile_picture_url, country.
Explore advanced problems features: How would you add "example test cases" that are visible to the user, separate from the hidden test_cases used for judging? Where would this data live?

Solution Hints for Assignment

comments table:

id UUID PRIMARY KEY, user_id UUID REFERENCES users(id), problem_id UUID REFERENCES problems(id), content TEXT NOT NULL, created_at TIMESTAMP WITH TIME ZONE.
An is_deleted BOOLEAN DEFAULT FALSE for soft deletes is a good idea.
Indexes on problem_id (for fetching all comments for a problem) and user_id (for fetching all comments by a user).
A composite index (problem_id, created_at DESC) would be excellent for displaying comments in chronological order on a problem page.

users profile: Simply add display_name VARCHAR(100), bio TEXT, profile_picture_url VARCHAR(255), country VARCHAR(50) to the users table. Ensure display_name is not unique but perhaps indexed if frequently searched.
problems example test cases:

You could add another JSONB column to the problems table, e.g., example_test_cases JSONB. This keeps all problem-related data together.
Alternatively, if example test cases become very complex or need their own access patterns, a separate example_test_cases table with a problem_id foreign key could be considered. For now, JSONB in the problems table is simpler and efficient.

This foundational schema is not just theoretical; it's what powers real-world systems. Understanding these design choices now will save you countless headaches and unlock true scalability down the line. Next, we'll bring this schema to life with some basic data interaction!

Learning Objectives

✓ Design scalable SQL schemas for high-write, high-read systems
✓ Model core entities: users, problems, and submissions with clear relationships
✓ Make informed trade-offs between normalization and denormalization
✓ Choose appropriate primary keys (UUID vs BIGINT) for distributed systems
✓ Optimize query performance using indexes and composite indexes
✓ Support high-throughput submission workflows with efficient writes
✓ Track and manage submission state transitions reliably
✓ Plan for schema evolution without downtime in production systems

💬 Discuss this topic