Day 6: Query Planner Improvements: Propagating column stats from CTEs.

Lesson 6 60 min

Day 6: Query Planner Improvements: Propagating Column Stats from CTEs – Seeing Beyond the Fence

Welcome back, fellow architects and engineers! Today, we're diving into a subtle yet profoundly impactful improvement in PostgreSQL's query planner that can make or break the performance of your complex geospatial queries: the ability to propagate column statistics from Common Table Expressions (CTEs).

You've likely used CTEs (those elegant WITH clauses) to break down intricate SQL into readable, logical steps. They're fantastic for modularity, especially in a geospatial context where queries can involve multiple spatial operations, joins, and aggregations. But here's a secret: for a long time, the query planner often treated CTEs like an opaque box, an "optimization fence." It would execute the CTE, get its result, and then try to plan the rest of the query, often without fully understanding the underlying data distribution, selectivity, or cardinality within that CTE.

This is like a general planning a battle without knowing the exact strength or terrain of the enemy forces hiding behind a hill. They'd make a best guess, which sometimes works, but often leads to a disastrous strategy.

The Old Way: Blind Spots and Suboptimal Plans

Component Architecture

Client PostgreSQL / PostGIS Query Planner (CTE Stats Prop.) Executor Spatial Functions Spatial Indexes Regions Sensors Improved visibility into CTEs

Imagine you're building a system for a logistics company. You have millions of delivery points and thousands of service regions. A common query might be: "Find all delivery points that fall within a specific high-priority region, and then analyze their average delivery time."

You might write it like this:

sql
WITH high_priority_deliveries AS (
SELECT
dp.id,
dp.geom,
dp.delivery_time
FROM
delivery_points dp
JOIN
regions r ON ST_Intersects(dp.geom, r.geom)
WHERE
r.name = 'High-Priority Zone A'
)
SELECT
AVG(delivery_time)
FROM
high_priority_deliveries;

In older PostgreSQL versions (prior to 12, with significant improvements in subsequent releases like 17), the planner might have seen high_priority_deliveries and thought, "Okay, this CTE could return many rows, or few. I'll make a conservative estimate." If High-Priority Zone A actually contains only a handful of delivery points, but the planner estimates it'll return hundreds of thousands, it might choose a very expensive join algorithm (like a Hash Join where a Nested Loop would be faster with few rows) or avoid an index scan that would have been perfect. This misestimation, born from a lack of visibility into the CTE's internals, can balloon query times from milliseconds to seconds or even minutes on large geospatial datasets.

The New Way: Peeking Beyond the Fence

PostgreSQL 12+ (and particularly refined in 17) introduced significant enhancements to how the query planner handles non-materialized CTEs. Instead of treating them as black boxes, the planner now attempts to propagate statistics from the CTE's internal queries to the outer query. It essentially "looks inside" the CTE to gather more accurate cardinality estimates, understand data distribution, and infer selectivity of predicates defined within the CTE.

What does this mean for you?

  • Smarter Join Decisions: If a CTE filters down a massive table to a tiny subset using a spatial predicate (like ST_Intersects), the planner will now more accurately estimate that small row count. This allows it to choose efficient join algorithms (e.g., a Nested Loop if the CTE result is small, or a Hash Join if it's still sizable but accurately estimated).

  • Better Index Utilization: The planner can better understand the selectivity of WHERE clauses within the CTE, leading to more intelligent decisions about which indexes (especially PostGIS's GIST spatial indexes) to use, and how to combine them.

  • Reduced Overheads: Less guesswork means fewer suboptimal plans, reducing CPU and I/O overheads.

  • Empowered Readability: You can continue to use CTEs for code organization without fearing a performance penalty due to planner blindness.

When does this apply? Non-MATERIALIZED CTEs

Flowchart

SQL Query Submitted Parse & Rewrite Query Planner Estimate CTE Statistics (Propagate Column Stats) CTE MATERIALIZED? Generate Physical Plan Yes (Force Eval) No (Inline & Optimize)

It's crucial to distinguish between regular CTEs and MATERIALIZED CTEs.

  • Non-MATERIALIZED CTEs: These are typically inlined into the main query, allowing the planner to optimize them together with the rest of the query. This is where statistics propagation shines.

  • MATERIALIZED CTEs: These force the database to execute the CTE first, store its results (often in a temporary table), and then use that materialized result for the outer query. This acts as an explicit optimization fence. While it bypasses stats propagation, it's useful when:

  • The CTE is complex and its result is used multiple times in the outer query (avoiding re-computation).

  • You want to force a specific execution order or break a complex query into stages.

  • The CTE's output is small enough to be efficiently materialized.

For most geospatial data warehousing scenarios, especially when breaking down a single logical query, you want the planner to see through the CTE.

Real-World Impact: Sub-200ms Geospatial Queries

State Machine

Initial Query Logical Plan (CTEs Opaque) Optimized Logical Plan (CTEs Integrated) Executable Plan Parse & Rewrite Stats Propagation Physical Plan Gen Modern PostgreSQL Improvement

Consider a high-traffic mapping service that needs to identify points of interest (POIs) within a user's viewport, then filter them by category, and finally rank them by proximity to the user. Each step could be a CTE. Without accurate stats propagation, a slight misestimation at the "POIs in viewport" stage could lead to fetching and processing millions of irrelevant POIs, instead of just the few thousands actually visible. The difference is milliseconds versus seconds, which is the difference between a fluid user experience and a frustrating one.

This mechanism is critical for any system that relies on complex, multi-stage queries over large datasets, ensuring that the database engine can make the most intelligent decisions about how to execute your geospatial logic.


Hands-On Assignment: Unveiling the Planner's Vision

Let's get our hands dirty and observe this in action. We'll set up a PostGIS database, populate it with some simulated geospatial data, and then run a query with a CTE to see how the planner behaves.

Goal: Understand how PostgreSQL's query planner leverages statistics from non-materialized CTEs to make better execution decisions.

Setup: We'll use Docker for a clean, reproducible environment.

Task 1: Set up the Geospatial Database

  1. Start a PostgreSQL container with PostGIS.

  2. Create a database and enable the PostGIS extension.

  3. Create two tables: regions (polygons representing administrative areas, e.g., counties or districts) and sensors (points representing IoT sensor locations).

  4. Populate regions with a few distinct polygons.

  5. Populate sensors with a large number of points (e.g., 50,000 to 100,000) scattered across these regions. Ensure sensors has a status column (e.g., 'active', 'inactive').

  6. Create a GIST spatial index on the geom columns of both tables.

Task 2: Craft a CTE Query and Analyze its Plan

  1. Write a SQL query using a non-materialized CTE. The CTE should:

  • Select sensor_id, geom, and status from the sensors table.

  • Filter sensors that ST_Intersects with a specific, moderately sized region from your regions table (e.g., WHERE r.name = 'Region_B').

  • Further filter by status = 'active'.

  1. The outer query should then:

  • Select the count of active sensors found within that specific region.

  1. Execute this query with EXPLAIN (ANALYZE, VERBOSE, BUFFERS).

  2. Carefully examine the EXPLAIN ANALYZE output. Look for:

  • The CTE Scan node. What are its estimated rows (rows=...) versus actual rows (actual rows=...)?

  • How does the planner estimate the selectivity of the ST_Intersects and status predicates within the CTE?

  • What kind of join or scan is performed for the CTE's contents? Does it use the spatial index?

Task 3: Compare with a MATERIALIZED CTE (Optional but Recommended)

  1. Modify the query from Task 2 to explicitly use WITH high_priority_sensors AS MATERIALIZED (...).

  2. Run EXPLAIN (ANALYZE, VERBOSE, BUFFERS) again.

  3. Compare the execution plan with the non-materialized version.

  • You should see a distinct CTE Scan step that completes before the outer query proceeds.

  • Observe how the estimated vs. actual rows might differ more significantly if the planner cannot push down predicates into the materialized CTE.

  • Note the "buffers" used – materialization might incur temporary storage costs.

Insightful Takeaway: The key is to observe how accurately the planner estimates the row count coming out of your non-materialized CTE. A close match between rows= and actual rows= indicates the planner "saw through" your CTE, leading to a potentially much more efficient overall plan.


Solution Hints for Assignment

Task 1: Setup

  • Use docker run --name pg_geospatial -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -d postgis/postgis

  • Connect with psql -h localhost -U postgres -d postgres

  • CREATE DATABASE geospatial_dw;

  • c geospatial_dw

  • CREATE EXTENSION postgis;

  • Table creation:

sql
CREATE TABLE regions (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
geom GEOMETRY(Polygon, 4326)
);

CREATE TABLE sensors (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
status VARCHAR(50) NOT NULL,
geom GEOMETRY(Point, 4326)
);
  • Population:

  • For regions, use ST_MakeEnvelope or ST_Buffer around ST_MakePoint to create a few distinct polygons. Ensure SRID 4326.

  • For sensors, use generate_series and ST_SetSRID(ST_MakePoint(random() * 360 - 180, random() * 180 - 90), 4326) to create many random points. Assign 'active' or 'inactive' randomly.

  • Indexes:

sql
CREATE INDEX idx_regions_geom ON regions USING GIST (geom);
CREATE INDEX idx_sensors_geom ON sensors USING GIST (geom);
CREATE INDEX idx_sensors_status ON sensors (status); -- For the status filter

Task 2 & 3: CTE Query and Analysis

  • Example query structure:

sql
WITH target_region AS (
SELECT geom FROM regions WHERE name = 'Region_B'
),
active_sensors_in_region AS (
SELECT s.id, s.geom
FROM sensors s, target_region tr
WHERE ST_Intersects(s.geom, tr.geom)
AND s.status = 'active'
)
SELECT COUNT(id) FROM active_sensors_in_region;
  • For MATERIALIZED version, just add MATERIALIZED after WITH active_sensors_in_region AS.

  • When analyzing EXPLAIN ANALYZE, pay close attention to the rows= (estimated) and actual rows= values for the CTE Scan node. If rows= is close to actual rows=, the planner did a good job propagating stats. Also, observe if Index Scan using idx_sensors_geom is used within the CTE's execution plan.

This hands-on exercise will solidify your understanding of how PostgreSQL's planner intelligently processes your queries, especially when dealing with the complex, performance-critical world of geospatial data. Master this, and you'll be writing queries that not only look clean but also fly on your production systems.

Need help?