SQL Playbook

SQL Best Practices & Playbook

A practical, code-first guide to query design habits that make SQL easier to write, debug, and scale — drawn from real data engineering and analytics workflows.

Core PrinciplesAnti-PatternsDebugging WorkflowAdvanced Patterns

Core Principles

1. Know the grain before you join

Always identify what one row represents in your base table before adding joins or aggregates. A row in `orders` is one order. A row in `order_items` is one line item. Joining them without aggregating first multiplies rows.

Avoid
-- Danger: joining before knowing row shape
SELECT o.customer_id, SUM(o.amount) AS total
FROM orders o
JOIN payments p ON p.order_id = o.id   -- payments has N rows per order
GROUP BY o.customer_id;
Prefer
-- Safe: pre-aggregate payments, then join
WITH paid AS (
  SELECT order_id, SUM(amount) AS total_paid
  FROM payments
  GROUP BY order_id
)
SELECT o.customer_id, SUM(p.total_paid) AS total
FROM orders o
JOIN paid p ON p.order_id = o.id
GROUP BY o.customer_id;

2. Build in layers with CTEs

Break complex logic into named CTE stages so each layer can be inspected, trusted, and debugged independently. This also makes queries self-documenting.

Avoid
SELECT customer_id, SUM(qty * price) FROM orders
JOIN order_items USING (id) JOIN products USING (product_id)
WHERE order_date > NOW() - INTERVAL '30 days'
GROUP BY customer_id HAVING SUM(qty * price) > 500;
Prefer
WITH recent_orders AS (
  SELECT id, customer_id
  FROM orders
  WHERE order_date > NOW() - INTERVAL '30 days'
),
order_revenue AS (
  SELECT o.customer_id, SUM(oi.qty * p.price) AS revenue
  FROM recent_orders o
  JOIN order_items oi ON oi.order_id = o.id
  JOIN products     p  ON p.id = oi.product_id
  GROUP BY o.customer_id
)
SELECT customer_id, revenue
FROM order_revenue
WHERE revenue > 500;

3. Always use explicit ORDER BY

SQL result sets have no guaranteed order without ORDER BY. Never assume rows come back in insert order, primary-key order, or any other order.

Avoid
SELECT name, salary FROM employees;
-- Row order is non-deterministic — varies by planner, vacuum, parallelism
Prefer
SELECT name, salary
FROM employees
ORDER BY salary DESC, employee_id;  -- tie-break with a unique column

4. Name things clearly — aliases matter

Cryptic aliases like `t`, `x`, or `a` make queries unreadable within hours. Use names that describe what the result set represents.

Avoid
SELECT a.id, b.name, c.amount
FROM orders a JOIN customers b ON b.id = a.cid
JOIN payments c ON c.oid = a.id;
Prefer
SELECT o.id        AS order_id,
       cust.name    AS customer_name,
       pay.amount   AS payment_amount
FROM orders    o
JOIN customers cust ON cust.id = o.customer_id
JOIN payments  pay  ON pay.order_id = o.id;

5. Guard against row multiplication

After every JOIN, validate that row counts are what you expect. A one-to-many join without pre-aggregation is the #1 source of silent data bugs.

Prefer
-- After each JOIN, add a quick count check during development
SELECT COUNT(*) FROM employees e
JOIN departments d ON d.id = e.department_id;
-- Should equal COUNT(*) FROM employees if dept_id is many-to-one

6. Handle NULL deliberately

NULL is not zero and not empty string. NULL propagates through arithmetic and comparisons. Use COALESCE, NULLIF, and IS NULL intentionally.

Avoid
-- Wrong: NULL + anything = NULL
SELECT salary + bonus AS total_comp FROM employees;

-- Wrong: WHERE col != 'X' silently excludes NULLs
SELECT * FROM employees WHERE department != 'HR';
Prefer
-- Correct: treat NULL as 0 for bonus
SELECT salary + COALESCE(bonus, 0) AS total_comp FROM employees;

-- Correct: include rows where department is NULL or not 'HR'
SELECT * FROM employees
WHERE department IS DISTINCT FROM 'HR';

7. Prefer EXISTS over IN for large subqueries

IN with a large subquery materialises the entire list. EXISTS short-circuits at the first match and is typically faster when the subquery returns many rows.

Avoid
SELECT * FROM employees
WHERE department_id IN (
  SELECT id FROM departments WHERE region = 'EMEA'
);
Prefer
SELECT e.*
FROM employees e
WHERE EXISTS (
  SELECT 1 FROM departments d
  WHERE d.id = e.department_id AND d.region = 'EMEA'
);

8. Avoid functions on indexed columns in WHERE

Wrapping a column in a function prevents the query planner from using an index on that column. Rewrite to keep the column bare.

Avoid
-- No index on order_date is usable here
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
Prefer
-- Index on order_date CAN be used
SELECT * FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';

9. Use window functions instead of self-joins for ranking

Self-joins for ranking are verbose and often slower. Window functions are cleaner and execute in a single pass over the data.

Avoid
-- Self-join to compute rank — verbose and slow
SELECT a.employee_id, a.salary,
  COUNT(*) AS rank
FROM employees a
JOIN employees b ON b.salary >= a.salary
GROUP BY a.employee_id, a.salary;
Prefer
SELECT employee_id, salary,
  RANK() OVER (ORDER BY salary DESC) AS salary_rank
FROM employees;

10. Filter early, aggregate late

Apply WHERE filters as early as possible — ideally before joins — to reduce the number of rows the planner processes. Don't filter on aggregated results in HAVING unless you actually need post-aggregation filtering.

Avoid
-- Filtering after aggregation when it could be done before
SELECT department_id, AVG(salary)
FROM employees
GROUP BY department_id
HAVING department_id IN (1, 2, 3);
Prefer
-- Filter rows before the GROUP BY — fewer rows to aggregate
SELECT department_id, AVG(salary)
FROM employees
WHERE department_id IN (1, 2, 3)
GROUP BY department_id;

11. Validate output shape before shipping

Before finalising a query, check: (a) the column names are correct, (b) the row count is expected, (c) there are no unexpected NULLs, (d) numeric aggregates are in the right ballpark.

Prefer
-- Quick sanity checks during development
SELECT COUNT(*), COUNT(DISTINCT customer_id),
       MIN(total), MAX(total), AVG(total)
FROM order_summary;
-- Spot check: max shouldn't be wildly higher than avg (data quality)

12. Document non-obvious decisions inline

Only add comments when the WHY is non-obvious: a hidden constraint, a known data quality issue, or a deliberate workaround. Don't narrate what the SQL already says clearly.

Prefer
-- payments.amount can be negative (refunds) — COALESCE would hide refunds
SELECT order_id, SUM(amount) AS net_amount
FROM payments
GROUP BY order_id;

Anti-Patterns to Avoid

SELECT * in production outputs

Always select only the columns you need. SELECT * couples your query to the table schema, breaks if columns are added/reordered, and sends unnecessary data over the wire.

Joining before validating base filter logic

If your WHERE clause is wrong, the JOIN will amplify the mistake. Always run the base table query alone first.

Skipping ORDER BY in deterministic results

SQL has no guaranteed row order. Omitting ORDER BY means results can silently differ between runs, environments, or query plan versions.

Nesting logic without CTE checkpoints

Deep subquery nesting makes queries impossible to debug. Each logical step should be a named CTE that you can query independently.

Using DISTINCT as a band-aid for wrong joins

DISTINCT hides the symptom (duplicate rows) but not the cause (a many-to-many join or missing pre-aggregation). Fix the join.

Ambiguous column names without aliases

In multi-table queries, always alias columns to their intended meaning. `SELECT id` is meaningless. `SELECT o.id AS order_id` is self-documenting.

Hardcoding magic numbers without context

WHERE status = 2 is unreadable. Use a comment, a CTE with a named constant, or a lookup to make the intent clear.

Using NOT IN with a subquery that can return NULLs

NOT IN returns 0 rows if the subquery contains any NULL. Use NOT EXISTS or add WHERE col IS NOT NULL to the subquery.

Mixing aggregation levels in one query

Don't mix grain — e.g., row-level salary with department-level average — in a single SELECT without a window function or CTE to separate the levels.

Forgetting NULLS LAST/FIRST in ORDER BY

NULLs sort before or after non-NULLs depending on the database and sort direction. Be explicit: ORDER BY col DESC NULLS LAST.

SQL Debugging Workflow

When a query returns wrong results, don't stare at the whole thing. Follow this incremental process to isolate where the logic breaks down.

1

Run the base table alone

Before any join or filter, run SELECT * FROM your_table LIMIT 20. Understand the shape and grain of the data.

2

Add filters incrementally

Add one WHERE condition at a time and check that the row count moves in the expected direction.

3

Join one table at a time

After each JOIN, check COUNT(*). If it jumps unexpectedly, you have a one-to-many relationship you weren't accounting for.

4

Inspect intermediate CTEs

Run each CTE SELECT as a standalone query to verify its output before composing them.

5

Add aggregation last

Confirm the ungrouped rows look correct before wrapping with GROUP BY / HAVING.

6

Compare to a known answer

For a sanity check, manually compute the expected result for one customer or product and verify your query returns the same value.

PostgreSQL bonus: Use EXPLAIN ANALYZE prepended to any SELECT to see exactly how Postgres executes it — sequential scans, index usage, join strategy, and actual row counts at each step.

Advanced SQL Patterns

These patterns appear constantly in real analytics and data engineering work. Learn them once and you'll recognise when to reach for them.

Top-N per group

Find the top N rows within each group — e.g., the 3 best-selling products per category.

Pattern
WITH ranked AS (
  SELECT category_id, product_id, SUM(qty) AS total_sold,
    RANK() OVER (PARTITION BY category_id ORDER BY SUM(qty) DESC) AS rnk
  FROM order_items
  GROUP BY category_id, product_id
)
SELECT category_id, product_id, total_sold
FROM ranked
WHERE rnk <= 3;

Running total (cumulative sum)

Compute a running total ordered by date — common in revenue dashboards.

Pattern
SELECT order_date,
       daily_revenue,
       SUM(daily_revenue) OVER (ORDER BY order_date
         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) AS running_total
FROM (
  SELECT DATE_TRUNC('day', order_date) AS order_date,
         SUM(amount) AS daily_revenue
  FROM orders
  GROUP BY 1
) daily;

Period-over-period comparison with LAG

Compare a metric to the previous period — monthly revenue YoY, week-over-week growth, etc.

Pattern
WITH monthly AS (
  SELECT DATE_TRUNC('month', order_date) AS month,
         SUM(amount) AS revenue
  FROM orders
  GROUP BY 1
)
SELECT month, revenue,
       LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
       ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY month))
             / NULLIF(LAG(revenue) OVER (ORDER BY month), 0), 2) AS growth_pct
FROM monthly
ORDER BY month;

Pivot with conditional aggregation

Turn rows into columns — e.g., count of orders per status as separate columns.

Pattern
SELECT customer_id,
  COUNT(*) FILTER (WHERE status = 'pending')   AS pending_orders,
  COUNT(*) FILTER (WHERE status = 'completed') AS completed_orders,
  COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled_orders
FROM orders
GROUP BY customer_id;

Deduplication with ROW_NUMBER

Keep only the latest record per entity when a table has duplicates or history rows.

Pattern
WITH deduped AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY employee_id
      ORDER BY updated_at DESC
    ) AS rn
  FROM employee_snapshots
)
SELECT * FROM deduped WHERE rn = 1;

Recursive CTE for hierarchies

Walk a self-referential parent–child structure — org charts, file systems, bill-of-materials.

Pattern
WITH RECURSIVE org_chart AS (
  -- Base case: top-level managers (no manager)
  SELECT id, name, manager_id, 0 AS depth
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  -- Recursive case: direct reports
  SELECT e.id, e.name, e.manager_id, oc.depth + 1
  FROM employees e
  JOIN org_chart oc ON oc.id = e.manager_id
)
SELECT depth, name FROM org_chart ORDER BY depth, name;

Performance Quick Reference

Index your JOIN and WHERE columns

The most impactful single change. If you join on `orders.customer_id`, make sure there is an index on that column.

Pre-aggregate before joining

Joining a table with N rows per key multiplies your base table by N. Aggregate first, then join the summary.

Use LIMIT in exploratory subqueries

During development, add LIMIT to intermediate CTEs to get fast feedback without scanning the full table.

Avoid SELECT * in production

Selecting unused columns wastes I/O, memory, and network — especially on wide tables with JSONB or TEXT columns.

Use EXPLAIN ANALYZE

Run EXPLAIN (ANALYZE, BUFFERS) before committing a query to production. Look for sequential scans on large tables.

Materialise hot CTEs

PostgreSQL 12+ inlines CTEs by default. Add MATERIALIZED to force the CTE to run once: WITH cte AS MATERIALIZED (…).

Apply it right now

The fastest way to internalise these patterns is to use them in real exercises. Open the playground and try rewriting a query using CTEs, or take on a Hard challenge that forces you to use window functions.