The Art of Compound Indexes - Why Column Order Matters
Takashi Yamamoto
Infrastructure Engineer · Leapcell

The Art of Compound Indexes: Why Column Order Matters
Databases are the backbone of most applications, and their performance significantly impacts user experience. When queries slow down, the entire system suffers. Among the many optimization techniques available to database administrators and developers, indexing stands out as a powerful tool. However, the true art of indexing lies not just in creating them, but in understanding their nuances, especially when it comes to compound indexes. A well-designed compound index can dramatically accelerate data retrieval, while a poorly designed one might be ignored entirely by the query optimizer, or even degrade performance. This article will explore the critical role of column order within compound indexes, illustrating why this seemingly small detail holds immense practical significance for database efficiency.
At its core, a database index is akin to the index at the back of a book. Instead of sequentially scanning every page to find a particular term, you can quickly jump to the relevant pages listed in the index. This principle applies to databases, allowing them to locate data rows without examining every single record. When we talk about indexes, two key terms come to mind:
-
Single-Column Index: An index created on a single column of a table. It helps in quickly finding rows based on the values in that specific column. For example, an index on
users.emailallows for fast lookups by email address. -
Compound (Multi-Column) Index: An index created on two or more columns of a table. This type of index can be particularly powerful for queries involving multiple columns in their
WHERE,ORDER BY, orGROUP BYclauses. The critical aspect of compound indexes is the order in which the columns are defined.
Let's consider a practical scenario to illustrate the importance of column order. Imagine an orders table with millions of records, containing columns like customer_id, order_date, and status.
Suppose we frequently run queries like these:
SELECT * FROM orders WHERE customer_id = 123;SELECT * FROM orders WHERE customer_id = 123 AND order_date > '2023-01-01';SELECT * FROM orders WHERE customer_id = 123 AND order_date > '2023-01-01' ORDER BY order_date DESC;SELECT * FROM orders WHERE order_date > '2023-01-01';SELECT * FROM orders WHERE status = 'shipped';
Now, let's explore how a compound index ((customer_id, order_date)) would perform compared to ((order_date, customer_id)).
Understanding the Leftmost Prefix Rule
The fundamental principle governing compound indexes is the "leftmost prefix rule." An index ((col_a, col_b, col_c)) can be used to efficiently search on:
col_acol_a, col_bcol_a, col_b, col_c
However, it cannot efficiently search directly on col_b, col_c, col_b, col_c, or col_a, col_c (without col_b) because it doesn't start with the leftmost column of the index. Think of it like a phone book sorted by (Last Name, First Name). You can easily find all people with a certain last name, or all people with a certain last name AND first name. But you can't easily find all people with a certain first name without knowing their last name first.
Scenario 1: Index ((customer_id, order_date))
Let's create this index:
CREATE INDEX idx_customer_date ON orders (customer_id, order_date);
- Query 1 (
WHERE customer_id = 123): This query can fully utilizeidx_customer_date. The database can quickly narrow down to records forcustomer_id = 123by traversing thecustomer_idpart of the index. This works becausecustomer_idis the leftmost column. - Query 2 (
WHERE customer_id = 123 AND order_date > '2023-01-01'): This query can also fully utilizeidx_customer_date. The database will first filter bycustomer_idand then efficiently find records within that subset whereorder_datemeets the condition. This provides a double benefit, as bothWHEREclauses are covered by the index. - Query 3 (
WHERE customer_id = 123 AND order_date > '2023-01-01' ORDER BY order_date DESC): Here, the index not only helps filter the data but also assists with sorting. Because the data withincustomer_idis already sorted byorder_date, the database can perform theORDER BYclause efficiently without needing an additional sort operation (an "filesort" in MySQL terminology). This is a huge performance gain. - Query 4 (
WHERE order_date > '2023-01-01'): This query cannot effectively useidx_customer_date. Sinceorder_dateis not the leftmost column, the database would likely resort to a full table scan or use a different single-column index if available onorder_date. - Query 5 (
WHERE status = 'shipped'): This query definitely cannot useidx_customer_date, asstatusis not part of the index.
Scenario 2: Index ((order_date, customer_id))
Now let's consider the reversed index:
CREATE INDEX idx_date_customer ON orders (order_date, customer_id);
- Query 1 (
WHERE customer_id = 123): This query cannot effectively useidx_date_customerbecausecustomer_idis not the leftmost column. The database would likely perform a full table scan or use a single-column index oncustomer_idif it exists. - Query 2 (
WHERE customer_id = 123 AND order_date > '2023-01-01'): This query can partially useidx_date_customer. It can efficiently filter byorder_date > '2023-01-01', but then it would still need to scan the selected rows to filter bycustomer_id. While better than a full table scan, it's not as efficient asidx_customer_datefor this specific query pattern. - Query 3 (
WHERE customer_id = 123 AND order_date > '2023-01-01' ORDER BY order_date DESC): Similar to Query 2, theorder_datefilter andORDER BYclause would be efficient, butcustomer_idwould still require a subsequent filter operation. - Query 4 (
WHERE order_date > '2023-01-01'): This query can fully utilizeidx_date_customerbecauseorder_dateis the leftmost column. This would be very fast. - Query 5 (
WHERE status = 'shipped'): Still no use for this index.
Key Takeaways for Column Order
- Cardinality: Generally, put the column with the highest cardinality (most distinct values) first, especially if it's frequently used in
WHEREclauses for equality. This helps the index narrow down the search space most effectively early on. However, this is a guideline, not a strict rule. - Usage Pattern: The most important factor is your query patterns.
- If you frequently query only by
col_a, or bycol_aandcol_b, then((col_a, col_b))is appropriate. - If you frequently query only by
col_b, then((col_b, col_a))(or a separate index oncol_b) is better. - Consider any
ORDER BYorGROUP BYclauses. IfORDER BY col_bis common after filtering bycol_a, then((col_a, col_b))can satisfy both the filter and the sort, avoiding an expensive "filesort" operation.
- If you frequently query only by
- Equality vs. Range: Columns used for equality conditions (
=) should generally come before columns used for range conditions (<,>,BETWEEN,LIKE 'prefix%'). If you haveWHERE col_a = 'X' AND col_b > 'Y', then((col_a, col_b))will work very well. The index can jump tocol_a = 'X'and then efficiently scan alongcol_bfrom'Y'. If the order was((col_b, col_a)), the index would scan a much larger range ofcol_bvalues before filtering forcol_a. - Covering Indexes: A compound index can become a "covering index" if all columns required by the query (in
SELECT,WHERE,ORDER BY,GROUP BY) are part of the index. This means the database doesn't need to access the actual table rows at all, further speeding up the query. For example,SELECT customer_id, order_date FROM orders WHERE customer_id = 123could be covered by((customer_id, order_date)).
-- Example demonstrating the importance of order for an ORDER BY clause -- Assume 'orders' table with 'customer_id' and 'order_date' -- Index 1: customer_id first, then order_date CREATE INDEX idx_customer_date_order ON orders (customer_id, order_date); -- Query 1: Filter by customer_id, order by order_date EXPLAIN SELECT customer_id, order_date, status FROM orders WHERE customer_id = 123 ORDER BY order_date DESC; -- This will likely use idx_customer_date_order efficiently for both WHERE and ORDER BY. -- The index is ordered by customer_id, then order_date within each customer_id, -- allowing for efficient scanning in the desired order. -- Index 2: order_date first, then customer_id CREATE INDEX idx_date_customer_order ON orders (order_date, customer_id); -- Query 2: Same query as above, but with a different index structure EXPLAIN SELECT customer_id, order_date, status FROM orders WHERE customer_id = 123 ORDER BY order_date DESC; -- While idx_date_customer_order might be *used* to some extent, -- it won't be as efficient for the WHERE clause on customer_id, as it's not the leading column. -- It might still help with the ORDER BY if the optimizer deems it beneficial after filtering. -- However, if there was no equality predicate on customer_id, and only a range on order_date, -- then idx_date_customer_order would shine. -- Consider a query where order_date is a range, and customer_id is equality EXPLAIN SELECT customer_id, order_date, status FROM orders WHERE order_date BETWEEN '2023-01-01' AND '2023-01-31' AND customer_id = 456; -- For this query, idx_date_customer_order would be more efficient, -- as it can first use the range on order_date from its leftmost column.
The art of designing compound indexes lies in a deep understanding of your application's query workload. It's not about blindly creating indexes on every column or every combination of columns, as excessive indexing comes with its own performance overhead for write operations. Instead, it's about strategically placing columns in an order that maximizes the utility of the index for the most critical or frequent queries while minimizing the overall index footprint. By carefully considering the leftmost prefix rule, column cardinality, and the nature of your queries (equality vs. range, WHERE vs. ORDER BY), you can unlock significant performance gains and ensure your database operates efficiently under heavy loads. The order of columns in a compound index is not a mere detail; it is the cornerstone of its effectiveness.

