Welcome to Aussie SQL Server Bloggers Sign in | Join | Help

SELECT * & query / index matching

I spent a few minutes in Microsoft's public discussion forums today & noticed a post which highlighted a very common misconception amongst developers about indexes. The poster presented a question along these lines:

I have a table with 4 columns:

create table table1 (c1 int, c2, int, c3, int, c4 int)

Then I create an index on the first three columns:

create index ix1 on table1 (c1, c2, c3)

Will the following query use the index?

select * from table1 where c1 = 1 and c2 = 2

At first glance, it might be tempting to think that the index will be used because it contains both filter columns (c1 & c2) but there are two other important factors which SQL Server will take into account when determining whether the index will be used or not:

The first is that ALL FOUR of table1's columns were requested to be returned by the query due to the use of SELECT *, yet the index only contains 3 of the 4 requested columns.

The second is the degree of "selectivity" (uniqueness) of the values for c1 & c2 over the rows in the table. Whilst it's possible that the values of c1 & c2 columns might be unique accross the rows in this table, it's also perfectly possible that every row in the table meets the filter criteria, ie - c1 = 1 and c2 = 2. The table definition doesn't explicitly declare that these columns form a unique constraint, so either extreme case is possible & there might also be any lesser degree of uniqueness accross the rows.

Let's explore the case where a unique constraint does exist on (c1, c2), eg a primary key exists on those columns. In this case SQL Server should seek into the index, find the row with (c1 = 1 and c2 = 2) (if one exists), returning c1, c2 & c3 from index seek (as c3 is also in the ncix) and then perform a bookmark lookup into the table to retrieve c4. This bookmark lookup is necessary because ALL columns were required to be return due to the use of SELECT * in the query.

In the case where all rows meet the filter predicate (c1=1 and c2=2), a full scan of the table would definitely be the most efficient method of executing the query, as all columns of all rows need to be returned in the resultset. If SQL Server chose to scan the index, it would only retrieve 3 of the four columns required for the resultset (as c1, c2 & c3 are in the index definition) but it would still need to retrieve c4 to complete the resultset. To do this, it would have to perform a bookmark or page lookup (depending on whether the table had a clustered index or not) FOR EVERY ROW just to retrieve c4. This is a good example of how bookmark lookups can be very expensive.

Of course, SQL Server should choose the first option if the values are unique and the second option if all rows have the same value. But what if some smaller percentage of rows actually meet the filter criteria (c1 = 1 and c2 = 2)?

Whether seeking or scanning into the index is more efficient really depends on how many rows actually meet the filter criteria. Generally speaking, this threshold  is relatively small (eg, < 10%) because bookmark lookups are so expensive that the cost / benefit equation tips at quite a low percentage. Hence, it's fairly easy to have queries full-scanning tables when indexes aren't adequately "covering" the columns requested by queries or where queries are retrieving more columns than they actually need - a very common scenario when developers us SELECT *.

Another interesting point is that SQL Server needs to make the decision about whether to seek or scan in advance of actually running the query. To do this, it needs statistical information about the "selectivity" of columns so that it can intelligently "guess" about how many rows it expects to be returned & choose to seek or scan wisely. This is a critical reason why database statistics need to be kept up to date, otherwise SQL Server will make these decisions based on insufficient information & can easily make bad choices about seeking or scanning..

If you're new to indexing databases for query performance, here are some points you might take away from this little discussion:

(a) Always avoid use of SELECT * in production queries, especially against larger tables as suh queries can never be indexed effectively. Whilst it might be convenient to type out the ubiquitous "SELECT * " as you're starting to componse your SQL statement, always remember to finish the job properly & only include the columns you actually need returned rather than all columns.

(b) When you're considering whether an index will help your query, you need to take into account what percentage of rows in the base tables will be returned by your query, especially if the index doesn't include all of the columns required by your query.

(c) You need to construct well-designed indexes for queries which need to provide high levels of performance. These indexes should ideally include all columns accessed in the base table by the query. Order of inclusion of those columns into the index is (left to right) columns referred to in the WHERE clause & INNER JOINS first, then columns referred to in OUTER JOINS next & finally, columns referenced by the SELECT list.

(d) Keep your database statistics up to date, otherwise you're causing SQL Server to make less-informed "guesses" about how it should use indexes

Published Friday, 25 August 2006 8:14 PM by Greg_Linwood
Filed Under:

Comments

# re: SELECT * & query / index matching

Greg,

Thanks for the blog, that was a good bit of info.  It's one of those things that you need to read about a number of times till it sinks in properly.

Question though about the SELECT *, if you use it in an exists statement does it make different descisions on index selection or is it still a good idea to specify a column?

Wednesday, 6 September 2006 6:01 PM by neildavidyoung

# re: SELECT * & query / index matching

Thx Neil. EXISTS statements are specially optimised & not affected by SELECT *.

However, I usually recommend that there's still an argument for not using SELECT * with EXISTS. If you don't use SELECT * in your EXISTS statements, you can then easily search your stored procs to find intances of where you HAVE used SELECT *.

Eg, select * from syscomments where text like '%SELECT *%'.

Cheers,
Greg
Wednesday, 6 September 2006 7:19 PM by Greg_Linwood
Anonymous comments are disabled