Welcome to Aussie SQL Server Bloggers Sign in | Join | Help

Debunking myths about clustered indexes - part 2 (CIXs are NOT better for range scans)

In my previous post on this topic, I discussed the ineffiency associated with performing bookmark lookups between Non-Clustered Indexes (NCIXs) and Clustered Indexes (CIXs), as opposed to when those same lookups are made from NCIXs to HEAP storage. Although this can be a pretty severe performance inhibitor on its own, there are a few other problems yet to discuss so I'll move right along to the next myth - that "Clustered Indexes are perfect for range scans".

The premise for this argument is generally that CIXs are "physically" sorted & therefore are good for scanning "ranges" of rows. Firstly, I'm not going to get embroiled in the discussion about what "physically" sorted means, other than to point out that databases have logical and physical models and for the purposes of this discussion, I am simply referring to the fact that CIXs are "physically" sorted in the physical model (as opposed to being physically sorted on disk storage, which is a whole other discussion).

What astounds me about the claim that CIXs are better for range scans than other indexes because they're physically sorted is that ALL indexes are physically sorted! (I'm gonna let academics refer to them as indices). Given that all indexes are physically sorted, I just don't understand why anyone would claim that CIXs are better for range scans, simply based on the fact that they're physically sorted!!

The most significant issue to take into account when comparing the scan efficiency between CIXs & NCIXs, however, has no relationship to physical sorting - by far the biggest issue is the fact that CIXs contain ALL of a table's columns in their leaf pages, whilst other index types (NCIXs) only contain a subset of the table's columns.

During scan operations, SQL Server reads entire 8kb pages from the leaf level of indexes which, if scanning a CIX, contains ALL of the columns in a table for the rows on those pages, whether all columns are being used by a query or not. Scans over NCIXs still read entire 8kb leaf pages during scan operations, but because NCIXs can be tailored to contain only the columns actually being used by a query, many more index rows are stored per page, allowing scan operations to be completed with significantly fewer page reads than are required with CIX scans

For example, consider a table which contains 100  4 byte integer columns - each row will contain ~ 400 bytes of data, allowing storage of ~ 20 rows per 8kb CIX leaf page. To store 1 million of these rows, we'll need ~ 50,000 pages. If a select statement were run against this table to retrieve only 3 of the 100 columns for all rows in the table and we only had a CIX on one of the columns (or other columns) on the table, to retrieve the 3 columns, SQL Server would have to read 50,000 pages.

If a NCIX were created for the three columns, the index row size would be ~ 12 bytes per row, allowing us to store ~ 666 rows per 8kb NCIX leaf page (sorry, no secret messages intended for heavy metal heads!). To store the 1 million entries on our 3 column NCIX, we'd only need ~ 1500 pages. So, a scan of the table to retrieve these three columns would only require SQL Server to read 1500 pages. Quite a few less than the 50,000 required for the CIX!

Just a note here to anyone who has noticed that my sizing calculations aren't taking into account row storage overheads such as column offsets, null bitmasks etc - please don't bother pointing this out as it doesn't materially alter the argument substantially.

Ok, so we've established that NCIXs are more efficient than CIXs for scan operations & you might be thinking "whoopiee, nothing new here". Fair enough - I guess you've already accepted that you need NCIXs to achieve query efficiency (c:

There's a subtler point here though - it's not that NCIXs are more efficient than CIXs, it's that there's actually a real INEFFICIENCY in the way that scans (range or full) operate over CIXs. This is that CIX scans inherently involve reading ALL COLUMNS IN THE TABLE, whether the query actually needs them or not. Because index scan operators read full 8kb leaf pages and the CIX contains all of the table's columns in the leaf pages, there is no way to avoid reading columns you didn't intend to, unless you actually WANT all columns.. The negative performance effects from such an inefficiencey in a database system mounts up extremely quickly, especially when those systems are performing transactions at high speeed.

Another issue is that tables can only have one CIX, yet multiple NCIXs can be created on any table. Keep in mind that tables are usually queried in multiple ways, so the powerful capability of providing multiple NCIXs, each tailored to the specific requirements of different performance critical queries is an important concept to grasp. There's no way CIXs can do this as (a) you can only have one per table and (b) they're less efficient for range scans anyway. The most important issue to get right in index performance tuning is the NCIXs, not the CIXs.

Earlier in this post, I deliberately used a full table scan scenario even though the discussion was really about range scanning. Keep in mind that a full scan is really an unrestricted full scan & the issues I discussed here apply relatively to internal range scans within sub-sets of a table's rows. Using the above example, if only half the rows were being scanned, due to a where clause filtering half of them out, the scan inefficiency where only a CIX exists would be half of the full scan.

Given the importance of getting NCIXs right, I'm thinking about exploring how the existance of CIXs can negatively affect performance-critical NCIX scan efficiency, especially when the CIX is implemented over a multiple columns. But I'll need a few winks rest before embarking on that little endeavour, so I might tackle it in a day or two..

Published Monday, 11 September 2006 8:34 PM by Greg_Linwood
Filed Under:

Comments

# re: Debunking myths about clustered indexes - part 2 (CIXs are NOT better for range scans)

Hi Greg,

If there is a covering nonclustered index available, then seeking that index will beat a clustered index scan anytime. But as you yourself have written in part 1 of this excellent series, it is usually not possible to index every column in a database.

If you compare a non-covering nonclustered index to a clustered index for use in a range scan, then the clustered index wins heads down.

Best, Hugo
Saturday, 4 November 2006 8:24 AM by Hugo Kornelis

# Debunking myths about clustered indexes - part 5 - a real world example


This post discusses a problem experienced last year on one of the highest profile websites in...
Friday, 8 June 2007 12:52 AM by Transaction blog
Anonymous comments are disabled