Welcome to Aussie SQL Server Bloggers Sign in | Join | Help

Debunking myths about clustered indexes - part 5b - a real world solution

In my last post, I discussed a real world clustered index design problem which had far-reaching consequences on a high volume & profile website. In this post I'll discuss the problem in further detail & post the solution which solved the problem

In this scenario, a Clustered Index (CIX) was placed on an identity column but the main query which accesses the table at very high frequency did not use the identity column for filtering or joining in any way. The only reason the CIX was defined on this column is that SQL Server places CIXs on Primary Keys by default (afaik, no more thought went into the decision than this).

The fact that the identity column was not used by the query for filtering (in either WHERE or JOIN) means that it's b-tree lookup structure was totally wasted on this query (& this query is by far the most performance sensitive query that accesses the table). This is a common problem when placing a CIX on IDENTITY columns.

This might not be so bad if another index could have been created to "cover" the query, but the query accesses more columns (& bytes) than are allowed in an index (this system was using SQL Server 2000) so another index cannot be designed to totally suit the needs of the query.

The MS support engineers who worked on this case tried to design Non-Clustered Indexes (NCIXs) with as many columns as possible but ultimately, the other columns not included in the index also have to be retrieved & forcing SQL Server to perform extremely expensive Clustered Index Bookmark Lookups. Even though these occur in memory, they are still enormously expensive & totally brought this high profile system down week after week last year during peak workloads.

The solution:

Because CIXs become the storage container for the tables they are built on (the table's rows are stored in the leaf pages of a CIX), they naturally contain all columns. A limited number of columns can be used for the index keys, but all columns exist on the leaf pages. Hence, CIXs always cover queries, although not necessarily in their index keys.

The trick to solving this particular problem was to recognise that placing the CIX keys on a few of the filtered columns would allow SQL Server to perform a partial range scan over however many rows those columns return & then evaluate the wildcard search in the affids column within the leaf pages as it scans over that range. This is far more efficient then evaluating those extra colums over an extremely expensive Bookmark Lookup, whether in memory or not, as no additional IOs need to be performed (where at least three extra reads are required PER ROW with the CIX bookmark lookup, massively increasing total reads performed by the query).

It appears that the engineers who worked on this problem either didn't consider altering the existing CIX or maybe they simply believed blindly in the "best practise" of placing CIXs on identity columns. Either way, they didn't try changing the CIX design & consequently failed to solve the problem.

So finally, the solution was to replace the identity based CIX with the following definition:

create clustered index cix_article on article (clid, grptype, chksmcurr, chksmver)

With this CIX, the query can simply traverse its b-tree, locating the range where clid = 2 & grptype = 'news', then scan that range whilst testing the other predicates - chksmcurr = chksmver, the wildcard search in affids etc. As it identifies each row, it can extract all other column values, sort & produce the top 5 rows. Crucially, no Bookmark Lookups are required.

Immediately after installing this index, the site dropped from extremtly high CPU utilisation (> 90% consistently at peak load) to less than 10% even at peak load. 

The reason why CPU load was reduced by such a substantial margin is that Bookmark Lookups are extremely expensive, even though the CIX b-tree pages are typically in memory. They often happen as such high frequency that they can easily overload even the most powerful systems. Note that in this case, the DB size was only ~2Gb, the server had 16Gb RAM, 8 high end CPUs & high end SAN storage. All that hardware couldn't handle the massive logical query inefficiency.

It's also crucial to note that the identity column was not involved in the query's filter predicates (although it was returned in the SELECT list). This is a significant problem with placing CIXs on identity columns if performance sensitive queries do not actually use that column for filtering - it not only wastes a useful b-tree structure (one that covers any query) it also introduces a costly layer of indirection for Bookmark Lookups if the query accesses many columns and cannot be easily covered with a NCIX. Many people place CIXs on identity columns to achieve fragmentation objectives without considering the opportunity cost with query processing (which is nearly always far more important than avoiding fragmentation, which is another very poorly understood topic in general)

Note also that I am demonstrating the usefulness of a CIX in this particular discussion (yes, they're often useful - just not always) but I am also showing why using identity columns blindly is a bad idea.

Ultimately, you need to think through how a query accesses its underlying data structures to get the best out of index design & resist the temptation to simply roll out "best practises" generically..

Published Tuesday, 12 June 2007 11:42 PM by Greg_Linwood

Comments

# Identites are not good aggregators

Having identified the major contributing query plans hitting a table then optimising performance is optimising the search order of the column vector predicates in the clustered index scan.

So if clid is a predicate in the majority query plan and is the finest grained first aggregation of the data in the table it should be first column in the primary clustered index.

In architectures where the $/MIP is higher people have developed tools to aid this and have been doing it for a long time (read decades).
Tuesday, 17 July 2007 10:42 AM by Dave

# re: Debunking myths about clustered indexes - part 5b - a real world solution

Not sure exactly what you mean by $/MIP Dave?

Solving this problem certainly shouldn't have required any expensive tools though - some simple human analysis should have sufficed but unfortunately, as is very common in our industry, SQL professionals tend to want to hang onto "generic practises" rather than analysing specific query indexing requirements, even with the most performance critical queries & sometimes to the complete detriment of system performance (& the customer's business)

Regards,
Greg Linwood
Tuesday, 17 July 2007 10:43 PM by Greg_Linwood

# re: Debunking myths about clustered indexes - part 5b - a real world solution

$/MIP - Dollar cost of execution per million of instructions processed (operational cost in Data Base layer).

I meant that the practice of designing the index (and schema really) considering the vectors of the data and the predicates of the queries is very common in mainframe where the cost of the DBA effort is low compared to the operational cost of the database.

Given cheap CPU, low concurrent transaction volumes and with a high development cost (relative to the operational cost)
an identity is fine and it isn't cost effective to spend time designing anything more efficient. (at least initially)

Still I am surprised that it would be considered the 'only' 'one size fits all' best practice.
Friday, 20 July 2007 2:47 PM by Dave
Anonymous comments are disabled