Home > Performance, SQL Server > A performance analysis example

A performance analysis example

January 15, 2006 Leave a comment Go to comments

Last week i was dragged kicking and screaming away from my allocated SQL 2005 days to spend 2 days diagnosing performance problems on one of the firms Systems and write a report.
Due to the sensitive nature of the data on the system it has been managed by another team and this was the first time i had looked at it. I thought i’d share my thought process to identifying the problem on the server and reference tools i used. Its the first iteration of the review and once my recommendations are implemented there will be another review to see what else may need dealing with. Please note however i have not included every single step i took otherwise i’d be writing a novel!

Scenario

The users of the system have been reporting intermittent time outs occurring in the front end application. Current system administrators have noted periods of high cpu usage. The server is Windows 2000 SP4 running SQL 2000 SP3a with 2 physical processors, hyperthreading enabled, 2gb memory, SAN attached (EVA 5000).

Analysis

The first port of call was to review the event logs and sql server logs for any errors that may indicate what is happening. These logs revealed nothing. Next port of call was to connect via Query Analyser and see what was going on. A quick look at current activity showed nothing much going on, so i did the standard SELECT * FROM SYSPROCESSES. Now this revealed that 80% of the connected spids last wait times had been for a latch and specifically the waitresource was 2:3:69 or 2:1:24.

This waitresource translates as follows first number = Db ID, in this case tempdb. second number = filegroup and 3rd number is the page.
So, this made me suspicious and i moved on. Next thing, we have a timeout at the front end so what is the server waiting for? To find this out i created the track_waitstats and get_waitstats procedures on the server and ran the track_waitstats for 8 hours with 5 minute samples. At the same time i kicked off a server side trace and performance monitor with all my standard counters and left this running for 24hrs.

One last thing to do was take a look at how SQL was using its memory. For this i used DBCC MEMORYSTATUS and DBCC MEMUSAGE. This did not reveal anything alarming but the DBCC MEMUSAGE did show that a log table was the top buffer cache user so i noted this and i would later suggest the table be reviewed and archived as appropriate.

The next day it was time to start reviewing the information i had gathered. I loaded up the trace with my trusty trace analysis tool which takes all the pain out of working with the trace. Within 30 minutes the tool had analysed the trace, loaded it into a normalised database and presented me with a summary report with the most interesting information. Alot of the time i don’t need to go any deeper than the summary report, although in a small number of cases you do have to and then its a case of querying the normalised database to extract more information and in very rare cases going back to the trace (painful!).

The first thing i looked at was the wait statistics i captured using track_waitstats. This gave me the key to the problem and indicated that the server spent 45% of its time waiting for latches! The other most significant wait @ 32% was for cxpackets which relate to parallelism.  Now with latching in my head i started thinking IO bottleneck but a little voice said, “unlikely, you are attached to a high performance SAN”. So, i went to review my performance counters and specifically the Average and Current disk queue lengths inline with disk read & writes. Nothing! Queue lengths were negligible and read writes were nothing to boast about. So what next? Well, i need to know where the latching is occurring so i set up a loop to capture the output of sysprocesses for 45 minutes. Now this was interesting, most of the waiting is occurring in tempdb! Ahar! Everything now points to tempdb contention. For a moment i thought to myself its classic contention and follow Microsoft article about it. Never that easy though! I was not exhibiting waitresources of 2:1:1 or 2:1:3.

So, my wait resources are primarily 2:3:69. What is this? So i delve into the page to take a look. To do this i used DBCC PAGE to find out whats on the page and what object it belongs to. Both the pages being flagged belonged to SYSINDEXES in tempdb. The alarm bells are now ringing, the contention is in tempdb around sysindexes. Probably some dodgy code so back to the summary results of my trace and lets take a look at the top queries by CPU, READS & WRITES.  Voila! There it is. There are 2 queries that consume 40% of all resouce requested by SQL on the box.

So tuning time. The primary problem was that one of the peices of code was doing a SELECT * INTO #TBL from badly_written_view. To the best of my understanding its was actually this that was causing the latch contention in tempdb. I believe that when you do select * into its holding latches around the appropriate tables including sysindexes but for the duration of the transaction which is up to 3.5mins in this case. So i have removed the select * into and changed it to create #table followed by insert into. I then spent the next 5 hours optimising the code so i could send it back to the developers. I did not have to optimise the code but i was so horrified by the code i saw and a little part of me wanted to say to them “And that’s how you do it”. I managed to half the execution times on both bits of code but there was still room for improvement but that was going to take to much time as i was not not familiar enough with the dataset and my optimising was rusty.

The developers now have my recommendations and new code for review, hopefully changes will be implemented in the next week and i can confirm my belief regarding select * into.  Either way i`ll be doing another round of reviewing.

I’ve decided to stop here for the blog but a number of other recommendations were made which i’ve listed below. Every iteration of a performance review is different and other tools i have not mentioned are used but i hope you find this article gives you an insight into tuning and that its links are useful.

1) Increase tempdb files to match the number of processors. (Best practice)

2) Reduce degree of parallelism to n – 1 where n is the number of physical processors.

3) Review archive table.

Who knows, the next review may turn up a whole new world of pain or confirm the server is now running in a satisfactory state! One thing i can say for sure, the way i approach the problem and the tools i use next time will be similar but definitely not the same.

Advertisements
Categories: Performance, SQL Server
  1. L Parker
    February 28, 2013 at 17:57

    Andrew,

    The Read80Trace utility you mentioned looks interesting. Thanks for writing such a thorough article! I’d like to mention a tool that has saved my @#$3 on numerous occasions. It’s called Trace Analyzer and I’ve been using it for years. I also use another tool called Deadlock Detector when I’m looking to specifically identify and resolve deadlocks.

    -Leslie

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: