SQL Server – Finding the Source of Excessive I/O
April 5, 2013 3 Comments
I recently had to troubleshoot a poorly-performing instance of SQL Server. The cause turned out to be a single stored procedure that was producing excessive disk I/O.
Following is a description of the process I used to identify the problem and find its source.
This is a summary of the Windows server and SQL Server instance that was experiencing the problem:
SQL Server 2005 Enterprise edition
Windows 2008 R2 Standard edition
23 separate user databases, 7 in use 24/7
The initial suggestion that something was wrong was slowness in multiple web sites and applications that accessed databases hosted on the server. These sites and applications all accessed different databases, so the indication was that the problem was affecting the entire server.
First I opened the SQL Server Management Studio Activity Monitor, and examined the four graphs at the top of the monitor. The graphs of “% Processor Time”, “Waiting Tasks”, and “Batch Requests/sec” were all within the expected operating ranges. The “Database I/O” graph, on the other hand, showed that I/O was running much higher than normal, and that the high level of I/O was nearly constant.
Note: It is important that you are familiar with the normal operating ranges for these metrics on your server. Without that knowledge, it is much harder to identify problems. Every server is different, so you need to know your server.
- Next I used this script to examine the SQL Server wait conditions. PAGEIOLATCH_SH was higher than normal. I now had pretty clear indications that excessive I/O was the reason the server was performing poorly.
- The first step I took to find the cause of the excessive I/O was to check for abnormal behavior from the applications that were accessing the server. Primary suspects were two web sites backed by the SQL Server instance, but neither were experiencing any unusual loads. No other applications were showing signs of unusual database activity.
- Since the applications accessing the server checked out, the next step was to look at the server itself. I opened the Task Manager on the server and confirmed that no unexpected processes (such as backup or antivirus applications) were running.
- Still on the server, I clicked on the “Performance” tab of the Task Manager and then clicked the “Resource Monitor…” button to open the Resource Monitor. The Resource Monitor provides more detailed statistics for CPU, Disk, Network, And Memory resources. By examining the Disk section, I was able to identify one particular database that was showing unusually high I/O numbers. For that database, both “Total Bytes/sec” and “Read Bytes/sec” were high.
The Windows Resource Monitor gives in-depth details about server resource usage.
I had now narrowed my search for the problem down to one particular database on the server. In the past, I have seen out-of-date statistics cause unusual query engine behavior… such as simple queries taking much longer to run than they should and consuming far more resources than necessary. That knowledge led me to the next step.
- I executed UPDATE STATISTICS for the major tables in the database that was showing excessive I/O. However, this had no affect.
At this point, I suspected that the problem was that the query plan for one stored procedure was bad. Why? Simply a hunch based on past experience… I’ve encountered the problem before. I have seen that one bad procedure query plan can have the same affect as bad statistics. The tricky part is finding the procedure that is the problem, because procedures and queries across all databases on the server run slow when the machine is starved for resources.
Note: There are commands like SET STATISTICS IO and SET STATISTICS TIME that help you to tune your SQL procedures and queries, but they are only useful if you know where the problem is. They are great if you need to tune an individual query/procedure, but they don’t help you find the one procedure out of hundreds that is slowing everything down.
- The next step was to start the SQL Server Profiler and do the following…
a) Set up a trace that watched "RPC:Completed" events.
b) Set a filter for “Duration” greater than or equal to 5000. Ideally I would have set filters on “Reads” and “Writes”, but I had no idea what read/write thresholds I was looking for… so instead I chose to look at just the particularly long-running queries.
c) Set a filter on “Database Name” to view only events in the problem database.
d) Collect 10 minutes of data.
Paying close attention to the I/O statistics (Reads/Writes), I was able to identify one particular procedure that was producing around 1500000 reads each time it was executed., This was more than five times the number of reads produced other known "expensive" queries. In addition, this procedure was being run over and over… as soon as one execution completed, another was started. This was OK (that was the nature of the application that was invoking the procedure), but it was a further indication that this was the culprit… the procedure was always running. (Recall that the initial investigation showed high I/O at a nearly constant level.)
- Now that I had identified one procedure as the cause of the excessive I/O, the next task was to tune the procedure. Query tuning is a topic about which many articles (and books!) have been written, so I won’t cover those details here. Ultimately I was able to reduce the number of reads and CPU usage of the procedure by a factor of 10. Success!
- The last step was to confirm that I/O usage dropped to normal levels after the modifications to the stored procedure, and to continue monitoring the server to be certain that the problem had been corrected.
With this process, I was able to successfully identify and correct the cause of excessive I/O and poor performance on SQL Server. I hope you find this helpful in troubleshooting your own performance problems.