Troubleshooting SQL Server performance issues using wait statistics – Part 2

In the previous article we described native SQL Server capabilities for troubleshooting SQL Server performance issues using wait statistics and especially the query wait statistic.

In this part, we’ll focus on a 3rd party solution – ApexSQL Monitor. ApexSQL Monitor is a SQL Server and system performance monitoring tool designed for monitoring of operating system, SQL Server, and database performance metrics in real time including wait statistics as well as query wait statistic on cumulative and individual query level. Besides being able to track wait statistics, ApexSQL Monitor allows a database administrator to configure and receive alert notifications about wait statistics as well.

To demonstrate ApexSQL Monitor’s ability to point out at the root cause of a reported SQL Server performance issue, we will use a real world situation encountered by DBAs quite regularly, Disk latency

Usually SQL Server troubleshooting starts after alerts are raised due to some irregular activity. In the following example we will consider a frequent situation, a disk latency alert is raised. Disk latency is a measure of the time delay from the time a disk IO request is created, until the time the disk IO request is completed. This usually indicate that workload is performing out of the acceptable bounds and alerts like these have to be resolved

The best way to start troubleshooting is visiting the ApexSQL Monitor Alerts window where a list of raised alerts is displayed. To do that, click on the Alerts button in the Configure menu group

The Alerts window in ApexSQL Monitor displays information about potential performance issues for the operating system, SQL Server instances, and databases. Alerts can be configured for each monitored machine, SQL Server instance, and databases. Selecting the Average read time metric in the Alerts window will display the metric’s chart for the period when alerts were detected. Next to selected Average read time title 3 icons are shown for different alert severity and but the 50 next to High (red shield) means that a high alert is triggered 50 time for the period of time displayed in the Average read time chart.

From the graphs it is clearly visible that there were some large spikes between 6:30 PM and 7:05PM (highlighted area in the chart). Disk latency issues doesn’t always mean that SQL Server is involved, but for the purpose of this article, we’ll assume it is the culprit

Therefore, the next step in our SQL Server performance troubleshooting would be to check whether and what wait stats were contributing to this performance issue and next what queries were running at these times. In this particular situation, for disk IO specific investigations, attention should be focused on wait stats related to I/O i.e. Wait stats that include “IO” within the name – PAGEIOLATCH.

For a better understanding of what is going on, it is helpful to understand the meaning of PAGEIOLATCH wait types and how they affects SQL Server. A description of the most common types are included in the Appendix of this article

To inspect the wait stats chart, go to the Overview tab of the affected instance. By using the Show next series and Show previous series buttons in the Wait stats chart, quickly review all the waits stats and then isolate the particular wait stats of interest. The wait stats names in the chart are in descending order based on their Wait stats time, which mean that Wait stats with the longest wait time will be listed first. In our case, we should focus on the PAGEIOLATCH_SH and PAGEIOLATCH_EX

Closer inspection of the chart reveals that PAGEIOLATCH_SH and PAGEIOLATCH_EX are at the period of time when alerts are raised, way above the usual values. In this case there was a spike for both, PAGEIOLATCH_SH and PAGEIOLATCH_EX at the time period when Average disc seconds per read alerts were raised

Next, check the Query performance tab with attention to the Avg I/O per sec chart to see which queries are using the most I/O

The graph indicates that there are two queries with high average I/O per second. Therefore, in this particular case it is obvious that queries are causing such high wait times that affects the SQL Server performance. ApexSQL Monitor allows you to inspect the problem deeper, up to the very root cause by providing query wait statistics for the specific alert period via a double click on a marked point at the chart as indicated in the trackball message (see above) for the selected measurement value on the Average read time chart

The application focus will be immediately switched to the query wait tab. The query waits tab displays the 1 hour period that includes the selected alert value from the alerts tab. Since there is a known period that have to be investigated, selecting the appropriate date and time should be done to cover the complete period affected by alerts. Click on the calendar icon next to the displayed date in the Query waits tab, will open the calendar where desired date should be selected

By clicking on “View specific hour” box, the time picker will be opened where time can be selected

If a specific time period is required, the time can be entered manually in the “View specific hour” box, and chart will display the 1 hour period starting from the time entered

The Query waits allows inspecting of the individual queries and their wait statistics, by selecting the Single queries radio button

When looking at the chart above, two queries with high wait times can be seen – Query1 (blue chunk) and Query2 (green chunk). Both with wait types related to I/O issues that we are troubleshooting. Queries are affected by high wait stats values for the PAGEIOLATCH_SH and PAGEIOLATCH_EX. It is that wait times for this wait types higher than 20 ms that should be look for as indicators of disk bottlenecks. High wait times detected in this particular case for mentioned I/O wait types mean that these queries need to be investigated and potentially, query tuning performed.

A 3rd party solution such as ApexSQL Monitor when used for SQL Server performance troubleshooting help resolve SQL Server performance problems faster, speeding development cycles and maintain stable SQL Server performance

Appendix – PAGEIOLATCH examples

The most common PAGEIOLATCH examples are:

PAGEIOLATCH_SH – (SH = share) occurs when query is waiting on a latch to be loaded from disk into the buffer pool for reading its content. The latch request is in shared mode and long waits often indicate problems where inefficient SQL statements read more data than necessary and as a result will try to acquire more latches than necessarily, but also it can suggests issues with the disk I/O subsystem.

PAGEIOLATCH_EX – (EX = exclusive) occurs when query is waiting for a on a latch to be brought from disk into the buffer pool for modifying its content. When a page that is not in the buffer cache is requested, SQL Server has to allocate a buffer page first, and then it place an exclusive PAGEIOLATCH_EX latch while the page is transferred from disk to cache. After the write to cache finishes, the PAGEIOLATCH_EX latch is released. This wait type will prevent access to the page while being written and only one per page can be held. Presence of excessive waits of this type usually indicate problems with the disk subsystem

PAGEIOLATCH_UP – (UP = update) occurs when query wait for an update latch on a page, but the page is not available in memory and has to be loaded from disk.  Update latches are common when system allocation pages are being updated or often during row versioning in tempdb. Unlike exclusive latches, they do not update the page data itself. The update latch remains in place until pages are successfully written to disk. It cannot be held concurrently with existing destroy or exclusive latches

In practice, resolving these wait types require Inspecting and tuning of the queries execution plans for expensive I/O operations, such as scanning large tables and indexes but also taking care of missing indexes, or of indexes containing an incorrect sort order

May 21, 2015