How to baseline SQL Server wait statistics

A quick intro into SQL Server wait statistics

Starting with SQL Server 2005, Microsoft introduced the ability of SQL Server to track executed queries executed by measuring and logging the time that executed query has to wait for various resources such as other queries, CPU, memory, I/O, etc. This performance metric is referred to as a SQL Server wait statistic. Wait statistics are most precise and the most reliable way for tracking and identifying SQL Server performance problems as it allows measuring each step the query has taken during execution.

Comparing to the standard set of CPU, Memory, I/O and other metrics, SQL Server wait statistics has, in most cases, better precision required for SQL Server performance troubleshooting and/or optimization.

Why baseline wait statistics?

The nature of some wait types is that having either high or low values is not always the wrong or right, respectively. As a matter of fact in many cases, high values might be the indication of the better SQL Server performance.

Some typical representatives of wait types that cannot be understood and interpreted are CXPACKET and PAGEIOLATCH_SH among others.

  • CXPACKET, for example, is the wait type which indicates parallelism in SQL Server and having highly utilized parallelism mean that SQL Server processes the data faster. Therefore, this means that high values of the CXPACKET wait type indicates that the SQL Server is well optimized
  • PAGEIOLATCH_SH indicates physical I/O readings from the storage subsystem into a buffer pool, which is the normal operation and should not be considered as a problem as long as the storage subsystem can handle those operations without affecting the performance of SQL Server

So how can we determine what wait time value for some specific wait type should be considered “bad”? Moreover, how can we determine the thresholds for wait types that would be a real indicator that the waits time value is a potential problem? The only reliable way of establishing the thresholds that can be considered, in most cases, as correct, is to baseline the collected wait types values

It is clear from those articles that collecting Wait statistics manually, and then adjusting and measuring baseline by hand with every change introduced to the system is very challenging.

Baselining wait statistic with ApexSQL Monitor

ApexSQL Monitor is a third-party tool designed with wait statistics in mind. It is capable of monitoring and collecting wait stats data for all wait types. At the same time, it is highly configurable allowing the user to optimize the monitoring by choosing what wait types to monitor (what wait types to exclude from monitoring), as well as establishing the predefined alert thresholds and finally the ability to baseline collected data and use the calculated baselines as thresholds for alerting

To calculate a wait statistic baseline:

  1. Select the SQL Server in the server explorer pane and click the Configuration button in the main menu bar
  2. Select the Baselines tab in the Configuration page
  3. Chose the time range for which the baseline should be calculated. A minimum 7 days of collected data must be present in the repository database for baseline calculation
  4. Check the Wait stats checkbox
  5. After pressing Calculate, ApexSQL Monitor will calculate baselines for the defined time range

Technically, when baselining wait statistics, it is important to distinguish three different scenarios where each requires the different approach to baselining and requires the different interpretation of the calculated data:

  1. Calculate the wait statistic baseline for a well-optimized SQL Server that doesn’t normally experience issues. This is the standard approach in creating the baseline when the server is working optimally and, in such scenario, the calculated baseline is used to monitor and notify the user whenever the wait time value of wait types exceed normal values in the monitored period
  2. Calculating wait statistic baselines when SQL Server experiences performance problems. In such scenarios, the baselining should not be used for alerting as the calculated values couldn’t be treated as reference values. When used in this manner, the baseline should serve only for measuring the troubleshooting progress and whether the troubleshooting action improved or degraded the SQL Server performance
  3. The third method is not frequently used and mainly by the database and application developers so it won’t be elaborate in details in this article. The calculated baseline is used solely for providing the information to developers in what way the application upgrade affects the SQL Server performance, or how the SQL Server upgrade affects the application performance, or to allow the developers to have the results control for various developmental stages or bug fixing. So the baseline value, in this case, doesn’t represent either the good or the bad performance, but rather just the reference point that can be used in development cycles for controlling the development progress

Calculating the baseline for a well-optimized SQL Server

The following above is a typical situation that could be encountered when the wait statistic baseline is calculated for an optimized SQL Server. The arrows in the image mark the PAGEIOLATCH_SH wait type values that are breaching the high thresholds and will trigger the high alerts. In such a scenario, we have some breaches of the threshold that should be investigated, but considering that in all cases thresholds are breached by wait type values that are just slightly above the threshold, it doesn’t have to trigger a knee-jerk reaction that each breach automatically means that some performance issue is the cause

The high baseline threshold breaches are marked with red arrows while medium breaches are marked with yellow arrows.

Now let’s take a closer look at the period marked with the white square in the image below. Twelve consecutive PAGEIOLATCH_SH values are breaching either the high or medium thresholds. Therefore, it is evident that in that prolonged period something unexpected occurred that caused the larger PAGEIOLATCH_SH wait type values. While it doesn’t necessarily mean that SQL Server performance is compromised, it would be a good practice to investigate that period for potential or existing problems .

Here, we have a situation where the PAGEIOLATCH_SH wait type values were constantly (though not significantly) above the threshold for a prolonged period. The fact that the values are consistently above the threshold even the PAGEIOLATCH_SH values are not much higher than the baseline threshold is the clear indicator that the affected period should be investigated.

The result of the investigation could be that the performance of SQL Server was affected in that period, in which case the DBA should preform the analysis and troubleshooting. Alternatively, the results of the investigation might show that the SQL Server performance was not affected anyhow in that period. In such a case, the DBA should pay attention to that period and whether such behavior is repeating. In such a case where repeating occurs without affecting the SQL Server performance, editing and correcting the calculated baseline should be considered. The corrected wait statistic baseline should accommodate the new behavior of PAGEIPOLATCH_SH for that period to be treated as the normal.

Another scenario that could be encountered is presented in the image below. It’s again a typical scenario where values of the wait type are significantly above the threshold for a prolonged period. This is typically an indicator of performance problems that require serious investigation and troubleshooting. While it is highly unlikely that such behavior would not affect SQL Server performance, that option still has to be taken into account during the investigation. It should always be correlated to the level of performance degradation in that period, and whether the end users are suffering as a consequence of such high values; or performance degradation is still within acceptable margins.

Calculating baselines when SQL Server has performance problems

As already explained, this method should be used only to track the progress of the troubleshooting.

This is characteristicly charting data that would appear after the baseline was calculated for wait statistics while a SQL Server experiences performance problems. For this article, let’s consider that PAGEIOLATCH_SH wait type is the cause of the performance problem. Even the values are quite high when the baseline calculates for such values those are treated as the normal so all of the PAGEIOLATCH_SH values will be displayed well within the normal baseline zone. Now it is clear that once the baseline is calculated for sub-optimal values, the application will use those values as normal values, with a consequence that alerts will not be triggered.

For this reason, this method should be reserved exclusively for wait statistic troubleshooting and is not recommended for less experienced DBAs. With this baseline, though, the DBA can track the progress of the troubleshooting.

The wrong solution could cause even worse values of the PAGEIOLATCH_SH wait type in our example

As can be seen in the next image, all PAGEIOLATCH_SH values are increased breaching already highly set thresholds due to an unsuccessful wait statistic troubleshooting. That means that the implemented solution worsened the situation and should be reverted and another approach tried.

Of course, when the applied solution is effective in improving the existing issue, that reflects in the wait statistic charts as well.

Here, applying appropriate changes, at a specific moment, decreases the PAGEIOLATCH_SH values significantly, making them significantly lower compared to the “bad” baseline. That indicates significant improvement. In a situation where additional wait statistic troubleshooting and performance improvement is targeted, the appropriate approach would be to edit the baseline to accommodate the new values achieved for a more comfortable and precise tracking of the troubleshooting improvements.

 

January 9, 2019