How to master SQL Server performance baselining to avoid false positives and/or missing alerts

Baselining SQL Server metrics have significant advantages over traditional predefined alert thresholds when monitoring SQL Server performance. What’s more, when tracking some wait types statistics, it is a must-have as there is no other way to interpret collected performance data correctly otherwise. However, to be able to learn how to use performance baselining and to understand its full potential some advanced knowledge will be presented first

SQL Server baselining challenges and required knowledge

Creating an adequately established performance baseline is one of the toughest jobs, and therefore baselining is mainly reserved for skilled and experienced SQL Server DBAs. While essential and irreplaceable in many situations, performance baselining must be handled with care and knowledgeable. Before jumping into any SQL Server performance baselining, the DBA must develop certain skills and gather some in-depth knowledge:

  • Good knowledge and understanding of SQL Server’s underlying hardware, in particular, CPU, physical memory, and related I/O subsystems (especially relations between the physical memory and the I/O subsystems)
  • More in-depth knowledge of underlying Windows operating system (in most cases Server versions)
  • In the case of a system based on virtualization, knowledge of hypervisor software, host machine and its virtual machines operating system, including the knowledge on relations between those three virtualization components
  • Profound knowledge of SQL Server internals like locking, spinlocks, latches, blocking and wait statistics
  • Knowledge of how to configure and change SQL Server configuration parameters in accordance with performance requirements
  • Knowledge of how to configure and implement changes in underlying Windows OS depending on SQL Server performance requirements
  • Thorough knowledge of database design
  • Good understanding of database applications, and workloads that database applications impose to database

Any system change imposed in any of the above categories could lead to an entirely different set of data, potentially making the created baselines useless. Using the various improvised solutions for collecting SQL Server data and calculating baselines are often inherently difficult to fine tune and if not handled carefully, often can create problems instead of solutions.

The goal of SQL Server baselining

SQL Server can be looked at as a live organism that reacts on external and internal variables in sometimes different and not always predictable ways. This can wind up making hard to understand the performance limits of SQL Server and what is the “normal” SQL Server state. A proper baseline should be the set of calculated threshold values that reflect the “normal” SQL Server workload, and it serves to prevent or reduce false alerts in SQL Server that is exposed to workload fluctuations

However, establishing a proper SQL Server performance baseline is not the goal in and of itself. The goal of establishing the limits of “normal” SQL Server workload is determining more precisely what can be considered as an “abnormal” SQL Server workload, as only by knowing well what is normal for SQL Server it is possible to understand what is abnormal

Having that in mind, it is important to know the following SQL Server performance hallmarks:

  • The normal workload – the range of SQL Server workload that allows optimal SQL Server performance during the everyday work. The normal workload information is something that is available during the regular SQL Server work and can be determined as a period where SQL Server regular workload does not cause any SQL Server performance issues. It is something that is used as a basis for baseline calculation
  • The high-water mark – the high-water mark presents the workload peaks of SQL Server that still allows SQL Server to work without degradation or within the acceptable/expected degradation scope. The high-water mark is not something that can be established from reviewing everyday SQL Server workload. For establishing high-watermarks in SQL Server performance, the specific load test on SQL Server should be executed in a way that allows determining performance peaks that are still acceptable
  • The maximum workload – the maximum SQL Server workload imposed over an unlimited time that still allows SQL Server to work without degradation or within the acceptable/expected degradation scope. The same as the high-water mark, the maximum workload can be determined only using the specific load tests that help to determine what the maximum continuous workload that can be imposed on SQL Server is

The latter two, high-water mark and maximum workload, are often disregarded categories as establishing either of those two is time and resources intensive work that requires executing a set of carefully created tests. Moreover, while creating a baseline based on the normal workload can satisfy DBAs needs in many cases, the maintenance of serious enterprise graded systems would require all three aspects to be taken in calculation when creating the performance baselines. The close to “ideal” baselines can be created only by having the data from all three mentioned performance parameters

Creating the proper SQL Server performance baseline

The preferred scenario to create a baseline is to have a historical period with data for calculation, where collected SQL Server metric data are on the verge of what is to be considered as “good” values. That means that collected data are from the period when SQL Server performance is not affected in any case. Theoretically, that allows threshold calculations based on standard deviations to meet the sweet points of the boundaries between the normal and abnormal SQL Server performance.

However, having and collecting SQL Server metric data with “ideal” values is rarely possible in real world. That is why having the insight in the normal workload, high-water mark, and the maximum workload is important. While baseline calculation based on the normal SQL Server workload metric data could help us to understand and to interpret SQL Server behavior when workload without significant ups and downs is imposed on SQL Server, it is also important to be aware that such a baseline is not something that guarantees the best results.

Sometimes, the business or end user requirements imposed could be stricter in setting the expectations on SQL Server that some operations must be executed faster or SQL Server responsiveness must be higher in some cases. Some examples are:

  • A SQL Server maintenance period cannot be longer than 30 minutes
  • SQL Server must return data to then end user in less than 5 seconds with up to 100 users producing simultaneous workload
  • Transaction log backup must be created every 2 hours and must complete within 5 minutes
  • No SQL Server blocking when continuously handling 700 batch request/second
  • A data retention policy of one year

Having such precise requirements imposed on SQL Server allows DBA to understand the performance expectations and end-user needs correctly and to built-in those into the resulting baseline calculations

Interpreting the baseline data

When working with baselines, there is a constant pressure imposed to interpret the data and the calculated baseline correctly or to perform a recalculation of the baseline from time to time when “better” values are collected. Therefore it is important to have insight and awareness of the baseline calculation methodology, statistical data behind the calculations such as the min and max values, baseline average (mean), standard deviations and other necessary parameters

It is true that the longer the time span used for the baseline calculation is, the more accurate the resulting baseline could be, but even then there is no guarantee that calculated baseline can hit the target and provide acceptable results. As a consequence, the DBA may face a situation to either deal with excessive alerting, or even worse the situation with no alerts even after SQL Server performance suddenly degraded. In the first case, due to the high number of alerts, it is impossible to distinguish real from false-positives in the context of performance problems. In the second scenario, the DBA is not even aware of any performance problems until they occur due to missing notifications about underlying issues issues.

So, mastering SQL Server baselining, on its own, requires specific skills and in-depth knowledge possessed only by more advanced/experienced DBAs. Moreover, even for them, surviving everyday urgent problems, additional requirements imposed and ad-hock requests by end users could leave them with a very little or no time for this critical skill.

Useful resources:
Establishing a Performance Baseline
Creating Simple Performance Baselines with SQL Server Profiler
How to detect SQL Server performance issues using baselines – Part 1 – Introduction
How to detect SQL Server performance issues using baselines – Part 2 – Collecting metrics and reporting

 

May 11, 2018