An XEvent a Day (13 of 31) – The system_health Session

Today’s post was originally planned for this coming weekend, but seems I’ve caught whatever bug my kids had over the weekend so I am changing up today’s blog post with one that is easier to cover and shorter.  If you’ve been running some of the queries from the posts in this series, you have no doubt come across an Event Session running on your server with the name of system_health.  In today’s post I’ll go over this session and provide links to references related to it.

When Extended Events was introduced in SQL Server 2008, the Produce Support Services group worked with the Extended Events developers to create the definition for an Event Session that could be shipped with SQL Server 2008, would startup automatically when SQL Server starts up, and contained Events of interest in troubleshooting common problems seen by the PSS Engineers.  Bob Ward (Blog|Twitter) blogged about the details of the system_health session that shipped with SQL Server 2008 in his blog post Supporting SQL Server 2008: The system_health session.  The script for this Event Session is inside of the utables.sql script file that is in the instance Install folder (for example c:\Program Files\Microsoft SQL Server\<InstanceDesignator>\MSSQL\Install) and can be used to recreate the Event Session if you inadvertently change it.

-- The predicates in this session have been carefully crafted to minimize impact of event collection
-- Changing the predicate definition may impact system performance
--
CREATE EVENT SESSION system_health ON SERVER
  • The sql_text and session_id for any sessions that encounter an error that has a severity >=20.
  • The sql_text and session_id for any sessions that encounter a memory-related error. The errors include 17803, 701, 802, 8645, 8651, 8657 and 8902.
ADD EVENT sqlserver.error_reported
(
    ACTION (package0.callstack, sqlserver.session_id, sqlserver.sql_text, sqlserver.tsql_stack)
    -- Get callstack, SPID, and query for all high severity errors ( above sev 20 )
    WHERE severity >= 20
    -- Get callstack, SPID, and query for OOM errors ( 17803 , 701 , 802 , 8645 , 8651 , 8657 , 8902 )
    OR (ERROR = 17803 OR ERROR = 701 OR ERROR = 802 OR ERROR = 8645 OR ERROR = 8651 OR ERROR = 8657 OR ERROR = 8902)
),
  • A record of any non-yielding scheduler problems. (These appear in the SQL Server error log as error 17883.)
ADD EVENT sqlos.scheduler_monitor_non_yielding_ring_buffer_recorded,
  • Any deadlocks that are detected.
ADD EVENT sqlserver.xml_deadlock_report,
  • The callstack, sql_text, and session_id for any sessions that have waited on latches (or other interesting resources) for > 15 seconds.
ADD EVENT sqlos.wait_info
(
    ACTION (package0.callstack, sqlserver.session_id, sqlserver.sql_text)
    WHERE 
    (duration > 15000 AND 
        (    
            (wait_type > 31    -- Waits for latches and important wait resources (not locks ) that have exceeded 15 seconds. 
                AND
                (
                    (wait_type > 47 AND wait_type < 54)
                    OR wait_type < 38
                    OR (wait_type > 63 AND wait_type < 70)
                    OR (wait_type > 96 AND wait_type < 100)
                    OR (wait_type = 107)
                    OR (wait_type = 113)
                    OR (wait_type > 174 AND wait_type < 179)
                    OR (wait_type = 186)
                    OR (wait_type = 207)
                    OR (wait_type = 269)
                    OR (wait_type = 283)
                    OR (wait_type = 284)
                )
            )
  • The callstack, sql_text, and session_id for any sessions that have waited on locks for > 30 seconds.
            OR 
            (duration > 30000        -- Waits for locks that have exceeded 30 secs.
                AND wait_type < 22
            ) 
        )
    )
),
  • The callstack, sql_text, and session_id for any sessions that have waited for a long time for preemptive waits. The duration varies by wait type. A preemptive wait is where SQL Server is waiting for external API calls.
ADD EVENT sqlos.wait_info_external
(
    ACTION (package0.callstack, sqlserver.session_id, sqlserver.sql_text)
    WHERE 
    (duration > 5000 AND
        (   
            (    -- Login related preemptive waits that have exceeded 5 seconds.
                (wait_type > 365 AND wait_type < 372)
                OR    (wait_type > 372 AND wait_type < 377)
                OR    (wait_type > 377 AND wait_type < 383)
                OR    (wait_type > 420 AND wait_type < 424)
                OR    (wait_type > 426 AND wait_type < 432)
                OR    (wait_type > 432 AND wait_type < 435)
            )
            OR 
            (duration > 45000     -- Preemptive OS waits that have exceeded 45 seconds. 
                AND 
                (    
                    (wait_type > 382 AND wait_type < 386)
                    OR    (wait_type > 423 AND wait_type < 427)
                    OR    (wait_type > 434 AND wait_type < 437)
                    OR    (wait_type > 442 AND wait_type < 451)
                    OR    (wait_type > 451 AND wait_type < 473)
                    OR    (wait_type > 484 AND wait_type < 499)
                    OR wait_type = 365
                    OR wait_type = 372
                    OR wait_type = 377
                    OR wait_type = 387
                    OR wait_type = 432
                    OR wait_type = 502
                )
            )
        )
    )
)
  • Capture Event information using the ring_buffer target.
ADD target package0.ring_buffer        -- Store events in the ring buffer target
    (SET max_memory = 4096)
  • Set the session to start automatically with SQL Server
WITH (startup_state = ON)
GO

In SQL Server Denali CTP1, two new Events have been added to the system_health session specific to SQLCLR.

  • A SQLCLR memory allocation failed.
ADD EVENT sqlclr.allocation_failure,
  • A SQLCLR virtual memory allocation failed.
ADD EVENT sqlclr.virtual_alloc_failure,

While the system_health session captures very useful information, it uses the ring_buffer Target to store the Event data.  In a scenario where the database engine fails completely the information that may have been captured by the system_health session will be lost when the process terminates.  Also since the Event Session uses the ring_buffer Target, it is possible that you may not receive back all of the Event data contained in the target, or the Events that you might have expected to exist.  Bob Ward talked about the limitation of the DMV’s to return 4MB of XML data and how this impacts the in memory Targets in Extended Events in his blog post You may not see the data you expect in Extended Event Ring Buffer Targets…. 

One of my favorite aspects of the system_health session is that it includes deadlock tracing through Extended Events by default.  However, in order to make use of the deadlock graph captured by Extended Events, you have to be on CU6 for SQL Server 2008 SP1 (http://support.microsoft.com/kb/978629), or you could try to hack your way around the bug as I showed in my article Retrieving Deadlock Graphs with SQL Server 2008 Extended Events, and Michael Zilberstein’s update to correct a problem with the code in that article, Parsing Extended Events xml_deadlock_report.  The deadlock graph in Extended Events will not open graphically in SSMS like a SQL Trace XML Deadlock Graph will due to changes in its output to support multi-victim deadlocks, which I covered in my blog post Changes to the Deadlock Monitor for the Extended Events xml_deadlock_report and Multi-Victim Deadlocks.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.