Debugging recurring web farm outages: SQL Server heartbeat pings and lock contention analysis with build-in tools

23 Jul 2017

Part 1: Establishing context with HTTP heartbeat pings
Part 2: Analyzing Windows Event Log, IIS log, and SharePoint log with PowerShell
Part 3: Collecting and analyzing a memory dump with DebugDiag
Part 4: Analyzing the memory dump with WinDbg
Part 5: SQL Server heartbeat pings and lock contention analysis with build-in tools

In the previous post, we used WinDbg to debug OLE DB used by native code SharePoint to communicate with SQL Server. From OLE DB we managed to extract the SQL statement on which worker threads block. Turns out they all block on a call to proc_AddAuditEntry, a stored procedure responsible for writing audit log entries the content database. As such, when audit logging is enabled for a site collection, this procedure records the request before actual content is fetched.

In this post we develop a SQL Server heartbeat tool to measure general database availability. Then we use build-in SQL Server monitoring tools to determine what causes proc_AddAuditEntry to block.

Collecting SQL Server heartbeats

Much akin to the HTTP heartbeat ping tool, in order to measure availability, we want to issue periodic requests to SQL Server. That way we can tell if the issue is local to proc_AddAuditEntry and if the database (or server) is experiencing symptoms of instability.

The HTTP heartbeat and SQL Server heartbeat tools are structurally equivalent. The latter connects to SQL server and executes a fixed query. As native code SharePoint connects using OLE DB, the tool executes the same query using both ADO.NET and OLE DB, just in case there's an issue only facing OLE DB:

using System;
using System.Data.OleDb;
using System.Data.SqlClient;
using System.Threading;

// .\SqlServerHeartbeat.exe 5

namespace SqlServerHeartbeat {
    class Program {
        static void Main(string[] args) {
            var run = 0;
            while (true) {
                var ole = "";
                var ado = "";
                Console.Write($"{++run} {DateTime.Now} ");

                using (var con = new OleDbConnection("Provider=sqloledb;Data Source=sql.acme.dk,1433;Initial Catalog=SP_Intranet_Content_01; Integrated Security=SSPI;"))
                using (var cmd = new OleDbCommand("select count(*) from AllDocs (nolock)", con)) {
                    try {
                        con.Open();
                        var res = (int)cmd.ExecuteScalar();
                        ole = res == 0 ? "Empty" : "OK";
                    } catch (Exception e) {
                        Console.WriteLine(e.Message);
                    }
                }

                using (var con = new SqlConnection("Server=tcp:sql.acme.dk,1433;Database=SP_Intranet_Content_01; Integrated Security=SSPI;Encrypt=True;TrustServerCertificate=True;Connection Timeout=30;"))
                using (var cmd = new SqlCommand("select count(*) from AllDocs (nolock)", con)) {
                    try {
                        con.Open();
                        var res = (int)cmd.ExecuteScalar();
                        ado = res == 0 ? "Empty" : "OK";
                    } catch (Exception e) {
                        Console.WriteLine(e.Message);
                    }
                }

                Console.WriteLine($"{ole} {ado}");
                Thread.Sleep(int.Parse(args[0]) * 1000);
            }
        }
    }
}

Once setup to run on the reference laptop, the SQL Server, and Web front-end Server 2, we awaited the HTTP heartbeat tool to report SharePoint as down. That happened in between July 20, 22:47:04 and July 21, 00:18:06. SQL Server heartbeat kept going for entire duration, though. This confirms an earlier analysis of SharePoint 12 hive log files recording database activity when HTTP-based SharePoint is unavailable.

The finding of SharePoint down and SQL Server up points in the direction of contention around writes to the audit log. What proc_AddAuditEntry does is technically straightforward. With the exception of DocLocation, the procedure inserts its arguments into the AuditData table:

INSERT INTO AuditData (SiteId, ItemId, ItemType, UserId, MachineName, MachineIp, 
                       DocLocation, LocationType, Occurred, Event, 
                       EventName, EventSource, SourceName, EventData)
       VALUES (@SiteId, @ItemId, @ItemType, @UserId, @MachineName, @MachineIp, 
               @Location, @LocationType, @Occurred, @Event, 
               @EventName, @EventSource, @SourceName, @EventData)

The audit log has the potential for high-frequency inserts and rapid growth. Especially on an intranet which has had full auditing enabled since 2010 and where the audit log has never been trimmed. Here the AuditData table itself holds a very large amount of data. Add to that the storage required for its two indices: AuditData_OnSiteOccurredEvent and AuditData_OnSiteItem:

sp_spaceused AuditData

name       rows         reserved        data             index_size     unused
AuditData  500,797,831  179,610,816 KB  144,268,664 KB   34,772,344 KB  569,808 KB

To better get a feeling for the normal activity pattern around the AuditData table, we summarize hourly inserts across two days. Keep in mind that the Occurred field used for filtering is in UTC, i.e., two hours behind local time:

select count(*) 
from AuditData (nolock)
where Occurred >= '2017-07-21 00:00' and 
      Occurred < '2017-07-21 01:00'
  
July 20                       July 21  
00-01:    914  12-13: 26.225  00-01:  1.968  12-13: 14.739
01-02:  1.158  13-14: 22.483  01-02:  2.004  13-14:  4.642
02-03:  1.338  14-15:  8.773  02-03:  2.376  14-15:  2.731
03-04: 11.849  15-16:  2.422  03-04: 14.064  15-16:  1.485
04-05:  4.925  16-17:  2.629  04-05: 12.382  16-17:  1.473
05-06: 16.067  17-18:  3.299  05-06: 21.108  17-18:  1.124
06-07: 17.375  18-19:  2.676  06-07: 28.543  18-19:  1.108
07-08: 16.610  19-20:  3.035  07-08: 12.564  19-20:  1.289
08-09: 19.646  20-21:  2.813  08-09: 20.205  20-21:  1.367
09-10: 16.179  21-22:     90  09-10: 14.243  21-22:     36
10-11: 15.621  22-23:  5.605  10-11: 13.648  22-23:  4.413
11-12: 19.401  23-00:  2.100  11-12: 11.324  23-00:  1.349 

For the whole hour of 23-00 local time of each day, we know that SharePoint is down, translating to between 21 and 22 UTC. Clearly, a correlation exists between inserts on AuditData and downtimes.

Investigating lock contention on AuditData

The most straightforward way to view locks in MS SQL Server is during application downtime. Here the Activity Monitor, the Log File View, the sp_who2 stored procedure, and the dm_tran_locks dynamic management view are valuable tools for inspecting lock behavior. The table below shows what we can learn from sp_who2 by executing it shortly after SharePoint goes offline. Here we're looking at SharePoint downtime between July 21, 22:47:38 and July 22, 00:20:43:

sp_who2 'active'

Cleaned up output:
SPID  Status     HostName   BlkBy   DBName                  Command      LastBatch       ProgramName
 906  RUNNABLE   WFE4               SP_Intranet_Content_01  ALTER INDEX  07/21 22:45:38  .Net SqlClient Data Provider
1011  SUSPENDED  ADMIN        906   SP_Intranet_Content_01  INSERT       07/21 22:47:33  Internet Information Services
 348  SUSPENDED  WFE1        1011   SP_Intranet_Content_01  INSERT       07/21 22:47:38  Internet Information Services
 353  SUSPENDED  WFE1        1011   SP_Intranet_Content_01  INSERT       07/21 22:47:43  Internet Information Services
 197  SUSPENDED  WFE2        1011   SP_Intranet_Content_01  INSERT       07/21 22:47:43  Internet Information Services

At 22:45:38 an alter index command comes in from Web front-end server 4. It causes the following insert to blocks, indicated by SPID 906 in the BlkBy column. Later inserts block on the first insert, and inserts keep queuing up. Starting 22:56:22 and ending 23:35:02, on nine occasions, according to the SQL Server Log File Viewer, SPID 906 was killed by the management process. When it first happened, sp_who2 changed to report SPID 906 as follows:

SPID  Status    HostName   BlkBy  DBName                  Command          LastBatch       ProgramName
906   ROLLBACK  WFE4              SP_Intranet_Content_01  KILLED/ROLLBACK  07/21 22:45:38  .Net SqlClient Data Provider

While rollback is in progress, inserts remain blocked on SPID 906. Once the rollback is complete, and the lock on alter index is released, queued inserts are retroactively written to AuditData. That explains why even though SharePoint is down, (some) audit log still show up. Under these circumstances, SharePoint would add false audit log entries; the HTTP client originally asking for the page is no longer waiting.

From the output of sp_who2 alone we cannot be certain that what we're looking at are operations against AuditData. To confirm, we need to query dm_tran_locks, holding details about locks being held and requested:

select * from sys.dm_tran_locks

Cleaned up output:
request_session_id  resource_subtype  resource_associated_entity_id  request_mode  request_type  request_status
 906                INDEX_OPERATION   1826105546                     Sch-M         LOCK          GRANT
1011                                  1826105546                     IX            LOCK          WAIT
 348                                  1826105546                     IX            LOCK          WAIT
 353                                  1826105546                     IX            LOCK          WAIT
 197                                  1826105546                     IX            LOCK          WAIT

select * from sys.objects where object_id = 1826105546

Cleaned up output:
name       object_id   type  type_desc   create_date
AuditData  1826105546  U     USER_TABLE  2010-05-12 15:45:05.160

The dm_tran_locks output confirms that SPID 906 is granted a lock because of an index_operation and that other requests are waiting for the lock to be released. The request_mode (request lock mode in full) of Sch-M is short for schema modification. A lock of type Sch-M prevents concurrent access to the table, meaning that Sch-M blocks inserts.

Furthermore, a few times during the downtime and later at 00:24:34 and 00:52:46, the Log File Viewer displays messages indicating that the I/O subsystem is overloaded, likely due to the large sizes of the table and its indices:

SQL Server has encountered 1699 occurrence(s) of I/O requests taking
longer than 15 seconds to complete on file
[D:\MSSQL\DATA\SP_Intranet_Content_01.mdf:MSSQL_DBCC116] in database
[SP_Intranet_Content_01] (116).  The OS file handle is
0x0000000000000A24.  The offset of the latest long I/O is:
0x00003781ada000

Investigating SharePoint timer jobs on WFE4

Let's have a look at WFE4, the originating host. With the client listed as ".Net SqlClient Data Provider", it's probably one of SharePoint's timer jobs. In Central Administration, under Timer Job Status, one job stands out. Below is what its status looks like during the downtime. It's a bit odd that on the following day of July 22, the Timer Job status page lists that very same job as succeeded and progress being 100%. Perhaps the timer job attempts a number of re-runs:

Job Title            Server        Status  Progress  Started
Database Statistics  WFE4     Initialized  0%        7/21/2017 10:45 PM

The purpose of the Database Statistics job is to update query statistics for main tables inside the Intranet content database. On SQL Server 2008 Enterprise Edition, which the farm is running, the job rebuilds most (for some unknown definition of most) indices online. The Database statistics timer job works by calling the proc_UpdateStatistics stored procedure which, according to a support article is unreliable when run in parallel with a database backup. The stored procedure contains lines such as the following for each main table:

update statistics AuditData with resample

The support article mentions that operational status codes are logged by the Database Statistics job to SharePoint's 12 hive log. Yet the log on WFE4 contains no matching entries.

Conclusion

Perhaps unsurprising by now, the AuditData table needs trimming, and stsadm command comes with the Trimauditlog operation for that purpose. It must be used judiciously and incrementally as the very operation of trimming acquires locks which for longer runs can take down SharePoint. Another side effect of trimming is transaction log bloat.

Only with SharePoint 2010 and up does Central Administration allow for the specification of a number of days to retain audit log entries and a location to store entries into prior to trimming.

Have a comment or question? Please drop me an email or tweet to @ronnieholm.