Debugging recurring web farm outages: SQL Server heartbeat pings and lock contention analysis with build-in tools

23 Jul 2017

Part 1: Establishing context with HTTP heartbeat pings
Part 2: Analyzing Windows Event Log, IIS log, and SharePoint log with PowerShell
Part 3: Collecting and analyzing a memory dump with DebugDiag
Part 4: Analyzing the memory dump with WinDbg
Part 5: SQL Server heartbeat pings and lock contention analysis with build-in tools

In the previous post, we debugged the OLE DB driver used by native code SharePoint to communicate with MS SQL Server. It turns out that worker thread pool threads block on the call to proc_AddAuditEntry, the stored procedure responsible for writing audit log entries to the content database. This procedure records the request before actual content is fetched.

In this post we develop a SQL Server heartbeat tool to measure general database availability. Then we use build-in SQL Server commands to determine what causes proc_AddAuditEntry to block.

Collecting SQL Server heartbeat pings

Much akin to the HTTP heartbeat ping tool, we wish to issue periodic requests to MS SQL Server to gauge its uptime. That way we can tell if the issue is local to proc_AddAuditEntry or if the database (or server) is experiencing symptoms of instability.

The HTTP heartbeat and SQL Server heartbeat tools are structurally identical. The latter periodically connects to SQL server and executes a fixed query, but not on the AuditData table. As native code SharePoint connects using OLE DB, the tool executes the same query using both ADO.NET and OLE DB, just in case there's an issue with OLE DB only:

using System;
using System.Data.OleDb;
using System.Data.SqlClient;
using System.Threading;

// .\SqlServerHeartbeat.exe 5

namespace SqlServerHeartbeat {
    class Program {
        static void Main(string[] args) {
            var run = 0;
            while (true) {
                var ole = "";
                var ado = "";
                Console.Write($"{++run} {DateTime.Now} ");

                using (var con = new OleDbConnection("Provider=sqloledb;Data Source=sql.acme.dk,1433;Initial Catalog=SP_Intranet_Content_01; Integrated Security=SSPI;"))
                using (var cmd = new OleDbCommand("select count(*) from AllDocs (nolock)", con)) {
                    try {
                        con.Open();
                        var res = (int)cmd.ExecuteScalar();
                        ole = res == 0 ? "Empty" : "OK";
                    } catch (Exception e) {
                        Console.WriteLine(e.Message);
                    }
                }

                using (var con = new SqlConnection("Server=tcp:sql.acme.dk,1433;Database=SP_Intranet_Content_01; Integrated Security=SSPI;Encrypt=True;TrustServerCertificate=True;Connection Timeout=30;"))
                using (var cmd = new SqlCommand("select count(*) from AllDocs (nolock)", con)) {
                    try {
                        con.Open();
                        var res = (int)cmd.ExecuteScalar();
                        ado = res == 0 ? "Empty" : "OK";
                    } catch (Exception e) {
                        Console.WriteLine(e.Message);
                    }
                }

                Console.WriteLine($"{ole} {ado}");
                Thread.Sleep(int.Parse(args[0]) * 1000);
            }
        }
    }
}

Setting up the SQL Server heartbeat tool to run on the reference laptop, the SQL Server, and Web front-end Server 2, we waited for the HTTP heartbeat tool to report SharePoint down. That happened between July 20, 22:47:04 and July 21, 00:18:06. SQL Server heartbeats were unaffected, supporting an earlier observation of SharePoint 12 hive log files showing database activity when HTTP-based SharePoint is down.

The findings of SharePoint down and SQL Server up points in the direction of contention around the audit log. What proc_AddAuditEntry does is actually straightforward. With the exception of DocLocation, the procedure inserts its arguments as-is into the AuditData table:

INSERT INTO AuditData (SiteId, ItemId, ItemType, UserId, MachineName, MachineIp, 
                       DocLocation, LocationType, Occurred, Event, 
                       EventName, EventSource, SourceName, EventData)
       VALUES (@SiteId, @ItemId, @ItemType, @UserId, @MachineName, @MachineIp, 
               @Location, @LocationType, @Occurred, @Event, 
               @EventName, @EventSource, @SourceName, @EventData)

The audit log is known for high-frequency inserts and rapid growth. Especially on an intranet where full audit logging has been enabled since 2010 and where the audit log has never been trimmed. Not only does the AuditData table itself hold a very large amount of data, but the storage required for its two indices, AuditData_OnSiteOccurredEvent and AuditData_OnSiteItem, is also significant:

sp_spaceused AuditData

name       rows         reserved        data             index_size     unused
AuditData  500,797,831  179,610,816 KB  144,268,664 KB   34,772,344 KB  569,808 KB

To measure the typical activity pattern around the AuditData table, we summarize hourly inserts across two days. Keep in mind that the Occurred field is in UTC, two hours behind local time:

select count(*) 
from AuditData (nolock)
where Occurred >= '2017-07-20 00:00' and 
      Occurred < '2017-07-20 01:00'
  
July 20                       July 21  
00-01:    914  12-13: 26.225  00-01:  1.968  12-13: 14.739
01-02:  1.158  13-14: 22.483  01-02:  2.004  13-14:  4.642
02-03:  1.338  14-15:  8.773  02-03:  2.376  14-15:  2.731
03-04: 11.849  15-16:  2.422  03-04: 14.064  15-16:  1.485
04-05:  4.925  16-17:  2.629  04-05: 12.382  16-17:  1.473
05-06: 16.067  17-18:  3.299  05-06: 21.108  17-18:  1.124
06-07: 17.375  18-19:  2.676  06-07: 28.543  18-19:  1.108
07-08: 16.610  19-20:  3.035  07-08: 12.564  19-20:  1.289
08-09: 19.646  20-21:  2.813  08-09: 20.205  20-21:  1.367
09-10: 16.179  21-22:     90  09-10: 14.243  21-22:     36
10-11: 15.621  22-23:  5.605  10-11: 13.648  22-23:  4.413
11-12: 19.401  23-00:  2.100  11-12: 11.324  23-00:  1.349 

For the whole hour of 23-00 local time of each day (21-22 UTC), we know SharePoint is down. Clearly, a correlation exists between inserts on AuditData and downtimes.

Investigating lock contention on AuditData

Short of scripting period monitor calls, the most straightforward way to view locks in MS SQL Server is the Activity Monitor, the Log File View, the sp_who2 stored procedure, and the dm_tran_locks dynamic management view. Below we see the output of sp_who2 shortly after SharePoint goes offline between July 21, 22:47:38 and July 22, 00:20:43:

sp_who2 'active'

Cleaned up output:
SPID  Status     HostName   BlkBy   DBName                  Command      LastBatch       ProgramName
 906  RUNNABLE   WFE4               SP_Intranet_Content_01  ALTER INDEX  07/21 22:45:38  .Net SqlClient Data Provider
1011  SUSPENDED  ADMIN        906   SP_Intranet_Content_01  INSERT       07/21 22:47:33  Internet Information Services
 348  SUSPENDED  WFE1        1011   SP_Intranet_Content_01  INSERT       07/21 22:47:38  Internet Information Services
 353  SUSPENDED  WFE1        1011   SP_Intranet_Content_01  INSERT       07/21 22:47:43  Internet Information Services
 197  SUSPENDED  WFE2        1011   SP_Intranet_Content_01  INSERT       07/21 22:47:43  Internet Information Services

At 22:45:38 an alter index command comes in from Web front-end server 4. It causes the next insert to blocks as indicated by SPID 906 in the BlkBy column. Later inserts block on the first insert, and keep queuing up. Starting 22:56:22 and ending 23:35:02, on nine occasions, according to the SQL Server Log File Viewer, SPID 906 was killed by the management process. When this happens, sp_who2 change to report SPID 906 as follows:

SPID  Status    HostName   BlkBy  DBName                  Command          LastBatch       ProgramName
906   ROLLBACK  WFE4              SP_Intranet_Content_01  KILLED/ROLLBACK  07/21 22:45:38  .Net SqlClient Data Provider

While rollback is in progress, inserts remain blocked on SPID 906. Once rollback is complete, and the lock on alter index is released, queued inserts are executed against the AuditData table. That explains why (some) audit log entries eventually show up in the table. Under these circumstances, SharePoint records false audit log entries. Because once the insert finally completes, the HTTP client originally asking for the page has likely disconnected.

From the output of sp_who2 alone we cannot be certain that what we're looking at are operations against AuditData. To confirm, we must query dm_tran_locks, holding details about locks being held and requested:

select * from sys.dm_tran_locks

Cleaned up output:
request_session_id  resource_subtype  resource_associated_entity_id  request_mode  request_type  request_status
 906                INDEX_OPERATION   1826105546                     Sch-M         LOCK          GRANT
1011                                  1826105546                     IX            LOCK          WAIT
 348                                  1826105546                     IX            LOCK          WAIT
 353                                  1826105546                     IX            LOCK          WAIT
 197                                  1826105546                     IX            LOCK          WAIT

select * from sys.objects where object_id = 1826105546

Cleaned up output:
name       object_id   type  type_desc   create_date
AuditData  1826105546  U     USER_TABLE  2010-05-12 15:45:05.160

The dm_tran_locks output confirms that SPID 906 is granted a lock because of an index_operation and that other requests are waiting for the lock to be released. The request_mode (request lock mode in full) of Sch-M is short for schema modification. A lock of type Sch-M prevents concurrent access to the table, meaning that Sch-M blocks inserts.

In addition, a few times during the downtime and later at 00:24:34 and 00:52:46, the Log File Viewer displays messages indicating that the I/O subsystem is overloaded. This is likely due to the large sizes of the table and its indices:

SQL Server has encountered 1699 occurrence(s) of I/O requests taking
longer than 15 seconds to complete on file
[D:\MSSQL\DATA\SP_Intranet_Content_01.mdf:MSSQL_DBCC116] in database
[SP_Intranet_Content_01] (116).  The OS file handle is
0x0000000000000A24.  The offset of the latest long I/O is:
0x00003781ada000

Investigating SharePoint timer jobs on WFE4

Let's have a look at WFE4, the host from which the alter index request originates. With the client listed as ".Net SqlClient Data Provider", it's probably one of SharePoint's timer jobs. In Central Administration, under Timer Job Status, one job stands out. Below is what its status looks like during the downtime. On the following day of July 22, however, the Timer Job status page lists that very same job as succeeded with progress at 100%. Perhaps the timer job attempts a number of re-runs and one succeeds:

Job Title            Server        Status  Progress  Started
Database Statistics  WFE4     Initialized  0%        7/21/2017 10:45 PM

The purpose of the Database Statistics job is to update query statistics for main tables inside the content database. On SQL Server 2008 Enterprise Edition, which the farm is running, the job rebuilds most (for some unknown definition of most) indices online, i.e., without taking an exclusive lock on the table. The Database statistics timer job works by calling the proc_UpdateStatistics stored procedure. The stored procedure contains lines such as the following for each main table:

update statistics AuditData with resample

According to a support article, proc_UpdateStatistics is unreliable when run in parallel with a database backup. No database appears to be taking place at the same time, though. Also, the support article mentions that the Database Statistics job logs operational status codes to SharePoint's 12 hive log. Yet the log on WFE4 contains no matching entries.

Conclusion

Perhaps unsurprising by now, the AuditData table requires trimming. For that purpose, the stsadm command supports the Trimauditlog operation. It must be used judiciously and incrementally as the very act of trimming acquires a lock which for longer runs can take down SharePoint. Another side effect of trimming is transaction log bloat.

Only with SharePoint 2010 and later does Central Administration allow for the specification of a number of days to retain audit log entries and a file location to store entries into prior to trimming.