Part 1: Establishing context with HTTP heartbeat pings
Part 2: Analyzing Windows Event Log, IIS log, and SharePoint log with PowerShell
Part 3: Collecting and analyzing a memory dump with DebugDiag
Part 4: Analyzing the memory dump with WinDbg
Part 5: SQL Server heartbeat pings and lock contention analysis with build-in tools
In the previous post, we debugged the OLE DB driver used by native code SharePoint to communicate with MS SQL Server. It turns out that worker thread pool threads block on the call to proc_AddAuditEntry, the stored procedure responsible for writing audit log entries to the content database. This procedure records the request before actual content is fetched.
In this post we develop a SQL Server heartbeat tool to measure general database availability. Then we use build-in SQL Server commands to determine what causes proc_AddAuditEntry to block.
Much akin to the HTTP heartbeat ping tool, we wish to issue periodic requests to MS SQL Server to gauge its uptime. That way we can tell if the issue is local to proc_AddAuditEntry or if the database (or server) is experiencing symptoms of instability.
The HTTP heartbeat and SQL Server heartbeat tools are structurally identical. The latter periodically connects to SQL server and executes a fixed query, but not on the AuditData table. As native code SharePoint connects using OLE DB, the tool executes the same query using both ADO.NET and OLE DB, just in case there's an issue with OLE DB only:
using System; using System.Data.OleDb; using System.Data.SqlClient; using System.Threading; // .\SqlServerHeartbeat.exe 5 namespace SqlServerHeartbeat { class Program { static void Main(string[] args) { var run = 0; while (true) { var ole = ""; var ado = ""; Console.Write($"{++run} {DateTime.Now} "); using (var con = new OleDbConnection("Provider=sqloledb;Data Source=sql.acme.dk,1433;Initial Catalog=SP_Intranet_Content_01; Integrated Security=SSPI;")) using (var cmd = new OleDbCommand("select count(*) from AllDocs (nolock)", con)) { try { con.Open(); var res = (int)cmd.ExecuteScalar(); ole = res == 0 ? "Empty" : "OK"; } catch (Exception e) { Console.WriteLine(e.Message); } } using (var con = new SqlConnection("Server=tcp:sql.acme.dk,1433;Database=SP_Intranet_Content_01; Integrated Security=SSPI;Encrypt=True;TrustServerCertificate=True;Connection Timeout=30;")) using (var cmd = new SqlCommand("select count(*) from AllDocs (nolock)", con)) { try { con.Open(); var res = (int)cmd.ExecuteScalar(); ado = res == 0 ? "Empty" : "OK"; } catch (Exception e) { Console.WriteLine(e.Message); } } Console.WriteLine($"{ole} {ado}"); Thread.Sleep(int.Parse(args[0]) * 1000); } } } }
Setting up the SQL Server heartbeat tool to run on the reference laptop, the SQL Server, and Web front-end Server 2, we waited for the HTTP heartbeat tool to report SharePoint down. That happened between July 20, 22:47:04 and July 21, 00:18:06. SQL Server heartbeats were unaffected, supporting an earlier observation of SharePoint 12 hive log files showing database activity when HTTP-based SharePoint is down.
The findings of SharePoint down and SQL Server up points in the direction of contention around the audit log. What proc_AddAuditEntry does is actually straightforward. With the exception of DocLocation, the procedure inserts its arguments as-is into the AuditData table:
INSERT INTO AuditData (SiteId, ItemId, ItemType, UserId, MachineName, MachineIp, DocLocation, LocationType, Occurred, Event, EventName, EventSource, SourceName, EventData) VALUES (@SiteId, @ItemId, @ItemType, @UserId, @MachineName, @MachineIp, @Location, @LocationType, @Occurred, @Event, @EventName, @EventSource, @SourceName, @EventData)
The audit log is known for high-frequency inserts and rapid growth. Especially on an intranet where full audit logging has been enabled since 2010 and where the audit log has never been trimmed. Not only does the AuditData table itself hold a very large amount of data, but the storage required for its two indices, AuditData_OnSiteOccurredEvent and AuditData_OnSiteItem, is also significant:
sp_spaceused AuditData name rows reserved data index_size unused AuditData 500,797,831 179,610,816 KB 144,268,664 KB 34,772,344 KB 569,808 KB
To measure the typical activity pattern around the AuditData table, we summarize hourly inserts across two days. Keep in mind that the Occurred field is in UTC, two hours behind local time:
select count(*) from AuditData (nolock) where Occurred >= '2017-07-20 00:00' and Occurred < '2017-07-20 01:00' July 20 July 21 00-01: 914 12-13: 26.225 00-01: 1.968 12-13: 14.739 01-02: 1.158 13-14: 22.483 01-02: 2.004 13-14: 4.642 02-03: 1.338 14-15: 8.773 02-03: 2.376 14-15: 2.731 03-04: 11.849 15-16: 2.422 03-04: 14.064 15-16: 1.485 04-05: 4.925 16-17: 2.629 04-05: 12.382 16-17: 1.473 05-06: 16.067 17-18: 3.299 05-06: 21.108 17-18: 1.124 06-07: 17.375 18-19: 2.676 06-07: 28.543 18-19: 1.108 07-08: 16.610 19-20: 3.035 07-08: 12.564 19-20: 1.289 08-09: 19.646 20-21: 2.813 08-09: 20.205 20-21: 1.367 09-10: 16.179 21-22: 90 09-10: 14.243 21-22: 36 10-11: 15.621 22-23: 5.605 10-11: 13.648 22-23: 4.413 11-12: 19.401 23-00: 2.100 11-12: 11.324 23-00: 1.349
For the whole hour of 23-00 local time of each day (21-22 UTC), we know SharePoint is down. Clearly, a correlation exists between inserts on AuditData and downtimes.
Short of scripting period monitor calls, the most straightforward way to view locks in MS SQL Server is the Activity Monitor, the Log File View, the sp_who2 stored procedure, and the dm_tran_locks dynamic management view. Below we see the output of sp_who2 shortly after SharePoint goes offline between July 21, 22:47:38 and July 22, 00:20:43:
sp_who2 'active' Cleaned up output: SPID Status HostName BlkBy DBName Command LastBatch ProgramName 906 RUNNABLE WFE4 SP_Intranet_Content_01 ALTER INDEX 07/21 22:45:38 .Net SqlClient Data Provider 1011 SUSPENDED ADMIN 906 SP_Intranet_Content_01 INSERT 07/21 22:47:33 Internet Information Services 348 SUSPENDED WFE1 1011 SP_Intranet_Content_01 INSERT 07/21 22:47:38 Internet Information Services 353 SUSPENDED WFE1 1011 SP_Intranet_Content_01 INSERT 07/21 22:47:43 Internet Information Services 197 SUSPENDED WFE2 1011 SP_Intranet_Content_01 INSERT 07/21 22:47:43 Internet Information Services
At 22:45:38 an alter index command comes in from Web front-end server 4. It causes the next insert to blocks as indicated by SPID 906 in the BlkBy column. Later inserts block on the first insert, and keep queuing up. Starting 22:56:22 and ending 23:35:02, on nine occasions, according to the SQL Server Log File Viewer, SPID 906 was killed by the management process. When this happens, sp_who2 change to report SPID 906 as follows:
SPID Status HostName BlkBy DBName Command LastBatch ProgramName 906 ROLLBACK WFE4 SP_Intranet_Content_01 KILLED/ROLLBACK 07/21 22:45:38 .Net SqlClient Data Provider
While rollback is in progress, inserts remain blocked on SPID 906. Once rollback is complete, and the lock on alter index is released, queued inserts are executed against the AuditData table. That explains why (some) audit log entries eventually show up in the table. Under these circumstances, SharePoint records false audit log entries. Because once the insert finally completes, the HTTP client originally asking for the page has likely disconnected.
From the output of sp_who2 alone we cannot be certain that what we're looking at are operations against AuditData. To confirm, we must query dm_tran_locks, holding details about locks being held and requested:
select * from sys.dm_tran_locks Cleaned up output: request_session_id resource_subtype resource_associated_entity_id request_mode request_type request_status 906 INDEX_OPERATION 1826105546 Sch-M LOCK GRANT 1011 1826105546 IX LOCK WAIT 348 1826105546 IX LOCK WAIT 353 1826105546 IX LOCK WAIT 197 1826105546 IX LOCK WAIT select * from sys.objects where object_id = 1826105546 Cleaned up output: name object_id type type_desc create_date AuditData 1826105546 U USER_TABLE 2010-05-12 15:45:05.160
The dm_tran_locks output confirms that SPID 906 is granted a lock because of an index_operation and that other requests are waiting for the lock to be released. The request_mode (request lock mode in full) of Sch-M is short for schema modification. A lock of type Sch-M prevents concurrent access to the table, meaning that Sch-M blocks inserts.
In addition, a few times during the downtime and later at 00:24:34 and 00:52:46, the Log File Viewer displays messages indicating that the I/O subsystem is overloaded. This is likely due to the large sizes of the table and its indices:
SQL Server has encountered 1699 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\MSSQL\DATA\SP_Intranet_Content_01.mdf:MSSQL_DBCC116] in database [SP_Intranet_Content_01] (116). The OS file handle is 0x0000000000000A24. The offset of the latest long I/O is: 0x00003781ada000
Let's have a look at WFE4, the host from which the alter index request originates. With the client listed as ".Net SqlClient Data Provider", it's probably one of SharePoint's timer jobs. In Central Administration, under Timer Job Status, one job stands out. Below is what its status looks like during the downtime. On the following day of July 22, however, the Timer Job status page lists that very same job as succeeded with progress at 100%. Perhaps the timer job attempts a number of re-runs and one succeeds:
Job Title Server Status Progress Started Database Statistics WFE4 Initialized 0% 7/21/2017 10:45 PM
The purpose of the Database Statistics job is to update query statistics for main tables inside the content database. On SQL Server 2008 Enterprise Edition, which the farm is running, the job rebuilds most (for some unknown definition of most) indices online, i.e., without taking an exclusive lock on the table. The Database statistics timer job works by calling the proc_UpdateStatistics stored procedure. The stored procedure contains lines such as the following for each main table:
update statistics AuditData with resample
According to a support article, proc_UpdateStatistics is unreliable when run in parallel with a database backup. No database appears to be taking place at the same time, though. Also, the support article mentions that the Database Statistics job logs operational status codes to SharePoint's 12 hive log. Yet the log on WFE4 contains no matching entries.
Perhaps unsurprising by now, the AuditData table requires trimming. For that purpose, the stsadm command supports the Trimauditlog operation. It must be used judiciously and incrementally as the very act of trimming acquires a lock which for longer runs can take down SharePoint. Another side effect of trimming is transaction log bloat.
Only with SharePoint 2010 and later does Central Administration allow for the specification of a number of days to retain audit log entries and a file location to store entries into prior to trimming.