Debugging recurring web farm outages: Analyzing the memory dump with WinDbg

19 Jul 2017

Part 1: Establishing context with HTTP heartbeat pings
Part 2: Analyzing Windows Event Log, IIS log, and SharePoint log with PowerShell
Part 3: Collecting and analyzing a memory dump with DebugDiag
Part 4: Analyzing the memory dump with WinDbg
Part 5: SQL Server heartbeat pings and lock contention analysis with build-in tools

In the previous post we used DebugDiag to narrow down the problem. As expected, the ASP.NET runtime is unable to finish processing requests because worker thread pool threads become blocked. As more requests come in, more threads become blocked, until finally the runtime notices that something is wrong. The w3wp host process then signals the Windows process Activation Service to restart w3wp. This flow repeats itself a number of times.

In this post we look into why worker threads become blocked. We do so by loading the memory dump into WinDbg and analyze call stacks of blocked threads, looking for a pattern.

Loading the dump into WinDbg

We don't want to analyze the memory dump directly on a production server. Instead we move it to another machine for postmortem analysis. For WinDbg to work within this setup, we must follow the steps outlined in postmortem debugging across installed .NET CLR versions. These steps are necessary because the dump was captured on a Windows Server 2008 with w3wp running a version of .NET 2.0 not necessarily available on the machine used for debugging. Core files related to the CLR version must therefore be copied across machines.

In the WinDbg session below, commands issued are in green. The parts of the output worth highlighting are in red.

Microsoft (R) Windows Debugger Version 10.0.15063.137 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\dumps\w3wp.exe_170713_230009.dmp]
User Mini Dump File with Full Memory: Only application data is available

Comment: '
*** C:\procdump64  -accepteula -64 -ma 6844 e:\
*** Manual dump'
Symbol search path is: srv*
Executable search path is: 
Windows Server 2008/Windows Vista Version 6002 (Service Pack 2) MP (4 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Machine Name:
Debug session time: Thu Jul 13 23:00:51.000 2017 (UTC + 2:00)
System Uptime: 4 days 10:02:51.079
Process Uptime: 0 days 2:07:05.000
................................................................
................................................................
................................................................
................................................................
............................................
Loading unloaded module list
..............
ntdll!NtWaitForSingleObject+0xa:
00000000`778d68da c3              ret
0:000> .sympath+ C:\dumps
Symbol search path is: srv*;C:\dumps
Expanded Symbol search path is: cache*;SRV*https://msdl.microsoft.com/download/symbols;c:\dumps

************* Symbol Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       srv*
OK                                             C:\dumps
0:000> .cordll -ve -u -l
CLRDLL: C:\Windows\Microsoft.NET\Framework64\v2.0.50727\mscordacwks.dll:2.0.50727.8745 f:0
doesn't match desired version 2.0.50727.8762 f:0
Automatically loaded SOS Extension
CLRDLL: Loaded DLL C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\sym\mscordacwks_AMD64_AMD64_2.0.50727.8762.dll\58E462F09a0000\mscordacwks_AMD64_AMD64_2.0.50727.8762.dll
CLR DLL status: Loaded DLL C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\sym\mscordacwks_AMD64_AMD64_2.0.50727.8762.dll\58E462F09a0000\mscordacwks_AMD64_AMD64_2.0.50727.8762.dll

On this debugging machine, the proper core files ship with the debugger. Otherwise mscordacwks would've loaded from inside c:\dumps, where core files extracted from the Windows 2008 Server reside.

Because the Event Log entries mention a (potential) deadlock, in addition to sos, we load the sosex debugger extension. While the SOS !locks command requires manual interpretation, the SOSEX !dlk command aims to detect and present deadlocks within managed resources:

0:000> .load C:\dumps\sosex_64\sosex
0:000> !dlk
Examining SyncBlocks...
Scanning for ReaderWriterLock(Slim) instances...
Scanning for holders of ReaderWriterLock locks...
Scanning for holders of ReaderWriterLockSlim locks...
Examining CriticalSections...
Scanning for threads waiting on SyncBlocks...
Scanning for threads waiting on ReaderWriterLock locks...
Scanning for threads waiting on ReaderWriterLocksSlim locks...
Scanning for threads waiting on CriticalSections...
No deadlocks detected.

The deadlock either isn't there, the deadlock doesn't involve managed resources, or it's a false positive Event Log message.

Analyzing blocked threads

Let's start by looking at high-level thread pool statistics. From the DebugDiag report we know that eight worker thread pool threads are blocked at the time of the dump. More threads are running inside w3wp; in fact the ~ command lists 58 threads. Those just aren't managed thread pool threads:

0:000> !ThreadPool
CPU utilization 4%
Worker Thread: Total: 9 Running: 8 Idle: 1 MaxLimit: 400 MinLimit: 4
Work Request in Queue: 1
--------------------------------------
Number of Timers: 17
--------------------------------------
Completion Port Thread:Total: 1 Free: 1 MaxFree: 8 CurrentLimit: 0 MaxLimit: 400 MinLimit: 4

In order to print the managed call stacks of those eight threads, we could issue the ~*e!CLRStack command: for all threads (~*), execute (e) command (!CLRStack). But rather than include the lengthy output here, instead we focus on one thread and generalize from it. Thread 27, the first thread which DebugDiag reports as blocked is servicing our HTTP ping request. It's the one thread blocked for the longest (661 seconds).

Instead of listing first the native call stack with the k command, followed by the managed call stack with the !CLRStack command, we combine both using the !DumpStack command. SharePoint consists of both native and managed code, and the combined stack may reveal a clue as to why a thread is blocked. Since !DumpStack doesn't include stack frame numbers, we manually add those for later reference:

0:000> ~27s
0:027> !DumpStack
OS Thread Id: 0x380 (27)
   Child-SP         RetAddr          Call Site
00 000000000b409838 000007fefd3af5f1 ntdll!NtWaitForSingleObject+0xa
01 000000000b409840 000007fefd3b1a68 mswsock!_GSHandlerCheck_SEH+0x2e41
02 000000000b4098c0 000007fefea5295c mswsock!_GSHandlerCheck_SEH+0x62a6
03 000000000b4099b0 000007fee6a0120d ws2_32!recv+0x13c
04 000000000b409a50 000007fee2e16f28 dbnetlib!ConnectionRead+0x491
05 000000000b409b30 000007fee2e15b11 sqloledb!CDataSource::ConnectionTransact+0xde
06 000000000b409b90 000007fee2e15866 sqloledb!CDBConnection::SendPacket+0x1e1
07 000000000b409c40 000007fee2e15617 sqloledb!CStmt::ExecDirect+0xa5b
08 000000000b409d80 000007fee2e1531d sqloledb!CStmt::SQLExecDirect+0x44
09 000000000b409db0 000007fee2e140b1 sqloledb!CCommand::ExecuteHelper+0x209
10 000000000b409e40 000007fee2e13b4c sqloledb!CCommand::Execute+0x1458
11 000000000b409f50 000007fee3f9d28b sqloledb!CImpICommandText::Execute+0x181
12 000000000b409fb0 000007fee3f5d674 oledb32!CCommandText::DoExecute+0x5ab
13 000000000b40a1b0 0000000072fc59fa oledb32!CCommandText::Execute+0x92f
14 000000000b40a400 0000000072e2443c STSWEL!Voledb::ExecQuery+0x37e
15 000000000b40a560 0000000072e24837 STSWEL!VauditEntry::createInternalEntry+0x418
16 000000000b40acc0 0000000072e24fe7 STSWEL!VauditEntry::createInternalEntry+0x813
17 000000000b40ad50 0000000072e25335 STSWEL!VglobalAuditStore::addAuditEntry+0x97
18 000000000b40ad90 0000000072efe6de STSWEL!VglobalAuditStore::checkAuditQueue+0x241
19 000000000b40ae70 0000000072f35055 STSWEL!VdocumentStore::httpGetDocument+0x2ed2
20 000000000b40be60 0000000072f367e0 STSWEL!VhttpManager::loadFileCore+0x5c5
21 000000000b40c700 000000006d9a0543 STSWEL!VhttpManager::loadFileAndMetaInfo+0xc4
22 000000000b40c7c0 000000006d9ab514 OWSSVR!GetExtensionVersion+0x89b57
23 000000000b40cb80 000007fef60170a7 OWSSVR!GetExtensionVersion+0x94b28
24 000000000b40cf00 000007fe96c64bba mscorwks!DoCLRToCOMCall+0x177
25 000000000b40d090 000007fe96c643a1 Microsoft_SharePoint_Library!DomainNeutralILStubClass.IL_STUB(...)+0x96c64bbb
26 000000000b40d490 000007fe96c63d52 Microsoft_SharePoint!Microsoft.SharePoint.Library.SPRequest.GetFileAndMetaInfo(System.String, ...)+0x271
27 000000000b40d620 000007fe96c578c5 Microsoft_SharePoint!Microsoft.SharePoint.SPWeb.GetWebPartPageContent(System.Uri, ...)+0xa12
28 000000000b40d980 000007fe96c56e87 Microsoft_SharePoint!Microsoft.SharePoint.ApplicationRuntime.SPRequestModuleData.FetchWebPartPageInformationForInit(System.Web.HttpContext, ...)+0x495
29 000000000b40dc10 000007fe96c563b4 Microsoft_SharePoint!Microsoft.SharePoint.ApplicationRuntime.SPRequestModuleData.GetFileForRequest(System.Web.HttpContext, ...)+0x2d7
30 000000000b40dee0 000007fe9698fa39 Microsoft_SharePoint!Microsoft.SharePoint.ApplicationRuntime.SPRequestModule.InitContextWeb(System.Web.HttpContext, Microsoft.SharePoint.SPWeb)+0xa4
31 000000000b40df60 000007fe9698f769 Microsoft_SharePoint!Microsoft.SharePoint.WebControls.SPControl.SPWebEnsureSPControl(System.Web.HttpContext)+0x299
32 000000000b40e050 000007fe9698f2e3 Microsoft_SharePoint!Microsoft.SharePoint.WebControls.SPControl.GetContextWeb(System.Web.HttpContext)+0x29
33 000000000b40e090 000007fee4bc78b0 Microsoft_SharePoint!Microsoft.SharePoint.ApplicationRuntime.SPRequestModule.PostResolveRequestCacheHandler(System.Object, System.EventArgs)+0x253
34 000000000b40e190 000007fee4bb86bb System_Web_ni!System.Web.HttpApplication+SyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()+0x50
35 000000000b40e1f0 000007fee4bc6c15 System_Web_ni!System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)+0xab
36 000000000b40e290 000007fee4bb7773 System_Web_ni!System.Web.HttpApplication+ApplicationStepManager.ResumeSteps(System.Exception)+0x1a5
37 000000000b40e340 000007fee4bbba94 System_Web_ni!System.Web.HttpApplication.System.Web.IHttpAsyncHandler.BeginProcessRequest(System.Web.HttpContext, System.AsyncCallback, System.Object)+0xd3
38 000000000b40e3a0 000007fee4bbb67c System_Web_ni!System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)+0x1c4
39 000000000b40e430 000007fee4bba2ac System_Web_ni!System.Web.HttpRuntime.ProcessRequestNoDemand(System.Web.HttpWorkerRequest)+0x7c
40 000000000b40e470 000007fef6018e92 System_Web_ni!System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)+0x18c
41 000000000b40e5a0 000007fef5ea6b63 mscorwks!CallDescrWorker+0x82
42 000000000b40e600 000007fef5e9c4be mscorwks!CallDescrWorkerWithHandler+0xd3
43 000000000b40e6a0 000007fef6507b96 mscorwks!ForwardCallToManagedMethod+0x172
44 000000000b40e740 000007fef5e66054 mscorwks!COMToCLRWorkerBody+0x2b6
45 000000000b40e990 000007fef5eb5662 mscorwks!COMToCLRWorkerDebuggerWrapper+0x50
46 000000000b40ea00 000007fef601900e mscorwks!COMToCLRWorker+0x366
47 000000000b40ecf0 000007fee2d782b3 mscorwks!GenericComCallStub+0x5e
48 000000000b40eda0 000007fee2d78693 webengine!HttpCompletion::ProcessRequestInManagedCode+0x2a3
49 000000000b40f250 000007fee2dbad04 webengine!HttpCompletion::ProcessCompletion+0x63
50 000000000b40f290 000007fef5f28787 webengine!CorThreadPoolWorkitemCallback+0x24
51 000000000b40f2c0 000007fef5f16d6a mscorwks!UnManagedPerAppDomainTPCount::DispatchWorkItem+0x157
52 000000000b40f360 000007fef5e4c3dc mscorwks!ThreadpoolMgr::WorkerThreadStart+0x1ba
53 000000000b40f400 000000007777a4bd mscorwks!Thread::intermediateThreadProc+0x78
54 000000000b40f7d0 00000000778b6461 kernel32!BaseThreadInitThunk+0xd
55 000000000b40f800 0000000000000000 ntdll!RtlUserThreadStart+0x1d

That's quite a mouthful. We'll touch on the relevant parts next.

Interpreting the native and managed call stack

Reading the call stack bottom up, it shows how (and where) one function calls the next, leading to the blocked path. Beginning at frame 55, the user space part of the Windows kernel kicks of thread execution by passing control to ntdll!RtlUserThreadStart. Notice how its RetAddr is all zeros, indicating that it wasn't called from anywhere and should never return. The operating system passes RtlUserThreadStart the address of a function to run -- the function passed via CreateThread's lpStartAddress argument -- in this case a C++ method inside the CLR at frame 52. From there on and until 41, the thread is setup to eventually run managed code. This happens at frame 40 where the thread is assigned the processing of an incoming request by executing managed code.

Say we want to know the URL being processed by the thread; the one shown by DebugDiag. It turns out that it's stored (among other places) inside the HttpWorkerRequest object passed as argument in frame 38. That object has both a _path property and a _startTime property which presumably DebugDiag uses to compute the value of RunningSince. The same inspection commands could be applied to HttpContext at frame 32. Its HttpContext argument is stored at memory location 0x000000014da07450 which we should recognize from the DebugDiag report.

We can inspect the HttpWorkerRequest as follows:

0:027> !CLRStack -a
...
000000000b40e3a0 000007fee4bbb67c System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
    PARAMETERS:
        this = 0x000000013f917d80
        wr = 0x000000014da068e8
...

0:027> !DumpObj 0x000000014da068e8
Name: System.Web.Hosting.ISAPIWorkerRequestInProcForIIS7
MethodTable: 000007fee4cdd158
EEClass: 000007fee496d800
Size: 408(0x198) bytes
 (C:\Windows\assembly\GAC_64\System.Web\2.0.0.0__b03f5f7f11d50a3a\System.Web.dll)
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
000007fef4038400  400137c       10      System.DateTime  1 instance 000000014da068f8 _startTime
...
000007fef3ff7ec0  4001baf       30        System.String  0 instance 000000014da06ca8 _path
...

0:027> !DumpObj 000000014da06ca8
Name: System.String
MethodTable: 000007fef3ff7ec0
EEClass: 000007fef3bfe560
Size: 96(0x60) bytes
 (C:\Windows\assembly\GAC_64\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
String: /org/it/pages/default.aspx
...

Continuing our journey up the call stack, at frame 30 SharePoint initializes the SPWeb based on the URL, and at frame 26 prepares a request for the document library item storing the page inside the content database.

At frames 25 to 23, something notable happens. Instead of connecting to MS SQL Server and fetching the file -- a row in a database table -- using ADO.NET, SharePoint transfers control to a native code COM component inside OWSSVR.dll. The following calls appear to resolve the file path inside the virtual file system of the content database, followed by the insertion of an audit log entry at frame 17. If enabled, audit log entries are stored in a single table called AuditData inside the content database.

Because we're inside native code, limited runtime metadata is available compared to managed code. But from the method names it appear that at frame 7 Microsoft OLE DB -- a COM based legacy equivalent to ADO.NET -- composes and sends a SQL statement to MS SQL Server. Data appears to have been sent over a socket at frame 6, and at frame 4, the driver calls the socket receive method, causing the thread to block until a response arrives. What should've happened is that once SQL Server responds, Windows would wake up the thread. The thread would then fetch and return the actual page from the content database.

It's possible to extract the SQL statement send by OLE DB to MS SQL Server. While !CLRStack -a is no good with native code, what we can do is dump the native stack frame using the frame command. It holds the values of the CPU registers at that point in time:

0:027> .frame /r 7
07 00000000`0b409c40 000007fe`e2e15617 sqloledb!CStmt::ExecDirect+0xa5b
rax=000000014dde5fe0 rbx=00000000182edf30 rcx=000000014dde7020
rdx=0000000000000000 rsi=0000000000000001 rdi=00000000ffffffff
rip=000007fee2e15866 rsp=000000000b409c40 rbp=0000000000000000
 r8=0000000000000000  r9=0000000000000040 r10=000000014dde5fd0
r11=0000000000000007 r12=0000000000000001 r13=000000001322a1b0
r14=000000001a7921c0 r15=ffffffffffffffff
iopl=0         nv up ei ng nz na pe cy
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000283
sqloledb!CStmt::ExecDirect+0xa5b:
000007fe`e2e15866 4983cfff        or      r15,0FFFFFFFFFFFFFFFFh

On 64 bit systems, the calling convention typically involves passing arguments to functions via registers (on 32 bit systems arguments are passed on the stack). Some amount of guesswork is required to determine which register holds the pointer to the SQL statement. Probing registers using the du command (short for dump memory region as Unicode string), we find a pointer to a SQL statement in CPU register r13:

0:027> du 000000001322a1b0 L1000

Cleaned up output:    
EXEC proc_AddAuditEntry
  'F98E5B5D-2147-4534-9902-266AEA45A817', -- SiteId
  '95B07887-99E9-4B98-AFDC-2269C9747571', -- ItemId
  4,                                      -- ItemType
  7245,                                   -- UserId
  NULL,                                   -- MachineName
  NULL,                                   -- MachineIp
  N'org/it/pages/default.aspx',           -- Location
  0,                                      -- LocationType
  '20170713 20:49:50',                    -- Occurred
  3,                                      -- Event
  NULL,                                   -- EventName
  0,                                      -- EventSource
  NULL,                                   -- SourceName
  N'<Version>...</Version>'               -- EventData

OLE DB is calling a stored procedure which adds an audit log entry to the database. The arguments to the stored procedure identify the object requested and the user requesting it. The time zone is UTC and given that the debug session time is reported as 23:00:51, and the request has been queued for 661 seconds, 20:49:50 marks the actual time of addition.

Conclusion

Inspecting call stacks of the eight blocked threads, they're all blocked waiting for MS SQL Server to respond to the proc_AddAuditEntry stored procedure. As the stored procedure executes without issues the rest of the time, it's unlikely that there's anything wrong with the procedure itself.

The culprit causing recurring platform outages appear to be MS SQL Server not responding to requests. It's a bit odd that no timeout occurs with the OLE DB driver. But that might be entirely by design if Microsoft decided that audit logging must succeed in order to continue.

In the next post we dive into MS SQL Server and investigate the cause of hangs from a database perspective.