Debugging recurring web farm outages: Analyzing the memory dump with WinDbg

19 Jul 2017

Part 1: Establishing context with HTTP heartbeat pings
Part 2: Analyzing Windows Event Log, IIS log, and SharePoint log with PowerShell
Part 3: Collecting and analyzing a memory dump with DebugDiag
Part 4: Analyzing the memory dump with WinDbg
Part 5: SQL Server heartbeat pings and lock contention analysis with build-in tools

In the previous post we used DebugDiag to narrow down the problem. As expected, the ASP.NET runtime stops processing requests because worker thread pool threads are getting stuck. As more requests come in, more threads get stuck, until finally the runtime notices that something is wrong. It then signals the w3wp host process which the Windows process Activation Service detects and restarts w3wp. This flow repeats itself a number of times.

This post investigates on what worker threads get blocked. We do so by loading the memory dump captured earlier into WinDbg.

Loading the dump into WinDbg

We don't want to analyze the memory dump directly on a production server. Instead we move it to another machine and do postmortem analysis there. For WinDbg to work reliably within this setup, we must follow the steps outlined in postmortem debugging across installed .NET CLR versions. These steps are required because the dump was captured on a Windows Server 2008 with w3wp running a version of .NET 2.0 not necessarily available on the target machine. Key files related to the CLR version must therefore be copied across machines.

In the WinDbg session below, commands issues are marked in green. The parts of the output worth highlighting are in red.

Microsoft (R) Windows Debugger Version 10.0.15063.137 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\dumps\w3wp.exe_170713_230009.dmp]
User Mini Dump File with Full Memory: Only application data is available

Comment: '
*** C:\procdump64  -accepteula -64 -ma 6844 e:\
*** Manual dump'
Symbol search path is: srv*
Executable search path is: 
Windows Server 2008/Windows Vista Version 6002 (Service Pack 2) MP (4 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Machine Name:
Debug session time: Thu Jul 13 23:00:51.000 2017 (UTC + 2:00)
System Uptime: 4 days 10:02:51.079
Process Uptime: 0 days 2:07:05.000
................................................................
................................................................
................................................................
................................................................
............................................
Loading unloaded module list
..............
ntdll!NtWaitForSingleObject+0xa:
00000000`778d68da c3              ret
0:000> .sympath+ C:\dumps
Symbol search path is: srv*;C:\dumps
Expanded Symbol search path is: cache*;SRV*https://msdl.microsoft.com/download/symbols;c:\dumps

************* Symbol Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       srv*
OK                                             C:\dumps
0:000> .cordll -ve -u -l
CLRDLL: C:\Windows\Microsoft.NET\Framework64\v2.0.50727\mscordacwks.dll:2.0.50727.8745 f:0
doesn't match desired version 2.0.50727.8762 f:0
Automatically loaded SOS Extension
CLRDLL: Loaded DLL C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\sym\mscordacwks_AMD64_AMD64_2.0.50727.8762.dll\58E462F09a0000\mscordacwks_AMD64_AMD64_2.0.50727.8762.dll
CLR DLL status: Loaded DLL C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\sym\mscordacwks_AMD64_AMD64_2.0.50727.8762.dll\58E462F09a0000\mscordacwks_AMD64_AMD64_2.0.50727.8762.dll

(On this target machine, the proper core files were already present. Otherwise the loader output above would've included c:\dumps.)

Because the Event Log entry mentions a (potential) deadlock, in addition to sos, we load the sosex debugger extension. While the SOS !locks command requires manual interpretation, the SOSEX !dlk command aims to detect and present deadlocks within managed resources:

0:000> .load C:\dumps\sosex_64\sosex
0:000> !dlk
Examining SyncBlocks...
Scanning for ReaderWriterLock(Slim) instances...
Scanning for holders of ReaderWriterLock locks...
Scanning for holders of ReaderWriterLockSlim locks...
Examining CriticalSections...
Scanning for threads waiting on SyncBlocks...
Scanning for threads waiting on ReaderWriterLock locks...
Scanning for threads waiting on ReaderWriterLocksSlim locks...
Scanning for threads waiting on CriticalSections...
No deadlocks detected.

Either the deadlock hasn't manifested itself yet, the deadlock doesn't involve managed resources, or it's a false positive Event Log message.

Analyzing blocked threads

Let's start by looking at high-level thread pool statistics. From the DebugDiag report we know that eight worker thread pool threads were blocked at the time of the dump. More threads are running inside w3wp; in fact the ~ command lists 58 threads. Those just aren't managed thread pool threads:

0:000> !ThreadPool
CPU utilization 4%
Worker Thread: Total: 9 Running: 8 Idle: 1 MaxLimit: 400 MinLimit: 4
Work Request in Queue: 1
--------------------------------------
Number of Timers: 17
--------------------------------------
Completion Port Thread:Total: 1 Free: 1 MaxFree: 8 CurrentLimit: 0 MaxLimit: 400 MinLimit: 4

If we wanted to see the managed call stacks of those eight threads, we could issue the ~*e!CLRStack command: for all threads (~*), execute (e) command (!CLRStack). But rather than include the stacks here (it's a long listing), instead we focus on one thread and generalize the analysis. Let's use thread 27 as an example. It's the first thread that DebugDiag reported blocked and is servicing our HTTP ping request. It's also the one thread blocked for the longest (661 seconds).

Instead of listing first the native call stack as the k command would do, or then the managed call stack as the !CLRStack command would, we combined those using the !DumpStack command. We want both stacks because SharePoint consists of both native and managed code, and the native stack may reveal clues as to why a thread is blocked. Since the !DumpStack command doesn't include stack frame numbers we manually include those for easier reference:

0:027> !DumpStack
OS Thread Id: 0x380 (27)
   Child-SP         RetAddr          Call Site
00 000000000b409838 000007fefd3af5f1 ntdll!NtWaitForSingleObject+0xa
01 000000000b409840 000007fefd3b1a68 mswsock!_GSHandlerCheck_SEH+0x2e41
02 000000000b4098c0 000007fefea5295c mswsock!_GSHandlerCheck_SEH+0x62a6
03 000000000b4099b0 000007fee6a0120d ws2_32!recv+0x13c
04 000000000b409a50 000007fee2e16f28 dbnetlib!ConnectionRead+0x491
05 000000000b409b30 000007fee2e15b11 sqloledb!CDataSource::ConnectionTransact+0xde
06 000000000b409b90 000007fee2e15866 sqloledb!CDBConnection::SendPacket+0x1e1
07 000000000b409c40 000007fee2e15617 sqloledb!CStmt::ExecDirect+0xa5b
08 000000000b409d80 000007fee2e1531d sqloledb!CStmt::SQLExecDirect+0x44
09 000000000b409db0 000007fee2e140b1 sqloledb!CCommand::ExecuteHelper+0x209
10 000000000b409e40 000007fee2e13b4c sqloledb!CCommand::Execute+0x1458
11 000000000b409f50 000007fee3f9d28b sqloledb!CImpICommandText::Execute+0x181
12 000000000b409fb0 000007fee3f5d674 oledb32!CCommandText::DoExecute+0x5ab
13 000000000b40a1b0 0000000072fc59fa oledb32!CCommandText::Execute+0x92f
14 000000000b40a400 0000000072e2443c STSWEL!Voledb::ExecQuery+0x37e
15 000000000b40a560 0000000072e24837 STSWEL!VauditEntry::createInternalEntry+0x418
16 000000000b40acc0 0000000072e24fe7 STSWEL!VauditEntry::createInternalEntry+0x813
17 000000000b40ad50 0000000072e25335 STSWEL!VglobalAuditStore::addAuditEntry+0x97
18 000000000b40ad90 0000000072efe6de STSWEL!VglobalAuditStore::checkAuditQueue+0x241
19 000000000b40ae70 0000000072f35055 STSWEL!VdocumentStore::httpGetDocument+0x2ed2
20 000000000b40be60 0000000072f367e0 STSWEL!VhttpManager::loadFileCore+0x5c5
21 000000000b40c700 000000006d9a0543 STSWEL!VhttpManager::loadFileAndMetaInfo+0xc4
22 000000000b40c7c0 000000006d9ab514 OWSSVR!GetExtensionVersion+0x89b57
23 000000000b40cb80 000007fef60170a7 OWSSVR!GetExtensionVersion+0x94b28
24 000000000b40cf00 000007fe96c64bba mscorwks!DoCLRToCOMCall+0x177
25 000000000b40d090 000007fe96c643a1 Microsoft_SharePoint_Library!DomainNeutralILStubClass.IL_STUB(...)+0x96c64bbb
26 000000000b40d490 000007fe96c63d52 Microsoft_SharePoint!Microsoft.SharePoint.Library.SPRequest.GetFileAndMetaInfo(System.String, ...)+0x271
27 000000000b40d620 000007fe96c578c5 Microsoft_SharePoint!Microsoft.SharePoint.SPWeb.GetWebPartPageContent(System.Uri, ...)+0xa12
28 000000000b40d980 000007fe96c56e87 Microsoft_SharePoint!Microsoft.SharePoint.ApplicationRuntime.SPRequestModuleData.FetchWebPartPageInformationForInit(System.Web.HttpContext, ...)+0x495
29 000000000b40dc10 000007fe96c563b4 Microsoft_SharePoint!Microsoft.SharePoint.ApplicationRuntime.SPRequestModuleData.GetFileForRequest(System.Web.HttpContext, ...)+0x2d7
30 000000000b40dee0 000007fe9698fa39 Microsoft_SharePoint!Microsoft.SharePoint.ApplicationRuntime.SPRequestModule.InitContextWeb(System.Web.HttpContext, Microsoft.SharePoint.SPWeb)+0xa4
31 000000000b40df60 000007fe9698f769 Microsoft_SharePoint!Microsoft.SharePoint.WebControls.SPControl.SPWebEnsureSPControl(System.Web.HttpContext)+0x299
32 000000000b40e050 000007fe9698f2e3 Microsoft_SharePoint!Microsoft.SharePoint.WebControls.SPControl.GetContextWeb(System.Web.HttpContext)+0x29
33 000000000b40e090 000007fee4bc78b0 Microsoft_SharePoint!Microsoft.SharePoint.ApplicationRuntime.SPRequestModule.PostResolveRequestCacheHandler(System.Object, System.EventArgs)+0x253
34 000000000b40e190 000007fee4bb86bb System_Web_ni!System.Web.HttpApplication+SyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()+0x50
35 000000000b40e1f0 000007fee4bc6c15 System_Web_ni!System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)+0xab
36 000000000b40e290 000007fee4bb7773 System_Web_ni!System.Web.HttpApplication+ApplicationStepManager.ResumeSteps(System.Exception)+0x1a5
37 000000000b40e340 000007fee4bbba94 System_Web_ni!System.Web.HttpApplication.System.Web.IHttpAsyncHandler.BeginProcessRequest(System.Web.HttpContext, System.AsyncCallback, System.Object)+0xd3
38 000000000b40e3a0 000007fee4bbb67c System_Web_ni!System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)+0x1c4
39 000000000b40e430 000007fee4bba2ac System_Web_ni!System.Web.HttpRuntime.ProcessRequestNoDemand(System.Web.HttpWorkerRequest)+0x7c
40 000000000b40e470 000007fef6018e92 System_Web_ni!System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)+0x18c
41 000000000b40e5a0 000007fef5ea6b63 mscorwks!CallDescrWorker+0x82
42 000000000b40e600 000007fef5e9c4be mscorwks!CallDescrWorkerWithHandler+0xd3
43 000000000b40e6a0 000007fef6507b96 mscorwks!ForwardCallToManagedMethod+0x172
44 000000000b40e740 000007fef5e66054 mscorwks!COMToCLRWorkerBody+0x2b6
45 000000000b40e990 000007fef5eb5662 mscorwks!COMToCLRWorkerDebuggerWrapper+0x50
46 000000000b40ea00 000007fef601900e mscorwks!COMToCLRWorker+0x366
47 000000000b40ecf0 000007fee2d782b3 mscorwks!GenericComCallStub+0x5e
48 000000000b40eda0 000007fee2d78693 webengine!HttpCompletion::ProcessRequestInManagedCode+0x2a3
49 000000000b40f250 000007fee2dbad04 webengine!HttpCompletion::ProcessCompletion+0x63
50 000000000b40f290 000007fef5f28787 webengine!CorThreadPoolWorkitemCallback+0x24
51 000000000b40f2c0 000007fef5f16d6a mscorwks!UnManagedPerAppDomainTPCount::DispatchWorkItem+0x157
52 000000000b40f360 000007fef5e4c3dc mscorwks!ThreadpoolMgr::WorkerThreadStart+0x1ba
53 000000000b40f400 000000007777a4bd mscorwks!Thread::intermediateThreadProc+0x78
54 000000000b40f7d0 00000000778b6461 kernel32!BaseThreadInitThunk+0xd
55 000000000b40f800 0000000000000000 ntdll!RtlUserThreadStart+0x1d

That's quite a mouthful. We'll go through the relevant parts of the call stack next.

Interpreting the native and managed call stack

When the call stack is read bottom up, it shows how (and where) one function calls into the next. At frame 55, the user space part of the Windows kernel kicks of thread execution by passing control to ntdll!RtlUserThreadStart. Notice how RetAddr has all zeroes as its RetAddr, indicating it wasn't called from anywhere and should never return. The operating system passes RtlUserThreadStart the address of a function to run (passed to the Window create thread function), in this case the C++/CLR method at frame 52. From there and until 39, the thread prepares itself to run managed code. Then at frame 40, the thread is assigned the processing of an incoming request and starts executing managed code.

Say we're interested in the URL being processed by the thread. We can dump the HttpWorkerRequest object passed as an argument in frame 38. It has a _path property (and a _startTime property which presumably DebugDiag uses to compute the value of "RunningSince" in its report) which we can write out the value of. The same drill-down technique could be applied to HttpContext at frame 32. Its HttpContext argument points to 0x000000014da07450 which we should recognize from the DebugDiag report:

0:027> !CLRStack -a
...
000000000b40e3a0 000007fee4bbb67c System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
    PARAMETERS:
        this = 0x000000013f917d80
        wr = 0x000000014da068e8
...

0:027> !DumpObj 0x000000014da068e8
Name: System.Web.Hosting.ISAPIWorkerRequestInProcForIIS7
MethodTable: 000007fee4cdd158
EEClass: 000007fee496d800
Size: 408(0x198) bytes
 (C:\Windows\assembly\GAC_64\System.Web\2.0.0.0__b03f5f7f11d50a3a\System.Web.dll)
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
000007fef4038400  400137c       10      System.DateTime  1 instance 000000014da068f8 _startTime
...
000007fef3ff7ec0  4001baf       30        System.String  0 instance 000000014da06ca8 _path
...

0:027> !DumpObj 000000014da06ca8
Name: System.String
MethodTable: 000007fef3ff7ec0
EEClass: 000007fef3bfe560
Size: 96(0x60) bytes
 (C:\Windows\assembly\GAC_64\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
String: /org/it/pages/default.aspx
...

Moving up the call stack, at frame 30 we see SharePoint initialize the SPWeb based on the URL and prepare a request for the document library item storing the page (frame 26) inside the content database.

At this point something interesting happens. Turns out that instead of using ADO.NET to connect to MS SQL Server and fetch the file -- a row in the database -- SharePoint transfers control to a native code COM component (frame 23) inside OWSSVR.dll. The following calls appear to resolve the file path inside the virtual filesystem of the database, and then before loading the actual file construct an audit log entry (frame 17). If enabled, audit log entries are stored in a single table inside the content database.

Because we're in native code, we lost our runtime metadata. But from the method names we see that Microsoft OLE DB, a COM based legacy equivalent to ADO.NET, composes and sends a SQL statement to MS SQL Server (frame 7). Data appears to have been sent on the socket (frame 6). The OLE driver calls the socket receive method waiting for a response (frame 4). Because no data is available, the socket blocks the thread (frame 3). Once socket data becomes available, Windows wakes up the thread.

We can extract the SQL statement which OLE DB sends to SQL Server. !CLRStack isn't applicable in native code. What we can do, however, is dump the native stack frame using the frame command. It holds the values of the CPU registers at that point in time:

0:027> .frame /r 7
07 00000000`0b409c40 000007fe`e2e15617 sqloledb!CStmt::ExecDirect+0xa5b
rax=000000014dde5fe0 rbx=00000000182edf30 rcx=000000014dde7020
rdx=0000000000000000 rsi=0000000000000001 rdi=00000000ffffffff
rip=000007fee2e15866 rsp=000000000b409c40 rbp=0000000000000000
 r8=0000000000000000  r9=0000000000000040 r10=000000014dde5fd0
r11=0000000000000007 r12=0000000000000001 r13=000000001322a1b0
r14=000000001a7921c0 r15=ffffffffffffffff
iopl=0         nv up ei ng nz na pe cy
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000283
sqloledb!CStmt::ExecDirect+0xa5b:
000007fe`e2e15866 4983cfff        or      r15,0FFFFFFFFFFFFFFFFh

Using the du command (short for dump memory region as Unicode string), we can get at the SQL statement. Some amount of guesswork is required as to which register holds the pointer to the SQL statement. It turns out to be CPU register r13 which we can deference:

0:027> du 000000001322a1b0 L1000

Cleaned up output:    
EXEC proc_AddAuditEntry
  'F98E5B5D-2147-4534-9902-266AEA45A817', /* SiteId */
  '95B07887-99E9-4B98-AFDC-2269C9747571', /* ItemId */
  4,                                      /* ItemType */ 
  7245,                                   /* UserId */ 
  NULL,                                   /* MachineName */
  NULL,                                   /* MachineIp */
  N'org/it/pages/default.aspx',           /* Location */ 
  0,                                      /* LocationType */
  '20170713 20:49:50',                    /* Occurred */
  3,                                      /* Event */
  NULL,                                   /* EventName */
  0,                                      /* EventSource */
  NULL,                                   /* SourceName */ 
  N'<Version>...</Version>'               /* EventData */

OLE DB is calling a stored procedure which adds a row to the audit log table (if auditing is enabling). The arguments identify object requested and the user. The time zone is UTC and given that the debug session time is reported as 23:00:51 and the request has been queued for 661 seconds, 20:49:50 makes sense.

Conclusion

Inspecting the call stacks of the eight blocked threads, they're all blocked waiting for MS SQL Server to respond to the proc_AddAuditEntry stored procedure. It's unlikely that there's anything wrong with that procedure. It's just the first database operation which happens before actual content is fetched.

The culprit causing recurring platform outages appear to be MS SQL Server not responding to requests. It's a bit odd that no timeout occurs with the OLE DB driver. That might be entirely by design because audit logging must succeed in order to proceed.

In the next post we dive into MS SQL Server and investigate the cause of hangs observed within the application layer.

Have a comment or question? Please drop me an email or tweet to @ronnieholm.