Debugging recurring web farm outages: Collecting and analyzing the memory dump with DebugDiag

18 Jul 2017

Part 1: Establishing context with HTTP heartbeat pings
Part 2: Analyzing Windows Event Log, IIS log, and SharePoint log with PowerShell
Part 3: Collecting and analyzing the memory dump with DebugDiag
Part 4: Analyzing the memory dump with WinDbg
Part 5: SQL Server heartbeat pings and lock contention analysis with build-in tools

From the previous post, it's clear that IIS is receiving requests and that the ASP.NET runtime is blocked processing those requests. To gain insights into ASP.NET's request processing pipeline, this post outlines how to dump and analyze the w3wp process' memory. Dumping at just the right moment, after ASP.NET flags itself as unhealthy and before w3wp is recycled, is crucial. We'll illustrate how to use DebugDiag or a fixed time approach do generate a dump, and we'll use the DebugDiag reporting capabilities to analyze the dump.

DebugDiag dump collection

DebugDiag (watch the four-part introduction on Channel 9: 1, 2, 3, 4) is both a capture and reporting utility. The reporting part automates manual WinDbg digging around by analyzing a dump using predefined rules and heuristics. For dump purposes, we're interested in DebugDiag's ability to capture-on-breakpoint-hit.

Once a dump is generated (when isapi!SSFReportUnhealthy is called by the ASP.NET runtime), the dump can be moved to another machine for DebugDiag report generation and further WinDbg analysis. On the downside, DebugDiag isn't xcopy deployable onto a production server. One has to either run the installer or xcopy deploy the installation folder from another machine and run the included Register.bat and Unregister.bat files.

Scheduled dump collection

With a small PowerShell script shelling out to ProcDump, we can schedule a dump and avoid installing DebugDiag in production. Something similar to the script could be achieved using Windows Scheduled Tasks, but the script approach is about as non-intrusive as it gets. While it isn't breakpoint based (ProcDump doesn't support it), for our purpose, dumping at a fixed time is good enough.

$dumpAt = (Get-Date -Year 2017 -Month 7 -Day 17 -Hour 23 -Minute 15 -Second 00)
sleep -Seconds (New-TimeSpan -end $dumpAt).TotalSeconds
$pids = C:\Windows\system32\inetsrv\appcmd.exe list wps /apppool.name:"SP_Intranet_AppPool" /text:WP.NAME
$pids | foreach {
  .\procdump64 -accepteula -64 -ma $_ e:\
}  

One hurdle that the script has to overcome is extracting w3wp process identifiers given the name of an application pool for use by ProcDump. With the server hosting 16 IIS applications, a similar number of application pools and w3wp processes exist (more processes if some web applications run in a web garden setup). Another hurdle to overcome is that while the dump is ongoing, the w3wp process is suspended, not processing requests. A dump must happen fast enough for IIS not to notice or the w3wp process will get recycled while it's being dumped. ProcDump provides the -r switch to clone a process, then resume execution of the original, and finally dump the clone. But that switch is unsupported with Windows Server 2008. Fortunately, our dump times are short, in the 10-50 seconds range, generating dump files of 1-4 GB in size.

Analyzing the dump DebugDiag reports

Next we have DebugDiag aggregate all its reports to a single file. Inside it we find the HttpContext Report, listing properties of each request/response (304 in total). We're especially interested in if a request has completed. And if not, then for how long has it been running and which worker thread pool thread is assigned to it.

The following lists a couple of completed requests and every not completed one. The heartbeat pings are the GET requests to /org/it/pages/default.aspx. From the actual report, we've removed the ReturnCode column containing 200 and Verb column containing GET:

HttpContext   Timeout   Completed   RunningSince   ThreadId   RequestPath+QueryString 
1020c2550     110 Sec   Yes         ---                       /favicon.ico  
105cc7988     360 Sec   Yes         ---                       /_layouts/1030/styles/core.css rev=...
112e66ee8     110 Sec   No          501 Sec        26         /pages/frontpage.aspx
14da07450     110 Sec   No          661 Sec        27         /org/it/pages/default.aspx
18f72bce8     110 Sec   No          246 Sec        52         /pages/frontpage.aspx
18f7ab300     110 Sec   No           52 Sec        56         /pages/frontpage.aspx 
1cb74f958     110 Sec   No          401 Sec        21         /org/it/pages/default.aspx
1cb7e31b8     110 Sec   No          348 Sec        28         /pages/frontpage.aspx   
1cb8588a8     110 Sec   No          204 Sec        54         /pages/frontpage.aspx
1cb86b8d8     110 Sec   No          109 Sec        55         /org/it/pages/default.aspx

The DebugDiag report summary states that the above thread Ids are blocked while calling into an ISAPI Extension named OWSSVR. For .NET code, DebugDiag would be able to tell that, for instance, these threads are waiting on data to be returned from the database. But OWSSVR is an opaque COM component, and we'll need to dig out the information ourselves.

The above listing is from a dump collected at 23:00. Notice how the report lists 14% threads blocked (of 8 threads in total). A later dump collected at 23:15, but on a different day, reports 30% threads blocked (of 22 threads in total). While the heartbeat script may cause the number of blocked thread to increase faster than if only regular users access the site, the trend is clear.

At neither 23:00 nor 23:15 has the runtime raised the unhealthy flag. Only heartbeat requests report timeouts and Internet Explorer shows a blank page. Also, in the 23:15 dump, the maximum value in RunningSince column has increased from 661 to 1,747 seconds. Subtracting 1,747 seconds from 23:15 indicates the request was made around 22:46; just about when the heartbeat script started detecting downtime.

Matching the eight not completed requests, the SharePointAnalysis section of the report lists eights undisposed SPWeb objects (of 681 in total). Similarly, the SPRequest Statistics section lists eight undisposed SPRequest objects (of 741 in total). SPWeb and SPRequest are SharePoint .NET classes involved in processing requests. The thread IDs of both SPWeb and SPRequest objects match the threads IDs of the not completed requests. It's therefore safe to assume that these undisposed resources are in use and isn't a memory leak. The memory locations provided come in handy once we load the dump into WinDbg. They allow us to easily access additional properties of each request:

Address            SPRequest   Auto-Dispose   Auth     Thread   Url [Title]
0000000112e6d400   Own         Thread         (null)   26       (null) []
000000014ddda9a8   Own         Thread         (null)   27       (null) []
000000018f735380   Own         Thread         (null)   52       (null) []
000000018f7b4690   Own         Thread         (null)   56       (null) []
00000001cb793a90   Own         Thread         (null)   21       (null) []
00000001cb7e89c0   Own         Thread         (null)   28       (null) []
00000001cb85f020   Own         Thread         (null)   54       (null) []
00000001cb8b02e0   Own         Thread         (null)   55       (null) []

SPContext          Thread   Linked Objects
0000000112e6d7a0   26	    SPSWeb:0000000112e6d400 
000000014dddad48   27	    SPSWeb:000000014ddda9a8 
000000018f735720   52	    SPSWeb:000000018f735380 
000000018f7b4a30   56	    SPSWeb:000000018f7b4690 
00000001cb793e30   21	    SPSWeb:00000001cb793a90 
00000001cb7e8d60   28	    SPSWeb:00000001cb7e89c0 
00000001cb85f3c0   54	    SPSWeb:00000001cb85f020 
00000001cb8b0680   55	    SPSWeb:00000001cb8b02e0 

Lastly, the Client Connections section details currently executing requests. Below is one of the heartbeat requests, including its NTLM header (see previous post for details on the authentication handshake). The 172.27.18.210 IP is my reference laptop running the heartbeat tool and 172.27.18.8 is Web front-end server 2:

Client connection from 172.27.18.210:58666 to 172.27.18.8:80 
Host Header            intranet.acme.dk:80 
GET request for        /org/it/pages/default.aspx
HTTP Version           HTTP/1.1 
SSL Request            False 
Time alive             0-2:0-57:0-56 
QueryString    
Request mapped to    
HTTP Request State     HTR_READING_CLIENT_REQUEST 
Native Request State   NREQ_STATE_PROCESS 
HTTP Headers           Authorization: NTLM TlRMTVNTUA...AAAAAAAAAAAAAAAAAA==
                       Host: intranet.acme.dk

Conclusion

The DebugDiag report confirms our suspicion that ASP.NET worker threads are stuck processing requests. The longer we wait, the more threads get blocked, eventually causing w3wp to signal that it's unhealthy and in a deadlocked state. The Windows process Activation Service then ensures the process is recycled, and the flow repeats a number of times until the issue resolves itself.

We still don't know specifically on what the worker threads are blocked. DebugDiag's Thread Report section provides stack traces listing functions only, indicating that threads are blocked waiting for a database operation. We refrained from including a stack trace in this post because given the thread ID, WinDbg is better suited for digging into that side of things.

In the next post, we'll load the dump into WinDbg and dig into the reasons for why worker thread pool threads are blocked.