Debugging recurring web farm outages: Collecting and analyzing the memory dump with DebugDiag

18 Jul 2017

Part 1: Establishing context with HTTP heartbeat pings
Part 2: Analyzing Windows Event Log, IIS log, and SharePoint log with PowerShell
Part 3: Collecting and analyzing the memory dump with DebugDiag
Part 4: Analyzing the memory dump with WinDbg
Part 5: SQL Server heartbeat pings and lock contention analysis with build-in tools

From the previous post, it's clear that IIS is receiving requests and that the ASP.NET runtime is blocked processing those requests. To gain insights into ASP.NET's request processing pipeline, this post outlines how to dump and analyze the w3wp process' memory. Dumping at just the right moment, after the ASP.NET runtime flags itself as unhealthy and before w3wp is recycled, is crucial. We'll illustrate how to use DebugDiag or a timed approach do generate a dump, and we'll use the DebugDiag reporting capabilities to analyze the dump.

DebugDiag dump collection

DebugDiag (watch the four-part introduction on Channel 9: 1, 2, 3, 4) is both a capture and reporting utility. The reporting part automates manual WinDbg digging around by analyzing a dump using predefined rules and heurisics. For dump purposes, we're interested in its ability to capture-on-breakpoint-hit. Other tools, such as ProcDump doesn't support this feature. WinDbg does, but in a more crude and manual fashion.

Once a dump is generated (when isapi!SSFReportUnhealthy is called by the ASP.NET runtime), the dump can be moved to another machine for DebugDiag report generation and further WinDbg analysis. Unfortunately, DebugDiag isn't fully xcopy deployable. One has to xcopy deploy the installation folder and then run the included Register.bat and Unregister.bat files.

Scheduled dump collection

With a small PowerShell script shelling out to ProcDump, we can construct a dump scheduler. Something similar could be achieved using Windows Scheduled Tasks, but this approach is about as non-intrusive as it gets. While it isn't breakpoint based, for our purpose, dumping on a schedule is good enough.

$dumpAt = (Get-Date -Year 2017 -Month 7 -Day 17 -Hour 23 -Minute 15 -Second 00)
sleep -Seconds (New-TimeSpan -end $dumpAt).TotalSeconds
$pids = C:\Windows\system32\inetsrv\appcmd.exe list wps /apppool.name:"SP_Intranet_AppPool" /text:WP.NAME
$pids | foreach {
  .\procdump64 -accepteula -64 -ma $_ e:\
}  

One hurdle that the script has to overcome is extracting w3wp process identifiers given the name of an application pool. With the server hosting 16 IIS applications, a similar number of w3wp processes may exist (more if some apps are running in a web garden setup). Another hurdle is that while the dump is ongoing, the w3wp process is suspended. A dump must happen fast enough for IIS not to notice that w3wp has become unresponsive and recycle it. ProcDump provides an -r switch which clones a process, resumes the execution of the original, and dumps the clone. But that switch is unsupported with Windows Server 2008. Fortunately, our dump times are short enough, in the 10-50 seconds range, generating a dump file of 1-4 GB in size.

Analyzing the dump DebugDiag reports

Next we have DebugDiag generate all its reports to a single file. Inside the aggregated report, we find the HttpContext Report. Based on HttpContext instances in memory, the report lists properties about each request/response (304 in total). For our purpose, the most interesting piece of information is if the request has completed. If not, then how long has it been running, and which worker thread pool thread is assigned to request.

The following lists a couple of completed requests and every not completed one. The heartbeat pings are the GET requests to /org/it/pages/default.aspx. From the original report, we removed the ReturnCode column containing 200 and Verb column containing GET:

HttpContext   Timeout   Completed   RunningSince   ThreadId   RequestPath+QueryString 
1020c2550     110 Sec   Yes         ---                       /favicon.ico  
105cc7988     360 Sec   Yes         ---                       /_layouts/1030/styles/core.css rev=...
112e66ee8     110 Sec   No          501 Sec        26         /pages/frontpage.aspx
14da07450     110 Sec   No          661 Sec        27         /org/it/pages/default.aspx
18f72bce8     110 Sec   No          246 Sec        52         /pages/frontpage.aspx
18f7ab300     110 Sec   No           52 Sec        56         /pages/frontpage.aspx 
1cb74f958     110 Sec   No          401 Sec        21         /org/it/pages/default.aspx
1cb7e31b8     110 Sec   No          348 Sec        28         /pages/frontpage.aspx   
1cb8588a8     110 Sec   No          204 Sec        54         /pages/frontpage.aspx
1cb86b8d8     110 Sec   No          109 Sec        55         /org/it/pages/default.aspx

The report summary states that these threads are blocked, calling an ISAPI Extension named OWSSVR. For .NET code, DebugDiag would be able to tell that, for instance, these threads are waiting on data to be returned from the database server. But here, OWSSVR is an opaque COM component. We must dig out the information ourselves.

The above listing is from a dump collected at 23:00. Notice how the report lists 14% threads blocked (8 threads in total). A later dump collected at 23:15, but on a different day, reports 30% threads blocked (22 threads in total). While the heartbeat script may cause the number of blocked thread to increase faster than if only regular users access the site, the trend is clear.

At neither 23:00 nor 23:15 has the runtime raised the unhealthy flag. Only heartbeat requests report timeouts and Internet Explorer shows a blank page. Also, in the 23:15 dump, the maximum RunningSince across all threads have increased from 661 to 1,747 seconds. Subtracting 1,747 seconds from 23:15 means the request was made around 22:46, just about when the heartbeat script started detecting downtime.

Matching the number of eight not completed requests, the report contains a SharePointAnalysis report with an SPWeb Object List (681 SPWeb objects found, eight undisposed). Similarly, the SPRequest Statistics lists eight undisposed SPRequest objects (og 741 in total). SPWeb and SPRequest are SharePoint .NET classes involved in processing requests. The thread IDs of both SPWeb and SPRequest objects match up with the threads IDs of the not completed requests. It's therefore safe to assume that these undisposed resources are currently in use and doesn't represent a memory leak. The memory locations provided come in handy once we load the dump into WinDbg. They allow us to easily access additional properties of each request:

Address            SPRequest   Auto-Dispose   Auth     Thread   Url [Title]
0000000112e6d400   Own         Thread         (null)   26       (null) []
000000014ddda9a8   Own         Thread         (null)   27       (null) []
000000018f735380   Own         Thread         (null)   52       (null) []
000000018f7b4690   Own         Thread         (null)   56       (null) []
00000001cb793a90   Own         Thread         (null)   21       (null) []
00000001cb7e89c0   Own         Thread         (null)   28       (null) []
00000001cb85f020   Own         Thread         (null)   54       (null) []
00000001cb8b02e0   Own         Thread         (null)   55       (null) []

SPContext          Thread   Linked Objects
0000000112e6d7a0   26	    SPSWeb:0000000112e6d400 
000000014dddad48   27	    SPSWeb:000000014ddda9a8 
000000018f735720   52	    SPSWeb:000000018f735380 
000000018f7b4a30   56	    SPSWeb:000000018f7b4690 
00000001cb793e30   21	    SPSWeb:00000001cb793a90 
00000001cb7e8d60   28	    SPSWeb:00000001cb7e89c0 
00000001cb85f3c0   54	    SPSWeb:00000001cb85f020 
00000001cb8b0680   55	    SPSWeb:00000001cb8b02e0 

Lastly, the Client Connections section details currently executing requests. Below is one of a heartbeat request, including its NTLM authentication headers mentioned in the previous post. The 172.27.18.210 IP is my reference laptop running the heartbeat tool and 172.27.18.8 is Web front-end server 2:

Client connection from 172.27.18.210:58666 to 172.27.18.8:80 
Host Header            intranet.acme.dk:80 
GET request for        /org/it/pages/default.aspx
HTTP Version           HTTP/1.1 
SSL Request            False 
Time alive             0-2:0-57:0-56 
QueryString    
Request mapped to    
HTTP Request State     HTR_READING_CLIENT_REQUEST 
Native Request State   NREQ_STATE_PROCESS 
HTTP Headers           Authorization: NTLM TlRMTVNTUA...AAAAAAAAAAAAAAAAAA==
                       Host: intranet.acme.dk

Conclusion

The DebugDiag report confirms our suspicion that ASP.NET worker threads are stuck processing requests, and that the longer we wait the more threads are blocked. Eventually w3wp signals that it's unhealthy and in a deadlocked state, and WAS ensures the process is recycled. This process repeats itself a number of times until the issue resolve itself.

We still don't know specifically on what the worker threads are blocked. DebugDiag's Thread Report provides stack traces listing functions only, indicating that threads are awaiting the database. We didn't include such stack traces above because given the thread ID, WinDbg is better suited for digging into that side of things.

In the next post, we'll load the dump into WinDbg and dig into the reasons for why worker thread pool threads are blocked.

Have a comment or question? Please drop me an email or tweet to @ronnieholm.