Debugging recurring web farm outages: Analyzing Windows Event Log, IIS log, and SharePoint log with PowerShell

17 Jul 2017

Part 1: Establishing context with HTTP heartbeat
Part 2: Analyzing Windows Event Log, IIS log, and SharePoint log with PowerShell
Part 3: Collecting and analyzing the memory dump with DebugDiag
Part 4: Analyzing the memory dump with WinDbg
Part 5: SQL Server heartbeat pings and lock contention analysis with build-in tools

In the previous post we discovered a pattern to the outages. Servers go offline at the same time each day and come back up within a few hours, give or take. In this post, we analyze Windows Event logs, Internet Information Services (IIS) logs, and SharePoint logs for the cause. While manually browsing logs is a good start, once a relevant piece of information has been uncovered, we use PowerShell to weed through the logs, searching for a similar pattern.

Windows Event Logs

For issues with Windows and IIS, the Application and System event logs is a good starting point. While they contain lots of noise entries, for each server in farm we browse the logs between July 7, 22:45 and July 8, 07:00. The following entries originate from Web front-end server 1, with similar entries across other servers:

07-07-2017 23:21:57 - Application log
ISAPI 'C:\Windows\Microsoft.NET\Framework64\v2.0.50727\aspnet_isapi.dll'
reported itself as unhealthy for the following reason: 'Deadlock detected'.

07-07-2017 23:25:24 - System log
A process serving application pool 'SP_Intranet_AppPool' exceeded time
limits during shut down. The process id was '11004'.

A couple of minutes after a deadlock is detected, the w3wp process is restarted. With PowerShell, we can search event logs across servers and across time for a similar pattern using the script below:

Get-EventLog -LogName Application -Message '*unhealthy*' `
  -After $(Get-Date -Year 2017 -Month 7 -Day 7 -Hour 22 -Minute 45) `
  -Before $(Get-Date -Year 2017 -Month 7 -Day 8 -Hour 07 -Minute 00) `
  | foreach { $_.TimeGenerated }

Get-EventLog -Logname System -Message '*exceeded time limits*' `
  -After $(Get-Date -Year 2017 -Month 7 -Day 7 -Hour 22 -Minute 45) `
  -Before $(Get-Date -Year 2017 -Month 7 -Day 8 -Hour 07 -Minute 00) `
  | foreach { $_.TimeGenerated }

Running this PowerShell script against the event log on Web Front-end server 1, we get a list of matches.

Deadlock -> Restart  
23:21:57 -> 23:25:24
00:10:25 -> 00:13:55
01:06:04 -> 01:09:28
01:59:09 -> 02:02:34
02:21:02 -> 02:24:00
03:08:09 -> 03:11:32
03:48:04 -> 03:51:33
04:45:41 -> 04:49:06
05:29:02 -> 05:32:08
06:14:07 -> 06:17:11

Running the script across servers in farm, it turns out that all servers show a similar ordering of messages but with different frequencies and timestamps.

An MSDN post explains how the deadlock detection and restart mechanisms work:

When a web application is busy, there may be instances when the maximum number of threads has been spawned and some of the threads are unresponsive, resulting in degraded performance. IIS has the ability to solve the problem of unresponsive threads by recycling the worker process that hosts that particular instance of the ISAPI. When threads are unresponsive, the appropriate ISAPI, i.e., ASP.dll [classic ASP] or aspnet_isapi.dll [ASP.NET on IIS 6 or [ASP.NET on] IIS 7 in classic mode] calls the ISAPI server support function HSE_REQ_REPORT_UNHEALTHY, and the WAS [Windows process Activation Service] (in IIS 7) or WWW service (in IIS 6) will recycle the worker process and make an entry in the event log.

It's actually the ServerSupportFunction which is passed the HSE_REQ_REPORT_UNHEALTHY status flag. On IIS 7, the function is isapi!SSFReportUnhealthy, e.g., a function named SSFReportUnhealthy inside the ISAPI module. Later on, when we get to collect a memory dump, one approach is to trigger a dump when isapi!SSFReportUnhealthy is called.

Recycling the w3wp process takes longer than the default 90 seconds which is why we see the second message in the event log.

SharePoint 12 hive logs files

The 12 hive log files contain nothing out the ordinary. Only a few HTTP requests show up, indicating that ASP.NET has stopped processing requests. Once in a while, the logs indicate database activity originating from SharePoint's timer job service, possibly initiated through the Server Side Object Model.

IIS log files

Let's correlate HTTP heartbeat ping requests and IIS log entries. Remember that IIS first receives the request and based on the URL it passes it along to ASP.NET. Requests for ASPX pages are processed by ASP.NET and SharePoint while requests for static resources such as images, CSS, and JavaScript files may be served directly from the file system.

We start by tracing a few successful heartbeats. Observe that because the IIS log field is in UTC, it's two hours behind heartbeat time:

Heartbeats log
5170 09-07-2017 18:30:01 OK
5171 09-07-2017 18:30:31 OK
5172 09-07-2017 18:31:01 OK

IIS log
2017-07-09 16:30:01 127.0.0.1 GET /org/it/pages/default.aspx - 80 - 127.0.0.1 - 401 2 5 0
2017-07-09 16:30:01 127.0.0.1 GET /org/it/pages/default.aspx - 80 acme\admin 127.0.0.1 - 200 0 0 6973
2017-07-09 16:30:31 127.0.0.1 GET /org/it/pages/default.aspx - 80 - 127.0.0.1 - 401 2 5 0
2017-07-09 16:30:31 127.0.0.1 GET /org/it/pages/default.aspx - 80 acme\admin 127.0.0.1 - 200 0 0 6973
2017-07-09 16:31:01 127.0.0.1 GET /org/it/pages/default.aspx - 80 - 127.0.0.1 - 401 2 5 0
2017-07-09 16:31:01 127.0.0.1 GET /org/it/pages/default.aspx - 80 acme\admin 127.0.0.1 - 200 0 0 6973

A single request by the heartbeat tool results in two entries in the IIS log. That's because the client, the .NET WebClient class, initially requests the URL without an authentication header. IIS then responds with HTTP 401: Unauthorized and a list of authentication providers configured for the site. The client picks a provider and re-sends the request, now including the NTLM authentication header. That's why the subsequent HTTP 200: OK entry includes domain and user name. This handshake process is hidden away inside the .NET WebClient class and the HTTP 401: Unauthorized is by design and not an error.

Next we trace a couple of failing requests. As we approach the timeout, the server appears to become slower at processing requests. Notice the seven seconds delay between HTTP 401: Unauthorized and HTTP 200: OK:

Heartbeats log
5675 09-07-2017 22:46:03 OK
5676 09-07-2017 22:46:40 The operation has timed out
5677 09-07-2017 22:48:51 The operation has timed out

IIS log
2017-07-09 20:46:03 127.0.0.1 GET /org/it/pages/default.aspx - 80 - 127.0.0.1 - 401 2 5 0
2017-07-09 20:46:10 127.0.0.1 GET /org/it/pages/default.aspx - 80 acme\admin 127.0.0.1 - 200 0 0 6973
2017-07-09 20:46:40 127.0.0.1 GET /org/it/pages/default.aspx - 80 - 127.0.0.1 - 401 2 5 0
2017-07-09 20:48:51 127.0.0.1 GET /org/it/pages/default.aspx - 80 - 127.0.0.1 - 401 2 5 0

Even as heartbeat requests starts to timeout, IIS receives, responds, and logs the initial HTTP 401: Unauthorized response. That's because authentication for a site is an IIS setting, and thus the response is generated by IIS without consulting ASP.NET. The HTTP 200: OK response, however, doesn't show because ASP.NET never finishes processing the resend request.

Conclusion

Entries from Windows and IIS logs indicate that ASP.NET is experiencing issues. The w3wp worker thread pool isn't processing ASP.NET requests as it ought to. From the log messages alone we can't conclude if what we're experiencing is worker thread pool exhaustion (caused by COM interop or some other issue), but signs of exhaustion are present. This could explain why the Windows process Activation Service has to restart w3wp so frequently.

In the next post, we'll memory dump the w3wp process and inspect the state of its worker thread pool.