Debugging recurring web farm outages: Establishing context with HTTP heartbeat pings

16 Jul 2017

Part 1: Establishing context with HTTP heartbeat pings
Part 2: Analyzing Windows Event Log, IIS log, and SharePoint log with PowerShell
Part 3: Collecting and analyzing the memory dump with DebugDiag
Part 4: Analyzing the memory dump with WinDbg
Part 5: SQL Server heartbeat pings and lock contention analysis with build-in tools

A client has reported that their web farm running SharePoint 2007 (effectively ASP.NET, MS SQL Server, and associated services) has been going offline each night for the past couple of months. In this post we'll document the steps taken to troubleshoot the issue. As it's a fairly long process, it'll take a few posts to arrive at the root cause. We'll start by collecting precise timing data on the farm's uptime and downtime and correlate these measurements with server logs.

Collecting uptime/downtime measurements

First order of business is talking to the affected internal users. They consistently report that when browsing the intranet, Internet Explorer shows a blank page with the tab loader wheel spinning for minutes (forever). The page remains blank, indicating that the web server never responded to the client. Users also report that the farm is down every night at around 23.00, lasting for 90 to 120 minutes.

We want to collect more accurate timing to see if a pattern emerges: how long is the farm actually down? How many servers in farm are affected? How many Internet Information Services (IIS) web applications are affected? Since no monitoring system is in place, we roll our own. While not enterprise grade, is plenty sufficient for the case at hand.

Our monitoring tool employs HTTP heartbeat monitoring as a proxy for farm uptime and downtime:

using System;
using System.Net;
using System.Threading;

// Usage examples:
// .\HttpHeartbeat interval url
// .\HttpHeartbeat 5 http://dtu.dk/images/modules/socialmedia/twitter_small.png | tee dtu.txt
// .\HttpHeartbeat 30 http://intranet.acme.dk/org/it/pages/default.aspx | tee intranet.txt
// .\HttpHeartbeat 30 http://projects.acme.dk/pages/default.aspx | tee project.txt

namespace HttpHeartbeat {
    class Program {
        static void Main(string[] args) {
            // Present custom NTLM credential to the web server. Otherwise, the credential
            // with which the tool is running is used. It may not provide access to URL.
            var credential = new NetworkCredential("admin", "secret", "acme.dk");
            var cache = new CredentialCache();
            cache.Add(new Uri("http://intranet.acme.dk"), "NTLM", credential);
            cache.Add(new Uri("http://projects.acme.dk"), "NTLM", credential);

            var run = 0;
            while (true) {
                using (var wc = new WebClient()) {
                    wc.Credentials = cache;
                    wc.Proxy = WebRequest.DefaultWebProxy;
                    Console.Write($"{++run} {DateTime.Now} ");
                    try {
                        var s = wc.DownloadString(args[1]);
                        Console.WriteLine(string.IsNullOrEmpty(s) ? "Empty" : "OK");
                    } catch (Exception e) {
                        Console.WriteLine(e.Message);
                    }
                }
                Thread.Sleep(int.Parse(args[0]) * 1000);
            }
            Console.WriteLine();
        }
    }
}

Every nth second, HttpHeartbeat attempts to establish a connection with an endpoint and fetch its content. The tool is well-suited for execution in a PowerShell terminal using the tee cmdlet. It simultaneously outputs to the console and a file for later analysis.

We setup HttpHeartbeat to run across a host of machines: on a reference laptop on the local network and on key servers in the farm. These servers typically resolve intranet.acme.dk to 127.0.0.1 (by way of their hosts file). Thus, running HttpHeartbeat against intranet.acme.dk on these servers, the request is processed by the local server's web stack. That way we can determine if most or all servers are affected and rule out issues with the network between the client and farm. As for the reference laptop, its requests pass through a load balancer.

Here's the aggregated output from running the tool, showing downtime per server:

Reference laptop:
    07-07-2017 22:47:26 to 08-07-2017 06:47:47
    08-07-2017 22:47:59 to 09-07-2017 00:04:22
    09-07-2017 22:46:25 to 10-07-2017 00:19:01
    10-07-2017 22:46:43 to 11-07-2017 01:01:05
    11-07-2017 22:46:53 to 12-07-2017 00:14:06
    12-07-2017 22:46:39 to 13-07-2017 00:09:31
    13-07-2017 22:47:39 to 14-07-2017 00:31:07

Web front-end server 1: 
    07-07-2017 22:48:03 to 08-07-2017 06:47:04
    08-07-2017 22:47:25 to 09-07-2017 00:03:47
    09-07-2017 22:46:15 to 10-07-2017 00:17:48
    10-07-2017 22:46:30 to 11-07-2017 01:01:28
    11-07-2017 22:46:48 to 12-07-2017 00:13:29
    12-07-2017 22:46:32 to 13-07-2017 00:09:55
    No further heartbeats collected as user was logged out

Web front-end server 2: 
    07-07-2017 22:47:24 to 08-07-2017 06:48:35
    08-07-2017 22:47:25 to 09-07-2017 00:03:47
    No further heartbeats collected as user was logged out

Search application server (intra-farm communication: resolves intranet.acme.dk to Web front-end server 2):
    07-07-2017 22:47:55 to 08-07-2017 06:47:25
    08-07-2017 22:47:02 to 09-07-2017 00:04:56
    No further heartbeats collected as user was logged out

Management application server: 
    07-07-2017 22:47:24 to 08-07-2017 06:47:41
    08-07-2017 22:47:25 to 09-07-2017 00:03:17
    09-07-2017 22:46:40 to 10-07-2017 00:17:43
    No further heartbeats collected as user was logged out

Running HttpHeartbeat reveals an interesting pattern: the reference laptop and every server detect roughly identical uptimes and downtimes. The slight difference in time is likely due to the 30 second heartbeat interval and the time it takes to detect a failure. Across the log files, "The operation has timed out" shows up for almost every heartbeat request within the downtime period.

Next, we setup an additional heartbeat target on the reference laptop, fetching a page within another web application. We ran the heartbeat pings on the reference laptop only, assuming the results generalize to servers as well. Pinging another web application, of which the farm has 16, we want to see if the issue is isolated to a single web application or multiple/every IIS web application.

Reference laptop: 
    11-07-2017 00:55:26 to 11-07-2017 00:59:46
    12-07-2017 00:55:16 to 12-07-2017 00:59:36
    13-07-2017 00:55:36 to 13-07-2017 01:02:06

On the first night, because downtime is so brief, it's hard to tell if what we see is actual downtime. Subsequent nights, however, reveal a pattern. The projects.acme.dk web application is offline a little later than intranet.acme.dk web application, and is offline for shorter time as well.

Conclusion

When all servers in a farm go offline at about the same time, and stays down for roughly the same duration, it indicates that the issue isn't with a particular server. More likely, the downtime is due to some shared component that each server depends on in order to operate: Active Directory, MS SQL Server, SAN, and the like.

In the next post we'll dig into the Event Log and other server logs for hints as to the underlying cause.