WinDbg: Recreating .NET objects from an Azure App Service memory dump

06 Aug 2017

Part 1: WinDbg: Recreating .NET objects from an Azure App Service memory dump
Part 2: ClrMD: Recreating .NET objects from an Azure App Service memory dump

This post outlines how to use WinDbg to identify and extract .NET objects from a memory dump of an Azure App Service, an Azure deployed web application.

Akin to Google Analytics, the memory dump is of an application which records page visits. It does so by exposing endpoints called by JavaScript on each visit to a web application. Due to a bug, the analytics application has been running for three weeks without the ability to persist visits to the database. Visits have accumulated in-process, and are lost on restart. The goal is to conduct a post-mortem analysis of the memory dump and extract visit objects for replay.

Memory dumping an Azure App Service

An Azure App Service supports memory dumping without terminating the process through Kudu. Furthermore, Kudu enables navigating the App Service's file system to download sos.dll and mscordacwks.dll. At this point the dump is ready to be loaded into WinDbg, like any other dump of a .NET process.

Architectural overview of the dumped process

To know what to look for and expect inside the dump, a brief overview of the application is in order. The dump is of the Bugfree.Spo.Analytics application whose endpoints receive a JSON payload containing visit metadata on each page visit to another application.

On the server side, multiple producers and a single consumer operate on single, shared, in-process message queue. Once a visit arrives on a worker threads, it's validated, enriched, and turned into an enqueued .NET Visit object. When the queue reaches a certain length, a consumer dequeues the Visit objects and after a bit of processing writes the visits to a MS SQL Azure instance:

  Visit 1   \
  Visit 2   -\
  ..         -> Thread pool producers
  ...       -/              |
  ...      -/       Process and post
  ...     -/                |
  Visit N /                 v
                Mailbox processor with queue and consumer
                            |
                    Process and save
                            |
                            v
                MS SQL Azure instance

In our case, some 423k Visit objects are stuck, waiting to be consumed.

Locating Visit objects inside the mailbox processor

At this point, we'll assume that the dump is ready to be loaded into WinDbg. From the output below, we see that WinDbg ships with appropriate versions of sos.dll and mscordacwks.dll, rather than loading the ones we supplied. Process uptime is reported at close to 25 days. From later analysis of visits we know that for about 21 of those the consumer hasn't been operational:

Microsoft (R) Windows Debugger Version 10.0.15063.137 X86
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\AzureDump\Bugfree.Spo.Analytics.Cli-d3c510-07-25-13-08-00.dmp]
User Mini Dump File with Full Memory: Only application data is available

Symbol search path is: srv*
Executable search path is: 
Windows 8 Version 9200 UP Free x86 compatible
Product: Server, suite: TerminalServer DataCenter SingleUserTS
6.2.9200.16384 (win8_rtm.120725-1247)
Machine Name:
Debug session time: Tue Jul 25 15:07:22.000 2017 (UTC + 2:00)
System Uptime: 28 days 13:46:50.588
Process Uptime: 24 days 20:40:30.000
................................................................
...
Loading unloaded module list
..
eax=00000000 ebx=0116e610 ecx=00000000 edx=00000000 esi=0116e3a0 edi=00000001
eip=7781081c esp=0116e278 ebp=0116e3f8 iopl=0         nv up ei pl nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206
ntdll!NtWaitForMultipleObjects+0xc:
7781081c c21400          ret     14h

0:000> .sympath+ C:\AzureDump
Symbol search path is: srv*;C:\AzureDump
Expanded Symbol search path is: cache*;SRV*https://msdl.microsoft.com/download/symbols;c:\azuredump

************* Symbol Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       srv*
OK                                             C:\AzureDump

0:000> .cordll -ve -u -l
CLRDLL: Unable to get version info for 'D:\Windows\Microsoft.NET\Framework\v4.0.30319\mscordacwks.dll', Win32 error 0n87
Automatically loaded SOS Extension
CLRDLL: Loaded DLL C:\Program Files (x86)\Windows Kits\10\Debuggers\x86\sym\mscordacwks_x86_x86_4.7.2053.00.dll\58FA6BB36e6000\mscordacwks_x86_x86_4.7.2053.00.dll
CLR DLL status: Loaded DLL C:\Program Files (x86)\Windows Kits\10\Debuggers\x86\sym\mscordacwks_x86_x86_4.7.2053.00.dll\58FA6BB36e6000\mscordacwks_x86_x86_4.7.2053.00.dll

First order of business is to locate Visit objects inside the dump. One way is filtering the heap for objects of the Visit type. The problem with this approach is that the heap may contain objects eligible for garbage collection, i.e., visits already written to the database. Instead we opt for browsing the application's source, looking for a unique object which, directly or indirectly, stores Visit objects.

Browsing the source, we observe that the producers/consumer mechanism is nicely encapsulated within the MailboxProcessor type. Searching the heap for instances of this type, a single instance shows up:

0:000> !DumpHeap -stat -type MailboxProcessor
Statistics:
      MT    Count    TotalSize Class Name
08a2db5c        1           32 Microsoft.FSharp.Control.FSharpMailboxProcessor`1[[Bugfree.Spo.Analytics.Cli.Agents+LoggerMessage, Bugfree.Spo.Analytics.Cli]]
08a2d584        1           32 Microsoft.FSharp.Control.FSharpMailboxProcessor`1[[Bugfree.Spo.Analytics.Cli.Agents+VisitorMessage, Bugfree.Spo.Analytics.Cli]]
Total 2 objects

From documentation and F# library source, it's evident that MailboxProcessor is aliased as FSharpMailboxProcessor. The "`1" is CLR notation for the arity of its generic type -- the number of type arguments -- representing the type of item queued. Inside the brackets is the concrete generic type and its assembly, with "+" being CLR notation for an inner class.

From a C# perspective, having a type called Agents with an inner type of VisitorMessage may seem like an odd design. In fact, it's an consequence of how the F# compiler maps language constructs to IL.

From the value of the MT (Method Table) column, we can locate objects of only that type on the managed heap. As the queueing mechanism is a singleton object, a static field, only one instance shows up:

0:000> !DumpHeap /d -mt 08a2d584
 Address       MT     Size
0252c384 08a2d584       32     

The address column points to the location of the object inside the process' 32-bit virtual address space. We could inspect it by dumping raw memory content, but SOS comes with a command to print an object:

0:000> !DumpObj /d 0252c384
Name:        Microsoft.FSharp.Control.FSharpMailboxProcessor`1[[Bugfree.Spo.Analytics.Cli.Agents+VisitorMessage, Bugfree.Spo.Analytics.Cli]]
MethodTable: 08a2d584
EEClass:     08a0ff30
Size:        32(0x20) bytes
File:        D:\home\site\wwwroot\FSharp.Core.dll
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
00000000  40001ff        4                       0 instance 0252c378 initial
048a5570  4000200       18 ...CancellationToken  1 instance 0252c39c cancellationToken@2521
0855a350  4000201        8 ...Canon, mscorlib]]  0 instance 0252c3a4 mailbox
012dc8f0  4000202       10         System.Int32  1 instance       -1 defaultTimeout
012da988  4000203       14       System.Boolean  1 instance        1 started
0855a0b8  4000204        c ...ption, mscorlib]]  0 instance 0252c3f4 errorEvent

The mailbox field looks promising. It's a generic reference type whose Type is oddly listed as Microsoft.FSharp.Control.Mailbox`1[[System.__Canon, mscorlib]]. System.__Canon is a CLR placeholder type related to how .NET generics is implemented under the hood and is replaced with an actual type at runtime as shown next:

0:000> !DumpObj /d 0252c3a4
Name:        Microsoft.FSharp.Control.Mailbox`1[[Bugfree.Spo.Analytics.Cli.Agents+VisitorMessage, Bugfree.Spo.Analytics.Cli]]
MethodTable: 0855bec8
EEClass:     08a3073c
Size:        32(0x20) bytes
File:        D:\home\site\wwwroot\FSharp.Core.dll
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
05491110  40001f8        4 ...Canon, mscorlib]]  0 instance 00000000 inboxStore
0855b748  40001f9        8 ...Canon, mscorlib]]  0 instance 0252c3c4 arrivals
0855b748  40001fa        c ...Canon, mscorlib]]  0 instance 0252c3c4 syncRoot
0855af04  40001fb       10 ...ore]], mscorlib]]  0 instance 00000000 savedCont
082472fc  40001fc       14 ...ng.AutoResetEvent  0 instance 00000000 pulse
0855b290  40001fd       18 ...olean, mscorlib]]  0 instance 0252c3e8 waitOneNoTimeout

From these fields, it's clear that Mailbox is where shared access to the queue is synchronized. Continuing our search for Visit objects, the arrivals field looks promising. Following the pointer once again, we end up at a Queue type defined in the F# standard library -- a thin wrapper around an array:

0:000> !DumpObj /d 0252c3c4
Name:        Microsoft.FSharp.Control.Queue`1[[Bugfree.Spo.Analytics.Cli.Agents+VisitorMessage, Bugfree.Spo.Analytics.Cli]]
MethodTable: 0855bf60
EEClass:     08a310d4
Size:        24(0x18) bytes
File:        D:\home\site\wwwroot\FSharp.Core.dll
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
048a17a8  40001db        4     System.__Canon[]  0 instance 037371e0 array
012dc8f0  40001dc        8         System.Int32  1 instance        0 head
012dc8f0  40001dd        c         System.Int32  1 instance   422813 size
012dc8f0  40001de       10         System.Int32  1 instance   422813 tail

Besides an array of objects, the Queue appears to keep track of the index of the first and last element as well as its size. The array has a capacity of 524,288, but only contains 422,813 visits:

0:000> !DumpObj /d 037371e0
Name:        Bugfree.Spo.Analytics.Cli.Agents+VisitorMessage[]
MethodTable: 0855bfcc
EEClass:     012da164
Size:        2097164(0x20000c) bytes
Array:       Rank 1, Number of elements 524288, Type CLASS (Print Array)
Fields:
None

At first glance, we might have expected the array to store Visit objects directly, but that's not how a MailboxProcessor works. It supports switching on the type of each message. In C# terms, think of messages as an inheritance hierarchy with VisitorMessage as the abstract base type and each actual message type as a concrete subtype. Each type of message may carry additional state, such as the actual visit stored in a field inside the subtype.

To see this hierarchy at work, we dump the first element of the array. Its item field holds the Visit object:

0:000> !DumpArray -start 0 -length 1 /d 037371e0
Name:        Bugfree.Spo.Analytics.Cli.Agents+VisitorMessage[]
MethodTable: 0855bfcc
EEClass:     012da164
Size:        2097164(0x20000c) bytes
Array:       Rank 1, Number of elements 524288, Type CLASS
Element Methodtable: 08a2d2ac
[0] 02663808

0:000> !DumpObj /d 02663808
Name:        Bugfree.Spo.Analytics.Cli.Agents+VisitorMessage
MethodTable: 08a2d2ac
EEClass:     08a0fec8
Size:        12(0xc) bytes
File:        D:\home\site\wwwroot\Bugfree.Spo.Analytics.Cli.exe
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
08a2d938  40000b2        4 ....Cli.Domain+Visit  0 instance 026637d0 item

0:000> !DumpObj /d 026637d0
Name:        Bugfree.Spo.Analytics.Cli.Domain+Visit
MethodTable: 08a2d938
EEClass:     0852014c
Size:        56(0x38) bytes
File:        D:\home\site\wwwroot\Bugfree.Spo.Analytics.Cli.exe
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
048aa9a4  40000fc       1c          System.Guid  1 instance 026637ec CorrelationId@
05547ab4  40000fd       2c      System.DateTime  1 instance 026637fc Timestamp@
012dfccc  40000fe        4        System.String  0 instance 02663530 LoginName@
012dfccc  40000ff        8        System.String  0 instance 02663700 SiteCollectionUrl@
012dfccc  4000100        c        System.String  0 instance 026635b4 VisitUrl@
054b6a94  4000101       10 ...Int32, mscorlib]]  0 instance 02663774 PageLoadTime@
0641c510  4000102       14 System.Net.IPAddress  0 instance 02663780 IP@
054b5248  4000103       18 ...tring, mscorlib]]  0 instance 026637c4 UserAgent@

The "@" sign appended to each name denotes a property backing field. Thus, for every Visit object in the array, to recreate it from memory, we must dump the values of each backing field. And for any non-simple type of backing field, we must recursively dump it until we arrive at simple types.

Tracking pointers and dumping objects with WinDbg should make it clear that we're traversing a (potentially cyclic) graph of objects. In this case the objects form a tree, rooted in the singleton MailboxProcessor instance:

Bugfree.Spo.Analytics.Cli.Agents+visitor (Microsoft.FSharp.Control.FSharpMailboxProcessor)
  mailbox (Microsoft.FSharp.Control.Mailbox)
    arrivals (Microsoft.FSharp.Control.Queue)
      array (Bugfree.Spo.Analytics.Cli.Agents+VisitorMessage[])
        visit1 (Bugfree.Spo.Analytics.Cli.Agents+VisitorMessage)
          item (Bugfree.Spo.Analytics.Cli.Domain+Visit)
            CorrelationId (System.Guid)
              _a (System.Int32)
              _b (System.Int16)
              _c (System.Int16)
              _d (System.Byte)
              ...
              _k (System.Byte)
            Timestamp
              dateDate (System.UInt64)
              ...
            LoginName (System.String)
            SiteCollectionUrl (System.String)
            VisitUrl (System.String)
            PageLoadTime (Microsoft.FSharp.Core.FSharpOption)
              value (System.Int32)
            IP (System.Net.IPAddress)
              m_Address (System.Int64)
              ...
            UserAgent (Microsoft.FSharp.Core.FSharpOption)
              value (System.String)
        visit2
        ...
        visitN

Each Visit object inside the queue is prevented from being garbage collected because it's indirectly rooted by the static MailboxProcessor field. Incidentally, the number of Visit objects in the array matches the number of Visit objects on the heap:

0:000> !DumpHeap -stat -type Bugfree.Spo.Analytics.Cli.Domain+Visit
Statistics:
      MT    Count    TotalSize Class Name
08a62f2c        1           16 Microsoft.FSharp.Collections.FSharpList`1[[Bugfree.Spo.Analytics.Cli.Domain+Visit, Bugfree.Spo.Analytics.Cli]]
08a2d938   422813     23677528 Bugfree.Spo.Analytics.Cli.Domain+Visit
Total 422814 objects

This implies that rather than traversing the tree, dumping Visit objects directly is simpler and yields the same result.

Conclusion

While WinDbg provides for easy graph exploration, it can only extract and pretty print simple .NET types such as String, Int, and Float. Compound types, such as Guid, FSharpOption, IPAddress, and DateTime require parsing of its text output. We'd have to recursively traverse each compound type inside each one of the 422,813 Visit objects, parsing command output.

Generating the batch of WinDbg commands to run against each visit and parsing all that command output is a lot of work. As an alternative, next we'll explore the Microsoft Diagnostics Runtime, or ClrMD for short, a .NET dump file API, alleviating the need for WinDbg scripting and parsing.