Application insight Snapshot debugger and strange production problem

I want to share with you an history of a problem we had in production last week, because after lots of internet search no article lead us to find the solution, so I hope that this article can help you if you experience the very same problem.

Situation: we deployed new version of a web application based on ASP.NET Web API, it was a small increment (we deploy frequently), but suddenly the customer starts experiencing slowness of the application and intermittent errors.

Logging into the machine we found a real strange situation, shown in Figure 1 we have two worker process for the application, one is running and the other one is in Suspended state, but it is using lots of RAM.

image

Figure 1: Two worker process in the same machine, but one is suspended

In Figure 1, the running process used 4 minutes of CPU time because this screenshot was taken after a process recycle, but we had situation where the running process was alive for more than 10 hours, and the suspended process has consumed not a single second of CPU.

In production server, suddently, another worker process (w3wp.exe) started, in suspended state, starts consuming RAM, then disappear.

If you search in internet, you can find lots of article related to process recycling, but this is not our situation, because the process does not get recycled. This is our usual situation on that production server

– Our process usually consume 1.5 GB / 2.0 GB of RAM
– Production machine has 8 GB of RAM
– We had other services in that machine, but usually  total memory consumption is around 6 GB / 7 GB

Performance of the application is good, we use almost all the memory of the server, but we never experienced problems. Now, after the last deploy, periodically we see this exact series of events from Task Manager:

– Another w3wp.exe was launched, it has the very same IIS user (jarvis) and the very same command line of the existing instance of w3wp.exe
– That process is in state suspended, but it starts allocating RAM
– That process is a child of the original w3wp.exe process
– The web application works normally, all requests are handled by the original w3wp.exe process and the site responds quickly
– Suspended process continue to grow in RAM usage, until it arrives approximately to the working set of original w3wp.exe process, then it is closed
the original w3wp.exe process was not recycled and continue worked perfectly

The real problem is memory usage, because when this sequence of events starts, the machine starts experiencing memory pressure and everything is really slow, sometimes a third w3wp.exe suspended process starts and this bring down the performance of the machine

image

Figure 2: Windows detects low memory, and w3wp.exe process is the problem

This is the problem when you did a good dimensioning of the machine, when something out of your control happens and some process is using more resource, performances starts to go down.

If you try to kill the new w3wp.exe suspended process, its working set drops to 4k, but it remains suspended and cannot be killed. This is a nasty situation, because performance of production server was poor, and the only workaround was adding another RAM to the virtual machine, but we really hate wasting RAM for unnecessary task.

After a couple of hours of searching we were frustrated because:

1) This has nothing to do with Recycling, the original process was not recycled, every article in the internet refers to some recycling problem
2) This happens only after last deploy, but we really did not change a lot the code in the last sprint
3) Whenever a w3wp.exe suspended process appears, it reach the same amount of memory of the official w3wp.exe process before disappearing, this is important because the new process must be related to the old one.

The only possible reason for this behavior is that something is creating another process in suspended state to create a dump of production w3wp.exe process

When you take a memory dump of a process it will freeze for the duration of the dump, but you can also spawn a new process, copy memory and then dump the other process. With this technique the original process continue to run.

Armed with this intuition we start searching for *.dmp files in all disks, and immediately we found a temp folder with some dump of 1.7 GB size. Bingo: this confirms our hypothesis, something is creating dump, and this creates too memory pressure.

By carefully analyzing what happened the only explanation of the issue is “something is creating dumps of w3wp.exe process.

Now we start searching for the possible culprit and after some unsuccessful search on internet (we found tons of article related to lots of reason but noone was applicable to our situation), we were a little bit frustrated.

This is the problem when your production server does not works well, you feel in an hurry and you made mistake, since we located the .dmp file we can simply use SysInternal Process monitor to verify what process was reading or writing in the folder containing dump.

BINGO we have a process call3ed SnapshotUploader64 that was using that file, and finally we immediately understand what happened, for some reason the web application has Application Insight Snapshot debugger enabled.

image

Figure 3: SysInternal process monitor shows you what process is using file and it is an invaluable tool to understand what is happening in your machine.

Looking at the source code we found that in this sprint we updated reference to Application Insight library and the nuget packet changed Application Insight configuration file enabling Snapshot Debugger of the site, no one in the team realized this, because was nothing we discussed.

This is a general rule, whenever you update nuget packages, always look at config files, because sometimes packages post update action manipulate them in a way that can cause problem.

Snapshot debugger is a really nice feature, you can read more here, it has a real nice feature that uses a Snappoint where the original process is forked in a new process that is immediately suspended. Reading the documentation we found that:

The snapshot is not a copy of the full heap of the app – it’s only a copy of the page table with pages set to copy-on-write. The Snapshot Debugger only makes copies of pages in your app if the page gets modified, minimizing the memory impact on your server. In total, your app will only slow down by 10-30 milliseconds when creating snapshots. As snapshots are held in-memory on your server, they do cost ~100s of kilobytes while they are active, as well as an additional commit charge. The overhead of capturing snapshots is fixed and should therefore not affect the throughput of your app regardless of the scale of your app.

But this is absolutely not true, in our situation, because forked process consumed lots of RAM. We choose not to investigate why, but instead we disabled snapshot debugger from web config and the problem went away.

It was a long morning, but our server is up and running at full speed again :)

Gian Maria.

Application insight real life, avoid sending garbage to AI instance

Application Insight is a real nice azure service that allows you to instrument your application to have a complete set of  telemetries and problem diagnostics. Usually when you start using Application Insight, you also have some log in place (log4net, serilog) and you want to integrate logs with AI so some kind of logs (Ex errors) will be automatically send to AI.

All these log libraries allows you to write a sink / appender to send logs to your chosen destination, thus, sending all your serilog / log4net logs to application insight is usually a matter of minutes.

Integrating existing log library with Application Insight is a matter of few minutes of code, or just referencing some existing libraries

Even if this seems a perfect solution it is not without problem, as an example it is possible to have too much logs sent to AI from log library. In a software I’m developing there is a poller with 50ms interval on a collection in MongoDb database. For various reason a wrong record got in the collection, we have deserialization error in some of the clients, and Boom, we have an exception log each 50 ms for 4 machines, and it is 288k of logs for each hour. Net result is that the instance of Application Insight was full of unnecessary and identical logs.

Another potential problem is logging every mongo query (polling will flood ai with lots of identical query with less than 1 ms execution time), this will simply pollute your AI instance of rubbish data.

Luckily enough, AI has the ability to intercept any logs to filter it out before it is sent to azure, the secret is implementing ITelemetryProcessor interface, because it allows you to filter out unwanted logs.

public void Process(ITelemetry item)
{
    if (item is DependencyTelemetry dependencyTelemetry)
    {
        if (!OkToSend(dependencyTelemetry))
        {
            return;
        }
    }
    else if (item is ExceptionTelemetry exceptionTelemetry)
    {
        if (!OkToSend(exceptionTelemetry))
        {
            return;
        }
    }
    else if (item is EventTelemetry eventTelemetry)
    {
        if (!OkToSend(eventTelemetry))
        {
            return;
        }
    }
  _next.Process(item);

As you can see, implementing process method leaves complete freedom on filtering the telemetry, because if you do not want it to be sent to AI, simple do not call _next.Process. 

Application insight is really extendible and have tons of extensibility points to customize information. In this article I’m dealing with the ITelemetryProcessor that allow you to manipulate and optionally completely discard a telemetry before it is sent to the server.

Once you create your telemetry processor you simply need to add to the default chain. For this example I have a simple component that does a couple of things, it will avoid flooding AI with identical errors and avoid to track any mongo dependencies that last for less than a certain number of ms.

var builder = TelemetryConfiguration.Active.TelemetryProcessorChainBuilder;

Int32 limitQueryLogInMs = 50;
if (Int32.TryParse(MongoDependencyMinDurationInMilliseconds ?? String.Empty, out Int32 limitDuration))
{
    limitQueryLogInMs = limitDuration;
}

builder.Use((next) => new JarvisApplicationInsightFilter(next, limitQueryLogInMs));
builder.Build();

This solution was developed because in this project, the customer is starting using AI mainly to monitor for problems so we are interested only in slow mongo dependencies.

The logic inside OkToSend is 100% custom for your need. In this example we want to filter out any dependencies that have some special string in the name, we want also to filter out all dependencies that have url custom property with a certain string. The first two line assures that any dependency that failed will be sent to AI without any further filtering (we do not want to miss a failed dependency because it falls under duration filter)

private bool OkToSend(DependencyTelemetry dependencyTelemetry)
{
    //always log dependency that failed,
    if (dependencyTelemetry.Success != true)
        return true;

    //ok filter out
    if (_forbiddenTelemetryStrings.Any(u => dependencyTelemetry.Data.IndexOf(u, StringComparison.OrdinalIgnoreCase) >= 0))
    {
        return false;
    }

    if (dependencyTelemetry.Properties.TryGetValue("Url", out var url))
    {
        if (_forbiddenUrl.Any(u => url.IndexOf(u, StringComparison.OrdinalIgnoreCase) >= 0))
        {
            return false;
        }
    }

The second feature of this processor is avoid flooding AI if an exception trace goes south. The goal is avoiding too many identical exception in a short time range; if some part of the application is experiencing errors, we want at max 1 exception trace in a given interval. An error is symptom that something is wrong, if the Db is not reachable and I’m polling each 50ms, an exception each 5 minutes is perfectly ok and it is far better than an exception each 50ms.

It is always a good practice to limit the number of logs you send to Application Insight, to avoid using too much bandwidth and clutter the server with garbage

For this reason we have a simple antiflood component to monitor how often we are sending telemetry identified with a give string, that uniquely identify the call. As an example if I configure my antiflood with 5 minutes interval, whenever I call antiflood.ShouldPass method with a given string key, it returns true not more than one time each 5 seconds for the same string.

To have an unique string for an exception I use three properties, Loggername (the name of the class that is logging), Servicename (the name of the logical microservices that is logging) and finally User (if present is the name of the user that request a command, called api etc). The goal is avoid flooding the server, but also avoiding to miss some important exception.

private bool OkToSend(ExceptionTelemetry exceptionTelemetry)
{
    //Some exception can be rogue, like pollers, they tend to flood the log with unuseful exception.
    String exceptionKey = "EXCEPTION/";
    var propertyNames = new[] { "Loggername", "ServiceName", "User" };
    foreach (var propertyName in propertyNames)
    {
        if (exceptionTelemetry.Properties.TryGetValue(propertyName, out var propertyValue))
        {
            //a single logger can go rogue.
            exceptionKey += propertyValue + "/";
        }
    }

    exceptionKey += exceptionTelemetry.Message.Substring(0, Math.Min(exceptionTelemetry.Message.Length, 50));
    return _antiflooder.ShouldPass(exceptionKey);
}

If a single poller gone wild, I got no more than 1 exception each 5 minutes, but if the entire db is down, I got more exception because each 5 minutes I can have an exception for each logger, services and unique user. Usually this lead to hundreds of exception each minute, but this is a clear indication that something catastrophic is happening (db is down and every single component that depends from the db is failing).

Finally, the antiflooder maintains count of the number of discharged exceptions, and a timer will evict each string after the threshold time passed. This will allows me to log how many exception were really logged during the antiflood interval.

private void Antiflooder_EventEvicted(object sender, Antiflooder.AntiFlooderEntryEvictedEventArgs e)
{
    if (e.Count > 1)
    {
        TraceTelemetry traceTelemetry = new TraceTelemetry();
        traceTelemetry.SeverityLevel = e.Entry.StartsWith("EXCEPTION") ? SeverityLevel.Error : SeverityLevel.Information;
        traceTelemetry.Message = $"Antiflood got N°{e.Count} occurrences of {e.Entry} in an interval of {e.GetElapsedSeconds()} seconds";
        ApplicationInsightConfiguration.Instance.GetTelemetryClient().TrackTrace(traceTelemetry);
    }
}

If I have more than one calls, it means that the antiflood discharged some logs, but I want to trace how many error logs were really generated.

With few lines of code I was able to avoid my application to send tons of unnecessary traces (dependency and errors) to AI instance, keeping only interesting traces.

I was really amazed by the flexibility of Application Insight library, because it allows me to have full control of the trace pipeline with few line of code. This example was made with C#, but you can obtain the same result with SDK for other tecnologies.

Gian Maria

Application insights installation

I’ve already covered installation of Monitoring Agent for Visual Studio Online Application Insights. Actually the service is in preview and the setup experience is changed from the first one. At date of Today, during installation phase, setup asks you the type of agent you need, but it does not asks you if you want to automatically monitor all of your web application. After the installation you can configure Microsoft Monitoring Agent from Control Panel

image

Figure 1: Configuration of MMA

From this configuration panel you should insert Account ID and Instrumentation Key that you find in your Visual Studio Online account. To find these values, you should simply Login in VSO and go to your Application Insights hub, then you press the little “gear” icon in upper right to configure Application Insights. In that section you have an apposite section called Keys & Downloads.

image

Figure 2: Configuration page with all needed data for Microsoft Monitoring Agent.

Once MMA is configured correctly, you can start monitoring a web site with this simple command.

 Start-WebApplicationMonitoring -Name WebSiteName -Cloud

This simple command will create a file called ApplicationInsights.config in the web site folder with default values to enable Monitoring of the application. Actually this is not the preferred way to do this, because you should install the appropriate Visual Studio Add-in that allow you to simply add telemetry configuration to your site directly from Visual Studio.

image

Figure 3: Add Application Insights telemetry to your projects.

If you use Visual Studio addin, or if you add the application via Start-WebApplicationMonitoring commandlet, I usually change the name of the application immediately. This can be done to the Applicationinsights.config file that is located in the site root

image

Figure 4: Change the component Name of the application

This is useful because I do not like to look at my application in Application Insights with the same name of the site in IIS. I like to monitor not only production application, but it can be useful to enable monitoring also of Test and Pre-Production deployment. With such scenario the ability to change the name that I see in Application Insights is a key value. As you can see, in the above picture I changed the name to AzureVM – TailspinToys (in iis the web site is simply named TailspinToys). After a couple of minutes Application Insights starts to have data.

image

Figure 5: Data starts flowing to Application Insights.

Gian Maria.

Application Insights on Visual Studio Online

Continuous deploy on Test and PreProduction server was always an argument that fascinates me a lot. The ability to have latest version of the code up and Running in a Test environment is a must for every developing team. I’ve wrote some articles in the past on how to obtain such result for Web Sites and TFS.

Now that you are able to automatically deploy a Web Site you need a way to monitor your site to verify what is happening during testing and beta phase. Thanks to Visual Studio Online Application Insights, you can have a lot of information about your application with very little effort. This kind of information are most useful for production server, but they can be really useful even during testing phase.

image

Figure 1: Application Insights in Visual Studio Online

You can have various level of instrumentation, the simplest one is using what is called Unobtrusive Monitoring that give you data without the need of modifying your project. All you need to do is open the application hub and add a new application.

SNAGHTML4ca815

Figure 2: Adding a new application to Application Insights

You should only download Microsoft Monitoring Agent and install on your machines where you deployed your Web Application. The only information that the installer asks you is account id and instrumentation key (these information are contained in VSO setting page). During the installation phase you can choose to instrument all of your applications that resides on local IIS, but I strongly suggest not to use that option and avoiding to start monitoring all applications. 

image

Figure 3: Avoid to automatically start monitoring all web applications.

The reason behind this is: I prefer to control what application to monitor and also I want to give a friendly name to monitored applications, because once configured you cannot change the name. After Monitoring Agents are installed you can simply open Monitoring Agent PowerShell Prompt to configure sites you want to monitor.

image

Figure 4: Open the MMA Powershell Prompt to manually configure monitoring for your applications

If you want to monitor a site called TailSpinToys and have several of them deployed in various Test Web Server, the ability to choose the name of the application is a key point. If you choose to automatically monitor all web applications of local IIS you will end with a single TailSpinToys application with multiple machine monitored. If you choose instead to manually start monitoring applications, you can configure the name of the application. Suppose you are on a machine called WebTest1, you can start monitoring TailspinToys application with this command.

Start-WebApplicationMonitoring "TailspinToys" -Mode Monitor -OutputChannel Cloud -DisplayName "Tailspin Toys on Web Test 1"

Display name is the parameter that permits me to distinguish between applications and constitute the identity of the application in Visual Studio Insight hub. When monitoring is started, you can navigate on the site to generate some data, wait for 5-10 minutes for the server to start analysis and you should be able to see you new application in the application hub.

image

Figure 5: New application is now available in the list of monitored applications

The really good aspect of Application Insights is that you only installed an agent and configured applications with a simple PowerShell command and you start collecting data. The effort is really minimal. In future posts I’ll show you what kind of data you can expect from this type of unobtrusive validation.

Gian Maria.