Application insight Snapshot debugger and strange production problem

I want to share with you an history of a problem we had in production last week, because after lots of internet search no article lead us to find the solution, so I hope that this article can help you if you experience the very same problem.

Situation: we deployed new version of a web application based on ASP.NET Web API, it was a small increment (we deploy frequently), but suddenly the customer starts experiencing slowness of the application and intermittent errors.

Logging into the machine we found a real strange situation, shown in Figure 1 we have two worker process for the application, one is running and the other one is in Suspended state, but it is using lots of RAM.

image

Figure 1: Two worker process in the same machine, but one is suspended

In Figure 1, the running process used 4 minutes of CPU time because this screenshot was taken after a process recycle, but we had situation where the running process was alive for more than 10 hours, and the suspended process has consumed not a single second of CPU.

In production server, suddently, another worker process (w3wp.exe) started, in suspended state, starts consuming RAM, then disappear.

If you search in internet, you can find lots of article related to process recycling, but this is not our situation, because the process does not get recycled. This is our usual situation on that production server

– Our process usually consume 1.5 GB / 2.0 GB of RAM
– Production machine has 8 GB of RAM
– We had other services in that machine, but usually  total memory consumption is around 6 GB / 7 GB

Performance of the application is good, we use almost all the memory of the server, but we never experienced problems. Now, after the last deploy, periodically we see this exact series of events from Task Manager:

– Another w3wp.exe was launched, it has the very same IIS user (jarvis) and the very same command line of the existing instance of w3wp.exe
– That process is in state suspended, but it starts allocating RAM
– That process is a child of the original w3wp.exe process
– The web application works normally, all requests are handled by the original w3wp.exe process and the site responds quickly
– Suspended process continue to grow in RAM usage, until it arrives approximately to the working set of original w3wp.exe process, then it is closed
the original w3wp.exe process was not recycled and continue worked perfectly

The real problem is memory usage, because when this sequence of events starts, the machine starts experiencing memory pressure and everything is really slow, sometimes a third w3wp.exe suspended process starts and this bring down the performance of the machine

image

Figure 2: Windows detects low memory, and w3wp.exe process is the problem

This is the problem when you did a good dimensioning of the machine, when something out of your control happens and some process is using more resource, performances starts to go down.

If you try to kill the new w3wp.exe suspended process, its working set drops to 4k, but it remains suspended and cannot be killed. This is a nasty situation, because performance of production server was poor, and the only workaround was adding another RAM to the virtual machine, but we really hate wasting RAM for unnecessary task.

After a couple of hours of searching we were frustrated because:

1) This has nothing to do with Recycling, the original process was not recycled, every article in the internet refers to some recycling problem
2) This happens only after last deploy, but we really did not change a lot the code in the last sprint
3) Whenever a w3wp.exe suspended process appears, it reach the same amount of memory of the official w3wp.exe process before disappearing, this is important because the new process must be related to the old one.

The only possible reason for this behavior is that something is creating another process in suspended state to create a dump of production w3wp.exe process

When you take a memory dump of a process it will freeze for the duration of the dump, but you can also spawn a new process, copy memory and then dump the other process. With this technique the original process continue to run.

Armed with this intuition we start searching for *.dmp files in all disks, and immediately we found a temp folder with some dump of 1.7 GB size. Bingo: this confirms our hypothesis, something is creating dump, and this creates too memory pressure.

By carefully analyzing what happened the only explanation of the issue is “something is creating dumps of w3wp.exe process.

Now we start searching for the possible culprit and after some unsuccessful search on internet (we found tons of article related to lots of reason but noone was applicable to our situation), we were a little bit frustrated.

This is the problem when your production server does not works well, you feel in an hurry and you made mistake, since we located the .dmp file we can simply use SysInternal Process monitor to verify what process was reading or writing in the folder containing dump.

BINGO we have a process call3ed SnapshotUploader64 that was using that file, and finally we immediately understand what happened, for some reason the web application has Application Insight Snapshot debugger enabled.

image

Figure 3: SysInternal process monitor shows you what process is using file and it is an invaluable tool to understand what is happening in your machine.

Looking at the source code we found that in this sprint we updated reference to Application Insight library and the nuget packet changed Application Insight configuration file enabling Snapshot Debugger of the site, no one in the team realized this, because was nothing we discussed.

This is a general rule, whenever you update nuget packages, always look at config files, because sometimes packages post update action manipulate them in a way that can cause problem.

Snapshot debugger is a really nice feature, you can read more here, it has a real nice feature that uses a Snappoint where the original process is forked in a new process that is immediately suspended. Reading the documentation we found that:

The snapshot is not a copy of the full heap of the app – it’s only a copy of the page table with pages set to copy-on-write. The Snapshot Debugger only makes copies of pages in your app if the page gets modified, minimizing the memory impact on your server. In total, your app will only slow down by 10-30 milliseconds when creating snapshots. As snapshots are held in-memory on your server, they do cost ~100s of kilobytes while they are active, as well as an additional commit charge. The overhead of capturing snapshots is fixed and should therefore not affect the throughput of your app regardless of the scale of your app.

But this is absolutely not true, in our situation, because forked process consumed lots of RAM. We choose not to investigate why, but instead we disabled snapshot debugger from web config and the problem went away.

It was a long morning, but our server is up and running at full speed again :)

Gian Maria.

Load Security Alerts in Azure Application Insight

Security is one of the big problem of the modern web, business moved to web application and security become an important part of application development. One side of the problem is adding standard and proved procedure to avoid risk of basic attacks like SQL or NO SQL injection, but big part of security was programmed at application level.

If you are using SPA with some client framework like Angular and have business logic exposed with API, (Ex ASP NET Web API), you cannot trust the client, thus you need to validate every server call for authorization and authentication.

When API layer is exposed to a web application it is one of the preferred attack surfaces for your application. Never ever trust your UI logic to protect from malicious calls

One of the most typical attack is forging calls to your API layer, a task that can be automated with tools like BURP Suite or WAPT and it is a major source of problem. As an example, if you have some administrative page where you grant or deny claim to users, it is imperative that every call should be accepted only if the authenticated user is an administrator of the system. Application logic can become complicated, as an example you can have access to a document because it is linked to an offer made in France and you are the Area Manager for France. The more security logic is complex the more you should test it, but you probably cannot cover 100% of the situations.

In such scenario it is imperative that every unauthorized access is logged and visible to administrators, because if you see that there is a spike in forged requests you should immediately investigate, because probably your system is under attack.

Proactive security is a better approach, you should immediately be warned if something suspicious is happening in your application.

Once you determine that there was a problem in your app you can raise a SecurityAlertLog but then the question is: Where this log should be stored? How can I visualize all SecurityAlerts generated by the system? One of the possible solution is using Azure Application Insights to collect all alerts and use it to monitor trend and status of security logs.

As an example, in our web application, when the application logic determines that the request has security problem (ex, user trying to access resource he as no access to), a typical HTTP Response Code 403 (Forbidden) is returned. Angular application usually prevent such invalid calls, so whenever we found a 403 it could be a UI Bug (Angular application incorrectly request a resource current user has no access to) or it could be some malicious tentative of accessing a resource. Both of the situation is quite problematic from a security perspective, so we want to be able to log them separately from application log.

Security problems should not be simply logged with the standard log infrastructure, but should generate some more visible and centralized alert.

With Web API a possible solution is creating an ActionFilterAttribute that is capable of intercepting every Web API Call, it depends on a list of ISecurityAlertLogService, a custom interface that represent some component capable of logging a security alert. With such configuration we can let our Inversion of Control mechanism to scan all implementation of ISecurityAlertLog service available to the system. As an example we log security Exception in a specialized collection in Mongo Database and on Azure Application Insights.

image

Figure 1: Declaration of a Web API Filter capable of intercepting every call and send security logs to a list of providers.

An Action Filter Attribut is a simple class that is capable of inspecting every WebApi call. In this scenario I’m interested in inspecting what happened when the request ends

image

Figure 2: Filter is capable of inspecting each call upon completion and can verify if the return value is 403,

Thanks to the filter we can simply verify if the return code is 403 or if the call generates an exception and that exception is a SecurityException; if the check is positive someone is tampering with requests or the UI has a security bug, a SecurityLogAlert is generated and it is passed to every ISecurityAlertLogService

Thanks to Azure Application Insoghts we can track the alert with very few lines of code.

image

Figure 3: Traking to AI is really a few lines of codes.

We are actually issuing two distinct telemetry events, the first one is a custom event, it contains very few data: most important is name, equal to string SecurityProblem concatenated with property type of our AlertMessage. That property contains the area of the problem, like ApiCall or CommandCall, etc. We also add three custom properties. Of the three “Type” property is important because it helps us to separate SecurityAlertMessages events from all other events that were generated by the system; user and ip address can be used to further filter security alerts. The reason why we are using very few information is that an Event is a telemetry object that cannot contain much data, it is meant to simply track something that happened in the system.

Then we add a Trace telemetry event, because a trace can contain much more information; basically we are tracing the same information of the Event, but the message is the serialization of the entire AlertMessage object, that can contain lots of information thus is not suited for a Custom event (if the custom event is too big it will be discarded)

With this technique we are sure that no SecurityAlert Custom Event will be discarded due to length (we are tracing minimum information) but we have also full information with TrackTrace. If a trace is too big to be logged we will miss it, but we will never miss an event.

image

Figure 4: Telemetry information as seen from the Visual Studio Application Insight Query window.

As you can see, I can filter for Type : SecurityAlertMessage to inspect only events and tracing related to security, I have my events and I can immediately see the user that generates the events.

Centralizing the log of multiple installation is a key point in security, the more log location source you need to monitor, the more is the probability that you miss some important information.

An interesting aspect is that when the software initialize TelemetryClient it adds the name of the customer/installation, so we can have a single point where all the logs of every customer is saved. From the telemetry we can filter for customer or immediately understand where the log was generated.

image

Figure 5: Customer properties allow us to send all telemetry events to a single Application Insight instance from multiple installation.

With a few lines of code we now have a centralized place where we can check security alert of our installations.

Gian Maria Ricci

Exclude Folders from SonarQube analysis

Creating a build that is capable of perform a SonarQube analysis on a VSTS /  TFS is a really simple task, thanks to the two tasks that are present out-of-the box.

image

Figure 1: Build that uses Sonarqube tasks to perform analysis

The problem in a project that was alive for more than a couple of years is that you usually have a really bad report when you do your first analysis. This happens because, without a constant analysis, the code have surely some smells.

Sometimes you get really discouraged, because the number of issue is really high, but before losing any hope, check if the errors are really part of your code. In a project where I’m working, we got really bad numbers and I was 100% sure that it is not a problem of our code.

When you analyze your project for the first time, sometimes the number of issue is so high that you really are discouraged. Before giving up, check if the errors are really part of your code.

To diagnostic the problem, simply login to the project, then go to Code view (something like http://build:9000/code/?id=projectName), then you will see a summary of all the bugs, but unfortunately you cannot order for the number of the bug, so Just scroll down to see the part of the code with the most errors.

SNAGHTML139609

Figure 2: 185 bugs are located in scripts folder

In this situation, we have 185 bugs signaled by the javascript analyzer under the scripts folder, but in that folder we have our code but also third party code. This is not the very first analysis, because the first analysis shows a folder where there are more than 3k of errors and it was the folder that stores all third party javascript libraries.

If you do not use npm it is quite normal to have third party javascript code in your source code and if you are using SonarQube you absolutely need to configure it to exclude that directories. Just go to the administration page of the project and in the Analysis Scope you find the Source File Exclusions section that allows you to exclude some directories from the analysis.

SNAGHTML16d611

Figure 3: Exclude folder for analysis

In your situation the vast majority of errors comes from the angular library, from all the script of the skin we use and for third party libraries stored under the /app/scripts/lib folder. After exclusion, the number of bugs dropped from almost 7k to 500.

If you add Sonarqube analysis to existsting project that have third party javascript code in source code repository, please always check where the errors are and exclude folder accordingly.

Gian Maria.

Sonar Upgrade from 5.1 to 5.3

I really like SonarQube, but I need to admit that upgrade procedures are not the best experience I’ve had. Upgrading from 5.1 to 5.3 should be a simple task, but I had a couple of problems.

I’ve followed the official upgrade instruction: http://docs.sonarqube.org/display/SONAR/Upgrading but the first problem was my connection string to SQL Server, that needs to be changed. Original connection string was

sonar.jdbc.url=jdbc:jtds:sqlserver://localhost/Sonar;SelectMethod=Cursor;instance=sqlexpress

But it seems that with the new version of Solr it needs to be changed because jtds is not anymore used. The string was changed to

sonar.jdbc.url=jdbc:sqlserver://localhost;databaseName=Sonar;instance=sqlexpress

And the service is able to connect to my database. Then I start upgrade procedure but I got an error, and here what I found in the log

sonr Fail to execute DROP INDEX unique_schema_migrations ON schema_migrations

Luckily enough, another person got this error  and I was able to find the solution on StackOverflow Post that explained how to create the missing index. Then I restart the service, re-execute again migration procedure and everything now runs fine.

I cannot avoid to compare this experience to an upgrade of TFS on-premises, that is a fully automated procedure, a little more of next, next, next, and it is a product really more complex than SonarQube. :)

Gian Maria

BufferingAppenderSkeleton performance problem in log4net

The problem

Today I was working to Jarvis project and I got a warning from another developer telling me that MongoDbAppender for Log4net slows down the application during a full Rebuild of all projections. In that specific situation we have log level set to DEBUG and the rebuild generates 800k logs inside mongo database, so I’m expecting Mongo Appender to slow down a little bit. The problem is: using a standard FileAppender to verify difference in speed it results that the FileAppender was really faster than the MongoDbAppender.

Setting up a simple load test with a nunit test showed me that the result is true, logging 10k messages in my machine requires 2000ms with MongoDbAppender but only 97ms with a FileAppender. After some profiling with ANTS Profiler I was puzzled, because time was not spent in my code. The next step was: I created a test appender that inherits from BufferingAppenderSkeleton but does nothing (it does not log anything); with my surprise it takes almost the same time of my MongoDbAppender. Where is the bottleneck??

Investigating how Log4Net works internally

This post in Stack Overflow explains the problem. To summarize here is what happens: LoggingEvent object from Log4net has lots of volatile variables. These variables have correct value when the event is passed to appender, but the value could not be valid afterwards. If you store LoggingEvents objects (as the BufferingAppenderSkeleton does) for later use, you need to call a method called FixVolatileData on LoggingEvent to avoid reading wrong data.

To verify that this is the real reason, I simply changed the PatternLayout of the FileAppender to use one of these volatile properties the username (I know that this is one expensive variable). Here is the new layout.

“%date %username [%thread] %-5level %logger [%property{NDC}] – %message%newline”

Running again the test confirm my supposition, FileAppender execution time raised from 90ms to 1100ms, just because I asked it to include the %username property. This means that simply accessing UserName property slows down performance of a 10 factor.

This test shows an important fact: performance of various log4net appenders is related to which information you are loggin.  If you are not interested to the %username property, avoid using it because it will slow down a lot your appender. The same is true for Location property on LoggingEvent object.

Why BufferingAppenderSkeleton is so slow even if you do not log anything

Previous explanation on how FixVolatileData works gives me the explanation on why simply inheriting from BufferingAppenderSkeleton slows down the appender a lot, even if you are doing nothing in the SendBuffer function.

Since BufferingAppenderSkeleton cannot know in advance which fields of Logging Event you are going to access, it takes the most conservative approach: it fixes all the property on the LoggingEvent for every log, then stored in its internal buffer. Fixing every volatile property have the same penality of accessing them and this explain everything.

How can you solve it

The solution is super-simple, BufferingAppenderSkeleton has a Fix property that has the same meaning of the same Fix property on LoggingEvent object; it is a Flags enumeration specifying which field should be fixed. It turns out that if you avoid fixing UserName and Location property (the two most expensive properties), you gain a real boost in performances. If you are not interested in these two, but you care for speed, this can be a solution.

Here is how my BufferedMongoDBAppender initializes itself.

public override void ActivateOptions()
{
    try
    {
        if (Settings.LooseFix) 
        {
            //default Fix value is ALL, we need to avoid fixing some values that are
            //too heavy. We skip FixFlags.UserName and FixFlags.LocationInfo
            Fix = FixFlags.Domain | FixFlags.Exception | FixFlags.Identity |
                FixFlags.Mdc | FixFlags.Message | FixFlags.Ndc |
                FixFlags.Properties | FixFlags.ThreadName;
        }

I’ve added a simple settings called LooseFix to specify to base class that I do not want to fix nor UserName nor LocationInfo. This simple trick give a boost on performance to the appender, at the cost of renouncing to a couple of property in logging.  The name LooseFix is given because an obsolete method called FixVolatileData on LoggingEvent has a boolean parameter called fastButLoose, to avoid fixing properties that are slow to fix.

Here is what the documentation says about this parameter

The fastButLoose param controls the data that is fixed. Some of the data that can be fixed takes a long time to generate, therefore if you do not require those settings to be fixed they can be ignored by setting thefastButLoose param to true. This setting will ignore the LocationInformation and UserName settings.

This piece of documentation confirms me that the two most expensive variable to fix are UserName and LocationInformation.

Conclusion

I’ve added also another little improvement to maximize performances: saving data on a dedicated thread to avoid blocking the caller when the buffer is full. When BufferingAppenderSkeleton buffer is full it calls a virtual function synchronously with logger call; this means that the call that trigger buffer flushing needs to wait for all buffer objects to be saved into mongo. Using a different thread can avoid slowing down the caller, but you need to be really careful because you can waste lots of memory with LoggingEvents object waiting to be saved on Mongo.

At the end, here the result of 10k logging with some simple timing.

With MongoAppender - Before flush 2152 ms
With MongoAppender - Iteration took 2158 ms
With MongoAppender Loose Fix - Before flush 337 ms
With MongoAppender Loose Fix - Iteration took 343 ms
With MongoAppender save on different thread - Before flush 1901 ms
With MongoAppender save on different thread - Iteration took 2001 ms
With MongoAppender Loose fix and save on different thread - Before flush 37 ms
With MongoAppender Loose fix and  save on different thread - Iteration took 379 ms

The test is really simple: I log 10k times and takes timing, then call flush to wait for all the logs to be flushed and takes another timing. The first two lines shows me that standard buffered mongo appender is quite slow, it takes 2 seconds to handle all the logs. With LooseFix to true, we are renouncing to UserName and LocationInfo property, but the boost in performance is impressive, again we have no difference before and after the flush.

The third copule of lines shows that saving on a different thread does not save great time if you fix all the properties, but last two lines shows that saving in different thread with LooseFix to true only needs 37 ms to logs (the appender is storing in memory all the logs to be saved on different thread), but you need a little bit more time when it is time to flush.

To avoid hammering memory if you are logging an enormous quantity of logs, probably it is a good idea setting a limit on how many events to store in memory before triggering an internal flush and wait for the saving thread to reduce the buffer. Actually if I have more than 10k LoggingObjects in memory I block the caller until dedicated thread saved at least an entire buffer;

Gian Maria.