BufferingAppenderSkeleton performance problem in log4net

The problem

Today I was working to Jarvis project and I got a warning from another developer telling me that MongoDbAppender for Log4net slows down the application during a full Rebuild of all projections. In that specific situation we have log level set to DEBUG and the rebuild generates 800k logs inside mongo database, so I’m expecting Mongo Appender to slow down a little bit. The problem is: using a standard FileAppender to verify difference in speed it results that the FileAppender was really faster than the MongoDbAppender.

Setting up a simple load test with a nunit test showed me that the result is true, logging 10k messages in my machine requires 2000ms with MongoDbAppender but only 97ms with a FileAppender. After some profiling with ANTS Profiler I was puzzled, because time was not spent in my code. The next step was: I created a test appender that inherits from BufferingAppenderSkeleton but does nothing (it does not log anything); with my surprise it takes almost the same time of my MongoDbAppender. Where is the bottleneck??

Investigating how Log4Net works internally

This post in Stack Overflow explains the problem. To summarize here is what happens: LoggingEvent object from Log4net has lots of volatile variables. These variables have correct value when the event is passed to appender, but the value could not be valid afterwards. If you store LoggingEvents objects (as the BufferingAppenderSkeleton does) for later use, you need to call a method called FixVolatileData on LoggingEvent to avoid reading wrong data.

To verify that this is the real reason, I simply changed the PatternLayout of the FileAppender to use one of these volatile properties the username (I know that this is one expensive variable). Here is the new layout.

“%date %username [%thread] %-5level %logger [%property{NDC}] – %message%newline”

Running again the test confirm my supposition, FileAppender execution time raised from 90ms to 1100ms, just because I asked it to include the %username property. This means that simply accessing UserName property slows down performance of a 10 factor.

This test shows an important fact: performance of various log4net appenders is related to which information you are loggin.  If you are not interested to the %username property, avoid using it because it will slow down a lot your appender. The same is true for Location property on LoggingEvent object.

Why BufferingAppenderSkeleton is so slow even if you do not log anything

Previous explanation on how FixVolatileData works gives me the explanation on why simply inheriting from BufferingAppenderSkeleton slows down the appender a lot, even if you are doing nothing in the SendBuffer function.

Since BufferingAppenderSkeleton cannot know in advance which fields of Logging Event you are going to access, it takes the most conservative approach: it fixes all the property on the LoggingEvent for every log, then stored in its internal buffer. Fixing every volatile property have the same penality of accessing them and this explain everything.

How can you solve it

The solution is super-simple, BufferingAppenderSkeleton has a Fix property that has the same meaning of the same Fix property on LoggingEvent object; it is a Flags enumeration specifying which field should be fixed. It turns out that if you avoid fixing UserName and Location property (the two most expensive properties), you gain a real boost in performances. If you are not interested in these two, but you care for speed, this can be a solution.

Here is how my BufferedMongoDBAppender initializes itself.

public override void ActivateOptions()
{
    try
    {
        if (Settings.LooseFix) 
        {
            //default Fix value is ALL, we need to avoid fixing some values that are
            //too heavy. We skip FixFlags.UserName and FixFlags.LocationInfo
            Fix = FixFlags.Domain | FixFlags.Exception | FixFlags.Identity |
                FixFlags.Mdc | FixFlags.Message | FixFlags.Ndc |
                FixFlags.Properties | FixFlags.ThreadName;
        }

I’ve added a simple settings called LooseFix to specify to base class that I do not want to fix nor UserName nor LocationInfo. This simple trick give a boost on performance to the appender, at the cost of renouncing to a couple of property in logging.  The name LooseFix is given because an obsolete method called FixVolatileData on LoggingEvent has a boolean parameter called fastButLoose, to avoid fixing properties that are slow to fix.

Here is what the documentation says about this parameter

The fastButLoose param controls the data that is fixed. Some of the data that can be fixed takes a long time to generate, therefore if you do not require those settings to be fixed they can be ignored by setting thefastButLoose param to true. This setting will ignore the LocationInformation and UserName settings.

This piece of documentation confirms me that the two most expensive variable to fix are UserName and LocationInformation.

Conclusion

I’ve added also another little improvement to maximize performances: saving data on a dedicated thread to avoid blocking the caller when the buffer is full. When BufferingAppenderSkeleton buffer is full it calls a virtual function synchronously with logger call; this means that the call that trigger buffer flushing needs to wait for all buffer objects to be saved into mongo. Using a different thread can avoid slowing down the caller, but you need to be really careful because you can waste lots of memory with LoggingEvents object waiting to be saved on Mongo.

At the end, here the result of 10k logging with some simple timing.

With MongoAppender - Before flush 2152 ms
With MongoAppender - Iteration took 2158 ms
With MongoAppender Loose Fix - Before flush 337 ms
With MongoAppender Loose Fix - Iteration took 343 ms
With MongoAppender save on different thread - Before flush 1901 ms
With MongoAppender save on different thread - Iteration took 2001 ms
With MongoAppender Loose fix and save on different thread - Before flush 37 ms
With MongoAppender Loose fix and  save on different thread - Iteration took 379 ms

The test is really simple: I log 10k times and takes timing, then call flush to wait for all the logs to be flushed and takes another timing. The first two lines shows me that standard buffered mongo appender is quite slow, it takes 2 seconds to handle all the logs. With LooseFix to true, we are renouncing to UserName and LocationInfo property, but the boost in performance is impressive, again we have no difference before and after the flush.

The third copule of lines shows that saving on a different thread does not save great time if you fix all the properties, but last two lines shows that saving in different thread with LooseFix to true only needs 37 ms to logs (the appender is storing in memory all the logs to be saved on different thread), but you need a little bit more time when it is time to flush.

To avoid hammering memory if you are logging an enormous quantity of logs, probably it is a good idea setting a limit on how many events to store in memory before triggering an internal flush and wait for the saving thread to reduce the buffer. Actually if I have more than 10k LoggingObjects in memory I block the caller until dedicated thread saved at least an entire buffer;

Gian Maria.

Wcf over HTTPS, compression, binary binding and some measurement

I’ve an application that is composed by WPF clients deployed over the internet and a central WCF service exposed over Https for security reason, everything works fine, but as the number of clients grows, performance starts to decrease a little. Since I have a great number of clients that operates behind a single standard ADSL 8 MBit (used for everything, mail, browsing, etc..), I measured and found that bandwidth is becoming a bottleneck, so I need to reduce amount of traffic between clients and server.

It turns out that the easiest way to gain bandwidth is enabling IIS dynamic compression, as you can read on this interesting article by Scott Hanselman that explain in great details how you can enable compression in IIS. Since my server has plenty of CPU available, compression is a real interesting way to reduce bandwidth.

This is not the only option, you can also enable binary serialization over Http because binary serialization usually is smaller than standard serialization you obtain with DataContractSerializer, the only drawback is that such an endpoint is accessible only with WCF client; but if this is an issue it is really simple to expose the same service with multiple endpoints, each one with a different binding. I decided to add another endpoint with customBinding to use the BinaryMessageEncoding (because I want to be interoperable with non .NET clients), so I added this binding configuration

      <customBinding>
        <binding name="BinaryHttpBindingWithNoAuth">
          <binaryMessageEncoding>
            <readerQuotas maxDepth="32" maxStringContentLength="2100000000" maxArrayLength="16384" maxBytesPerRead="2147483647" maxNameTableCharCount="16384" />
          </binaryMessageEncoding>
          <httpsTransport />
        </binding>

        <binding name="BinaryHttpBinding">
          <binaryMessageEncoding>
            <readerQuotas maxDepth="32" maxStringContentLength="2100000000" maxArrayLength="16384" maxBytesPerRead="2147483647" maxNameTableCharCount="16384" />
          </binaryMessageEncoding>
          <security authenticationMode="UserNameOverTransport">
  
          </security>
          <httpsTransport />
        </binding>
      </customBinding>

Then I added another endpoint of this service that use this binding (actually I have more than a single wcf service, some of them without authentication, some other that needs to be authenticated, this is the reason why I need two configuration for binding).

Then I modified my client program to be able to use standard or binary endpoint with a user settings and I started to measure bandwidth usage in various situation. I’ve started with a real scenario, startup of the application with production real data, because during startup the application issues seventeen requests to the service to initialize everything, it start requesting customer information, then try to login the user and finally loads all data the user needs to start working. I can improve performances reducing number of requests, but actually I’m interested only in download bandwidth.

Here is the sum of bytes downloaded for all the requests, first column is SOAP uncompressed, second one SOAP compressed, third column BinaryMessageEncoding uncompressed, and finally fourth column BinaryMessageEncoding compressed.

image

Figure 1: Measurement over the amount of data downloaded at application startup

With SOAP binding and no compression the startup download ~262Kb of data and since data is mostly textual, simply enabling Http Dynamic compression (second column) saved ~80% of the bandwidth, because total amount of data is now ~52Kb. Now it is interesting to compare these numbers with BinaryMessageEncoding (third column) because the total size is ~155Kb, a ~40% saving respect SOAP uncompressed data. This is interesting because if for some reason you have small CPU power to enable compression, with Binary Encoding you can still save a lot of bandwidth without resorting to compression. The fourth column represents BinaryEncoding with Dynamic compression, since the content is mostly textual I can gain ~76% of the bandwidth (36Kb instead of 155Kb) enabling compression even for the Binary Encoding. It is a ~30% save respect to compressed SOAP and ~86% gain respect to SOAP uncompressed.

These numbers can vary a lot in your situation, because they are highly dependent of format of data (I need to send large portion of textual data so it my data is highly compressible), but it worth takes some measurement to understand if you can gain similar amount of bandwidth with your data. The important stuff is that everything is done simply changing WCF configuration and you do not need to change a single line of code.

Gian Maria.

Use Sql Server Query Hints with NHibernate HQL and ICriteria

When you work with Big databases with many records and not uniform distribution of data into columns used for join or where conditions, you can have really bad performance problem due to Query Plan caching. I do not want to give a deep explanation of this problem, you can find information here, but I want to outline the general problem to make clear what is happening.

We can argue a lot why this problem happens in a database engine, but basically I have a really bad situation where the db stores data of multiple customers with really non uniform distribution of data (some of them have ~1000 rows for each table, others have ~100.000 rows in some of the tables).

The problem is the following one: the software issue a query (two tables in join) filtered with a customer Id that has few rows on these tables; SQL-Server creates a plan where tables are joined with nested loops and the query executes in 2 milliseconds.

Untitled2

Figure 1: Query with a value of parameter that filters for a customer that has ~100 rows in the two tables used in the join (elapsed time = 2ms).

As you can see the execution time is almost zero, now I execute the very same query, changing the value of the parameter to filter records for a customer that has nearly 100.000 records, since the query is the same, sql server uses the same execution plan.

Untitled3

Figure 2: Query gets executed filtering for a customer that has ~100.000 rows in the two table that joins together (elapsed time = 2824 ms).

The query is three order of magnitude slower (2.824 ms vs 2 ms) and for query that are more complex (three tables in join), sometimes the execution time is more than 5 minutes and the software gives a TimeoutException. As you can see from Figure 2, the gray arrow from the bottom table is really wider respect of the arrows of Figure 1, because this customer has more rows in database. If you issue a DBCC FREEPROCCACHE command to clear all cached query plans and execute again the query of Figure 2, you got an execution time of 28 ms (100 times faster than before XD).

Untitlewerwerd 

Figure 3: Join is done with an hash match, a technique that is more efficient for joining two tables with a lot of rows. (elapsed time = 28 ms).

In figure 3 you can see that if you clear the query plan cache and execute again the query, SQL Server check the table, and since this customer has a lot of rows it uses HASH MATCH join and not a nested loop. Now if you execute the query for the customer used in Figure 1 that has few rows, the execution time is 28 ms, higher than 2 ms because an hash match is less efficient for few rows.

This lead to a nasty problem due to order of execution of the query: if I issue the same query, first for the customer with a lot of rows and then for the customer with few rows, I got an execution time of 28 ms for each query, if I reverse the order of the query I got 2 ms and 2800 ms thus the system is much slower . This happens because the Hash Match is not good for few number of rows, (28 ms against 2 ms), but the loss of performance when we have a small set of record is not so bad; at the contrary, the Nested Loop is a KILLER for big resultset and this can even cause execution Timeout. Having such a problem in production is like sitting on a bomb ready to explode. (A customer calls telling us that the software is completely unresponsive, you verify that this is true, but for other customers everything is quite good :(

Untitled

The only solution to this approach is using the OPTION (NORECOMPILE) for all queries that present this problem, or you can instruct the query governor to prefer some join option (using OPTION (HASH JOIN) for example) if you already know that this is the right execution plan for all data distribution. All these techniques are called “query hint” and are the only way to solve bad performance problem of parameterized query when you have non uniform data distribution.

Now the problem seems solved, but, wait!!!!!, most of the queries are issued by nhibernate, and I need to find a way to add query hints to nhibernate query, a task that is not supported natively by NH. Solving this problem is a two phase process, first of all you need to find a way to insert text into nhibernate generated SQL, a task that can easily solved by an interceptor.

   1: [Serializable]

   2: public class QueryHintInterceptor : EmptyInterceptor

   3: {

   4:     internal const string QueryHintHashJoinCommentString = "queryhint-option-hash";

   5:     internal const string QueryHintRecompileCommentString = "queryhint-recompile";

   6:  

   7:     public override NHibernate.SqlCommand.SqlString OnPrepareStatement(NHibernate.SqlCommand.SqlString sql)

   8:     {

   9:         if (sql.ToString().Contains(QueryHintHashJoinCommentString))

  10:         {

  11:             return sql.Insert(sql.Length, " option(HASH JOIN)");

  12:         }

  13:         if (sql.ToString().Contains(QueryHintRecompileCommentString))

  14:         {

  15:             return sql.Insert(sql.Length, " option(RECOMPILE)");

  16:         }

  17:         return base.OnPrepareStatement(sql);

  18:     }

  19: }

As you can see the code is really simple, the interceptor inherits from EmptyInterceptor and override the OnPrepareStatement(), adding the right Query Hint to the end of the query. This is probably not 100% production ready code, because I’m not 100% sure that for complex query, inserting the hint at the end of the query is the right choice, but for my specific problem is enough and it is a good starting point.

If you look at the code you can verify that I’m checking for certain string in query text to add the appropriate hint, but how can you add these strings to the query to enable query hint to be generated? The answer is “comments”. First of all I add this interceptor to NH configuration, so it got attached to every session.

   1: cfg.SetInterceptor(new QueryHintInterceptor());

But I also enable comments in SQL code in NH configuration setting the property use_sql_comments to true, now I only need a couple of extension methods like this one.

   1: public static IQuery QueryHintRecompile(this IQuery query)

   2:  {

   3:      query.SetComment(QueryHintInterceptor.QueryHintRecompileCommentString);

   4:      return query;

   5:  }

That enables me to write this HQL query.

   1: session.CreateQuery("Select y from LinkResult y where ClieId = :param")

   2:     .QueryHintRecompile()

   3:     .SetInt32("param", 11)

   4:     .List();

As you can see I have now the QueryHintRecompile() method that permits me to specify for each single HQL Query (you can add similar extension method to ICriteria) the query hint to use. Running this query output this query.

23-07-2011 10-36-27

Figure 4: The query with the query hint specified by the QueryHintRecompile()

Thanks to incredible number of extension points offered by NHibernate, adding Query Hint to queries is really a breeze.

Alk.

Tags:

Nhibernate BatchSize and massive insert

I have an application that analyze some data and insert about 1000 objects for each cycle into the database. The insertion needs to be transactional, and I need to insert 500 instances of object A and 500 instances of object B, and each B have a reference to A.

When I launch the application I see this insert pattern, even if I enabled batch size in Nhibernate.

image

Figure 1: Each object has his own insert instruction A then B then A then B again.

This is a typical bulk insert problem with an ORM, but it is strange that it happens even if I enabled Batch size. Thanks to NhProf I verified that NH inserted objects one by one, alternating object of type A and B. The only flush is at the very end of the insertion, and I verified with NHProf that all the insert are issued when I call flush(). Performances are unacceptable because due to network latency (database accessed over an internet based vpn) each insert requires ~100 ms, so I have about 1000 * 100 = 100 seconds. Since everything is transactional I have a transaction open for more than one minute.

I’m using NH 2.1 and I do not know exactly if bulk inserting of linked objects was fixed in version 3 but the solution is simple, instead of creating couple of A and B object and immediately pass them to Session.Save object I took this different strategy

1) created all object in memory, for 500 times I create A, create B and set B.refa = A
2) now that all 1000 objects are in memory, I call Session.Save for each object of type A
3) call Session.Flush()
4) call Session.Save() for each B object
5) commit the transaction.

Since object of type A does not depend from objects of type B, when I call flush() in point 3, session contains only objects of type A, so it can use batching. Then when I insert all B objects and commit transaction, the session should only insert all B objects and it can use batching again. The result is

image

Figure 2: Batching in action, since I have 50 as batch size, inserting 500 objects requires only 10 insert query.

The improvement in performances is really high, because each batch requires about ~500 ms to complete, so I’m able to insert everything in 20 * 500 = ~10 seconds, and the insertion time is reduced by one order of magnitude.

This simple example shows, that for some specific operations, like massive insert, you should always double check what your ORM does behind the curtain, to avoid unnecessary performance loss.

alk.

Performance of Lists

In a previous post I explained how to create a simple LINQ extension method to use the Distinct() method with a lambda and not an IEqualityComparer. That solution uses a simple List<T> as temporary storage to implement the concept of “distinct elements”. To verify if an object was already returned I simply searched into the temp List<T> looking if an element that satisfy the Predicate, this will permit me to return only distinct elements.

The question is, it is slow? The answer is yes, because List<T> performs really bad when you search elements in it, because it needs to scan all the elements until it find an element that satisfy the predicate or you finish all the elements.

I created a list with 320 elements and benchmark the time needed to execute 2000 times an iteration with the Distinct() method I wrote.

   1: for (int i = 0; i < 2000; i++)

   2: {

   3:     foreach (Test test in list.Distinct(

   4:         (t1, t2) => t1.BlaBla.Equals(t2.BlaBla)))

   5:     {

   6:         //do nothing only iterate

   7:     }

   8: }

It executes in about 3800 milliseconds, but is it slow? is it fast? can we optimize it?

The solution is running this simple test under a profiler, like ANT profiler, just to see if there is some Hot path

image

From this image it is clear that 98% of the time is spent inside the Exists() method of the List<T>, and this is something we already know, but the profiler is confirming our suspects, searching elements inside a simple IList<T> is really slow. A simple solution is writing this helper to create an IEqualityComparer with lambda.

   1: public class Comparer<T> : IComparer<T>, IEqualityComparer<T> {

   2:  

   3:    private readonly Func<T, T, Int32> compareFunc;

   4:    private readonly Func<T, Int32> hashFunc;

   5:  

   6:    public Comparer(Func<T, T, Int32> compareFunc, Func<T, Int32> hashFunc)

   7:    {

   8:       this.compareFunc = compareFunc;

   9:       this.hashFunc = hashFunc;

  10:    }

  11:  

  12:    #region IComparer<T> Members

  13:  

  14:    public int Compare(T x, T y) {

  15:       return compareFunc(x, y);

  16:    }

  17:  

  18:    #endregion

  19:  

  20:    #region IEqualityComparer<T> Members

  21:  

  22:    public bool Equals(T x, T y) {

  23:       return compareFunc(x, y) == 0;

  24:    }

  25:  

  26:    public int GetHashCode(T obj) {

  27:       return hashFunc(obj);

  28:    }

  29:  

  30:    #endregion

  31: }

  32:  

  33: public static class ComparerFactory {

  34:  

  35:     public static Comparer<T> Create<T> (

  36:         Func<T, T, Int32> compareFunc, 

  37:         Func<T, Int32> hashFunc)

  38:     {

  39:         return new Comparer<T>(compareFunc, hashFunc);

  40:     }

This is a class that implements IComparer and IEqualityComparer based on lambda. IEqualityComparer<T> is a special interface, because it has the GetHasCode function that permits to use Hashtable to store objects and use fast search algorithm. Thanks to a simple helper factory, you can create an IEqualityComparer<T> based on a lambda with a single instruction.

   1: Utils.Comparer<Test> comparer = ComparerFactory.Create<Test>(

   2:         (t1, t2) => t1.BlaBla.CompareTo(t2.BlaBla),

   3:         t => t.BlaBla.GetHashCode());

Thanks to the Factory class I can create a Comparer<T> based on two functions, one implements the concepts of equality, the other return an hashcode for an object. The general rule is that two object that are equals must return the same hashcode. Since I want to compare instance only by BlaBla property, I implement the hash function simply using BlaBla.GetHashCode(). Now I can use the base Distinct() LINQ function.

   1: for (int i = 0; i < 2000; i++)

   2: {

   3:     foreach (Test test in list.Distinct(comparer))

   4:     {

   5:         //do nothing only iterate

   6:     }

   7: }

If you compare this code with the one that use the lambda it is really similar, in the end all you want is to be able to use the Distinct operator with a Lambda function that implements the concept of Equality; in the first sample you can pass the lambda directly to the distinct() operator, in the second example you need to write some extra code to create the comparer passing the lambda object and an HashCode function, but the result is the same: being able to use Distinct() method with a Lambda.

If you run this second snippet, the time of execution drops from 3800 ms to 90ms, because the base implementation of Distinct, (thanks to IEqualityComparer) can use some hash based container to store objects and perform faster searches.

alk.