Use Sql Server Query Hints with NHibernate HQL and ICriteria

When you work with Big databases with many records and not uniform distribution of data into columns used for join or where conditions, you can have really bad performance problem due to Query Plan caching. I do not want to give a deep explanation of this problem, you can find information here, but I want to outline the general problem to make clear what is happening.

We can argue a lot why this problem happens in a database engine, but basically I have a really bad situation where the db stores data of multiple customers with really non uniform distribution of data (some of them have ~1000 rows for each table, others have ~100.000 rows in some of the tables).

The problem is the following one: the software issue a query (two tables in join) filtered with a customer Id that has few rows on these tables; SQL-Server creates a plan where tables are joined with nested loops and the query executes in 2 milliseconds.

Untitled2

Figure 1: Query with a value of parameter that filters for a customer that has ~100 rows in the two tables used in the join (elapsed time = 2ms).

As you can see the execution time is almost zero, now I execute the very same query, changing the value of the parameter to filter records for a customer that has nearly 100.000 records, since the query is the same, sql server uses the same execution plan.

Untitled3

Figure 2: Query gets executed filtering for a customer that has ~100.000 rows in the two table that joins together (elapsed time = 2824 ms).

The query is three order of magnitude slower (2.824 ms vs 2 ms) and for query that are more complex (three tables in join), sometimes the execution time is more than 5 minutes and the software gives a TimeoutException. As you can see from Figure 2, the gray arrow from the bottom table is really wider respect of the arrows of Figure 1, because this customer has more rows in database. If you issue a DBCC FREEPROCCACHE command to clear all cached query plans and execute again the query of Figure 2, you got an execution time of 28 ms (100 times faster than before XD).

Untitlewerwerd 

Figure 3: Join is done with an hash match, a technique that is more efficient for joining two tables with a lot of rows. (elapsed time = 28 ms).

In figure 3 you can see that if you clear the query plan cache and execute again the query, SQL Server check the table, and since this customer has a lot of rows it uses HASH MATCH join and not a nested loop. Now if you execute the query for the customer used in Figure 1 that has few rows, the execution time is 28 ms, higher than 2 ms because an hash match is less efficient for few rows.

This lead to a nasty problem due to order of execution of the query: if I issue the same query, first for the customer with a lot of rows and then for the customer with few rows, I got an execution time of 28 ms for each query, if I reverse the order of the query I got 2 ms and 2800 ms thus the system is much slower . This happens because the Hash Match is not good for few number of rows, (28 ms against 2 ms), but the loss of performance when we have a small set of record is not so bad; at the contrary, the Nested Loop is a KILLER for big resultset and this can even cause execution Timeout. Having such a problem in production is like sitting on a bomb ready to explode. (A customer calls telling us that the software is completely unresponsive, you verify that this is true, but for other customers everything is quite good :(

Untitled

The only solution to this approach is using the OPTION (NORECOMPILE) for all queries that present this problem, or you can instruct the query governor to prefer some join option (using OPTION (HASH JOIN) for example) if you already know that this is the right execution plan for all data distribution. All these techniques are called “query hint” and are the only way to solve bad performance problem of parameterized query when you have non uniform data distribution.

Now the problem seems solved, but, wait!!!!!, most of the queries are issued by nhibernate, and I need to find a way to add query hints to nhibernate query, a task that is not supported natively by NH. Solving this problem is a two phase process, first of all you need to find a way to insert text into nhibernate generated SQL, a task that can easily solved by an interceptor.

   1: [Serializable]

   2: public class QueryHintInterceptor : EmptyInterceptor

   3: {

   4:     internal const string QueryHintHashJoinCommentString = "queryhint-option-hash";

   5:     internal const string QueryHintRecompileCommentString = "queryhint-recompile";

   6:  

   7:     public override NHibernate.SqlCommand.SqlString OnPrepareStatement(NHibernate.SqlCommand.SqlString sql)

   8:     {

   9:         if (sql.ToString().Contains(QueryHintHashJoinCommentString))

  10:         {

  11:             return sql.Insert(sql.Length, " option(HASH JOIN)");

  12:         }

  13:         if (sql.ToString().Contains(QueryHintRecompileCommentString))

  14:         {

  15:             return sql.Insert(sql.Length, " option(RECOMPILE)");

  16:         }

  17:         return base.OnPrepareStatement(sql);

  18:     }

  19: }

As you can see the code is really simple, the interceptor inherits from EmptyInterceptor and override the OnPrepareStatement(), adding the right Query Hint to the end of the query. This is probably not 100% production ready code, because I’m not 100% sure that for complex query, inserting the hint at the end of the query is the right choice, but for my specific problem is enough and it is a good starting point.

If you look at the code you can verify that I’m checking for certain string in query text to add the appropriate hint, but how can you add these strings to the query to enable query hint to be generated? The answer is “comments”. First of all I add this interceptor to NH configuration, so it got attached to every session.

   1: cfg.SetInterceptor(new QueryHintInterceptor());

But I also enable comments in SQL code in NH configuration setting the property use_sql_comments to true, now I only need a couple of extension methods like this one.

   1: public static IQuery QueryHintRecompile(this IQuery query)

   2:  {

   3:      query.SetComment(QueryHintInterceptor.QueryHintRecompileCommentString);

   4:      return query;

   5:  }

That enables me to write this HQL query.

   1: session.CreateQuery("Select y from LinkResult y where ClieId = :param")

   2:     .QueryHintRecompile()

   3:     .SetInt32("param", 11)

   4:     .List();

As you can see I have now the QueryHintRecompile() method that permits me to specify for each single HQL Query (you can add similar extension method to ICriteria) the query hint to use. Running this query output this query.

23-07-2011 10-36-27

Figure 4: The query with the query hint specified by the QueryHintRecompile()

Thanks to incredible number of extension points offered by NHibernate, adding Query Hint to queries is really a breeze.

Alk.

Tags:

Using guid id in nhibernate, index fragmentation

This is a quick consideration about using id of type GUID in nhibernate. If in SQL server the cluster index is on the id (default choiche), if you use a simple guid generator you will end in high index fragmentation. This happens because if you insert a lot of objects into the table, since the physical ordering of the records (the clustered index) is on the Id field, inserting a sequence of objects with random id will insert these object randomly into the physical space of the DB. Remember that the index is a Tree that was kept ordered by its clustered index.

Figure 1: A simple picture that shows the index organization in a database.

A simple solution is using guid.comb generator

   1: <id name="Id" column="Id" type="Guid" unsaved-value="00000000-0000-0000-0000-000000000000">

   2:    <generator class="guid.comb" />

   3:  </id>

And look at generated guid when I insert a bunch of elements in unit test.

image

Figure 2: The id sequence generated by guid.comb generator

As you can see, it is quite clear that guid are generated sequentially, but the very first part is highly different, so it seems to me that those guid are really not so similar. Another problem is that I hate guid for human readability and inserting a sequence of elements results in highly different guid. The solution is using another guid generation strategy

   1: public static class SequentialGuidGenerator 

   2: {

   3:     [DllImport("rpcrt4.dll", SetLastError = true)]

   4:     internal static extern int UuidCreateSequential(out Guid guid);

   5:  

   6:     public static Guid CreateSequentialGuid()

   7:     {

   8:         Guid guid;

   9:         UuidCreateSequential(out guid);

  10:         return guid;

  11:     }

  12:  

  13: }

This function permits me to use the UuidCreateSequential operating system function to generate a sequential guid. Now you can write another nhibernate generator.

   1: class SequentialGuidComboIdGenerator : IIdentifierGenerator

   2: {

   3:     public object Generate(NHibernate.Engine.ISessionImplementor session, object obj)

   4:     {

   5:         return SequentialGuidGenerator.CreateSequentialGuid();

   6:     }

   7: }

Now I change the id generator and run again the test.

image

Figure 3: Sequential guid generator are much more equals.

As you can verify now guid are really similar each ones, and I highlighted only the part that is different. This generator can reduce indexes fragmentation, and sequentially inserted element are really simple, because they differ only for few digit, and in my opinion this lead to a better human readability.

Clearly performance consideration are valid only when the clustered key is on the id, if clustered index is on other fields, all consideration about performance of guid should be not taken into consideration.

Alk.

Red Gate Tools, Sql Dependency Tracker

I must admit that I really loves tools from RedGate and recently I’ve discovered the Sql Dependency Tracker, an exceptional tool to manage dependencies between objects in a database.

This tool is really great because with a few click you can have different graphs that represents dependency relation between objects in the database. I’m working on a legacy database, and I have two very old view that were replaced by another indexed view, that performs better and has better column names. We ported almost everything, but I need to know if some stored procedure or object still uses the old views. I simply open the Sql Dependency Tracker and immediately I found this graph

image

That actually shows me that seven stored proceudures still reference one of these two old view. If you zoom in you can easily find details about the procedure, and you can also look at the code.

image

Another useful option is the ability to organize the result in different view, this is the Hierarchic on the northwind database

image

The circular view give immediately a good idea on how objects are related toghether.

image

You can as an example look at how table view and stored procedure on a legacy database are organized.

image

This is a really interesting images, because it shows that there is an inner circle of core objects, surrounded by some other areas of the project.

You can also colorize different objects with different color, and with baloon view I can immediately spot central object of databases

image

look at the above picture, there is clearli an object that is so related to others that is immediately visible :)

This tools is really useful also to familiarize with a new complex database because it permits to have a quick look on how objects are organized.

alk.

SSIS use parameter with ADO Net source in DataFlow

I have a database where reports are done with stored procedure over the OLTP database, and we begin to suffer poor performance because we have a lot of data and sometimes reports are locked by transaction issued from components that does bulk data insertion.

A viable solution is moving all data in a DataWarehouse server where I can copy all denormalized data. This permits me to do a report query with only a simple select over a single table, then queries are issued against a database with no lock. I decided to move data using SSIS, with incremental population and customer partition.

My algorithm is the following one

  1. select all customer id to process
  2. For each id take the date of last population and the maximum date present in original data
  3. delete in destination table all data present in that range (this permits me to recalculate all if needed)
  4. Move all the new data into destination table.

To simplify the process I created a SSIS package, represented in this figure.

image

In this approach I can take advantage to the fact that Execute Sql Task support the specification of SQL syntax with parameter, so I’m able to use parameter in the query, and use SSIS variables to populate those parameters.

 image

image 

As you can verify I simply create a T-Sql query with the parameter @clieId, then assign the value of a variable to the parameter, but in a Data Flow you cannot use parameter on ADO Net Source, and this can be a problem, because I need to filter the source data by customer and date range, then doing transformation etc etc.

The solution is using Expression, you can see that he Move Data for ExcelAnalysis has a little pink triangle in the upper left, to specify that some properties are set with expression. If you click on the whole dataflow in the “control Flow” tab you can setup expression for the Data Flow Block

image

As you can see the block of the DataFlow contains expression for each ADO NET data source inside the flow, so you can set properties with expression.

The problem is that expression are not parameter aware, so we need to use expression syntax to build the SQL query, like old school :), the expression is something like this.

   1: "SELECT    xxx,yyy,

   2: from xxxx

   3: WHERE     LastUpdateDate >  '" + (DT_WSTR, 30) @[User::lastRunDate] +

   4:  "'  and LastUpdateDate <= ' " + (DT_WSTR, 30)  @[User::HighestDateForCustomer]  +

   5:  "' and CustomerId = " +  (DT_WSTR, 12)  @[User::CustomerId]

As you can see the query is composed using string manipulation and expression syntax, it is surely sub-optimal, but as far as I know this is the best way to use a parametrize-like query in a Data Flow Ado Net Source

Alk.