Entity Framework, a super dataset?

Entity Framework is quite a good product, but in my opinion still misses some point to be called an ORM. I must admit that I never used the 4.1 Code first in real project, but there are some stuff that still does not convince me when using EF.

When I decide to use EF (and not NH)

Sometimes I need to access small legacy databases to do a small amount of operations and I must admit that adding a EF Model from Visual Studio is a matter of seconds and the LINQ provider of EF is quite good so it is a valuable alternative to create a quick DAL. I have a simple database (four tables), I created an EF model, and then issue this query.

image

Figure 1: A very simple piece of code that query database with EF

This is a very super simple piece of code, in the first line I grab an instance of Term from the database (actually I grab the first one, in real code I have a complex query that returns a specific term). Then I want to grab all instance of Verbs that are related to this term, so I issue the second query, what will be the result?

This image exceptionally capture my face the first time I ran this query.

SNAGHTML4bb74e

Figure 2: Strange error in the query.

I’m really disappointed from this exception for many reasons. First of all I do not know why it does not work, because this is a perfectly and reasonable query to issue.

Fix the query to make it works

To verify that the query is indeed ok lets modify it in this way.

image

Figure 3: Modified version of the query, now it works.

This version works simply changing the order of the Where and ToList operators. It now works because it loads the entire Verbs table in memory and then filters out with LINQ to Object. This clearly makes me sick, because this creates a big memory pressure. The second issue with the above error is the error message, that is really obscure. Unable to create a constraint value of type xxxxxxxx. If you read with caution you could eventually understand that you cannot put a condition based on equality of objects, but only primitive type.

This error could have a good reason to exists: if the Term entity overrides the Equality method, it could be impossible to create an equivalent query in SQL to mimic the same behavior by the EF LINQ provider. Suppose that the Term overrides the Equal operator comparing two entities by checking for equality of the Value and Type properties, how could the EF provider understand this behavior to issue an equivalent query to the database? I can agree with this reason, but only if the error would be something like “Could not issue a comparison between two entities in a EF query because it is impossible to mimic the semantic of equals operator.. bla bla bla”.

A real solution, but you need to make your hands dirty

But working with objects in DDD or Domain Model or even with Active Record pattern, the programmer in 99.99% of the situation, translate the condition V.Term == term with: “gives me all verbs that are connected with term”. This because we are used to the fact that a reference from Verb entity to Term entity is part of the object graph and this can easily be translated with a comparison between identity operators. Given that, you can rewrite the above query in this way.

image

Figure 3: The right way to issue the EF query

Now the query runs fine, but you can have two completely different opinion about it.

The first opinion is negative, because I need to understand the name of the property/properties used as id by the ORM. I know that by convention you can use a property named Id for every entity, but sometimes, especially with legacy database you can have entities with composite key or identity property named differently. LINQ is great to express query in the object model, but in this situation I need to be aware of details of the ORM Mapping to issue the right query and this is bad.

The second type of opinion is positive, because this query does not suffer from mismatch between the LINQ to entities counterpart when someone redefines the equals operator for the entity.

Conclusion

I tend to agree with the dislike part, because I want my ORM to permit me to express query without being aware of the details. So I tend to consider EF not a real ORM but something like a SuperDataset, it is quick to use with database first approach, it permits you to create a DAL quickly, but it forces you to know the detail of the database, like a dataset. Clearly EF has really a lot of interesting feature respect to a dataset, but it still miss something to be considered an ORM.

Alk.

Tags:

EntityLock, implement a logic locking strategy in my domain

I have a domain where one of the business operation consists of analysis of some entities, for each entity we need to do complex analysis involving external servers and until now we could live with a single server that sequentially analyze those entities one after another.

image

All works good, but we reach a point where the amount of work reached the capacity of the Analyzer server, so we need to change the analysis algorithm to parallelize the analysis.

image

Simply launching the analyzer application on multiple server produces errors, because analyzing a single entities it is a work of several seconds, and there is high chance that both server start to analyze the very same entity. The Analyze operation is idempotent, so there is no problem in data if two server analyzes at the same data at the same moment, but we have timeout problem due to lock in database (the result of analysis was kept in ram, but when it finished it simply open a transaction, clear old result, and insert all new data). I need some way to make the two analysis independent one from each other.

I introduced in the domain the concept of EntityLock and LockedEntity

image

Figure 1: Schema of EntityLock and LockedEntity

I simply need some mechanism to tell “I want to modify this entity and I want to be sure that no other machine or process is actually using it”. The class has three properties, one is EntityCode and is the logical Id of the entity being modified, Locker field contains a code that identify the process that settled the lock (actually is machineName + full path of the executable) and finally the timestamp identify the date and time when the lock was imposed.

Since I need to share the lock between multiple machines I can simplify the design using the Database as the backing storage for the locks.

   1: public static LockedEntity<T> Lock<T>(String entityId, T entity)

   2: {

   3:     EntityLock @lock;

   4:     using (var session = SessionFactory.CreateNewSession(_))

   5:     using (session.BeginTransaction(IsolationLevel.ReadCommitted))

   6:     {

   7:         @lock = new EntityLock(entityId);

   8:         session.Save(@lock);

   9:         try

  10:         {

  11:             session.Transaction.Commit();

  12:         }

  13:         catch (HibernateException)

  14:         {

  15:             session.Transaction.Rollback();

  16:             return null;

  17:         }

  18:     }

  19:     return new LockedEntity<T>(@lock, entity);

  20: }

The lock function accepts a logical id and the entity that should be locked. I could pass only the entity if I have some base entity, but my domain is on a legacy database, I have entity with GUID id, some other with Int32 IDENTITY, so I need to specify the id with a string. When the entity have guid Id I can use the Id, if it has a Identity ID I use id + entityname, and so on. Since the logical Id is the primary key (generator assigned) of the EntityLock, I simply try to insert a new EntityLock in database and if some other component had already locked that same entity an exception of Primary Key violation will occur (I could even obtain better performance doing a lookup before the insertion, but I need to acquire a lock each several minutes I do not care about performances). When an exception is raised I simply return null, meaning that the entity could not be locked, if the insert operation was successful I return the entity wrapped in a LockedEntity<T> class, that simply implement IDisposable because it release the lock when it got disposed.

With this class each analyzer server ask for the next data to analyze, try to lock it, and if the lock failed, it ask for another data until he find a data object that can lock or there is no data to analyze. All the analysis operation are done inside a using, so the lock on the entity got disposed as soon the analysis finish.

alk.

Tags:

Sequential GUID are Unique

This post refers to the previous post, where liviu warns against the “non uniqueness” of T-SQL sequential guid. The technique used is based on the UuidCreateSequential API function of the operating system. This function generates unique guid unless there is no NetworkCard in the system, but this function warns you when the guid can be considered unique only on local machine, when it returns the value RPC_S_UUID_LOCAL_ONLY.

image

This happens when the machine has not network card, and this is not a problem in my scenario, but if this is something you fear, you can simply change the function in this way.

   1: public static class SequentialGuidGenerator

   2: {

   3:     private const Int32 RPC_S_OK = 0x0;

   4:  

   5:     /// <summary>

   6:     /// Questa funzione di sistema presa dalla dll rpcrt4.dll permette di generare

   7:     /// Guid sequenziali che sono più ottimali per inserimenti massivi di grandi quantità 

   8:     /// di dati in database in cui l'indice cluster è messo sull'id

   9:     /// </summary>

  10:     /// <param name="guid"></param>

  11:     /// <returns></returns>

  12:     [DllImport("rpcrt4.dll", SetLastError = true)]

  13:     internal static extern int UuidCreateSequential(out Guid guid);

  14:  

  15:     public static Guid CreateSequentialGuid()

  16:     {

  17:         Guid guid;

  18:         Int32 retvalue = UuidCreateSequential(out guid);

  19:         if (retvalue != RPC_S_OK)

  20:         {

  21:             //it is not safe to generate a guid on this machine because it would be unique

  22:             //only in this machine

  23:             return Guid.NewGuid();

  24:         }

  25:         return guid;

  26:     }

  27:  

  28: }

The change is in link 19, I check for the retvalue and if it is different from RPC_S_OK I use the standard Guid generator. This little change makes me sure that if the machine could not generate unique sequential guid, it uses standard guid generation avoiding the risk of having two identical GUID.

alk.

Using guid id in nhibernate, index fragmentation

This is a quick consideration about using id of type GUID in nhibernate. If in SQL server the cluster index is on the id (default choiche), if you use a simple guid generator you will end in high index fragmentation. This happens because if you insert a lot of objects into the table, since the physical ordering of the records (the clustered index) is on the Id field, inserting a sequence of objects with random id will insert these object randomly into the physical space of the DB. Remember that the index is a Tree that was kept ordered by its clustered index.

Figure 1: A simple picture that shows the index organization in a database.

A simple solution is using guid.comb generator

   1: <id name="Id" column="Id" type="Guid" unsaved-value="00000000-0000-0000-0000-000000000000">

   2:    <generator class="guid.comb" />

   3:  </id>

And look at generated guid when I insert a bunch of elements in unit test.

image

Figure 2: The id sequence generated by guid.comb generator

As you can see, it is quite clear that guid are generated sequentially, but the very first part is highly different, so it seems to me that those guid are really not so similar. Another problem is that I hate guid for human readability and inserting a sequence of elements results in highly different guid. The solution is using another guid generation strategy

   1: public static class SequentialGuidGenerator 

   2: {

   3:     [DllImport("rpcrt4.dll", SetLastError = true)]

   4:     internal static extern int UuidCreateSequential(out Guid guid);

   5:  

   6:     public static Guid CreateSequentialGuid()

   7:     {

   8:         Guid guid;

   9:         UuidCreateSequential(out guid);

  10:         return guid;

  11:     }

  12:  

  13: }

This function permits me to use the UuidCreateSequential operating system function to generate a sequential guid. Now you can write another nhibernate generator.

   1: class SequentialGuidComboIdGenerator : IIdentifierGenerator

   2: {

   3:     public object Generate(NHibernate.Engine.ISessionImplementor session, object obj)

   4:     {

   5:         return SequentialGuidGenerator.CreateSequentialGuid();

   6:     }

   7: }

Now I change the id generator and run again the test.

image

Figure 3: Sequential guid generator are much more equals.

As you can verify now guid are really similar each ones, and I highlighted only the part that is different. This generator can reduce indexes fragmentation, and sequentially inserted element are really simple, because they differ only for few digit, and in my opinion this lead to a better human readability.

Clearly performance consideration are valid only when the clustered key is on the id, if clustered index is on other fields, all consideration about performance of guid should be not taken into consideration.

Alk.

Nhibernate BatchSize and massive insert

I have an application that analyze some data and insert about 1000 objects for each cycle into the database. The insertion needs to be transactional, and I need to insert 500 instances of object A and 500 instances of object B, and each B have a reference to A.

When I launch the application I see this insert pattern, even if I enabled batch size in Nhibernate.

image

Figure 1: Each object has his own insert instruction A then B then A then B again.

This is a typical bulk insert problem with an ORM, but it is strange that it happens even if I enabled Batch size. Thanks to NhProf I verified that NH inserted objects one by one, alternating object of type A and B. The only flush is at the very end of the insertion, and I verified with NHProf that all the insert are issued when I call flush(). Performances are unacceptable because due to network latency (database accessed over an internet based vpn) each insert requires ~100 ms, so I have about 1000 * 100 = 100 seconds. Since everything is transactional I have a transaction open for more than one minute.

I’m using NH 2.1 and I do not know exactly if bulk inserting of linked objects was fixed in version 3 but the solution is simple, instead of creating couple of A and B object and immediately pass them to Session.Save object I took this different strategy

1) created all object in memory, for 500 times I create A, create B and set B.refa = A
2) now that all 1000 objects are in memory, I call Session.Save for each object of type A
3) call Session.Flush()
4) call Session.Save() for each B object
5) commit the transaction.

Since object of type A does not depend from objects of type B, when I call flush() in point 3, session contains only objects of type A, so it can use batching. Then when I insert all B objects and commit transaction, the session should only insert all B objects and it can use batching again. The result is

image

Figure 2: Batching in action, since I have 50 as batch size, inserting 500 objects requires only 10 insert query.

The improvement in performances is really high, because each batch requires about ~500 ms to complete, so I’m able to insert everything in 20 * 500 = ~10 seconds, and the insertion time is reduced by one order of magnitude.

This simple example shows, that for some specific operations, like massive insert, you should always double check what your ORM does behind the curtain, to avoid unnecessary performance loss.

alk.