alkampfer on June 11th, 2009

I wrote a stupid util that does this operation: it scans a folder with 500.000 + files (in various subfolder), all files have name that contains an integer that is an id of a row in a database. I need to find orphan files, so I simply take file name, es: myfile_1002.txt, extracts the number 1002 and then verify if in a specific table of the database there is a row with index 1002.

To avoid poor performance I eager load all valid ids in memory (each id is an In32 (4 bytes) and I have 100k of them, so I’m keeping in memory 100k * 4 bytes 400kbytes of RAM). Now I load all these id in a List<Int32>, then iterate into all files of a directory, for each file parse the name and verify that the List contains the id. To analyze 200k files the whole routine took 10.200 milliseconds. It is quite speedy, but if you profile it you can verify that most of the time was spent into checking that an id is contained in the list.

This is because search operations in List<T> are slow, a search is basically a scan of the whole list until you find the element or you find the end of the list. We are so used with List<T> that I saw a lot of developers completely forget the old Array class. Let’s see how we can speed up the lookup with the old Array class.

//Load all id in ascending order into ValidLink List<Int32>
Int32[] links;
links = ValidLink.ToArray();

I’ve simply change the query to the database, asking for id in ascending order, then I populate the list as usual, and finally creates a simple Array of Int32 with the function ToArray(). now you can use BinarySearch

if (Array.BinarySearch<Int32>(links, id) > 0)

A lot of people forget that the Array class natively supports Binary search for ordered array, when I ran the sample again execution time was dropped to 2.392 seconds, and now most of the time is spent reading Disk structure.

Pay attention to the data structure you use ;)

Alk.

6 Responses to “Do not forget Array”

  1. You don’t need to use an array to do a binary search, because List has BinarySearch methods as well. http://msdn.microsoft.com/en-u.....7fxsh.aspx

  2. You are right, I’m so used to Array.BinarySearch because quite often I work with NHibernate that returns me an IList that have not BinarySearch. Now Even when I use the List I forget that the List have binarySearch implemented.

    Thanks for pointing it out. :)

    Alk.

  3. Hi, I bet that if you use a HashSet it will be a little better…
    Don’t be stuck in .net 1.1 structures :)

  4. Generics are not .net 1.1

  5. In your first example, you quote “10.200 milliseconds”, and in your second example, you quote “2.392 seconds” — I think your units are incorrect :-)

  6. @liviu: hashSet are great, but they are avaliable only in .net 3.5, array are avaliable since 1.0 ;) . If you need Set semantics and much more, even in .NET 2.0), wintellect powercollection are really a good choiche.

    @Amar: You are right :) both of the measure are done in millisecond.

    alk.

Leave a Reply