I wrote a stupid util that does this operation: it scans a folder with 500.000 + files (in various subfolder), all files have name that contains an integer that is an id of a row in a database. I need to find orphan files, so I simply take file name, es: myfile_1002.txt, extracts the number 1002 and then verify if in a specific table of the database there is a row with index 1002.
To avoid poor performance I eager load all valid ids in memory (each id is an In32 (4 bytes) and I have 100k of them, so I’m keeping in memory 100k * 4 bytes 400kbytes of RAM). Now I load all these id in a List<Int32>, then iterate into all files of a directory, for each file parse the name and verify that the List contains the id. To analyze 200k files the whole routine took 10.200 milliseconds. It is quite speedy, but if you profile it you can verify that most of the time was spent into checking that an id is contained in the list.
This is because search operations in List<T> are slow, a search is basically a scan of the whole list until you find the element or you find the end of the list. We are so used with List<T> that I saw a lot of developers completely forget the old Array class. Let’s see how we can speed up the lookup with the old Array class.
//Load all id in ascending order into ValidLink List<Int32> Int32[] links; links = ValidLink.ToArray();
I’ve simply change the query to the database, asking for id in ascending order, then I populate the list as usual, and finally creates a simple Array of Int32 with the function ToArray(). now you can use BinarySearch
if (Array.BinarySearch<Int32>(links, id) > 0)
A lot of people forget that the Array class natively supports Binary search for ordered array, when I ran the sample again execution time was dropped to 2.392 seconds, and now most of the time is spent reading Disk structure.
Pay attention to the data structure you use
Alk.






June 12th, 2009 at 10:31 am
You don’t need to use an array to do a binary search, because List has BinarySearch methods as well. http://msdn.microsoft.com/en-u.....7fxsh.aspx
June 12th, 2009 at 11:44 am
You are right, I’m so used to Array.BinarySearch because quite often I work with NHibernate that returns me an IList that have not BinarySearch. Now Even when I use the List I forget that the List have binarySearch implemented.
Thanks for pointing it out.
Alk.
June 12th, 2009 at 1:28 pm
Hi, I bet that if you use a HashSet it will be a little better…
Don’t be stuck in .net 1.1 structures
June 13th, 2009 at 6:10 am
Generics are not .net 1.1
June 13th, 2009 at 12:40 pm
In your first example, you quote “10.200 milliseconds”, and in your second example, you quote “2.392 seconds” — I think your units are incorrect
June 14th, 2009 at 12:56 pm
@liviu: hashSet are great, but they are avaliable only in .net 3.5, array are avaliable since 1.0
. If you need Set semantics and much more, even in .NET 2.0), wintellect powercollection are really a good choiche.
@Amar: You are right
both of the measure are done in millisecond.
alk.