Case sensitivity in lucene search

We ended last post with a good knowledge of how to do complex searches in Lucene.NET indexes but a question remains unresolved, is Lucene.NET search case insensitive or case insensitive?. Suppose you search for

+mime + format

Here is the first result returned from the above query.

image

Figure 1: Searching for +mime +format returns a document that contains MIME in uppercase.

As you can see this documents satisfy the query with the word MIME, thus suggesting that the query is Case Insensitive (you searched for mime but MIME satisfied the search), but this is not true. If you look at Lucene documentation you can find that all searches are always case sensitive, and there is no way to do case insensitive search. This fact usually puzzled the user because searches seems to be Case insensitive, so where is the trick?. The answer is: StandardAnalyzer transforms in lowercase all the tokens before storing them into the index and QueryParser lowercase all the terms during query parsing.

image

Figure 2: The query resulted from the parsing of +MIME +FORMAT

As you can see in Figure 2 I’ve showed the result of parsing the query +MIME +FORMAT; the result correctly has the default field content before each search term, but it also present every term in lowercase. If you look closely at the constructor of QueryParser you can verify that it needs a reference to the analyzer used to create the index. During initialization QueryParser ask to that analyzer if he manipulate tokens before storing them to the index to apply the very same manipulation to search terms during query parsing. If you manually create a query to issue a search for term MIME (all uppercase) with this simple line of code you can be surprised by the result.

Query query = new TermQuery(new Term("content", "MIME"));

The index does not return a single result even if the previous query showed in Figure 1 that the word MIME is present in the original text. The reason is: StandardAnalyzer converted every term in lowercase so the index contains term mime not MIME and the above query has no result. This example shows how important is to initialize the QueryParser with the very exact Analyzer class type or you can incur in really bad surprise.

To build a Case Sensitive index you can simply create your own analyzer, inheriting from StandardAnalyzer and overriding the TokenStream method

    public class CaseSensitiveStandardAnalyzer : StandardAnalyzer
    {
        public CaseSensitiveStandardAnalyzer(Lucene.Net.Util.Version version) : base(version) { }

        public override Lucene.Net.Analysis.TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
        {
            StandardTokenizer tokenizer = new StandardTokenizer(reader);
            Lucene.Net.Analysis.TokenStream filterStream = (Lucene.Net.Analysis.TokenStream)new StandardFilter((Lucene.Net.Analysis.TokenStream)tokenizer);
            //Lucene.Net.Analysis.TokenStream tokenStream = (Lucene.Net.Analysis.TokenStream)new LowerCaseFilter(filterStream);
            Lucene.Net.Analysis.TokenStream stream = (Lucene.Net.Analysis.TokenStream)new StopFilter(true, filterStream, StopAnalyzer.ENGLISH_STOP_WORDS_SET, true);
            return stream;
        }
    }

The code inside TokenStream() method is similar to the one contained in StandardAnalyzer and shows clearly its default configuration. The TokenStream is a specific stream that permits to “tokenize” the content to split words from the original content. For a StandardAnalyzer the base tokenizer is of type StandardTokenizer  a grammar enabled tokenizer, capable to recognize specific patterns of english language. In the second line the the StandardTokenizer is used as base stream for a StandardFilter that is used to handle standard punctuation like aphostrophe or dots. As you can see each new Stream is created passing the previous one as a source to build a chain of filters that gets applied to tokens one after another. The third line was commented in my analyzer because it is the one that adds the LowerCaseFilter, used to transform each token to lowercase and finally the StopFilter remove some noise words that should not be indexed because are known to be noise in the English language. If you comment the line that adds the LowerCaseFilter, all the tokens will be stored inside the index with original casing obtaining a Case Sensitive index. If you need to support case sensitive and case insensitive search in the same application, you need to build two different indexes in two distinct folders and use the right one during searches.

With this Analyzer I created another index of the first five million post of the dump of StackOverflow, then I issued the query +Mime +Format, obtaining this result:

Insert search string:+Mime +Format

N°4 found in  9 ms. Press a key to look at top results

As you can see, searching for +Mime +Format returns only 4 resultsw, because now the search was case sensitive so words like MIME does not satisfy the query. The cool part is that QueryParser correctly identifies that our new analyzer does not apply the LowerCaseFilter so it does not lowercase terms during query parsing.

This example shows how you can  manage search casing with lucene and gave you some golden rule:

Rule #1: all lucene searches are case sensitive, if you build query directly with TermQuery class you need to be aware of casing used by the analyzer.

Rule #2: standard analyzer applies a lowercase filter to tokens to make search case insensitive.

Rule #3: if you need a Case Sensitive search ,you need to create an index with an Analyzer that does not apply lowercase to tokens, such as the above CaseSensitiveStandardAnalyzer

Rule #4: always pass to QueryParser constructor the very same type of Analyzer used to create the index, because QueryParser needs it to determine if search terms should be lowercased to build the query.

Happy searching :)

Gian Maria.

Advanced queries with Lucene.NET

Previous part of the series

In the previous parts I created a lucene.NET index that contains information about StackOverflow posts content and I showed some basic searches. Suppose I create a new index where documents were created with these options

luceneDocument.Add(new Field("Id", reader.GetAttribute("Id"), Field.Store.YES, Field.Index.NOT_ANALYZED));
luceneDocument.Add(new Field("content", reader.GetAttribute("Body"), Field.Store.COMPRESS, Field.Index.ANALYZED));

With this document I’m able to retrieve the original content because I’ve specified lucene.NET to store original content inside the index in compressed format (Field.Store.COMPRESS), so I can write a stupid routine to search from a simple console application.

using (FSDirectory directory = FSDirectory.Open(luceneDirectory))
using (Analyzer analyzerStandard = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29))
{
    QueryParser parser = new QueryParser("content", analyzerStandard);
    String searchString;
    while (!String.IsNullOrWhiteSpace(searchString = RequestGetStringToUser()))
    {
        using (IndexSearcher indexSearcher = new IndexSearcher(directory, true))
        {
            Query query = null;
            try
            {
                query = parser.Parse(searchString);
            }
            catch (ParseException pex)
            {
                Console.WriteLine("search string is not correct");
                continue;
            }

            Stopwatch stopWatch = new Stopwatch();
            stopWatch.Start();
            TopDocs result = indexSearcher.Search(query, 3);
            stopWatch.Stop();
            Console.WriteLine("N°" + result.TotalHits + " found in  " + stopWatch.ElapsedMilliseconds + " ms. Press a key to look at top results");
            Console.Read();
            if (result.ScoreDocs.Length > 0)
            {
                for (int i = 0; i < result.ScoreDocs.Length; i++)
                {
                    var score = result.ScoreDocs&#91;i&#93;.Score;
                    Document doc = indexSearcher.Doc(result.ScoreDocs&#91;i&#93;.Doc);
                    var Id = doc.Get("Id");
                    var content = doc.Get("content");
                    Console.Clear();
                    Console.WriteLine("Result N°" + i);
                    Console.WriteLine("Post id " + Id + " score " + score);
                    Console.WriteLine(content);
                    Console.ReadKey();
                }
            }
            else
            {
                Console.WriteLine("No result found!");
            }
        }
    }
}
&#91;/sourcecode&#93;
</pre>
</div>

<p>The big advantage of Lucene indexing is the opportunity for the user to specify complex search queries with a natural syntax and letting lucene taking care of everything thanks to the <strong>QueryParser</strong> object. Then only task for the programmer is handling the eventual ParseException that can be raised if the user use a bad syntax for the query (Ex +-format). Now you can fire the program and try some searches, like for example <strong><em>Format* </em></strong>and here is the output.</p>

<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:C89E2BDB-ADD3-4f7a-9810-1B7EACF446C1:75a2778c-445f-4e16-9f97-34dd8f356c28" class="wlWriterEditableSmartContent"><pre style=white-space:normal>

Insert search string:Format*
N°141649 found in  39 ms. Press a key to look at top results

The amazing stuff is the speed of the response, it actually took 39 milliseconds to find that there are 141649 documents in the index that satisfy our query and to return information about the top 10. The secret of this speed is in how the index is constructed internally, and the TopDocs returned object that does not contains any document data but only information about how to retrieve matching documents. In fact you need to retrieve the document inside a for cycle to show results to the user.

image

Figure 1: The results can now contain the original body of the post because it is included in the index.

With such a simple console application you can explore all the different query you can issue. Remember that this example uses a QueryParser that was constructed specifying Content as the default field to search, so the user can use a simpler syntax es:

  • Format AND MIME: Contains the word Format and MIME in the document.
  • Format* AND MIME: Contains the word MIME and a word that starts with Format
  • Forma?: Contains a word that starts with Forma and have another char at the end
  • Forma?????: contains a word that starts with Forma and contains another five char (es. FormaTTING)
  • Format OR MIME: Contains the word MIME or the word Format.
  • Format AND NOT MIME: Contains the word Format but not the word MIME

Boolean condition can be specified even with plus and minus symbols Es.

  • Format MIME: if you omit any boolean condition it is the same of the OR condition, this search string is equivalent to Format OR MIME
  • +Format +MIME: the plus sign is the same of AND, this search is equivalent to Format AND MIME
  • +Format -MIME: Contains the word Format but not the word MIME (prefixed with minus sign). Since the plus sign is not mandatory it is equivalent to Format -MIME

Clearly you can use parenthesis to alter the precedence Es:

  • Format and (MIME OR Email): Contains the word Format and the word MIME or EMail. With such a query if the index contains many documents, probably the first returned will contains both the words MIME and Email because percentage of match is higher.
  • +Format* –MIME +(email or message): Contains a word that starts with Format, does not contain the word MIME but contains email or message.

Then you have some advanced syntax to satisfy more specific searches

  • “MIME Format”: with “ you are enforcing an exact phrase, it matches a document if contains the exact text MIME Format
  • disambiguation~: The tilde character tells to do a non-exact search so it can match a word that it is similar to disambiguation. This kind of query took more time to execute, because the index is not optimized to satisfy such a query.
  • “Format Mime”~3: Should contains the word Format and Mime and the relative position between the two should not exceed three terms.

As you can see you can express quite complex queries but usually you get surprised that a query like *ormat should not work and gave exception when parsed. The exact exception is

Cannot parse ‘*ormat’: ‘*’ or ‘?’ not allowed as first character in WildcardQuery

To make this query run you should activate an option into the QueryParser calling SetAllowLeadingWildcard() method passing true as single argument. Once you have set this option you can search with leading wildcard, but the response time is really slower, because for a standard index it requires a full traversal of the index itself. As an example searching for *ormat takes 6957 milliseconds on my system while Forma* takes only 22 milliseconds (the index is in a Fast Solid State Disk).

Gian Maria.,

Getting started with Lucene.NET–Searching

Previous part of the series

In the previous part I’ve showed how easy is to create an index with lucene.net, but in this post I’ll start to explain how to search into it, first of all what I need is a more interesting example, so I decided to download a dump of stack overflow, and I’ve extracted the Posts.Xml file (10 GB of XML file), then I wrote this simple piece of text to create the lucene index.

using (FSDirectory directory = FSDirectory.Open(luceneDirectory))
using (Analyzer analyzerStandard =
    new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29))
using (IndexWriter indexWriter = new IndexWriter(directory, analyzerStandard, IndexWriter.MaxFieldLength.UNLIMITED))
{
    Int32 i = 0;
    using (XmlReader reader = XmlReader.Create(@"D:\posts.xml"))
    {
        while (reader.Read())
        {
            if (reader.NodeType == XmlNodeType.Element &&
                reader.Name == "row")
            {
                Document luceneDocument = new Document();
                luceneDocument.Add(new Field("Id", reader.GetAttribute("Id"), Field.Store.YES, Field.Index.NO));
                luceneDocument.Add(new Field("content", reader.GetAttribute("Body"), Field.Store.NO, Field.Index.ANALYZED));
                indexWriter.AddDocument(luceneDocument);
                Console.CursorLeft = 0;
                Console.Write("Indexed Documents:" + ++i);
            }
        }
    }

    indexWriter.Optimize();
    indexWriter.Commit();
}

This code is really similar to the previous post, the only difference is that the Directory used to store the Index is a FSDirectory (File System Directory) because I want to create a permanent index on disk, then I simply use a XmlReader to scan the file (10 GB Xml file needs to be read by an XMLReader because if you try other method you will find performance trouble), and I decided to analyze the attribute “Body” of the <row> node, storing the Id of the post.

Indexing such a huge amount of data needs time, but the most important thing I want to point out is the call to Optimize() before the call to Commit(). Basically Lucene.NET indexes are composed by segments of various length, more segments in an index, less performance you have, but if you callOptimize() method lucene will collapse the index in a single segment, maximizing search performances. Remember that Optimization is a long and time consuming process because lucene needs to read all the index, merge them and rewrite a single file, so it worth calling it during low system usage time (es during the nigth if the system is idle), or after a big change of the index. You can also pass an integer to Optimize, specifying the maximum number of segments you want in the index; as an example you can specify 5 if you want your index to contains at maximum 5 segments (it is a good tradeoff because it can save time and have a good performing index).

In the above example the call to Optimize could be avoided entirely, because if you continue to add documents to the very same index, lucene try to keep the index optimized during writing, if you start this program you can see lots file of about 7MB length created in the FSDirectory, after a little bit files get chained together so you see less and lager files. The call to optimize is really necessary if you modify the index lots of time closing and reopening the IndexWriter. Remember also that until you do not call Commit, if you open a IndexSearcher on the Index directory you do not see any of the new indexed documents.

After the index was created you can search on it simply using an Index reader and a Query Parser.

using (FSDirectory directory = FSDirectory.Open(luceneDirectory))
using (Analyzer analyzerStandard = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29))
{
    QueryParser parser = new QueryParser("", analyzerStandard);
    using (IndexSearcher indexSearcher = new IndexSearcher(directory, true))
    {
        var query = parser.Parse("content:child*");
        TopDocs result = indexSearcher.Search(query, 20);
        Console.WriteLine("N° results in index:" + result.TotalHits);
        for (int i = 0; i < result.ScoreDocs.Length; i++)
        {
            var score = result.ScoreDocs&#91;i&#93;.Score;
            Document doc = indexSearcher.Doc(result.ScoreDocs&#91;i&#93;.Doc);
            var Id = doc.Get("Id");
            Console.WriteLine("Match id " + Id + " score " + score);
        }
    }
}
&#91;/sourcecode&#93;
</pre>
</div>

<p>The code is really simple, <strong>I open the directory and create an analyzer, then I need a QueryParser, an object capable of parsing query from string</strong>, that is really useful to parse user inserted strings. In my example <strong>I search all the documents where the Field content contains the word child* </strong>because the character <strong>* (asterisk)</strong> matches any number of chars. This query will match Child, Children, and so on. The query is in the form <strong>fieldname:searchstring</strong> but the first parameter of QueryParser constructor is the default field, this means that if you create the QueryParser in this way</p>

<div style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; float: none; padding-top: 0px" id="scid:C89E2BDB-ADD3-4f7a-9810-1B7EACF446C1:e7073df1-a013-42ed-a230-9be1eedaf989" class="wlWriterEditableSmartContent"><pre style=white-space:normal>

QueryParser parser = new QueryParser("content", analyzerStandard);

You can simply parse the query “child*” instead of specifying “content:child*” because the QueryParser automatically issue the query to field content. Lucene query syntax permits you to specify more complex query like this “+child* +position*” that matches all documents that contains both child* and position*. You can use AND or and other advanced techniques like for example this query “child* position*”~10 that search word child* and position* but matches only if the distance between them is less or equals to 10 words. You can also search for similarity, if you search for Children~ you are searching for terms similar to Children, so you can match terms like Chidren, a misspelled version of the word you are searching.

The result of a search is a simple object of type TopDocs that contains all the docs that matched in the ScoreDocs array, it contains also the total number of matches in the field TotalHits. To show results you can simply cycle inside the ScoreDocs to get information about documents that matched the query. In this example, since I’ve not included the body in the index (Field.Store.NO) I can only retrieve the field “Id” from the document returned from the query, and I need to reopen the original XML file if I want to know the Body of the post that matches. If you do not want to reopen the original XML file to get the Body of the post, you can change Storage of the content field to Field.Store.COMPRESS.

luceneDocument.Add(new Field("Id", reader.GetAttribute("Id"), Field.Store.YES, Field.Index.NOT_ANALYZED));
luceneDocument.Add(new Field("content", reader.GetAttribute("Body"), Field.Store.COMPRESS, Field.Index.ANALYZED));

In this example I’ve changed the Id from Field.Index.NO to Field.Index.NOT_ANALYZED, this means that the Id field should not be analyzed, but you can search for exact match. If you leave the value as Field.Index.NO, as in the previous snippet, if you issue a query “Id:100” to find the document with Id = 100 you will get no result. Content field is changed from Field.Store.NO to Field.Store.COMPRESS, this means that the entire unchanged value of the field is included in the index in compressed format and can be retrieved from the result of a query. Now you can get the original unchanged content calling doc.Get(“content”). The reason why you need to include the content in the index with the Field.Store.COMPRESS is due to the fact that indexes lose completely the original structure of the field if you specify Field.Index.ANALYZED, because the index only contains terms and you completely lost the original text. Clearly such an index will occupy more space, but with compression is a good tradeoff, because you are immediately able to find original text without the need to go to to the original store (our 10 GB xml file in this example).

Just to conclude this second part I want to summarize the various usage of the Field.Store and Field.Index value for document fields.

A combination of Field.Store.YES and Field.Index.NO is used usually to store data inside the document that will not be used to search, it is useful for Database Primary keys, or other metadata that you need to retrieve from the result of a search, but that you do not need to use in a search query.

A combination of Field.Store.YES and Field.Index.NOT_ANALYZED or Field.Index.NOT_ANALYZED_NO_NORMS is used for fields that you want to use in a search, but should be treated as a single value, as for example Url, Single word, Database Index that you want to use on query, and all identifiers in general. You should use the NOT_ANALYZED_NO_NORMS if you want to save index space and you will not use index boosting (an advanced feature of lucene).

A combination of Field.Store.YES (or Field.Store.COMPRESS) and Field.Index.ANALYZED is used to store text you want to search into and that you want to retrieve from query result. This is useful if the original text is part of a document, or of a large file (as in this example) and retrieving it is a time consuming thing, so it can be better to store it in the index.

A combination of Field.Store.NO and Field.Index.ANALYZED is used to store text you want to search into but you are not interested to retrieve from query result. This is useful if you have the original text in a database and you can retrieve it with a single fast query if needed.

Gian Maria

Getting Started with Lucene.net

I started working with Lucene.Net and I should admit that is a real powerful library, but it is really huge and needs a little bit of time to be mastered completely. Probably one of the best resource to keep in mind is the FAQ, because it contains really most of the more common question you can have on Lucene and it is a good place to start. Another good place is the Wiki that contains other useful information and many other link to relevant resources.

Getting started with lucene.net is really simple, after you grabbed the bits and placed a reference in your project you are ready to search in your “documents”. Lucene has a set of basic concepts that you need to grasp before starting using it, basically it has Analyzers that elaborate documents to create indexes that are stored Directory and permits fast search; searches are done with IndexSearcher that are capable of searching data inside a directory previously populated by analyzers and indexes. Now lets see how you can index two long string of text:

using (RAMDirectory directory = new RAMDirectory())
using (Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29))
{
    String test = "la marianna la va in campagna......";
    String test2 = "Lorem Ipsum è un testo segnaposto .....";
    using (IndexWriter ixw = new IndexWriter(directory, analyzer))
    {

        Document document = new Document();

        document.Add(new Field("Id", test.GetHashCode().ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
        document.Add(new Field("content", test, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
        ixw.AddDocument(document);

        document = new Document();
        document.Add(new Field("Id", test2.GetHashCode().ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
        document.Add(new Field("content", test2, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
        ixw.AddDocument(document);
        ixw.Commit();
    

The code can seem complex, but it is simpler than you can think if you observe it carefully. First of all we need to create a Directory where we want to store the index, for this sample I use a RAMDirectory that simply stores everything in RAM and it is really fast and useful if you need to do quick search into text and you do not want to maintain the index for future searches. After the Directory you need to create an Analyzer that is the component capable of analyzing the text. Notice how both Directory and Analyzer are in a using clause, because you need to dispose them when you do not need them anymore.

Then I have two long strings to index, so I created an IndexWriter that will use the directory and the analyzer previously created and finally I called AddDocument() adding documents to the index. In Lucene a Document is nothing more than a bunch of Key and Value pairs, that contains data you want to go into the index. The complexity of creating a document is deciding what to do with each pair because you need to tell exactly to lucene what you want to be indexed and/or included in the index. If you look at the Field constructor the first two parameters are the name and value of the field, but they are followed by some specific Lucene enum values.

The first one is storage information, it can be YES, COMPRESS or NO, and basically tells lucene if the content of the field should be stored in the index (YES), stored but compressed (COMPRESS) or not stored (NO). You need to store content in document only if you are interested in retrieving it during a search. Suppose that you are writing an indexing system for data stored in an external Relational Database where you have a table with two column called Id and Content, if you want to index that table with Lucene to find id of documents that contains specific text, you want to store in the index the original Id of the Database  Row to retrieve it with a  search. When you will issue a search you will be able to retrieve that value from the document returned from the query.

The second parameter is an enum that tells Lucene if the field should be analyzed, in my example the Id field (I used the hashcode of the string to create an  unique id for this quick example, but in the previous scenario is is the id of a table in database) is not analyzed, because I do not need to search inside its content. The final constant specify to lucene if we want to store in the index the position of the various world that are contained in the text, again for the Id field I do not need to analyze anything. For the content field I decided to store it in the index (Field.Store.YES) because I want the original text to be included in the index, I want it to be analyzed (Field.Index.Analyzed) because I want to search text inside it and finally I want to store count of terms with positions and offsets (Field.TermVector.WITH_POSITIONS_OFFSETS).

Finally I call Commit() method of the IndexWriter to make it flush everything to the Directory, if you forget to call Commit and you will open an IndexSearcher that pints to the same RAMDirectory, probably you will not find all the documents indexed, because the IndexWriter caches results in memory and does not write directly to the directory each time AddDocument is called.

In the next post I’ll show how easy is to search inside a lucene Index Directory.

Gian Maria.