Hilight matched text inside documents indexed with Solr plus Tika

I’ve already dealt on how to index documents with Solr and Tika and in this article I’ll explain how you can not only search for documents that match your query, but returns even some text extract that shows where the document match the query. To achieve this, you should store the full content of the document inside your index, usually I create a couple of fields, one called Content that will contain the content of the file, and with a copyfield directive (  <copyField source=”content” dest=”text”/> ) automatically copy that value inside the catch all field called text.

   <field name="content" type="text_general" indexed="false" stored="true" multiValued="false"/>
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

Text field is multivalued and not stored, it is only indexed to permit search inside various field of the document. Content field store the extracted text from Tikaand it is useful both for highlighting and to troubleshoot extraction problems, because it contains the exact text extracted by Tika.

Now suppose you want to search for the term Branch and want also to highlight the part of the text where you find that term, you can simply issue a query that ask for highlighting, it is really simple.

http://localhost:8080/TestInstance/tikacrawl/select?q=text%3Abranch&fl=id%2Ctitle%2Cauthor&wt=xml&hl=true&hl.snippets=20&hl.fl=content&hl.usePhraseHighlighter=true

This simple query  ask for document with text that contains word branch, I want to extract (fl=) only title and author fields, want xml format and with hl=true I’m asking for snippet of matching text, hl.snippets=20 instruct solr to search for a maximum of 20 snippet and hl.usePhraseHighlighter=true use a specific highlighter that try to extract a single phrase from the text. The most important parameter is the hl.fl=content that specify the field of the document that contains the text used for highlight.  In the results, after all matching documents there is a new section that contains all the highlights for each document

image

Figure 1: Hilight for the TFS Branching Guid – Scenarios 2.0.pdf file

The name of the element match the id of the document (in my configuration full path of the file), and there a list of highlights that follows. But the true power of Solr comes out if you start to use languages specific fields.

   <field name="content" type="text_en" indexed="false" stored="true" multiValued="false"/>
   <field name="text" type="text_en" indexed="true" stored="false" multiValued="true"/>

I’ve just changed in schema.xml the type of Content and Text from general_text to text_en and this simple modification enables a more specific tokenizer, capable of doing full-text searches. Suppose you want to know all documents that deal with branching strategies, here is a possible query

http://localhost:8080/TestInstance/tikacrawl/select?q=text%3A%22branch+strategy%22~3&fl=id%2Ctitle&wt=xml&hl=true&hl.snippets=5&hl.fl=content&hl.usePhraseHighlighter=true&hl.fragsize=300

The key is in the search query text:”branch strategy”~3 that states I’m interested in documents containing both branch and strategy terms and with a relative distance of no more than three words. Since text was indexed with text_en field type I got full-text search, and I have a confirmation looking at the highlights.

image

Figure 2: Highlights for a proximity query with full text, as you can see word branching matches even if I searched for branch

And voilà!! You have full-text searching inside file content with minimal amount of work and a simple REST interface for querying the index

Gian Maria.

Import folder of documents with Apache Solr 4.0 and tika

In a previous article I showed how simple is to import data from a Sql database into Solr with a Data Import Handler, in this article I’ll use a similar technique to import documents stored inside a folder.

This feature is exposed by the integration with Tika, an open source document analyzer capable of extracting text by various formats of files. Thanks to this library solr is capable of crawling an entire directory, indexing every document inside it with really minimal configuration. Apache Tika is a standalone project, you can find all supported formats here and you can use directly from your java (or .NET code) but thanks to Solr Integration setting everything up is a real breeze.

First of all you need to copy all required jars from solr distribuzion inside lib subdirectory of your core, I strongly suggest you to grab all the files inside contrib\extraction\lib subdirectory of solr distribution and copy all of them inside your core, in this way you can use every Data Import Handler you want without incurring in errors because a library is not available.

To import all files you can simply configure an import handler as I described in the previous article, here is the full configuration

<dataConfig>  
	<dataSource type="BinFileDataSource" />
		<document>
			<entity name="files" dataSource="null" rootEntity="false"
			processor="FileListEntityProcessor" 
			baseDir="c:/temp/docs" fileName=".*\.(doc)|(pdf)|(docx)"
			onError="skip"
			recursive="true">
				<field column="fileAbsolutePath" name="id" />
				<field column="fileSize" name="size" />
				<field column="fileLastModified" name="lastModified" />
				
				<entity 
					name="documentImport" 
					processor="TikaEntityProcessor"
					url="${files.fileAbsolutePath}" 
					format="text">
					<field column="file" name="fileName"/>
					<field column="Author" name="author" meta="true"/>
					<field column="title" name="title" meta="true"/>
					<field column="text" name="text"/>

				</entity>
		</entity>
		</document> 
</dataConfig>  

 

This is a really simple import configuration but there are some key point you should be aware of.

All of my schema.xml have a unique field called id, that serves me as a purpose of identifying the document, and is the key used by solr to understand if a document should be inserted or updated. This importer uses a BinFiledataSource that simply crawl inside a directory looking for files, and extracting standard values as Name of the file, last modify date and so on, but it does not know the text inside the file. This dataSource has a document entity called files, and has a rootEntity=”false” because this is not the real root entity that will be used to index the document. Other attributes simply states the folder to crawl, document extension to index and so on.

In that entity you find columns related to file attributes and I’ve decided to include in the document three of them

  1. fileAbsolutePath that will be used as unique index of the document
  2. fileSize that contains the size of the file in bytes
  3. fileLastModified that contains the date of last modification of the document

After these three fields there is another entity, based on TikaEntityProcessor, that will extract both text and metadata from the document. This entity is the real entity that will be indexed and it is the one that use Tika library to extract all information from documents, not only file related one. It basically extract all the text plus all file metadata attribute if presents. Here is the list of the field I want to store inside my index

  1. file: contains files name (opposed to the id field that contains the full path of the file)
  2. Author and Title: both are metadata of the document, and they are extracted by Tika if present in the files
  3. text: that contains the text of the document.

Clearly you should define all of these field accordingly inside your schema.xml file.

   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
   <field name="fileName" type="string" indexed="true" stored="true" />
   <field name="author" type="string" indexed="true" stored="true" />
   <field name="title" type="string" indexed="true" stored="true" />
   
   <field name="size" type="plong" indexed="true" stored="true" />
   <field name="lastModified" type="pdate" indexed="true" stored="true" />
   
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

This is all you need to do, really!!. You can now toss some documents inside specified folder, then go to solr console and try to execute the import.

image

Figure 1: Importing documents from Solr Web UI

If you have error during the import process, please refer to solr logs to understand what went wrong. The most obvious problem are: missing jar file from the lib directory, document missing some mandatory fields specified in schema.xml (such as missing id).

Now you can query and look at what is contained inside your index.

image

figure 2: A standard catch all query to verify what is inside your index

One of the cool feature of tika is extracting metadata and text from your files, as an example you can search for text that contains the word “rebase” with the query text:rebase

image

Figure 3: Return of a search inside the text of the document

As you can see, it founds progit.en.pdf book, and if you look at the properties you can find that this book does not contains nor author or title metadata, because they are missing in the original document. Since those fields are not mandatory, if they are not present in the file, nothing bad happens, but you are still able to search inside original text of the pdf file.

Gian Maria

Index your blog using tags and lucene.net

In the last part of my series on Lucene I show how simple is adding tags to document to do a simple tag based categorization, now it is time to explain how you can automate this process and how to use some advanced characteristic of lucene. First of all I write a specialized analyzer called TagSnowballAnalyzer, based on standard SnowballAnalyzer plus a series of keywords associated to various tags, here is how I construct it.

TagSnowBallAnalyzer tagSnowball = new TagSnowBallAnalyzer("English");
tagSnowball.AddTag("Nhibernate", "orm", 5);
tagSnowball.AddTag("HQL", "orm", 3);
...

With the above code I’m telling my analyzer that the word NHibernate is related to tag “orm” with a  weight of 5 and the keyword HQL is related to the same tag with a  weight of 3 and so on. Now that the analyzer has a series of words associated to tags and it can use this information to automatically add tags to document during the analysis process. All the work is done inside a specialized TagFilter that is capable of adding tag as synonym during the tokenization process.

public class TagFilter : TokenFilter
{
    public IDictionary<String, Tag> tags;

    public TagFilter(TokenStream inStream, IDictionary<String, Tag> tags)
        : base(inStream)
    {
        currentTagStack = new Stack<Tag>();
        termAttr = (TermAttribute)AddAttribute(typeof(TermAttribute));
        posIncrAttr = (PositionIncrementAttribute)AddAttribute(typeof(PositionIncrementAttribute));
        payloadAttr = (PayloadAttribute)AddAttribute(typeof(PayloadAttribute));
        this.tags = tags;
    }

    private Stack<Tag> currentTagStack;
    private State currentState;
    private TermAttribute termAttr;
    private PositionIncrementAttribute posIncrAttr;
    private PayloadAttribute payloadAttr;

    /// <summary>
    /// This is the function that I need to increment the token and return other token
    /// </summary>
    /// <returns></returns>
    public override bool IncrementToken()
    {
        if (currentTagStack.Count > 0)
        {
            //Still have synonym to return
            Tag tag = currentTagStack.Pop();
            RestoreState(currentState);
            termAttr.SetTermBuffer(tag.ConvertToToken());
            posIncrAttr.SetPositionIncrement(0);
            payloadAttr.SetPayload(new Payload(PayloadHelper.EncodeInt(tag.Weight)));
            return true;
        }
        //verify if the base stream still have token 
        if (!input.IncrementToken())
            return false;

        //token was incremented
        String currentTerm = termAttr.Term();
        if (tags.ContainsKey(currentTerm))
        {
            var tag = tags[currentTerm];
            currentTagStack.Push(tag);
        }
        currentState = CaptureState();
        return true;
    }
}

There are various code around the net on how to add synonyms with weight, like described in this stackoverflow question, standard java lucene code has a SynonymTokenFilter in the codebase, but this example shows how simple is to write a Filter to add tags as synonym of related words.   First of all the filter was initialized with a dictionary of keyword and Tags, where Tag is a simple helper class that stores Tag string and relative weight, it also have a ConvertToToken() method that returns the tag enclosed by | (pipe) character. The use of pipe character is done to explicitly mark tags in the token stream, any word that is enclosed by pipe is by convention a tag.

The above code is a standard customization of a TagFilter, a component that is capable to filter tokens during the analysis process. All the work is done inside the IncrementToken method. Lets start to examine this method from the middle: a call to IncrementToken() method of base class permits toadvance to the next term, if there is another term in the stream I grab its value with a call to termAttr.Term() method. The base token has no properties, but it contains only attributes managed by external objects, this permits to add whatever attribute you want without changing base code or adding property to a base token class. Once I got a reference to the string representing the next token it is time to handle tag management.

If the current word is associated to a tag I push related tag on the stack and returns the current token without any modification. When the method return another call is done to IncrementToken() method to grab the next token; now I have a Tag into the stack meaning that the previous word was associated to that tag. I should pop the tag from the stack, call RestoreState() method on base class and finally call SetTermBuffer() passing a converted token (token surronded by pipe character) to set the tag as the value of the next token. The call to SetTermBuffer actually set the value of the current term and the following call to SetPositionIncrement() tells the analyzer that this new term has the same position in the stream as the previous one. Actually what I’ve done with this code is inserting a synonym of the previous token. This is a standard technique you can use even if you have an english dictionary and want to do analysis based on synonym.

Finally I call SetPayload() method to add weight of the tag as payload to the current term. The payload is basically a stream of bytes that is associated to a term, it is stored inside the index and can be used during the query process. To easy the handling of this stream of bytes the PayloadHelper class should be used to save and retrieve values avoiding raw manipulation of payload stream.( If you run the sample with a debugger and step through the code everything will be clear). Now you can use this new filter inside the custom analyzer.

public override TokenStream TokenStream(string fieldName, TextReader reader)
{
    TokenStream stream = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
    stream = new StandardFilter(stream);
    stream = new LowerCaseFilter(stream);
    if (this.stopSet != null)
    {
        stream = new StopFilter(true, stream, this.stopSet);
    }
    //Now synonym
    stream = new SnowballFilter(stream, this.name);
    return new TagFilter(stream, tags);
}

This code is inside my custom analyzer and it is the standard constructor of Snowball Analyzer but in the last line it adds my new TagFilter to the list of filters to do tag analysis. Now suppose to feed the string “I usually use nhibernate in all of my projects” to this analyzer, here is the resulting token stream.

1:[i] 2:[usual] 3:[use] 4:[nhibern] 4:[|orm|] 5:[in] 6:[all] 7:[of] 8:[my] 9:[project]

As you can see the word nhibernate was stemmed by snowball filter but the key fact is that in position 4 of the resulting stream there are two words: first one is the original word nhibernate present in the stream, second one is the tag I’ve inserted as synonym. Since both word occupy the position number 4 of the resulting stream, they are equivalent in the index and they are to all extent managed like synonym.

Now I need another class to manage my payload during query, this is necessary because I want use my payload to alter match score of documents during query process.

public class TagWeightSimilarity : DefaultSimilarity
{
    public override float ScorePayload(int docId, string fieldName, int start, int end, byte[] payload, int offset, int length)
    {
        if (payload != null)
            return PayloadHelper.DecodeInt(payload, offset);
        return 1;
    }
}

TagWeightSimilarity is a class that use the payload in a term to return a value that will be used during score calculation. If you look in the sample code you will find this code for a typical tag query.

indexSearcher.SetSimilarity(new TagWeightSimilarity());

Query query = new PayloadTermQuery(
    new Term("content", "|orm|"),
    new SumPayloadFunction());

As you can see I set the similarity in the index searcher and use a specific query of type PayloadTermQuery, a specialized version of standard TermQuery that accepts not only a term but also a class that is capable to combine payload value. Class SumPayloadFunction  is really simple and it contains only a ScorePayload() method called for each match to combine the various payload values. In this example I simply sum together all the payload so each document score is affected by how many words associated to tags and relative weight.

Finally I want also to assign tag fields to the document as explained in the last article, this is really simple because I’ve used the same logic to create another filter called NoTagRemover that basically removes all tokens except those one that have an associated tag. All tags are returned as a single string with format tag|weight and it can be used during document creation to add each tag as separate field.

var tokens = analyzer.GetTagFinder(new StringReader(content));
TermAttribute termAttribute = (TermAttribute)tokens.AddAttribute(typeof(TermAttribute));
Dictionary<String, List<Tag>> tags = new Dictionary<String, List<Tag>>();
while (tokens.IncrementToken())
{
    String term = termAttribute.Term();
    String[] termPart = term.Split('|');
    String tagstring = termPart[0];
    Int32 weight = Int32.Parse(termPart[1]);
    if (!tags.ContainsKey(tagstring))
    {
        tags.Add(tagstring, new List<Tag>());
    }
    Tag tag = new Tag() { TagName = tagstring, Weight = weight };
    tags[tagstring].Add(tag);
}

//now I have all the tags, so I can add tag to the document.
foreach (var tagEntry in tags)
{
    //I store the sum+ of all the weight of the tags.
    var allTags = tagEntry.Value;
    Int32 maxWeight = allTags.Sum(t => t.Weight);
    document.Add(new NumericField(tagEntry.Key, Field.Store.YES, true).SetIntValue(maxWeight));
}

This code is really simple, just tokenize the input stream with NoTagFilter, for each tag split it based on | character and sum all the weights together for each tags. At the end of the process you have a list of tags with associated weight so you can add them as NumericField to the document (as explained in last article). Finally it is time to issue query, like this one.

Query query = new PayloadTermQuery(
    new Term("content", "|orm|"),
    new SumPayloadFunction());

I’m basically searching for all documents that contains terms associated with tag orm, and I can search “|orm|” directly inside the content of the document, because tags are stored inside the index as synonym. This produces the following result.

Found Match for id=6 score=3,989792 tags=[orm:13]

Found Match for id=1 score=1,476607 tags=[orm:5]

Found Match for id=2 score=1,033625 tags=[nosql:5] [orm:5]

Found Match for id=6 score=0,7087715 tags=[orm:3]

As you can see the score is greater than 1 because it is altered by the payload, the first result has a score of 3,9 and has an associated orm tag with final weight of 13 because it contains several words associated with orm tag. The third document has two tags: orm and nosql, because it contains words associated to nosql tag but everything is done automatically, you need only to associate specific words to tags. You can also use standard numeric query to query for tags associated to the whole document

query = NumericRangeQuery.NewIntRange("orm", 10, Int32.MaxValue, true, true);

This is the same technique shown in the old post. As for every customization done in lucene it is a good practice to create a specific Query Parser that is capable of parsing a query to make simpler to issue a query. If I want to find documents that contains words associated to the tag orm I need to issue a PayloadTermQuery searching for "|orm|” in content field. The use of pipe characters is used to distinguish between regular terms of the field and tags that were injected by custom analyzer, but it is a really bad practice to expose this fact to the user because the use of pipe/synonym/payload are all internal details that should be hidden to final user. Here is the GetFieldQuery method of a customized queryparser that does the trick

public override Lucene.Net.Search.Query GetFieldQuery(string field, string queryText)
{
    //if field is tag issue a standard query tag.
    if (field.Equals("tag", StringComparison.OrdinalIgnoreCase)) {
            
        return new PayloadTermQuery(
                new Term("content", "|" + queryText + "|"),
                new SumPayloadFunction());
    }

    //if field is a base field, use standard query parser routine.
    if (field.Equals("content", StringComparison.OrdinalIgnoreCase) ||
        field.Equals("id", StringComparison.OrdinalIgnoreCase))
    {
        return base.GetFieldQuery(field, queryText);
    }

    //all other fields are tags.
    Int32 value = Int32.Parse(queryText);
    return NumericRangeQuery.NewIntRange(field, value, value, true, true);
}

The PayloadTermquery is generated only if the user ask to search into a field called tag. Actually the inner document does not contain such a field, because tags are stored as synonyms inside the content field, but thanks to this customized query parser the user can issue a query like “tag:orm”  and it gets translated to the right query.

Query [tag:orm] is translated to: content:|orm|

Found Match for id=6 score=3,989792 tags=[orm:13]

Found Match for id=1 score=1,476607 tags=[orm:5]

Found Match for id=2 score=1,033625 tags=[nosql:5] [orm:5]

Found Match for id=6 score=0,7087715 tags=[orm:3]

As you can see searching for tag:orm produces the right content:|orm| query that uses my customized tags system.

Sample code can be found Here.

Gian Maria

Assign “tag” to lucene documents

One of the good aspect of working with lucene.NET is that it is really similar to a NoSql database, because it permits you to store “document” where a document is a generic collection of fields. Lucene has the ability to store not only textual field, but also Numeric Fields to solve interesting scenarios because you are not limited in storing and searching only for text. Suppose you want to categorize all posts of a blog where each post can have one or more Tag and a pertinence value associated to that Tag. The technique used to determine the Tags to associate to a Blog post is not the subject of this discussion, what I need is only a technical way in Lucene.NET to add tags with an integer value to a document and issue query on them. For the sake of this discussion we can say that blog user decide one or more tag word to associate to the post and give a value to 1 to 10 to determine how pertinent the tag is to the post. We can add tags to document that represent a post with this simple code.

document.Add(new NumericField("orm", Field.Store.YES, true).SetIntValue(5));
document.Add(new NumericField("cqrs", Field.Store.YES, true).SetIntValue(7));

The above snippet of code states that this blog post is pretty related to ORM and CQRS. The important aspect is that each document can have different field inside a document because a document is Schemaless as NoSql databases. You can now query this index in this way.

Query query = NumericRangeQuery.NewIntRange("orm", 5, 10, true, true);

This query will retrieve all the documents that have an associated tag named “orm” with pertinence value in range [5 to 10]. You can clearly compose query to express more complex criteria es: all post that are pertinent to orm with a value of 5 to 10 and pertinent to cqrs with a value of 1 to 10 and so on.

BooleanQuery bquery = new BooleanQuery();
bquery.Add(NumericRangeQuery.NewIntRange("orm", 5, 10, true, true), BooleanClause.Occur.MUST);
bquery.Add(NumericRangeQuery.NewIntRange("cqrs", 1, 10, true, true), BooleanClause.Occur.MUST);

As you see it is really simple to build a BooleanQuery using BooleanClause.Occur.MUST to create AND composition or BooleanClause.Occur.SHOULD if you want to compose with logical OR. To make everything simpler you can inherit from QueryParser to build a specialized parser for tag of your blog.

class QueryParserForTagged : QueryParser
{
    public QueryParserForTagged(Lucene.Net.Util.Version version, String field, Analyzer analyzer)
        : base(version, field, analyzer) { 
        
    }

    public override Lucene.Net.Search.Query GetFieldQuery(string field, string queryText)
    {
        //if field is a base field, use standard query parser routine.
        if (field.Equals("title", StringComparison.OrdinalIgnoreCase) ||
            field.Equals("content", StringComparison.OrdinalIgnoreCase)
        {
            return base.GetFieldQuery(field, queryText);
        }
            
        //all other fields are tags.
        Int32 value = Int32.Parse(queryText);
        return NumericRangeQuery.NewIntRange(field, value, value, true, true);
    }

    protected override Query GetRangeQuery(string field, string part1, string part2, bool inclusive)
    {
        //if field is a base field, use standard query parser routine.
        if (field.Equals("title", StringComparison.OrdinalIgnoreCase) ||
            field.Equals("content", StringComparison.OrdinalIgnoreCase)
        {
            return base.GetRangeQuery(field, part1, part2, inclusive);
        }

        //all other fields are tags.
        Int32 valueMin = Int32.Parse(part1);
        Int32 valueMax = Int32.Parse(part2);
        return NumericRangeQuery.NewIntRange(field, valueMin, valueMax, inclusive, inclusive);
    }
}

The logic is really simple, if the field is one of the standard fields (title or content in this example) simply use the basic QueryParser capability, each field that is not a standard field of the document is by convention a tag and generates a NumericRangeQuery so you can issue a query like “NHibernate cqrs:[5 TO 10]" to find all post that contains the word nhibernate but have also an associated tag whose value is from 5 to 10.

Alk.

Faceted searches with Lucene.NET

One of the coolest feature of Lucene.NET is the ability to do faceted searches with really few lines of code. A faceted search runs a query on an index and calculate the distribution of the results based on a property of the document. Let me show a sample result and then you probably will have a better understanding of this concept. Suppose I’m indexing product from the Microsoft Sample database AdventureWorks, each product has an Id a category and a description fields and I offered this simple UI for searching.

SNAGHTML1248f1

Figure 1: Result of a faceted search.

As you can see I can show first X results of the query followed by the distribution count on all categories that contains results. This kind of result is really cool for the user, because you can give the opportunity to drill down into categories so the user can narrow or navigate results to find exactly what he needs.

Latest version of java Lucene has already this capability built in in the main product, unfortunately Lucene.NET still misses this functionality but it contains all the primitive that permits you to implement Faceted searches with few lines of code. First of all I wrote a simple helper class that simplify facets management.

private class FacetValueCache
{
    public String FacetValue { get; set; }

    public CachingWrapperFilter Filter { get; set; }

    public FacetValueCache(String facetValue, CachingWrapperFilter filter)
    {
        FacetValue = facetValue;
        Filter = filter;
    }

    private OpenBitSetDISI GetValueBitset(IndexReader reader)
    {
        return new OpenBitSetDISI(Filter.GetDocIdSet(reader).Iterator(), 10000000);
    }

    public Int32 GetFacetCount(IndexReader reader, Filter queryFilterOnMainQuery)
    {
        var bs = GetValueBitset(reader);
        bs.InPlaceAnd(queryFilterOnMainQuery.GetDocIdSet(reader).Iterator());
        return (Int32) bs.Cardinality();
    }
}

In my example the AdventureWorks database has a set of categories, for each categories I have a list of Lucene documents that belongs to that category and to do faceted searches I need to store this list for each distinct value of category. FacetValueCache is then used to keep track of two distinct values

  • CategoryName "(Es. bikes / roadbikes)
  • List of documents that belongs to that category

The second value is not stored directly, but I use a CachingWrappingFilter that basically is a cached version of a QueryFilter that is capable to return a list of document matching a given query.

The magic is done in the GetFacetCount thanks to the class OpenBitSetDISI, that basically is nothing more than a stream of bit used by Lucene to store a list of document indexes and that contains operator specifically dedicate do to massive operations on list of document indexes. If you have an OpenBitSetDISI you can use the InPlaceAnd operator passing another list of document indexes and then you can call Cardinality() method to know the number of document that the two lists have in common.

You should have already understood that is the OpenBitSetDISI class that does all the magic of calculate facets, if you store inside FacetValueCache the list of document indexes that represents documents belonging to a given category, you can use InPlaceAnd to intersect with the result of user query, and know how many documents are in common; et voilà. Since OpenBitSetDISI class can be created by a QueryFilter thanks to the document iterator returned by the method GetDocIdSet, I store a CachingWrappingFilter inside my FacetValueClass to recreate the bitset when needed.

Ad you can see, in GetFacetCount I simply recreate the bitset for the documents that belongs to that specific category, then I intersect with InPlaceAnd taking the list of document indexes from the QueryFilter passed as argument and finally returns the number of document in common.

Now you can create a GetFacets method that uses this class to do faceted search.

Dictionary<String, List<FacetValueCache>> facetsCacheContainer = new Dictionary<string, List<FacetValueCache>>();

public IEnumerable<KeyValuePair<String, Int32>> GetFacets(Query query, String facetField)
{
    if (!facetsCacheContainer.ContainsKey(facetField))
    {
        List<FacetValueCache> cache = new List<FacetValueCache>();
        var allDistinctField = FieldCache_Fields.DEFAULT.GetStrings(reader, facetField).Distinct();

        foreach (var fieldValue in allDistinctField)
        {
            var facetQuery = new TermQuery(new Term(facetField, fieldValue));
            var facetQueryFilter = new CachingWrapperFilter(new QueryWrapperFilter(facetQuery));
            cache.Add(new FacetValueCache(fieldValue, facetQueryFilter));
        }
        facetsCacheContainer.Add(facetField, cache);
    }

I used a Dictionary to cache the list of FacetValueCache objects related to the field I’m using to do facet results. Creating the list of FacetValueCache is really trivial, thanks to FieldCache_Fields I can get all values of a given field in the index, then I use LINQ Distinct() operator to have a distinct list of all the possible values of that field. The call to FieldCache_Fields is quite fast because Lucene.NET caches the result in memory. For each distinct field I can create the corresponding FacetValueCache object using a QueryWrapperFilter based on a simple TermQuery that retrieve all documents that belong to that facet value. Now I can calculate facets.

//now calculate facets.
var mainQueryFilter = new CachingWrapperFilter(new QueryWrapperFilter(query));
var facetDefinition = facetsCacheContainer[facetField];
return facetDefinition.Select(fd =>
    new KeyValuePair<String, Int32>(fd.FacetValue, fd.GetFacetCount(reader, mainQueryFilter)));

First step is creating the CachingWrapperFilter for the query issued by the user, then retrieve the list of FacetValueCache associated to that field (Es. Category), and then with a simply LINQ function I return to the user a list of KeyValuePair<String, Int> and the game is done. As you can see faceted searches with Lucene .NET is a matter of few lines of code and can really give fantastic results to the user.

You can find the code here.

Gian Maria.