Index documents content with Solr and Tika

I’ve blogged in the past about indexing entire folders of documents with solr and Tika with Data Import Handler. This approach has pro and cons. On the good side, once you’ve understand the basics, setting everything up and running is a matter of a couple of hours max, on the wrong side, using a DIH gives you little controls over the entire process.

As an example, I’ve had problem with folder with jpg images, because the extractor crashes due to a missing library. If you do not configure correctly the import handler, every error stops the entire import process. Another problem is that document content is not subdivided into pages even if Tika can give you this kind of information. Finally, you need to have all of your documents inside a folder to be indexed. In real situation it is quite often preferable to have more control over the index process. Lets examine how you can use tika from your C# code.

The easiest way is directly invoking the tika.jar file with Java, it is quick and does not requires any other external library, just install java and uncompress tika in a local folder.

public TikaDocument ExtractDataFromDocument(string pathToFile)
{

    var arguments = String.Format("-jar \"{0}\" -h \"{1}\"", Configuration.TikaJarLocation, pathToFile);

    using (Process process = new Process())
    {
        process.StartInfo.FileName = Configuration.JavaExecutable;
        process.StartInfo.Arguments = arguments;
        process.StartInfo.WorkingDirectory = Path.GetDirectoryName(pathToFile);
        process.StartInfo.WindowStyle = ProcessWindowStyle.Minimized;
        process.StartInfo.UseShellExecute = false;
        process.StartInfo.ErrorDialog = false;
        process.StartInfo.CreateNoWindow = true;
        process.StartInfo.RedirectStandardOutput = true;
        var result = process.Start();
        if (!result) return TikaDocument.Error;
        var fullContent = process.StandardOutput.ReadToEnd();
        return new TikaDocument(fullContent);
    }

}

This snippet of code simply invoke Tika passing the file you want to analyze as argument, it uses standard System.Diagnostics.Process .NET object and intercept all standard output to grab Tika output. This output is parsed with an helper object called TikaDocument that takes care of understanding how the document is structured. If you are interested in the code you can find everything in the included sample, but it is just a matter of HTML parsing with HtmlAgilityToolkit. ES.

Meta = new MetaHelper(meta);
var pagesList = new List<TikaPage>();
Pages = pagesList;
Success = true;
FullHtmlContent = fullContent;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(fullContent);
FullTextContent = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);

var titleNode = doc.DocumentNode.SelectSingleNode("//title");
if (titleNode != null) 
{
    Title = HttpUtility.HtmlDecode(titleNode.InnerText);
}


var pages = doc.DocumentNode.SelectNodes(@"//div[@class='page']");
if (pages != null)
{
    foreach (var page in pages)
    {
        pagesList.Add(new TikaPage(page));
    }
}
var metaNodes = doc.DocumentNode.SelectNodes("//meta");
if (metaNodes != null)
{
    foreach (var metaNode in metaNodes)
    {

Thanks to TikaDocument class you can index content of single pages, in my example I simply send to Solr the entire content of the document (I do not care subdividing document into pages). This is the XML message for standard document update

public System.Xml.Linq.XDocument SolarizeTikaDocument(String fullPath, TikaDocument document)
{
    XElement elementNode;
    XDocument doc = new XDocument(
        new XElement("add", elementNode = new XElement("doc")));

    elementNode.Add(new XElement("field", new XAttribute("name", "id"), fullPath));
    elementNode.Add(new XElement("field", new XAttribute("name", "fileName"), Path.GetFileName(fullPath)));
    elementNode.Add(new XElement("field", new XAttribute("name", "title"), document.Title));
    elementNode.Add(new XElement("field", new XAttribute("name", "content"), document.FullTextContent));
    return doc;
}

To mimic how DIH works, you can use File System Watcher to monitor a folder, and index the document as soon some of the documents gets updated or added. In my sample I only care about file being added to the directory,

static void watcher_Created(object sender, FileSystemEventArgs e)
{
    var document = _tikaHandler.ExtractDataFromDocument(e.FullPath);
    var solrDocument = _solarizer.SolarizeTikaDocument(e.FullPath, document);
    _solr.Post(solrDocument);
}

This approach is more complex than using a plain DIH but gives you more control over the entire process and it is also suitable if documents are stored inside databases or in other locations.

Code is available here: http://sdrv.ms/17zKJdL

Gian Maria.

Index a folder of multilanguage documents in Solr with Tika

Previous Posts on the serie

Everything is up and running, but now requirements change, documents can have multiple languages (italian and english in my scenario) and we want to do the simplest thing that could possibly work. First of all I change the schema of the core in solr to support language specific fields with wildcards.

image

Figure 1: Configuration of solr core to support multiple language field.

This is a simple modification, all fields are indexed and stored (for highlighting) and multivalued. Now we can leverage another interesting functionality of Solr+Tika, an update handler that identifies the language of every document that got indexed. This time we need to modify solrconfig.xml file, locating the section of the /update handler and modify in this way.

<requestHandler name="/update" class="solr.UpdateRequestHandler">
   <lst name="defaults">
	 <str name="update.chain">langid</str>
   </lst>
   
</requestHandler>

<updateRequestProcessorChain >
  <processor name="langid" class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
	<lst name="defaults">
	  <bool name="langid">true</bool>
	  <str name="langid.fl">title,content</str>
	  <str name="langid.langField">lang</str>
	  <str name="langid.fallback">en</str>
	  <bool name="langid.map">true</bool>
	  <bool name="langid.map.keepOrig">true</bool>
	</lst>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

I use a TikaLanguageIndentifierUpdateProcessorFactory to identify the language of documents, this runs for every documents that gets indexed, because it is injected in the chain of UpdateRequests. The configuration is simple and you can find full details in solr wiki. Basically I want it to analyze both the title and content field of the document and enable mapping of fields. This means that if the document is detected as Italian language it will contain content_it and title_it fields not only content field. Thanks to previous modification of solr.xml schema to match dynamicField with the correct language all content_xx files are indexed using the correct language.

This way to proceed consumes memory and disk space, because for each field I have the original Content stored as well as the content localized, but it is needed for highlighting and makes my core simple to use.

Now I want to be able to do a search in this multilanguage core, basically I have two choices:

  • Identify the language of terms in query and query the correct field
  • Query all the field with or.

Since detecting language of term used in query gives a lots of false positive, the secondo technique sounds better. Suppose you want to find italian term “tipografia”, You can issue query: content_it:tipografia OR content_en:tipografia. Everything works as expected as you can see from the following picture.

image

Figure 2: Sample search in all content fields.

Now if you want highlights in the result, you must specify all localized fields, you cannot simply use Content field. As an example, if I simply ask to highlight the result of previous query using original content field, I got no highlight.

image

Figure 3: No highlighting found if you use the original Content field.

This happens because the match in the document was not an exact match, I ask for word tipografia but in my document the match is on the term tipografo, thanks to language specific indexing Solr is able to match with stemming, this a typical full text search. The problem is, when is time to highlight, if you specify the content field, solr is not able to find any match of word tipografia in it, so you got no highlight.

 To avoid problem, you should specify all localized fields in hl parameters, this has no drawback because a single document have only one non-null localized field and the result is the expected one:

image

Figure 4: If you specify localized content fields you can have highlighting even with a full-text match.

In this example when is time to highlight Solr will use both content_it and content_en. In my document content_en is empty, but Solr is able to find a match in content_it and is able to highlight with the original content, because content_it has stored=”true” in configuration.

Clearly using a single core with multiple field can slow down performances a little bit, but probably is the easiest way to deal to index Multilanguage files  automatically with Tika and Solr.

Gian Maria.

Hilight matched text inside documents indexed with Solr plus Tika

I’ve already dealt on how to index documents with Solr and Tika and in this article I’ll explain how you can not only search for documents that match your query, but returns even some text extract that shows where the document match the query. To achieve this, you should store the full content of the document inside your index, usually I create a couple of fields, one called Content that will contain the content of the file, and with a copyfield directive (  <copyField source=”content” dest=”text”/> ) automatically copy that value inside the catch all field called text.

   <field name="content" type="text_general" indexed="false" stored="true" multiValued="false"/>
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

Text field is multivalued and not stored, it is only indexed to permit search inside various field of the document. Content field store the extracted text from Tikaand it is useful both for highlighting and to troubleshoot extraction problems, because it contains the exact text extracted by Tika.

Now suppose you want to search for the term Branch and want also to highlight the part of the text where you find that term, you can simply issue a query that ask for highlighting, it is really simple.

http://localhost:8080/TestInstance/tikacrawl/select?q=text%3Abranch&fl=id%2Ctitle%2Cauthor&wt=xml&hl=true&hl.snippets=20&hl.fl=content&hl.usePhraseHighlighter=true

This simple query  ask for document with text that contains word branch, I want to extract (fl=) only title and author fields, want xml format and with hl=true I’m asking for snippet of matching text, hl.snippets=20 instruct solr to search for a maximum of 20 snippet and hl.usePhraseHighlighter=true use a specific highlighter that try to extract a single phrase from the text. The most important parameter is the hl.fl=content that specify the field of the document that contains the text used for highlight.  In the results, after all matching documents there is a new section that contains all the highlights for each document

image

Figure 1: Hilight for the TFS Branching Guid – Scenarios 2.0.pdf file

The name of the element match the id of the document (in my configuration full path of the file), and there a list of highlights that follows. But the true power of Solr comes out if you start to use languages specific fields.

   <field name="content" type="text_en" indexed="false" stored="true" multiValued="false"/>
   <field name="text" type="text_en" indexed="true" stored="false" multiValued="true"/>

I’ve just changed in schema.xml the type of Content and Text from general_text to text_en and this simple modification enables a more specific tokenizer, capable of doing full-text searches. Suppose you want to know all documents that deal with branching strategies, here is a possible query

http://localhost:8080/TestInstance/tikacrawl/select?q=text%3A%22branch+strategy%22~3&fl=id%2Ctitle&wt=xml&hl=true&hl.snippets=5&hl.fl=content&hl.usePhraseHighlighter=true&hl.fragsize=300

The key is in the search query text:”branch strategy”~3 that states I’m interested in documents containing both branch and strategy terms and with a relative distance of no more than three words. Since text was indexed with text_en field type I got full-text search, and I have a confirmation looking at the highlights.

image

Figure 2: Highlights for a proximity query with full text, as you can see word branching matches even if I searched for branch

And voilà!! You have full-text searching inside file content with minimal amount of work and a simple REST interface for querying the index

Gian Maria.

Import folder of documents with Apache Solr 4.0 and tika

In a previous article I showed how simple is to import data from a Sql database into Solr with a Data Import Handler, in this article I’ll use a similar technique to import documents stored inside a folder.

This feature is exposed by the integration with Tika, an open source document analyzer capable of extracting text by various formats of files. Thanks to this library solr is capable of crawling an entire directory, indexing every document inside it with really minimal configuration. Apache Tika is a standalone project, you can find all supported formats here and you can use directly from your java (or .NET code) but thanks to Solr Integration setting everything up is a real breeze.

First of all you need to copy all required jars from solr distribuzion inside lib subdirectory of your core, I strongly suggest you to grab all the files inside contrib\extraction\lib subdirectory of solr distribution and copy all of them inside your core, in this way you can use every Data Import Handler you want without incurring in errors because a library is not available.

To import all files you can simply configure an import handler as I described in the previous article, here is the full configuration

<dataConfig>  
	<dataSource type="BinFileDataSource" />
		<document>
			<entity name="files" dataSource="null" rootEntity="false"
			processor="FileListEntityProcessor" 
			baseDir="c:/temp/docs" fileName=".*\.(doc)|(pdf)|(docx)"
			onError="skip"
			recursive="true">
				<field column="fileAbsolutePath" name="id" />
				<field column="fileSize" name="size" />
				<field column="fileLastModified" name="lastModified" />
				
				<entity 
					name="documentImport" 
					processor="TikaEntityProcessor"
					url="${files.fileAbsolutePath}" 
					format="text">
					<field column="file" name="fileName"/>
					<field column="Author" name="author" meta="true"/>
					<field column="title" name="title" meta="true"/>
					<field column="text" name="text"/>

				</entity>
		</entity>
		</document> 
</dataConfig>  

 

This is a really simple import configuration but there are some key point you should be aware of.

All of my schema.xml have a unique field called id, that serves me as a purpose of identifying the document, and is the key used by solr to understand if a document should be inserted or updated. This importer uses a BinFiledataSource that simply crawl inside a directory looking for files, and extracting standard values as Name of the file, last modify date and so on, but it does not know the text inside the file. This dataSource has a document entity called files, and has a rootEntity=”false” because this is not the real root entity that will be used to index the document. Other attributes simply states the folder to crawl, document extension to index and so on.

In that entity you find columns related to file attributes and I’ve decided to include in the document three of them

  1. fileAbsolutePath that will be used as unique index of the document
  2. fileSize that contains the size of the file in bytes
  3. fileLastModified that contains the date of last modification of the document

After these three fields there is another entity, based on TikaEntityProcessor, that will extract both text and metadata from the document. This entity is the real entity that will be indexed and it is the one that use Tika library to extract all information from documents, not only file related one. It basically extract all the text plus all file metadata attribute if presents. Here is the list of the field I want to store inside my index

  1. file: contains files name (opposed to the id field that contains the full path of the file)
  2. Author and Title: both are metadata of the document, and they are extracted by Tika if present in the files
  3. text: that contains the text of the document.

Clearly you should define all of these field accordingly inside your schema.xml file.

   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
   <field name="fileName" type="string" indexed="true" stored="true" />
   <field name="author" type="string" indexed="true" stored="true" />
   <field name="title" type="string" indexed="true" stored="true" />
   
   <field name="size" type="plong" indexed="true" stored="true" />
   <field name="lastModified" type="pdate" indexed="true" stored="true" />
   
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

This is all you need to do, really!!. You can now toss some documents inside specified folder, then go to solr console and try to execute the import.

image

Figure 1: Importing documents from Solr Web UI

If you have error during the import process, please refer to solr logs to understand what went wrong. The most obvious problem are: missing jar file from the lib directory, document missing some mandatory fields specified in schema.xml (such as missing id).

Now you can query and look at what is contained inside your index.

image

figure 2: A standard catch all query to verify what is inside your index

One of the cool feature of tika is extracting metadata and text from your files, as an example you can search for text that contains the word “rebase” with the query text:rebase

image

Figure 3: Return of a search inside the text of the document

As you can see, it founds progit.en.pdf book, and if you look at the properties you can find that this book does not contains nor author or title metadata, because they are missing in the original document. Since those fields are not mandatory, if they are not present in the file, nothing bad happens, but you are still able to search inside original text of the pdf file.

Gian Maria