Hilight matched text inside documents indexed with Solr plus Tika

I’ve already dealt on how to index documents with Solr and Tika and in this article I’ll explain how you can not only search for documents that match your query, but returns even some text extract that shows where the document match the query. To achieve this, you should store the full content of the document inside your index, usually I create a couple of fields, one called Content that will contain the content of the file, and with a copyfield directive (  <copyField source=”content” dest=”text”/> ) automatically copy that value inside the catch all field called text.

   <field name="content" type="text_general" indexed="false" stored="true" multiValued="false"/>
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

Text field is multivalued and not stored, it is only indexed to permit search inside various field of the document. Content field store the extracted text from Tikaand it is useful both for highlighting and to troubleshoot extraction problems, because it contains the exact text extracted by Tika.

Now suppose you want to search for the term Branch and want also to highlight the part of the text where you find that term, you can simply issue a query that ask for highlighting, it is really simple.

http://localhost:8080/TestInstance/tikacrawl/select?q=text%3Abranch&fl=id%2Ctitle%2Cauthor&wt=xml&hl=true&hl.snippets=20&hl.fl=content&hl.usePhraseHighlighter=true

This simple query  ask for document with text that contains word branch, I want to extract (fl=) only title and author fields, want xml format and with hl=true I’m asking for snippet of matching text, hl.snippets=20 instruct solr to search for a maximum of 20 snippet and hl.usePhraseHighlighter=true use a specific highlighter that try to extract a single phrase from the text. The most important parameter is the hl.fl=content that specify the field of the document that contains the text used for highlight.  In the results, after all matching documents there is a new section that contains all the highlights for each document

image

Figure 1: Hilight for the TFS Branching Guid – Scenarios 2.0.pdf file

The name of the element match the id of the document (in my configuration full path of the file), and there a list of highlights that follows. But the true power of Solr comes out if you start to use languages specific fields.

   <field name="content" type="text_en" indexed="false" stored="true" multiValued="false"/>
   <field name="text" type="text_en" indexed="true" stored="false" multiValued="true"/>

I’ve just changed in schema.xml the type of Content and Text from general_text to text_en and this simple modification enables a more specific tokenizer, capable of doing full-text searches. Suppose you want to know all documents that deal with branching strategies, here is a possible query

http://localhost:8080/TestInstance/tikacrawl/select?q=text%3A%22branch+strategy%22~3&fl=id%2Ctitle&wt=xml&hl=true&hl.snippets=5&hl.fl=content&hl.usePhraseHighlighter=true&hl.fragsize=300

The key is in the search query text:”branch strategy”~3 that states I’m interested in documents containing both branch and strategy terms and with a relative distance of no more than three words. Since text was indexed with text_en field type I got full-text search, and I have a confirmation looking at the highlights.

image

Figure 2: Highlights for a proximity query with full text, as you can see word branching matches even if I searched for branch

And voilà!! You have full-text searching inside file content with minimal amount of work and a simple REST interface for querying the index

Gian Maria.

Published by

Ricci Gian Maria

.Net programmer, User group and community enthusiast, programmer - aspiring architect - and guitar player :). Visual Studio ALM MVP

8 thoughts on “Hilight matched text inside documents indexed with Solr plus Tika”

  1. Dear Ricci,

    This is the best example I’ve found about query highlighting paragraph on solr.

    I write you because I have implemented solr 6.5.1 today in my debian server but I have trouble getting the pdf text content. The search is ok, because the document name appears ok in highlighting. However, the does not appear with each str result:

    Additionally, the log is showing all the pdf text, so I assume it was correctly indexed with bin/post command.

    This is the query:

    http://x.x.x.x:8983/solr/ex/select?q=juan&fl=title&wt=xml&hl=true&hl.snippets=20&hl.fl=content&hl.usePhraseHighlighter=true

    And this is the result:

    0
    1

    20
    juan
    true
    title
    true
    content
    xml

    CV_Juan_Jara_ultimo

    Do you know what could be wrong with it?.

    Best Regards,

    Juan Jara.

  2. Probably the field where you are storing the text has Stored=”false” configuration. Remember that for each field in SOLR you can decide if the SOLR engine will maintain the original text for retrieval.

    In my example I explicitly set stored=”false” for the text field, you can try change that setting to true.

  3. Hi Ricci,
    I’m new to Solr. I tried the same but highlighting is not coming. Do I need to add some special code block or something. please suggest. I’ve indexed files using Tika.
    below are the changes I’ve made.

    Managed-schema.xml

    Solrconfig.xml

    added this code block under “browse” requesthandler.

    on
    content features title name
    html
    <b>
    </b>
    0
    title
    0
    name
    3
    200
    content
    750

    Still not able to get highlights. It’s not even show the Highlight tags in response.

  4. It is really difficult to troubleshoot with so few information, but actually the most common problem I’ve encountered is that text extracted with tika is put in a field that has stored=”false”, thus preventing it to be used for highlight. I switched to elasticsearch really long time ago, so I’m not really familiar with the release 6 of solr, but I suggest you to check if the text extracted with tika can be returned from a query. If the test cannot be returned it is a clear sign that it is not stored and thus highligh will not work.

  5. Quick heads up:

    I’m a newbie on solr myself, but the guide has small error on the copy directive. You should use:

    The data-importer (tika) copies the file contents to the field “text”, that is not indexed.
    Then the copy-field directive copies the data from the “text” field to the “contents” field, that is indexed.

    Also, just a warning for begginers like me that stumbled upon this blog: I was using solr 6.6.1 and I could not get the DIH to work with Tika. Switched to 6.5.x and it works fine.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.