Hilight matched text inside documents indexed with Solr plus Tika

I’ve already dealt on how to index documents with Solr and Tika and in this article I’ll explain how you can not only search for documents that match your query, but returns even some text extract that shows where the document match the query. To achieve this, you should store the full content of the document inside your index, usually I create a couple of fields, one called Content that will contain the content of the file, and with a copyfield directive (  <copyField source=”content” dest=”text”/> ) automatically copy that value inside the catch all field called text.

   <field name="content" type="text_general" indexed="false" stored="true" multiValued="false"/>
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

Text field is multivalued and not stored, it is only indexed to permit search inside various field of the document. Content field store the extracted text from Tikaand it is useful both for highlighting and to troubleshoot extraction problems, because it contains the exact text extracted by Tika.

Now suppose you want to search for the term Branch and want also to highlight the part of the text where you find that term, you can simply issue a query that ask for highlighting, it is really simple.


This simple query  ask for document with text that contains word branch, I want to extract (fl=) only title and author fields, want xml format and with hl=true I’m asking for snippet of matching text, hl.snippets=20 instruct solr to search for a maximum of 20 snippet and hl.usePhraseHighlighter=true use a specific highlighter that try to extract a single phrase from the text. The most important parameter is the hl.fl=content that specify the field of the document that contains the text used for highlight.  In the results, after all matching documents there is a new section that contains all the highlights for each document


Figure 1: Hilight for the TFS Branching Guid – Scenarios 2.0.pdf file

The name of the element match the id of the document (in my configuration full path of the file), and there a list of highlights that follows. But the true power of Solr comes out if you start to use languages specific fields.

   <field name="content" type="text_en" indexed="false" stored="true" multiValued="false"/>
   <field name="text" type="text_en" indexed="true" stored="false" multiValued="true"/>

I’ve just changed in schema.xml the type of Content and Text from general_text to text_en and this simple modification enables a more specific tokenizer, capable of doing full-text searches. Suppose you want to know all documents that deal with branching strategies, here is a possible query


The key is in the search query text:”branch strategy”~3 that states I’m interested in documents containing both branch and strategy terms and with a relative distance of no more than three words. Since text was indexed with text_en field type I got full-text search, and I have a confirmation looking at the highlights.


Figure 2: Highlights for a proximity query with full text, as you can see word branching matches even if I searched for branch

And voilà!! You have full-text searching inside file content with minimal amount of work and a simple REST interface for querying the index

Gian Maria.

Published by

Ricci Gian Maria

.Net programmer, User group and community enthusiast, programmer - aspiring architect - and guitar player :). Visual Studio ALM MVP

14 thoughts on “Hilight matched text inside documents indexed with Solr plus Tika”

  1. Dear Ricci,

    This is the best example I’ve found about query highlighting paragraph on solr.

    I write you because I have implemented solr 6.5.1 today in my debian server but I have trouble getting the pdf text content. The search is ok, because the document name appears ok in highlighting. However, the does not appear with each str result:

    Additionally, the log is showing all the pdf text, so I assume it was correctly indexed with bin/post command.

    This is the query:


    And this is the result:




    Do you know what could be wrong with it?.

    Best Regards,

    Juan Jara.

  2. Probably the field where you are storing the text has Stored=”false” configuration. Remember that for each field in SOLR you can decide if the SOLR engine will maintain the original text for retrieval.

    In my example I explicitly set stored=”false” for the text field, you can try change that setting to true.

  3. Hi Ricci,
    I’m new to Solr. I tried the same but highlighting is not coming. Do I need to add some special code block or something. please suggest. I’ve indexed files using Tika.
    below are the changes I’ve made.



    added this code block under “browse” requesthandler.

    content features title name

    Still not able to get highlights. It’s not even show the Highlight tags in response.

  4. It is really difficult to troubleshoot with so few information, but actually the most common problem I’ve encountered is that text extracted with tika is put in a field that has stored=”false”, thus preventing it to be used for highlight. I switched to elasticsearch really long time ago, so I’m not really familiar with the release 6 of solr, but I suggest you to check if the text extracted with tika can be returned from a query. If the test cannot be returned it is a clear sign that it is not stored and thus highligh will not work.

  5. Quick heads up:

    I’m a newbie on solr myself, but the guide has small error on the copy directive. You should use:

    The data-importer (tika) copies the file contents to the field “text”, that is not indexed.
    Then the copy-field directive copies the data from the “text” field to the “contents” field, that is indexed.

    Also, just a warning for begginers like me that stumbled upon this blog: I was using solr 6.6.1 and I could not get the DIH to work with Tika. Switched to 6.5.x and it works fine.

  6. your information is incomplete. i want detailed information about how to highlight content in pdf file index. i am eagerly waiting for your response.
    Thanks in Advance

  7. What do you mean by “pdf file index”? Actually tika is able to extract text from binary file (pdf included) then the text is simply indexed in SOLr or Elasticsearch. Finally, the highligh function is given by SOLr or ES and simply returns you a piece of text where the match is found.

    Using this process, extracting and searching in file content process is the same for pdf, docx, xls and every other file supported by Tika.

  8. kindly tell me where to configure the above lines.
    We have tried indexing binary files using Tika & successfully able to search those indexed files.
    But while highlighting we are not getting matched or searched words in highlighting result….we get only document id in result.
    Should we do any highlighting configuration in solr?
    please guide about implementation of highlighting in detail.


Comments are closed.