Captain Codeman Captain Codeman

NHibernate.Search using Lucene.NET Full Text Index (2)

Contents

Introduction

In NHibernate.Search using Lucene.NET Full Text Index (Part 1) we looked at setting up the NHibernate.Search extension to add full-text searching of NHibernate-persisted objects.

Next, we’ll look at how we can perform Google-like searches using the Lucene.NET index and some tips on displaying the results including highlighting the search-terms.

Our Book class has the Title, Summary, Authors and Publisher field indexed so we’ll allow searching in any of these fields. However, if a search-term exists in the title it is probably more relevant than if it just exists in the summary so we want to give more priority to certain fields than to others. Likewise, we probably want to be able to specify which fields to search on otherwise we would get books that make mention of “Martin Fowler” in the summary whereas we may want to only see books that have “Martin Fowler” as an author for example.

Also worth mentioning is the Summary field. In the Book class there is a SummaryHtml field which (you’ll never guess) contains the Html summary retrieved from Amazon and also a Summary field which is the one that is actually indexed. In the full app this text field is generated from the Html content using the HtmlAgility library. The reason we want a version of the Summary in plain text is to make indexing easier / more accurate (no HTML tags) and also to allow result fragments to be created: imagine if a section of the SummaryHtml was output - it could potentially split across an Html element or attribute (producing invalid markup) or include the opening tag but not the matching closing one (producing runaway bold-text for instance).

Back to our example though. To be able to show the highlighted search terms in the results I found it easier to create a special BookSearchResult class that I can return from the data provider - the highlighting is something Lucene.NET can do for us and avoids us having to write our own presentation code to handle it. Here is the class:

  /// <summary>
  /// A wrapper for a book object returned from a full text index query
  /// with additional properties for highlighted segments
  /// </summary>
  public class BookSearchResult : IBookSearchResult
  {
    private readonly IBook _book;
    private string _highlightedTitle;
    private string _highlightedSummary;
    private string _highlightedAuthors;
    private string _highlightedPublisher;

    /// <summary>
    /// Initializes a new instance of the <see cref="BookSearchResult"/> class.
    /// </summary>
    /// <param name="book">The book.</param>
    public BookSearchResult(IBook book)
    {
      _book = book;
    }

    /// <summary>
    /// Gets the book.
    /// </summary>
    /// <value>The book.</value>
    public IBook Book
    {
      get { return _book; }
    }

    /// <summary>
    /// Gets or sets the highlighted title.
    /// </summary>
    /// <value>The highlighted title.</value>
    public string HighlightedTitle
    {
      get
      {
        if (_highlightedTitle == null || _highlightedTitle.Length == 0)
        {
          return _book.Title;
        }
        return _highlightedTitle;
      }
      set { _highlightedTitle = value; }
    }

    /// <summary>
    /// Gets or sets the highlighted summary.
    /// </summary>
    /// <value>The highlighted summary.</value>
    public string HighlightedSummary
    {
      get
      {
        if (_highlightedSummary == null || _highlightedSummary.Length == 0)
        {
          if (_book.Summary == null || _book.Summary.Length < 300)
          {
            return _book.Summary;
          }
          else
          {
            return _book.Summary.Substring(0,300) + " ...";
          }
        }
        return _highlightedSummary;
      }
      set { _highlightedSummary = value; }
    }

    /// <summary>
    /// Gets or sets the highlighted authors.
    /// </summary>
    /// <value>The highlighted authors.</value>
    public string HighlightedAuthors
    {
      get
      {
        if (_highlightedAuthors == null || _highlightedAuthors.Length == 0)
        {
          return _book.Authors;
        }
        return _highlightedAuthors;
      }
      set { _highlightedAuthors = value; }
    }

    /// <summary>
    /// Gets or sets the highlighted publisher.
    /// </summary>
    /// <value>The highlighted publisher.</value>
    public string HighlightedPublisher
    {
      get
      {
        if (_highlightedPublisher == null || _highlightedPublisher.Length == 0)
        {
          return _book.Publisher;
        }
        return _highlightedPublisher;
      }
      set { _highlightedPublisher = value; }
    }
  }

You’ll notice that the Highlighted… fields return the equivalent book field if the highlighted field does not exist. This just saves us having to check whether there is a highlighted term in each field when we’re building the search result list.

Our data provider will accept a single string consisting of the entered search-terms and return a list of BookSearchResult objects that match. Here is the code and I’ll then try and explain what it’s doing:

  /// <summary>
  /// Finds the books.
  /// </summary>
  /// <param name="query">The query.</param>
  /// <returns></returns>
  public override IList<IBookSearchResult> FindBooks(string query)
  {
    IList<IBookSearchResult> results = new List<IBookSearchResult>();

    Analyzer analyzer = new SimpleAnalyzer();
    MultiFieldQueryParser parser = new MultiFieldQueryParser(
                     new string[] { "Title", "Summary", "Authors", "Publisher"},
                     analyzer);
    Query queryObj;

    try
    {
      queryObj = parser.Parse(query);
    }
    catch (ParseException)
    {
      // TODO: provide feedback to user on failed search expressions
      return results;
    }

    IFullTextSession session = (IFullTextSession) NHibernateHelper.GetCurrentSession();
    IQuery nhQuery = session.CreateFullTextQuery(queryObj, new Type[] {typeof (Book) } );

    IList<IBook> books = nhQuery.List<IBook>();

    IndexReader indexReader = IndexReader.Open(SearchFactory.GetSearchFactory(session)
                       .GetDirectoryProvider(typeof (Book)).Directory);
    Query simplifiedQuery = queryObj.Rewrite(indexReader);

    SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b class='term'>", "</b>");

    Highlighter hTitle = GetHighlighter(simplifiedQuery, formatter, "Title", 100);
    Highlighter hSummary = GetHighlighter(simplifiedQuery, formatter, "Summary", 200);
    Highlighter hAuthors = GetHighlighter(simplifiedQuery, formatter, "Authors", 100);
    Highlighter hPublisher = GetHighlighter(simplifiedQuery, formatter, "Publisher", 100);

    foreach(IBook book in books)
    {
      IBookSearchResult result = new BookSearchResult(book);

      TokenStream tsTitle = analyzer.TokenStream("Title",
                  new System.IO.StringReader(book.Title ?? string.Empty));
      result.HighlightedTitle = hTitle.GetBestFragment(tsTitle, book.Title);

      TokenStream tsAuthors = analyzer.TokenStream("Authors",
                  new System.IO.StringReader(book.Authors ?? string.Empty));
      result.HighlightedAuthors = hAuthors.GetBestFragment(tsAuthors, book.Authors);

      TokenStream tsPublisher = analyzer.TokenStream("Publisher",
                  new System.IO.StringReader(book.Publisher ?? string.Empty));
      result.HighlightedPublisher = hPublisher.GetBestFragment(tsPublisher, book.Publisher);

      TokenStream tsSummary = analyzer.TokenStream("Summary",
                  new System.IO.StringReader(book.Summary ?? string.Empty));
      result.HighlightedSummary = hSummary.GetBestFragments(tsSummary,
                    book.Summary, 3, " ... <br /><br /> ... ");

      results.Add(result);
    }

    return results;
  }

  /// <summary>
  /// Gets the highlighter for the given field.
  /// </summary>
  /// <param name="query">The query.</param>
  /// <param name="formatter">The formatter.</param>
  /// <param name="field">The field.</param>
  /// <param name="fragmentSize">Size of the fragment.</param>
  /// <returns></returns>
  private static Highlighter GetHighlighter(Query query, Formatter formatter,
                        string field, int fragmentSize)
  {
    // create a new query to contain the terms
    BooleanQuery termsQuery = new BooleanQuery();

    // extract terms for this field only
    WeightedTerm[] terms = QueryTermExtractor.GetTerms(query, true, field);
    foreach (WeightedTerm term in terms)
    {
      // create new term query and add to list
      TermQuery termQuery = new TermQuery(new Term(field, term.GetTerm()));
      termsQuery.Add(termQuery, BooleanClause.Occur.SHOULD);
    }

    // create query scorer based on term queries (field specific)
    QueryScorer scorer = new QueryScorer(termsQuery);

    Highlighter highlighter = new Highlighter(formatter, scorer);
    highlighter.SetTextFragmenter(new SimpleFragmenter(fragmentSize));

    return highlighter;
  }

First, we parse the user-entered query string indicating that we want to match on the fields Title, Summary, Authors and Publisher using the MultiFieldQueryParser. This turns the user entered search expression into Lucene specific instructions. Most users when searching will enter a simple expression containing the words or phrase that they want to find. If the search term “XML’ is entered for example Lucene will convert this into the expression “Title:XML Summary:XML Authors:XML Publisher:XML” which effectively means “find any record where ‘XML’ exists in any of the fields”.

The user can enter specific instructions directly such as “Title:Architecture Authors:Fowler” which means “Find any books that have ‘Architecture’ in the Title field or ‘Fowler’ in the Authors field”. Boolean expressions can be used to control this further allowing “(Title:Architecture) AND (Authors:Fowler)” to find any books titled ‘Architecture’ authored by ‘Fowler’. When specific searches like this have been entered then the MultiFieldQueryParser doesn’t expand the search to include all fields (except for un-field-prefixed words and phrases).

Incidentally, in the original Book class we included attributes to control the indexing such as [Boost(10)] for the Title. This boosts the relevance of searches on certain fields so a search for ‘XML’ in the Title and Summary of a document will rank books with ‘XML’ in the Title higher than books that have ‘XML’ in the summary - they are more likely to be what the user is searching for in this case.

Lucene does provide many other ways to define a query but this is simple and easy for this example.

Once we have our Lucene query object we use this to create an NHibernate.Search full-text query to return Book objects. This is where NHibernate and Lucene meet (from a querying point of view). It is possible to combine full-text-queries of Lucene with NHibernate queries of the database - NHibernate.Search handles the searching and returns the relevant objects.

So, we now have a list of Book objects just the same as if it had come directly from NHibernate except that the results are in order based on the rank provided by the Lucene search.

Now, we’ll use another part of Lucene to highlight the matches. This is done using the SimpleHTMLFormatter, QueryScorer and Highlighter objects which combined allow us to get a fragment for each field with the search terms highlighted.

Note that the SimpleHtmlFormatter class is not in the main Lucene.Net.dll assembly but instead in a separate contrib assembly called Highlighter.Net.dll - there are also some other interesting utilities worth exploring in the contrib folder of the Lucene.NET distribution. Remember in Part 1 I mentioned that I had problems with assembly references and different versions of Lucene.Net.dll being used by NHibernate.Search so if you have problems building the solution after adding references to these contrib assemblies, consider building NHibernate.Search making sure that it references the same Lucene.Net.dll as the Lucene contrib assemblies were built against.

The Highlighter object for each field has to be based on the query terms for that field only so the original query is re-written and split up so that only the terms searched for that field are used. This isn’t strictly necessary but I think it makes more sense if when you search for ‘Microsoft’ in the Title of a book only that occurrences of ‘Microsoft’ in the Summary or Publisher fields are not highlighted: the highlighted results then show clearly which found terms influenced the results. I have split this functionality into a separate GetHighlighter() method.

For example, without doing this a search for ‘Title:Microsoft’ incorrectly highlights the occurrences of ‘Microsoft’ found within the Author, Publisher and Summary fields even though they did not really contribute to the Book being included in the results or it’s rank within them:

highlight_wrong

By creating the proper Highlighter for each field based on the terms used to search it the search results can be shown correctly without highlighting the un-searched fields / terms:

highlight_correct

Also, not that the fragments produced for the Summary are different - if a separate terms are used for the Title and Summary then having the Title terms highlighted in the Summary would possibly produce incorrect or sub-standard fragments.

Having built our Highlighters we can then iterate over the results creating a BookSearchResult to wrap each book in the result set. The same analyzer used in the initial query is then used to get a TokenStream for each field which the Highlighter instance needs to create the highlighted fragment from.

For the Title, Authors and Publisher fields we return a single Fragment which will normally be the field itself with the highlighted search terms wrapped in Html tags (courtesy of the SimpleHtmlFormatter class). The highlighted Summary is set to the best 3 fragments separated by ’…

… ’. However big the summary is this ensures that the results contain a similar sized chunk of text with the best fragments shown (those containing the most highlighted terms).

Here is an example of the results for ‘Title:Software Summary:Requirements Authors:Steve’ after formatting and CSS applied to show the highlighted terms in yellow:

search_results

Lucene.NET can do a lot more than I’ve shown here. I found the best resource for learning about how to use it is the ‘Lucene in Action’ book:

**Lucene in Action (In Action series)**by Otis Gospodnetic, Erik Hatcher

Read more about this book…

Note that this covers the Java version but applies equally well to the .NET port which is practically identical.

I hope this has been useful. In Part 3 I’ll try and demonstrate using the Lucene.NET index to find similar items based on the frequency of shared terms. This can be used to provide ‘other books you may like’ or ‘blog posts like this one’ type functionality.