Lucene made my app embarrassingly fast

Inverted index,
Dictionary compression:
Lucene’s swift hand strikes.

God damn Lucene is fast; no, like, really fast. Faster than should be possible. Even too fast. Embarrassingly fast.

“We can’t implement that feature, it will be too slow”

I’ve been putting the finishing touches to our new product, getting it ready for its first public demo (ob-plug: If you use CVS, our new FishEye product will rule your world).

A cow orker and I were discussing the changelog page (here is an example). The conversation kinda went like this:

Spud: What do you think of the calendar?
Pete: Sweet! You think you can hyperlink only those days/months that have changes in them?
Spud: I can compute it, but that is 43 calculations per page (12 months plus 31 days). It will be too slow.
Pete: Give it a go?
(tap tap tap code code code)
Spud: Holy crap, each of those calculations takes under 10ms to compute. That’s under the threshold of the timer, even!
Pete: Cool, lets add more info then. How about we also compute the number of changes in each day/month/year, and display that in parenthesis?
Spud: No way, determining “does exist” is way different than counting. We can’t implement that, it will be too slow.
Pete: Give it a go?
(tap tap tap code code code)
Spud: Um, okay. We are now counting revisions for each time period (including the years). There are over 23 thousand documents, and we are still computing that in under 10ms!
Pete: :D
Spud: *sheepish grin*

What is an inverted index, and why does it make Lucene fast?

Lucene gets its god-like power by implementing an inverted index. You normally think about objects and values like this: I have an object, an object has fields, each field has a value. But an inverted index thinks like this: I have this value, and these objects have fields with that value.

Hmm, I’m not sure that really made sense. The best way I can think of explaining this is to show some code (I think of everything in terms of code).

This SlowCustomerIndex implements a collection of Customers and has two (slow) methods for finding customers by their first/last name. These methods are “slow” because they are O(N), where N is the number of customers in the collection.

import java.util.List;
import java.util.LinkedList;

public class SlowCustomerIndex {

    private final List<Customer> customers = new LinkedList<Customer>();

    public void addCustomer(Customer c) {
        customers.add(c);
    }

    public List<Customer> findByFirstName(String name) {
        List<Customer> result = new LinkedList<Customer>();
        for (Customer customer : customers) {
            if (customer.getFirstName().equals(name)) {
                result.add(customer);
            }
        }
        return result;
    }
    public List<Customer> findByLastName(String name) {
        List<Customer> result = new LinkedList<Customer>();
        for (Customer customer : customers) {
            if (customer.getLastName().equals(name)) {
                result.add(customer);
            }
        }
        return result;
    }

    public static void main(String[] args) {
        SlowCustomerIndex idx = new SlowCustomerIndex();
        idx.addCustomer(new Customer("Joe", "Smith"));
        idx.addCustomer(new Customer("Joe", "Barnes"));
        idx.addCustomer(new Customer("Jane", "Alexander"));
        idx.addCustomer(new Customer("Rex", "Alexander"));
        idx.addCustomer(new Customer("Franc", "Jones"));

        System.out.println("Joe: " + idx.findByFirstName("Joe"));
        System.out.println("Alexander: " + idx.findByLastName("Alexander"));
    }
}

Okay, we can do way better than O(N). Can you work out how fast the following version of CustomerIndex is? Don’t be too scared of the index variable; just think “it maps ‘first’ to ‘Joe’ to the list of Customers that have a first name of ‘Joe’. (I used Generics in this example, because I felt it made the index variable more obvious… but now I’m not so sure.)

import java.util.List;
import java.util.LinkedList;
import java.util.Map;
import java.util.HashMap;
import java.util.Collections;

public class FastCustomerIndex {
    /**
     * A map of field name ("first" or "last") to a Map that maps
     * values (first names or last names) to Customers that match that name/value
     * (ie; that has that first/last name).
     */
    private final Map<String, Map<String, List<Customer>>> index =
        new HashMap<String, Map<String, List<Customer>>>();

    private List<Customer> getFieldList(String fieldName, String fieldValue) {
        Map<String, List<Customer>> fieldMap = index.get(fieldName);
        if (fieldMap == null) {
            fieldMap = new HashMap<String, List<Customer>>();
            index.put(fieldName, fieldMap);
        }

        List<Customer> l = fieldMap.get(fieldValue);
        if (l == null) {
            l = new LinkedList<Customer>();
            fieldMap.put(fieldValue, l);
        }
        return l;
    }

    public void addCustomer(Customer c) {
        getFieldList("first", c.getFirstName()).add(c);
        getFieldList("last", c.getLastName()).add(c);
    }

    public List<Customer> findByFirstName(String name) {
        return Collections.unmodifiableList(getFieldList("first", name));
    }
    public List<Customer> findByLastName(String name) {
        return Collections.unmodifiableList(getFieldList("last", name));
    }
}

The findByFirstName() method just calls getFieldList(), which then calls HashMap.get() twice. Which is (more or less) O(1). Which makes our getFieldList() method (more or less) O(1). That’s much better than O(N). Got 1 million customers in the index? Finding them by first name can be done in constant time!

Lucene’s inverted index is asymptotically O(1) as well (as far as I can tell). Or at least, it’s worse-case performance is something like O(k) or O(log k), where k is the number of unique terms for the given field (in our example, there are 4 unique first-name terms, ‘Joe’, ‘Jane’, ‘Rex’ and ‘Frank’). This is still much better than O(N), especially when a field has many re-occurring terms. And in practice, Lucene’s implementation of that O(k) search is ridiculously fast.

(In Lucene parlance, Customer is a “document”, a customer’s first name is a “field”, and a particular first name (like ‘Joe’) is a “term”.)

Using Lucene

There are plenty of articles written on using Lucene (starting points here and here), so I won’t repeat them here. My one tip is: “everything is a document”; be prepared to de-normalize your data to get the most out of Lucene.

More on FishEye

Just to push the FishEye plug a little further, one of its cool features is EyeQL (an SQL-like language for searching your repository, more info here). I’ve been able to extract a lot of mileage out of this mini domain-language in FishEye. Lucene was one of the key enabling technologies that made implementing EyeQL easy, the other was ANTLR. When I get some more time, I hope to write a bit more on implementing your own mini-languages using ANTLR.

8 Comments

  1. Gary
    Posted March 24, 2004 at 6:13 pm | Permalink

    Matt: Your custom query language is "SQL-like", and from the looks of it would be very easy to learn for someone used to writing SQL. But do you think people will be willing to learn a new language, even one has familiar as this one, just to make full use of your product? It almost seems a bit like overkill to me.

  2. Posted March 24, 2004 at 9:27 pm | Permalink

    Gary,

    Good point. No, no one has to learn EyeQL to make full use of FishEye. The final version will have all sorts of GUI goodness to help you get the data and reports you need (wizards, canned reports, etc.).

  3. Gary
    Posted March 24, 2004 at 10:37 pm | Permalink

    Will there be any way to look at the EyeQL that gets generated by using a wizard or GUI form to create a query? That would be a nice way to learn the language by example.

  4. Posted March 24, 2004 at 10:41 pm | Permalink

    Lucene comments — :) FishEye — :))
    Me like: http://jroller.com/page/otis/20040324#fisheye_candy

  5. Posted March 24, 2004 at 10:50 pm | Permalink

    Gray,

    you bet! Most things in FishEye are powered by EyeQL on the inside (which is in turn powered by Lucene; go Lucene!). Some of these pages will have a "see associated EyeQL" kind of link.

  6. Posted April 21, 2004 at 4:17 pm | Permalink

    Matt, nice descriptions – I really need to get a round tuit for Lucene…

    You should get to know Python/Jython, especially if you think in & explain things in code. It is much clearer and more concise than "Java as pseudocode".

    I translated the FastCustomerIndex example into Python.

    A few things about Python illustrated below:

    • init defines a class constructor

    • [a,b,c] defines a list

    • {key: value, key1:value1 } defines a dictionary (like a map)

    • you must explicitly pass the class instance into methods ("self", corrseponding to Java’s "this"), and must use it explicitly to access fields/methods

    • triple-quoted strings can span lines:
      """line 1
      line 2"""
      A string as the first part of a method body becomes the method’s
      documentation string – an attribute of the method that is available
      at runtime for inspection (Reflection)

    • I didn’t return immutable lists – I found the code clearer without that. The easiest way in Python
      is just to copy the list, but if size is a concern, you could wrap it in an immutable sequence class as well…

      class FastCustomerIndex:
      def init( self ):
      """
      Construct FastCustomerIndex.
      index is a dict that maps field name ("first" or "last") to a dict
      that maps values (first names or last names) to a list of Customers with
      that name/value (ie; that has that first/last name).

      Example: Given j = Customer( "George Bush" )
      {
      "first" : { "George": [ Customer( "George Bush" ) ] },
      "last" : { "Bush": [ Customer( "George Bush" ) ] }
      }
      """
      index = {}

      def getFieldList( self, fieldName, fieldValue ):
      """Returns list of customers"""
      fieldMap = self.index.setdefault( fieldName, {} )
      l = fieldMap.setdefault( fieldValue, [] )
      return l

      def addCustomer( self, customer ):
      """Adds customer to first and last name indices"""
      self.getFieldList( "first", customer.firstName ).append( c )
      self.getFieldList( "last", customer.lastName ).append( c )

      def findByFirstName( self, firstName ):
      """Returns a copy of the list of customers matching firstName"""
      return self.getFieldList( "first", firstName )

      def findByLastName( self, lastName ):
      """Returns a copy of the list of customers matching lastName"""
      return self.getFieldList( "last", lastName )

  7. Posted April 22, 2004 at 12:11 am | Permalink

    Kevin,

    I do indeed know python (infact, this site is generated by a Jython script I hacked together). I can even point out a bug in your code; the constructor should contain the line "self.index={}" not "index={}" :P

    I do have a preference for doing code examples in Java since that is the main theme of my blog, and I can assume that the reader will be more familiar with Java than something else. Having said that, I’m a big fan of Python/Jython from way back.

    =Matt

  8. Posted July 25, 2004 at 4:08 pm | Permalink

    I learned that Lucene is fast the hard way. I implemented my own file-based B+Tree Indexing, dictionary, and inverted index. I paid attention to every disk seek. I thought, "I can get more performance on the indexing if I broke it up into memory sized chunks, flushed it to disk, then started a new file that I would merge after so many files accumulate." Yeah, a bright one I was, beginning to catch on to this whole indexing thing. Then on some site I found a link to a Java base indexer hosted on Apache. I was suprised, because I looked everywhere for one before, and only found C-coded ones that had false positives because it used that weird table method instead of an inverted index. Lucene is indeed the fastest indexing engine I’ve found to date. I’ve had Gigabyte Lucene indexes, and the network is still the bottleneck.

Post a Comment

Your email is never shared. Required fields are marked *

*
*