Inverted index,

Dictionary compression:

Lucene’s swift hand strikes.

God damn Lucene is fast; no, like, really fast. Faster than should be possible. Even too fast. Embarrassingly fast.

“We can’t implement that feature, it will be too slow”

I’ve been putting the finishing touches to our new product, getting it ready for its first public demo (ob-plug: If you use CVS, our new FishEye product will rule your world).

A cow orker and I were discussing the changelog page (here is an example). The conversation kinda went like this:

Spud: What do you think of the calendar?

Pete: Sweet! You think you can hyperlink only those days/months that have changes in them?

Spud: I can compute it, but that is 43 calculations per page (12 months plus 31 days). It will be too slow.

Pete: Give it a go?

(tap tap tap code code code)

Spud: Holy crap, each of those calculations takes under 10ms to compute. That’s under the threshold of the timer, even!

Pete: Cool, lets add more info then. How about we also compute the number of changes in each day/month/year, and display that in parenthesis?

Spud: No way, determining “does exist” is way different than counting. We can’t implement that, it will be too slow.

Pete: Give it a go?

(tap tap tap code code code)

Spud: Um, okay. We are now counting revisions for each time period (including the years). There are over 23 thousand documents, and we are still computing that in under 10ms!

Pete: :D

Spud: *sheepish grin*

What is an inverted index, and why does it make Lucene fast?

Lucene gets its god-like power by implementing an inverted index. You normally think about objects and values like this: I have an object, an object has fields, each field has a value. But an inverted index thinks like this: I have this value, and these objects have fields with that value.

Hmm, I’m not sure that really made sense. The best way I can think of explaining this is to show some code (I think of everything in terms of code).

This SlowCustomerIndex implements a collection of Customers and has two (slow) methods for finding customers by their first/last name. These methods are “slow” because they are O(N), where N is the number of customers in the collection.

import java.util.List;
import java.util.LinkedList;

public class SlowCustomerIndex {

    private final List<Customer> customers = new LinkedList<Customer>();

    public void addCustomer(Customer c) {
        customers.add(c);
    }

    public List<Customer> findByFirstName(String name) {
        List<Customer> result = new LinkedList<Customer>();
        for (Customer customer : customers) {
            if (customer.getFirstName().equals(name)) {
                result.add(customer);
            }
        }
        return result;
    }
    public List<Customer> findByLastName(String name) {
        List<Customer> result = new LinkedList<Customer>();
        for (Customer customer : customers) {
            if (customer.getLastName().equals(name)) {
                result.add(customer);
            }
        }
        return result;
    }

    public static void main(String[] args) {
        SlowCustomerIndex idx = new SlowCustomerIndex();
        idx.addCustomer(new Customer("Joe", "Smith"));
        idx.addCustomer(new Customer("Joe", "Barnes"));
        idx.addCustomer(new Customer("Jane", "Alexander"));
        idx.addCustomer(new Customer("Rex", "Alexander"));
        idx.addCustomer(new Customer("Franc", "Jones"));

        System.out.println("Joe: " + idx.findByFirstName("Joe"));
        System.out.println("Alexander: " + idx.findByLastName("Alexander"));
    }
}

Okay, we can do way better than O(N). Can you work out how fast the following version of CustomerIndex is? Don’t be too scared of the index variable; just think “it maps ‘first’ to ‘Joe’ to the list of Customers that have a first name of ‘Joe’. (I used Generics in this example, because I felt it made the index variable more obvious… but now I’m not so sure.)

import java.util.List;
import java.util.LinkedList;
import java.util.Map;
import java.util.HashMap;
import java.util.Collections;

public class FastCustomerIndex {
    /**
     * A map of field name ("first" or "last") to a Map that maps
     * values (first names or last names) to Customers that match that name/value
     * (ie; that has that first/last name).
     */
    private final Map<String, Map<String, List<Customer>>> index =
        new HashMap<String, Map<String, List<Customer>>>();

    private List<Customer> getFieldList(String fieldName, String fieldValue) {
        Map<String, List<Customer>> fieldMap = index.get(fieldName);
        if (fieldMap == null) {
            fieldMap = new HashMap<String, List<Customer>>();
            index.put(fieldName, fieldMap);
        }

        List<Customer> l = fieldMap.get(fieldValue);
        if (l == null) {
            l = new LinkedList<Customer>();
            fieldMap.put(fieldValue, l);
        }
        return l;
    }

    public void addCustomer(Customer c) {
        getFieldList("first", c.getFirstName()).add(c);
        getFieldList("last", c.getLastName()).add(c);
    }

    public List<Customer> findByFirstName(String name) {
        return Collections.unmodifiableList(getFieldList("first", name));
    }
    public List<Customer> findByLastName(String name) {
        return Collections.unmodifiableList(getFieldList("last", name));
    }
}

The findByFirstName() method just calls getFieldList(), which then calls HashMap.get() twice. Which is (more or less) O(1). Which makes our getFieldList() method (more or less) O(1). That’s much better than O(N). Got 1 million customers in the index? Finding them by first name can be done in constant time!

Lucene’s inverted index is asymptotically O(1) as well (as far as I can tell). Or at least, it’s worse-case performance is something like O(k) or O(log k), where k is the number of unique terms for the given field (in our example, there are 4 unique first-name terms, ‘Joe’, ‘Jane’, ‘Rex’ and ‘Frank’). This is still much better than O(N), especially when a field has many re-occurring terms. And in practice, Lucene’s implementation of that O(k) search is ridiculously fast.

(In Lucene parlance, Customer is a “document”, a customer’s first name is a “field”, and a particular first name (like ‘Joe’) is a “term”.)

Using Lucene

There are plenty of articles written on using Lucene (starting points here and here), so I won’t repeat them here. My one tip is: “everything is a document”; be prepared to de-normalize your data to get the most out of Lucene.

More on FishEye

Just to push the FishEye plug a little further, one of its cool features is EyeQL (an SQL-like language for searching your repository, more info here). I’ve been able to extract a lot of mileage out of this mini domain-language in FishEye. Lucene was one of the key enabling technologies that made implementing EyeQL easy, the other was ANTLR. When I get some more time, I hope to write a bit more on implementing your own mini-languages using ANTLR.