RogerBW's Blog: New Search Engine at Work

We recently switched to a new search engine at work, largely written by me with a standard back-end library.

We were previously using an engine written by the founders of the project, and it had problems: when C code stops working after an incremental libc upgrade, you can be reasonably confident there's something a bit dodgy going on. We got a new version from the authors, but it didn't stem search terms, only indexed terms – so the word "rules" would be correctly indexed under "rule", but when you searched for "rules" it wouldn't have the S stripped off, so the site would fail to find anything. (This is more embarrassing when the search word in question is "Energis".) Also, UTF-8 support was always… interesting.

So I decided to build something based on Lucy, and started by writing a couple of small test cases – like this blog's search engine, and something similar for finding phrases anywhere in my ebook collection.

The old system offered only a monolithic interactive binary (I had to use Expect to get search results out of it) or C library headers at the same high level of abstraction, both with minimal documentation; Lucy offers a rather more comprehensive Perl interface. It's a pretty low-level one, with no calls like "index this entire directory", but that's a good thing: there are plenty of opportunities to get into the pathway between one component and the next and to fine-tune behaviour to build the specific search engine I want for the documents we're storing: a title is stemmed and sortable, body text is stemmed and not sortable, a URL path or a date are sortable and not stemmed, and so on.

Indexing is a little faster – taking 10-20 minutes to reread and reindex everything rather than 20-30 minutes. But it's also become possible to do incremental indexing: remove this deleted file from the index, add that new one, rather than having to rescan and rebuild everything every time something's changed, which happens 8-10 times a day. (The old engine could in theory do incremental addition, but in practice it didn't work reliably, and deletion wasn't supported at all.) This takes only a few seconds.

The old engine didn't do date ranges, or dates at all really. Lucy lets me take the user's basic query and add to it from other inputs: a search for "something" within a date range might end up looking in terms of the internal data objects something like:

something AND (date >= earliest) AND (date <= latest) AND path starts-with foo/bar/

In other words all this stuff can move from fixup code that sits as a shim between the search engine and the rest of the world, massaging both incoming queries and outgoing search results, into code that just adds to the query before it goes into the engine, which ends up being rather faster – because the engine can do all the filtering (and sorting, which was similarly problematic given the lack of date support) internally.

One of the things our users want is a copy of the source document with the query highlighted – not just a Google-style abstract of the relevant bit of text, but every occurrence of query terms in the original document. (The old engine had a program to do this, which was withdrawn without comment in more recent versions, so I'd had to cobble together my own query parser, which was never ideal.) Lucy will give a highlighted abstract of the first match it's found, but that doesn't help here. What it can usefully do is feed me the tree-structured output of the query parser, so when a user searches for

something AND "the other thing" AND NOT irrelevance

the same code that parses that for the search engine itself can give me something I can break down into a list of words and phrases (well, stemmed words and phrases) to look for and highlight in the document.

We cut over on Wednesday 13 April. We've had a total of four user queries about the new system. I think this has to count as a success.

Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.