We recently switched to a new search engine at work, largely written
by me with a standard back-end library.
We were previously using an engine written by the founders of the
project, and it had problems: when C code stops working after an
incremental libc upgrade, you can be reasonably confident there's
something a bit dodgy going on. We got a new version from the authors,
but it didn't stem search
terms, only indexed terms – so the word "rules" would be correctly
indexed under "rule", but when you searched for "rules" it wouldn't
have the S stripped off, so the site would fail to find anything.
(This is more embarrassing when the search word in question is
"Energis".) Also, UTF-8 support was always… interesting.
So I decided to build something based on
Lucy, and started by writing a couple of
small test cases – like this blog's search engine, and something
similar for finding phrases anywhere in my ebook collection.
The old system offered only a monolithic interactive binary (I had to
use Expect to get search results out of it) or C library headers at
the same high level of abstraction, both with minimal documentation;
Lucy offers a rather more comprehensive Perl interface. It's a pretty
low-level one, with no calls like "index this entire directory", but
that's a good thing: there are plenty of opportunities to get into the
pathway between one component and the next and to fine-tune behaviour
to build the specific search engine I want for the documents we're
storing: a title is stemmed and sortable, body text is stemmed and not
sortable, a URL path or a date are sortable and not stemmed, and so
on.
Indexing is a little faster – taking 10-20 minutes to reread and
reindex everything rather than 20-30 minutes. But it's also become
possible to do incremental indexing: remove this deleted file from
the index, add that new one, rather than having to rescan and
rebuild everything every time something's changed, which happens 8-10
times a day. (The old engine could in theory do incremental
addition, but in practice it didn't work reliably, and deletion wasn't
supported at all.) This takes only a few seconds.
The old engine didn't do date ranges, or dates at all really. Lucy
lets me take the user's basic query and add to it from other inputs: a
search for "something" within a date range might end up looking in
terms of the internal data objects something like:
something AND (date >= earliest) AND (date <= latest) AND path starts-with foo/bar/
In other words all this stuff can move from fixup code that sits as a
shim between the search engine and the rest of the world, massaging
both incoming queries and outgoing search results, into code that
just adds to the query before it goes into the engine, which ends up
being rather faster – because the engine can do all the filtering (and
sorting, which was similarly problematic given the lack of date
support) internally.
One of the things our users want is a copy of the source document with
the query highlighted – not just a Google-style abstract of the
relevant bit of text, but every occurrence of query terms in the
original document. (The old engine had a program to do this, which was
withdrawn without comment in more recent versions, so I'd had to
cobble together my own query parser, which was never ideal.) Lucy will
give a highlighted abstract of the first match it's found, but that
doesn't help here. What it can usefully do is feed me the
tree-structured output of the query parser, so when a user searches
for
something AND "the other thing" AND NOT irrelevance
the same code that parses that for the search engine itself can give
me something I can break down into a list of words and phrases (well,
stemmed words and phrases) to look for and highlight in the document.
We cut over on Wednesday 13 April. We've had a total of four user
queries about the new system. I think this has to count as a success.
Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.