RogerBW's Blog

New Search Engine at Work 26 April 2016

We recently switched to a new search engine at work, largely written by me with a standard back-end library.

We were previously using an engine written by the founders of the project, and it had problems: when C code stops working after an incremental libc upgrade, you can be reasonably confident there's something a bit dodgy going on. We got a new version from the authors, but it didn't stem search terms, only indexed terms – so the word "rules" would be correctly indexed under "rule", but when you searched for "rules" it wouldn't have the S stripped off, so the site would fail to find anything. (This is more embarrassing when the search word in question is "Energis".) Also, UTF-8 support was always… interesting.

So I decided to build something based on Lucy, and started by writing a couple of small test cases – like this blog's search engine, and something similar for finding phrases anywhere in my ebook collection.

The old system offered only a monolithic interactive binary (I had to use Expect to get search results out of it) or C library headers at the same high level of abstraction, both with minimal documentation; Lucy offers a rather more comprehensive Perl interface. It's a pretty low-level one, with no calls like "index this entire directory", but that's a good thing: there are plenty of opportunities to get into the pathway between one component and the next and to fine-tune behaviour to build the specific search engine I want for the documents we're storing: a title is stemmed and sortable, body text is stemmed and not sortable, a URL path or a date are sortable and not stemmed, and so on.

Indexing is a little faster – taking 10-20 minutes to reread and reindex everything rather than 20-30 minutes. But it's also become possible to do incremental indexing: remove this deleted file from the index, add that new one, rather than having to rescan and rebuild everything every time something's changed, which happens 8-10 times a day. (The old engine could in theory do incremental addition, but in practice it didn't work reliably, and deletion wasn't supported at all.) This takes only a few seconds.

The old engine didn't do date ranges, or dates at all really. Lucy lets me take the user's basic query and add to it from other inputs: a search for "something" within a date range might end up looking in terms of the internal data objects something like:

something AND (date >= earliest) AND (date <= latest) AND path starts-with foo/bar/

In other words all this stuff can move from fixup code that sits as a shim between the search engine and the rest of the world, massaging both incoming queries and outgoing search results, into code that just adds to the query before it goes into the engine, which ends up being rather faster – because the engine can do all the filtering (and sorting, which was similarly problematic given the lack of date support) internally.

One of the things our users want is a copy of the source document with the query highlighted – not just a Google-style abstract of the relevant bit of text, but every occurrence of query terms in the original document. (The old engine had a program to do this, which was withdrawn without comment in more recent versions, so I'd had to cobble together my own query parser, which was never ideal.) Lucy will give a highlighted abstract of the first match it's found, but that doesn't help here. What it can usefully do is feed me the tree-structured output of the query parser, so when a user searches for

something AND "the other thing" AND NOT irrelevance

the same code that parses that for the search engine itself can give me something I can break down into a list of words and phrases (well, stemmed words and phrases) to look for and highlight in the document.

We cut over on Wednesday 13 April. We've had a total of four user queries about the new system. I think this has to count as a success.

Tags: computing

Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.

Tags 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s 3d printing action advent of code aeronautics aikakirja anecdote animation anime army astronomy audio audio tech aviation base commerce battletech beer boardgaming book of the week bookmonth chain of command children chris chronicle church of no redeeming virtues cold war comedy computing contemporary cornish smuggler cosmic encounter coup covid-19 crime cthulhu eternal cycling dead of winter doctor who documentary drama driving drone ecchi economics en garde espionage essen 2015 essen 2016 essen 2017 essen 2018 essen 2019 essen 2022 essen 2023 existential risk falklands war fandom fanfic fantasy feminism film firefly first world war flash point flight simulation food garmin drive gazebo genesys geocaching geodata gin gkp gurps gurps 101 gus harpoon historical history horror hugo 2014 hugo 2015 hugo 2016 hugo 2017 hugo 2018 hugo 2019 hugo 2020 hugo 2022 hugo-nebula reread in brief avoid instrumented life javascript julian simpson julie enfield kickstarter kotlin learn to play leaving earth linux liquor lovecraftiana lua mecha men with beards mpd museum music mystery naval noir non-fiction one for the brow opera parody paul temple perl perl weekly challenge photography podcast politics postscript powers prediction privacy project woolsack pyracantha python quantum rail raku ranting raspberry pi reading reading boardgames social real life restaurant reviews romance rpg a day rpgs ruby rust scala science fiction scythe second world war security shipwreck simutrans smartphone south atlantic war squaddies stationery steampunk stuarts suburbia superheroes suspense television the resistance the weekly challenge thirsty meeples thriller tin soldier torg toys trailers travel type 26 type 31 type 45 vietnam war war wargaming weather wives and sweethearts writing about writing x-wing young adult
Special All book reviews, All film reviews
Produced by aikakirja v0.1