RogerBW's Blog

Blog Search Engine 18 March 2016

The blog now has a search engine, powered by Lucy.

I've been looking at Lucy for work anyway, and this seemed like a good exercise for it and for me. It's remarkably easy to get running, though one does have some hard decisions to make about stemming (is it desirable that a search for "mechanic" should also match "mechanics" and "mechanism", or does the exact word matter more), and it's definitely a per-document engine: it'll only return each post once, even if that post contains multiple instances of whatever it was you searched for.

I've previously used ht://Dig (sic) and it's something of a pain, having very limited configurability. Lucy is less of an out-of-the-box solution (there's no pre-built command for "make an index out of this mess of HTML files"), but it's vastly more pleasant to work with. If you have a search-related problem and you don't want to let Google into your life, I'd definitely recommend taking a look at Lucy.

Code will be released as part of Aikakirja, when it's a bit prettier.

  1. Posted by chris at 11:45am on 18 March 2016

    Speaking as a user I am glad for anything which makes my life easier. I think it is desirable that a search for "mechanic" should also match "mechanics" and "mechanism", or at least that a search for mechanic should also match mechanics; why should it match something which is spelt differently as opposed to having an extra letter at the end? "Mechani" ought to find both -- sometimes I cannot remember whether "math" or "maths" has been used in an article, though not by you because you are English, so part-words are useful. And also one should have the option of searching only for the right case, as it might be "roger" instead of "Roger", if only to correct an error. Good search engine Good.

    Now, about the fact that this blog falls of the edge of my window, so that the tags list has a line "3d printing action aeronautics aikakirja a" and another "redeeming virtues cold war comedy com", in different sized fonts. Is it possible for it to do what websites used to do, and conform to the browser, instead of remaining at a width which suits itself but means altering all the websites in my browser to suit its particular choice of width? "Boardgamin" is rather fine in its own way; it is spoken, I feel, by Gussie Fink-Nottle after an evenin' doin' it, what? when he isn't accustomed. But not bein I mean sorry being able to see for instance "cycling" at all unless I think it worth scrollin around is a little sad.

  2. Posted by RogerBW at 11:51am on 18 March 2016

    One restriction on this search engine is that it's relatively hard work to have options for things like case-sensitivity and stemming; it's better if possible to pick one and stick with it. So for now I have.

  3. Posted by John Dallman at 02:29pm on 18 March 2016

    It would be good to define an order for the display of hits. At present it seems to be a bit random. Chronological or reverse chronological seem like the sensible ones.

  4. Posted by RogerBW at 02:34pm on 18 March 2016

    It's theoretically by descending relevance (number of times the terms occur, proximity of the terms to each other and the beginning of the document, etc.) but I'm not convinced.

  5. Posted by RogerBW at 06:07pm on 18 March 2016

    There is now sorting by approximate date (URL, actually). Real date may follow some day.

  6. Posted by John Dallman at 10:14pm on 18 March 2016

    The words in the tag cloud are being indexed for each page. So if you mistakenly search for a word in the tag cloud, you get everything. This is a little confusing.

  7. Posted by RogerBW at 10:47pm on 18 March 2016

    It seemed easier to fix it rather than to write a comment saying "I'll fix it". So I have.

  8. Posted by John Dallman at 09:28am on 19 March 2016


Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.

Tags 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s 3d printing action advent of code aeronautics aikakirja anecdote animation anime army astronomy audio audio tech aviation base commerce battletech beer boardgaming book of the week bookmonth chain of command children chris chronicle church of no redeeming virtues cold war comedy computing contemporary cornish smuggler cosmic encounter coup covid-19 crime cthulhu eternal cycling dead of winter doctor who documentary drama driving drone ecchi economics en garde espionage essen 2015 essen 2016 essen 2017 essen 2018 essen 2019 essen 2022 essen 2023 existential risk falklands war fandom fanfic fantasy feminism film firefly first world war flash point flight simulation food garmin drive gazebo genesys geocaching geodata gin gkp gurps gurps 101 gus harpoon historical history horror hugo 2014 hugo 2015 hugo 2016 hugo 2017 hugo 2018 hugo 2019 hugo 2020 hugo 2022 hugo-nebula reread in brief avoid instrumented life javascript julian simpson julie enfield kickstarter kotlin learn to play leaving earth linux liquor lovecraftiana lua mecha men with beards mpd museum music mystery naval noir non-fiction one for the brow opera parody paul temple perl perl weekly challenge photography podcast politics postscript powers prediction privacy project woolsack pyracantha python quantum rail raku ranting raspberry pi reading reading boardgames social real life restaurant reviews romance rpg a day rpgs ruby rust scala science fiction scythe second world war security shipwreck simutrans smartphone south atlantic war squaddies stationery steampunk stuarts suburbia superheroes suspense television the resistance the weekly challenge thirsty meeples thriller tin soldier torg toys trailers travel type 26 type 31 type 45 vietnam war war wargaming weather wives and sweethearts writing about writing x-wing young adult
Special All book reviews, All film reviews
Produced by aikakirja v0.1