RogerBW's Blog

Perl Weekly Challenge 24 03 September 2019

I've been doing the Perl Weekly Challenges. This one dealt with minimal programs and inverted-index text search.

The minimal valid program is an empty file, for both perl5 and perl6. However, if you want to run the program directly from the command line, it needs to invoke the interpreter, so:

#! /usr/bin/perl

and indeed

#! /usr/bin/perl6

That wasn't terribly interesting, but the other challenge was more fun. I like search systems, and some years ago I once wrote an indexer and search engine for mailing list postings in 24 hours for a small bet.

However, it's worth noting that these days I'd just use Lucy (and I know it's abandonware now, but it still works). So I didn't take this very far, because I'd just be duplicating functionality that's much more conveniently available elsewhere.

package Local::InvIndex;
use Lingua::Stem;

Initialisation. Set up the index and the stemmer, which I didn't write from scratch.

sub new {
  my $class=shift;
  my $self={};
  bless $self,$class;
  $self->{stemmer}=Lingua::Stem->new();
  $self->{stemmer}->stem_caching({-level => 2});
  $self->{index}={};
  return $self;
}

Add a document (as one long string, so we have to give it a document name too). I break the string down to lines and words, stem the words, then for each word store its line number in the string and word number within the line.

sub add_doc_string {
  my $self=shift;
  my $docname=shift;
  my @words;
  my @indices;
  my $line=0;
  foreach my $str (@_) {
    my @l=split /\n/,$str;
    foreach my $l (@l) {
      my @w=$self->splitline($l);
      push @indices,map {[$docname,$line,$_]} (0..$#w);
      push @words,@w;
      $line++;
    }
  }
  $self->{stemmer}->stem_in_place(@words);
  foreach my $i (0..$#words) {
    push @{$self->{index}{$words[$i]}},$indices[$i];
  }
}

Add a document as a file, which just wraps round the previous function.

sub add_doc_file {
  my $self=shift;
  my $filename=shift;
  my $buf;
  open I,'<',$filename or die "Can't open filename\n";
  while (<I>) {
    $buf.=$_;
  }
  close I;
  $self->add_doc_string($filename,$buf);
}

Do the search. This is pretty trivial, and since searching should happen more often than indexing that's probably the right way to do it. (This just supports single-word search terms, nothing fiddly.) The return is an arrayref of (file, linenumber, wordnumber) arrayrefs.

sub search {
  my $self=shift;
  my @search=shift;
  $self->{stemmer}->stem_in_place(@search);
  return $self->{index}{$search[0]} || [];
}

A utility method: splitting a line into words (and removing punctuation and dropping to lower case) needs to be done consistently.

sub splitline {
  my $self=shift;
  my $line=shift;
  $line =~ s/[^A-za-z ]+/ /g;
  return grep /./,split ' ',lc($line);
}

If I were working on this seriously, I'd add a context method, so that when someone searches for "chimney" it can return some surrounding words; e.g. "… foot up the chimney, and said to …". That effectively requires one to store the whole document (unstemmed).

I'd also write the index to a file (probably YAML) rather than assuming it's in a persistent variable.

But Lucy, which is great, can already do all these things, so I won't be developing this system further.

Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.

Search
Archive
Tags 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s 2300ad 3d printing action advent of code aeronautics aikakirja anecdote animation anime army astronomy audio audio tech base commerce battletech bayern beer boardgaming book of the week bookmonth chain of command children chris chronicle church of no redeeming virtues cold war comedy computing contemporary cornish smuggler cosmic encounter coup covid-19 crime crystal cthulhu eternal cycling dead of winter doctor who documentary drama driving drone ecchi economics en garde espionage essen 2015 essen 2016 essen 2017 essen 2018 essen 2019 essen 2022 essen 2023 essen 2024 existential risk falklands war fandom fanfic fantasy feminism film firefly first world war flash point flight simulation food garmin drive gazebo genesys geocaching geodata gin gkp gurps gurps 101 gus harpoon historical history horror hugo 2014 hugo 2015 hugo 2016 hugo 2017 hugo 2018 hugo 2019 hugo 2020 hugo 2021 hugo 2022 hugo 2023 hugo 2024 hugo-nebula reread in brief avoid instrumented life javascript julian simpson julie enfield kickstarter kotlin learn to play leaving earth linux liquor lovecraftiana lua mecha men with beards mpd museum music mystery naval noir non-fiction one for the brow opera parody paul temple perl perl weekly challenge photography podcast politics postscript powers prediction privacy project woolsack pyracantha python quantum rail raku ranting raspberry pi reading reading boardgames social real life restaurant reviews romance rpg a day rpgs ruby rust scala science fiction scythe second world war security shipwreck simutrans smartphone south atlantic war squaddies stationery steampunk stuarts suburbia superheroes suspense television the resistance the weekly challenge thirsty meeples thriller tin soldier torg toys trailers travel type 26 type 31 type 45 vietnam war war wargaming weather wives and sweethearts writing about writing x-wing young adult
Special All book reviews, All film reviews
Produced by aikakirja v0.1