I've been doing the
Perl Weekly Challenges. This one
dealt with minimal programs and inverted-index text search.
The minimal valid program is an empty file, for both perl5 and
perl6. However, if you want to run the program directly from the
command line, it needs to invoke the interpreter, so:
#! /usr/bin/perl
and indeed
#! /usr/bin/perl6
That wasn't terribly interesting, but the other challenge was more
fun. I like search systems, and some years ago I once wrote an indexer
and search engine for mailing list postings in 24 hours for a small
bet.
However, it's worth noting that these days I'd just use Lucy (and I
know it's abandonware now, but it still works). So I didn't take this
very far, because I'd just be duplicating functionality that's
much more conveniently available elsewhere.
package Local::InvIndex;
use Lingua::Stem;
Initialisation. Set up the index and the stemmer, which I didn't write
from scratch.
sub new {
my $class=shift;
my $self={};
bless $self,$class;
$self->{stemmer}=Lingua::Stem->new();
$self->{stemmer}->stem_caching({-level => 2});
$self->{index}={};
return $self;
}
Add a document (as one long string, so we have to give it a document
name too). I break the string down to lines and words, stem the words,
then for each word store its line number in the string and word number
within the line.
sub add_doc_string {
my $self=shift;
my $docname=shift;
my @words;
my @indices;
my $line=0;
foreach my $str (@_) {
my @l=split /\n/,$str;
foreach my $l (@l) {
my @w=$self->splitline($l);
push @indices,map {[$docname,$line,$_]} (0..$#w);
push @words,@w;
$line++;
}
}
$self->{stemmer}->stem_in_place(@words);
foreach my $i (0..$#words) {
push @{$self->{index}{$words[$i]}},$indices[$i];
}
}
Add a document as a file, which just wraps round the previous function.
sub add_doc_file {
my $self=shift;
my $filename=shift;
my $buf;
open I,'<',$filename or die "Can't open filename\n";
while (<I>) {
$buf.=$_;
}
close I;
$self->add_doc_string($filename,$buf);
}
Do the search. This is pretty trivial, and since searching should
happen more often than indexing that's probably the right way to do
it. (This just supports single-word search terms, nothing fiddly.) The
return is an arrayref of (file, linenumber, wordnumber) arrayrefs.
sub search {
my $self=shift;
my @search=shift;
$self->{stemmer}->stem_in_place(@search);
return $self->{index}{$search[0]} || [];
}
A utility method: splitting a line into words (and removing
punctuation and dropping to lower case) needs to be done consistently.
sub splitline {
my $self=shift;
my $line=shift;
$line =~ s/[^A-za-z ]+/ /g;
return grep /./,split ' ',lc($line);
}
If I were working on this seriously, I'd add a context method, so that
when someone searches for "chimney" it can return some surrounding
words; e.g. "… foot up the chimney, and said to …". That
effectively requires one to store the whole document (unstemmed).
I'd also write the index to a file (probably YAML) rather than
assuming it's in a persistent variable.
But Lucy, which is great, can already do all these things, so I won't
be developing this system further.
Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.