RogerBW's Blog

Perl Weekly Challenge 81: Base Frequency 07 October 2020

I’ve been doing the Perl Weekly Challenges. The latest involved substrings and word frequencies. (Note that this is open until 11 October 2020.)

Task #1 › Common Base String

You are given 2 strings, $A and $B.

Write a script to find out common base strings in $A and $B.

A substring of a string $S is called base string if repeated concatenation of the substring results in the string.

In other words, every string is its own base string, and if it consists of the same character group repeated there may be more base strings. "hellohello" has as base strings "hello" and itself.

Given how often we've been told "return 0 if there are no results", for cases where there will never be no results, I'm surprised to see that there's no instruction on what to do in that situation for a problem where it's actually quite likely. I choose to return an empty list of strings.

I split this into two parts: (1) determine all the base strings of an individual string; (2) intersect those lists to determine the common base strings. Part 1 got its own test cases, to make sure that function was working before I built on it.

In Perl:

sub bs {
  my $str=shift;
  my $sl=length($str);
  my %f;

Take all integers between 1 to the square root of the string length, inclusive. If that number is an exact factor of the string length, store it and the quotient as candidate substring lengths. (For example, with an 8-character string, I'll store 1, 2, 4 and 8, mapping respectively to 8, 4, 2 and 1.)

  foreach my $n (1..int(sqrt($sl))) {
    my $p=$sl/$n;
    if ($p==int($p)) {
      $f{$n}=$p;
      $f{$p}=$n;
    }
  }
  my @out;

For each candidate substring length, take that many characters from the start of the string and copy it a suitable number of times – which is the quotient I calculated earlier. (So the 2-character string is multiplied by 4 to make an 8-character string.) If the result matches the original string, this is a base string of it, so put it in the output list.

Now for the common-strings part.

sub cbs {

Take the list of input strings (any number of them), and get a list of base strings for each.

  my @bss=map {bs($_)} @_;
  my %r;
  my $f=0;
  foreach my $bs (@bss) {

For each list of base strings, if we already have a working list, remove anything that isn't also in the new one.

    if ($f) {
      my %s=map {$_ => 1} @{$bs};
      foreach my $k (keys %r) {
        unless (exists $s{$k}) {
          delete $r{$k};
        }
      }

If we don't, build it from the list of base strings we have.

    } else {
      %r=map {$_ => 1} @{$bs};
      $f=1;
    }
  }

Then return what's left.

  return [sort keys %r];
}

Yes, that hash construct for the working list is a bit pointless; I don't care that the value is 1, only that it has a value. If only I had a sort of stubby hash that only cared about keys' existence, but not their values.

Turns out I do in the other languages. My bs function in Raku is basically the same as in Perl, but I can use its SetHash type to maintain a list of the valid base strings and intersect it with the (&) operator.

sub cbs(**@strs) {
  my @bss=map {bs($_)},@strs;
  my $r=SetHash.new;
  my $f=0;
  for @bss -> @bs {
    if ($f) {
      my $s=@bs.SetHash;
      $r = $r (&) $s;
    } else {

(Something wrong here surely? Doesn't this just make $r a list? But it works…)

      $r=@bs;
      $f=1;
    }
  }
  return sort $r.keys;
}

Same general principle for Python:

def cbs(*strs):
    bss=map(bs,strs)
    r=set()
    f=False
    for bsa in bss:
        if (f):
            s=set(bsa)
            r=r & s
        else:
            r=set(bsa)
            f=True
    return sorted(r)

Ruby gets a bit odder. My standard idiom is to loop over the sorted keys of a hash, but it would rather sort the whole thing. So in bs:

  out=Array.new
  for l in f.sort_by {|n, p| n}
    q=str[0,l[0]]
    if q * l[1] == str
      out.push(q)
    end
  end
  return out

but cbs looks basically like the Raku version, complete with map.

  bss=strs.map {|x| bs(x)}

TASK #2 › Frequency Sort

You are given file named input.

Write a script to find the frequency of all the words.

It should print the result as first column of each line should be the frequency of the the word followed by all the words of that frequency arranged in lexicographical order. Also sort the words in the ascending order of frequency.

For the sake of this task, please ignore the following in the input file:

. " ( ) , 's --

This is one of those Very Standard Problems… but the input processing makes it a bit more fiddly. I break this into three stages: accumulate each word's frequency into a hash keyed on that word, invert the hash to get one keyed on frequency with each entry containing a list of words, and output.

Perl:

my %c;
while (<>) {
  chomp;

Clean up the punctuation.

  s/(--|'s)/ /g;
  s/[."(),]+/ /g;

Split into words and accumulate totals.

  map {$c{$_}++} split ' ',$_;
}

Invert the hash:

my %f;
map {push @{$f{$c{$_}}},$_} sort keys %c;

And output:

foreach my $n (sort keys %f) {
  print join(' ',$n,@{%f{$n}}),"\n\n";
}

Raku works similarly, though its regexp syntax is a bit different

  $s ~~ s :g /(\-\-|\'s)/ /;
  $s ~~ s :g /<[.\"\(\),]>+/ /;

and I unrolled the map for ease of debugging.

for sort keys %c -> $w {
  push %f{%c{$w}},$w;
}

Python and Ruby don't have hash element autovivification, which makes life more verbose. (And yes, I know, one can override this, but I'm trying to stick with code I understand rather than copying other people's recipes.)

Python prefers basic string find-and-replace to regexps. (Also it turns out that it shares a Perlish feature I like, the ability to read lines from standard input or files named on the command line without needing any particular cleverness in the code.)

for line in fileinput.input():
    line=line.rstrip("\r\n")
    for rep in ('--',"'s",'.','"','(',')',','):
        line=line.replace(rep,' ')
    for word in line.split():
        c.setdefault(word,0)
        c[word] += 1

for w in sorted(c.keys()):
    f.setdefault(c[w],list())
    f[c[w]].append(w)

The output got fiddly, because join specifically takes only one parameter, and that must be an iterator. So I fake one. (I could have shoved the frequency into the start of the list, but I got stubborn.)

for n in sorted(f.keys()):
    print(' '.join((list(str(n)) + f[n])))
    print('')

In Ruby we're back to regexps for processing, and the way of getting all the lines is even simpler:

while line = gets

but I can't easily set a default list during the inversion; also, see above regarding the preference for sorting an entire hash rather than just its keys.

f=Hash.new
for l in c.sort_by {|w, c| w}
  if !f.include?(l[1])
    f[l[1]]=[l[1]]
  end
  f[l[1]].push(l[0])
end

This time I did stick the frequency at the start of each list, which made the output much simpler.

for l in f.sort_by {|f, wl| f}
  print l[1].join(' '),"\n\n"
end

The test file is a synopsis of the plot of West Side Story, and the output includes the line

3 Maria Tony a can of stop

which seems quite appropriate really.

Full code on github.


  1. Posted by RogerBW at 10:59am on 13 October 2020

    For part 1, most people don't seem to have bothered with the factorisation. Which is fair enough; this would be a thing to profile rather than guess about which approach will be fastest. And of course programmer time is one of the considerations that goes into this calculation in the real world.

    For part 2, in Raku I could have used a Bag type; it's pretty good for situations like this. Then the inversion would just have been a matter of calling the .invert method to get a series of pairs and combining them.

Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.

Search
Archive
Tags 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s 3d printing action advent of code aeronautics aikakirja anecdote animation anime army astronomy audio audio tech base commerce battletech beer boardgaming book of the week bookmonth chain of command children chris chronicle church of no redeeming virtues cold war comedy computing contemporary cornish smuggler cosmic encounter coup covid-19 crime crystal cthulhu eternal cycling dead of winter doctor who documentary drama driving drone ecchi economics en garde espionage essen 2015 essen 2016 essen 2017 essen 2018 essen 2019 essen 2022 essen 2023 existential risk falklands war fandom fanfic fantasy feminism film firefly first world war flash point flight simulation food garmin drive gazebo genesys geocaching geodata gin gkp gurps gurps 101 gus harpoon historical history horror hugo 2014 hugo 2015 hugo 2016 hugo 2017 hugo 2018 hugo 2019 hugo 2020 hugo 2021 hugo 2022 hugo 2023 hugo 2024 hugo-nebula reread in brief avoid instrumented life javascript julian simpson julie enfield kickstarter kotlin learn to play leaving earth linux liquor lovecraftiana lua mecha men with beards mpd museum music mystery naval noir non-fiction one for the brow opera parody paul temple perl perl weekly challenge photography podcast politics postscript powers prediction privacy project woolsack pyracantha python quantum rail raku ranting raspberry pi reading reading boardgames social real life restaurant reviews romance rpg a day rpgs ruby rust scala science fiction scythe second world war security shipwreck simutrans smartphone south atlantic war squaddies stationery steampunk stuarts suburbia superheroes suspense television the resistance the weekly challenge thirsty meeples thriller tin soldier torg toys trailers travel type 26 type 31 type 45 vietnam war war wargaming weather wives and sweethearts writing about writing x-wing young adult
Special All book reviews, All film reviews
Produced by aikakirja v0.1