I’ve been doing the Perl Weekly Challenges. The latest involved substrings and word frequencies. (Note that this is open until 11 October 2020.)

Task #1 › Common Base String

You are given 2 strings, $A and $B.

Write a script to find out common base strings in $A and $B.

A substring of a string $S is called base string if repeated concatenation of the substring results in the string.

In other words, every string is its own base string, and if it consists of the same character group repeated there may be more base strings. "hellohello" has as base strings "hello" and itself.

Given how often we've been told "return 0 if there are no results", for cases where there will never be no results, I'm surprised to see that there's no instruction on what to do in that situation for a problem where it's actually quite likely. I choose to return an empty list of strings.

I split this into two parts: (1) determine all the base strings of an individual string; (2) intersect those lists to determine the common base strings. Part 1 got its own test cases, to make sure that function was working before I built on it.

In Perl:

sub bs {
  my $str=shift;
  my $sl=length($str);
  my %f;

Take all integers between 1 to the square root of the string length, inclusive. If that number is an exact factor of the string length, store it and the quotient as candidate substring lengths. (For example, with an 8-character string, I'll store 1, 2, 4 and 8, mapping respectively to 8, 4, 2 and 1.)

  foreach my $n (1..int(sqrt($sl))) {
    my $p=$sl/$n;
    if ($p==int($p)) {
      $f{$n}=$p;
      $f{$p}=$n;
    }
  }
  my @out;

For each candidate substring length, take that many characters from the start of the string and copy it a suitable number of times – which is the quotient I calculated earlier. (So the 2-character string is multiplied by 4 to make an 8-character string.) If the result matches the original string, this is a base string of it, so put it in the output list.

Now for the common-strings part.

sub cbs {

Take the list of input strings (any number of them), and get a list of base strings for each.

  my @bss=map {bs($_)} @_;
  my %r;
  my $f=0;
  foreach my $bs (@bss) {

For each list of base strings, if we already have a working list, remove anything that isn't also in the new one.

    if ($f) {
      my %s=map {$_ => 1} @{$bs};
      foreach my $k (keys %r) {
        unless (exists $s{$k}) {
          delete $r{$k};
        }
      }

If we don't, build it from the list of base strings we have.

    } else {
      %r=map {$_ => 1} @{$bs};
      $f=1;
    }
  }

Then return what's left.

  return [sort keys %r];
}

Yes, that hash construct for the working list is a bit pointless; I don't care that the value is 1, only that it has a value. If only I had a sort of stubby hash that only cared about keys' existence, but not their values.

Turns out I do in the other languages. My bs function in Raku is basically the same as in Perl, but I can use its SetHash type to maintain a list of the valid base strings and intersect it with the (&) operator.

sub cbs(**@strs) {
  my @bss=map {bs($_)},@strs;
  my $r=SetHash.new;
  my $f=0;
  for @bss -> @bs {
    if ($f) {
      my $s=@bs.SetHash;
      $r = $r (&) $s;
    } else {

(Something wrong here surely? Doesn't this just make $r a list? But it works…)

      $r=@bs;
      $f=1;
    }
  }
  return sort $r.keys;
}

Same general principle for Python:

def cbs(*strs):
    bss=map(bs,strs)
    r=set()
    f=False
    for bsa in bss:
        if (f):
            s=set(bsa)
            r=r & s
        else:
            r=set(bsa)
            f=True
    return sorted(r)

Ruby gets a bit odder. My standard idiom is to loop over the sorted keys of a hash, but it would rather sort the whole thing. So in bs:

  out=Array.new
  for l in f.sort_by {|n, p| n}
    q=str[0,l[0]]
    if q * l[1] == str
      out.push(q)
    end
  end
  return out

but cbs looks basically like the Raku version, complete with map.

  bss=strs.map {|x| bs(x)}

TASK #2 › Frequency Sort

You are given file named input.

Write a script to find the frequency of all the words.

It should print the result as first column of each line should be the frequency of the the word followed by all the words of that frequency arranged in lexicographical order. Also sort the words in the ascending order of frequency.

For the sake of this task, please ignore the following in the input file:

. " ( ) , 's --

This is one of those Very Standard Problems… but the input processing makes it a bit more fiddly. I break this into three stages: accumulate each word's frequency into a hash keyed on that word, invert the hash to get one keyed on frequency with each entry containing a list of words, and output.

Perl:

my %c;
while (<>) {
  chomp;

Clean up the punctuation.

  s/(--|'s)/ /g;
  s/[."(),]+/ /g;

Split into words and accumulate totals.

  map {$c{$_}++} split ' ',$_;
}

Invert the hash:

my %f;
map {push @{$f{$c{$_}}},$_} sort keys %c;

And output:

foreach my $n (sort keys %f) {
  print join(' ',$n,@{%f{$n}}),"\n\n";
}

Raku works similarly, though its regexp syntax is a bit different

  $s ~~ s :g /(\-\-|\'s)/ /;
  $s ~~ s :g /<[.\"\(\),]>+/ /;

and I unrolled the map for ease of debugging.

for sort keys %c -> $w {
  push %f{%c{$w}},$w;
}

Python and Ruby don't have hash element autovivification, which makes life more verbose. (And yes, I know, one can override this, but I'm trying to stick with code I understand rather than copying other people's recipes.)

Python prefers basic string find-and-replace to regexps. (Also it turns out that it shares a Perlish feature I like, the ability to read lines from standard input or files named on the command line without needing any particular cleverness in the code.)

for line in fileinput.input():
    line=line.rstrip("\r\n")
    for rep in ('--',"'s",'.','"','(',')',','):
        line=line.replace(rep,' ')
    for word in line.split():
        c.setdefault(word,0)
        c[word] += 1

for w in sorted(c.keys()):
    f.setdefault(c[w],list())
    f[c[w]].append(w)

The output got fiddly, because join specifically takes only one parameter, and that must be an iterator. So I fake one. (I could have shoved the frequency into the start of the list, but I got stubborn.)

for n in sorted(f.keys()):
    print(' '.join((list(str(n)) + f[n])))
    print('')

In Ruby we're back to regexps for processing, and the way of getting all the lines is even simpler:

while line = gets

but I can't easily set a default list during the inversion; also, see above regarding the preference for sorting an entire hash rather than just its keys.

f=Hash.new
for l in c.sort_by {|w, c| w}
  if !f.include?(l[1])
    f[l[1]]=[l[1]]
  end
  f[l[1]].push(l[0])
end

This time I did stick the frequency at the start of each list, which made the output much simpler.

for l in f.sort_by {|f, wl| f}
  print l[1].join(' '),"\n\n"
end

The test file is a synopsis of the plot of West Side Story, and the output includes the line

3 Maria Tony a can of stop

which seems quite appropriate really.

Full code on github.

Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.

RogerBW's Blog

Task #1 › Common Base String

TASK #2 › Frequency Sort