# RogerBW's Blog

Perl Weekly Challenge 81: Base Frequency 07 October 2020

I’ve been doing the Perl Weekly Challenges. The latest involved substrings and word frequencies. (Note that this is open until 11 October 2020.)

# Task #1 › Common Base String

You are given 2 strings, `\$A` and `\$B`.

Write a script to find out common base strings in `\$A` and `\$B`.

A substring of a string `\$S` is called base string if repeated concatenation of the substring results in the string.

In other words, every string is its own base string, and if it consists of the same character group repeated there may be more base strings. "hellohello" has as base strings "hello" and itself.

Given how often we've been told "return 0 if there are no results", for cases where there will never be no results, I'm surprised to see that there's no instruction on what to do in that situation for a problem where it's actually quite likely. I choose to return an empty list of strings.

I split this into two parts: (1) determine all the base strings of an individual string; (2) intersect those lists to determine the common base strings. Part 1 got its own test cases, to make sure that function was working before I built on it.

In Perl:

``````sub bs {
my \$str=shift;
my \$sl=length(\$str);
my %f;
``````

Take all integers between 1 to the square root of the string length, inclusive. If that number is an exact factor of the string length, store it and the quotient as candidate substring lengths. (For example, with an 8-character string, I'll store 1, 2, 4 and 8, mapping respectively to 8, 4, 2 and 1.)

``````  foreach my \$n (1..int(sqrt(\$sl))) {
my \$p=\$sl/\$n;
if (\$p==int(\$p)) {
\$f{\$n}=\$p;
\$f{\$p}=\$n;
}
}
my @out;
``````

For each candidate substring length, take that many characters from the start of the string and copy it a suitable number of times – which is the quotient I calculated earlier. (So the 2-character string is multiplied by 4 to make an 8-character string.) If the result matches the original string, this is a base string of it, so put it in the output list.

Now for the common-strings part.

``````sub cbs {
``````

Take the list of input strings (any number of them), and get a list of base strings for each.

``````  my @bss=map {bs(\$_)} @_;
my %r;
my \$f=0;
foreach my \$bs (@bss) {
``````

For each list of base strings, if we already have a working list, remove anything that isn't also in the new one.

``````    if (\$f) {
my %s=map {\$_ => 1} @{\$bs};
foreach my \$k (keys %r) {
unless (exists \$s{\$k}) {
delete \$r{\$k};
}
}
``````

If we don't, build it from the list of base strings we have.

``````    } else {
%r=map {\$_ => 1} @{\$bs};
\$f=1;
}
}
``````

Then return what's left.

``````  return [sort keys %r];
}
``````

Yes, that hash construct for the working list is a bit pointless; I don't care that the value is 1, only that it has a value. If only I had a sort of stubby hash that only cared about keys' existence, but not their values.

Turns out I do in the other languages. My `bs` function in Raku is basically the same as in Perl, but I can use its `SetHash` type to maintain a list of the valid base strings and intersect it with the `(&)` operator.

``````sub cbs(**@strs) {
my @bss=map {bs(\$_)},@strs;
my \$r=SetHash.new;
my \$f=0;
for @bss -> @bs {
if (\$f) {
my \$s=@bs.SetHash;
\$r = \$r (&) \$s;
} else {
``````

(Something wrong here surely? Doesn't this just make `\$r` a list? But it works…)

``````      \$r=@bs;
\$f=1;
}
}
return sort \$r.keys;
}
``````

Same general principle for Python:

``````def cbs(*strs):
bss=map(bs,strs)
r=set()
f=False
for bsa in bss:
if (f):
s=set(bsa)
r=r & s
else:
r=set(bsa)
f=True
return sorted(r)
``````

Ruby gets a bit odder. My standard idiom is to loop over the sorted keys of a hash, but it would rather sort the whole thing. So in `bs`:

``````  out=Array.new
for l in f.sort_by {|n, p| n}
q=str[0,l]
if q * l == str
out.push(q)
end
end
return out
``````

but `cbs` looks basically like the Raku version, complete with `map`.

``````  bss=strs.map {|x| bs(x)}
``````

# TASK #2 › Frequency Sort

You are given file named `input`.

Write a script to find the frequency of all the words.

It should print the result as first column of each line should be the frequency of the the word followed by all the words of that frequency arranged in lexicographical order. Also sort the words in the ascending order of frequency.

For the sake of this task, please ignore the following in the input file:

. " ( ) , 's --

This is one of those Very Standard Problems… but the input processing makes it a bit more fiddly. I break this into three stages: accumulate each word's frequency into a hash keyed on that word, invert the hash to get one keyed on frequency with each entry containing a list of words, and output.

Perl:

``````my %c;
while (<>) {
chomp;
``````

Clean up the punctuation.

``````  s/(--|'s)/ /g;
s/[."(),]+/ /g;
``````

Split into words and accumulate totals.

``````  map {\$c{\$_}++} split ' ',\$_;
}
``````

Invert the hash:

``````my %f;
map {push @{\$f{\$c{\$_}}},\$_} sort keys %c;
``````

And output:

``````foreach my \$n (sort keys %f) {
print join(' ',\$n,@{%f{\$n}}),"\n\n";
}
``````

Raku works similarly, though its regexp syntax is a bit different

``````  \$s ~~ s :g /(\-\-|\'s)/ /;
\$s ~~ s :g /<[.\"\(\),]>+/ /;
``````

and I unrolled the map for ease of debugging.

``````for sort keys %c -> \$w {
push %f{%c{\$w}},\$w;
}
``````

Python and Ruby don't have hash element autovivification, which makes life more verbose. (And yes, I know, one can override this, but I'm trying to stick with code I understand rather than copying other people's recipes.)

Python prefers basic string find-and-replace to regexps. (Also it turns out that it shares a Perlish feature I like, the ability to read lines from standard input or files named on the command line without needing any particular cleverness in the code.)

``````for line in fileinput.input():
line=line.rstrip("\r\n")
for rep in ('--',"'s",'.','"','(',')',','):
line=line.replace(rep,' ')
for word in line.split():
c.setdefault(word,0)
c[word] += 1

for w in sorted(c.keys()):
f.setdefault(c[w],list())
f[c[w]].append(w)
``````

The output got fiddly, because `join` specifically takes only one parameter, and that must be an iterator. So I fake one. (I could have shoved the frequency into the start of the list, but I got stubborn.)

``````for n in sorted(f.keys()):
print(' '.join((list(str(n)) + f[n])))
print('')
``````

In Ruby we're back to regexps for processing, and the way of getting all the lines is even simpler:

``````while line = gets
``````

but I can't easily set a default list during the inversion; also, see above regarding the preference for sorting an entire hash rather than just its keys.

``````f=Hash.new
for l in c.sort_by {|w, c| w}
if !f.include?(l)
f[l]=[l]
end
f[l].push(l)
end
``````

This time I did stick the frequency at the start of each list, which made the output much simpler.

``````for l in f.sort_by {|f, wl| f}
print l.join(' '),"\n\n"
end
``````

The test file is a synopsis of the plot of West Side Story, and the output includes the line

`3 Maria Tony a can of stop`

which seems quite appropriate really.

Full code on github.

1. Posted by RogerBW at 10:59am on 13 October 2020

For part 1, most people don't seem to have bothered with the factorisation. Which is fair enough; this would be a thing to profile rather than guess about which approach will be fastest. And of course programmer time is one of the considerations that goes into this calculation in the real world.

For part 2, in Raku I could have used a `Bag` type; it's pretty good for situations like this. Then the inversion would just have been a matter of calling the `.invert` method to get a series of pairs and combining them.