I’ve been doing the Perl Weekly
Challenges. The
latest
involved substrings and word frequencies. (Note that this is open
until 11 October 2020.)
Task #1 › Common Base String
You are given 2 strings, $A
and $B
.
Write a script to find out common base strings in $A
and $B
.
A substring of a string $S
is called base string if repeated
concatenation of the substring results in the string.
In other words, every string is its own base string, and if it
consists of the same character group repeated there may be more base
strings. "hellohello" has as base strings "hello" and itself.
Given how often we've been told "return 0 if there are no results",
for cases where there will never be no results, I'm surprised to see
that there's no instruction on what to do in that situation for a
problem where it's actually quite likely. I choose to return an empty
list of strings.
I split this into two parts: (1) determine all the base strings of an
individual string; (2) intersect those lists to determine the common
base strings. Part 1 got its own test cases, to make sure that
function was working before I built on it.
In Perl:
sub bs {
my $str=shift;
my $sl=length($str);
my %f;
Take all integers between 1 to the square root of the string length,
inclusive. If that number is an exact factor of the string length,
store it and the quotient as candidate substring lengths. (For
example, with an 8-character string, I'll store 1, 2, 4 and 8, mapping
respectively to 8, 4, 2 and 1.)
foreach my $n (1..int(sqrt($sl))) {
my $p=$sl/$n;
if ($p==int($p)) {
$f{$n}=$p;
$f{$p}=$n;
}
}
my @out;
For each candidate substring length, take that many characters from
the start of the string and copy it a suitable number of times – which
is the quotient I calculated earlier. (So the 2-character string is
multiplied by 4 to make an 8-character string.) If the result matches
the original string, this is a base string of it, so put it in the
output list.
Now for the common-strings part.
sub cbs {
Take the list of input strings (any number of them), and get a list of
base strings for each.
my @bss=map {bs($_)} @_;
my %r;
my $f=0;
foreach my $bs (@bss) {
For each list of base strings, if we already have a working list,
remove anything that isn't also in the new one.
if ($f) {
my %s=map {$_ => 1} @{$bs};
foreach my $k (keys %r) {
unless (exists $s{$k}) {
delete $r{$k};
}
}
If we don't, build it from the list of base strings we have.
} else {
%r=map {$_ => 1} @{$bs};
$f=1;
}
}
Then return what's left.
return [sort keys %r];
}
Yes, that hash construct for the working list is a bit pointless; I
don't care that the value is 1, only that it has a value. If only I
had a sort of stubby hash that only cared about keys' existence, but
not their values.
Turns out I do in the other languages. My bs
function in Raku is
basically the same as in Perl, but I can use its SetHash
type to
maintain a list of the valid base strings and intersect it with the
(&)
operator.
sub cbs(**@strs) {
my @bss=map {bs($_)},@strs;
my $r=SetHash.new;
my $f=0;
for @bss -> @bs {
if ($f) {
my $s=@bs.SetHash;
$r = $r (&) $s;
} else {
(Something wrong here surely? Doesn't this just make $r
a list? But
it works…)
$r=@bs;
$f=1;
}
}
return sort $r.keys;
}
Same general principle for Python:
def cbs(*strs):
bss=map(bs,strs)
r=set()
f=False
for bsa in bss:
if (f):
s=set(bsa)
r=r & s
else:
r=set(bsa)
f=True
return sorted(r)
Ruby gets a bit odder. My standard idiom is to loop over the sorted
keys of a hash, but it would rather sort the whole thing. So in bs
:
out=Array.new
for l in f.sort_by {|n, p| n}
q=str[0,l[0]]
if q * l[1] == str
out.push(q)
end
end
return out
but cbs
looks basically like the Raku version, complete with map
.
bss=strs.map {|x| bs(x)}
TASK #2 › Frequency Sort
You are given file named input
.
Write a script to find the frequency of all the words.
It should print the result as first column of each line should be
the frequency of the the word followed by all the words of that
frequency arranged in lexicographical order. Also sort the words in
the ascending order of frequency.
For the sake of this task, please ignore the following in the input
file:
. " ( ) , 's --
This is one of those Very Standard Problems… but the input processing
makes it a bit more fiddly. I break this into three stages: accumulate
each word's frequency into a hash keyed on that word, invert the hash
to get one keyed on frequency with each entry containing a list of
words, and output.
Perl:
my %c;
while (<>) {
chomp;
Clean up the punctuation.
s/(--|'s)/ /g;
s/[."(),]+/ /g;
Split into words and accumulate totals.
map {$c{$_}++} split ' ',$_;
}
Invert the hash:
my %f;
map {push @{$f{$c{$_}}},$_} sort keys %c;
And output:
foreach my $n (sort keys %f) {
print join(' ',$n,@{%f{$n}}),"\n\n";
}
Raku works similarly, though its regexp syntax is a bit different
$s ~~ s :g /(\-\-|\'s)/ /;
$s ~~ s :g /<[.\"\(\),]>+/ /;
and I unrolled the map for ease of debugging.
for sort keys %c -> $w {
push %f{%c{$w}},$w;
}
Python and Ruby don't have hash element
autovivification,
which makes life more verbose. (And yes, I know, one can override
this, but I'm trying to stick with code I understand rather than
copying other people's recipes.)
Python prefers basic string find-and-replace to regexps. (Also it
turns out that it shares a Perlish feature I like, the ability to read
lines from standard input or files named on the command line without
needing any particular cleverness in the code.)
for line in fileinput.input():
line=line.rstrip("\r\n")
for rep in ('--',"'s",'.','"','(',')',','):
line=line.replace(rep,' ')
for word in line.split():
c.setdefault(word,0)
c[word] += 1
for w in sorted(c.keys()):
f.setdefault(c[w],list())
f[c[w]].append(w)
The output got fiddly, because join
specifically takes only one
parameter, and that must be an iterator. So I fake one. (I could have
shoved the frequency into the start of the list, but I got stubborn.)
for n in sorted(f.keys()):
print(' '.join((list(str(n)) + f[n])))
print('')
In Ruby we're back to regexps for processing, and the way of getting
all the lines is even simpler:
while line = gets
but I can't easily set a default list during the inversion; also, see
above regarding the preference for sorting an entire hash rather than
just its keys.
f=Hash.new
for l in c.sort_by {|w, c| w}
if !f.include?(l[1])
f[l[1]]=[l[1]]
end
f[l[1]].push(l[0])
end
This time I did stick the frequency at the start of each list, which
made the output much simpler.
for l in f.sort_by {|f, wl| f}
print l[1].join(' '),"\n\n"
end
The test file is a synopsis of the plot of West Side Story, and the
output includes the line
3 Maria Tony a can of stop
which seems quite appropriate really.
Full code on github.
Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.