Sebastian Nozzi

Posted on Oct 26, 2022

Most common word length

#miniscript #minimicro #challenge

In the last MiniScript challenge we had fun processing MiniMicro's list of English words.

The next challenge continues in that vein. This time we are asked to measure the amount of words of lengths 1 to 12. That is: how many words are 1 letter long, how many 2 letters long ... and so on until 12?

We are also asked, and this is actually the main question and name of the challenge: what is the most common length? In other words: from the resulting list of lengths, which one is the maximum?

Reading the lines

Again, we need to access the file of words over at:

/sys/data/englishWords.txt

As in the previous task, we need to iterate over all lines. But this time we will take a different approach - the one suggested by Joe Strout (creator of MiniScript) on the comments.

By calling file.readLines over a file-name we get a list of lines over which we can iterate directly with a "for-in" loop (what is that? See below).

(Note: this approach should be avoided for very big files - probably hundreds of megabytes or more ... our words file is under 1 Mb - so it's fine to call readLines).

It would look something like this:

words = file.readLines("/sys/data/englishWords.txt")

At this points all lines have been read and words consists of the lists of lines, or words, on the file (since it has one word per line).

Iterating over elements

We mentioned a "for-in" loop before. What is that? In MiniScript for-loops iterate over a list, or more generally speaking, a "collection". A "collection" could be a list, but also a string, or a map.

While in other programming languages you do things like this:

for (i = 1; i =< 10; i++) {
    // Do something with i
}

... in MiniScript you always iterate like "for variable" in "collection". Even when iterating over a range of numbers you would do that like this:

for i in range(1, 10)
    // Do something with i
end for

In fact, range is a function that returns a list of numbers in sequence.

Going back to our example, we can iterate over all words with:

for word in words
    // Do something with word
end for

Counting lengths - naive approach

Now, to find out the length of word we already know to use len. It is an "intrinsic" function which returns the length of a "collection" (which can be a list, map or string).

But how do we store the amount of words of different lengths?

We are asked to consider words up to a length of 12.

One (naive) approach could be to use 12 different counters. Depending on the length of the word, we would increase the corresponding counter.

It would look like this:

counter_for_words_of_length_1 = 0
counter_for_words_of_length_2 = 0
counter_for_words_of_length_3 = 0
// ... and so on ...
counter_for_words_of_length_10 = 0
counter_for_words_of_length_11 = 0
counter_for_words_of_length_12 = 0

for word in words
    if len(word) == 1 then
        counter_for_words_of_length_1 = counter_for_words_of_length_1 + 1
    else if len(word) == 2 then 
        counter_for_words_of_length_2 = counter_for_words_of_length_2 + 1
    // ... and so on ...
    else if len(word) == 12 then 
        counter_for_words_of_length_12 = counter_for_words_of_length_12 + 1
    end if
end for

Our intuition should tell us that this is far from optimal, and that there is room for improvement here. Specifically, whenever we see things repeating a lot it is a sign that we could abstract a little and find a more generic and elegant solution.

Note how we have practically identical variables that are different only by the last number. Note also, how the whole "if / else-if" dispatch serves only the purpose of finding the right variable. Wouldn't it be great if we had a list of counter, which we could address by the word-length?

There are mainly two approaches on how to do that. The first one would be using a "list" (array). The second using a "map". To keep an already long post short, we will only consider the array-based approach here. The (more uncommon) "map" based approach will be covered in another post - or left as an exercise to the reader.

Array-based approach

Instead of having 12 different "counter" variables we could have an array of 12 numbers, which we increase as needed. Like:

// List of 12 zeroes
word_length_counters = [0,0,0,0,0,0,0,0,0,0,0,0]

While it's perfectly fine to write the 12 zeroes like this, another possibility in MiniScript is to do:

// List of 12 zeroes
word_length_counters = [0] * 12

(Try it out in the console! Try out things like [1,2,3] * 3, ["ABC"] * 3 and "ABC" * 3 ...)

Having that, we need to think how to increase each value ... how would we do that? Well, we would:

1) Take the length of the current word
2) Take the current count
3) Increase the count

Our solution so far looks like this:

words = file.readLines("/sys/data/englishWords.txt")

word_length_counters = [0] * 12

for word in words
    word_length = len(word)
    current_count = word_length_counters[word_length]
    word_length_counters[word_length] = current_count + 1
end for

Try it out.

Does it work? No?

It does not work. In fact, you should get this error message:

Runtime Error: Index Error: list index (12) out of range (-12 to 11)
 [line 7]

Fixing element access

Problem is, arrays are 0-indexed (we start counting at 0 when accessing their elements) and we are using word-lengths, which start at 1.

How do we solve this? Here we have two possibilities:

1) We subtract 1 to the word length whenever accessing the array
2) We use an array of length 13, effectively wasting the first position, so that we can access it by word-length (indexes 1 to 12).

The challenge "hints" suggest approach 2, which you are free to try out. We will then go for approach 1. For me, wasting one position just to achieve the effect feels "wrong", and I also want to illustrate that in programs it is often necessary to convert between human-intended and computer-intended number-representations.

Let's fix our code:

for word in words
    word_length = len(word)
    counter_index = word_length - 1
    current_count = word_length_counters[counter_index]
    word_length_counters[counter_index] = current_count + 1
end for

Filtering out lines

Now it should work, right? But running it gives us the same error message again:

Runtime Error: Index Error: list index (12) out of range (-12 to 11)
 [line 8]

Let's see what's going on there ... over at the console I typed:

> word_length_counters
[1, 0, 1, 0, 6, 8, 5, 6, 5, 2, 3, 2]
> word
abbreviations
> len(word)
13

Ah! The problem is clear. We have a word which is 13-letters long!

The thing is, the file we are processing has words of different lengths ... actually words as long as 22 letters!

When processing that file we need to "ignore" anything longer than 12 letters. Again, here we could either

1) Enclose the whole processing block with an if, like:

for word in words
    word_length = len(word)
    if word_length =< 12 then
        counter_index = word_length - 1
        current_count = word_length_counters[counter_index]
        word_length_counters[counter_index] = current_count + 1
    end if
end for

... or tell the program flow to "skip" or continue to the next iteration if a word is too long. Like:

for word in words
    word_length = len(word)

    if word_length > 12 then continue

    counter_index = word_length - 1
    current_count = word_length_counters[counter_index]
    word_length_counters[counter_index] = current_count + 1
end for

Either approach is fine. For the sake of brevity we will go with the "continue" approach.

Calculating results

Running this program does ... nothing. Well, nothing visible (for now). Evaluating word_length_counters in the console should yield results:

> word_length_counters
[4, 61, 649, 2446, 4680, 7352, 9879, 10467, 9412, 7569, 5250, 3383]

Interesting. This means that:

There are 4 words of length 1
There are 61 words of length 2
There are 649 words of length 3
...
There are 3383 words of length 12

So, are we able to answer the main question now? Which is the most common? What is the maximum number? To which word length does it correspond?

Having only 12 numbers to consider, we can answer that by sight. But that would not solve the challenge, would it? We need the computer to figure it out.

Let's do that. That means, looking at the last paragraph, that we need:

1) To find out the maximum number
2) To figure out to which word-length it corresponds

We could quickly solve this by importing the "listUtil" module in Mini Micro since it extends lists to have a max method.

Then, by using indexOf we would know the index of that number. And adding 1 to that index would give us the (most common) word-length.

Can you figure out how this would look like ...?

I first tried it out on the console:

> word_length_counters
[4, 61, 649, 2446, 4680, 7352, 9879, 10467, 9412, 7569, 5250, 3383]
> import "listUtil"
> word_length_counters.max
10467
> word_length_counters.indexOf 10467
7
> 7 + 1
8

It seems to work! We can then modify our program accordingly.

Put this at the beginning:

import "listUtil"

And put this at the end:

max_count = word_length_counters.max
max_count_index = word_length_counters.indexOf(max_count)
max_word_length = max_count_index + 1

print "The most common word length is " + max_word_length

Final solution (without bonus)

The whole program should look like:

import "listUtil"

words = file.readLines("/sys/data/englishWords.txt")

word_length_counters = [0] * 12

for word in words
    word_length = len(word)
    if word_length > 12 then continue
    counter_index = word_length - 1
    current_count = word_length_counters[counter_index]
    word_length_counters[counter_index] = current_count + 1
end for

max_count = word_length_counters.max
max_count_index = word_length_counters.indexOf(max_count)
max_word_length = max_count_index + 1

print "The most common word length is " + max_word_length

Run it. Does it work for you? I get:

The most common word length is 8

This completes the challenge for now, without the "bonus".

In a future post we could consider the bonus part, but it is enough for now.

Hopefully you followed along and learnt some or two things about MiniScript and had fun with Mini Micro!

DEV Community