Words, Words, Words: the Weekly Challenge 331

Both this weekʼs tasks are not confined to ASCII, making them more real-world applicable. That brings with it real-world problems.

Task 1: Last Word:

"There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy." — Hamlet

You are given a string.

 
Write a script to find the length of last word in the given string.

What is a word, and what is its length? For the latter, Iʼm going to go with count of graphemes, as I did last week.

For the former, there are no easy answers. Is 123 a word? Or 10,000? What about $%@! (when used to obscure a word), or l33t, or d*mn? What is the last word in this string:

She said, “the last word here is ‘isnʼt’, though he says, ‘it isnʼt.’”

A good general purpose language-agnostic definition of a word in a Unicode world includes letters, numerics, and mark characters. To accommodate the tradition of including underscore, add combining punctuation characters (underscore and nine other similar characters).

And then there are apostrophes, which, when possessive or contractitive[1], should be part of a word. U+2BC is an apostrophe categorized as a letter modifier, so included with letters, but ASCIIʼs U+27 or the ending single quote U+2019 also commonly appear, and some sources unfortunately recommend the latter. Itʼs tempting to include these when between word characters, but that fails when a possessive apostrophe is at the end of a word or a contraction begins or ends (or both) with an apostrophe and is not readily distinguishable from the beginning or ending of a quotation. So, for better consistency, if not better results, I will treat it as a GIGO problem and only include the letter modifier apostrophe.

Comma or full stop as a thousands separator could be special-cased, but Iʼm not going to bother.

Sequences of four or more "other punctuation" category characters, or fewer if between other word characters, will account for most Bowdlerizing[2] punctuation, but again, I wonʼt bother.

So our regex for a word is just letters, numbers, marks, and combining punctuation:

[\p{L}\p{N}\p{M}\p{Pc}]+

There are a number of different approaches to finding the final match of a regex in a string; Iʼll demonstrate some in different languages.

In Python, weʼll substitute away everything before or after the final word, and count the remaining length (in graphemes), using the regex engine for lookahead/behind for word boundaries:

def last_word_length(string: str) -> str:
    return grapheme.length(regex.sub(r'^.*[^\pL\pN\pM\p{pC}](?=[\pL\pN\pM\p{pC}])|(?<=[\pL\pN\pM\p{pC}])[^\pL\pN\pM\p{pC}].*', '', string))

In Go, grapheme length is:

func grapheme_length(word string) int {
    var grapheme_length int = 0
    var seg = graphemes.NewSegmenter([]byte(word))
    for seg.Next() {
        grapheme_length++
    }
    return grapheme_length
}

and instead of using the regex library to find all matches and keeping the last, weʼll iterate through the matches:

func last_match(r *regexp.Regexp, s string) string {
    var out string
    for offset, loc := 0, []int{0,0}; loc != nil; loc = r.FindStringIndex(s[offset:]) {
        out = s[offset+loc[0]:offset+loc[1]]
        offset += loc[1]
    }
    return out
}

(Note the first iteration of the loop sets out to the empty string, without having attempted a match yet).

Putting those together:

var word = regexp.MustCompile(`[\p{L}\p{N}\p{M}\p{Pc}]+`)

func last_word_length(s string) int {
    return grapheme_length(last_match(word, s))
}

In Perl, weʼll find the last match first, using whatʼs been coined a sexeger technique: reversing the string, reversing the regex (nothing to do in this case), and reversing the match:

sub last_word_length($string) {
    grapheme_length reverse($string) =~ /([\pL\pN\pM\p{pC}]+)/ ? scalar reverse $1 : ''
}

Weʼll get grapheme length by counting grapheme matches:

sub grapheme_length($string) {
    scalar( () = $string =~ /\X/g )
}

(scalar( ()= ... ) is a Perl idiom; a list assignment in scalar context gives the count of elements on the right.)

full script, Python
full script, Go
full script, Perl

Task 2: Buddy Strings:

"Go keep us company, and we shall make full satisfaction." — Comedy of Errors

You are given two strings, source and target.

 
Write a script to find out if the given strings are "Buddy Strings".

 
If swapping of a letter in one string make them same as the other then they are "Buddy Strings".

The strings have arbitrary content, but only swapping of a letter makes them Buddy Strings? This doesnʼt make a lot of sense to me, so I will allow swapping of any single character. Arguably, swapping a grapheme should be allowed, and "same as the other" meaning even with different Unicode normalization, but I will save that for some other weekʼs task.

On to the task: determining if there is exactly one substitution difference between the strings. The method will vary by language.

In Python, the regexp library provides fuzzy matching, inspired by the TRE regex engine. The syntax is to provide, in curly braces, upper or upper and lower bounds for substitutions, insertions, deletions, or any of these errors, comma separated if more than one type is given, after the component of the regex the fuzzy match applies to:

def buddy_strings(string1: str, string2: str) -> bool:
    return True if regex.fullmatch('(?:'+regex.escape(string1)+'){1<=s<=1}', string2) else False

In Go, a compiled language, thereʼs no benefit to using a library for fuzzy matching; we simply verify the strings arenʼt equal, have the same number of characters, and loop over the characters to check for only one difference:

func buddy_strings(string1 string, string2 string) bool {
    if (string1 == string2) { return false }
    var runes1 = []rune(string1)
    var runes2 = []rune(string2)
    if len(runes1) != len(runes2) { return false }
    var differences int = 0
    for i := range runes1 {
        if runes1[i] != runes2[i] {
            differences++
            if differences > 1 { return false }
        }
    }
    return true
}

In Perl, on the other hand, we want to do as much work as possible in C. Thereʼs a TRE regex engine plugin that should work for this but I didnʼt actually try. The venerable String::Approx is a possibility in theory, but I wasnʼt able to find the correct incantation for this case, and looking at the code, it isnʼt clear to me if it has proper Unicode support. Text::Fuzzy, on the other hand, does the job (with the hard work in C), though we must check the lengths are equal to ensure the one allowed difference is a substitution, not insertion or deletion:

sub buddy_strings($string1, $string2) {
    length $string1 == length $string2
        and Text::Fuzzy->new($string1, max => 1, no_exact => 1) == 1
}

full script, Python
full script, Go
full script, Perl

Comments greatly appreciated. See you next week.


  1. Not a word, but it should be. ↩︎

  2. Sticking to the Shakespeare theme. ↩︎





Read more