Specifications, Ambiguity, Contradiction: the Weekly Challenge 330
Software engineering often involves listening to statements of problems, deriving one or more underlying actual problems, and coming up with solutions that address (partially or completely) those problems (and ideally have other advantages or advance other goals as well).[1] But sometimes the starting point you get is a specification for the desired solution. And that takes a lot of the burden of your shoulders; you can proceed right to thinking about implementation, right?
If only it were that easy. This week's tasks provide great counterexamples.
Task 1: Clear Digits:
You are given a string containing only lower case English letters and digits.
Write a script to remove all digits by removing the first digit and the closest non-digit character to its left.
Right away, I see an apparent self-contradiction in the description. We are to remove all digits, and also remove a non-digit for each. What if there is no such non-digit? The contradiction is removed in one of two interpretations:
- The "all" in "remove all digits" is figurative, and the remainder clarifies which digits were actually meant.
- The "all" is literal, and the remainder is interpreted as it would be in free logic, not classical logic — that is, "closest non-digit" does not necessitate the existence of such a digit, and a digit should be removed whether or not a suitable non-digit can be removed.
Under assumption 1, we'll use a recursive regex (necessary whenever two or more letters are followed by two or more digits, so equal numbers of each are removed):
(?<pair>[a-z](?&pair)?[0-9])
Here, (?&pair)
says try to match, at this point, what the regex for the named capture group pair
matches. So this expands to:
(?<pair>[a-z] (?:[a-z] (?:[a-z] ... [0-9])? [0-9])? [0-9])
matching a number of letters followed by an equal number of digits, and the task is solved by a global substitution of this regex with an empty string.
Under assumption 2, we modify this; a global substitution of the above regex will find all pairs — we could then separately remove any remaining digits, but it's trivial to instead modify the regex to also find unpaired digits (and while we're at it, match as many sequential sets of pairs as possible):
[0-9]*(?<pair>[a-z](?&pair)?[0-9])*
In Python, using regex
instead of re
for support of recursion (and more standard named group syntax):
regex.sub(r'[0-9]*(?<pair>[a-z](?&pair)?[0-9])*', '', string)
In Go, using https://pkg.go.dev/github.com/htfy96/go-pcre2/v2 (one of many pcre2 modules):
var subst = pcre2.MustCompile("[0-9]+|(?<pair>[a-z](?&pair)?[0-9])+", 0)
subst.ReplaceAllString(in, "", 0)
(This module seemed buggy; without the alternation and +
instead of *
, it entered an endless loop when there were unpaired digits.)
In Perl:
s/[0-9]*(?<pair>[a-z](?&pair)?[0-9])*//g
full script, Python
full script, Go
full script, Perl
Task 2: Title Capital:
You are given a string made up of one or more words separated by a single space.
Write a script to capitalise the given title. If the word length is 1 or 2 then convert the word to lowercase otherwise make the first character uppercase and remaining lowercase.
This task, unlike most, didn't restrict itself to "English" (ASCII) letters, and so introduced another contradiction: "capitalise"[2] should mean title case, not upper case. The two differ for letters such as dz: uppercase: DZ titlecase: Dz lowercase: dz. I'm going to assume titlecase was meant.
Beyond that, there's a major ambiguity: what does "word length" mean? In today's world, it isn't bytes, but if you assume it is unicode characters, 🤷♂ would have a word length of 3. A better assumption is graphemes, multiple characters that display as a single graphical unit. Even then, in a monospaced font, some graphemes will be wider than others. Dealing with that width is more than I want to take on in multiple languages, so I'm going to go with just the count of graphemes.
(And I will ignore that changing case can change length, e.g. "ßo" which could equally well be left unchanged or become "Sso".)
In Python:
def grapheme_length_more_than_2(string: str) -> bool:
return grapheme.length(string, 3) > 2
def title_capital(string: str) -> str:
return ' '.join( word.capitalize() if grapheme_length_more_than_2(word) else word.lower() for word in string.split(' ') )
(Because getting the overall length of a string in graphemes is O(N) in the length, grapheme.length provides a parameter for when to stop counting that is ideal for our use case.)
In Go, using https://pkg.go.dev/github.com/clipperhouse/uax29/graphemes and the somewhat lengthy current way to transform a slice (list/array) — slices.Collect
takes an iter.seq
, a passed function that is itself passed a callback for returning each result:
func grapheme_length_more_than_2(word string) bool {
var seg = graphemes.NewSegmenter([]byte(word))
return seg.Next() && seg.Next() && seg.Next() && true
}
func title_capital(in string) string {
return strings.Join(
slices.Collect(func(yield func(string) bool) {
for _, word := range strings.Split(in, " ") {
var cased_word string
if grapheme_length_more_than_2(word) {
runes := []rune(word)
cased_word = strings.ToTitle(string(runes[0])) + strings.ToLower(string(runes[1:]))
} else {
cased_word = strings.ToLower(word)
}
if !yield(cased_word) {
return
}
}
}), " ")
}
(In Go, a string is a sequence of bytes, not necessarily in any particular encoding or even character data at all; runes are unicode characters.)
In Perl, which has builtin though little known grapheme support via \X
:
sub title_capital($string) {
join ' ', map { /^\X{3}/ ? ucfirst lc : lc } split / /, $string
}
(Despite the name, ucfirst
is titlecase, not uppercase.)
In all of these, while I am detecting word length using graphemes, I am titlecasing only the first character, not the first grapheme. I don't know of graphemes for which this would be incorrect, but it wouldn't surprise me to know they exist.
full script, Python
full script, Go
full script, Perl
Comments greatly appreciated. See you next week.