Property is Theft, Theft is Good: the weekly challenge 328 task 2

Task 2: Good String:

You are given a string made up of lower and upper case English letters only.

 
Write a script to return the good string of the given string. A string is called good string if it doesn’t have two adjacent same characters, one in upper case and other is lower case.

 
UPDATE [2025-07-01]: Just to be explicit, you can only remove pair if they are same characters, one in lower case and other in upper case, order is not important.

 
Example 1

Input: $str = "WeEeekly"
Output: "Weekly"

 
We can remove either, "eE" or "Ee" to make it good.

 
Example 2

Input: $str = "abBAdD"
Output: ""

 
We remove "bB" first: "aAdD"
Then we remove "aA": "dD"
Finally remove "dD".

 
Example 3

Input: $str = "abc"
Output: "abc"

This is a pretty simple task, we just need a regex to capture an ASCII letter and verify the following character is the same letter but of the opposite case. We do that with a backreference that does match case insensitively but not case sensitively:

(?i:([a-z])\1)(?<!\1)

Since the inputs are specified as "upper and lower case English letters only", we could just as well do:

(?i:(.)\1)(?<!\1)

Because removing pairs of characters may reveal other pairs, as in the abBA case, we do need to loop as long as something is substituted, not just do a single global substitution. In Python, this is a little cumbersome[1]:

bad_pair = re.compile(r'(?i:(.)\1)(?<!\1)')
string_out = string_in
while True:
    string_out, substitutions_made = bad_pair.subn('', string_out)
    if substitutions_made==0:
        break

In Perl, because substitution by default mutates and just returns a count, the loop is simpler[2]. However, the regex engine isn't smart enough to track the possible length of captures, and so complains about variable length lookbehind[3]. So the regex is slightly modified to hint that it is in fact just one character:

my $bad_pair = qr/(?i:(.)\1)(?<!(?=\1).)/;
my $string_out = $string_in;
1 while $string_out =~ s/$bad_pair//g;

That does it for the problem as stated, but not if we allow any letter, not just ASCII letters. Then several problems rear their ugly heads.

First of all, there are titlecase letters such as U+1F2 (LATIN CAPITAL LETTER D WITH SMALL LETTER Z) Dz, which will match when next to U+1F1 (LATIN CAPITAL LETTER DZ) DZ or U+1F3 (LATIN SMALL LETTER DZ) dz, even though Dz is not upper or lower case.

Also, there are symbols that case insensitvely match with other code points that are their canonical equivalent, such as U+2126 (OHM SIGN) Ω that matches with U+3A9 (GREEK CAPITAL LETTER OMEGA) Ω, though both are upper case.

To avoid improperly removing these, we need to explicitly only match upper/lower or lower/upper pairs:

(?i:(.)\1)(?<!\1)(?<=\p{Ll}\p{Lu}|\p{Lu}\p{Ll})

using a regex engine that understands unicode properties (regex or pcre2 in python, but not re).

But unicode also allows what is the same character to appear as differing numbers of codepoints. For example, U+E9 (LATIN SMALL LETTER E WITH ACUTE) é can appear as that single codepoint or as the pair U+65 (LATIN SMALL LETTER E) e and U+301 (COMBINING ACUTE ACCENT) ́. Normally[4] if you have to cope with these differences, you'd put a string in either Normalization Form D where combined bits are decomposed to different codepoints or Normalization Form C where after decomposition they are composed again into codepoints that combine them to the extent such codepoints exist. Either is not very regex-friendly, and here where the goal is to leave parts of the string unchanged, those parts would have to be mapped back to their pre-normalized form, which is messy.

And even beyond that, NFC and NFD forms only deal with canonical differences, where the meaning and appearance are the same. There are also NFKC and NFKD normalizations where series of characters that have compatibility equivalence rather than canonical equivalence become the same. This includes things such as superscripts and ligatures; a canonical[5] example is U+FB01 (LATIN SMALL LIGATURE FI) fi which under these normalizations becomes U+66 (LATIN SMALL LETTER F) f and U+69 (LATIN SMALL LETTER I) i. But now it is very unclear what a good string would even be: given U+46 U+FB01 U+49 FfiI, is its good string the empty string? The string unchanged?

So I'm going to leave any attempt at meaningfully dealing with any normalization to the reader.

full script, Python
full script, Perl

See you next week.


  1. Sadly, the assignment operator doesn't support tuple assignment, so you can't do:

    while (string_out, substitutions_made := bad_pair.subn('', string_out))[1]:
        pass
    
    ↩︎
  2. 1 while ... is a Perl idiom; the 1 does nothing and in fact could be 0 or undef, equivalent to Python's pass. But it shouldn't be other values, since only those three are exempt from the "Useless use of a constant in void context" warning you would get if you said e.g. 42 while.... That warning is very helpful for catching precedence errors, for instance with $x, $y = 12, 7 (= is higher precedence than ,). ↩︎

  3. Current versions of Perl allow variable length lookbehind, but only up to 255 characters, and Perl assumes \1 may be longer than that. ↩︎

  4. Pun intended ↩︎

  5. Pun intended ↩︎