Idiomatic Perl solutions to the weekly challenge 325 task 1

Task 1: Consecutive One:

You are given a binary array containing only 0 or/and 1.

Write a script to find out the maximum consecutive 1 in the given array.

A regex might be fun?

Regexes can do so many tricks, maybe here too. Let's start with something very naive, just looking for the earliest sequence of 1's that isn't followed by another sequence that is as long or longer:

sub naive_regex ($binary_array) {
    join('', @$binary_array) =~ /(1+)(?!.*?\1)/
        ? length $1
        : 0
}

This works, but can be easily greatly improved in speed. We'll use a possessive quantifier in the capture, so we don't backtrack and look for a shorter sequence when the longer one has already failed, and assert that the 1's are not preceded by a 1 to prevent trying again at a later point in the same sequence of 1's, again giving a shorter sequence when the longer one has already failed:

sub better_regex ($binary_array) {
    join('', @$binary_array) =~ /(?<!1)(1++)(?!.*?\1)/
        ? length $1
        : 0
}

On my test data of a million random 0's and 1's, this gives more than a 3x speed increase.

But any regex solution is going to have the extra overhead of assembling the data into a string.

Back to the basics

So, let's try just a straightforward loop through the array, counting sequences of 1's and keeping track of the longest one found:

sub loop ($binary_array) {
    my $current_sequence = 0;
    my $longest_sequence = 0;
    for my $value (@$binary_array) {
        if ($value) {
            ++$current_sequence;
        }
        else {
            if ($longest_sequence < $current_sequence) {
                $longest_sequence = $current_sequence;
            }
            $current_sequence = 0;
        }
    }
    if ($longest_sequence < $current_sequence) {
        $longest_sequence = $current_sequence;
    }
    return $longest_sequence;
}

The extra check of current sequence after the loop is needed when the longest sequence appears at the very end of the array. Repeating the same code inside and outside a loop can be avoided by adding a special sentinel value to the end of the data; here, just adding a 0 suffices:

    for my $value (@$binary_array, 0) {

but this causes a very slight slowdown, as perl has to load the array onto the stack and add a 0, rather than having the for alias directly to the array. In production code, I'd usually favor doing it anyway, to avoid the extra complexity.

This provided about a 3x speed increase over the faster regex.

What about map?

A completely different approach is to transform the array into a list of counts of each sequence of 1's so far, so that:

0,1,0,1,0,1,1,1,1,1,0

becomes:

0,1,0,1,0,1,2,3,4,5,0

and then simply take the max of those. This ended up being just as fast as the loop, but only when the 0's were left out of the output so taking the max was faster (1,1,1,2,3,4,5):

sub mapping ($binary_array) {
    my $count = 0;
    # suppress "Found = in conditional, should be =="
    no warnings 'syntax';
    max( map { $_ ? ++$count : ($count = 0) || () } @$binary_array )
}

Unfortunately leaving out the 0's caused a warning that needed to be suppressed. Most perl syntax warnings point you to something you meant instead; this breaks that pattern by just telling you not to do it. I would greatly have appreciated some alternate syntax (an extra set of parentheses around the assignment, for instance) to let perl know I meant it. Note that Python added a := operator explicitly for this purpose.

Some alternatives that don't trigger the warning but incur a performance penalty:

   max( map { $_ ? ++$count : ($count = 0) } @$binary_array )
   max( map { no warnings 'syntax'; $_ ? ++$count : ($count = 0) || () } @$binary_array )
   max( map { $_ ? ++$count : do { no warnings 'syntax'; ($count = 0) || () } } @$binary_array )
   max( map { $_ ? ++$count : ($count = 0)[0] || () } @$binary_array )
   max( map { ($_ ? ++$count : ($count = 0)) || () } @$binary_array )
   max( map { $_ ? ++$count : ($count = 0)[1..0] } @$binary_array )
   my @empty; max( map { $_ ? ++$count : ($count = 0)[@empty] } @$binary_array )

That said, now we have a reasonably readable one-liner, instead of the long loop code, and have achieved equivalent or better speed. But I don't like suppressing an entire category of warnings for essentially the entire sub's body. And like the loop with sentinel, with a large array, a lot of it ends up in memory more than once.

Regex reprise

The regex approach involved moving around a lot in the string; for each sequence of 1's it would scan farther into the string to find a repetition, then jump back to find the next sequence. Ideally, it would just make one pass, beginning to end, with the following pseudocode:

  • find a sequence of 1's
  • advance through the string looking for a sequence as long or longer
    • if not found, we have our sequence
    • if it is found, repeat, starting at this new position

To do this, we need to manipulate where the regex engine starts searching once it rejects a match at one position. The (*SKIP) regex verb exists for exactly this purpose:

sub one_pass_regex ($binary_array) {
    join('', @$binary_array) =~ /(1++)(?:.*?(*MARK:go)\1(*SKIP:go)(*FAIL)|)/
        ? length $1
        : 0
}

A bare (*SKIP) restarts the regex match at a new position when the regex fails later on. But here we want to move the next attempt's start to before the backreference, but only if the backreference matches. Using (*MARK:label) to label a position and (*SKIP:label) to actually use that position is designed for this. (Thanks to ikegami for helping me understand MARK and SKIP.) Once the search for the backreference fails, we've found the answer.

This offers a significant improvement over loop or map.

Looping back around

The loop was really the most straightforward approach, but for a loop like this an interpreted language is going to be slower than, for instance, C. It's a shame we can't just rewrite it in C and use it in our perl script...oh wait, we can:

use Inline 'C';
__END__
__C__
int inlinec_loop(SV *array) {
    int longest_sequence = 0;
    if (SvROK(array) && SvTYPE(SvRV(array)) == SVt_PVAV) {
        AV *av = (AV *)SvRV(array);
        int max_element = av_len(av);
        int current_sequence = 0;
        for (int i = 0; i <= max_element; ++i) {
            int value = SvIV(*av_fetch(av, i, 0));
            if (value) {
                ++current_sequence;
            }
            else {
                if (longest_sequence < current_sequence) {
                    longest_sequence = current_sequence;
                }
                current_sequence = 0;
            }
        }
        if (longest_sequence < current_sequence) {
            longest_sequence = current_sequence;
        }
    }
    return longest_sequence;
}

Now we are more than twice as fast as the complex regex, and three times as fast as the fastest others.

Benchmarks

                 Rate naive_regex better_regex loop_sentinel loop mapping one_pass_regex inlinec_loop
naive_regex    3.23/s          --         -69%          -90% -90%    -90%           -91%         -97%
better_regex   10.5/s        225%           --          -66% -67%    -67%           -72%         -90%
loop_sentinel  31.1/s        864%         197%            --  -2%     -3%           -17%         -69%
loop           31.6/s        879%         201%            2%   --     -1%           -15%         -69%
mapping        32.1/s        894%         206%            3%   2%      --           -14%         -68%
one_pass_regex 37.3/s       1055%         256%           20%  18%     16%             --         -63%
inlinec_loop    101/s       3028%         863%          224% 220%    215%           171%           --

full script
some python, for comparison

See you next week.