Regex Fu: Not Followed By

Everyone has their own regular expression implementation; especially text editors. One is good, while the rest range from absurd to almost usable.

Sometimes, despite your best efforts, you’ll find yourself in a strange environment needing to do some funky regex fu.

So for the sake of argument, let’s imagine we have a standard no-frills regex engine:

  • ( ) Capture groups
  • ?, *, + Repetition
  • [a-z] [^a-z] Character classes including negation
  • ^ $ Anchors
  • a|b Alternation

For the sake of clarity, we’ll assume it’s set to ignore whitespace too.

Simple Case

We’re looking for all instances of “fish”, so long as they aren’t followed immediately by “box”. With Perl-compliant regexes, we’d be able to say something as simple as:

fish(?!box)

We won’t be able to replicate that exactly with our limited engine, but we can get fairly close. (Remember that negative lookahead should be zero-width and non-capturing; that’s something we’ll have to fudge). Let’s start by figuring out what we want to accept, and what we want to fail.

Accept:

  • fish
  • fishsticks
  • fishbig
  • fishborn

Fail:

  • fishbox

Sounds simple enough. So we can get 90% there with just “fish” followed by either not “b” or the end of the stream:

Replace fish( [^b] | $ ) with salmon$1

Simple. That makes “fishbig” and “fishborn” fail, but we have successfully converted “fish(?!b)”, which is at least a start. Let’s try to convert fish(?!bo) next. So in addition to what we already have, we also want to say “if we do find a ‘b’, just make sure it’s not followed by an ‘o’”.

fish (b([^o] | $)
      | [^b] | $ )

And after that, adding the “x” is just the same thing all over again.

fish (b
        (o([^x] | $)
         | [^o] | $)
      | [^b] | $ )

Condensed, that’s:

fish(b(o([^x]|$)|[^o]|$)|[^b]|$)

Can you believe it, it’s only twenty-one characters longer than the Perl version! (Or 9n – 6, where n is the number of characters in the negative lookahead)

It’s very limited

Initially I thought that it would be possible to translate something more interesting, such as:

fish(?!.*box)

Into something like:

fish(b(o(x^$FAIL$^|[^x]|$)|[^o]|$)|[^b]|$)*$

And this works for the most part, except in cases like “fishbbox” or “fishbobox” – it lets them straight through. So, only the absolute simplest case of “something not followed by some literal” is possible, or I think at most the not-followed-by could have character classes or alternations.

2 Comments:

  1. Hi Amos,

    Negation in traditional (non-Perl) regular expressions has never been easy. However, that doesn’t mean that it’s impossible. It took me a while but I think I got it. Try this out:

    fish([^b]|(b(b|(ob))*([^bo]|(o[^bx]))))*((b(b|(ob))*o)|(b(b|(ob))*)|$)$

    Let me know if you find any failing cases – regexp are hard to test for correctness.

  2. [...] As demonstrated, although negation of pattern can be hard in plain regular expressions, it is not impossible as some might think. [...]

Leave a Reply