Regex Fu: Not Followed By
Everyone has their own regular expression implementation; especially text editors. One is good, while the rest range from absurd to almost usable.
Sometimes, despite your best efforts, you’ll find yourself in a strange environment needing to do some funky regex fu.
So for the sake of argument, let’s imagine we have a standard no-frills regex engine:
- ( ) Capture groups
- ?, *, + Repetition
- [a-z] [^a-z] Character classes including negation
- ^ $ Anchors
- a|b Alternation
For the sake of clarity, we’ll assume it’s set to ignore whitespace too.
Simple Case
We’re looking for all instances of “fish”, so long as they aren’t followed immediately by “box”. With Perl-compliant regexes, we’d be able to say something as simple as:
fish(?!box)
We won’t be able to replicate that exactly with our limited engine, but we can get fairly close. (Remember that negative lookahead should be zero-width and non-capturing; that’s something we’ll have to fudge). Let’s start by figuring out what we want to accept, and what we want to fail.
Accept:
- fish
- fishsticks
- fishbig
- fishborn
Fail:
- fishbox
Sounds simple enough. So we can get 90% there with just “fish” followed by either not “b” or the end of the stream:
Replace fish( [^b] | $ ) with salmon$1
Simple. That makes “fishbig” and “fishborn” fail, but we have successfully converted “fish(?!b)”, which is at least a start. Let’s try to convert fish(?!bo) next. So in addition to what we already have, we also want to say “if we do find a ‘b’, just make sure it’s not followed by an ‘o’”.
fish (b([^o] | $)
| [^b] | $ )
And after that, adding the “x” is just the same thing all over again.
fish (b
(o([^x] | $)
| [^o] | $)
| [^b] | $ )
Condensed, that’s:
fish(b(o([^x]|$)|[^o]|$)|[^b]|$)
Can you believe it, it’s only twenty-one characters longer than the Perl version! (Or 9n - 6, where n is the number of characters in the negative lookahead)
It’s very limited
Initially I thought that it would be possible to translate something more interesting, such as:
fish(?!.*box)
Into something like:
fish(b(o(x^$FAIL$^|[^x]|$)|[^o]|$)|[^b]|$)*$
And this works for the most part, except in cases like “fishbbox” or “fishbobox” - it lets them straight through. So, only the absolute simplest case of “something not followed by some literal” is possible, or I think at most the not-followed-by could have character classes or alternations.


Leave a Reply