How Regular expression Works

Estimated reading time: 7 min

Introduction

Regular expressions (regex or regexp) are perhaps the most widely-adopted programming language feature. Almost every major language – from Java to Golang – to relative newbies like Deno have regular expressions implemented as an in-built feature.

Even outside the realm of conventional programming, regex is still quite prevalent. Almost all web servers – both Apache and Nginx, which form somewhere about 90% of the market – rely on regex, for example. Essentially, it’s impossible for any developer to escape learning regex wherever they go.

Despite their widespread usage, regular expressions have earned bad rep over the years for being infamous ‘hard’ to master. You might find diving into how regular expressions work interesting and a bit beautiful at times, even, especially as compared to traditional ways of learning regex. If you’ve already grasped the fundamentals, learning how regex works internally may help you craft better regexes and unleash its full power.

That said, let’s get the basics out of the way first.

Prerequisites

The first thing to understand before getting too carried away is what regex actually is. A regular expression is a string that describes certain patterns that can be used to match and locate text.

Regex internally depends on a ‘regular expression engine’ to run. There are many implementations of regex engines in the world, sometimes differing slightly in complexity and quite significantly at other times. This article is going to explain the inner workings of regex using the PCRE (Perl Compatible Regex Engine). Javascript and PHP are examples of prominent programming languages that rely on a PCRE.

For the sake of making regexes easier to read, they are normally placed between forward-slashes eg. /ABC/. This convention is respected throughout the article.

Step 1: Understanding regex engines

Regex only defines certain patterns for matching text. For the actual pattern matching process to run, the pattern has to be passed interpreted through the engine. The engine is what’s actually responsible for matching the regex with the input string.

There are two main classes of regex engines – those based on backtracking (also called regex-directed engines) and those based on finite-state machines (also referred to as text-derived regex engines).

Finite-state regex engines: These work by building a finite-state machine and feeding characters into it. They are generally the fastest, but due to design constraints, it’s impossible to implement certain features into them. POSIX (used in UNIX-based systems) is a regex implementation that uses finite-based automata regex.

Backtracking regex engines: A backtracking regex engine walks through the regex, attempting to match the next token in the regex to the next character in the text. If a match is found, the engine advances through both the regex and the subject string. If a token doesn’t match, the engine backtracks to a previous position in the regex and tries a different path.

Backtracking engines are the most common way of implementing regexes because of advanced features they allow (eg. backreferences and lazy quantifiers). PCRE, Java, PHP, R, .NET, and many other programming languages use this flavor of regex.

Step 2: Understanding how regex works

Now that we have the higher-level details out of the way, we’re going to dive deeper into the inner working of regex engines. In particular, our goal in this step is to understand the rules that govern how PRCE matches text.

In order to understand how regex works, let’s imagine we have a machine that takes any kind of string as the input. The machine has predetermined instructions within it that tell it which strings it can match.

These are the rules governing the inner working of our machine:

  1. It reads the whole input and attempts to match the characters against the instructions one at a time.
  2. It reads the characters from left to right.
  3. If a character matches the desired pattern, it returns true. If it does this, the machine can proceed to the next character and run rules II) and III) again. If a character doesn’t match the pattern, it returns false. If this happens, reject the current character and start matching again from the next one. Start from II)
  4. If all instructions inside the machine are matched, a green bulb lights up indicating the regex matched successfully.

This theoretical machine we’ve ‘invented’ is actually how regex engines like PCRE work internally.

Let’s start with something simple.

The regex /cat/ will match any string with the word ‘cat’ in it and nothing else. Does it match up against the arbitrary string ‘catsareadorable’?

First, the machine has to be fired up. This process is built into most programming languages, so you shouldn’t worry about it.

Once we are sure our machine is up and the regex engine is running, we feed the string into the machine. It reads the first character in the string and compares it against its in-built schema. ‘c’ from the regex matches ‘c’ from the string, and subsequently matches the ‘a’ then the ‘t’. Since all the conditions have been satisfied, the green bulb lights up. The regex matched.

How about trying the same regex for the string “i’m a cat”?

The machine takes in the whole input and attempts to match the letter ‘i’ to ‘c’. This fails so it ditches that effort and attempts to match the apostrophe next. This will fail to the point it reaches the word ‘cat’ at the end and eventually succeeds.

Basically, literals always match literals. That means if you have ‘a’ in a string, it will always match ‘a’ in the regex.

To get into more contrived examples, we have to first introduce special characters.

Step 3: Introducing meta characters (special characters)

Special characters look like the normal characters you’re used to but have a different logical meaning within regex. Let’s introduce a few of them:

  • ^ : the hat (or caret) matches the beginning of the line. It can also be used for negation.
  • $ : the dollar sign matches the end of the line
  • \d : matches digits only. Notice the backslash at the beginning? That tells the regex engine that it’s not a string literal.

Using the above three characters alone, consider the following regex: ^\d\d:\d\d$

It will only match strings in the format: ‘digitdigit:digitdigit,’ eg. ‘12:45’ and ‘11:30’ and not ‘112:20’ or even ‘ 12:45’ (notice the space at the beginning).

Some more important in regex include:

  • . : the dot character matches any character.
  • \w : matches any one-word character.
  • \s : matches any white-space character
  • * : matches 0 or more characters.
  • + : matches 1 or more characters
  • \ : escapes a character if it equals a metacharacter.
  • ? : makes the preceding character optional.

With this knowledge, we can finally create some regexes. The first one will be something you’ve probably run into a thousand times already: matching an arbitrary email address.

Emails typically look like: ‘[email protected]’, or in more regex-like terms ‘[email protected]’, which can be translated into: /\w@\w\.\w/.

Most metacharacters can be negated. So while ‘\w’ matches words, ‘\W’ matches tokens that aren’t words. The same applies for ‘\D’ (non-digits), ‘\S’ (non-space), and so on.

With that out of the way, there are still two more fundamental concepts that must be understood before we can complete regex school.

Step 4: Defining ranges and optional characters

Let’s say you wanted to match only characters in a certain range. For example, you don’t want kids less than 13 on your app. The input form on your website will only match ages greater than 13.

A simplistic way of going about this problem would be listing all the characters that aren’t supposed to be within your range (1 to 13) and negating them out somehow. Since regex supports ranges, we don’t have to go to such lengths.

Regex allows you to define a range using square brackets: ‘[‘ and ‘]’.

A range of all (lowercase) letters from a to z is /[a-z]/, a range of numbers from 0 to 9 is /[0-9]/ and so forth.

This unlocks new doors.

With the knowledge of the metacharacters we know so far and our square brackets, we can finally write a slightly more complex regex. See if you can understand how it works without reading the explanation.

/^[a-zA-Z0-9]*$/

The above regex only matches alphanumeric strings. Why?

The hat (^) matches the beginning of the regex. The square brackets represent a range of characters that can be accepted. In this case, we will accept all lowercase letters ([a-z]), uppercase letters ([A-Z)] and numbers from 0-9.

This will return true for strings like ‘abcde’ and ‘teslamodel5’ but not ‘mars*colony’ or ‘muzak/’.

Additionally, notice the use of the asterisk (*). This causes the engine to match one or more of any of the characters within the range we described. If we removed it, the pattern would only match true for a single character. Incidentally, this regex also matches an empty string. Can you guess why?

If we add the hat character to the regex: /^[^a-zA-Z0-9]*$/, we negate the regex. Now, it only accepts non-alphanumeric strings.

Lastly, now that we understand how to define ranges, it’s worth mentioning that the metacharacter ‘\w’ is simply shorthand for /[a-zA-Z0-9_]/

Step 5: Matching repetitive text

If we want to match repeating characters, we use curly brackets: ‘{‘ and ‘}’. Technically speaking, both the dot and asterisk metacharacters also deal with repetitions, but curly brackets give us more fine-grained control.

Let’s say we wanted strings matching the pattern ‘zzz’. As long as there are three or more ‘zzz’s, the string should match. The relevant regex will look like this: /z{3}/.

The repetition character can be used in three ways:

  • /{x}/ : matches exactly ‘x’ repetitions.
  • /{x,}/ : matches ‘x’ amount of repetitions or more
  • /{x,y}/ : matches between ‘x’ and ‘y’ number of repetitions.

The second way to use square brackets is to match characters you’re not very sure about.

For instance, taking ‘catsareadorable’ as an example again, what if we also wanted to be inclusive and match ‘dogsareadorable’?

Here’s how square brackets come to the rescue

[cats]

Step 6: Capturing a group

The final important regex feature you ought to understand is group capturing. This allows us to do two things: matching characters as a group and extracting information for further processing.

Take the regex /([0-9]+)/ for example. This will extract the longest sequence of numbers and store them in a variable.

In Perl, for example:

my $value = "Bolton's number is 801-555-1234.";
     if ($value =~ /([0-9]+)/ {
       print "Found area code: $1"; // Found area code: 801
     }

And in Javascript:

const value = “Nathan's number is 801-555-1234.”;
   const areaCode = value.match(/([0-9]+)/ )[0];
   console.log("Found area code: ", areaCode) // Found area code: 801

To complete our previous example, remember our string ‘catsareadorable’? If we wanted to be more inclusive and also match ‘dogsareadorable’, you could write

/(cats|dogs)areadorable/

The second way to use groups is to match repetitions.

For instance: /(abc){2}/ – matches a string in which ‘abc’ is repeated at least twice. ie. matches ‘abcabc’ or ‘abc000abc’ but not ‘abvbbc’

Conclusion

That’s the end of regex school. Now that you’ve mastered the fundamentals, don’t be afraid to go out there and challenge yourself. Being good at regex, like so many other things, requires a lot of time and practice so don’t expect to dream in regex starting tomorrow.

Was this article helpful?
Dislike 5
Views: 10049

Reader Interactions

Leave a Reply

Your email address will not be published. Required fields are marked *