A good[ish] website
Web development blog, loads of UI and JavaScript topics
An easily approachable but complete look into Regex for beginners and intermediates.
"Let’s learn Regex," I said deterministically, and smashed my head through a dry wall. After wiping the dust off from my shoulders I spat on my palms and sat in front of the computer.
Regular Expressions come in different flavors, the most adopted one might be the Regex syntax that came bundled along Perl, that was later inherited by Java, Python, Ruby, and JavaScript (plus many more). You may read more about from WikiPedia.
This post looks into using regex with JavaScript, mainly, but due to the universal nature of Regular Expressions, the patterns might be usable in other languages, too.
There’s a Wikipedia entry comparing different regex engines if you’d like to know more about it.
At first, Regex looks like what your grandmother might think programming languages look like: unfathomable garble. The syntax is dense and completely unexpressive, each character is packed full of meaning.
In JavaScript, all regex starts and ends with forward slash: /pattern/
, (expect when using the RegExp()
constructor, more on that later). Also it's very often coupled with the global g
flag, like so: /pattern/g
, the g
means it matches all occurrences in a given string, without it, only first occurrence is matched.
At its simplest, a pattern
can be just a word:
/hello/
That will match the word "hello" once in the given string. The following will match all "hello" words, because it uses the global flag:
/hello/g
Then you would use it with JavaScript like so:
const greeting = 'Well, hello'.replace(/hello/g, 'hi')
console.log(greeting)
// Well, hi
On the above case the replace
method was used, but there are many others, more on the JavaScript regex methods later on in the article.
To do more advanced regexs you need to know about the metacharacters.
Metacharacters are the juice of Regex, they make all the complex matching patterns possible. Below I have a quick reference of them:
ℹ️ Editors note: whenever you see
|
that means the pipe|
. MDX has trouble rendering pipes inside tables even if they’re marked as code, and this blog uses MDX.
Char name | Char | Name | Explanation |
---|---|---|---|
Pipe or vertical bar | | | Alternation | /hello | hey/ will match "hello" or "hey". |
The question mark | ? | Greedy Quantifier | /https?/ makes the preceding char optional, the "s" in this case. |
Dot or period | . | Any character | /hello./ matches any one character after the "hello". |
Caret | ^ | Anchor, begins with | /^hello/ matches a string if it begins with "hello". |
Dollar symbol | $ | Anchor, ends with | /hello$/ matches a string if it ends with "hello". |
Star or asterisk | * | Greedy quantifier | Repeat the preceding character zero or more times, /hello.*\.jpg/ matches "hello.jpg" or "hello-anything.jpg". |
The plus sign | + | Greedy quantifier | Repeat the preceding character one or more times. |
Parenthesis | () | Grouping / capturing | /Pop(Tarts)?/ matches "Pop" and "PopTarts". |
Square brackets | [] | Character class | /[0-9a-f]/ matches all characters used to formulate a hex number. Basically place the wanted characters inside the brackets, can use ranges. |
Curly braces | {} | Quantifiers | /x{6}/ matches "xxxxxx" |
Let’s look escaping first and then what all the meta characters do.
Incidentally, if you want to target any of the meta characters, they need to be escaped with a backslash, in this example I want to target the question mark (Greedy Quantifier):
/\?/
All character in regex patterns that need to be escaped are:
. / \ ^ $ * + ? ( ) [ ] { } |
Inside character classes ([]
) the requirement is looser, only the following characters require escaping:
^ - ]
When you’re using the RegExp
constructor, the shorthand character classes need to be doubly escaped (more on those later):
const regex = new RegExp('\\s', 'g')
Because when using the constructor, you’re passing in a string (not a regex), and it’s parsed by using the string literal parser, and the backslash \
is an escape character in the string universe, so you’re just passing in a plain s
. This why the backlash needs also to be escaped.
The forward slash /
should also be escaped, so that it's not accidentally interpret as the regex closing character.
Now we got that out of the way, let's dig into the metacharacters.
|
The following would match 'hello' and 'hey':
/hello|hey/g
Very suitable use-case for the alternation character is to match single and double quotes:
/'|"/g
?
The question mark makes the preceding character optional.
One good example use-case for this is to match http
and https
:
/https?/g
In this case the 's' is optional.
Another classic example: colou?r
matches both the British colour
and the simplified American spelling color
.
$
and caret ^
The anchors match the beginning or ending of a string: beginning with caret ^
ending with $
.
Anchors match a position, not character, so on it's own, the only finds zero length matches. Let's look the caret first:
/^http://clubmate.fi/g
That'll match only all URLs pointing to this site.
The dollar $
is pretty much the same as the caret, but targets the ending of a string, and it's semantically placed at the end:
/something$/
.
Dot, or the period is really powerful, it targets any character once (except line break), take the below example:
/hello./
In the above any single character after the hello is selected.
On it's own the the dot is kind of boring, but when coupled with the repeater characters it comes super handy. Let's look the repeaters next.
*
and plus +
The star or the asterisk character will repeat the preceding character zero or more times.
/hello*/
The above would match "hello", "helloo", "helloooo", and so on. Not very useful eh? Couple that with the upper mentioned dot and it becomes super useful!
The below will target all sort of different favicon names:
/favicon.*\.png/
Will match, for example:
The plus character is almost same as the star, only difference is that it will repeat the preceding character one or more times.
/favicon.+\.png/
Would not match match:
In the case of /favicon.*\.png/
you might notice that the match is wildly too optimistic, which is the problem and the virtue of the dot. Greedy regular expressions in wrong places can be security hazards. In the favicon matching case, we probably only need to match numbers, letters, and the hyphen (the letters appearing in the Latin alphabet, that is). We can make the match less greedy with character classes, let’s look those next.
[ ]
Between the angle brackets you can define the characters you want to match. For example:
/[- /.]/
Would conveniently target all common separators in different date formats:
Like mentioned earlier, the dot is not needed to be escaped inside a character class.
All numbers could be matched like so:
/[0123456789]/
But that’s kind of silly, luckily the smart folks at the Regular Expressions head quarters — in a seedy business park somewhere in suburban America – know this, and they have given us the ranges:
/[0-9]/
Within the square brackets [ ]
The hyphen sets the range. Same range can be applied to upper or lowercase letters, like so:
/[a-zA-Z]/
They can also be combined:
/[0-9a-zA-Z]/
a-z
includes all letters from the Roman alphabet (the English language alphabet). But for example, the Finnish alphabet include more characters, and the Germans have four special letters Ä, Ö, Ü, and ß.
When needed, special characters can be simply added into the character class:
/[a-zäöüß]/
Character classes can be negated by placing a caret after the opening square bracket:
/[^0-9]/
The above makes sure the match has no numbers.
At times, the character classes might become too long and complex, that’s when the shorthand character classes come handy.
The character classes might get too long in some cases, so there are (a ton of) convenient shorthands that do the same thing.
Here’s a table of all the character classes in JavaScript:
\b
Word boundary/the/
would match "there was the boat", but /\bthe\b/
would only match the full word: "there was the boat"\B
Non-word boundary\b
. \Be
would match the second "e" in "erection", but not the first.\cX
Control character\d
Digit[0-9]
.\D
Non-digit[^0-9]
.\f
Feed character\n
Linefeed\r
carriage return\s
Single whitespace character[ \f\n\r\t\v\u00a0\u1680\u180e\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
. The pattern /\.\s/g
would match all the dots with a following whitespace "Lorem. Ipsum. Dolor.yms.sit.".\S
Single non-whitespace character[^ \f\n\r\t\v\u00a0\u1680\u180e\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
. The pattern /\.\S/g
would match: "Lorem. Ipsum. Dolor.yms.sit."\t
Tab\v
Vertical tab\w
Word character[A-Za-z0-9_]
. Note the underscore.\W
Non-word character[^A-Za-z0-9_]
. Note the underscore.\0
Null character{ }
We've touched quantifiers already in this article, in the shorthand form of the question mark ?
, the plus +
, and the asterisk *
. Below is a list of how you would write these quantifiers using curly braces:
/x?/
match x zero or one times, same as: /x{0,1}/
/x+/
match x zero or more times, same as: /x{0,}/
/x*/
match x one or more times, same as: /x{1,}/
Matching an MD5 hash is a great example case, the MD5 hash is a 16 byte string consisting of 32 hexadecimal characters, in simplified terms, numbers and letters:
/^[a-f0-9]{32}$/
Or with a shorthand:
/^w{32}$/
That would work for most cases, but since MD5 hash consists of hexadecimal numbers, which is to say, all numbers, and letters from A to F, then our optimized pattern would look something like:
/^[\da-f]{32}$/
( )
Syntax:
(x)
Also called as capturing groups, the regex patterns wrapped inside parentheses can be subjected to quantifiers or alternation, or the match can be remembered and returned for later use. In JavaScript the return value can be accessed with '$1...'
.
Same way as the question mark makes the preceding character optional, you can make a string of characters optional:
/pop(tarts)?/
Would match "pop" and "poptarts".
Alteration can be used in normal fashion inside the group, which make it super useful:
/pop(farts|tarts)?/
And here's how to later on use the remembered captured group, note the '$1'
:
var foo = 'PopTarts'.replace(/Pop(Tarts)?/, 'Toaster$1')
console.log(foo) // ToasterTarts
The captured group can be also used within the same regex pattern, referring to them with escaped numbers 1
... If we hypothetically needed to target an HTML element <span>hello</span>
, we could use the capturing group to match the closing tag:
/<([a-z]+)>hello<\/\1>/
See the examples section on how to match an HTML tag.
Syntax:
(?:x)
Capturing groups come with performance penalty, so use them only when you need to access the returned value. Otherwise use a non-capturing group. This is done by placing a question mark and colon ?:
in the very beginning of the group:
/color=(?:red|green|blue)/
Syntax:
(?<Name>x)
JavaScript has the concept of named capturing groups, they’re what you’d think they are, and you'd used them like this:
/<(?<openingTag>[a-z]+)>hello<\/(?<closingTag>[a-z]+)>/
Put that into a match:
const foo = '<span>hello</span>'.match(
/<(?<openingTag>[a-z]+)>hello<\/(?<closingTag>[a-z]+)>/
)
console.log(foo)
The groups are now in the output as key/value pairs:
{
0: '<span>hello</span>',
1: 'span',
2: 'span',
groups: { openingTag: 'span', closingTag: 'span' },
index: 0,
input: '<span>hello</span>',
}
Flags are single letters that are added after the pattern, they change the way that specific regex behaves.
/foo/
matches the first "foo" in a string and then stops, but /foo/g
matches all instances of "foo". When using the case-insensitive flag i
, the regex pattern /foo/gi
matches all instances of "FOO" and "foo".
In JavaScript, the following flags can be used:
Flag | Name | Purpose |
---|---|---|
g | Global | Match all occurrences, default is to match only the first |
i | Ignore case | Make the Regex match case insensitive |
m | Multiline | Changes how the anchors ^ & & behave |
u | Unicode | Treat pattern as a sequence of unicode code points |
y | Sticky | matches only from the index indicated by the lastIndex property of this regular expression in the target string (and does not attempt to match from any later indexes) |
There’s two ways to define regex patterns in JavaScript:
/ab+c/i
,new RegExp('ab+c', 'i')
or as a regex new RegExp(/ab+c/, 'i')
Regex patters in JavaScript are objects:
console.log(typeof /foo/)
// object
The object has one property: lastIndex
, which defines the index to start the match from. I find myself only needing this when working with the exec command, see the exec
section of the article for more.
In general, there are four things you can do with regex:
JavaScript provides few handy functions to deal with these.
Syntax:
str.replace(regexp|substr, newSubStr|function[, flags])
Replace must be one of the most used Regex related methods in JavaScript.
Simple example:
const greeting = 'Good morning'
const modifiedGreeting = greeting.replace(/morning$/, 'evening')
console.log(modifiedGreeting) // Good evening
Replace is common to use with a callback function:
const greeting = 'Good morning'
const greetingModifier = match => console.log('Bad ' + match)
greeting.replace(/morning$/, greetingModifier) // Bad morning
The replace
method’s callback function takes the output from a capturing group as its parameters. To demonstrate this let’s make a simple helper to replace placeholder values like these {{foo}}
.
But first, here’s the anatomy of the callback arguments:
const corpus = 'We have {{fruitCount}} fruits, and {{veggieCount}} of veggies.'
// 3 capturing groups, mostly for show.
corpus.replace(/({{)(\w+)(}})/gi, (...rest) => {
console.log([...rest])
})
The above clog spits out the arguments for each match (this being the latter):
[
"{{veggieCount}}",
"{{",
"veggieCount",
"}}",
38,
"We have {{fruitCount}} fruits, and {{veggieCount}} of veggies."
]
This is the exact anatomical structure of the callback parameters:
Given name | value |
---|---|
match | The full match from the regex. |
g1 , g2 ... | An infinite amount of possible capturing groups. |
offset | The position of the match in the original string. |
string | The whole original string. |
groups | Object where key is the group’s name and value the match. |
And here’s the little helper that replaces the stubs globally:
const corpus = 'We have {{fruitCount}} fruits, and {{veggieCount}} of veggies.'
const replacements = {
fruitCount: 5,
veggieCount: 7
}
const replacePlaceholders = corpus =>
corpus.replace(/{{(\w+)}}/gi, (match, group1) => replacements[group1])
console.log(replacePlaceholders(corpus))
// We have 5 fruits, and 7 of veggies.
Or, like mentioned earlier, the group can also be accessed with the dollar notation '$1'
...:
const tarts = 'PopTarts'.replace(/Pop(Tarts)?/, 'Blob$1')
console.log(tarts) // BlobTarts
Syntax:
regexObj.test(str)
test
is a super simple no thrills Regex matcher method that returns a boolean, and is good for conditions, like so:
const isSecureUrl = /^https:\/\//.match('https://example.com')
if (isSecureUrl) {
console.log('Yup, is secure')
}
Syntax:
str.search(regex)
Search is almost like test
(above), but it returns the position of the match instead of a boolean, and -1
if no match was found (similarly as indexOf
):
console.log('bad unboxing'.search(/box/))
// 6
console.log('bad unboxing'.search(/duck/))
// -1
Syntax:
str.split([separator[, limit]])
The split()
method splits a string into an array on the spot where it finds the defined separator, and the separator can be a regex pattern. If you come from the PHP world, it’s basically the same as explode
.
The following will split the string on any whitespace character (space, tab, form feed, or line feed):
const corpus = 'Lorem\tipsum\ndolor'
const myArray = corpus.split(/\s/)
console.log(myArray) // ["Lorem", "ipsum", "dolor"]
The second, optional, parameter in split
is limit:
const corpus = 'Lorem ipsum dolor'
const myArray = corpus.split(/\s/, 2)
console.log(myArray.length) // 2
console.log(myArray) // ["Lorem", "ipsum"]
Syntax:
str.match(regexp)
What’s cool about match
is that it returns an array, with a lot of useful crap in it.
This example finds placeholder values {{foo}}
, which can be later on filled with actual values coming from a database or whatever:
const corpus = 'We have {{fruitCount}} fruits in the store.'
const pattern = /({{)(\w+)(}})/i
const matches = corpus.match(pattern)
console.log(matches)
That gives us an array looking like this:
['{{fruitCount}}', '{{', 'fruitCount', '}}']
Where matches[0]
is the full match, and the rest are matches from the capturing groups (the stuff inside parens).
Maybe you noticed the absence of the g
flag, that’s because match
won’t really work the same way with the global flag.
const corpus =
'We have {{amountOfFruit}} fruits, and {{amountOfVeggies}} of veggies.'
// Now with g
const pattern = /({{)(\w+)(}})/gi
console.log(corpus.match(pattern))
// ["{{amountOfFruit}}", "{{amountOfVeggies}}"]
That won’t give you the groups, the groups are are ignored.
It’s not the best tool for extracting multiple placeholder values, since we need access to the group that contains the value between the double curly braces. But matchAll
or exec
can handle patterns using the global flag.
Syntax:
str.matchAll(regexp)
The matchAll
regex method is a bit like what match
should’ve been. It’s exclusively meant for global matches, actually, it throws an error if the g
flag is not used.
const corpus = 'We have {{fruitCount}} fruits, and {{veggieCount}} of veggies.'
const regex = RegExp('({{)(w+)(}})', 'gi')
const matches = corpus.matchAll(regex)
// It has to be spread, because matchAll returns and iterator.
console.log([...matches])
// [
// ['{{fruitCount}}', '{{', 'fruitCount', '}}'],
// ['{{veggieCount}}', '{{', 'veggieCount', '}}']
// ]
The return value from matchAll
is an iterator, so "normal" array methods can’t be used on it, but it can be spread: [...matches]
, and for...of
loop works on it, or Array.from(matches)
also turns it into a normal array that can be worked on.
matchAll
is relatively new, so IE11 doesn’t support it.
Syntax:
regexObj.exec(str)
The exec
method executes a search on an input string and returns the results array, or null
if no match was found.
exec
can do pretty much everything matchAll
can, but it has a more a 90s-type-of-archaic-feel to it. Plus, it updates the RegExp object’s lastIndex
property every time it finds a match (see below).
Here’s the above example written with exec
which returns all the placeholders and their capturing groups:
const extractPlaceholders = corpus => {
const pattern = /({{)(\w+)(}})/gi
let array
const placeholders = []
while ((array = pattern.exec(corpus)) !== null) {
placeholders.push(array)
}
return placeholders
}
const placeholders = extractPlaceholders(
'We have {{amountOfFruit}} fruits, and {{amountOfVeggies}} of veggies.'
)
That extractPlaceholders
helper function gives us a nested array:
[
['{{amountOfFruit}}', '{{', 'amountOfFruit', '}}'],
['{{amountOfVeggies}}', '{{', 'amountOfVeggies', '}}']
]
One thing to note, is that the RegExp
object is stateful when using g
or y
flags, which means that every time a match is found, the RegExp.lastIndex
is updated to reflect the end position of the current match.
Here’s how exec
works basically:
lastIndex
in the RegExp
object,Here’s a little function that shows the beginning positions of a word in a corpus of text:
const getWordPositions = (pattern, corpus) => {
const regex = RegExp(pattern, 'gi')
let array
let wordPositionsInCorpus = []
while ((array = regex.exec(corpus)) !== null) {
// The `lastIndex` is now the end of the match.
const matchStart = regex.lastIndex - array[0].length
wordPositionsInCorpus.push(matchStart)
}
return wordPositionsInCorpus
}
// Use it: give it the regex pattern (in this case "the") and the corpus
const wordPositions = getWordPositions(
'\\bthe\\b',
'The quick brown fox jumps over the lazy dog'
)
console.log(wordPositions)
// [0, 31]
No programming related article is complete without examples. Here’s some common things, with explanations, what one can do with Regex while programming in JavaScript. This is obviously just a scrape on the surface, sites like RegexLib offer a stupefyingly large array of ready-made Regex patterns. Also, like anything nowadays, there’s probably a prepackaged solution for you to use, just dig the npm.
Internal URL is easy to match since it has your site domain in it and it and starts with "http" or "https", or with a forward slash /
:
^https?:\/\/clubmate.fi|^\/.*
│ │ └┬─┘└┬┘
│ │ │ └Has nothing or anything after the /
│ └Optional s └Or begins with a /
└─Should begin with http
This would match:
/
✅/foo
✅http://clubmate.fi
✅https://clubmate.fi
✅Here's another way to write the same thing:
^(?:https?://clubmate.fi|/.*)
In this hectic Internet era that we live in, where everything is possible, like really messed up TLDs, it’s very difficult with 100% accuracy to match a domain name.
Here’s a Regex for a traditional TLD (since the new TLDs this is not as effective anymore):
/^(https?:\/\/)?([\da-z.\-]+)\.([a-z.]{2,6})([/w .\-]*)*/?$/
└─────┬─────┘ └────┬─────┘ └─────┬─────┘└────┬────┘ └─┬─┘
│ │ │ │ └─Optional forward slash at the end
│ │ │ └─Any word character, dot, hyphen, 0 or more times
│ │ └─The TLD, letters, 2 to 6 times
│ └─Numbers & letters, 1 or more times
└─http:// or https://, optional
Hex values can be #dddddd
or #ddd
or also without the hash ddd
:
/^#?([\da-f]{6}|[\da-f]{3})$/
││ └────┬────┘└─────┬────┘
││ │ └─Or any number or a-f 3 times
││ └─Any number or a-f 6 times
│└─Optional hash
└─Begins with
MD5 hashes, these things: 7e18a1b53182d6124253453811b67eb0
, 32 digits long hex values. The following will match the hash if it’s completely on its own:
/^[\da-f]{32}$/
If, for example, part of a filename, like so: global.7e18a1b53182d6124253453811b67eb0.js
, it won’t match, because of the ^
and $
. The following matches the hash in that filename:
/[\da-f]{32}/
You really almost never n͠eed̸ ͢t͝o do͢ ̨t͏h̛is, and if you’re doing this, you’re doing s̺̠o̻̭̞̞m̶͕͙ͩͯ̅ͅe̱͖͖̹̭̜͋̅̾̃̓̈́̕ṯ͇̯̤͇͇̤͌̌͛ͯͤ̑̈́ĥ̪̽ͅiņ͇̞̯ǵ̴ͨ͒̾ ͥ̓ṱ̜̬͇̮̱̰̽ͮ͊͋̅̓̿e̘̥̙̝̼͖̬ͪ͐ͧ͗ͥͧ͊͠rr̵̥̬̖̟í̛͔̹̙ͭͤb̙̓ḻ͟y w̻̖̿̄r͎͍̅͑ö̻́n̈͊ͦ͊̽̾g̶̻ and you need to stop. But all the more reason, this is a great example of a complex Regex pattern and of what Regular Expressions does if you cut its head off and let it roam free:
┌─4) Non capturing group to match the closing tag
┌──────────┴─────────┐
^<([a-z\-]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$
||└────┬───┘└──┬──┘| |└─┬┘└──┬─┘└──┬──┘ |
|| | | | | | | | └─End of string
|| | | | | | | └─Or the end of a self-closing tag " />"
|| | | | | | └─The "1" refers to the result of the first capturing group
|| | | | | └─3) Followed by any character 0 or more times
|| | | | └─Closing carrot ">"
|| | | └─Zero or more times
|| | └─2) Anything but "<" one or more times
|| └─1) a to z and "-" one or more times
|└─"<"
└─Beginning of string
Below I’ve numbered the groups above, so you can see what they exactly do:
<span class="hello">foobar</span>
└─┬─┘└──────┬─────┘ └──┬─┘└──┬──┘
| | | |
└─1 └─2 └─3 └─4
This will not work:
const haystack = 'will not work'
const needle = 'work'
haystack.replace(/needle/g, 'twerk')
But the RegExp
constructor can take variables:
const haystack = 'something is wrong.'
const needle = 'something'
const regex = new RegExp(needle, 'g')
haystack.replace(regex, 'everything')
These are priceless! And there’s so many, just search for "regex tester".
I find regexpal.com to be all I need.
To wrap things up I could summarize:
test
or search
when you want a simple boolean match.match
when you want to match a single instance of a string.matchAll
or exec
when you want to match globally.exec
when you want to target successive matches.replace
when you want to replace values and use capturing groups.split
when you want to split strings into arrays.Regex is one of those things you want to use everywhere right after you’ve learned it. But often there’s a better way to do it.
Thanks to regular-expressions.info and MDN Regular Expressions articles for great references!
Comments would go here, but the commenting system isn’t ready yet, sorry.