Smelly Code

Level up your Regex Game 🚀

October 13, 201811 min 👓

Regular expressions are one of the most powerful tools which programmers can have in their arsenal. But learning them is one of hell of a painful thing. This week I mustered up some courage and decided to go through that pain. I am still learning about them. In this post, I’ll share what I have learned so far. So let’s get started.

Definition

Regular expressions are patterns used to match character combinations in strings. ~ MDN

When ugly cryptic symbols meet gorgeous characters, they produce patterns which can be used by a regex engine(a tool which understands those patterns) to match strings/characters in a text. These patterns are called “Regular Expressions”. Let’s have an example.

Suppose you want to check the presence of ‘Sheldon’ in text ‘Bazinga! I am Sheldon Cooper’. You might say, you’ll use the String.indexOf API. But we are not learning about string APIs here. We are learning regex so we’ll do it the regex way.

In JavaScript, a regular expression can be created using one of the following two ways

  1. Using regex literals(with the help of a pair of forward slashes /).
  2. And the RegExp constructor.

We’ll use literal way throughout this post. You can read about the constructor way on MDN if you wish to.

Alright! Let’s create a regex to find ‘Sheldon’ in a text. As we saw earlier we use / to create regex. So whatever we write between a pair of / becomes a pattern for matching. eg /Sheldon/ creates a regex which matches ‘Sheldon’ in a text. Congratulations! We just made our first regex.

Hold on Hitesh! You haven’t explained how I am gonna use it. I can hear you.

Regex Meme

Well, what if I tell you that regex our also JavaScript objects which have some useful methods to make our life simpler. One such method is RegExp.test which is used to test pattern with respect to a text. It returns true if the pattern matches otherwise false.

const regex = /Sheldon/; // matches 'Sheldon' in text.
regex.test('Bazinga! I am Sheldon Cooper'); // Returns true.
regex.test('Penny! Penny! Penny!'); // Return false.

Understanding and debugging a regex — especially when you are getting started — is difficult, let’s be honest. There are some helpful tools like Regex Visualiser, Regex101 and the almighty console.dir which can help you throughout journey. Ok! Enough ado!! Let’s take a plunge.

Mostly, we create regular expressions by juggling with simple characters(mainly alphanumeric) and special characters/symbols. Simple characters don’t hold any specific meaning but special characters do for sure. That is why they are defined in various categories/classes. We’ll take a gander at each category with the help of some examples.

Character Classes

This class holds .,\d, \D, \w, \W, \s and many more characters which are beyond the scope of the article. Here’s the full list.

Dot(.) Character Class

Dot character has candid nature. It gels well with every single character(except line terminators) which comes on its way while matching the text.

const regex = /./;
// Statements below return true.
regex.test('!'); // matches '!'
regex.test('Voila!'); // macthes 'V'
regex.test('$'); // matches '$'
regex.test(' '); // macthes ' ' (space)
regex.test("'"); // matches '(single quote)
/.Script/.test('JScript JavaScript'); // macthes 'JScript'
/.Script/.test('@Script $Script'); // macthes '@Script'
// Statements below return false
regex.test(''); // no character present.
regex.test('\n'); // line terminator.

Note: Dot loses its special meaning when it appears in a character set and matches the literal dot. Please refer to Character Sets section for more details.

Digit Character Class(\d and \D)

As the name suggests, digit character class is used for matching numbers. \d matches digits(numbers) while \D matches non-digit characters.

const regex = /\d/;
regex.test('1'); // true matches '1'
regex.test('23'); // true matches '2'
regex.test('1e10'); // true matches '1'
regex.test('e'); // false
/\D/.test('1'); // false
/\D/.test('A'); // true matches 'A'
/\D/.test('A2'); // true matches 'A'

Word Character Class(\w and \W)

\w matches any alphanumeric character including underscore (_). \W is the negation of \w i.e. it matches all the non-alphanumeric characters excluding the underscore.

const regex = /\w/;
regex.test('1'); // true matches '1'
regex.test('23 is my roll number.'); // true matches '2'
regex.test('1e10'); // true matches '1'
regex.test('e'); // true
regex.test(' '); // false
regex.test(' _jedi '); // true matches '_'
regex.test('!@#$%^&*()'); // false
/\W/.test('1'); // false
/\W/.test('23 is my roll number.'); // true matches ' '(first space)
/\W/.test('1e10'); // false
/\W/.test('e'); // false
/\W/.test(' '); // true
/\W/.test('_jedi '); // false
/\W/.test('!@#$%^&*()'); // true matchs '!'

Space Character Class(\s and \S)

Space character(\s) matches a white space character in a text. White space in text can be any space, tab, new line or any other character which can create white space in text(eg. unicode space characters). Any non-white space character can be matched by \S.

const regex = /\s/;
regex.test('1 '); // true matches ' '
regex.test('1\t'); // true matches tab  " "
regex.test('\n'); // true matches "↵"
regex.test(' '); // true
regex.test('Name'); // false
/\S/.test(' '); // false
/\S/.test(' 1'); // true matches "1"
/\S/.test('\t'); // false
/\S/.test('\n'); // false

Character Sets([]) and Alternation

Character Sets are pretty useful especially when we want to match text against a set of characters. We create a character set by enclosing characters in brackets []. Character Set matches any of the character enclosed within brackets. eg. pattern [abc] is a character set which matches a, b, and c in a text.

const regex = /[abc]/;
regex.test('a'); // true
regex.test('b'); // true
regex.test('c'); // true
regex.test('abc'); // true
regex.test('This is plain text'); // true matches "a"

We can also define a range in character sets using a hyphen(-) eg. pattern [0–4] will match any digit between 0 to 4 in a text. Hyphen loses its special meaning when it’s on the boundary(appears as the first or the last character in the character set), and treated as literal hyphen character. eg. [04-] will macth 0, 4 and -.

/[0-4]/.test('0'); // true
/[0-4]/.test('4'); // true
/[0-4]/.test('2'); // true
/[0-4]/.test('54'); // true matches "4"
/[0-4]/.test('a4'); // true matches "4"
/[0-4]/.test('5'); // false
/[0-4]/.test('a'); // false
// Below is equivalent to \w.
/[A-Za-z0-9_]/.test('_jumbo');
// Hyphen loses special meaning when on boundary
/[04-]/.test('0'); // true
/[04-]/.test('4'); // true
/[04-]/.test('2'); // false
/[04-]/.test('-'); // true macthes "-"
/[04-]/.test('non-digit'); // true matches "-"
/[0-4-7]/.test('non-digit'); // true matches "-"
/[0-4-7]/.test('0'); // true
/[0-4-7]/.test('4'); // true
/[0-4-7]/.test('2'); // true
/[0-4-7]/.test('5'); // false
/[0-4-7]/.test('7'); // true
/[0-4-7]/.test('-'); // true macthes "-"
// Dot(.) and asterisk(*) also lose their power with charset.
/[.*]/.test('*'); // true
/[.*]/.test('.'); // true
/[.*]/.test('.*'); // matches "."
/[.*]/.test('abc'); // false

Character set also allows negation using ^ which means we can match any character except the one present in charset. The only stipulation is it has to be the first character of the charset. If it’s not the first character then it will have its literal meaning.

const regex = [^0-4]; // any character except range 0 to 4.
regex.test('0'); // false
regex.test('4'); // false
regex.test('2'); // false
regex.test('5'); // true
regex.test('a'); // true
// When ^ is not first character
/[a^c]/.test('a'); // true
/[a^c]/.test('c'); // true
/[a^c]/.test('^'); // true
/[a^c]/.test('abc'); // matches "a"

Alternation is achieved with the help of OR (|) operator. The behavior is akin to the logical OR operator. eg. /green|red/ will match either green or red in a text.

Boundaries

Just like string APIs (String.startsWith, String.endsWith, etc), regular expressions also provide ways to test if a string starts or ends with certain characters. It advances the game further with the help of word boundary(\b) which will be discussing soon.

Beginning (^)

It matches the beginning of the text.

const regex = /^T/;
regex.test('The Coldplay'); // true matches "T" of "The"
regex.test('the Coldplay'); // false
regex.test('the Mike Tyson'); // false

Beginning character has different meaning when it appears in a charset. Please refer the charset section.

Ending ($)

It matches end of the text.

const regex = /End$/;
regex.test('The End'); // true matches "End"
regex.test('The end'); // false
regex.test('The End.'); // false

Word Boundaries(\b)

A word boundary is a position where a word starts or ends. It is basically an empty string before or after the word which acts as a boundary. Let’s use String.replace method to grok it.

// Below regex matches all word boundaries in text.
// "g" is global flag which will discuss in flags section.
const regex = /\b/g;
const str1 = 'Regular Expressions'.replace(regex, '~');
const str2 = 'Regular_Expressions'.replace(regex, '~');
const str3 = 'Regular-Expressions'.replace(regex, '~');
console.log(str1); // ~Regular~ ~Expressions~
console.log(str2); // ~Regular_Expressions~
console.log(str3); // ~Regular~-~Expressions~

In regular expressions, a word can only have characters of word character \w class which means that ‘RegularExpressions’ will be treated as a single word while ‘Regular-Expressions’ will have two words. You can read more about word boundaries <a href=”https://stackoverflow.com/questions/1324676/what-is-a-word-boundary-in-regexes” target=”blank”>here.

Regex also has non-word boundaries \B which is opposite of \b.

Quantifiers

Quantifiers are used for matching occurrence eg. matching wow in a text where wow can be present with multiple ‘o’(wow, woow, woooow or woooooooo).

  • * matches zero or more occurrence of the preceding character.
const regex = /wo*/;
regex.test('wooohooo'); // true matches "wooo"
regex.test('wink'); // true matches "w";
regex.test('blink');; /false
  • + matches 1 or more occurrence of the preceding character.
const regex = /wo+w/;
regex.test('wow'); // true
regex.test('woooow'); // true
regex.test('www'); // false
  • ? matches 0 or 1 time.
const regex = /wo?w/;
regex.test('wow'); // true
regex.test('woooow'); // false
regex.test('www'); // true matches "ww"
  • {n,m} matches n to m times where n ≥ 0 and m > 0. Also m is optional means {n} match exact n occurrence and {n,} matches n or more occurrence.
/wo{1}w/.test('wow'); // true
/wo{1}w/.test('ww'); // false
/wo{1}w/.test('wo0w'); // false
/wo{1,}w/.test('wow'); // true
/wo{1,}w/.test('woooow'); // true
/wo{1,3}w/.test('ww'); // false
/wo{1,3}w/.test('wow'); // true
/wo{1,3}w/.test('woow'); // true
/wo{1,3}w/.test('wooow'); // true
/wo{1,3}w/.test('woooow'); // false

Flags

Flags are mainly used to provide additional information to the regex engine.Flags are mainly used to provide additional information to the regex engine. We place flags at the end of the regex — after the last forward slash.

  • g global match; we have seen that by default a regex returns the first match but, with the help of global match flag, we can also get all the matches. Regex engine leverages lastIndex property of a regex for global matching. Global match examples with String.match.
// only first match
const regexWithout = /foo/;
'foo foosball football'.match(regexWithout); // ["foo"]
// matches all "foo"s
const regexWith = /foo/g;
'foo foosball football'.match(regexWith); // ["foo", "foo", "foo"]
  • i ignore cases;
const regex = /Regex/i;
// All returns true
regex.test('regex');
regex.test('REGEX');
regex.test('rEgEx');
  • m multiline; when it is passed the beginning (^) and end ($) characters work with lines instead of the whole input.
const text = `
Roses are red
violets are blue
Regex are awesome
so do you`;
/blue$/m.test(text); // true matches the second line end.
/blue$/.test(false);

There are two more flags; sticky y and unicode u which got introduced in ES6. You can read about them here.

Grouping and Back References

Consider a scenario where we want to check whether a sentence starts and ends with the same word. To achieve it, regex needs some ways to remember the matches. Grouping and back references are used for such scenarios. We use parenthesis () for grouping. Regex remembers whatever is captured in the group which can be referenced later using \n where n is the ordered position of group starting from 1. In a regex object groups are referred using $1, $2…

// matches "a" and back reference it using \1
const regex = /(a)b\1d/;
regex.test('abad'); // true
regex.test('abd'); // false
// multiple groups
/(a)b(c)\1d\2/.test('abcadc'); // true
// Regex matches a sentence which starts and ends with same word
const sentenceRegex = /^([\w]+\b).*\1\.$/i;
sentenceRegex.test('Apple is red so red is apple.');

Sometimes we just use grouping for legibility. For such scenarios, we can avoid the performance penalty of remembering them using non-capturing group (?:). Non-captured group will not be available to \n.

Assertions (?=) and (?!)

When we want to test whether a character/string is followed by a certain character/string or not we use assertions.

// Matches "Sheldon" followed by "Cooper"
const regex = /Sheldon(?=Cooper)/;
regex.test('SheldonCooper'); // true, only "Sheldon" is matched
regex.test('SheldonLeeCooper'); // false
// Matches "Sheldon" not followed by "Cooper"
const regex = /Sheldon(?!Cooper)/;
regex.test('SheldonCooper'); // false
regex.test('SheldonLeeCooper'); // true, only "Sheldon" is matched

Conclusion

XKCD Regex Comic
All Rise! The Regex Man is here.(Source)

Hitesh

Hi, I am Hitesh.

|