Skip to main content

Regular expressions (regex) are extremely useful in extracting information from any text by searching for one or more matches of a specific search pattern.

The basic anchors - ^ and $

expression action
^The matches any string that starts with The
end$ matches a string that ends with end
^The end$ exact string match (starts and ends with The end)
pragmatic matches any string that has the text pragmatic in it

 

The basic quantifiers — * + ? and {}

expression action
alpha* alph matches the characters alph literally (case sensitive)
a* matches the character a literally (case sensitive)
* Quantifier — Matches between zero and unlimited times
alpha+ alph matches the characters alph literally (case sensitive)
a+ matches the character a literally (case sensitive)
+ Quantifier — Matches between one and unlimited times
alpha? alph matches the characters alph literally (case sensitive)
a? matches the character a literally (case sensitive)
? Quantifier — Matches between zero and one times
alpha{2} alph matches the characters alph literally (case sensitive)
a{2} matches the character a literally (case sensitive)
{2} Quantifier — Matches exactly 2 times
matches the character   literally (case sensitive)
alpha{2,} alph matches the characters alph literally (case sensitive)
a{2,} matches the character a literally (case sensitive)
{2,} Quantifier — Matches between 2 and unlimited
alpha{2,5} alph matches the characters alph literally (case sensitive)
a{2,5} matches the character a literally (case sensitive)
{2,5} Quantifier — Matches between 2 and 5 times
alp(ha)* alp matches the characters alp literally (case sensitive)
1st Capturing Group (ha)*
* Quantifier — Matches between zero and unlimited times,
alp(ha){2,5} alp matches the characters alp literally (case sensitive)
1st Capturing Group (ha){2,5}
{2,5} Quantifier — Matches between 2 and 5 times

 

The basic OR operators - | and []

expression action
alp(h|a) alp matches the characters alp literally (case sensitive)
1st Capturing Group (h|a)
1st Alternative h – h matches the character h literally (case sensitive)
2nd Alternative a – a matches the character a literally (case sensitive)
a[bc] alb matches the characters alb literally (case sensitive)
Match a single character present in the list below [bc]
ha matches a single character in the list bc (case sensitive)

 

The basic character classes - \d \w \s . \.

expression action
\d \d matches a digit (equal to [0-9])
\D matches any non digit
\w \w matches any word character (equal to [a-zA-Z0-9_])
\W matches any non word character
\s \s matches any whitespace character (equal to [\r\n\t\f\v ])
. . matches any character (except for line terminators)
\. \. matches the character . literally (case sensitive)

 

Bracket expressions[]

expression action
[alpha] matches a string that has either an a, l, p or h
[a-zA-Z] a string that has a letter from a to z or from A to Z
[a-zA-Z0-9] matches a string that has a letter from a to z or from A to Z or 0 to 9
[^a-zA-Z] a string that doesn't have a letter from a to z or from A to Z. In this case the ^ is used as negation of the expression
[0-9]% a string that has a character from 0 to 9 before a % sign

 

Word boundaries

expression action
\balpha\b \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
alpha matches the characters alpha literally (case sensitive)
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
\Balpha\B \B assert position where \b does not match
alpha matches the characters alpha literally (case sensitive)
\B assert position where \b does not match

 

Tokens

expression action
\n newline
\r return
\t tab
\0 null character

 

References

expression action
(...) Parts of the regex enclosed in parentheses may be referred to later in the expression or extracted from the results of a successful match.
(alpha) 1st Capturing Group (alpha)
alpha matches the characters alpha literally (case sensitive)
([alpha]) 1st Capturing Group ([alpha])
Match a single character present in the list below [alpha]
alpha matches a single character in the list alph (case sensitive)
a(?=l) a matches the character a literally (case sensitive)
Positive Lookahead (?=l)
Assert that the Regex below matches
l matches the character l literally (case sensitive)
(?<=d)e Positive Lookbehind (?<=d)
Assert that the Regex below matches
d matches the character d literally (case sensitive)
e matches the character e literally (case sensitive)

 

Examples

expression action
/[a-z.\/:=_]{12,}/i {12,} Quantifier — Matches between 12 and unlimited times, as many times as possible, giving back as needed (greedy)
a-z a single character in the range between a (index 97) and z (index 122) (case insensitive)
. matches the character . literally (case insensitive)
\/ matches the character / literally (case insensitive)
:=_ matches a single character in the list :=_ (case insensitive)
/^[1-2][0-9\.]*$/ matches at the start of the string a number that is either 1 or 2
matches at the end of the string a number (0 - 9)
/^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$/ simply this is looking for an IPv4 address
^ asserts position at start of the string
match between the first three variables as digits followed by a full point (.), then repeat this four times.  Except on the last attempt do not include the full point

 

More examples

1. Removing tags and &nbsp;

Using regex, remove tags such as <p>, <ul>, <li>, <h1>, <h5> and etc, plus also remove an extra spaces &nbsp;

The text that we will test this on is going to be:

<p>Load testing verifies the system performance under the expected peak load. &nbsp;The peak load needs to set by a series of parameters that you have benchmarked targets. &nbsp;For example, these parameters could include:</p> <h5>Load testing:</h5> <ul> <li>20,000 concurrent users; and</li> <li>response time of under 4 seconds</li> </ul> <h5>Stress testing:</h5> <ul> <li>Verifies the server performance under extreme load. &nbsp;Test this through examining how many users are required to bring your server</li> </ul> <h5>Endurance testing:</h5> <p>Load test over an extended period of time</p> <p>&nbsp;</p> <h4>Check with your hosting provider</h4>

I needed to remove the tags (<p>, <ul>, <li>, etc...) and &nbsp;  I could remove the tags using the php command strip_tags().  However, I do as much through regex as possible.

expression action
/<[a-zA-Z\/][^>]*|&nbsp;|>/gi

< matches the character < literally (case insensitive)
[a-zA-Z\/]
a-z a single character in the range between a (index 97) and z (index 122) (case insensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case insensitive)
\/ matches the character / literally (case insensitive)
* Quantifier — Matches as many times as possible
​​​​​​​> matches the character > literally (case insensitive)

Through using the above regex, the outcome is as follows:

Load testing verifies the system performance under the expected peak load. The peak load needs to set by a series of parameters that you have benchmarked targets. For example, these parameters could include: Load testing: 20,000 concurrent users; and response time of under 4 seconds Stress testing: Verifies the server performance under extreme load. Test this through examining how many users are required to bring your server Endurance testing: Load test over an extended period of time Check with your hosting provider

See regex example to remove tags and space

 

2. Adding target, alt and title to a href

How do you add to a url string elements such as target, alt and title?

Lets begin by setting out the url string that we will work with:

<a href="https://www.codebales.com/regex-expressions-a-working-sheet">Regex examples sheet</a>

What is the regex expression going to be used for this?

/(<a\b[^<>]*href=['"]?http[^<>]+)>/gi

expression action

/(<a\b[^<>]*href=['"]?http[^<>]+)>/gi

 

applied using a preg_replace

preg_replace('/(<a\b[^<>]*href=['"]?http[^<>]+)>/gi', '<a $1 target="_blank" alt="' . $alt . '" title="' . $alt . '">', $url)

(<a\b[^<>]*href=['"]?http[^<>]+)
<a matches the characters <a literally (case insensitive)
href= matches the characters href= literally (case insensitive)
[^<>]* ~ * Quantifier — Matches as many times as possible
['"]? ~ ? Quantifier — Matches as many times as possible
[^<>]+ ~ + Quantifier — Matches as many times as possible
> matches the character > literally (case insensitive)

Using the above regex, the outcome is as follows...

Based on the following variable definitions:

  • $alt = "Regex examples sheet"
  • $url = <a href="https://www.codebales.com/regex-expressions-a-working-sheet">Regex examples sheet</a>

<a href="https://www.codebales.com/regex-expressions-a-working-sheet" target="_blank" alt="Regex examples sheet" title="Regex examples sheet">Regex examples sheet</a>

see regex example add elements to url

 

3. Obfuscating an email

I wanted to partially hide some of the user's email.  By way of example, changing the email

sarah@example.com

to 

s****@e*******.c**

To achieve this, the regex expression that can be used is?

(?<![^\w])(?<=...)[\w]/gi

expression action

(?<![^\w])(?<=...)[\w]/gi

Negative Lookbehind 

(?<![^\d\w])

[^\w] – \w matches any word character (equal to [a-zA-Z0-9_])

Positive Lookbehind 

(?<=...)

Assert that the Regex below matches
. matches any character (except for line terminators)
Match a single character present in the list below 

see regex example add elements to url

Resources

Regex 101 (https://regex101.com/) – A fantastic playground for testing and experimenting with your expressions

Related articles

Andrew Fletcher18 Mar 2024
Resolving CVE-2022-48624 less issue
To resolve the CVE-2022-48624 vulnerability on Ubuntu using Nginx, it's crucial to understand that the issue lies within the "less" package, not Nginx itself. The vulnerability affects "less" before version 606, where close_altfile in filename.c in less omits shell_quote calls for LESSCLOSE,...