Regular expressions (regex) are extremely useful in extracting information from any text by searching for one or more matches of a specific search pattern.
The basic anchors - ^ and $
expression | action |
---|---|
^The | matches any string that starts with The |
end$ | matches a string that ends with end |
^The end$ | exact string match (starts and ends with The end) |
pragmatic | matches any string that has the text pragmatic in it |
The basic quantifiers — * + ? and {}
expression | action |
---|---|
alpha* | alph matches the characters alph literally (case sensitive) a* matches the character a literally (case sensitive) * Quantifier — Matches between zero and unlimited times |
alpha+ | alph matches the characters alph literally (case sensitive) a+ matches the character a literally (case sensitive) + Quantifier — Matches between one and unlimited times |
alpha? | alph matches the characters alph literally (case sensitive) a? matches the character a literally (case sensitive) ? Quantifier — Matches between zero and one times |
alpha{2} | alph matches the characters alph literally (case sensitive) a{2} matches the character a literally (case sensitive) {2} Quantifier — Matches exactly 2 times matches the character literally (case sensitive) |
alpha{2,} | alph matches the characters alph literally (case sensitive) a{2,} matches the character a literally (case sensitive) {2,} Quantifier — Matches between 2 and unlimited |
alpha{2,5} | alph matches the characters alph literally (case sensitive) a{2,5} matches the character a literally (case sensitive) {2,5} Quantifier — Matches between 2 and 5 times |
alp(ha)* | alp matches the characters alp literally (case sensitive) 1st Capturing Group (ha)* * Quantifier — Matches between zero and unlimited times, |
alp(ha){2,5} | alp matches the characters alp literally (case sensitive) 1st Capturing Group (ha){2,5} {2,5} Quantifier — Matches between 2 and 5 times |
The basic OR operators - | and []
expression | action |
---|---|
alp(h|a) | alp matches the characters alp literally (case sensitive) 1st Capturing Group (h|a) 1st Alternative h – h matches the character h literally (case sensitive) 2nd Alternative a – a matches the character a literally (case sensitive) |
a[bc] | alb matches the characters alb literally (case sensitive) Match a single character present in the list below [bc] ha matches a single character in the list bc (case sensitive) |
The basic character classes - \d \w \s . \.
expression | action |
---|---|
\d | \d matches a digit (equal to [0-9]) |
\D | matches any non digit |
\w | \w matches any word character (equal to [a-zA-Z0-9_]) |
\W | matches any non word character |
\s | \s matches any whitespace character (equal to [\r\n\t\f\v ]) |
. | . matches any character (except for line terminators) |
\. | \. matches the character . literally (case sensitive) |
Bracket expressions[]
expression | action |
---|---|
[alpha] | matches a string that has either an a, l, p or h |
[a-zA-Z] | a string that has a letter from a to z or from A to Z |
[a-zA-Z0-9] | matches a string that has a letter from a to z or from A to Z or 0 to 9 |
[^a-zA-Z] | a string that doesn't have a letter from a to z or from A to Z. In this case the ^ is used as negation of the expression |
[0-9]% | a string that has a character from 0 to 9 before a % sign |
Word boundaries
expression | action |
---|---|
\balpha\b | \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W) alpha matches the characters alpha literally (case sensitive) \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W) |
\Balpha\B | \B assert position where \b does not match alpha matches the characters alpha literally (case sensitive) \B assert position where \b does not match |
Tokens
expression | action |
---|---|
\n | newline |
\r | return |
\t | tab |
\0 | null character |
References
expression | action |
---|---|
(...) | Parts of the regex enclosed in parentheses may be referred to later in the expression or extracted from the results of a successful match. |
(alpha) | 1st Capturing Group (alpha) alpha matches the characters alpha literally (case sensitive) |
([alpha]) | 1st Capturing Group ([alpha]) Match a single character present in the list below [alpha] alpha matches a single character in the list alph (case sensitive) |
a(?=l) | a matches the character a literally (case sensitive) Positive Lookahead (?=l) Assert that the Regex below matches l matches the character l literally (case sensitive) |
(?<=d)e | Positive Lookbehind (?<=d) Assert that the Regex below matches d matches the character d literally (case sensitive) e matches the character e literally (case sensitive) |
Examples
expression | action |
---|---|
/[a-z.\/:=_]{12,}/i | {12,} Quantifier — Matches between 12 and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in the range between a (index 97) and z (index 122) (case insensitive) . matches the character . literally (case insensitive) \/ matches the character / literally (case insensitive) :=_ matches a single character in the list :=_ (case insensitive) |
/^[1-2][0-9\.]*$/ | matches at the start of the string a number that is either 1 or 2 matches at the end of the string a number (0 - 9) |
/^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$/ | simply this is looking for an IPv4 address ^ asserts position at start of the string match between the first three variables as digits followed by a full point (.), then repeat this four times. Except on the last attempt do not include the full point |
More examples
1. Removing tags and
Using regex, remove tags such as <p>, <ul>, <li>, <h1>, <h5> and etc, plus also remove an extra spaces
The text that we will test this on is going to be:
<p>Load testing verifies the system performance under the expected peak load. The peak load needs to set by a series of parameters that you have benchmarked targets. For example, these parameters could include:</p> <h5>Load testing:</h5> <ul> <li>20,000 concurrent users; and</li> <li>response time of under 4 seconds</li> </ul> <h5>Stress testing:</h5> <ul> <li>Verifies the server performance under extreme load. Test this through examining how many users are required to bring your server</li> </ul> <h5>Endurance testing:</h5> <p>Load test over an extended period of time</p> <p> </p> <h4>Check with your hosting provider</h4>
I needed to remove the tags (<p>, <ul>, <li>, etc...) and I could remove the tags using the php command strip_tags(). However, I do as much through regex as possible.
expression | action |
---|---|
/<[a-zA-Z\/][^>]*| |>/gi |
< matches the character < literally (case insensitive) |
Through using the above regex, the outcome is as follows:
Load testing verifies the system performance under the expected peak load. The peak load needs to set by a series of parameters that you have benchmarked targets. For example, these parameters could include: Load testing: 20,000 concurrent users; and response time of under 4 seconds Stress testing: Verifies the server performance under extreme load. Test this through examining how many users are required to bring your server Endurance testing: Load test over an extended period of time Check with your hosting provider
See regex example to remove tags and space
2. Adding target, alt and title to a href
How do you add to a url string elements such as target, alt and title?
Lets begin by setting out the url string that we will work with:
<a href="https://www.codebales.com/regex-expressions-a-working-sheet">Regex examples sheet</a>
What is the regex expression going to be used for this?
/(<a\b[^<>]*href=['"]?http[^<>]+)>/gi
expression | action |
---|---|
/(<a\b[^<>]*href=['"]?http[^<>]+)>/gi
applied using a preg_replace preg_replace('/(<a\b[^<>]*href=['"]?http[^<>]+)>/gi', '<a $1 target="_blank" alt="' . $alt . '" title="' . $alt . '">', $url) |
(<a\b[^<>]*href=['"]?http[^<>]+) |
Using the above regex, the outcome is as follows...
Based on the following variable definitions:
- $alt = "Regex examples sheet"
- $url = <a href="https://www.codebales.com/regex-expressions-a-working-sheet">Regex examples sheet</a>
<a href="https://www.codebales.com/regex-expressions-a-working-sheet" target="_blank" alt="Regex examples sheet" title="Regex examples sheet">Regex examples sheet</a>
see regex example add elements to url
3. Obfuscating an email
I wanted to partially hide some of the user's email. By way of example, changing the email
sarah@example.com
to
s****@e*******.c**
To achieve this, the regex expression that can be used is?
(?<![^\w])(?<=...)[\w]/gi
expression | action |
---|---|
(?<![^\w])(?<=...)[\w]/gi |
Negative Lookbehind (?<![^\d\w]) [^\w] – \w matches any word character (equal to [a-zA-Z0-9_]) Positive Lookbehind (?<=...) Assert that the Regex below matches |
see regex example add elements to url
Resources
Regex 101 (https://regex101.com/) – A fantastic playground for testing and experimenting with your expressions