Regular Expressions#
The main task of regular expressions (regex) is to search, match, and manipulate patterns within text. They allow you to efficiently find specific sequences, validate formats like emails or phone numbers, and perform complex text processing tasks with minimal code. Regex is a powerful tool for working with text data in a precise and flexible way.
Regex basics#
Main components of regex#
Three main components of regex:
- Pattern
The pattern is a sequence of characters that defines the search criteria. It could be a specific sequence of characters, a set of characters, or a more complex structure using special symbols to represent various possibilities.
- Text
The text or sequence of characters that you want to search within. The pattern is applied to this text to find matches or perform substitutions.
- Match
The match is the outcome of applying the pattern to the string. It indicates where and how the pattern fits within the string. A match can refer to the exact location, the part of the string that corresponds to the pattern, or simply a boolean value indicating whether a match exists or not.
Example#
We can use regex in the similar way as we would use basic string methods to find some substring in the text:
Basic String Methods#
text = "Don’t explain your philosophy. Embody it."
pattern = "philosophy"
# string methods:
print("find:", text.find(pattern), "\nin:", pattern in text)
find: 19
in: True
Regex approach#
Python has a special module called re
that can be used to work with regular expressions.
import re
res = re.search(pattern, text)
if res is not None:
print("res.start():", res.start())
print("res.end():", res.end())
print("res.span():", res.span())
print("bool(res):", bool(res))
res.start(): 19
res.end(): 29
res.span(): (19, 29)
bool(res): True
This was just a basic example of what regex can do, but it’s far from showcasing their true power. Now, let’s dive into the key advantages and capabilities of regular expressions.
re
Functions#
The main functions provided by the re module include:
re.match
: Checks for a match only at the beginning of the string.re.search
: Searches the entire string for a match.re.findall
: Returns a list of all matches found in the string.re.finditer
: Returns an iterator yielding match objects for all matches found.re.sub
: Replaces occurrences of the pattern with a replacement string.
Metacharacters#
A metacharacter in regular expressions is a special character that has a unique meaning and function, rather than being interpreted as a literal character. These characters are used to create patterns that can match different types of text.
Dot .
metacharacter#
Explanation. The dot .
is one of the most commonly used metacharacters in regular expressions. It’s simple yet powerful, and it plays a key role in pattern matching.
The Basic Function:
The dot
.
matches any single character except a newline\n
.This means it can represent any letter, digit, symbol, or even a space.
Examples:
Suppose you have the regex pattern 'a.c'
. This pattern matches:
‘abc’ ‘a1c’ ‘a-c’
and in general any string where 'a'
is followed by any single character and then 'c'
.
Non-matching examples:
‘ac’ ‘abdc’
Implementation
words = ['abc', 'a1c', 'a-c', 'ac', 'abdc']
pattern = re.compile(r'a.c')
for w in words:
print(f"Pattern found in {w:>4}:", bool(re.search(pattern, w)))
Pattern found in abc: True
Pattern found in a1c: True
Pattern found in a-c: True
Pattern found in ac: False
Pattern found in abdc: False
Raw Strings
In Python, an r string (or raw string) is a string prefixed with the letter r
or R
. The primary purpose of a raw string is to treat backslashes \
as literal characters and not as escape characters. This is particularly useful when working with regular expressions or file paths, where backslashes are common.
Example:
rline = r"Line 1\nLine 2\tTabbed"
line = "Line 1\nLine 2\tTabbed"
print('Raw line:', rline)
print("Regular line:", line)
Dot vs Literal Dot#
If you want to match an actual dot (a period or full stop), you need to escape it with a backslash \.
.
Examples:
Pattern:
'a.c'
abc a.c a-c a c
Pattern:
'a\.c'
abc a.c a-c a c
p1 = r'a.c'
p2 = r'a\.c'
words = ['abc', 'a.c', 'a-c', 'a c']
for w in words:
print(f"p1 in {w}: {bool(re.search(p1, w))}", end=" | ")
print(f"p2 in {w}: {bool(re.search(p2, w))}")
p1 in abc: True | p2 in abc: False
p1 in a.c: True | p2 in a.c: True
p1 in a-c: True | p2 in a-c: False
p1 in a c: True | p2 in a c: False
Metacharacters: ^
and $
#
^
: Matches the start of a stringPattern:
'^cat'
:cat dog dog cat
$
: Matches the end of a stringPattern
'dog$'
cat dog dog cat
Implementation:
p1 = r'^cat'
p2 = r'dog$'
lines = ['cat dog', 'dog cat']
for l in lines:
print(f'p1 in {l}:', bool(re.search(p1, l)))
print(f'p2 in {l}:', bool(re.search(p2, l)))
p1 in cat dog: True
p2 in cat dog: True
p1 in dog cat: False
p2 in dog cat: False
Character Sets#
Character sets, also known as character classes, are a feature in regular expressions that allow you to match any one of several characters at a specific position in the text. They are defined using square brackets []
.
How Character Sets Work
When you place characters inside square brackets, the regex engine will match any single character from that set.
For example, the character set [abc]
will match any one of the characters "a"
, "b"
, or "c"
.
Basic Character Set#
Pattern:
h[aeoiu]t
hat hot hit htt hbt
Implementation
p = r'h[aeiou]t'
words = ['hat', 'hot', 'hit', 'htt', 'hbt']
for w in words:
print(f'p in {w}:', bool(re.search(p, w)))
p in hat: True
p in hot: True
p in hit: True
p in htt: False
p in hbt: False
Ranges#
Ranges in regular expression character sets are used to specify a sequence of characters that should be matched.
A range is defined using a hyphen -
between two characters inside square brackets []
, and it tells the regex engine to match any character between those two characters.
How ranges work
A range is represented as
[start-end]
, where start and end are characters.The regex will match any character that falls within that range.
Common examples
Pattern:
[a-z]
a b y A Z 3 5
Patter:
[A-Z]
a b y A Z 3 5
Pattern:
[0-9]
a b y A Z 3 5
Combining Multiple Ranges:
[a-zA-Z0-9]
a b y A Z 3 5
Implementation
p1, p2, p3, p4 = r'[a-z]', r'[A-Z]', r'[0-9]', r'[a-zA-Z0-9]'
words = ['a', 'b', 'y', 'A', 'Z', '3', '5']
for p in (p1, p2, p3, p4):
print(f"Current pattern is: {p}")
results = [f"{w}: {bool(re.search(p, w))}" for w in words]
print(f"\t{' | '.join(results)}")
Show code cell output
Current pattern is: [a-z]
a: True | b: True | y: True | A: False | Z: False | 3: False | 5: False
Current pattern is: [A-Z]
a: False | b: False | y: False | A: True | Z: True | 3: False | 5: False
Current pattern is: [0-9]
a: False | b: False | y: False | A: False | Z: False | 3: True | 5: True
Current pattern is: [a-zA-Z0-9]
a: True | b: True | y: True | A: True | Z: True | 3: True | 5: True
Using Hyphen as a Literal
To match a literal hyphen, place it at the start or end of the character set.
Example: [a-z-]
matches any lowercase letter or the hyphen -
.
If the hyphen is placed in the middle (e.g., [a-z-0-9]
), it’s interpreted as a range.
Negative Character Sets#
Negative character set in regular expressions allows you to match any character except the ones specified inside the set. It’s a powerful feature when you want to exclude certain characters from matching.
How to Negate a Character Set
To negate a character set, you place a caret ^
immediately after the opening square bracket [
.
This means “match any character that is not in the set.”
Examples
Pattern
[^aeiou]
: match any character that is not a vowel (i.e., anything other than “a”, “e”, “i”, “o”, or “u”).Pattern:
[^0-9]
: Match any character that is not a digit (anything other than “0” to “9”)
Quantifiers#
Quantifiers in regular expressions specify how many times a particular character, group, or character class should appear in the input for a match to occur. They are essential tools that allow you to fine-tune your pattern matching, making your regex as specific or as general as needed.
Common Quantifiers#
Here is the table with the most commonly used quantifiers in regular expressions:
Symbol |
Meaning |
---|---|
|
Matches 0 or more occurrences of the preceding element. |
|
Matches 1 or more occurrences of the preceding element. |
|
Matches 0 or 1 occurrence of the preceding element. |
|
Matches exactly |
|
Matches |
|
Matches up to |
|
Matches between |
The Asterisk *
Quantifier#
Explanation. The *
quantifier tells the regex engine to match the preceding element zero or more times.
It’s like saying, “Match this element if it appears any number of times, including not at all.”
Examples
Pattern:
ca*t
“ct” (zero
'a'
)“cat” (one
'a'
)“caaat” (three
'a'
s)“bat” (doesn’t start with
'c'
)
Implementation
import re
pattern = r'ca*t'
words = ['ct', 'cat', 'caaat', 'bat']
for word in words:
print(f"Pattern matches '{word}':", bool(re.search(pattern, word)))
Pattern matches 'ct': True
Pattern matches 'cat': True
Pattern matches 'caaat': True
Pattern matches 'bat': False
The +
Quantifier#
Explanation. The +
quantifier matches the preceding element one or more times. It’s like ordering at least one item: “I want one or more scoops of ice cream.”
Examples
Pattern:
ca+t
“ct” (no
'a'
, so it doesn’t match)“cat” (one
'a'
)“caaat” (three
'a'
s)“bat” (doesn’t start with
'c'
)
Implementation
pattern = r'ca+t'
words = ['ct', 'cat', 'caaat', 'bat']
for word in words:
print(f"Pattern matches '{word}':", bool(re.search(pattern, word)))
Pattern matches 'ct': False
Pattern matches 'cat': True
Pattern matches 'caaat': True
Pattern matches 'bat': False
The Question Mark ?
Quantifier#
Explanation. The ?
quantifier matches the preceding element zero or one time. It’s useful when an element is optional in the pattern.
Examples
Pattern:
ca?t
“ct” (zero
'a'
)“cat” (one
'a'
)“caaat” (more than one
'a'
, so it doesn’t match)
Implementation
pattern = r'ca?t'
words = ['ct', 'cat', 'caaat', 'bat']
for word in words:
print(f"Pattern matches '{word}':", bool(re.search(pattern, word)))
Pattern matches 'ct': True
Pattern matches 'cat': True
Pattern matches 'caaat': False
Pattern matches 'bat': False
Exact Quantifier {n}
#
Explanation. The {n}
quantifier matches exactly n
occurrences of the preceding element.
Examples
Pattern:
ca{2}t
“ct” (zero
'a'
)“cat” (one
'a'
)“caat” (exactly two
'a'
s)“caaat” (three
'a'
s)
Implementation
pattern = r'ca{2}t'
words = ['ct', 'cat', 'caat', 'caaat']
for word in words:
print(f"Pattern matches '{word}':", bool(re.search(pattern, word)))
Pattern matches 'ct': False
Pattern matches 'cat': False
Pattern matches 'caat': True
Pattern matches 'caaat': False
Range Quantifier {n,m}
#
Explanation. The {n,m}
quantifier matches between n
and m
occurrences of the preceding element.
Examples
Pattern:
ca{1,3}t
“ct” (zero
'a'
)“cat” (one
'a'
)“caat” (two
'a'
s)“caaat” (three
'a'
s)“caaaat” (four
'a'
s)
Implementation
pattern = r'ca{1,3}t'
words = ['ct', 'cat', 'caat', 'caaat', 'caaaat']
for word in words:
print(f"Pattern matches '{word}':", bool(re.search(pattern, word)))
Pattern matches 'ct': False
Pattern matches 'cat': True
Pattern matches 'caat': True
Pattern matches 'caaat': True
Pattern matches 'caaaat': False
Open-Ended Quantifiers {n,}
and {,m}
#
Explanation.
{n,}
: Matchesn
or more occurrences.{,m}
: Matches up tom
occurrences (including zero).
Examples
Pattern
ca{2,}t
matches two or more'a'
s:“cat” (one
'a'
)“caat” (two
'a'
s)“caaat” (three
'a'
s)
Pattern
ca{,2}t
matches up to two'a'
s:“ct” (zero
'a'
)“cat” (one
'a'
)“caat” (two
'a'
s)“caaat” (three
'a'
s)
Implementation
# Pattern for two or more 'a's
pattern_n_or_more = r'ca{2,}t'
words = ['cat', 'caat', 'caaat']
print("Pattern: 'ca{2,}t'")
for word in words:
print(f"Pattern matches '{word}':", bool(re.search(pattern_n_or_more, word)))
# Pattern for up to two 'a's
pattern_up_to_m = r'ca{,2}t'
words = ['ct', 'cat', 'caat', 'caaat']
print("\nPattern: 'ca{,2}t'")
for word in words:
print(f"Pattern matches '{word}':", bool(re.search(pattern_up_to_m, word)))
Pattern: 'ca{2,}t'
Pattern matches 'cat': False
Pattern matches 'caat': True
Pattern matches 'caaat': True
Pattern: 'ca{,2}t'
Pattern matches 'ct': True
Pattern matches 'cat': True
Pattern matches 'caat': True
Pattern matches 'caaat': False
Greedy vs. Non-Greedy Quantifiers#
Explanation. By default, quantifiers are greedy, meaning they match as many occurrences as possible. Sometimes, you might want them to be non-greedy (or lazy), matching as few occurrences as needed.
To make a quantifier non-greedy, you append a ?
to it.
Examples
Greedy Pattern:
"<.*>"
Matches the longest possible string starting with
<
and ending with>
.
Non-Greedy Pattern:
"<.*?>"
Matches the shortest possible string starting with
<
and ending with>
.
Implementation
text = "<div>Content</div><span>More</span>"
# Greedy match
pattern_greedy = r'<.*>'
match_greedy = re.search(pattern_greedy, text)
print("Greedy match:", match_greedy.group())
# Non-greedy match
pattern_non_greedy = r'<.*?>'
match_non_greedy = re.search(pattern_non_greedy, text)
print("Non-greedy match:", match_non_greedy.group())
Greedy match: <div>Content</div><span>More</span>
Non-greedy match: <div>
Groups#
Groups in regular expressions allow you to capture parts of the matching text and work with them separately. They are like containers that hold specific portions of the text you’re interested in. Groups make complex pattern matching and text manipulation tasks more manageable.
Understanding Groups#
Imagine you’re sorting a collection of colored balls into boxes based on their colors:
Red balls go into the red box.
Blue balls go into the blue box.
Green balls go into the green box.
Each box represents a group that holds items of a particular type. Similarly, in regex, groups let you isolate and handle specific parts of the text that match certain patterns.
Why Use Groups?#
Extraction: Pull out specific pieces of data from a larger text.
Repetition: Apply quantifiers to entire patterns, not just single characters.
Substitution: Replace or rearrange parts of the text.
Referencing: Reuse matched groups elsewhere in the pattern or replacement text.
Capturing Groups#
A capturing group is created by placing the desired pattern inside parentheses ()
.
Basic Syntax
Pattern:
(subpattern)
Examples
Pattern:
(cat)
Matches and captures the exact string
"cat"
.
Pattern:
(\d{3})-(\d{2})-(\d{4})
Captures three groups of digits in a Social Security number format.
Groups Numbering. Groups are automatically numbered based on the order of their opening parentheses (
from left to right.
Group 0: The entire match.
Group 1: The first capturing group.
Group 2: The second capturing group.
And so on…
Implementation
import re
text = "440-27-9201"
pattern = r'(\d{3})-(\d{2})-(\d{4})'
res = re.search(pattern, text)
if res:
for i in range(len(res.groups()) + 1):
print(f"Group {i}: {res.group(i)}")
Group 0: 440-27-9201
Group 1: 440
Group 2: 27
Group 3: 9201
Named Groups#
Named groups assign a name to a group, making your regex more readable and the code easier to maintain.
Syntax
Pattern:
(?P<name>subpattern)
Example
import re
text = "Date: 2023-09-17"
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.search(pattern, text)
if match:
print("Year:", match.group('year'))
print("Month:", match.group('month'))
print("Day:", match.group('day'))
Year: 2023
Month: 09
Day: 17
Non-Capturing Groups#
Sometimes you need to group parts of a pattern without capturing them. Non-capturing groups help in such cases.
Syntax
Pattern:
(?:subpattern)
Why Use Non-Capturing Groups?
Performance: Avoids unnecessary capturing, which can speed up the regex.
Clarity: Keeps the numbering of capturing groups consistent.
Example
import re
text = "redapple greenapple yellowapple"
pattern = r'(?:red|green|yellow)(apple)'
matches = re.finditer(pattern, text)
for m in matches:
print(m, m.groups())
<re.Match object; span=(0, 8), match='redapple'> ('apple',)
<re.Match object; span=(9, 19), match='greenapple'> ('apple',)
<re.Match object; span=(20, 31), match='yellowapple'> ('apple',)
Nested Groups#
Groups can be nested within other groups. Numbering is assigned by the opening parenthesis from left to right.
import re
text = "abc123"
pattern = r'((a)(b)(c))(\d{3})'
match = re.search(pattern, text)
if match:
print("Group 0:", match.group(0))
print("Group 1:", match.group(1))
print("Group 2:", match.group(2))
print("Group 3:", match.group(3))
print("Group 4:", match.group(4))
print("Group 5:", match.group(5))
Group 0: abc123
Group 1: abc
Group 2: a
Group 3: b
Group 4: c
Group 5: 123
Backreferences#
Backreferences allow you to match the same text as previously matched by a capturing group.
Syntax
Pattern:
\number
wherenumber
is the group number.
Example: Matching Duplicate Words
import re
text = "This is is a test test."
pattern = r'\b(\w+)\s+\1\b'
# \b assert position at a word boundary
# \w+ matches 1 or more occurrences of any word character
# \s+ matches 1 or more occurrences of any whitespace character
# \1 matches the same text as most recently matched by the 1st capturing group
matches = re.findall(pattern, text)
print("Duplicate words:", matches)
Duplicate words: ['is', 'test']
Using Groups in Substitutions#
Groups are extremely useful in the re.sub
function for performing complex text substitutions.
Example: Swapping First and Last Names
import re
text = "Doe, John"
pattern = r'(\w+),\s+(\w+)'
replacement = r'\2 \1'
result = re.sub(pattern, replacement, text)
print("Reformatted name:", result)
Reformatted name: John Doe
Summary#
Groups in regular expressions are powerful tools that allow you to:
Capture specific parts of the text.
Refer back to those parts within your pattern or replacement text.
Organize complex patterns into manageable sections.
Understanding groups is like having a well-organized toolbox—you know exactly where each tool is and how to use it effectively.
Note
Capturing Groups: Use parentheses
()
to capture.Named Groups: Use
(?P<name>)
for better readability.Non-Capturing Groups: Use
(?:)
when you need grouping without capturing.Backreferences: Use
\number
to refer back to captured groups.Substitutions: Use groups in
re.sub
to manipulate text.
With groups, your regular expressions become more flexible and powerful, enabling you to handle complex text processing tasks with ease.
Lookahead and Lookbehind#
Lookahead and lookbehind are assertions in regular expressions that allow you to match a pattern only if it is (or isn’t) followed or preceded by another pattern, without including that surrounding text in the match.
Lookahead#
Positive Lookahead (?=...)
#
Syntax
X(?=Y)
Meaning: Match
X
only if it is followed byY
.
Example:
Pattern:
\w+(?=\s+car)
Text:
"red car, blue bike, green car"
Matches:
"red"
,"green"
import re
text = "red car, blue bike, green car"
pattern = r'\w+(?=\s+car)'
matches = re.finditer(pattern, text)
for m in matches:
print(m, m.group())
<re.Match object; span=(0, 3), match='red'> red
<re.Match object; span=(20, 25), match='green'> green
Negative Lookahead (?!...)
#
Syntax:
X(?!Y)
Meaning: Match
X
only if it is not followed byY
.
Example:
Pattern:
\w+(?:\s+)(?!car)
Text:
"red car, blue bike, green car"
Matches:
"blue"
text = "red car, blue bike, green car"
pattern = r'\w+(?:\s+)(?!car)'
matches = re.finditer(pattern, text)
for m in matches:
print(m)
<re.Match object; span=(9, 14), match='blue '>
Lookbehind#
Positive Lookbehind (?<=...)
#
Syntax:
(?<=Y)X
Meaning: Match
X
only if it is preceded byY
.
Example:
Pattern:
(?<=\$)\d+
Text:
"Price: $100, $200"
Matches:
"100"
,"200"
text = "Price: $100, $200"
pattern = r'(?<=\$)\d+'
matches = re.finditer(pattern, text)
for m in matches:
print(m, m.group())
<re.Match object; span=(8, 11), match='100'> 100
<re.Match object; span=(14, 17), match='200'> 200
Negative Lookbehind (?<!...)
#
Syntax:
(?<!Y)X
Meaning: Match
X
only if it is not preceded byY
Example:
Pattern:
(?<!\$)\b\d+\b
Text:
"Items: $5, 10, $15, 20"
Matches:
"10", "20"
text = "Items: $5, 10, $15, 20"
pattern = r'(?<!\$)\b\d+\b'
matches = re.finditer(pattern, text)
for m in matches:
print(m, m.group())
<re.Match object; span=(11, 13), match='10'> 10
<re.Match object; span=(20, 22), match='20'> 20
Summary#
Positive Lookahead
(?=...)
: Ensures a pattern follows.Negative Lookahead
(?!...)
: Ensures a pattern does not follow.Positive Lookbehind
(?<=...)
: Ensures a pattern precedes.Negative Lookbehind
(?<!...)
: Ensures a pattern does not precede.
Note
You can combine lookaheads and lookbehinds for precise matching.
Examples of re
usage#
Example 1: Extracting URLs from text#
Objective: Extract all URLs from a block of text, including those starting with http://, https://, or www., and capture the domain and path separately.
Techniques Used:
Groups: To capture the domain and path.
Quantifiers: To match variable-length patterns.
Lookahead: To ensure the match ends at the correct point.
Character Classes: To match specific sets of characters.
re.findall
Function: To find all occurrences.
import re
text = """
Visit our website at https://www.example.com/path/to/page or our sister site at http://example.org.
Don't forget to check out www.example.net for more information.
"""
pattern = r'(https?://|www\.)' # Protocol or www
pattern += r'([A-Za-z0-9.-]+)' # Domain
pattern += r'(\.[A-Za-z]{2,6})' # Top-level domain
pattern += r'(/[A-Za-z0-9./?=&%-]*)?' # Optional path
Explanation:
(https?://|www\.)
: Matches http://, https://, or www. using a group and alternation|
.([A-Za-z0-9.-]+)
: Captures the domain name.(\.[A-Za-z]{2,6})
: Captures the top-level domain (e.g., .com, .org).(/[A-Za-z0-9./?=&%-]*)?
: Optionally captures the path and query string.
matches = re.findall(pattern, text)
for match in matches:
protocol_or_www, domain, tld, path = match
url = ''.join(match)
print(f"Full URL: {url}")
print(f"Protocol or 'www.': {protocol_or_www}")
print(f"Domain: {domain}{tld}")
print(f"Path: {path if path else '/'}\n")
Full URL: https://www.example.com/path/to/page
Protocol or 'www.': https://
Domain: www.example.com
Path: /path/to/page
Full URL: http://example.org
Protocol or 'www.': http://
Domain: example.org
Path: /
Full URL: www.example.net
Protocol or 'www.': www.
Domain: example.net
Path: /
Example 2: Validating and Masking Credit Card Numbers#
Objective: Identify credit card numbers in a text, validate their format, and mask all but the last four digits for security.
Techniques Used:
Groups: To capture parts of the credit card number.
Quantifiers: To specify exact and variable lengths.
Lookbehind: To ensure numbers are not preceded by certain words.
re.sub
Function: For substitution and masking.Non-Capturing Groups: To group without capturing.
import re
text = """
Customer data:
Name: John Doe
Credit Card: 1234-5678-9012-3456
SSN: 123-45-6789
Name: Jane Smith
Credit Card: 9876 5432 1098 7654
"""
# Pattern to match credit card numbers
pattern = r'(?<!\d)(?:\d{4}[-\s]?){3}\d{4}(?!\d)'
Explanation:
(?<!\d)
: Negative lookbehind to ensure the number is not part of a longer number.(?:\d{4}[-\s]?){3}
: Non-capturing group matching three sets of four digits, optionally followed by - or space.\d{4}
: Matches the last four digits.(?!\d)
: Negative lookahead to ensure no digit follows.
# Function to mask credit card numbers
def mask_cc(match):
full_cc = match.group()
masked_cc = '****-****-****-' + full_cc[-4:]
return masked_cc
# Masking credit card numbers
masked_text = re.sub(pattern, mask_cc, text)
print(masked_text)
Customer data:
Name: John Doe
Credit Card: ****-****-****-3456
SSN: 123-45-6789
Name: Jane Smith
Credit Card: ****-****-****-7654
Summary#
Regular expressions, commonly known as regex, are sequences of characters that define search patterns, primarily used for string matching and manipulation. In Python, the re
module provides support for working with regular expressions.
Core Components
Metacharacters: Special characters that have unique meanings (e.g.,
., ^, $, *, +, ?, {}, [], |, (), \
).Quantifiers: Specify the number of occurrences (
*, +, ?, {n}, {n,}, {,m}, {n,m}
).Character Sets: Use square brackets
[]
to define a set of characters to match.Groups: Use parentheses
()
to capture parts of the matched text for reuse.Lookahead and Lookbehind: Allow assertions about what precedes or follows the current position without including it in the match.