Regular Expressions#

The main task of regular expressions (regex) is to search, match, and manipulate patterns within text. They allow you to efficiently find specific sequences, validate formats like emails or phone numbers, and perform complex text processing tasks with minimal code. Regex is a powerful tool for working with text data in a precise and flexible way.

Regex basics#

Main components of regex#

Three main components of regex:

Pattern

The pattern is a sequence of characters that defines the search criteria. It could be a specific sequence of characters, a set of characters, or a more complex structure using special symbols to represent various possibilities.

Text

The text or sequence of characters that you want to search within. The pattern is applied to this text to find matches or perform substitutions.

Match

The match is the outcome of applying the pattern to the string. It indicates where and how the pattern fits within the string. A match can refer to the exact location, the part of the string that corresponds to the pattern, or simply a boolean value indicating whether a match exists or not.

Example#

We can use regex in the similar way as we would use basic string methods to find some substring in the text:

Basic String Methods#

text = "Don’t explain your philosophy. Embody it."
pattern = "philosophy"

# string methods:
print("find:", text.find(pattern), "\nin:", pattern in text)
find: 19 
in: True

Regex approach#

Python has a special module called re that can be used to work with regular expressions.

import re

res = re.search(pattern, text)
if res is not None:
    print("res.start():", res.start())
    print("res.end():", res.end())
    print("res.span():", res.span())

print("bool(res):", bool(res))
res.start(): 19
res.end(): 29
res.span(): (19, 29)
bool(res): True

This was just a basic example of what regex can do, but it’s far from showcasing their true power. Now, let’s dive into the key advantages and capabilities of regular expressions.

re Functions#

The main functions provided by the re module include:

  1. re.match: Checks for a match only at the beginning of the string.

  2. re.search: Searches the entire string for a match.

  3. re.findall: Returns a list of all matches found in the string.

  4. re.finditer: Returns an iterator yielding match objects for all matches found.

  5. re.sub: Replaces occurrences of the pattern with a replacement string.

Metacharacters#

A metacharacter in regular expressions is a special character that has a unique meaning and function, rather than being interpreted as a literal character. These characters are used to create patterns that can match different types of text.

Dot . metacharacter#

Explanation. The dot . is one of the most commonly used metacharacters in regular expressions. It’s simple yet powerful, and it plays a key role in pattern matching.

The Basic Function:

  • The dot . matches any single character except a newline \n.

  • This means it can represent any letter, digit, symbol, or even a space.

Examples:

Suppose you have the regex pattern 'a.c'. This pattern matches:

‘abc’ ‘a1c’ ‘a-c’

and in general any string where 'a' is followed by any single character and then 'c'.

Non-matching examples:

‘ac’ ‘abdc’

Implementation

words = ['abc', 'a1c', 'a-c', 'ac', 'abdc']
pattern = re.compile(r'a.c')

for w in words:
    print(f"Pattern found in {w:>4}:", bool(re.search(pattern, w)))
Pattern found in  abc: True
Pattern found in  a1c: True
Pattern found in  a-c: True
Pattern found in   ac: False
Pattern found in abdc: False

Raw Strings

In Python, an r string (or raw string) is a string prefixed with the letter r or R. The primary purpose of a raw string is to treat backslashes \ as literal characters and not as escape characters. This is particularly useful when working with regular expressions or file paths, where backslashes are common.

Example:

rline = r"Line 1\nLine 2\tTabbed"
line = "Line 1\nLine 2\tTabbed"

print('Raw line:', rline)
print("Regular line:", line)

Dot vs Literal Dot#

If you want to match an actual dot (a period or full stop), you need to escape it with a backslash \..

Examples:

  • Pattern: 'a.c'

    • abc a.c a-c a c

  • Pattern: 'a\.c'

    • abc a.c a-c a c

p1 = r'a.c'
p2 = r'a\.c'

words = ['abc', 'a.c', 'a-c', 'a c']
for w in words:
    print(f"p1 in {w}: {bool(re.search(p1, w))}", end=" | ")
    print(f"p2 in {w}: {bool(re.search(p2, w))}")
p1 in abc: True | p2 in abc: False
p1 in a.c: True | p2 in a.c: True
p1 in a-c: True | p2 in a-c: False
p1 in a c: True | p2 in a c: False

Metacharacters: ^ and $#

  • ^: Matches the start of a string

    • Pattern: '^cat':

      • cat dog dog cat

  • $: Matches the end of a string

    • Pattern 'dog$'

      • cat dog dog cat

Implementation:

p1 = r'^cat'
p2 = r'dog$'

lines = ['cat dog', 'dog cat']

for l in lines:
    print(f'p1 in {l}:', bool(re.search(p1, l)))
    print(f'p2 in {l}:', bool(re.search(p2, l)))
p1 in cat dog: True
p2 in cat dog: True
p1 in dog cat: False
p2 in dog cat: False

Character Sets#

Character sets, also known as character classes, are a feature in regular expressions that allow you to match any one of several characters at a specific position in the text. They are defined using square brackets [].

How Character Sets Work When you place characters inside square brackets, the regex engine will match any single character from that set. For example, the character set [abc] will match any one of the characters "a", "b", or "c".

Basic Character Set#

  • Pattern: h[aeoiu]t

    • hat hot hit htt hbt

Implementation

p = r'h[aeiou]t'
words = ['hat', 'hot', 'hit', 'htt', 'hbt']

for w in words:
    print(f'p in {w}:', bool(re.search(p, w)))
p in hat: True
p in hot: True
p in hit: True
p in htt: False
p in hbt: False

Ranges#

Ranges in regular expression character sets are used to specify a sequence of characters that should be matched. A range is defined using a hyphen - between two characters inside square brackets [], and it tells the regex engine to match any character between those two characters.

How ranges work

  • A range is represented as [start-end], where start and end are characters.

  • The regex will match any character that falls within that range.

Common examples

  • Pattern: [a-z]

    • a b y A Z 3 5

  • Patter: [A-Z]

    • a b y A Z 3 5

  • Pattern: [0-9]

    • a b y A Z 3 5

  • Combining Multiple Ranges: [a-zA-Z0-9]

    • a b y A Z 3 5

Implementation

p1, p2, p3, p4 = r'[a-z]', r'[A-Z]', r'[0-9]', r'[a-zA-Z0-9]'

words = ['a', 'b', 'y', 'A', 'Z', '3', '5']

for p in (p1, p2, p3, p4):
    print(f"Current pattern is: {p}")
    results = [f"{w}: {bool(re.search(p, w))}" for w in words]
    print(f"\t{' | '.join(results)}")
Hide code cell output
Current pattern is: [a-z]
	a: True | b: True | y: True | A: False | Z: False | 3: False | 5: False
Current pattern is: [A-Z]
	a: False | b: False | y: False | A: True | Z: True | 3: False | 5: False
Current pattern is: [0-9]
	a: False | b: False | y: False | A: False | Z: False | 3: True | 5: True
Current pattern is: [a-zA-Z0-9]
	a: True | b: True | y: True | A: True | Z: True | 3: True | 5: True

Using Hyphen as a Literal

To match a literal hyphen, place it at the start or end of the character set. Example: [a-z-] matches any lowercase letter or the hyphen -. If the hyphen is placed in the middle (e.g., [a-z-0-9]), it’s interpreted as a range.

Negative Character Sets#

Negative character set in regular expressions allows you to match any character except the ones specified inside the set. It’s a powerful feature when you want to exclude certain characters from matching.

How to Negate a Character Set

To negate a character set, you place a caret ^ immediately after the opening square bracket [. This means “match any character that is not in the set.”

Examples

  • Pattern [^aeiou]: match any character that is not a vowel (i.e., anything other than “a”, “e”, “i”, “o”, or “u”).

  • Pattern: [^0-9]: Match any character that is not a digit (anything other than “0” to “9”)

Quantifiers#

Quantifiers in regular expressions specify how many times a particular character, group, or character class should appear in the input for a match to occur. They are essential tools that allow you to fine-tune your pattern matching, making your regex as specific or as general as needed.

Common Quantifiers#

Here is the table with the most commonly used quantifiers in regular expressions:

Symbol

Meaning

*

Matches 0 or more occurrences of the preceding element.

+

Matches 1 or more occurrences of the preceding element.

?

Matches 0 or 1 occurrence of the preceding element.

{n}

Matches exactly n occurrences of the preceding element.

{n,}

Matches n or more occurrences of the preceding element.

{,m}

Matches up to m (including zero) occurrences of the preceding element

{n,m}

Matches between n and m occurrences of the preceding element.

The Asterisk * Quantifier#

Explanation. The * quantifier tells the regex engine to match the preceding element zero or more times. It’s like saying, “Match this element if it appears any number of times, including not at all.”

Examples

  • Pattern: ca*t

    • “ct” (zero 'a')

    • “cat” (one 'a')

    • “caaat” (three 'a's)

    • “bat” (doesn’t start with 'c')

Implementation

import re

pattern = r'ca*t'
words = ['ct', 'cat', 'caaat', 'bat']

for word in words:
    print(f"Pattern matches '{word}':", bool(re.search(pattern, word)))
Pattern matches 'ct': True
Pattern matches 'cat': True
Pattern matches 'caaat': True
Pattern matches 'bat': False

The + Quantifier#

Explanation. The + quantifier matches the preceding element one or more times. It’s like ordering at least one item: “I want one or more scoops of ice cream.”

Examples

  • Pattern: ca+t

    • “ct” (no 'a', so it doesn’t match)

    • “cat” (one 'a')

    • “caaat” (three 'a's)

    • “bat” (doesn’t start with 'c')

Implementation

pattern = r'ca+t'
words = ['ct', 'cat', 'caaat', 'bat']

for word in words:
    print(f"Pattern matches '{word}':", bool(re.search(pattern, word)))
Pattern matches 'ct': False
Pattern matches 'cat': True
Pattern matches 'caaat': True
Pattern matches 'bat': False

The Question Mark ? Quantifier#

Explanation. The ? quantifier matches the preceding element zero or one time. It’s useful when an element is optional in the pattern.

Examples

  • Pattern: ca?t

    • “ct” (zero 'a')

    • “cat” (one 'a')

    • “caaat” (more than one 'a', so it doesn’t match)

Implementation

pattern = r'ca?t'
words = ['ct', 'cat', 'caaat', 'bat']

for word in words:
    print(f"Pattern matches '{word}':", bool(re.search(pattern, word)))
Pattern matches 'ct': True
Pattern matches 'cat': True
Pattern matches 'caaat': False
Pattern matches 'bat': False

Exact Quantifier {n}#

Explanation. The {n} quantifier matches exactly n occurrences of the preceding element.

Examples

  • Pattern: ca{2}t

    • “ct” (zero 'a')

    • “cat” (one 'a')

    • “caat” (exactly two 'a's)

    • “caaat” (three 'a's)

Implementation

pattern = r'ca{2}t'
words = ['ct', 'cat', 'caat', 'caaat']

for word in words:
    print(f"Pattern matches '{word}':", bool(re.search(pattern, word)))
Pattern matches 'ct': False
Pattern matches 'cat': False
Pattern matches 'caat': True
Pattern matches 'caaat': False

Range Quantifier {n,m}#

Explanation. The {n,m} quantifier matches between n and m occurrences of the preceding element.

Examples

  • Pattern: ca{1,3}t

    • “ct” (zero 'a')

    • “cat” (one 'a')

    • “caat” (two 'a's)

    • “caaat” (three 'a's)

    • “caaaat” (four 'a's)

Implementation

pattern = r'ca{1,3}t'
words = ['ct', 'cat', 'caat', 'caaat', 'caaaat']

for word in words:
    print(f"Pattern matches '{word}':", bool(re.search(pattern, word)))
Pattern matches 'ct': False
Pattern matches 'cat': True
Pattern matches 'caat': True
Pattern matches 'caaat': True
Pattern matches 'caaaat': False

Open-Ended Quantifiers {n,} and {,m}#

Explanation.

  • {n,}: Matches n or more occurrences.

  • {,m}: Matches up to m occurrences (including zero).

Examples

  • Pattern ca{2,}t matches two or more 'a's:

    • “cat” (one 'a')

    • “caat” (two 'a's)

    • “caaat” (three 'a's)

  • Pattern ca{,2}t matches up to two 'a's:

    • “ct” (zero 'a')

    • “cat” (one 'a')

    • “caat” (two 'a's)

    • “caaat” (three 'a's)

Implementation

# Pattern for two or more 'a's
pattern_n_or_more = r'ca{2,}t'
words = ['cat', 'caat', 'caaat']


print("Pattern: 'ca{2,}t'")
for word in words:
    print(f"Pattern matches '{word}':", bool(re.search(pattern_n_or_more, word)))

# Pattern for up to two 'a's
pattern_up_to_m = r'ca{,2}t'
words = ['ct', 'cat', 'caat', 'caaat']

print("\nPattern: 'ca{,2}t'")
for word in words:
    print(f"Pattern matches '{word}':", bool(re.search(pattern_up_to_m, word)))
Pattern: 'ca{2,}t'
Pattern matches 'cat': False
Pattern matches 'caat': True
Pattern matches 'caaat': True

Pattern: 'ca{,2}t'
Pattern matches 'ct': True
Pattern matches 'cat': True
Pattern matches 'caat': True
Pattern matches 'caaat': False

Greedy vs. Non-Greedy Quantifiers#

Explanation. By default, quantifiers are greedy, meaning they match as many occurrences as possible. Sometimes, you might want them to be non-greedy (or lazy), matching as few occurrences as needed.

To make a quantifier non-greedy, you append a ? to it.

Examples

  • Greedy Pattern: "<.*>"

    • Matches the longest possible string starting with < and ending with >.

  • Non-Greedy Pattern: "<.*?>"

    • Matches the shortest possible string starting with < and ending with >.

Implementation

text = "<div>Content</div><span>More</span>"

# Greedy match
pattern_greedy = r'<.*>'
match_greedy = re.search(pattern_greedy, text)
print("Greedy match:", match_greedy.group())

# Non-greedy match
pattern_non_greedy = r'<.*?>'
match_non_greedy = re.search(pattern_non_greedy, text)
print("Non-greedy match:", match_non_greedy.group())
Greedy match: <div>Content</div><span>More</span>
Non-greedy match: <div>

Groups#

Groups in regular expressions allow you to capture parts of the matching text and work with them separately. They are like containers that hold specific portions of the text you’re interested in. Groups make complex pattern matching and text manipulation tasks more manageable.

Understanding Groups#

Imagine you’re sorting a collection of colored balls into boxes based on their colors:

  • Red balls go into the red box.

  • Blue balls go into the blue box.

  • Green balls go into the green box.

Each box represents a group that holds items of a particular type. Similarly, in regex, groups let you isolate and handle specific parts of the text that match certain patterns.

Why Use Groups?#

  • Extraction: Pull out specific pieces of data from a larger text.

  • Repetition: Apply quantifiers to entire patterns, not just single characters.

  • Substitution: Replace or rearrange parts of the text.

  • Referencing: Reuse matched groups elsewhere in the pattern or replacement text.

Capturing Groups#

A capturing group is created by placing the desired pattern inside parentheses ().

Basic Syntax

  • Pattern: (subpattern)

Examples

  • Pattern: (cat)

    • Matches and captures the exact string "cat".

  • Pattern: (\d{3})-(\d{2})-(\d{4})

    • Captures three groups of digits in a Social Security number format.

Groups Numbering. Groups are automatically numbered based on the order of their opening parentheses ( from left to right.

  • Group 0: The entire match.

  • Group 1: The first capturing group.

  • Group 2: The second capturing group.

  • And so on…

Implementation

import re

text = "440-27-9201"
pattern = r'(\d{3})-(\d{2})-(\d{4})'

res = re.search(pattern, text)

if res:
    for i in range(len(res.groups()) + 1):
        print(f"Group {i}: {res.group(i)}")
Group 0: 440-27-9201
Group 1: 440
Group 2: 27
Group 3: 9201

Named Groups#

Named groups assign a name to a group, making your regex more readable and the code easier to maintain.

Syntax

  • Pattern: (?P<name>subpattern)

Example

import re

text = "Date: 2023-09-17"
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'

match = re.search(pattern, text)
if match:
    print("Year:", match.group('year'))
    print("Month:", match.group('month'))
    print("Day:", match.group('day'))
Year: 2023
Month: 09
Day: 17

Non-Capturing Groups#

Sometimes you need to group parts of a pattern without capturing them. Non-capturing groups help in such cases.

Syntax

  • Pattern: (?:subpattern)

Why Use Non-Capturing Groups?

  • Performance: Avoids unnecessary capturing, which can speed up the regex.

  • Clarity: Keeps the numbering of capturing groups consistent.

Example

import re

text = "redapple greenapple yellowapple"
pattern = r'(?:red|green|yellow)(apple)'

matches = re.finditer(pattern, text)
for m in matches:
    print(m, m.groups())
<re.Match object; span=(0, 8), match='redapple'> ('apple',)
<re.Match object; span=(9, 19), match='greenapple'> ('apple',)
<re.Match object; span=(20, 31), match='yellowapple'> ('apple',)

Nested Groups#

Groups can be nested within other groups. Numbering is assigned by the opening parenthesis from left to right.

import re

text = "abc123"
pattern = r'((a)(b)(c))(\d{3})'

match = re.search(pattern, text)
if match:
    print("Group 0:", match.group(0))
    print("Group 1:", match.group(1))
    print("Group 2:", match.group(2))
    print("Group 3:", match.group(3))
    print("Group 4:", match.group(4))
    print("Group 5:", match.group(5))
Group 0: abc123
Group 1: abc
Group 2: a
Group 3: b
Group 4: c
Group 5: 123

Backreferences#

Backreferences allow you to match the same text as previously matched by a capturing group.

Syntax

  • Pattern: \number where number is the group number.

Example: Matching Duplicate Words

import re

text = "This is is a test test."
pattern = r'\b(\w+)\s+\1\b'
# \b assert position at a word boundary
# \w+ matches 1 or more occurrences of any word character 
# \s+ matches 1 or more occurrences of any whitespace character
# \1 matches the same text as most recently matched by the 1st capturing group


matches = re.findall(pattern, text)
print("Duplicate words:", matches)
Duplicate words: ['is', 'test']

Using Groups in Substitutions#

Groups are extremely useful in the re.sub function for performing complex text substitutions.

Example: Swapping First and Last Names

import re

text = "Doe, John"
pattern = r'(\w+),\s+(\w+)'
replacement = r'\2 \1'

result = re.sub(pattern, replacement, text)
print("Reformatted name:", result)
Reformatted name: John Doe

Summary#

Groups in regular expressions are powerful tools that allow you to:

  • Capture specific parts of the text.

  • Refer back to those parts within your pattern or replacement text.

  • Organize complex patterns into manageable sections.

Understanding groups is like having a well-organized toolbox—you know exactly where each tool is and how to use it effectively.

Note

  • Capturing Groups: Use parentheses () to capture.

  • Named Groups: Use (?P<name>) for better readability.

  • Non-Capturing Groups: Use (?:) when you need grouping without capturing.

  • Backreferences: Use \number to refer back to captured groups.

  • Substitutions: Use groups in re.sub to manipulate text.

With groups, your regular expressions become more flexible and powerful, enabling you to handle complex text processing tasks with ease.

Lookahead and Lookbehind#

Lookahead and lookbehind are assertions in regular expressions that allow you to match a pattern only if it is (or isn’t) followed or preceded by another pattern, without including that surrounding text in the match.

Lookahead#

Positive Lookahead (?=...)#

  • Syntax X(?=Y)

  • Meaning: Match X only if it is followed by Y.

Example:

  • Pattern: \w+(?=\s+car)

  • Text: "red car, blue bike, green car"

  • Matches: "red", "green"

import re

text = "red car, blue bike, green car"
pattern = r'\w+(?=\s+car)'

matches = re.finditer(pattern, text)
for m in matches:
    print(m, m.group())
<re.Match object; span=(0, 3), match='red'> red
<re.Match object; span=(20, 25), match='green'> green

Negative Lookahead (?!...)#

  • Syntax: X(?!Y)

  • Meaning: Match X only if it is not followed by Y.

Example:

  • Pattern: \w+(?:\s+)(?!car)

  • Text: "red car, blue bike, green car"

  • Matches: "blue"

text = "red car, blue bike, green car"

pattern = r'\w+(?:\s+)(?!car)'
matches = re.finditer(pattern, text)
for m in matches:
    print(m)
<re.Match object; span=(9, 14), match='blue '>

Lookbehind#

Positive Lookbehind (?<=...)#

  • Syntax: (?<=Y)X

  • Meaning: Match X only if it is preceded by Y.

Example:

  • Pattern: (?<=\$)\d+

  • Text: "Price: $100, $200"

  • Matches: "100", "200"

text = "Price: $100, $200"
pattern = r'(?<=\$)\d+'
matches = re.finditer(pattern, text)

for m in matches:
    print(m, m.group())
<re.Match object; span=(8, 11), match='100'> 100
<re.Match object; span=(14, 17), match='200'> 200

Negative Lookbehind (?<!...)#

  • Syntax: (?<!Y)X

  • Meaning: Match X only if it is not preceded by Y

Example:

  • Pattern: (?<!\$)\b\d+\b

  • Text: "Items: $5, 10, $15, 20"

  • Matches: "10", "20"

text = "Items: $5, 10, $15, 20"
pattern = r'(?<!\$)\b\d+\b'
matches = re.finditer(pattern, text)

for m in matches:
    print(m, m.group())
<re.Match object; span=(11, 13), match='10'> 10
<re.Match object; span=(20, 22), match='20'> 20

Summary#

  • Positive Lookahead (?=...): Ensures a pattern follows.

  • Negative Lookahead (?!...): Ensures a pattern does not follow.

  • Positive Lookbehind (?<=...): Ensures a pattern precedes.

  • Negative Lookbehind (?<!...): Ensures a pattern does not precede.

Note

You can combine lookaheads and lookbehinds for precise matching.

Examples of re usage#

Example 1: Extracting URLs from text#

Objective: Extract all URLs from a block of text, including those starting with http://, https://, or www., and capture the domain and path separately.

Techniques Used:

  • Groups: To capture the domain and path.

  • Quantifiers: To match variable-length patterns.

  • Lookahead: To ensure the match ends at the correct point.

  • Character Classes: To match specific sets of characters.

  • re.findall Function: To find all occurrences.

import re

text = """
Visit our website at https://www.example.com/path/to/page or our sister site at http://example.org.
Don't forget to check out www.example.net for more information.
"""

pattern = r'(https?://|www\.)'            # Protocol or www
pattern += r'([A-Za-z0-9.-]+)'            # Domain
pattern += r'(\.[A-Za-z]{2,6})'           # Top-level domain
pattern += r'(/[A-Za-z0-9./?=&%-]*)?'     # Optional path

Explanation:

  • (https?://|www\.): Matches http://, https://, or www. using a group and alternation |.

  • ([A-Za-z0-9.-]+): Captures the domain name.

  • (\.[A-Za-z]{2,6}): Captures the top-level domain (e.g., .com, .org).

  • (/[A-Za-z0-9./?=&%-]*)?: Optionally captures the path and query string.

matches = re.findall(pattern, text)
for match in matches:
    protocol_or_www, domain, tld, path = match
    url = ''.join(match)
    print(f"Full URL: {url}")
    print(f"Protocol or 'www.': {protocol_or_www}")
    print(f"Domain: {domain}{tld}")
    print(f"Path: {path if path else '/'}\n")
Full URL: https://www.example.com/path/to/page
Protocol or 'www.': https://
Domain: www.example.com
Path: /path/to/page

Full URL: http://example.org
Protocol or 'www.': http://
Domain: example.org
Path: /

Full URL: www.example.net
Protocol or 'www.': www.
Domain: example.net
Path: /

Example 2: Validating and Masking Credit Card Numbers#

Objective: Identify credit card numbers in a text, validate their format, and mask all but the last four digits for security.

Techniques Used:

  • Groups: To capture parts of the credit card number.

  • Quantifiers: To specify exact and variable lengths.

  • Lookbehind: To ensure numbers are not preceded by certain words.

  • re.sub Function: For substitution and masking.

  • Non-Capturing Groups: To group without capturing.

import re

text = """
Customer data:
Name: John Doe
Credit Card: 1234-5678-9012-3456
SSN: 123-45-6789

Name: Jane Smith
Credit Card: 9876 5432 1098 7654
"""

# Pattern to match credit card numbers
pattern = r'(?<!\d)(?:\d{4}[-\s]?){3}\d{4}(?!\d)'

Explanation:

  • (?<!\d): Negative lookbehind to ensure the number is not part of a longer number.

  • (?:\d{4}[-\s]?){3}: Non-capturing group matching three sets of four digits, optionally followed by - or space.

  • \d{4}: Matches the last four digits.

  • (?!\d): Negative lookahead to ensure no digit follows.

# Function to mask credit card numbers
def mask_cc(match):
    full_cc = match.group()
    masked_cc = '****-****-****-' + full_cc[-4:]
    return masked_cc

# Masking credit card numbers
masked_text = re.sub(pattern, mask_cc, text)
print(masked_text)
Customer data:
Name: John Doe
Credit Card: ****-****-****-3456
SSN: 123-45-6789

Name: Jane Smith
Credit Card: ****-****-****-7654

Summary#

Regular expressions, commonly known as regex, are sequences of characters that define search patterns, primarily used for string matching and manipulation. In Python, the re module provides support for working with regular expressions.

Core Components

  • Metacharacters: Special characters that have unique meanings (e.g., ., ^, $, *, +, ?, {}, [], |, (), \).

  • Quantifiers: Specify the number of occurrences (*, +, ?, {n}, {n,}, {,m}, {n,m}).

  • Character Sets: Use square brackets [] to define a set of characters to match.

  • Groups: Use parentheses () to capture parts of the matched text for reuse.

  • Lookahead and Lookbehind: Allow assertions about what precedes or follows the current position without including it in the match.