Skip to main content
Regular expressions (called REs, regexes, or regex patterns) are a powerful tool for matching text patterns in Python using the re module.

Getting Started

Compiling Patterns

Compile a regular expression pattern into a pattern object for reuse:
import re

p = re.compile('[a-z]+')
result = p.match('tempo')

Basic Matching

Match patterns at the beginning of strings:
import re

p = re.compile('[a-z]+')
m = p.match('tempo')
if m:
    print('Match found:', m.group())
else:
    print('No match')

Pattern Syntax

Character Classes

Match specific sets of characters:
  • [abc] - matches ‘a’, ‘b’, or ‘c’
  • [a-z] - matches any lowercase letter
  • [^5] - matches any character except ‘5’
p = re.compile('[a-z]+')
p.match('abc')  # Matches
p.match('123')  # Doesn't match

Special Sequences

Pre-defined character sets:
  • \d - any decimal digit [0-9]
  • \D - any non-digit [^0-9]
  • \s - any whitespace character
  • \S - any non-whitespace character
  • \w - any alphanumeric character [a-zA-Z0-9_]
  • \W - any non-alphanumeric character
# Extract all digits from a string
p = re.compile(r'\d+')
p.findall('12 drummers drumming, 11 pipers piping')
# Returns: ['12', '11']
Always use raw strings (prefix with r) for regex patterns to avoid backslash issues:
# Good
re.compile(r'\bclass\b')

# Bad - backslash gets interpreted by Python first
re.compile('\bclass\b')

Repetition

Specify how many times to match:
  • * - 0 or more times
  • + - 1 or more times
  • ? - 0 or 1 time
  • {m,n} - at least m, at most n times
# Match 'a' followed by zero or more 'b's
p = re.compile('ab*')
p.match('a')     # Matches
p.match('ab')    # Matches
p.match('abb')   # Matches

Searching and Finding

1
Choose the Right Method
2
Different methods for different needs:
3
import re

p = re.compile(r'\d+')
text = 'There are 12 drummers and 11 pipers'

# match() - checks beginning only
p.match(text)  # Returns None

# search() - finds first match anywhere
m = p.search(text)
m.group()  # Returns '12'

# findall() - returns all matches as list
p.findall(text)  # Returns ['12', '11']

# finditer() - returns iterator of match objects
for match in p.finditer(text):
    print(match.group(), 'at position', match.start())
4
Extract Match Details
5
Get information about matches:
6
m = p.search('::: message')
if m:
    print(m.group())   # The matched string
    print(m.start())   # Starting position
    print(m.end())     # Ending position
    print(m.span())    # Tuple of (start, end)

Grouping

Capture parts of the pattern:
# Parse RFC-822 header
p = re.compile(r'(\w+):\s*(.+)')
m = p.match('From: [email protected]')

if m:
    print(m.group(0))  # Entire match
    print(m.group(1))  # First group: 'From'
    print(m.group(2))  # Second group: '[email protected]'

Named Groups

Use names instead of numbers:
p = re.compile(r'(?P<word>\b\w+\b)')
m = p.search('(((( Lots of punctuation ))))')

print(m.group('word'))  # 'Lots'
print(m.group(1))       # Also 'Lots'

Backreferences

Match repeated patterns:
# Find doubled words
p = re.compile(r'\b(\w+)\s+\1\b')
p.search('Paris in the the spring').group()
# Returns: 'the the'

String Modification

Splitting

Split strings on pattern matches:
p = re.compile(r'\W+')
p.split('This is a test, short and sweet.')
# Returns: ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', '']

Substitution

Replace pattern matches:
p = re.compile(r'section{\s*(\w+)\s*}')
p.sub(r'subsection{\1}', 'section{ Intro }')
# Returns: 'subsection{Intro}'

Compilation Flags

Modify pattern behavior:
# Case-insensitive matching
p = re.compile('[a-z]+', re.IGNORECASE)
p.match('SPAM')  # Matches

# Multi-line mode
p = re.compile('^From', re.MULTILINE)

# Dot matches newlines
p = re.compile('a.*b', re.DOTALL)

# Verbose mode for readable patterns
charref = re.compile(r"""
 &[#]                # Start of numeric entity
 (
     0[0-7]+         # Octal form
   | [0-9]+          # Decimal form  
   | x[0-9a-fA-F]+   # Hexadecimal form
 )
 ;                   # Trailing semicolon
""", re.VERBOSE)

Common Patterns

Email Validation

email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
email_pattern.match('[email protected]')

Phone Numbers

phone_pattern = re.compile(r'\d{3}[-.]?\d{3}[-.]?\d{4}')
phone_pattern.findall('Call 555-123-4567 or 555.987.6543')
# Returns: ['555-123-4567', '555.987.6543']

URL Extraction

url_pattern = re.compile(r'https?://[^\s]+')
url_pattern.findall('Visit https://python.org and http://docs.python.org')

Best Practices

Performance Tip: Compile patterns that are used multiple times:
# Good - compile once
pattern = re.compile(r'\d+')
for line in large_file:
    pattern.search(line)

# Bad - recompiles every iteration
for line in large_file:
    re.search(r'\d+', line)
When NOT to Use Regex: For simple string operations, use string methods:
# Use this
if 'python' in text.lower():
    ...

# Not this  
if re.search(r'python', text, re.IGNORECASE):
    ...

Common Gotchas

Greedy vs Non-Greedy

By default, repetition is greedy:
# Greedy (matches as much as possible)
re.match(r'<.*>', '<h1>Title</h1>').group()
# Returns: '<h1>Title</h1>'

# Non-greedy (matches as little as possible)
re.match(r'<.*?>', '<h1>Title</h1>').group()
# Returns: '<h1>'

Anchors Matter

# Match at beginning only
re.match(r'\d+', 'abc 123')  # None

# Search anywhere
re.search(r'\d+', 'abc 123')  # Matches '123'

Reference

Key re module functions:
  • compile(pattern, flags=0) - Compile a pattern
  • match(pattern, string, flags=0) - Match at string start
  • search(pattern, string, flags=0) - Search anywhere
  • findall(pattern, string, flags=0) - Find all matches
  • sub(pattern, repl, string, count=0, flags=0) - Replace matches
  • split(pattern, string, maxsplit=0, flags=0) - Split on pattern

Build docs developers (and LLMs) love