Overview
The text preprocessing pipeline transforms raw input text into clean, normalized tokens suitable for machine learning models. This involves contraction expansion, lowercasing, special character removal, and lemmatization.
Clean Text Function
The clean_text function is the core preprocessing component:
from nltk.stem import WordNetLemmatizer
import re
nltk.download('omw-1.4')
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
def clean_text(text):
"""
Process text - Lemmatizing the texts
"""
# Implementation details below...
Preprocessing Steps
1. Lowercasing
Converts all characters to lowercase for case-insensitive matching.
2. Contraction Expansion
The function expands common English contractions using regex substitution:
text = re.sub(r"what's", "what is ", str(text))
text = re.sub(r"'s", " ", str(text))
text = re.sub(r"'ve", " have ", str(text))
text = re.sub(r"can't", "cannot ", str(text))
text = re.sub(r"ain't", 'is not', str(text))
text = re.sub(r"won't", 'will not', str(text))
text = re.sub(r"n't", " not ", str(text))
text = re.sub(r"i'm", "i am ", str(text))
text = re.sub(r"'re", " are ", str(text))
text = re.sub(r"'d", " would ", str(text))
text = re.sub(r"'ll", " will ", str(text))
text = re.sub(r"'scuse", " excuse ", str(text))
Contraction expansion helps normalize text by converting informal speech patterns into their full forms, improving model consistency.
Supported Contractions
| Contraction | Expansion |
|---|
| what’s | what is |
| ’s | (removed) |
| ‘ve | have |
| can’t | cannot |
| ain’t | is not |
| won’t | will not |
| n’t | not |
| i’m | i am |
| ’re | are |
| ’d | would |
| ’ll | will |
| ’scuse | excuse |
3. Special Character Removal
text = re.sub('W', ' ', str(text))
text = re.sub(' +', ' ', str(text))
The pattern 'W' removes literal ‘W’ characters (likely a typo - should be '\W' for non-word characters).
4. Hyperlink Removal
text = re.sub(r"https?://\S+|www\.\S+", ' ', str(text))
Removes URLs starting with:
Removing URLs prevents the model from learning spurious correlations with specific domains and reduces noise in text analysis.
5. Punctuation and Number Removal
text = re.sub('[^A-Za-z0-9]+', ' ', str(text))
Keeps only alphanumeric characters, replacing all other characters with spaces.
6. Lemmatization
text = lemmatizer.lemmatize(text)
Reduces words to their base/dictionary form using NLTK’s WordNetLemmatizer.
Lemmatization vs. Stemming: Lemmatization produces actual dictionary words (e.g., “running” → “run”), while stemming uses simple rules that may produce non-words (e.g., “running” → “runn”).
7. Whitespace Trimming
text = text.strip(' ')
return text
Removes leading and trailing spaces from the final result.
Complete Pipeline
The full preprocessing function:
def clean_text(text):
# removing aphostrophe words
text = text.lower()
text = re.sub(r"what's", "what is ", str(text))
text = re.sub(r"'s", " ", str(text))
text = re.sub(r"'ve", " have ", str(text))
text = re.sub(r"can't", "cannot ", str(text))
text = re.sub(r"ain't", 'is not', str(text))
text = re.sub(r"won't", 'will not', str(text))
text = re.sub(r"n't", " not ", str(text))
text = re.sub(r"i'm", "i am ", str(text))
text = re.sub(r"'re", " are ", str(text))
text = re.sub(r"'d", " would ", str(text))
text = re.sub(r"'ll", " will ", str(text))
text = re.sub(r"'scuse", " excuse ", str(text))
text = re.sub('W', ' ', str(text))
text = re.sub(' +', ' ', str(text))
# Remove hyperlinks
text = re.sub(r"https?://\S+|www\.\S+", ' ', str(text))
# Remove punctuations, numbers and special characters
text = re.sub('[^A-Za-z0-9]+', ' ', str(text))
text = lemmatizer.lemmatize(text)
text = text.strip(' ')
return text
Vectorization
After cleaning, text is vectorized for the emotion classifier:
text_cleaned = clean_text(text)
text_vector = vectorizer.fit_transform([text_cleaned])
Using fit_transform on inference data is unconventional. Typically, you should only use transform with a pre-fitted vectorizer to ensure consistent feature space.
Word Embedding Conversion
For the neural network, cleaned text undergoes a different transformation:
def converter(words):
return average_tensor(
embedding(torch.tensor(coded_words(words.split(), word_dict)))
)
Process:
- Tokenization:
words.split() splits on whitespace
- Encoding:
coded_words() maps words to indices using word_dict
- Embedding: PyTorch embedding layer converts indices to 10-dim vectors
- Averaging:
average_tensor() computes mean embedding
The embedding conversion uses the raw (uncleaned) text, while the emotion classifier uses cleaned text. This dual approach may be intentional to preserve information for different models.
Input:
"I can't believe what's happening! Visit https://example.com for more info."
After Preprocessing:
"i cannot believe what is happening visit for more info"
Changes Applied:
- Lowercased
can't → cannot
what's → what is
- Exclamation mark removed
- URL removed
- Extra spaces normalized
- Trimmed whitespace
NLTK Dependencies
The preprocessing pipeline requires NLTK data packages:
nltk.download('omw-1.4') # Open Multilingual Wordnet
nltk.download('wordnet') # WordNet lexical database
These downloads occur at runtime when the application starts, ensuring the required NLTK data is available for lemmatization.
Additional Text Processing
Beyond emotion prediction, the system includes specialized extraction functions:
def get_dates(text):
# Matches formats like:
# dd/mm/yyyy, dd-mm-yyyy
# dd MMM yyyy
# Mar-20-2009, March 20, 2009
Country Detection
def find_country(text):
for country in pycountry.countries:
if country.name in text:
# Add to results
def get_people_names(text):
tokens = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokens)
# Extract proper nouns (NNP tags)
These extraction functions operate on the original text before preprocessing to preserve proper nouns, capitalization, and formatting needed for entity recognition.