Exclusion lists allow you to filter out common or generic terms that appear repeatedly across documents. This improves ElasticSearch searchability by ensuring tags are specific and unique to each document.
The system uses a 2-layer approach: AI is instructed to avoid excluded terms during generation, and post-processing filters any that slip through.
Upload a .txt or .pdf file containing terms to exclude.Supported Formats:
.txt: Plain text file (recommended)
.pdf: PDF containing exclusion terms
Text Encoding:
Auto-detected with chardet
Supports UTF-8, Latin-1, CP1252, ISO-8859-1
Falls back to UTF-8 with error replacement
2
Parse Exclusion Terms
The system parses the file and extracts terms:
# Line-by-line formatgovernment-of-indiaministry-of-social-justiceannual-report# Comma-separated formatscheme, yojana, program, initiative# Mixed formatpmkvy, skill-developmenttraining-manualnewsletter, circular# Comments (ignored)# This is a commentsome-term # Inline comment not supported — whole line ignored
Parsing Rules:
Lines starting with # are ignored
Empty lines are ignored
Terms are converted to lowercase
Whitespace is trimmed
Commas split terms on same line
3
Pre-Generation Filtering (Layer 1)
The AI is instructed to avoid excluded terms in its system prompt:
prompt = f"""Generate {num_tags} metadata tags for this document.**IMPORTANT**: Do NOT use these terms (they are too generic):{', '.join(exclusion_words)}Generate unique, specific tags that distinguish this document."""
Example:
**IMPORTANT**: Do NOT use these terms:government-of-india, ministry-of-social-justice, annual-report, newsletter, policy-document
4
Post-Processing Filtering (Layer 2)
After AI generates tags, any excluded terms that slipped through are removed:
# AI generated these tagsraw_tags = [ "pmkvy", "skill-development", "government-of-india", # Should be excluded! "vocational-training", "ministry-of-social-justice" # Should be excluded!]# Filter excluded termsfiltered_tags = [ tag for tag in raw_tags if tag.lower() not in exclusion_set]# Resultfiltered_tags = [ "pmkvy", "skill-development", "vocational-training"]
5
Guaranteed Tag Count
If filtering removes tags, system requests extras from AI to maintain target count:
if requested_tags == 5 and exclusion_words: # Request 10 tags to ensure 5 remain after filtering ai_tag_count = requested_tags * 2
Example:
User requests: 5 tags
AI generates: 10 tags (with instruction to avoid excluded terms)
Post-filtering removes: 2 tags
Final result: 8 tags (more than requested minimum)
This ensures you always get at least your requested number of tags, even if some are filtered out.
# Common government organizations (comments start with #)government-indiaministry-of-social-justicesocial-justicedepartment-of-empowerment# Generic document typesannual-reportnewsletterpolicy-documentcircularnotification# Overly generic termsempowermentconstitutional-provisionsgovernment-schemepublic-welfare# Comma-separated (same line)scheme, yojana, program, initiative, mission
Parsing:
lines = text.split('\n')for line in lines: line = line.strip() # Skip comments and empty lines if not line or line.startswith('#'): continue # Split by comma if present if ',' in line: terms = [t.strip().lower() for t in line.split(',')] exclusion_set.update(terms) else: exclusion_set.add(line.lower())
Use Case: When exclusion terms are in a PDF documentProcess:
Extract text from PDF (uses same OCR pipeline as document processing)
Parse extracted text with same rules as .txt files
Filter comments and empty lines
Example PDF Content:
Exclusion List for Government DocumentsCommon Organizations:- Government of India- Ministry of Social Justice- Department of EmpowermentGeneric Terms:- Annual Report- Policy Document- Circular
# Try common encodings in orderfor encoding in ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']: try: text = file_bytes.decode(encoding) break except UnicodeDecodeError: continue# Last resort: UTF-8 with error replacementtext = file_bytes.decode('utf-8', errors='replace')
formData = { 'pdf_file': pdf_file, 'config': JSON.stringify({ 'api_key': '...', 'model_name': 'google/gemini-flash-1.5', 'num_tags': 8, 'num_pages': 3, 'exclusion_words': [] # Can also be set here }), 'exclusion_file': exclusion_file # .txt or .pdf}
@staticmethoddef parse_from_text(text: str) -> Set[str]: """ Parse exclusion words from text content """ words = set() lines = text.strip().split('\n') for line in lines: line = line.strip() # Skip empty lines and comments if not line or line.startswith('#'): continue # Split by comma if present if ',' in line: parts = [p.strip().lower() for p in line.split(',')] words.update(p for p in parts if p and not p.startswith('#')) else: words.add(line.lower()) logger.info(f"Parsed {len(words)} exclusion words") return words
# Build exclusion list from processed documentsfrom collections import Counter# Collect all tags from processed docsall_tags = []for doc in processed_documents: all_tags.extend(doc['tags'])# Find most common tags (likely generic)tag_counts = Counter(all_tags)most_common = tag_counts.most_common(50)# Add to exclusion list if appears in >50% of docsexclusions = [ tag for tag, count in most_common if count > len(processed_documents) * 0.5]