Skip to main content

Overview

The Bibliographic Search module aggregates scientific literature from multiple sources (PubMed Central, SciELO, LILACS, FDA) to support pharmacovigilance signal detection, safety profile analysis, and regulatory submissions.

Supported Databases

LILACS

Latin American & Caribbean Health SciencesCoverage: 1982-presentFocus: Regional pharmacovigilance dataAPI: BVS iAHx XML

SciELO

Scientific Electronic Library OnlineCoverage: 1997-presentFocus: Open-access journals (PT, BR, CL, PE)API: OAI-PMH (Dublin Core)

PubMed Central

US National Library of MedicineCoverage: 1900s-presentFocus: Global biomedical literatureAPI: PMC E-utilities

FDA

US Food & Drug AdministrationCoverage: Regulatory documentsFocus: Drug approvals, safety alertsAPI: FDA.gov site search

Architecture

1

Query Construction

Build search terms from product name + IFA (active ingredient)
2

Parallel Execution

Query all sources concurrently (asyncio)
3

Result Normalization

Standardize fields: title, authors, year, source, link, database
4

Deduplication

Remove identical records across sources
5

Relevance Filtering

Apply local regex/keyword filters

API Integration

LILACS uses the BVS iAHx search interface with XML output:
LILACS Search (backend/app/services/biblio_sources.py:48)
from app.services.biblio_sources import search_lilacs

results = search_lilacs(
    query="ibuprofeno AND urticaria",
    lang="es",
    start=0,
    count=20,
    timeout=20.0
)

# Response:
[
    {
        "title": "Reacciones adversas cutáneas por ibuprofeno en población pediátrica",
        "authors": ["García M", "López J", "Rodríguez A"],
        "year": "2024",
        "source": "Rev Peru Med Exp Salud Publica",
        "link": "https://lilacs.bvsalud.org/resource/123456",
        "database": "LILACS"
    }
]
Features:
  • Multi-language support (es, en, pt)
  • Automatic retry with exponential backoff (0s, 1s, 2s)
  • Custom User-Agent header (VIGIA/0.1)
  • XML parsing with namespace handling
  • Timeout protection (default 20s)
LILACS returns results in the requested language. Use lang="es" for Spanish-speaking regions to get properly localized abstracts.

Advanced Query Syntax

LILACS Query Examples
# Simple term
search_lilacs("paracetamol")

# Boolean operators
search_lilacs("ibuprofeno AND (urticaria OR exantema)")

# Phrase search
search_lilacs('"reacción adversa medicamentosa"')

# Field-specific (title, author, subject)
search_lilacs("ti:farmacovigilancia")

# Date range (not directly supported, use filter parameter)
# params = {"filter": "year_cluster:[2020 TO 2025]"}

SciELO OAI-PMH

Harvesting Protocol

SciELO provides OAI-PMH endpoints for bulk harvesting:
SciELO Harvest (backend/app/services/biblio_sources.py:126)
from app.services.biblio_sources import harvest_scielo_oai

# National nodes
BRAZIL = "https://www.scielo.br/oai/scielo-oai.php"
PERU = "https://scielo.org.pe/oai/scielo-oai.php"
CHILE = "https://scielo.conicyt.cl/oai/scielo-oai.php"
GLOBAL = "https://search.scielo.org/oai/scielo-oai.php"  # Fallback

results = harvest_scielo_oai(
    base_url=PERU,
    from_date="2020-01-01",
    until_date="2025-03-31",
    max_records=200,
    timeout=30.0
)

# Response:
[
    {
        "title": "Farmacovigilancia en el Perú: situación actual y perspectivas",
        "authors": ["Ministerio de Salud"],
        "year": "2023",
        "journal": "Rev Peru Med Exp Salud Publica",
        "links": ["https://doi.org/10.17843/rpmesp.2023.401.12345"],
        "subjects": ["Farmacovigilancia", "Perú", "Eventos adversos"],
        "description": "La farmacovigilancia en el Perú ha evolucionado...",
        "database": "SciELO (OAI-PMH)"
    }
]
Metadata Format: Dublin Core (oai_dc)
OAI-PMH Record Structure
<record>
  <metadata>
    <oai_dc:dc>
      <dc:title>Article Title</dc:title>
      <dc:creator>Author Name</dc:creator>
      <dc:date>2023-05-15</dc:date>
      <dc:identifier>https://doi.org/10.1234/journal.v1.123</dc:identifier>
      <dc:source>Journal Name, v.1 n.2</dc:source>
      <dc:subject>Pharmacovigilance</dc:subject>
      <dc:description>Abstract text...</dc:description>
      <dc:language>es</dc:language>
    </oai_dc:dc>
  </metadata>
</record>

Resumption Tokens

OAI-PMH uses resumption tokens for pagination:
Automatic Pagination (backend/app/services/biblio_sources.py:217)
# Initial request
params = {"verb": "ListRecords", "metadataPrefix": "oai_dc"}

# Subsequent requests use token
token = root.find(".//oai:resumptionToken", ns)
if token is not None and token.text:
    params = {"verb": "ListRecords", "resumptionToken": token.text}
else:
    break  # No more results
SciELO regional nodes (Peru, Chile) sometimes return 404 errors. The system automatically retries with the global endpoint (search.scielo.org) as fallback.

Quick Search with Filters

The quick_search() function combines LILACS and SciELO with local filtering:
Quick Search API (backend/app/services/biblio_sources.py:280)
from app.services.biblio_sources import quick_search

results = quick_search(
    query="ibuprofeno urticaria",
    scielo_oai_url="https://www.scielo.br/oai/scielo-oai.php",
    lang="es",
    limit_each=20,
    timeout=30.0,
    skip_lilacs=False,
    years_back=5,  # Last 5 years only
    fields=["title", "subjects", "description"],
    regex=False  # Use OR/AND logic, not regex
)

# Response:
{
    "lilacs": [<20 results>],
    "scielo": [<20 results>]
}

Filter Logic

Boolean Query Syntax
# OR operator (pipe)
"urticaria | rash | exantema"  # Matches any term

# AND operator (implicit by space)
"ibuprofeno urticaria"  # Matches both terms

# Combined
"paracetamol (hepatotoxicidad | liver)"
# → Matches: "paracetamol" AND ("hepatotoxicidad" OR "liver")
Matching Process:
  1. Split query by | (OR parts)
  2. Split each OR part by space (AND tokens)
  3. Normalize: remove accents, case-fold
  4. Check if all AND tokens appear in any field
  5. Return first matching OR part
# Available Fields
fields = [
    "title",         # Article title
    "authors",       # Author list
    "year",          # Publication year
    "journal",       # Journal name (SciELO)
    "source",        # Source citation (LILACS)
    "subjects",      # Keywords/MeSH terms
    "description",   # Abstract
    "links"          # DOI, URLs
]

# Example: Search only in titles and subjects
results = quick_search(
    query="farmacovigilancia",
    fields=["title", "subjects"]
)

Aggregated Search API

The main search endpoint queries all sources in parallel:
Unified Search (GET /api/v1/biblio/search)
GET /api/v1/biblio/search?name=Ibuprofeno&ifa=ibuprofeno&range_from=2020-01-01&range_to=2025-12-31

# Response:
{
  "term": "Ibuprofeno ibuprofeno",  # Combined search term
  "rangeFrom": "2020-01-01",
  "rangeTo": "2025-12-31",
  "total": 87,
  "items": [
    {"title": "...", "database": "scielo", ...},
    {"title": "...", "database": "pmc", ...},
    {"title": "...", "database": "lilacs", ...},
    {"title": "...", "database": "fda", ...}
  ],
  "traces": [
    {
      "collection": "scielo",
      "url": "https://search.scielo.org/?q=Ibuprofeno...",
      "note": "SciELO search",
      "count": 23
    },
    {
      "collection": "pmc",
      "url": "https://pmc.ncbi.nlm.nih.gov/search/?term=Ibuprofeno...",
      "note": "PMC search",
      "count": 45
    },
    {
      "collection": "lilacs",
      "url": "https://pesquisa.bvsalud.org/portal/?q=Ibuprofeno...",
      "note": "BVS / LILACS",
      "count": 12
    },
    {
      "collection": "fda",
      "url": "https://www.fda.gov/search?s=Ibuprofeno",
      "note": "FDA site search (trace UI)",
      "count": 7
    }
  ],
  "usedAttempts": [
    {"engine": "scielo", "name": "Ibuprofeno", "ifa": "ibuprofeno", "search_term": "Ibuprofeno ibuprofeno"},
    {"engine": "pmc", "name": "Ibuprofeno", "ifa": "ibuprofeno", "search_term": "Ibuprofeno ibuprofeno"},
    {"engine": "lilacs", "name": "Ibuprofeno", "ifa": "ibuprofeno", "search_term": "Ibuprofeno ibuprofeno"},
    {"engine": "fda", "name": "Ibuprofeno", "ifa": "ibuprofeno", "search_term": "Ibuprofeno ibuprofeno"}
  ]
}
Query Parameters:
name
string
Commercial product name
ifa
string
Active ingredient (IFA - International Nonproprietary Name)
term
string
Direct search term (overrides name+ifa)
range_from
string
Start date (YYYY-MM-DD)
range_to
string
End date (YYYY-MM-DD)
lang
string
default:"es"
Language preference (es, en, pt)
max_results_fda
integer
default:"10"
Max FDA results (1-25)
max_pages_lilacs
integer
default:"2"
Max LILACS pages (1-20)
page_size_lilacs
integer
default:"50"
Results per LILACS page (1-200)
translate_lilacs
boolean
default:"true"
Auto-translate LILACS results

Error Handling

HTTP Status Codes

SciELO Regional Node Fallback
# Regional node returns 404
try:
    results = harvest_scielo_oai("https://scielo.org.pe/oai/scielo-oai.php")
except HTTPStatusError as e:
    if e.response.status_code == 404:
        # Retry with global endpoint
        results = harvest_scielo_oai("https://search.scielo.org/oai/scielo-oai.php")

Retry Strategy

Exponential Backoff (backend/app/services/biblio_sources.py:72)
backoff = [0, 1.0, 2.0]  # Wait times in seconds
last_err = None

for attempt, wait_s in enumerate(backoff, start=1):
    try:
        log.debug("LILACS request attempt %d/%d", attempt, len(backoff))
        response = client.get(url, params=params)
        response.raise_for_status()
        log.info("LILACS HTTP OK (%.1f ms)", elapsed_ms)
        break
    except Exception as e:
        last_err = e
        log.warning("LILACS attempt %d/%d failed: %s", attempt, len(backoff), e)
        if attempt < len(backoff):
            time.sleep(wait_s)
        else:
            raise  # Exhausted retries

Performance Optimization

Parallel Execution

Concurrent Queries
import asyncio
import httpx

async def search_all_sources(query: str):
    async with httpx.AsyncClient() as client:
        tasks = [
            client.get(f"{API_BASE}/scielo/search", params={"q": query}),
            client.get(f"{API_BASE}/pmc/search", params={"q": query}),
            client.get(f"{API_BASE}/lilacs/search", params={"q": query}),
            client.get(f"{API_BASE}/fda/search", params={"q": query}),
        ]
        responses = await asyncio.gather(*tasks, return_exceptions=True)
    return responses

Result Caching

Redis Cache
import redis
import json
import hashlib

redis_client = redis.Redis()

def cached_search(query: str, **params):
    # Generate cache key
    key_data = f"{query}:{json.dumps(params, sort_keys=True)}"
    cache_key = f"biblio:{hashlib.md5(key_data.encode()).hexdigest()}"
    
    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Perform search
    results = search_lilacs(query, **params)
    
    # Store in cache (1 hour TTL)
    redis_client.setex(cache_key, 3600, json.dumps(results))
    return results

Request Batching

Batch Multiple Queries
def search_batch(queries: list[str], source: str):
    """Execute multiple queries with connection pooling"""
    with httpx.Client() as client:
        results = []
        for query in queries:
            try:
                r = client.get(f"{API_BASE}/{source}/search", params={"q": query})
                results.append(r.json())
            except Exception as e:
                results.append({"error": str(e)})
    return results

Local Filtering

Pre-Filter Before Network
# Filter locally instead of server-side for SciELO
# (Server doesn't support complex queries)

# 1. Fetch broader results
raw = harvest_scielo_oai(
    from_date="2020-01-01",
    max_records=500  # Fetch more
)

# 2. Filter locally
filtered = [
    r for r in raw
    if _match_local(r, "urticaria", ["title", "description"], regex=False)
]

# 3. Return top N
return filtered[:20]

Best Practices

Do:
  • Use specific search terms (product + symptom)
  • Specify date ranges to reduce result volume
  • Enable local filtering for precision
  • Cache results for repeated queries
  • Use LILACS for Latin American data
  • Query SciELO regional nodes for local journals
  • Handle 404/403 errors with fallbacks
  • Log search traces for debugging
Don’t:
  • Make excessive requests without rate limiting
  • Ignore timeout settings (can block worker)
  • Use overly broad queries (“drug” returns millions)
  • Skip error handling (external APIs are unreliable)
  • Hard-code URLs (use environment config)
  • Fetch all records without pagination
  • Store raw HTML (parse and normalize)
  • Trust external metadata without validation

Configuration

Environment Variables
# API endpoints (override for testing)
LILACS_BASE="https://pesquisa.bvsalud.org/portal/"
SCIELO_OAI_GLOBAL="https://search.scielo.org/oai/scielo-oai.php"
SCIELO_OAI_PERU="https://scielo.org.pe/oai/scielo-oai.php"

# Timeouts
BIBLIO_TIMEOUT="30"  # seconds
BIBLIO_CONNECT_TIMEOUT="20"  # seconds

# Rate limiting
BIBLIO_RETRY_BACKOFF="0,1,2"  # comma-separated seconds
BIBLIO_MAX_RETRIES="3"

# Cache
REDIS_URL="redis://localhost:6379/2"  # Database 2 for biblio cache
BIBLIO_CACHE_TTL="3600"  # 1 hour

Signal Detection

Using literature search for signal validation

IPS Reports

Integrating bibliographic references into safety reports

Build docs developers (and LLMs) love