Bibliographic Search Module

Overview

The Bibliographic Search module aggregates scientific literature from multiple sources (PubMed Central, SciELO, LILACS, FDA) to support pharmacovigilance signal detection, safety profile analysis, and regulatory submissions.

Supported Databases

LILACS

Latin American & Caribbean Health SciencesCoverage: 1982-presentFocus: Regional pharmacovigilance dataAPI: BVS iAHx XML

SciELO

Scientific Electronic Library OnlineCoverage: 1997-presentFocus: Open-access journals (PT, BR, CL, PE)API: OAI-PMH (Dublin Core)

PubMed Central

US National Library of MedicineCoverage: 1900s-presentFocus: Global biomedical literatureAPI: PMC E-utilities

FDA

US Food & Drug AdministrationCoverage: Regulatory documentsFocus: Drug approvals, safety alertsAPI: FDA.gov site search

Architecture

Query Construction

Build search terms from product name + IFA (active ingredient)

Parallel Execution

Query all sources concurrently (asyncio)

Result Normalization

Standardize fields: title, authors, year, source, link, database

Deduplication

Remove identical records across sources

Relevance Filtering

Apply local regex/keyword filters

LILACS Search

API Integration

LILACS uses the BVS iAHx search interface with XML output:

LILACS Search (backend/app/services/biblio_sources.py:48)

from app.services.biblio_sources import search_lilacs

results = search_lilacs(
    query="ibuprofeno AND urticaria",
    lang="es",
    start=0,
    count=20,
    timeout=20.0
)

# Response:
[
    {
        "title": "Reacciones adversas cutáneas por ibuprofeno en población pediátrica",
        "authors": ["García M", "López J", "Rodríguez A"],
        "year": "2024",
        "source": "Rev Peru Med Exp Salud Publica",
        "link": "https://lilacs.bvsalud.org/resource/123456",
        "database": "LILACS"
    }
]

Features:

Multi-language support (es, en, pt)
Automatic retry with exponential backoff (0s, 1s, 2s)
Custom User-Agent header (VIGIA/0.1)
XML parsing with namespace handling
Timeout protection (default 20s)

LILACS returns results in the requested language. Use lang="es" for Spanish-speaking regions to get properly localized abstracts.

Advanced Query Syntax

LILACS Query Examples

# Simple term
search_lilacs("paracetamol")

# Boolean operators
search_lilacs("ibuprofeno AND (urticaria OR exantema)")

# Phrase search
search_lilacs('"reacción adversa medicamentosa"')

# Field-specific (title, author, subject)
search_lilacs("ti:farmacovigilancia")

# Date range (not directly supported, use filter parameter)
# params = {"filter": "year_cluster:[2020 TO 2025]"}

SciELO OAI-PMH

Harvesting Protocol

SciELO provides OAI-PMH endpoints for bulk harvesting:

SciELO Harvest (backend/app/services/biblio_sources.py:126)

from app.services.biblio_sources import harvest_scielo_oai

# National nodes
BRAZIL = "https://www.scielo.br/oai/scielo-oai.php"
PERU = "https://scielo.org.pe/oai/scielo-oai.php"
CHILE = "https://scielo.conicyt.cl/oai/scielo-oai.php"
GLOBAL = "https://search.scielo.org/oai/scielo-oai.php"  # Fallback

results = harvest_scielo_oai(
    base_url=PERU,
    from_date="2020-01-01",
    until_date="2025-03-31",
    max_records=200,
    timeout=30.0
)

# Response:
[
    {
        "title": "Farmacovigilancia en el Perú: situación actual y perspectivas",
        "authors": ["Ministerio de Salud"],
        "year": "2023",
        "journal": "Rev Peru Med Exp Salud Publica",
        "links": ["https://doi.org/10.17843/rpmesp.2023.401.12345"],
        "subjects": ["Farmacovigilancia", "Perú", "Eventos adversos"],
        "description": "La farmacovigilancia en el Perú ha evolucionado...",
        "database": "SciELO (OAI-PMH)"
    }
]

Metadata Format: Dublin Core (oai_dc)

Standard Fields
Namespace Handling

OAI-PMH Record Structure

<record>
  <metadata>
    <oai_dc:dc>
      <dc:title>Article Title</dc:title>
      <dc:creator>Author Name</dc:creator>
      <dc:date>2023-05-15</dc:date>
      <dc:identifier>https://doi.org/10.1234/journal.v1.123</dc:identifier>
      <dc:source>Journal Name, v.1 n.2</dc:source>
      <dc:subject>Pharmacovigilance</dc:subject>
      <dc:description>Abstract text...</dc:description>
      <dc:language>es</dc:language>
    </oai_dc:dc>
  </metadata>
</record>

Namespace-Aware Parsing (backend/app/services/biblio_sources.py:149)

ns = {
    "oai": "http://www.openarchives.org/OAI/2.0/",
    "oai_dc": "http://www.openarchives.org/OAI/2.0/oai_dc/",
    "dc": "http://purl.org/dc/elements/1.1/",
}

# Correct XPath with namespaces
titles = dc_el.findall("./dc:title", ns)
authors = dc_el.findall("./dc:creator", ns)
dates = dc_el.findall("./dc:date", ns)

# Incorrect (will fail):
# titles = dc_el.findall(".//title")  # Missing namespace

Resumption Tokens

OAI-PMH uses resumption tokens for pagination:

Automatic Pagination (backend/app/services/biblio_sources.py:217)

# Initial request
params = {"verb": "ListRecords", "metadataPrefix": "oai_dc"}

# Subsequent requests use token
token = root.find(".//oai:resumptionToken", ns)
if token is not None and token.text:
    params = {"verb": "ListRecords", "resumptionToken": token.text}
else:
    break  # No more results

SciELO regional nodes (Peru, Chile) sometimes return 404 errors. The system automatically retries with the global endpoint (search.scielo.org) as fallback.

Quick Search with Filters

The quick_search() function combines LILACS and SciELO with local filtering:

Quick Search API (backend/app/services/biblio_sources.py:280)

from app.services.biblio_sources import quick_search

results = quick_search(
    query="ibuprofeno urticaria",
    scielo_oai_url="https://www.scielo.br/oai/scielo-oai.php",
    lang="es",
    limit_each=20,
    timeout=30.0,
    skip_lilacs=False,
    years_back=5,  # Last 5 years only
    fields=["title", "subjects", "description"],
    regex=False  # Use OR/AND logic, not regex
)

# Response:
{
    "lilacs": [<20 results>],
    "scielo": [<20 results>]
}

Filter Logic

Boolean Search (default)

Boolean Query Syntax

# OR operator (pipe)
"urticaria | rash | exantema"  # Matches any term

# AND operator (implicit by space)
"ibuprofeno urticaria"  # Matches both terms

# Combined
"paracetamol (hepatotoxicidad | liver)"
# → Matches: "paracetamol" AND ("hepatotoxicidad" OR "liver")

Matching Process:

Split query by | (OR parts)
Split each OR part by space (AND tokens)
Normalize: remove accents, case-fold
Check if all AND tokens appear in any field
Return first matching OR part

Regex Search

Regex Query Mode

results = quick_search(
    query=r"\b(urticaria|exantema)\s+(generalizada?|severa?)",
    regex=True,
    fields=["title", "description"]
)

# Matches:
# - "urticaria generalizada"
# - "exantema severo"
# But NOT:
# - "urticaria leve" (missing second group)

Accent Handling:Query is normalized (ó → o, ñ → n), fields are normalized before matching, and search is case-insensitive by default (re.IGNORECASE).

Field Selection

# Available Fields
fields = [
    "title",         # Article title
    "authors",       # Author list
    "year",          # Publication year
    "journal",       # Journal name (SciELO)
    "source",        # Source citation (LILACS)
    "subjects",      # Keywords/MeSH terms
    "description",   # Abstract
    "links"          # DOI, URLs
]

# Example: Search only in titles and subjects
results = quick_search(
    query="farmacovigilancia",
    fields=["title", "subjects"]
)

Aggregated Search API

The main search endpoint queries all sources in parallel:

Unified Search (GET /api/v1/biblio/search)

GET /api/v1/biblio/search?name=Ibuprofeno&ifa=ibuprofeno&range_from=2020-01-01&range_to=2025-12-31

# Response:
{
  "term": "Ibuprofeno ibuprofeno",  # Combined search term
  "rangeFrom": "2020-01-01",
  "rangeTo": "2025-12-31",
  "total": 87,
  "items": [
    {"title": "...", "database": "scielo", ...},
    {"title": "...", "database": "pmc", ...},
    {"title": "...", "database": "lilacs", ...},
    {"title": "...", "database": "fda", ...}
  ],
  "traces": [
    {
      "collection": "scielo",
      "url": "https://search.scielo.org/?q=Ibuprofeno...",
      "note": "SciELO search",
      "count": 23
    },
    {
      "collection": "pmc",
      "url": "https://pmc.ncbi.nlm.nih.gov/search/?term=Ibuprofeno...",
      "note": "PMC search",
      "count": 45
    },
    {
      "collection": "lilacs",
      "url": "https://pesquisa.bvsalud.org/portal/?q=Ibuprofeno...",
      "note": "BVS / LILACS",
      "count": 12
    },
    {
      "collection": "fda",
      "url": "https://www.fda.gov/search?s=Ibuprofeno",
      "note": "FDA site search (trace UI)",
      "count": 7
    }
  ],
  "usedAttempts": [
    {"engine": "scielo", "name": "Ibuprofeno", "ifa": "ibuprofeno", "search_term": "Ibuprofeno ibuprofeno"},
    {"engine": "pmc", "name": "Ibuprofeno", "ifa": "ibuprofeno", "search_term": "Ibuprofeno ibuprofeno"},
    {"engine": "lilacs", "name": "Ibuprofeno", "ifa": "ibuprofeno", "search_term": "Ibuprofeno ibuprofeno"},
    {"engine": "fda", "name": "Ibuprofeno", "ifa": "ibuprofeno", "search_term": "Ibuprofeno ibuprofeno"}
  ]
}

Query Parameters:

name

string

Commercial product name

ifa

string

Active ingredient (IFA - International Nonproprietary Name)

term

string

Direct search term (overrides name+ifa)

range_from

string

Start date (YYYY-MM-DD)

range_to

string

End date (YYYY-MM-DD)

lang

string

default:"es"

Language preference (es, en, pt)

max_results_fda

integer

default:"10"

Max FDA results (1-25)

max_pages_lilacs

integer

default:"2"

Max LILACS pages (1-20)

page_size_lilacs

integer

default:"50"

Results per LILACS page (1-200)

translate_lilacs

boolean

default:"true"

Auto-translate LILACS results

Error Handling

HTTP Status Codes

404 Not Found
403 Forbidden
Timeout

SciELO Regional Node Fallback

# Regional node returns 404
try:
    results = harvest_scielo_oai("https://scielo.org.pe/oai/scielo-oai.php")
except HTTPStatusError as e:
    if e.response.status_code == 404:
        # Retry with global endpoint
        results = harvest_scielo_oai("https://search.scielo.org/oai/scielo-oai.php")

Rate Limiting

# SciELO/LILACS may block excessive requests
# Implement exponential backoff

backoff = [0, 1.0, 2.0, 4.0]  # seconds
for wait in backoff:
    time.sleep(wait)
    try:
        results = search_lilacs(query)
        break
    except httpx.HTTPStatusError as e:
        if e.response.status_code == 403:
            continue  # Retry
        else:
            raise  # Other error

Request Timeouts

# All HTTP clients use configurable timeout
HTTP_TIMEOUT = httpx.Timeout(
    45.0,      # Total request timeout
    connect=20.0  # Connection establishment timeout
)

async with httpx.AsyncClient(timeout=HTTP_TIMEOUT) as client:
    response = await client.get(url)

Retry Strategy

Exponential Backoff (backend/app/services/biblio_sources.py:72)

backoff = [0, 1.0, 2.0]  # Wait times in seconds
last_err = None

for attempt, wait_s in enumerate(backoff, start=1):
    try:
        log.debug("LILACS request attempt %d/%d", attempt, len(backoff))
        response = client.get(url, params=params)
        response.raise_for_status()
        log.info("LILACS HTTP OK (%.1f ms)", elapsed_ms)
        break
    except Exception as e:
        last_err = e
        log.warning("LILACS attempt %d/%d failed: %s", attempt, len(backoff), e)
        if attempt < len(backoff):
            time.sleep(wait_s)
        else:
            raise  # Exhausted retries

Performance Optimization

Parallel Execution

Concurrent Queries

import asyncio
import httpx

async def search_all_sources(query: str):
    async with httpx.AsyncClient() as client:
        tasks = [
            client.get(f"{API_BASE}/scielo/search", params={"q": query}),
            client.get(f"{API_BASE}/pmc/search", params={"q": query}),
            client.get(f"{API_BASE}/lilacs/search", params={"q": query}),
            client.get(f"{API_BASE}/fda/search", params={"q": query}),
        ]
        responses = await asyncio.gather(*tasks, return_exceptions=True)
    return responses

Result Caching

Redis Cache

import redis
import json
import hashlib

redis_client = redis.Redis()

def cached_search(query: str, **params):
    # Generate cache key
    key_data = f"{query}:{json.dumps(params, sort_keys=True)}"
    cache_key = f"biblio:{hashlib.md5(key_data.encode()).hexdigest()}"
    
    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Perform search
    results = search_lilacs(query, **params)
    
    # Store in cache (1 hour TTL)
    redis_client.setex(cache_key, 3600, json.dumps(results))
    return results

Request Batching

Batch Multiple Queries

def search_batch(queries: list[str], source: str):
    """Execute multiple queries with connection pooling"""
    with httpx.Client() as client:
        results = []
        for query in queries:
            try:
                r = client.get(f"{API_BASE}/{source}/search", params={"q": query})
                results.append(r.json())
            except Exception as e:
                results.append({"error": str(e)})
    return results

Local Filtering

Pre-Filter Before Network

# Filter locally instead of server-side for SciELO
# (Server doesn't support complex queries)

# 1. Fetch broader results
raw = harvest_scielo_oai(
    from_date="2020-01-01",
    max_records=500  # Fetch more
)

# 2. Filter locally
filtered = [
    r for r in raw
    if _match_local(r, "urticaria", ["title", "description"], regex=False)
]

# 3. Return top N
return filtered[:20]

Best Practices

Do:

Use specific search terms (product + symptom)
Specify date ranges to reduce result volume
Enable local filtering for precision
Cache results for repeated queries
Use LILACS for Latin American data
Query SciELO regional nodes for local journals
Handle 404/403 errors with fallbacks
Log search traces for debugging

Don’t:

Make excessive requests without rate limiting
Ignore timeout settings (can block worker)
Use overly broad queries (“drug” returns millions)
Skip error handling (external APIs are unreliable)
Hard-code URLs (use environment config)
Fetch all records without pagination
Store raw HTML (parse and normalize)
Trust external metadata without validation

Configuration

Environment Variables

# API endpoints (override for testing)
LILACS_BASE="https://pesquisa.bvsalud.org/portal/"
SCIELO_OAI_GLOBAL="https://search.scielo.org/oai/scielo-oai.php"
SCIELO_OAI_PERU="https://scielo.org.pe/oai/scielo-oai.php"

# Timeouts
BIBLIO_TIMEOUT="30"  # seconds
BIBLIO_CONNECT_TIMEOUT="20"  # seconds

# Rate limiting
BIBLIO_RETRY_BACKOFF="0,1,2"  # comma-separated seconds
BIBLIO_MAX_RETRIES="3"

# Cache
REDIS_URL="redis://localhost:6379/2"  # Database 2 for biblio cache
BIBLIO_CACHE_TTL="3600"  # 1 hour

Signal Detection

Using literature search for signal validation

IPS Reports

Integrating bibliographic references into safety reports

Get Started

Core Features

User Guides

Modules

Regulatory

Administration

Bibliographic Search Module

Overview

Supported Databases

LILACS

SciELO

PubMed Central

FDA

Architecture

LILACS Search

API Integration

Advanced Query Syntax

SciELO OAI-PMH

Harvesting Protocol

Resumption Tokens

Quick Search with Filters

Filter Logic

Aggregated Search API

Error Handling

HTTP Status Codes

Retry Strategy

Performance Optimization

Parallel Execution

Result Caching

Request Batching

Local Filtering

Best Practices

Configuration

Signal Detection

IPS Reports

Build docs developers (and LLMs) love

Get Started

Core Features

User Guides

Modules

Regulatory

Administration

​Overview

​Supported Databases

LILACS

SciELO

PubMed Central

FDA

​Architecture

​LILACS Search

​API Integration

​Advanced Query Syntax

​SciELO OAI-PMH

​Harvesting Protocol

​Resumption Tokens

​Quick Search with Filters

​Filter Logic

​Aggregated Search API

​Error Handling

​HTTP Status Codes

​Retry Strategy

​Performance Optimization

Parallel Execution

Result Caching

Request Batching

Local Filtering

​Best Practices

​Configuration

​Related Documentation

Signal Detection

IPS Reports

Build docs developers (and LLMs) love

Overview

Supported Databases

Architecture

LILACS Search

API Integration

Advanced Query Syntax

SciELO OAI-PMH

Harvesting Protocol

Resumption Tokens

Quick Search with Filters

Filter Logic

Aggregated Search API

Error Handling

HTTP Status Codes

Retry Strategy

Performance Optimization

Best Practices

Configuration

Related Documentation