Skip to main content

Overview

The Convert Service provides document format conversion capabilities using LibreOffice bindings. It processes conversion requests from SQS queues and supports bidirectional conversion between multiple document formats. Key Features:
  • Multi-format document conversion
  • DOCX to PDF conversion
  • PDF to DOCX conversion
  • Queue-based async processing
  • S3 integration for file storage
Architecture:
  • Internal-only API (requires internal auth)
  • SQS queue consumer for async jobs
  • LibreOffice bindings for conversion
  • Process isolation via fork for stability

Convert Document

Convert a document from one format to another.
This is an internal-only endpoint requiring internal API authentication.
curl -X POST https://internal.macro.com/internal/convert \
  -H "X-Internal-API-Key: YOUR_INTERNAL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "fromBucket": "macro-documents",
    "fromKey": "uploads/document.docx",
    "toBucket": "macro-documents",
    "toKey": "converted/document.pdf"
  }'
fromBucket
string
required
Source S3 bucket containing the input file
fromKey
string
required
S3 key path to the source file. File extension determines source format
toBucket
string
required
Destination S3 bucket for converted file
toKey
string
required
S3 key path for output file. File extension determines target format
success
boolean
Whether conversion completed successfully

Supported Formats

The Convert Service supports conversion between these formats:

Document Formats

FormatExtensionInputOutputNotes
Microsoft Word.docxModern Word format
PDF.pdfPortable Document Format
Rich Text.rtfCross-platform text
OpenDocument.odtOpen standard format
Plain Text.txtUnformatted text
HTML.htmlWeb format

Common Conversions

High-fidelity conversion preserving formatting, fonts, and layout.
{
  "fromKey": "docs/report.docx",
  "toKey": "pdfs/report.pdf"
}
Converts PDF to editable Word document. Quality depends on PDF structure.
{
  "fromKey": "pdfs/contract.pdf",
  "toKey": "docs/contract.docx"
}
Exports Word document as web-ready HTML.
{
  "fromKey": "docs/article.docx",
  "toKey": "web/article.html"
}

Queue-Based Processing

The Convert Service primarily operates as an SQS queue consumer for asynchronous conversions.

Queue Message Format

{
  "fromBucket": "macro-documents",
  "fromKey": "uploads/original.docx",
  "toBucket": "macro-documents",
  "toKey": "converted/original.pdf"
}

Processing Flow

  1. Receive: Consumer polls SQS queue for conversion jobs
  2. Download: Fetch source file from S3 bucket
  3. Convert: Process file using LibreOffice bindings in isolated fork
  4. Upload: Write converted file to destination S3 bucket
  5. Notify: For DOCX conversions, invoke WebSocket lambda with result
  6. Cleanup: Delete temporary files and acknowledge SQS message

DOCX Upload Notifications

When converting DOCX files, the service sends completion notifications:
{
  "jobId": "job_abc123",
  "status": "Completed",
  "jobType": "docx_upload",
  "data": {
    "error": false,
    "data": {
      "converted": true,
      "bomParts": []
    }
  }
}

Conversion Process

Process Isolation

Conversions run in forked child processes for stability:
  • Fork: Parent spawns isolated child process
  • Convert: Child loads LibreOffice and performs conversion
  • Exit: Child exits with status code (0 = success)
  • Monitor: Parent waits for completion with 30-second timeout
This prevents LibreOffice crashes from affecting the main service.

Timeout Handling

Conversions have a 30-second timeout:
  • If conversion exceeds timeout, child process is killed
  • Temporary files are cleaned up
  • SQS message is released back to queue for retry

Temporary Files

Each conversion creates a temporary workspace:
./{job_id}/
├── IN.{source_format}   # Downloaded source file
└── OUT.{target_format}  # Generated output file
Workspaces are cleaned up after conversion (success or failure).

Error Handling

Common Errors

error
string
Error description
ErrorCauseResolution
unable to get file typeInvalid file extensionEnsure file has valid extension
unable to get lok filterUnsupported conversionCheck supported formats
unable to get object from S3Source file not foundVerify S3 bucket and key
conversion failedLibreOffice errorCheck file format validity
fork failedSystem resource errorRetry or contact support

Retry Strategy

Failed conversions follow SQS retry policy:
  1. Message returns to queue after visibility timeout
  2. Up to 3 retry attempts
  3. Dead letter queue after max retries
  4. Manual inspection for persistent failures

Performance Considerations

Conversion Time

Typical conversion times:
  • Small files (< 1 MB): 1-3 seconds
  • Medium files (1-10 MB): 3-10 seconds
  • Large files (10-50 MB): 10-30 seconds
  • Files > 50 MB: May timeout (30s limit)

Concurrency

  • Multiple workers process queue in parallel
  • Each worker handles one conversion at a time
  • Fork isolation prevents interference between jobs

Resource Usage

  • CPU-intensive during conversion
  • Temporary disk space: ~3x source file size
  • Memory: ~500MB per LibreOffice instance

Backfill Operations

Backfill DOCX

Bulk conversion endpoint for batch processing (internal only).
curl -X POST https://internal.macro.com/internal/backfill/docx \
  -H "X-Internal-API-Key: YOUR_INTERNAL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "items": [
      {
        "fromBucket": "macro-documents",
        "fromKey": "batch/doc1.docx",
        "toBucket": "macro-converted",
        "toKey": "batch/doc1.pdf"
      }
    ]
  }'
items
array
required
Array of conversion requests to process
Backfill operations queue multiple conversions for batch processing.

Integration Example

Triggering Conversion from Another Service

import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';

const sqs = new SQSClient({ region: 'us-east-1' });

async function queueConversion(fromKey, toKey) {
  const jobId = generateJobId();
  
  await sqs.send(new SendMessageCommand({
    QueueUrl: process.env.CONVERT_QUEUE_URL,
    MessageBody: JSON.stringify({
      fromBucket: 'macro-documents',
      fromKey,
      toBucket: 'macro-documents',
      toKey
    }),
    MessageAttributes: {
      job_id: {
        StringValue: jobId,
        DataType: 'String'
      }
    }
  }));
  
  return jobId;
}

// Queue a DOCX to PDF conversion
const jobId = await queueConversion(
  'uploads/report.docx',
  'converted/report.pdf'
);

Monitoring

Metrics

Key metrics to monitor:
  • Queue depth: Number of pending conversions
  • Processing time: Average conversion duration
  • Error rate: Failed conversion percentage
  • Timeout rate: Conversions exceeding 30s

Logging

Conversion logs include:
  • job_id: Unique conversion identifier
  • from_key: Source file path
  • to_key: Destination file path
  • status_code: Child process exit code
  • Processing timestamps and durations

Build docs developers (and LLMs) love