pandocHandler

Overview

The pandocHandler provides document format conversion using Pandoc compiled to WebAssembly. It supports extensive document formats including Markdown variants, office documents, HTML, LaTeX, and many markup languages.

Supported Formats

PandocHandler queries Pandoc at runtime for input and output formats. It supports 80+ formats across multiple categories:

Markdown Variants

Markdown - Pandoc’s Markdown
GFM - GitHub-Flavored Markdown
CommonMark - CommonMark Markdown
CommonMark_x - CommonMark with extensions
Markdown_strict - Original unextended Markdown
Markdown_mmd - MultiMarkdown
Markdown_phpextra - PHP Markdown Extra

Office Documents

DOCX - Microsoft Word Document
XLSX - Microsoft Excel Spreadsheet
PPTX - Microsoft PowerPoint Presentation
ODT - OpenDocument Text
RTF - Rich Text Format

Markup Languages

HTML - Hypertext Markup Language
HTML5 - HTML5
LaTeX - LaTeX typesetting
reStructuredText - RST
AsciiDoc - AsciiDoc markup
MediaWiki - MediaWiki markup
Textile - Textile markup
Org - Emacs Org mode

Presentation Formats

Beamer - LaTeX Beamer slides
DZSlides - DZSlides HTML slides
Slidy - Slidy HTML slides
Slideous - Slideous HTML slides
S5 - S5 HTML slides

Other Formats

EPUB - Electronic Publication (v2 and v3)
DocBook - DocBook v4 and v5
JATS - JATS XML
TEI - TEI Simple
Typst - Typst typesetting
Jupyter - Jupyter notebooks (.ipynb)
CSV - Comma-Separated Values
TSV - Tab-Separated Values
JSON - JSON (CSL bibliography)
XML - Various XML formats
MathML - Mathematical Markup Language

Filtered Formats

// PDF removed - doesn't work in this configuration
if (format === "pdf") continue;

// RevealJS removed - hangs indefinitely
if (format === "revealjs") continue;

Initialization

The handler dynamically loads Pandoc and queries supported formats:

const handler = new pandocHandler();
await handler.init();

Initialization Process

Dynamically imports Pandoc WASM module
Queries input formats: pandoc --query input-formats
Queries output formats: pandoc --query output-formats
Manually adds MathML (supported but not exposed by query)
Normalizes format metadata
Categorizes formats
Prioritizes common formats

const { query, convert } = await import("./pandoc/pandoc.js");
this.query = query;
this.convert = convert;

const inputFormats: string[] = await query({ query: "input-formats" });
const outputFormats: string[] = await query({ query: "output-formats" });

// Pandoc supports MathML natively but doesn't expose as a format
outputFormats.push("mathml");

Format Naming

The handler uses custom format names for better clarity:

static formatNames: Map<string, string> = new Map([
  ["html", "Hypertext Markup Language"],
  ["docx", "Microsoft Word Document"],
  ["xlsx", "Microsoft Excel Spreadsheet"],
  ["pptx", "Microsoft PowerPoint Presentation"],
  ["markdown", "Pandoc's Markdown"],
  ["gfm", "GitHub-Flavored Markdown"],
  ["latex", "LaTeX"],
  ["epub", "EPUB v3"],
  ["csv", "Comma-Separated Values"],
  ["json", "JavaScript Object Notation"],
  ["xml", "Extensible Markup Language"],
  ["rst", "reStructuredText"],
  ["org", "Emacs Org mode"],
  ["mediawiki", "MediaWiki markup"],
  ["textile", "Textile"],
  ["typst", "Typst"],
  // ... and 70+ more
]);

Format Extensions

Custom extension mappings for formats where extension differs from format name:

static formatExtensions: Map<string, string> = new Map([
  ["html5", "html"],
  ["markdown", "md"],
  ["gfm", "md"],
  ["latex", "tex"],
  ["beamer", "tex"],
  ["typst", "typ"],
  ["djot", "dj"],
  ["rst", "rst"],
  ["asciidoc", "adoc"],
  ["vimdoc", "txt"],
  // ... and more
]);

Format Categorization

Formats are categorized for filtering and organization:

Spreadsheets

if (format === "xlsx") categories.push("spreadsheet");

Presentations

else if (format === "pptx") categories.push("presentation");

Text Formats

if (
  name.toLowerCase().includes("text")
  || mimeType === "text/plain"
) {
  categories.push("text");
} else {
  categories.push("document");
}

Conversion Process

Basic Conversion

const outputFiles = await handler.doConvert(
  inputFiles,
  inputFormat,
  outputFormat
);

Per-File Processing

Unlike other handlers, pandocHandler processes files individually:

const outputFiles: FileData[] = [];

for (const inputFile of inputFiles) {
  const files = {
    [inputFile.name]: new Blob([inputFile.bytes as BlobPart])
  };
  
  let options = {
    from: inputFormat.internal,
    to: outputFormat.internal,
    "input-files": [inputFile.name],
    "output-file": "output",
    "embed-resources": true,
    "html-math-method": "mathjax",
  }
  
  const { stderr } = await this.convert(options, null, files);
  
  if (stderr) throw stderr;
  
  const outputBlob = files.output;
  const arrayBuffer = await outputBlob.arrayBuffer();
  const bytes = new Uint8Array(arrayBuffer);
  
  outputFiles.push({ bytes, name });
}

return outputFiles;

Conversion Options

from

string

required

Input format identifier (e.g., “markdown”, “docx”)

string

required

Output format identifier (e.g., “html”, “pdf”)

input-files

string[]

required

Array of input filenames in the virtual file system

output-file

string

required

Output filename in the virtual file system

embed-resources

boolean

default:true

Embed all resources (images, CSS, etc.) in the output file

html-math-method

string

default:"mathjax"

Method for rendering math in HTML output: "mathjax" or "mathml"

Special Format Handling

MathML Output

MathML is handled specially since Pandoc doesn’t expose it as a format:

if (outputFormat.internal === "mathml") {
  options.to = "html";
  options["html-math-method"] = "mathml";
}

This outputs HTML with MathML for mathematical expressions.

Plain Text Normalization

Pandoc’s “plain” format is normalized to “text” for consistency:

if (format === "plain") format = "text";

Resource Embedding

HTML outputs automatically embed all resources:

"embed-resources": true,

This ensures images and stylesheets are included in the output file.

Format Prioritization

HTML is prioritized as it can embed resources:

const htmlIndex = this.supportedFormats.findIndex(c => c.internal === "html");
const htmlFormat = this.supportedFormats[htmlIndex];
this.supportedFormats.splice(htmlIndex, 1);
this.supportedFormats.unshift(htmlFormat);

JSON/XML formats are deprioritized (moved to end) as Pandoc’s internal formats are rarely what users want:

const jsonXmlFormats = this.supportedFormats.filter(c =>
  c.mime === "application/json"
  || c.mime === "application/xml"
);
this.supportedFormats = this.supportedFormats.filter(c =>
  c.mime !== "application/json"
  && c.mime !== "application/xml"
);
this.supportedFormats.push(...jsonXmlFormats);

Lossless Detection

Office formats are marked as lossy due to conversion limitations:

const isOfficeDocument = format === "docx"
  || format === "xlsx"
  || format === "pptx"
  || format === "odt"
  || format === "ods"
  || format === "odp";
  
lossless: !isOfficeDocument

Output File Naming

Output files preserve the base name with updated extension:

const name = inputFile.name.split(".").slice(0, -1).join(".") + "." + outputFormat.extension;

This handles filenames with multiple dots (e.g., archive.tar.gz).

Error Handling

if (stderr) throw stderr;

Pandoc errors are thrown immediately and include diagnostic information.

Virtual File System

Pandoc uses a virtual file system for I/O:

const files = {
  [inputFile.name]: new Blob([inputFile.bytes as BlobPart])
};

await this.convert(options, null, files);

// Output is written back to the files object
const outputBlob = files.output;

Format Metadata Structure

name

string

Human-readable format name from formatNames map

format

string

Normalized format identifier (e.g., “text” instead of “plain”)

extension

string

File extension from formatExtensions map or format name

mime

string

Normalized MIME type

from

boolean

Whether format can be used as input

boolean

Whether format can be used as output

internal

string

Pandoc’s internal format identifier

Properties

name

string

default:"pandoc"

Handler identifier

supportedFormats

FileFormat[] | undefined

Array of supported formats populated during initialization

ready

boolean

true when initialization is complete and handler is ready for conversions

Performance Considerations

Processes files individually (no batch optimization)
Embeds all resources by default (increases file size)
Suitable for text-based document conversions
May be slower than native handlers for large files

Use Cases

Ideal for:

Markdown to HTML conversion
Document format interchange (DOCX ↔ ODT ↔ HTML)
Creating presentations from Markdown
Converting between markup languages
Academic writing workflows (LaTeX, EPUB, etc.)

Not ideal for:

PDF generation (disabled in this configuration)
RevealJS presentations (hangs indefinitely)
Large binary office documents

Source Reference

Implementation: ~/workspace/source/src/handlers/pandoc.ts

Core Concepts

Major Handlers

Utilities

Overview

Supported Formats

Markdown Variants

Office Documents

Markup Languages

Presentation Formats

Other Formats

Filtered Formats

Initialization

Initialization Process

Format Naming

Format Extensions

Format Categorization

Spreadsheets

Presentations

Text Formats

Conversion Process

Basic Conversion

Per-File Processing

Conversion Options

Special Format Handling

MathML Output

Plain Text Normalization

Resource Embedding

Format Prioritization

Lossless Detection

Output File Naming

Error Handling

Virtual File System

Format Metadata Structure

Properties

Performance Considerations

Use Cases

Source Reference

Build docs developers (and LLMs) love

Core Concepts

Major Handlers

Utilities

​Overview

​Supported Formats

​Markdown Variants

​Office Documents

​Markup Languages

​Presentation Formats

​Other Formats

​Filtered Formats

​Initialization

​Initialization Process

​Format Naming

​Format Extensions

​Format Categorization

​Spreadsheets

​Presentations

​Text Formats

​Conversion Process

​Basic Conversion

​Per-File Processing

​Conversion Options

​Special Format Handling

​MathML Output

​Plain Text Normalization

​Resource Embedding

​Format Prioritization

​Lossless Detection

​Output File Naming

​Error Handling

​Virtual File System

​Format Metadata Structure

​Properties

​Performance Considerations

​Use Cases

​Source Reference

Build docs developers (and LLMs) love

Overview

Supported Formats

Markdown Variants

Office Documents

Markup Languages

Presentation Formats

Other Formats

Filtered Formats

Initialization

Initialization Process

Format Naming

Format Extensions

Format Categorization

Spreadsheets

Presentations

Text Formats

Conversion Process

Basic Conversion

Per-File Processing

Conversion Options

Special Format Handling

MathML Output

Plain Text Normalization

Resource Embedding

Format Prioritization

Lossless Detection

Output File Naming

Error Handling

Virtual File System

Format Metadata Structure

Properties

Performance Considerations

Use Cases

Source Reference