DOCX Skill Reference

Overview

The DOCX skill enables comprehensive Word document manipulation including creating professional documents from scratch, editing existing files with tracked changes and comments, and extracting content. A .docx file is a ZIP archive containing XML files that can be programmatically manipulated.

Use this skill whenever working with Word documents (.docx files), including creating reports, memos, letters, templates, or any document requiring professional formatting like tables of contents, headings, page numbers, or letterheads.

Quick Reference

Task	Approach
Read/analyze content	`pandoc` or unpack for raw XML
Create new document	Use `docx-js` library
Edit existing document	Unpack → edit XML → repack

Reading and Converting Documents

Converting Legacy .doc Files

Legacy .doc files must be converted before editing:

python scripts/office/soffice.py --headless --convert-to docx document.doc

Reading Content

# Text extraction with tracked changes
pandoc --track-changes=all document.docx -o output.md

# Raw XML access
python scripts/office/unpack.py document.docx unpacked/

Converting to Images

python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page

Accepting Tracked Changes

To produce a clean document with all tracked changes accepted:

python scripts/accept_changes.py input.docx output.docx

Creating New Documents

Setup and Installation

Install the docx-js library globally:

npm install -g docx

Basic Document Structure

const { Document, Packer, Paragraph, TextRun } = require('docx');

const doc = new Document({ 
  sections: [{ 
    children: [
      new Paragraph({
        children: [new TextRun("Hello World")]
      })
    ] 
  }] 
});

Packer.toBuffer(doc).then(buffer => {
  fs.writeFileSync("doc.docx", buffer);
});

After creating a document, always validate it with python scripts/office/validate.py doc.docx. If validation fails, unpack, fix the XML, and repack.

Page Size and Orientation

Page Size Configuration

CRITICAL: docx-js defaults to A4, not US Letter. Always set page size explicitly:

sections: [{
  properties: {
    page: {
      size: {
        width: 12240,   // 8.5 inches in DXA (1440 DXA = 1 inch)
        height: 15840   // 11 inches in DXA
      },
      margin: { 
        top: 1440, 
        right: 1440, 
        bottom: 1440, 
        left: 1440 
      } // 1 inch margins
    }
  },
  children: [/* content */]
}]

Common page sizes (DXA units):

Paper	Width	Height	Content Width (1” margins)
US Letter	12,240	15,840	9,360
A4 (default)	11,906	16,838	9,026

Landscape orientation:

size: {
  width: 12240,   // Pass SHORT edge as width
  height: 15840,  // Pass LONG edge as height
  orientation: PageOrientation.LANDSCAPE  // docx-js swaps them in XML
}

Styles and Headings

Use Arial as the default font (universally supported). Override built-in heading styles:

const doc = new Document({
  styles: {
    default: { 
      document: { 
        run: { font: "Arial", size: 24 } 
      } 
    },
    paragraphStyles: [
      {
        id: "Heading1", 
        name: "Heading 1", 
        basedOn: "Normal", 
        next: "Normal", 
        quickFormat: true,
        run: { size: 32, bold: true, font: "Arial" },
        paragraph: { 
          spacing: { before: 240, after: 240 }, 
          outlineLevel: 0 // Required for TOC
        }
      },
      {
        id: "Heading2", 
        name: "Heading 2", 
        basedOn: "Normal", 
        next: "Normal", 
        quickFormat: true,
        run: { size: 28, bold: true, font: "Arial" },
        paragraph: { 
          spacing: { before: 180, after: 180 }, 
          outlineLevel: 1 
        }
      }
    ]
  },
  sections: [{
    children: [
      new Paragraph({ 
        heading: HeadingLevel.HEADING_1, 
        children: [new TextRun("Title")] 
      })
    ]
  }]
});

Lists

NEVER use unicode bullets manually. Always use numbering configuration with LevelFormat.BULLET.

// ❌ WRONG - never manually insert bullet characters
new Paragraph({ children: [new TextRun("• Item")] })  // BAD

// ✅ CORRECT - use numbering config
const doc = new Document({
  numbering: {
    config: [
      {
        reference: "bullets",
        levels: [{
          level: 0, 
          format: LevelFormat.BULLET, 
          text: "•", 
          alignment: AlignmentType.LEFT,
          style: { 
            paragraph: { 
              indent: { left: 720, hanging: 360 } 
            } 
          }
        }]
      },
      {
        reference: "numbers",
        levels: [{
          level: 0, 
          format: LevelFormat.DECIMAL, 
          text: "%1.", 
          alignment: AlignmentType.LEFT,
          style: { 
            paragraph: { 
              indent: { left: 720, hanging: 360 } 
            } 
          }
        }]
      }
    ]
  },
  sections: [{
    children: [
      new Paragraph({ 
        numbering: { reference: "bullets", level: 0 },
        children: [new TextRun("Bullet item")] 
      }),
      new Paragraph({ 
        numbering: { reference: "numbers", level: 0 },
        children: [new TextRun("Numbered item")] 
      })
    ]
  }]
});

Each reference creates INDEPENDENT numbering. Same reference continues (1,2,3 then 4,5,6), different reference restarts (1,2,3 then 1,2,3).

Tables

Table Configuration

CRITICAL: Tables need dual widths - set both columnWidths on the table AND width on each cell.

const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
const borders = { top: border, bottom: border, left: border, right: border };

new Table({
  width: { size: 9360, type: WidthType.DXA }, // Always use DXA
  columnWidths: [4680, 4680], // Must sum to table width
  rows: [
    new TableRow({
      children: [
        new TableCell({
          borders,
          width: { size: 4680, type: WidthType.DXA }, // Set on each cell
          shading: { 
            fill: "D5E8F0", 
            type: ShadingType.CLEAR // CLEAR not SOLID
          },
          margins: { 
            top: 80, bottom: 80, 
            left: 120, right: 120 
          },
          children: [
            new Paragraph({ 
              children: [new TextRun("Cell")] 
            })
          ]
        })
      ]
    })
  ]
})

Width rules:

Always use WidthType.DXA — never WidthType.PERCENTAGE (breaks in Google Docs)
Table width must equal sum of columnWidths
Cell width must match corresponding columnWidth
Cell margins are internal padding - reduce content area, not add to width
For full-width tables: use content width (page width minus margins)

Images

// CRITICAL: type parameter is REQUIRED
new Paragraph({
  children: [
    new ImageRun({
      type: "png", // Required: png, jpg, jpeg, gif, bmp, svg
      data: fs.readFileSync("image.png"),
      transformation: { width: 200, height: 150 },
      altText: { 
        title: "Title", 
        description: "Desc", 
        name: "Name" 
      } // All three required
    })
  ]
})

Hyperlinks and Bookmarks

Hyperlink Examples

External links:

new Paragraph({
  children: [
    new ExternalHyperlink({
      children: [
        new TextRun({ 
          text: "Click here", 
          style: "Hyperlink" 
        })
      ],
      link: "https://example.com"
    })
  ]
})

Internal links (bookmarks):

// 1. Create bookmark at destination
new Paragraph({ 
  heading: HeadingLevel.HEADING_1, 
  children: [
    new Bookmark({ 
      id: "chapter1", 
      children: [new TextRun("Chapter 1")] 
    })
  ]
})

// 2. Link to it
new Paragraph({ 
  children: [
    new InternalHyperlink({
      children: [
        new TextRun({ 
          text: "See Chapter 1", 
          style: "Hyperlink" 
        })
      ],
      anchor: "chapter1"
    })
  ]
})

Headers, Footers, and Page Numbers

sections: [{
  properties: {
    page: { 
      margin: { 
        top: 1440, right: 1440, 
        bottom: 1440, left: 1440 
      } 
    }
  },
  headers: {
    default: new Header({ 
      children: [
        new Paragraph({ 
          children: [new TextRun("Header")] 
        })
      ] 
    })
  },
  footers: {
    default: new Footer({ 
      children: [
        new Paragraph({
          children: [
            new TextRun("Page "), 
            new TextRun({ 
              children: [PageNumber.CURRENT] 
            })
          ]
        })
      ] 
    })
  },
  children: [/* content */]
}]

// CRITICAL: Headings must use HeadingLevel ONLY - no custom styles
new TableOfContents("Table of Contents", { 
  hyperlink: true, 
  headingStyleRange: "1-3" 
})

Editing Existing Documents

Follow all 3 steps in order:

Step 1: Unpack

python scripts/office/unpack.py document.docx unpacked/

Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities so they survive editing.

Step 2: Edit XML

Edit files in unpacked/word/. Use “Claude” as the author for tracked changes and comments unless specified otherwise.

Use the Edit tool directly for string replacement. Do not write Python scripts. Scripts introduce unnecessary complexity.

CRITICAL: Use smart quotes for new content:

<w:t>Here&#x2019;s a quote: &#x201C;Hello&#x201D;</w:t>

Entity	Character
`‘`	’ (left single)
`’`	’ (right single / apostrophe)
`“`	” (left double)
`”`	” (right double)

Adding comments:

python scripts/comment.py unpacked/ 0 "Comment text with &amp; and &#x2019;"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0

Step 3: Pack

python scripts/office/pack.py unpacked/ output.docx --original document.docx

Validates with auto-repair, condenses XML, and creates DOCX.

Auto-Repair Features

Auto-repair will fix:

durableId >= 0x7FFFFFFF (regenerates valid ID)
Missing xml:space="preserve" on <w:t> with whitespace

Auto-repair won’t fix:

Malformed XML
Invalid element nesting
Missing relationships
Schema violations

XML Reference

Tracked Changes

Tracked Change Patterns

Insertion:

<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
  <w:r><w:t>inserted text</w:t></w:r>
</w:ins>

Deletion:

<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
  <w:r><w:delText>deleted text</w:delText></w:r>
</w:del>

Minimal edits - only mark what changes:

<!-- Change "30 days" to "60 days" -->
<w:r><w:t>The term is </w:t></w:r>
<w:del w:id="1" w:author="Claude" w:date="...">
  <w:r><w:delText>30</w:delText></w:r>
</w:del>
<w:ins w:id="2" w:author="Claude" w:date="...">
  <w:r><w:t>60</w:t></w:r>
</w:ins>
<w:r><w:t> days.</w:t></w:r>

Deleting entire paragraphs:

<w:p>
  <w:pPr>
    <w:rPr>
      <w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
    </w:rPr>
  </w:pPr>
  <w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
    <w:r><w:delText>Entire paragraph content...</w:delText></w:r>
  </w:del>
</w:p>

Comments

CRITICAL: <w:commentRangeStart> and <w:commentRangeEnd> are siblings of <w:r>, never inside <w:r>.

<w:commentRangeStart w:id="0"/>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
  <w:r><w:delText>deleted</w:delText></w:r>
</w:del>
<w:r><w:t> more text</w:t></w:r>
<w:commentRangeEnd w:id="0"/>
<w:r>
  <w:rPr><w:rStyle w:val="CommentReference"/></w:rPr>
  <w:commentReference w:id="0"/>
</w:r>

Critical Rules

Always follow these rules when using docx-js:

Set page size explicitly (defaults to A4, not US Letter)
Never use \n - use separate Paragraph elements
Never use unicode bullets - use LevelFormat.BULLET
PageBreak must be inside a Paragraph
ImageRun requires type parameter
Always set table width with DXA, never WidthType.PERCENTAGE
Tables need dual widths - columnWidths AND cell width
Use ShadingType.CLEAR, never SOLID
TOC requires HeadingLevel only - no custom styles
Include outlineLevel for headings (required for TOC)

Dependencies

pandoc: Text extraction
docx: npm install -g docx (new documents)
LibreOffice: PDF conversion (auto-configured via scripts/office/soffice.py)
Poppler: pdftoppm for images

Document Skills

Creative & Design

Development & Technical

Enterprise & Communication

Overview

Quick Reference

Reading and Converting Documents

Converting Legacy .doc Files

Reading Content

Converting to Images

Accepting Tracked Changes

Creating New Documents

Setup and Installation

Basic Document Structure

Page Size and Orientation

Styles and Headings

Lists

Tables

Images

Hyperlinks and Bookmarks

Headers, Footers, and Page Numbers

Table of Contents

Editing Existing Documents

Step 1: Unpack

Step 2: Edit XML

Step 3: Pack

XML Reference

Tracked Changes

Comments

Critical Rules

Dependencies

Build docs developers (and LLMs) love

Document Skills

Creative & Design

Development & Technical

Enterprise & Communication

​Overview

​Quick Reference

​Reading and Converting Documents

​Converting Legacy .doc Files

​Reading Content

​Converting to Images

​Accepting Tracked Changes

​Creating New Documents

​Setup and Installation

​Basic Document Structure

​Page Size and Orientation

​Styles and Headings

​Lists

​Tables

​Images

​Hyperlinks and Bookmarks

​Headers, Footers, and Page Numbers

​Table of Contents

​Editing Existing Documents

​Step 1: Unpack

​Step 2: Edit XML

​Step 3: Pack

​XML Reference

​Tracked Changes

​Comments

​Critical Rules

​Dependencies

Build docs developers (and LLMs) love

Overview

Quick Reference

Reading and Converting Documents

Converting Legacy .doc Files

Reading Content

Converting to Images

Accepting Tracked Changes

Creating New Documents

Setup and Installation

Basic Document Structure

Page Size and Orientation

Styles and Headings

Lists

Tables

Images

Hyperlinks and Bookmarks

Headers, Footers, and Page Numbers

Table of Contents

Editing Existing Documents

Step 1: Unpack

Step 2: Edit XML

Step 3: Pack

XML Reference

Tracked Changes

Comments

Critical Rules

Dependencies