Overview
The DOCX skill enables comprehensive Word document manipulation including creating professional documents from scratch, editing existing files with tracked changes and comments, and extracting content. A .docx file is a ZIP archive containing XML files that can be programmatically manipulated.
Use this skill whenever working with Word documents (.docx files), including creating reports, memos, letters, templates, or any document requiring professional formatting like tables of contents, headings, page numbers, or letterheads.
Quick Reference
Task Approach Read/analyze content pandoc or unpack for raw XMLCreate new document Use docx-js library Edit existing document Unpack → edit XML → repack
Reading and Converting Documents
Converting Legacy .doc Files
Legacy .doc files must be converted before editing:
python scripts/office/soffice.py --headless --convert-to docx document.doc
Reading Content
# Text extraction with tracked changes
pandoc --track-changes=all document.docx -o output.md
# Raw XML access
python scripts/office/unpack.py document.docx unpacked/
Converting to Images
python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page
Accepting Tracked Changes
To produce a clean document with all tracked changes accepted:
python scripts/accept_changes.py input.docx output.docx
Creating New Documents
Setup and Installation
Install the docx-js library globally:
Basic Document Structure
const { Document , Packer , Paragraph , TextRun } = require ( 'docx' );
const doc = new Document ({
sections: [{
children: [
new Paragraph ({
children: [ new TextRun ( "Hello World" )]
})
]
}]
});
Packer . toBuffer ( doc ). then ( buffer => {
fs . writeFileSync ( "doc.docx" , buffer );
});
After creating a document, always validate it with python scripts/office/validate.py doc.docx. If validation fails, unpack, fix the XML, and repack.
Page Size and Orientation
CRITICAL : docx-js defaults to A4, not US Letter. Always set page size explicitly:sections : [{
properties: {
page: {
size: {
width: 12240 , // 8.5 inches in DXA (1440 DXA = 1 inch)
height: 15840 // 11 inches in DXA
},
margin: {
top: 1440 ,
right: 1440 ,
bottom: 1440 ,
left: 1440
} // 1 inch margins
}
},
children: [ /* content */ ]
}]
Common page sizes (DXA units): Paper Width Height Content Width (1” margins) US Letter 12,240 15,840 9,360 A4 (default) 11,906 16,838 9,026
Landscape orientation: size : {
width : 12240 , // Pass SHORT edge as width
height : 15840 , // Pass LONG edge as height
orientation : PageOrientation . LANDSCAPE // docx-js swaps them in XML
}
Styles and Headings
Use Arial as the default font (universally supported). Override built-in heading styles:
const doc = new Document ({
styles: {
default: {
document: {
run: { font: "Arial" , size: 24 }
}
},
paragraphStyles: [
{
id: "Heading1" ,
name: "Heading 1" ,
basedOn: "Normal" ,
next: "Normal" ,
quickFormat: true ,
run: { size: 32 , bold: true , font: "Arial" },
paragraph: {
spacing: { before: 240 , after: 240 },
outlineLevel: 0 // Required for TOC
}
},
{
id: "Heading2" ,
name: "Heading 2" ,
basedOn: "Normal" ,
next: "Normal" ,
quickFormat: true ,
run: { size: 28 , bold: true , font: "Arial" },
paragraph: {
spacing: { before: 180 , after: 180 },
outlineLevel: 1
}
}
]
},
sections: [{
children: [
new Paragraph ({
heading: HeadingLevel . HEADING_1 ,
children: [ new TextRun ( "Title" )]
})
]
}]
});
Lists
NEVER use unicode bullets manually. Always use numbering configuration with LevelFormat.BULLET.
// ❌ WRONG - never manually insert bullet characters
new Paragraph ({ children: [ new TextRun ( "• Item" )] }) // BAD
// ✅ CORRECT - use numbering config
const doc = new Document ({
numbering: {
config: [
{
reference: "bullets" ,
levels: [{
level: 0 ,
format: LevelFormat . BULLET ,
text: "•" ,
alignment: AlignmentType . LEFT ,
style: {
paragraph: {
indent: { left: 720 , hanging: 360 }
}
}
}]
},
{
reference: "numbers" ,
levels: [{
level: 0 ,
format: LevelFormat . DECIMAL ,
text: "%1." ,
alignment: AlignmentType . LEFT ,
style: {
paragraph: {
indent: { left: 720 , hanging: 360 }
}
}
}]
}
]
},
sections: [{
children: [
new Paragraph ({
numbering: { reference: "bullets" , level: 0 },
children: [ new TextRun ( "Bullet item" )]
}),
new Paragraph ({
numbering: { reference: "numbers" , level: 0 },
children: [ new TextRun ( "Numbered item" )]
})
]
}]
});
Each reference creates INDEPENDENT numbering. Same reference continues (1,2,3 then 4,5,6), different reference restarts (1,2,3 then 1,2,3).
Tables
CRITICAL : Tables need dual widths - set both columnWidths on the table AND width on each cell.const border = { style: BorderStyle . SINGLE , size: 1 , color: "CCCCCC" };
const borders = { top: border , bottom: border , left: border , right: border };
new Table ({
width: { size: 9360 , type: WidthType . DXA }, // Always use DXA
columnWidths: [ 4680 , 4680 ], // Must sum to table width
rows: [
new TableRow ({
children: [
new TableCell ({
borders ,
width: { size: 4680 , type: WidthType . DXA }, // Set on each cell
shading: {
fill: "D5E8F0" ,
type: ShadingType . CLEAR // CLEAR not SOLID
},
margins: {
top: 80 , bottom: 80 ,
left: 120 , right: 120
},
children: [
new Paragraph ({
children: [ new TextRun ( "Cell" )]
})
]
})
]
})
]
})
Width rules:
Always use WidthType.DXA — never WidthType.PERCENTAGE (breaks in Google Docs)
Table width must equal sum of columnWidths
Cell width must match corresponding columnWidth
Cell margins are internal padding - reduce content area, not add to width
For full-width tables: use content width (page width minus margins)
Images
// CRITICAL: type parameter is REQUIRED
new Paragraph ({
children: [
new ImageRun ({
type: "png" , // Required: png, jpg, jpeg, gif, bmp, svg
data: fs . readFileSync ( "image.png" ),
transformation: { width: 200 , height: 150 },
altText: {
title: "Title" ,
description: "Desc" ,
name: "Name"
} // All three required
})
]
})
Hyperlinks and Bookmarks
External links: new Paragraph ({
children: [
new ExternalHyperlink ({
children: [
new TextRun ({
text: "Click here" ,
style: "Hyperlink"
})
],
link: "https://example.com"
})
]
})
Internal links (bookmarks): // 1. Create bookmark at destination
new Paragraph ({
heading: HeadingLevel . HEADING_1 ,
children: [
new Bookmark ({
id: "chapter1" ,
children: [ new TextRun ( "Chapter 1" )]
})
]
})
// 2. Link to it
new Paragraph ({
children: [
new InternalHyperlink ({
children: [
new TextRun ({
text: "See Chapter 1" ,
style: "Hyperlink"
})
],
anchor: "chapter1"
})
]
})
Headers, Footers, and Page Numbers
sections : [{
properties: {
page: {
margin: {
top: 1440 , right: 1440 ,
bottom: 1440 , left: 1440
}
}
},
headers: {
default: new Header ({
children: [
new Paragraph ({
children: [ new TextRun ( "Header" )]
})
]
})
},
footers: {
default: new Footer ({
children: [
new Paragraph ({
children: [
new TextRun ( "Page " ),
new TextRun ({
children: [ PageNumber . CURRENT ]
})
]
})
]
})
},
children: [ /* content */ ]
}]
Table of Contents
// CRITICAL: Headings must use HeadingLevel ONLY - no custom styles
new TableOfContents ( "Table of Contents" , {
hyperlink: true ,
headingStyleRange: "1-3"
})
Editing Existing Documents
Follow all 3 steps in order:
Step 1: Unpack
python scripts/office/unpack.py document.docx unpacked/
Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities so they survive editing.
Step 2: Edit XML
Edit files in unpacked/word/. Use “Claude” as the author for tracked changes and comments unless specified otherwise.
Use the Edit tool directly for string replacement. Do not write Python scripts. Scripts introduce unnecessary complexity.
CRITICAL: Use smart quotes for new content:
< w:t > Here ’ s a quote: “ Hello ” </ w:t >
Entity Character ‘’ (left single) ’’ (right single / apostrophe) “” (left double) ”” (right double)
Adding comments:
python scripts/comment.py unpacked/ 0 "Comment text with & and ’"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0
Step 3: Pack
python scripts/office/pack.py unpacked/ output.docx --original document.docx
Validates with auto-repair, condenses XML, and creates DOCX.
Auto-repair will fix:
durableId >= 0x7FFFFFFF (regenerates valid ID)
Missing xml:space="preserve" on <w:t> with whitespace
Auto-repair won’t fix:
Malformed XML
Invalid element nesting
Missing relationships
Schema violations
XML Reference
Tracked Changes
Insertion: < w:ins w:id = "1" w:author = "Claude" w:date = "2025-01-01T00:00:00Z" >
< w:r >< w:t > inserted text </ w:t ></ w:r >
</ w:ins >
Deletion: < w:del w:id = "2" w:author = "Claude" w:date = "2025-01-01T00:00:00Z" >
< w:r >< w:delText > deleted text </ w:delText ></ w:r >
</ w:del >
Minimal edits - only mark what changes: <!-- Change "30 days" to "60 days" -->
< w:r >< w:t > The term is </ w:t ></ w:r >
< w:del w:id = "1" w:author = "Claude" w:date = "..." >
< w:r >< w:delText > 30 </ w:delText ></ w:r >
</ w:del >
< w:ins w:id = "2" w:author = "Claude" w:date = "..." >
< w:r >< w:t > 60 </ w:t ></ w:r >
</ w:ins >
< w:r >< w:t > days. </ w:t ></ w:r >
Deleting entire paragraphs: < w:p >
< w:pPr >
< w:rPr >
< w:del w:id = "1" w:author = "Claude" w:date = "2025-01-01T00:00:00Z" />
</ w:rPr >
</ w:pPr >
< w:del w:id = "2" w:author = "Claude" w:date = "2025-01-01T00:00:00Z" >
< w:r >< w:delText > Entire paragraph content... </ w:delText ></ w:r >
</ w:del >
</ w:p >
CRITICAL : <w:commentRangeStart> and <w:commentRangeEnd> are siblings of <w:r>, never inside <w:r>.
< w:commentRangeStart w:id = "0" />
< w:del w:id = "1" w:author = "Claude" w:date = "2025-01-01T00:00:00Z" >
< w:r >< w:delText > deleted </ w:delText ></ w:r >
</ w:del >
< w:r >< w:t > more text </ w:t ></ w:r >
< w:commentRangeEnd w:id = "0" />
< w:r >
< w:rPr >< w:rStyle w:val = "CommentReference" /></ w:rPr >
< w:commentReference w:id = "0" />
</ w:r >
Critical Rules
Always follow these rules when using docx-js:
Set page size explicitly (defaults to A4, not US Letter)
Never use \n - use separate Paragraph elements
Never use unicode bullets - use LevelFormat.BULLET
PageBreak must be inside a Paragraph
ImageRun requires type parameter
Always set table width with DXA, never WidthType.PERCENTAGE
Tables need dual widths - columnWidths AND cell width
Use ShadingType.CLEAR, never SOLID
TOC requires HeadingLevel only - no custom styles
Include outlineLevel for headings (required for TOC)
Dependencies
pandoc : Text extraction
docx : npm install -g docx (new documents)
LibreOffice : PDF conversion (auto-configured via scripts/office/soffice.py)
Poppler : pdftoppm for images