Skip to main content
Markdown functions provide tools for parsing, chunking, and extracting content from Markdown documents. All functions are available via fc.markdown.*.

to_json

Converts a column of Markdown-formatted strings into a hierarchical JSON representation.
fc.markdown.to_json(column: ColumnOrName) -> Column
column
ColumnOrName
required
Input column containing Markdown strings.
return
Column
A column of JSON-formatted strings representing the structured document tree.
This function parses Markdown into a structured JSON format optimized for document chunking, semantic analysis, and jq queries. The output conforms to a custom schema that organizes content into nested sections based on heading levels. The full JSON schema is available at: docs.fenic.ai/topics/markdown-json

Supported Markdown Features

  • Headings with nested hierarchy (e.g., h2 → h3 → h4)
  • Paragraphs with inline formatting (bold, italics, links, code, etc.)
  • Lists (ordered, unordered, task lists)
  • Tables with header alignment and inline content
  • Code blocks with language info
  • Blockquotes, horizontal rules, and inline/flow HTML

Examples

df.select(fc.markdown.to_json(fc.col("markdown_text")))

get_code_blocks

Extracts all code blocks from a column of Markdown-formatted strings.
fc.markdown.get_code_blocks(
    column: ColumnOrName,
    language_filter: Optional[str] = None
) -> Column
column
ColumnOrName
required
Input column containing Markdown strings.
language_filter
Optional[str]
Optional language filter to extract only code blocks with a specific language. By default, all code blocks are extracted.
return
Column
A column of code blocks. The output column type is ArrayType(StructType([StructField("language", StringType), StructField("code", StringType)])).
  • Code blocks are parsed from fenced Markdown blocks (e.g., triple backticks).
  • Language identifiers are optional and may be null if not provided in the original Markdown.
  • Indented code blocks without fences are not currently supported.

Examples

df.select(fc.markdown.get_code_blocks(fc.col("markdown_text")))

generate_toc

Generates a table of contents from markdown headings.
fc.markdown.generate_toc(
    column: ColumnOrName,
    max_level: Optional[int] = None
) -> Column
column
ColumnOrName
required
Input column containing Markdown strings.
max_level
Optional[int]
Maximum heading level to include in the TOC (1-6). Defaults to 6 (all levels).
return
Column
A column of Markdown-formatted table of contents strings.
  • The TOC is generated using markdown heading syntax (# ## ### etc.)
  • Each heading in the source document becomes a line in the TOC
  • The heading level is preserved in the output
  • This creates a valid markdown document that can be rendered or processed further

Examples

df.select(fc.markdown.generate_toc(fc.col("documentation")))

extract_header_chunks

Splits markdown documents into logical chunks based on heading hierarchy.
fc.markdown.extract_header_chunks(
    column: ColumnOrName,
    header_level: int
) -> Column
column
ColumnOrName
required
Input column containing Markdown strings.
header_level
int
required
Heading level to split on (1-6). Creates a new chunk at every heading of this level, including all nested content and subsections.
return
Column
A column of arrays containing chunk objects with the following structure:
ArrayType(StructType([
    StructField("heading", StringType),        # Heading text (clean, no markdown)
    StructField("level", IntegerType),         # Heading level (1-6)
    StructField("content", StringType),        # All content under this heading (clean text)
    StructField("parent_heading", StringType), # Parent heading text (or null)
    StructField("full_path", StringType),      # Full breadcrumb path
]))

Features

  • Context-preserving: Each chunk contains all content and subsections under the heading
  • Hierarchical awareness: Includes parent heading context for better LLM understanding
  • Clean text output: Strips markdown formatting for direct LLM consumption

Chunking Behavior

With header_level=2, this markdown:
# Introduction
Overview text

## Getting Started
Setup instructions

### Prerequisites
Python 3.8+ required

## API Reference
Function documentation
Produces 2 chunks:
  1. Getting Started chunk (includes Prerequisites subsection)
  2. API Reference chunk

Examples

df.select(fc.markdown.extract_header_chunks(fc.col("articles"), header_level=1))

Build docs developers (and LLMs) love