Text Functions

Text functions provide string manipulation, chunking, template rendering, and text extraction capabilities. All functions are available via fc.text.*.

Chunking Functions

recursive_character_chunk

Chunks a string column into chunks of a specified size (in characters) with an optional overlap, preserving text structure.

fc.text.recursive_character_chunk(
    column: ColumnOrName,
    chunk_size: int,
    chunk_overlap_percentage: int,
    chunking_character_set_custom_characters: Optional[list[str]] = None,
) -> Column

column

ColumnOrName

required

The input string column or column name to chunk.

chunk_size

int

required

The size of each chunk in characters.

chunk_overlap_percentage

int

required

The overlap between each chunk as a percentage of the chunk size.

chunking_character_set_custom_characters

Optional[list[str]]

List of alternative characters to split on. Characters should be ordered from coarsest to finest desired granularity. Default is ['\n\n', '\n', '.', ';', ':', ' ', '-', ''].

return

Column

A column containing the chunks as an array of strings.

Example

df.select(
    fc.text.recursive_character_chunk(fc.col("text"), 100, 20).alias("chunks")
)

recursive_word_chunk

Chunks a string column into chunks of a specified size (in words) with an optional overlap.

fc.text.recursive_word_chunk(
    column: ColumnOrName,
    chunk_size: int,
    chunk_overlap_percentage: int,
    chunking_character_set_custom_characters: Optional[list[str]] = None,
) -> Column

column

ColumnOrName

required

The input string column or column name to chunk.

chunk_size

int

required

The size of each chunk in words.

chunk_overlap_percentage

int

required

The overlap between each chunk as a percentage of the chunk size.

chunking_character_set_custom_characters

Optional[list[str]]

List of alternative characters to split on.

return

Column

A column containing the chunks as an array of strings.

Example

df.select(
    fc.text.recursive_word_chunk(fc.col("text"), 100, 20).alias("chunks")
)

recursive_token_chunk

Chunks a string column into chunks of a specified size (in tokens) with an optional overlap.

fc.text.recursive_token_chunk(
    column: ColumnOrName,
    chunk_size: int,
    chunk_overlap_percentage: int,
    chunking_character_set_custom_characters: Optional[list[str]] = None,
) -> Column

column

ColumnOrName

required

The input string column or column name to chunk.

chunk_size

int

required

The size of each chunk in tokens.

chunk_overlap_percentage

int

required

The overlap between each chunk as a percentage of the chunk size.

chunking_character_set_custom_characters

Optional[list[str]]

List of alternative characters to split on.

return

Column

A column containing the chunks as an array of strings.

Example

df.select(
    fc.text.recursive_token_chunk(fc.col("text"), 100, 20).alias("chunks")
)

character_chunk

Chunks a string column into chunks of a specified size (in characters) with an optional overlap using a simple sliding window.

fc.text.character_chunk(
    column: ColumnOrName,
    chunk_size: int,
    chunk_overlap_percentage: int = 0
) -> Column

column

ColumnOrName

required

The input string column or column name to chunk.

chunk_size

int

required

The size of each chunk in characters.

chunk_overlap_percentage

int

default:"0"

The overlap between chunks as a percentage of the chunk size.

return

Column

A column containing the chunks as an array of strings.

word_chunk

Chunks a string column into chunks of a specified size (in words) with an optional overlap using a simple sliding window.

fc.text.word_chunk(
    column: ColumnOrName,
    chunk_size: int,
    chunk_overlap_percentage: int = 0
) -> Column

token_chunk

Chunks a string column into chunks of a specified size (in tokens) with an optional overlap using a simple sliding window.

fc.text.token_chunk(
    column: ColumnOrName,
    chunk_size: int,
    chunk_overlap_percentage: int = 0
) -> Column

count_tokens

Returns the number of tokens in a string using OpenAI’s cl100k_base encoding (tiktoken).

fc.text.count_tokens(column: ColumnOrName) -> Column

column

ColumnOrName

required

The input string column.

return

Column

A column with the token counts for each input string.

Example

df.select(fc.text.count_tokens(fc.col("text")))

String Manipulation

extract

Extracts structured data from text using template-based pattern matching.

fc.text.extract(column: ColumnOrName, template: str) -> Column

column

ColumnOrName

required

Input text column to extract from.

template

str

required

Template string with placeholders as ${field_name} or ${field_name:format}. Available formats: none, csv, json, quoted.

return

Column

Struct column with fields corresponding to template placeholders.

Examples

fc.text.extract(fc.col("log"), "${date} ${level} ${message}")
# Input: "2024-01-15 ERROR Connection failed"
# Output: {date: "2024-01-15", level: "ERROR", message: "Connection failed"}

concat

Concatenates multiple columns or strings into a single string.

fc.text.concat(*cols: ColumnOrName) -> Column

*cols

ColumnOrName

required

Columns or strings to concatenate.

return

Column

A column containing the concatenated strings.

Example

df.select(fc.text.concat(fc.col("col1"), fc.lit(" "), fc.col("col2")))

concat_ws

Concatenates multiple columns or strings into a single string with a separator.

fc.text.concat_ws(separator: str, *cols: ColumnOrName) -> Column

separator

str

required

The separator to use.

*cols

ColumnOrName

required

Columns or strings to concatenate.

return

Column

A column containing the concatenated strings.

Example

df.select(fc.text.concat_ws(",", fc.col("col1"), fc.col("col2")))

replace

Replace all occurrences of a pattern with a new string, treating pattern as a literal string.

fc.text.replace(
    src: ColumnOrName,
    search: Union[Column, str],
    replace: Union[Column, str]
) -> Column

src

ColumnOrName

required

The input string column or column name to perform replacements on.

Union[Column, str]

required

The pattern to search for (can be a string or column expression).

replace

Union[Column, str]

required

The string to replace with (can be a string or column expression).

return

Column

A column containing the strings with replacements applied.

Example

df.select(fc.text.replace(fc.col("name"), "foo", "bar"))

regexp_replace

Replace all occurrences of a pattern with a new string, treating pattern as a regular expression.

fc.text.regexp_replace(
    src: ColumnOrName,
    pattern: Union[Column, str],
    replacement: Union[Column, str],
) -> Column

src

ColumnOrName

required

The input string column or column name to perform replacements on.

pattern

Union[Column, str]

required

The regular expression pattern to search for.

replacement

Union[Column, str]

required

The string to replace with.

return

Column

A column containing the strings with replacements applied.

Example

df.select(fc.text.regexp_replace(fc.col("text"), r"\d+", "--"))

Regular Expressions

regexp_count

Count the number of times a regex pattern is matched in a string.

fc.text.regexp_count(src: ColumnOrName, pattern: Union[Column, str]) -> Column

regexp_extract

Extract a specific regex group from a string.

fc.text.regexp_extract(
    src: ColumnOrName,
    pattern: Union[Column, str],
    idx: int
) -> Column

idx

int

required

The group index to extract (0 = entire match, 1+ = capture groups).

Example

fc.text.regexp_extract("email", r"([^@]+)@", 1)
# Input: "[email protected]"
# Output: "user"

regexp_extract_all

Extract all strings matching a regex pattern, optionally from a specific group.

fc.text.regexp_extract_all(
    src: ColumnOrName,
    pattern: Union[Column, str],
    idx: Union[Column, int] = 0
) -> Column

return

Column

An array column containing all matches.

Example

fc.text.regexp_extract_all("text", r"\d+")
# Input: "abc123def456"
# Output: ["123", "456"]

split

Split a string column into an array using a regular expression pattern.

fc.text.split(src: ColumnOrName, pattern: str, limit: int = -1) -> Column

pattern

str

required

The regular expression pattern to split on.

limit

int

default:"-1"

Maximum number of splits to perform. If > 0, returns at most limit+1 elements.

Example

df.select(fc.text.split(fc.col("text"), r"\s+"))

split_part

Split a string and return a specific part using 1-based indexing.

fc.text.split_part(
    src: ColumnOrName,
    delimiter: Union[Column, str],
    part_number: Union[Column, int]
) -> Column

delimiter

Union[Column, str]

required

The delimiter to split on.

part_number

Union[Column, int]

required

Which part to return (1-based integer index). Negative values count from the end.

Example

df.select(fc.text.split_part(fc.col("text"), ",", 2))

String Transformations

upper

Convert all characters in a string column to uppercase.

fc.text.upper(column: ColumnOrName) -> Column

lower

Convert all characters in a string column to lowercase.

fc.text.lower(column: ColumnOrName) -> Column

title_case

Convert the first character of each word in a string column to uppercase.

fc.text.title_case(column: ColumnOrName) -> Column

trim

Remove whitespace from both sides of strings in a column.

fc.text.trim(column: ColumnOrName) -> Column

ltrim

Remove whitespace from the start of strings in a column.

fc.text.ltrim(col: ColumnOrName) -> Column

rtrim

Remove whitespace from the end of strings in a column.

fc.text.rtrim(col: ColumnOrName) -> Column

length

Calculate the character length of each string in the column.

fc.text.length(column: ColumnOrName) -> Column

byte_length

Calculate the byte length of each string in the column.

fc.text.byte_length(column: ColumnOrName) -> Column

Template Rendering

jinja

Render a Jinja template using values from the specified columns.

fc.text.jinja(
    jinja_template: str,
    /,
    strict: bool = True,
    **columns: Column
) -> Column

jinja_template

str

required

A Jinja2 template string to render for each row. Variables are referenced using double braces: {{ variable_name }}.

strict

bool

default:"True"

If True, when any of the provided columns has a None value for a row, the entire row’s output will be None.

**columns

Column

required

Keyword arguments mapping variable names to columns.

return

Column

A string column containing the rendered template for each row.

Example

LLM prompt formatting

prompt_template = '''
Answer the user's question.

{% if context %}
Context: {{ context }}
{% endif %}

Question: {{ query }}

Please provide a {{ style }} response.'''

result = df.select(
    fc.text.jinja(
        prompt_template,
        query=fc.col("user_question"),
        context=fc.col("retrieved_context"),
        style=fc.when(fc.col("query_type") == "technical", "detailed")
              .otherwise("concise")
    ).alias("llm_prompt")
)

Fuzzy Matching

compute_fuzzy_ratio

Compute the similarity between two strings using a fuzzy string matching algorithm.

fc.text.compute_fuzzy_ratio(
    column: ColumnOrName,
    other: Union[Column, str],
    method: FuzzySimilarityMethod = "indel"
) -> Column

column

ColumnOrName

required

A string column or column name.

other

Union[Column, str]

required

A second string column or literal string.

method

FuzzySimilarityMethod

default:"indel"

Similarity method: "indel", "levenshtein", "damerau_levenshtein", "jaro", "jaro_winkler", or "hamming".

return

Column

A double column with similarity scores in the range [0, 100].

Example

result = df.select(
    fc.text.compute_fuzzy_ratio(fc.col("a"), fc.col("b"), method="levenshtein").alias("sim")
)

compute_fuzzy_token_sort_ratio

Compute fuzzy similarity after sorting tokens in each string.

fc.text.compute_fuzzy_token_sort_ratio(
    column: ColumnOrName,
    other: Union[Column, str],
    method: FuzzySimilarityMethod = "indel"
) -> Column

Transcript Parsing

parse_transcript

Parses a transcript from text to a structured format with unified schema.

fc.text.parse_transcript(
    column: ColumnOrName,
    format: TranscriptFormatType
) -> Column

column

ColumnOrName

required

The input string column or column name containing transcript text.

format

TranscriptFormatType

required

The format of the transcript: "srt", "webvtt", or "generic".

return

Column

A column containing an array of structured transcript entries with fields: index, speaker, start_time, end_time, duration, content, format.

Example

df.select(fc.text.parse_transcript(fc.col("transcript"), "srt"))

Core

Functions

I/O

Types

Configuration

MCP

​Chunking Functions

​recursive_character_chunk

​Example

​recursive_word_chunk

​Example

​recursive_token_chunk

​Example

​character_chunk

​word_chunk

​token_chunk

​count_tokens

​Example

​String Manipulation

​extract

​Examples

​concat

​Example

​concat_ws

​Example

​replace

​Example

​regexp_replace

​Example

​Regular Expressions

​regexp_count

​regexp_extract

​Example

​regexp_extract_all

​Example

​split

​Example

​split_part

​Example

​String Transformations

​upper

​lower

​title_case

​trim

​ltrim

​rtrim

​length

​byte_length

​Template Rendering

​jinja

​Example

​Fuzzy Matching

​compute_fuzzy_ratio

​Example

​compute_fuzzy_token_sort_ratio

​Transcript Parsing

​parse_transcript

​Example

Build docs developers (and LLMs) love

Chunking Functions

recursive_character_chunk

Example

recursive_word_chunk

Example

recursive_token_chunk

Example

character_chunk

word_chunk

token_chunk

count_tokens

Example

String Manipulation

extract

Examples

concat

Example

concat_ws

Example

replace

Example

regexp_replace

Example

Regular Expressions

regexp_count

regexp_extract

Example

regexp_extract_all

Example

split

Example

split_part

Example

String Transformations

upper

lower

title_case

trim

ltrim

rtrim

length

byte_length

Template Rendering

jinja

Example

Fuzzy Matching

compute_fuzzy_ratio

Example

compute_fuzzy_token_sort_ratio

Transcript Parsing

parse_transcript

Example