JSON Processing

Comprehensive JSON processing using Fenic’s JSON type system and JQ integration. This example processes whisper transcription data and transforms it into multiple structured DataFrames for analysis.

What This Example Shows

Core JSON Processing Features

JSON Type Casting

Loading string data and converting to Fenic’s JsonType.

JQ Integration

Complex queries for nested data extraction and aggregation.

Array Operations

Processing nested arrays and extracting scalar values.

Type Conversion

Converting JSON data to appropriate DataFrame types.

Hybrid Processing

Combining JSON-native operations with traditional DataFrame analytics.

Struct Casting

Efficient field extraction using schema definitions.

Data Structure

The input JSON contains whisper transcription data:

{
  "language": "en",
  "segments": [
    {
      "text": "Let me ask you about AI.",
      "start": 2.94,
      "end": 4.48,
      "words": [
        {
          "word": " Let",
          "start": 2.94,
          "end": 3.12,
          "speaker": "SPEAKER_01",
          "probability": 0.69384765625
        }
        // ... more words
      ]
    }
    // ... more segments
  ]
}

Output DataFrames

The pipeline creates three complementary DataFrames:

1. Words DataFrame (Granular Analysis)

Purpose: Individual word-level analysis with timing and confidence Fields:

word_text: The spoken word
speaker: Speaker identifier
start_time, end_time: Word timing in seconds
duration: Calculated word duration
probability: Speech recognition confidence score

2. Segments DataFrame (Content Analysis)

Purpose: Conversation segments with aggregated metrics Fields:

segment_text: Full text of conversation segment
start_time, end_time, duration: Segment timing
word_count: Number of words in segment
average_confidence: Average recognition confidence

3. Speaker Summary DataFrame (Analytics)

Purpose: High-level speaker analytics and patterns Fields:

speaker: Speaker identifier
total_words: Total words spoken
total_speaking_time: Total time speaking
average_confidence: Overall speech quality
word_rate: Words per minute

Implementation

Session Configuration

import fenic as fc
from pathlib import Path

config = fc.SessionConfig(app_name="json_processing")
session = fc.Session.get_or_create(config)

Load and Cast JSON Data

# Read JSON file as string
transcript_path = Path(__file__).parent / "whisper-transcript.json"
with open(transcript_path, "r") as f:
    json_content = f.read()

# Create dataframe with JSON string
df = session.create_dataframe([{"json_string": json_content}])

# Cast the JSON string to JSON type
df_json = df.select(
    fc.col("json_string").cast(fc.JsonType).alias("json_data")
)

print("JSON data cast to JSON type:")
df_json.show(1)

Casting to JsonType enables JSON-specific operations like JQ queries and structured extraction.

Create Words DataFrame

Using JQ for Extraction

# Extract all words from all segments using JQ
words_df = df_json.select(
    fc.json.jq(
        fc.col("json_data"),
        # JQ query with variable binding
        '.segments[] as $seg | $seg.words[] | {word: .word, speaker: .speaker, start: .start, end: .end, probability: .probability, segment_start: $seg.start}'
    ).alias("word_data")
).explode("word_data")  # Convert array to rows

print(f"Extracted {words_df.count()} individual words")

Using Struct Casting (Recommended)

# Define schema for word-level data
word_schema = fc.StructType([
    fc.StructField("word", fc.StringType),
    fc.StructField("speaker", fc.StringType),
    fc.StructField("start", fc.FloatType),
    fc.StructField("end", fc.FloatType),
    fc.StructField("probability", fc.FloatType),
    fc.StructField("segment_start", fc.FloatType),
    fc.StructField("segment_end", fc.FloatType)
])

# Cast to struct and unnest - more efficient than JQ + get_item(0)
words_clean_df = words_df.select(
    fc.col("word_data").cast(word_schema).alias("word_struct")
).unnest("word_struct").select(
    fc.col("word").alias("word_text"),
    fc.col("speaker"),
    fc.col("start").alias("start_time"),
    fc.col("end").alias("end_time"),
    fc.col("probability")
)

# Add calculated fields
words_final_df = words_clean_df.select(
    "*",
    (fc.col("end_time") - fc.col("start_time")).alias("duration")
)

print("Words DataFrame:")
words_final_df.show(10)

Struct casting + unnest is more efficient than multiple JQ queries for simple field extraction.

Create Segments DataFrame

Hybrid Approach: Struct Casting + JQ

# Extract segments using JQ
segments_df = df_json.select(
    fc.json.jq(
        fc.col("json_data"),
        '.segments[] | {text: .text, start: .start, end: .end, words: .words}'
    ).alias("segment_data")
).explode("segment_data")

# Define schema for basic segment fields
segment_basic_schema = fc.StructType([
    fc.StructField("text", fc.StringType),
    fc.StructField("start", fc.FloatType),
    fc.StructField("end", fc.FloatType)
])

# Extract basic fields with struct casting, complex aggregations with JQ
segments_clean_df = segments_df.select(
    # Struct casting for basic fields (efficient)
    fc.col("segment_data").cast(segment_basic_schema).alias("segment_struct"),
    # JQ for array aggregations (best tool for this)
    fc.json.jq(fc.col("segment_data"), '.words | length').get_item(0).cast(fc.IntegerType).alias("word_count"),
    fc.json.jq(fc.col("segment_data"), '[.words[].probability] | add / length').get_item(0).cast(fc.FloatType).alias("average_confidence")
).unnest("segment_struct").select(
    fc.col("text").alias("segment_text"),
    fc.col("start").alias("start_time"),
    fc.col("end").alias("end_time"),
    fc.col("word_count"),
    fc.col("average_confidence")
).select(
    "segment_text",
    "start_time",
    "end_time",
    (fc.col("end_time") - fc.col("start_time")).alias("duration"),
    "word_count",
    "average_confidence"
)

print("Segments DataFrame:")
segments_clean_df.show(5)

Create Speaker Summary DataFrame

# Use traditional DataFrame aggregations on JSON-extracted data
speaker_summary_df = words_final_df.group_by("speaker").agg(
    fc.count("*").alias("total_words"),
    fc.avg("probability").alias("average_confidence"),
    fc.min("start_time").alias("first_speaking_time"),
    fc.max("end_time").alias("last_speaking_time"),
    fc.sum("duration").alias("total_speaking_time")
).select(
    "speaker",
    "total_words",
    "total_speaking_time",
    "average_confidence",
    "first_speaking_time",
    "last_speaking_time",
    # Calculate words per minute
    (fc.col("total_words") / (fc.col("total_speaking_time") / 60.0)).alias("word_rate")
)

print("Speaker Summary:")
speaker_summary_df.show()

Data Extraction Approaches

When to Use Struct Casting

Best for: Simple field extraction

# Define schema
schema = fc.StructType([
    fc.StructField("word", fc.StringType),
    fc.StructField("start", fc.FloatType),
])

# Cast and unnest
df.select(
    fc.col("data").cast(schema).alias("struct")
).unnest("struct")

Benefits:

More efficient than multiple JQ queries
Automatic type handling
Single operation extracts all fields

When to Use JQ Queries

Best for: Complex operations

# Nested traversal with variable binding
'.segments[] as $seg | $seg.words[] | {...}'

# Array length calculation
'.words | length'

# Array aggregation for averages
'[.words[].probability] | add / length'

Use JQ for:

Array aggregations (length, add / length)
Complex transformations
Variable binding and nested operations
Mathematical operations within queries

JQ Query Complexity Levels

Simple Field Extraction

.field
.nested.field

Array Traversal

.segments[]
.segments[].words[]

Variable Binding

.segments[] as $seg | $seg.words[]

Array Operations

.words | length
[.words[].probability] | add / length

Object Construction

{word: .word, segment_start: $seg.start}

Running the Example

python json_processing.py

Expected Output

The script displays:

Raw JSON data loading and casting
Word-level extraction (~4000+ individual words)
Segment-level analysis (conversation segments)
Speaker summary analytics (2 speakers with statistics)
Comprehensive pipeline summary

Learning Outcomes

After studying this example, you’ll understand:

How to load and cast JSON data in Fenic
Struct casting + unnest for efficient field extraction
Complex JQ queries for advanced data manipulation
When to choose struct casting vs JQ
Array processing and aggregation within JSON
Type conversion between JSON and DataFrame types
Hybrid workflows combining multiple extraction approaches
Performance optimization for JSON processing pipelines
Real-world data processing patterns for audio/transcript analysis

Use struct casting for simple field extraction and JQ for complex aggregations and transformations. Combine both approaches for optimal performance.

Get Started

Core Concepts

Guides

Examples

Integrations

JSON Processing

What This Example Shows

Core JSON Processing Features

JSON Type Casting

JQ Integration

Array Operations

Type Conversion

Hybrid Processing

Struct Casting

Data Structure

Output DataFrames

1. Words DataFrame (Granular Analysis)

2. Segments DataFrame (Content Analysis)

3. Speaker Summary DataFrame (Analytics)

Implementation

Session Configuration

Load and Cast JSON Data

Create Words DataFrame

Using JQ for Extraction

Using Struct Casting (Recommended)

Create Segments DataFrame

Hybrid Approach: Struct Casting + JQ

Create Speaker Summary DataFrame

Data Extraction Approaches

When to Use Struct Casting

When to Use JQ Queries

JQ Query Complexity Levels

Running the Example

Expected Output

Learning Outcomes

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

Integrations

​What This Example Shows

​Core JSON Processing Features

JSON Type Casting

JQ Integration

Array Operations

Type Conversion

Hybrid Processing

Struct Casting

​Data Structure

​Output DataFrames

​1. Words DataFrame (Granular Analysis)

​2. Segments DataFrame (Content Analysis)

​3. Speaker Summary DataFrame (Analytics)

​Implementation

​Session Configuration

​Load and Cast JSON Data

​Create Words DataFrame

​Using JQ for Extraction

​Using Struct Casting (Recommended)

​Create Segments DataFrame

​Hybrid Approach: Struct Casting + JQ

​Create Speaker Summary DataFrame

​Data Extraction Approaches

​When to Use Struct Casting

​When to Use JQ Queries

​JQ Query Complexity Levels

​Running the Example

​Expected Output

​Learning Outcomes

Build docs developers (and LLMs) love

What This Example Shows

Core JSON Processing Features

Data Structure

Output DataFrames

1. Words DataFrame (Granular Analysis)

2. Segments DataFrame (Content Analysis)

3. Speaker Summary DataFrame (Analytics)

Implementation

Session Configuration

Load and Cast JSON Data

Create Words DataFrame

Using JQ for Extraction

Using Struct Casting (Recommended)

Create Segments DataFrame

Hybrid Approach: Struct Casting + JQ

Create Speaker Summary DataFrame

Data Extraction Approaches

When to Use Struct Casting

When to Use JQ Queries

JQ Query Complexity Levels

Running the Example

Expected Output

Learning Outcomes