Comprehensive JSON processing using Fenic’s JSON type system and JQ integration. This example processes whisper transcription data and transforms it into multiple structured DataFrames for analysis.
What This Example Shows
Core JSON Processing Features
JSON Type Casting Loading string data and converting to Fenic’s JsonType.
JQ Integration Complex queries for nested data extraction and aggregation.
Array Operations Processing nested arrays and extracting scalar values.
Type Conversion Converting JSON data to appropriate DataFrame types.
Hybrid Processing Combining JSON-native operations with traditional DataFrame analytics.
Struct Casting Efficient field extraction using schema definitions.
Data Structure
The input JSON contains whisper transcription data:
{
"language" : "en" ,
"segments" : [
{
"text" : "Let me ask you about AI." ,
"start" : 2.94 ,
"end" : 4.48 ,
"words" : [
{
"word" : " Let" ,
"start" : 2.94 ,
"end" : 3.12 ,
"speaker" : "SPEAKER_01" ,
"probability" : 0.69384765625
}
// ... more words
]
}
// ... more segments
]
}
Output DataFrames
The pipeline creates three complementary DataFrames:
1. Words DataFrame (Granular Analysis)
Purpose : Individual word-level analysis with timing and confidence
Fields:
word_text: The spoken word
speaker: Speaker identifier
start_time, end_time: Word timing in seconds
duration: Calculated word duration
probability: Speech recognition confidence score
2. Segments DataFrame (Content Analysis)
Purpose : Conversation segments with aggregated metrics
Fields:
segment_text: Full text of conversation segment
start_time, end_time, duration: Segment timing
word_count: Number of words in segment
average_confidence: Average recognition confidence
3. Speaker Summary DataFrame (Analytics)
Purpose : High-level speaker analytics and patterns
Fields:
speaker: Speaker identifier
total_words: Total words spoken
total_speaking_time: Total time speaking
average_confidence: Overall speech quality
word_rate: Words per minute
Implementation
Session Configuration
import fenic as fc
from pathlib import Path
config = fc.SessionConfig( app_name = "json_processing" )
session = fc.Session.get_or_create(config)
Load and Cast JSON Data
# Read JSON file as string
transcript_path = Path( __file__ ).parent / "whisper-transcript.json"
with open (transcript_path, "r" ) as f:
json_content = f.read()
# Create dataframe with JSON string
df = session.create_dataframe([{ "json_string" : json_content}])
# Cast the JSON string to JSON type
df_json = df.select(
fc.col( "json_string" ).cast(fc.JsonType).alias( "json_data" )
)
print ( "JSON data cast to JSON type:" )
df_json.show( 1 )
Casting to JsonType enables JSON-specific operations like JQ queries and structured extraction.
Create Words DataFrame
# Extract all words from all segments using JQ
words_df = df_json.select(
fc.json.jq(
fc.col( "json_data" ),
# JQ query with variable binding
'.segments[] as $seg | $seg.words[] | {word: .word, speaker: .speaker, start: .start, end: .end, probability: .probability, segment_start: $seg.start} '
).alias( "word_data" )
).explode( "word_data" ) # Convert array to rows
print ( f "Extracted { words_df.count() } individual words" )
Using Struct Casting (Recommended)
# Define schema for word-level data
word_schema = fc.StructType([
fc.StructField( "word" , fc.StringType),
fc.StructField( "speaker" , fc.StringType),
fc.StructField( "start" , fc.FloatType),
fc.StructField( "end" , fc.FloatType),
fc.StructField( "probability" , fc.FloatType),
fc.StructField( "segment_start" , fc.FloatType),
fc.StructField( "segment_end" , fc.FloatType)
])
# Cast to struct and unnest - more efficient than JQ + get_item(0)
words_clean_df = words_df.select(
fc.col( "word_data" ).cast(word_schema).alias( "word_struct" )
).unnest( "word_struct" ).select(
fc.col( "word" ).alias( "word_text" ),
fc.col( "speaker" ),
fc.col( "start" ).alias( "start_time" ),
fc.col( "end" ).alias( "end_time" ),
fc.col( "probability" )
)
# Add calculated fields
words_final_df = words_clean_df.select(
"*" ,
(fc.col( "end_time" ) - fc.col( "start_time" )).alias( "duration" )
)
print ( "Words DataFrame:" )
words_final_df.show( 10 )
Struct casting + unnest is more efficient than multiple JQ queries for simple field extraction.
Create Segments DataFrame
Hybrid Approach: Struct Casting + JQ
# Extract segments using JQ
segments_df = df_json.select(
fc.json.jq(
fc.col( "json_data" ),
'.segments[] | {text: .text, start: .start, end: .end, words: .words} '
).alias( "segment_data" )
).explode( "segment_data" )
# Define schema for basic segment fields
segment_basic_schema = fc.StructType([
fc.StructField( "text" , fc.StringType),
fc.StructField( "start" , fc.FloatType),
fc.StructField( "end" , fc.FloatType)
])
# Extract basic fields with struct casting, complex aggregations with JQ
segments_clean_df = segments_df.select(
# Struct casting for basic fields (efficient)
fc.col( "segment_data" ).cast(segment_basic_schema).alias( "segment_struct" ),
# JQ for array aggregations (best tool for this)
fc.json.jq(fc.col( "segment_data" ), '.words | length' ).get_item( 0 ).cast(fc.IntegerType).alias( "word_count" ),
fc.json.jq(fc.col( "segment_data" ), '[.words[].probability] | add / length' ).get_item( 0 ).cast(fc.FloatType).alias( "average_confidence" )
).unnest( "segment_struct" ).select(
fc.col( "text" ).alias( "segment_text" ),
fc.col( "start" ).alias( "start_time" ),
fc.col( "end" ).alias( "end_time" ),
fc.col( "word_count" ),
fc.col( "average_confidence" )
).select(
"segment_text" ,
"start_time" ,
"end_time" ,
(fc.col( "end_time" ) - fc.col( "start_time" )).alias( "duration" ),
"word_count" ,
"average_confidence"
)
print ( "Segments DataFrame:" )
segments_clean_df.show( 5 )
Create Speaker Summary DataFrame
# Use traditional DataFrame aggregations on JSON-extracted data
speaker_summary_df = words_final_df.group_by( "speaker" ).agg(
fc.count( "*" ).alias( "total_words" ),
fc.avg( "probability" ).alias( "average_confidence" ),
fc.min( "start_time" ).alias( "first_speaking_time" ),
fc.max( "end_time" ).alias( "last_speaking_time" ),
fc.sum( "duration" ).alias( "total_speaking_time" )
).select(
"speaker" ,
"total_words" ,
"total_speaking_time" ,
"average_confidence" ,
"first_speaking_time" ,
"last_speaking_time" ,
# Calculate words per minute
(fc.col( "total_words" ) / (fc.col( "total_speaking_time" ) / 60.0 )).alias( "word_rate" )
)
print ( "Speaker Summary:" )
speaker_summary_df.show()
When to Use Struct Casting
Best for : Simple field extraction
# Define schema
schema = fc.StructType([
fc.StructField( "word" , fc.StringType),
fc.StructField( "start" , fc.FloatType),
])
# Cast and unnest
df.select(
fc.col( "data" ).cast(schema).alias( "struct" )
).unnest( "struct" )
Benefits :
More efficient than multiple JQ queries
Automatic type handling
Single operation extracts all fields
When to Use JQ Queries
Best for : Complex operations
# Nested traversal with variable binding
'.segments[] as $seg | $seg.words[] | {...}'
# Array length calculation
'.words | length'
# Array aggregation for averages
'[.words[].probability] | add / length'
Use JQ for :
Array aggregations (length, add / length)
Complex transformations
Variable binding and nested operations
Mathematical operations within queries
JQ Query Complexity Levels
.segments[]
.segments[].words[]
.segments[] as $seg | $seg.words[]
.words | length
[.words[].probability] | add / length
{word: .word, segment_start: $seg.start}
Running the Example
python json_processing.py
Expected Output
The script displays:
Raw JSON data loading and casting
Word-level extraction (~4000+ individual words)
Segment-level analysis (conversation segments)
Speaker summary analytics (2 speakers with statistics)
Comprehensive pipeline summary
Learning Outcomes
After studying this example, you’ll understand:
How to load and cast JSON data in Fenic
Struct casting + unnest for efficient field extraction
Complex JQ queries for advanced data manipulation
When to choose struct casting vs JQ
Array processing and aggregation within JSON
Type conversion between JSON and DataFrame types
Hybrid workflows combining multiple extraction approaches
Performance optimization for JSON processing pipelines
Real-world data processing patterns for audio/transcript analysis
Use struct casting for simple field extraction and JQ for complex aggregations and transformations. Combine both approaches for optimal performance.