Audio Processing & Encoding
VozCraft includes advanced audio processing capabilities that allow users to download their generated speech as high-quality audio files. This page documents the audio generation pipeline, WAV encoding implementation, and file download mechanisms.
Overview
The audio processing system consists of three main components:
Audio Generation : Uses Web Audio API to synthesize audio from speech parameters
WAV Encoding : Encodes raw audio samples into WAV file format
File Download : Creates downloadable Blob objects for MP3/WAV files
VozCraft generates audio entirely in the browser using the Web Audio API’s OfflineAudioContext , requiring no server-side processing.
Audio Generation Pipeline
The generateAndDownloadAudio Function
This is the primary function that orchestrates audio generation:
async function generateAndDownloadAudio ( item , format ) {
const velData = VELOCIDADES . find ( v => v . label === item . velocidad ) || VELOCIDADES [ 2 ];
const animData = ANIMOS . find ( a => a . label === item . animo ) || ANIMOS [ 0 ];
const gd = GENEROS . find ( g => g . label === item . genero ) || GENEROS [ 0 ];
const effectiveRate = ( velData . rate + gd . rateAdd ) * animData . rateMulti ;
const estimatedDuration = Math . max ( 1 , item . texto . length / ( 14 * effectiveRate ));
const sampleRate = 22050 ;
const numSamples = Math . ceil ( estimatedDuration * sampleRate );
const audioCtx = new OfflineAudioContext ( 1 , numSamples , sampleRate );
const effectivePitch = gd . pitch * animData . pitch ;
const baseFreq = 120 * effectivePitch ;
// Create oscillator for voice synthesis
const osc1 = audioCtx . createOscillator ();
osc1 . type = 'sawtooth' ;
osc1 . frequency . setValueAtTime ( baseFreq , 0 );
for ( let i = 0 ; i < estimatedDuration ; i += 0.3 ) {
const v = ( Math . random () - 0.5 ) * baseFreq * 0.12 ;
osc1 . frequency . linearRampToValueAtTime ( baseFreq + v , i + 0.15 );
osc1 . frequency . linearRampToValueAtTime ( baseFreq , i + 0.3 );
}
// Create formant filters
const f1 = audioCtx . createBiquadFilter ();
f1 . type = 'bandpass' ;
f1 . frequency . value = 800 * effectivePitch ;
f1 . Q . value = 3 ;
const f2 = audioCtx . createBiquadFilter ();
f2 . type = 'bandpass' ;
f2 . frequency . value = 2200 * effectivePitch ;
f2 . Q . value = 4 ;
// Create gain envelope
const gainNode = audioCtx . createGain ();
const words = item . texto . split ( ' ' );
const wDur = estimatedDuration / words . length ;
gainNode . gain . setValueAtTime ( 0 , 0 );
words . forEach (( w , wi ) => {
const t = wi * wDur ;
const syl = Math . max ( 1 , Math . ceil ( w . length / 3 ));
for ( let s = 0 ; s < syl ; s ++ ) {
const st = t + s * ( wDur / syl );
gainNode . gain . linearRampToValueAtTime ( 0.18 * animData . volume , st + 0.02 );
gainNode . gain . linearRampToValueAtTime ( 0.05 , st + wDur / syl - 0.02 );
}
});
gainNode . gain . linearRampToValueAtTime ( 0 , estimatedDuration );
// Add noise for consonants
const noiseBuf = audioCtx . createBuffer ( 1 , numSamples , sampleRate );
const nd = noiseBuf . getChannelData ( 0 );
for ( let i = 0 ; i < numSamples ; i ++ ) nd [ i ] = ( Math . random () * 2 - 1 ) * 0.04 ;
const noiseSource = audioCtx . createBufferSource ();
noiseSource . buffer = noiseBuf ;
const nf = audioCtx . createBiquadFilter ();
nf . type = 'bandpass' ;
nf . frequency . value = 4000 ;
nf . Q . value = 2 ;
// Connect audio graph
osc1 . connect ( f1 );
osc1 . connect ( f2 );
f1 . connect ( gainNode );
f2 . connect ( gainNode );
noiseSource . connect ( nf );
nf . connect ( gainNode );
gainNode . connect ( audioCtx . destination );
// Start sources
osc1 . start ( 0 );
osc1 . stop ( estimatedDuration );
noiseSource . start ( 0 );
noiseSource . stop ( estimatedDuration );
// Render audio
const rendered = await audioCtx . startRendering ();
const channelData = rendered . getChannelData ( 0 );
const wavBuffer = encodeWAV ( channelData , sampleRate );
const mime = format === 'wav' ? 'audio/wav' : 'audio/mpeg' ;
downloadBlob ( new Blob ([ wavBuffer ], { type: mime }), `vozcraft- ${ item . id } . ${ format } ` );
}
Audio Generation Steps
Calculate audio parameters
First, the function calculates the effective speech rate and estimates the audio duration: const effectiveRate = ( velData . rate + gd . rateAdd ) * animData . rateMulti ;
const estimatedDuration = Math . max ( 1 , item . texto . length / ( 14 * effectiveRate ));
The formula assumes 14 characters per second at normal speed, adjusted by the effective rate multiplier.
Create OfflineAudioContext
Initialize an offline rendering context with calculated parameters: const sampleRate = 22050 ; // 22.05 kHz
const numSamples = Math . ceil ( estimatedDuration * sampleRate );
const audioCtx = new OfflineAudioContext ( 1 , numSamples , sampleRate );
Sample Rate : 22.05 kHz provides good quality for speech while keeping file sizes reasonable. CD-quality audio is 44.1 kHz.
Generate voice waveform
Create a sawtooth oscillator with frequency modulation: const effectivePitch = gd . pitch * animData . pitch ;
const baseFreq = 120 * effectivePitch ;
const osc1 = audioCtx . createOscillator ();
osc1 . type = 'sawtooth' ;
osc1 . frequency . setValueAtTime ( baseFreq , 0 );
// Add natural pitch variation
for ( let i = 0 ; i < estimatedDuration ; i += 0.3 ) {
const v = ( Math . random () - 0.5 ) * baseFreq * 0.12 ;
osc1 . frequency . linearRampToValueAtTime ( baseFreq + v , i + 0.15 );
osc1 . frequency . linearRampToValueAtTime ( baseFreq , i + 0.3 );
}
Sawtooth waves contain all harmonics and create a rich, buzzy sound that works well for voice synthesis. The harmonic content is then shaped by formant filters to create vowel-like sounds.
Apply formant filters
Create bandpass filters to simulate vocal tract resonances: const f1 = audioCtx . createBiquadFilter ();
f1 . type = 'bandpass' ;
f1 . frequency . value = 800 * effectivePitch ; // First formant
f1 . Q . value = 3 ;
const f2 = audioCtx . createBiquadFilter ();
f2 . type = 'bandpass' ;
f2 . frequency . value = 2200 * effectivePitch ; // Second formant
f2 . Q . value = 4 ;
Formants are resonant frequencies of the vocal tract. F1 (800 Hz) and F2 (2200 Hz) are the most important for vowel perception.
Create amplitude envelope
Shape the volume over time to simulate syllables: const gainNode = audioCtx . createGain ();
const words = item . texto . split ( ' ' );
const wDur = estimatedDuration / words . length ;
gainNode . gain . setValueAtTime ( 0 , 0 );
words . forEach (( w , wi ) => {
const t = wi * wDur ;
const syl = Math . max ( 1 , Math . ceil ( w . length / 3 ));
for ( let s = 0 ; s < syl ; s ++ ) {
const st = t + s * ( wDur / syl );
gainNode . gain . linearRampToValueAtTime ( 0.18 * animData . volume , st + 0.02 );
gainNode . gain . linearRampToValueAtTime ( 0.05 , st + wDur / syl - 0.02 );
}
});
Each word is divided into syllables (estimated as word_length / 3), and each syllable gets an attack-decay envelope.
Add noise for consonants
Generate white noise and filter it to simulate fricatives: const noiseBuf = audioCtx . createBuffer ( 1 , numSamples , sampleRate );
const nd = noiseBuf . getChannelData ( 0 );
for ( let i = 0 ; i < numSamples ; i ++ ) {
nd [ i ] = ( Math . random () * 2 - 1 ) * 0.04 ;
}
const noiseSource = audioCtx . createBufferSource ();
noiseSource . buffer = noiseBuf ;
const nf = audioCtx . createBiquadFilter ();
nf . type = 'bandpass' ;
nf . frequency . value = 4000 ; // High-frequency content
nf . Q . value = 2 ;
Render audio
Connect the audio graph and render: osc1 . connect ( f1 );
osc1 . connect ( f2 );
f1 . connect ( gainNode );
f2 . connect ( gainNode );
noiseSource . connect ( nf );
nf . connect ( gainNode );
gainNode . connect ( audioCtx . destination );
osc1 . start ( 0 );
osc1 . stop ( estimatedDuration );
noiseSource . start ( 0 );
noiseSource . stop ( estimatedDuration );
const rendered = await audioCtx . startRendering ();
const channelData = rendered . getChannelData ( 0 );
WAV Encoding
The encodeWAV Function
This function converts raw PCM audio samples to WAV file format:
function encodeWAV ( samples , sampleRate ) {
const buf = new ArrayBuffer ( 44 + samples . length * 2 );
const v = new DataView ( buf );
const ws = ( off , str ) => {
for ( let i = 0 ; i < str . length ; i ++ ) {
v . setUint8 ( off + i , str . charCodeAt ( i ));
}
};
// RIFF header
ws ( 0 , 'RIFF' );
v . setUint32 ( 4 , 36 + samples . length * 2 , true );
ws ( 8 , 'WAVE' );
// fmt chunk
ws ( 12 , 'fmt ' );
v . setUint32 ( 16 , 16 , true ); // Subchunk1Size (16 for PCM)
v . setUint16 ( 20 , 1 , true ); // AudioFormat (1 = PCM)
v . setUint16 ( 22 , 1 , true ); // NumChannels (1 = mono)
v . setUint32 ( 24 , sampleRate , true );
v . setUint32 ( 28 , sampleRate * 2 , true ); // ByteRate
v . setUint16 ( 32 , 2 , true ); // BlockAlign
v . setUint16 ( 34 , 16 , true ); // BitsPerSample
// data chunk
ws ( 36 , 'data' );
v . setUint32 ( 40 , samples . length * 2 , true );
// Write PCM samples
let off = 44 ;
for ( let i = 0 ; i < samples . length ; i ++ ) {
const s = Math . max ( - 1 , Math . min ( 1 , samples [ i ]));
v . setInt16 ( off , s < 0 ? s * 32768 : s * 32767 , true );
off += 2 ;
}
return buf ;
}
WAV File Structure
The WAV format consists of three main sections:
Offset Size Description
12 4 "fmt " subchunk ID
16 4 Subchunk size (16 for PCM)
20 2 Audio format (1 = PCM)
22 2 Number of channels (1 = mono)
24 4 Sample rate (22050 Hz)
28 4 Byte rate (sampleRate × channels × bitsPerSample / 8)
32 2 Block align (channels × bitsPerSample / 8)
34 2 Bits per sample (16)
ws ( 12 , 'fmt ' );
v . setUint32 ( 16 , 16 , true ); // Subchunk1Size
v . setUint16 ( 20 , 1 , true ); // AudioFormat (PCM)
v . setUint16 ( 22 , 1 , true ); // NumChannels (mono)
v . setUint32 ( 24 , sampleRate , true ); // Sample rate
v . setUint32 ( 28 , sampleRate * 2 , true ); // ByteRate
v . setUint16 ( 32 , 2 , true ); // BlockAlign
v . setUint16 ( 34 , 16 , true ); // BitsPerSample
data Chunk (8 bytes + audio data)
Offset Size Description
36 4 "data" subchunk ID
40 4 Data size
44 samples.length×2 PCM audio data
ws ( 36 , 'data' );
v . setUint32 ( 40 , samples . length * 2 , true );
// Write 16-bit PCM samples
let off = 44 ;
for ( let i = 0 ; i < samples . length ; i ++ ) {
const s = Math . max ( - 1 , Math . min ( 1 , samples [ i ]));
v . setInt16 ( off , s < 0 ? s * 32768 : s * 32767 , true );
off += 2 ;
}
PCM Sample Conversion
The Web Audio API provides samples as 32-bit floats in the range [-1.0, 1.0]. These must be converted to 16-bit signed integers:
const s = Math . max ( - 1 , Math . min ( 1 , samples [ i ])); // Clamp to [-1, 1]
v . setInt16 ( off , s < 0 ? s * 32768 : s * 32767 , true );
16-bit PCM range:
Minimum: -32768 (0x8000)
Maximum: 32767 (0x7FFF)
Zero: 0
The conversion multiplies by 32768 for negative values and 32767 for positive values to maximize dynamic range while preventing overflow.
File Download System
The downloadBlob Function
Creates a temporary download link and triggers the browser’s download:
function downloadBlob ( blob , filename ) {
const url = URL . createObjectURL ( blob );
const a = document . createElement ( 'a' );
a . href = url ;
a . download = filename ;
a . click ();
URL . revokeObjectURL ( url );
}
Create object URL
URL.createObjectURL() creates a temporary URL pointing to the Blob data:const url = URL . createObjectURL ( blob );
// Returns: "blob:http://localhost:5173/uuid"
Create anchor element
Programmatically create an invisible <a> tag: const a = document . createElement ( 'a' );
a . href = url ;
a . download = filename ;
Trigger download
Click the anchor to initiate download:
Cleanup
Revoke the object URL to free memory: URL . revokeObjectURL ( url );
Always revoke object URLs after use to prevent memory leaks. Object URLs persist until the page is closed or explicitly revoked.
Download Handlers
Audio File Download
const handleDownload = async ( item , format ) => {
try {
await generateAndDownloadAudio ( item , format );
showToast ( `📥 ${ format . toUpperCase () } descargado` );
} catch ( e ) {
console . error ( e );
showToast ( 'Error al generar el audio' , 'error' );
}
};
Transcript Download
VozCraft also allows downloading text transcripts:
function downloadTranscript ( item , lang ) {
const fecha = new Date ( item . timestamp );
const fechaStr = fecha . toLocaleDateString ( lang === 'es' ? 'es-MX' : 'en-US' , { dateStyle: 'long' });
const horaStr = fecha . toLocaleTimeString ( lang === 'es' ? 'es-MX' : 'en-US' , { timeStyle: 'medium' });
const nombre = item . nombre || `Audio- ${ item . id } ` ;
const sep = '═' . repeat ( 50 );
const content = [
'VozCraft — Transcripción de Audio' ,
sep ,
`Nombre: ${ nombre } ` ,
`Fecha: ${ fechaStr } ` ,
`Hora: ${ horaStr } ` ,
`Idioma: ${ item . voz } ` ,
`Género: ${ item . genero || '—' } ` ,
`Velocidad: ${ item . velocidad } ` ,
`Ánimo: ${ item . animo } ` ,
sep ,
'TRANSCRIPCIÓN:' ,
'' ,
item . texto ,
'' ,
sep ,
'© VozCraft · mateoRiosdev · 2026' ,
]. join ( ' \n ' );
const safeName = nombre . replace ( / [ ^ a-zA-Z0-9_\-áéíóúñÁÉÍÓÚÑ ] / g , '' ). trim (). replace ( / / g , '_' ) || `vozcraft- ${ item . id } ` ;
downloadBlob ( new Blob ([ content ], { type: 'text/plain;charset=utf-8' }), ` ${ safeName } .txt` );
}
Transcript files include metadata (date, voice, speed, mood) along with the text, making them useful for archiving and documentation.
Benchmarks
Typical processing times on modern hardware:
Text Length Duration Processing Time 50 chars ~4s ~150ms 200 chars ~14s ~300ms 1000 chars ~70s ~800ms 5000 chars ~350s ~2500ms
Processing happens asynchronously using OfflineAudioContext.startRendering(), which returns a Promise. The UI remains responsive during rendering.
Memory Usage
// Memory calculation for audio buffer
const sampleRate = 22050 ;
const duration = 60 ; // 60 seconds
const numSamples = duration * sampleRate ;
const memoryBytes = numSamples * 4 ; // 32-bit floats
// = 5,292,000 bytes (~5.3 MB) for 1 minute of audio
Memory considerations:
Each minute of audio requires ~5.3 MB in memory during processing
Offline contexts hold audio data until garbage collected
Mobile browsers may have stricter memory limits
Consider chunking very long audio (>10 minutes)
Advanced Features
Dynamic Frequency Modulation
VozCraft applies pitch variation to create more natural-sounding speech:
for ( let i = 0 ; i < estimatedDuration ; i += 0.3 ) {
const v = ( Math . random () - 0.5 ) * baseFreq * 0.12 ;
osc1 . frequency . linearRampToValueAtTime ( baseFreq + v , i + 0.15 );
osc1 . frequency . linearRampToValueAtTime ( baseFreq , i + 0.3 );
}
This creates pitch variations of ±12% every 300ms, simulating natural speech prosody.
Syllable-Based Envelopes
const words = item . texto . split ( ' ' );
const wDur = estimatedDuration / words . length ;
words . forEach (( w , wi ) => {
const t = wi * wDur ;
const syl = Math . max ( 1 , Math . ceil ( w . length / 3 ));
for ( let s = 0 ; s < syl ; s ++ ) {
const st = t + s * ( wDur / syl );
gainNode . gain . linearRampToValueAtTime ( 0.18 * animData . volume , st + 0.02 );
gainNode . gain . linearRampToValueAtTime ( 0.05 , st + wDur / syl - 0.02 );
}
});
Each word is divided into estimated syllables (word_length / 3), and each syllable gets an attack-decay envelope for a more rhythmic sound.
WAV Specifications
Format: WAV (PCM)
Codec: Linear PCM
Channels: 1 (mono)
Sample Rate: 22050 Hz
Bit Depth: 16-bit
Byte Order: Little-endian
Compression: None
File Size Calculation
// WAV file size formula
const headerSize = 44 ; // bytes
const sampleSize = 2 ; // bytes (16-bit)
const duration = 60 ; // seconds
const sampleRate = 22050 ;
const fileSize = headerSize + ( duration * sampleRate * sampleSize );
// = 44 + (60 × 22050 × 2) = 2,646,044 bytes (~2.52 MB)
Error Handling
const handleDownload = async ( item , format ) => {
try {
await generateAndDownloadAudio ( item , format );
showToast ( `📥 ${ format . toUpperCase () } descargado` );
} catch ( e ) {
console . error ( e );
showToast ( 'Error al generar el audio' , 'error' );
}
};
Common error scenarios:
Out of memory : Very long audio (>10 minutes)
Browser restrictions : iOS Safari has stricter limits
OfflineAudioContext limits : Maximum context length varies by browser
Best Practices
Validate input length
if ( texto . length > 5000 ) {
showToast ( 'Text too long. Maximum 5000 characters.' , 'error' );
return ;
}
Use appropriate sample rates
22.05 kHz : Speech, podcasts (good balance)
44.1 kHz : Music, professional audio
8 kHz : Phone quality (not recommended)
Clamp audio samples
const s = Math . max ( - 1 , Math . min ( 1 , samples [ i ]));
Always clamp samples to prevent clipping and distortion.
Clean up resources
URL . revokeObjectURL ( url );
Revoke object URLs to prevent memory leaks.
Browser Compatibility
Web Audio API Support:
✅ Chrome 35+
✅ Firefox 25+
✅ Safari 14.1+
✅ Edge 79+
✅ Opera 22+
OfflineAudioContext Support:
✅ All modern browsers
⚠️ Safari has lower maximum context lengths
Next Steps
Web Speech API Learn about real-time speech synthesis
PWA Setup Configure Progressive Web App features