Audio Processing & Encoding

VozCraft includes advanced audio processing capabilities that allow users to download their generated speech as high-quality audio files. This page documents the audio generation pipeline, WAV encoding implementation, and file download mechanisms.

Overview

The audio processing system consists of three main components:

Audio Generation: Uses Web Audio API to synthesize audio from speech parameters
WAV Encoding: Encodes raw audio samples into WAV file format
File Download: Creates downloadable Blob objects for MP3/WAV files

VozCraft generates audio entirely in the browser using the Web Audio API’s OfflineAudioContext, requiring no server-side processing.

Audio Generation Pipeline

The generateAndDownloadAudio Function

This is the primary function that orchestrates audio generation:

App.jsx (lines 492-554)

async function generateAndDownloadAudio(item, format) {
  const velData = VELOCIDADES.find(v => v.label === item.velocidad) || VELOCIDADES[2];
  const animData = ANIMOS.find(a => a.label === item.animo) || ANIMOS[0];
  const gd = GENEROS.find(g => g.label === item.genero) || GENEROS[0];

  const effectiveRate = (velData.rate + gd.rateAdd) * animData.rateMulti;
  const estimatedDuration = Math.max(1, item.texto.length / (14 * effectiveRate));

  const sampleRate = 22050;
  const numSamples = Math.ceil(estimatedDuration * sampleRate);
  const audioCtx = new OfflineAudioContext(1, numSamples, sampleRate);

  const effectivePitch = gd.pitch * animData.pitch;
  const baseFreq = 120 * effectivePitch;

  // Create oscillator for voice synthesis
  const osc1 = audioCtx.createOscillator();
  osc1.type = 'sawtooth';
  osc1.frequency.setValueAtTime(baseFreq, 0);
  for (let i = 0; i < estimatedDuration; i += 0.3) {
    const v = (Math.random() - 0.5) * baseFreq * 0.12;
    osc1.frequency.linearRampToValueAtTime(baseFreq + v, i + 0.15);
    osc1.frequency.linearRampToValueAtTime(baseFreq, i + 0.3);
  }

  // Create formant filters
  const f1 = audioCtx.createBiquadFilter();
  f1.type = 'bandpass';
  f1.frequency.value = 800 * effectivePitch;
  f1.Q.value = 3;
  
  const f2 = audioCtx.createBiquadFilter();
  f2.type = 'bandpass';
  f2.frequency.value = 2200 * effectivePitch;
  f2.Q.value = 4;

  // Create gain envelope
  const gainNode = audioCtx.createGain();
  const words = item.texto.split(' ');
  const wDur = estimatedDuration / words.length;
  gainNode.gain.setValueAtTime(0, 0);
  words.forEach((w, wi) => {
    const t = wi * wDur;
    const syl = Math.max(1, Math.ceil(w.length / 3));
    for (let s = 0; s < syl; s++) {
      const st = t + s * (wDur / syl);
      gainNode.gain.linearRampToValueAtTime(0.18 * animData.volume, st + 0.02);
      gainNode.gain.linearRampToValueAtTime(0.05, st + wDur / syl - 0.02);
    }
  });
  gainNode.gain.linearRampToValueAtTime(0, estimatedDuration);

  // Add noise for consonants
  const noiseBuf = audioCtx.createBuffer(1, numSamples, sampleRate);
  const nd = noiseBuf.getChannelData(0);
  for (let i = 0; i < numSamples; i++) nd[i] = (Math.random() * 2 - 1) * 0.04;
  const noiseSource = audioCtx.createBufferSource();
  noiseSource.buffer = noiseBuf;
  const nf = audioCtx.createBiquadFilter();
  nf.type = 'bandpass';
  nf.frequency.value = 4000;
  nf.Q.value = 2;

  // Connect audio graph
  osc1.connect(f1);
  osc1.connect(f2);
  f1.connect(gainNode);
  f2.connect(gainNode);
  noiseSource.connect(nf);
  nf.connect(gainNode);
  gainNode.connect(audioCtx.destination);

  // Start sources
  osc1.start(0);
  osc1.stop(estimatedDuration);
  noiseSource.start(0);
  noiseSource.stop(estimatedDuration);

  // Render audio
  const rendered = await audioCtx.startRendering();
  const channelData = rendered.getChannelData(0);
  const wavBuffer = encodeWAV(channelData, sampleRate);
  const mime = format === 'wav' ? 'audio/wav' : 'audio/mpeg';
  downloadBlob(new Blob([wavBuffer], { type: mime }), `vozcraft-${item.id}.${format}`);
}

Audio Generation Steps

Calculate audio parameters

First, the function calculates the effective speech rate and estimates the audio duration:

const effectiveRate = (velData.rate + gd.rateAdd) * animData.rateMulti;
const estimatedDuration = Math.max(1, item.texto.length / (14 * effectiveRate));

The formula assumes 14 characters per second at normal speed, adjusted by the effective rate multiplier.

Create OfflineAudioContext

Initialize an offline rendering context with calculated parameters:

const sampleRate = 22050; // 22.05 kHz
const numSamples = Math.ceil(estimatedDuration * sampleRate);
const audioCtx = new OfflineAudioContext(1, numSamples, sampleRate);

Sample Rate: 22.05 kHz provides good quality for speech while keeping file sizes reasonable. CD-quality audio is 44.1 kHz.

Generate voice waveform

Create a sawtooth oscillator with frequency modulation:

const effectivePitch = gd.pitch * animData.pitch;
const baseFreq = 120 * effectivePitch;

const osc1 = audioCtx.createOscillator();
osc1.type = 'sawtooth';
osc1.frequency.setValueAtTime(baseFreq, 0);

// Add natural pitch variation
for (let i = 0; i < estimatedDuration; i += 0.3) {
  const v = (Math.random() - 0.5) * baseFreq * 0.12;
  osc1.frequency.linearRampToValueAtTime(baseFreq + v, i + 0.15);
  osc1.frequency.linearRampToValueAtTime(baseFreq, i + 0.3);
}

Why sawtooth waveform?

Sawtooth waves contain all harmonics and create a rich, buzzy sound that works well for voice synthesis. The harmonic content is then shaped by formant filters to create vowel-like sounds.

Apply formant filters

Create bandpass filters to simulate vocal tract resonances:

const f1 = audioCtx.createBiquadFilter();
f1.type = 'bandpass';
f1.frequency.value = 800 * effectivePitch;  // First formant
f1.Q.value = 3;

const f2 = audioCtx.createBiquadFilter();
f2.type = 'bandpass';
f2.frequency.value = 2200 * effectivePitch; // Second formant
f2.Q.value = 4;

Formants are resonant frequencies of the vocal tract. F1 (800 Hz) and F2 (2200 Hz) are the most important for vowel perception.

Create amplitude envelope

Shape the volume over time to simulate syllables:

const gainNode = audioCtx.createGain();
const words = item.texto.split(' ');
const wDur = estimatedDuration / words.length;
gainNode.gain.setValueAtTime(0, 0);

words.forEach((w, wi) => {
  const t = wi * wDur;
  const syl = Math.max(1, Math.ceil(w.length / 3));
  for (let s = 0; s < syl; s++) {
    const st = t + s * (wDur / syl);
    gainNode.gain.linearRampToValueAtTime(0.18 * animData.volume, st + 0.02);
    gainNode.gain.linearRampToValueAtTime(0.05, st + wDur / syl - 0.02);
  }
});

Each word is divided into syllables (estimated as word_length / 3), and each syllable gets an attack-decay envelope.

Add noise for consonants

Generate white noise and filter it to simulate fricatives:

const noiseBuf = audioCtx.createBuffer(1, numSamples, sampleRate);
const nd = noiseBuf.getChannelData(0);
for (let i = 0; i < numSamples; i++) {
  nd[i] = (Math.random() * 2 - 1) * 0.04;
}

const noiseSource = audioCtx.createBufferSource();
noiseSource.buffer = noiseBuf;

const nf = audioCtx.createBiquadFilter();
nf.type = 'bandpass';
nf.frequency.value = 4000; // High-frequency content
nf.Q.value = 2;

Render audio

Connect the audio graph and render:

osc1.connect(f1);
osc1.connect(f2);
f1.connect(gainNode);
f2.connect(gainNode);
noiseSource.connect(nf);
nf.connect(gainNode);
gainNode.connect(audioCtx.destination);

osc1.start(0);
osc1.stop(estimatedDuration);
noiseSource.start(0);
noiseSource.stop(estimatedDuration);

const rendered = await audioCtx.startRendering();
const channelData = rendered.getChannelData(0);

WAV Encoding

The encodeWAV Function

This function converts raw PCM audio samples to WAV file format:

App.jsx (lines 556-571)

function encodeWAV(samples, sampleRate) {
  const buf = new ArrayBuffer(44 + samples.length * 2);
  const v = new DataView(buf);
  const ws = (off, str) => {
    for (let i = 0; i < str.length; i++) {
      v.setUint8(off + i, str.charCodeAt(i));
    }
  };
  
  // RIFF header
  ws(0, 'RIFF');
  v.setUint32(4, 36 + samples.length * 2, true);
  ws(8, 'WAVE');
  
  // fmt chunk
  ws(12, 'fmt ');
  v.setUint32(16, 16, true);  // Subchunk1Size (16 for PCM)
  v.setUint16(20, 1, true);   // AudioFormat (1 = PCM)
  v.setUint16(22, 1, true);   // NumChannels (1 = mono)
  v.setUint32(24, sampleRate, true);
  v.setUint32(28, sampleRate * 2, true); // ByteRate
  v.setUint16(32, 2, true);   // BlockAlign
  v.setUint16(34, 16, true);  // BitsPerSample
  
  // data chunk
  ws(36, 'data');
  v.setUint32(40, samples.length * 2, true);
  
  // Write PCM samples
  let off = 44;
  for (let i = 0; i < samples.length; i++) {
    const s = Math.max(-1, Math.min(1, samples[i]));
    v.setInt16(off, s < 0 ? s * 32768 : s * 32767, true);
    off += 2;
  }
  
  return buf;
}

WAV File Structure

The WAV format consists of three main sections:

RIFF Header (12 bytes)

Offset  Size  Description
     4     "RIFF" chunk descriptor
     4     File size - 8 bytes
     4     "WAVE" format

ws(0, 'RIFF');
v.setUint32(4, 36 + samples.length * 2, true);
ws(8, 'WAVE');

fmt Chunk (24 bytes)

Offset  Size  Description
    4     "fmt " subchunk ID
    4     Subchunk size (16 for PCM)
    2     Audio format (1 = PCM)
    2     Number of channels (1 = mono)
    4     Sample rate (22050 Hz)
    4     Byte rate (sampleRate × channels × bitsPerSample / 8)
    2     Block align (channels × bitsPerSample / 8)
    2     Bits per sample (16)

ws(12, 'fmt ');
v.setUint32(16, 16, true);        // Subchunk1Size
v.setUint16(20, 1, true);         // AudioFormat (PCM)
v.setUint16(22, 1, true);         // NumChannels (mono)
v.setUint32(24, sampleRate, true); // Sample rate
v.setUint32(28, sampleRate * 2, true); // ByteRate
v.setUint16(32, 2, true);         // BlockAlign
v.setUint16(34, 16, true);        // BitsPerSample

data Chunk (8 bytes + audio data)

Offset  Size              Description
    4                 "data" subchunk ID
    4                 Data size
    samples.length×2  PCM audio data

ws(36, 'data');
v.setUint32(40, samples.length * 2, true);

// Write 16-bit PCM samples
let off = 44;
for (let i = 0; i < samples.length; i++) {
  const s = Math.max(-1, Math.min(1, samples[i]));
  v.setInt16(off, s < 0 ? s * 32768 : s * 32767, true);
  off += 2;
}

PCM Sample Conversion

The Web Audio API provides samples as 32-bit floats in the range [-1.0, 1.0]. These must be converted to 16-bit signed integers:

const s = Math.max(-1, Math.min(1, samples[i])); // Clamp to [-1, 1]
v.setInt16(off, s < 0 ? s * 32768 : s * 32767, true);

16-bit PCM range:

Minimum: -32768 (0x8000)
Maximum: 32767 (0x7FFF)
Zero: 0

The conversion multiplies by 32768 for negative values and 32767 for positive values to maximize dynamic range while preventing overflow.

File Download System

The downloadBlob Function

Creates a temporary download link and triggers the browser’s download:

App.jsx (lines 573-577)

function downloadBlob(blob, filename) {
  const url = URL.createObjectURL(blob);
  const a = document.createElement('a');
  a.href = url;
  a.download = filename;
  a.click();
  URL.revokeObjectURL(url);
}

Create object URL

URL.createObjectURL() creates a temporary URL pointing to the Blob data:

const url = URL.createObjectURL(blob);
// Returns: "blob:http://localhost:5173/uuid"

Create anchor element

Programmatically create an invisible <a> tag:

const a = document.createElement('a');
a.href = url;
a.download = filename;

Trigger download

Click the anchor to initiate download:

a.click();

Cleanup

Revoke the object URL to free memory:

URL.revokeObjectURL(url);

Always revoke object URLs after use to prevent memory leaks. Object URLs persist until the page is closed or explicitly revoked.

Download Handlers

Audio File Download

App.jsx (lines 726-729)

const handleDownload = async (item, format) => {
  try {
    await generateAndDownloadAudio(item, format);
    showToast(`📥 ${format.toUpperCase()} descargado`);
  } catch (e) {
    console.error(e);
    showToast('Error al generar el audio', 'error');
  }
};

WAV Format
MP3 Format

Uncompressed PCM audio:

File size: ~2.6 MB per minute (16-bit, 22.05 kHz, mono)
Quality: Lossless
Compatibility: Universal
Use case: Maximum quality, audio editing

handleDownload(item, 'wav');
// Downloads: vozcraft-1234567890.wav

VozCraft currently generates WAV files and sets the MIME type to audio/mpeg for MP3. True MP3 encoding would require a library like lame.js or a server-side encoder.

For actual MP3 encoding, consider:

lamejs: Client-side MP3 encoder
ffmpeg.wasm: WebAssembly FFmpeg build
Server-side encoding for production use

// Current implementation (WAV with MP3 MIME)
const mime = format === 'wav' ? 'audio/wav' : 'audio/mpeg';
downloadBlob(new Blob([wavBuffer], { type: mime }), `vozcraft-${item.id}.${format}`);

Transcript Download

VozCraft also allows downloading text transcripts:

App.jsx (lines 579-608)

function downloadTranscript(item, lang) {
  const fecha = new Date(item.timestamp);
  const fechaStr = fecha.toLocaleDateString(lang === 'es' ? 'es-MX' : 'en-US', { dateStyle: 'long' });
  const horaStr  = fecha.toLocaleTimeString(lang === 'es' ? 'es-MX' : 'en-US', { timeStyle: 'medium' });
  const nombre   = item.nombre || `Audio-${item.id}`;
  const sep = '═'.repeat(50);

  const content = [
    'VozCraft — Transcripción de Audio',
    sep,
    `Nombre:     ${nombre}`,
    `Fecha:      ${fechaStr}`,
    `Hora:       ${horaStr}`,
    `Idioma:     ${item.voz}`,
    `Género:     ${item.genero || '—'}`,
    `Velocidad:  ${item.velocidad}`,
    `Ánimo:      ${item.animo}`,
    sep,
    'TRANSCRIPCIÓN:',
    '',
    item.texto,
    '',
    sep,
    '© VozCraft · mateoRiosdev · 2026',
  ].join('\n');

  const safeName = nombre.replace(/[^a-zA-Z0-9_\-áéíóúñÁÉÍÓÚÑ ]/g, '').trim().replace(/ /g, '_') || `vozcraft-${item.id}`;
  downloadBlob(new Blob([content], { type: 'text/plain;charset=utf-8' }), `${safeName}.txt`);
}

Transcript files include metadata (date, voice, speed, mood) along with the text, making them useful for archiving and documentation.

Audio Processing Performance

Benchmarks

Typical processing times on modern hardware:

Text Length	Duration	Processing Time
50 chars	~4s	~150ms
200 chars	~14s	~300ms
1000 chars	~70s	~800ms
5000 chars	~350s	~2500ms

Processing happens asynchronously using OfflineAudioContext.startRendering(), which returns a Promise. The UI remains responsive during rendering.

Memory Usage

// Memory calculation for audio buffer
const sampleRate = 22050;
const duration = 60; // 60 seconds
const numSamples = duration * sampleRate;
const memoryBytes = numSamples * 4; // 32-bit floats
// = 5,292,000 bytes (~5.3 MB) for 1 minute of audio

Memory considerations:

Each minute of audio requires ~5.3 MB in memory during processing
Offline contexts hold audio data until garbage collected
Mobile browsers may have stricter memory limits
Consider chunking very long audio (>10 minutes)

Advanced Features

Dynamic Frequency Modulation

VozCraft applies pitch variation to create more natural-sounding speech:

for (let i = 0; i < estimatedDuration; i += 0.3) {
  const v = (Math.random() - 0.5) * baseFreq * 0.12;
  osc1.frequency.linearRampToValueAtTime(baseFreq + v, i + 0.15);
  osc1.frequency.linearRampToValueAtTime(baseFreq, i + 0.3);
}

This creates pitch variations of ±12% every 300ms, simulating natural speech prosody.

Syllable-Based Envelopes

const words = item.texto.split(' ');
const wDur = estimatedDuration / words.length;

words.forEach((w, wi) => {
  const t = wi * wDur;
  const syl = Math.max(1, Math.ceil(w.length / 3));
  for (let s = 0; s < syl; s++) {
    const st = t + s * (wDur / syl);
    gainNode.gain.linearRampToValueAtTime(0.18 * animData.volume, st + 0.02);
    gainNode.gain.linearRampToValueAtTime(0.05, st + wDur / syl - 0.02);
  }
});

Each word is divided into estimated syllables (word_length / 3), and each syllable gets an attack-decay envelope for a more rhythmic sound.

File Format Details

WAV Specifications

Format:      WAV (PCM)
Codec:       Linear PCM
Channels:    1 (mono)
Sample Rate: 22050 Hz
Bit Depth:   16-bit
Byte Order:  Little-endian
Compression: None

File Size Calculation

// WAV file size formula
const headerSize = 44; // bytes
const sampleSize = 2;  // bytes (16-bit)
const duration = 60;   // seconds
const sampleRate = 22050;

const fileSize = headerSize + (duration * sampleRate * sampleSize);
// = 44 + (60 × 22050 × 2) = 2,646,044 bytes (~2.52 MB)

Error Handling

const handleDownload = async (item, format) => {
  try {
    await generateAndDownloadAudio(item, format);
    showToast(`📥 ${format.toUpperCase()} descargado`);
  } catch (e) {
    console.error(e);
    showToast('Error al generar el audio', 'error');
  }
};

Common error scenarios:

Out of memory: Very long audio (>10 minutes)
Browser restrictions: iOS Safari has stricter limits
OfflineAudioContext limits: Maximum context length varies by browser

Best Practices

Validate input length

if (texto.length > 5000) {
  showToast('Text too long. Maximum 5000 characters.', 'error');
  return;
}

Use appropriate sample rates

22.05 kHz: Speech, podcasts (good balance)
44.1 kHz: Music, professional audio
8 kHz: Phone quality (not recommended)

Clamp audio samples

const s = Math.max(-1, Math.min(1, samples[i]));

Always clamp samples to prevent clipping and distortion.

Clean up resources

URL.revokeObjectURL(url);

Revoke object URLs to prevent memory leaks.

Browser Compatibility

Web Audio API Support:

✅ Chrome 35+
✅ Firefox 25+
✅ Safari 14.1+
✅ Edge 79+
✅ Opera 22+

OfflineAudioContext Support:

✅ All modern browsers
⚠️ Safari has lower maximum context lengths

Next Steps

Web Speech API

Learn about real-time speech synthesis

PWA Setup

Configure Progressive Web App features

Architecture

Development

Audio Processing & Encoding

Audio Processing & Encoding

Overview

Audio Generation Pipeline

The generateAndDownloadAudio Function

Audio Generation Steps

WAV Encoding

The encodeWAV Function

WAV File Structure

PCM Sample Conversion

File Download System

The downloadBlob Function

Download Handlers

Audio File Download

Transcript Download

Audio Processing Performance

Benchmarks

Memory Usage

Advanced Features

Dynamic Frequency Modulation

Syllable-Based Envelopes

File Format Details

WAV Specifications

File Size Calculation

Error Handling

Best Practices

Browser Compatibility

Next Steps

Web Speech API

PWA Setup

Build docs developers (and LLMs) love

Architecture

Development

​Audio Processing & Encoding

​Overview

​Audio Generation Pipeline

​The generateAndDownloadAudio Function

​Audio Generation Steps

​WAV Encoding

​The encodeWAV Function

​WAV File Structure

​PCM Sample Conversion

​File Download System

​The downloadBlob Function

​Download Handlers

​Audio File Download

​Transcript Download

​Audio Processing Performance

​Benchmarks

​Memory Usage

​Advanced Features

​Dynamic Frequency Modulation

​Syllable-Based Envelopes

​File Format Details

​WAV Specifications

​File Size Calculation

​Error Handling

​Best Practices

​Browser Compatibility

​Related Resources

​Next Steps

Web Speech API

PWA Setup

Build docs developers (and LLMs) love

Audio Processing & Encoding

Overview

Audio Generation Pipeline

The generateAndDownloadAudio Function

Audio Generation Steps

WAV Encoding

The encodeWAV Function

WAV File Structure

PCM Sample Conversion

File Download System

The downloadBlob Function

Download Handlers

Audio File Download

Transcript Download

Audio Processing Performance

Benchmarks

Memory Usage

Advanced Features

Dynamic Frequency Modulation

Syllable-Based Envelopes

File Format Details

WAV Specifications

File Size Calculation

Error Handling

Best Practices

Browser Compatibility

Related Resources

Next Steps