Skip to main content

Overview

The StatisticsCalculator class provides comprehensive statistical analysis for tokenized text. It calculates token counts, character counts, word counts, cost estimates, context utilization, and provides model comparison capabilities.
This calculator works with data from 48 AI models and provides accurate cost estimates based on current pricing.

Constructor

Creates a new StatisticsCalculator instance.
const calculator = new StatisticsCalculator();
The calculator is stateless and can be reused for multiple calculations.

Methods

calculateStatistics()

Calculates comprehensive statistics for the given text and model.
calculateStatistics(text, tokenResult, modelId)
text
string
required
The original input text
tokenResult
Object
required
Result object from TokenizationService.tokenizeText()
modelId
string
required
Model identifier (e.g., “gpt-4o”, “claude-3.5-sonnet”)
returns
Object
Comprehensive statistics object
Return value structure:
tokenCount
number
Total number of tokens
charCount
number
Total number of characters
wordCount
number
Total number of words
costEstimate
number
Estimated cost in USD for input tokens
contextUtilization
number
Percentage of context window used (0-100)
tokensPerWord
number
Average tokens per word ratio
inputCostPer1M
number
Cost per 1M input tokens in USD
outputCostPer1M
number
Cost per 1M output tokens in USD
const calculator = new StatisticsCalculator();
const tokenizer = new TokenizationService();

const text = "Hello world! This is a test.";
const tokenResult = await tokenizer.tokenizeText(text, 'gpt-4o');

const stats = calculator.calculateStatistics(text, tokenResult, 'gpt-4o');

console.log(stats);
// {
//   tokenCount: 8,
//   charCount: 29,
//   wordCount: 6,
//   costEstimate: 0.00002,
//   contextUtilization: 0.00625,
//   tokensPerWord: 1.33,
//   inputCostPer1M: 2.50,
//   outputCostPer1M: 10.00
// }

countWords()

Counts words in text using intelligent word boundary detection.
countWords(text)
text
string
required
Text to analyze
returns
number
Number of words (0 for empty text)
Algorithm:
  1. Trims whitespace from text
  2. Splits on whitespace characters (\s+)
  3. Filters out empty strings
  4. Returns count
const calculator = new StatisticsCalculator();

console.log(calculator.countWords('Hello world'));
// 2

console.log(calculator.countWords('  Multiple   spaces   between  '));
// 3

console.log(calculator.countWords(''));
// 0

console.log(calculator.countWords('One-hyphenated-word'));
// 1

calculateCost()

Calculates estimated cost based on token count and model pricing.
calculateCost(tokenCount, modelInfo)
tokenCount
number
required
Number of tokens
modelInfo
Object
required
Model information object from MODELS_DATA
returns
number
Estimated cost in USD
Cost calculation formula:
cost = (tokenCount / 1,000,000) × inputCostPer1M
const calculator = new StatisticsCalculator();
const modelInfo = MODELS_DATA['gpt-4o'];

// Calculate cost for 1,000 tokens
const cost1k = calculator.calculateCost(1000, modelInfo);
console.log(`1K tokens: $${cost1k.toFixed(6)}`);
// 1K tokens: $0.002500

// Calculate cost for 1,000,000 tokens
const cost1m = calculator.calculateCost(1000000, modelInfo);
console.log(`1M tokens: $${cost1m.toFixed(2)}`);
// 1M tokens: $2.50
Cost estimates are based on input token pricing. Output tokens typically cost more.

calculateContextUtilization()

Calculates the percentage of the model’s context window being used.
calculateContextUtilization(tokenCount, contextLimit)
tokenCount
number
required
Number of tokens in the text
contextLimit
number
required
Maximum context window size for the model
returns
number
Percentage from 0 to 100 (capped at 100)
const calculator = new StatisticsCalculator();

// GPT-4o has 128K context
console.log(calculator.calculateContextUtilization(1000, 128000));
// 0.78 (less than 1%)

console.log(calculator.calculateContextUtilization(64000, 128000));
// 50.0 (half the context)

console.log(calculator.calculateContextUtilization(128000, 128000));
// 100.0 (full context)

console.log(calculator.calculateContextUtilization(150000, 128000));
// 100.0 (capped at 100%, actually exceeds)

exceedsContextLimit()

Checks if token count exceeds the model’s context limit.
exceedsContextLimit(tokenCount, modelId)
tokenCount
number
required
Number of tokens
modelId
string
required
Model identifier
returns
boolean
True if exceeds limit, false otherwise
const calculator = new StatisticsCalculator();

// GPT-4o has 128K context limit
console.log(calculator.exceedsContextLimit(100000, 'gpt-4o'));
// false

console.log(calculator.exceedsContextLimit(150000, 'gpt-4o'));
// true

// GPT-3.5 has 16K context limit
console.log(calculator.exceedsContextLimit(20000, 'gpt-3.5-turbo'));
// true

getContextWarning()

Returns a warning message if context usage is high or exceeded.
getContextWarning(tokenCount, modelId)
tokenCount
number
required
Number of tokens
modelId
string
required
Model identifier
returns
string|null
Warning message or null if no warning needed
Warning thresholds:
"⚠️ Texto excede el límite de contexto del modelo (128,000 tokens)"
Text exceeds the model’s maximum context window.
const calculator = new StatisticsCalculator();

// GPT-4o: 128K context
console.log(calculator.getContextWarning(50000, 'gpt-4o'));
// null (39% usage)

console.log(calculator.getContextWarning(100000, 'gpt-4o'));
// "ℹ️ Alto uso del contexto (78.1% utilizado)"

console.log(calculator.getContextWarning(120000, 'gpt-4o'));
// "⚠️ Cerca del límite de contexto (93.8% utilizado)"

console.log(calculator.getContextWarning(150000, 'gpt-4o'));
// "⚠️ Texto excede el límite de contexto del modelo (128,000 tokens)"

formatStatistics()

Formats statistics for display with proper localization and units.
formatStatistics(stats)
stats
Object
required
Raw statistics object from calculateStatistics()
returns
Object
Formatted statistics with string values
const calculator = new StatisticsCalculator();

const rawStats = {
  tokenCount: 15847,
  charCount: 72456,
  wordCount: 11234,
  costEstimate: 0.03961175,
  contextUtilization: 12.380469,
  tokensPerWord: 1.410987,
  inputCostPer1M: 2.50,
  outputCostPer1M: 10.00
};

const formatted = calculator.formatStatistics(rawStats);

console.log(formatted);
// {
//   tokenCount: "15,847",
//   charCount: "72,456",
//   wordCount: "11,234",
//   costEstimate: "$0.039612",
//   contextUtilization: "12.4%",
//   tokensPerWord: "1.41",
//   inputCostPer1M: "$2.50/1M",
//   outputCostPer1M: "$10.00/1M"
// }
Use formatted statistics for displaying in UI. They include proper thousand separators, currency symbols, and percentage signs.

compareModels()

Compares tokenization statistics across multiple models.
async compareModels(text, modelIds, tokenizationService)
text
string
required
Text to analyze
modelIds
string[]
required
Array of model IDs to compare
tokenizationService
TokenizationService
required
Tokenization service instance
returns
Promise<Array>
Array of comparison objects sorted by cost (cheapest first)
Comparison object structure:
modelId
string
Model identifier
company
string
Model provider (e.g., “OpenAI”, “Anthropic”)
stats
Object
Raw statistics object
formatted
Object
Formatted statistics for display
const calculator = new StatisticsCalculator();
const tokenizer = new TokenizationService();
await tokenizer.waitForInitialization();

const text = "Long text for comparison...";

const comparison = await calculator.compareModels(
  text,
  ['gpt-4o', 'claude-3.5-sonnet', 'llama-3.1-70b'],
  tokenizer
);

comparison.forEach(result => {
  console.log(`${result.modelId} (${result.company}):`);
  console.log(`  Tokens: ${result.formatted.tokenCount}`);
  console.log(`  Cost: ${result.formatted.costEstimate}`);
});
Comparison results are automatically sorted by cost estimate, making it easy to find the most economical model for your text.

getEfficiencyMetrics()

Calculates efficiency metrics for tokenization analysis.
getEfficiencyMetrics(stats)
stats
Object
required
Statistics object from calculateStatistics()
returns
Object
Efficiency metrics object
Return value structure:
costEfficiency
number
Cost per thousand tokens (lower is better)
compressionRatio
number
Tokens per character (lower = better compression)
verbosityIndex
number
Tokens per word (lower = more efficient encoding)
const calculator = new StatisticsCalculator();

const stats = {
  tokenCount: 1000,
  charCount: 4500,
  wordCount: 750,
  costEstimate: 0.0025
};

const metrics = calculator.getEfficiencyMetrics(stats);

console.log(metrics);
// {
//   costEfficiency: 2.5,        // $2.50 per 1000 tokens
//   compressionRatio: 0.222,    // 0.22 tokens per character
//   verbosityIndex: 1.333       // 1.33 tokens per word
// }

Usage Examples

const calculator = new StatisticsCalculator();
const tokenizer = new TokenizationService();
await tokenizer.waitForInitialization();

const text = `
  This is a sample text for comprehensive tokenization analysis.
  We'll analyze tokens, costs, and efficiency metrics.
`;

const modelId = 'gpt-4o';

// Tokenize
const tokenResult = await tokenizer.tokenizeText(text, modelId);

// Calculate statistics
const stats = calculator.calculateStatistics(text, tokenResult, modelId);

// Get formatted display values
const formatted = calculator.formatStatistics(stats);

// Check for warnings
const warning = calculator.getContextWarning(stats.tokenCount, modelId);

// Get efficiency metrics
const efficiency = calculator.getEfficiencyMetrics(stats);

console.log('Statistics:', formatted);
if (warning) console.log('Warning:', warning);
console.log('Efficiency:', efficiency);

Statistics Interpretation

The total number of tokens the text is divided into. This directly impacts:
  • API costs (priced per token)
  • Processing time
  • Context window usage
Typical ranges:
  • Short prompt: 10-100 tokens
  • Medium text: 100-1,000 tokens
  • Long document: 1,000-10,000+ tokens
Total number of characters including spaces and punctuation.Rule of thumb: English text averages ~4 characters per token.
Number of words (whitespace-separated).Rule of thumb: English text averages ~0.75 tokens per word.
Estimated API cost for processing the text.Note: Based on input pricing. Output tokens cost more.Cost ranges (GPT-4o):
  • 1K tokens: ~$0.0025
  • 10K tokens: ~$0.025
  • 100K tokens: ~$0.25
Percentage of the model’s context window being used.Guidelines:
  • Less than 50%: Comfortable usage
  • 50-75%: Moderate usage
  • 75-90%: High usage
  • 90-100%: Near limit
  • Greater than 100%: Exceeds limit (will fail)
Average number of tokens per word.Typical values:
  • English: 1.3-1.5
  • Code: 1.5-2.0
  • Non-English: varies by language
Lower values indicate more efficient tokenization.

Cost Optimization Tips

Choose Efficient Models

Compare models to find the best token-to-cost ratio for your use case

Minimize Prompt Length

Remove unnecessary context and instructions to reduce token count

Use Smaller Models

Consider mini variants (e.g., gpt-4o-mini) for simpler tasks

Batch Requests

Process multiple items in one request to reduce per-request overhead

See Also

TokenAnalyzer

Main application orchestrator

TokenizationService

Tokenization engine

UIController

UI management

Supported Models

View all model pricing

Build docs developers (and LLMs) love