Status
Draft - This proposal is currently under review and discussion.Abstract
MIP-004 proposes a standardized approach to data hashing and integrity verification within the Masumi ecosystem. This standard defines recommended hashing algorithms, content addressing schemes, and verification practices to ensure data authenticity, prevent tampering, and enable efficient content deduplication across the network.Motivation
Data integrity is fundamental to a trustworthy agentic ecosystem:- Verification: Ensure data hasn’t been tampered with during transmission or storage
- Content Addressing: Enable location-independent data retrieval
- Deduplication: Identify and eliminate redundant data storage
- Provenance: Track data origins and modifications
- Interoperability: Ensure consistent hashing across different services and tools
Standardized hashing enables trustless verification and efficient data management across the Masumi network.
Specification
Primary Hashing Algorithm
BLAKE3
BLAKE3
Recommended: BLAKE3 is the primary hashing algorithm for the Masumi ecosystem.Properties:
- Fast: Significantly faster than SHA-256 and SHA-3
- Secure: Cryptographically secure with no known vulnerabilities
- Parallelizable: Efficient on modern multi-core processors
- Verifiable: Supports incremental verification and tree structure
- Deterministic: Same input always produces same output
- File integrity verification
- Content addressing
- Merkle tree construction
- Data deduplication
- 256-bit (32-byte) hash
- Hex encoding: 64 characters
- Base58 encoding option for compact representation
Secondary Algorithms
SHA-256
SHA-256
Legacy Support: SHA-256 for compatibility with existing systems.Use Cases:
- Blockchain integrations requiring SHA-256
- Compatibility with Bitcoin and Ethereum ecosystems
- Legacy system integration
CID (Content Identifier)
CID (Content Identifier)
IPFS Compatibility: Support for IPFS CIDv1 format.Structure:Use Cases:
- IPFS integration
- Distributed storage systems
- Content-addressed storage
Hash Representation
Standard Formats
Hexadecimal
Hexadecimal
Default format for most use cases
- 64 characters (256 bits)
- Case-insensitive (prefer lowercase)
- Easy to debug and display
Base58
Base58
Compact representation for user-facing identifiers
- Shorter than hex (~44 characters)
- No ambiguous characters (0, O, I, l)
- Better for URLs and user input
Multihash
Multihash
Self-describing hash format
- Specifies which hash function was used
- Enables future algorithm upgrades
- Compatible with IPFS ecosystem
Content Hashing Practices
File Hashing
Always hash the raw file contents, not metadata like timestamps or permissions.
Structured Data Hashing
For JSON and other structured data:- Canonicalization: Use deterministic JSON serialization
- Normalization: Sort object keys alphabetically
- Consistency: Use consistent whitespace (prefer no whitespace)
- Encoding: Always use UTF-8 encoding
Merkle Trees
For large datasets and verification efficiency:Merkle Tree Structure
Merkle Tree Structure
- Efficient partial verification
- Incremental updates
- Proof of inclusion/exclusion
- Scalable for large datasets
- Use BLAKE3 for leaf and internal node hashing
- Store tree structure alongside root hash
- Support proof generation and verification
Hash Verification Protocol
Basic Verification
Verification Metadata
When transmitting hashed data, include:Rationale
Design Decisions
Why BLAKE3?
Why BLAKE3?
Performance: 10x faster than SHA-256, crucial for large-scale data processing.Security: Modern design with strong security guarantees and no known vulnerabilities.Parallelism: Native support for parallel processing, leveraging modern hardware.Tree Structure: Built-in Merkle tree support for efficient verification.Future-Proof: Designed with lessons learned from previous hash functions.
Why support multiple algorithms?
Why support multiple algorithms?
- Compatibility: Integration with existing systems (Bitcoin, Ethereum, IPFS)
- Migration: Smooth transition from legacy systems
- Flexibility: Different algorithms for different security/performance requirements
- Future-Proofing: Easy to add new algorithms if needed
Why standardize representation formats?
Why standardize representation formats?
- Interoperability: Consistent hashes across different implementations
- User Experience: Appropriate formats for different contexts (debugging vs display)
- Efficiency: Compact formats reduce storage and bandwidth requirements
Why canonical JSON?
Why canonical JSON?
Different JSON serializations of the same data produce different hashes. Canonicalization ensures the same data always produces the same hash, critical for verification and deduplication.
Alternative Approaches Considered
- SHA-3: Secure but slower than BLAKE3, less parallelizable
- BLAKE2: Fast but BLAKE3 offers better performance and features
- Custom hash function: Maximum control but requires extensive security review
- Multiple primary algorithms: Adds complexity without clear benefits
Backwards Compatibility
Services using different hashing algorithms can continue operating:- Support Period: 12-month transition period for adopting BLAKE3
- Multi-Algorithm Support: Services may support multiple algorithms during transition
- Hash Metadata: Always include algorithm identifier with hashes
- Migration Tools: Provide tools to rehash existing data
Migration Strategy
Phase 1: Introduction (Months 1-3)
Phase 1: Introduction (Months 1-3)
- Release specification and reference implementations
- Provide migration tools and documentation
- Services begin BLAKE3 implementation
Phase 2: Dual Support (Months 4-9)
Phase 2: Dual Support (Months 4-9)
- Services support both old and new algorithms
- New data uses BLAKE3
- Gradual rehashing of existing data
Phase 3: BLAKE3 Primary (Months 10-12)
Phase 3: BLAKE3 Primary (Months 10-12)
- BLAKE3 becomes default
- Legacy algorithm support maintained for compatibility
- Clear migration path for remaining services
Phase 4: Full Adoption (Month 12+)
Phase 4: Full Adoption (Month 12+)
- BLAKE3 required for new services
- Legacy support maintained only for specific compatibility needs
Security Considerations
Hash Function Security
Collision Resistance
Collision Resistance
BLAKE3 provides strong collision resistance. With a 256-bit output, finding collisions is computationally infeasible with current and foreseeable technology.
Preimage Resistance
Preimage Resistance
Given a hash, it should be computationally infeasible to find input data that produces that hash. BLAKE3 provides strong preimage resistance.
Algorithm Agility
Algorithm Agility
The multihash format allows for algorithm upgrades if vulnerabilities are discovered in the future, without breaking existing systems.
Implementation Security
Use well-tested cryptographic libraries. Do not implement hash functions from scratch.
Timing Attacks
Timing Attacks
Hash comparison should use constant-time comparison to prevent timing-based attacks when verifying sensitive hashes.
Random Number Generation
Random Number Generation
If hashing is used with salts or nonces, use cryptographically secure random number generators.
Key Derivation
Key Derivation
For deriving encryption keys from data, use proper key derivation functions (KDFs) like HKDF, not raw hashing.
Operational Security
- Hash Storage: Store hashes securely; they can reveal data presence
- Transmission: Always transmit hashes over secure channels
- Verification: Always verify hashes before using data
- Updates: Monitor for security updates to hash function implementations
Implementation
Reference Implementations
Official reference implementations provided for:- JavaScript/TypeScript:
@masumi/hash-utils - Python:
masumi-hash - Rust:
masumi-hash-rs - Go:
masumi-hash-go
Library Requirements
Compliant implementations must:- Support BLAKE3 as primary algorithm
- Support SHA-256 for compatibility
- Provide hex and base58 encoding
- Support multihash format
- Include verification utilities
- Provide Merkle tree construction and verification
- Use constant-time comparison for hash verification
Testing and Validation
Test Vectors
Test Vectors
Standardized test vectors provided for:
- Empty input
- Single-byte inputs
- Common string inputs
- Large binary data
- Structured JSON data
Compliance Testing
Compliance Testing
- Automated test suite for implementation validation
- Performance benchmarks
- Security testing guidelines
- Interoperability tests between implementations
Integration Examples
This MIP is currently in draft status. Join the discussion on the MIP repository.