Generation Overview
You need to write code to generate three CSV files that will be loaded into HDFS for MapReduce analytics.| Dataset | Records | File Size Est. | Complexity |
|---|---|---|---|
| CircleNetPage | 200,000 | ~20-30 MB | Low |
| Follows | 20,000,000 | ~1.5-2 GB | Medium |
| ActivityLog | 10,000,000 | ~700 MB - 1 GB | High |
Design Goal: Create a scalable solution that can generate large datasets efficiently without running out of memory.
General Requirements
File Format Rules
Data Quality
-
Realistic Values: Use creative, plausible data
- NickNames should sound like actual usernames
- JobTitles should be real occupations
- FavoriteHobbies should be genuine activities
- Relationship descriptions should make sense
- Action types should reflect actual social media behavior
-
Referential Integrity: All foreign keys must be valid
- Follows.ID1 and ID2 must exist in CircleNetPage (1-200,000)
- ActivityLog.ByWho and WhatPage must exist in CircleNetPage (1-200,000)
-
Constraints: Follow all schema rules
- See individual dataset pages for detailed constraints
CircleNetPage Generation
Scale: 200,000 users
Key Considerations
- Uniqueness: Each ID from 1 to 200,000 exactly once
- No Commas: Strip or replace commas in generated strings
- Variety: Create diverse hobbies and job titles for interesting analytics
- RegionCode Distribution: Distribute users across all 50 regions
Consider using lists of common hobbies and job titles to ensure realism rather than random character generation.
Follows Generation
Scale: 20,000,000 relationships
Key Considerations
- Self-Follow Prevention: ID1 must never equal ID2
- One-Directional: (ID1 → ID2) is independent of (ID2 → ID1)
- Scale: 20M records means ~100 average follows per user
- Description Variety: Create diverse relationship types
Realistic Distribution
Consider creating a power-law distribution where:- Some users are very popular (thousands of followers)
- Most users have moderate followers (50-200)
- Some users have few or no followers
ActivityLog Generation
Scale: 10,000,000 actions
Key Considerations
- View-First Constraint: Track which pairs have had view actions
- Action Type Variety: Mix views, comments, likes, pokes, etc.
- Temporal Distribution: Spread actions across time range
- Self-Interaction: Users CAN access their own pages
Action Type Examples
Views (20-50 chars):- “viewed profile page”
- “viewed photos section and albums”
- “viewed recent posts feed”
- “left a comment on recent post about vacation”
- “poked user playfully to say hello”
- “liked profile photo and cover banner”
- “sent friend request with nice message”
- “shared post to own timeline for friends”
Implementation Strategies
Memory-Efficient Generation
Best Practice: Write records incrementally rather than storing all in memory.
Buffered Writing
For optimal performance with large files:Progress Tracking
For long-running generation jobs, add progress indicators:Validation
After generation, validate your datasets:Record Counts
Format Validation
Constraint Checks
Performance Tips
-
Use Appropriate Data Structures
- Sets for tracking viewed pairs (O(1) lookup)
- Lists for buffering records
- Generators for memory-efficient iteration
-
Optimize Random Generation
- Pre-generate lists of nicknames, hobbies, job titles
- Use random.choice() on pre-built lists
- Avoid complex string operations in tight loops
-
Parallel Generation
- Consider generating Follows and ActivityLog in chunks
- Use multiprocessing for CPU-bound operations
- Merge chunks into final file
-
File I/O Optimization
- Use buffered writes (default buffer size: 8192 bytes)
- Consider using faster I/O libraries
- Flush periodically to avoid memory buildup
Next Steps
After generating the datasets:- Validate all files for correctness
- Load into HDFS for distributed processing
- Run MapReduce jobs for analytics tasks
CircleNetPage
Review schema details
Follows
Review relationship rules
ActivityLog
Review activity constraints