Skip to main content
Learn how to create the three CircleNet Analytics datasets with the proper scale, constraints, and realistic data for big data processing and analysis.

Generation Overview

You need to write code to generate three CSV files that will be loaded into HDFS for MapReduce analytics.
DatasetRecordsFile Size Est.Complexity
CircleNetPage200,000~20-30 MBLow
Follows20,000,000~1.5-2 GBMedium
ActivityLog10,000,000~700 MB - 1 GBHigh
Design Goal: Create a scalable solution that can generate large datasets efficiently without running out of memory.

General Requirements

File Format Rules

Critical Requirements:
  • NO column headers in the files
  • Values separated by commas
  • No commas inside string values
  • One record per line
  • Files must be plain CSV format

Data Quality

  1. Realistic Values: Use creative, plausible data
    • NickNames should sound like actual usernames
    • JobTitles should be real occupations
    • FavoriteHobbies should be genuine activities
    • Relationship descriptions should make sense
    • Action types should reflect actual social media behavior
  2. Referential Integrity: All foreign keys must be valid
    • Follows.ID1 and ID2 must exist in CircleNetPage (1-200,000)
    • ActivityLog.ByWho and WhatPage must exist in CircleNetPage (1-200,000)
  3. Constraints: Follow all schema rules
    • See individual dataset pages for detailed constraints

CircleNetPage Generation

Scale: 200,000 users

# Pseudocode example
for id in range(1, 200001):
    nickname = generate_nickname(10, 20)  # 10-20 chars, no commas
    job_title = generate_job_title(10, 20)  # 10-20 chars, no commas
    region_code = random_int(1, 50)
    hobby = generate_hobby(5, 30)  # 5-30 chars, no commas
    write_csv_line(id, nickname, job_title, region_code, hobby)

Key Considerations

  • Uniqueness: Each ID from 1 to 200,000 exactly once
  • No Commas: Strip or replace commas in generated strings
  • Variety: Create diverse hobbies and job titles for interesting analytics
  • RegionCode Distribution: Distribute users across all 50 regions
Consider using lists of common hobbies and job titles to ensure realism rather than random character generation.

Follows Generation

Scale: 20,000,000 relationships

# Pseudocode example
for col_rel in range(1, 20000001):
    id1 = random_int(1, 200000)
    id2 = random_int(1, 200000)
    
    # Ensure ID1 != ID2 (no self-follows)
    while id2 == id1:
        id2 = random_int(1, 200000)
    
    date_of_relation = random_int(1, 1000000)
    description = generate_relation_description(20, 50)  # No commas
    
    write_csv_line(col_rel, id1, id2, date_of_relation, description)

Key Considerations

  • Self-Follow Prevention: ID1 must never equal ID2
  • One-Directional: (ID1 → ID2) is independent of (ID2 → ID1)
  • Scale: 20M records means ~100 average follows per user
  • Description Variety: Create diverse relationship types
Performance Tip: For 20 million records, use buffered writing to avoid excessive I/O operations.

Realistic Distribution

Consider creating a power-law distribution where:
  • Some users are very popular (thousands of followers)
  • Most users have moderate followers (50-200)
  • Some users have few or no followers

ActivityLog Generation

Scale: 10,000,000 actions

# Pseudocode example
viewed_pairs = set()  # Track (ByWho, WhatPage) that have viewed

for action_id in range(1, 10000001):
    by_who = random_int(1, 200000)
    what_page = random_int(1, 200000)
    action_time = random_int(1, 1000000)
    
    pair = (by_who, what_page)
    
    # First action for this pair must be a view
    if pair not in viewed_pairs:
        action_type = generate_view_action(20, 50)
        viewed_pairs.add(pair)
    else:
        # Can be view or interaction
        action_type = generate_any_action(20, 50)
    
    write_csv_line(action_id, by_who, what_page, action_type, action_time)

Key Considerations

Critical Rule: Any non-view action must be preceded by a view action for the same (ByWho, WhatPage) pair.
  • View-First Constraint: Track which pairs have had view actions
  • Action Type Variety: Mix views, comments, likes, pokes, etc.
  • Temporal Distribution: Spread actions across time range
  • Self-Interaction: Users CAN access their own pages

Action Type Examples

Views (20-50 chars):
  • “viewed profile page”
  • “viewed photos section and albums”
  • “viewed recent posts feed”
Interactions (20-50 chars):
  • “left a comment on recent post about vacation”
  • “poked user playfully to say hello”
  • “liked profile photo and cover banner”
  • “sent friend request with nice message”
  • “shared post to own timeline for friends”

Implementation Strategies

Memory-Efficient Generation

Best Practice: Write records incrementally rather than storing all in memory.
# Good: Stream writing
with open('ActivityLog.csv', 'w') as f:
    for i in range(1, 10000001):
        record = generate_activity_record(i)
        f.write(record + '\n')
        
        # Optional: flush every N records
        if i % 100000 == 0:
            f.flush()

# Bad: Store all in memory first
records = []
for i in range(1, 10000001):
    records.append(generate_activity_record(i))  # Memory overflow!
write_all(records)

Buffered Writing

For optimal performance with large files:
buffer_size = 100000  # Write every 100K records
buffer = []

for i in range(1, 20000001):
    buffer.append(generate_follow_record(i))
    
    if len(buffer) >= buffer_size:
        write_buffer_to_file(buffer)
        buffer = []

# Write remaining records
if buffer:
    write_buffer_to_file(buffer)

Progress Tracking

For long-running generation jobs, add progress indicators:
total = 20000000
for i in range(1, total + 1):
    generate_and_write_record(i)
    
    if i % 1000000 == 0:
        print(f"Progress: {i}/{total} ({100*i/total:.1f}%)")

Validation

After generation, validate your datasets:

Record Counts

wc -l CircleNetPage.csv  # Should be 200000
wc -l Follows.csv        # Should be 20000000
wc -l ActivityLog.csv    # Should be 10000000

Format Validation

# Check for correct number of columns
head -n 1000 CircleNetPage.csv | awk -F',' '{print NF}' | sort -u  # Should be 5
head -n 1000 Follows.csv | awk -F',' '{print NF}' | sort -u        # Should be 5
head -n 1000 ActivityLog.csv | awk -F',' '{print NF}' | sort -u    # Should be 5

Constraint Checks

# Validate no self-follows in Follows dataset
with open('Follows.csv', 'r') as f:
    for line in f:
        parts = line.strip().split(',')
        col_rel, id1, id2, date, desc = parts
        assert id1 != id2, f"Self-follow found at ColRel {col_rel}"

Performance Tips

  1. Use Appropriate Data Structures
    • Sets for tracking viewed pairs (O(1) lookup)
    • Lists for buffering records
    • Generators for memory-efficient iteration
  2. Optimize Random Generation
    • Pre-generate lists of nicknames, hobbies, job titles
    • Use random.choice() on pre-built lists
    • Avoid complex string operations in tight loops
  3. Parallel Generation
    • Consider generating Follows and ActivityLog in chunks
    • Use multiprocessing for CPU-bound operations
    • Merge chunks into final file
  4. File I/O Optimization
    • Use buffered writes (default buffer size: 8192 bytes)
    • Consider using faster I/O libraries
    • Flush periodically to avoid memory buildup

Next Steps

After generating the datasets:
  1. Validate all files for correctness
  2. Load into HDFS for distributed processing
  3. Run MapReduce jobs for analytics tasks

CircleNetPage

Review schema details

Follows

Review relationship rules

ActivityLog

Review activity constraints

Build docs developers (and LLMs) love