Dataset Generation

Learn how to create the three CircleNet Analytics datasets with the proper scale, constraints, and realistic data for big data processing and analysis.

Generation Overview

You need to write code to generate three CSV files that will be loaded into HDFS for MapReduce analytics.

Dataset	Records	File Size Est.	Complexity
CircleNetPage	200,000	~20-30 MB	Low
Follows	20,000,000	~1.5-2 GB	Medium
ActivityLog	10,000,000	~700 MB - 1 GB	High

Design Goal: Create a scalable solution that can generate large datasets efficiently without running out of memory.

General Requirements

File Format Rules

Critical Requirements:

NO column headers in the files
Values separated by commas
No commas inside string values
One record per line
Files must be plain CSV format

Data Quality

Realistic Values: Use creative, plausible data
- NickNames should sound like actual usernames
- JobTitles should be real occupations
- FavoriteHobbies should be genuine activities
- Relationship descriptions should make sense
- Action types should reflect actual social media behavior
Referential Integrity: All foreign keys must be valid
- Follows.ID1 and ID2 must exist in CircleNetPage (1-200,000)
- ActivityLog.ByWho and WhatPage must exist in CircleNetPage (1-200,000)
Constraints: Follow all schema rules
- See individual dataset pages for detailed constraints

CircleNetPage Generation

Scale: 200,000 users

# Pseudocode example
for id in range(1, 200001):
    nickname = generate_nickname(10, 20)  # 10-20 chars, no commas
    job_title = generate_job_title(10, 20)  # 10-20 chars, no commas
    region_code = random_int(1, 50)
    hobby = generate_hobby(5, 30)  # 5-30 chars, no commas
    write_csv_line(id, nickname, job_title, region_code, hobby)

Key Considerations

Uniqueness: Each ID from 1 to 200,000 exactly once
No Commas: Strip or replace commas in generated strings
Variety: Create diverse hobbies and job titles for interesting analytics
RegionCode Distribution: Distribute users across all 50 regions

Consider using lists of common hobbies and job titles to ensure realism rather than random character generation.

Follows Generation

Scale: 20,000,000 relationships

# Pseudocode example
for col_rel in range(1, 20000001):
    id1 = random_int(1, 200000)
    id2 = random_int(1, 200000)
    
    # Ensure ID1 != ID2 (no self-follows)
    while id2 == id1:
        id2 = random_int(1, 200000)
    
    date_of_relation = random_int(1, 1000000)
    description = generate_relation_description(20, 50)  # No commas
    
    write_csv_line(col_rel, id1, id2, date_of_relation, description)

Key Considerations

Self-Follow Prevention: ID1 must never equal ID2
One-Directional: (ID1 → ID2) is independent of (ID2 → ID1)
Scale: 20M records means ~100 average follows per user
Description Variety: Create diverse relationship types

Performance Tip: For 20 million records, use buffered writing to avoid excessive I/O operations.

Realistic Distribution

Consider creating a power-law distribution where:

Some users are very popular (thousands of followers)
Most users have moderate followers (50-200)
Some users have few or no followers

ActivityLog Generation

Scale: 10,000,000 actions

# Pseudocode example
viewed_pairs = set()  # Track (ByWho, WhatPage) that have viewed

for action_id in range(1, 10000001):
    by_who = random_int(1, 200000)
    what_page = random_int(1, 200000)
    action_time = random_int(1, 1000000)
    
    pair = (by_who, what_page)
    
    # First action for this pair must be a view
    if pair not in viewed_pairs:
        action_type = generate_view_action(20, 50)
        viewed_pairs.add(pair)
    else:
        # Can be view or interaction
        action_type = generate_any_action(20, 50)
    
    write_csv_line(action_id, by_who, what_page, action_type, action_time)

Key Considerations

Critical Rule: Any non-view action must be preceded by a view action for the same (ByWho, WhatPage) pair.

View-First Constraint: Track which pairs have had view actions
Action Type Variety: Mix views, comments, likes, pokes, etc.
Temporal Distribution: Spread actions across time range
Self-Interaction: Users CAN access their own pages

Action Type Examples

Views (20-50 chars):

“viewed profile page”
“viewed photos section and albums”
“viewed recent posts feed”

Interactions (20-50 chars):

“left a comment on recent post about vacation”
“poked user playfully to say hello”
“liked profile photo and cover banner”
“sent friend request with nice message”
“shared post to own timeline for friends”

Implementation Strategies

Memory-Efficient Generation

Best Practice: Write records incrementally rather than storing all in memory.

# Good: Stream writing
with open('ActivityLog.csv', 'w') as f:
    for i in range(1, 10000001):
        record = generate_activity_record(i)
        f.write(record + '\n')
        
        # Optional: flush every N records
        if i % 100000 == 0:
            f.flush()

# Bad: Store all in memory first
records = []
for i in range(1, 10000001):
    records.append(generate_activity_record(i))  # Memory overflow!
write_all(records)

Buffered Writing

For optimal performance with large files:

buffer_size = 100000  # Write every 100K records
buffer = []

for i in range(1, 20000001):
    buffer.append(generate_follow_record(i))
    
    if len(buffer) >= buffer_size:
        write_buffer_to_file(buffer)
        buffer = []

# Write remaining records
if buffer:
    write_buffer_to_file(buffer)

Progress Tracking

For long-running generation jobs, add progress indicators:

total = 20000000
for i in range(1, total + 1):
    generate_and_write_record(i)
    
    if i % 1000000 == 0:
        print(f"Progress: {i}/{total} ({100*i/total:.1f}%)")

Validation

After generation, validate your datasets:

Record Counts

wc -l CircleNetPage.csv  # Should be 200000
wc -l Follows.csv        # Should be 20000000
wc -l ActivityLog.csv    # Should be 10000000

Format Validation

# Check for correct number of columns
head -n 1000 CircleNetPage.csv | awk -F',' '{print NF}' | sort -u  # Should be 5
head -n 1000 Follows.csv | awk -F',' '{print NF}' | sort -u        # Should be 5
head -n 1000 ActivityLog.csv | awk -F',' '{print NF}' | sort -u    # Should be 5

Constraint Checks

# Validate no self-follows in Follows dataset
with open('Follows.csv', 'r') as f:
    for line in f:
        parts = line.strip().split(',')
        col_rel, id1, id2, date, desc = parts
        assert id1 != id2, f"Self-follow found at ColRel {col_rel}"

Performance Tips

Use Appropriate Data Structures
- Sets for tracking viewed pairs (O(1) lookup)
- Lists for buffering records
- Generators for memory-efficient iteration
Optimize Random Generation
- Pre-generate lists of nicknames, hobbies, job titles
- Use random.choice() on pre-built lists
- Avoid complex string operations in tight loops
Parallel Generation
- Consider generating Follows and ActivityLog in chunks
- Use multiprocessing for CPU-bound operations
- Merge chunks into final file
File I/O Optimization
- Use buffered writes (default buffer size: 8192 bytes)
- Consider using faster I/O libraries
- Flush periodically to avoid memory buildup

Next Steps

After generating the datasets:

Validate all files for correctness
Load into HDFS for distributed processing
Run MapReduce jobs for analytics tasks

CircleNetPage

Review schema details

Follows

Review relationship rules

ActivityLog

Review activity constraints

Get Started

Dataset

Analytics Tasks

Guides

Generation Overview

General Requirements

File Format Rules

Data Quality

CircleNetPage Generation

Scale: 200,000 users

Key Considerations

Follows Generation

Scale: 20,000,000 relationships

Key Considerations

Realistic Distribution

ActivityLog Generation

Scale: 10,000,000 actions

Key Considerations

Action Type Examples

Implementation Strategies

Memory-Efficient Generation

Buffered Writing

Progress Tracking

Validation

Record Counts

Format Validation

Constraint Checks

Performance Tips

Next Steps

CircleNetPage

Follows

ActivityLog

Build docs developers (and LLMs) love

Get Started

Dataset

Analytics Tasks

Guides

​Generation Overview

​General Requirements

​File Format Rules

​Data Quality

​CircleNetPage Generation

​Scale: 200,000 users

​Key Considerations

​Follows Generation

​Scale: 20,000,000 relationships

​Key Considerations

​Realistic Distribution

​ActivityLog Generation

​Scale: 10,000,000 actions

​Key Considerations

​Action Type Examples

​Implementation Strategies

​Memory-Efficient Generation

​Buffered Writing

​Progress Tracking

​Validation

​Record Counts

​Format Validation

​Constraint Checks

​Performance Tips

​Next Steps

CircleNetPage

Follows

ActivityLog

Build docs developers (and LLMs) love

Generation Overview

General Requirements

File Format Rules

Data Quality

CircleNetPage Generation

Scale: 200,000 users

Key Considerations

Follows Generation

Scale: 20,000,000 relationships

Key Considerations

Realistic Distribution

ActivityLog Generation

Scale: 10,000,000 actions

Key Considerations

Action Type Examples

Implementation Strategies

Memory-Efficient Generation

Buffered Writing

Progress Tracking

Validation

Record Counts

Format Validation

Constraint Checks

Performance Tips

Next Steps