Performance Tuning and Benchmarking

This guide explains how to benchmark CircleNet Analytics jobs and compare the performance of different optimization approaches.

Performance Measurement Setup

All CircleNet Analytics tasks include built-in timing instrumentation using the JobTimer utility class.

JobTimer Implementation

JobTimer.java

public class JobTimer {
    // Run a job and measure execution time
    public static boolean run(Job job, String taskName) {
        long start = System.currentTimeMillis();
        boolean success = job.waitForCompletion(true);
        long elapsed = System.currentTimeMillis() - start;
        
        System.out.println(taskName + ",job,time_ms=" + elapsed);
        
        // Optionally log to CSV file
        String timingFile = System.getenv("CIRCLENET_TIMING_FILE");
        if (timingFile != null) {
            // Append to timing file: taskName,phase,time_ms
        }
        return success;
    }
    
    // Record total time including all jobs
    public static void total(String taskName, long totalStart) {
        long totalElapsed = System.currentTimeMillis() - totalStart;
        System.out.println(taskName + ",total,total_time_ms=" + totalElapsed);
    }
}

Using JobTimer in Tasks

public static void main(String[] args) throws Exception {
    long totalStart = System.currentTimeMillis();
    
    // ... configure and run job ...
    
    boolean ok = JobTimer.run(job, "TaskAOptimized");
    JobTimer.total("TaskAOptimized", totalStart);
    
    System.exit(ok ? 0 : 1);
}

For multi-job tasks:

TaskBOptimized.java

long totalStart = System.currentTimeMillis();

// Job 1: Count accesses
if (!JobTimer.run(job1, "TaskBOptimized-job1-count")) {
    JobTimer.total("TaskBOptimized", totalStart);
    System.exit(1);
}

// Job 2: Top 10 selection with map-side join
boolean ok = JobTimer.run(job2, "TaskBOptimized-job2-top10-join");
JobTimer.total("TaskBOptimized", totalStart);

Benchmarking Environment Setup

Set Environment Variables

Configure paths and enable timing collection:

export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
export PAGES=/circlenet/pages/CircleNetPage.csv
export FOLLOWS=/circlenet/follows/Follows.csv
export ACTIVITY=/circlenet/activitylog/ActivityLog.csv
export OUT=/circlenet/output
export CIRCLENET_TIMING_FILE=/home/ds503/task_times.csv

Reset for Fresh Benchmark

Clean up previous outputs and timing data:

# Remove timing file
rm -f $CIRCLENET_TIMING_FILE

# Remove output directories
hdfs dfs -rm -r -f $OUT/taskA $OUT/taskB $OUT/taskC $OUT/taskD \
                    $OUT/taskE $OUT/taskF $OUT/taskG $OUT/taskH

Run Benchmark Suite

Execute both simple and optimized versions of each task

Collect and Analyze Results

Extract timing data and compare performance

Running Benchmarks

Single Task Benchmark

# Task A: Hobby Frequency
hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple
hadoop jar $JAR circlenet.taskA.TaskAOptimized $PAGES $OUT/taskA/optimized

Multi-Job Task Benchmark

For tasks requiring multiple MapReduce jobs:

# Task B: Top 10 Most Accessed Pages
# Simple version (3 jobs: count, join, top10)
hadoop jar $JAR circlenet.taskB.TaskBSimple \
  $ACTIVITY $PAGES \
  $OUT/taskB/tmp_count_simple \
  $OUT/taskB/tmp_join_simple \
  $OUT/taskB/simple

# Optimized version (2 jobs: count with combiner, top10 with map-side join)
hadoop jar $JAR circlenet.taskB.TaskBOptimized \
  $ACTIVITY $PAGES \
  $OUT/taskB/tmp_count_opt \
  $OUT/taskB/optimized

Analyzing Results

View Timing Data

# View all timing records
cat $CIRCLENET_TIMING_FILE

# Filter for total times only
rg ",total," $CIRCLENET_TIMING_FILE

# Example output:
# TaskA,total,total_time_ms=45231
# TaskAOptimized,total,total_time_ms=32156
# TaskB,total,total_time_ms=123450
# TaskBOptimized,total,total_time_ms=87342

Copy Results to Host

# Copy timing file from container
docker cp dadc9d47d16e:$CIRCLENET_TIMING_FILE ./task_times.csv

Performance Comparison Framework

Comparison Table Template

Based on the project report structure:

Task	Description	Simple (ms)	Optimized (ms)	Improvement	Key Optimization
A	Hobby frequency	-	-	-	Combiner
B	Top 10 pages	-	-	-	Map-side join + combiner
C	Filter by hobby	-	-	-	Map-only job
D	Popularity factor	-	-	-	Count + map-side join
E	Favorites behavior	-	-	-	Dedupe + combine
F	Above average	-	-	-	Cache lookup + filter
G	Outdated pages	-	-	-	Map-only + threshold
H	One-way follows	-	-	-	Cache-based join

Calculating Performance Metrics

# Example Python script to analyze timing data
import pandas as pd

df = pd.read_csv('task_times.csv', names=['task', 'phase', 'time_ms'])

# Filter for total times
totals = df[df['phase'] == 'total']

# Group by task type (Simple vs Optimized)
for task in ['TaskA', 'TaskB', 'TaskC', 'TaskD', 'TaskE', 'TaskF', 'TaskG', 'TaskH']:
    simple = totals[totals['task'] == task + 'Simple']['time_ms'].values
    optimized = totals[totals['task'] == task + 'Optimized']['time_ms'].values
    
    if len(simple) > 0 and len(optimized) > 0:
        improvement = ((simple[0] - optimized[0]) / simple[0]) * 100
        print(f"{task}: {improvement:.1f}% faster")

Correctness Verification

Always verify correctness before comparing performance! An incorrect optimization is worthless.

Verification Procedure

Copy Both Outputs

hdfs dfs -get -f $OUT/taskA/simple /tmp/taskA_simple
hdfs dfs -get -f $OUT/taskA/optimized /tmp/taskA_optimized

Canonicalize Output

Sort outputs to compare regardless of order:

cat /tmp/taskA_simple/part-* | sort > /tmp/taskA_simple.txt
cat /tmp/taskA_optimized/part-* | sort > /tmp/taskA_optimized.txt

Compare Results

diff -u /tmp/taskA_simple.txt /tmp/taskA_optimized.txt

No output means the results are identical!

For tasks with intentionally different output formats, compare only the key columns or counts.

Performance Analysis Guidelines

Expected Performance Patterns

High Impact Optimizations:

Adding combiners to aggregation jobs (Task A, B, D)
Converting to map-only jobs (Task C, Task G final stage)
Map-side joins replacing reduce-side joins (Task B, D, F, H)

Low/No Impact Optimizations:

Combiners on map-only jobs (no effect)
Map-side joins when both datasets are large
Optimizations on small datasets (overhead dominates)

When Optimizations May Not Help

From the project findings:

Task C Optimized: No real gain over simple version on small datasets
Task E Optimized: Actually slower due to additional job overhead
Always benchmark your specific use case!

Optimization Decision Framework

Questions to Ask

What is the bottleneck?
- Shuffle phase → Try combiner or map-side join
- Too many jobs → Consolidate with caching
- No aggregation → Use map-only job
What are the data sizes?
- Small reference data → Map-side join
- Large datasets → Combiners
- Filtering only → Map-only
What is the operation type?
- SUM/COUNT/MAX → Safe for combiner
- AVG/DISTINCT → Need careful approach
- Filter/Transform → Map-only candidate

Optimization Checklist

Run simple version first
Measure baseline performance
Identify bottleneck (shuffle, jobs, I/O)
Apply appropriate optimization
Verify correctness
Measure optimized performance
Calculate and document improvement
Explain why optimization helped (or didn’t)

Reporting Performance Results

From the project report template:

\section{Task A: Frequency of Favorite Hobby}
\subsection*{Simple Solution}
Read CircleNetPage, emit (FavoriteHobby, 1) in mapper, and sum in reducer.

\subsection*{Optimization Thinking and Implementation}
This task is safe for combiner because sum is associative and commutative.
We used the same sum reducer as combiner to reduce network shuffle.

\subsection*{Performance Results}
Simple version: X ms
Optimized version: Y ms
Improvement: Z% faster

Best Practices

Consistent Environment: Run all benchmarks in the same environment
Multiple Runs: Execute each version 3-5 times and average results
Warm-up: First run may be slower due to JVM warm-up
Clean State: Clear outputs between runs to avoid caching effects
Document Everything: Record dataset sizes, cluster configuration, and Hadoop settings

Use the hdfs dfs -rm -r -f command before each run to ensure a clean slate.

Next Steps

Learn about specific optimization techniques
Set up your Docker environment
Understand HDFS operations

Get Started

Dataset

Analytics Tasks

Guides

Performance Tuning and Benchmarking

Performance Measurement Setup

JobTimer Implementation

Using JobTimer in Tasks

Benchmarking Environment Setup

Running Benchmarks

Single Task Benchmark

Multi-Job Task Benchmark

Analyzing Results

View Timing Data

Copy Results to Host

Performance Comparison Framework

Comparison Table Template

Calculating Performance Metrics

Correctness Verification

Verification Procedure

Performance Analysis Guidelines

Expected Performance Patterns

When Optimizations May Not Help

Optimization Decision Framework

Questions to Ask

Optimization Checklist

Reporting Performance Results

Best Practices

Next Steps

Build docs developers (and LLMs) love

Get Started

Dataset

Analytics Tasks

Guides

​Performance Measurement Setup

​JobTimer Implementation

​Using JobTimer in Tasks

​Benchmarking Environment Setup

​Running Benchmarks

​Single Task Benchmark

​Multi-Job Task Benchmark

​Analyzing Results

​View Timing Data

​Copy Results to Host

​Performance Comparison Framework

​Comparison Table Template

​Calculating Performance Metrics

​Correctness Verification

​Verification Procedure

​Performance Analysis Guidelines

​Expected Performance Patterns

​When Optimizations May Not Help

​Optimization Decision Framework

​Questions to Ask

​Optimization Checklist

​Reporting Performance Results

​Best Practices

​Next Steps

Build docs developers (and LLMs) love

Performance Measurement Setup

JobTimer Implementation

Using JobTimer in Tasks

Benchmarking Environment Setup

Running Benchmarks

Single Task Benchmark

Multi-Job Task Benchmark

Analyzing Results

View Timing Data

Copy Results to Host

Performance Comparison Framework

Comparison Table Template

Calculating Performance Metrics

Correctness Verification

Verification Procedure

Performance Analysis Guidelines

Expected Performance Patterns

When Optimizations May Not Help

Optimization Decision Framework

Questions to Ask

Optimization Checklist

Reporting Performance Results

Best Practices

Next Steps