Skip to main content
This guide explains how to benchmark CircleNet Analytics jobs and compare the performance of different optimization approaches.

Performance Measurement Setup

All CircleNet Analytics tasks include built-in timing instrumentation using the JobTimer utility class.

JobTimer Implementation

JobTimer.java
public class JobTimer {
    // Run a job and measure execution time
    public static boolean run(Job job, String taskName) {
        long start = System.currentTimeMillis();
        boolean success = job.waitForCompletion(true);
        long elapsed = System.currentTimeMillis() - start;
        
        System.out.println(taskName + ",job,time_ms=" + elapsed);
        
        // Optionally log to CSV file
        String timingFile = System.getenv("CIRCLENET_TIMING_FILE");
        if (timingFile != null) {
            // Append to timing file: taskName,phase,time_ms
        }
        return success;
    }
    
    // Record total time including all jobs
    public static void total(String taskName, long totalStart) {
        long totalElapsed = System.currentTimeMillis() - totalStart;
        System.out.println(taskName + ",total,total_time_ms=" + totalElapsed);
    }
}

Using JobTimer in Tasks

public static void main(String[] args) throws Exception {
    long totalStart = System.currentTimeMillis();
    
    // ... configure and run job ...
    
    boolean ok = JobTimer.run(job, "TaskAOptimized");
    JobTimer.total("TaskAOptimized", totalStart);
    
    System.exit(ok ? 0 : 1);
}
For multi-job tasks:
TaskBOptimized.java
long totalStart = System.currentTimeMillis();

// Job 1: Count accesses
if (!JobTimer.run(job1, "TaskBOptimized-job1-count")) {
    JobTimer.total("TaskBOptimized", totalStart);
    System.exit(1);
}

// Job 2: Top 10 selection with map-side join
boolean ok = JobTimer.run(job2, "TaskBOptimized-job2-top10-join");
JobTimer.total("TaskBOptimized", totalStart);

Benchmarking Environment Setup

1

Set Environment Variables

Configure paths and enable timing collection:
export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
export PAGES=/circlenet/pages/CircleNetPage.csv
export FOLLOWS=/circlenet/follows/Follows.csv
export ACTIVITY=/circlenet/activitylog/ActivityLog.csv
export OUT=/circlenet/output
export CIRCLENET_TIMING_FILE=/home/ds503/task_times.csv
2

Reset for Fresh Benchmark

Clean up previous outputs and timing data:
# Remove timing file
rm -f $CIRCLENET_TIMING_FILE

# Remove output directories
hdfs dfs -rm -r -f $OUT/taskA $OUT/taskB $OUT/taskC $OUT/taskD \
                    $OUT/taskE $OUT/taskF $OUT/taskG $OUT/taskH
3

Run Benchmark Suite

Execute both simple and optimized versions of each task
4

Collect and Analyze Results

Extract timing data and compare performance

Running Benchmarks

Single Task Benchmark

# Task A: Hobby Frequency
hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple
hadoop jar $JAR circlenet.taskA.TaskAOptimized $PAGES $OUT/taskA/optimized

Multi-Job Task Benchmark

For tasks requiring multiple MapReduce jobs:
# Task B: Top 10 Most Accessed Pages
# Simple version (3 jobs: count, join, top10)
hadoop jar $JAR circlenet.taskB.TaskBSimple \
  $ACTIVITY $PAGES \
  $OUT/taskB/tmp_count_simple \
  $OUT/taskB/tmp_join_simple \
  $OUT/taskB/simple

# Optimized version (2 jobs: count with combiner, top10 with map-side join)
hadoop jar $JAR circlenet.taskB.TaskBOptimized \
  $ACTIVITY $PAGES \
  $OUT/taskB/tmp_count_opt \
  $OUT/taskB/optimized

Analyzing Results

View Timing Data

# View all timing records
cat $CIRCLENET_TIMING_FILE

# Filter for total times only
rg ",total," $CIRCLENET_TIMING_FILE

# Example output:
# TaskA,total,total_time_ms=45231
# TaskAOptimized,total,total_time_ms=32156
# TaskB,total,total_time_ms=123450
# TaskBOptimized,total,total_time_ms=87342

Copy Results to Host

# Copy timing file from container
docker cp dadc9d47d16e:$CIRCLENET_TIMING_FILE ./task_times.csv

Performance Comparison Framework

Comparison Table Template

Based on the project report structure:
TaskDescriptionSimple (ms)Optimized (ms)ImprovementKey Optimization
AHobby frequency---Combiner
BTop 10 pages---Map-side join + combiner
CFilter by hobby---Map-only job
DPopularity factor---Count + map-side join
EFavorites behavior---Dedupe + combine
FAbove average---Cache lookup + filter
GOutdated pages---Map-only + threshold
HOne-way follows---Cache-based join

Calculating Performance Metrics

# Example Python script to analyze timing data
import pandas as pd

df = pd.read_csv('task_times.csv', names=['task', 'phase', 'time_ms'])

# Filter for total times
totals = df[df['phase'] == 'total']

# Group by task type (Simple vs Optimized)
for task in ['TaskA', 'TaskB', 'TaskC', 'TaskD', 'TaskE', 'TaskF', 'TaskG', 'TaskH']:
    simple = totals[totals['task'] == task + 'Simple']['time_ms'].values
    optimized = totals[totals['task'] == task + 'Optimized']['time_ms'].values
    
    if len(simple) > 0 and len(optimized) > 0:
        improvement = ((simple[0] - optimized[0]) / simple[0]) * 100
        print(f"{task}: {improvement:.1f}% faster")

Correctness Verification

Always verify correctness before comparing performance! An incorrect optimization is worthless.

Verification Procedure

1

Copy Both Outputs

hdfs dfs -get -f $OUT/taskA/simple /tmp/taskA_simple
hdfs dfs -get -f $OUT/taskA/optimized /tmp/taskA_optimized
2

Canonicalize Output

Sort outputs to compare regardless of order:
cat /tmp/taskA_simple/part-* | sort > /tmp/taskA_simple.txt
cat /tmp/taskA_optimized/part-* | sort > /tmp/taskA_optimized.txt
3

Compare Results

diff -u /tmp/taskA_simple.txt /tmp/taskA_optimized.txt
No output means the results are identical!
For tasks with intentionally different output formats, compare only the key columns or counts.

Performance Analysis Guidelines

Expected Performance Patterns

High Impact Optimizations:
  • Adding combiners to aggregation jobs (Task A, B, D)
  • Converting to map-only jobs (Task C, Task G final stage)
  • Map-side joins replacing reduce-side joins (Task B, D, F, H)
Low/No Impact Optimizations:
  • Combiners on map-only jobs (no effect)
  • Map-side joins when both datasets are large
  • Optimizations on small datasets (overhead dominates)

When Optimizations May Not Help

From the project findings:
  • Task C Optimized: No real gain over simple version on small datasets
  • Task E Optimized: Actually slower due to additional job overhead
  • Always benchmark your specific use case!

Optimization Decision Framework

Questions to Ask

  1. What is the bottleneck?
    • Shuffle phase → Try combiner or map-side join
    • Too many jobs → Consolidate with caching
    • No aggregation → Use map-only job
  2. What are the data sizes?
    • Small reference data → Map-side join
    • Large datasets → Combiners
    • Filtering only → Map-only
  3. What is the operation type?
    • SUM/COUNT/MAX → Safe for combiner
    • AVG/DISTINCT → Need careful approach
    • Filter/Transform → Map-only candidate

Optimization Checklist

  • Run simple version first
  • Measure baseline performance
  • Identify bottleneck (shuffle, jobs, I/O)
  • Apply appropriate optimization
  • Verify correctness
  • Measure optimized performance
  • Calculate and document improvement
  • Explain why optimization helped (or didn’t)

Reporting Performance Results

From the project report template:
\section{Task A: Frequency of Favorite Hobby}
\subsection*{Simple Solution}
Read CircleNetPage, emit (FavoriteHobby, 1) in mapper, and sum in reducer.

\subsection*{Optimization Thinking and Implementation}
This task is safe for combiner because sum is associative and commutative.
We used the same sum reducer as combiner to reduce network shuffle.

\subsection*{Performance Results}
Simple version: X ms
Optimized version: Y ms
Improvement: Z% faster

Best Practices

  1. Consistent Environment: Run all benchmarks in the same environment
  2. Multiple Runs: Execute each version 3-5 times and average results
  3. Warm-up: First run may be slower due to JVM warm-up
  4. Clean State: Clear outputs between runs to avoid caching effects
  5. Document Everything: Record dataset sizes, cluster configuration, and Hadoop settings
Use the hdfs dfs -rm -r -f command before each run to ensure a clean slate.

Next Steps

Build docs developers (and LLMs) love