This guide explains how to benchmark CircleNet Analytics jobs and compare the performance of different optimization approaches.
All CircleNet Analytics tasks include built-in timing instrumentation using the JobTimer utility class.
JobTimer Implementation
public class JobTimer {
// Run a job and measure execution time
public static boolean run(Job job, String taskName) {
long start = System.currentTimeMillis();
boolean success = job.waitForCompletion(true);
long elapsed = System.currentTimeMillis() - start;
System.out.println(taskName + ",job,time_ms=" + elapsed);
// Optionally log to CSV file
String timingFile = System.getenv("CIRCLENET_TIMING_FILE");
if (timingFile != null) {
// Append to timing file: taskName,phase,time_ms
}
return success;
}
// Record total time including all jobs
public static void total(String taskName, long totalStart) {
long totalElapsed = System.currentTimeMillis() - totalStart;
System.out.println(taskName + ",total,total_time_ms=" + totalElapsed);
}
}
Using JobTimer in Tasks
public static void main(String[] args) throws Exception {
long totalStart = System.currentTimeMillis();
// ... configure and run job ...
boolean ok = JobTimer.run(job, "TaskAOptimized");
JobTimer.total("TaskAOptimized", totalStart);
System.exit(ok ? 0 : 1);
}
For multi-job tasks:
long totalStart = System.currentTimeMillis();
// Job 1: Count accesses
if (!JobTimer.run(job1, "TaskBOptimized-job1-count")) {
JobTimer.total("TaskBOptimized", totalStart);
System.exit(1);
}
// Job 2: Top 10 selection with map-side join
boolean ok = JobTimer.run(job2, "TaskBOptimized-job2-top10-join");
JobTimer.total("TaskBOptimized", totalStart);
Benchmarking Environment Setup
Set Environment Variables
Configure paths and enable timing collection:export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
export PAGES=/circlenet/pages/CircleNetPage.csv
export FOLLOWS=/circlenet/follows/Follows.csv
export ACTIVITY=/circlenet/activitylog/ActivityLog.csv
export OUT=/circlenet/output
export CIRCLENET_TIMING_FILE=/home/ds503/task_times.csv
Reset for Fresh Benchmark
Clean up previous outputs and timing data:# Remove timing file
rm -f $CIRCLENET_TIMING_FILE
# Remove output directories
hdfs dfs -rm -r -f $OUT/taskA $OUT/taskB $OUT/taskC $OUT/taskD \
$OUT/taskE $OUT/taskF $OUT/taskG $OUT/taskH
Run Benchmark Suite
Execute both simple and optimized versions of each task
Collect and Analyze Results
Extract timing data and compare performance
Running Benchmarks
Single Task Benchmark
# Task A: Hobby Frequency
hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple
hadoop jar $JAR circlenet.taskA.TaskAOptimized $PAGES $OUT/taskA/optimized
Multi-Job Task Benchmark
For tasks requiring multiple MapReduce jobs:
# Task B: Top 10 Most Accessed Pages
# Simple version (3 jobs: count, join, top10)
hadoop jar $JAR circlenet.taskB.TaskBSimple \
$ACTIVITY $PAGES \
$OUT/taskB/tmp_count_simple \
$OUT/taskB/tmp_join_simple \
$OUT/taskB/simple
# Optimized version (2 jobs: count with combiner, top10 with map-side join)
hadoop jar $JAR circlenet.taskB.TaskBOptimized \
$ACTIVITY $PAGES \
$OUT/taskB/tmp_count_opt \
$OUT/taskB/optimized
Analyzing Results
View Timing Data
# View all timing records
cat $CIRCLENET_TIMING_FILE
# Filter for total times only
rg ",total," $CIRCLENET_TIMING_FILE
# Example output:
# TaskA,total,total_time_ms=45231
# TaskAOptimized,total,total_time_ms=32156
# TaskB,total,total_time_ms=123450
# TaskBOptimized,total,total_time_ms=87342
Copy Results to Host
# Copy timing file from container
docker cp dadc9d47d16e:$CIRCLENET_TIMING_FILE ./task_times.csv
Comparison Table Template
Based on the project report structure:
| Task | Description | Simple (ms) | Optimized (ms) | Improvement | Key Optimization |
|---|
| A | Hobby frequency | - | - | - | Combiner |
| B | Top 10 pages | - | - | - | Map-side join + combiner |
| C | Filter by hobby | - | - | - | Map-only job |
| D | Popularity factor | - | - | - | Count + map-side join |
| E | Favorites behavior | - | - | - | Dedupe + combine |
| F | Above average | - | - | - | Cache lookup + filter |
| G | Outdated pages | - | - | - | Map-only + threshold |
| H | One-way follows | - | - | - | Cache-based join |
# Example Python script to analyze timing data
import pandas as pd
df = pd.read_csv('task_times.csv', names=['task', 'phase', 'time_ms'])
# Filter for total times
totals = df[df['phase'] == 'total']
# Group by task type (Simple vs Optimized)
for task in ['TaskA', 'TaskB', 'TaskC', 'TaskD', 'TaskE', 'TaskF', 'TaskG', 'TaskH']:
simple = totals[totals['task'] == task + 'Simple']['time_ms'].values
optimized = totals[totals['task'] == task + 'Optimized']['time_ms'].values
if len(simple) > 0 and len(optimized) > 0:
improvement = ((simple[0] - optimized[0]) / simple[0]) * 100
print(f"{task}: {improvement:.1f}% faster")
Correctness Verification
Always verify correctness before comparing performance! An incorrect optimization is worthless.
Verification Procedure
Copy Both Outputs
hdfs dfs -get -f $OUT/taskA/simple /tmp/taskA_simple
hdfs dfs -get -f $OUT/taskA/optimized /tmp/taskA_optimized
Canonicalize Output
Sort outputs to compare regardless of order:cat /tmp/taskA_simple/part-* | sort > /tmp/taskA_simple.txt
cat /tmp/taskA_optimized/part-* | sort > /tmp/taskA_optimized.txt
Compare Results
diff -u /tmp/taskA_simple.txt /tmp/taskA_optimized.txt
No output means the results are identical!
For tasks with intentionally different output formats, compare only the key columns or counts.
High Impact Optimizations:
- Adding combiners to aggregation jobs (Task A, B, D)
- Converting to map-only jobs (Task C, Task G final stage)
- Map-side joins replacing reduce-side joins (Task B, D, F, H)
Low/No Impact Optimizations:
- Combiners on map-only jobs (no effect)
- Map-side joins when both datasets are large
- Optimizations on small datasets (overhead dominates)
When Optimizations May Not Help
From the project findings:
- Task C Optimized: No real gain over simple version on small datasets
- Task E Optimized: Actually slower due to additional job overhead
- Always benchmark your specific use case!
Optimization Decision Framework
Questions to Ask
-
What is the bottleneck?
- Shuffle phase → Try combiner or map-side join
- Too many jobs → Consolidate with caching
- No aggregation → Use map-only job
-
What are the data sizes?
- Small reference data → Map-side join
- Large datasets → Combiners
- Filtering only → Map-only
-
What is the operation type?
- SUM/COUNT/MAX → Safe for combiner
- AVG/DISTINCT → Need careful approach
- Filter/Transform → Map-only candidate
Optimization Checklist
From the project report template:
\section{Task A: Frequency of Favorite Hobby}
\subsection*{Simple Solution}
Read CircleNetPage, emit (FavoriteHobby, 1) in mapper, and sum in reducer.
\subsection*{Optimization Thinking and Implementation}
This task is safe for combiner because sum is associative and commutative.
We used the same sum reducer as combiner to reduce network shuffle.
\subsection*{Performance Results}
Simple version: X ms
Optimized version: Y ms
Improvement: Z% faster
Best Practices
- Consistent Environment: Run all benchmarks in the same environment
- Multiple Runs: Execute each version 3-5 times and average results
- Warm-up: First run may be slower due to JVM warm-up
- Clean State: Clear outputs between runs to avoid caching effects
- Document Everything: Record dataset sizes, cluster configuration, and Hadoop settings
Use the hdfs dfs -rm -r -f command before each run to ensure a clean slate.
Next Steps