This guide covers the Docker-based workflow for developing and running CircleNet Analytics MapReduce jobs.
Development Workflow Overview
CircleNet Analytics uses a containerized development workflow:
Write Java Code
Develop MapReduce jobs in IntelliJ IDEA or your preferred IDE
Build JAR
Maven compiles Java code and packages it into a JAR file
Copy to Container
Transfer JAR file to the Docker container
Run on Hadoop
Execute the compiled Java code inside the container using Hadoop
Container Creation
Initial Setup
Create a container from the cs585-ds503-image with proper port mappings and volume mounts.
Container Configuration:
docker run -d \
--name ds503-container \
-p 3000:22 \
-p 3001:4040 \
-p 3002:50070 \
-p 3003:8080 \
-p 3004:8081 \
-v /path/to/local/data:/home/ds503/data \
cs585-ds503-image
Port Mappings
Local Port Container Port Service 3000 22 SSH access 3001 4040 Spark Jobs UI 3002 50070 Hadoop NameNode Web UI 3003 8080 Spark Master UI 3004 8081 Spark Worker UI
Access the Hadoop Web UI at http://localhost:3002 to monitor HDFS and job progress.
Volume Mount
Mount your local dataset directory to the container:
Container path : /home/ds503/data
Local directory : Path to your generated CSV files
Windows example: D:\DATA_SCIENCE_2025\semester2\bdm_data\project1_resources\dataset_generation\my_output_full
Linux/Mac example: /home/user/datasets/circlenet
Connecting to the Container
SSH Connection
ssh -p 3000 ds503@localhost
Credentials:
Username: ds503
Password: ds503
Verify Connection (Windows PowerShell)
Test-NetConnection - ComputerName localhost - Port 3002
Hadoop Cluster Management
Starting the Hadoop Cluster
Connect via SSH
ssh -p 3000 ds503@localhost
Start Hadoop Services
This starts:
NameNode
DataNode
ResourceManager
NodeManager
Verify Cluster is Running
Stopping the Hadoop Cluster
Always stop the Hadoop cluster gracefully before stopping the Docker container to avoid data corruption.
Building and Deploying Jobs
Project Structure
ds503_bdm/
├── pom.xml
└── src/main/java/
└── circlenet/
├── taskA/
│ ├── TaskA.java
│ └── TaskAOptimized.java
├── taskB/
│ ├── TaskBSimple.java
│ └── TaskBOptimized.java
├── util/
│ └── JobTimer.java
└── common/
└── CsvUtils.java
Building the JAR
Open Maven Panel
In IntelliJ IDEA, open the Maven panel on the right side
Run Package
Navigate to: Lifecycle → Package Or use command line: mvn clean package -DskipTests
Locate JAR File
Built JAR is located at: target/ds503_bdm-1.0-SNAPSHOT.jar
Copying JAR to Container
Get Container ID
Example output: CONTAINER ID IMAGE ...
dadc9d47d16e cs585-ds503-image ...
Copy JAR File
docker cp target/ds503_bdm-1.0-SNAPSHOT.jar dadc9d47d16e:/home/ds503/
Replace dadc9d47d16e with your actual container ID.
Verify Copy
Inside the container: ls -lh /home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
Environment Variables Setup
For convenience, set up environment variables inside the container:
# JAR file location
export JAR = / home / ds503 / ds503_bdm-1 . 0-SNAPSHOT . jar
# HDFS input paths
export PAGES = / circlenet / pages / CircleNetPage . csv
export FOLLOWS = / circlenet / follows / Follows . csv
export ACTIVITY = / circlenet / activitylog / ActivityLog . csv
# HDFS output base directory
export OUT = / circlenet / output
# Local output directory (inside container)
export LOCAL_OUT = / home / ds503 / results
# Timing collection file
export CIRCLENET_TIMING_FILE = / home / ds503 / task_times . csv
Add these to ~/.bashrc to persist across sessions: echo 'export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar' >> ~/.bashrc
source ~/.bashrc
Running MapReduce Jobs
Basic Job Execution
hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT /taskA/simple
Command breakdown:
hadoop jar $JAR - Run a Hadoop job from JAR file
circlenet.taskA.TaskA - Fully qualified class name with main method
$PAGES - Input path (HDFS)
$OUT/taskA/simple - Output path (HDFS)
Running Multiple Versions
# Simple version
hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT /taskA/simple
# Optimized version
hadoop jar $JAR circlenet.taskA.TaskAOptimized $PAGES $OUT /taskA/optimized
Multi-Argument Jobs
Some tasks require multiple inputs:
# Task D: Pages and Follows as input
hadoop jar $JAR circlenet.taskD.TaskDSimple \
$PAGES \
$FOLLOWS \
$OUT /taskD/simple
# Task B: Multiple intermediate outputs
hadoop jar $JAR circlenet.taskB.TaskBOptimized \
$ACTIVITY \
$PAGES \
$OUT /taskB/tmp_count_opt \
$OUT /taskB/optimized
Retrieving Results
View Results in HDFS
# List output files
hdfs dfs -ls $OUT /taskA/simple
# View first 20 lines
hdfs dfs -cat $OUT /taskA/simple/part-r-00000 | head -20
Copy Results to Container
# Create local output directory
mkdir -p $LOCAL_OUT /taskA
# Copy from HDFS to container local filesystem
hdfs dfs -get $OUT /taskA/simple $LOCAL_OUT /taskA/
Copy Results to Host Machine
# From host machine (PowerShell or bash)
docker cp dadc9d47d16e:/home/ds503/results/taskA ./results_taskA
Use the -f flag with hdfs dfs -get to overwrite existing local files: hdfs dfs -get -f $OUT /taskA/simple $LOCAL_OUT /taskA/
Cleanup and Reset
Clean Output Directories
# Remove all task outputs
hdfs dfs -rm -r -f $OUT /taskA $OUT /taskB $OUT /taskC $OUT /taskD \
$OUT /taskE $OUT /taskF $OUT /taskG $OUT /taskH
Reset for Fresh Benchmark
# Remove timing file
rm -f $CIRCLENET_TIMING_FILE
# Remove local results
rm -rf $LOCAL_OUT / *
# Remove HDFS outputs
hdfs dfs -rm -r -f $OUT / *
# Recreate local output directory
mkdir -p $LOCAL_OUT
Troubleshooting
Container Won’t Start
Check if ports 3000-3004 are available: # Windows
netstat -ano | findstr :3000
# Linux/Mac
lsof -i :3000
Change port mappings if needed: docker run -p 3010:22 ...
Volume Mount Permission Denied
Ensure the local directory exists and has proper permissions: # Create directory if it doesn't exist
mkdir -p /path/to/local/data
# Set permissions (Linux/Mac)
chmod 755 /path/to/local/data
Hadoop Cluster Issues
Check if NameNode is running:
Should show: NameNode, DataNode, ResourceManager, NodeManager
Restart Hadoop:
hadoop/sbin/stop-all.sh
hadoop/sbin/start-all.sh
Safe Mode Preventing Writes
HDFS may enter safe mode on startup: # Check safe mode status
hdfs dfsadmin -safemode get
# Leave safe mode
hdfs dfsadmin -safemode leave
Job Execution Issues
Ensure:
JAR was built with mvn clean package
JAR was copied to container
Fully qualified class name is correct
# List classes in JAR
jar tf $JAR | grep Task
Output Directory Already Exists
Hadoop won’t overwrite existing output directories: # Remove existing output
hdfs dfs -rm -r $OUT /taskA/simple
# Then run job again
hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT /taskA/simple
Quick Reference Commands
Docker Commands
# Start container
docker start ds503-container
# Stop container
docker stop ds503-container
# View container logs
docker logs ds503-container
# Get container ID
docker ps
# Copy file to container
docker cp < local_fil e > < container_i d > : < container_pat h >
# Copy file from container
docker cp < container_i d > : < container_pat h > < local_pat h >
Hadoop Commands
# Start cluster
hadoop/sbin/start-all.sh
# Stop cluster
hadoop/sbin/stop-all.sh
# Run job
hadoop jar $JAR < main_clas s > < arg s >
# Check Java processes
jps
Next Steps