Skip to main content
This guide covers the Docker-based workflow for developing and running CircleNet Analytics MapReduce jobs.

Development Workflow Overview

CircleNet Analytics uses a containerized development workflow:
1

Write Java Code

Develop MapReduce jobs in IntelliJ IDEA or your preferred IDE
2

Build JAR

Maven compiles Java code and packages it into a JAR file
3

Copy to Container

Transfer JAR file to the Docker container
4

Run on Hadoop

Execute the compiled Java code inside the container using Hadoop

Container Creation

Initial Setup

Create a container from the cs585-ds503-image with proper port mappings and volume mounts. Container Configuration:
docker run -d \
  --name ds503-container \
  -p 3000:22 \
  -p 3001:4040 \
  -p 3002:50070 \
  -p 3003:8080 \
  -p 3004:8081 \
  -v /path/to/local/data:/home/ds503/data \
  cs585-ds503-image

Port Mappings

Local PortContainer PortService
300022SSH access
30014040Spark Jobs UI
300250070Hadoop NameNode Web UI
30038080Spark Master UI
30048081Spark Worker UI
Access the Hadoop Web UI at http://localhost:3002 to monitor HDFS and job progress.

Volume Mount

Mount your local dataset directory to the container:
  • Container path: /home/ds503/data
  • Local directory: Path to your generated CSV files
    • Windows example: D:\DATA_SCIENCE_2025\semester2\bdm_data\project1_resources\dataset_generation\my_output_full
    • Linux/Mac example: /home/user/datasets/circlenet

Connecting to the Container

SSH Connection

ssh -p 3000 ds503@localhost
Credentials:
  • Username: ds503
  • Password: ds503

Verify Connection (Windows PowerShell)

Test-NetConnection -ComputerName localhost -Port 3002

Hadoop Cluster Management

Starting the Hadoop Cluster

1

Connect via SSH

ssh -p 3000 ds503@localhost
2

Start Hadoop Services

hadoop/sbin/start-all.sh
This starts:
  • NameNode
  • DataNode
  • ResourceManager
  • NodeManager
3

Verify Cluster is Running

hdfs dfs -ls /
Expected output:
/tmp
/user

Stopping the Hadoop Cluster

hadoop/sbin/stop-all.sh
Always stop the Hadoop cluster gracefully before stopping the Docker container to avoid data corruption.

Building and Deploying Jobs

Project Structure

ds503_bdm/
├── pom.xml
└── src/main/java/
    └── circlenet/
        ├── taskA/
        │   ├── TaskA.java
        │   └── TaskAOptimized.java
        ├── taskB/
        │   ├── TaskBSimple.java
        │   └── TaskBOptimized.java
        ├── util/
        │   └── JobTimer.java
        └── common/
            └── CsvUtils.java

Building the JAR

1

Open Maven Panel

In IntelliJ IDEA, open the Maven panel on the right side
2

Run Package

Navigate to: Lifecycle → PackageOr use command line:
mvn clean package -DskipTests
3

Locate JAR File

Built JAR is located at:
target/ds503_bdm-1.0-SNAPSHOT.jar

Copying JAR to Container

1

Get Container ID

docker ps
Example output:
CONTAINER ID   IMAGE              ...
dadc9d47d16e   cs585-ds503-image  ...
2

Copy JAR File

docker cp target/ds503_bdm-1.0-SNAPSHOT.jar dadc9d47d16e:/home/ds503/
Replace dadc9d47d16e with your actual container ID.
3

Verify Copy

Inside the container:
ls -lh /home/ds503/ds503_bdm-1.0-SNAPSHOT.jar

Environment Variables Setup

For convenience, set up environment variables inside the container:
# JAR file location
export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar

# HDFS input paths
export PAGES=/circlenet/pages/CircleNetPage.csv
export FOLLOWS=/circlenet/follows/Follows.csv
export ACTIVITY=/circlenet/activitylog/ActivityLog.csv

# HDFS output base directory
export OUT=/circlenet/output

# Local output directory (inside container)
export LOCAL_OUT=/home/ds503/results

# Timing collection file
export CIRCLENET_TIMING_FILE=/home/ds503/task_times.csv
Add these to ~/.bashrc to persist across sessions:
echo 'export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar' >> ~/.bashrc
source ~/.bashrc

Running MapReduce Jobs

Basic Job Execution

hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple
Command breakdown:
  • hadoop jar $JAR - Run a Hadoop job from JAR file
  • circlenet.taskA.TaskA - Fully qualified class name with main method
  • $PAGES - Input path (HDFS)
  • $OUT/taskA/simple - Output path (HDFS)

Running Multiple Versions

# Simple version
hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple

# Optimized version
hadoop jar $JAR circlenet.taskA.TaskAOptimized $PAGES $OUT/taskA/optimized

Multi-Argument Jobs

Some tasks require multiple inputs:
# Task D: Pages and Follows as input
hadoop jar $JAR circlenet.taskD.TaskDSimple \
  $PAGES \
  $FOLLOWS \
  $OUT/taskD/simple

# Task B: Multiple intermediate outputs
hadoop jar $JAR circlenet.taskB.TaskBOptimized \
  $ACTIVITY \
  $PAGES \
  $OUT/taskB/tmp_count_opt \
  $OUT/taskB/optimized

Retrieving Results

View Results in HDFS

# List output files
hdfs dfs -ls $OUT/taskA/simple

# View first 20 lines
hdfs dfs -cat $OUT/taskA/simple/part-r-00000 | head -20

Copy Results to Container

# Create local output directory
mkdir -p $LOCAL_OUT/taskA

# Copy from HDFS to container local filesystem
hdfs dfs -get $OUT/taskA/simple $LOCAL_OUT/taskA/

Copy Results to Host Machine

# From host machine (PowerShell or bash)
docker cp dadc9d47d16e:/home/ds503/results/taskA ./results_taskA
Use the -f flag with hdfs dfs -get to overwrite existing local files:
hdfs dfs -get -f $OUT/taskA/simple $LOCAL_OUT/taskA/

Cleanup and Reset

Clean Output Directories

# Remove all task outputs
hdfs dfs -rm -r -f $OUT/taskA $OUT/taskB $OUT/taskC $OUT/taskD \
                    $OUT/taskE $OUT/taskF $OUT/taskG $OUT/taskH

Reset for Fresh Benchmark

# Remove timing file
rm -f $CIRCLENET_TIMING_FILE

# Remove local results
rm -rf $LOCAL_OUT/*

# Remove HDFS outputs
hdfs dfs -rm -r -f $OUT/*

# Recreate local output directory
mkdir -p $LOCAL_OUT

Troubleshooting

Container Won’t Start

Check if ports 3000-3004 are available:
# Windows
netstat -ano | findstr :3000

# Linux/Mac
lsof -i :3000
Change port mappings if needed:
docker run -p 3010:22 ...
Ensure the local directory exists and has proper permissions:
# Create directory if it doesn't exist
mkdir -p /path/to/local/data

# Set permissions (Linux/Mac)
chmod 755 /path/to/local/data

Hadoop Cluster Issues

  1. Check if NameNode is running:
jps
Should show: NameNode, DataNode, ResourceManager, NodeManager
  1. Restart Hadoop:
hadoop/sbin/stop-all.sh
hadoop/sbin/start-all.sh
HDFS may enter safe mode on startup:
# Check safe mode status
hdfs dfsadmin -safemode get

# Leave safe mode
hdfs dfsadmin -safemode leave

Job Execution Issues

Ensure:
  1. JAR was built with mvn clean package
  2. JAR was copied to container
  3. Fully qualified class name is correct
# List classes in JAR
jar tf $JAR | grep Task
Hadoop won’t overwrite existing output directories:
# Remove existing output
hdfs dfs -rm -r $OUT/taskA/simple

# Then run job again
hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple

Quick Reference Commands

Docker Commands

# Start container
docker start ds503-container

# Stop container
docker stop ds503-container

# View container logs
docker logs ds503-container

# Get container ID
docker ps

# Copy file to container
docker cp <local_file> <container_id>:<container_path>

# Copy file from container
docker cp <container_id>:<container_path> <local_path>

Hadoop Commands

# Start cluster
hadoop/sbin/start-all.sh

# Stop cluster
hadoop/sbin/stop-all.sh

# Run job
hadoop jar $JAR <main_class> <args>

# Check Java processes
jps

Next Steps

Build docs developers (and LLMs) love