Docker Container Setup - CircleNet Analytics

This guide covers the Docker-based workflow for developing and running CircleNet Analytics MapReduce jobs.

Development Workflow Overview

CircleNet Analytics uses a containerized development workflow:

Write Java Code

Develop MapReduce jobs in IntelliJ IDEA or your preferred IDE

Build JAR

Maven compiles Java code and packages it into a JAR file

Copy to Container

Transfer JAR file to the Docker container

Run on Hadoop

Execute the compiled Java code inside the container using Hadoop

Container Creation

Initial Setup

Create a container from the cs585-ds503-image with proper port mappings and volume mounts. Container Configuration:

docker run -d \
  --name ds503-container \
  -p 3000:22 \
  -p 3001:4040 \
  -p 3002:50070 \
  -p 3003:8080 \
  -p 3004:8081 \
  -v /path/to/local/data:/home/ds503/data \
  cs585-ds503-image

Port Mappings

Local Port	Container Port	Service
3000	22	SSH access
3001	4040	Spark Jobs UI
3002	50070	Hadoop NameNode Web UI
3003	8080	Spark Master UI
3004	8081	Spark Worker UI

Access the Hadoop Web UI at http://localhost:3002 to monitor HDFS and job progress.

Volume Mount

Mount your local dataset directory to the container:

Container path: /home/ds503/data
Local directory: Path to your generated CSV files
- Windows example: D:\DATA_SCIENCE_2025\semester2\bdm_data\project1_resources\dataset_generation\my_output_full
- Linux/Mac example: /home/user/datasets/circlenet

Connecting to the Container

SSH Connection

ssh -p 3000 ds503@localhost

Credentials:

Username: ds503
Password: ds503

Verify Connection (Windows PowerShell)

Test-NetConnection -ComputerName localhost -Port 3002

Hadoop Cluster Management

Starting the Hadoop Cluster

Connect via SSH

ssh -p 3000 ds503@localhost

Start Hadoop Services

hadoop/sbin/start-all.sh

This starts:

NameNode
DataNode
ResourceManager
NodeManager

Verify Cluster is Running

hdfs dfs -ls /

Expected output:

/tmp
/user

Stopping the Hadoop Cluster

hadoop/sbin/stop-all.sh

Always stop the Hadoop cluster gracefully before stopping the Docker container to avoid data corruption.

Building and Deploying Jobs

Project Structure

ds503_bdm/
├── pom.xml
└── src/main/java/
    └── circlenet/
        ├── taskA/
        │   ├── TaskA.java
        │   └── TaskAOptimized.java
        ├── taskB/
        │   ├── TaskBSimple.java
        │   └── TaskBOptimized.java
        ├── util/
        │   └── JobTimer.java
        └── common/
            └── CsvUtils.java

Building the JAR

Open Maven Panel

In IntelliJ IDEA, open the Maven panel on the right side

Run Package

Navigate to: Lifecycle → PackageOr use command line:

mvn clean package -DskipTests

Locate JAR File

Built JAR is located at:

target/ds503_bdm-1.0-SNAPSHOT.jar

Copying JAR to Container

Get Container ID

docker ps

Example output:

CONTAINER ID   IMAGE              ...
dadc9d47d16e   cs585-ds503-image  ...

Copy JAR File

docker cp target/ds503_bdm-1.0-SNAPSHOT.jar dadc9d47d16e:/home/ds503/

Replace dadc9d47d16e with your actual container ID.

Verify Copy

Inside the container:

ls -lh /home/ds503/ds503_bdm-1.0-SNAPSHOT.jar

Environment Variables Setup

For convenience, set up environment variables inside the container:

# JAR file location
export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar

# HDFS input paths
export PAGES=/circlenet/pages/CircleNetPage.csv
export FOLLOWS=/circlenet/follows/Follows.csv
export ACTIVITY=/circlenet/activitylog/ActivityLog.csv

# HDFS output base directory
export OUT=/circlenet/output

# Local output directory (inside container)
export LOCAL_OUT=/home/ds503/results

# Timing collection file
export CIRCLENET_TIMING_FILE=/home/ds503/task_times.csv

Add these to ~/.bashrc to persist across sessions:

echo 'export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar' >> ~/.bashrc
source ~/.bashrc

Running MapReduce Jobs

Basic Job Execution

hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple

Command breakdown:

hadoop jar $JAR - Run a Hadoop job from JAR file
circlenet.taskA.TaskA - Fully qualified class name with main method
$PAGES - Input path (HDFS)
$OUT/taskA/simple - Output path (HDFS)

Running Multiple Versions

# Simple version
hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple

# Optimized version
hadoop jar $JAR circlenet.taskA.TaskAOptimized $PAGES $OUT/taskA/optimized

Multi-Argument Jobs

Some tasks require multiple inputs:

# Task D: Pages and Follows as input
hadoop jar $JAR circlenet.taskD.TaskDSimple \
  $PAGES \
  $FOLLOWS \
  $OUT/taskD/simple

# Task B: Multiple intermediate outputs
hadoop jar $JAR circlenet.taskB.TaskBOptimized \
  $ACTIVITY \
  $PAGES \
  $OUT/taskB/tmp_count_opt \
  $OUT/taskB/optimized

Retrieving Results

View Results in HDFS

# List output files
hdfs dfs -ls $OUT/taskA/simple

# View first 20 lines
hdfs dfs -cat $OUT/taskA/simple/part-r-00000 | head -20

Copy Results to Container

# Create local output directory
mkdir -p $LOCAL_OUT/taskA

# Copy from HDFS to container local filesystem
hdfs dfs -get $OUT/taskA/simple $LOCAL_OUT/taskA/

Copy Results to Host Machine

# From host machine (PowerShell or bash)
docker cp dadc9d47d16e:/home/ds503/results/taskA ./results_taskA

Use the -f flag with hdfs dfs -get to overwrite existing local files:

hdfs dfs -get -f $OUT/taskA/simple $LOCAL_OUT/taskA/

Cleanup and Reset

Clean Output Directories

# Remove all task outputs
hdfs dfs -rm -r -f $OUT/taskA $OUT/taskB $OUT/taskC $OUT/taskD \
                    $OUT/taskE $OUT/taskF $OUT/taskG $OUT/taskH

Reset for Fresh Benchmark

# Remove timing file
rm -f $CIRCLENET_TIMING_FILE

# Remove local results
rm -rf $LOCAL_OUT/*

# Remove HDFS outputs
hdfs dfs -rm -r -f $OUT/*

# Recreate local output directory
mkdir -p $LOCAL_OUT

Troubleshooting

Container Won’t Start

Port Already in Use

Check if ports 3000-3004 are available:

# Windows
netstat -ano | findstr :3000

# Linux/Mac
lsof -i :3000

Change port mappings if needed:

docker run -p 3010:22 ...

Volume Mount Permission Denied

Ensure the local directory exists and has proper permissions:

# Create directory if it doesn't exist
mkdir -p /path/to/local/data

# Set permissions (Linux/Mac)
chmod 755 /path/to/local/data

Hadoop Cluster Issues

HDFS Not Responding

Check if NameNode is running:

jps

Should show: NameNode, DataNode, ResourceManager, NodeManager

Restart Hadoop:

hadoop/sbin/stop-all.sh
hadoop/sbin/start-all.sh

Safe Mode Preventing Writes

HDFS may enter safe mode on startup:

# Check safe mode status
hdfs dfsadmin -safemode get

# Leave safe mode
hdfs dfsadmin -safemode leave

Job Execution Issues

ClassNotFoundException

Ensure:

JAR was built with mvn clean package
JAR was copied to container
Fully qualified class name is correct

# List classes in JAR
jar tf $JAR | grep Task

Output Directory Already Exists

Hadoop won’t overwrite existing output directories:

# Remove existing output
hdfs dfs -rm -r $OUT/taskA/simple

# Then run job again
hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple

Quick Reference Commands

Docker Commands

# Start container
docker start ds503-container

# Stop container
docker stop ds503-container

# View container logs
docker logs ds503-container

# Get container ID
docker ps

# Copy file to container
docker cp <local_file> <container_id>:<container_path>

# Copy file from container
docker cp <container_id>:<container_path> <local_path>

Hadoop Commands

# Start cluster
hadoop/sbin/start-all.sh

# Stop cluster
hadoop/sbin/stop-all.sh

# Run job
hadoop jar $JAR <main_class> <args>

# Check Java processes
jps

Get Started

Dataset

Analytics Tasks

Guides

​Development Workflow Overview

​Container Creation

​Initial Setup

​Port Mappings

​Volume Mount

​Connecting to the Container

​SSH Connection

​Verify Connection (Windows PowerShell)

​Hadoop Cluster Management

​Starting the Hadoop Cluster

​Stopping the Hadoop Cluster

​Building and Deploying Jobs

​Project Structure

​Building the JAR

​Copying JAR to Container

​Environment Variables Setup

​Running MapReduce Jobs

​Basic Job Execution

​Running Multiple Versions

​Multi-Argument Jobs

​Retrieving Results

​View Results in HDFS

​Copy Results to Container

​Copy Results to Host Machine

​Cleanup and Reset

​Clean Output Directories

​Reset for Fresh Benchmark

​Troubleshooting

​Container Won’t Start

​Hadoop Cluster Issues

​Job Execution Issues

​Quick Reference Commands

​Docker Commands

​Hadoop Commands

​Next Steps

Build docs developers (and LLMs) love

Development Workflow Overview

Container Creation

Initial Setup

Port Mappings

Volume Mount

Connecting to the Container

SSH Connection

Verify Connection (Windows PowerShell)

Hadoop Cluster Management

Starting the Hadoop Cluster

Stopping the Hadoop Cluster

Building and Deploying Jobs

Project Structure

Building the JAR

Copying JAR to Container

Environment Variables Setup

Running MapReduce Jobs

Basic Job Execution

Running Multiple Versions

Multi-Argument Jobs

Retrieving Results

View Results in HDFS

Copy Results to Container

Copy Results to Host Machine

Cleanup and Reset

Clean Output Directories

Reset for Fresh Benchmark

Troubleshooting

Container Won’t Start

Hadoop Cluster Issues

Job Execution Issues

Quick Reference Commands

Docker Commands

Hadoop Commands

Next Steps