Skip to main content
This guide walks you through setting up a complete Hadoop development environment for CircleNet Analytics using Docker. You’ll configure the container, start the Hadoop cluster, and load your datasets into HDFS.

Prerequisites

Before you begin, ensure you have:
  • Docker Desktop installed and running
  • At least 8GB RAM available for the container
  • CircleNet datasets generated and available locally
  • SSH client (built into macOS/Linux, PuTTY for Windows)

Step 1: Create the Docker container

Create a container from the cs585-ds503-image with the following configuration:

Container configuration

Container name: ds503-container Port mappings:
docker run -d \
  --name ds503-container \
  -p 3000:22 \
  -p 3001:4040 \
  -p 3002:50070 \
  -p 3003:8080 \
  -p 3004:8081 \
  -v /path/to/your/datasets:/home/ds503/data \
  cs585-ds503-image
Port descriptions:
PortServiceDescription
3000SSHRemote shell access to container
3001Spark JobsSpark job monitoring
3002NameNode UIHDFS file browser and cluster status
3003Spark MasterSpark master node interface
3004Spark WorkerSpark worker node interface
Volume mount:
  • Container path: /home/ds503/data
  • Local path: Update this to your dataset directory (e.g., D:\DATA_SCIENCE_2025\semester2\bdm_data\project1_resources\dataset_generation\my_output_full)
Ensure your local dataset directory contains CircleNetPage.csv, Follows.csv, and ActivityLog.csv before mounting.

Verify container is running

docker ps
You should see ds503-container in the list with all ports mapped correctly.

Step 2: Connect to the container

Connect via SSH using the configured port:
ssh -p 3000 ds503@localhost
Default credentials:
  • Username: ds503
  • Password: ds503
If you’re on Windows and using PuTTY, enter localhost as the hostname and 3000 as the port.
Once connected, you’ll see your container’s shell prompt:
➜  ~ 

Step 3: Start the Hadoop cluster

Start all Hadoop services (NameNode, DataNode, ResourceManager, NodeManager):
hadoop/sbin/start-all.sh
You should see output indicating services are starting:
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes
Starting resourcemanager
Starting nodemanagers

Verify the cluster is running

From inside the container:
hdfs dfs -ls /
You should see default directories:
Found 2 items
drwxrwxrwt   - root supergroup          0 /tmp
drwxr-xr-x   - ds503 supergroup         0 /user
From your local machine: Test the NameNode Web UI connection:
Test-NetConnection -ComputerName localhost -Port 3002
Open your browser and visit the NameNode Web UI:
http://localhost:3002/
Bookmark http://localhost:3002/explorer.html#/ - this is the HDFS file browser you’ll use frequently.

Step 4: Create HDFS directory structure

Create the directory structure for CircleNet datasets:

Create base directory

hdfs dfs -mkdir /circlenet

Create subdirectories for each dataset

hdfs dfs -mkdir /circlenet/pages
hdfs dfs -mkdir /circlenet/follows
hdfs dfs -mkdir /circlenet/activitylog

Verify directory creation

hdfs dfs -ls /circlenet
Expected output:
Found 3 items
drwxr-xr-x   - ds503 supergroup          0 /circlenet/pages
drwxr-xr-x   - ds503 supergroup          0 /circlenet/follows
drwxr-xr-x   - ds503 supergroup          0 /circlenet/activitylog
You can also verify this visually at http://localhost:3002/explorer.html#/circlenet

Step 5: Load datasets into HDFS

Now upload the three CSV files from your mounted volume to HDFS:
hdfs dfs -put data/CircleNetPage.csv /circlenet/pages/
hdfs dfs -put data/Follows.csv /circlenet/follows/
hdfs dfs -put data/ActivityLog.csv /circlenet/activitylog/
These put operations may take several minutes depending on dataset size. CircleNetPage (200K rows) is fast, but Follows (20M rows) and ActivityLog (10M rows) take longer.

Verify uploads

Check that files were uploaded successfully:
hdfs dfs -ls /circlenet/pages/
hdfs dfs -ls /circlenet/follows/
hdfs dfs -ls /circlenet/activitylog/
Each directory should show one CSV file with its size.

Inspect data samples

View the first 5 rows of each dataset:
hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -5
View the last 5 rows to verify complete upload:
hdfs dfs -cat /circlenet/follows/Follows.csv | tail -5
CircleNetPage format: ID, NickName, JobTitle, RegionCode, FavoriteHobbyFollows format: ColRel, ID1, ID2, DateOfRelation, DescriptionActivityLog format: ActionId, ByWho, WhatPage, ActionType, ActionTime

Step 6: Set up environment variables

For easier job execution, set up environment variables:
export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
export PAGES=/circlenet/pages/CircleNetPage.csv
export FOLLOWS=/circlenet/follows/Follows.csv
export ACTIVITY=/circlenet/activitylog/ActivityLog.csv
export OUT=/circlenet/output
export LOCAL_OUT=/home/ds503/results
export CIRCLENET_TIMING_FILE=/home/ds503/task_times.csv
Create output directories:
mkdir -p $LOCAL_OUT
Add these exports to your ~/.bashrc file to make them permanent:
echo 'export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar' >> ~/.bashrc
echo 'export PAGES=/circlenet/pages/CircleNetPage.csv' >> ~/.bashrc
# ... add all other exports
source ~/.bashrc

Step 7: Build and deploy your project

Build the JAR file

On your host machine (not in the container), navigate to your project root and build:
mvn clean package -DskipTests
This creates target/ds503_bdm-1.0-SNAPSHOT.jar with all compiled MapReduce jobs.

Copy JAR to container

Get your container ID:
docker ps
Copy the JAR (replace dadc9d47d16e with your container ID):
docker cp target/ds503_bdm-1.0-SNAPSHOT.jar dadc9d47d16e:/home/ds503/

Verify JAR is accessible

From inside the container:
ls -lh /home/ds503/ds503_bdm-1.0-SNAPSHOT.jar

Common HDFS operations

Here are essential HDFS commands you’ll use frequently:

List files and directories

hdfs dfs -ls /
hdfs dfs -ls /circlenet
hdfs dfs -ls -R /circlenet  # Recursive listing

View file contents

hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -20
hdfs dfs -tail /circlenet/pages/CircleNetPage.csv  # Last 1KB

Copy files from HDFS to local

hdfs dfs -get /circlenet/output/taskA/simple ~/results/

Delete files and directories

hdfs dfs -rm /circlenet/output/taskA/simple/part-r-00000
hdfs dfs -rm -r /circlenet/output/taskA  # Recursive delete
hdfs dfs -rm -r -f /circlenet/output/*   # Force delete all outputs

Check disk usage

hdfs dfs -du -h /circlenet
All hdfs dfs commands work on HDFS paths (starting with /), not local filesystem paths. To work with local files, use regular Linux commands like ls, cat, rm, etc.

Stopping the cluster

When you’re done working, stop the Hadoop cluster to free resources:
hadoop/sbin/stop-all.sh
This stops all NameNode, DataNode, ResourceManager, and NodeManager processes.
Don’t stop the cluster if you have running MapReduce jobs - they will be killed.

Project structure

Your project follows this structure:
ds503_bdm/
├── pom.xml
└── src/main/java/
    └── circlenet/
        ├── taskA/
        │   ├── TaskA.java           # Simple mapper/reducer
        │   └── TaskAOptimized.java  # With combiner
        ├── taskB/
        │   ├── TaskBSimple.java
        │   └── TaskBOptimized.java
        ├── ... (tasks C through H)
        ├── common/
        │   ├── CsvUtils.java        # CSV parsing utilities
        │   └── JobTimer.java        # Performance timing
        └── util/

Maven configuration

The pom.xml uses Hadoop 2.7.7 dependencies:
pom.xml
<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.7.7</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>2.7.7</version>
    </dependency>
</dependencies>

Next steps

Now that your environment is configured, you’re ready to run analytics:

Run your first job

Complete the quickstart guide to run Task A

Explore all tasks

Learn about all 8 analytics tasks

Dataset details

Understand the data structure and relationships

Optimization guide

Learn MapReduce optimization techniques

Troubleshooting

Ensure the Hadoop cluster is running:
jps
You should see processes: NameNode, DataNode, ResourceManager, NodeManager. If not, run:
hadoop/sbin/start-all.sh
Check your HDFS permissions:
hdfs dfs -ls -d /circlenet
If needed, change ownership:
hdfs dfs -chown -R ds503:supergroup /circlenet
Check HDFS disk usage:
hdfs dfsadmin -report
Clean up old output directories:
hdfs dfs -rm -r /circlenet/output/*
Check container logs:
docker logs ds503-container
The container may need more memory. Increase Docker Desktop’s memory allocation to at least 8GB.

Build docs developers (and LLMs) love