Setup guide

This guide walks you through setting up a complete Hadoop development environment for CircleNet Analytics using Docker. You’ll configure the container, start the Hadoop cluster, and load your datasets into HDFS.

Prerequisites

Before you begin, ensure you have:

Docker Desktop installed and running
At least 8GB RAM available for the container
CircleNet datasets generated and available locally
SSH client (built into macOS/Linux, PuTTY for Windows)

Step 1: Create the Docker container

Create a container from the cs585-ds503-image with the following configuration:

Container configuration

Container name: ds503-container Port mappings:

docker run -d \
  --name ds503-container \
  -p 3000:22 \
  -p 3001:4040 \
  -p 3002:50070 \
  -p 3003:8080 \
  -p 3004:8081 \
  -v /path/to/your/datasets:/home/ds503/data \
  cs585-ds503-image

Port descriptions:

Port	Service	Description
3000	SSH	Remote shell access to container
3001	Spark Jobs	Spark job monitoring
3002	NameNode UI	HDFS file browser and cluster status
3003	Spark Master	Spark master node interface
3004	Spark Worker	Spark worker node interface

Volume mount:

Container path: /home/ds503/data
Local path: Update this to your dataset directory (e.g., D:\DATA_SCIENCE_2025\semester2\bdm_data\project1_resources\dataset_generation\my_output_full)

Ensure your local dataset directory contains CircleNetPage.csv, Follows.csv, and ActivityLog.csv before mounting.

Verify container is running

docker ps

You should see ds503-container in the list with all ports mapped correctly.

Step 2: Connect to the container

Connect via SSH using the configured port:

ssh -p 3000 ds503@localhost

Default credentials:

Username: ds503
Password: ds503

If you’re on Windows and using PuTTY, enter localhost as the hostname and 3000 as the port.

Once connected, you’ll see your container’s shell prompt:

➜  ~

Step 3: Start the Hadoop cluster

Start all Hadoop services (NameNode, DataNode, ResourceManager, NodeManager):

hadoop/sbin/start-all.sh

You should see output indicating services are starting:

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes
Starting resourcemanager
Starting nodemanagers

Verify the cluster is running

From inside the container:

hdfs dfs -ls /

You should see default directories:

Found 2 items
drwxrwxrwt   - root supergroup          0 /tmp
drwxr-xr-x   - ds503 supergroup         0 /user

From your local machine: Test the NameNode Web UI connection:

Test-NetConnection -ComputerName localhost -Port 3002

Open your browser and visit the NameNode Web UI:

http://localhost:3002/

Bookmark http://localhost:3002/explorer.html#/ - this is the HDFS file browser you’ll use frequently.

Step 4: Create HDFS directory structure

Create the directory structure for CircleNet datasets:

Create base directory

hdfs dfs -mkdir /circlenet

Create subdirectories for each dataset

hdfs dfs -mkdir /circlenet/pages
hdfs dfs -mkdir /circlenet/follows
hdfs dfs -mkdir /circlenet/activitylog

Verify directory creation

hdfs dfs -ls /circlenet

Expected output:

Found 3 items
drwxr-xr-x   - ds503 supergroup          0 /circlenet/pages
drwxr-xr-x   - ds503 supergroup          0 /circlenet/follows
drwxr-xr-x   - ds503 supergroup          0 /circlenet/activitylog

You can also verify this visually at http://localhost:3002/explorer.html#/circlenet

Step 5: Load datasets into HDFS

Now upload the three CSV files from your mounted volume to HDFS:

hdfs dfs -put data/CircleNetPage.csv /circlenet/pages/
hdfs dfs -put data/Follows.csv /circlenet/follows/
hdfs dfs -put data/ActivityLog.csv /circlenet/activitylog/

These put operations may take several minutes depending on dataset size. CircleNetPage (200K rows) is fast, but Follows (20M rows) and ActivityLog (10M rows) take longer.

Verify uploads

Check that files were uploaded successfully:

hdfs dfs -ls /circlenet/pages/
hdfs dfs -ls /circlenet/follows/
hdfs dfs -ls /circlenet/activitylog/

Each directory should show one CSV file with its size.

Inspect data samples

View the first 5 rows of each dataset:

hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -5

View the last 5 rows to verify complete upload:

hdfs dfs -cat /circlenet/follows/Follows.csv | tail -5

CircleNetPage format: ID, NickName, JobTitle, RegionCode, FavoriteHobbyFollows format: ColRel, ID1, ID2, DateOfRelation, DescriptionActivityLog format: ActionId, ByWho, WhatPage, ActionType, ActionTime

Step 6: Set up environment variables

For easier job execution, set up environment variables:

export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
export PAGES=/circlenet/pages/CircleNetPage.csv
export FOLLOWS=/circlenet/follows/Follows.csv
export ACTIVITY=/circlenet/activitylog/ActivityLog.csv
export OUT=/circlenet/output
export LOCAL_OUT=/home/ds503/results
export CIRCLENET_TIMING_FILE=/home/ds503/task_times.csv

Create output directories:

mkdir -p $LOCAL_OUT

Add these exports to your ~/.bashrc file to make them permanent:

echo 'export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar' >> ~/.bashrc
echo 'export PAGES=/circlenet/pages/CircleNetPage.csv' >> ~/.bashrc
# ... add all other exports
source ~/.bashrc

Step 7: Build and deploy your project

Build the JAR file

On your host machine (not in the container), navigate to your project root and build:

mvn clean package -DskipTests

This creates target/ds503_bdm-1.0-SNAPSHOT.jar with all compiled MapReduce jobs.

Copy JAR to container

Get your container ID:

docker ps

Copy the JAR (replace dadc9d47d16e with your container ID):

docker cp target/ds503_bdm-1.0-SNAPSHOT.jar dadc9d47d16e:/home/ds503/

Verify JAR is accessible

From inside the container:

ls -lh /home/ds503/ds503_bdm-1.0-SNAPSHOT.jar

Common HDFS operations

Here are essential HDFS commands you’ll use frequently:

List files and directories

hdfs dfs -ls /
hdfs dfs -ls /circlenet
hdfs dfs -ls -R /circlenet  # Recursive listing

View file contents

hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -20
hdfs dfs -tail /circlenet/pages/CircleNetPage.csv  # Last 1KB

Copy files from HDFS to local

hdfs dfs -get /circlenet/output/taskA/simple ~/results/

Delete files and directories

hdfs dfs -rm /circlenet/output/taskA/simple/part-r-00000
hdfs dfs -rm -r /circlenet/output/taskA  # Recursive delete
hdfs dfs -rm -r -f /circlenet/output/*   # Force delete all outputs

Check disk usage

hdfs dfs -du -h /circlenet

All hdfs dfs commands work on HDFS paths (starting with /), not local filesystem paths. To work with local files, use regular Linux commands like ls, cat, rm, etc.

Stopping the cluster

When you’re done working, stop the Hadoop cluster to free resources:

hadoop/sbin/stop-all.sh

This stops all NameNode, DataNode, ResourceManager, and NodeManager processes.

Don’t stop the cluster if you have running MapReduce jobs - they will be killed.

Project structure

Your project follows this structure:

ds503_bdm/
├── pom.xml
└── src/main/java/
    └── circlenet/
        ├── taskA/
        │   ├── TaskA.java           # Simple mapper/reducer
        │   └── TaskAOptimized.java  # With combiner
        ├── taskB/
        │   ├── TaskBSimple.java
        │   └── TaskBOptimized.java
        ├── ... (tasks C through H)
        ├── common/
        │   ├── CsvUtils.java        # CSV parsing utilities
        │   └── JobTimer.java        # Performance timing
        └── util/

Maven configuration

The pom.xml uses Hadoop 2.7.7 dependencies:

pom.xml

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.7.7</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>2.7.7</version>
    </dependency>
</dependencies>

Next steps

Now that your environment is configured, you’re ready to run analytics:

Run your first job

Complete the quickstart guide to run Task A

Explore all tasks

Learn about all 8 analytics tasks

Dataset details

Understand the data structure and relationships

Optimization guide

Learn MapReduce optimization techniques

Troubleshooting

Cannot connect to localhost:3002

Ensure the Hadoop cluster is running:

jps

You should see processes: NameNode, DataNode, ResourceManager, NodeManager. If not, run:

hadoop/sbin/start-all.sh

Permission denied errors in HDFS

Check your HDFS permissions:

hdfs dfs -ls -d /circlenet

If needed, change ownership:

hdfs dfs -chown -R ds503:supergroup /circlenet

No space left on device

Check HDFS disk usage:

hdfs dfsadmin -report

Clean up old output directories:

hdfs dfs -rm -r /circlenet/output/*

Container exits immediately after docker run

Check container logs:

docker logs ds503-container

The container may need more memory. Increase Docker Desktop’s memory allocation to at least 8GB.

Get Started

Dataset

Analytics Tasks

Guides

Prerequisites

Step 1: Create the Docker container

Container configuration

Verify container is running

Step 2: Connect to the container

Step 3: Start the Hadoop cluster

Verify the cluster is running

Step 4: Create HDFS directory structure

Create base directory

Create subdirectories for each dataset

Verify directory creation

Step 5: Load datasets into HDFS

Verify uploads

Inspect data samples

Step 6: Set up environment variables

Step 7: Build and deploy your project

Build the JAR file

Copy JAR to container

Verify JAR is accessible

Common HDFS operations

List files and directories

View file contents

Copy files from HDFS to local

Delete files and directories

Check disk usage

Stopping the cluster

Project structure

Maven configuration

Next steps

Run your first job

Explore all tasks

Dataset details

Optimization guide

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Dataset

Analytics Tasks

Guides

​Prerequisites

​Step 1: Create the Docker container

​Container configuration

​Verify container is running

​Step 2: Connect to the container

​Step 3: Start the Hadoop cluster

​Verify the cluster is running

​Step 4: Create HDFS directory structure

​Create base directory

​Create subdirectories for each dataset

​Verify directory creation

​Step 5: Load datasets into HDFS

​Verify uploads

​Inspect data samples

​Step 6: Set up environment variables

​Step 7: Build and deploy your project

​Build the JAR file

​Copy JAR to container

​Verify JAR is accessible

​Common HDFS operations

​List files and directories

​View file contents

​Copy files from HDFS to local

​Delete files and directories

​Check disk usage

​Stopping the cluster

​Project structure

​Maven configuration

​Next steps

Run your first job

Explore all tasks

Dataset details

Optimization guide

​Troubleshooting

Build docs developers (and LLMs) love

Prerequisites

Step 1: Create the Docker container

Container configuration

Verify container is running

Step 2: Connect to the container

Step 3: Start the Hadoop cluster

Verify the cluster is running

Step 4: Create HDFS directory structure

Create base directory

Create subdirectories for each dataset

Verify directory creation

Step 5: Load datasets into HDFS

Verify uploads

Inspect data samples

Step 6: Set up environment variables

Step 7: Build and deploy your project

Build the JAR file

Copy JAR to container

Verify JAR is accessible

Common HDFS operations

List files and directories

View file contents

Copy files from HDFS to local

Delete files and directories

Check disk usage

Stopping the cluster

Project structure

Maven configuration

Next steps

Troubleshooting