Prerequisites
Before you begin, ensure you have:- Docker Desktop installed and running
- At least 8GB RAM available for the container
- CircleNet datasets generated and available locally
- SSH client (built into macOS/Linux, PuTTY for Windows)
Step 1: Create the Docker container
Create a container from thecs585-ds503-image with the following configuration:
Container configuration
Container name:ds503-container
Port mappings:
| Port | Service | Description |
|---|---|---|
| 3000 | SSH | Remote shell access to container |
| 3001 | Spark Jobs | Spark job monitoring |
| 3002 | NameNode UI | HDFS file browser and cluster status |
| 3003 | Spark Master | Spark master node interface |
| 3004 | Spark Worker | Spark worker node interface |
- Container path:
/home/ds503/data - Local path: Update this to your dataset directory (e.g.,
D:\DATA_SCIENCE_2025\semester2\bdm_data\project1_resources\dataset_generation\my_output_full)
Verify container is running
ds503-container in the list with all ports mapped correctly.
Step 2: Connect to the container
Connect via SSH using the configured port:- Username:
ds503 - Password:
ds503
If you’re on Windows and using PuTTY, enter
localhost as the hostname and 3000 as the port.Step 3: Start the Hadoop cluster
Start all Hadoop services (NameNode, DataNode, ResourceManager, NodeManager):Verify the cluster is running
From inside the container:Step 4: Create HDFS directory structure
Create the directory structure for CircleNet datasets:Create base directory
Create subdirectories for each dataset
Verify directory creation
You can also verify this visually at
http://localhost:3002/explorer.html#/circlenetStep 5: Load datasets into HDFS
Now upload the three CSV files from your mounted volume to HDFS:Verify uploads
Check that files were uploaded successfully:Inspect data samples
View the first 5 rows of each dataset:CircleNetPage format: ID, NickName, JobTitle, RegionCode, FavoriteHobbyFollows format: ColRel, ID1, ID2, DateOfRelation, DescriptionActivityLog format: ActionId, ByWho, WhatPage, ActionType, ActionTime
Step 6: Set up environment variables
For easier job execution, set up environment variables:Step 7: Build and deploy your project
Build the JAR file
On your host machine (not in the container), navigate to your project root and build:target/ds503_bdm-1.0-SNAPSHOT.jar with all compiled MapReduce jobs.
Copy JAR to container
Get your container ID:dadc9d47d16e with your container ID):
Verify JAR is accessible
From inside the container:Common HDFS operations
Here are essential HDFS commands you’ll use frequently:List files and directories
View file contents
Copy files from HDFS to local
Delete files and directories
Check disk usage
All
hdfs dfs commands work on HDFS paths (starting with /), not local filesystem paths. To work with local files, use regular Linux commands like ls, cat, rm, etc.Stopping the cluster
When you’re done working, stop the Hadoop cluster to free resources:Project structure
Your project follows this structure:Maven configuration
Thepom.xml uses Hadoop 2.7.7 dependencies:
pom.xml
Next steps
Now that your environment is configured, you’re ready to run analytics:Run your first job
Complete the quickstart guide to run Task A
Explore all tasks
Learn about all 8 analytics tasks
Dataset details
Understand the data structure and relationships
Optimization guide
Learn MapReduce optimization techniques
Troubleshooting
Cannot connect to localhost:3002
Cannot connect to localhost:3002
Ensure the Hadoop cluster is running:You should see processes:
NameNode, DataNode, ResourceManager, NodeManager. If not, run:Permission denied errors in HDFS
Permission denied errors in HDFS
Check your HDFS permissions:If needed, change ownership:
No space left on device
No space left on device
Check HDFS disk usage:Clean up old output directories:
Container exits immediately after docker run
Container exits immediately after docker run
Check container logs:The container may need more memory. Increase Docker Desktop’s memory allocation to at least 8GB.