Prerequisites
Before you begin, ensure you have:- Docker container running with Hadoop installed (see setup guide)
- CircleNet datasets loaded into HDFS at
/circlenet/pages/CircleNetPage.csv - Maven installed for building the project
- SSH access to your container on port 3000
If you haven’t set up your environment yet, follow the complete setup guide first.
Step 1: Build the JAR file
From your project root directory, build the project using Maven:- Cleans previous builds
- Compiles all Java source files in the
circlenetpackage - Packages everything into
target/ds503_bdm-1.0-SNAPSHOT.jar - Skips tests for faster builds
Step 2: Copy JAR to container
Get your container ID and copy the JAR file:dadc9d47d16e with yours):
Step 3: Set up environment variables
SSH into your container:Default credentials are username
ds503 and password ds503.Step 4: Run Task A
Task A analyzes hobby frequency by reading the CircleNetPage dataset and counting occurrences of each hobby.Run the simple version
- Executes the
TaskAclass from the JAR - Reads input from
/circlenet/pages/CircleNetPage.csv - Writes output to
/circlenet/output/taskA/simple
Run the optimized version
The optimized version uses a combiner to reduce data shuffling:The optimized version typically runs 20-30% faster by using a combiner to aggregate hobby counts locally before the shuffle phase.
Step 5: View the results
Inspect the output directly in HDFS:Step 6: Copy results to local
Copy the results from HDFS to your container’s local filesystem:Step 7: Check performance metrics
View the timing information:Understanding the code
Here’s how Task A works under the hood:Mapper
The mapper reads each line from CircleNetPage.csv and emits the hobby as the key:src/main/java/circlenet/taskA/TaskA.java
Reducer
The reducer sums up all the counts for each hobby:src/main/java/circlenet/taskA/TaskA.java
This is a classic word count pattern - the most fundamental MapReduce algorithm. The mapper emits (hobby, 1) pairs, and the reducer sums the counts.
Next steps
Explore Task B
Find the 10 most popular CircleNetPages
All tasks
View all 8 analytics tasks
Optimization guide
Learn MapReduce optimization techniques
Dataset details
Understand the data structure
Troubleshooting
Output directory already exists
Output directory already exists
If you see an error about the output directory already existing, remove it first:Then rerun your job.
ClassNotFoundException
ClassNotFoundException
This usually means the JAR wasn’t copied correctly. Verify the JAR exists:If missing, repeat Step 2.
Input path does not exist
Input path does not exist