Quickstart

This guide walks you through running Task A, which analyzes the frequency of favorite hobbies across all CircleNet users. You’ll learn how to build the JAR, copy it to your container, and execute a MapReduce job.

Prerequisites

Before you begin, ensure you have:

Docker container running with Hadoop installed (see setup guide)
CircleNet datasets loaded into HDFS at /circlenet/pages/CircleNetPage.csv
Maven installed for building the project
SSH access to your container on port 3000

If you haven’t set up your environment yet, follow the complete setup guide first.

Step 1: Build the JAR file

From your project root directory, build the project using Maven:

mvn clean package -DskipTests

This command:

Cleans previous builds
Compiles all Java source files in the circlenet package
Packages everything into target/ds503_bdm-1.0-SNAPSHOT.jar
Skips tests for faster builds

The JAR file contains all compiled MapReduce jobs including mappers, reducers, and driver classes for all eight tasks.

Step 2: Copy JAR to container

Get your container ID and copy the JAR file:

docker ps

Copy the JAR using your container ID (replace dadc9d47d16e with yours):

docker cp target/ds503_bdm-1.0-SNAPSHOT.jar dadc9d47d16e:/home/ds503/

Step 3: Set up environment variables

SSH into your container:

ssh -p 3000 ds503@localhost

Default credentials are username ds503 and password ds503.

Set up your environment variables for easier job execution:

export JAR=/home/ds503/ds503_bdm-1.0-SNAPSHOT.jar
export PAGES=/circlenet/pages/CircleNetPage.csv
export OUT=/circlenet/output
export LOCAL_OUT=/home/ds503/results
export CIRCLENET_TIMING_FILE=/home/ds503/task_times.csv

Create the local output directory:

mkdir -p $LOCAL_OUT

Step 4: Run Task A

Task A analyzes hobby frequency by reading the CircleNetPage dataset and counting occurrences of each hobby.

Run the simple version

hadoop jar $JAR circlenet.taskA.TaskA $PAGES $OUT/taskA/simple

This command:

Executes the TaskA class from the JAR
Reads input from /circlenet/pages/CircleNetPage.csv
Writes output to /circlenet/output/taskA/simple

Run the optimized version

The optimized version uses a combiner to reduce data shuffling:

hadoop jar $JAR circlenet.taskA.TaskAOptimized $PAGES $OUT/taskA/optimized

The optimized version typically runs 20-30% faster by using a combiner to aggregate hobby counts locally before the shuffle phase.

Step 5: View the results

Inspect the output directly in HDFS:

hdfs dfs -ls $OUT/taskA/simple

View the first 20 results:

hdfs dfs -cat $OUT/taskA/simple/part-r-00000 | head -20

hdfs dfs -cat $OUT/taskA/simple/part-r-00000 | head -20

Expected output format:

PodcastBinging	4523
Gardening	3876
Photography	4102
Hiking	3945
Cooking	4234
...

Step 6: Copy results to local

Copy the results from HDFS to your container’s local filesystem:

mkdir -p $LOCAL_OUT/taskA/simple
mkdir -p $LOCAL_OUT/taskA/optimized
hdfs dfs -get $OUT/taskA/simple $LOCAL_OUT/taskA/
hdfs dfs -get $OUT/taskA/optimized $LOCAL_OUT/taskA/

From your host machine, copy results out of the container:

docker cp dadc9d47d16e:/home/ds503/results/taskA ./results_taskA

Step 7: Check performance metrics

View the timing information:

cat $CIRCLENET_TIMING_FILE

Filter for total execution time:

grep ",total," $CIRCLENET_TIMING_FILE

The timing file shows execution time in milliseconds for each phase (map, reduce, total).

Understanding the code

Here’s how Task A works under the hood:

Mapper

The mapper reads each line from CircleNetPage.csv and emits the hobby as the key:

src/main/java/circlenet/taskA/TaskA.java

public static class TaskAMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text hobby = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) 
        throws IOException, InterruptedException{
        String line = value.toString();
        String[] fields = line.split(",");
        if(fields.length == 5){
            hobby.set(fields[4]);  // FavoriteHobby is the 5th field
            context.write(hobby, one);
        }
    }
}

Reducer

The reducer sums up all the counts for each hobby:

src/main/java/circlenet/taskA/TaskA.java

public static class TaskAReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    private IntWritable result = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) 
        throws IOException, InterruptedException{
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

This is a classic word count pattern - the most fundamental MapReduce algorithm. The mapper emits (hobby, 1) pairs, and the reducer sums the counts.

Next steps

Explore Task B

Find the 10 most popular CircleNetPages

All tasks

View all 8 analytics tasks

Optimization guide

Learn MapReduce optimization techniques

Dataset details

Understand the data structure

Troubleshooting

Output directory already exists

If you see an error about the output directory already existing, remove it first:

hdfs dfs -rm -r $OUT/taskA

Then rerun your job.

ClassNotFoundException

This usually means the JAR wasn’t copied correctly. Verify the JAR exists:

ls -lh /home/ds503/ds503_bdm-1.0-SNAPSHOT.jar

If missing, repeat Step 2.

Input path does not exist

Ensure your datasets are loaded into HDFS:

hdfs dfs -ls /circlenet/pages/

If empty, follow the setup guide to load your datasets.

Get Started

Dataset

Analytics Tasks

Guides

Prerequisites

Step 1: Build the JAR file

Step 2: Copy JAR to container

Step 3: Set up environment variables

Step 4: Run Task A

Run the simple version

Run the optimized version

Step 5: View the results

Step 6: Copy results to local

Step 7: Check performance metrics

Understanding the code

Mapper

Reducer

Next steps

Explore Task B

All tasks

Optimization guide

Dataset details

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Dataset

Analytics Tasks

Guides

​Prerequisites

​Step 1: Build the JAR file

​Step 2: Copy JAR to container

​Step 3: Set up environment variables

​Step 4: Run Task A

​Run the simple version

​Run the optimized version

​Step 5: View the results

​Step 6: Copy results to local

​Step 7: Check performance metrics

​Understanding the code

​Mapper

​Reducer

​Next steps

Explore Task B

All tasks

Optimization guide

Dataset details

​Troubleshooting

Build docs developers (and LLMs) love

Prerequisites

Step 1: Build the JAR file

Step 2: Copy JAR to container

Step 3: Set up environment variables

Step 4: Run Task A

Run the simple version

Run the optimized version

Step 5: View the results

Step 6: Copy results to local

Step 7: Check performance metrics

Understanding the code

Mapper

Reducer

Next steps

Troubleshooting