Skip to main content
This guide covers essential HDFS operations for managing datasets in the CircleNet Analytics project.

HDFS Command Structure

All HDFS commands follow this pattern:
hadoop/bin/hdfs dfs -<command> <arguments>
Breakdown:
  • hadoop/bin/hdfs - Main HDFS executable
  • dfs - Distributed File System parameter
  • -<command> - Specific operation (mkdir, put, get, cat, etc.)
  • <arguments> - Command-specific arguments
Most HDFS commands mirror Unix/Linux file system commands with a - prefix.

Setting Up Directory Structure

Creating Directories

1

Create Base User Directory

hdfs dfs -mkdir /user/ds503
Convention: /user/<username> for user-specific data
2

Create Project Directory

hdfs dfs -mkdir /circlenet
3

Create Dataset Directories

hdfs dfs -mkdir /circlenet/pages
hdfs dfs -mkdir /circlenet/follows
hdfs dfs -mkdir /circlenet/activitylog
4

Verify Directory Structure

hdfs dfs -ls /circlenet
Expected output:
drwxr-xr-x   - ds503 supergroup          0 2024-03-01 10:30 /circlenet/activitylog
drwxr-xr-x   - ds503 supergroup          0 2024-03-01 10:30 /circlenet/follows
drwxr-xr-x   - ds503 supergroup          0 2024-03-01 10:30 /circlenet/pages

Directory Structure Overview

/
├── user/
│   └── ds503/
└── circlenet/
    ├── pages/
    │   └── CircleNetPage.csv
    ├── follows/
    │   └── Follows.csv
    ├── activitylog/
    │   └── ActivityLog.csv
    └── output/
        ├── taskA/
        ├── taskB/
        └── ...

Loading Data into HDFS

Uploading CSV Files

1

Navigate to Data Directory

Inside the container, your mounted data is at:
cd /home/ds503/data
ls
Should show: CircleNetPage.csv, Follows.csv, ActivityLog.csv
2

Upload Files to HDFS

hdfs dfs -put data/CircleNetPage.csv /circlenet/pages/
hdfs dfs -put data/Follows.csv /circlenet/follows/
hdfs dfs -put data/ActivityLog.csv /circlenet/activitylog/
The put command uploads from local filesystem to HDFS.
3

Verify Upload

hdfs dfs -ls /circlenet/pages
hdfs dfs -ls /circlenet/follows
hdfs dfs -ls /circlenet/activitylog
Check file sizes match your local files.

Dataset Specifications

CircleNet Analytics datasets:
FileRecordsDescription
CircleNetPage.csv200,000User profiles (ID, NickName, JobTitle, RegionCode, FavoriteHobby)
Follows.csv20,000,000Follow relationships (ColRel, ID1, ID2, DateOfRelation, Description)
ActivityLog.csv10,000,000User actions (ActionId, ByWho, WhatPage, ActionType, ActionTime)
CSV files should NOT include column headers. Only data values, separated by commas.

Viewing Data in HDFS

Inspecting File Contents

# View first 5 rows
hdfs dfs -cat /circlenet/follows/Follows.csv | head -5

# View last 5 rows
hdfs dfs -cat /circlenet/follows/Follows.csv | tail -5

# View specific number of lines
hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -20

File Statistics

# Detailed file information
hdfs dfs -ls /circlenet/pages/CircleNetPage.csv

# Human-readable file sizes
hdfs dfs -ls -h /circlenet/pages/CircleNetPage.csv

# Count files, directories, and total size
hdfs dfs -count -h /circlenet
Example output:
    DIR_COUNT   FILE_COUNT   CONTENT_SIZE   PATHNAME
            3            3          2.5 G   /circlenet

Verifying Data Integrity

1

Count Lines

hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | wc -l
Should show 200,000 lines for CircleNetPage.csv
2

Sample Random Records

hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | shuf | head -10
3

Check for Format Issues

# Look for lines with unexpected field counts
hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -100 | awk -F',' '{print NF}' | sort | uniq -c
All lines should have the same field count (5 for CircleNetPage).

Managing Job Outputs

Listing Output Files

# List all outputs for Task A
hdfs dfs -ls /circlenet/output/taskA/simple

# Typical output structure:
# _SUCCESS              (empty file indicating job completion)
# part-r-00000          (reducer output)
# part-r-00001          (if multiple reducers)

Reading Output Files

# View complete output
hdfs dfs -cat /circlenet/output/taskA/simple/part-r-00000

# View first 20 results
hdfs dfs -cat /circlenet/output/taskA/simple/part-r-00000 | head -20

# Search for specific results
hdfs dfs -cat /circlenet/output/taskA/simple/part-r-00000 | grep "Chess"

Combining Multiple Output Parts

# When job has multiple reducers, combine all parts
hdfs dfs -cat /circlenet/output/taskA/simple/part-* | sort -k2 -nr | head -10

Downloading Data from HDFS

Copy to Container Local Filesystem

# Create local directory
mkdir -p /home/ds503/results/taskA

# Download from HDFS
hdfs dfs -get /circlenet/output/taskA/simple /home/ds503/results/taskA/

# Force overwrite existing files
hdfs dfs -get -f /circlenet/output/taskA/simple /home/ds503/results/taskA/

Copy to Host Machine

From your host machine (outside container):
# Get container ID
docker ps

# Copy results directory
docker cp dadc9d47d16e:/home/ds503/results/taskA ./results_taskA
Use environment variables for cleaner commands:
export LOCAL_OUT=/home/ds503/results
hdfs dfs -get $OUT/taskA/simple $LOCAL_OUT/taskA/

Deleting Data

Remove Output Directories

# Remove single output
hdfs dfs -rm -r /circlenet/output/taskA/simple

# Remove all outputs for a task
hdfs dfs -rm -r /circlenet/output/taskA

# Force removal (skip trash)
hdfs dfs -rm -r -f /circlenet/output/taskA
Important: Hadoop will NOT overwrite existing output directories. You must delete them first before re-running jobs.

Clean All Task Outputs

# Remove all task outputs at once
hdfs dfs -rm -r -f /circlenet/output/taskA \
                    /circlenet/output/taskB \
                    /circlenet/output/taskC \
                    /circlenet/output/taskD \
                    /circlenet/output/taskE \
                    /circlenet/output/taskF \
                    /circlenet/output/taskG \
                    /circlenet/output/taskH

Using Wildcards

# Remove all task outputs
hdfs dfs -rm -r -f /circlenet/output/task*

# Remove specific patterns
hdfs dfs -rm -r -f /circlenet/output/*/simple
hdfs dfs -rm -r -f /circlenet/output/*/tmp_*

Advanced Operations

Moving and Renaming

# Rename file or directory
hdfs dfs -mv /circlenet/output/taskA/simple /circlenet/output/taskA/v1

# Move to different directory
hdfs dfs -mv /circlenet/output/taskA/temp /circlenet/archive/

Copying Within HDFS

# Copy within HDFS
hdfs dfs -cp /circlenet/output/taskA/simple /circlenet/output/taskA/backup

# Copy recursively
hdfs dfs -cp -r /circlenet/output/taskA /circlenet/archive/taskA_$(date +%Y%m%d)

Changing Permissions

# Change file permissions
hdfs dfs -chmod 755 /circlenet/pages/CircleNetPage.csv

# Change ownership
hdfs dfs -chown ds503:supergroup /circlenet/output/taskA

# Recursive permission change
hdfs dfs -chmod -R 755 /circlenet/output

Disk Usage

# Check disk usage for directory
hdfs dfs -du -h /circlenet

# Summary of total usage
hdfs dfs -du -s -h /circlenet

# Example output:
# 500.2 M  /circlenet/activitylog
# 1.2 G    /circlenet/follows
# 50.3 M   /circlenet/pages

Web UI for HDFS

Access the Hadoop NameNode Web UI for visual file browsing: URL: http://localhost:3002

Features:

  • Browse HDFS: Navigate directories visually
  • Download files: Click files to download
  • View file properties: Size, replication, block locations
  • Cluster health: DataNode status, capacity
Direct file browser: http://localhost:3002/explorer.html#/circlenet
The Web UI is useful for:
  • Verifying data uploads
  • Checking job output completion
  • Monitoring cluster storage

HDFS Safe Mode

HDFS may enter safe mode on startup, preventing write operations.

Check Safe Mode Status

hdfs dfsadmin -safemode get
Output:
  • Safe mode is ON - Cannot write to HDFS
  • Safe mode is OFF - Normal operation

Exit Safe Mode

# Leave safe mode
hdfs dfsadmin -safemode leave

# Wait for safe mode to exit automatically
hdfs dfsadmin -safemode wait
Do not force leave safe mode if HDFS is performing startup checks. Wait for automatic exit when possible.

Common Workflows

Upload New Dataset Version

1

Remove Old Data

hdfs dfs -rm /circlenet/pages/CircleNetPage.csv
2

Upload New Data

hdfs dfs -put data/CircleNetPage_v2.csv /circlenet/pages/CircleNetPage.csv
3

Verify

hdfs dfs -ls -h /circlenet/pages/
hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -5

Re-run Job with Clean Output

1

Remove Previous Output

hdfs dfs -rm -r -f /circlenet/output/taskA/simple
2

Run Job

hadoop jar $JAR circlenet.taskA.TaskA $PAGES /circlenet/output/taskA/simple
3

Verify Results

hdfs dfs -ls /circlenet/output/taskA/simple
hdfs dfs -cat /circlenet/output/taskA/simple/_SUCCESS
hdfs dfs -cat /circlenet/output/taskA/simple/part-r-00000 | head -20

Compare Simple vs Optimized Outputs

# Download both outputs
hdfs dfs -get /circlenet/output/taskA/simple /tmp/taskA_simple
hdfs dfs -get /circlenet/output/taskA/optimized /tmp/taskA_optimized

# Canonicalize and compare
cat /tmp/taskA_simple/part-* | sort > /tmp/taskA_simple.txt
cat /tmp/taskA_optimized/part-* | sort > /tmp/taskA_optimized.txt
diff -u /tmp/taskA_simple.txt /tmp/taskA_optimized.txt

Quick Reference

Essential Commands

OperationCommand
List fileshdfs dfs -ls <path>
Create directoryhdfs dfs -mkdir <path>
Upload filehdfs dfs -put <local> <hdfs>
Download filehdfs dfs -get <hdfs> <local>
View filehdfs dfs -cat <path>
Delete file/dirhdfs dfs -rm -r <path>
Copyhdfs dfs -cp <src> <dst>
Move/Renamehdfs dfs -mv <src> <dst>
Disk usagehdfs dfs -du -h <path>
Counthdfs dfs -count -h <path>

Path Shortcuts

# Relative to user home directory
hdfs dfs -ls .          # /user/ds503
hdfs dfs -ls ~          # /user/ds503

# Absolute paths
hdfs dfs -ls /circlenet

Troubleshooting

Check file ownership and permissions:
hdfs dfs -ls -l /circlenet
Fix permissions:
hdfs dfs -chmod 755 /circlenet/pages
hdfs dfs -chown ds503 /circlenet/pages/CircleNetPage.csv
Hadoop will not overwrite output directories. Remove first:
hdfs dfs -rm -r /circlenet/output/taskA/simple
Verify path and file existence:
hdfs dfs -ls /circlenet/pages
hdfs dfs -ls -R /circlenet
Check for typos in file names (case-sensitive).
Check disk usage:
hdfs dfs -df -h
hdfs dfs -du -s -h /circlenet
Clean up old outputs:
hdfs dfs -rm -r /circlenet/output/old_*
hdfs dfs -rm -r /circlenet/archive/*

Next Steps

Build docs developers (and LLMs) love