This guide covers essential HDFS operations for managing datasets in the CircleNet Analytics project.
HDFS Command Structure
All HDFS commands follow this pattern:
hadoop/bin/hdfs dfs - < comman d > < argument s >
Breakdown:
hadoop/bin/hdfs - Main HDFS executable
dfs - Distributed File System parameter
-<command> - Specific operation (mkdir, put, get, cat, etc.)
<arguments> - Command-specific arguments
Most HDFS commands mirror Unix/Linux file system commands with a - prefix.
Setting Up Directory Structure
Creating Directories
Create Base User Directory
hdfs dfs -mkdir /user/ds503
Convention: /user/<username> for user-specific data
Create Project Directory
hdfs dfs -mkdir /circlenet
Create Dataset Directories
hdfs dfs -mkdir /circlenet/pages
hdfs dfs -mkdir /circlenet/follows
hdfs dfs -mkdir /circlenet/activitylog
Verify Directory Structure
Expected output: drwxr-xr-x - ds503 supergroup 0 2024-03-01 10:30 /circlenet/activitylog
drwxr-xr-x - ds503 supergroup 0 2024-03-01 10:30 /circlenet/follows
drwxr-xr-x - ds503 supergroup 0 2024-03-01 10:30 /circlenet/pages
Directory Structure Overview
/
├── user/
│ └── ds503/
└── circlenet/
├── pages/
│ └── CircleNetPage.csv
├── follows/
│ └── Follows.csv
├── activitylog/
│ └── ActivityLog.csv
└── output/
├── taskA/
├── taskB/
└── ...
Loading Data into HDFS
Uploading CSV Files
Navigate to Data Directory
Inside the container, your mounted data is at: Should show: CircleNetPage.csv, Follows.csv, ActivityLog.csv
Upload Files to HDFS
hdfs dfs -put data/CircleNetPage.csv /circlenet/pages/
hdfs dfs -put data/Follows.csv /circlenet/follows/
hdfs dfs -put data/ActivityLog.csv /circlenet/activitylog/
The put command uploads from local filesystem to HDFS.
Verify Upload
hdfs dfs -ls /circlenet/pages
hdfs dfs -ls /circlenet/follows
hdfs dfs -ls /circlenet/activitylog
Check file sizes match your local files.
Dataset Specifications
CircleNet Analytics datasets:
File Records Description CircleNetPage.csv 200,000 User profiles (ID, NickName, JobTitle, RegionCode, FavoriteHobby) Follows.csv 20,000,000 Follow relationships (ColRel, ID1, ID2, DateOfRelation, Description) ActivityLog.csv 10,000,000 User actions (ActionId, ByWho, WhatPage, ActionType, ActionTime)
CSV files should NOT include column headers. Only data values, separated by commas.
Viewing Data in HDFS
Inspecting File Contents
# View first 5 rows
hdfs dfs -cat /circlenet/follows/Follows.csv | head -5
# View last 5 rows
hdfs dfs -cat /circlenet/follows/Follows.csv | tail -5
# View specific number of lines
hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -20
File Statistics
# Detailed file information
hdfs dfs -ls /circlenet/pages/CircleNetPage.csv
# Human-readable file sizes
hdfs dfs -ls -h /circlenet/pages/CircleNetPage.csv
# Count files, directories, and total size
hdfs dfs -count -h /circlenet
Example output:
DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
3 3 2.5 G /circlenet
Verifying Data Integrity
Count Lines
hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | wc -l
Should show 200,000 lines for CircleNetPage.csv
Sample Random Records
hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | shuf | head -10
Check for Format Issues
# Look for lines with unexpected field counts
hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -100 | awk -F ',' '{print NF}' | sort | uniq -c
All lines should have the same field count (5 for CircleNetPage).
Managing Job Outputs
Listing Output Files
# List all outputs for Task A
hdfs dfs -ls /circlenet/output/taskA/simple
# Typical output structure:
# _SUCCESS (empty file indicating job completion)
# part-r-00000 (reducer output)
# part-r-00001 (if multiple reducers)
Reading Output Files
# View complete output
hdfs dfs -cat /circlenet/output/taskA/simple/part-r-00000
# View first 20 results
hdfs dfs -cat /circlenet/output/taskA/simple/part-r-00000 | head -20
# Search for specific results
hdfs dfs -cat /circlenet/output/taskA/simple/part-r-00000 | grep "Chess"
Combining Multiple Output Parts
# When job has multiple reducers, combine all parts
hdfs dfs -cat /circlenet/output/taskA/simple/part- * | sort -k2 -nr | head -10
Downloading Data from HDFS
Copy to Container Local Filesystem
# Create local directory
mkdir -p /home/ds503/results/taskA
# Download from HDFS
hdfs dfs -get /circlenet/output/taskA/simple /home/ds503/results/taskA/
# Force overwrite existing files
hdfs dfs -get -f /circlenet/output/taskA/simple /home/ds503/results/taskA/
Copy to Host Machine
From your host machine (outside container):
# Get container ID
docker ps
# Copy results directory
docker cp dadc9d47d16e:/home/ds503/results/taskA ./results_taskA
Use environment variables for cleaner commands: export LOCAL_OUT = / home / ds503 / results
hdfs dfs -get $OUT /taskA/simple $LOCAL_OUT /taskA/
Deleting Data
Remove Output Directories
# Remove single output
hdfs dfs -rm -r /circlenet/output/taskA/simple
# Remove all outputs for a task
hdfs dfs -rm -r /circlenet/output/taskA
# Force removal (skip trash)
hdfs dfs -rm -r -f /circlenet/output/taskA
Important: Hadoop will NOT overwrite existing output directories. You must delete them first before re-running jobs.
Clean All Task Outputs
# Remove all task outputs at once
hdfs dfs -rm -r -f /circlenet/output/taskA \
/circlenet/output/taskB \
/circlenet/output/taskC \
/circlenet/output/taskD \
/circlenet/output/taskE \
/circlenet/output/taskF \
/circlenet/output/taskG \
/circlenet/output/taskH
Using Wildcards
# Remove all task outputs
hdfs dfs -rm -r -f /circlenet/output/task *
# Remove specific patterns
hdfs dfs -rm -r -f /circlenet/output/ * /simple
hdfs dfs -rm -r -f /circlenet/output/ * /tmp_ *
Advanced Operations
Moving and Renaming
# Rename file or directory
hdfs dfs -mv /circlenet/output/taskA/simple /circlenet/output/taskA/v1
# Move to different directory
hdfs dfs -mv /circlenet/output/taskA/temp /circlenet/archive/
Copying Within HDFS
# Copy within HDFS
hdfs dfs -cp /circlenet/output/taskA/simple /circlenet/output/taskA/backup
# Copy recursively
hdfs dfs -cp -r /circlenet/output/taskA /circlenet/archive/taskA_ $( date +%Y%m%d )
Changing Permissions
# Change file permissions
hdfs dfs -chmod 755 /circlenet/pages/CircleNetPage.csv
# Change ownership
hdfs dfs -chown ds503:supergroup /circlenet/output/taskA
# Recursive permission change
hdfs dfs -chmod -R 755 /circlenet/output
Disk Usage
# Check disk usage for directory
hdfs dfs -du -h /circlenet
# Summary of total usage
hdfs dfs -du -s -h /circlenet
# Example output:
# 500.2 M /circlenet/activitylog
# 1.2 G /circlenet/follows
# 50.3 M /circlenet/pages
Web UI for HDFS
Access the Hadoop NameNode Web UI for visual file browsing:
URL: http://localhost:3002
Features:
Browse HDFS : Navigate directories visually
Download files : Click files to download
View file properties : Size, replication, block locations
Cluster health : DataNode status, capacity
Direct file browser: http://localhost:3002/explorer.html#/circlenet
The Web UI is useful for:
Verifying data uploads
Checking job output completion
Monitoring cluster storage
HDFS Safe Mode
HDFS may enter safe mode on startup, preventing write operations.
Check Safe Mode Status
hdfs dfsadmin -safemode get
Output:
Safe mode is ON - Cannot write to HDFS
Safe mode is OFF - Normal operation
Exit Safe Mode
# Leave safe mode
hdfs dfsadmin -safemode leave
# Wait for safe mode to exit automatically
hdfs dfsadmin -safemode wait
Do not force leave safe mode if HDFS is performing startup checks. Wait for automatic exit when possible.
Common Workflows
Upload New Dataset Version
Remove Old Data
hdfs dfs -rm /circlenet/pages/CircleNetPage.csv
Upload New Data
hdfs dfs -put data/CircleNetPage_v2.csv /circlenet/pages/CircleNetPage.csv
Verify
hdfs dfs -ls -h /circlenet/pages/
hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -5
Re-run Job with Clean Output
Remove Previous Output
hdfs dfs -rm -r -f /circlenet/output/taskA/simple
Run Job
hadoop jar $JAR circlenet.taskA.TaskA $PAGES /circlenet/output/taskA/simple
Verify Results
hdfs dfs -ls /circlenet/output/taskA/simple
hdfs dfs -cat /circlenet/output/taskA/simple/_SUCCESS
hdfs dfs -cat /circlenet/output/taskA/simple/part-r-00000 | head -20
Compare Simple vs Optimized Outputs
# Download both outputs
hdfs dfs -get /circlenet/output/taskA/simple /tmp/taskA_simple
hdfs dfs -get /circlenet/output/taskA/optimized /tmp/taskA_optimized
# Canonicalize and compare
cat /tmp/taskA_simple/part- * | sort > /tmp/taskA_simple.txt
cat /tmp/taskA_optimized/part- * | sort > /tmp/taskA_optimized.txt
diff -u /tmp/taskA_simple.txt /tmp/taskA_optimized.txt
Quick Reference
Essential Commands
Operation Command List files hdfs dfs -ls <path>Create directory hdfs dfs -mkdir <path>Upload file hdfs dfs -put <local> <hdfs>Download file hdfs dfs -get <hdfs> <local>View file hdfs dfs -cat <path>Delete file/dir hdfs dfs -rm -r <path>Copy hdfs dfs -cp <src> <dst>Move/Rename hdfs dfs -mv <src> <dst>Disk usage hdfs dfs -du -h <path>Count hdfs dfs -count -h <path>
Path Shortcuts
# Relative to user home directory
hdfs dfs -ls . # /user/ds503
hdfs dfs -ls ~ # /user/ds503
# Absolute paths
hdfs dfs -ls /circlenet
Troubleshooting
Check file ownership and permissions: hdfs dfs -ls -l /circlenet
Fix permissions: hdfs dfs -chmod 755 /circlenet/pages
hdfs dfs -chown ds503 /circlenet/pages/CircleNetPage.csv
Output Directory Already Exists
Hadoop will not overwrite output directories. Remove first: hdfs dfs -rm -r /circlenet/output/taskA/simple
Verify path and file existence: hdfs dfs -ls /circlenet/pages
hdfs dfs -ls -R /circlenet
Check for typos in file names (case-sensitive).
Check disk usage: hdfs dfs -df -h
hdfs dfs -du -s -h /circlenet
Clean up old outputs: hdfs dfs -rm -r /circlenet/output/old_ *
hdfs dfs -rm -r /circlenet/archive/ *
Next Steps