HDFS File Operations - CircleNet Analytics

This guide covers essential HDFS operations for managing datasets in the CircleNet Analytics project.

HDFS Command Structure

All HDFS commands follow this pattern:

hadoop/bin/hdfs dfs -<command> <arguments>

Breakdown:

hadoop/bin/hdfs - Main HDFS executable
dfs - Distributed File System parameter
-<command> - Specific operation (mkdir, put, get, cat, etc.)
<arguments> - Command-specific arguments

Most HDFS commands mirror Unix/Linux file system commands with a - prefix.

Setting Up Directory Structure

Creating Directories

Create Base User Directory

hdfs dfs -mkdir /user/ds503

Convention: /user/<username> for user-specific data

Create Project Directory

hdfs dfs -mkdir /circlenet

Create Dataset Directories

hdfs dfs -mkdir /circlenet/pages
hdfs dfs -mkdir /circlenet/follows
hdfs dfs -mkdir /circlenet/activitylog

Verify Directory Structure

hdfs dfs -ls /circlenet

Expected output:

drwxr-xr-x   - ds503 supergroup          0 2024-03-01 10:30 /circlenet/activitylog
drwxr-xr-x   - ds503 supergroup          0 2024-03-01 10:30 /circlenet/follows
drwxr-xr-x   - ds503 supergroup          0 2024-03-01 10:30 /circlenet/pages

Directory Structure Overview

/
├── user/
│   └── ds503/
└── circlenet/
    ├── pages/
    │   └── CircleNetPage.csv
    ├── follows/
    │   └── Follows.csv
    ├── activitylog/
    │   └── ActivityLog.csv
    └── output/
        ├── taskA/
        ├── taskB/
        └── ...

Loading Data into HDFS

Uploading CSV Files

Navigate to Data Directory

Inside the container, your mounted data is at:

cd /home/ds503/data
ls

Should show: CircleNetPage.csv, Follows.csv, ActivityLog.csv

Upload Files to HDFS

hdfs dfs -put data/CircleNetPage.csv /circlenet/pages/
hdfs dfs -put data/Follows.csv /circlenet/follows/
hdfs dfs -put data/ActivityLog.csv /circlenet/activitylog/

The put command uploads from local filesystem to HDFS.

Verify Upload

hdfs dfs -ls /circlenet/pages
hdfs dfs -ls /circlenet/follows
hdfs dfs -ls /circlenet/activitylog

Check file sizes match your local files.

Dataset Specifications

CircleNet Analytics datasets:

File	Records	Description
CircleNetPage.csv	200,000	User profiles (ID, NickName, JobTitle, RegionCode, FavoriteHobby)
Follows.csv	20,000,000	Follow relationships (ColRel, ID1, ID2, DateOfRelation, Description)
ActivityLog.csv	10,000,000	User actions (ActionId, ByWho, WhatPage, ActionType, ActionTime)

CSV files should NOT include column headers. Only data values, separated by commas.

Viewing Data in HDFS

Inspecting File Contents

# View first 5 rows
hdfs dfs -cat /circlenet/follows/Follows.csv | head -5

# View last 5 rows
hdfs dfs -cat /circlenet/follows/Follows.csv | tail -5

# View specific number of lines
hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -20

File Statistics

# Detailed file information
hdfs dfs -ls /circlenet/pages/CircleNetPage.csv

# Human-readable file sizes
hdfs dfs -ls -h /circlenet/pages/CircleNetPage.csv

# Count files, directories, and total size
hdfs dfs -count -h /circlenet

Example output:

    DIR_COUNT   FILE_COUNT   CONTENT_SIZE   PATHNAME
            3            3          2.5 G   /circlenet

Verifying Data Integrity

Count Lines

hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | wc -l

Should show 200,000 lines for CircleNetPage.csv

Sample Random Records

hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | shuf | head -10

Check for Format Issues

# Look for lines with unexpected field counts
hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -100 | awk -F',' '{print NF}' | sort | uniq -c

All lines should have the same field count (5 for CircleNetPage).

Managing Job Outputs

Listing Output Files

# List all outputs for Task A
hdfs dfs -ls /circlenet/output/taskA/simple

# Typical output structure:
# _SUCCESS              (empty file indicating job completion)
# part-r-00000          (reducer output)
# part-r-00001          (if multiple reducers)

Reading Output Files

# View complete output
hdfs dfs -cat /circlenet/output/taskA/simple/part-r-00000

# View first 20 results
hdfs dfs -cat /circlenet/output/taskA/simple/part-r-00000 | head -20

# Search for specific results
hdfs dfs -cat /circlenet/output/taskA/simple/part-r-00000 | grep "Chess"

Combining Multiple Output Parts

# When job has multiple reducers, combine all parts
hdfs dfs -cat /circlenet/output/taskA/simple/part-* | sort -k2 -nr | head -10

Downloading Data from HDFS

Copy to Container Local Filesystem

# Create local directory
mkdir -p /home/ds503/results/taskA

# Download from HDFS
hdfs dfs -get /circlenet/output/taskA/simple /home/ds503/results/taskA/

# Force overwrite existing files
hdfs dfs -get -f /circlenet/output/taskA/simple /home/ds503/results/taskA/

Copy to Host Machine

From your host machine (outside container):

# Get container ID
docker ps

# Copy results directory
docker cp dadc9d47d16e:/home/ds503/results/taskA ./results_taskA

Use environment variables for cleaner commands:

export LOCAL_OUT=/home/ds503/results
hdfs dfs -get $OUT/taskA/simple $LOCAL_OUT/taskA/

Deleting Data

Remove Output Directories

# Remove single output
hdfs dfs -rm -r /circlenet/output/taskA/simple

# Remove all outputs for a task
hdfs dfs -rm -r /circlenet/output/taskA

# Force removal (skip trash)
hdfs dfs -rm -r -f /circlenet/output/taskA

Important: Hadoop will NOT overwrite existing output directories. You must delete them first before re-running jobs.

Clean All Task Outputs

# Remove all task outputs at once
hdfs dfs -rm -r -f /circlenet/output/taskA \
                    /circlenet/output/taskB \
                    /circlenet/output/taskC \
                    /circlenet/output/taskD \
                    /circlenet/output/taskE \
                    /circlenet/output/taskF \
                    /circlenet/output/taskG \
                    /circlenet/output/taskH

Using Wildcards

# Remove all task outputs
hdfs dfs -rm -r -f /circlenet/output/task*

# Remove specific patterns
hdfs dfs -rm -r -f /circlenet/output/*/simple
hdfs dfs -rm -r -f /circlenet/output/*/tmp_*

Advanced Operations

Moving and Renaming

# Rename file or directory
hdfs dfs -mv /circlenet/output/taskA/simple /circlenet/output/taskA/v1

# Move to different directory
hdfs dfs -mv /circlenet/output/taskA/temp /circlenet/archive/

Copying Within HDFS

# Copy within HDFS
hdfs dfs -cp /circlenet/output/taskA/simple /circlenet/output/taskA/backup

# Copy recursively
hdfs dfs -cp -r /circlenet/output/taskA /circlenet/archive/taskA_$(date +%Y%m%d)

Changing Permissions

# Change file permissions
hdfs dfs -chmod 755 /circlenet/pages/CircleNetPage.csv

# Change ownership
hdfs dfs -chown ds503:supergroup /circlenet/output/taskA

# Recursive permission change
hdfs dfs -chmod -R 755 /circlenet/output

Disk Usage

# Check disk usage for directory
hdfs dfs -du -h /circlenet

# Summary of total usage
hdfs dfs -du -s -h /circlenet

# Example output:
# 500.2 M  /circlenet/activitylog
# 1.2 G    /circlenet/follows
# 50.3 M   /circlenet/pages

Web UI for HDFS

Access the Hadoop NameNode Web UI for visual file browsing: URL: http://localhost:3002

Features:

Browse HDFS: Navigate directories visually
Download files: Click files to download
View file properties: Size, replication, block locations
Cluster health: DataNode status, capacity

Direct file browser: http://localhost:3002/explorer.html#/circlenet

The Web UI is useful for:

Verifying data uploads
Checking job output completion
Monitoring cluster storage

HDFS Safe Mode

HDFS may enter safe mode on startup, preventing write operations.

Check Safe Mode Status

hdfs dfsadmin -safemode get

Output:

Safe mode is ON - Cannot write to HDFS
Safe mode is OFF - Normal operation

Exit Safe Mode

# Leave safe mode
hdfs dfsadmin -safemode leave

# Wait for safe mode to exit automatically
hdfs dfsadmin -safemode wait

Do not force leave safe mode if HDFS is performing startup checks. Wait for automatic exit when possible.

Common Workflows

Upload New Dataset Version

Remove Old Data

hdfs dfs -rm /circlenet/pages/CircleNetPage.csv

Upload New Data

hdfs dfs -put data/CircleNetPage_v2.csv /circlenet/pages/CircleNetPage.csv

Verify

hdfs dfs -ls -h /circlenet/pages/
hdfs dfs -cat /circlenet/pages/CircleNetPage.csv | head -5

Re-run Job with Clean Output

Remove Previous Output

hdfs dfs -rm -r -f /circlenet/output/taskA/simple

Run Job

hadoop jar $JAR circlenet.taskA.TaskA $PAGES /circlenet/output/taskA/simple

Verify Results

hdfs dfs -ls /circlenet/output/taskA/simple
hdfs dfs -cat /circlenet/output/taskA/simple/_SUCCESS
hdfs dfs -cat /circlenet/output/taskA/simple/part-r-00000 | head -20

Compare Simple vs Optimized Outputs

# Download both outputs
hdfs dfs -get /circlenet/output/taskA/simple /tmp/taskA_simple
hdfs dfs -get /circlenet/output/taskA/optimized /tmp/taskA_optimized

# Canonicalize and compare
cat /tmp/taskA_simple/part-* | sort > /tmp/taskA_simple.txt
cat /tmp/taskA_optimized/part-* | sort > /tmp/taskA_optimized.txt
diff -u /tmp/taskA_simple.txt /tmp/taskA_optimized.txt

Quick Reference

Essential Commands

Operation	Command
List files	`hdfs dfs -ls <path>`
Create directory	`hdfs dfs -mkdir <path>`
Upload file	`hdfs dfs -put <local> <hdfs>`
Download file	`hdfs dfs -get <hdfs> <local>`
View file	`hdfs dfs -cat <path>`
Delete file/dir	`hdfs dfs -rm -r <path>`
Copy	`hdfs dfs -cp <src> <dst>`
Move/Rename	`hdfs dfs -mv <src> <dst>`
Disk usage	`hdfs dfs -du -h <path>`
Count	`hdfs dfs -count -h <path>`

Path Shortcuts

# Relative to user home directory
hdfs dfs -ls .          # /user/ds503
hdfs dfs -ls ~          # /user/ds503

# Absolute paths
hdfs dfs -ls /circlenet

Troubleshooting

Permission Denied

Check file ownership and permissions:

hdfs dfs -ls -l /circlenet

Fix permissions:

hdfs dfs -chmod 755 /circlenet/pages
hdfs dfs -chown ds503 /circlenet/pages/CircleNetPage.csv

Output Directory Already Exists

Hadoop will not overwrite output directories. Remove first:

hdfs dfs -rm -r /circlenet/output/taskA/simple

File Not Found

Verify path and file existence:

hdfs dfs -ls /circlenet/pages
hdfs dfs -ls -R /circlenet

Check for typos in file names (case-sensitive).

HDFS Full / No Space

Check disk usage:

hdfs dfs -df -h
hdfs dfs -du -s -h /circlenet

Clean up old outputs:

hdfs dfs -rm -r /circlenet/output/old_*
hdfs dfs -rm -r /circlenet/archive/*

Get Started

Dataset

Analytics Tasks

Guides

​HDFS Command Structure

​Setting Up Directory Structure

​Creating Directories

​Directory Structure Overview

​Loading Data into HDFS

​Uploading CSV Files

​Dataset Specifications

​Viewing Data in HDFS

​Inspecting File Contents

​File Statistics

​Verifying Data Integrity

​Managing Job Outputs

​Listing Output Files

​Reading Output Files

​Combining Multiple Output Parts

​Downloading Data from HDFS

​Copy to Container Local Filesystem

​Copy to Host Machine

​Deleting Data

​Remove Output Directories

​Clean All Task Outputs

​Using Wildcards

​Advanced Operations

​Moving and Renaming

​Copying Within HDFS

​Changing Permissions

​Disk Usage

​Web UI for HDFS

​Features:

​HDFS Safe Mode

​Check Safe Mode Status

​Exit Safe Mode

​Common Workflows

​Upload New Dataset Version

​Re-run Job with Clean Output

​Compare Simple vs Optimized Outputs

​Quick Reference

​Essential Commands

​Path Shortcuts

​Troubleshooting

​Next Steps

Build docs developers (and LLMs) love

HDFS Command Structure

Setting Up Directory Structure

Creating Directories

Directory Structure Overview

Loading Data into HDFS

Uploading CSV Files

Dataset Specifications

Viewing Data in HDFS

Inspecting File Contents

File Statistics

Verifying Data Integrity

Managing Job Outputs

Listing Output Files

Reading Output Files

Combining Multiple Output Parts

Downloading Data from HDFS

Copy to Container Local Filesystem

Copy to Host Machine

Deleting Data

Remove Output Directories

Clean All Task Outputs

Using Wildcards

Advanced Operations

Moving and Renaming

Copying Within HDFS

Changing Permissions

Disk Usage

Web UI for HDFS

Features:

HDFS Safe Mode

Check Safe Mode Status

Exit Safe Mode

Common Workflows

Upload New Dataset Version

Re-run Job with Clean Output

Compare Simple vs Optimized Outputs

Quick Reference

Essential Commands

Path Shortcuts

Troubleshooting

Next Steps