extract-tweets

The extract-tweets command reads your Twitter archive JSON file and converts it to a structured CSV format suitable for analysis.

Usage

python src/main.py extract-tweets

This command takes no additional arguments or options.

What it does

The extract-tweets command performs the following operations:

Reads the Twitter archive JSON file from data/tweets/tweets.json
Parses the tweet data using the JSON parser
Transforms tweets into a normalized CSV format
Writes the output to data/tweets/transformed/tweets.csv

This is always the first command you should run. The analyze-tweets command depends on the CSV file generated by this command.

Input requirements

data/tweets/tweets.json

file

required

Your Twitter archive JSON file. This file is obtained by:

Requesting your Twitter archive from X.com
Downloading the archive ZIP file (after 24-48 hours)
Extracting the ZIP and copying data/tweets.json to your project’s data/tweets/ directory

Archive structure

Your data directory should look like:

data/
└── tweets/
    └── tweets.json  # From your X archive

Output

data/tweets/transformed/tweets.csv

file

A CSV file containing all extracted tweets with normalized fields including:

id: Tweet ID
content: Tweet text
created_at: Timestamp
Other metadata fields

The output directory is created automatically if it doesn’t exist.

Success output

When the extraction succeeds, you’ll see:

Extracting tweets from archive...
Successfully extracted 1247 tweets

The number displayed is the total count of tweets extracted from your archive.

Error handling

The command handles several error scenarios:

File not found

Error type: file_not_foundCause: The tweets archive file doesn’t exist at the expected pathSolution: Ensure you’ve placed your Twitter archive at data/tweets/tweets.json

Extracting tweets from archive...
Error: [Errno 2] No such file or directory: 'data/tweets/tweets.json'

Invalid format

Error type: invalid_formatCause: The JSON file is corrupted or has an unexpected structureSolution: Re-download your Twitter archive and ensure the file is not corrupted

Permission denied

Error type: permission_deniedCause: Insufficient permissions to read the archive or write to the output directorySolution: Check file and directory permissions

Unexpected errors

Error type: unexpected_errorCause: An unforeseen error occurred during extractionSolution: Check the logs for detailed error information and stack traces

Configuration

The command uses these settings from your configuration (src/config.py:26):

TWEETS_ARCHIVE_PATH

string

default:"data/tweets/tweets.json"

Location of your Twitter archive JSON file

TRANSFORMED_TWEETS_PATH

string

default:"data/tweets/transformed/tweets.csv"

Output location for the extracted CSV file

Implementation details

The extraction process:

Initializes a JSONParser with the archive path (src/application.py:50)
Parses all tweets from the JSON structure (src/application.py:51)
Opens a CSV writer with the output path (src/application.py:54)
Writes all tweets to CSV format (src/application.py:55)
Returns a success result with the tweet count (src/application.py:60)

The extraction is performed in-memory. For very large archives (100k+ tweets), ensure you have sufficient available RAM.

Example workflow

Here’s a complete example of preparing and extracting your archive:

# 1. Create the data directory structure
mkdir -p data/tweets

# 2. Copy your Twitter archive
cp /path/to/twitter-archive/data/tweets.json data/tweets/tweets.json

# 3. Run extraction
python src/main.py extract-tweets

# Expected output:
# Extracting tweets from archive...
# Successfully extracted 1247 tweets

# 4. Verify the output file was created
ls -lh data/tweets/transformed/tweets.csv

Next steps

After successfully extracting your tweets:

Review the generated CSV file to ensure tweets were extracted correctly
Configure your analysis criteria in config.json
Run the analyze-tweets command to process the tweets

Analyze tweets

Continue to the next step: analyzing your extracted tweets with AI

Logging

Detailed logs are written during execution. Set your desired log level in .env:

LOG_LEVEL=INFO  # Options: DEBUG, INFO, WARNING, ERROR

Extraction logs include:

Archive file path being read
Number of tweets found
Output file path
Any errors encountered

Performance

Extraction performance varies by archive size:

Archive Size	Approximate Time	Memory Usage
1,000 tweets	< 1 second	~10 MB
10,000 tweets	1-2 seconds	~50 MB
50,000 tweets	5-10 seconds	~200 MB
100,000+ tweets	20+ seconds	~500+ MB

Extraction is a one-time operation. Once you’ve successfully extracted your tweets, you don’t need to run this command again unless you download a fresh archive.

Commands

Examples

Usage

What it does

Input requirements

Archive structure

Output

Success output

Error handling

File not found

Invalid format

Permission denied

Unexpected errors

Configuration

Implementation details

Example workflow

Next steps

Analyze tweets

Logging

Performance

Build docs developers (and LLMs) love

Commands

Examples

​Usage

​What it does

​Input requirements

​Archive structure

​Output

​Success output

​Error handling

​File not found

​Invalid format

​Permission denied

​Unexpected errors

​Configuration

​Implementation details

​Example workflow

​Next steps

Analyze tweets

​Logging

​Performance

Build docs developers (and LLMs) love

Usage

What it does

Input requirements

Archive structure

Output

Success output

Error handling

File not found

Invalid format

Permission denied

Unexpected errors

Configuration

Implementation details

Example workflow

Next steps

Logging

Performance