Skip to main content
The extract-tweets command reads your Twitter archive JSON file and converts it to a structured CSV format suitable for analysis.

Usage

python src/main.py extract-tweets
This command takes no additional arguments or options.

What it does

The extract-tweets command performs the following operations:
  1. Reads the Twitter archive JSON file from data/tweets/tweets.json
  2. Parses the tweet data using the JSON parser
  3. Transforms tweets into a normalized CSV format
  4. Writes the output to data/tweets/transformed/tweets.csv
This is always the first command you should run. The analyze-tweets command depends on the CSV file generated by this command.

Input requirements

data/tweets/tweets.json
file
required
Your Twitter archive JSON file. This file is obtained by:
  1. Requesting your Twitter archive from X.com
  2. Downloading the archive ZIP file (after 24-48 hours)
  3. Extracting the ZIP and copying data/tweets.json to your project’s data/tweets/ directory

Archive structure

Your data directory should look like:
data/
└── tweets/
    └── tweets.json  # From your X archive

Output

data/tweets/transformed/tweets.csv
file
A CSV file containing all extracted tweets with normalized fields including:
  • id: Tweet ID
  • content: Tweet text
  • created_at: Timestamp
  • Other metadata fields
The output directory is created automatically if it doesn’t exist.

Success output

When the extraction succeeds, you’ll see:
Extracting tweets from archive...
Successfully extracted 1247 tweets
The number displayed is the total count of tweets extracted from your archive.

Error handling

The command handles several error scenarios:

File not found

Error type: file_not_foundCause: The tweets archive file doesn’t exist at the expected pathSolution: Ensure you’ve placed your Twitter archive at data/tweets/tweets.json
Extracting tweets from archive...
Error: [Errno 2] No such file or directory: 'data/tweets/tweets.json'

Invalid format

Error type: invalid_formatCause: The JSON file is corrupted or has an unexpected structureSolution: Re-download your Twitter archive and ensure the file is not corrupted

Permission denied

Error type: permission_deniedCause: Insufficient permissions to read the archive or write to the output directorySolution: Check file and directory permissions

Unexpected errors

Error type: unexpected_errorCause: An unforeseen error occurred during extractionSolution: Check the logs for detailed error information and stack traces

Configuration

The command uses these settings from your configuration (src/config.py:26):
TWEETS_ARCHIVE_PATH
string
default:"data/tweets/tweets.json"
Location of your Twitter archive JSON file
TRANSFORMED_TWEETS_PATH
string
default:"data/tweets/transformed/tweets.csv"
Output location for the extracted CSV file

Implementation details

The extraction process:
  1. Initializes a JSONParser with the archive path (src/application.py:50)
  2. Parses all tweets from the JSON structure (src/application.py:51)
  3. Opens a CSV writer with the output path (src/application.py:54)
  4. Writes all tweets to CSV format (src/application.py:55)
  5. Returns a success result with the tweet count (src/application.py:60)
The extraction is performed in-memory. For very large archives (100k+ tweets), ensure you have sufficient available RAM.

Example workflow

Here’s a complete example of preparing and extracting your archive:
# 1. Create the data directory structure
mkdir -p data/tweets

# 2. Copy your Twitter archive
cp /path/to/twitter-archive/data/tweets.json data/tweets/tweets.json

# 3. Run extraction
python src/main.py extract-tweets

# Expected output:
# Extracting tweets from archive...
# Successfully extracted 1247 tweets

# 4. Verify the output file was created
ls -lh data/tweets/transformed/tweets.csv

Next steps

After successfully extracting your tweets:
  1. Review the generated CSV file to ensure tweets were extracted correctly
  2. Configure your analysis criteria in config.json
  3. Run the analyze-tweets command to process the tweets

Analyze tweets

Continue to the next step: analyzing your extracted tweets with AI

Logging

Detailed logs are written during execution. Set your desired log level in .env:
LOG_LEVEL=INFO  # Options: DEBUG, INFO, WARNING, ERROR
Extraction logs include:
  • Archive file path being read
  • Number of tweets found
  • Output file path
  • Any errors encountered

Performance

Extraction performance varies by archive size:
Archive SizeApproximate TimeMemory Usage
1,000 tweets< 1 second~10 MB
10,000 tweets1-2 seconds~50 MB
50,000 tweets5-10 seconds~200 MB
100,000+ tweets20+ seconds~500+ MB
Extraction is a one-time operation. Once you’ve successfully extracted your tweets, you don’t need to run this command again unless you download a fresh archive.

Build docs developers (and LLMs) love