Extracting tweets

Before analyzing tweets, you need to extract them from your X (Twitter) archive export. This guide covers the entire extraction process.

Prerequisites

Request your X archive

Go to X.com and request your data archive:

Navigate to More → Settings and Privacy → Your Account
Click Download an archive of your data
Verify your identity and confirm the request
Wait 24-48 hours for X to prepare your archive

Download and extract the archive

Once ready, X will email you a download link:

# Download the ZIP file from the email link
# Extract it to a temporary location
unzip twitter-archive.zip -d /tmp/twitter-archive

The archive contains a data folder with your tweets in JSON format.

Copy tweets.json to your project

Locate and copy the tweets file:

# Create the data directory
mkdir -p data/tweets

# Copy tweets.json from the archive
cp /tmp/twitter-archive/data/tweets.json data/tweets/tweets.json

Understanding the archive format

The X archive uses a nested JSON structure. Here’s what it looks like:

tweets.json

[
  {
    "tweet": {
      "id_str": "1234567890123456789",
      "full_text": "This is my tweet content",
      "created_at": "Wed Jan 01 12:00:00 +0000 2020",
      "retweet_count": "0",
      "favorite_count": "5",
      "lang": "en"
    }
  },
  {
    "tweet": {
      "id_str": "9876543210987654321",
      "full_text": "Another tweet here",
      "created_at": "Thu Jan 02 14:30:00 +0000 2020",
      "retweet_count": "2",
      "favorite_count": "10",
      "lang": "en"
    }
  }
]

Required fields

The extraction process reads these fields from storage.py:14-15:

tweet.id_str

string

required

The unique tweet ID as a string.

TWITTER_ARCHIVE_ID_FIELD = "id_str"

tweet.full_text

string

required

The complete tweet content (up to 280 characters).

TWITTER_ARCHIVE_TEXT_FIELD = "full_text"

If either field is missing, extraction will fail with:

ValueError: Missing required field 'id_str' in data/tweets/tweets.json

Running the extraction

Once your tweets.json is in place, run the extraction command:

python src/main.py extract-tweets

What happens during extraction

The extraction process (application.py:47-62) follows these steps:

Load JSON archive

The JSONParser reads the archive file:

logger.info(f"Reading tweets from {settings.tweets_archive_path}")
parser = JSONParser(settings.tweets_archive_path)
tweets = parser.parse()

Implementation in storage.py:39-58:

def parse(self) -> list[Tweet]:
    try:
        with open(self.file_path, encoding=FILE_ENCODING) as file:
            data = json.load(file)
            return [
                Tweet(
                    id=item["tweet"][TWITTER_ARCHIVE_ID_FIELD],
                    content=item["tweet"][TWITTER_ARCHIVE_TEXT_FIELD],
                )
                for item in data
            ]
    except FileNotFoundError as e:
        raise FileNotFoundError(f"Tweet archive not found: {self.file_path}") from e

Transform to CSV

Tweets are converted to a simpler CSV format:

logger.info(f"Extracted {len(tweets)} tweets, writing to CSV")
with CSVWriter(settings.transformed_tweets_path) as writer:
    writer.write_tweets(tweets)

The CSVWriter creates a two-column format (storage.py:161-170):

def write_tweets(self, tweets: list[Tweet]) -> None:
    if not self.header_written:
        self.writer.writerow([TWEET_CSV_ID_COLUMN, TWEET_CSV_TEXT_COLUMN])
        self.header_written = True

    for tweet in tweets:
        self.writer.writerow([tweet.id, tweet.content])

Save transformed tweets

The CSV is saved to data/tweets/transformed/tweets.csv:

logger.info(
    f"Successfully wrote {len(tweets)} tweets to {settings.transformed_tweets_path}"
)
return Result(success=True, count=len(tweets))

Expected output

Successful extraction displays:

Extracting tweets from archive...
2024-01-15 10:30:00 - application - INFO - Reading tweets from data/tweets/tweets.json
2024-01-15 10:30:01 - application - INFO - Extracted 1523 tweets, writing to CSV
2024-01-15 10:30:01 - application - INFO - Successfully wrote 1523 tweets to data/tweets/transformed/tweets.csv
Successfully extracted 1523 tweets

Output format

The extracted CSV (data/tweets/transformed/tweets.csv) has this structure:

id,text
1234567890123456789,"This is my tweet content"
9876543210987654321,"Another tweet here"
1122334455667788990,"RT @someone: This is a retweet"

Retweets are included in the CSV but will be automatically skipped during analysis.

File permissions and security

The tool sets secure file permissions on all output files:

PRIVATE_FILE_MODE = 0o600  # Owner read/write only
PRIVATE_DIR_MODE = 0o750  # Owner read/write/execute, group read/execute

From storage.py:139-151:

def __enter__(self) -> "CSVWriter":
    dir_path = os.path.dirname(self.file_path)
    if dir_path:
        os.makedirs(dir_path, mode=PRIVATE_DIR_MODE, exist_ok=True)

    file_exists = os.path.exists(self.file_path)
    self.header_written = self.append and file_exists

    mode = "a" if self.append and file_exists else "w"
    self.file = open(self.file_path, mode, encoding=FILE_ENCODING, newline="")
    self.writer = csv.writer(self.file)

    os.chmod(self.file_path, PRIVATE_FILE_MODE)
    return self

Your tweet data is private! The tool ensures only you (the file owner) can read/write the CSV files.

Troubleshooting extraction

File not found error

Error: Tweet archive not found: data/tweets/tweets.json

Solution: Verify the archive is in the correct location:

ls -la data/tweets/tweets.json

If missing, copy it from your X archive:

cp /path/to/twitter-archive/data/tweets.json data/tweets/tweets.json

Invalid JSON format

Error: Invalid JSON in data/tweets/tweets.json: Expecting value: line 1 column 1 (char 0)

Solution: The file might be corrupted. Verify it’s valid JSON:

python -m json.tool data/tweets/tweets.json | head -20

If invalid, re-download your X archive.

Missing required fields

Error: Missing required field 'full_text' in data/tweets/tweets.json

Solution: Your archive format might be outdated. The tool expects:

[
  {
    "tweet": {
      "id_str": "...",
      "full_text": "..."
    }
  }
]

Check if your archive uses "text" instead of "full_text". If so, you’ll need to modify storage.py:15:

# Change this:
TWITTER_ARCHIVE_TEXT_FIELD = "full_text"

# To this:
TWITTER_ARCHIVE_TEXT_FIELD = "text"

Permission denied error

Error: Permission denied: data/tweets/transformed/tweets.csv

Solution: Ensure the directory is writable:

chmod -R u+w data/

Verifying extraction results

After extraction, verify the CSV was created correctly:

# Check file exists
ls -lh data/tweets/transformed/tweets.csv

# View first 10 tweets
head -10 data/tweets/transformed/tweets.csv

# Count total tweets (subtract 1 for header)
wc -l data/tweets/transformed/tweets.csv

Example output:

-rw------- 1 user user 523K Jan 15 10:30 data/tweets/transformed/tweets.csv
    1524 data/tweets/transformed/tweets.csv

The line count includes the header row, so 1524 lines = 1523 tweets.

Re-extracting tweets

If you need to re-extract (for example, after getting a new archive):

# Remove the old transformed CSV
rm data/tweets/transformed/tweets.csv

# Replace the archive
cp /path/to/new/tweets.json data/tweets/tweets.json

# Re-run extraction
python src/main.py extract-tweets

Re-extracting will overwrite tweets.csv. If you’ve started analysis, you may also want to remove:

data/checkpoint.txt (to restart from the beginning)
data/tweets/processed/results.csv (to clear previous results)

What gets extracted

The extraction includes:

✅ Original tweets
✅ Replies to others
✅ Retweets (but they’re skipped during analysis)
✅ Quote tweets
✅ Tweets with media (text only is extracted)
✅ Threads (each tweet is separate)

The extraction does NOT include:

❌ Deleted tweets
❌ Tweets from suspended accounts
❌ Direct messages
❌ Likes/favorites
❌ Media files (images, videos)
❌ Tweet metadata (likes, retweets, dates)

Only the tweet ID and text content are needed for analysis. Other metadata is ignored.

Understanding the Tweet model

Extracted tweets are stored as simple data objects from models.py:13-19:

@dataclass(frozen=True)
class Tweet:
    id: str
    content: str

    def __repr__(self) -> str:
        preview = self.content[:50] + "..." if len(self.content) > 50 else self.content
        return f"Tweet(id={self.id!r}, content={preview!r})"

id: The unique tweet identifier (used to construct URLs)
content: The full tweet text (up to 280 characters)

Next steps

After successful extraction, you’re ready to analyze your tweets:

Analyze tweets

Run AI analysis on your extracted tweets

Customize criteria

Fine-tune what gets flagged for deletion

Get Started

Guides

Advanced

Support

Prerequisites

Understanding the archive format

Required fields

Running the extraction

What happens during extraction

Expected output

Output format

File permissions and security

Troubleshooting extraction

File not found error

Invalid JSON format

Missing required fields

Permission denied error

Verifying extraction results

Re-extracting tweets

What gets extracted

Understanding the Tweet model

Next steps

Analyze tweets

Customize criteria

Build docs developers (and LLMs) love

Get Started

Guides

Advanced

Support

​Prerequisites

​Understanding the archive format

​Required fields

​Running the extraction

​What happens during extraction

​Expected output

​Output format

​File permissions and security

​Troubleshooting extraction

​File not found error

​Invalid JSON format

​Missing required fields

​Permission denied error

​Verifying extraction results

​Re-extracting tweets

​What gets extracted

​Understanding the Tweet model

​Next steps

Analyze tweets

Customize criteria

Build docs developers (and LLMs) love

Prerequisites

Understanding the archive format

Required fields

Running the extraction

What happens during extraction

Expected output

Output format

File permissions and security

Troubleshooting extraction

File not found error

Invalid JSON format

Missing required fields

Permission denied error

Verifying extraction results

Re-extracting tweets

What gets extracted

Understanding the Tweet model

Next steps