Skip to main content

Overview

The setup_temporal_data.py script processes raw ADNI data to create the temporal dataset required for STGNN training. It combines acquisition dates from TADPOLE_COMPLETE.csv with labels from TADPOLE_Simplified.csv to generate TADPOLE_TEMPORAL.csv with temporal features.

Prerequisites

Required Input Files

TADPOLE_COMPLETE.csv

Complete ADNI dataset with all visits and acquisition dates

TADPOLE_Simplified.csv

Simplified ADNI dataset with subject labels
Both files should be placed in the data/ directory.

File Locations

data/
├── TADPOLE_COMPLETE.csv     # Input: Complete ADNI data with dates
├── TADPOLE_Simplified.csv   # Input: Subject labels
└── TADPOLE_TEMPORAL.csv     # Output: Generated temporal dataset

Running the Script

1

Navigate to project directory

cd /path/to/stgnn
2

Verify input files exist

ls -lh data/TADPOLE_COMPLETE.csv data/TADPOLE_Simplified.csv
The script will error if either file is missing.
3

Run the preparation script

python setup_temporal_data.py
This creates data/TADPOLE_TEMPORAL.csv with temporal features.
4

Review output statistics

The script displays statistics about temporal sequences and prediction opportunities.

Script Workflow

1. Data Loading

From setup_temporal_data.py:45-52:
print("Loading TADPOLE_COMPLETE...")
df_complete = pd.read_csv(complete_path)

print("Loading TADPOLE_Simplified for labels...")
df_simplified = pd.read_csv(simplified_path)

# Clean subject IDs (remove underscores in complete, keep in simplified)
df_complete['Subject'] = df_complete['Subject'].str.replace('_', '', regex=False)
Subject IDs are automatically cleaned by removing underscores to ensure consistency with FC matrix filenames.

2. Date Parsing

From setup_temporal_data.py:13-27:
def parse_date(date_str: str) -> datetime:
    """Parse date string from TADPOLE_COMPLETE format."""
    try:
        # Try common formats
        for fmt in ['%m/%d/%Y', '%m/%d/%y', '%Y-%m-%d', '%d/%m/%Y']:
            try:
                return datetime.strptime(str(date_str).strip(), fmt)
            except ValueError:
                continue
        
        raise ValueError(f"Cannot parse date: {date_str}")
    except Exception as e:
        print(f"Warning: Could not parse date '{date_str}': {e}")
        return None
Supported date formats:
  • MM/DD/YYYY (e.g., 03/15/2018)
  • MM/DD/YY (e.g., 03/15/18)
  • YYYY-MM-DD (e.g., 2018-03-15)
  • DD/MM/YYYY (e.g., 15/03/2018)
Rows with unparseable dates are automatically excluded.

3. Temporal Sequence Creation

From setup_temporal_data.py:72-115:
1

Group by subject

Data is grouped by subject ID and sorted chronologically by acquisition date.
2

Remove duplicates

subject_data = subject_data.drop_duplicates(
    subset=['Subject', 'Visit', 'Acq Date'], 
    keep='first'
)
Duplicates based on subject, visit, and date are removed.
3

Calculate temporal features

For each visit:
  • Months from baseline: Time elapsed since first visit
  • Months to next: Time gap until next visit
  • Visit order: Chronological position (1, 2, 3, …)
  • Total visits: Number of visits for this subject
4

Assign labels

Subject’s label is retrieved from TADPOLE_Simplified.csv:
subject_label = simplified_subject['Label_CS_Num'].iloc[0]

4. Temporal Feature Calculation

Months From Baseline

From setup_temporal_data.py:88-95:
# Get baseline date (first visit) for this subject
baseline_date = subject_data.iloc[0]['Acq_Date_Parsed']

for idx, (_, row) in enumerate(subject_data.iterrows()):
    # Calculate months from baseline
    months_from_baseline = months_between_dates(
        baseline_date, 
        row['Acq_Date_Parsed']
    )
The baseline is always the first chronological visit for each subject.

Months To Next Visit

From setup_temporal_data.py:97-102:
# Calculate time to next visit
if idx < len(subject_data) - 1:
    next_date = subject_data.iloc[idx + 1]['Acq_Date_Parsed']
    months_to_next = months_between_dates(row['Acq_Date_Parsed'], next_date)
else:
    months_to_next = np.nan  # Last visit has no next
The last visit for each subject has months_to_next = NaN since there is no subsequent visit.

Date Difference Calculation

From setup_temporal_data.py:29-37:
def months_between_dates(date1: datetime, date2: datetime) -> float:
    """Calculate months between two dates."""
    if date1 is None or date2 is None:
        return np.nan
    
    # Calculate total days and convert to months (approximate)
    days_diff = (date2 - date1).days
    months_diff = days_diff / 30.44  # Average days per month
    return round(months_diff, 1)
Months are calculated using the average of 30.44 days per month.

Output Format

Generated File

Filename: data/TADPOLE_TEMPORAL.csv

Columns

From setup_temporal_data.py:105-120:
ColumnTypeDescriptionExample
SubjectstringSubject ID (no underscores)123456
VisitstringVisit code from original databl, m06, m12
Acq_DatestringOriginal acquisition date03/15/2018
Months_From_BaselinefloatMonths since first visit0.0, 6.2, 12.5
Months_To_NextfloatMonths until next visit6.2, 6.3, NaN
Label_CS_NumintCognitive stage label (0=stable, 1=converter)0 or 1
Visit_OrderintChronological visit number1, 2, 3, …
Total_VisitsintTotal visits for this subject3, 5, 7, …
AgefloatSubject age at visit (if available)72.5
SexstringSubject sex (if available)M, F
GroupstringDiagnostic group (if available)AD, MCI, CN
The output column is Months_To_Next, but the dataset loader expects Months_To_Next_Original. You may need to rename this column:
df = pd.read_csv('data/TADPOLE_TEMPORAL.csv')
df = df.rename(columns={'Months_To_Next': 'Months_To_Next_Original'})
df.to_csv('data/TADPOLE_TEMPORAL.csv', index=False)

Output Statistics

The script provides analysis of the temporal dataset:

Subject Counts

From setup_temporal_data.py:138-141:
visits_per_subject = df.groupby('Subject').size()
multi_visit_subjects = visits_per_subject[visits_per_subject > 1]

print(f"  Single visit: {len(visits_per_subject[visits_per_subject == 1])}")
print(f"  Multi-visit: {len(multi_visit_subjects)}")
Example output:
Subjects available:
  Single visit: 45
  Multi-visit: 123
  Total prediction pairs: 287

Prediction Horizons

From setup_temporal_data.py:144-153:
# Define prediction horizon bins
bins = [0, 6, 12, 24, float('inf')]
labels = ['0-6m', '6-12m', '12-24m', '24m+']

print(f"\nPrediction horizons:")
for i in range(len(bins)-1):
    count = len(valid_gaps[(valid_gaps >= bins[i]) & (valid_gaps < bins[i+1])])
    print(f"  {labels[i]:8s}: {count:3d} prediction opportunities")
Example output:
Prediction horizons:
  0-6m    :  87 prediction opportunities
  6-12m   : 112 prediction opportunities
  12-24m  :  65 prediction opportunities
  24m+    :  23 prediction opportunities
These statistics show how many visit pairs exist for different prediction time windows.

Complete Example Run

$ python setup_temporal_data.py

SETTING UP TEMPORAL DATA FOR TIME-AWARE PREDICTION
============================================================

============================================================
CREATING TEMPORAL DATASET
============================================================
Loading TADPOLE_COMPLETE...
Loading TADPOLE_Simplified for labels...
TADPOLE_COMPLETE: 12741 rows, 1737 subjects
TADPOLE_Simplified: 1737 rows, 1737 subjects
Processing data: 12741 rows, 1737 subjects
Parsing acquisition dates...
Data with valid dates: 12689 rows
Creating temporal sequences...
Temporal dataset created: 12689 rows, 1737 subjects

============================================================
PREDICTION PAIRS ANALYSIS
============================================================
Subjects available:
  Single visit: 1234
  Multi-visit: 503
  Total prediction pairs: 1456

Prediction horizons:
  0-6m    : 245 prediction opportunities
  6-12m   : 578 prediction opportunities
  12-24m  : 412 prediction opportunities
  24m+    : 221 prediction opportunities

SAVING RESULTS...
File created:
   data/TADPOLE_TEMPORAL.csv

Ready for time-aware training!

Troubleshooting

Error: Error: data/TADPOLE_COMPLETE.csv not found!Solution: Ensure both input CSV files are in the data/ directory:
ls data/TADPOLE_COMPLETE.csv data/TADPOLE_Simplified.csv
Warning: Warning: Could not parse date '...': ...Impact: Rows with unparseable dates are excluded from the temporal dataset.Solution: Check date format in TADPOLE_COMPLETE.csv. Supported formats are listed above.
Issue: Fewer subjects in output than expectedPossible causes:
  1. Subject not in TADPOLE_Simplified.csv (no label available)
  2. No valid acquisition dates for subject
  3. Subject ID format mismatch
Solution: Check that subject IDs match between the two input files.
Issue: Dataset loader expects Months_To_Next_Original but script creates Months_To_NextSolution: Rename the column after generation:
import pandas as pd
df = pd.read_csv('data/TADPOLE_TEMPORAL.csv')
df = df.rename(columns={'Months_To_Next': 'Months_To_Next_Original'})
df.to_csv('data/TADPOLE_TEMPORAL.csv', index=False)

Next Steps

After generating TADPOLE_TEMPORAL.csv:

Prepare FC Matrices

Organize functional connectivity matrices in the required format

Start Training

Begin training STGNN model with temporal data

Build docs developers (and LLMs) love