Skip to main content

Overview

The TADPOLE_TEMPORAL.csv file is the primary label and temporal information source for STGNN training. It contains subject labels, visit information, and temporal features for each scan in the dataset.
This file is typically generated by setup_temporal_data.py from the original TADPOLE data files, but can also be created manually following this specification.

File Location

data/
├── FC_Matrices/
├── TADPOLE_TEMPORAL.csv     # Required label file
├── TADPOLE_COMPLETE.csv     # (Optional) Source data
└── TADPOLE_Simplified.csv   # (Optional) Source data
From FC_ADNIDataset.py:10 and README line 18:
  • Default filename: TADPOLE_TEMPORAL.csv
  • Location: Root data directory (same level as FC_Matrices/)
  • Format: CSV with header row

Required Columns

From FC_ADNIDataset.py:123-158 and setup_temporal_data.py:105-120:

Subject

  • Type: string
  • Description: Unique subject identifier
  • Format: No underscores (e.g., 123456, not 123_456)
  • Example: 002S0295, 123456
Subject IDs must not contain underscores. The dataset automatically strips underscores:
df['Subject'] = df['Subject'].str.replace('_', '', regex=False)
This ensures consistency with FC matrix filenames (sub-123456_run-01_fc_matrix.npz).

Visit

  • Type: string
  • Description: Visit code/identifier
  • Format: Free text (typically ADNI visit codes)
  • Examples: bl (baseline), m06 (6-month), m12 (12-month), m24 (24-month)
Used for visit identification and debugging. Stored in graph attribute visit_code.

Label_CS_Num

  • Type: integer
  • Description: Cognitive stage label (classification target)
  • Values:
    • 0: Stable (no cognitive decline)
    • 1: Converter (progressed to worse cognitive stage)
  • Usage: Binary classification label for the subject
From FC_ADNIDataset.py:41:
graph.y = torch.tensor([label_dict.get(base_id, 0)], dtype=torch.long)
Labels are assigned at the subject level. All visits from the same subject receive the same label, determined by the subject’s final cognitive stage (last chronological visit).

Visit_Order

  • Type: integer
  • Description: Chronological visit number for this subject
  • Values: 1, 2, 3, … (1-indexed)
  • Example: For a subject with 3 visits, values are 1, 2, 3
From setup_temporal_data.py:112:
'Visit_Order': idx + 1,
Used to sort visits chronologically and map run numbers to specific visits.

Months_From_Baseline

  • Type: float
  • Description: Time elapsed (in months) from subject’s first visit
  • Values: ≥ 0.0 (baseline visit is 0.0)
  • Precision: Rounded to 1 decimal place
  • Examples: 0.0, 6.2, 12.5, 24.8
From setup_temporal_data.py:29-37:
def months_between_dates(date1: datetime, date2: datetime) -> float:
    days_diff = (date2 - date1).days
    months_diff = days_diff / 30.44  # Average days per month
    return round(months_diff, 1)
Calculated using average month length (30.44 days).

Months_To_Next_Original

  • Type: float
  • Description: Time gap (in months) until subject’s next visit
  • Values:
    • Positive float for non-final visits
    • -1 or NaN for final visit (no subsequent visit)
  • Examples: 6.2, 12.3, -1 (last visit)
From FC_ADNIDataset.py:49 and setup_temporal_data.py:110:
# In dataset loader:
'months_to_next': visit_row.get('Months_To_Next_Original', -1)

# In setup script:
'Months_To_Next': months_to_next  # May need renaming to Months_To_Next_Original
Column name discrepancy: The setup_temporal_data.py script creates Months_To_Next, but the dataset expects Months_To_Next_Original. You may need to rename:
df = df.rename(columns={'Months_To_Next': 'Months_To_Next_Original'})

Optional Columns

Acq_Date

  • Type: string
  • Description: Scan acquisition date
  • Format: Various date formats (MM/DD/YYYY, YYYY-MM-DD, etc.)
  • Example: 03/15/2018, 2018-03-15
Used during temporal data preparation but not required for training.

Total_Visits

  • Type: integer
  • Description: Total number of visits for this subject
  • Example: For a subject with 3 scans: all rows have Total_Visits=3
From setup_temporal_data.py:113:
'Total_Visits': len(subject_data)

Age

  • Type: float
  • Description: Subject age at this visit
  • Units: Years
  • Example: 72.5, 68.2
From setup_temporal_data.py:117-119:
for col in ['Age', 'Sex', 'Group']:
    if col in row:
        temporal_entry[col] = row[col]

Sex

  • Type: string
  • Description: Subject biological sex
  • Values: M (Male), F (Female)

Group

  • Type: string
  • Description: Diagnostic group at this visit
  • Values: CN (Cognitively Normal), MCI (Mild Cognitive Impairment), AD (Alzheimer’s Disease)
  • Example: A converter might progress from CNMCIAD across visits

Example CSV

Subject,Visit,Acq_Date,Months_From_Baseline,Months_To_Next_Original,Label_CS_Num,Visit_Order,Total_Visits,Age,Sex,Group
002S0295,bl,01/15/2018,0.0,6.2,0,1,3,72.5,M,CN
002S0295,m06,07/20/2018,6.2,6.3,0,2,3,73.0,M,CN
002S0295,m12,01/28/2019,12.5,-1,0,3,3,73.5,M,CN
011S4105,bl,03/10/2017,0.0,12.1,1,1,2,68.2,F,MCI
011S4105,m12,03/15/2018,12.1,-1,1,2,2,69.2,F,AD
023S4020,bl,05/22/2019,0.0,-1,0,1,1,75.8,F,CN

Row Structure

One Row Per Visit

Each row represents one visit for one subject:
Subject 002S0295:
  Row 1: Visit 1 (baseline) - Month 0.0
  Row 2: Visit 2 (6-month) - Month 6.2  
  Row 3: Visit 3 (12-month) - Month 12.5

Subject 011S4105:
  Row 1: Visit 1 (baseline) - Month 0.0
  Row 2: Visit 2 (12-month) - Month 12.1

Subject-Level vs Visit-Level Data

FieldLevelVaries Across Visits?
SubjectSubjectNo (same for all visits)
Label_CS_NumSubjectNo (determined by final visit)
SexSubjectNo
VisitVisitYes
Visit_OrderVisitYes
Months_From_BaselineVisitYes
Months_To_Next_OriginalVisitYes
AgeVisitYes
GroupVisitYes (can change)

Mapping to FC Matrices

Subject ID Matching

From FC_ADNIDataset.py:94-108:
# FC Matrix filename: sub-002S0295_run-01_fc_matrix.npz
# Extracted: subj_id = '002S0295', run_num = '01'
# Full ID: '002S0295_run01'

# CSV lookup:
# Subject = '002S0295' (no underscores)
# Visit_Order = 1 → maps to run01
Mapping logic:
  1. Extract subject ID from filename (remove sub- prefix)
  2. Extract run number from filename
  3. Format as {subject}_run{run_num} (e.g., 002S0295_run01)
  4. Map to CSV row where Subject = '002S0295' and Visit_Order = 1

Run Number to Visit Order

From FC_ADNIDataset.py:144-152:
# Sort visits by Visit_Order
subject_data = df[df['Subject'] == subject].sort_values('Visit_Order')

# Map run numbers to chronological visits
for run_idx, (_, visit_row) in enumerate(subject_data.iterrows()):
    run_key = f"{subject}_run{run_idx + 1:02d}"  # run01, run02, etc.
    
    visit_dict[run_key] = {
        'visit_code': visit_row['Visit'],
        'visit_months': visit_row['Months_From_Baseline'],
        'months_to_next': visit_row.get('Months_To_Next_Original', -1)
    }
Example:
CSV rows (sorted by Visit_Order):
  Subject=002S0295, Visit=bl,   Visit_Order=1 → run01
  Subject=002S0295, Visit=m06,  Visit_Order=2 → run02  
  Subject=002S0295, Visit=m12,  Visit_Order=3 → run03

FC Matrix files:
  sub-002S0295_run-01_fc_matrix.npz → Visit=bl,  Months=0.0
  sub-002S0295_run-02_fc_matrix.npz → Visit=m06, Months=6.2
  sub-002S0295_run-03_fc_matrix.npz → Visit=m12, Months=12.5
Run numbers in FC filenames must match the chronological order of visits in the CSV (sorted by Visit_Order).

Label Assignment Logic

From FC_ADNIDataset.py:130-135:
# Use the label from the last visit per subject
label_dict = {}
for subject in df['Subject'].unique():
    subject_data = df[df['Subject'] == subject]
    # Use the last visit's label as the overall subject label
    label_dict[subject] = subject_data.iloc[-1][label_col]
Logic:
  1. Group rows by subject
  2. Sort by Visit_Order to find chronological sequence
  3. Take Label_CS_Num from the last visit (highest Visit_Order)
  4. Apply this label to all visits for the subject
Rationale: The label represents the subject’s final cognitive trajectory. All historical scans are labeled based on where the patient eventually ended up (stable vs. converter).

Data Validation

Required Validation Checks

1

Check required columns

required_cols = [
    'Subject', 'Visit', 'Label_CS_Num', 'Visit_Order',
    'Months_From_Baseline', 'Months_To_Next_Original'
]
assert all(col in df.columns for col in required_cols)
2

Verify data types

assert df['Label_CS_Num'].dtype in [int, 'int64']
assert df['Visit_Order'].dtype in [int, 'int64']
assert df['Months_From_Baseline'].dtype in [float, 'float64']
3

Check label values

assert df['Label_CS_Num'].isin([0, 1]).all()
print(f"Label distribution: {df['Label_CS_Num'].value_counts()}")
4

Validate visit ordering

for subject in df['Subject'].unique():
    subject_data = df[df['Subject'] == subject]
    # Visit_Order should start at 1 and be consecutive
    orders = sorted(subject_data['Visit_Order'].values)
    assert orders == list(range(1, len(orders) + 1))
5

Check baseline months

# First visit should have Months_From_Baseline = 0.0
for subject in df['Subject'].unique():
    subject_data = df[df['Subject'] == subject].sort_values('Visit_Order')
    assert subject_data.iloc[0]['Months_From_Baseline'] == 0.0

Validation Script

import pandas as pd
import numpy as np

def validate_tadpole_temporal(csv_path):
    """Validate TADPOLE_TEMPORAL.csv format and contents."""
    print(f"Validating {csv_path}...")
    
    # Load CSV
    df = pd.read_csv(csv_path)
    print(f"Loaded {len(df)} rows, {df['Subject'].nunique()} subjects")
    
    # Check required columns
    required_cols = [
        'Subject', 'Visit', 'Label_CS_Num', 'Visit_Order',
        'Months_From_Baseline', 'Months_To_Next_Original'
    ]
    missing_cols = [col for col in required_cols if col not in df.columns]
    if missing_cols:
        print(f"ERROR: Missing required columns: {missing_cols}")
        return False
    print("✓ All required columns present")
    
    # Check data types
    if df['Label_CS_Num'].dtype not in [int, np.int64]:
        print(f"WARNING: Label_CS_Num type is {df['Label_CS_Num'].dtype}, expected int")
    print("✓ Data types correct")
    
    # Check label values
    if not df['Label_CS_Num'].isin([0, 1]).all():
        print("ERROR: Label_CS_Num contains values other than 0 or 1")
        return False
    label_dist = df['Label_CS_Num'].value_counts()
    print(f"✓ Labels valid: {label_dist[0]} stable (0), {label_dist[1]} converter (1)")
    
    # Check visit ordering
    issues = 0
    for subject in df['Subject'].unique():
        subject_data = df[df['Subject'] == subject].sort_values('Visit_Order')
        
        # Check Visit_Order is consecutive
        orders = subject_data['Visit_Order'].values
        expected_orders = list(range(1, len(orders) + 1))
        if not np.array_equal(orders, expected_orders):
            print(f"WARNING: Subject {subject} has non-consecutive Visit_Order: {orders}")
            issues += 1
        
        # Check baseline is 0.0
        if subject_data.iloc[0]['Months_From_Baseline'] != 0.0:
            print(f"WARNING: Subject {subject} baseline is not 0.0")
            issues += 1
    
    if issues == 0:
        print("✓ Visit ordering and baselines correct")
    else:
        print(f"⚠ Found {issues} issues with visit ordering")
    
    # Check subject ID format (no underscores)
    if df['Subject'].str.contains('_').any():
        print("ERROR: Subject IDs contain underscores (not allowed)")
        return False
    print("✓ Subject IDs formatted correctly (no underscores)")
    
    print("\n✓ Validation complete!")
    return True

if __name__ == "__main__":
    validate_tadpole_temporal('data/TADPOLE_TEMPORAL.csv')

Common Issues

Issue: KeyError: 'Months_To_Next_Original'Cause: CSV has Months_To_Next instead of Months_To_Next_OriginalSolution:
df = pd.read_csv('data/TADPOLE_TEMPORAL.csv')
df = df.rename(columns={'Months_To_Next': 'Months_To_Next_Original'})
df.to_csv('data/TADPOLE_TEMPORAL.csv', index=False)
Issue: Subject IDs contain underscoresSolution: Remove underscores:
df['Subject'] = df['Subject'].str.replace('_', '', regex=False)
Issue: Visit_Order values like 1, 3, 5 (missing 2, 4)Solution: Re-number consecutively:
for subject in df['Subject'].unique():
    mask = df['Subject'] == subject
    df.loc[mask, 'Visit_Order'] = range(1, mask.sum() + 1)
Issue: Some subjects in FC_Matrices/ not found in CSVSolution: Ensure all subjects with FC matrices have corresponding CSV entries. Check subject ID formatting (remove sub- prefix and underscores).

Integration with Training

The TADPOLE_TEMPORAL.csv file is loaded during dataset initialization:
from FC_ADNIDataset import FC_ADNIDataset

# Dataset automatically loads TADPOLE_TEMPORAL.csv from root directory
dataset = FC_ADNIDataset(
    root='data/',
    label_csv='TADPOLE_TEMPORAL.csv'  # Can customize filename
)

# Each graph now has temporal information
for graph in dataset:
    print(f"Subject: {graph.subj_id}")
    print(f"Label: {graph.y.item()}")
    print(f"Visit: {graph.visit_code}")
    print(f"Months from baseline: {graph.visit_months}")
    print(f"Months to next: {graph.months_to_next}")
After validation and preparation, the CSV is ready for use with the STGNN training pipeline.

Build docs developers (and LLMs) love