Building Cache

Before using Platzi Viewer, you must build the courses cache by scanning your Google Drive folder structure. This process generates courses_cache.json, which maps all courses, modules, and classes to their Drive file IDs.

What is the Cache?

The cache (courses_cache.json) is a comprehensive index of your course library:

Size: ~20 MB for a typical library of 500 courses
Content: Category/route/course structure with Drive file IDs for all videos, summaries, subtitles, and resources
Purpose: Enables fast navigation without querying Drive API on every page load
Validity: Permanent until you reorganize your Drive folder structure

The cache only stores metadata and file IDs - actual video files and content remain in Google Drive and are streamed on-demand.

Cache Structure

The cache follows this hierarchy:

{
  "categories": [
    {
      "name": "Desarrollo Web",
      "icon": "🌐",
      "routes": [
        {
          "name": "Desarrollo Backend con Node.js",
          "courses": [
            {
              "name": "Curso de Fundamentos de Node.js",
              "id": "1ABC...xyz",  // Drive folder ID
              "modules": [
                {
                  "name": "Introducción",
                  "classes": [
                    {
                      "name": "Bienvenida al curso",
                      "hasVideo": true,
                      "hasSummary": true,
                      "files": {
                        "video": "1OOJ5lrsLfFEnp6AKVKZKYZH5A-NasCjl",
                        "summary": "1WWggG3NLugsK6dZ37wzbNeLAPqFdOVfj",
                        "subtitles": "1ABCdef...",
                        "reading": null,
                        "html": null
                      },
                      "resources": [
                        {
                          "name": "slides.pdf",
                          "file": "1QWE456...",
                          "ext": ".pdf",
                          "viewable": true
                        }
                      ]
                    }
                  ]
                }
              ],
              "moduleCount": 5,
              "classCount": 47
            }
          ]
        }
      ]
    }
  ],
  "stats": {
    "totalCategories": 8,
    "totalRoutes": 120,
    "totalCourses": 500,
    "totalClasses": 20000
  }
}

Expected Drive Folder Structure

Your Google Drive should be organized like this:

Platzi Courses/  (Root folder shared with service account)
├── Curso de Python/
│   ├── 1. Introducción/  (Module folder)
│   │   ├── 1. Bienvenida.mp4
│   │   ├── 1. Bienvenida_summary.html
│   │   ├── 1. Bienvenida.vtt
│   │   ├── 1. Bienvenida - Lecturas recomendadas.txt
│   │   ├── 2. Instalación de Python.mp4
│   │   ├── 2. Instalación de Python_summary.html
│   │   └── ...
│   ├── 2. Fundamentos/
│   │   ├── 1. Variables y tipos de datos.mp4
│   │   └── ...
│   └── presentation.html  (Optional course presentation)
├── Curso de JavaScript/
│   ├── 1. Primeros Pasos/
│   └── ...
└── ...

File Naming Conventions

The scanner recognizes files by their extensions and naming patterns:

File Type	Pattern	Example
Video	`{num}. {name}.mp4`	`1. Introducción.mp4`
Summary	`{num}. {name}_summary.html`	`1. Introducción_summary.html`
Subtitles	`{num}. {name}.vtt`	`1. Introducción.vtt`
Reading	`{num}. {name} - Lecturas recomendadas.txt`	`1. Intro - Lecturas recomendadas.txt`
HTML	`{num}. {name}.html`	`1. Demo.html`
Resources	`{num}. {name}.{ext}`	`1. Slides.pdf`

Numbers must start each filename (e.g., 1., 2., 3.)
Class name follows the number and period
Files without numbers are treated as course-level resources
The scanner automatically groups files by their leading number

Running the Cache Builder

Prerequisites

Before building the cache, ensure:

✅ Service account is configured (see Google Drive Setup)
✅ Drive folder is shared with service account
✅ Dependencies are installed (pip install -r requirements.txt)
✅ You have the Drive root folder ID

Build Command

Navigate to project directory

cd platzi-viewer

Activate virtual environment (if using one)

# Windows
.venv\Scripts\activate

# Linux/Mac
source .venv/bin/activate

Update Drive root folder ID

Open rebuild_cache_drive.py and update the DRIVE_ROOT_ID constant:

rebuild_cache_drive.py:18

DRIVE_ROOT_ID = "17kPqqPSheDtQ5S1HM6Qvvh2qJ7O3YADm"  # Replace with your folder ID

Find your folder ID from the Drive URL:

https://drive.google.com/drive/folders/17kPqqPSheDtQ5S1HM6Qvvh2qJ7O3YADm
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                         This is your folder ID

Run the rebuild script

python rebuild_cache_drive.py

You’ll see output like:

============================================================
📦 Rebuilding courses_cache.json from Google Drive
============================================================

📖 Parsing PlatziRoutes.md...
   8 categories, 120 routes, 500 course entries

📁 Listing Drive root folder...
   487 course folders found in Drive
   0 courses already scanned (resumable)

🔗 Matching courses to Drive folders and scanning content...

  🌐 Desarrollo Web...
     ✓ 85 courses matched to Drive

  🎨 Diseño y UX...
     ✓ 42 courses matched to Drive

  ...

Wait for completion

The scan takes 15-30 minutes for a full library due to API rate limiting.Progress is saved automatically every 10 courses to drive_scan_progress.json. If interrupted, simply run the command again to resume.

Build Output

Once complete, you’ll see:

============================================================
✅ Cache rebuilt from Google Drive!
   Categories:     8
   Routes:         120
   Course entries: 500
   Matched Drive:  487
   With content:   478
   Total classes:  19,847
   File size:      18.7 MB
   Scanned this run: 487
   API calls:      ~5,240
============================================================

Understanding the Matching Process

The script matches courses from PlatziRoutes.md to Drive folders using fuzzy matching:

rebuild_cache_drive.py:213

def match_course_to_drive(md_name, drive_names_san, drive_names_map):
    """Try to match an MD course name to a Drive folder.

    Returns (drive_folder_name, drive_folder_id, match_type) or (None, None, None).
    """
    san = sanitize_for_match(md_name)

    # 1. Exact match
    if san in drive_names_san:
        info = drive_names_map[san]
        return info["name"], info["id"], "exact"

    # 2. MD name starts with Drive name
    for ds, info in drive_names_map.items():
        if san.startswith(ds) and len(ds) > 20:
            return info["name"], info["id"], "prefix"

    # 3. Drive name starts with MD name
    for ds, info in drive_names_map.items():
        if ds.startswith(san) and len(san) > 20:
            return info["name"], info["id"], "prefix"

    # 4. High word overlap (80%+ matching words)
    ...

Match types:

exact: Exact match after sanitization (removing special chars, lowercasing)
prefix: One name is a prefix of the other
fuzzy: High word overlap (≥80% common words)

Courses not matched to Drive folders are still included in the cache with foundInDrive: false and classCount: 0.

Resume Capability

If the scan is interrupted (Ctrl+C, connection loss, etc.):

Progress is saved to drive_scan_progress.json every 10 courses
Run the same command again to resume from where it stopped
Already scanned courses are loaded from the progress file

Progress file structure:

drive_scan_progress.json

{
  "1ABC...xyz": {  // Drive folder ID
    "modules": [...],
    "moduleCount": 5,
    "classCount": 47,
    "hasPresentation": true,
    "presentationId": "1XYZ..."
  },
  ...
}

To start fresh (rescan everything):

rm drive_scan_progress.json
python rebuild_cache_drive.py

Rate Limiting

The script includes automatic throttling to stay within Google Drive API quotas:

rebuild_cache_drive.py:26

def api_call_throttle():
    """Simple rate limiter to avoid hitting Drive API quotas."""
    global API_CALL_COUNT, API_CALL_START
    API_CALL_COUNT += 1
    elapsed = time.time() - API_CALL_START
    # Google Drive API: 12,000 queries per minute for service accounts
    # Be conservative: max ~100 calls per second
    if API_CALL_COUNT % 50 == 0 and elapsed < 1.0:
        wait = 1.0 - elapsed
        time.sleep(wait)
        API_CALL_START = time.time()
        API_CALL_COUNT = 0

Limits:

Google Drive API: 12,000 queries/minute for service accounts
Script throttles to: ~100 calls/second (conservative)
Total calls for 500 courses: ~5,000-10,000 (depends on folder depth)

If you encounter “User Rate Limit Exceeded” errors, the script will automatically retry with exponential backoff (up to 5 retries). If errors persist, wait a few minutes before running again.

Updating the Cache

When you add new courses to Drive or reorganize folders:

# Full rebuild (recommended for major changes)
rm drive_scan_progress.json
python rebuild_cache_drive.py

# Resume scan (if you only added new courses)
python rebuild_cache_drive.py

The server can reload the cache without restarting:

# Rebuild cache
python rebuild_cache_drive.py

# Trigger server reload (from localhost only)
curl http://localhost:8080/api/refresh

The /api/refresh endpoint is restricted to loopback addresses (localhost, 127.0.0.1) for security. Remote clients cannot reload the cache.

Troubleshooting

”Drive service not available”

Problem: Cannot connect to Google Drive API Solution:

# Check service account configuration
ls service_account.json

# Test Drive access
python -c "from drive_service import drive_service; print('Drive OK')"

# Verify folder is shared with service account
# (check service_account.json for client_email)

See Google Drive Setup for more details.

”No courses matched to Drive”

Problem: All courses show foundInDrive: false Solution:

Verify DRIVE_ROOT_ID points to the correct folder
Check folder is shared with service account email
Ensure course folders exist in Drive (not empty)
Review folder naming - must loosely match names in PlatziRoutes.md

”Scan is very slow”

Problem: Taking longer than 30 minutes Causes:

Many subfolders/files per course
API rate limiting kicking in
Network latency

Solution:

Let it run - progress is saved every 10 courses
Use wired connection instead of WiFi for stability
Avoid running during peak hours

”Invalid Drive file IDs detected”

Problem: Cache validation shows local refs or invalid IDs Solution: This should not happen with rebuild_cache_drive.py. If you see this:

# Check cache integrity
python server.py
# Look at http://localhost:8080/api/health for cache.driveOnlyCheck

# Rebuild cache from scratch
rm courses_cache.json drive_scan_progress.json
python rebuild_cache_drive.py

“Out of memory during scan”

Problem: Python crashes with memory errors Solution: The script loads everything in memory. For very large libraries (1000+ courses):

# Increase Python memory limit (Linux/Mac)
ulimit -v 8388608  # 8GB

# Or process in chunks (requires code modification)
# Split PlatziRoutes.md into smaller files

Cache File Locations

The application checks for cache files in this order:

$PLATZI_DATA_PATH/courses_cache.json (if PLATZI_DATA_PATH is set)
$PLATZI_VIEWER_PATH/courses_cache.json (if PLATZI_VIEWER_PATH is set)
./courses_cache.json (current directory)

Generated files:

File	Purpose	Safe to Delete?
`courses_cache.json`	Main cache - required for app to run	❌ No (must rebuild)
`drive_scan_progress.json`	Resume checkpoint	✅ Yes (will rescan)
`PlatziRoutes.md`	Course definitions	❌ No (required for rebuild)

Next Steps

Start the Server

With the cache built, you’re ready to launch Platzi Viewer.Go to Quickstart →

Explore the Application

Learn how to navigate, watch videos, and track your progress.View User Guide →

Overview

Getting Started

User Guide

Deployment

Architecture

What is the Cache?

Cache Structure

Expected Drive Folder Structure

File Naming Conventions

Running the Cache Builder

Prerequisites

Build Command

Build Output

Understanding the Matching Process

Resume Capability

Rate Limiting

Updating the Cache

Troubleshooting

”Drive service not available”

”No courses matched to Drive”

”Scan is very slow”

”Invalid Drive file IDs detected”

“Out of memory during scan”

Cache File Locations

Next Steps

Build docs developers (and LLMs) love

Overview

Getting Started

User Guide

Deployment

Architecture

​What is the Cache?

​Cache Structure

​Expected Drive Folder Structure

​File Naming Conventions

​Running the Cache Builder

​Prerequisites

​Build Command

​Build Output

​Understanding the Matching Process

​Resume Capability

​Rate Limiting

​Updating the Cache

​Troubleshooting

​”Drive service not available”

​”No courses matched to Drive”

​”Scan is very slow”

​”Invalid Drive file IDs detected”

​“Out of memory during scan”

​Cache File Locations

​Next Steps

Build docs developers (and LLMs) love

What is the Cache?

Cache Structure

Expected Drive Folder Structure

File Naming Conventions

Running the Cache Builder

Prerequisites

Build Command

Build Output

Understanding the Matching Process

Resume Capability

Rate Limiting

Updating the Cache

Troubleshooting

”Drive service not available”

”No courses matched to Drive”

”Scan is very slow”

”Invalid Drive file IDs detected”

“Out of memory during scan”

Cache File Locations

Next Steps