Google Drive Integration

Overview

Platzi Viewer streams all course content from Google Drive using a service account. No files are stored locally—only metadata (file IDs, course structure) is cached in courses_cache.json.

Authentication

Service Account Setup

The application uses OAuth2 Service Account credentials for Drive API access:

Create Service Account

In Google Cloud Console, create a service account with Drive API access

Download Credentials

Download JSON key file as service_account.json

Share Drive Folder

Share your Drive folder with the service account email (found in JSON)

Configure Application

Place service_account.json in the app directory or set environment variable

Credential Loading

From drive_service.py:55-90:

def authenticate(self):
    service_account_json = os.environ.get("GOOGLE_SERVICE_ACCOUNT_JSON")
    if service_account_json:
        try:
            service_info = json.loads(service_account_json)
        except json.JSONDecodeError as e:
            raise Exception("GOOGLE_SERVICE_ACCOUNT_JSON is not valid JSON") from e

        self.creds = Credentials.from_service_account_info(service_info, scopes=SCOPES)
        self.service_account_source = "env:GOOGLE_SERVICE_ACCOUNT_JSON"
    else:
        selected_path = None
        for candidate_path in _candidate_service_account_paths():
            if os.path.exists(candidate_path):
                selected_path = candidate_path
                break

        if not selected_path:
            searched = _candidate_service_account_paths()
            searched_text = ", ".join(searched)
            raise Exception(
                "Service account file not found. "
                "Set GOOGLE_SERVICE_ACCOUNT_FILE or GOOGLE_SERVICE_ACCOUNT_JSON. "
                f"Searched: {searched_text}"
            )

        self.creds = Credentials.from_service_account_file(selected_path, scopes=SCOPES)
        self.service_account_source = selected_path

    print(f"[INFO] Drive credentials loaded from: {self.service_account_source}")

Credential Search Order

From drive_service.py:17-43:

def _candidate_service_account_paths():
    candidates = []
    env_path = os.environ.get("GOOGLE_SERVICE_ACCOUNT_FILE")
    if env_path:
        candidates.append(env_path)

    # PyInstaller onefile/onedir location
    if getattr(sys, "frozen", False):
        exe_dir = os.path.dirname(os.path.abspath(sys.executable))
        candidates.append(os.path.join(exe_dir, "service_account.json"))

    # Local working directory and repository directory
    candidates.append(os.path.join(os.getcwd(), "service_account.json"))
    candidates.append(os.path.join(os.path.dirname(__file__), "service_account.json"))

    # Deduplicate
    unique = []
    seen = set()
    for path in candidates:
        normalized = os.path.abspath(path)
        if normalized not in seen:
            seen.add(normalized)
            unique.append(normalized)

    return unique

You can provide credentials via:

Environment variable: GOOGLE_SERVICE_ACCOUNT_JSON (inline JSON)
Environment variable: GOOGLE_SERVICE_ACCOUNT_FILE (file path)
File in working directory: ./service_account.json
File in code directory: <repo>/service_account.json

Required Scopes

From drive_service.py:13:

SCOPES = ["https://www.googleapis.com/auth/drive.readonly"]

The application only requires read-only Drive access.

DriveService Class

Initialization

From drive_service.py:46-53:

class DriveService:
    def __init__(self):
        self.creds = None
        self.service_account_source = None
        self._thread_local = threading.local()
        self._shared_session = None
        self._shared_session_lock = threading.Lock()
        self.authenticate()

Thread-Local Service Instances

From drive_service.py:97-100:

def get_service(self):
    if not hasattr(self._thread_local, "service"):
        self._thread_local.service = build("drive", "v3", credentials=self.creds, cache_discovery=False)
    return self._thread_local.service

Each thread gets its own drive.v3 service instance. This is necessary because the Google API client library is not thread-safe for service objects.

Shared Authenticated Session

From drive_service.py:102-110:

def _get_session(self):
    # Shared session avoids cold-start latency on each per-request thread.
    if self._shared_session is None:
        with self._shared_session_lock:
            if self._shared_session is None:
                session = AuthorizedSession(self.creds)
                session.headers.update({"Accept-Encoding": "identity"})
                self._shared_session = session
    return self._shared_session

Accept-Encoding: identity is set to disable Drive API compression. We want raw bytes for video streaming—compression would break Range requests.

File Streaming

Range Request Support

From drive_service.py:212-250:

def download_file_range(self, file_id, start=None, end=None, range_header=None):
    """Descarga un rango de bytes de un archivo."""
    file_id = self._validate_drive_id(file_id, "file_id")
    
    session = self._get_session()
    url = f"https://www.googleapis.com/drive/v3/files/{file_id}"
    headers = {}
    params = {
        "alt": "media",
        "supportsAllDrives": "true",
    }
    if range_header:
        headers["Range"] = str(range_header).strip()
    elif start is not None or end is not None:
        generated_range = f"bytes={start if start is not None else ''}-{end if end is not None else ''}"
        headers["Range"] = generated_range

    response = session.get(url, headers=headers, params=params, stream=True, timeout=(5, 60))
    response.raise_for_status()
    return response

The method accepts:

range_header: Raw Range header from client (e.g., bytes=0-1048575)
start/end: Explicit byte offsets

Streaming Workflow

Chunk Size Optimization

From server.py:1185:

for chunk in resp.iter_content(chunk_size=1024 * 1024):  # 1MB chunks
    if chunk:
        self.wfile.write(chunk)
        total_bytes += len(chunk)

Why 1MB chunks?

Reduced overhead: Fewer system calls and context switches
Better A/V sync: Video players get data faster, reducing buffer starvation
Higher throughput: Network stack can optimize larger writes

Trade-offs

Memory: Each streaming request uses 1MB buffer (negligible for modern systems)
Latency: First byte arrives slightly later (not noticeable for video)

Cache Building

Drive Folder Structure

Expected Drive folder hierarchy:

Root Folder (DRIVE_ROOT_ID)
├── Course 1 Name/
│   ├── presentation.html (optional)
│   ├── 1. Module Name/
│   │   ├── 1. Class Name.mp4
│   │   ├── 1. Class Name_summary.html
│   │   ├── 1. Class Name.vtt
│   │   └── 1. Class Name - Extra.pdf
│   └── 2. Module Name/
│       └── ...
└── Course 2 Name/
    └── ...

Cache Rebuild Process

From rebuild_cache_drive.py:270-423:

def main():
    print("📦 Rebuilding courses_cache.json from Google Drive")

    # 1. Parse PlatziRoutes.md
    parsed = parse_routes.parse()
    categories = parsed["categories"]

    # 2. List all course folders from Drive root
    root_items = list_drive_folder(DRIVE_ROOT_ID)
    drive_folders = [f for f in root_items if f["mimeType"] == "application/vnd.google-apps.folder"]

    # Build Drive lookup: sanitized name -> {name, id}
    drive_map = {}
    for df in drive_folders:
        san = sanitize_for_match(df["name"])
        drive_map[san] = {"name": df["name"], "id": df["id"]}

    # 3. Load previous scan progress (for resume)
    scanned = load_scan_progress()

    # 4. Match courses and scan Drive content
    for cat in categories:
        for route in cat["routes"]:
            for course in route.get("courses", []):
                md_name = course["name"]
                drive_folder_name, drive_folder_id, match_type = match_course_to_drive(md_name, drive_san, drive_map)

                if drive_folder_name:
                    # Check if already scanned
                    if drive_folder_id in scanned:
                        # Use cached scan
                    else:
                        # Scan Drive for modules/classes
                        modules, has_pres, pres_id = scan_drive_course(drive_folder_id, drive_folder_name)
                        # Save progress
                        scanned[drive_folder_id] = {...}

    # 5. Save courses_cache.json
    with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
        json.dump(result, f, ensure_ascii=False, indent=2)

Matching Logic

From rebuild_cache_drive.py:213-250:

def match_course_to_drive(md_name, drive_names_san, drive_names_map):
    """Try to match an MD course name to a Drive folder."""
    san = sanitize_for_match(md_name)

    # 1. Exact match
    if san in drive_names_san:
        info = drive_names_map[san]
        return info["name"], info["id"], "exact"

    # 2. MD name starts with Drive name
    for ds, info in drive_names_map.items():
        if san.startswith(ds) and len(ds) > 20:
            return info["name"], info["id"], "prefix"

    # 3. Drive name starts with MD name
    for ds, info in drive_names_map.items():
        if ds.startswith(san) and len(san) > 20:
            return info["name"], info["id"], "prefix"

    # 4. High word overlap
    san_words = set(san.split())
    best_match = None
    best_overlap = 0
    for ds, info in drive_names_map.items():
        ds_words = set(ds.split())
        overlap = len(san_words & ds_words)
        min_len = min(len(san_words), len(ds_words))
        if min_len > 0 and overlap / min_len >= 0.8 and overlap > best_overlap:
            best_overlap = overlap
            best_match = (info["name"], info["id"], "fuzzy")

    if best_match and best_overlap >= 4:
        return best_match

    return None, None, None

The matcher uses fuzzy logic to handle name variations:

“React Course 2024” matches “React Course”
“Curso de JavaScript Moderno” matches “JavaScript Moderno”
Word overlap of 80%+ with 4+ matching words

Class Scanning

From rebuild_cache_drive.py:68-170:

def scan_drive_classes(folder_id):
    """Scan class files inside a Drive module folder."""
    classes = []
    files = list_drive_folder(folder_id)

    # Group files by class number
    class_files = {}
    for f in files:
        name = f["name"]
        parts = name.split(". ", 1)
        if len(parts) >= 2 and parts[0].isdigit():
            num = int(parts[0])
            if num not in class_files:
                class_files[num] = []
            class_files[num].append(f)

    for num in sorted(class_files.keys()):
        flist = class_files[num]
        video = summary = vtt = reading = html = None
        video_id = summary_id = vtt_id = reading_id = html_id = None
        resources = []

        for f in flist:
            fname = f["name"]
            fid = f["id"]

            if fname.endswith(".mp4"):
                video = fname
                video_id = fid
            elif fname.endswith("_summary.html"):
                summary = fname
                summary_id = fid
            elif fname.endswith(".vtt"):
                vtt = fname
                vtt_id = fid
            # ... other file types

        classes.append({
            "num": num,
            "name": name[:60],
            "hasVideo": video is not None,
            "hasSummary": summary is not None,
            "files": {
                "video": video_id,
                "summary": summary_id,
                "subtitles": vtt_id,
                # ...
            },
            "resources": resources,
        })

    return classes

Rate Limiting

From rebuild_cache_drive.py:26-37:

def api_call_throttle():
    """Simple rate limiter to avoid hitting Drive API quotas."""
    global API_CALL_COUNT, API_CALL_START
    API_CALL_COUNT += 1
    elapsed = time.time() - API_CALL_START
    # Google Drive API: 12,000 queries per minute for service accounts
    # Be conservative: max ~100 calls per second
    if API_CALL_COUNT % 50 == 0 and elapsed < 1.0:
        wait = 1.0 - elapsed
        time.sleep(wait)
        API_CALL_START = time.time()
        API_CALL_COUNT = 0

Google Drive API has a quota of 12,000 queries per minute. The throttle ensures we don’t exceed ~100 calls/second during cache rebuild.

Resume Support

From rebuild_cache_drive.py:256-268:

def load_scan_progress():
    """Load previously scanned course data to allow resuming."""
    if os.path.exists(PROGRESS_FILE):
        with open(PROGRESS_FILE, "r", encoding="utf-8") as f:
            return json.load(f)
    return {}

def save_scan_progress(progress):
    """Save scan progress for resume capability."""
    with open(PROGRESS_FILE, "w", encoding="utf-8") as f:
        json.dump(progress, f, ensure_ascii=False)

Progress is saved every 10 courses:

if courses_scanned_this_run % 10 == 0:
    save_scan_progress(scanned)
    print(f"💾 Progress saved ({courses_scanned_this_run} scanned this run)")

If rebuild_cache_drive.py is interrupted, re-running it will resume from the last checkpoint, skipping already-scanned courses.

File Listing with Retries

From drive_service.py:126-178:

def list_files(self, folder_id):
    """Lista archivos y carpetas en un directorio con reintentos."""
    files_list = []
    page_token = None
    query = f"'{folder_id}' in parents and trashed = false"

    service = self.get_service()
    page_num = 1
    
    while True:
        retry_count = 0
        max_retries = 5
        success = False

        while not success and retry_count < max_retries:
            try:
                results = (
                    service.files()
                    .list(
                        q=query,
                        fields="nextPageToken, files(id, name, mimeType, size)",
                        pageSize=500,
                        pageToken=page_token,
                        orderBy="name",
                    )
                    .execute()
                )
                success = True
            except Exception as e:
                retry_count += 1
                wait_time = 2**retry_count  # Exponential backoff
                print(f"[WARN] Drive API Error listing files (try {retry_count}/{max_retries}): {e}")
                time.sleep(wait_time)

        if not success:
            print(f"[ERROR] Falló listado de carpeta {folder_id} tras {max_retries} intentos.")
            break

        files = results.get("files", [])
        files_list.extend(files)

        page_token = results.get("nextPageToken")
        if not page_token:
            break
        page_num += 1

    return files_list

Exponential Backoff

Retries use exponential backoff: 2s, 4s, 8s, 16s, 32sThis handles transient Drive API errors (rate limits, network hiccups) gracefully.

Pagination

Drive API returns max 500 files per page. The function automatically fetches all pages until nextPageToken is null.

Drive ID Validation

From drive_service.py:14 and drive_service.py:92-95:

DRIVE_ID_RE = re.compile(r"^[A-Za-z0-9_-]{10,}$")

def _validate_drive_id(self, value, field_name="id"):
    if not isinstance(value, str) or not DRIVE_ID_RE.match(value.strip()):
        raise ValueError(f"Invalid Google Drive {field_name}")
    return value.strip()

All Drive file IDs are validated before API calls to prevent injection attacks.

The server rejects local: prefixed file references (from old local-file mode). Only valid Drive IDs (10+ alphanumeric/-/_ characters) are accepted.

Performance Metrics

Streaming Metrics

From server.py:1193-1199:

if duration > 0.5:
    speed = (total_bytes / 1024 / 1024) / duration
    print(f"[STREAM] {file_id} | Range: {range_header or 'Full'} | "
          f"{total_bytes/1024/1024:.2f} MB in {duration:.2f}s ({speed:.2f} MB/s)")

Example output:

[STREAM] 1a2b3c4d5e | Range: bytes=0-1048575 | 1.00 MB in 0.87s (1.15 MB/s)

Health Endpoint

From server.py:817-835:

if self.path == "/api/health":
    ds = get_drive_service()
    ffmpeg_executable = _get_ffmpeg_executable()
    with compat_stream_lock:
        compat_snapshot = dict(compat_stream_stats)
    payload = {
        "status": "ok",
        "drive": {
            "available": bool(ds),
            "error": None if ds else get_drive_service_error(),
        },
        "ffmpeg": {
            "available": bool(ffmpeg_executable),
            "path": ffmpeg_executable,
        },
        "compatStream": compat_snapshot,
    }
    self._send_json(200, payload)

Check /api/health to verify:

Drive service authentication status
FFmpeg availability
Compatibility stream statistics

Common Issues

403 Forbidden

Cause: Service account doesn’t have access to the Drive folder.Fix: Share the folder with the service account email (found in service_account.json under client_email).

401 Unauthorized

Cause: Invalid or expired credentials.Fix: Download a fresh service_account.json from Google Cloud Console.

404 Not Found

Cause: Incorrect DRIVE_ROOT_ID or file ID.Fix: Verify the folder ID in the Drive URL: drive.google.com/drive/folders/{ID}

Slow Streaming

Cause: Drive API rate limiting or network latency.Fix:

Check /api/health for metrics
Use FFmpeg compatibility mode for problematic files
Increase chunk size (already optimized to 1MB)

Next Steps

Architecture Overview

Return to system architecture overview

Configuration

Configure Drive credentials and environment variables

Overview

Getting Started

User Guide

Deployment

Architecture

Google Drive Integration

Overview

Authentication

Service Account Setup

Credential Loading

Credential Search Order

Required Scopes

DriveService Class

Initialization

Thread-Local Service Instances

Shared Authenticated Session

File Streaming

Range Request Support

Streaming Workflow

Chunk Size Optimization

Cache Building

Drive Folder Structure

Cache Rebuild Process

Matching Logic

Class Scanning

Rate Limiting

Resume Support

File Listing with Retries

Drive ID Validation

Performance Metrics

Streaming Metrics

Health Endpoint

Common Issues

Next Steps

Architecture Overview

Configuration

Build docs developers (and LLMs) love

Overview

Getting Started

User Guide

Deployment

Architecture

​Overview

​Authentication

​Service Account Setup

​Credential Loading

​Credential Search Order

​Required Scopes

​DriveService Class

​Initialization

​Thread-Local Service Instances

​Shared Authenticated Session

​File Streaming

​Range Request Support

​Streaming Workflow

​Chunk Size Optimization

​Cache Building

​Drive Folder Structure

​Cache Rebuild Process

​Matching Logic

​Class Scanning

​Rate Limiting

​Resume Support

​File Listing with Retries

​Drive ID Validation

​Performance Metrics

​Streaming Metrics

​Health Endpoint

​Common Issues

​Next Steps

Architecture Overview

Configuration

Build docs developers (and LLMs) love

Overview

Authentication

Service Account Setup

Credential Loading

Credential Search Order

Required Scopes

DriveService Class

Initialization

Thread-Local Service Instances

Shared Authenticated Session

File Streaming

Range Request Support

Streaming Workflow

Chunk Size Optimization

Cache Building

Drive Folder Structure

Cache Rebuild Process

Matching Logic

Class Scanning

Rate Limiting

Resume Support

File Listing with Retries

Drive ID Validation

Performance Metrics

Streaming Metrics

Health Endpoint

Common Issues

Next Steps