Skip to main content

Overview

Platzi Viewer streams all course content from Google Drive using a service account. No files are stored locally—only metadata (file IDs, course structure) is cached in courses_cache.json.

Authentication

Service Account Setup

The application uses OAuth2 Service Account credentials for Drive API access:
1

Create Service Account

In Google Cloud Console, create a service account with Drive API access
2

Download Credentials

Download JSON key file as service_account.json
3

Share Drive Folder

Share your Drive folder with the service account email (found in JSON)
4

Configure Application

Place service_account.json in the app directory or set environment variable

Credential Loading

From drive_service.py:55-90:
def authenticate(self):
    service_account_json = os.environ.get("GOOGLE_SERVICE_ACCOUNT_JSON")
    if service_account_json:
        try:
            service_info = json.loads(service_account_json)
        except json.JSONDecodeError as e:
            raise Exception("GOOGLE_SERVICE_ACCOUNT_JSON is not valid JSON") from e

        self.creds = Credentials.from_service_account_info(service_info, scopes=SCOPES)
        self.service_account_source = "env:GOOGLE_SERVICE_ACCOUNT_JSON"
    else:
        selected_path = None
        for candidate_path in _candidate_service_account_paths():
            if os.path.exists(candidate_path):
                selected_path = candidate_path
                break

        if not selected_path:
            searched = _candidate_service_account_paths()
            searched_text = ", ".join(searched)
            raise Exception(
                "Service account file not found. "
                "Set GOOGLE_SERVICE_ACCOUNT_FILE or GOOGLE_SERVICE_ACCOUNT_JSON. "
                f"Searched: {searched_text}"
            )

        self.creds = Credentials.from_service_account_file(selected_path, scopes=SCOPES)
        self.service_account_source = selected_path

    print(f"[INFO] Drive credentials loaded from: {self.service_account_source}")

Credential Search Order

From drive_service.py:17-43:
def _candidate_service_account_paths():
    candidates = []
    env_path = os.environ.get("GOOGLE_SERVICE_ACCOUNT_FILE")
    if env_path:
        candidates.append(env_path)

    # PyInstaller onefile/onedir location
    if getattr(sys, "frozen", False):
        exe_dir = os.path.dirname(os.path.abspath(sys.executable))
        candidates.append(os.path.join(exe_dir, "service_account.json"))

    # Local working directory and repository directory
    candidates.append(os.path.join(os.getcwd(), "service_account.json"))
    candidates.append(os.path.join(os.path.dirname(__file__), "service_account.json"))

    # Deduplicate
    unique = []
    seen = set()
    for path in candidates:
        normalized = os.path.abspath(path)
        if normalized not in seen:
            seen.add(normalized)
            unique.append(normalized)

    return unique
You can provide credentials via:
  1. Environment variable: GOOGLE_SERVICE_ACCOUNT_JSON (inline JSON)
  2. Environment variable: GOOGLE_SERVICE_ACCOUNT_FILE (file path)
  3. File in working directory: ./service_account.json
  4. File in code directory: <repo>/service_account.json

Required Scopes

From drive_service.py:13:
SCOPES = ["https://www.googleapis.com/auth/drive.readonly"]
The application only requires read-only Drive access.

DriveService Class

Initialization

From drive_service.py:46-53:
class DriveService:
    def __init__(self):
        self.creds = None
        self.service_account_source = None
        self._thread_local = threading.local()
        self._shared_session = None
        self._shared_session_lock = threading.Lock()
        self.authenticate()

Thread-Local Service Instances

From drive_service.py:97-100:
def get_service(self):
    if not hasattr(self._thread_local, "service"):
        self._thread_local.service = build("drive", "v3", credentials=self.creds, cache_discovery=False)
    return self._thread_local.service
Each thread gets its own drive.v3 service instance. This is necessary because the Google API client library is not thread-safe for service objects.

Shared Authenticated Session

From drive_service.py:102-110:
def _get_session(self):
    # Shared session avoids cold-start latency on each per-request thread.
    if self._shared_session is None:
        with self._shared_session_lock:
            if self._shared_session is None:
                session = AuthorizedSession(self.creds)
                session.headers.update({"Accept-Encoding": "identity"})
                self._shared_session = session
    return self._shared_session
Accept-Encoding: identity is set to disable Drive API compression. We want raw bytes for video streaming—compression would break Range requests.

File Streaming

Range Request Support

From drive_service.py:212-250:
def download_file_range(self, file_id, start=None, end=None, range_header=None):
    """Descarga un rango de bytes de un archivo."""
    file_id = self._validate_drive_id(file_id, "file_id")
    
    session = self._get_session()
    url = f"https://www.googleapis.com/drive/v3/files/{file_id}"
    headers = {}
    params = {
        "alt": "media",
        "supportsAllDrives": "true",
    }
    if range_header:
        headers["Range"] = str(range_header).strip()
    elif start is not None or end is not None:
        generated_range = f"bytes={start if start is not None else ''}-{end if end is not None else ''}"
        headers["Range"] = generated_range

    response = session.get(url, headers=headers, params=params, stream=True, timeout=(5, 60))
    response.raise_for_status()
    return response
The method accepts:
  • range_header: Raw Range header from client (e.g., bytes=0-1048575)
  • start/end: Explicit byte offsets

Streaming Workflow

Chunk Size Optimization

From server.py:1185:
for chunk in resp.iter_content(chunk_size=1024 * 1024):  # 1MB chunks
    if chunk:
        self.wfile.write(chunk)
        total_bytes += len(chunk)
  • Reduced overhead: Fewer system calls and context switches
  • Better A/V sync: Video players get data faster, reducing buffer starvation
  • Higher throughput: Network stack can optimize larger writes
  • Memory: Each streaming request uses 1MB buffer (negligible for modern systems)
  • Latency: First byte arrives slightly later (not noticeable for video)

Cache Building

Drive Folder Structure

Expected Drive folder hierarchy:
Root Folder (DRIVE_ROOT_ID)
├── Course 1 Name/
│   ├── presentation.html (optional)
│   ├── 1. Module Name/
│   │   ├── 1. Class Name.mp4
│   │   ├── 1. Class Name_summary.html
│   │   ├── 1. Class Name.vtt
│   │   └── 1. Class Name - Extra.pdf
│   └── 2. Module Name/
│       └── ...
└── Course 2 Name/
    └── ...

Cache Rebuild Process

From rebuild_cache_drive.py:270-423:
def main():
    print("📦 Rebuilding courses_cache.json from Google Drive")

    # 1. Parse PlatziRoutes.md
    parsed = parse_routes.parse()
    categories = parsed["categories"]

    # 2. List all course folders from Drive root
    root_items = list_drive_folder(DRIVE_ROOT_ID)
    drive_folders = [f for f in root_items if f["mimeType"] == "application/vnd.google-apps.folder"]

    # Build Drive lookup: sanitized name -> {name, id}
    drive_map = {}
    for df in drive_folders:
        san = sanitize_for_match(df["name"])
        drive_map[san] = {"name": df["name"], "id": df["id"]}

    # 3. Load previous scan progress (for resume)
    scanned = load_scan_progress()

    # 4. Match courses and scan Drive content
    for cat in categories:
        for route in cat["routes"]:
            for course in route.get("courses", []):
                md_name = course["name"]
                drive_folder_name, drive_folder_id, match_type = match_course_to_drive(md_name, drive_san, drive_map)

                if drive_folder_name:
                    # Check if already scanned
                    if drive_folder_id in scanned:
                        # Use cached scan
                    else:
                        # Scan Drive for modules/classes
                        modules, has_pres, pres_id = scan_drive_course(drive_folder_id, drive_folder_name)
                        # Save progress
                        scanned[drive_folder_id] = {...}

    # 5. Save courses_cache.json
    with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
        json.dump(result, f, ensure_ascii=False, indent=2)

Matching Logic

From rebuild_cache_drive.py:213-250:
def match_course_to_drive(md_name, drive_names_san, drive_names_map):
    """Try to match an MD course name to a Drive folder."""
    san = sanitize_for_match(md_name)

    # 1. Exact match
    if san in drive_names_san:
        info = drive_names_map[san]
        return info["name"], info["id"], "exact"

    # 2. MD name starts with Drive name
    for ds, info in drive_names_map.items():
        if san.startswith(ds) and len(ds) > 20:
            return info["name"], info["id"], "prefix"

    # 3. Drive name starts with MD name
    for ds, info in drive_names_map.items():
        if ds.startswith(san) and len(san) > 20:
            return info["name"], info["id"], "prefix"

    # 4. High word overlap
    san_words = set(san.split())
    best_match = None
    best_overlap = 0
    for ds, info in drive_names_map.items():
        ds_words = set(ds.split())
        overlap = len(san_words & ds_words)
        min_len = min(len(san_words), len(ds_words))
        if min_len > 0 and overlap / min_len >= 0.8 and overlap > best_overlap:
            best_overlap = overlap
            best_match = (info["name"], info["id"], "fuzzy")

    if best_match and best_overlap >= 4:
        return best_match

    return None, None, None
The matcher uses fuzzy logic to handle name variations:
  • “React Course 2024” matches “React Course”
  • “Curso de JavaScript Moderno” matches “JavaScript Moderno”
  • Word overlap of 80%+ with 4+ matching words

Class Scanning

From rebuild_cache_drive.py:68-170:
def scan_drive_classes(folder_id):
    """Scan class files inside a Drive module folder."""
    classes = []
    files = list_drive_folder(folder_id)

    # Group files by class number
    class_files = {}
    for f in files:
        name = f["name"]
        parts = name.split(". ", 1)
        if len(parts) >= 2 and parts[0].isdigit():
            num = int(parts[0])
            if num not in class_files:
                class_files[num] = []
            class_files[num].append(f)

    for num in sorted(class_files.keys()):
        flist = class_files[num]
        video = summary = vtt = reading = html = None
        video_id = summary_id = vtt_id = reading_id = html_id = None
        resources = []

        for f in flist:
            fname = f["name"]
            fid = f["id"]

            if fname.endswith(".mp4"):
                video = fname
                video_id = fid
            elif fname.endswith("_summary.html"):
                summary = fname
                summary_id = fid
            elif fname.endswith(".vtt"):
                vtt = fname
                vtt_id = fid
            # ... other file types

        classes.append({
            "num": num,
            "name": name[:60],
            "hasVideo": video is not None,
            "hasSummary": summary is not None,
            "files": {
                "video": video_id,
                "summary": summary_id,
                "subtitles": vtt_id,
                # ...
            },
            "resources": resources,
        })

    return classes

Rate Limiting

From rebuild_cache_drive.py:26-37:
def api_call_throttle():
    """Simple rate limiter to avoid hitting Drive API quotas."""
    global API_CALL_COUNT, API_CALL_START
    API_CALL_COUNT += 1
    elapsed = time.time() - API_CALL_START
    # Google Drive API: 12,000 queries per minute for service accounts
    # Be conservative: max ~100 calls per second
    if API_CALL_COUNT % 50 == 0 and elapsed < 1.0:
        wait = 1.0 - elapsed
        time.sleep(wait)
        API_CALL_START = time.time()
        API_CALL_COUNT = 0
Google Drive API has a quota of 12,000 queries per minute. The throttle ensures we don’t exceed ~100 calls/second during cache rebuild.

Resume Support

From rebuild_cache_drive.py:256-268:
def load_scan_progress():
    """Load previously scanned course data to allow resuming."""
    if os.path.exists(PROGRESS_FILE):
        with open(PROGRESS_FILE, "r", encoding="utf-8") as f:
            return json.load(f)
    return {}

def save_scan_progress(progress):
    """Save scan progress for resume capability."""
    with open(PROGRESS_FILE, "w", encoding="utf-8") as f:
        json.dump(progress, f, ensure_ascii=False)
Progress is saved every 10 courses:
if courses_scanned_this_run % 10 == 0:
    save_scan_progress(scanned)
    print(f"💾 Progress saved ({courses_scanned_this_run} scanned this run)")
If rebuild_cache_drive.py is interrupted, re-running it will resume from the last checkpoint, skipping already-scanned courses.

File Listing with Retries

From drive_service.py:126-178:
def list_files(self, folder_id):
    """Lista archivos y carpetas en un directorio con reintentos."""
    files_list = []
    page_token = None
    query = f"'{folder_id}' in parents and trashed = false"

    service = self.get_service()
    page_num = 1
    
    while True:
        retry_count = 0
        max_retries = 5
        success = False

        while not success and retry_count < max_retries:
            try:
                results = (
                    service.files()
                    .list(
                        q=query,
                        fields="nextPageToken, files(id, name, mimeType, size)",
                        pageSize=500,
                        pageToken=page_token,
                        orderBy="name",
                    )
                    .execute()
                )
                success = True
            except Exception as e:
                retry_count += 1
                wait_time = 2**retry_count  # Exponential backoff
                print(f"[WARN] Drive API Error listing files (try {retry_count}/{max_retries}): {e}")
                time.sleep(wait_time)

        if not success:
            print(f"[ERROR] Falló listado de carpeta {folder_id} tras {max_retries} intentos.")
            break

        files = results.get("files", [])
        files_list.extend(files)

        page_token = results.get("nextPageToken")
        if not page_token:
            break
        page_num += 1

    return files_list
Retries use exponential backoff: 2s, 4s, 8s, 16s, 32sThis handles transient Drive API errors (rate limits, network hiccups) gracefully.

Drive ID Validation

From drive_service.py:14 and drive_service.py:92-95:
DRIVE_ID_RE = re.compile(r"^[A-Za-z0-9_-]{10,}$")

def _validate_drive_id(self, value, field_name="id"):
    if not isinstance(value, str) or not DRIVE_ID_RE.match(value.strip()):
        raise ValueError(f"Invalid Google Drive {field_name}")
    return value.strip()
All Drive file IDs are validated before API calls to prevent injection attacks.
The server rejects local: prefixed file references (from old local-file mode). Only valid Drive IDs (10+ alphanumeric/-/_ characters) are accepted.

Performance Metrics

Streaming Metrics

From server.py:1193-1199:
if duration > 0.5:
    speed = (total_bytes / 1024 / 1024) / duration
    print(f"[STREAM] {file_id} | Range: {range_header or 'Full'} | "
          f"{total_bytes/1024/1024:.2f} MB in {duration:.2f}s ({speed:.2f} MB/s)")
Example output:
[STREAM] 1a2b3c4d5e | Range: bytes=0-1048575 | 1.00 MB in 0.87s (1.15 MB/s)

Health Endpoint

From server.py:817-835:
if self.path == "/api/health":
    ds = get_drive_service()
    ffmpeg_executable = _get_ffmpeg_executable()
    with compat_stream_lock:
        compat_snapshot = dict(compat_stream_stats)
    payload = {
        "status": "ok",
        "drive": {
            "available": bool(ds),
            "error": None if ds else get_drive_service_error(),
        },
        "ffmpeg": {
            "available": bool(ffmpeg_executable),
            "path": ffmpeg_executable,
        },
        "compatStream": compat_snapshot,
    }
    self._send_json(200, payload)
Check /api/health to verify:
  • Drive service authentication status
  • FFmpeg availability
  • Compatibility stream statistics

Common Issues

Cause: Service account doesn’t have access to the Drive folder.Fix: Share the folder with the service account email (found in service_account.json under client_email).
Cause: Invalid or expired credentials.Fix: Download a fresh service_account.json from Google Cloud Console.
Cause: Incorrect DRIVE_ROOT_ID or file ID.Fix: Verify the folder ID in the Drive URL: drive.google.com/drive/folders/{ID}
Cause: Drive API rate limiting or network latency.Fix:
  • Check /api/health for metrics
  • Use FFmpeg compatibility mode for problematic files
  • Increase chunk size (already optimized to 1MB)

Next Steps

Architecture Overview

Return to system architecture overview

Configuration

Configure Drive credentials and environment variables

Build docs developers (and LLMs) love