Supported Data Warehouses
Google BigQuery
Serverless data warehouse with ML and BI capabilities
Snowflake
Cloud data platform with elastic compute and storage
Amazon Redshift
AWS data warehouse with columnar storage
Apache Doris
Real-time analytical database
Google BigQuery
Configuration
Features
- Batch load API - High-performance data loading using native BigQuery Storage API
- Automatic partitioning - Date-based table partitioning
- Schema evolution - Automatic column additions and type changes
- Nested structures - ARRAY and STRUCT (JSON) support
- Query parameters - Parameterized inserts for large datasets
- Transaction support - ACID compliance with sessions
Batch Loading
Enable batch loading for significantly faster data ingestion:- Pandas DataFrames → Parquet format
- Arrays in data → JSON format
- Storage Write API for parallel uploads
Performance Comparison
Performance Comparison
Standard Insert: ~1,000 rows/secondBatch Load: ~50,000+ rows/secondBatch loading is recommended for:
- Large datasets (>10,000 rows)
- Frequent data loads
- Time-sensitive pipelines
Partitioning
Automatically partition tables by date:Data Type Mapping
| Python Type | BigQuery Type |
|---|---|
| str | STRING |
| int | INT64 |
| float | FLOAT64 |
| bool | BOOL |
| datetime | DATETIME |
| dict | JSON |
| list | ARRAY |
Unique Constraints
Handle duplicates with MERGE operations:Snowflake
Configuration
Features
- Batch loading - Fast data loading with
write_pandas - Key pair authentication - Secure authentication without passwords
- Role-based access - Fine-grained permissions
- MERGE operations - Efficient upserts
- VARIANT columns - Store semi-structured data (JSON, Arrays)
- Temporary tables - Session-scoped staging tables
Batch Loading
Snowflake’s batch load uses the nativewrite_pandas function:
- Automatic Parquet conversion
- Parallel file uploads
- Native compression
- Up to 10x faster than standard SQL inserts
Unique Constraints with Temp Tables
Upserts use temporary tables for staging:- Create temporary table matching target schema
- Load data into temp table using
write_pandas - Execute MERGE from temp to target
- Drop temporary table
Data Type Mapping
| Python Type | Snowflake Type |
|---|---|
| str | VARCHAR |
| int | NUMBER |
| float | FLOAT |
| bool | BOOLEAN |
| datetime | TIMESTAMP |
| dict | VARIANT |
| list | ARRAY |
Disable Double Quotes
Snowflake uppercases identifiers by default. Disable quotes to match this behavior:Amazon Redshift
Configuration
Features
- IAM authentication - AWS credential-based access
- Redshift Serverless - Automatic scaling without cluster management
- MERGE operations - Efficient deduplication
- Columnar storage - Optimized for analytical queries
- Distribution keys - Optimize data distribution across nodes
- Sort keys - Improve query performance
Merge Load Method
Use staging tables for better performance:- Create staging table with same schema
- Load data into staging table
- Execute MERGE to deduplicate and update
- Drop staging table
Data Type Mapping
| Python Type | Redshift Type |
|---|---|
| str | VARCHAR(65535) |
| int | BIGINT |
| float | DOUBLE PRECISION |
| bool | BOOLEAN |
| datetime | TIMESTAMP |
| dict | VARCHAR(65535) |
| list | VARCHAR(65535) |
Redshift Serverless
Connecting to Redshift Serverless:Apache Doris
Configuration
Features
- Real-time analytics - Low-latency queries
- MySQL protocol - Compatible with MySQL clients
- Vectorized execution - High performance queries
- MPP architecture - Distributed processing
Common Configuration Options
Lowercase Column Names
Force lowercase column names:Disable Column Type Updates
Prevent automatic schema changes:Custom Table Names
Override table names per stream:Performance Optimization
BigQuery Optimization
BigQuery Optimization
Use Batch LoadPartition Large TablesQuery Optimization
- BigQuery automatically manages query parameters
- Supports up to 100MB query payload
- Automatic retry on concurrent update conflicts
Snowflake Optimization
Snowflake Optimization
Enable Batch LoadWarehouse Sizing
- Start with X-Small for testing
- Scale up for production workloads
- Use multi-cluster warehouses for concurrency
- Faster than password authentication
- More secure for production
- Supports key rotation
Redshift Optimization
Redshift Optimization
Use Merge LoadDistribution StrategyVACUUM and ANALYZE
- Run VACUUM regularly to reclaim space
- Run ANALYZE after large loads to update statistics
Example: BigQuery Export with Batch Load
Error Handling
BigQuery Errors
BigQuery Errors
Concurrent Update Conflicts
- Mage automatically retries up to 2 times
- Recreates table if needed
- Automatically splits large queries
- Uses query parameters to reduce size
- Check
disable_update_column_typessetting - Verify column type compatibility
Snowflake Errors
Snowflake Errors
Authentication Failures
- Verify account identifier format:
account.region - Check role permissions
- Validate key pair if using
- Snowflake auto-resumes warehouses
- May cause initial query delay
Redshift Errors
Redshift Errors
Connection Timeouts
- Check VPC security groups
- Verify cluster is publicly accessible (if needed)
- Use VPN or SSH tunnel for private clusters
- Grant INSERT, UPDATE, DELETE on tables
- Grant USAGE on schema
Next Steps
Cloud Storage
Export to S3, GCS, and Delta Lake
Databases
Configure traditional database destinations