Cloud Fundamentals for Data Engineers
AWS, GCP, and Azure essentials — storage, compute, IAM, and the FinOps loop every interview tests.
Every modern data team operates on a cloud — and every interview starts with 'tell me how this pipeline would run on AWS.' The candidates who land mid-level data engineering roles can read a cloud bill, design a least-privilege IAM role, and explain when serverless beats a managed warehouse. The ones who can't get screened out in 20 minutes.
What you’ll be able to do
- Read AWS / GCP / Azure billing line items — and design pipelines whose cost shape you can predict
- Choose between warehouse, lake, and lakehouse with a real understanding of the storage, query, and governance trade-offs
- Configure object storage, IAM, and VPC primitives that survive a security review — not just a demo
- Apply FinOps practices (rightsizing, scheduled autoscaling, query attribution) that cut cloud data spend 40–60%
Curriculum
Phase 1: Cloud Foundations
The shape, the storage layer, the pipeline, the bill
The Shape of Cloud Data Systems
The five layers every cloud data platform has — ingestion, storage, orchestration, transformation, serving — and why this shape stays constant from a 3-person startup to Netflix. The vocabulary every later module sharpens.
Object Storage as the Landing Zone
S3 / GCS / ADLS as the durability layer underneath every cloud warehouse and lake — bucket partitioning, prefix limits, request-rate caps, lifecycle policies, and the tier (Standard / IA / Glacier) decisions that decide whether a 50 TB lake costs $1K or $9K a month.
What a Cloud Pipeline Actually Is
Source → schedule → transform → destination — the four-step shape every orchestrated pipeline collapses to. EventBridge / Cloud Scheduler / Airflow as the trigger layer, plus the idempotency + retry rules that decide whether your Tuesday DAG survives Saturday's network blip.
How Cloud Bills Are Built
Compute-seconds + GB-months + request-counts as the three line items every cloud bill reduces to — read the bill, you can predict it; predict it, you can cut it. The forensics that turn a $300K Snowflake surprise into a $90K reality.
Phase 2: Data Cloud Services
Storage at scale, warehouse vs lake vs lakehouse, ELT, serverless
Object Storage at Scale
Partitioning strategies that survive 100 M objects, prefix entropy to dodge S3 request-rate limits, multipart uploads, S3 Select for skinny scans, and the tier-transition policies that drop a $9K/month lake to $2K without losing query performance.
Warehouse, Lake, Lakehouse at Depth
Snowflake vs BigQuery vs Redshift Serverless on storage format, query engine, and cost shape — when a lakehouse (Iceberg + Trino) beats a managed warehouse, and when the operational cost of running it eats the savings.
ETL vs ELT in the Cloud
Why ELT (load raw → transform in-warehouse) won in cloud and ETL (transform in-flight) still owns streaming and PII-filtering paths. Glue / Dataflow / Fabric trade-offs, the cost shape of each pattern, and the contract decisions that decide which one your platform standardizes on.
Serverless for Data Workloads
Lambda + Cloud Functions for event-driven ingestion, Fargate / Cloud Run for containerized ETL, and the 15-minute / cold-start / cost-per-invocation cliffs that decide when serverless wins vs when you should pay for an EC2 box.
Phase 3: Security, FinOps & AI
Breach patterns, the cloud-first case, FinOps loop, AI layer
How Cloud Data Platforms Get Breached
Over-permissive IAM, leaked keys in CI, unencrypted EBS volumes, public buckets — the four patterns behind every cloud-data breach post-mortem from the last three years, plus least-privilege role design, KMS envelope encryption, and VPC endpoints for warehouse traffic.
Why Cloud Wins for Data Engineering
Elasticity, managed services, and cloud-native-first tooling as the three structural reasons even cost-sensitive teams move data workloads off-prem — and the two scenarios (sub-millisecond latency, sovereign data) where on-prem still wins.
FinOps for Data Engineers
Cost attribution per pipeline (per-DAG tagging, per-query labels), warehouse autoscaling schedules, query-level cost optimization (filter pushdown, partition pruning), and the FinOps loop — measure → allocate → optimize — that closes the gap between cloud bills and engineering decisions.
The AI Layer in Cloud Platforms
Bedrock / Vertex AI / Azure AI Foundry as the managed-LLM layer, the data-engineering work that feeds them (vector stores, retrieval pipelines, evaluation harnesses), and the cost-attribution model that prevents the $40K AI-agent surprise bill.
What you’ll build
- A cost-attributed Terraform-deployed pipeline (S3 → Glue → Snowflake) with per-stage tagging, scheduled warehouse autoscaling, and an AWS Budgets alert that pages before the bill hits 80% of plan
- A least-privilege IAM design — separate roles for ingestion / transform / serving, KMS-encrypted state, VPC endpoints, and a quarterly access-review query you can hand to security
- A warehouse cost teardown — pick any production query, attribute its cost across compute-seconds + bytes-scanned + result-cache savings, and present a 3-week optimization plan that cuts it 40%
- A multi-cloud comparison doc — the same pipeline costed on AWS + GCP + Azure with real line items, plus the architecture decisions (managed service, storage class, network egress path) that explain the delta
Without cloud fundamentals, you build pipelines that pass code review and die the day they meet a real bill.
WHAT GOES WRONG
- The 'works-on-my-laptop' interview — candidate explains a pipeline that 'loads to S3 and queries from Snowflake' but can't answer what S3 charges per GET, what a Snowflake warehouse-hour costs, or why ELT replaced ETL — screened out in round 1
- The forgotten S3 bucket — engineer opens a bucket 'temporarily' to debug a Glue job and never closes it; six weeks later it's on a credential-stuffing forum and HR is on the phone
- The $300K Snowflake surprise — query without partition pruning runs full-table on every dbt invocation, three weeks of compounding spend, the bill arrives before anyone reads it
- The IAM over-grant — `AdministratorAccess` on the service role because 'we'll lock it down later'; the SOC2 audit fails, the team rebuilds RBAC under a deadline
What is Cloud Fundamentals?
Cloud fundamentals for data engineers covers the storage, compute, security, and cost-management primitives that every modern data platform runs on. The curriculum walks through S3 / GCS / ADLS object storage, the warehouse-vs-lake-vs-lakehouse decision, ELT pipeline mechanics on AWS / GCP / Azure, least-privilege IAM, and the FinOps loop that turns a $300K surprise bill into a predictable cost line.
Why this matters in production
Every modern data team operates on a cloud, and the interview screens reflect it — hiring panels at Stripe, Datadog, and every cloud-native data org expect candidates to read the bill, design IAM roles, and explain when serverless beats a managed warehouse. Cloud fundamentals isn't a specialist skill; it's the shared vocabulary the rest of data engineering builds on.
Common use cases
- Reading AWS / GCP / Azure billing line items and predicting pipeline cost shape before the bill arrives
- Choosing between warehouse (Snowflake / BigQuery), lake (S3 + Glue), and lakehouse (Iceberg + Trino) for new workloads
- Designing least-privilege IAM with separate ingest / transform / serve roles and KMS-encrypted state
- Setting up ELT pipelines with Glue / Dataflow / Fabric and the idempotency + retry rules that survive Saturday outages
- Applying FinOps practices (rightsizing, scheduled autoscaling, per-pipeline cost tagging) to cut cloud data spend 40–60%
- Preparing for mid-level data engineering interviews where cloud fluency is the floor screen
CLOUD vs alternatives
CLOUD vs AWS vs GCP
AWS has the broadest service portfolio and the most data-engineering job postings. GCP leads in analytics with BigQuery's serverless query engine. Both are strong for data engineering — the choice usually comes down to existing organizational investment, not capability gap.
CLOUD vs Cloud vs On-Premise
Cloud provides elastic scaling, managed services, and pay-per-use pricing — the structural reasons even cost-sensitive teams move off-prem. On-premise still wins for sub-millisecond latency requirements and sovereign-data workloads. Most modern data teams are cloud-first with selective hybrid.
CLOUD vs Single-Cloud vs Multi-Cloud
Single-cloud simplifies operations, IAM, and cross-service data transfer cost. Multi-cloud prevents vendor lock-in and enables best-of-breed services (e.g., BigQuery for analytics, AWS for everything else). Most companies are single-cloud primary with selective multi-cloud for specific services.
Related skills
Why this skill matters
Cloud fundamentals is the *floor* skill for a mid-level data engineering offer. Hiring panels at Stripe, Datadog, Snowflake, and every cloud-native data team test for it on day one — not because they need a specialist, but because nothing else on a resume reads as credible without it. The engineer who can read the AWS bill is the one who gets trusted with the platform.
Common questions about CLOUD
Do data engineers really need cloud skills?
Yes — it's the floor screen. Every mid-level data engineering interview tests whether you can read a cloud bill, design an IAM role, and explain when serverless beats a managed warehouse. You can't fake it past a 30-minute screen.
Which cloud platform should I learn first?
AWS has the broadest job market and most data-engineering tooling. GCP excels in analytics with BigQuery. Azure is strongest in Microsoft-heavy enterprises. Start with whichever your target employers use — concepts transfer between providers within a few weeks of hands-on practice.
How long does it take to learn cloud fundamentals?
Core concepts across all three providers take 3–4 weeks of focused study. Real fluency with one provider's data services — enough to design and operate pipelines without surprises — typically takes 2–3 months of hands-on usage.
How do I learn cloud without spending money?
AWS, GCP, and Azure all offer free tiers with enough capacity for the storage, compute, and IAM work in this curriculum. Set a $1/month billing alert as your first task — that habit alone separates engineers who get trusted with platforms from those who don't.
What cloud certifications help data engineers?
AWS Data Analytics Specialty, GCP Professional Data Engineer, and Azure Data Engineer Associate are the most relevant. Practical experience matters more than certifications, but they validate baseline knowledge — and the prep work itself is high-quality cloud fundamentals practice.
Is one cloud certification enough to get hired?
A certification gets your resume past the first filter. The interview still tests whether you can read the bill, design IAM, and explain trade-offs. Treat certifications as proof of effort, not proof of skill — pair them with a portfolio project that exercises the concepts.
What is FinOps and why should data engineers care?
FinOps applies financial accountability to cloud spending — measure cost per pipeline, allocate it to the team that owns it, optimize what's most expensive. For data engineers, it means tagging warehouses by team, attributing query cost, and making cost a first-class design consideration alongside latency and reliability.