Python Foundations
Python from notebook to pipeline: type hints, dataclasses, error handling, logging, and the OOP + functional patterns that turn one-off scripts into reusable code.
OOP, functional programming, pandas, and production-ready code patterns.
Every interview that asks "can you write Python?" actually means "can you write production code?" The difference between a notebook and a pipeline is the difference between junior and your first real role.
Core Python, OOP, data structures
Python from notebook to pipeline: type hints, dataclasses, error handling, logging, and the OOP + functional patterns that turn one-off scripts into reusable code.
Your first end-to-end pipeline: extract from an API → transform with Polars → load to Postgres + S3. The shape every production pipeline rhymes with.
Data manipulation, quality, reliability
Polars vs pandas decision matrix, lazy frames + streaming, joins/groupby/pivot at scale, and when single-machine processing beats Spark.
Pydantic validation, pytest fixtures + parametrize, property-based testing with Hypothesis, retry/backoff patterns, and the dead-letter queue every pipeline needs.
ETL, performance, distributed, cloud deployment
SQLAlchemy 2.0 connections, upsert + bulk-insert patterns, transaction management, idempotent writes, and the pipeline structure that survives mid-run failures.
PySpark DataFrame fundamentals, Pandas UDFs vs vectorized UDFs, async I/O with asyncio for API ingestion, multiprocessing for CPU-bound work, and profiling with cProfile.
Cloud storage patterns (S3 + GCS + Azure Blob), Docker multi-stage builds, Lambda / Cloud Functions deployment, secrets management, and the IAM least-privilege setup that doesn't break in production.
End-to-end pipeline architecture: design → ingest → transform → load → monitor → deploy. Plus the production-grade decisions (retries, idempotency, observability) that separate juniors from mid-level.
LLMs, embeddings, RAG pipelines, async batching
LLM API fundamentals (OpenAI / Anthropic SDKs), embeddings + vector search basics, prompt + response evals, async batching for LLM throughput, and where Python data engineering meets AI infra.
Without production Python patterns, you risk:
Python for data engineering focuses on writing production-grade code for data pipelines, ETL processes, and data services. Unlike data science Python, data engineering Python emphasizes OOP, error handling, testing, and deployment — the patterns used at companies like Spotify, Airbnb, and Uber to process terabytes daily.
Python is the primary language for data pipeline orchestration, API integrations, and custom transformation logic. At Netflix, Python orchestrates thousands of data jobs daily. Production Python requires proper error handling, logging, and testing — not just notebook-style scripting.
Python offers faster development and a richer data ecosystem. Java provides better performance for low-latency systems. Most data teams choose Python for pipeline logic and reserve Java/Scala for framework-level code.
Python dominates data engineering due to PySpark and broader library support. Scala is preferred for Spark-native development and performance-critical jobs. PySpark bridges both worlds.
Python handles orchestration, API calls, and complex logic. SQL handles warehouse transformations. Production data engineers use both — Python for pipeline code, SQL for data transformations.
Production Python is the line between writing notebooks and writing pipelines. Once you can structure a project, handle errors, test the edges, and deploy to cloud — you've cleared the bar for junior data engineering roles AND set up the foundations for every senior trajectory (batch, streaming, ML, platform).
Data engineers need OOP, error handling, testing with pytest, data manipulation with Polars/pandas, and deployment with Docker. Notebook-only Python is not sufficient for production work.
Python plus SQL covers most data engineering work. You will also need familiarity with cloud platforms, Docker, and orchestration tools like Airflow, which all have Python interfaces.
With programming experience, 2-3 months of focused practice. The key is learning production patterns — error handling, testing, and deployment — not just data manipulation.
Both are essential and complementary. SQL handles warehouse queries and transformations. Python handles orchestration, APIs, custom logic, and testing. Every data engineer uses both daily.
Polars is rapidly replacing pandas for new projects due to better performance. Most production teams use Polars for local processing and PySpark for distributed workloads.
Common frameworks include Airflow for orchestration, PySpark for distributed processing, FastAPI for data services, pytest for testing, and Docker for deployment.
Polars is the better first choice. Same DataFrame mental model, dramatically better performance (Rust-backed, lazy execution, larger-than-RAM streaming), and an API that teaches good habits (no SettingWithCopyWarning chaos). pandas is still useful for legacy code + ML library compatibility, but learn Polars first and pick up pandas as needed.