Airflow
DAG scheduling, executor errors, task failures, and connection issues.
Spark
OOM errors, executor failures, shuffle bottlenecks, and performance problems.
RAG
Retrieval quality, embedding issues, vector database errors, and relevance tuning.
MLOps
Model serving failures, drift detection, feature store issues, and pipeline errors.
LLM Pipeline
Tokenization errors, data quality issues, training data pipeline failures.
LLM Eval
Evaluation metric issues, benchmark failures, and scoring inconsistencies.
Agentic
Agent tool call failures, memory errors, multi-agent coordination issues.