Context
Module 03 deploys the registered scikit-learn / XGBoost churn model behind a REST API on Kubernetes. The serving layer must:
- Package the model + its dependencies into a Docker image
- Expose
/predictwith Pydantic-validated input - Auto-scale on CPU pressure (HPA)
- Support canary rollouts (10 % → 30 % → 50 % → 100 %)
- Hit the <50 ms P99 budget at 1k req/sec
Classic options:
- BentoML — ML-aware serving framework: model packaging, REST API, batching, GPU runners, OpenAPI spec, Docker image build.
- TorchServe — PyTorch-only; the project uses scikit-learn / XGBoost.
- Seldon Core — heavier-weight Kubernetes-native serving with inference graphs.
- FastAPI by hand — write the API, write the Dockerfile, write the Pydantic models, write the model loader.
Decision
Adopt BentoML 1.3+.
# service.py
import bentoml
from pydantic import BaseModel
class ChurnInput(BaseModel):
customer_id: int
tenure_months: float
monthly_charges: float
# ... rest of features
@bentoml.service(traffic={"timeout": 30})
class ChurnPredictor:
model = bentoml.models.load_model("churn-predictor:latest")
@bentoml.api
async def predict(self, input: ChurnInput) -> dict:
features = await self._fetch_features(input.customer_id)
score = self.model.predict_proba([features])[0][1]
return {"churn_probability": float(score)}
# Module 03 deploy flow
bentoml build # produces a Bento package
bentoml containerize churn_predictor:latest --image-tag $REGISTRY/churn:$SHA
docker push ...
kubectl set image deployment/churn-predictor app=$REGISTRY/churn:$SHA
Tradeoffs we accept
| Lever | BentoML (chosen) | TorchServe | Seldon | FastAPI by hand |
|---|---|---|---|---|
| Framework support | scikit-learn, XGBoost, PyTorch, TF, ONNX, custom | PyTorch only | All (via runtimes) | All |
| Day-1 setup | bentoml.service decorator | TorchScript packaging | Kubernetes operator + CRDs | Hand-roll everything |
| OpenAPI spec | Auto-generated | Manual | Auto-generated | Manual |
| Pydantic validation | First-class | Manual | Manual | First-class |
| Adaptive batching | Built-in | Built-in | Via inference graph | Build it |
| Docker image build | bentoml containerize | Manual | Manual | Manual |
| K8s deploy footprint | Plain Deployment | Plain Deployment | CRDs + operator | Plain Deployment |
| Model registry | MLflow + S3 + Bento store | TorchScript files | Custom | Custom |
| Inference graphs (multi-step) | Limited | None | Strong | Build it |
| Vendor lock-in | None (open-source) | None | None | None |
We optimize for scikit-learn/XGBoost first-class support + zero-handroll Docker build. TorchServe is a non-starter (no sklearn). Seldon is overkill for a single-model service. FastAPI by hand is what most teams start with and rebuild as BentoML once they realize they're rebuilding BentoML.
Consequences (positive)
- Module 03 ships a tested API in <100 lines of Python (
service.py). bentoml containerizeproduces a multi-stage Docker image with the Python runtime, dependencies, and model baked in — single command.- The Pydantic input model is the same in
service.py, the integration tests, and the OpenAPI doc — drift-impossible. - Adaptive batching is on by default — predict-batches at the API layer give 2-3× throughput improvement at the same latency budget.
- The MLflow Model Registry → Bento store handoff is documented in
Module 03's
service.py(bentoml.models.import_from_mlflow(...)).
Consequences (negative)
- No native multi-model inference graph. If you need
feature-fetch → model-A → fan-out-to-model-B → join, you build it
yourself in
service.py. Mitigation: Module 03 shows the pattern for the single-model path; inference graphs are a Seldon decision point if the platform grows. - GPU runners are paid in BentoML Cloud. OSS BentoML supports GPU
via
nvidia-dockerbut the orchestration is BYO. Mitigation: this project is CPU-only (sklearn + XGBoost), so no impact. - Bento store is a new mental model. Learners have to internalize "Bentos contain Models contain Frameworks". Mitigation: Module 03's first 2 lessons walk through the abstraction explicitly.
- No native A/B framework. Canary rollouts use Kubernetes-level
splits (
k8s/canary-deployment.yaml+scripts/canary_promote.py), not BentoML's runtime. Mitigation: this is the production-correct pattern (k8s splits are observable in the same Prometheus labels).
Reversal plan
service.py is a single file with a Pydantic input + a predict
method. Replacement is bounded:
- Add
serving/fastapi_service.pywith the same contract:from fastapi import FastAPI app = FastAPI() class ChurnInput(BaseModel): ... @app.post("/predict") async def predict(input: ChurnInput) -> dict: ... - Replace
bentoml containerizewith a hand-written Dockerfile. - Update
k8s/deployment.yaml's image command (bentoml serve→uvicorn fastapi_service:app). - Re-run the integration + load tests in
tests/— the latency SLA assertions will catch any regression.
Estimated effort: 3-5 engineer-days for a tested swap. Reversible.
References
service.pybentofile.yamlk8s/{deployment,service,hpa,canary-deployment}.yaml.github/workflows/deploy.ymlscripts/canary_promote.pytests/integration/test_api.py- ADR-003 (MLflow Registry — Bento imports models from this Registry)
- ADR-005 (Deprecated event-driven retraining — affects how new model versions reach Bento)