ADR-004: Model serving is BentoML, not TorchServe / Seldon / FastAPI-by-hand | PredictFlow Feature Store

Context

Module 03 deploys the registered scikit-learn / XGBoost churn model behind a REST API on Kubernetes. The serving layer must:

Package the model + its dependencies into a Docker image
Expose /predict with Pydantic-validated input
Auto-scale on CPU pressure (HPA)
Support canary rollouts (10 % → 30 % → 50 % → 100 %)
Hit the <50 ms P99 budget at 1k req/sec

Classic options:

BentoML — ML-aware serving framework: model packaging, REST API, batching, GPU runners, OpenAPI spec, Docker image build.
TorchServe — PyTorch-only; the project uses scikit-learn / XGBoost.
Seldon Core — heavier-weight Kubernetes-native serving with inference graphs.
FastAPI by hand — write the API, write the Dockerfile, write the Pydantic models, write the model loader.

Decision

Adopt BentoML 1.3+.

# service.py
import bentoml
from pydantic import BaseModel

class ChurnInput(BaseModel):
    customer_id: int
    tenure_months: float
    monthly_charges: float
    # ... rest of features

@bentoml.service(traffic={"timeout": 30})
class ChurnPredictor:
    model = bentoml.models.load_model("churn-predictor:latest")

    @bentoml.api
    async def predict(self, input: ChurnInput) -> dict:
        features = await self._fetch_features(input.customer_id)
        score = self.model.predict_proba([features])[0][1]
        return {"churn_probability": float(score)}

# Module 03 deploy flow
bentoml build  # produces a Bento package
bentoml containerize churn_predictor:latest --image-tag $REGISTRY/churn:$SHA
docker push ...
kubectl set image deployment/churn-predictor app=$REGISTRY/churn:$SHA

Tradeoffs we accept

Lever	BentoML (chosen)	TorchServe	Seldon	FastAPI by hand
Framework support	scikit-learn, XGBoost, PyTorch, TF, ONNX, custom	PyTorch only	All (via runtimes)	All
Day-1 setup	`bentoml.service` decorator	TorchScript packaging	Kubernetes operator + CRDs	Hand-roll everything
OpenAPI spec	Auto-generated	Manual	Auto-generated	Manual
Pydantic validation	First-class	Manual	Manual	First-class
Adaptive batching	Built-in	Built-in	Via inference graph	Build it
Docker image build	`bentoml containerize`	Manual	Manual	Manual
K8s deploy footprint	Plain `Deployment`	Plain `Deployment`	CRDs + operator	Plain `Deployment`
Model registry	MLflow + S3 + Bento store	TorchScript files	Custom	Custom
Inference graphs (multi-step)	Limited	None	Strong	Build it
Vendor lock-in	None (open-source)	None	None	None

We optimize for scikit-learn/XGBoost first-class support + zero-handroll Docker build. TorchServe is a non-starter (no sklearn). Seldon is overkill for a single-model service. FastAPI by hand is what most teams start with and rebuild as BentoML once they realize they're rebuilding BentoML.

Consequences (positive)

Module 03 ships a tested API in <100 lines of Python (service.py).
bentoml containerize produces a multi-stage Docker image with the Python runtime, dependencies, and model baked in — single command.
The Pydantic input model is the same in service.py, the integration tests, and the OpenAPI doc — drift-impossible.
Adaptive batching is on by default — predict-batches at the API layer give 2-3× throughput improvement at the same latency budget.
The MLflow Model Registry → Bento store handoff is documented in Module 03's service.py (bentoml.models.import_from_mlflow(...)).

Consequences (negative)

No native multi-model inference graph. If you need feature-fetch → model-A → fan-out-to-model-B → join, you build it yourself in service.py. Mitigation: Module 03 shows the pattern for the single-model path; inference graphs are a Seldon decision point if the platform grows.
GPU runners are paid in BentoML Cloud. OSS BentoML supports GPU via nvidia-docker but the orchestration is BYO. Mitigation: this project is CPU-only (sklearn + XGBoost), so no impact.
Bento store is a new mental model. Learners have to internalize "Bentos contain Models contain Frameworks". Mitigation: Module 03's first 2 lessons walk through the abstraction explicitly.
No native A/B framework. Canary rollouts use Kubernetes-level splits (k8s/canary-deployment.yaml + scripts/canary_promote.py), not BentoML's runtime. Mitigation: this is the production-correct pattern (k8s splits are observable in the same Prometheus labels).

Reversal plan

service.py is a single file with a Pydantic input + a predict method. Replacement is bounded:

Add serving/fastapi_service.py with the same contract:

from fastapi import FastAPI
app = FastAPI()
class ChurnInput(BaseModel): ...
@app.post("/predict")
async def predict(input: ChurnInput) -> dict: ...

Replace bentoml containerize with a hand-written Dockerfile.
Update k8s/deployment.yaml's image command (bentoml serve → uvicorn fastapi_service:app).
Re-run the integration + load tests in tests/ — the latency SLA assertions will catch any regression.

Estimated effort: 3-5 engineer-days for a tested swap. Reversible.

References

service.py
bentofile.yaml
k8s/{deployment,service,hpa,canary-deployment}.yaml
.github/workflows/deploy.yml
scripts/canary_promote.py
tests/integration/test_api.py
ADR-003 (MLflow Registry — Bento imports models from this Registry)
ADR-005 (Deprecated event-driven retraining — affects how new model versions reach Bento)