Skip to content
Back to PredictFlow Feature Store

Model serving is BentoML, not TorchServe / Seldon / FastAPI-by-hand

✓ AcceptedPredictFlow Feature Store03 — CI/CD & Model Serving
By AI-DE Engineering Team·Stakeholders: ML platform engineer, DevOps lead, infra reviewer

Context

Module 03 deploys the registered scikit-learn / XGBoost churn model behind a REST API on Kubernetes. The serving layer must:

  • Package the model + its dependencies into a Docker image
  • Expose /predict with Pydantic-validated input
  • Auto-scale on CPU pressure (HPA)
  • Support canary rollouts (10 % → 30 % → 50 % → 100 %)
  • Hit the <50 ms P99 budget at 1k req/sec

Classic options:

  1. BentoML — ML-aware serving framework: model packaging, REST API, batching, GPU runners, OpenAPI spec, Docker image build.
  2. TorchServe — PyTorch-only; the project uses scikit-learn / XGBoost.
  3. Seldon Core — heavier-weight Kubernetes-native serving with inference graphs.
  4. FastAPI by hand — write the API, write the Dockerfile, write the Pydantic models, write the model loader.

Decision

Adopt BentoML 1.3+.

# service.py
import bentoml
from pydantic import BaseModel

class ChurnInput(BaseModel):
    customer_id: int
    tenure_months: float
    monthly_charges: float
    # ... rest of features

@bentoml.service(traffic={"timeout": 30})
class ChurnPredictor:
    model = bentoml.models.load_model("churn-predictor:latest")

    @bentoml.api
    async def predict(self, input: ChurnInput) -> dict:
        features = await self._fetch_features(input.customer_id)
        score = self.model.predict_proba([features])[0][1]
        return {"churn_probability": float(score)}
# Module 03 deploy flow
bentoml build  # produces a Bento package
bentoml containerize churn_predictor:latest --image-tag $REGISTRY/churn:$SHA
docker push ...
kubectl set image deployment/churn-predictor app=$REGISTRY/churn:$SHA

Tradeoffs we accept

LeverBentoML (chosen)TorchServeSeldonFastAPI by hand
Framework supportscikit-learn, XGBoost, PyTorch, TF, ONNX, customPyTorch onlyAll (via runtimes)All
Day-1 setupbentoml.service decoratorTorchScript packagingKubernetes operator + CRDsHand-roll everything
OpenAPI specAuto-generatedManualAuto-generatedManual
Pydantic validationFirst-classManualManualFirst-class
Adaptive batchingBuilt-inBuilt-inVia inference graphBuild it
Docker image buildbentoml containerizeManualManualManual
K8s deploy footprintPlain DeploymentPlain DeploymentCRDs + operatorPlain Deployment
Model registryMLflow + S3 + Bento storeTorchScript filesCustomCustom
Inference graphs (multi-step)LimitedNoneStrongBuild it
Vendor lock-inNone (open-source)NoneNoneNone

We optimize for scikit-learn/XGBoost first-class support + zero-handroll Docker build. TorchServe is a non-starter (no sklearn). Seldon is overkill for a single-model service. FastAPI by hand is what most teams start with and rebuild as BentoML once they realize they're rebuilding BentoML.

Consequences (positive)

  • Module 03 ships a tested API in <100 lines of Python (service.py).
  • bentoml containerize produces a multi-stage Docker image with the Python runtime, dependencies, and model baked in — single command.
  • The Pydantic input model is the same in service.py, the integration tests, and the OpenAPI doc — drift-impossible.
  • Adaptive batching is on by default — predict-batches at the API layer give 2-3× throughput improvement at the same latency budget.
  • The MLflow Model Registry → Bento store handoff is documented in Module 03's service.py (bentoml.models.import_from_mlflow(...)).

Consequences (negative)

  • No native multi-model inference graph. If you need feature-fetch → model-A → fan-out-to-model-B → join, you build it yourself in service.py. Mitigation: Module 03 shows the pattern for the single-model path; inference graphs are a Seldon decision point if the platform grows.
  • GPU runners are paid in BentoML Cloud. OSS BentoML supports GPU via nvidia-docker but the orchestration is BYO. Mitigation: this project is CPU-only (sklearn + XGBoost), so no impact.
  • Bento store is a new mental model. Learners have to internalize "Bentos contain Models contain Frameworks". Mitigation: Module 03's first 2 lessons walk through the abstraction explicitly.
  • No native A/B framework. Canary rollouts use Kubernetes-level splits (k8s/canary-deployment.yaml + scripts/canary_promote.py), not BentoML's runtime. Mitigation: this is the production-correct pattern (k8s splits are observable in the same Prometheus labels).

Reversal plan

service.py is a single file with a Pydantic input + a predict method. Replacement is bounded:

  1. Add serving/fastapi_service.py with the same contract:
    from fastapi import FastAPI
    app = FastAPI()
    class ChurnInput(BaseModel): ...
    @app.post("/predict")
    async def predict(input: ChurnInput) -> dict: ...
    
  2. Replace bentoml containerize with a hand-written Dockerfile.
  3. Update k8s/deployment.yaml's image command (bentoml serveuvicorn fastapi_service:app).
  4. Re-run the integration + load tests in tests/ — the latency SLA assertions will catch any regression.

Estimated effort: 3-5 engineer-days for a tested swap. Reversible.

References

  • service.py
  • bentofile.yaml
  • k8s/{deployment,service,hpa,canary-deployment}.yaml
  • .github/workflows/deploy.yml
  • scripts/canary_promote.py
  • tests/integration/test_api.py
  • ADR-003 (MLflow Registry — Bento imports models from this Registry)
  • ADR-005 (Deprecated event-driven retraining — affects how new model versions reach Bento)
Built into the project

This decision shipped as part of PredictFlow Feature Store — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open