MLflow Model Serving
MLflow provides comprehensive model serving capabilities to deploy your machine learning models as REST APIs for real-time inference. Whether you're working with MLflow OSS or Databricks Managed MLflow, you can serve models locally, in cloud environments, or through managed endpoints.
Overview
MLflow serving transforms your trained models into production-ready inference servers that can handle HTTP requests and return predictions. The serving infrastructure supports various deployment patterns, from local development servers to scalable cloud deployments.
Key Features
- 🔌 REST API Endpoints: Automatic generation of standardized REST endpoints for model inference
- 🧬 Multiple Model Formats: Support for various ML frameworks through MLflow's flavor system
- 🧠 Custom Applications: Build sophisticated serving applications with custom logic and preprocessing
- 📈 Scalable Deployment: Deploy to various targets including local servers, cloud platforms, and Kubernetes
- 🗂️ Model Registry Integration: Seamless integration with MLflow Model Registry for version management
Serving Options
MLflow OSS Serving
MLflow open-source provides several serving options:
- Local Serving: Quick deployment for development and testing using
mlflow models serve
- Custom PyFunc Models: Advanced serving with custom preprocessing, postprocessing, and business logic
- Docker Deployment: Containerized serving for consistent deployment across environments
- Cloud Platform Integration: Deploy to AWS SageMaker, Azure ML, and other cloud services
Databricks Managed MLflow
Databricks provides additional managed serving capabilities:
- Model Serving Endpoints: Fully managed, auto-scaling endpoints with built-in monitoring
- Foundation Model APIs: Direct access to foundation models through pay-per-token endpoints
- Advanced Security: Enterprise-grade security with access controls and audit logging
- Real-time Monitoring: Built-in metrics, logging, and performance monitoring
Quick Start
Basic Model Serving
For a simple model serving setup:
# Serve a logged model
mlflow models serve -m "models:/<model-id>" -p 5000
# Serve a registered model
mlflow models serve -m "models:/<model-name>/<model-version>" -p 5000
# Serve a model from local path
mlflow models serve -m ./path/to/model -p 5000
Making Predictions
Once your model is served, you can make predictions via HTTP requests:
curl -X POST http://localhost:5000/invocations \
-H "Content-Type: application/json" \
-d '{"inputs": [[1, 2, 3, 4]]}'
Architecture
MLflow serving uses a standardized architecture:
- 🧠 Model Loading: Models are loaded using their respective MLflow flavors
- 🌐 HTTP Server: FastAPI-based server handles incoming requests
- 🔄 Prediction Pipeline: Requests are processed through the model's predict method
- 📦 Response Formatting: Results are returned in standardized JSON format
Best Practices
Performance Optimization
- Use appropriate hardware resources based on model requirements
- Implement request batching for improved throughput
- Consider model quantization for faster inference
- Monitor memory usage and optimize accordingly
Security Considerations
- Implement proper authentication and authorization
- Use HTTPS in production environments
- Validate input data to prevent security vulnerabilities
- Regularly update dependencies and monitor for security issues
Monitoring and Observability
- Set up comprehensive logging for debugging and auditing
- Monitor key metrics like latency, throughput, and error rates
- Implement health checks for service reliability
- Use distributed tracing for complex serving pipelines
Common Use Cases
- Real-time Inference
- Batch Processing
- A/B Testing
- Multi-model Serving
Serve models for real-time predictions in web applications, mobile apps, or microservices architectures.
import requests
import json
# Single prediction
data = {
"dataframe_split": {
"columns": ["feature1", "feature2", "feature3"],
"data": [[1.0, 2.0, 3.0]],
}
}
response = requests.post(
"http://localhost:5000/invocations",
headers={"Content-Type": "application/json"},
data=json.dumps(data),
)
print(response.json())
Use serving endpoints for batch inference on large datasets with controlled resource usage.
import mlflow
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, struct
# Parameters
model_name = "YOUR_MODEL_NAME"
model_version = "YOUR_MODEL_VERSION"
input_table = "YOUR_INPUT_TABLE_NAME"
output_table = "YOUR_OUTPUT_TABLE_NAME"
# Load data
df = spark.table(input_table)
# Apply model using Spark UDF
model_uri = f"models:/{model_name}/{model_version}"
predict_udf = mlflow.pyfunc.spark_udf(spark, model_uri)
# Make predictions
predictions_df = df.withColumn(
"prediction", predict_udf(struct([col(c) for c in df.columns]))
)
# Save results
predictions_df.write.mode("overwrite").saveAsTable(output_table)
See Databricks batch inference documentation for built-in batch inference support with AI Functions on a deployed serving endpoint.
Deploy multiple model versions simultaneously to compare performance and gradually roll out improvements.
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import requests
import random
import json
import logging
import uvicorn
app = FastAPI()
# Model endpoints
MODEL_A_URL = "http://localhost:5000/invocations" # Current model
MODEL_B_URL = "http://localhost:5001/invocations" # New model
# Traffic split configuration
TRAFFIC_SPLIT = {
"model_a": 0.8, # 80% to current model
"model_b": 0.2, # 20% to new model
}
@app.post("/predict")
async def predict(request: Request):
# Route traffic based on split
rand = random.random()
if rand < TRAFFIC_SPLIT["model_a"]:
endpoint = MODEL_A_URL
model_version = "A"
else:
endpoint = MODEL_B_URL
model_version = "B"
# Forward request
try:
req_json = await request.json()
response = requests.post(
endpoint,
headers={"Content-Type": "application/json"},
data=json.dumps(req_json),
timeout=30,
)
result = response.json()
# Log for analysis
logging.info(f"Model: {model_version}, Request: {req_json}, Response: {result}")
return JSONResponse(
content={"prediction": result, "model_version": model_version}
)
except Exception as e:
logging.error(f"Error with model {model_version}: {e}")
return JSONResponse(content={"error": str(e)}, status_code=500)
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8080)
Serve multiple models from a single endpoint for ensemble predictions or model routing based on input characteristics.
import mlflow
import pandas as pd
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import uvicorn
app = FastAPI()
# Load specialized models
models = {
"fraud_detection": mlflow.pyfunc.load_model("models:/<fraud-model-id>"),
"recommendation": mlflow.pyfunc.load_model("models:/<recommendation-model-id>"),
"classification": mlflow.pyfunc.load_model("models:/<classification-model-id>"),
}
def route_request(input_data):
"""Route request to appropriate model based on input characteristics"""
# Example routing logic
if "transaction_amount" in input_data.columns:
return "fraud_detection"
elif "user_id" in input_data.columns and "item_id" in input_data.columns:
return "recommendation"
else:
return "classification"
@app.post("/predict")
async def smart_predict(request: Request):
data = await request.json()
input_df = pd.DataFrame(data["data"], columns=data["columns"])
# Route to appropriate model
model_name = route_request(input_df)
model = models[model_name]
# Make prediction
prediction = model.predict(input_df)
return JSONResponse(
content={"model_used": model_name, "prediction": prediction.tolist()}
)
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=5000)
Next Steps
- Explore Custom Applications to build advanced serving logic
- Understand ResponsesAgent for handling complex response patterns
For more detailed information about MLflow serving capabilities, refer to the official MLflow documentation and experiment with the examples provided in each section.