MLflow Model Serving

MLflow provides comprehensive model serving capabilities to deploy your machine learning models as REST APIs for real-time inference. Whether you're working with MLflow OSS or Databricks Managed MLflow, you can serve models locally, in cloud environments, or through managed endpoints.

Overview

MLflow serving transforms your trained models into production-ready inference servers that can handle HTTP requests and return predictions. The serving infrastructure supports various deployment patterns, from local development servers to scalable cloud deployments.

Key Features

🔌 REST API Endpoints: Automatic generation of standardized REST endpoints for model inference
🧬 Multiple Model Formats: Support for various ML frameworks through MLflow's flavor system
🧠 Custom Applications: Build sophisticated serving applications with custom logic and preprocessing
📈 Scalable Deployment: Deploy to various targets including local servers, cloud platforms, and Kubernetes
🗂️ Model Registry Integration: Seamless integration with MLflow Model Registry for version management

Serving Options

MLflow OSS Serving

MLflow open-source provides several serving options:

Local Serving: Quick deployment for development and testing using mlflow models serve
Custom PyFunc Models: Advanced serving with custom preprocessing, postprocessing, and business logic
Docker Deployment: Containerized serving for consistent deployment across environments
Cloud Platform Integration: Deploy to AWS SageMaker, Azure ML, and other cloud services

Databricks Managed MLflow

Databricks provides additional managed serving capabilities:

Model Serving Endpoints: Fully managed, auto-scaling endpoints with built-in monitoring
Foundation Model APIs: Direct access to foundation models through pay-per-token endpoints
Advanced Security: Enterprise-grade security with access controls and audit logging
Real-time Monitoring: Built-in metrics, logging, and performance monitoring

Quick Start

Basic Model Serving

For a simple model serving setup:

# Serve a logged model
mlflow models serve -m "models:/<model-id>" -p 5000

# Serve a registered model
mlflow models serve -m "models:/<model-name>/<model-version>" -p 5000

# Serve a model from local path
mlflow models serve -m ./path/to/model -p 5000

Making Predictions

Once your model is served, you can make predictions via HTTP requests:

curl -X POST http://localhost:5000/invocations \
  -H "Content-Type: application/json" \
  -d '{"inputs": [[1, 2, 3, 4]]}'

Architecture

MLflow serving uses a standardized architecture:

🧠 Model Loading: Models are loaded using their respective MLflow flavors
🌐 HTTP Server: FastAPI-based server handles incoming requests
🔄 Prediction Pipeline: Requests are processed through the model's predict method
📦 Response Formatting: Results are returned in standardized JSON format

Best Practices

Performance Optimization

Use appropriate hardware resources based on model requirements
Implement request batching for improved throughput
Consider model quantization for faster inference
Monitor memory usage and optimize accordingly

Security Considerations

Implement proper authentication and authorization
Use HTTPS in production environments
Validate input data to prevent security vulnerabilities
Regularly update dependencies and monitor for security issues

Monitoring and Observability

Set up comprehensive logging for debugging and auditing
Monitor key metrics like latency, throughput, and error rates
Implement health checks for service reliability
Use distributed tracing for complex serving pipelines

Common Use Cases

Real-time Inference
Batch Processing
A/B Testing
Multi-model Serving

Serve models for real-time predictions in web applications, mobile apps, or microservices architectures.

import requests
import json

# Single prediction
data = {
    "dataframe_split": {
        "columns": ["feature1", "feature2", "feature3"],
        "data": [[1.0, 2.0, 3.0]],
    }
}

response = requests.post(
    "http://localhost:5000/invocations",
    headers={"Content-Type": "application/json"},
    data=json.dumps(data),
)
print(response.json())

Use serving endpoints for batch inference on large datasets with controlled resource usage.

import mlflow
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, struct

# Parameters
model_name = "YOUR_MODEL_NAME"
model_version = "YOUR_MODEL_VERSION"
input_table = "YOUR_INPUT_TABLE_NAME"
output_table = "YOUR_OUTPUT_TABLE_NAME"

# Load data
df = spark.table(input_table)

# Apply model using Spark UDF
model_uri = f"models:/{model_name}/{model_version}"
predict_udf = mlflow.pyfunc.spark_udf(spark, model_uri)

# Make predictions
predictions_df = df.withColumn(
    "prediction", predict_udf(struct([col(c) for c in df.columns]))
)

# Save results
predictions_df.write.mode("overwrite").saveAsTable(output_table)

See Databricks batch inference documentation for built-in batch inference support with AI Functions on a deployed serving endpoint.

Deploy multiple model versions simultaneously to compare performance and gradually roll out improvements.

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import requests
import random
import json
import logging
import uvicorn

app = FastAPI()

# Model endpoints
MODEL_A_URL = "http://localhost:5000/invocations"  # Current model
MODEL_B_URL = "http://localhost:5001/invocations"  # New model

# Traffic split configuration
TRAFFIC_SPLIT = {
    "model_a": 0.8,  # 80% to current model
    "model_b": 0.2,  # 20% to new model
}


@app.post("/predict")
async def predict(request: Request):
    # Route traffic based on split
    rand = random.random()

    if rand < TRAFFIC_SPLIT["model_a"]:
        endpoint = MODEL_A_URL
        model_version = "A"
    else:
        endpoint = MODEL_B_URL
        model_version = "B"

    # Forward request
    try:
        req_json = await request.json()
        response = requests.post(
            endpoint,
            headers={"Content-Type": "application/json"},
            data=json.dumps(req_json),
            timeout=30,
        )

        result = response.json()

        # Log for analysis
        logging.info(f"Model: {model_version}, Request: {req_json}, Response: {result}")

        return JSONResponse(
            content={"prediction": result, "model_version": model_version}
        )

    except Exception as e:
        logging.error(f"Error with model {model_version}: {e}")
        return JSONResponse(content={"error": str(e)}, status_code=500)


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8080)

Serve multiple models from a single endpoint for ensemble predictions or model routing based on input characteristics.

import mlflow
import pandas as pd
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import uvicorn

app = FastAPI()

# Load specialized models
models = {
    "fraud_detection": mlflow.pyfunc.load_model("models:/<fraud-model-id>"),
    "recommendation": mlflow.pyfunc.load_model("models:/<recommendation-model-id>"),
    "classification": mlflow.pyfunc.load_model("models:/<classification-model-id>"),
}


def route_request(input_data):
    """Route request to appropriate model based on input characteristics"""

    # Example routing logic
    if "transaction_amount" in input_data.columns:
        return "fraud_detection"
    elif "user_id" in input_data.columns and "item_id" in input_data.columns:
        return "recommendation"
    else:
        return "classification"


@app.post("/predict")
async def smart_predict(request: Request):
    data = await request.json()
    input_df = pd.DataFrame(data["data"], columns=data["columns"])

    # Route to appropriate model
    model_name = route_request(input_df)
    model = models[model_name]

    # Make prediction
    prediction = model.predict(input_df)

    return JSONResponse(
        content={"model_used": model_name, "prediction": prediction.tolist()}
    )


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=5000)

Next Steps

Explore Custom Applications to build advanced serving logic
Understand ResponsesAgent for handling complex response patterns

For more detailed information about MLflow serving capabilities, refer to the official MLflow documentation and experiment with the examples provided in each section.

Overview​

Key Features​

Serving Options​

MLflow OSS Serving​

Databricks Managed MLflow​

Quick Start​

Basic Model Serving​

Making Predictions​

Architecture​

Best Practices​

Performance Optimization​

Security Considerations​

Monitoring and Observability​

Common Use Cases​

Next Steps​