Skip to main content

MLflow Model Serving

MLflow provides comprehensive model serving capabilities to deploy your machine learning models as REST APIs for real-time inference. Whether you're working with MLflow OSS or Databricks Managed MLflow, you can serve models locally, in cloud environments, or through managed endpoints.

Overview

MLflow serving transforms your trained models into production-ready inference servers that can handle HTTP requests and return predictions. The serving infrastructure supports various deployment patterns, from local development servers to scalable cloud deployments.

Key Features

  • 🔌 REST API Endpoints: Automatic generation of standardized REST endpoints for model inference
  • 🧬 Multiple Model Formats: Support for various ML frameworks through MLflow's flavor system
  • 🧠 Custom Applications: Build sophisticated serving applications with custom logic and preprocessing
  • 📈 Scalable Deployment: Deploy to various targets including local servers, cloud platforms, and Kubernetes
  • 🗂️ Model Registry Integration: Seamless integration with MLflow Model Registry for version management

Serving Options

MLflow OSS Serving

MLflow open-source provides several serving options:

  • Local Serving: Quick deployment for development and testing using mlflow models serve
  • Custom PyFunc Models: Advanced serving with custom preprocessing, postprocessing, and business logic
  • Docker Deployment: Containerized serving for consistent deployment across environments
  • Cloud Platform Integration: Deploy to AWS SageMaker, Azure ML, and other cloud services

Databricks Managed MLflow

Databricks provides additional managed serving capabilities:

  • Model Serving Endpoints: Fully managed, auto-scaling endpoints with built-in monitoring
  • Foundation Model APIs: Direct access to foundation models through pay-per-token endpoints
  • Advanced Security: Enterprise-grade security with access controls and audit logging
  • Real-time Monitoring: Built-in metrics, logging, and performance monitoring

Quick Start

Basic Model Serving

For a simple model serving setup:

# Serve a logged model
mlflow models serve -m "models:/<model-id>" -p 5000

# Serve a registered model
mlflow models serve -m "models:/<model-name>/<model-version>" -p 5000

# Serve a model from local path
mlflow models serve -m ./path/to/model -p 5000

Making Predictions

Once your model is served, you can make predictions via HTTP requests:

curl -X POST http://localhost:5000/invocations \
-H "Content-Type: application/json" \
-d '{"inputs": [[1, 2, 3, 4]]}'

Architecture

MLflow serving uses a standardized architecture:

  1. 🧠 Model Loading: Models are loaded using their respective MLflow flavors
  2. 🌐 HTTP Server: FastAPI-based server handles incoming requests
  3. 🔄 Prediction Pipeline: Requests are processed through the model's predict method
  4. 📦 Response Formatting: Results are returned in standardized JSON format

Best Practices

Performance Optimization

  • Use appropriate hardware resources based on model requirements
  • Implement request batching for improved throughput
  • Consider model quantization for faster inference
  • Monitor memory usage and optimize accordingly

Security Considerations

  • Implement proper authentication and authorization
  • Use HTTPS in production environments
  • Validate input data to prevent security vulnerabilities
  • Regularly update dependencies and monitor for security issues

Monitoring and Observability

  • Set up comprehensive logging for debugging and auditing
  • Monitor key metrics like latency, throughput, and error rates
  • Implement health checks for service reliability
  • Use distributed tracing for complex serving pipelines

Common Use Cases

Serve models for real-time predictions in web applications, mobile apps, or microservices architectures.

import requests
import json

# Single prediction
data = {
"dataframe_split": {
"columns": ["feature1", "feature2", "feature3"],
"data": [[1.0, 2.0, 3.0]],
}
}

response = requests.post(
"http://localhost:5000/invocations",
headers={"Content-Type": "application/json"},
data=json.dumps(data),
)
print(response.json())

Next Steps

For more detailed information about MLflow serving capabilities, refer to the official MLflow documentation and experiment with the examples provided in each section.