Deploy MLflow Model to Kubernetes

Using MLServer as the Inference Server

By default, MLflow deployment uses Flask, a widely used WSGI web application framework for Python, to serve the inference endpoint. However, Flask is mainly designed for a lightweight application and might not be suitable for production use cases at scale. To address this gap, MLflow integrates with MLServer as an alternative deployment option, which is used as a core Python inference server in Kubernetes-native frameworks like Seldon Core and KServe (formerly known as KFServing). Using MLServer, you can take advantage of the scalability and reliability of Kubernetes to serve your model at scale. See Serving Framework for the detailed comparison between Flask and MLServer, and why MLServer is a better choice for ML production use cases.

Building a Docker Image for MLflow Model

The essential step to deploy an MLflow model to Kubernetes is to build a Docker image that contains the MLflow model and the inference server. This can be done via build-docker CLI command or Python API.

mlflow models build-docker -m runs:/<run_id>/model -n <image_name> --enable-mlserver

If you want to use the bare-bones Flask server instead of MLServer, remove the --enable-mlserver flag. For other options, see the build-docker command documentation.

Important

Since MLflow 2.10.1, the Docker image spec has been changed to reduce the image size and improve the performance. Most notably, Java is no longer installed in the image except for the Java model flavor such as spark. If you need to install Java for other flavors, e.g. custom Python model that uses SparkML, please specify the --install-java flag to enforce Java installation.

Deployment Steps

Please refer to the following partner documentations for deploying MLflow Models to Kubernetes using MLServer. You can also follow the tutorial below to learn the end-to-end process including environment setup, model training, and deployment.