Models, Flavors, and PyFuncs in MLflow

In the MLflow ecosystem, “flavors” play a pivotal role in model management. Essentially, a “flavor” is a designated wrapper for specific machine learning libraries. For instance, the spark-ml package, despite producing diverse model types such as Pipeline, LogisticRegressionModel, or RandomForestModel, all fall under the Spark flavor umbrella. This abstraction ensures that, irrespective of the model’s foundational structure, its Spark flavor variant can be seamlessly saved, logged, and retrieved using MLflow’s named flavor utilities.

Flavors streamline the process of saving, loading, and handling machine learning models across different frameworks. They consider each library’s unique approach to model serialization and deserialization.

MLflow’s flavor design ensures a degree of uniformity. For every library, its corresponding MLflow flavor defines the behavior of the loaded pyfunc for inference deployment. Each flavor prescribes a predict method behavior, ensuring a consistent yet somewhat rigid format.

To understand these constraints, consider the sklearn flavor as an example. The diagram below delineates its implementation, highlighting the APIs and serialization methods MLflow has standardized:

A depiction of the standards adopted by MLflow’s sklearn flavor

While MLflow endeavors to offer a universally applicable pyfunc representation for each flavor, it’s not always feasible to accommodate every unique model scenario generated by a specific library.

However, there’s a silver lining. MLflow offers the flexibility to craft a custom pyfunc by extending the foundational PythonModel base class, which underlies all named flavors’ pyfunc variants. With a correct implementation of PythonModel, you can embed any code or model from any library within a custom class, all while enjoying the uniformity benefits associated with a named flavor.

To delve deeper into these functionalities, let’s examine the core structure of an MLflow Model.

Components of a Model in MLflow

When thinking of a “model,” most practitioners envision the learned parameters or weights from a machine learning training process. These are typically saved as a file or a directory of files, and then utilized for predictions on new, unseen data. However, in the realm of MLOps and especially within MLflow, the concept of a “model” is much broader.

In MLflow, a model is not just the binary file containing the learned parameters. It’s a comprehensive package or bundle that encapsulates everything needed to reproduce predictions reliably in various environments.

This includes the model’s weights, but it goes far beyond that.

The basic components of a model in MLflow

The Model Binary: This is the central piece - the actual saved model weights or parameters. It’s what many think of as “the model.”
Additional Binary Files: For some models, auxiliary files might be needed. For example, tokenizers for NLP models, scalers for preprocessing, or even non-parametric elements like decision trees or k-means centroids.
Pre-loaded Code: Certain models might need custom code to be loaded in the inference environment. This could be for preprocessing, postprocessing, or other custom logic.
Library Dependencies: For the model to function correctly, it might depend on specific versions of libraries. MLflow keeps track of these dependencies, ensuring that the environment where the model runs matches the one where it was trained.
Metadata: This contains vital information about the model’s lineage. It can track details like who trained the model, with what code, when, and where. This metadata is crucial for model governance, auditing, and reproducibility.
PyFunc Signature: To ensure seamless deployment and inference, MLflow wraps the model in a standardized pyfunc interface. This interface defines the expected input and output formats, ensuring consistency.
Input Example: An optional component, this provides a sample input that can be used for testing, ensuring that the deployed model is functioning correctly.

All of these elements are viewable within the MLflow UI’s artifact viewer, when looking at a saved model.

Model components seen in the MLflow artifact viewer

Note

The contents of the logged model directory within MLflow are dependent upon both the number of optional arguments that are supplied when saving or logging the model, as well as the underlying base model type. Some model flavors have additional metadata and serialized artifacts as compared to others.

The components shown here are important to understand when creating a custom pyfunc, as this structure and the elements within it are what you will be interfacing with when creating and using custom PyFuncs.

Understanding “Named Flavors”

A named flavor in MLflow refers to a predefined entity associated with a specific machine learning or data processing framework. For instance, if you’re working with a Scikit-Learn model, you might employ methods like mlflow.sklearn.save_model(), mlflow.sklearn.load_model(), and mlflow.sklearn.log_model().

Key properties of named flavors include:

Root Namespace Integration: Named flavors are accessible directly from the MLflow root namespace, allowing for straightforward interactions.
PyFunc Compatibility: Models saved with named flavors can be loaded back as a PyFunc. This facilitates integration with various deployment environments, be it real-time inference platforms, Spark-based batch processing, or any system that can invoke a Python function.
Autologging: Certain named flavors support autologging, a feature that automatically logs model artifacts and training metadata upon the completion of a training process.

Characteristics of Named Flavors

Named flavors encapsulate several functionalities:

Unified API: Despite the underlying differences in machine learning frameworks, named flavors offer a consistent set of methods for model saving, loading, and logging. This consistency extends to advanced features such as signature declaration, input example storage, custom dependencies, and model registration.
Maintenance & Reliability: Being part of the MLflow project, named flavors undergo rigorous testing and updates by the core maintainers.
Serialization Methods: Each named flavor leverages native serialization mechanisms pertinent to its associated framework.
Custom Python Function Wrappers: Each flavor contains a specific implementation that maps the underlying framework’s methods to a standard Python function, making certain decisions about the function’s behavior.
Simplified High-Level APIs: Despite their capacity to handle intricate details, the high-level APIs for named flavors are designed for ease of use.

Criteria for Inclusion as a Named Flavor

The inclusion of a framework as a named flavor within MLflow isn’t arbitrary.

Criteria include:

Popularity & Demand: Frameworks with significant adoption in the industry are favored. The inclusion also depends on the frequency of user requests and the perceived demand within the broader ML community.
Framework Stability: Named flavors are typically associated with frameworks that are stable, have active maintenance, and lack overly intricate or restrictive build requirements that could pose an impossible task to integrate with them.

The Anatomy of Named Flavors

Every named flavor in MLflow typically implements a set of core functions:

get_default_conda_env(): Returns a list of conda dependencies required for the flavor.
get_default_pip_requirements(): Lists the PyPI dependencies vital for the flavor.
load_model(): Handles the process of deserialization, retrieving a model instance from a given artifact store via a provided resolvable model_uri.
save_model(): Manages the serialization process, ensuring the model, its metadata, and other associated artifacts are appropriately stored.
log_model(): An extended version of save_model(), facilitating model registration in addition to the saving process.

Moreover, to ensure that a flavor’s model can be loaded as a generic Python function, a Wrapper class is required in order to integrate with mlflow.pyfunc.load_model().

Addressing Unsupported Models in MLflow

For machine learning frameworks not supported as named flavors, MLflow provides the flexibility to define custom PyFuncs.

This tutorial will guide you through the process, enabling you to incorporate virtually any model into the MLflow ecosystem.

Creating Reusable Custom Flavors

For those frequently using specific custom PyFuncs across various projects, MLflow’s architecture supports the development of custom flavors through a plugin-style interface. While a comprehensive guide on this topic is beyond the scope of this tutorial, the general approach involves creating a module that encompasses functions for saving, loading, and logging the model type. A PyFunc wrapper class is then crafted to provide integration for loading the custom flavor as a PyFunc.