Skip to main content

Model Evaluation

This guide covers MLflow's core model evaluation capabilities for classification and regression tasks, showing how to comprehensively assess model performance with automated metrics, visualizations, and diagnostic tools.

Quick Start: Evaluating a Classification Model​

The simplest way to evaluate a model is with MLflow's unified evaluation API:

import mlflow
import xgboost as xgb
import shap
from sklearn.model_selection import train_test_split
from mlflow.models import infer_signature

# Load the UCI Adult Dataset
X, y = shap.datasets.adult()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)

# Train model
model = xgb.XGBClassifier().fit(X_train, y_train)

# Create evaluation dataset
eval_data = X_test.copy()
eval_data["label"] = y_test

with mlflow.start_run():
# Log model with signature
signature = infer_signature(X_test, model.predict(X_test))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")

# Comprehensive evaluation
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluators=["default"],
)

print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
print(f"ROC AUC: {result.metrics['roc_auc']:.3f}")

This single call automatically generates:

  • Performance Metrics: Accuracy, precision, recall, F1-score, ROC-AUC
  • Visualizations: Confusion matrix, ROC curve, precision-recall curve
  • Feature Importance: SHAP values and feature contribution analysis
  • Model Artifacts: All plots and diagnostic information saved to MLflow

Supported Model Types​

MLflow supports different model types, each with specialized metrics and evaluations:

  • classifier - Binary and multiclass classification models
  • regressor - Regression models for continuous target prediction

For classification tasks, MLflow automatically computes comprehensive metrics:

# Binary Classification
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier", # Automatically detects binary vs multiclass
evaluators=["default"],
)

# Access classification-specific metrics
metrics = result.metrics
print(f"Precision: {metrics['precision_score']:.3f}")
print(f"Recall: {metrics['recall_score']:.3f}")
print(f"F1 Score: {metrics['f1_score']:.3f}")
print(f"ROC AUC: {metrics['roc_auc']:.3f}")

Automatic Classification Metrics:

  • Accuracy, Precision, Recall, F1-Score
  • ROC-AUC and Precision-Recall AUC
  • Log Loss and Brier Score
  • Confusion Matrix and Classification Report

Advanced Evaluation Configurations​

Specifying Evaluators​

Control which evaluators run during assessment:

# Run only default metrics (fastest)
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluators=["default"],
)

# Include SHAP explainer for feature importance
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluators=["default"],
evaluator_config={"log_explainer": True},
)
Configuration Options Reference

SHAP Configuration​

  • log_explainer: Whether to log the SHAP explainer as a model
  • explainer_type: Type of SHAP explainer ("exact", "permutation", "partition")
  • max_error_examples: Maximum number of error examples to analyze
  • log_model_explanations: Whether to log individual prediction explanations

Performance Options​

  • pos_label: Positive class label for binary classification metrics
  • average: Averaging strategy for multiclass metrics ("macro", "micro", "weighted")
  • sample_weights: Sample weights for weighted metrics
  • normalize: Normalization for confusion matrix ("true", "pred", "all")

Custom Metrics and Artifacts​

MLflow provides a powerful framework for defining custom evaluation metrics using the make_metric function:

import mlflow
import numpy as np
from mlflow.models import make_metric


def weighted_accuracy(predictions, targets, metrics, sample_weights=None):
"""Custom weighted accuracy metric."""
if sample_weights is None:
return (predictions == targets).mean()
else:
correct = predictions == targets
return np.average(correct, weights=sample_weights)


# Create custom metric
custom_accuracy = make_metric(
eval_fn=weighted_accuracy, greater_is_better=True, name="weighted_accuracy"
)

# Use in evaluation
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
extra_metrics=[custom_accuracy],
)

Working with Evaluation Results​

The evaluation result object provides comprehensive access to all generated metrics and artifacts:

# Run evaluation
result = mlflow.models.evaluate(
model_uri, eval_data, targets="label", model_type="classifier"
)

# Access metrics
print("All Metrics:")
for metric_name, value in result.metrics.items():
print(f" {metric_name}: {value}")

# Access artifacts (plots, tables, etc.)
print("\nGenerated Artifacts:")
for artifact_name, path in result.artifacts.items():
print(f" {artifact_name}: {path}")

# Access evaluation dataset
eval_table = result.tables["eval_results_table"]
print(f"\nEvaluation table shape: {eval_table.shape}")
print(f"Columns: {list(eval_table.columns)}")

Model Comparison and Advanced Workflows​

Compare multiple models systematically:

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Define models to compare
models = {
"random_forest": RandomForestClassifier(n_estimators=100, random_state=42),
"logistic_regression": LogisticRegression(random_state=42),
"svm": SVC(probability=True, random_state=42),
}

# Evaluate each model
results = {}

for model_name, model in models.items():
with mlflow.start_run(run_name=f"eval_{model_name}"):
# Train model
model.fit(X_train, y_train)

# Log model
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")

# Evaluate model
result = mlflow.models.evaluate(
model_uri, eval_data, targets="label", model_type="classifier"
)

results[model_name] = result.metrics

# Log comparison metrics
mlflow.log_metrics(
{
"accuracy": result.metrics["accuracy_score"],
"f1": result.metrics["f1_score"],
"roc_auc": result.metrics["roc_auc"],
}
)

# Compare results
comparison_df = pd.DataFrame(results).T
print("Model Comparison:")
print(comparison_df[["accuracy_score", "f1_score", "roc_auc"]].round(3))

Model Validation and Quality Gates​

attention

MLflow 2.18.0 has moved the model validation functionality from the mlflow.models.evaluate() API to a dedicated mlflow.validate_evaluation_results() API. The relevant parameters, such as baseline_model, are deprecated and will be removed from the older API in future versions.

With the mlflow.validate_evaluation_results() API, you can validate metrics generated during model evaluation to assess the quality of your model against a baseline.

from mlflow.models import MetricThreshold

# Evaluate your model first
result = mlflow.models.evaluate(
model_uri, eval_data, targets="label", model_type="classifier"
)

# Define static performance thresholds
static_thresholds = {
"accuracy_score": MetricThreshold(
threshold=0.85, greater_is_better=True # Must achieve 85% accuracy
),
"precision_score": MetricThreshold(
threshold=0.80, greater_is_better=True # Must achieve 80% precision
),
"recall_score": MetricThreshold(
threshold=0.75, greater_is_better=True # Must achieve 75% recall
),
}

# Validate against static thresholds
try:
mlflow.validate_evaluation_results(
candidate_result=result,
baseline_result=None, # No baseline comparison
validation_thresholds=static_thresholds,
)
print("βœ… Model meets all static performance thresholds.")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"❌ Model failed static validation: {e}")

More information on model validation behavior and outputs can be found in the mlflow.validate_evaluation_results() API documentation.

Error Analysis and Debugging​

Analyze model errors in detail:

def analyze_model_errors(result, eval_data, targets, top_n=20):
"""Analyze model errors in detail."""

# Load evaluation results
eval_table = result.tables["eval_results_table"]

# Identify errors
errors = eval_table[eval_table["prediction"] != eval_table[targets]]

if len(errors) > 0:
print(f"Total errors: {len(errors)} out of {len(eval_table)} predictions")
print(f"Error rate: {len(errors) / len(eval_table) * 100:.2f}%")

# Most confident wrong predictions
if "prediction_score" in errors.columns:
confident_errors = errors.nlargest(top_n, "prediction_score")
print(f"\nTop {top_n} most confident errors:")
print(confident_errors[["prediction", targets, "prediction_score"]].head())

# Error patterns by true class
error_by_class = errors.groupby(targets).size()
print(f"\nErrors by true class:")
print(error_by_class)

return errors


# Usage
errors = analyze_model_errors(result, eval_data, "label")

Best Practices and Optimization​

Complete evaluation workflow with best practices:

def comprehensive_model_evaluation(
model, X_train, y_train, eval_data, targets, model_type
):
"""Complete evaluation workflow with best practices."""

with mlflow.start_run():
# Train model
model.fit(X_train, y_train)

# Log training info
mlflow.log_params(
{"model_class": model.__class__.__name__, "training_samples": len(X_train)}
)

# Log model with signature
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")

# Comprehensive evaluation
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets=targets,
model_type=model_type,
evaluators=["default"],
evaluator_config={
"log_explainer": True,
"explainer_type": "exact",
"log_model_explanations": True,
},
)

return result

Conclusion​

MLflow's model evaluation capabilities provide a comprehensive framework for assessing model performance across classification and regression tasks. The unified API simplifies complex evaluation workflows while providing deep insights into model behavior through automated metrics, visualizations, and diagnostic tools.

Key benefits of MLflow model evaluation include:

  • Comprehensive Assessment: Automated generation of task-specific metrics and visualizations
  • Reproducible Workflows: Consistent evaluation processes with complete tracking and versioning
  • Advanced Analysis: Error investigation, feature impact analysis, and model comparison capabilities
  • Production Integration: Seamless integration with MLflow tracking for experiment organization and reporting

Whether you're evaluating a single model or comparing multiple candidates, MLflow's evaluation framework provides the tools needed to make informed decisions about model performance and production readiness.