Skip to main content

Scikit-learn with MLflow

In this comprehensive guide, we'll walk you through how to use scikit-learn with MLflow for experiment tracking, model management, and production deployment. We'll cover both autologging and manual logging approaches, from basic usage to advanced production patterns.

Quick Start with Autologging​

The fastest way to get started is with MLflow's scikit-learn autologging. With just a single line of code, you can automatically track parameters, metrics, and models from your scikit-learn experiments. This approach requires no changes to your existing training code and captures everything you need for reproducible ML workflows.

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

# Enable autologging for scikit-learn
mlflow.sklearn.autolog()

# Load sample data
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)

# Train your model - MLflow automatically logs everything
with mlflow.start_run():
model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Evaluation metrics are automatically captured
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print(f"Training accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")

This simple example automatically logs all model parameters, training metrics, the trained model with proper serialization, and model signatures for deploymentβ€”all without any additional code.

Understanding Autologging Behavior​

MLflow's scikit-learn autologging captures comprehensive information about your training process automatically. Here's exactly what gets tracked every time you train a model:

CategoryInformation Captured
ParametersAll parameters from estimator.get_params(deep=True)
MetricsTraining score, classification/regression metrics
TagsEstimator class name and fully qualified class name
ArtifactsSerialized model, model signature, metric information

The autologging system is designed to be comprehensive yet non-intrusive. It captures everything you need for reproducibility without requiring changes to your existing scikit-learn code.

Logging Approaches​

For complete control over what gets logged, you can manually instrument your scikit-learn code. This approach is ideal when you need custom metrics, specific artifact logging, or want to organize experiments in a particular way:

import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from mlflow.models import infer_signature

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Manual logging approach
with mlflow.start_run():
# Define hyperparameters
params = {"C": 1.0, "max_iter": 1000, "solver": "lbfgs", "random_state": 42}

# Log parameters
mlflow.log_params(params)

# Train model
model = LogisticRegression(**params)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate and log metrics
metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"precision": precision_score(y_test, y_pred, average="weighted"),
"recall": recall_score(y_test, y_pred, average="weighted"),
"f1_score": f1_score(y_test, y_pred, average="weighted"),
}
mlflow.log_metrics(metrics)

# Infer model signature
signature = infer_signature(X_train, model.predict(X_train))

# Log the model
mlflow.sklearn.log_model(
sk_model=model,
name="model",
signature=signature,
input_example=X_train[:5], # Sample input for documentation
)

Hyperparameter Tuning​

MLflow provides exceptional support for scikit-learn's hyperparameter optimization tools, automatically creating organized child runs for parameter search experiments. This makes it easy to track and compare different parameter combinations:

import mlflow
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Enable autologging with hyperparameter tuning support
mlflow.sklearn.autolog(max_tuning_runs=10) # Track top 10 parameter combinations

# Load data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.2, random_state=42
)

# Define parameter grid
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [5, 10, 15, None],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
}

with mlflow.start_run(run_name="Random Forest Hyperparameter Tuning"):
# Create and fit GridSearchCV
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
rf, param_grid, cv=5, scoring="accuracy", n_jobs=-1, verbose=1
)

grid_search.fit(X_train, y_train)

# Best model evaluation
best_score = grid_search.score(X_test, y_test)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
print(f"Test score: {best_score:.3f}")

MLflow automatically creates a parent run containing the overall search results and child runs for each parameter combination, making it easy to analyze which parameters work best.

Model Evaluation with MLflow​

MLflow provides a comprehensive evaluation API that automatically generates metrics, visualizations, and diagnostic tools for scikit-learn models:

import mlflow
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from mlflow.models import infer_signature

# Load data and train model
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Create evaluation dataset
eval_data = X_test.copy()
eval_data = pd.DataFrame(eval_data, columns=wine.feature_names)
eval_data["label"] = y_test

with mlflow.start_run():
# Log model with signature
signature = infer_signature(X_test, model.predict(X_test))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")

# Comprehensive evaluation with MLflow
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier", # or "regressor" for regression
evaluators=["default"],
)

# Access automatic metrics
print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
print(f"ROC AUC: {result.metrics['roc_auc']:.3f}")

# Access generated artifacts
print("Generated artifacts:")
for artifact_name, path in result.artifacts.items():
print(f" {artifact_name}: {path}")

Automatic Generation Includes:

Performance Metrics such as accuracy, precision, recall, F1-score, ROC-AUC for classification. Visualizations including confusion matrix, ROC curve, precision-recall curve. Feature Importance with SHAP values and feature contribution analysis. Model Artifacts where all plots and diagnostic information are saved to MLflow.

Model Comparison and Selection​

Use MLflow evaluate to systematically compare multiple scikit-learn models:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Define models to compare
sklearn_models = {
"random_forest": RandomForestClassifier(n_estimators=100, random_state=42),
"gradient_boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
"logistic_regression": LogisticRegression(random_state=42, max_iter=1000),
"svm": SVC(probability=True, random_state=42),
}

# Evaluate each model systematically
comparison_results = {}

for model_name, model in sklearn_models.items():
with mlflow.start_run(run_name=f"eval_{model_name}"):
# Train model
model.fit(X_train, y_train)

# Log model
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")

# Comprehensive evaluation with MLflow
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluators=["default"],
)

comparison_results[model_name] = result.metrics

# Log key metrics for comparison
mlflow.log_metrics(
{
"accuracy": result.metrics["accuracy_score"],
"f1": result.metrics["f1_score"],
"roc_auc": result.metrics["roc_auc"],
"precision": result.metrics["precision_score"],
"recall": result.metrics["recall_score"],
}
)

# Create comparison summary
import pandas as pd

comparison_df = pd.DataFrame(comparison_results).T
print("Model Comparison Summary:")
print(comparison_df[["accuracy_score", "f1_score", "roc_auc"]].round(3))

# Identify best model
best_model = comparison_df["f1_score"].idxmax()
print(f"\nBest model by F1 score: {best_model}")

Model Validation and Quality Gates​

Use MLflow's validation API to ensure scikit-learn model quality:

from mlflow.models import MetricThreshold

# First, evaluate your scikit-learn model
result = mlflow.evaluate(model_uri, eval_data, targets="label", model_type="classifier")

# Define quality thresholds for classification models
quality_thresholds = {
"accuracy_score": MetricThreshold(threshold=0.85, greater_is_better=True),
"f1_score": MetricThreshold(threshold=0.80, greater_is_better=True),
"roc_auc": MetricThreshold(threshold=0.75, greater_is_better=True),
}

# Validate model meets quality standards
try:
mlflow.validate_evaluation_results(
candidate_result=result,
validation_thresholds=quality_thresholds,
)
print("βœ… Scikit-learn model meets all quality thresholds")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"❌ Model failed validation: {e}")

# Compare against baseline model (e.g., previous model version)
baseline_result = mlflow.evaluate(
baseline_model_uri, eval_data, targets="label", model_type="classifier"
)

# Validate improvement over baseline
improvement_thresholds = {
"f1_score": MetricThreshold(
threshold=0.02, greater_is_better=True # Must be 2% better
),
}

try:
mlflow.validate_evaluation_results(
candidate_result=result,
baseline_result=baseline_result,
validation_thresholds=improvement_thresholds,
)
print("βœ… New model improves over baseline")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"❌ Model doesn't improve sufficiently: {e}")

Model Management​

MLflow supports multiple serialization formats for scikit-learn models, each optimized for different deployment scenarios. Understanding these options helps you choose the right approach for your production needs:

import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)

# Cloudpickle format (default) - better cross-system compatibility
mlflow.sklearn.log_model(
sk_model=model,
name="cloudpickle_model",
serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE,
)

# Pickle format - faster but less portable
mlflow.sklearn.log_model(
sk_model=model,
name="pickle_model",
serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_PICKLE,
)

Cloudpickle is the default format because it provides better cross-system compatibility by identifying and packaging code dependencies with the serialized model. Pickle is faster but less portable across different environments.

Production Deployment​

The Model Registry provides centralized model management with version control and alias-based deployment. This is essential for managing models from development through production deployment:

# Register model to MLflow Model Registry
import mlflow
from mlflow import MlflowClient

client = MlflowClient()

# Log and register model in one step
with mlflow.start_run():
mlflow.sklearn.log_model(
sk_model=model,
name="model",
registered_model_name="CustomerChurnModel",
signature=signature,
)

# Or register an existing model
run_id = "your_run_id"
model_uri = f"runs:/{run_id}/model"

# Register the model
registered_model = mlflow.register_model(model_uri=model_uri, name="CustomerChurnModel")

# Use aliases instead of deprecated stages for deployment management
# Set aliases for different deployment environments
client.set_registered_model_alias(
name="CustomerChurnModel",
alias="champion", # Production model
version=registered_model.version,
)

client.set_registered_model_alias(
name="CustomerChurnModel",
alias="challenger", # A/B testing model
version=registered_model.version,
)

# Use tags to track model status and metadata
client.set_model_version_tag(
name="CustomerChurnModel",
version=registered_model.version,
key="validation_status",
value="approved",
)

client.set_model_version_tag(
name="CustomerChurnModel",
version=registered_model.version,
key="deployment_date",
value="2025-05-29",
)

Modern Model Registry Features:

Model Aliases replace deprecated stages with flexible, named references. You can assign multiple aliases to any model version (e.g., champion, challenger, shadow), update aliases independently of model training for seamless deployments, and use them for A/B testing and gradual rollouts.

Model Tags provide rich metadata and status tracking. Track validation status with validation_status: approved, mark deployment readiness with ready_for_prod: true, and add team ownership with team: data-science.

Environment-based Models support mature MLOps workflows. Create separate registered models per environment: dev.CustomerChurnModel, staging.CustomerChurnModel, prod.CustomerChurnModel, and use copy_model_version() to promote models across environments.

# Promote model from staging to production environment
client.copy_model_version(
src_model_uri="models:/staging.CustomerChurnModel@candidate",
dst_name="prod.CustomerChurnModel",
)

Advanced Features​

Scikit-learn pipelines are first-class citizens in MLflow, providing end-to-end workflow tracking from data preprocessing through model training. This ensures reproducibility of your entire ML workflow:

import mlflow
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Enable autologging for pipelines
mlflow.sklearn.autolog()

# Create a complex preprocessing and modeling pipeline
numeric_features = ["age", "income", "credit_score"]
categorical_features = ["occupation", "location"]

# Preprocessing pipeline
numeric_transformer = Pipeline(
steps=[("scaler", StandardScaler()), ("selector", SelectKBest(f_regression, k=2))]
)

categorical_transformer = Pipeline(
steps=[("encoder", OneHotEncoder(drop="first", sparse_output=False))]
)

# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)

# Complete pipeline with model
pipeline = Pipeline(
steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
)

# Train pipeline - all steps are automatically logged
with mlflow.start_run(run_name="Complete Pipeline Experiment"):
pipeline.fit(X_train, y_train)

# Pipeline scoring is automatically captured
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f"Pipeline RΒ² score: {test_score:.3f}")

MLflow automatically logs parameters from each pipeline stage, making it easy to understand exactly how your data was processed and which model parameters were used.

Conclusion​

MLflow's scikit-learn integration provides a comprehensive solution for experiment tracking, model management, and deployment in traditional machine learning workflows. Whether you're using simple autologging for quick experiments or implementing complex production pipelines, MLflow scales to meet your needs.

Key benefits of using MLflow with scikit-learn:

Effortless Experiment Tracking provides one-line autologging that captures everything you need for reproducible ML. Hyperparameter Optimization includes built-in support for grid search with organized child runs and easy comparison. Comprehensive Evaluation offers automatic metrics generation, visualizations, and SHAP analysis through mlflow.evaluate(). Production-Ready Deployment provides model registry integration with alias-based deployment and quality gates. Team Collaboration enables centralized experiment management with rich metadata and artifacts.

The patterns and examples in this guide provide a solid foundation for building scalable, reproducible machine learning systems with scikit-learn and MLflow. Start with autologging for immediate benefits, then gradually adopt more advanced features like model evaluation, registry, and custom configurations as your needs grow.