Skip to main content

XGBoost with MLflow

In this comprehensive guide, we'll explore how to use XGBoost with MLflow for experiment tracking, model management, and production deployment. We'll cover both the native XGBoost API and scikit-learn compatible interface, from basic autologging to advanced distributed training patterns.

Quick Start with Autologging​

The fastest way to get started is with MLflow's XGBoost autologging. Enable comprehensive experiment tracking with a single line:

import mlflow
import mlflow.xgboost
import xgboost as xgb
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# Enable autologging for XGBoost
mlflow.xgboost.autolog()

# Load sample data
data = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)

# Prepare DMatrix format
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Define training parameters
params = {
"objective": "reg:squarederror",
"max_depth": 6,
"learning_rate": 0.1,
"subsample": 0.8,
"colsample_bytree": 0.8,
"random_state": 42,
}

# Train model - MLflow automatically logs everything
with mlflow.start_run():
model = xgb.train(
params=params,
dtrain=dtrain,
num_boost_round=100,
evals=[(dtrain, "train"), (dtest, "test")],
early_stopping_rounds=10,
verbose_eval=False,
)

print(f"Best iteration: {model.best_iteration}")
print(f"Best score: {model.best_score}")

This simple example automatically logs all XGBoost parameters and training configuration, training and validation metrics for each boosting round, feature importance plots and JSON artifacts, the trained model with proper serialization, and early stopping metrics and best iteration information.

Understanding XGBoost Autologging​

MLflow's XGBoost autologging captures comprehensive information about your gradient boosting process automatically:

CategoryInformation Captured
ParametersAll booster parameters, training configuration, callback settings
MetricsTraining/validation metrics per iteration, early stopping metrics
Feature ImportanceWeight, gain, cover, and total_gain importance with visualizations
ArtifactsTrained model, feature importance plots, JSON importance data

The autologging system is designed to be comprehensive yet non-intrusive. It captures everything you need for reproducibility without requiring changes to your existing XGBoost code.

Logging Approaches​

For complete control over experiment tracking, you can manually instrument your XGBoost training:

import mlflow
import mlflow.xgboost
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np

# Generate sample data
X, y = make_classification(n_samples=10000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Manual logging approach
with mlflow.start_run():
# Define and log parameters
params = {
"objective": "binary:logistic",
"max_depth": 8,
"learning_rate": 0.05,
"subsample": 0.9,
"colsample_bytree": 0.9,
"min_child_weight": 1,
"gamma": 0,
"reg_alpha": 0,
"reg_lambda": 1,
"random_state": 42,
}

training_config = {
"num_boost_round": 500,
"early_stopping_rounds": 50,
}

# Log all parameters
mlflow.log_params(params)
mlflow.log_params(training_config)

# Prepare data
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Custom evaluation tracking
eval_results = {}

# Train model with custom callback
model = xgb.train(
params=params,
dtrain=dtrain,
num_boost_round=training_config["num_boost_round"],
evals=[(dtrain, "train"), (dtest, "test")],
early_stopping_rounds=training_config["early_stopping_rounds"],
evals_result=eval_results,
verbose_eval=False,
)

# Log training history
for epoch, (train_metrics, test_metrics) in enumerate(
zip(eval_results["train"]["logloss"], eval_results["test"]["logloss"])
):
mlflow.log_metrics(
{"train_logloss": train_metrics, "test_logloss": test_metrics}, step=epoch
)

# Final evaluation
y_pred_proba = model.predict(dtest)
y_pred = (y_pred_proba > 0.5).astype(int)

final_metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"roc_auc": roc_auc_score(y_test, y_pred_proba),
"best_iteration": model.best_iteration,
"best_score": model.best_score,
}

mlflow.log_metrics(final_metrics)

# Log the model with signature
from mlflow.models import infer_signature

signature = infer_signature(X_train, y_pred_proba)

mlflow.xgboost.log_model(
xgb_model=model,
name="model",
signature=signature,
input_example=X_train[:5],
)

Hyperparameter Optimization​

MLflow provides exceptional support for XGBoost hyperparameter optimization, automatically creating organized child runs for parameter search experiments:

import mlflow
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

# Enable autologging with hyperparameter tracking
mlflow.sklearn.autolog(max_tuning_runs=10)

# Define parameter grid
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [3, 6, 9],
"learning_rate": [0.01, 0.1, 0.2],
"subsample": [0.8, 0.9, 1.0],
"colsample_bytree": [0.8, 0.9, 1.0],
}

with mlflow.start_run(run_name="XGBoost Grid Search"):
# Create base model
xgb_model = XGBClassifier(random_state=42)

# Grid search with cross-validation
grid_search = GridSearchCV(
xgb_model, param_grid, cv=5, scoring="roc_auc", n_jobs=-1, verbose=1
)

grid_search.fit(X_train, y_train)

# Best parameters and scores are automatically logged
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Evaluate on test set
test_score = grid_search.score(X_test, y_test)
print(f"Test score: {test_score:.3f}")

MLflow automatically creates a parent run containing the overall search results and child runs for each parameter combination, making it easy to analyze which parameters work best.

Feature Importance Analysis​

XGBoost provides multiple types of feature importance, and MLflow captures them all automatically:

import json
import matplotlib.pyplot as plt
import seaborn as sns


def comprehensive_feature_importance_analysis(model, feature_names=None):
"""Analyze and log comprehensive feature importance."""

importance_types = ["weight", "gain", "cover", "total_gain"]

with mlflow.start_run(run_name="Feature Importance Analysis"):
for imp_type in importance_types:
# Get importance scores
importance = model.get_score(importance_type=imp_type)

if not importance:
continue

# Sort features by importance
sorted_features = sorted(
importance.items(), key=lambda x: x[1], reverse=True
)

# Log individual feature scores
for feature, score in sorted_features[:20]: # Top 20 features
mlflow.log_metric(f"{imp_type}_{feature}", score)

# Create visualization
features, scores = zip(*sorted_features[:20])

plt.figure(figsize=(10, 8))
sns.barplot(x=list(scores), y=list(features))
plt.title(f"Top 20 Feature Importance ({imp_type.title()})")
plt.xlabel("Importance Score")
plt.tight_layout()

# Save and log plot
plot_filename = f"feature_importance_{imp_type}.png"
plt.savefig(plot_filename, dpi=300, bbox_inches="tight")
mlflow.log_artifact(plot_filename)
plt.close()

# Log importance as JSON artifact
json_filename = f"feature_importance_{imp_type}.json"
with open(json_filename, "w") as f:
json.dump(importance, f, indent=2)
mlflow.log_artifact(json_filename)


# Usage
model = xgb.train(params, dtrain, num_boost_round=100)
comprehensive_feature_importance_analysis(model, feature_names=wine.feature_names)

Model Management​

XGBoost supports various serialization formats, each optimized for different deployment scenarios:

import mlflow.xgboost

# Train model
model = xgb.train(params, dtrain, num_boost_round=100)

with mlflow.start_run():
# JSON format (recommended) - Human readable and version stable
mlflow.xgboost.log_model(xgb_model=model, name="model_json", model_format="json")

# UBJ format - More compact binary format
mlflow.xgboost.log_model(xgb_model=model, name="model_ubj", model_format="ubj")

# Legacy XGBoost format (deprecated but sometimes needed)
mlflow.xgboost.log_model(xgb_model=model, name="model_xgb", model_format="xgb")

JSON format is recommended for production as it's human-readable and version-stable. UBJ format provides more compact binary serialization. The legacy XGBoost format is deprecated but sometimes needed for compatibility.

Production Deployment​

The Model Registry provides centralized model management with version control and alias-based deployment. This is essential for managing XGBoost models from development through production deployment:

from mlflow import MlflowClient

client = MlflowClient()

# Register model to MLflow Model Registry
with mlflow.start_run():
mlflow.xgboost.log_model(
xgb_model=model,
name="model",
registered_model_name="XGBoostChurnModel",
signature=signature,
model_format="json",
)

# Use aliases instead of deprecated stages for deployment management
# Set aliases for different deployment environments
model_version = client.get_latest_versions("XGBoostChurnModel")[0]

client.set_registered_model_alias(
name="XGBoostChurnModel",
alias="champion", # Production model
version=model_version.version,
)

client.set_registered_model_alias(
name="XGBoostChurnModel",
alias="challenger", # A/B testing model
version=model_version.version,
)

# Use tags to track model status and metadata
client.set_model_version_tag(
name="XGBoostChurnModel",
version=model_version.version,
key="validation_status",
value="approved",
)

client.set_model_version_tag(
name="XGBoostChurnModel",
version=model_version.version,
key="model_type",
value="xgboost_classifier",
)

client.set_model_version_tag(
name="XGBoostChurnModel",
version=model_version.version,
key="feature_importance_type",
value="gain",
)

Modern Model Registry Features:

Model Aliases replace deprecated stages with flexible, named references. You can assign multiple aliases to any model version (e.g., champion, challenger, shadow), update aliases independently of model training for seamless deployments, and use them for A/B testing and gradual rollouts.

Model Tags provide rich metadata and status tracking. Track validation status with validation_status: approved, mark model characteristics with model_type: xgboost_classifier, and add performance metrics like best_auc_score: 0.95.

Environment-based Models support mature MLOps workflows. Create separate registered models per environment: dev.XGBoostChurnModel, staging.XGBoostChurnModel, prod.XGBoostChurnModel, and use copy_model_version() to promote models across environments.

# Promote model from staging to production environment
client.copy_model_version(
src_model_uri="models:/staging.XGBoostChurnModel@candidate",
dst_name="prod.XGBoostChurnModel",
)

Advanced Features​

XGBoost allows custom objective functions and evaluation metrics, which MLflow can track:

def custom_objective_function(y_pred, y_true):
"""Custom objective function for XGBoost."""
# Example: Focal loss for imbalanced classification
alpha = 0.25
gamma = 2.0

# Convert DMatrix to numpy array
y_true = y_true.get_label()

# Calculate focal loss gradients and hessians
p = 1 / (1 + np.exp(-y_pred)) # sigmoid

# Focal loss gradient
grad = alpha * (1 - p) ** gamma * (gamma * p * np.log(p + 1e-8) + p - y_true)

# Focal loss hessian
hess = (
alpha
* (1 - p) ** gamma
* (gamma * (gamma + 1) * p * np.log(p + 1e-8) + 2 * gamma * p + p)
)

return grad, hess


def custom_eval_metric(y_pred, y_true):
"""Custom evaluation metric."""
y_true = y_true.get_label()
y_pred = 1 / (1 + np.exp(-y_pred)) # sigmoid

# Custom F-beta score
beta = 2.0
precision = np.sum((y_pred > 0.5) & (y_true == 1)) / np.sum(y_pred > 0.5)
recall = np.sum((y_pred > 0.5) & (y_true == 1)) / np.sum(y_true == 1)

f_beta = (1 + beta**2) * precision * recall / (beta**2 * precision + recall)

return "f_beta", f_beta


# Train with custom objective and metric
with mlflow.start_run():
model = xgb.train(
params=params,
dtrain=dtrain,
obj=custom_objective_function,
feval=custom_eval_metric,
num_boost_round=100,
evals=[(dtrain, "train"), (dtest, "test")],
verbose_eval=10,
)

Model Evaluation with MLflow​

MLflow provides a comprehensive evaluation API that automatically generates metrics, visualizations, and diagnostic tools:

import mlflow
import xgboost as xgb
from sklearn.model_selection import train_test_split
from mlflow.models import infer_signature

# Prepare data and train model
model = xgb.XGBClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Create evaluation dataset
eval_data = X_test.copy()
eval_data["label"] = y_test

with mlflow.start_run():
# Log model with signature
signature = infer_signature(X_test, model.predict(X_test))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")

# Comprehensive evaluation with MLflow
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier", # or "regressor" for regression
evaluators=["default"],
)

# Access automatic metrics
print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
print(f"ROC AUC: {result.metrics['roc_auc']:.3f}")

# Access generated artifacts
print("Generated artifacts:")
for artifact_name, path in result.artifacts.items():
print(f" {artifact_name}: {path}")

Automatic Generation Includes:

Performance Metrics such as accuracy, precision, recall, F1-score, ROC-AUC for classification. Visualizations including confusion matrix, ROC curve, precision-recall curve. Feature Importance with SHAP values and feature contribution analysis. Model Artifacts where all plots and diagnostic information are saved to MLflow.

Model Comparison and Selection​

Use MLflow evaluate to systematically compare multiple XGBoost configurations:

from sklearn.ensemble import RandomForestClassifier

# Define XGBoost variants to compare
xgb_models = {
"xgb_shallow": xgb.XGBClassifier(max_depth=3, n_estimators=100, random_state=42),
"xgb_deep": xgb.XGBClassifier(max_depth=8, n_estimators=100, random_state=42),
"xgb_boosted": xgb.XGBClassifier(max_depth=6, n_estimators=200, random_state=42),
}

# Compare with other algorithms
all_models = {
**xgb_models,
"random_forest": RandomForestClassifier(n_estimators=100, random_state=42),
}

# Evaluate each model systematically
comparison_results = {}

for model_name, model in all_models.items():
with mlflow.start_run(run_name=f"eval_{model_name}"):
# Train model
model.fit(X_train, y_train)

# Log model
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")

# Comprehensive evaluation with MLflow
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluators=["default"],
)

comparison_results[model_name] = result.metrics

# Log key metrics for comparison
mlflow.log_metrics(
{
"accuracy": result.metrics["accuracy_score"],
"f1": result.metrics["f1_score"],
"roc_auc": result.metrics["roc_auc"],
"precision": result.metrics["precision_score"],
"recall": result.metrics["recall_score"],
}
)

# Create comparison summary
import pandas as pd

comparison_df = pd.DataFrame(comparison_results).T
print("Model Comparison Summary:")
print(comparison_df[["accuracy_score", "f1_score", "roc_auc"]].round(3))

# Identify best model
best_model = comparison_df["f1_score"].idxmax()
print(f"\nBest model by F1 score: {best_model}")

Model Validation and Quality Gates​

Use MLflow's validation API to ensure model quality:

from mlflow.models import MetricThreshold

# First, evaluate your XGBoost model
result = mlflow.evaluate(model_uri, eval_data, targets="label", model_type="classifier")

# Define quality thresholds for XGBoost models
quality_thresholds = {
"accuracy_score": MetricThreshold(threshold=0.85, greater_is_better=True),
"f1_score": MetricThreshold(threshold=0.80, greater_is_better=True),
"roc_auc": MetricThreshold(threshold=0.75, greater_is_better=True),
}

# Validate model meets quality standards
try:
mlflow.validate_evaluation_results(
candidate_result=result,
validation_thresholds=quality_thresholds,
)
print("βœ… XGBoost model meets all quality thresholds")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"❌ Model failed validation: {e}")

# Compare against baseline model (e.g., previous XGBoost version)
baseline_result = mlflow.evaluate(
baseline_model_uri, eval_data, targets="label", model_type="classifier"
)

# Validate improvement over baseline
improvement_thresholds = {
"f1_score": MetricThreshold(
threshold=0.02, greater_is_better=True # Must be 2% better
),
}

try:
mlflow.validate_evaluation_results(
candidate_result=result,
baseline_result=baseline_result,
validation_thresholds=improvement_thresholds,
)
print("βœ… New XGBoost model improves over baseline")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"❌ Model doesn't improve sufficiently: {e}")

Advanced XGBoost Features​

XGBoost naturally handles multi-class classification with MLflow tracking:

from sklearn.datasets import load_digits
from sklearn.metrics import classification_report

# Multi-class classification
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.2, random_state=42
)

with mlflow.start_run(run_name="Multi-class XGBoost"):
# XGBoost naturally handles multi-class
model = XGBClassifier(
objective="multi:softprob",
num_class=10, # 10 digit classes
n_estimators=100,
max_depth=6,
random_state=42,
)

model.fit(X_train, y_train)

# Multi-class predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)

# Multi-class metrics
report = classification_report(y_test, y_pred, output_dict=True)

# Log per-class metrics
for class_label, metrics in report.items():
if isinstance(metrics, dict):
mlflow.log_metrics(
{
f"class_{class_label}_precision": metrics["precision"],
f"class_{class_label}_recall": metrics["recall"],
f"class_{class_label}_f1": metrics["f1-score"],
}
)

Best Practices and Organization​

Ensure reproducible XGBoost experiments with comprehensive environment tracking:

import platform
import random
import xgboost


def reproducible_xgboost_experiment(experiment_name, random_state=42):
"""Set up reproducible XGBoost experiment."""

# Set random seeds for reproducibility
np.random.seed(random_state)

random.seed(random_state)

# Set experiment
mlflow.set_experiment(experiment_name)

with mlflow.start_run():
mlflow.set_tags(
{
"python_version": platform.python_version(),
"xgboost_version": xgboost.__version__,
"platform": platform.platform(),
"random_state": random_state,
}
)

# Log dataset information
mlflow.log_params(
{
"dataset_size": len(X_train),
"n_features": X_train.shape[1],
"n_classes": len(np.unique(y_train)),
"class_distribution": dict(
zip(*np.unique(y_train, return_counts=True))
),
}
)

# Your model training code here
params = {
"objective": "binary:logistic",
"max_depth": 6,
"learning_rate": 0.1,
"random_state": random_state,
"n_jobs": -1,
}

model = XGBClassifier(**params)
model.fit(X_train, y_train)

return model


# Usage
model = reproducible_xgboost_experiment("Customer_Churn_Analysis_v2")

Conclusion​

MLflow's XGBoost integration provides a comprehensive solution for gradient boosting experiment management and deployment. Whether you're using the native XGBoost API for maximum performance or the scikit-learn interface for pipeline integration, MLflow captures all the essential information needed for reproducible machine learning.

Key benefits of using MLflow with XGBoost:

Comprehensive Autologging provides one-line setup that captures parameters, metrics, and feature importance. Dual API Support offers seamless integration with both native and scikit-learn XGBoost interfaces. Advanced Feature Analysis includes multiple importance types with automatic visualization. Production-Ready Deployment provides model registry integration with multiple serialization formats. Performance Optimization supports GPU acceleration and memory-efficient training. Competition-Grade Tracking offers detailed experiment management for winning ML solutions.

The patterns and examples in this guide provide a solid foundation for building scalable, reproducible gradient boosting systems with XGBoost and MLflow. Start with autologging for immediate benefits, then gradually adopt more advanced features like custom objectives, callbacks, and sophisticated deployment patterns as your projects grow in complexity and scale.