XGBoost with MLflow
In this comprehensive guide, we'll explore how to use XGBoost with MLflow for experiment tracking, model management, and production deployment. We'll cover both the native XGBoost API and scikit-learn compatible interface, from basic autologging to advanced distributed training patterns.
Quick Start with Autologgingβ
The fastest way to get started is with MLflow's XGBoost autologging. Enable comprehensive experiment tracking with a single line:
import mlflow
import mlflow.xgboost
import xgboost as xgb
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
# Enable autologging for XGBoost
mlflow.xgboost.autolog()
# Load sample data
data = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Prepare DMatrix format
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Define training parameters
params = {
"objective": "reg:squarederror",
"max_depth": 6,
"learning_rate": 0.1,
"subsample": 0.8,
"colsample_bytree": 0.8,
"random_state": 42,
}
# Train model - MLflow automatically logs everything
with mlflow.start_run():
model = xgb.train(
params=params,
dtrain=dtrain,
num_boost_round=100,
evals=[(dtrain, "train"), (dtest, "test")],
early_stopping_rounds=10,
verbose_eval=False,
)
print(f"Best iteration: {model.best_iteration}")
print(f"Best score: {model.best_score}")
This simple example automatically logs all XGBoost parameters and training configuration, training and validation metrics for each boosting round, feature importance plots and JSON artifacts, the trained model with proper serialization, and early stopping metrics and best iteration information.
Understanding XGBoost Autologgingβ
- What Gets Logged
- Native vs Scikit-learn API
MLflow's XGBoost autologging captures comprehensive information about your gradient boosting process automatically:
Category | Information Captured |
---|---|
Parameters | All booster parameters, training configuration, callback settings |
Metrics | Training/validation metrics per iteration, early stopping metrics |
Feature Importance | Weight, gain, cover, and total_gain importance with visualizations |
Artifacts | Trained model, feature importance plots, JSON importance data |
The autologging system is designed to be comprehensive yet non-intrusive. It captures everything you need for reproducibility without requiring changes to your existing XGBoost code.
XGBoost offers two main interfaces, and MLflow supports both seamlessly:
# Native XGBoost API - Maximum control and performance
import xgboost as xgb
mlflow.xgboost.autolog()
dtrain = xgb.DMatrix(X_train, label=y_train)
model = xgb.train(params, dtrain, num_boost_round=100)
# Scikit-learn API - Familiar interface with sklearn integration
from xgboost import XGBClassifier
mlflow.sklearn.autolog() # Note: Use sklearn autolog for XGBoost sklearn API
model = XGBClassifier(n_estimators=100, max_depth=6)
model.fit(X_train, y_train)
Choosing the Right API:
Native XGBoost API - Use when you need maximum performance with direct access to all XGBoost optimizations, advanced features like custom objectives and evaluation metrics, memory efficiency with fine-grained control over data loading, or competition settings where every bit of performance matters.
Scikit-learn API - Use when you need pipeline integration with sklearn preprocessing and feature engineering, hyperparameter tuning using GridSearchCV or RandomizedSearchCV, team familiarity with sklearn patterns, or rapid prototyping with familiar interfaces.
Logging Approachesβ
- Manual Logging
- Scikit-learn Integration
For complete control over experiment tracking, you can manually instrument your XGBoost training:
import mlflow
import mlflow.xgboost
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np
# Generate sample data
X, y = make_classification(n_samples=10000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Manual logging approach
with mlflow.start_run():
# Define and log parameters
params = {
"objective": "binary:logistic",
"max_depth": 8,
"learning_rate": 0.05,
"subsample": 0.9,
"colsample_bytree": 0.9,
"min_child_weight": 1,
"gamma": 0,
"reg_alpha": 0,
"reg_lambda": 1,
"random_state": 42,
}
training_config = {
"num_boost_round": 500,
"early_stopping_rounds": 50,
}
# Log all parameters
mlflow.log_params(params)
mlflow.log_params(training_config)
# Prepare data
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Custom evaluation tracking
eval_results = {}
# Train model with custom callback
model = xgb.train(
params=params,
dtrain=dtrain,
num_boost_round=training_config["num_boost_round"],
evals=[(dtrain, "train"), (dtest, "test")],
early_stopping_rounds=training_config["early_stopping_rounds"],
evals_result=eval_results,
verbose_eval=False,
)
# Log training history
for epoch, (train_metrics, test_metrics) in enumerate(
zip(eval_results["train"]["logloss"], eval_results["test"]["logloss"])
):
mlflow.log_metrics(
{"train_logloss": train_metrics, "test_logloss": test_metrics}, step=epoch
)
# Final evaluation
y_pred_proba = model.predict(dtest)
y_pred = (y_pred_proba > 0.5).astype(int)
final_metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"roc_auc": roc_auc_score(y_test, y_pred_proba),
"best_iteration": model.best_iteration,
"best_score": model.best_score,
}
mlflow.log_metrics(final_metrics)
# Log the model with signature
from mlflow.models import infer_signature
signature = infer_signature(X_train, y_pred_proba)
mlflow.xgboost.log_model(
xgb_model=model,
name="model",
signature=signature,
input_example=X_train[:5],
)
XGBoost's scikit-learn compatible estimators work seamlessly with MLflow's sklearn autologging:
import mlflow
import mlflow.sklearn
from xgboost import XGBClassifier, XGBRegressor
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Enable sklearn autologging for XGBoost sklearn estimators
mlflow.sklearn.autolog()
# Load data
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42
)
with mlflow.start_run(run_name="XGBoost Sklearn API"):
# XGBoost with scikit-learn interface
model = XGBClassifier(
n_estimators=100,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
colsample_bytree=0.8,
random_state=42,
early_stopping_rounds=10,
eval_metric="logloss",
)
# Fit with evaluation set for early stopping
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
# Cross-validation scores are automatically logged
cv_scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
Pipeline Integration:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
("num", StandardScaler(), [0, 1, 2, 3]),
("cat", OneHotEncoder(drop="first"), [4, 5]),
]
)
# Complete ML pipeline
pipeline = Pipeline(
[
("preprocessor", preprocessor),
("classifier", XGBClassifier(n_estimators=100, random_state=42)),
]
)
with mlflow.start_run():
# Entire pipeline is logged including preprocessing steps
pipeline.fit(X_train, y_train)
# Pipeline scoring is automatically captured
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
Hyperparameter Optimizationβ
- GridSearchCV
- RandomizedSearchCV
MLflow provides exceptional support for XGBoost hyperparameter optimization, automatically creating organized child runs for parameter search experiments:
import mlflow
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
# Enable autologging with hyperparameter tracking
mlflow.sklearn.autolog(max_tuning_runs=10)
# Define parameter grid
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [3, 6, 9],
"learning_rate": [0.01, 0.1, 0.2],
"subsample": [0.8, 0.9, 1.0],
"colsample_bytree": [0.8, 0.9, 1.0],
}
with mlflow.start_run(run_name="XGBoost Grid Search"):
# Create base model
xgb_model = XGBClassifier(random_state=42)
# Grid search with cross-validation
grid_search = GridSearchCV(
xgb_model, param_grid, cv=5, scoring="roc_auc", n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
# Best parameters and scores are automatically logged
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
# Evaluate on test set
test_score = grid_search.score(X_test, y_test)
print(f"Test score: {test_score:.3f}")
MLflow automatically creates a parent run containing the overall search results and child runs for each parameter combination, making it easy to analyze which parameters work best.
For more efficient hyperparameter exploration, especially with large parameter spaces, RandomizedSearchCV provides a great alternative:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define parameter distributions for more efficient exploration
param_distributions = {
"n_estimators": randint(50, 300),
"max_depth": randint(5, 20),
"min_child_weight": randint(1, 10),
"learning_rate": uniform(0.01, 0.3),
"subsample": uniform(0.6, 0.4),
"colsample_bytree": uniform(0.6, 0.4),
"gamma": uniform(0, 0.5),
"reg_alpha": uniform(0, 1),
"reg_lambda": uniform(0, 1),
}
with mlflow.start_run(run_name="XGBoost Randomized Search"):
xgb_model = XGBClassifier(random_state=42)
random_search = RandomizedSearchCV(
xgb_model,
param_distributions,
n_iter=50, # Try 50 random combinations
cv=5,
scoring="roc_auc",
random_state=42,
n_jobs=-1,
)
random_search.fit(X_train, y_train)
# MLflow automatically creates child runs for parameter combinations
# The parent run contains the best model and overall results
The max_tuning_runs
parameter in autolog controls how many of the best parameter combinations get their own child runs, helping you focus on the most promising results.
Feature Importance Analysisβ
- Multiple Importance Types
- Feature Selection
XGBoost provides multiple types of feature importance, and MLflow captures them all automatically:
import json
import matplotlib.pyplot as plt
import seaborn as sns
def comprehensive_feature_importance_analysis(model, feature_names=None):
"""Analyze and log comprehensive feature importance."""
importance_types = ["weight", "gain", "cover", "total_gain"]
with mlflow.start_run(run_name="Feature Importance Analysis"):
for imp_type in importance_types:
# Get importance scores
importance = model.get_score(importance_type=imp_type)
if not importance:
continue
# Sort features by importance
sorted_features = sorted(
importance.items(), key=lambda x: x[1], reverse=True
)
# Log individual feature scores
for feature, score in sorted_features[:20]: # Top 20 features
mlflow.log_metric(f"{imp_type}_{feature}", score)
# Create visualization
features, scores = zip(*sorted_features[:20])
plt.figure(figsize=(10, 8))
sns.barplot(x=list(scores), y=list(features))
plt.title(f"Top 20 Feature Importance ({imp_type.title()})")
plt.xlabel("Importance Score")
plt.tight_layout()
# Save and log plot
plot_filename = f"feature_importance_{imp_type}.png"
plt.savefig(plot_filename, dpi=300, bbox_inches="tight")
mlflow.log_artifact(plot_filename)
plt.close()
# Log importance as JSON artifact
json_filename = f"feature_importance_{imp_type}.json"
with open(json_filename, "w") as f:
json.dump(importance, f, indent=2)
mlflow.log_artifact(json_filename)
# Usage
model = xgb.train(params, dtrain, num_boost_round=100)
comprehensive_feature_importance_analysis(model, feature_names=wine.feature_names)
Use XGBoost feature importance for automated feature selection:
from sklearn.feature_selection import SelectFromModel
def feature_selection_pipeline(X_train, y_train, X_test, y_test):
"""Pipeline with XGBoost-based feature selection."""
with mlflow.start_run(run_name="Feature Selection Pipeline"):
# Step 1: Train initial model for feature selection
selector_model = XGBClassifier(n_estimators=50, max_depth=6, random_state=42)
selector_model.fit(X_train, y_train)
# Step 2: Feature selection based on importance
selector = SelectFromModel(
selector_model,
threshold="median", # Select features above median importance
prefit=True,
)
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)
# Log feature selection results
selected_features = selector.get_support()
n_selected = sum(selected_features)
mlflow.log_metrics(
{
"original_features": X_train.shape[1],
"selected_features": n_selected,
"feature_reduction_ratio": n_selected / X_train.shape[1],
}
)
# Step 3: Train final model on selected features
final_model = XGBClassifier(
n_estimators=100, max_depth=8, learning_rate=0.1, random_state=42
)
final_model.fit(X_train_selected, y_train)
# Evaluate performance
train_score = final_model.score(X_train_selected, y_train)
test_score = final_model.score(X_test_selected, y_test)
mlflow.log_metrics(
{
"train_accuracy_selected": train_score,
"test_accuracy_selected": test_score,
}
)
# Log the final model and selector
mlflow.sklearn.log_model(final_model, name="final_model")
mlflow.sklearn.log_model(selector, name="feature_selector")
return final_model, selector
Model Managementβ
- Serialization & Formats
- Model Signatures
- Loading & Usage
XGBoost supports various serialization formats, each optimized for different deployment scenarios:
import mlflow.xgboost
# Train model
model = xgb.train(params, dtrain, num_boost_round=100)
with mlflow.start_run():
# JSON format (recommended) - Human readable and version stable
mlflow.xgboost.log_model(xgb_model=model, name="model_json", model_format="json")
# UBJ format - More compact binary format
mlflow.xgboost.log_model(xgb_model=model, name="model_ubj", model_format="ubj")
# Legacy XGBoost format (deprecated but sometimes needed)
mlflow.xgboost.log_model(xgb_model=model, name="model_xgb", model_format="xgb")
JSON format is recommended for production as it's human-readable and version-stable. UBJ format provides more compact binary serialization. The legacy XGBoost format is deprecated but sometimes needed for compatibility.
Model signatures describe input and output schemas, providing crucial validation for production deployment:
from mlflow.models import infer_signature
import pandas as pd
# Create model signature for production deployment
X_sample = X_train[:100]
# For native XGBoost
predictions = model.predict(xgb.DMatrix(X_sample))
signature = infer_signature(X_sample, predictions)
# For sklearn XGBoost
# predictions = model.predict(X_sample)
# signature = infer_signature(X_sample, predictions)
with mlflow.start_run():
mlflow.xgboost.log_model(
xgb_model=model,
name="model",
signature=signature,
input_example=X_sample[:5], # Sample input for documentation
model_format="json",
)
Model signatures are automatically inferred when autologging is enabled, but you can also create them manually for more control over the schema validation process.
MLflow provides flexible ways to load and use your saved XGBoost models:
# Load model in different ways
run_id = "your_run_id_here"
# Load as native XGBoost model (preserves all XGBoost functionality)
xgb_model = mlflow.xgboost.load_model(f"runs:/{run_id}/model")
predictions = xgb_model.predict(xgb.DMatrix(X_test))
# Load as PyFunc model (generic Python function interface)
pyfunc_model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
predictions = pyfunc_model.predict(pd.DataFrame(X_test))
# Load from model registry (production deployment)
registered_model = mlflow.pyfunc.load_model("models:/XGBoostModel@champion")
The PyFunc format is particularly useful for deployment scenarios where you need a consistent interface across different model types and frameworks.
Production Deploymentβ
- Model Registry
- Model Serving
The Model Registry provides centralized model management with version control and alias-based deployment. This is essential for managing XGBoost models from development through production deployment:
from mlflow import MlflowClient
client = MlflowClient()
# Register model to MLflow Model Registry
with mlflow.start_run():
mlflow.xgboost.log_model(
xgb_model=model,
name="model",
registered_model_name="XGBoostChurnModel",
signature=signature,
model_format="json",
)
# Use aliases instead of deprecated stages for deployment management
# Set aliases for different deployment environments
model_version = client.get_latest_versions("XGBoostChurnModel")[0]
client.set_registered_model_alias(
name="XGBoostChurnModel",
alias="champion", # Production model
version=model_version.version,
)
client.set_registered_model_alias(
name="XGBoostChurnModel",
alias="challenger", # A/B testing model
version=model_version.version,
)
# Use tags to track model status and metadata
client.set_model_version_tag(
name="XGBoostChurnModel",
version=model_version.version,
key="validation_status",
value="approved",
)
client.set_model_version_tag(
name="XGBoostChurnModel",
version=model_version.version,
key="model_type",
value="xgboost_classifier",
)
client.set_model_version_tag(
name="XGBoostChurnModel",
version=model_version.version,
key="feature_importance_type",
value="gain",
)
Modern Model Registry Features:
Model Aliases replace deprecated stages with flexible, named references. You can assign multiple aliases to any model version (e.g., champion
, challenger
, shadow
), update aliases independently of model training for seamless deployments, and use them for A/B testing and gradual rollouts.
Model Tags provide rich metadata and status tracking. Track validation status with validation_status: approved
, mark model characteristics with model_type: xgboost_classifier
, and add performance metrics like best_auc_score: 0.95
.
Environment-based Models support mature MLOps workflows. Create separate registered models per environment: dev.XGBoostChurnModel
, staging.XGBoostChurnModel
, prod.XGBoostChurnModel
, and use copy_model_version()
to promote models across environments.
# Promote model from staging to production environment
client.copy_model_version(
src_model_uri="models:/staging.XGBoostChurnModel@candidate",
dst_name="prod.XGBoostChurnModel",
)
MLflow provides built-in model serving capabilities that make it easy to deploy your XGBoost models as REST APIs:
# Serve model using alias for production deployment
mlflow models serve \
-m "models:/XGBoostChurnModel@champion" \
-p 5000 \
--no-conda
# Or serve a specific version
mlflow models serve \
-m "models:/XGBoostChurnModel/3" \
-p 5000 \
--no-conda
Deployment Best Practices:
Use aliases for production serving by pointing to @champion
or @production
aliases instead of hard-coding version numbers. Implement blue-green deployments by updating aliases to switch traffic between model versions instantly. Ensure model signatures provide automatic input validation at serving time. Use JSON format for better compatibility and debugging.
Once your model is served, you can make predictions by sending POST requests:
import requests
import json
# Example prediction request
data = {"inputs": [[1.2, 0.8, 3.4, 2.1]]} # Feature values
response = requests.post(
"http://localhost:5000/invocations",
headers={"Content-Type": "application/json"},
data=json.dumps(data),
)
predictions = response.json()
For larger production deployments, you can also deploy MLflow models to cloud platforms like AWS SageMaker, Azure ML, or deploy them as Docker containers for Kubernetes orchestration.
Advanced Featuresβ
- Custom Objectives & Metrics
- Autolog Configuration
- Performance Optimization
XGBoost allows custom objective functions and evaluation metrics, which MLflow can track:
def custom_objective_function(y_pred, y_true):
"""Custom objective function for XGBoost."""
# Example: Focal loss for imbalanced classification
alpha = 0.25
gamma = 2.0
# Convert DMatrix to numpy array
y_true = y_true.get_label()
# Calculate focal loss gradients and hessians
p = 1 / (1 + np.exp(-y_pred)) # sigmoid
# Focal loss gradient
grad = alpha * (1 - p) ** gamma * (gamma * p * np.log(p + 1e-8) + p - y_true)
# Focal loss hessian
hess = (
alpha
* (1 - p) ** gamma
* (gamma * (gamma + 1) * p * np.log(p + 1e-8) + 2 * gamma * p + p)
)
return grad, hess
def custom_eval_metric(y_pred, y_true):
"""Custom evaluation metric."""
y_true = y_true.get_label()
y_pred = 1 / (1 + np.exp(-y_pred)) # sigmoid
# Custom F-beta score
beta = 2.0
precision = np.sum((y_pred > 0.5) & (y_true == 1)) / np.sum(y_pred > 0.5)
recall = np.sum((y_pred > 0.5) & (y_true == 1)) / np.sum(y_true == 1)
f_beta = (1 + beta**2) * precision * recall / (beta**2 * precision + recall)
return "f_beta", f_beta
# Train with custom objective and metric
with mlflow.start_run():
model = xgb.train(
params=params,
dtrain=dtrain,
obj=custom_objective_function,
feval=custom_eval_metric,
num_boost_round=100,
evals=[(dtrain, "train"), (dtest, "test")],
verbose_eval=10,
)
MLflow's XGBoost autologging behavior can be customized to fit your specific workflow needs:
# Fine-tune autologging behavior
mlflow.xgboost.autolog(
importance_types=["weight", "gain", "cover"], # Types of importance to log
log_input_examples=True, # Include input examples in logged models
log_model_signatures=True, # Include model signatures
log_models=True, # Log trained models
log_datasets=True, # Log dataset information
model_format="json", # Use JSON format for better compatibility
registered_model_name="XGBoostModel", # Auto-register models
extra_tags={"team": "data-science", "project": "customer-churn"},
)
These configuration options give you fine-grained control over the autologging behavior. Importance types controls which feature importance metrics are captured. Dataset logging tracks the data used for training and evaluation. Input examples and signatures are crucial for production deployment. Extra tags help organize experiments across teams and projects.
XGBoost offers several performance optimization options that MLflow can track:
# GPU-accelerated training
def gpu_accelerated_training(X_train, y_train, X_test, y_test):
"""GPU-accelerated XGBoost training."""
with mlflow.start_run(run_name="GPU XGBoost"):
# GPU-optimized parameters
params = {
"tree_method": "gpu_hist", # Use GPU for training
"gpu_id": 0, # GPU device ID
"predictor": "gpu_predictor", # Use GPU for prediction
"objective": "binary:logistic",
"eval_metric": "logloss",
"max_depth": 8,
"learning_rate": 0.1,
}
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
model = xgb.train(
params=params,
dtrain=dtrain,
num_boost_round=500,
evals=[(dtrain, "train"), (dtest, "test")],
early_stopping_rounds=50,
)
return model
# Memory-efficient training for large datasets
def memory_efficient_training():
"""Memory efficient training for large datasets."""
with mlflow.start_run():
# Enable histogram-based algorithm for faster training
params = {
"tree_method": "hist", # Use histogram-based algorithm
"max_bin": 256, # Number of bins for histogram
"single_precision_histogram": True, # Use single precision
"objective": "reg:squarederror",
"eval_metric": "rmse",
}
# For very large datasets, consider loading from file
# dtrain = xgb.DMatrix('train.libsvm')
# dtest = xgb.DMatrix('test.libsvm')
model = xgb.train(
params=params,
dtrain=dtrain,
num_boost_round=1000,
evals=[(dtest, "test")],
early_stopping_rounds=50,
verbose_eval=100,
)
return model
Model Evaluation with MLflowβ
- MLflow Evaluate API
- Regression Evaluation
- Custom Metrics & Artifacts
- Manual Evaluation
MLflow provides a comprehensive evaluation API that automatically generates metrics, visualizations, and diagnostic tools:
import mlflow
import xgboost as xgb
from sklearn.model_selection import train_test_split
from mlflow.models import infer_signature
# Prepare data and train model
model = xgb.XGBClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test.copy()
eval_data["label"] = y_test
with mlflow.start_run():
# Log model with signature
signature = infer_signature(X_test, model.predict(X_test))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")
# Comprehensive evaluation with MLflow
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier", # or "regressor" for regression
evaluators=["default"],
)
# Access automatic metrics
print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
print(f"ROC AUC: {result.metrics['roc_auc']:.3f}")
# Access generated artifacts
print("Generated artifacts:")
for artifact_name, path in result.artifacts.items():
print(f" {artifact_name}: {path}")
Automatic Generation Includes:
Performance Metrics such as accuracy, precision, recall, F1-score, ROC-AUC for classification. Visualizations including confusion matrix, ROC curve, precision-recall curve. Feature Importance with SHAP values and feature contribution analysis. Model Artifacts where all plots and diagnostic information are saved to MLflow.
For XGBoost regression models, MLflow automatically provides regression-specific metrics:
from sklearn.datasets import fetch_california_housing
# Load regression dataset
housing = fetch_california_housing(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Train XGBoost regressor
reg_model = xgb.XGBRegressor(n_estimators=100, max_depth=6, random_state=42)
reg_model.fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test.copy()
eval_data["target"] = y_test
with mlflow.start_run():
# Log and evaluate regression model
signature = infer_signature(X_train, reg_model.predict(X_train))
mlflow.sklearn.log_model(reg_model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")
result = mlflow.evaluate(
model_uri,
eval_data,
targets="target",
model_type="regressor",
evaluators=["default"],
)
print(f"MAE: {result.metrics['mean_absolute_error']:.3f}")
print(f"RMSE: {result.metrics['root_mean_squared_error']:.3f}")
print(f"RΒ² Score: {result.metrics['r2_score']:.3f}")
Automatic Regression Metrics:
Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root MSE provide error magnitude assessment. RΒ² Score and Adjusted RΒ² measure model fit quality. Mean Absolute Percentage Error (MAPE) shows relative error rates. Residual plots and distribution analysis help identify model assumptions violations.
Extend MLflow evaluation with custom metrics and visualizations:
from mlflow.models import make_metric
import matplotlib.pyplot as plt
import numpy as np
import os
def profit_metric(predictions, targets, sample_weights=None):
"""Custom business metric: profit from correct predictions."""
# Assume profit of $100 per correct prediction, $50 loss per error
correct_predictions = (predictions == targets).sum()
incorrect_predictions = len(predictions) - correct_predictions
profit = (correct_predictions * 100) - (incorrect_predictions * 50)
return profit
def create_feature_importance_comparison(eval_df, builtin_metrics, artifacts_dir):
"""Compare XGBoost native importance with SHAP values."""
# This would use model feature importance from eval_df
# Create comparison visualization
plt.figure(figsize=(12, 8))
# Placeholder for actual feature importance comparison
features = [f"feature_{i}" for i in range(10)]
xgb_importance = np.random.random(10)
shap_importance = np.random.random(10)
x = np.arange(len(features))
width = 0.35
plt.bar(x - width / 2, xgb_importance, width, label="XGBoost Native", alpha=0.8)
plt.bar(x + width / 2, shap_importance, width, label="SHAP Values", alpha=0.8)
plt.xlabel("Features")
plt.ylabel("Importance")
plt.title("Feature Importance Comparison")
plt.xticks(x, features, rotation=45)
plt.legend()
plt.tight_layout()
plot_path = os.path.join(artifacts_dir, "importance_comparison.png")
plt.savefig(plot_path)
plt.close()
return {"importance_comparison": plot_path}
# Create custom metric
custom_profit = make_metric(
eval_fn=profit_metric, greater_is_better=True, name="profit_score"
)
# Use custom metrics and artifacts
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
extra_metrics=[custom_profit],
custom_artifacts=[create_feature_importance_comparison],
)
print(f"Custom Profit Score: ${result.metrics['profit_score']:.2f}")
For cases where you need more control or custom evaluation logic, you can still implement manual evaluation:
import numpy as np
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
roc_auc_score,
roc_curve,
precision_recall_curve,
confusion_matrix,
average_precision_score,
)
import matplotlib.pyplot as plt
import seaborn as sns
def comprehensive_xgboost_evaluation(model, X_test, y_test, X_train=None, y_train=None):
"""Comprehensive XGBoost model evaluation with MLflow logging."""
with mlflow.start_run(run_name="Comprehensive Model Evaluation"):
# Predictions
if hasattr(model, "predict_proba"):
y_pred_proba = model.predict_proba(X_test)[:, 1]
y_pred = (y_pred_proba > 0.5).astype(int)
else:
# Native XGBoost model
if isinstance(X_test, xgb.DMatrix):
dtest = X_test
else:
dtest = xgb.DMatrix(X_test)
y_pred_proba = model.predict(dtest)
y_pred = (y_pred_proba > 0.5).astype(int)
# Basic metrics
metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"precision": precision_score(y_test, y_pred, average="weighted"),
"recall": recall_score(y_test, y_pred, average="weighted"),
"f1_score": f1_score(y_test, y_pred, average="weighted"),
"roc_auc": roc_auc_score(y_test, y_pred_proba),
}
mlflow.log_metrics(metrics)
# Training metrics if provided
if X_train is not None and y_train is not None:
if hasattr(model, "predict_proba"):
y_train_pred = model.predict_proba(X_train)[:, 1]
else:
dtrain = (
xgb.DMatrix(X_train)
if not isinstance(X_train, xgb.DMatrix)
else X_train
)
y_train_pred = model.predict(dtrain)
train_metrics = {
"train_accuracy": accuracy_score(
y_train, (y_train_pred > 0.5).astype(int)
),
"train_roc_auc": roc_auc_score(y_train, y_train_pred),
}
mlflow.log_metrics(train_metrics)
# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {metrics["roc_auc"]:.3f})')
plt.plot([0, 1], [0, 1], "k--", label="Random Classifier")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.grid(True)
plt.savefig("roc_curve.png", dpi=300, bbox_inches="tight")
mlflow.log_artifact("roc_curve.png")
plt.close()
# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
avg_precision = average_precision_score(y_test, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, label=f"PR Curve (AP = {avg_precision:.3f})")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend()
plt.grid(True)
plt.savefig("precision_recall_curve.png", dpi=300, bbox_inches="tight")
mlflow.log_artifact("precision_recall_curve.png")
plt.close()
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix")
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.savefig("confusion_matrix.png", dpi=300, bbox_inches="tight")
mlflow.log_artifact("confusion_matrix.png")
plt.close()
mlflow.log_metric("average_precision", avg_precision)
Model Comparison and Selectionβ
- MLflow Model Comparison
- Hyperparameter Evaluation
Use MLflow evaluate to systematically compare multiple XGBoost configurations:
from sklearn.ensemble import RandomForestClassifier
# Define XGBoost variants to compare
xgb_models = {
"xgb_shallow": xgb.XGBClassifier(max_depth=3, n_estimators=100, random_state=42),
"xgb_deep": xgb.XGBClassifier(max_depth=8, n_estimators=100, random_state=42),
"xgb_boosted": xgb.XGBClassifier(max_depth=6, n_estimators=200, random_state=42),
}
# Compare with other algorithms
all_models = {
**xgb_models,
"random_forest": RandomForestClassifier(n_estimators=100, random_state=42),
}
# Evaluate each model systematically
comparison_results = {}
for model_name, model in all_models.items():
with mlflow.start_run(run_name=f"eval_{model_name}"):
# Train model
model.fit(X_train, y_train)
# Log model
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")
# Comprehensive evaluation with MLflow
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluators=["default"],
)
comparison_results[model_name] = result.metrics
# Log key metrics for comparison
mlflow.log_metrics(
{
"accuracy": result.metrics["accuracy_score"],
"f1": result.metrics["f1_score"],
"roc_auc": result.metrics["roc_auc"],
"precision": result.metrics["precision_score"],
"recall": result.metrics["recall_score"],
}
)
# Create comparison summary
import pandas as pd
comparison_df = pd.DataFrame(comparison_results).T
print("Model Comparison Summary:")
print(comparison_df[["accuracy_score", "f1_score", "roc_auc"]].round(3))
# Identify best model
best_model = comparison_df["f1_score"].idxmax()
print(f"\nBest model by F1 score: {best_model}")
Combine hyperparameter tuning with MLflow evaluation:
from sklearn.model_selection import ParameterGrid
# Define parameter grid for XGBoost
param_grid = {
"max_depth": [3, 6, 9],
"learning_rate": [0.01, 0.1, 0.2],
"n_estimators": [100, 200],
"subsample": [0.8, 1.0],
}
# Evaluate each parameter combination
grid_results = []
for params in ParameterGrid(param_grid):
with mlflow.start_run(run_name=f"xgb_grid_search"):
# Log parameters
mlflow.log_params(params)
# Train model with current parameters
model = xgb.XGBClassifier(**params, random_state=42)
model.fit(X_train, y_train)
# Log and evaluate
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(model, name="model", signature=signature)
model_uri = mlflow.get_artifact_uri("model")
# MLflow evaluation
result = mlflow.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluators=["default"],
)
# Track results
grid_results.append(
{
**params,
"f1_score": result.metrics["f1_score"],
"roc_auc": result.metrics["roc_auc"],
"accuracy": result.metrics["accuracy_score"],
}
)
# Log selection metric
mlflow.log_metric("grid_search_score", result.metrics["f1_score"])
# Find best parameters
best_result = max(grid_results, key=lambda x: x["f1_score"])
print(f"Best parameters: {best_result}")
Model Validation and Quality Gatesβ
Use MLflow's validation API to ensure model quality:
from mlflow.models import MetricThreshold
# First, evaluate your XGBoost model
result = mlflow.evaluate(model_uri, eval_data, targets="label", model_type="classifier")
# Define quality thresholds for XGBoost models
quality_thresholds = {
"accuracy_score": MetricThreshold(threshold=0.85, greater_is_better=True),
"f1_score": MetricThreshold(threshold=0.80, greater_is_better=True),
"roc_auc": MetricThreshold(threshold=0.75, greater_is_better=True),
}
# Validate model meets quality standards
try:
mlflow.validate_evaluation_results(
candidate_result=result,
validation_thresholds=quality_thresholds,
)
print("β
XGBoost model meets all quality thresholds")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"β Model failed validation: {e}")
# Compare against baseline model (e.g., previous XGBoost version)
baseline_result = mlflow.evaluate(
baseline_model_uri, eval_data, targets="label", model_type="classifier"
)
# Validate improvement over baseline
improvement_thresholds = {
"f1_score": MetricThreshold(
threshold=0.02, greater_is_better=True # Must be 2% better
),
}
try:
mlflow.validate_evaluation_results(
candidate_result=result,
baseline_result=baseline_result,
validation_thresholds=improvement_thresholds,
)
print("β
New XGBoost model improves over baseline")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"β Model doesn't improve sufficiently: {e}")
Advanced XGBoost Featuresβ
- Multi-Class Classification
- Custom Callbacks
XGBoost naturally handles multi-class classification with MLflow tracking:
from sklearn.datasets import load_digits
from sklearn.metrics import classification_report
# Multi-class classification
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.2, random_state=42
)
with mlflow.start_run(run_name="Multi-class XGBoost"):
# XGBoost naturally handles multi-class
model = XGBClassifier(
objective="multi:softprob",
num_class=10, # 10 digit classes
n_estimators=100,
max_depth=6,
random_state=42,
)
model.fit(X_train, y_train)
# Multi-class predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)
# Multi-class metrics
report = classification_report(y_test, y_pred, output_dict=True)
# Log per-class metrics
for class_label, metrics in report.items():
if isinstance(metrics, dict):
mlflow.log_metrics(
{
f"class_{class_label}_precision": metrics["precision"],
f"class_{class_label}_recall": metrics["recall"],
f"class_{class_label}_f1": metrics["f1-score"],
}
)
Implement custom callbacks for advanced monitoring and control:
class MLflowCallback(xgb.callback.TrainingCallback):
def __init__(self):
self.metrics_history = []
def after_iteration(self, model, epoch, evals_log):
# Log metrics in real-time
metrics = {}
for dataset, metric_dict in evals_log.items():
for metric_name, values in metric_dict.items():
key = f"{dataset}_{metric_name}"
metrics[key] = values[-1] # Latest value
mlflow.log_metrics(metrics, step=epoch)
self.metrics_history.append(metrics)
# Custom logic for model checkpointing
if epoch % 50 == 0:
temp_model_path = f"checkpoint_epoch_{epoch}.json"
model.save_model(temp_model_path)
mlflow.log_artifact(temp_model_path)
return False # Continue training
# Usage
with mlflow.start_run():
callback = MLflowCallback()
model = xgb.train(params, dtrain, callbacks=[callback], num_boost_round=1000)
Best Practices and Organizationβ
- Reproducibility
- Experiment Organization
Ensure reproducible XGBoost experiments with comprehensive environment tracking:
import platform
import random
import xgboost
def reproducible_xgboost_experiment(experiment_name, random_state=42):
"""Set up reproducible XGBoost experiment."""
# Set random seeds for reproducibility
np.random.seed(random_state)
random.seed(random_state)
# Set experiment
mlflow.set_experiment(experiment_name)
with mlflow.start_run():
mlflow.set_tags(
{
"python_version": platform.python_version(),
"xgboost_version": xgboost.__version__,
"platform": platform.platform(),
"random_state": random_state,
}
)
# Log dataset information
mlflow.log_params(
{
"dataset_size": len(X_train),
"n_features": X_train.shape[1],
"n_classes": len(np.unique(y_train)),
"class_distribution": dict(
zip(*np.unique(y_train, return_counts=True))
),
}
)
# Your model training code here
params = {
"objective": "binary:logistic",
"max_depth": 6,
"learning_rate": 0.1,
"random_state": random_state,
"n_jobs": -1,
}
model = XGBClassifier(**params)
model.fit(X_train, y_train)
return model
# Usage
model = reproducible_xgboost_experiment("Customer_Churn_Analysis_v2")
Organize XGBoost experiments effectively for team collaboration:
# Organize experiments with descriptive names and tags
experiment_name = "XGBoost Customer Churn - Q4 2024"
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name="Baseline XGBoost Model"):
# Use consistent tagging for easy filtering and organization
mlflow.set_tags(
{
"model_type": "gradient_boosting",
"algorithm": "xgboost",
"dataset_version": "v2.1",
"feature_engineering": "standard",
"purpose": "baseline",
"tree_method": "hist",
"objective": "binary:logistic",
}
)
# Train model with comprehensive logging
model = XGBClassifier(
n_estimators=100, max_depth=6, learning_rate=0.1, random_state=42
)
model.fit(X_train, y_train)
Consistent tagging and naming conventions make it much easier to find, compare, and understand XGBoost experiments later. Consider establishing team-wide conventions for experiment names, tags, and run organization.
Conclusionβ
MLflow's XGBoost integration provides a comprehensive solution for gradient boosting experiment management and deployment. Whether you're using the native XGBoost API for maximum performance or the scikit-learn interface for pipeline integration, MLflow captures all the essential information needed for reproducible machine learning.
Key benefits of using MLflow with XGBoost:
Comprehensive Autologging provides one-line setup that captures parameters, metrics, and feature importance. Dual API Support offers seamless integration with both native and scikit-learn XGBoost interfaces. Advanced Feature Analysis includes multiple importance types with automatic visualization. Production-Ready Deployment provides model registry integration with multiple serialization formats. Performance Optimization supports GPU acceleration and memory-efficient training. Competition-Grade Tracking offers detailed experiment management for winning ML solutions.
The patterns and examples in this guide provide a solid foundation for building scalable, reproducible gradient boosting systems with XGBoost and MLflow. Start with autologging for immediate benefits, then gradually adopt more advanced features like custom objectives, callbacks, and sophisticated deployment patterns as your projects grow in complexity and scale.