Skip to main content

Dataset Evaluation

Dataset evaluation allows you to assess model performance on pre-computed predictions without re-running the model. This is particularly useful for evaluating large-scale batch inference results, historical predictions, or when you want to separate the prediction and evaluation phases.

Quick Start: Evaluating Static Predictions​

The simplest dataset evaluation involves a DataFrame with predictions and targets:

import mlflow
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Generate sample data and train a model
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Generate predictions (this could be from a batch job, stored results, etc.)
predictions = model.predict(X_test)
prediction_probabilities = model.predict_proba(X_test)[:, 1]

# Create evaluation dataset with predictions already computed
eval_dataset = pd.DataFrame(
{
"prediction": predictions,
"prediction_proba": prediction_probabilities,
"target": y_test,
}
)

# Add original features for analysis (optional)
feature_names = [f"feature_{i}" for i in range(X_test.shape[1])]
for i, feature_name in enumerate(feature_names):
eval_dataset[feature_name] = X_test[:, i]

with mlflow.start_run():
# Evaluate static dataset - no model needed!
result = mlflow.evaluate(
data=eval_dataset,
predictions="prediction", # Column containing predictions
targets="target", # Column containing true labels
model_type="classifier",
)

print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
print(f"ROC AUC: {result.metrics['roc_auc']:.3f}")

This approach is perfect when:

  • You have batch prediction results from a production system
  • You want to evaluate historical predictions
  • You're comparing different versions of the same model's outputs
  • You need to separate compute-intensive prediction from evaluation

Dataset Management​

For more structured dataset management, use MLflow's PandasDataset:

import mlflow.data

# Create MLflow dataset with prediction column specified
dataset = mlflow.data.from_pandas(
eval_dataset,
predictions="prediction", # Specify prediction column
targets="target", # Specify target column
)

with mlflow.start_run():
# Log the dataset
mlflow.log_input(dataset, context="evaluation")

# Evaluate using the dataset (predictions=None since specified in dataset)
result = mlflow.evaluate(
data=dataset,
predictions=None, # Already specified in dataset creation
targets="target",
model_type="classifier",
)

print("Evaluation completed using MLflow PandasDataset")

Batch Evaluation Workflows​

For production batch inference results:

def evaluate_batch_predictions(batch_results_path, batch_size=10000):
"""Evaluate large batch prediction results efficiently."""

# Read batch results (could be from S3, database, etc.)
batch_df = pd.read_parquet(batch_results_path)

print(f"Evaluating {len(batch_df)} batch predictions")

with mlflow.start_run(run_name="Batch_Evaluation"):
# Log batch metadata
mlflow.log_params(
{
"batch_size": len(batch_df),
"batch_date": batch_df.get("prediction_date", "unknown").iloc[0]
if len(batch_df) > 0
else "unknown",
"data_source": batch_results_path,
}
)

# Evaluate full batch
result = mlflow.evaluate(
data=batch_df,
predictions="model_prediction",
targets="true_label",
model_type="classifier",
)

# Additional batch-specific analysis
if "prediction_timestamp" in batch_df.columns:
# Analyze performance over time
batch_df["hour"] = pd.to_datetime(batch_df["prediction_timestamp"]).dt.hour
hourly_accuracy = batch_df.groupby("hour").apply(
lambda x: (x["model_prediction"] == x["true_label"]).mean()
)

# Log time-based metrics
for hour, accuracy in hourly_accuracy.items():
mlflow.log_metric(f"accuracy_hour_{hour}", accuracy)

return result


# Usage
# result = evaluate_batch_predictions("s3://my-bucket/batch-predictions/2024-01-15.parquet")

Working with Large Datasets​

For datasets too large to fit in memory:

def evaluate_large_dataset_in_chunks(data_path, chunk_size=50000):
"""Evaluate very large datasets by processing in chunks."""

# Read data in chunks
chunk_results = []
total_samples = 0

with mlflow.start_run(run_name="Large_Dataset_Evaluation"):
for chunk_idx, chunk in enumerate(
pd.read_parquet(data_path, chunksize=chunk_size)
):
chunk_size_actual = len(chunk)
total_samples += chunk_size_actual

# Evaluate chunk
with mlflow.start_run(run_name=f"Chunk_{chunk_idx}", nested=True):
chunk_result = mlflow.evaluate(
data=chunk,
predictions="prediction",
targets="target",
model_type="classifier",
)

# Weight metrics by chunk size for aggregation
weighted_metrics = {
f"{k}_weighted": v * chunk_size_actual
for k, v in chunk_result.metrics.items()
if isinstance(v, (int, float))
}

chunk_results.append(
{
"chunk_idx": chunk_idx,
"chunk_size": chunk_size_actual,
"metrics": chunk_result.metrics,
"weighted_metrics": weighted_metrics,
}
)

mlflow.log_param("chunk_size", chunk_size_actual)

# Aggregate results across chunks
if chunk_results:
# Calculate weighted averages
total_weighted = {}
for chunk in chunk_results:
for metric, value in chunk["weighted_metrics"].items():
total_weighted[metric] = total_weighted.get(metric, 0) + value

# Log aggregated metrics
aggregated_metrics = {
k.replace("_weighted", "_aggregate"): v / total_samples
for k, v in total_weighted.items()
}

mlflow.log_metrics(aggregated_metrics)
mlflow.log_params(
{
"total_samples": total_samples,
"chunks_processed": len(chunk_results),
"avg_chunk_size": total_samples / len(chunk_results),
}
)

return chunk_results


# Usage
# results = evaluate_large_dataset_in_chunks("large_predictions.parquet")

Key Use Cases and Benefits​

Dataset evaluation in MLflow is particularly valuable for several scenarios:

Batch Processing - Perfect for evaluating large-scale batch prediction results from production systems without re-running expensive inference.

Historical Analysis - Ideal for analyzing model performance trends over time using previously computed predictions and ground truth data.

Model Comparison - Excellent for comparing different model versions' outputs on the same dataset without re-training or re-inference.

Production Monitoring - Essential for automated evaluation pipelines that assess model performance on incoming batch predictions.

Cost Optimization - Reduces computational costs by separating prediction generation from performance assessment, allowing evaluation without model re-execution.

Best Practices​

When using dataset evaluation, consider these best practices:

  • Data Validation: Always validate that prediction and target columns contain expected data types and ranges
  • Missing Values: Handle missing predictions or targets appropriately before evaluation
  • Memory Management: Use chunked processing or sampling for very large datasets
  • Metadata Logging: Log dataset characteristics, processing parameters, and evaluation context
  • Storage Formats: Use efficient formats like Parquet for large prediction datasets

Conclusion​

Dataset evaluation in MLflow provides powerful capabilities for assessing model performance on pre-computed predictions. This approach is essential for production ML systems where you need to separate prediction generation from performance assessment.

Key advantages of dataset evaluation include:

  • Flexibility: Evaluate predictions from any source without re-running models
  • Efficiency: Skip expensive model inference when predictions are already available
  • Scale: Handle large-scale batch predictions and historical analysis
  • Integration: Seamlessly work with production prediction pipelines

Whether you're analyzing batch predictions, conducting historical performance studies, or implementing automated evaluation pipelines, MLflow's dataset evaluation capabilities provide the tools needed for comprehensive model assessment at scale.