Skip to main content

MLflow Dataset Tracking

The mlflow.data module is a comprehensive solution for dataset management throughout the machine learning lifecycle. It enables you to track, version, and manage datasets used in training, validation, and evaluation, providing complete lineage from raw data to model predictions.

Why Dataset Tracking Matters​

Dataset tracking is essential for reproducible machine learning and provides several key benefits:

  • Data Lineage: Track the complete journey from raw data sources to model inputs
  • Reproducibility: Ensure experiments can be reproduced with identical datasets
  • Version Control: Manage different versions of datasets as they evolve
  • Collaboration: Share datasets and their metadata across teams
  • Evaluation Integration: Seamlessly integrate with MLflow's evaluation capabilities
  • Production Monitoring: Track datasets used in production inference and evaluation

Core Components​

MLflow's dataset tracking revolves around two main abstractions:

Dataset​

The Dataset abstraction is a metadata tracking object that holds comprehensive information about a logged dataset. The information stored within a Dataset object includes:

Core Properties:

  • Name: Descriptive identifier for the dataset (defaults to "dataset" if not specified)
  • Digest: Unique hash/fingerprint for dataset identification (automatically computed)
  • Source: DatasetSource containing lineage information to the original data location
  • Schema: Optional dataset schema (implementation-specific, e.g., MLflow Schema)
  • Profile: Optional summary statistics (implementation-specific, e.g., row count, column stats)

Supported Dataset Types:

Special Dataset Types:

  • EvaluationDataset - Internal dataset type used specifically with mlflow.evaluate() for model evaluation workflows

DatasetSource​

The DatasetSource component provides linked lineage to the original source of the data, whether it's a file URL, S3 bucket, database table, or any other data source. This ensures you can always trace back to where your data originated.

The DatasetSource can be retrieved using the mlflow.data.get_source() API, which accepts instances of Dataset, DatasetEntity, or DatasetInput.

Quick Start: Basic Dataset Tracking​

Here's how to get started with basic dataset tracking:

import mlflow.data
import pandas as pd

# Load your data
dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")

# Create a Dataset object
dataset = mlflow.data.from_pandas(
raw_data, source=dataset_source_url, name="wine-quality-white", targets="quality"
)

# Log the dataset to an MLflow run
with mlflow.start_run():
mlflow.log_input(dataset, context="training")

# Your training code here
# model = train_model(raw_data)
# mlflow.sklearn.log_model(model, "model")

Dataset Information and Metadata​

When you create a dataset, MLflow automatically captures rich metadata:

# Access dataset metadata
print(f"Dataset name: {dataset.name}") # Defaults to "dataset" if not specified
print(
f"Dataset digest: {dataset.digest}"
) # Unique hash identifier (computed automatically)
print(f"Dataset source: {dataset.source}") # DatasetSource object
print(
f"Dataset profile: {dataset.profile}"
) # Optional: implementation-specific statistics
print(f"Dataset schema: {dataset.schema}") # Optional: implementation-specific schema

Example output:

Dataset name: wine-quality-white
Dataset digest: 2a1e42c4
Dataset profile: {"num_rows": 4898, "num_elements": 58776}
Dataset schema: {"mlflow_colspec": [
{"type": "double", "name": "fixed acidity"},
{"type": "double", "name": "volatile acidity"},
...
{"type": "long", "name": "quality"}
]}
Dataset source: <DatasetSource object>
Dataset Properties

The profile and schema properties are implementation-specific and may vary depending on the dataset type (PandasDataset, SparkDataset, etc.). Some dataset types may return None for these properties.

Dataset Sources and Lineage​

MLflow supports datasets from various sources:

# From local file
local_dataset = mlflow.data.from_pandas(
df, source="/path/to/local/file.csv", name="local-data"
)

# From cloud storage
s3_dataset = mlflow.data.from_pandas(
df, source="s3://bucket/data.parquet", name="s3-data"
)

# From database
db_dataset = mlflow.data.from_pandas(
df, source="postgresql://user:pass@host/db", name="db-data"
)

# From URL
url_dataset = mlflow.data.from_pandas(
df, source="https://example.com/data.csv", name="web-data"
)

Dataset Tracking in MLflow UI​

When you log datasets to MLflow runs, they appear in the MLflow UI with comprehensive metadata. You can view dataset information, schema, and lineage directly in the interface.

Dataset in MLflow UI

The UI displays:

  • Dataset name and digest
  • Schema information with column types
  • Profile statistics (row counts, etc.)
  • Source lineage information
  • Context in which the dataset was used

Integration with MLflow Evaluate​

One of the most powerful features of MLflow datasets is their seamless integration with MLflow's evaluation capabilities. MLflow automatically converts various data types to EvaluationDataset objects internally when using mlflow.evaluate().

EvaluationDataset

MLflow uses an internal EvaluationDataset class when working with mlflow.evaluate(). This dataset type is automatically created from your input data and provides optimized hashing and metadata tracking specifically for evaluation workflows.

Use datasets directly with MLflow evaluate:

import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Prepare data and train model
data = pd.read_csv("classification_data.csv")
X = data.drop("target", axis=1)
y = data["target"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

model = RandomForestClassifier()
model.fit(X_train, y_train)

# Create evaluation dataset
eval_data = X_test.copy()
eval_data["target"] = y_test

eval_dataset = mlflow.data.from_pandas(
eval_data, targets="target", name="evaluation-set"
)

with mlflow.start_run():
# Log model
mlflow.sklearn.log_model(model, name="model", input_example=X_test)

# Evaluate using the dataset
result = mlflow.evaluate(
model="runs:/{}/model".format(mlflow.active_run().info.run_id),
data=eval_dataset,
model_type="classifier",
)

print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")

MLflow Evaluate Integration Example​

Here's a complete example showing how datasets integrate with MLflow's evaluation capabilities:

Dataset Evaluation in MLflow UI

The evaluation run shows how the dataset, model, metrics, and evaluation artifacts (like confusion matrices) are all logged together, providing a complete view of the evaluation process.

Advanced Dataset Management​

Track dataset versions as they evolve:

def create_versioned_dataset(data, version, base_name="customer-data"):
"""Create a versioned dataset with metadata."""

dataset = mlflow.data.from_pandas(
data,
source=f"data_pipeline_v{version}",
name=f"{base_name}-v{version}",
targets="target",
)

with mlflow.start_run(run_name=f"Dataset_Version_{version}"):
mlflow.log_input(dataset, context="versioning")

# Log version metadata
mlflow.log_params(
{
"dataset_version": version,
"data_size": len(data),
"features_count": len(data.columns) - 1,
"target_distribution": data["target"].value_counts().to_dict(),
}
)

# Log data quality metrics
mlflow.log_metrics(
{
"missing_values_pct": (data.isnull().sum().sum() / data.size) * 100,
"duplicate_rows": data.duplicated().sum(),
"target_balance": data["target"].std(),
}
)

return dataset


# Create multiple versions
v1_dataset = create_versioned_dataset(data_v1, "1.0")
v2_dataset = create_versioned_dataset(data_v2, "2.0")
v3_dataset = create_versioned_dataset(data_v3, "3.0")

Production Use Cases​

Monitor datasets used in production batch prediction:

def monitor_batch_predictions(batch_data, model_version, date):
"""Monitor production batch prediction datasets."""

# Create dataset for batch predictions
batch_dataset = mlflow.data.from_pandas(
batch_data,
source=f"production_batch_{date}",
name=f"batch_predictions_{date}",
targets="true_label" if "true_label" in batch_data.columns else None,
predictions="prediction" if "prediction" in batch_data.columns else None,
)

with mlflow.start_run(run_name=f"Batch_Monitor_{date}"):
mlflow.log_input(batch_dataset, context="production_batch")

# Log production metadata
mlflow.log_params(
{
"batch_date": date,
"model_version": model_version,
"batch_size": len(batch_data),
"has_ground_truth": "true_label" in batch_data.columns,
}
)

# Monitor prediction distribution
if "prediction" in batch_data.columns:
pred_metrics = {
"prediction_mean": batch_data["prediction"].mean(),
"prediction_std": batch_data["prediction"].std(),
"unique_predictions": batch_data["prediction"].nunique(),
}
mlflow.log_metrics(pred_metrics)

# Evaluate if ground truth is available
if all(col in batch_data.columns for col in ["prediction", "true_label"]):
result = mlflow.evaluate(data=batch_dataset, model_type="classifier")
print(f"Batch accuracy: {result.metrics.get('accuracy_score', 'N/A')}")

return batch_dataset


# Usage
batch_dataset = monitor_batch_predictions(daily_batch_data, "v2.1", "2024-01-15")

Best Practices​

When working with MLflow datasets, follow these best practices:

Data Quality: Always validate data quality before logging datasets. Check for missing values, duplicates, and data types.

Naming Conventions: Use consistent, descriptive names for datasets that include version information and context.

Source Documentation: Always specify meaningful source URLs or identifiers that allow you to trace back to the original data.

Context Specification: Use clear context labels when logging datasets (e.g., "training", "validation", "evaluation", "production").

Metadata Logging: Include relevant metadata about data collection, preprocessing steps, and data characteristics.

Version Control: Track dataset versions explicitly, especially when data preprocessing or collection methods change.

Digest Computation: Dataset digests are computed differently for different dataset types:

  • Standard datasets: Based on data content and structure
  • MetaDataset: Based on metadata only (name, source, schema) - no actual data hashing
  • EvaluationDataset: Optimized hashing using sample rows for large datasets

Source Flexibility: DatasetSource supports various source types including HTTP URLs, file paths, database connections, and cloud storage locations.

Evaluation Integration: Design datasets with evaluation in mind by clearly specifying target and prediction columns.

Key Benefits​

MLflow dataset tracking provides several key advantages for ML teams:

Reproducibility: Ensure experiments can be reproduced with identical datasets, even as data sources evolve.

Lineage Tracking: Maintain complete data lineage from source to model predictions, enabling better debugging and compliance.

Collaboration: Share datasets and their metadata across team members with consistent interfaces.

Evaluation Integration: Seamlessly integrate with MLflow's evaluation capabilities for comprehensive model assessment.

Production Monitoring: Track datasets used in production systems for performance monitoring and data drift detection.

Quality Assurance: Automatically capture data quality metrics and monitor changes over time.

Whether you're tracking training datasets, managing evaluation data, or monitoring production batch predictions, MLflow's dataset tracking capabilities provide the foundation for reliable, reproducible machine learning workflows.