MLflow Data Model

MLflow's data model provides a structured approach to developing and managing GenAI applications by organizing how you log, debug, and evaluate them to achieve quality, cost, and latency goals. This structured approach addresses key challenges in reproducibility, quality assessment, and iterative development.

Overview

The MLflow data model consists of several interconnected entities that work together to support your GenAI application development workflow:

🧪 Experiment - The root container for your GenAI application

🤖 LoggedModel - A first-class entity representing your AI model or agent with integrated tracking

🔍 Trace - A log of inputs, outputs, and intermediate steps from a single application execution

📊 Assessments - Quality judgments on a Trace, categorized as Feedback or Expectations

🎯 Scorers - Definitions of automated evaluation functions that produce Feedback

📋 Evaluation Datasets - Curated sets of inputs (and optional Expectations) for offline testing

🚀 Evaluation Runs - Results from running app versions against Evaluation Datasets, containing new, scored Traces

🏷️ Labeling Sessions - Collections of Traces organized for human review

MLflow Experiment

An Experiment is the top-level container for each distinct application or use case. It contains all Traces from development and production alongside all other entities in the data model. We recommend creating a single Experiment for each application.

Setting up an Experiment

import mlflow

# Create or set an experiment
mlflow.set_experiment("my-genai-app")

# Or create explicitly
experiment = mlflow.create_experiment("my-genai-app")

MLflow LoggedModel: Model Management

A LoggedModel is a first-class entity that represents your AI model, agent, or GenAI application within an Experiment. It provides unified tracking of model artifacts, execution traces, evaluation metrics, and metadata throughout the development lifecycle.

Key Features of LoggedModel

LoggedModel serves as the central hub that connects:

Model artifacts and configuration parameters
Execution traces from development and production
Evaluation metrics and performance assessments
Version history and deployment tracking

Creating LoggedModels

Direct Model Logging
Active Model Pattern
External Model Reference

Create a LoggedModel by logging your model directly:

import mlflow

# Log a model with comprehensive metadata
logged_model = mlflow.langchain.log_model(
    lc_model=your_chain,
    name="customer_support_agent",
    params={"temperature": 0.1, "max_tokens": 2000},
    model_type="agent",
    input_example={"messages": "How can I help you?"},
)

print(f"Model ID: {logged_model.model_id}")

Use the active model pattern for automatic trace linking:

# Set active model for automatic trace association
mlflow.set_active_model(name="customer_support_agent")

# Enable autologging
mlflow.langchain.autolog()

# All traces will be automatically linked to the active model
response = your_model.invoke({"messages": "Hello!"})

Reference models stored outside MLflow:

# Create external model reference
external_model = mlflow.create_external_model(
    name="production_model_v2",
    model_type="agent",
    params={"version": "2.1", "endpoint": "api.example.com"},
)

LoggedModel Benefits

Unified Tracking: All model-related artifacts, traces, and metrics are organized under a single entity, providing complete visibility into model behavior and performance.

Automatic Trace Linking: When using the active model pattern, all execution traces are automatically associated with the LoggedModel, eliminating manual tracking overhead.

Version Management: LoggedModel supports systematic versioning and comparison across different model iterations, enabling data-driven model selection.

Evaluation Integration: Evaluation metrics and results are directly linked to LoggedModel, providing comprehensive performance assessment.

MLflow Traces: The Foundation

The foundational concept is the Trace: a single, complete execution of your GenAI application (e.g., a user request or API call).

How Traces are Generated

Traces are generated through:

Automatic Instrumentation
Decorators and Context Managers

Automatic tracing is enabled with a single line of code for 20+ popular LLM SDKs:

import mlflow

# Enable automatic tracing for OpenAI
mlflow.openai.autolog()

# Your existing code works unchanged
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4", messages=[{"role": "user", "content": "Hello!"}]
)

Using the MLflow tracing APIs for fine-grained control:

@mlflow.trace
def my_custom_function(input_data):
    # Your custom logic here
    result = process_data(input_data)
    return result

Purpose of Traces

Traces enable:

Observability: Gain insights into application performance
Debugging: Understand execution flow to resolve issues
Quality Evaluation: Assess response quality over time
Human Review: Provide data for expert annotation

Assessments: Quality Judgments

Assessments are qualitative or quantitative judgments attached to Traces to understand and improve GenAI application quality. A Trace can have multiple Assessments, primarily Feedback or Expectations.

Feedback Assessments

Feedback captures evaluations of a Trace and includes:

🏷️ Name - Developer-defined category (e.g., relevance, correctness)
⭐ Score - Evaluation (e.g., thumbs up/down, numerical rating)
💭 Rationale - Optional textual explanation for the score

# Log feedback programmatically
mlflow.log_feedback(
    trace_id="trace-123",
    name="relevance",
    value=4,
    rationale="Response was highly relevant to the question",
)

Expectation Assessments

Expectations are ground truth labels for a Trace (e.g., expected_facts, expected_response). These are primarily used in offline evaluation to compare app output against known correct answers.

Scorers: Automated Quality Measurement

Scorers are functions that programmatically assess Trace quality, producing Feedback. They can be:

Code-based Heuristics

def check_response_length(trace):
    """Custom scorer to check if response is appropriate length"""
    response = trace.outputs.get("response", "")
    if 50 <= len(response) <= 500:
        return {"score": 1, "rationale": "Response length is appropriate"}
    else:
        return {"score": 0, "rationale": "Response too short or too long"}

LLM Judges

from mlflow.metrics import genai

# Use built-in LLM judge
relevance_metric = genai.relevance()

# Evaluate traces with the metric
results = mlflow.genai.evaluate(
    predict_fn=your_model, data=evaluation_data, scorers=[relevance_metric]
)

Evaluation Datasets and Runs

MLflow provides systematic offline testing through Evaluation Datasets and Evaluation Runs.

Evaluation Datasets

An Evaluation Dataset is a curated collection of example inputs used to evaluate and improve app performance:

# Create an evaluation dataset
dataset = mlflow.data.from_dict(
    {
        "inputs": ["What is MLflow?", "How do I log metrics?"],
        "expectations": {
            "expected_outputs": ["MLflow is...", "To log metrics..."],
        },
    }
)

# Register the dataset
mlflow.log_input(dataset, context="evaluation")

Evaluation Runs

An Evaluation Run stores results from running a new app version against an Evaluation Dataset:

# Run evaluation
results = mlflow.genai.evaluate(
    predict_fn=your_model,
    data=evaluation_dataset,
    scorers=[relevance_metric, accuracy_metric],
)

The evaluation process:

New app version processes inputs from Evaluation Dataset
MLflow generates a new Trace for each input
Configured Scorers annotate Traces with Feedback
All annotated Traces are stored in the Evaluation Run

Labeling Sessions: Human Review

Labeling Sessions organize Traces for human review, typically through the MLflow UI. Domain experts can browse these Traces and attach Feedback as Assessments.

# Create a labeling session
session = mlflow.genai.create_labeling_session(name="quality-review-session")

Complete Data Model Structure

All components operate within an Experiment, forming a comprehensive hierarchy:

Getting Started

To begin using the MLflow data model:

🧪 Set up an experiment for your GenAI application
🤖 Create or set a LoggedModel to organize your model tracking
🔄 Enable automatic tracing for your LLM library
🎯 Define custom scorers for your quality metrics
📋 Create evaluation datasets from representative examples
🚀 Run evaluations to compare different versions
👥 Review traces and add human feedback as needed

import mlflow

# 1. Set up experiment
mlflow.set_experiment("my-genai-app")

# 2. Create LoggedModel
logged_model = mlflow.langchain.log_model(
    lc_model=your_model, name="my_agent", params={"temperature": 0.1}
)

# 3. Enable tracing with active model
mlflow.set_active_model(name="my_agent")
mlflow.langchain.autolog()

# 4. Your app code runs normally
# Traces are automatically captured and linked to LoggedModel

# 5. Evaluate and iterate
results = mlflow.genai.evaluate(
    predict_fn=your_model, data=evaluation_data, scorers=[your_custom_scorers]
)

Next Steps

🤖 LoggedModel Guide: Learn comprehensive model lifecycle management
🔄 Automatic Tracing: Learn how to enable automatic tracing for your LLM library
🛠️ Custom Tracing: Add manual instrumentation to your application
📊 Evaluation Guide: Dive deeper into evaluation workflows
🖥️ MLflow UI: Explore traces and results in the web interface

MLflow's comprehensive data model empowers systematic observation, debugging, evaluation, and improvement of GenAI applications, providing the foundation for building high-quality, reliable, and maintainable GenAI systems.

Overview​

MLflow Experiment​

Setting up an Experiment​

MLflow LoggedModel: Model Management​

Key Features of LoggedModel​

Creating LoggedModels​

LoggedModel Benefits​

MLflow Traces: The Foundation​

How Traces are Generated​

Purpose of Traces​

Assessments: Quality Judgments​

Feedback Assessments​

Expectation Assessments​

Scorers: Automated Quality Measurement​

Code-based Heuristics​

LLM Judges​

Evaluation Datasets and Runs​

Evaluation Datasets​

Evaluation Runs​

Labeling Sessions: Human Review​

Complete Data Model Structure​

Getting Started​

Next Steps​