Skip to main content

MLflow Data Model

MLflow's data model provides a structured approach to developing and managing GenAI applications by organizing how you log, debug, and evaluate them to achieve quality, cost, and latency goals. This structured approach addresses key challenges in reproducibility, quality assessment, and iterative development.

Overviewโ€‹

The MLflow data model consists of several interconnected entities that work together to support your GenAI application development workflow:

๐Ÿงช Experiment - The root container for your GenAI application

๐Ÿค– LoggedModel - A first-class entity representing your AI model or agent with integrated tracking

๐Ÿ” Trace - A log of inputs, outputs, and intermediate steps from a single application execution

๐Ÿ“Š Assessments - Quality judgments on a Trace, categorized as Feedback or Expectations

๐ŸŽฏ Scorers - Definitions of automated evaluation functions that produce Feedback

๐Ÿ“‹ Evaluation Datasets - Curated sets of inputs (and optional Expectations) for offline testing

๐Ÿš€ Evaluation Runs - Results from running app versions against Evaluation Datasets, containing new, scored Traces

๐Ÿท๏ธ Labeling Sessions - Collections of Traces organized for human review

MLflow Experimentโ€‹

An Experiment is the top-level container for each distinct application or use case. It contains all Traces from development and production alongside all other entities in the data model. We recommend creating a single Experiment for each application.

Setting up an Experimentโ€‹

import mlflow

# Create or set an experiment
mlflow.set_experiment("my-genai-app")

# Or create explicitly
experiment = mlflow.create_experiment("my-genai-app")

MLflow LoggedModel: Model Managementโ€‹

A LoggedModel is a first-class entity that represents your AI model, agent, or GenAI application within an Experiment. It provides unified tracking of model artifacts, execution traces, evaluation metrics, and metadata throughout the development lifecycle.

Key Features of LoggedModelโ€‹

LoggedModel serves as the central hub that connects:

  • Model artifacts and configuration parameters
  • Execution traces from development and production
  • Evaluation metrics and performance assessments
  • Version history and deployment tracking

Creating LoggedModelsโ€‹

Create a LoggedModel by logging your model directly:

import mlflow

# Log a model with comprehensive metadata
logged_model = mlflow.langchain.log_model(
lc_model=your_chain,
name="customer_support_agent",
params={"temperature": 0.1, "max_tokens": 2000},
model_type="agent",
input_example={"messages": "How can I help you?"},
)

print(f"Model ID: {logged_model.model_id}")

LoggedModel Benefitsโ€‹

Unified Tracking: All model-related artifacts, traces, and metrics are organized under a single entity, providing complete visibility into model behavior and performance.

Automatic Trace Linking: When using the active model pattern, all execution traces are automatically associated with the LoggedModel, eliminating manual tracking overhead.

Version Management: LoggedModel supports systematic versioning and comparison across different model iterations, enabling data-driven model selection.

Evaluation Integration: Evaluation metrics and results are directly linked to LoggedModel, providing comprehensive performance assessment.

MLflow Traces: The Foundationโ€‹

The foundational concept is the Trace: a single, complete execution of your GenAI application (e.g., a user request or API call).

How Traces are Generatedโ€‹

Traces are generated through:

Automatic tracing is enabled with a single line of code for 20+ popular LLM SDKs:

import mlflow

# Enable automatic tracing for OpenAI
mlflow.openai.autolog()

# Your existing code works unchanged
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
model="gpt-4", messages=[{"role": "user", "content": "Hello!"}]
)

Purpose of Tracesโ€‹

Traces enable:

  • Observability: Gain insights into application performance
  • Debugging: Understand execution flow to resolve issues
  • Quality Evaluation: Assess response quality over time
  • Human Review: Provide data for expert annotation

Assessments: Quality Judgmentsโ€‹

Assessments are qualitative or quantitative judgments attached to Traces to understand and improve GenAI application quality. A Trace can have multiple Assessments, primarily Feedback or Expectations.

Feedback Assessmentsโ€‹

Feedback captures evaluations of a Trace and includes:

  • ๐Ÿท๏ธ Name - Developer-defined category (e.g., relevance, correctness)
  • โญ Score - Evaluation (e.g., thumbs up/down, numerical rating)
  • ๐Ÿ’ญ Rationale - Optional textual explanation for the score
# Log feedback programmatically
mlflow.log_feedback(
trace_id="trace-123",
name="relevance",
value=4,
rationale="Response was highly relevant to the question",
)

Expectation Assessmentsโ€‹

Expectations are ground truth labels for a Trace (e.g., expected_facts, expected_response). These are primarily used in offline evaluation to compare app output against known correct answers.

Scorers: Automated Quality Measurementโ€‹

Scorers are functions that programmatically assess Trace quality, producing Feedback. They can be:

Code-based Heuristicsโ€‹

def check_response_length(trace):
"""Custom scorer to check if response is appropriate length"""
response = trace.outputs.get("response", "")
if 50 <= len(response) <= 500:
return {"score": 1, "rationale": "Response length is appropriate"}
else:
return {"score": 0, "rationale": "Response too short or too long"}

LLM Judgesโ€‹

from mlflow.metrics import genai

# Use built-in LLM judge
relevance_metric = genai.relevance()

# Evaluate traces with the metric
results = mlflow.genai.evaluate(
predict_fn=your_model, data=evaluation_data, scorers=[relevance_metric]
)

Evaluation Datasets and Runsโ€‹

MLflow provides systematic offline testing through Evaluation Datasets and Evaluation Runs.

Evaluation Datasetsโ€‹

An Evaluation Dataset is a curated collection of example inputs used to evaluate and improve app performance:

# Create an evaluation dataset
dataset = mlflow.data.from_dict(
{
"inputs": ["What is MLflow?", "How do I log metrics?"],
"expectations": {
"expected_outputs": ["MLflow is...", "To log metrics..."],
},
}
)

# Register the dataset
mlflow.log_input(dataset, context="evaluation")

Evaluation Runsโ€‹

An Evaluation Run stores results from running a new app version against an Evaluation Dataset:

# Run evaluation
results = mlflow.genai.evaluate(
predict_fn=your_model,
data=evaluation_dataset,
scorers=[relevance_metric, accuracy_metric],
)

The evaluation process:

  1. New app version processes inputs from Evaluation Dataset
  2. MLflow generates a new Trace for each input
  3. Configured Scorers annotate Traces with Feedback
  4. All annotated Traces are stored in the Evaluation Run

Labeling Sessions: Human Reviewโ€‹

Labeling Sessions organize Traces for human review, typically through the MLflow UI. Domain experts can browse these Traces and attach Feedback as Assessments.

# Create a labeling session
session = mlflow.genai.create_labeling_session(name="quality-review-session")

Complete Data Model Structureโ€‹

All components operate within an Experiment, forming a comprehensive hierarchy:

Getting Startedโ€‹

To begin using the MLflow data model:

  1. ๐Ÿงช Set up an experiment for your GenAI application
  2. ๐Ÿค– Create or set a LoggedModel to organize your model tracking
  3. ๐Ÿ”„ Enable automatic tracing for your LLM library
  4. ๐ŸŽฏ Define custom scorers for your quality metrics
  5. ๐Ÿ“‹ Create evaluation datasets from representative examples
  6. ๐Ÿš€ Run evaluations to compare different versions
  7. ๐Ÿ‘ฅ Review traces and add human feedback as needed
import mlflow

# 1. Set up experiment
mlflow.set_experiment("my-genai-app")

# 2. Create LoggedModel
logged_model = mlflow.langchain.log_model(
lc_model=your_model, name="my_agent", params={"temperature": 0.1}
)

# 3. Enable tracing with active model
mlflow.set_active_model(name="my_agent")
mlflow.langchain.autolog()

# 4. Your app code runs normally
# Traces are automatically captured and linked to LoggedModel

# 5. Evaluate and iterate
results = mlflow.genai.evaluate(
predict_fn=your_model, data=evaluation_data, scorers=[your_custom_scorers]
)

Next Stepsโ€‹

  • ๐Ÿค– LoggedModel Guide: Learn comprehensive model lifecycle management
  • ๐Ÿ”„ Automatic Tracing: Learn how to enable automatic tracing for your LLM library
  • ๐Ÿ› ๏ธ Custom Tracing: Add manual instrumentation to your application
  • ๐Ÿ“Š Evaluation Guide: Dive deeper into evaluation workflows
  • ๐Ÿ–ฅ๏ธ MLflow UI: Explore traces and results in the web interface

MLflow's comprehensive data model empowers systematic observation, debugging, evaluation, and improvement of GenAI applications, providing the foundation for building high-quality, reliable, and maintainable GenAI systems.