Query Traces via SDK

note

This guide shows you how to programmatically query traces using the MLflow SDK for debugging, monitoring, and analysis. While the Search Traces guide covers comprehensive search functionality, this page focuses on practical SDK usage patterns for observability workflows. To learn more about detailed search syntax and filtering capabilities, see the Search Traces guide.

Why Query Traces Programmatically?

The MLflow UI is great for interactive exploration, but programmatic access enables automation and integration with your existing workflows. You can analyze error patterns across thousands of traces, monitor performance trends over time, and build custom analysis workflows.

important

The search_traces API returns a pandas DataFrame by default (if pandas is installed), but can also return a list of Trace objects when return_type="list" is specified.

Syntax Differences: When using DataFrame return type, the column names used in filter_string differ from those in the returned DataFrame:

Filter: status → DataFrame: state
Filter: timestamp_ms → DataFrame: request_time
Filter: execution_time_ms → DataFrame: execution_duration

Basic Query Methods

The mlflow.search_traces() function returns trace data in two formats:

pandas DataFrame (default if pandas is installed): Perfect for data analysis and quick insights
List of Trace objects: Useful when you need to work with the full Trace object structure

You can control the return type using the return_type parameter:

import mlflow

# Default: Get traces as a pandas DataFrame (if pandas is installed)
error_traces_df = mlflow.search_traces(
    experiment_ids=["1"], filter_string="status = 'ERROR'", max_results=100
)

# Easy analysis with pandas
print(f"Found {len(error_traces_df)} errors")
print(f"Average execution time: {error_traces_df['execution_duration'].mean():.1f}ms")

# Alternative: Get traces as a list of Trace objects
error_traces_list = mlflow.search_traces(
    experiment_ids=["1"],
    filter_string="status = 'ERROR'",
    max_results=100,
    return_type="list",
)

# Work with Trace objects directly
for trace in error_traces_list[:3]:
    print(f"Trace {trace.info.trace_id}: {trace.info.execution_duration}ms")

Choosing the Right Return Type

Use DataFrame (default) when:

You need to perform statistical analysis on trace data
You want to aggregate metrics across multiple traces
You're creating visualizations or reports
You're familiar with pandas and want to leverage its powerful data manipulation capabilities

Use List of Trace objects when:

You need access to the full Trace object methods and properties
You're integrating with other MLflow APIs that expect Trace objects
You want to work with individual traces in detail
pandas is not installed in your environment

Common Use Cases

Finding and Analyzing Errors

When you need to understand what's going wrong in your application, start with simple queries to identify patterns:

import mlflow
from datetime import datetime, timedelta

# Get errors from the last 24 hours
yesterday = datetime.now() - timedelta(days=1)
timestamp_ms = int(yesterday.timestamp() * 1000)

error_traces = mlflow.search_traces(
    experiment_ids=["1"],
    filter_string=f"status = 'ERROR' AND timestamp_ms > {timestamp_ms}",
    order_by=["timestamp_ms DESC"],
)

print(f"Found {len(error_traces)} errors in the last 24 hours")

# Look at error distribution by tags
if not error_traces.empty:
    # Count errors by user if user tags exist
    if "tags" in error_traces.columns:
        print("\nError patterns:")
        tag_analysis = {}
        for tags in error_traces["tags"].dropna():
            if isinstance(tags, dict):
                user = tags.get("user_id", "unknown")
                tag_analysis[user] = tag_analysis.get(user, 0) + 1

        for user, count in sorted(
            tag_analysis.items(), key=lambda x: x[1], reverse=True
        )[:5]:
            print(f"  {user}: {count} errors")

Performance Monitoring

Track how your application performs over time and identify bottlenecks:

# Get successful traces to analyze performance
recent_traces = mlflow.search_traces(
    experiment_ids=["1"],
    filter_string="status = 'OK'",
    order_by=["timestamp_ms DESC"],
    max_results=1000,
)

if not recent_traces.empty:
    # Basic performance metrics
    avg_time = recent_traces["execution_duration"].mean()
    p95_time = recent_traces["execution_duration"].quantile(0.95)

    print(f"Average execution time: {avg_time:.1f}ms")
    print(f"95th percentile: {p95_time:.1f}ms")

    # Find slowest traces
    slow_traces = recent_traces[recent_traces["execution_duration"] > p95_time]
    print(f"Found {len(slow_traces)} slow traces (>{p95_time:.1f}ms)")

User Session Analysis

Understand how users interact with your application by analyzing their sessions:

# Analyze traces for a specific user
user_traces = mlflow.search_traces(
    experiment_ids=["1"],
    filter_string="tags.user_id = 'user123'",
    order_by=["timestamp_ms ASC"],
)

if not user_traces.empty:
    print(f"User has {len(user_traces)} interactions")

    # Calculate session metrics
    total_time = user_traces["execution_duration"].sum()
    error_count = len(user_traces[user_traces["state"] == "ERROR"])

    print(f"Total execution time: {total_time:.1f}ms")
    print(
        f"Error rate: {error_count}/{len(user_traces)} ({error_count/len(user_traces)*100:.1f}%)"
    )

    # Show recent activity
    print("\nRecent traces:")
    for _, trace in user_traces.tail(3).iterrows():
        status = "✅" if trace["state"] == "OK" else "❌"
        print(f"  {status} {trace['execution_duration']:.1f}ms")

Best Practices

Start Simple: Begin with basic queries and gradually add complexity as needed. Most monitoring can be done with simple filters and pandas operations.

Use Time Windows: Always filter by timestamp_ms when analyzing recent data to keep queries fast and relevant.

Handle Missing Data: Production traces may have missing fields, so always check if data exists before using it.

Keep Queries Focused: Use specific filters to get only the data you need rather than retrieving everything and filtering in memory.

Error Handling

Add basic error handling to make your scripts more robust:

import pandas as pd


def safe_trace_query(experiment_ids, filter_string=None):
    """Query traces with basic error handling"""
    try:
        return mlflow.search_traces(
            experiment_ids=experiment_ids, filter_string=filter_string
        )
    except Exception as e:
        print(f"Error querying traces: {e}")
        return pd.DataFrame()  # Return empty DataFrame on error


# Usage
traces = safe_trace_query(["1"], "status = 'ERROR'")
if not traces.empty:
    print(f"Found {len(traces)} traces")
else:
    print("No traces found or query failed")

Summary

Programmatic trace querying with the MLflow SDK enables powerful automation for debugging, monitoring, and analysis. Start with simple queries to understand your data, then build up to more sophisticated analysis workflows.

The key is to focus on actionable insights that help you understand and improve your application's behavior in production.

Next Steps

MLflow Tracing UI: Learn the web interface for interactive trace exploration

Search Traces: Master advanced search and filtering techniques

Delete and Manage Traces: Understand trace lifecycle management

Why Query Traces Programmatically?​

Basic Query Methods​

Choosing the Right Return Type​

Common Use Cases​

Finding and Analyzing Errors​

Performance Monitoring​

User Session Analysis​

Best Practices​

Error Handling​

Summary​

Next Steps​