Query Traces via SDK
This guide shows you how to programmatically query traces using the MLflow SDK for debugging, monitoring, and analysis. While the Search Traces guide covers comprehensive search functionality, this page focuses on practical SDK usage patterns for observability workflows. To learn more about detailed search syntax and filtering capabilities, see the Search Traces guide.
Why Query Traces Programmatically?
The MLflow UI is great for interactive exploration, but programmatic access enables automation and integration with your existing workflows. You can analyze error patterns across thousands of traces, monitor performance trends over time, and build custom analysis workflows.
The search_traces
API returns a pandas DataFrame by default (if pandas is installed), but can also return a list of Trace objects when return_type="list"
is specified.
Syntax Differences: When using DataFrame return type, the column names used in filter_string
differ from those in the returned DataFrame:
- Filter:
status
→ DataFrame:state
- Filter:
timestamp_ms
→ DataFrame:request_time
- Filter:
execution_time_ms
→ DataFrame:execution_duration
Basic Query Methods
The mlflow.search_traces()
function returns trace data in two formats:
- pandas DataFrame (default if pandas is installed): Perfect for data analysis and quick insights
- List of Trace objects: Useful when you need to work with the full Trace object structure
You can control the return type using the return_type
parameter:
import mlflow
# Default: Get traces as a pandas DataFrame (if pandas is installed)
error_traces_df = mlflow.search_traces(
experiment_ids=["1"], filter_string="status = 'ERROR'", max_results=100
)
# Easy analysis with pandas
print(f"Found {len(error_traces_df)} errors")
print(f"Average execution time: {error_traces_df['execution_duration'].mean():.1f}ms")
# Alternative: Get traces as a list of Trace objects
error_traces_list = mlflow.search_traces(
experiment_ids=["1"],
filter_string="status = 'ERROR'",
max_results=100,
return_type="list",
)
# Work with Trace objects directly
for trace in error_traces_list[:3]:
print(f"Trace {trace.info.trace_id}: {trace.info.execution_duration}ms")
Choosing the Right Return Type
Use DataFrame (default) when:
- You need to perform statistical analysis on trace data
- You want to aggregate metrics across multiple traces
- You're creating visualizations or reports
- You're familiar with pandas and want to leverage its powerful data manipulation capabilities
Use List of Trace objects when:
- You need access to the full Trace object methods and properties
- You're integrating with other MLflow APIs that expect Trace objects
- You want to work with individual traces in detail
- pandas is not installed in your environment
Common Use Cases
Finding and Analyzing Errors
When you need to understand what's going wrong in your application, start with simple queries to identify patterns:
import mlflow
from datetime import datetime, timedelta
# Get errors from the last 24 hours
yesterday = datetime.now() - timedelta(days=1)
timestamp_ms = int(yesterday.timestamp() * 1000)
error_traces = mlflow.search_traces(
experiment_ids=["1"],
filter_string=f"status = 'ERROR' AND timestamp_ms > {timestamp_ms}",
order_by=["timestamp_ms DESC"],
)
print(f"Found {len(error_traces)} errors in the last 24 hours")
# Look at error distribution by tags
if not error_traces.empty:
# Count errors by user if user tags exist
if "tags" in error_traces.columns:
print("\nError patterns:")
tag_analysis = {}
for tags in error_traces["tags"].dropna():
if isinstance(tags, dict):
user = tags.get("user_id", "unknown")
tag_analysis[user] = tag_analysis.get(user, 0) + 1
for user, count in sorted(
tag_analysis.items(), key=lambda x: x[1], reverse=True
)[:5]:
print(f" {user}: {count} errors")
Performance Monitoring
Track how your application performs over time and identify bottlenecks:
# Get successful traces to analyze performance
recent_traces = mlflow.search_traces(
experiment_ids=["1"],
filter_string="status = 'OK'",
order_by=["timestamp_ms DESC"],
max_results=1000,
)
if not recent_traces.empty:
# Basic performance metrics
avg_time = recent_traces["execution_duration"].mean()
p95_time = recent_traces["execution_duration"].quantile(0.95)
print(f"Average execution time: {avg_time:.1f}ms")
print(f"95th percentile: {p95_time:.1f}ms")
# Find slowest traces
slow_traces = recent_traces[recent_traces["execution_duration"] > p95_time]
print(f"Found {len(slow_traces)} slow traces (>{p95_time:.1f}ms)")
User Session Analysis
Understand how users interact with your application by analyzing their sessions:
# Analyze traces for a specific user
user_traces = mlflow.search_traces(
experiment_ids=["1"],
filter_string="tags.user_id = 'user123'",
order_by=["timestamp_ms ASC"],
)
if not user_traces.empty:
print(f"User has {len(user_traces)} interactions")
# Calculate session metrics
total_time = user_traces["execution_duration"].sum()
error_count = len(user_traces[user_traces["state"] == "ERROR"])
print(f"Total execution time: {total_time:.1f}ms")
print(
f"Error rate: {error_count}/{len(user_traces)} ({error_count/len(user_traces)*100:.1f}%)"
)
# Show recent activity
print("\nRecent traces:")
for _, trace in user_traces.tail(3).iterrows():
status = "✅" if trace["state"] == "OK" else "❌"
print(f" {status} {trace['execution_duration']:.1f}ms")
Best Practices
Start Simple: Begin with basic queries and gradually add complexity as needed. Most monitoring can be done with simple filters and pandas operations.
Use Time Windows: Always filter by timestamp_ms when analyzing recent data to keep queries fast and relevant.
Handle Missing Data: Production traces may have missing fields, so always check if data exists before using it.
Keep Queries Focused: Use specific filters to get only the data you need rather than retrieving everything and filtering in memory.
Error Handling
Add basic error handling to make your scripts more robust:
import pandas as pd
def safe_trace_query(experiment_ids, filter_string=None):
"""Query traces with basic error handling"""
try:
return mlflow.search_traces(
experiment_ids=experiment_ids, filter_string=filter_string
)
except Exception as e:
print(f"Error querying traces: {e}")
return pd.DataFrame() # Return empty DataFrame on error
# Usage
traces = safe_trace_query(["1"], "status = 'ERROR'")
if not traces.empty:
print(f"Found {len(traces)} traces")
else:
print("No traces found or query failed")
Summary
Programmatic trace querying with the MLflow SDK enables powerful automation for debugging, monitoring, and analysis. Start with simple queries to understand your data, then build up to more sophisticated analysis workflows.
The key is to focus on actionable insights that help you understand and improve your application's behavior in production.
Next Steps
MLflow Tracing UI: Learn the web interface for interactive trace exploration
Search Traces: Master advanced search and filtering techniques
Delete and Manage Traces: Understand trace lifecycle management