Skip to main content

FAQ

Q: Can I use MLflow Tracing for production applications?​

Yes, MLflow Tracing is stable and designed to be used in production environments.

When using MLflow Tracing in production environments, we recommend using the MLflow Tracing SDK (mlflow-tracing) to instrument your code/models/agents with a minimal set of dependencies and a smeller installation footprint. The SDK is designed to be a perfect fit for production environments where you want an efficient and lightweight tracing solution. Please refer to the Production Monitoring section for more details.

Q: I cannot open my trace in the MLflow UI. What should I do?​

There are multiple possible reasons why a trace may not be viewable in the MLflow UI.

  1. The trace is not completed yet: If the trace is still being collected, MLflow cannot display spans in the UI. Ensure that all spans are properly ended with either "OK" or "ERROR" status.

  2. The browser cache is outdated: When you upgrade MLflow to a new version, the browser cache may contain outdated data and prevent the UI from displaying traces correctly. Clear your browser cache (Shift+F5) and refresh the page.

Q: The model execution gets stuck and my trace is "in progress" forever.​

Sometimes a model or an agent gets stuck in a long-running operation or an infinite loop, causing the trace to be stuck in the "in progress" state.

To prevent this, you can set a timeout for the trace using the MLFLOW_TRACE_TIMEOUT_SECONDS environment variable. If the trace exceeds the timeout, MLflow will automatically halt the trace with ERROR status and export it to the backend, so that you can analyze the spans to identify the issue. By default, the timeout is not set.

note

The timeout only applies to MLflow trace. The main program, model, or agent, will continue to run even if the trace is halted.

For example, the following code sets the timeout to 5 seconds and simulates how MLflow handles a long-running operation:

import mlflow
import os
import time

# Set the timeout to 5 seconds for demonstration purposes
os.environ["MLFLOW_TRACE_TIMEOUT_SECONDS"] = "5"


# Simulate a long-running operation
@mlflow.trace
def long_running():
for _ in range(10):
child()


@mlflow.trace
def child():
time.sleep(1)


long_running()
note

MLflow monitors the trace execution time and expiration in a background thread. By default, this check is performed every second and resource consumption is negligible. If you want to adjust the interval, you can set the MLFLOW_TRACE_TIMEOUT_CHECK_INTERVAL_SECONDS environment variable.

Q: My trace is split into multiple traces when doing multi-threading. How can I combine them into a single trace?​

As MLflow Tracing depends on Python ContextVar, each thread has its own trace context by default, but it is possible to generate a single trace for multi-threaded applications with a few additional steps. Refer to the Multi-threading section for more information.