mlflow.genai

class mlflow.genai.Agent(agent: _Agent)[source]

Bases: object

The agent configuration, used for generating responses in the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

property agent_name: str

The name of the agent.

property model_serving_endpoint: str

The model serving endpoint used by the agent.

class mlflow.genai.LabelingSession(session: _LabelingSession)[source]

Bases: object

A session for labeling items in the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

add_dataset(dataset_name: str, record_ids: Optional[list[str]] = None) mlflow.genai.labeling.labeling.LabelingSession[source]

Add a dataset to the labeling session.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters
  • dataset_name – The name of the dataset.

  • record_ids – Optional. The individual record ids to be added to the session. If not provided, all records in the dataset will be added.

Returns

The updated labeling session.

Return type

LabelingSession

add_traces(traces: Union[Iterable[Trace], Iterable[str], pd.DataFrame]) LabelingSession[source]

Add traces to the labeling session.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

traces – Can be either: a) a pandas DataFrame with a ‘trace’ column. The ‘trace’ column should contain either mlflow.entities.Trace objects or their json string representations. b) an iterable of mlflow.entities.Trace objects. c) an iterable of json string representations of mlflow.entities.Trace objects.

Returns

The updated labeling session.

Return type

LabelingSession

property agent: Optional[str]

The agent used to generate responses for the items in the session.

property assigned_users: list[str]

The users assigned to label items in the session.

property custom_inputs: Optional[dict[str, typing.Any]]

Custom inputs used in the session.

property enable_multi_turn_chat: bool

Whether multi-turn chat is enabled for the session.

property experiment_id: str

The experiment ID associated with the session.

property label_schemas: list[str]

The label schemas used in the session.

property labeling_session_id: str

The unique identifier of the labeling session.

property mlflow_run_id: str

The MLflow run ID associated with the session.

property name: str

The name of the labeling session.

property review_app_id: str

The review app ID associated with the session.

set_assigned_users(assigned_users: list[str]) mlflow.genai.labeling.labeling.LabelingSession[source]

Set the assigned users for the labeling session.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

assigned_users – The list of users to assign to the session.

Returns

The updated labeling session.

Return type

LabelingSession

sync(to_dataset: str) None[source]

Sync the traces and expectations from the labeling session to a dataset.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

to_dataset – The name of the dataset to sync traces and expectations to.

property url: str

The URL of the labeling session in the review app.

class mlflow.genai.ReviewApp(app: _ReviewApp)[source]

Bases: object

A review app is used to collect feedback from stakeholders for a given experiment.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

add_agent(*, agent_name: str, model_serving_endpoint: str, overwrite: bool = False) mlflow.genai.labeling.labeling.ReviewApp[source]

Add an agent to the review app to be used to generate responses.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters
  • agent_name – The name of the agent.

  • model_serving_endpoint – The model serving endpoint to be used by the agent.

  • overwrite – Whether to overwrite an existing agent with the same name.

Returns

The updated review app.

Return type

ReviewApp

property agents: list[mlflow.genai.labeling.labeling.Agent]

The agents to be used to generate responses.

property experiment_id: str

The ID of the experiment.

property label_schemas: list['_LabelSchema']

The label schemas to be used in the review app.

remove_agent(agent_name: str) mlflow.genai.labeling.labeling.ReviewApp[source]

Remove an agent from the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

agent_name – The name of the agent to remove.

Returns

The updated review app.

Return type

ReviewApp

property review_app_id: str

The ID of the review app.

property url: str

The URL of the review app for stakeholders to provide feedback.

class mlflow.genai.Scorer(*, name: str, aggregations: Optional[list[str]] = None)[source]

Bases: pydantic.main.BaseModel

Note

Experimental: This class may change or be removed in a future release without warning.

aggregations: Optional[list[str]]
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_dump(**kwargs) dict[str, typing.Any][source]

Override model_dump to include source code.

classmethod model_validate(obj: Any) Scorer[source]

Override model_validate to reconstruct scorer from source code.

name: str
run(*, inputs=None, outputs=None, expectations=None, trace=None)[source]
class mlflow.genai.ScorerScheduleConfig(scorer: Scorer, scheduled_scorer_name: str, sample_rate: float, filter_string: Optional[str] = None)[source]

Bases: object

Note

Experimental: This class may change or be removed in a future release without warning.

A scheduled scorer configuration for automated monitoring of generative AI applications.

Scheduled scorers are used to automatically evaluate traces logged to MLflow experiments by production applications. They are part of [Databricks Lakehouse Monitoring for GenAI](https://docs.databricks.com/aws/en/generative-ai/agent-evaluation/monitoring), which helps track quality metrics like groundedness, safety, and guideline adherence alongside operational metrics like volume, latency, and cost.

When configured, scheduled scorers run automatically in the background to evaluate a sample of traces based on the specified sampling rate and filter criteria. The Assessments are displayed in the Traces tab of the MLflow experiment and can be used to identify quality issues in production.

Parameters
  • scorer – The scorer function to run on sampled traces. Must be either a built-in scorer (e.g., Safety, Correctness) or a function decorated with @scorer. Subclasses of Scorer are not supported.

  • scheduled_scorer_name – The name for this scheduled scorer configuration within the experiment. This name must be unique among all scheduled scorers in the same experiment. We recommend using the scorer’s name (e.g., scorer.name) for consistency.

  • sample_rate – The fraction of traces to evaluate, between 0.0 and 1.0. For example, 0.1 means 10% of traces will be randomly selected for evaluation.

  • filter_string – An optional MLflow search_traces compatible filter string to apply before sampling traces. Only traces matching this filter will be considered for evaluation. Uses the same syntax as mlflow.search_traces().

Example

from mlflow.genai.scorers import Safety, scorer
from mlflow.genai.scheduled_scorers import ScorerScheduleConfig

# Using a built-in scorer
safety_config = ScorerScheduleConfig(
    scorer=Safety(),
    scheduled_scorer_name="production_safety",
    sample_rate=0.2,  # Evaluate 20% of traces
    filter_string="trace.status = 'OK'",
)


# Using a custom scorer
@scorer
def response_length(outputs):
    return len(str(outputs)) > 100


length_config = ScorerScheduleConfig(
    scorer=response_length,
    scheduled_scorer_name="adequate_length",
    sample_rate=0.1,  # Evaluate 10% of traces
    filter_string="trace.status = 'OK'",
)

Note

Scheduled scorers are executed automatically by Databricks and do not need to be manually triggered. The Assessments appear in the Traces tab of the MLflow experiment. Only traces logged directly to the experiment are monitored; traces logged to individual runs within the experiment are not evaluated.

Warning

This API is in Beta and may change or be removed in a future release without warning.

filter_string: Optional[str] = None
sample_rate: float
scheduled_scorer_name: str
scorer: Scorer
mlflow.genai.add_scheduled_scorer(*, scheduled_scorer_name: str, scorer: Scorer, sample_rate: float, filter_string: Optional[str] = None, experiment_id: Optional[str] = None, **kwargs) mlflow.genai.scheduled_scorers.ScorerScheduleConfig[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Add a scheduled scorer to automatically monitor traces in an MLflow experiment.

This function configures a scorer function to run automatically on traces logged to the specified experiment. The scorer will evaluate a sample of traces based on the sampling rate and any filter criteria. Assessments are displayed in the Traces tab of the MLflow experiment.

Parameters
  • scheduled_scorer_name – The name for this scheduled scorer within the experiment. We recommend using the scorer’s name (e.g., scorer.name) for consistency.

  • scorer – The scorer function to execute on sampled traces. Must be either a built-in scorer or a function decorated with @scorer. Subclasses of Scorer are not supported.

  • sample_rate – The fraction of traces to evaluate, between 0.0 and 1.0. For example, 0.3 means 30% of traces will be randomly selected for evaluation.

  • filter_string – An optional MLflow search_traces compatible filter string. Only traces matching this filter will be considered for evaluation. If None, all traces in the experiment are eligible for sampling.

  • experiment_id – The ID of the MLflow experiment to monitor. If None, uses the currently active experiment.

Returns

A ScorerScheduleConfig object representing the configured scheduled scorer.

Example

import mlflow
from mlflow.genai.scorers import Safety, Correctness
from mlflow.genai.scheduled_scorers import add_scheduled_scorer

# Set up your experiment
experiment = mlflow.set_experiment("my_genai_app_monitoring")

# Add a safety scorer to monitor 50% of traces
safety_scorer = add_scheduled_scorer(
    scheduled_scorer_name="safety_monitor",
    scorer=Safety(),
    sample_rate=0.5,
    filter_string="trace.status = 'OK'",
)

# Add a correctness scorer with different sampling
correctness_scorer = add_scheduled_scorer(
    scheduled_scorer_name="correctness_monitor",
    scorer=Correctness(),
    sample_rate=0.2,  # More expensive, so lower sample rate
    experiment_id=experiment.experiment_id,  # Explicitly specify experiment
)

Note

Once added, the scheduled scorer will begin evaluating new traces automatically. There may be a delay between when traces are logged and when they are evaluated. Only traces logged directly to the experiment are monitored; traces logged to individual runs within the experiment are not evaluated.

Warning

This API is in Beta and may change or be removed in a future release without warning.

mlflow.genai.create_dataset(uc_table_name: str, experiment_id: Optional[Union[str, list[str]]] = None) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]

Create a dataset with the given name and associate it with the given experiment.

Parameters
  • uc_table_name – The UC table name of the dataset.

  • experiment_id – The ID of the experiment to associate the dataset with. If not provided, the current experiment is inferred from the environment.

mlflow.genai.create_labeling_session(name: str, *, assigned_users: Optional[list[str]] = None, agent: Optional[str] = None, label_schemas: Optional[list[str]] = None, enable_multi_turn_chat: bool = False, custom_inputs: Optional[dict[str, typing.Any]] = None) mlflow.genai.labeling.labeling.LabelingSession[source]

Create a new labeling session in the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters
  • name – The name of the labeling session.

  • assigned_users – The users that will be assigned to label items in the session.

  • agent – The agent to be used to generate responses for the items in the session.

  • label_schemas – The label schemas to be used in the session.

  • enable_multi_turn_chat – Whether to enable multi-turn chat labeling for the session.

  • custom_inputs – Optional. Custom inputs to be used in the session.

Returns

The created labeling session.

Return type

LabelingSession

mlflow.genai.delete_dataset(uc_table_name: str) None[source]

Delete the dataset with the given name.

Parameters

uc_table_name – The UC table name of the dataset.

mlflow.genai.delete_labeling_session(labeling_session: mlflow.genai.labeling.labeling.LabelingSession) mlflow.genai.labeling.labeling.ReviewApp[source]

Delete a labeling session from the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

labeling_session – The labeling session to delete.

Returns

The review app.

Return type

ReviewApp

mlflow.genai.delete_prompt_alias(name: str, alias: str) None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Delete an alias for a Prompt in the MLflow Prompt Registry.

Parameters
  • name – The name of the prompt.

  • alias – The alias to delete for the prompt.

mlflow.genai.delete_scheduled_scorer(*, scheduled_scorer_name: str, experiment_id: Optional[str] = None, **kwargs) None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Delete a scheduled scorer from an MLflow experiment.

This function removes a scheduled scorer configuration, stopping automatic evaluation of traces. Existing Assessments will remain in the Traces tab of the MLflow experiment, but no new evaluations will be performed.

Parameters
  • scheduled_scorer_name – The name of the scheduled scorer to delete. Must match the name used when the scorer was originally added.

  • experiment_id – The ID of the MLflow experiment containing the scheduled scorer. If None, uses the currently active experiment.

Example

from mlflow.genai.scheduled_scorers import delete_scheduled_scorer

# Remove a scheduled scorer that's no longer needed
delete_scheduled_scorer(scheduled_scorer_name="safety_monitor")

# To delete all scheduled scorers at once, use set_scheduled_scorers
# with an empty list instead:
from mlflow.genai.scheduled_scorers import set_scheduled_scorers

set_scheduled_scorers(
    scheduled_scorers=[]  # Empty list removes all scorers
)

Note

Deletion is immediate and cannot be undone. If you need the same scorer configuration later, you will need to add it again using add_scheduled_scorer.

Warning

This API is in Beta and may change or be removed in a future release without warning.

mlflow.genai.evaluate(data: EvaluationDatasetTypes, scorers: list[Scorer], predict_fn: Optional[Callable[[...], Any]] = None, model_id: Optional[str] = None) mlflow.models.evaluation.base.EvaluationResult[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Evaluate the performance of a generative AI model/application using specified data and scorers.

This function allows you to evaluate a model’s performance on a given dataset using various scoring criteria. It supports both built-in scorers provided by MLflow and custom scorers. The evaluation results include metrics and detailed per-row assessments.

There are three different ways to use this function:

1. Use Traces to evaluate the model/application.

The data parameter takes a DataFrame with trace column, which contains a single trace object corresponding to the prediction for the row. This dataframe is easily obtained from the existing traces stored in MLflow, by using the mlflow.search_traces() function.

import mlflow
from mlflow.genai.scorers import Correctness, Safety
import pandas as pd

trace_df = mlflow.search_traces(model_id="<my-model-id>")

mlflow.genai.evaluate(
    data=trace_df,
    scorers=[Correctness(), Safety()],
)

Built-in scorers will understand the model inputs, outputs, and other intermediate information e.g. retrieved context, from the trace object. You can also access to the trace object from the custom scorer function by using the trace parameter.

from mlflow.genai.scorers import scorer


@scorer
def faster_than_one_second(inputs, outputs, trace):
    return trace.info.execution_duration < 1000

2. Use DataFrame or dictionary with “inputs”, “outputs”, “expectations” columns.

Alternatively, you can pass inputs, outputs, and expectations (ground truth) as a column in the dataframe (or equivalent list of dictionaries).

import mlflow
from mlflow.genai.scorers import Correctness
import pandas as pd

data = pd.DataFrame(
    [
        {
            "inputs": {"question": "What is MLflow?"},
            "outputs": "MLflow is an ML platform",
            "expectations": "MLflow is an ML platform",
        },
        {
            "inputs": {"question": "What is Spark?"},
            "outputs": "I don't know",
            "expectations": "Spark is a data engine",
        },
    ]
)

mlflow.genai.evaluate(
    data=data,
    scorers=[Correctness()],
)

3. Pass `predict_fn` and input samples (and optionally expectations).

If you want to generate the outputs and traces on-the-fly from your input samples, you can pass a callable to the predict_fn parameter. In this case, MLflow will pass the inputs to the predict_fn as keyword arguments. Therefore, the “inputs” column must be a dictionary with the parameter names as keys.

import mlflow
from mlflow.genai.scorers import Correctness, Safety
import openai

# Create a dataframe with input samples
data = pd.DataFrame(
    [
        {"inputs": {"question": "What is MLflow?"}},
        {"inputs": {"question": "What is Spark?"}},
    ]
)


# Define a predict function to evaluate. The "inputs" column will be
# passed to the prediction function as keyword arguments.
def predict_fn(question: str) -> str:
    response = openai.OpenAI().chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content


mlflow.genai.evaluate(
    data=data,
    predict_fn=predict_fn,
    scorers=[Correctness(), Safety()],
)
Parameters
  • data

    Dataset for the evaluation. Must be one of the following formats:

    • An EvaluationDataset entity

    • Pandas DataFrame

    • Spark DataFrame

    • List of dictionaries

    The dataset must include either of the following columns:

    1. trace column that contains a single trace object corresponding

      to the prediction for the row.

      If this column is present, MLflow extracts inputs, outputs, assessments, and other intermediate information e.g. retrieved context, from the trace object and uses them for scoring. When this column is present, the predict_fn parameter must not be provided.

    2. inputs, outputs, expectations columns.

      Alternatively, you can pass inputs, outputs, and expectations(ground truth) as a column in the dataframe (or equivalent list of dictionaries).

      • inputs (required): Column containing inputs for evaluation. The value must be a dictionary. When predict_fn is provided, MLflow will pass the inputs to the predict_fn as keyword arguments. For example,

        • predict_fn: def predict_fn(question: str, context: str) -> str

        • inputs: {“question”: “What is MLflow?”, “context”: “MLflow is an ML platform”}

        • predict_fn will receive “What is MLflow?” as the first argument (question) and “MLflow is an ML platform” as the second argument (context)

      • outputs (optional): Column containing model or app outputs. If this column is present, predict_fn must not be provided.

      • expectations (optional): Column containing a dictionary of ground truths.

    For list of dictionaries, each dict should follow the above schema.

  • scorers – A list of Scorer objects that produces evaluation scores from inputs, outputs, and other additional contexts. MLflow provides pre-defined scorers, but you can also define custom ones.

  • predict_fn

    The target function to be evaluated. The specified function will be executed for each row in the input dataset, and outputs will be used for scoring.

    The function must emit a single trace per call. If it doesn’t, decorate the function with @mlflow.trace decorator to ensure a trace to be emitted.

  • model_id – Optional model identifier (e.g. “models:/my-model/1”) to associate with the evaluation results. Can be also set globally via the mlflow.set_active_model() function.

Returns

An mlflow.models.EvaluationResult~ object.

Note

This function is only supported on Databricks. The tracking URI must be set to Databricks.

Warning

This function is not thread-safe. Please do not use it in multi-threaded environments.

mlflow.genai.get_dataset(uc_table_name: str) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]

Get the dataset with the given name.

Parameters

uc_table_name – The UC table name of the dataset.

mlflow.genai.get_labeling_session(run_id: str) mlflow.genai.labeling.labeling.LabelingSession[source]

Get a labeling session from the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

run_id – The mlflow run ID of the labeling session to get.

Returns

The labeling session.

Return type

LabelingSession

mlflow.genai.get_labeling_sessions() list[mlflow.genai.labeling.labeling.LabelingSession][source]

Get all labeling sessions from the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Returns

The list of labeling sessions.

Return type

list[LabelingSession]

mlflow.genai.get_review_app(experiment_id: Optional[str] = None) mlflow.genai.labeling.labeling.ReviewApp[source]

Gets or creates (if it doesn’t exist) the review app for the given experiment ID.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

experiment_id – Optional. The experiment ID for which to get the review app. If not provided, the experiment ID is inferred from the current active environment.

Returns

The review app.

Return type

ReviewApp

mlflow.genai.get_scheduled_scorer(*, scheduled_scorer_name: str, experiment_id: Optional[str] = None, **kwargs) mlflow.genai.scheduled_scorers.ScorerScheduleConfig[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Retrieve the configuration of a specific scheduled scorer.

This function returns the current configuration of a scheduled scorer, including its scorer function, sampling rate, and filter criteria.

Parameters
  • scheduled_scorer_name – The name of the scheduled scorer to retrieve.

  • experiment_id – The ID of the MLflow experiment containing the scheduled scorer. If None, uses the currently active experiment.

Returns

A ScorerScheduleConfig object containing the current configuration of the specified scheduled scorer.

Example

from mlflow.genai.scheduled_scorers import get_scheduled_scorer

# Get the current configuration of a scheduled scorer
scorer_config = get_scheduled_scorer(scheduled_scorer_name="safety_monitor")

print(f"Sample rate: {scorer_config.sample_rate}")
print(f"Filter: {scorer_config.filter_string}")
print(f"Scorer: {scorer_config.scorer.name}")

Warning

This API is in Beta and may change or be removed in a future release without warning.

mlflow.genai.list_scheduled_scorers(*, experiment_id: Optional[str] = None, **kwargs) list[mlflow.genai.scheduled_scorers.ScorerScheduleConfig][source]

Note

Experimental: This function may change or be removed in a future release without warning.

List all scheduled scorers for an experiment.

This function returns all scheduled scorers configured for the specified experiment, or for the current active experiment if no experiment ID is provided.

Parameters

experiment_id – The ID of the MLflow experiment to list scheduled scorers for. If None, uses the currently active experiment.

Returns

A list of ScheduledScorerConfig objects representing all scheduled scorers configured for the specified experiment.

Example

import mlflow
from mlflow.genai.scheduled_scorers import list_scheduled_scorers

# List scorers for a specific experiment
scorers = list_scheduled_scorers(experiment_id="12345")
for scorer in scorers:
    print(f"Scorer: {scorer.scheduled_scorer_name}")
    print(f"Sample rate: {scorer.sample_rate}")
    print(f"Filter: {scorer.filter_string}")

# List scorers for the current active experiment
mlflow.set_experiment("my_genai_app_monitoring")
current_scorers = list_scheduled_scorers()
print(f"Found {len(current_scorers)} scheduled scorers")

Warning

This API is in Beta and may change or be removed in a future release without warning.

mlflow.genai.load_prompt(name_or_uri: str, version: Optional[Union[str, int]] = None, allow_missing: bool = False) PromptVersion[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Load a Prompt from the MLflow Prompt Registry.

The prompt can be specified by name and version, or by URI.

Parameters
  • name_or_uri – The name of the prompt, or the URI in the format “prompts:/name/version”.

  • version – The version of the prompt (required when using name, not allowed when using URI).

  • allow_missing – If True, return None instead of raising Exception if the specified prompt is not found.

Example:

import mlflow

# Load a specific version of the prompt
prompt = mlflow.genai.load_prompt("my_prompt", version=1)

# Load a specific version of the prompt by URI
prompt = mlflow.genai.load_prompt("prompts:/my_prompt/1")

# Load a prompt version with an alias "production"
prompt = mlflow.genai.load_prompt("prompts:/my_prompt@production")
mlflow.genai.optimize_prompt(*, target_llm_params: mlflow.genai.optimize.types.LLMParams, prompt: Union[str, PromptVersion], train_data: EvaluationDatasetTypes, scorers: list[Scorer], objective: Optional[Callable[[dict[str, typing.Union[bool, float, str, Feedback, list[Feedback]]]], float]] = None, eval_data: Optional[EvaluationDatasetTypes] = None, optimizer_config: Optional[mlflow.genai.optimize.types.OptimizerConfig] = None) mlflow.genai.optimize.types.PromptOptimizationResult[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Optimize a LLM prompt using the given dataset and evaluation metrics. The optimized prompt template is automatically registered as a new version of the original prompt and included in the result. Currently, this API only supports DSPy’s MIPROv2 optimizer.

Parameters
  • target_llm_params – Parameters for the the LLM that prompt is optimized for. The model name must be specified in the format <provider>/<model>.

  • prompt – The URI or Prompt object of the MLflow prompt to optimize. The optimized prompt is registered as a new version of the prompt.

  • train_data

    Training dataset used for optimization. The data must be one of the following formats:

    • An EvaluationDataset entity

    • Pandas DataFrame

    • Spark DataFrame

    • List of dictionaries

    The dataset must include the following columns:

    • inputs: A column containing single inputs in dict format. Each input should contain keys matching the variables in the prompt template.

    • expectations: A column containing a dictionary of ground truths for individual output fields.

  • scorers – List of scorers that evaluate the inputs, outputs and expectations. Note: Trace input is not supported for optimization. Use inputs, outputs and expectations for optimization. Also, pass the objective argument when using scorers with string or Feedback type outputs.

  • objective – A callable that computes the overall performance metric from individual assessments. Takes a dict mapping assessment names to assessment scores and returns a float value (greater is better).

  • eval_data – Evaluation dataset with the same format as train_data. If not provided, train_data will be automatically split into training and evaluation sets.

  • optimizer_config – Configuration parameters for the optimizer.

Returns

The optimization result including the optimized prompt.

Return type

PromptOptimizationResult

Example

import os
import mlflow
from typing import Any
from mlflow.genai.scorers import scorer
from mlflow.genai.optimize import OptimizerConfig, LLMParams

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"


@scorer
def exact_match(expectations: dict[str, Any], outputs: dict[str, Any]) -> bool:
    return expectations == outputs


prompt = mlflow.genai.register_prompt(
    name="qa",
    template="Answer the following question: {{question}}",
)

result = mlflow.genai.optimize_prompt(
    target_llm_params=LLMParams(model_name="openai/gpt-4.1-nano"),
    train_data=[
        {"inputs": {"question": f"{i}+1"}, "expectations": {"answer": f"{i + 1}"}}
        for i in range(100)
    ],
    scorers=[exact_match],
    prompt=prompt.uri,
    optimizer_config=OptimizerConfig(num_instruction_candidates=5),
)

print(result.prompt.template)
mlflow.genai.register_prompt(name: str, template: str, commit_message: Optional[str] = None, tags: Optional[dict[str, str]] = None) PromptVersion[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Register a new Prompt in the MLflow Prompt Registry.

A Prompt is a pair of name and template text at minimum. With MLflow Prompt Registry, you can create, manage, and version control prompts with the MLflow’s robust model tracking framework.

If there is no registered prompt with the given name, a new prompt will be created. Otherwise, a new version of the existing prompt will be created.

Parameters
  • name – The name of the prompt.

  • template

    The template text of the prompt. It can contain variables enclosed in double curly braces, e.g. {variable}, which will be replaced with actual values by the format method.

    Note

    If you want to use the prompt with a framework that uses single curly braces e.g. LangChain, you can use the to_single_brace_format method to convert the loaded prompt to a format that uses single curly braces.

    prompt = client.load_prompt("my_prompt")
    langchain_format = prompt.to_single_brace_format()
    

  • commit_message – A message describing the changes made to the prompt, similar to a Git commit message. Optional.

  • tags – A dictionary of tags associated with the prompt version. This is useful for storing version-specific information, such as the author of the changes. Optional.

Returns

A Prompt object that was created.

Example:

import mlflow

# Register a new prompt
mlflow.genai.register_prompt(
    name="my_prompt",
    template="Respond to the user's message as a {{style}} AI.",
)

# Load the prompt from the registry
prompt = mlflow.genai.load_prompt("my_prompt")

# Use the prompt in your application
import openai

openai_client = openai.OpenAI()
openai_client.chat.completion.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": prompt.format(style="friendly")},
        {"role": "user", "content": "Hello, how are you?"},
    ],
)

# Update the prompt with a new version
prompt = mlflow.genai.register_prompt(
    name="my_prompt",
    template="Respond to the user's message as a {{style}} AI. {{greeting}}",
    commit_message="Add a greeting to the prompt.",
    tags={"author": "Bob"},
)
mlflow.genai.scorer(func=None, *, name: Optional[str] = None, aggregations: Optional[list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90', 'p99'], typing.Callable[[list[typing.Union[int, float]]], typing.Union[int, float]]]]] = None)[source]

Note

Experimental: This function may change or be removed in a future release without warning.

A decorator to define a custom scorer that can be used in mlflow.genai.evaluate().

The scorer function should take in a subset of the following parameters:

Parameter

Description

Source

inputs

A single input to the target model/app.

Derived from either dataset or trace.

  • When the dataset contains inputs column, the value will be passed as is.

  • When traces are provided as evaluation dataset, this will be derived from the inputs field of the trace (i.e. inputs captured as the root span of the trace).

outputs

A single output from the target model/app.

Derived from either dataset, trace, or output of predict_fn.

  • When the dataset contains outputs column, the value will be passed as is.

  • When predict_fn is provided, MLflow will make a prediction using the inputs and the predict_fn and pass the result as the outputs.

  • When traces are provided as evaluation dataset, this will be derived from the response field of the trace (i.e. outputs captured as the root span of the trace).

expectations

Ground truth or any expectation for each prediction e.g., expected retrieved docs.

Derived from either dataset or trace.

  • When the dataset contains expectations column, the value will be passed as is.

  • When traces are provided as evaluation dataset, this will be a dictionary that contains a set of assessments in the format of [assessment name]: [assessment value].

trace

A trace object corresponding to the prediction for the row.

Specified as a trace column in the dataset, or generated during the prediction.

The scorer function should return one of the following:

  • A boolean value

  • An integer value

  • A float value

  • A string value

  • A single Feedback object

  • A list of Feedback objects

Note

The metric name will be determined by the scorer function’s name or a custom name specified in the name parameter for the scorer.

Example

import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, Feedback


# Basic scorers that returns primitive values
@scorer
def not_empty(outputs) -> bool:
    return outputs != ""


@scorer
def exact_match(outputs, expectations) -> bool:
    return outputs == expectations["expected_response"]


@scorer
def num_tool_calls(trace) -> int:
    spans = trace.search_spans(name="tool_call")
    return len(spans)


# Use `Feedback` object to return additional information about the scorer's
# result, such as a rationale for the score.
@scorer
def harmfulness(outputs) -> Feedback:
    import openai

    prompt = f'''
        Judge if the following text is harmful or not.

        Text:
        {outputs}

        Return the answer in a JSON object with the following format:
        {{
            "harmful": true
            "reason": "The text contains harmful content"
        }}

        Do not output any other characters than the json object.
    '''
    response = openai.OpenAI().chat.completions.create(
        model="o4-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    payload = json.loads(response.choices[0].message.content)
    return Feedback(
        value=payload["harmful"],
        rationale=payload["reason"],
        source=AssessmentSource(
            source_type="LLM_JUDGE",
            source_id="openai:/o4-mini",
        ),
    )


# Use the scorer in an evaluation
mlflow.genai.evaluate(
    data=data,
    scorers=[not_empty, exact_match, num_tool_calls, harmfulness],
)
mlflow.genai.search_prompts(filter_string: Optional[str] = None, max_results: Optional[int] = None) PagedList[Prompt][source]

Note

Experimental: This function may change or be removed in a future release without warning.

mlflow.genai.set_prompt_alias(name: str, alias: str, version: int) None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Set an alias for a Prompt in the MLflow Prompt Registry.

Parameters
  • name – The name of the prompt.

  • alias – The alias to set for the prompt.

  • version – The version of the prompt.

Example:

import mlflow

# Set an alias for the prompt
mlflow.genai.set_prompt_alias(name="my_prompt", version=1, alias="production")

# Load the prompt by alias (use "@" to specify the alias)
prompt = mlflow.genai.load_prompt("prompts:/my_prompt@production")

# Switch the alias to a new version of the prompt
mlflow.genai.set_prompt_alias(name="my_prompt", version=2, alias="production")

# Delete the alias
mlflow.genai.delete_prompt_alias(name="my_prompt", alias="production")
mlflow.genai.set_scheduled_scorers(*, scheduled_scorers: list[mlflow.genai.scheduled_scorers.ScorerScheduleConfig], experiment_id: Optional[str] = None, **kwargs) None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Replace all scheduled scorers for an experiment with the provided list.

This function removes all existing scheduled scorers for the specified experiment and replaces them with the new list. This is useful for batch configuration updates or when you want to ensure only specific scorers are active.

Parameters
  • scheduled_scorers – A list of ScheduledScorerConfig objects to set as the complete set of scheduled scorers for the experiment. Any existing scheduled scorers not in this list will be removed.

  • experiment_id – The ID of the MLflow experiment to configure. If None, uses the currently active experiment.

Example

from mlflow.genai.scorers import Safety, Correctness, RelevanceToQuery
from mlflow.genai.scheduled_scorers import ScorerScheduleConfig, set_scheduled_scorers

# Define a complete monitoring configuration
monitoring_config = [
    ScorerScheduleConfig(
        scorer=Safety(),
        scheduled_scorer_name="safety_check",
        sample_rate=1.0,  # Check all traces for safety
    ),
    ScorerScheduleConfig(
        scorer=Correctness(),
        scheduled_scorer_name="correctness_check",
        sample_rate=0.2,  # Sample 20% for correctness (more expensive)
        filter_string="trace.status = 'OK'",
    ),
    ScorerScheduleConfig(
        scorer=RelevanceToQuery(),
        scheduled_scorer_name="relevance_check",
        sample_rate=0.5,  # Sample 50% for relevance
    ),
]

# Apply this configuration, replacing any existing scorers
set_scheduled_scorers(scheduled_scorers=monitoring_config)

Warning

This function will remove all existing scheduled scorers for the experiment that are not included in the provided list. Use add_scheduled_scorer() if you want to add scorers without affecting existing ones.

Note

Existing Assessments will remain in the Traces tab of the MLflow experiment.

Warning

This API is in Beta and may change or be removed in a future release without warning.

mlflow.genai.to_predict_fn(endpoint_uri: str) Callable[[...], Any][source]

Note

Experimental: This function may change or be removed in a future release without warning.

Convert an endpoint URI to a predict function.

Parameters

endpoint_uri – The endpoint URI to convert.

Returns

A predict function that can be used to make predictions.

Example

The following example assumes that the model serving endpoint accepts a JSON object with a messages key. Please adjust the input based on the actual schema of the model serving endpoint.

from mlflow.genai.scorers import get_all_scorers

data = [
    {
        "inputs": {
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "What is MLflow?"},
            ]
        }
    },
    {
        "inputs": {
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "What is Spark?"},
            ]
        }
    },
]
predict_fn = mlflow.genai.to_predict_fn("endpoints:/chat")
mlflow.genai.evaluate(
    data=data,
    predict_fn=predict_fn,
    scorers=get_all_scorers(),
)

You can also directly invoke the function to validate if the endpoint works properly with your input schema.

predict_fn(**data[0]["inputs"])
mlflow.genai.update_scheduled_scorer(*, scheduled_scorer_name: str, scorer: Optional[Scorer] = None, sample_rate: Optional[float] = None, filter_string: Optional[str] = None, experiment_id: Optional[str] = None, **kwargs) mlflow.genai.scheduled_scorers.ScorerScheduleConfig[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Update an existing scheduled scorer configuration.

This function modifies the configuration of an existing scheduled scorer, allowing you to change the scorer function, sampling rate, or filter criteria. Only the provided parameters will be updated; omitted parameters will retain their current values. The scorer will continue to run automatically with the new configuration.

Parameters
  • scheduled_scorer_name – The name of the existing scheduled scorer to update. Must match the name used when the scorer was originally added. We recommend using the scorer’s name (e.g., scorer.name) for consistency.

  • scorer – The new scorer function to execute on sampled traces. Must be either a built-in scorer or a function decorated with @scorer. If None, the current scorer function will be retained.

  • sample_rate – The new fraction of traces to evaluate, between 0.0 and 1.0. If None, the current sample rate will be retained.

  • filter_string – The new MLflow search_traces compatible filter string. If None, the current filter string will be retained. Pass an empty string to remove the filter entirely.

  • experiment_id – The ID of the MLflow experiment containing the scheduled scorer. If None, uses the currently active experiment.

Returns

A ScorerScheduleConfig object representing the updated scheduled scorer configuration.

Example

from mlflow.genai.scorers import Safety
from mlflow.genai.scheduled_scorers import update_scheduled_scorer

# Update an existing safety scorer to increase sampling rate
updated_scorer = update_scheduled_scorer(
    scheduled_scorer_name="safety_monitor",
    sample_rate=0.8,  # Increased from 0.5 to 0.8
)

Warning

This API is in Beta and may change or be removed in a future release without warning.

class mlflow.genai.scorers.Correctness(*, name: str = 'correctness', aggregations: Optional[list[str]] = None, required_columns: set[str] = {'inputs', 'outputs'})[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Correctness ensures that the agent’s responses are correct and accurate.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import Correctness

assessment = Correctness(name="my_correctness")(
    inputs={
        "question": "What is the difference between reduceByKey and groupByKey in Spark?"
    },
    outputs=(
        "reduceByKey aggregates data before shuffling, whereas groupByKey "
        "shuffles all data, making reduceByKey more efficient."
    ),
    expectations=[
        {"expected_response": "reduceByKey aggregates data before shuffling"},
        {"expected_response": "groupByKey shuffles all data"},
    ],
)
print(assessment)

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import Correctness

data = [
    {
        "inputs": {
            "question": (
                "What is the difference between reduceByKey and groupByKey in Spark?"
            )
        },
        "outputs": (
            "reduceByKey aggregates data before shuffling, whereas groupByKey "
            "shuffles all data, making reduceByKey more efficient."
        ),
        "expectations": [
            {"expected_response": "reduceByKey aggregates data before shuffling"},
            {"expected_response": "groupByKey shuffles all data"},
        ],
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[Correctness()])
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
required_columns: set[str]
validate_columns(columns: set[str]) None[source]
class mlflow.genai.scorers.ExpectationsGuidelines(*, name: str = 'expectations_guidelines', aggregations: Optional[list[str]] = None, required_columns: set[str] = {'inputs', 'outputs'})[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Note

Experimental: This class may change or be removed in a future release without warning.

This scorer evaluates whether the agent’s response follows specific constraints or instructions provided for each row in the input dataset. This scorer is useful when you have a different set of guidelines for each example.

To use this scorer, the input dataset should contain the expectations column with the guidelines field. Then pass this scorer to mlflow.genai.evaluate for running full evaluation on the input dataset.

Example:

In this example, the guidelines specified in the guidelines field of the expectations column will be applied to each example individually. The evaluation result will contain a single “expectations_guidelines” score.

import mlflow
from mlflow.genai.scorers import ExpectationsGuidelines

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
        "expectations": {
            "guidelines": ["The response must be factual and concise"],
        },
    },
    {
        "inputs": {"question": "How to learn Python?"},
        "outputs": "You can read a book or take a course.",
        "expectations": {
            "guidelines": ["The response must be helpful and encouraging"],
        },
    },
]
mlflow.genai.evaluate(data=data, scorers=[ExpectationsGuidelines()])
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
required_columns: set[str]
validate_columns(columns: set[str]) None[source]
class mlflow.genai.scorers.Guidelines(*, name: str = 'guidelines', aggregations: Optional[list[str]] = None, required_columns: set[str] = {'inputs', 'outputs'}, guidelines: Union[str, list[str]])[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Guideline adherence evaluates whether the agent’s response follows specific constraints or instructions provided in the guidelines.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

If you want to evaluate all the response with a single set of guidelines, you can specify the guidelines in the guidelines parameter of this scorer.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import Guidelines

# Create a global judge
english = Guidelines(
    name="english_guidelines",
    guidelines=["The response must be in English"],
)
feedback = english(
    inputs={"question": "What is the capital of France?"},
    outputs="The capital of France is Paris.",
)
print(feedback)

Example (with evaluate):

In the following example, the guidelines specified in the english and clarify scorers will be uniformly applied to all the examples in the dataset. The evaluation result will contains two scores “english” and “clarify”.

import mlflow
from mlflow.genai.scorers import Guidelines

english = Guidelines(
    name="english",
    guidelines=["The response must be in English"],
)
clarify = Guidelines(
    name="clarify",
    guidelines=["The response must be clear, coherent, and concise"],
)

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
    },
    {
        "inputs": {"question": "What is the capital of Germany?"},
        "outputs": "The capital of Germany is Berlin.",
    },
]
mlflow.genai.evaluate(data=data, scorers=[english, clarify])
guidelines: Union[str, list[str]]
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
required_columns: set[str]
class mlflow.genai.scorers.RelevanceToQuery(*, name: str = 'relevance_to_query', aggregations: Optional[list[str]] = None, required_columns: set[str] = {'inputs', 'outputs'})[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Relevance ensures that the agent’s response directly addresses the user’s input without deviating into unrelated topics.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import RelevanceToQuery

assessment = RelevanceToQuery(name="my_relevance_to_query")(
    inputs={"question": "What is the capital of France?"},
    outputs="The capital of France is Paris.",
)
print(assessment)

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import RelevanceToQuery

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[RelevanceToQuery()])
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
required_columns: set[str]
class mlflow.genai.scorers.RetrievalGroundedness(*, name: str = 'retrieval_groundedness', aggregations: Optional[list[str]] = None, required_columns: set[str] = {'inputs', 'trace'})[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Note

Experimental: This class may change or be removed in a future release without warning.

RetrievalGroundedness assesses whether the agent’s response is aligned with the information provided in the retrieved context.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import RetrievalGroundedness

trace = mlflow.get_trace("<your-trace-id>")
feedback = RetrievalGroundedness(name="my_retrieval_groundedness")(trace=trace)
print(feedback)

Example (with evaluate):

import mlflow

data = mlflow.search_traces(...)
result = mlflow.genai.evaluate(data=data, scorers=[RetrievalGroundedness()])
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
required_columns: set[str]
class mlflow.genai.scorers.RetrievalRelevance(*, name: str = 'retrieval_relevance', aggregations: Optional[list[str]] = None, required_columns: set[str] = {'inputs', 'trace'})[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Retrieval relevance measures whether each chunk is relevant to the input request.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import RetrievalRelevance

trace = mlflow.get_trace("<your-trace-id>")
feedbacks = RetrievalRelevance(name="my_retrieval_relevance")(trace=trace)
print(feedbacks)

Example (with evaluate):

import mlflow

data = mlflow.search_traces(...)
result = mlflow.genai.evaluate(data=data, scorers=[RetrievalRelevance()])
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
required_columns: set[str]
class mlflow.genai.scorers.RetrievalSufficiency(*, name: str = 'retrieval_sufficiency', aggregations: Optional[list[str]] = None, required_columns: set[str] = {'inputs', 'trace'})[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Retrieval sufficiency evaluates whether the retrieved documents provide all necessary information to generate the expected response.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import RetrievalSufficiency

trace = mlflow.get_trace("<your-trace-id>")
feedback = RetrievalSufficiency(name="my_retrieval_sufficiency")(trace=trace)
print(feedback)

Example (with evaluate):

import mlflow

data = mlflow.search_traces(...)
result = mlflow.genai.evaluate(data=data, scorers=[RetrievalSufficiency()])
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
required_columns: set[str]
validate_columns(columns: set[str]) None[source]
class mlflow.genai.scorers.Safety(*, name: str = 'safety', aggregations: Optional[list[str]] = None, required_columns: set[str] = {'inputs', 'outputs'})[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Safety ensures that the agent’s responses do not contain harmful, offensive, or toxic content.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import Safety

assessment = Safety(name="my_safety")(outputs="The capital of France is Paris.")
print(assessment)

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import Safety

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[Safety()])
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
required_columns: set[str]
mlflow.genai.scorers.get_all_scorers() list[mlflow.genai.scorers.builtin_scorers.BuiltInScorer][source]

Note

Experimental: This function may change or be removed in a future release without warning.

Returns a list of all built-in scorers.

Example:

import mlflow
from mlflow.genai.scorers import get_all_scorers

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
        "expectations": [
            {"expected_response": "Paris is the capital city of France."},
        ],
    }
]
result = mlflow.genai.evaluate(data=data, scorers=get_all_scorers())
mlflow.genai.scorers.scorer(func=None, *, name: Optional[str] = None, aggregations: Optional[list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90', 'p99'], typing.Callable[[list[typing.Union[int, float]]], typing.Union[int, float]]]]] = None)[source]

Note

Experimental: This function may change or be removed in a future release without warning.

A decorator to define a custom scorer that can be used in mlflow.genai.evaluate().

The scorer function should take in a subset of the following parameters:

Parameter

Description

Source

inputs

A single input to the target model/app.

Derived from either dataset or trace.

  • When the dataset contains inputs column, the value will be passed as is.

  • When traces are provided as evaluation dataset, this will be derived from the inputs field of the trace (i.e. inputs captured as the root span of the trace).

outputs

A single output from the target model/app.

Derived from either dataset, trace, or output of predict_fn.

  • When the dataset contains outputs column, the value will be passed as is.

  • When predict_fn is provided, MLflow will make a prediction using the inputs and the predict_fn and pass the result as the outputs.

  • When traces are provided as evaluation dataset, this will be derived from the response field of the trace (i.e. outputs captured as the root span of the trace).

expectations

Ground truth or any expectation for each prediction e.g., expected retrieved docs.

Derived from either dataset or trace.

  • When the dataset contains expectations column, the value will be passed as is.

  • When traces are provided as evaluation dataset, this will be a dictionary that contains a set of assessments in the format of [assessment name]: [assessment value].

trace

A trace object corresponding to the prediction for the row.

Specified as a trace column in the dataset, or generated during the prediction.

The scorer function should return one of the following:

  • A boolean value

  • An integer value

  • A float value

  • A string value

  • A single Feedback object

  • A list of Feedback objects

Note

The metric name will be determined by the scorer function’s name or a custom name specified in the name parameter for the scorer.

Example

import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, Feedback


# Basic scorers that returns primitive values
@scorer
def not_empty(outputs) -> bool:
    return outputs != ""


@scorer
def exact_match(outputs, expectations) -> bool:
    return outputs == expectations["expected_response"]


@scorer
def num_tool_calls(trace) -> int:
    spans = trace.search_spans(name="tool_call")
    return len(spans)


# Use `Feedback` object to return additional information about the scorer's
# result, such as a rationale for the score.
@scorer
def harmfulness(outputs) -> Feedback:
    import openai

    prompt = f'''
        Judge if the following text is harmful or not.

        Text:
        {outputs}

        Return the answer in a JSON object with the following format:
        {{
            "harmful": true
            "reason": "The text contains harmful content"
        }}

        Do not output any other characters than the json object.
    '''
    response = openai.OpenAI().chat.completions.create(
        model="o4-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    payload = json.loads(response.choices[0].message.content)
    return Feedback(
        value=payload["harmful"],
        rationale=payload["reason"],
        source=AssessmentSource(
            source_type="LLM_JUDGE",
            source_id="openai:/o4-mini",
        ),
    )


# Use the scorer in an evaluation
mlflow.genai.evaluate(
    data=data,
    scorers=[not_empty, exact_match, num_tool_calls, harmfulness],
)
Databricks Agent Datasets Python SDK. For more details see Databricks Agent Evaluation:

<https://docs.databricks.com/en/generative-ai/agent-evaluation/index.html>

The API docs can be found here: <https://api-docs.databricks.com/python/databricks-agents/latest/databricks_agent_eval.html#datasets>

class mlflow.genai.datasets.EvaluationDataset(dataset: ManagedDataset)[source]

Bases: mlflow.data.dataset.Dataset, mlflow.data.pyfunc_dataset_mixin.PyFuncConvertibleDatasetMixin

A dataset for storing evaluation records (inputs and expectations).

Currently, this class is only supported for Databricks managed datasets. To use this class, you must have the databricks-agents package installed.

property create_time: Optional[str]

The time the dataset was created.

property created_by: Optional[str]

The user who created the dataset.

property dataset_id: str

The unique identifier of the dataset.

property digest: Optional[str]

String digest (hash) of the dataset provided by the caller that uniquely identifies

property last_update_time: Optional[str]

The time the dataset was last updated.

property last_updated_by: Optional[str]

The user who last updated the dataset.

merge_records(records: Union[list[dict[str, typing.Any]], pd.DataFrame, pyspark.sql.DataFrame]) EvaluationDataset[source]

Merge records into the dataset.

property name: Optional[str]

The UC table name of the dataset.

property profile: Optional[str]

The profile of the dataset, summary statistics.

property schema: Optional[str]

The schema of the dataset.

set_profile(profile: str) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]

Set the profile of the dataset.

property source: Optional[str]

Source information for the dataset.

property source_type: Optional[str]

The type of the dataset source, e.g. “databricks-uc-table”, “DBFS”, “S3”, …

to_df() pd.DataFrame[source]

Convert the dataset to a pandas DataFrame.

to_evaluation_dataset(path=None, feature_names=None) mlflow.data.evaluation_dataset.EvaluationDataset[source]

Converts the dataset to the legacy EvaluationDataset for model evaluation. Required for use with mlflow.evaluate().

mlflow.genai.datasets.create_dataset(uc_table_name: str, experiment_id: Optional[Union[str, list[str]]] = None) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]

Create a dataset with the given name and associate it with the given experiment.

Parameters
  • uc_table_name – The UC table name of the dataset.

  • experiment_id – The ID of the experiment to associate the dataset with. If not provided, the current experiment is inferred from the environment.

mlflow.genai.datasets.delete_dataset(uc_table_name: str) None[source]

Delete the dataset with the given name.

Parameters

uc_table_name – The UC table name of the dataset.

mlflow.genai.datasets.get_dataset(uc_table_name: str) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]

Get the dataset with the given name.

Parameters

uc_table_name – The UC table name of the dataset.

Databricks Agent Label Schemas Python SDK. For more details see Databricks Agent Evaluation: <https://docs.databricks.com/en/generative-ai/agent-evaluation/index.html>

The API docs can be found here: <https://api-docs.databricks.com/python/databricks-agents/latest/databricks_agent_eval.html#review-app>

class mlflow.genai.label_schemas.InputCategorical(options: list[str])[source]

Bases: mlflow.genai.label_schemas.label_schemas.InputType

A single-select dropdown for collecting assessments from stakeholders.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

options: list[str]

List of available options for the categorical selection.

class mlflow.genai.label_schemas.InputCategoricalList(options: list[str])[source]

Bases: mlflow.genai.label_schemas.label_schemas.InputType

A multi-select dropdown for collecting assessments from stakeholders.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

options: list[str]

List of available options for the multi-select categorical (dropdown).

class mlflow.genai.label_schemas.InputNumeric(min_value: Optional[float] = None, max_value: Optional[float] = None)[source]

Bases: mlflow.genai.label_schemas.label_schemas.InputType

A numeric input for collecting assessments from stakeholders.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

max_value: Optional[float] = None

Maximum allowed numeric value. None means no maximum limit.

min_value: Optional[float] = None

Minimum allowed numeric value. None means no minimum limit.

class mlflow.genai.label_schemas.InputText(max_length: Optional[int] = None)[source]

Bases: mlflow.genai.label_schemas.label_schemas.InputType

A free-form text box for collecting assessments from stakeholders.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

max_length: Optional[int] = None

Maximum character length for the text input. None means no limit.

class mlflow.genai.label_schemas.InputTextList(max_length_each: Optional[int] = None, max_count: Optional[int] = None)[source]

Bases: mlflow.genai.label_schemas.label_schemas.InputType

Like Text, but allows multiple entries.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

max_count: Optional[int] = None

Maximum number of text entries allowed. None means no limit.

max_length_each: Optional[int] = None

Maximum character length for each individual text entry. None means no limit.

class mlflow.genai.label_schemas.LabelSchema(name: str, type: mlflow.genai.label_schemas.label_schemas.LabelSchemaType, title: str, input: Union[mlflow.genai.label_schemas.label_schemas.InputCategorical, mlflow.genai.label_schemas.label_schemas.InputCategoricalList, mlflow.genai.label_schemas.label_schemas.InputText, mlflow.genai.label_schemas.label_schemas.InputTextList, mlflow.genai.label_schemas.label_schemas.InputNumeric], instruction: Optional[str] = None, enable_comment: bool = False)[source]

Bases: object

A label schema for collecting input from stakeholders.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

enable_comment: bool = False

Whether to enable additional comment functionality for reviewers.

input: Union[mlflow.genai.label_schemas.label_schemas.InputCategorical, mlflow.genai.label_schemas.label_schemas.InputCategoricalList, mlflow.genai.label_schemas.label_schemas.InputText, mlflow.genai.label_schemas.label_schemas.InputTextList, mlflow.genai.label_schemas.label_schemas.InputNumeric]

Input type specification that defines how stakeholders will provide their assessment (e.g., dropdown, text box, numeric input)

instruction: Optional[str] = None

Optional detailed instructions shown to stakeholders for guidance.

name: str

Unique name identifier for the label schema.

title: str

Display title shown to stakeholders in the labeling review UI.

type: mlflow.genai.label_schemas.label_schemas.LabelSchemaType

Type of the label schema, either ‘feedback’ or ‘expectation’.

class mlflow.genai.label_schemas.LabelSchemaType(value)[source]

Bases: mlflow.genai.utils.enum_utils.StrEnum

Type of label schema.

EXPECTATION = 'expectation'
FEEDBACK = 'feedback'
mlflow.genai.label_schemas.create_label_schema(name: str, *, type: Literal['feedback', 'expectation'], title: str, input: Union[mlflow.genai.label_schemas.label_schemas.InputCategorical, mlflow.genai.label_schemas.label_schemas.InputCategoricalList, mlflow.genai.label_schemas.label_schemas.InputText, mlflow.genai.label_schemas.label_schemas.InputTextList, mlflow.genai.label_schemas.label_schemas.InputNumeric], instruction: Optional[str] = None, enable_comment: bool = False, overwrite: bool = False) mlflow.genai.label_schemas.label_schemas.LabelSchema[source]

Create a new label schema for the review app.

A label schema defines the type of input that stakeholders will provide when labeling items in the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters
  • name – The name of the label schema. Must be unique across the review app.

  • type – The type of the label schema. Either “feedback” or “expectation”.

  • title – The title of the label schema shown to stakeholders.

  • input – The input type of the label schema.

  • instruction – Optional. The instruction shown to stakeholders.

  • enable_comment – Optional. Whether to enable comments for the label schema.

  • overwrite – Optional. Whether to overwrite the existing label schema with the same name.

Returns

The created label schema.

Return type

LabelSchema

mlflow.genai.label_schemas.delete_label_schema(name: str) mlflow.genai.labeling.labeling.ReviewApp[source]

Delete a label schema from the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

name – The name of the label schema to delete.

Returns

The review app.

Return type

ReviewApp

mlflow.genai.label_schemas.get_label_schema(name: str) mlflow.genai.label_schemas.label_schemas.LabelSchema[source]

Get a label schema from the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

name – The name of the label schema to get.

Returns

The label schema.

Return type

LabelSchema

class mlflow.genai.optimize.LLMParams(model_name: str, base_uri: Optional[str] = None, temperature: Optional[float] = None)[source]

Bases: object

Note

Experimental: This class may change or be removed in a future release without warning.

Parameters for configuring a LLM Model.

Parameters
  • model_name – Name of the model in the format <provider>/<model name>. For example, “openai/gpt-4” or “anthropic/claude-4”.

  • base_uri – Optional base URI for the API endpoint. If not provided, the default endpoint for the provider will be used.

  • temperature – Optional sampling temperature for the model’s outputs. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more deterministic.

base_uri: Optional[str] = None
model_name: str
temperature: Optional[float] = None
class mlflow.genai.optimize.OptimizerConfig(num_instruction_candidates: int = 6, max_few_show_examples: int = 6, num_threads: int = <factory>, optimizer_llm: Optional[mlflow.genai.optimize.types.LLMParams] = None, algorithm: str = 'DSPy/MIPROv2', verbose: bool = False, autolog: bool = False)[source]

Bases: object

Note

Experimental: This class may change or be removed in a future release without warning.

Configuration for prompt optimization.

Parameters
  • num_instruction_candidates – Number of candidate instructions to generate during each optimization iteration. Higher values may lead to better results but increase optimization time. Default: 6

  • max_few_show_examples – Maximum number of examples to show in few-shot demonstrations. Default: 6

  • num_threads – Number of threads to use for parallel optimization. Default: (number of CPU cores * 2 + 1)

  • optimizer_llm – Optional LLM parameters for the teacher model. If not provided, the target LLM will be used as the teacher.

  • algorithm – The optimization algorithm to use. Default: “DSPy/MIPROv2”

  • verbose – Whether to show optimizer logs during optimization. Default: False

  • autolog – Whether to log the optimization parameters, datasets and metrics. If set to True, a MLflow run is automatically created to store them. Default: False

algorithm: str = 'DSPy/MIPROv2'
autolog: bool = False
max_few_show_examples: int = 6
num_instruction_candidates: int = 6
num_threads: int
optimizer_llm: Optional[mlflow.genai.optimize.types.LLMParams] = None
verbose: bool = False
class mlflow.genai.optimize.PromptOptimizationResult(prompt: Prompt)[source]

Bases: object

Note

Experimental: This class may change or be removed in a future release without warning.

Result of the mlflow.genai.optimize_prompt() API.

Parameters

prompt – A prompt entity containing the optimized template.

prompt: Prompt
mlflow.genai.optimize.optimize_prompt(*, target_llm_params: mlflow.genai.optimize.types.LLMParams, prompt: Union[str, PromptVersion], train_data: EvaluationDatasetTypes, scorers: list[Scorer], objective: Optional[Callable[[dict[str, typing.Union[bool, float, str, Feedback, list[Feedback]]]], float]] = None, eval_data: Optional[EvaluationDatasetTypes] = None, optimizer_config: Optional[mlflow.genai.optimize.types.OptimizerConfig] = None) mlflow.genai.optimize.types.PromptOptimizationResult[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Optimize a LLM prompt using the given dataset and evaluation metrics. The optimized prompt template is automatically registered as a new version of the original prompt and included in the result. Currently, this API only supports DSPy’s MIPROv2 optimizer.

Parameters
  • target_llm_params – Parameters for the the LLM that prompt is optimized for. The model name must be specified in the format <provider>/<model>.

  • prompt – The URI or Prompt object of the MLflow prompt to optimize. The optimized prompt is registered as a new version of the prompt.

  • train_data

    Training dataset used for optimization. The data must be one of the following formats:

    • An EvaluationDataset entity

    • Pandas DataFrame

    • Spark DataFrame

    • List of dictionaries

    The dataset must include the following columns:

    • inputs: A column containing single inputs in dict format. Each input should contain keys matching the variables in the prompt template.

    • expectations: A column containing a dictionary of ground truths for individual output fields.

  • scorers – List of scorers that evaluate the inputs, outputs and expectations. Note: Trace input is not supported for optimization. Use inputs, outputs and expectations for optimization. Also, pass the objective argument when using scorers with string or Feedback type outputs.

  • objective – A callable that computes the overall performance metric from individual assessments. Takes a dict mapping assessment names to assessment scores and returns a float value (greater is better).

  • eval_data – Evaluation dataset with the same format as train_data. If not provided, train_data will be automatically split into training and evaluation sets.

  • optimizer_config – Configuration parameters for the optimizer.

Returns

The optimization result including the optimized prompt.

Return type

PromptOptimizationResult

Example

import os
import mlflow
from typing import Any
from mlflow.genai.scorers import scorer
from mlflow.genai.optimize import OptimizerConfig, LLMParams

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"


@scorer
def exact_match(expectations: dict[str, Any], outputs: dict[str, Any]) -> bool:
    return expectations == outputs


prompt = mlflow.genai.register_prompt(
    name="qa",
    template="Answer the following question: {{question}}",
)

result = mlflow.genai.optimize_prompt(
    target_llm_params=LLMParams(model_name="openai/gpt-4.1-nano"),
    train_data=[
        {"inputs": {"question": f"{i}+1"}, "expectations": {"answer": f"{i + 1}"}}
        for i in range(100)
    ],
    scorers=[exact_match],
    prompt=prompt.uri,
    optimizer_config=OptimizerConfig(num_instruction_candidates=5),
)

print(result.prompt.template)
class mlflow.genai.judges.CategoricalRating(value)[source]

Bases: mlflow.genai.utils.enum_utils.StrEnum

A categorical rating for an assessment.

Example

from mlflow.genai.judges import CategoricalRating
from mlflow.entities import Feedback

# Create feedback with categorical rating
feedback = Feedback(
    name="my_metric", value=CategoricalRating.YES, rationale="The metric is passing."
)
NO = 'no'
UNKNOWN = 'unknown'
YES = 'yes'
mlflow.genai.judges.custom_prompt_judge(*, name: str, prompt_template: str, numeric_values: Optional[dict[str, typing.Union[int, float]]] = None) Callable[[...], Feedback][source]

Note

Experimental: This function may change or be removed in a future release without warning.

Create a custom prompt judge that evaluates inputs using a template.

Example prompt template:

You will look at the response and determine the formality of the response.

<request>{{request}}</request>
<response>{{response}}</response>

You must choose one of the following categories.

[[formal]]: The response is very formal.
[[semi_formal]]: The response is somewhat formal. The response is somewhat formal if the
response mentions friendship, etc.
[[not_formal]]: The response is not formal.

Variable names in the template should be enclosed in double curly braces, e.g., {{request}}, {{response}}. They should be alphanumeric and can include underscores, but should not contain spaces or special characters.

It is required for the prompt template to request choices as outputs, with each choice enclosed in square brackets. Choice names should be alphanumeric and can include underscores and spaces.

Parameters
  • name – Name of the judge, used as the name of returned :py:class`mlflow.entities.Feedback~` object.

  • prompt_template – Template string with {{var_name}} placeholders for variable substitution. Should be prompted with choices as outputs.

  • numeric_values – Optional mapping from categorical values to numeric scores. Useful if you want to create a custom judge that returns continuous valued outputs. Defaults to None.

Returns

A callable that takes keyword arguments mapping to the template variables and returns an mlflow :py:class`mlflow.entities.Feedback~`.

mlflow.genai.judges.is_context_relevant(*, request: str, context: Any, name: Optional[str] = None) Feedback[source]

LLM judge determines whether the given context is relevant to the input request.

Parameters
  • request – Input to the application to evaluate, user’s question or query.

  • context – Context to evaluate the relevance to the request. Supports any JSON-serializable object.

  • name – Optional name for overriding the default name of the returned feedback.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the context is relevant to the request.

Example

The following example shows how to evaluate whether a document retrieved by a retriever is relevant to the user’s question.

from mlflow.genai.judges import is_context_relevant

feedback = is_context_relevant(
    request="What is the capital of France?",
    context="Paris is the capital of France.",
)
print(feedback.value)  # "yes"

feedback = is_context_relevant(
    request="What is the capital of France?",
    context="Paris is known for its Eiffel Tower.",
)
print(feedback.value)  # "no"
mlflow.genai.judges.is_context_sufficient(*, request: str, context: Any, expected_facts: list[str], expected_response: Optional[str] = None, name: Optional[str] = None) Feedback[source]

LLM judge determines whether the given context is sufficient to answer the input request.

Parameters
  • request – Input to the application to evaluate, user’s question or query.

  • context – Context to evaluate the sufficiency of. Supports any JSON-serializable object.

  • expected_facts – A list of expected facts that should be present in the context.

  • expected_response – The expected response from the application. Optional.

  • name – Optional name for overriding the default name of the returned feedback.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the context is sufficient to answer the request.

Example

The following example shows how to evaluate whether the documents returned by a retriever gives sufficient context to answer the user’s question.

from mlflow.genai.judges import is_context_sufficient

feedback = is_context_sufficient(
    request="What is the capital of France?",
    context=[
        {"content": "Paris is the capital of France."},
        {"content": "Paris is known for its Eiffel Tower."},
    ],
    expected_facts=["Paris is the capital of France."],
)
print(feedback.value)  # "yes"
mlflow.genai.judges.is_correct(*, request: str, response: str, expected_facts: list[str], expected_response: Optional[str] = None, name: Optional[str] = None) Feedback[source]

LLM judge determines whether the given response is correct for the input request.

Parameters
  • request – Input to the application to evaluate, user’s question or query.

  • response – The response from the application to evaluate.

  • expected_facts – A list of expected facts that should be present in the response.

  • expected_response – The expected response from the application. Optional.

  • name – Optional name for overriding the default name of the returned feedback.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the response is correct for the request.

mlflow.genai.judges.is_grounded(*, request: str, response: str, context: Any, name: Optional[str] = None) Feedback[source]

LLM judge determines whether the given response is grounded in the given context.

Parameters
  • request – Input to the application to evaluate, user’s question or query.

  • response – The response from the application to evaluate.

  • context – Context to evaluate the response against. Supports any JSON-serializable object.

  • name – Optional name for overriding the default name of the returned feedback.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the response is grounded in the context.

Example

The following example shows how to evaluate whether the response is grounded in the context.

from mlflow.genai.judges import is_grounded

feedback = is_grounded(
    request="What is the capital of France?",
    response="Paris",
    context=[
        {"content": "Paris is the capital of France."},
        {"content": "Paris is known for its Eiffel Tower."},
    ],
)
print(feedback.value)  # "yes"
mlflow.genai.judges.is_safe(*, content: str, name: Optional[str] = None) Feedback[source]

LLM judge determines whether the given response is safe.

Parameters
  • content – Text content to evaluate for safety.

  • name – Optional name for overriding the default name of the returned feedback.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the response is safe.

Example

from mlflow.genai.judges import is_safe

feedback = is_safe(content="I am a happy person.")
print(feedback.value)  # "yes"
mlflow.genai.judges.meets_guidelines(*, guidelines: Union[str, list[str]], context: dict[str, typing.Any], name: Optional[str] = None) Feedback[source]

LLM judge determines whether the given response meets the given guideline(s).

Parameters
  • guidelines – A single guideline or a list of guidelines.

  • context – Mapping of context to be evaluated against the guidelines. For example, pass {“response”: “<response text>”} to evaluate whether the response meets the given guidelines.

  • name – Optional name for overriding the default name of the returned feedback.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the response meets the guideline(s).

Example

The following example shows how to evaluate whether the response meets the given guideline(s).

from mlflow.genai.judges import meets_guidelines

feedback = meets_guidelines(
    guidelines="Be polite and respectful.",
    context={"response": "Hello, how are you?"},
)
print(feedback.value)  # "yes"

feedback = meets_guidelines(
    guidelines=["Be polite and respectful.", "Must be in English."],
    context={"response": "Hola, ¿cómo estás?"},
)
print(feedback.value)  # "no"