mlflow.genai
- class mlflow.genai.Scorer(*, name: str, aggregations: Optional[list] = None)[source]
Bases:
pydantic.main.BaseModel
Note
Experimental: This class may change or be removed in a future release without warning.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- run(*, inputs=None, outputs=None, expectations=None, trace=None)[source]
- mlflow.genai.create_dataset(uc_table_name: str, experiment_id: Optional[Union[str, list[str]]] = None) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset [source]
Create a dataset with the given name and associate it with the given experiment.
- Parameters
uc_table_name – The UC table name of the dataset.
experiment_id – The ID of the experiment to associate the dataset with. If not provided, the current experiment is inferred from the environment.
- mlflow.genai.delete_dataset(uc_table_name: str) None [source]
Delete the dataset with the given name.
- Parameters
uc_table_name – The UC table name of the dataset.
- mlflow.genai.evaluate(data: EvaluationDatasetTypes, scorers: list[Scorer], predict_fn: Optional[Callable[[...], Any]] = None, model_id: Optional[str] = None) mlflow.models.evaluation.base.EvaluationResult [source]
Note
Experimental: This function may change or be removed in a future release without warning.
Evaluate the performance of a generative AI model/application using specified data and scorers.
This function allows you to evaluate a model’s performance on a given dataset using various scoring criteria. It supports both built-in scorers provided by MLflow and custom scorers. The evaluation results include metrics and detailed per-row assessments.
There are three different ways to use this function:
1. Use Traces to evaluate the model/application.
The data parameter takes a DataFrame with trace column, which contains a single trace object corresponding to the prediction for the row. This dataframe is easily obtained from the existing traces stored in MLflow, by using the
mlflow.search_traces()
function.import mlflow from mlflow.genai.scorers import Correctness, Safety import pandas as pd trace_df = mlflow.search_traces(model_id="<my-model-id>") mlflow.genai.evaluate( data=trace_df, scorers=[Correctness(), Safety()], )
Built-in scorers will understand the model inputs, outputs, and other intermediate information e.g. retrieved context, from the trace object. You can also access to the trace object from the custom scorer function by using the trace parameter.
from mlflow.genai.scorers import scorer @scorer def faster_than_one_second(inputs, outputs, trace): return trace.info.execution_duration < 1000
2. Use DataFrame or dictionary with “inputs”, “outputs”, “expectations” columns.
Alternatively, you can pass inputs, outputs, and expectations (ground truth) as a column in the dataframe (or equivalent list of dictionaries).
import mlflow from mlflow.genai.scorers import Correctness import pandas as pd data = pd.DataFrame( [ { "inputs": {"question": "What is MLflow?"}, "outputs": "MLflow is an ML platform", "expectations": "MLflow is an ML platform", }, { "inputs": {"question": "What is Spark?"}, "outputs": "I don't know", "expectations": "Spark is a data engine", }, ] ) mlflow.genai.evaluate( data=data, scorers=[Correctness()], )
3. Pass `predict_fn` and input samples (and optionally expectations).
If you want to generate the outputs and traces on-the-fly from your input samples, you can pass a callable to the predict_fn parameter. In this case, MLflow will pass the inputs to the predict_fn as keyword arguments. Therefore, the “inputs” column must be a dictionary with the parameter names as keys.
import mlflow from mlflow.genai.scorers import Correctness, Safety import openai # Create a dataframe with input samples data = pd.DataFrame( [ {"inputs": {"question": "What is MLflow?"}}, {"inputs": {"question": "What is Spark?"}}, ] ) # Define a predict function to evaluate. The "inputs" column will be # passed to the prediction function as keyword arguments. def predict_fn(question: str) -> str: response = openai.OpenAI().chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": question}], ) return response.choices[0].message.content mlflow.genai.evaluate( data=data, predict_fn=predict_fn, scorers=[Correctness(), Safety()], )
- Parameters
data –
Dataset for the evaluation. Must be one of the following formats:
An EvaluationDataset entity
Pandas DataFrame
Spark DataFrame
List of dictionaries
The dataset must include either of the following columns:
- trace column that contains a single trace object corresponding
to the prediction for the row.
If this column is present, MLflow extracts inputs, outputs, assessments, and other intermediate information e.g. retrieved context, from the trace object and uses them for scoring. When this column is present, the predict_fn parameter must not be provided.
inputs, outputs, expectations columns.
Alternatively, you can pass inputs, outputs, and expectations(ground truth) as a column in the dataframe (or equivalent list of dictionaries).
inputs (required): Column containing inputs for evaluation. The value must be a dictionary. When predict_fn is provided, MLflow will pass the inputs to the predict_fn as keyword arguments. For example,
predict_fn: def predict_fn(question: str, context: str) -> str
inputs: {“question”: “What is MLflow?”, “context”: “MLflow is an ML platform”}
predict_fn will receive “What is MLflow?” as the first argument (question) and “MLflow is an ML platform” as the second argument (context)
outputs (optional): Column containing model or app outputs. If this column is present, predict_fn must not be provided.
expectations (optional): Column containing a dictionary of ground truths.
For list of dictionaries, each dict should follow the above schema.
scorers – A list of Scorer objects that produces evaluation scores from inputs, outputs, and other additional contexts. MLflow provides pre-defined scorers, but you can also define custom ones.
predict_fn –
The target function to be evaluated. The specified function will be executed for each row in the input dataset, and outputs will be used for scoring.
The function must emit a single trace per call. If it doesn’t, decorate the function with @mlflow.trace decorator to ensure a trace to be emitted.
model_id – Optional model identifier (e.g. “models:/my-model/1”) to associate with the evaluation results. Can be also set globally via the
mlflow.set_active_model()
function.
- Returns
An
mlflow.models.EvaluationResult~
object.
Note
This function is only supported on Databricks. The tracking URI must be set to Databricks.
Warning
This function is not thread-safe. Please do not use it in multi-threaded environments.
- mlflow.genai.get_dataset(uc_table_name: str) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset [source]
Get the dataset with the given name.
- Parameters
uc_table_name – The UC table name of the dataset.
- mlflow.genai.optimize_prompt(*, target_llm_params: mlflow.genai.optimize.types.LLMParams, prompt: Union[str, Prompt], train_data: EvaluationDatasetTypes, scorers: list[Scorer], objective: Optional[Callable[[dict[str, typing.Union[bool, float, str, Feedback, list[Feedback]]]], float]] = None, eval_data: Optional[EvaluationDatasetTypes] = None, optimizer_config: Optional[mlflow.genai.optimize.types.OptimizerConfig] = None) mlflow.genai.optimize.types.PromptOptimizationResult [source]
Note
Experimental: This function may change or be removed in a future release without warning.
Optimize a LLM prompt using the given dataset and evaluation metrics. The optimized prompt template is automatically registered as a new version of the original prompt and included in the result. Currently, this API only supports DSPy’s MIPROv2 optimizer.
- Parameters
target_llm_params – Parameters for the the LLM that prompt is optimized for. The model name must be specified in the format <provider>/<model>.
prompt – The URI or Prompt object of the MLflow prompt to optimize. The optimized prompt is registered as a new version of the prompt.
train_data –
Training dataset used for optimization. The data must be one of the following formats:
An EvaluationDataset entity
Pandas DataFrame
Spark DataFrame
List of dictionaries
The dataset must include the following columns:
inputs: A column containing single inputs in dict format. Each input should contain keys matching the variables in the prompt template.
expectations: A column containing a dictionary of ground truths for individual output fields.
scorers – List of scorers that evaluate the inputs, outputs and expectations. Note: Trace input is not supported for optimization. Use inputs, outputs and expectations for optimization. Also, pass the objective argument when using scorers with string or
Feedback
type outputs.objective – A callable that computes the overall performance metric from individual assessments. Takes a dict mapping assessment names to assessment scores and returns a float value (greater is better).
eval_data – Evaluation dataset with the same format as train_data. If not provided, train_data will be automatically split into training and evaluation sets.
optimizer_config – Configuration parameters for the optimizer.
- Returns
The optimization result including the optimized prompt.
- Return type
Example
import os import mlflow from typing import Any from mlflow.genai.scorers import scorer from mlflow.genai.optimize import OptimizerConfig, LLMParams os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" @scorer def exact_match(expectations: dict[str, Any], outputs: dict[str, Any]) -> bool: return expectations == outputs prompt = mlflow.register_prompt( name="qa", template="Answer the following question: {{question}}", ) result = mlflow.genai.optimize_prompt( target_llm_params=LLMParams(model_name="openai/gpt-4.1-nano"), train_data=[ {"inputs": {"question": f"{i}+1"}, "expectations": {"answer": f"{i + 1}"}} for i in range(100) ], scorers=[exact_match], prompt=prompt.uri, optimizer_config=OptimizerConfig(num_instruction_candidates=5), ) print(result.prompt.template)
- mlflow.genai.scorer(func=None, *, name: Optional[str] = None, aggregations: Optional[list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90', 'p99'], typing.Callable]]] = None)[source]
Note
Experimental: This function may change or be removed in a future release without warning.
A decorator to define a custom scorer that can be used in
mlflow.genai.evaluate()
.The scorer function should take in a subset of the following parameters:
Parameter
Description
Source
inputs
A single input to the target model/app.
Derived from either dataset or trace.
When the dataset contains
inputs
column, the value will be passed as is.When traces are provided as evaluation dataset, this will be derived from the
inputs
field of the trace (i.e. inputs captured as the root span of the trace).
outputs
A single output from the target model/app.
Derived from either dataset, trace, or output of
predict_fn
.When the dataset contains
outputs
column, the value will be passed as is.When
predict_fn
is provided, MLflow will make a prediction using theinputs
and thepredict_fn
and pass the result as theoutputs
.When traces are provided as evaluation dataset, this will be derived from the
response
field of the trace (i.e. outputs captured as the root span of the trace).
expectations
Ground truth or any expectation for each prediction e.g., expected retrieved docs.
Derived from either dataset or trace.
When the dataset contains
expectations
column, the value will be passed as is.When traces are provided as evaluation dataset, this will be a dictionary that contains a set of assessments in the format of [assessment name]: [assessment value].
trace
A trace object corresponding to the prediction for the row.
Specified as a
trace
column in the dataset, or generated during the prediction.The scorer function should return one of the following:
A boolean value
An integer value
A float value
A string value
A single
Feedback
objectA list of
Feedback
objects
Note
The metric name will be determined by the scorer function’s name or a custom name specified in the name parameter for the scorer.
Example
import json from mlflow.genai.scorers import scorer from mlflow.entities import AssessmentSource, Feedback # Basic scorers that returns primitive values @scorer def not_empty(outputs) -> bool: return outputs != "" @scorer def exact_match(outputs, expectations) -> bool: return outputs == expectations["expected_response"] @scorer def num_tool_calls(trace) -> int: spans = trace.search_spans(name="tool_call") return len(spans) # Use `Feedback` object to return additional information about the scorer's # result, such as a rationale for the score. @scorer def harmfulness(outputs) -> Feedback: import openai prompt = f''' Judge if the following text is harmful or not. Text: {outputs} Return the answer in a JSON object with the following format: {{ "harmful": true "reason": "The text contains harmful content" }} Do not output any other characters than the json object. ''' response = openai.OpenAI().chat.completions.create( model="o4-mini", messages=[{"role": "user", "content": prompt}], ) payload = json.loads(response.choices[0].message.content) return Feedback( value=payload["harmful"], rationale=payload["reason"], source=AssessmentSource( source_type="LLM_JUDGE", source_id="openai:/o4-mini", ), ) # Use the scorer in an evaluation mlflow.genai.evaluate( data=data, scorers=[not_empty, exact_match, num_tool_calls, harmfulness], )
- mlflow.genai.to_predict_fn(endpoint_uri: str) Callable [source]
Note
Experimental: This function may change or be removed in a future release without warning.
Convert an endpoint URI to a predict function.
- Parameters
endpoint_uri – The endpoint URI to convert.
- Returns
A predict function that can be used to make predictions.
Example
The following example assumes that the model serving endpoint accepts a JSON object with a messages key. Please adjust the input based on the actual schema of the model serving endpoint.
from mlflow.genai.scorers import get_all_scorers data = [ { "inputs": { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is MLflow?"}, ] } }, { "inputs": { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Spark?"}, ] } }, ] predict_fn = mlflow.genai.to_predict_fn("endpoints:/chat") mlflow.genai.evaluate( data=data, predict_fn=predict_fn, scorers=get_all_scorers(), )
You can also directly invoke the function to validate if the endpoint works properly with your input schema.
predict_fn(**data[0]["inputs"])
- class mlflow.genai.scorers.Correctness(*, name: str = 'correctness', aggregations: Optional[list] = None, required_columns: set[str] = {'inputs', 'outputs'})[source]
Bases:
mlflow.genai.scorers.builtin_scorers.BuiltInScorer
Note
Experimental: This class may change or be removed in a future release without warning.
Correctness ensures that the agent’s responses are correct and accurate.
You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.
Example (direct usage):
import mlflow from mlflow.genai.scorers import Correctness assessment = Correctness(name="my_correctness")( inputs={ "question": "What is the difference between reduceByKey and groupByKey in Spark?" }, outputs=( "reduceByKey aggregates data before shuffling, whereas groupByKey " "shuffles all data, making reduceByKey more efficient." ), expectations=[ {"expected_response": "reduceByKey aggregates data before shuffling"}, {"expected_response": "groupByKey shuffles all data"}, ], ) print(assessment)
Example (with evaluate):
import mlflow from mlflow.genai.scorers import Correctness data = [ { "inputs": { "question": ( "What is the difference between reduceByKey and groupByKey in Spark?" ) }, "outputs": ( "reduceByKey aggregates data before shuffling, whereas groupByKey " "shuffles all data, making reduceByKey more efficient." ), "expectations": [ {"expected_response": "reduceByKey aggregates data before shuffling"}, {"expected_response": "groupByKey shuffles all data"}, ], } ] result = mlflow.genai.evaluate(data=data, scorers=[Correctness()])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- validate_columns(columns: set[str]) None [source]
- class mlflow.genai.scorers.ExpectationsGuidelines(*, name: str = 'expectations_guidelines', aggregations: Optional[list] = None, required_columns: set[str] = {'inputs', 'outputs'})[source]
Bases:
mlflow.genai.scorers.builtin_scorers.BuiltInScorer
Note
Experimental: This class may change or be removed in a future release without warning.
This scorer evaluates whether the agent’s response follows specific constraints or instructions provided for each row in the input dataset. This scorer is useful when you have a different set of guidelines for each example.
To use this scorer, the input dataset should contain the expectations column with the guidelines field. Then pass this scorer to mlflow.genai.evaluate for running full evaluation on the input dataset.
Example:
In this example, the guidelines specified in the guidelines field of the expectations column will be applied to each example individually. The evaluation result will contain a single “expectations_guidelines” score.
import mlflow from mlflow.genai.scorers import ExpectationsGuidelines data = [ { "inputs": {"question": "What is the capital of France?"}, "outputs": "The capital of France is Paris.", "expectations": { "guidelines": ["The response must be factual and concise"], }, }, { "inputs": {"question": "How to learn Python?"}, "outputs": "You can read a book or take a course.", "expectations": { "guidelines": ["The response must be helpful and encouraging"], }, }, ] mlflow.genai.evaluate(data=data, scorers=[ExpectationsGuidelines()])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- validate_columns(columns: set[str]) None [source]
- class mlflow.genai.scorers.Guidelines(*, name: str = 'guidelines', aggregations: Optional[list] = None, required_columns: set[str] = {'inputs', 'outputs'}, guidelines: Union[str, list[str]])[source]
Bases:
mlflow.genai.scorers.builtin_scorers.BuiltInScorer
Note
Experimental: This class may change or be removed in a future release without warning.
Guideline adherence evaluates whether the agent’s response follows specific constraints or instructions provided in the guidelines.
You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.
If you want to evaluate all the response with a single set of guidelines, you can specify the guidelines in the guidelines parameter of this scorer.
Example (direct usage):
import mlflow from mlflow.genai.scorers import Guidelines # Create a global judge english = Guidelines( name="english_guidelines", guidelines=["The response must be in English"], ) feedback = english( inputs={"question": "What is the capital of France?"}, outputs="The capital of France is Paris.", ) print(feedback)
Example (with evaluate):
In the following example, the guidelines specified in the english and clarify scorers will be uniformly applied to all the examples in the dataset. The evaluation result will contains two scores “english” and “clarify”.
import mlflow from mlflow.genai.scorers import Guidelines english = Guidelines( name="english", guidelines=["The response must be in English"], ) clarify = Guidelines( name="clarify", guidelines=["The response must be clear, coherent, and concise"], ) data = [ { "inputs": {"question": "What is the capital of France?"}, "outputs": "The capital of France is Paris.", }, { "inputs": {"question": "What is the capital of Germany?"}, "outputs": "The capital of Germany is Berlin.", }, ] mlflow.genai.evaluate(data=data, scorers=[english, clarify])
- class mlflow.genai.scorers.RelevanceToQuery(*, name: str = 'relevance_to_query', aggregations: Optional[list] = None, required_columns: set[str] = {'inputs', 'outputs'})[source]
Bases:
mlflow.genai.scorers.builtin_scorers.BuiltInScorer
Note
Experimental: This class may change or be removed in a future release without warning.
Relevance ensures that the agent’s response directly addresses the user’s input without deviating into unrelated topics.
You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.
Example (direct usage):
import mlflow from mlflow.genai.scorers import RelevanceToQuery assessment = RelevanceToQuery(name="my_relevance_to_query")( inputs={"question": "What is the capital of France?"}, outputs="The capital of France is Paris.", ) print(assessment)
Example (with evaluate):
import mlflow from mlflow.genai.scorers import RelevanceToQuery data = [ { "inputs": {"question": "What is the capital of France?"}, "outputs": "The capital of France is Paris.", } ] result = mlflow.genai.evaluate(data=data, scorers=[RelevanceToQuery()])
- class mlflow.genai.scorers.RetrievalGroundedness(*, name: str = 'retrieval_groundedness', aggregations: Optional[list] = None, required_columns: set[str] = {'inputs', 'trace'})[source]
Bases:
mlflow.genai.scorers.builtin_scorers.BuiltInScorer
Note
Experimental: This class may change or be removed in a future release without warning.
RetrievalGroundedness assesses whether the agent’s response is aligned with the information provided in the retrieved context.
You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.
Example (direct usage):
import mlflow from mlflow.genai.scorers import RetrievalGroundedness trace = mlflow.get_trace("<your-trace-id>") feedback = RetrievalGroundedness(name="my_retrieval_groundedness")(trace=trace) print(feedback)
Example (with evaluate):
import mlflow data = mlflow.search_traces(...) result = mlflow.genai.evaluate(data=data, scorers=[RetrievalGroundedness()])
- class mlflow.genai.scorers.RetrievalRelevance(*, name: str = 'retrieval_relevance', aggregations: Optional[list] = None, required_columns: set[str] = {'inputs', 'trace'})[source]
Bases:
mlflow.genai.scorers.builtin_scorers.BuiltInScorer
Note
Experimental: This class may change or be removed in a future release without warning.
Retrieval relevance measures whether each chunk is relevant to the input request.
You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.
Example (direct usage):
import mlflow from mlflow.genai.scorers import RetrievalRelevance trace = mlflow.get_trace("<your-trace-id>") feedbacks = RetrievalRelevance(name="my_retrieval_relevance")(trace=trace) print(feedbacks)
Example (with evaluate):
import mlflow data = mlflow.search_traces(...) result = mlflow.genai.evaluate(data=data, scorers=[RetrievalRelevance()])
- class mlflow.genai.scorers.RetrievalSufficiency(*, name: str = 'retrieval_sufficiency', aggregations: Optional[list] = None, required_columns: set[str] = {'inputs', 'trace'})[source]
Bases:
mlflow.genai.scorers.builtin_scorers.BuiltInScorer
Note
Experimental: This class may change or be removed in a future release without warning.
Retrieval sufficiency evaluates whether the retrieved documents provide all necessary information to generate the expected response.
You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.
Example (direct usage):
import mlflow from mlflow.genai.scorers import RetrievalSufficiency trace = mlflow.get_trace("<your-trace-id>") feedback = RetrievalSufficiency(name="my_retrieval_sufficiency")(trace=trace) print(feedback)
Example (with evaluate):
import mlflow data = mlflow.search_traces(...) result = mlflow.genai.evaluate(data=data, scorers=[RetrievalSufficiency()])
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- validate_columns(columns: set[str]) None [source]
- class mlflow.genai.scorers.Safety(*, name: str = 'safety', aggregations: Optional[list] = None, required_columns: set[str] = {'inputs', 'outputs'})[source]
Bases:
mlflow.genai.scorers.builtin_scorers.BuiltInScorer
Note
Experimental: This class may change or be removed in a future release without warning.
Safety ensures that the agent’s responses do not contain harmful, offensive, or toxic content.
You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.
Example (direct usage):
import mlflow from mlflow.genai.scorers import Safety assessment = Safety(name="my_safety")(outputs="The capital of France is Paris.") print(assessment)
Example (with evaluate):
import mlflow from mlflow.genai.scorers import Safety data = [ { "inputs": {"question": "What is the capital of France?"}, "outputs": "The capital of France is Paris.", } ] result = mlflow.genai.evaluate(data=data, scorers=[Safety()])
- mlflow.genai.scorers.get_all_scorers() list[mlflow.genai.scorers.builtin_scorers.BuiltInScorer] [source]
Note
Experimental: This function may change or be removed in a future release without warning.
Returns a list of all built-in scorers.
Example:
import mlflow from mlflow.genai.scorers import get_all_scorers data = [ { "inputs": {"question": "What is the capital of France?"}, "outputs": "The capital of France is Paris.", "expectations": [ {"expected_response": "Paris is the capital city of France."}, ], } ] result = mlflow.genai.evaluate(data=data, scorers=get_all_scorers())
- mlflow.genai.scorers.scorer(func=None, *, name: Optional[str] = None, aggregations: Optional[list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90', 'p99'], typing.Callable]]] = None)[source]
Note
Experimental: This function may change or be removed in a future release without warning.
A decorator to define a custom scorer that can be used in
mlflow.genai.evaluate()
.The scorer function should take in a subset of the following parameters:
Parameter
Description
Source
inputs
A single input to the target model/app.
Derived from either dataset or trace.
When the dataset contains
inputs
column, the value will be passed as is.When traces are provided as evaluation dataset, this will be derived from the
inputs
field of the trace (i.e. inputs captured as the root span of the trace).
outputs
A single output from the target model/app.
Derived from either dataset, trace, or output of
predict_fn
.When the dataset contains
outputs
column, the value will be passed as is.When
predict_fn
is provided, MLflow will make a prediction using theinputs
and thepredict_fn
and pass the result as theoutputs
.When traces are provided as evaluation dataset, this will be derived from the
response
field of the trace (i.e. outputs captured as the root span of the trace).
expectations
Ground truth or any expectation for each prediction e.g., expected retrieved docs.
Derived from either dataset or trace.
When the dataset contains
expectations
column, the value will be passed as is.When traces are provided as evaluation dataset, this will be a dictionary that contains a set of assessments in the format of [assessment name]: [assessment value].
trace
A trace object corresponding to the prediction for the row.
Specified as a
trace
column in the dataset, or generated during the prediction.The scorer function should return one of the following:
A boolean value
An integer value
A float value
A string value
A single
Feedback
objectA list of
Feedback
objects
Note
The metric name will be determined by the scorer function’s name or a custom name specified in the name parameter for the scorer.
Example
import json from mlflow.genai.scorers import scorer from mlflow.entities import AssessmentSource, Feedback # Basic scorers that returns primitive values @scorer def not_empty(outputs) -> bool: return outputs != "" @scorer def exact_match(outputs, expectations) -> bool: return outputs == expectations["expected_response"] @scorer def num_tool_calls(trace) -> int: spans = trace.search_spans(name="tool_call") return len(spans) # Use `Feedback` object to return additional information about the scorer's # result, such as a rationale for the score. @scorer def harmfulness(outputs) -> Feedback: import openai prompt = f''' Judge if the following text is harmful or not. Text: {outputs} Return the answer in a JSON object with the following format: {{ "harmful": true "reason": "The text contains harmful content" }} Do not output any other characters than the json object. ''' response = openai.OpenAI().chat.completions.create( model="o4-mini", messages=[{"role": "user", "content": prompt}], ) payload = json.loads(response.choices[0].message.content) return Feedback( value=payload["harmful"], rationale=payload["reason"], source=AssessmentSource( source_type="LLM_JUDGE", source_id="openai:/o4-mini", ), ) # Use the scorer in an evaluation mlflow.genai.evaluate( data=data, scorers=[not_empty, exact_match, num_tool_calls, harmfulness], )
- Databricks Agent Datasets Python SDK. For more details see Databricks Agent Evaluation:
<https://docs.databricks.com/en/generative-ai/agent-evaluation/index.html>
The API docs can be found here: <https://api-docs.databricks.com/python/databricks-agents/latest/databricks_agent_eval.html#datasets>
- class mlflow.genai.datasets.EvaluationDataset(dataset: ManagedDataset)[source]
Bases:
mlflow.data.dataset.Dataset
,mlflow.data.pyfunc_dataset_mixin.PyFuncConvertibleDatasetMixin
A dataset for storing evaluation records (inputs and expectations).
Currently, this class is only supported for Databricks managed datasets. To use this class, you must have the databricks-agents package installed.
- property digest: Optional[str]
String digest (hash) of the dataset provided by the caller that uniquely identifies
- insert(records: Union[list[dict], pd.DataFrame, pyspark.sql.DataFrame]) EvaluationDataset [source]
Insert records into the dataset.
- set_profile(profile: str) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset [source]
Set the profile of the dataset.
- property source_type: Optional[str]
The type of the dataset source, e.g. “databricks-uc-table”, “DBFS”, “S3”, …
- to_df() pd.DataFrame [source]
Convert the dataset to a pandas DataFrame.
- to_evaluation_dataset(path=None, feature_names=None) mlflow.data.evaluation_dataset.EvaluationDataset [source]
Converts the dataset to the legacy EvaluationDataset for model evaluation. Required for use with mlflow.evaluate().
- mlflow.genai.datasets.create_dataset(uc_table_name: str, experiment_id: Optional[Union[str, list[str]]] = None) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset [source]
Create a dataset with the given name and associate it with the given experiment.
- Parameters
uc_table_name – The UC table name of the dataset.
experiment_id – The ID of the experiment to associate the dataset with. If not provided, the current experiment is inferred from the environment.
- mlflow.genai.datasets.delete_dataset(uc_table_name: str) None [source]
Delete the dataset with the given name.
- Parameters
uc_table_name – The UC table name of the dataset.
- mlflow.genai.datasets.get_dataset(uc_table_name: str) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset [source]
Get the dataset with the given name.
- Parameters
uc_table_name – The UC table name of the dataset.
- class mlflow.genai.optimize.LLMParams(model_name: str, base_uri: Optional[str] = None, temperature: Optional[float] = None)[source]
Bases:
object
Note
Experimental: This class may change or be removed in a future release without warning.
Parameters for configuring a LLM Model.
- Parameters
model_name – Name of the model in the format <provider>/<model name>. For example, “openai/gpt-4” or “anthropic/claude-4”.
base_uri – Optional base URI for the API endpoint. If not provided, the default endpoint for the provider will be used.
temperature – Optional sampling temperature for the model’s outputs. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more deterministic.
- class mlflow.genai.optimize.OptimizerConfig(num_instruction_candidates: int = 6, max_few_show_examples: int = 6, num_threads: int = <factory>, optimizer_llm: Optional[mlflow.genai.optimize.types.LLMParams] = None, algorithm: str = 'DSPy/MIPROv2', verbose: bool = False, autolog: bool = False)[source]
Bases:
object
Note
Experimental: This class may change or be removed in a future release without warning.
Configuration for prompt optimization.
- Parameters
num_instruction_candidates – Number of candidate instructions to generate during each optimization iteration. Higher values may lead to better results but increase optimization time. Default: 6
max_few_show_examples – Maximum number of examples to show in few-shot demonstrations. Default: 6
num_threads – Number of threads to use for parallel optimization. Default: (number of CPU cores * 2 + 1)
optimizer_llm – Optional LLM parameters for the teacher model. If not provided, the target LLM will be used as the teacher.
algorithm – The optimization algorithm to use. Default: “DSPy/MIPROv2”
verbose – Whether to show optimizer logs during optimization. Default: False
autolog – Whether to log the optimization parameters, datasets and metrics. If set to True, a MLflow run is automatically created to store them. Default: False
- optimizer_llm: Optional[mlflow.genai.optimize.types.LLMParams] = None
- class mlflow.genai.optimize.PromptOptimizationResult(prompt: Prompt)[source]
Bases:
object
Note
Experimental: This class may change or be removed in a future release without warning.
Result of the
mlflow.genai.optimize_prompt()
API.- Parameters
prompt – A prompt entity containing the optimized template.
- mlflow.genai.optimize.optimize_prompt(*, target_llm_params: mlflow.genai.optimize.types.LLMParams, prompt: Union[str, Prompt], train_data: EvaluationDatasetTypes, scorers: list[Scorer], objective: Optional[Callable[[dict[str, typing.Union[bool, float, str, Feedback, list[Feedback]]]], float]] = None, eval_data: Optional[EvaluationDatasetTypes] = None, optimizer_config: Optional[mlflow.genai.optimize.types.OptimizerConfig] = None) mlflow.genai.optimize.types.PromptOptimizationResult [source]
Note
Experimental: This function may change or be removed in a future release without warning.
Optimize a LLM prompt using the given dataset and evaluation metrics. The optimized prompt template is automatically registered as a new version of the original prompt and included in the result. Currently, this API only supports DSPy’s MIPROv2 optimizer.
- Parameters
target_llm_params – Parameters for the the LLM that prompt is optimized for. The model name must be specified in the format <provider>/<model>.
prompt – The URI or Prompt object of the MLflow prompt to optimize. The optimized prompt is registered as a new version of the prompt.
train_data –
Training dataset used for optimization. The data must be one of the following formats:
An EvaluationDataset entity
Pandas DataFrame
Spark DataFrame
List of dictionaries
The dataset must include the following columns:
inputs: A column containing single inputs in dict format. Each input should contain keys matching the variables in the prompt template.
expectations: A column containing a dictionary of ground truths for individual output fields.
scorers – List of scorers that evaluate the inputs, outputs and expectations. Note: Trace input is not supported for optimization. Use inputs, outputs and expectations for optimization. Also, pass the objective argument when using scorers with string or
Feedback
type outputs.objective – A callable that computes the overall performance metric from individual assessments. Takes a dict mapping assessment names to assessment scores and returns a float value (greater is better).
eval_data – Evaluation dataset with the same format as train_data. If not provided, train_data will be automatically split into training and evaluation sets.
optimizer_config – Configuration parameters for the optimizer.
- Returns
The optimization result including the optimized prompt.
- Return type
Example
import os import mlflow from typing import Any from mlflow.genai.scorers import scorer from mlflow.genai.optimize import OptimizerConfig, LLMParams os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" @scorer def exact_match(expectations: dict[str, Any], outputs: dict[str, Any]) -> bool: return expectations == outputs prompt = mlflow.register_prompt( name="qa", template="Answer the following question: {{question}}", ) result = mlflow.genai.optimize_prompt( target_llm_params=LLMParams(model_name="openai/gpt-4.1-nano"), train_data=[ {"inputs": {"question": f"{i}+1"}, "expectations": {"answer": f"{i + 1}"}} for i in range(100) ], scorers=[exact_match], prompt=prompt.uri, optimizer_config=OptimizerConfig(num_instruction_candidates=5), ) print(result.prompt.template)
- mlflow.genai.judges.is_context_relevant(*, request: str, context: Any, name: Optional[str] = None) Feedback [source]
LLM judge determines whether the given context is relevant to the input request.
- Parameters
request – Input to the application to evaluate, user’s question or query.
context – Context to evaluate the relevance to the request. Supports any JSON-serializable object.
name – Optional name for overriding the default name of the returned feedback.
- Returns
A
mlflow.entities.assessment.Feedback~
object with a “yes” or “no” value indicating whether the context is relevant to the request.
Example
The following example shows how to evaluate whether a document retrieved by a retriever is relevant to the user’s question.
from mlflow.genai.judges import is_context_relevant feedback = is_context_relevant( request="What is the capital of France?", context="Paris is the capital of France.", ) print(feedback.value) # "yes" feedback = is_context_relevant( request="What is the capital of France?", context="Paris is known for its Eiffel Tower.", ) print(feedback.value) # "no"
- mlflow.genai.judges.is_context_sufficient(*, request: str, context: Any, expected_facts: list[str], expected_response: Optional[str] = None, name: Optional[str] = None) Feedback [source]
LLM judge determines whether the given context is sufficient to answer the input request.
- Parameters
request – Input to the application to evaluate, user’s question or query.
context – Context to evaluate the sufficiency of. Supports any JSON-serializable object.
expected_facts – A list of expected facts that should be present in the context.
expected_response – The expected response from the application. Optional.
name – Optional name for overriding the default name of the returned feedback.
- Returns
A
mlflow.entities.assessment.Feedback~
object with a “yes” or “no” value indicating whether the context is sufficient to answer the request.
Example
The following example shows how to evaluate whether the documents returned by a retriever gives sufficient context to answer the user’s question.
from mlflow.genai.judges import is_context_sufficient feedback = is_context_sufficient( request="What is the capital of France?", context=[ {"content": "Paris is the capital of France."}, {"content": "Paris is known for its Eiffel Tower."}, ], expected_facts=["Paris is the capital of France."], ) print(feedback.value) # "yes"
- mlflow.genai.judges.is_correct(*, request: str, response: str, expected_facts: list[str], expected_response: Optional[str] = None, name: Optional[str] = None) Feedback [source]
LLM judge determines whether the given response is correct for the input request.
- Parameters
request – Input to the application to evaluate, user’s question or query.
response – The response from the application to evaluate.
expected_facts – A list of expected facts that should be present in the response.
expected_response – The expected response from the application. Optional.
name – Optional name for overriding the default name of the returned feedback.
- Returns
A
mlflow.entities.assessment.Feedback~
object with a “yes” or “no” value indicating whether the response is correct for the request.
- mlflow.genai.judges.is_grounded(*, request: str, response: str, context: Any, name: Optional[str] = None) Feedback [source]
LLM judge determines whether the given response is grounded in the given context.
- Parameters
request – Input to the application to evaluate, user’s question or query.
response – The response from the application to evaluate.
context – Context to evaluate the response against. Supports any JSON-serializable object.
name – Optional name for overriding the default name of the returned feedback.
- Returns
A
mlflow.entities.assessment.Feedback~
object with a “yes” or “no” value indicating whether the response is grounded in the context.
Example
The following example shows how to evaluate whether the response is grounded in the context.
from mlflow.genai.judges import is_grounded feedback = is_grounded( request="What is the capital of France?", response="Paris", context=[ {"content": "Paris is the capital of France."}, {"content": "Paris is known for its Eiffel Tower."}, ], ) print(feedback.value) # "yes"
- mlflow.genai.judges.is_safe(*, content: str, name: Optional[str] = None) Feedback [source]
LLM judge determines whether the given response is safe.
- Parameters
content – Text content to evaluate for safety.
name – Optional name for overriding the default name of the returned feedback.
- Returns
A
mlflow.entities.assessment.Feedback~
object with a “yes” or “no” value indicating whether the response is safe.
Example
from mlflow.genai.judges import is_safe feedback = is_safe(content="I am a happy person.") print(feedback.value) # "yes"
- mlflow.genai.judges.meets_guidelines(*, guidelines: Union[str, list[str]], context: dict[str, typing.Any], name: Optional[str] = None) Feedback [source]
LLM judge determines whether the given response meets the given guideline(s).
- Parameters
guidelines – A single guideline or a list of guidelines.
context – Mapping of context to be evaluated against the guidelines. For example, pass {“response”: “<response text>”} to evaluate whether the response meets the given guidelines.
name – Optional name for overriding the default name of the returned feedback.
- Returns
A
mlflow.entities.assessment.Feedback~
object with a “yes” or “no” value indicating whether the response meets the guideline(s).
Example
The following example shows how to evaluate whether the response meets the given guideline(s).
from mlflow.genai.judges import meets_guidelines feedback = meets_guidelines( guidelines="Be polite and respectful.", context={"response": "Hello, how are you?"}, ) print(feedback.value) # "yes" feedback = meets_guidelines( guidelines=["Be polite and respectful.", "Must be in English."], context={"response": "Hola, ¿cómo estás?"}, ) print(feedback.value) # "no"