Evaluate Prompts

Creating effective prompts is an iterative process. Simply writing a prompt and deploying it is rarely sufficient. To build high-quality GenAI applications, you need to systematically evaluate different prompt variations to understand their impact on your application's output, quality, cost, and latency. This page discusses conceptual approaches and how MLflow's broader toolset (like Tracing and Evaluation) can be leveraged in the context of prompt evaluation, even if the Prompt Registry itself is primarily for storage and versioning.

This page will cover:

Setting up experiments for prompt evaluation.
Strategies for comparing different prompt versions.
Analyzing evaluation results to make informed decisions.
Selecting the most effective prompts based on data.

Setting Up Prompt Evaluation Experiments

Effective prompt evaluation begins with a clear experimental setup. The goal is to isolate the impact of prompt changes on your application's behavior.

Define Your Goal: What are you trying to achieve by changing the prompt? (e.g., improve factual accuracy, reduce verbosity, enhance a specific skill, lower cost by using a cheaper model with a better prompt).
Identify Key Metrics: How will you measure success? Metrics can be:
- Quality Metrics: Relevance, coherence, helpfulness, accuracy, adherence to style guides. These can be assessed by human reviewers or automated LLM-as-a-judge metrics.
- Performance Metrics: Latency of the LLM response, token usage (cost).
- Task-Specific Metrics: e.g., code generation correctness, summarization abstractiveness, question-answering precision.
Prepare an Evaluation Dataset: You need a consistent set of inputs to test your prompt variations against. This dataset should be representative of the types of queries your application will handle. This could be curated from:
- Real user queries (collected via MLflow Tracing).
- Manually crafted examples covering edge cases.
- Synthetic data generated to test specific behaviors.
- Refer to the "Build an Evaluation Dataset" section in "Evaluate & Monitor" for more details.
Select Prompt Variations: Choose the different prompt versions from the MLflow Prompt Registry that you want to compare. These could be different iterations of the same base prompt or entirely different approaches to the same task.

Comparing Different Prompt Versions

Once you have your setup, you can run your application with each selected prompt version against your evaluation dataset.

Workflow using MLflow tools:

Application Instrumentation: Ensure your application is instrumented with MLflow Tracing. This will capture detailed information about each execution, including the inputs, outputs, any intermediate steps, latency, and token counts.
Iterate and Log: For each prompt version you want to evaluate: a. Modify your application code to load the specific prompt version from the registry (e.g., mlflow.genai.load_prompt(uri="prompts:/my-prompt/1"), then mlflow.genai.load_prompt(uri="prompts:/my-prompt/2")). b. Run your application against each item in your evaluation dataset. c. The traces generated will capture the outputs produced using that specific prompt version.
Leverage mlflow.genai.evaluate(): MLflow's mlflow.genai.evaluate() API is a powerful tool here. You can configure it to:
- Take your application (which internally loads a specific prompt version) as the "model" input.
- Run it against your evaluation dataset.
- Apply predefined or custom metrics (including LLM judges for quality assessment and functions to parse cost/latency from traces).
- Log all results, linking them back to the run, which can be associated with the prompt version being tested.

Analyzing Evaluation Results

After running your evaluations, MLflow provides tools to analyze the results:

MLflow UI: Compare runs corresponding to different prompt versions. View logged metrics, parameters (like the prompt URI or version number you logged), and associated traces.
Traces: For each evaluation, dive into the traces to see the exact inputs, intermediate steps (if any), and final outputs generated by your application with a specific prompt. This is crucial for qualitative analysis and understanding why a prompt performed a certain way.
Metrics Comparison: Systematically compare the quantitative metrics (quality scores, latency, cost) across different prompt versions.

Selecting the Most Effective Prompts

Based on your analysis, you can make data-driven decisions about which prompt versions are most effective for different scenarios or which ones to promote for further testing or deployment (e.g., by updating an alias like staging or production to point to the winning version).

Consider:

Trade-offs: Often, there are trade-offs. A prompt might produce higher quality responses but at a higher latency or cost. Your selection should align with your overall application objectives.
Iterative Refinement: Evaluation results might highlight areas for further prompt improvement. Use these insights to create new prompt versions and repeat the evaluation cycle.

Key Takeaways

Systematic prompt evaluation is key to improving GenAI application quality.
Define clear goals, metrics, and a representative evaluation dataset for your experiments.
Leverage MLflow Tracing to capture detailed execution data and mlflow.genai.evaluate() to run evaluations and compute metrics across different prompt versions.
Analyze both quantitative metrics and qualitative trace data to understand prompt performance.
Make data-driven decisions for selecting and refining prompts, considering trade-offs between quality, cost, and latency.

Prerequisites

Understanding of basic evaluation metrics and principles.
Access to an evaluation dataset relevant to your application.
Familiarity with MLflow Tracing and MLflow Evaluation concepts.
Prompts registered in the MLflow Prompt Registry that you wish to evaluate.

While the Prompt Registry itself is the system of record for your prompt templates, its true power in enhancing quality comes when its versioned prompts are used within a robust evaluation framework enabled by other MLflow components like Tracing and Evaluate.

Quickstart

1. Install Required Libraries

First install MLflow and OpenAI SDK. If you use different LLM providers, install the corresponding SDK instead.

pip install mlflow>=2.21.0 openai -qU

Also set OpenAI API key (or any other LLM providers e.g. Anthropic).

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

1. Create a Prompt

UI
Python

Create Prompt UI

Run mlflow ui in your terminal to start the MLflow UI.
Navigate to the Prompts tab in the MLflow UI.
Click on the Create Prompt button.
Fill in the prompt details such as name, prompt template text, and commit message (optional).
Click Create to register the prompt.

To create a new prompt using the Python API, use mlflow.genai.register_prompt() API:

import mlflow

# Use double curly braces for variables in the template
initial_template = """\
Summarize content you are provided with in {{ num_sentences }} sentences.

Sentences: {{ sentences }}
"""

# Register a new prompt
prompt = mlflow.genai.register_prompt(
    name="summarization-prompt",
    template=initial_template,
    # Optional: Provide a commit message to describe the changes
    commit_message="Initial commit",
)

# The prompt object contains information about the registered prompt
print(f"Created prompt '{prompt.name}' (version {prompt.version})")

2. Prepare Evaluation Data

Below, we create a small summarization dataset for demonstration purposes.

import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            {
                "data": "Artificial intelligence has transformed how businesses operate in the 21st century. Companies are leveraging AI for everything from customer service to supply chain optimization. The technology enables automation of routine tasks, freeing human workers for more creative endeavors. However, concerns about job displacement and ethical implications remain significant. Many experts argue that AI will ultimately create more jobs than it eliminates, though the transition may be challenging."
            },
            {
                "data": "Climate change continues to affect ecosystems worldwide at an alarming rate. Rising global temperatures have led to more frequent extreme weather events including hurricanes, floods, and wildfires. Polar ice caps are melting faster than predicted, contributing to sea level rise that threatens coastal communities. Scientists warn that without immediate and dramatic reductions in greenhouse gas emissions, many of these changes may become irreversible. International cooperation remains essential but politically challenging."
            },
            {
                "data": "The human genome project was completed in 2003 after 13 years of international collaborative research. It successfully mapped all of the genes of the human genome, approximately 20,000-25,000 genes in total. The project cost nearly $3 billion but has enabled countless medical advances and spawned new fields like pharmacogenomics. The knowledge gained has dramatically improved our understanding of genetic diseases and opened pathways to personalized medicine. Today, a complete human genome can be sequenced in under a day for about $1,000."
            },
            {
                "data": "Remote work adoption accelerated dramatically during the COVID-19 pandemic. Organizations that had previously resisted flexible work arrangements were forced to implement digital collaboration tools and virtual workflows. Many companies reported surprising productivity gains, though concerns about company culture and collaboration persisted. After the pandemic, a hybrid model emerged as the preferred approach for many businesses, combining in-office and remote work. This shift has profound implications for urban planning, commercial real estate, and work-life balance."
            },
            {
                "data": "Quantum computing represents a fundamental shift in computational capability. Unlike classical computers that use bits as either 0 or 1, quantum computers use quantum bits or qubits that can exist in multiple states simultaneously. This property, known as superposition, theoretically allows quantum computers to solve certain problems exponentially faster than classical computers. Major technology companies and governments are investing billions in quantum research. Fields like cryptography, material science, and drug discovery are expected to be revolutionized once quantum computers reach practical scale."
            },
        ],
        "expectations": [
            {
                "expected_response": "AI has revolutionized business operations through automation and optimization, though ethical concerns about job displacement persist alongside predictions that AI will ultimately create more employment opportunities than it eliminates."
            },
            {
                "expected_response": "Climate change is causing accelerating environmental damage through extreme weather events and melting ice caps, with scientists warning that without immediate reduction in greenhouse gas emissions, many changes may become irreversible."
            },
            {
                "expected_response": "The Human Genome Project, completed in 2003, mapped approximately 20,000-25,000 human genes at a cost of $3 billion, enabling medical advances, improving understanding of genetic diseases, and establishing the foundation for personalized medicine."
            },
            {
                "expected_response": "The COVID-19 pandemic forced widespread adoption of remote work, revealing unexpected productivity benefits despite collaboration challenges, and resulting in a hybrid work model that impacts urban planning, real estate, and work-life balance."
            },
            {
                "expected_response": "Quantum computing uses qubits existing in multiple simultaneous states to potentially solve certain problems exponentially faster than classical computers, with major investment from tech companies and governments anticipating revolutionary applications in cryptography, materials science, and pharmaceutical research."
            },
        ],
    }
)

3. Define Prediction Function

Define a function that takes a DataFrame of inputs and returns a list of predictions.

MLflow will pass the input columns (inputs only in this example) to the function. The output string will be compared with the targets column to evaluate the model.

import mlflow
import openai


def predict(data: str) -> str:
    prompt = mlflow.genai.load_prompt("prompts:/summarization-prompt/1")
    content = prompt.format(sentences=data, num_sentences=1)
    completion = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": content}],
        temperature=0.1,
    )
    return completion.choices[0].message.content

4. Run Evaluation

Run the mlflow.genai.evaluate() API to evaluate the model with the prepared data and prompt. In this example, we will use the following two built-in metrics.

from mlflow.genai.scorers import Correctness, RelevanceToQuery

with mlflow.start_run(run_name="prompt-evaluation"):
    mlflow.log_param("model", "gpt-4o-mini")
    mlflow.log_param("temperature", 0.1)

    results = mlflow.genai.evaluate(
        predict_fn=predict,
        data=eval_data,
        scorers=[
            Correctness(),
            RelevanceToQuery(),
        ],
    )

5. View Results

You can view the evaluation results in the MLflow UI. Navigate to the Experiments tab and click on the evaluation run (prompt-evaluation in this example) to view the evaluation result.

Evaluation Results

If you have multiple Evaluation Runs, you can compare the metrics across runs in the chart view.

Evaluation Chart

Moreover, you can navigate to the Traces tab in the evaluation run page to show all input and output responses from the LLM during evaluation, to understand how the model responds to different prompts.

Evaluation Chart

Setting Up Prompt Evaluation Experiments​

Comparing Different Prompt Versions​

Analyzing Evaluation Results​

Selecting the Most Effective Prompts​

Key Takeaways​

Prerequisites​

Quickstart​

1. Install Required Libraries​

1. Create a Prompt​

2. Prepare Evaluation Data​

3. Define Prediction Function​

4. Run Evaluation​

5. View Results​