Skip to main content

Compare App Versions

After evaluating your customer support agent, let's make an improvement and compare the performance between versions. This guide shows how to track changes and objectively compare different versions of your application.

Creating an Improved Version

Building on our customer support agent from previous sections, let's improve it by adding a more empathetic tone to the prompt:

import mlflow
import subprocess
import openai

# Get current git commit for the new version
try:
git_commit = (
subprocess.check_output(["git", "rev-parse", "HEAD"])
.decode("ascii")
.strip()[:8]
)
version_identifier = f"git-{git_commit}"
except subprocess.CalledProcessError:
version_identifier = "local-dev" # Fallback if not in a git repo

improved_model_name = f"customer_support_agent-v2-{version_identifier}"

# Set the new active model
improved_model_info = mlflow.set_active_model(name=improved_model_name)

# Log parameters for the improved version
improved_params = {
"llm": "gpt-4o-mini",
"temperature": 0.7,
"retrieval_strategy": "vector_search_v3",
}
mlflow.log_model_params(model_id=improved_model_info.model_id, params=improved_params)


# Define the improved agent with more empathetic prompting
def improved_agent(question: str) -> str:
client = openai.OpenAI()

# Enhanced system prompt with empathy focus
system_prompt = """You are a caring and empathetic customer support agent.
Always acknowledge the customer's feelings and frustrations before providing solutions.
Use phrases like 'I understand how frustrating this must be' and 'I'm here to help'.
Provide clear, actionable steps while maintaining a warm, supportive tone."""

response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": question},
],
temperature=0.7,
max_tokens=150,
)
return response.choices[0].message.content

Now let's evaluate this improved version using the same dataset and scorers from the previous section. This ensures we can make an apples-to-apples comparison between versions:

# Evaluate the improved version with the same dataset
evaluation_results_v2 = mlflow.genai.evaluate(
data=eval_data, # Same evaluation dataset from previous section
predict_fn=improved_agent,
model_id=improved_model_info.model_id,
scorers=[
relevance_scorer,
support_guidelines,
], # Same scorers from previous section
)

print(f"V2 Metrics: {evaluation_results_v2.metrics}")

Comparing Versions in the MLflow UI

After creating multiple versions of your application, you need to systematically compare them to:

  • Validate improvements - Confirm that your changes actually improved the metrics you care about
  • Identify regressions - Ensure new versions don't degrade performance in unexpected ways
  • Select the best version - Make data-driven decisions about which version to deploy to production
  • Guide iteration - Understand which changes had the most impact to inform future improvements

The MLflow UI provides visual comparison tools that make this analysis intuitive:

  1. Navigate to your experiment containing the customer support agent versions
  2. Select multiple runs by checking the boxes next to each version's evaluation run
  3. Click "Compare" to see:
    • Side-by-side parameter differences
    • Metric comparisons with charts
    • Parallel coordinates plot for multi-metric analysis

Comparing Versions with the MLflow API

For automated workflows like CI/CD pipelines, regression testing, or programmatic version selection, MLflow provides powerful APIs to search, rank, and compare your LoggedModel versions. These APIs enable you to:

  • Automatically flag versions that don't meet quality thresholds
  • Generate comparison reports for code reviews
  • Select the best version for deployment without manual intervention
  • Track version improvements over time in your analytics

Ranking and Searching Versions

Use search_logged_models to find all versions of your application and rank them by quality, speed, or other performance characteristics. This helps you identify trends and find the best performing versions:

from mlflow import search_logged_models

# Search for all versions of our customer support agent
# Order by creation time to see version progression
all_versions = search_logged_models(
filter_string=f"name IN ('{logged_model_name}', '{improved_model_name}')",
order_by=[{"field_name": "creation_time", "ascending": False}], # Most recent first
output_format="list",
)

print(f"Found {len(all_versions)} versions of customer support agent\n")

# Compare key metrics across versions
for model in all_versions[:2]: # Compare latest 2 versions
print(f"Version: {model.name}")
print(f" Model ID: {model.model_id}")
print(f" Created: {model.creation_timestamp}")

# Display evaluation metrics
for metric in model.metrics:
print(f" {metric.key}: {metric.value}")

# Display parameters
print(f" Temperature: {model.params.get('temperature', 'N/A')}")
print(f" LLM: {model.params.get('llm', 'N/A')}")
print()

# Find the best performing version by a specific metric
best_by_guidelines = max(
all_versions,
key=lambda m: next(
(m.value for m in m.metrics if m.key == "support_guidelines/mean"), None
),
)
print(f"Best version for support guidelines: {best_by_guidelines.name}")
print(
f" Score: {next((m.value for m in best_by_guidelines.metrics if m.key == 'support_guidelines/mean'), None)}"
)

Side-by-Side Comparison

Once you've identified versions to compare, perform a detailed side-by-side analysis to understand exactly what changed and how it impacted performance:

# Get the two specific models we want to compare
v1 = mlflow.get_logged_model(
model_id=active_model_info.model_id
) # Original from previous section
v2 = mlflow.get_logged_model(model_id=improved_model_info.model_id) # Improved

print("=== Version Comparison ===")
print(f"V1: {v1.name} vs V2: {v2.name}\n")

# Compare parameters (what changed)
print("Parameter Changes:")
all_params = set(v1.params.keys()) | set(v2.params.keys())
for param in sorted(all_params):
v1_val = v1.params.get(param, "N/A")
v2_val = v2.params.get(param, "N/A")
if v1_val != v2_val:
print(f" {param}: '{v1_val}' → '{v2_val}' ✓")
else:
print(f" {param}: '{v1_val}' (unchanged)")

# Compare metrics (impact of changes)
print("\nMetric Improvements:")
v1_metrics = {m.key: m.value for m in v1.metrics}
v2_metrics = {m.key: m.value for m in v2.metrics}

metric_keys = ["relevance_to_query/mean", "support_guidelines/mean"]
for metric in metric_keys:
v1_score = v1_metrics.get(metric, 0)
v2_score = v2_metrics.get(metric, 0)
improvement = ((v2_score - v1_score) / max(v1_score, 0.01)) * 100
print(f" {metric}:")
print(f" V1: {v1_score:.3f}")
print(f" V2: {v2_score:.3f}")
print(f" Change: {improvement:+.1f}%")

Automating Deployment Decisions

One of the most powerful uses of the LoggedModels API is automating deployment decisions. Instead of manually reviewing each version, you can codify your quality criteria and let your CI/CD pipeline automatically determine whether a new version is ready for production.

This approach ensures consistent quality standards and enables rapid, safe deployments:

# Decision logic based on your criteria
min_relevance_threshold = 0.80
min_guidelines_threshold = 0.80

if (
v2_metrics.get("relevance_to_query/mean", 0) >= min_relevance_threshold
and v2_metrics.get("support_guidelines/mean", 0) >= min_guidelines_threshold
and v2_metrics.get("support_guidelines/mean", 0)
> v1_metrics.get("support_guidelines/mean", 0)
):
print(f"✅ Version 2 ({v2.name}) is ready for deployment!")
print(" - Meets all quality thresholds")
print(" - Shows improvement in support guidelines")
else:
print("❌ Version 2 needs more work before deployment")

You can extend this pattern to:

  • Create quality gates in your CI/CD pipeline that block deployments if metrics drop
  • Implement gradual rollouts based on improvement margins
  • Trigger alerts when a version shows significant regression

Next Steps

You can now deploy your best performing version and monitor its performance in production by linking traces to deployed versions.