Compare App Versions
After evaluating your customer support agent, let's make an improvement and compare the performance between versions. This guide shows how to track changes and objectively compare different versions of your application.
Creating an Improved Version
Building on our customer support agent from previous sections, let's improve it by adding a more empathetic tone to the prompt:
import mlflow
import subprocess
import openai
# Get current git commit for the new version
try:
git_commit = (
subprocess.check_output(["git", "rev-parse", "HEAD"])
.decode("ascii")
.strip()[:8]
)
version_identifier = f"git-{git_commit}"
except subprocess.CalledProcessError:
version_identifier = "local-dev" # Fallback if not in a git repo
improved_model_name = f"customer_support_agent-v2-{version_identifier}"
# Set the new active model
improved_model_info = mlflow.set_active_model(name=improved_model_name)
# Log parameters for the improved version
improved_params = {
"llm": "gpt-4o-mini",
"temperature": 0.7,
"retrieval_strategy": "vector_search_v3",
}
mlflow.log_model_params(model_id=improved_model_info.model_id, params=improved_params)
# Define the improved agent with more empathetic prompting
def improved_agent(question: str) -> str:
client = openai.OpenAI()
# Enhanced system prompt with empathy focus
system_prompt = """You are a caring and empathetic customer support agent.
Always acknowledge the customer's feelings and frustrations before providing solutions.
Use phrases like 'I understand how frustrating this must be' and 'I'm here to help'.
Provide clear, actionable steps while maintaining a warm, supportive tone."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": question},
],
temperature=0.7,
max_tokens=150,
)
return response.choices[0].message.content
Now let's evaluate this improved version using the same dataset and scorers from the previous section. This ensures we can make an apples-to-apples comparison between versions:
# Evaluate the improved version with the same dataset
evaluation_results_v2 = mlflow.genai.evaluate(
data=eval_data, # Same evaluation dataset from previous section
predict_fn=improved_agent,
model_id=improved_model_info.model_id,
scorers=[
relevance_scorer,
support_guidelines,
], # Same scorers from previous section
)
print(f"V2 Metrics: {evaluation_results_v2.metrics}")
Comparing Versions in the MLflow UI
After creating multiple versions of your application, you need to systematically compare them to:
- Validate improvements - Confirm that your changes actually improved the metrics you care about
- Identify regressions - Ensure new versions don't degrade performance in unexpected ways
- Select the best version - Make data-driven decisions about which version to deploy to production
- Guide iteration - Understand which changes had the most impact to inform future improvements
The MLflow UI provides visual comparison tools that make this analysis intuitive:
- Navigate to your experiment containing the customer support agent versions
- Select multiple runs by checking the boxes next to each version's evaluation run
- Click "Compare" to see:
- Side-by-side parameter differences
- Metric comparisons with charts
- Parallel coordinates plot for multi-metric analysis
Comparing Versions with the MLflow API
For automated workflows like CI/CD pipelines, regression testing, or programmatic version selection, MLflow provides powerful APIs to search, rank, and compare your LoggedModel
versions. These APIs enable you to:
- Automatically flag versions that don't meet quality thresholds
- Generate comparison reports for code reviews
- Select the best version for deployment without manual intervention
- Track version improvements over time in your analytics
Ranking and Searching Versions
Use search_logged_models
to find all versions of your application and rank them by quality, speed, or other performance characteristics. This helps you identify trends and find the best performing versions:
from mlflow import search_logged_models
# Search for all versions of our customer support agent
# Order by creation time to see version progression
all_versions = search_logged_models(
filter_string=f"name IN ('{logged_model_name}', '{improved_model_name}')",
order_by=[{"field_name": "creation_time", "ascending": False}], # Most recent first
output_format="list",
)
print(f"Found {len(all_versions)} versions of customer support agent\n")
# Compare key metrics across versions
for model in all_versions[:2]: # Compare latest 2 versions
print(f"Version: {model.name}")
print(f" Model ID: {model.model_id}")
print(f" Created: {model.creation_timestamp}")
# Display evaluation metrics
for metric in model.metrics:
print(f" {metric.key}: {metric.value}")
# Display parameters
print(f" Temperature: {model.params.get('temperature', 'N/A')}")
print(f" LLM: {model.params.get('llm', 'N/A')}")
print()
# Find the best performing version by a specific metric
best_by_guidelines = max(
all_versions,
key=lambda m: next(
(m.value for m in m.metrics if m.key == "support_guidelines/mean"), None
),
)
print(f"Best version for support guidelines: {best_by_guidelines.name}")
print(
f" Score: {next((m.value for m in best_by_guidelines.metrics if m.key == 'support_guidelines/mean'), None)}"
)
Side-by-Side Comparison
Once you've identified versions to compare, perform a detailed side-by-side analysis to understand exactly what changed and how it impacted performance:
# Get the two specific models we want to compare
v1 = mlflow.get_logged_model(
model_id=active_model_info.model_id
) # Original from previous section
v2 = mlflow.get_logged_model(model_id=improved_model_info.model_id) # Improved
print("=== Version Comparison ===")
print(f"V1: {v1.name} vs V2: {v2.name}\n")
# Compare parameters (what changed)
print("Parameter Changes:")
all_params = set(v1.params.keys()) | set(v2.params.keys())
for param in sorted(all_params):
v1_val = v1.params.get(param, "N/A")
v2_val = v2.params.get(param, "N/A")
if v1_val != v2_val:
print(f" {param}: '{v1_val}' → '{v2_val}' ✓")
else:
print(f" {param}: '{v1_val}' (unchanged)")
# Compare metrics (impact of changes)
print("\nMetric Improvements:")
v1_metrics = {m.key: m.value for m in v1.metrics}
v2_metrics = {m.key: m.value for m in v2.metrics}
metric_keys = ["relevance_to_query/mean", "support_guidelines/mean"]
for metric in metric_keys:
v1_score = v1_metrics.get(metric, 0)
v2_score = v2_metrics.get(metric, 0)
improvement = ((v2_score - v1_score) / max(v1_score, 0.01)) * 100
print(f" {metric}:")
print(f" V1: {v1_score:.3f}")
print(f" V2: {v2_score:.3f}")
print(f" Change: {improvement:+.1f}%")
Automating Deployment Decisions
One of the most powerful uses of the LoggedModels
API is automating deployment decisions. Instead of manually reviewing each version, you can codify your quality criteria and let your CI/CD pipeline automatically determine whether a new version is ready for production.
This approach ensures consistent quality standards and enables rapid, safe deployments:
# Decision logic based on your criteria
min_relevance_threshold = 0.80
min_guidelines_threshold = 0.80
if (
v2_metrics.get("relevance_to_query/mean", 0) >= min_relevance_threshold
and v2_metrics.get("support_guidelines/mean", 0) >= min_guidelines_threshold
and v2_metrics.get("support_guidelines/mean", 0)
> v1_metrics.get("support_guidelines/mean", 0)
):
print(f"✅ Version 2 ({v2.name}) is ready for deployment!")
print(" - Meets all quality thresholds")
print(" - Shows improvement in support guidelines")
else:
print("❌ Version 2 needs more work before deployment")
You can extend this pattern to:
- Create quality gates in your CI/CD pipeline that block deployments if metrics drop
- Implement gradual rollouts based on improvement margins
- Trigger alerts when a version shows significant regression
Next Steps
You can now deploy your best performing version and monitor its performance in production by linking traces to deployed versions.