Skip to main content

Feedback Concepts

This guide introduces the core concepts of feedback and assessment in MLflow's GenAI evaluation framework. Understanding these concepts is essential for effectively measuring and improving the quality of your GenAI applications.

What is Feedback?

Feedback in MLflow represents the result of any quality assessment performed on your GenAI application outputs. It provides a standardized way to capture evaluations, whether they come from automated systems, LLM judges, or human reviewers.

Feedback serves as the bridge between running your application and understanding its quality, enabling you to systematically track performance across different dimensions like correctness, relevance, safety, and adherence to guidelines.

Core Concepts

Feedback Object

The Feedback object (also referred to as an Assessment in some contexts) is the fundamental building block of MLflow's evaluation system. It serves as a standardized container for the result of any quality check, providing a common language for assessment across different evaluation methods.

Every Feedback object contains three core components:

Name: A string identifying the specific quality aspect being assessed

Examples: "correctness", "relevance_to_query", "is_safe", "guideline_adherence_politeness"

Value: The actual result of the assessment, which can be:

  • Numeric scores (e.g., 0.0 to 1.0, 1 to 5)
  • Boolean values (True/False)
  • Categorical labels (e.g., "PASS", "FAIL", "EXCELLENT")
  • Structured data (e.g., {"score": 0.8, "confidence": 0.9})

Rationale: A string explaining why the assessment resulted in the given value

This explanation is crucial for transparency, debugging, and understanding evaluation behavior, especially for LLM-based assessments.

Assessment Dimensions

Feedback can evaluate various aspects of your GenAI application's performance:

Factual Accuracy: Whether the generated content contains correct information

Answer Completeness: How thoroughly the response addresses the user's question

Logical Consistency: Whether the reasoning and conclusions are sound

Example feedback:

{
"name": "factual_accuracy",
"value": 0.85,
"rationale": "The response correctly identifies 3 out of 4 key facts about MLflow, but incorrectly states the founding year."
}

Feedback Lifecycle

Understanding how feedback flows through your evaluation process:

During Application Execution: Traces are created as your GenAI application processes requests

Post-Execution Evaluation: Feedback is generated by evaluating the trace data (inputs, outputs, intermediate steps)

Multiple Evaluators: Different evaluation methods can assess the same trace, creating multiple feedback objects

Batch or Real-time: Feedback can be generated immediately or in batch processes

Types of Feedback

MLflow supports different types of feedback to accommodate various evaluation needs:

Scalar Feedback

Numeric Scores: Continuous values representing quality on a scale

  • Range: Often 0.0 to 1.0 or 1 to 5
  • Use case: Measuring degrees of quality like relevance or accuracy
  • Example: {"name": "relevance", "value": 0.87}

Boolean Values: Binary assessments for pass/fail criteria

  • Values: true or false
  • Use case: Safety checks, guideline compliance
  • Example: {"name": "contains_pii", "value": false}

Categorical Feedback

Labels: Discrete categories representing quality levels

  • Values: Predefined labels like "EXCELLENT", "GOOD", "POOR"
  • Use case: Human-like quality ratings
  • Example: {"name": "overall_quality", "value": "GOOD"}

Classification: Specific category assignments

  • Values: Domain-specific categories
  • Use case: Content classification, intent recognition
  • Example: {"name": "response_type", "value": "INFORMATIONAL"}

Structured Feedback

Complex Objects: Rich data structures containing multiple assessment aspects

  • Format: JSON objects with nested properties
  • Use case: Comprehensive evaluations with multiple dimensions
  • Example:
{
"name": "comprehensive_quality",
"value": {
"overall_score": 0.85,
"accuracy": 0.9,
"fluency": 0.8,
"confidence": 0.75
}
}

Evaluation Methods

Different approaches for generating feedback:

Automated LLM Evaluation: Using language models to assess quality

Advantages:

  • Scale to large volumes of data
  • Evaluate subjective criteria
  • Provide detailed reasoning
  • Consistent evaluation criteria

Use Cases:

  • Content quality assessment
  • Relevance evaluation
  • Style and tone analysis
  • Complex reasoning evaluation

Example: An LLM judge evaluating response helpfulness with detailed rationale explaining specific strengths and weaknesses.

Integration with MLflow

Feedback integrates seamlessly with MLflow's ecosystem:

Trace Connection

Direct Association: Feedback objects are linked to specific traces, providing context about what was evaluated

Execution Context: Access to complete application execution data when performing evaluations

Multi-Step Evaluation: Ability to evaluate individual spans within a trace or the overall trace result

Evaluation Framework

Scorer Functions: Automated functions that generate feedback based on trace data

Judge Functions: LLM-based evaluators that provide intelligent assessment

Custom Metrics: Ability to define domain-specific evaluation criteria

Analysis and Monitoring

Quality Dashboards: Visualize feedback trends and patterns over time

Performance Tracking: Monitor how changes to your application affect quality metrics

Alerting: Set up notifications when quality metrics fall below thresholds

Best Practices

Feedback Design

Clear Names: Use descriptive, consistent names for feedback dimensions

Appropriate Scales: Choose value types and ranges that match your evaluation needs

Meaningful Rationale: Provide clear explanations that help with debugging and improvement

Evaluation Strategy

Multiple Dimensions: Assess various aspects of quality, not just a single metric

Balanced Approach: Combine automated and human evaluation methods

Regular Review: Periodically review and update evaluation criteria

Quality Monitoring

Baseline Establishment: Set quality baselines for comparison

Trend Monitoring: Track quality changes over time and across versions

Root Cause Analysis: Use feedback and trace data together to understand quality issues

Getting Started

To begin using feedback in your GenAI evaluation workflow:

LLM Evaluation Guide: Learn how to evaluate your GenAI applications

Custom Metrics: Create domain-specific evaluation functions

Trace Analysis: Explore how to query and analyze trace data with feedback

Quality Monitoring: Set up ongoing quality assessment


Feedback concepts form the foundation for systematic quality assessment in MLflow. By understanding how feedback objects work and integrate with traces, you can build comprehensive evaluation strategies that improve your GenAI applications over time.