Feedback Concepts

This guide introduces the core concepts of feedback and assessment in MLflow's GenAI evaluation framework. Understanding these concepts is essential for effectively measuring and improving the quality of your GenAI applications.

What is Feedback?

Feedback in MLflow represents the result of any quality assessment performed on your GenAI application outputs. It provides a standardized way to capture evaluations, whether they come from automated systems, LLM judges, or human reviewers.

Feedback serves as the bridge between running your application and understanding its quality, enabling you to systematically track performance across different dimensions like correctness, relevance, safety, and adherence to guidelines.

Core Concepts

Feedback Object

The Feedback object (also referred to as an Assessment in some contexts) is the fundamental building block of MLflow's evaluation system. It serves as a standardized container for the result of any quality check, providing a common language for assessment across different evaluation methods.

Feedback Structure
Feedback Sources
Trace Attachment

Every Feedback object contains three core components:

Name: A string identifying the specific quality aspect being assessed

Examples: "correctness", "relevance_to_query", "is_safe", "guideline_adherence_politeness"

Value: The actual result of the assessment, which can be:

Numeric scores (e.g., 0.0 to 1.0, 1 to 5)
Boolean values (True/False)
Categorical labels (e.g., "PASS", "FAIL", "EXCELLENT")
Structured data (e.g., {"score": 0.8, "confidence": 0.9})

Rationale: A string explaining why the assessment resulted in the given value

This explanation is crucial for transparency, debugging, and understanding evaluation behavior, especially for LLM-based assessments.

Assessment Dimensions

Feedback can evaluate various aspects of your GenAI application's performance:

Correctness & Accuracy
Relevance & Context
Safety & Guidelines
Quality & Style

Factual Accuracy: Whether the generated content contains correct information

Answer Completeness: How thoroughly the response addresses the user's question

Logical Consistency: Whether the reasoning and conclusions are sound

Example feedback:

{
  "name": "factual_accuracy",
  "value": 0.85,
  "rationale": "The response correctly identifies 3 out of 4 key facts about MLflow, but incorrectly states the founding year."
}

Query Relevance: How well the response addresses the specific user question

Context Utilization: Whether retrieved documents or provided context were used effectively

Topic Adherence: Staying on-topic and avoiding irrelevant tangents

Example feedback:

{
  "name": "relevance_to_query",
  "value": "HIGH",
  "rationale": "Response directly answers the user's question about MLflow features and provides relevant examples."
}

Content Safety: Detecting harmful, inappropriate, or toxic content

Guideline Adherence: Following specific organizational or ethical guidelines

Bias Detection: Identifying unfair bias or discrimination in responses

Example feedback:

{
  "name": "is_safe",
  "value": true,
  "rationale": "Content contains no harmful, toxic, or inappropriate material."
}

Writing Quality: Grammar, clarity, and coherence of the response

Tone Appropriateness: Whether the tone matches the intended context

Helpfulness: How useful the response is to the user

Example feedback:

{
  "name": "helpfulness",
  "value": 4,
  "rationale": "Response provides clear, actionable information but could include more specific examples."
}

Feedback Lifecycle

Understanding how feedback flows through your evaluation process:

Generation
Attachment
Aggregation

During Application Execution: Traces are created as your GenAI application processes requests

Post-Execution Evaluation: Feedback is generated by evaluating the trace data (inputs, outputs, intermediate steps)

Multiple Evaluators: Different evaluation methods can assess the same trace, creating multiple feedback objects

Batch or Real-time: Feedback can be generated immediately or in batch processes

Types of Feedback

MLflow supports different types of feedback to accommodate various evaluation needs:

Scalar Feedback

Numeric Scores: Continuous values representing quality on a scale

Range: Often 0.0 to 1.0 or 1 to 5
Use case: Measuring degrees of quality like relevance or accuracy
Example: {"name": "relevance", "value": 0.87}

Boolean Values: Binary assessments for pass/fail criteria

Values: true or false
Use case: Safety checks, guideline compliance
Example: {"name": "contains_pii", "value": false}

Categorical Feedback

Labels: Discrete categories representing quality levels

Values: Predefined labels like "EXCELLENT", "GOOD", "POOR"
Use case: Human-like quality ratings
Example: {"name": "overall_quality", "value": "GOOD"}

Classification: Specific category assignments

Values: Domain-specific categories
Use case: Content classification, intent recognition
Example: {"name": "response_type", "value": "INFORMATIONAL"}

Structured Feedback

Complex Objects: Rich data structures containing multiple assessment aspects

Format: JSON objects with nested properties
Use case: Comprehensive evaluations with multiple dimensions
Example:

{
  "name": "comprehensive_quality",
  "value": {
    "overall_score": 0.85,
    "accuracy": 0.9,
    "fluency": 0.8,
    "confidence": 0.75
  }
}

Evaluation Methods

Different approaches for generating feedback:

LLM Judges
Programmatic Checks
Human Review

Automated LLM Evaluation: Using language models to assess quality

Advantages:

Scale to large volumes of data
Evaluate subjective criteria
Provide detailed reasoning
Consistent evaluation criteria

Use Cases:

Content quality assessment
Relevance evaluation
Style and tone analysis
Complex reasoning evaluation

Example: An LLM judge evaluating response helpfulness with detailed rationale explaining specific strengths and weaknesses.

Integration with MLflow

Feedback integrates seamlessly with MLflow's ecosystem:

Trace Connection

Direct Association: Feedback objects are linked to specific traces, providing context about what was evaluated

Execution Context: Access to complete application execution data when performing evaluations

Multi-Step Evaluation: Ability to evaluate individual spans within a trace or the overall trace result

Evaluation Framework

Scorer Functions: Automated functions that generate feedback based on trace data

Judge Functions: LLM-based evaluators that provide intelligent assessment

Custom Metrics: Ability to define domain-specific evaluation criteria

Analysis and Monitoring

Quality Dashboards: Visualize feedback trends and patterns over time

Performance Tracking: Monitor how changes to your application affect quality metrics

Alerting: Set up notifications when quality metrics fall below thresholds

Best Practices

Feedback Design

Clear Names: Use descriptive, consistent names for feedback dimensions

Appropriate Scales: Choose value types and ranges that match your evaluation needs

Meaningful Rationale: Provide clear explanations that help with debugging and improvement

Evaluation Strategy

Multiple Dimensions: Assess various aspects of quality, not just a single metric

Balanced Approach: Combine automated and human evaluation methods

Regular Review: Periodically review and update evaluation criteria

Quality Monitoring

Baseline Establishment: Set quality baselines for comparison

Trend Monitoring: Track quality changes over time and across versions

Root Cause Analysis: Use feedback and trace data together to understand quality issues

Getting Started

To begin using feedback in your GenAI evaluation workflow:

LLM Evaluation Guide: Learn how to evaluate your GenAI applications

Custom Metrics: Create domain-specific evaluation functions

Trace Analysis: Explore how to query and analyze trace data with feedback

Quality Monitoring: Set up ongoing quality assessment

Feedback concepts form the foundation for systematic quality assessment in MLflow. By understanding how feedback objects work and integrate with traces, you can build comprehensive evaluation strategies that improve your GenAI applications over time.

What is Feedback?​

Core Concepts​

Feedback Object​

Assessment Dimensions​

Feedback Lifecycle​

Types of Feedback​

Scalar Feedback​

Categorical Feedback​

Structured Feedback​

Evaluation Methods​

Integration with MLflow​

Trace Connection​

Evaluation Framework​

Analysis and Monitoring​

Best Practices​

Feedback Design​

Evaluation Strategy​

Quality Monitoring​

Getting Started​

What is Feedback?

Core Concepts

Feedback Object

Assessment Dimensions

Feedback Lifecycle

Types of Feedback

Scalar Feedback

Categorical Feedback

Structured Feedback

Evaluation Methods

Integration with MLflow

Trace Connection

Evaluation Framework

Analysis and Monitoring

Best Practices

Feedback Design

Evaluation Strategy

Quality Monitoring

Getting Started