LLM Evaluation Examples

The notebooks listed below contain step-by-step tutorials on how to use MLflow to evaluate LLMs.

The first set of notebooks is centered around evaluating an LLM for question-answering with a prompt engineering approach. The second set is centered around evaluating a RAG system.

All the notebooks will demonstrate how to use MLflow's builtin metrics such as token_count and toxicity as well as LLM-judged intelligent metrics such as answer_relevance.

QA Evaluation Tutorial

LLM Question Answering Evaluation with MLflow

Learn how to evaluate various LLMs and RAG systems with MLflow, leveraging simple metrics such as toxicity, as well as LLM-judged metrics as relevance, and even custom LLM-judged metrics such as professionalism.

Evaluating a 🤗 Hugging Face LLMs with MLflow

Learn how to evaluate various Open-Source LLMs available in Hugging Face, leveraging MLflow's built-in LLM metrics and experiment tracking to manage models and evaluation results.

RAG Evaluation Tutorials

RAG Evaluation with MLflow and GPT-4 as Judge

Learn how to evaluate RAG systems with MLflow, leveraging OpenAI GPT-4 model as a judge.

RAG Evaluation with MLflow and Llama-2-70B as Judge

Learn how to evaluate RAG systems with MLflow, leveraging Llama 2 70B model hosted on Databricks serving endpoint.

QA Evaluation Tutorial​

RAG Evaluation Tutorials​

QA Evaluation Tutorial

RAG Evaluation Tutorials