LLM Evaluation Examples

The notebooks listed below contain step-by-step tutorials on how to use MLflow to evaluate LLMs. The first notebook is centered around evaluating an LLM for question-answering with a prompt engineering approach. The second notebook is centered around evaluating a RAG system. Both notebooks will demonstrate how to use MLflow’s builtin metrics such as token_count and toxicity as well as LLM-judged intelligent metrics such as answer_relevance. The third notebook is the same as the second notebook, but uses Databricks’s served llama2-70b as the judge instead of gpt-4.

QA Evaluation Tutorial

Learn how to evaluate various LLMs and RAG systems with MLflow, leveraging simple metrics such as toxicity, as well as LLM-judged metrics as relevance, and even custom LLM-judged metrics such as professionalism.

Learn how to evaluate various Open-Source LLMs available in Hugging Face, leveraging MLflow's built-in LLM metrics and experiment tracking to manage models and evaluation results.

RAG Evaluation Tutorials

Learn how to evaluate RAG systems with MLflow, leveraging OpenAI GPT-4 model as a judge.

Learn how to evaluate RAG systems with MLflow, leveraging Llama 2 70B model hosted on Databricks serving endpoint.