LLM Evaluation Examples
The notebooks listed below contain step-by-step tutorials on how to use MLflow to evaluate LLMs.
The first set of notebooks is centered around evaluating an LLM for question-answering with a prompt engineering approach. The second set is centered around evaluating a RAG system.
All the notebooks will demonstrate how to use MLflow's builtin metrics such as token_count and toxicity as well as LLM-judged intelligent metrics such as answer_relevance.
QA Evaluation Tutorial​
Learn how to evaluate various LLMs and RAG systems with MLflow, leveraging simple metrics such as toxicity, as well as LLM-judged metrics as relevance, and even custom LLM-judged metrics such as professionalism.
Learn how to evaluate various Open-Source LLMs available in Hugging Face, leveraging MLflow's built-in LLM metrics and experiment tracking to manage models and evaluation results.