Evaluate a Hugging Face LLM with `mlflow.evaluate()`

This guide will show how to load a pre-trained Hugging Face pipeline, log it to MLflow, and use mlflow.evaluate() to evaluate builtin metrics as well as custom LLM-judged metrics for the model.

For detailed information, please read the documentation on using MLflow evaluate.

Download this Notebook

Start MLflow Server

You can either:

Start a local tracking server by running mlflow ui within the same directory that your notebook is in.
Use a tracking server, as described in this overview.

Install necessary dependencies

[ ]:

%pip install -q mlflow transformers torch torchvision evaluate datasets openai tiktoken fastapi rouge_score textstat

[2]:

# Necessary imports
import warnings

import pandas as pd
from datasets import load_dataset
from transformers import pipeline

import mlflow
from mlflow.metrics.genai import EvaluationExample, answer_correctness, make_genai_metric

[3]:

# Disable FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)

Load a pretrained Hugging Face pipeline

Here we are loading a text generation pipeline, but you can also use either a text summarization or question answering pipeline.

[4]:

mpt_pipeline = pipeline("text-generation", model="mosaicml/mpt-7b-chat")

Log the model using MLflow

We log our pipeline as an MLflow Model, which follows a standard format that lets you save a model in different “flavors” that can be understood by different downstream tools. In this case, the model is of the transformers “flavor”.

[5]:

mlflow.set_experiment("Evaluate Hugging Face Text Pipeline")

# Define the signature
signature = mlflow.models.infer_signature(
    model_input="What are the three primary colors?",
    model_output="The three primary colors are red, yellow, and blue.",
)

# Log the model using mlflow
with mlflow.start_run():
    model_info = mlflow.transformers.log_model(
        transformers_model=mpt_pipeline,
        artifact_path="mpt-7b",
        signature=signature,
        registered_model_name="mpt-7b-chat",
    )

Successfully registered model 'mpt-7b-chat'.
Created version '1' of model 'mpt-7b-chat'.

Load Evaluation Data

Load in a dataset from Hugging Face Hub to use for evaluation.

The data fields in the dataset below represent:

instruction: Describes the task that the model should perform. Each row within the dataset is a unique instruction (task) to be performed.
input: Optional contextual information that relates to the task defined in the instruction field. For example, for the instruction “Identify the odd one out”, the input contextual guidance is given as the list of items to select an outlier from, “Twitter, Instagram, Telegram”.
output: The answer to the instruction (with the optional input context provided) as generated by the original evaluation model (text-davinci-003 from OpenAI)
text: The final total text as a result of applying the instruction, input, and output to the prompt template used, which is sent to the model for fine tuning purposes.

[7]:

dataset = load_dataset("tatsu-lab/alpaca")
eval_df = pd.DataFrame(dataset["train"])
eval_df.head(10)

[7]:

	instruction	input	output	text
0	Give three tips for staying healthy.		1.Eat a balanced diet and make sure to include...	Below is an instruction that describes a task....
1	What are the three primary colors?		The three primary colors are red, blue, and ye...	Below is an instruction that describes a task....
2	Describe the structure of an atom.		An atom is made up of a nucleus, which contain...	Below is an instruction that describes a task....
3	How can we reduce air pollution?		There are a number of ways to reduce air pollu...	Below is an instruction that describes a task....
4	Describe a time when you had to make a difficu...		I had to make a difficult decision when I was ...	Below is an instruction that describes a task....
5	Identify the odd one out.	Twitter, Instagram, Telegram	Telegram	Below is an instruction that describes a task,...
6	Explain why the following fraction is equivale...	4/16	The fraction 4/16 is equivalent to 1/4 because...	Below is an instruction that describes a task,...
7	Write a short story in third person narration ...		John was at a crossroads in his life. He had j...	Below is an instruction that describes a task....
8	Render a 3D model of a house		<nooutput> This type of instruction cannot be ...	Below is an instruction that describes a task....
9	Evaluate this sentence for spelling and gramma...	He finnished his meal and left the resturant	He finished his meal and left the restaurant.	Below is an instruction that describes a task,...

Define Metrics

Since we are evaluating how well our model can provide an answer to a given instruction, we may want to choose some metrics to help measure this on top of any builtin metrics that mlflow.evaluate() gives us.

Let’s measure how well our model is doing on the following two metrics:

Is the answer correct? Let’s use the predefined metric answer_correctness here.
Is the answer fluent, clear, and concise? We will define a custom metric answer_quality to measure this.

We will need to pass both of these into the extra_metrics argument for mlflow.evaluate() in order to assess the quality of our model.

What is an Evaluation Metric?

An evaluation metric encapsulates any quantitative or qualitative measure you want to calculate for your model. For each model type, mlflow.evaluate() will automatically calculate some set of builtin metrics. Refer here for which builtin metrics will be calculated for each model type. You can also pass in any other metrics you want to calculate as extra metrics. MLflow provides a set of predefined metrics that you can find here, or you can define your own custom metrics. In the example here, we will use the combination of predefined metrics mlflow.metrics.genai.answer_correctness and a custom metric for the quality evaluation.

Let’s load our predefined metrics - in this case we are using answer_correctness with GPT-4.

[9]:

answer_correctness_metric = answer_correctness(model="openai:/gpt-4")

Now we want to create a custom LLM-judged metric named answer_quality using make_genai_metric(). We need to define a metric definition and grading rubric, as well as some examples for the LLM judge to use.

[8]:

# The definition explains what "answer quality" entails
answer_quality_definition = """Please evaluate answer quality for the provided output on the following criteria:
fluency, clarity, and conciseness. Each of the criteria is defined as follows:
  - Fluency measures how naturally and smooth the output reads.
  - Clarity measures how understandable the output is.
  - Conciseness measures the brevity and efficiency of the output without compromising meaning.
The more fluent, clear, and concise a text, the higher the score it deserves.
"""

# The grading prompt explains what each possible score means
answer_quality_grading_prompt = """Answer quality: Below are the details for different scores:
  - Score 1: The output is entirely incomprehensible and cannot be read.
  - Score 2: The output conveys some meaning, but needs lots of improvement in to improve fluency, clarity, and conciseness.
  - Score 3: The output is understandable but still needs improvement.
  - Score 4: The output performs well on two of fluency, clarity, and conciseness, but could be improved on one of these criteria.
  - Score 5: The output reads smoothly, is easy to understand, and clear. There is no clear way to improve the output on these criteria.
"""

# We provide an example of a "bad" output
example1 = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform. For managing machine learning workflows, it "
    "including experiment tracking model packaging versioning and deployment as well as a platform "
    "simplifying for on the ML lifecycle.",
    score=2,
    justification="The output is difficult to understand and demonstrates extremely low clarity. "
    "However, it still conveys some meaning so this output deserves a score of 2.",
)

# We also provide an example of a "good" output
example2 = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine learning workflows, including "
    "experiment tracking, model packaging, versioning, and deployment.",
    score=5,
    justification="The output is easily understandable, clear, and concise. It deserves a score of 5.",
)

answer_quality_metric = make_genai_metric(
    name="answer_quality",
    definition=answer_quality_definition,
    grading_prompt=answer_quality_grading_prompt,
    version="v1",
    examples=[example1, example2],
    model="openai:/gpt-4",
    greater_is_better=True,
)

Evaluate

We need to set our OpenAI API key, since we are using GPT-4 for our LLM-judged metrics.

In order to set your private key safely, please be sure to either export your key through a command-line terminal for your current instance, or, for a permanent addition to all user-based sessions, configure your favored environment management configuration file (i.e., .bashrc, .zshrc) to have the following entry:

OPENAI_API_KEY=<your openai API key>

Now, we can call mlflow.evaluate(). Just to test it out, let’s use the first 10 rows of the data. Using the "text" model type, toxicity and readability metrics are calculated as builtin metrics. We also pass in the two metrics we defined above into the extra_metrics parameter to be evaluated.

[14]:

with mlflow.start_run():
    results = mlflow.evaluate(
        model_info.model_uri,
        eval_df.head(10),
        evaluators="default",
        model_type="text",
        targets="output",
        extra_metrics=[answer_correctness_metric, answer_quality_metric],
        evaluator_config={"col_mapping": {"inputs": "instruction"}},
    )

2023/12/28 11:57:30 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false

2023/12/28 12:00:25 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/12/28 12:00:25 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2023/12/28 12:02:23 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint

2023/12/28 12:02:43 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/12/28 12:02:43 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_correctness

2023/12/28 12:02:53 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_quality

View results

results.metrics is a dictionary with the aggregate values for all the metrics calculated. Refer here for details on the builtin metrics for each model type.

[15]:

results.metrics

[15]:

{'toxicity/v1/mean': 0.00809656630299287,
 'toxicity/v1/variance': 0.0004603014839856817,
 'toxicity/v1/p90': 0.010559113975614286,
 'toxicity/v1/ratio': 0.0,
 'flesch_kincaid_grade_level/v1/mean': 4.9,
 'flesch_kincaid_grade_level/v1/variance': 6.3500000000000005,
 'flesch_kincaid_grade_level/v1/p90': 6.829999999999998,
 'ari_grade_level/v1/mean': 4.1899999999999995,
 'ari_grade_level/v1/variance': 16.6329,
 'ari_grade_level/v1/p90': 7.949999999999998,
 'answer_correctness/v1/mean': 1.5,
 'answer_correctness/v1/variance': 1.45,
 'answer_correctness/v1/p90': 2.299999999999999,
 'answer_quality/v1/mean': 2.4,
 'answer_quality/v1/variance': 1.44,
 'answer_quality/v1/p90': 4.1}

We can also view the eval_results_table, which shows us the metrics for each row of data.

[16]:

results.tables["eval_results_table"]

[16]:

	instruction	input	text	output	outputs	token_count	toxicity/v1/score	flesch_kincaid_grade_level/v1/score	ari_grade_level/v1/score	answer_correctness/v1/score	answer_correctness/v1/justification	answer_quality/v1/score	answer_quality/v1/justification
0	Give three tips for staying healthy.		Below is an instruction that describes a task....	1.Eat a balanced diet and make sure to include...	Give three tips for staying healthy.\n1. Eat a...	19	0.000446	4.1	4.0	2	The output provided by the model only includes...	3	The output is understandable and fluent but it...
1	What are the three primary colors?		Below is an instruction that describes a task....	The three primary colors are red, blue, and ye...	What are the three primary colors?\nThe three ...	19	0.000217	5.0	4.9	5	The output provided by the model is completely...	5	The model's output is fluent, clear, and conci...
2	Describe the structure of an atom.		Below is an instruction that describes a task....	An atom is made up of a nucleus, which contain...	Describe the structure of an atom.\nAn atom is...	18	0.000139	3.1	2.2	1	The output provided by the model is incomplete...	2	The output is incomplete and lacks clarity, ma...
3	How can we reduce air pollution?		Below is an instruction that describes a task....	There are a number of ways to reduce air pollu...	How can we reduce air pollution?\nThere are ma...	18	0.000140	5.0	5.5	1	The output provided by the model is completely...	1	The output is entirely incomprehensible and ca...
4	Describe a time when you had to make a difficu...		Below is an instruction that describes a task....	I had to make a difficult decision when I was ...	Describe a time when you had to make a difficu...	18	0.000159	5.2	2.9	1	The output provided by the model is completely...	2	The output is incomplete and lacks clarity, ma...
5	Identify the odd one out.	Twitter, Instagram, Telegram	Below is an instruction that describes a task,...	Telegram	Identify the odd one out.\n\n1. A car\n2. A tr...	18	0.072345	0.1	-5.4	1	The output provided by the model is completely...	2	The output is not clear and lacks fluency. The...
6	Explain why the following fraction is equivale...	4/16	Below is an instruction that describes a task,...	The fraction 4/16 is equivalent to 1/4 because...	Explain why the following fraction is equivale...	23	0.000320	6.4	7.6	1	The output provided by the model is completely...	2	The output is not clear and does not answer th...
7	Write a short story in third person narration ...		Below is an instruction that describes a task....	John was at a crossroads in his life. He had j...	Write a short story in third person narration ...	20	0.000247	10.7	11.1	1	The output provided by the model is completely...	1	The output is exactly the same as the input, a...
8	Render a 3D model of a house		Below is an instruction that describes a task....	<nooutput> This type of instruction cannot be ...	Render a 3D model of a house in Blender - Blen...	19	0.003694	5.2	2.7	1	The output provided by the model is completely...	2	The output is partially understandable but lac...
9	Evaluate this sentence for spelling and gramma...	He finnished his meal and left the resturant	Below is an instruction that describes a task,...	He finished his meal and left the restaurant.	Evaluate this sentence for spelling and gramma...	18	0.003260	4.2	6.4	1	The output provided by the model is completely...	4	The output is fluent and clear, but it is not ...

View results in UI

Finally, we can view our evaluation results in the MLflow UI. We can select our experiment on the left sidebar, which will bring us to the following page. We can see that one run logged our model “mpt-7b-chat”, and the other run has the dataset we evaluated.

We click on the Evaluation tab and hide any irrelevant runs.

We can now choose what columns we want to group by, as well as which column we want to compare. In the following example, we are looking at the score for answer correctness for each input-output pair, but we could choose any other metric to compare.

Finally, we get to the following view, where we can see the justification and score for answer correctness for each row.

Evaluate a Hugging Face LLM with mlflow.evaluate()