Building and Serving an OpenAI-compatible Chatbot
Welcome to our tutorial on using Transformers and MLflow to create an OpenAI-compatible chat model. In MLflow 2.11 and up, the ChatModel class has been added, allowing for convenient creation of served models that conform to the OpenAI API spec. This enables you to seamlessly swap out your chat app’s backing LLM, or to easily evaluate different models without having to edit your client-side code.
If you haven’t already seen it, you may find it helpful to go through our introductory notebook on chat and Transformers before proceeding with this one, as this notebook is slightly higher-level and does not delve too deeply into the inner workings of Transformers or MLflow Tracking.
Learning objectives
In this tutorial, you will:
Create an OpenAI-compatible chat model using TinyLLama-1.1B-Chat
Serve the model with MLflow Model Serving
Learn how to use MLflow’s
pyfunc
API to add arbitrary customization to your model
[ ]:
%pip install mlflow>=2.11.0 -q -U
# OpenAI-compatible chat model support is available for Transformers 4.34.0 and above
%pip install transformers>=4.34.0 -q -U
[1]:
# Disable tokenizers warnings when constructing pipelines
%env TOKENIZERS_PARALLELISM=false
import warnings
# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)
env: TOKENIZERS_PARALLELISM=false
Building a Chat Model
MLflow’s native Transformers integration allows you to specify the task
param when saving or logging your pipelines. Originally, this param accepts any of the Transformers pipeline task types, but the mlflow.transformers
flavor adds a few more MLflow-specific keys for text-generation
pipeline types.
For text-generation
pipelines, instead of specifying text-generation
as the task type, you can provide one of two string literals conforming to the MLflow Deployments Server’s endpoint_type specification (“llm/v1/embeddings” can be specified as a task on models saved with mlflow.sentence_transformers
):
“llm/v1/chat” for chat-style applications
“llm/v1/completions” for generic completions
When one of these keys is specified, MLflow will automatically handle everything required to serve a chat or completions model. This includes:
Setting a chat/completions compatible signature on the model
Performing data pre- and post-processing to ensure the inputs and outputs conform to the Chat/Completions API spec, which is compatible with OpenAI’s API spec.
Note that these modifications only apply when the model is loaded with mlflow.pyfunc.load_model()
(e.g. when serving the model with the mlflow models serve
CLI tool). If you want to load just the base pipeline, you can always do so via mlflow.transformers.load_model()
.
In the next few cells, we’ll learn how serve a chat model with a local Transformers pipeline and MLflow, using TinyLlama-1.1B-Chat as an example.
To begin, let’s go through the original flow of saving a text generation pipeline:
[27]:
from transformers import pipeline
import mlflow
generator = pipeline(
"text-generation",
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
)
# save the model using the vanilla `text-generation` task type
mlflow.transformers.save_model(
path="tinyllama-text-generation", transformers_model=generator, task="text-generation"
)
/var/folders/qd/9rwd0_gd0qs65g4sdqlm51hr0000gp/T/ipykernel_55429/4268198845.py:11: FutureWarning: The 'transformers' MLflow Models integration is known to be compatible with the following package version ranges: ``4.25.1`` - ``4.37.1``. MLflow Models integrations with transformers may not succeed when used with package versions outside of this range.
mlflow.transformers.save_model(
Now, let’s load the model and use it for inference. Our loaded model is a text-generation
pipeline, and let’s take a look at its signature to see its expected inputs and outputs.
[28]:
# load the model for inference
model = mlflow.pyfunc.load_model("tinyllama-text-generation")
model.metadata.signature
2024/02/26 21:06:51 WARNING mlflow.transformers: Could not specify device parameter for this pipeline type
[28]:
inputs:
[string (required)]
outputs:
[string (required)]
params:
None
Unfortunately, it only accepts string
as input, which isn’t directly compatible with a chat interface. When interacting with OpenAI’s API, for example, we expect to simply be able to input a list of messages. In order to do this with our current model, we’ll have to write some additional boilerplate:
[29]:
# first, apply the tokenizer's chat template, since the
# model is tuned to accept prompts in a chat format. this
# also converts the list of messages to a string.
messages = [{"role": "user", "content": "Write me a hello world program in python"}]
prompt = generator.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
model.predict(prompt)
[29]:
['<|user|>\nWrite me a hello world program in python</s>\n<|assistant|>\nHere\'s a simple hello world program in Python:\n\n```python\nprint("Hello, world!")\n```\n\nThis program prints the string "Hello, world!" to the console. You can run this program by typing it into the Python interpreter or by running the command `python hello_world.py` in your terminal.']
Now we’re getting somewhere, but formatting our messages prior to inference is cumbersome.
Additionally, the output format isn’t compatible with the OpenAI API spec either–it’s just a list of strings. If we were looking to evaluate different model backends for our chat app, we’d have to rewrite some of our client-side code to both format the input, and to parse this new response.
To simplify all this, let’s just pass in "llm/v1/chat"
as the task param when saving the model.
[30]:
# save the model using the `"llm/v1/chat"`
# task type instead of `text-generation`
mlflow.transformers.save_model(
path="tinyllama-chat", transformers_model=generator, task="llm/v1/chat"
)
/var/folders/qd/9rwd0_gd0qs65g4sdqlm51hr0000gp/T/ipykernel_55429/609241782.py:3: FutureWarning: The 'transformers' MLflow Models integration is known to be compatible with the following package version ranges: ``4.25.1`` - ``4.37.1``. MLflow Models integrations with transformers may not succeed when used with package versions outside of this range.
mlflow.transformers.save_model(
Once again, let’s load the model and inspect the signature:
[31]:
model = mlflow.pyfunc.load_model("tinyllama-chat")
model.metadata.signature
2024/02/26 21:10:04 WARNING mlflow.transformers: Could not specify device parameter for this pipeline type
[31]:
inputs:
['messages': Array({content: string (required), name: string (optional), role: string (required)}) (required), 'temperature': double (optional), 'max_tokens': long (optional), 'stop': Array(string) (optional), 'n': long (optional), 'stream': boolean (optional)]
outputs:
['id': string (required), 'object': string (required), 'created': long (required), 'model': string (required), 'choices': Array({finish_reason: string (required), index: long (required), message: {content: string (required), name: string (optional), role: string (required)} (required)}) (required), 'usage': {completion_tokens: long (required), prompt_tokens: long (required), total_tokens: long (required)} (required)]
params:
None
Now when performing inference, we can pass our messages in a dict as we’d expect to do when interacting with the OpenAI API. Furthermore, the response we receive back from the model also conforms to the spec.
[32]:
messages = [{"role": "user", "content": "Write me a hello world program in python"}]
model.predict({"messages": messages})
[32]:
[{'id': '8435a57d-9895-485e-98d3-95b1cbe007c0',
'object': 'chat.completion',
'created': 1708949437,
'model': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0',
'usage': {'prompt_tokens': 24, 'completion_tokens': 71, 'total_tokens': 95},
'choices': [{'index': 0,
'finish_reason': 'stop',
'message': {'role': 'assistant',
'content': 'Here\'s a simple hello world program in Python:\n\n```python\nprint("Hello, world!")\n```\n\nThis program prints the string "Hello, world!" to the console. You can run this program by typing it into the Python interpreter or by running the command `python hello_world.py` in your terminal.'}}]}]
Serving the Chat Model
To take this example further, let’s use MLflow to serve our chat model, so we can interact with it like a web API. To do this, we can use the mlflow models serve
CLI tool.
In a terminal shell, run:
$ mlflow models serve -m tinyllama-chat
When the server has finished initializing, you should be able to interact with the model via HTTP requests. The input format is almost identical to the format described in the MLflow Deployments Server docs, with the exception that temperature
defaults to 1.0
instead of 0.0
.
Here’s a quick example:
[33]:
%%sh
curl http://127.0.0.1:5000/invocations \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "Write me a hello world program in python"}] }' \
| jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 706 100 617 100 89 25 3 0:00:29 0:00:23 0:00:06 160
[
{
"id": "fc3d08c3-d37d-420d-a754-50f77eb32a92",
"object": "chat.completion",
"created": 1708949465,
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"usage": {
"prompt_tokens": 24,
"completion_tokens": 71,
"total_tokens": 95
},
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "Here's a simple hello world program in Python:\n\n```python\nprint(\"Hello, world!\")\n```\n\nThis program prints the string \"Hello, world!\" to the console. You can run this program by typing it into the Python interpreter or by running the command `python hello_world.py` in your terminal."
}
}
]
}
]
It’s that easy!
You can also call the API with a few optional inference params to adjust the model’s responses. These map to Transformers pipeline params, and are passed in directly at inference time.
max_tokens
(maps tomax_new_tokens
): The maximum number of new tokens the model should generate.temperature
(maps totemperature
): Controls the creativity of the model’s response. Note that this is not guaranteed to be supported by all models, and in order for this param to have an effect, the pipeline must have been created withdo_sample=True
.stop
(maps tostopping_criteria
): A list of tokens at which to stop generation.
Note: n
does not have an equivalent Transformers pipeline param, and is not supported in queries. However, you can implement a model that consumes the n
param using Custom Pyfunc (details below).
Customizing the model
As always, custom functionality can be achieved with MLflow’s pyfunc
API. In the cell below, we create a custom Chat-flavored pyfunc
model by subclassing mlflow.pyfunc.ChatModel
.
In this example, we’ll use our previously-saved TinyLlama pipeline as the backing model by loading it with mlflow.transformers.load_model()
, and then build our own customizations on top of it. However, ChatModel
is agnostic to how you generate the outputs. You could use another transformer, Langchain, native OpenAI integrations, or even no LLM at all! As long the result of predict
is of type mlflow.types.llm.ChatResponse
, you’ll be able to take advantage of the automatic
signature generation and input/output parsing.
The possibilities for customization are endless here, but as a quick example, the code below simply edits the ID of the response, rather than having it be a random UUID. Of course, you could also insert any side-effects you wanted here, such as asynchronously logging some metadata for analytics.
[34]:
import random
import mlflow
from mlflow.types.llm import ChatResponse
class MyChatModel(mlflow.pyfunc.ChatModel):
def load_context(self, context):
# load our previously-saved Transformers pipeline from context.artifacts
self.pipeline = mlflow.transformers.load_model(context.artifacts["chat_model_path"])
def predict(self, context, messages, params):
tokenizer = self.pipeline.tokenizer
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# perform inference using the loaded pipeline
output = self.pipeline(prompt, return_full_text=False, generation_kwargs=params.to_dict())
text = output[0]["generated_text"]
id = f"some_meaningful_id_{random.randint(0, 100)}"
# construct token usage information
prompt_tokens = len(tokenizer.encode(prompt))
completion_tokens = len(tokenizer.encode(text))
usage = {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens,
}
# here, we can do any post-processing or side-effects required.
# for example, we could log the generated text to a database for
# analytics, or check the output for any banned words or phrases
# and return a different response if any are found.
# in this example, we just return the generated text as the response
response = {
"id": id,
"model": "MyChatModel",
"choices": [
{
"index": 0,
"message": {"role": "assistant", "content": text},
"finish_reason": "stop",
}
],
"usage": usage,
}
return ChatResponse(**response)
Similar to what happens above, upon saving an instance of MyChatModel
, MLflow will automatically recognize the mlflow.pyfunc.ChatModel
subclass, and set chat signatures and handle input and output parsing automatically. Note that enforcement is performed on the output–MLflow will run inference on an example input, and assert that the output is of type ChatResponse
.
Full documentation for the ChatResponse
type can be found in the API reference.
[36]:
mlflow.pyfunc.save_model(
path="my-model",
python_model=MyChatModel(),
# provide the path to the pipeline we saved earlier
artifacts={"chat_model_path": "tinyllama-chat"},
)
2024/02/26 21:12:47 INFO mlflow.pyfunc: Predicting on input example to validate output
/var/folders/qd/9rwd0_gd0qs65g4sdqlm51hr0000gp/T/ipykernel_55429/2958668123.py:10: FutureWarning: The 'transformers' MLflow Models integration is known to be compatible with the following package version ranges: ``4.25.1`` - ``4.37.1``. MLflow Models integrations with transformers may not succeed when used with package versions outside of this range.
self.pipeline = mlflow.transformers.load_model(context.artifacts["chat_model_path"])
2024/02/26 21:12:47 WARNING mlflow.transformers: Could not specify device parameter for this pipeline type
2024/02/26 21:13:19 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
As before, we can now serve the model by running the following in a terminal shell:
$ mlflow models serve -m my-model
And we should now be able to query it via HTTP request:
[37]:
%%sh
curl http://127.0.0.1:5000/invocations \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "Write me a hello world program in python"}] }' \
| jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 666 100 577 100 89 23 3 0:00:29 0:00:24 0:00:05 141
{
"id": "some_meaningful_id_33",
"model": "MyChatModel",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Here's a simple hello world program in Python:\n\n```python\nprint(\"Hello, world!\")\n```\n\nThis program prints the string \"Hello, world!\" to the console. You can run this program by typing it into the Python interpreter or by running the command `python hello_world.py` in your terminal."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 71,
"total_tokens": 96
},
"object": "chat.completion",
"created": 1708949708
}
Conclusion
In this tutorial, you learned how to create an OpenAI-compatible chat model by specifying “llm/v1/chat” as the task when saving Transformers pipelines. You also learned how to leverage Custom Pyfunc to add customizations that fit your specific use-case.
What’s next?
In-depth Pyfunc Walkthrough. We briefly touched on custom
pyfunc
in this tutorial, but if you’re looking for more detail on the anatomy of apyfunc
model, the linked page provides an in-depth overview of all the components.More on MLflow Deployments. In this tutorial, we saw how to deploy a model using a local server, but MLflow provides many other ways to deploy your models to production. Check out this page to learn more about the different options.
More on MLflow’s Transformers Integration. This page provides a comprehensive overview on MLflow’s Transformers integrations, along with lots of hands-on guides and notebooks. Learn how to fine-tune models, use prompt templates, and more!
Other LLM Integrations. Aside from Transformers, MLflow has integrations with many other popular LLM libraries, such as Langchain and OpenAI.