Chat
This page includes information about managing multi-turn chat sessions, templating, and maintaining the embedding history. Here’s how to run it interactively from the terminal:
python3 -m nano_llm.chat --api mlc \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--quantization q4f16_ft
If you load a multimodal model (like liuhaotian/llava-v1.6-vicuna-7b
), you can enter image filenames or URLs and a query to chat about images. Enter /reset
to reset the chat history.
Code Example
#!/usr/bin/env python3
import argparse
import termcolor
from nano_llm import NanoLLM, ChatHistory
# parse arguments
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--model', type=str, default='meta-llama/Meta-Llama-3-8B-Instruct', help="path to the model, or HuggingFace model repo")
parser.add_argument('--max-new-tokens', type=int, default=256, help="the maximum response length for each bot reply")
args = parser.parse_args()
# load model
model = NanoLLM.from_pretrained(
model=args.model,
quantization='q4f16_ft',
api='mlc'
)
# create the chat history
chat_history = ChatHistory(model, system_prompt="You are a helpful and friendly AI assistant.")
while True:
# enter the user query from terminal
print('>> ', end='', flush=True)
prompt = input().strip()
# add user prompt and generate chat tokens/embeddings
chat_history.append('user', prompt)
embedding, position = chat_history.embed_chat()
# generate bot reply
reply = model.generate(
embedding,
streaming=True,
kv_cache=chat_history.kv_cache,
stop_tokens=chat_history.template.stop,
max_new_tokens=args.max_new_tokens,
)
# stream the output
for token in reply:
termcolor.cprint(token, 'blue', end='\n\n' if reply.eos else '', flush=True)
# save the final output
chat_history.append('bot', reply)
Templates
These are the built-in chat templates that are automatically determined from the model type, or settable with the --chat-template
command-line argument:
* llama-2, llama-3
* vicuna-v0, vicuna-v1
* stablelm-zephyr
* chat-ml
* sheared-llama
* nous-obsidian
* phi-2-chat, phi-2-instruct
* gemma
See nano_llm/chat/templates.py
for them. You can also specify a JSON file containing the template.
Chat History
- class ChatHistory(model, chat_template=None, system_prompt=None, **kwargs)[source]
Bases:
object
Multimodal chat history that can contain a mix of media including text/images.
ChatHistory objects can be indexed like a list to access its messages, where each
ChatMessage
can have a different type of content:chat_history[n] # will return the n-th chat entry
Each type of media has an associated embedding function (e.g. LLM’s typically do text token embedding internally, and images use CLIP + projection layers). From these, it assembles the embedding for the entire chat as input to the LLM.
It uses templating to add the required special tokens as defined by different model architectures. In normal 2-turn chat, there are ‘user’ and ‘bot’ roles defined, but arbitrary roles can be added, each with their own template.
The system prompt can also be configured through the chat template and by setting the
ChatHistory.system_prompt
property.- kv_cache
The
KVCache
fromNanoLLM.generate()
used to store the model state.
- property num_tokens
Return the number of tokens used by the chat so far.
embed_chat()
needs to have been called for this to be upated, because otherwise the input wouldn’t have been tokenized yet.
- __delitem__(key)[source]
Remove one or more messages from the chat history:
del chat_history[-2] # remove the second-to-last entry del chat_history[-2:] # pop the last 2 entries del chat_history[1:] # remove all entries but the first
This will also update the KV cache and alter the bot memory.
- append(role='user', msg=None, **kwargs)[source]
Add a chat entry consisting of a text message, image, ect. See the
ChatMessage
class for description of arguments. This can also accept an existingChatMessage
set tomsg
.
- remove(start, stop=None)[source]
Remove the chat entries from the start (inclusive) to stop (exclusive) indexes. If stop is not specified, then only the single entry at the start index will be removed:
chat_history.remove(0) # remove the first chat entry chat_history.remove(0,2) # remove the first and second chat entries chat_history.remove(-1) # remove the last chat entry chat_history.remove(-2,0) # remove the last two entries
This will also update the KV cache and alter the bot’s memory (potentially destructively)
- reset(system_prompt=True, use_cache=True, wrap_tokens=None)[source]
Reset the chat history, and optionally add the system prompt to the new chat. If
use_cache=True
, then the system prompt tokens/embedding will be cached. If wrap_tokens is set, then the most recent N tokens from the chat will be kept.
- to_list(messages=None, html=False)[source]
Serialize the history to a list of dicts, where each dict is a chat entry with the non-critical keys removed (suitable for web transport, ect)
- add_system_prompt(system_prompt=None, use_cache=True)[source]
Add the system prompt message to the chat, containing
ChatHistory.system_prompt
appended by the tool function descriptions if tools are enabled. If thesystem
role is not defined by the model’s chat template, then this function does nothing.- Parameters:
use_cache (bool) – If true, then the system prompt tokens/embeddedings will be cached. This is the default because the system prompt typically may not change.
- Returns:
The
ChatMessage
that was added to the chat with thesystem
role.
- property system_prompt
Get the system prompt, the typically hidden instruction at the beginning of the chat like “You are a curious and helpful AI assistant, …”
- embed_chat(use_cache=True, max_tokens=None, wrap_tokens=None, **kwargs)[source]
Assemble the embedding of either the latest or entire chat.
If
use_cache=True
(the default), and only the new embeddings will be returned. Ifuse_cache=False
, then the entire chat history will be returned.This function returns an
(embedding, position)
tuple, where the embedding array contains the new embeddings (or tokens) from the chat, and position is the current overall position in the history (up to the model’s context window length)If the number of tokens in the chat history exceeds the length given in
max_tokens
argument (which is typically the model’s context window, minus the max generation length), then the chat history will drop all but the latestwrap_tokens
, starting with a user prompt. If max_tokens is provided but wrap_tokens is not, then the overflow tokens will be truncated.
- reindex()[source]
Update the linked lists in the messages that refer to each other. This gets called after messages are added, removed, or their order changed. You wouldn’t typically need to call this yourself.
- find_wrap_entry(wrap_tokens)[source]
Find the oldest entry from which the chat doesn’t exceed the number of wrap_tokens, and that the entry should be a user query. This is used to keep those more recent chat entries when the history overflows past the max context window of the model.
Chat Message
- class ChatMessage(role='user', text=None, image=None, **kwargs)[source]
Bases:
object
Create a chat entry consisting of a text message, image, ect as input.
- Parameters:
role (str) – The chat’s turn template to apply, typically ‘user’ or ‘bot’. The role should have a corresponding entry in the active ChatTemplate.
text (str) – String containing the message’s content for text messages.
image (str|image) – Either a np.ndarray, torch.Tensor, cudaImage, PIL.Image, or a path to an image file (.jpg, .png, .bmp, ect)
kwargs –
For messages with alternate content types, pass them in via kwargs and they will automatically be determined like so:
message = ChatMessage(role='user', audio='sounds.wav')
There are additional lower-level kwargs that can be set below.
use_cache (bool) – cache the tokens/embeddings for reused prompts (defaults to false)
tokens (list[int] or np.ndarray) – the message contents already having been tokenized
embedding (np.ndarray) – the message contents already having been embedded
history (ChatHistory) – the ChatHistory object this message belongs to
- role
The user role or character (‘user’, ‘assistant’, ‘system’, ect)
- template
The version of this message with the role template applied
- tokens
The tokenized version of the message
- embedding
The embedding of the message
- history
The ChatHistory object this message belongs to
- use_cache
Set to true if the tokens/embeddings should be cached for reused prompts
- cached
Set to true if the message is already in the chat embedding
- index
The index of this message in the chat history
- prev
The previous message in the chat history
- next
The next message in the chat history
- type
The type of the message (‘text’, ‘image’, ‘audio’, ect)
- content
The content or media contained in the message
- property num_tokens
Return the number of tokens used by this message. embed() needs to have been called for this to be valid.
- property start_token
The token offset or position in the chat history at which this message begins.
Function Calling
You can expose Python functions that the model is able to invoke using its code generation abilities, should you so instruct it to. A list of functions can be provided to NanoLLM.generate()
that will be called inline with the generation, and recieve the output produced so far by the model.
These functions can then parse the text from the bot to determine if it was called, and execute it accordingly. Any text returned by these functions will be added to the chat before resuming generation, so the bot is able to utilize them the rest of its reply.
The bot_function()
decorator automatically wraps Python functions, performs regex matching on the model output, runs them if they were called using Python eval()
, and returns any results:
from nano_llm import NanoLLM, ChatHistory, BotFunctions, bot_function
from datetime import datetime
@bot_function
def DATE():
""" Returns the current date. """
return datetime.now().strftime("%A, %B %-m %Y")
@bot_function
def TIME():
""" Returns the current time. """
return datetime.now().strftime("%-I:%M %p")
# load the model
model = NanoLLM.from_pretrained(
model="meta-llama/Meta-Llama-3-8B-Instruct",
quantization='q4f16_ft',
api='mlc'
)
# create the chat history
system_prompt = "You are a helpful and friendly AI assistant." + BotFunctions.generate_docs()
chat_history = ChatHistory(model, system_prompt=system_prompt)
while True:
# enter the user query from terminal
print('>> ', end='', flush=True)
prompt = input().strip()
# add user prompt and generate chat tokens/embeddings
chat_history.append(role='user', msg=prompt)
embedding, position = chat_history.embed_chat()
# generate bot reply (give it function access)
reply = model.generate(
embedding,
streaming=True,
functions=BotFunctions(),
kv_cache=chat_history.kv_cache,
stop_tokens=chat_history.template.stop
)
# stream the output
for token in reply:
print(token, end='\n\n' if reply.eos else '', flush=True)
# save the final output
chat_history.append(role='bot', text=reply.text, tokens=reply.tokens)
chat_history.kv_cache = reply.kv_cache
- bot_function(func, name=None, docs=None, enabled=True)[source]
Decorator for exposing a function to be callable by the LLM. This will create wrapper functions that do the parsing to determine if this function was called in the output text, and then interpret it to invoke the function call. Text returned from these functions will be added to the chat.
For example, this definition will expose the
TIME()
function to the bot:@bot_function def TIME(): ''' Returns the current time. ''' return datetime.now().strftime("%-I:%M %p")
You should then add instructions for calling it to the system prompt so that the bot knows it’s available.
BotFunctions.generate_docs()
can automatically generate the function descriptions for you from their Python docstrings, which you can then add to the chat history.- Parameters:
func (Callable) – The function to be called by the model.
name (str) – The function name that the model should refer to. By default, it will be the actual Python function name.
docs (str) – Description of the function that overrides its pydoc string.
enabled (bool) – Boolean that toggles whether this function is added to the system prompt and able to be called or not.
- class BotFunctions(all=False, load=True, test=False)[source]
Bases:
object
Manager of functions able to be called by the LLM that have been registered with the
bot_function()
decorator orBotFunctions.register()
. This is a singleton that is mostly intended to be used like a list, whereBotFunction()
returns the currently enabled functions.You can pass these to
NanoLLM.generate()
, and they will be called inline with the generation:model.generate( BotFunctions().generate_docs() + "What is the date?", functions=BotFunctions() )
BotFunctions.generate_docs()
will automatically generate function descriptions from their Python docstrings. You can filter and disable functions withBotFunctions.filter()
- static __new__(cls, all=False, load=True, test=False)[source]
Return the list of enabled functions whenever BotFunctions() is called, making it seem like you are just calling a function that returns a list:
for func in BotFunctions(): func("SQRT(64)")
If all=True, then even the disabled functions will be included. If load=True, then the built-in functions will be loaded (if they haven’t yet been). If test=True, then the built-in functions will be tested (if they haven’t yet been).
- classmethod list(all=False)[source]
Return the list of all enabled functions available to the bot. If all=True, then even the disabled functions will be included.
- classmethod filter(filters, mode='enable')[source]
Apply filters to the registered functions, either enabling or disabling them if their names are matched against the filter list.
- classmethod find(name, functions=None)[source]
Find a function by name, or return None if not found
- classmethod generate_docs(prologue=True, epilogue=True, spec='python', functions=None)[source]
Collate the documentation strings from all the enabled functions
- classmethod register(func, name=None, docs=None, enabled=True)[source]
See the docs for
bot_function()