This page includes information about managing multi-turn chat sessions, templating, and maintaining the embedding history. Here’s how to run it interactively from the terminal:

python3 -m --api mlc \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --quantization q4f16_ft

If you load a multimodal model (like liuhaotian/llava-v1.6-vicuna-7b), you can enter image filenames or URLs and a query to chat about images. Enter /reset to reset the chat history.

Code Example

#!/usr/bin/env python3
import argparse
import termcolor

from nano_llm import NanoLLM, ChatHistory

# parse arguments
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--model', type=str, default='meta-llama/Meta-Llama-3-8B-Instruct', help="path to the model, or HuggingFace model repo")
parser.add_argument('--max-new-tokens', type=int, default=256, help="the maximum response length for each bot reply")
args = parser.parse_args()

# load model
model = NanoLLM.from_pretrained(

# create the chat history
chat_history = ChatHistory(model, system_prompt="You are a helpful and friendly AI assistant.")

while True:
    # enter the user query from terminal
    print('>> ', end='', flush=True)
    prompt = input().strip()

    # add user prompt and generate chat tokens/embeddings
    chat_history.append('user', prompt)
    embedding, position = chat_history.embed_chat()

    # generate bot reply
    reply = model.generate(
    # stream the output
    for token in reply:
        termcolor.cprint(token, 'blue', end='\n\n' if reply.eos else '', flush=True)

    # save the final output
    chat_history.append('bot', reply)


These are the built-in chat templates that are automatically determined from the model type, or settable with the --chat-template command-line argument:

* llama-2, llama-3
* vicuna-v0, vicuna-v1
* stablelm-zephyr
* chat-ml
* sheared-llama
* nous-obsidian
* phi-2-chat, phi-2-instruct
* gemma

See nano_llm/chat/ for them. You can also specify a JSON file containing the template.

Chat History

class ChatHistory(model, chat_template=None, system_prompt=None, **kwargs)[source]

Bases: object

Multimodal chat history that can contain a mix of media including text/images.

ChatHistory objects can be indexed like a list to access its messages, where each ChatMessage can have a different type of content:

chat_history[n]  # will return the n-th chat entry

Each type of media has an associated embedding function (e.g. LLM’s typically do text token embedding internally, and images use CLIP + projection layers). From these, it assembles the embedding for the entire chat as input to the LLM.

It uses templating to add the required special tokens as defined by different model architectures. In normal 2-turn chat, there are ‘user’ and ‘bot’ roles defined, but arbitrary roles can be added, each with their own template.

The system prompt can also be configured through the chat template and by setting the ChatHistory.system_prompt property.


The KVCache from NanoLLM.generate() used to store the model state.

property num_tokens

Return the number of tokens used by the chat so far. embed_chat() needs to have been called for this to be upated, because otherwise the input wouldn’t have been tokenized yet.


Returns the number of messages in the chat history


Return the n-th chat message with the subscript indexing operator


Remove one or more messages from the chat history:

del chat_history[-2]   # remove the second-to-last entry
del chat_history[-2:]  # pop the last 2 entries
del chat_history[1:]   # remove all entries but the first

This will also update the KV cache and alter the bot memory.

append(role='user', msg=None, **kwargs)[source]

Add a chat entry consisting of a text message, image, ect. See the ChatMessage class for description of arguments. This can also accept an existing ChatMessage set to msg.


Remove the last N messages from the chat and KV cache.

remove(start, stop=None)[source]

Remove the chat entries from the start (inclusive) to stop (exclusive) indexes. If stop is not specified, then only the single entry at the start index will be removed:

chat_history.remove(0)    # remove the first chat entry
chat_history.remove(0,2)  # remove the first and second chat entries
chat_history.remove(-1)   # remove the last chat entry
chat_history.remove(-2,0) # remove the last two entries

This will also update the KV cache and alter the bot’s memory (potentially destructively)

reset(system_prompt=True, use_cache=True, wrap_tokens=None)[source]

Reset the chat history, and optionally add the system prompt to the new chat. If use_cache=True, then the system prompt tokens/embedding will be cached. If wrap_tokens is set, then the most recent N tokens from the chat will be kept.


Returns true if it’s the given role’s turn in the chat, otherwise false.

to_list(messages=None, html=False)[source]

Serialize the history to a list of dicts, where each dict is a chat entry with the non-critical keys removed (suitable for web transport, ect)

add_system_prompt(system_prompt=None, use_cache=True)[source]

Add the system prompt message to the chat, containing ChatHistory.system_prompt appended by the tool function descriptions if tools are enabled. If the system role is not defined by the model’s chat template, then this function does nothing.


use_cache (bool) – If true, then the system prompt tokens/embeddedings will be cached. This is the default because the system prompt typically may not change.


The ChatMessage that was added to the chat with the system role.

property system_prompt

Get the system prompt, the typically hidden instruction at the beginning of the chat like “You are a curious and helpful AI assistant, …”

embed_chat(use_cache=True, max_tokens=None, wrap_tokens=None, **kwargs)[source]

Assemble the embedding of either the latest or entire chat.

If use_cache=True (the default), and only the new embeddings will be returned. If use_cache=False, then the entire chat history will be returned.

This function returns an (embedding, position) tuple, where the embedding array contains the new embeddings (or tokens) from the chat, and position is the current overall position in the history (up to the model’s context window length)

If the number of tokens in the chat history exceeds the length given in max_tokens argument (which is typically the model’s context window, minus the max generation length), then the chat history will drop all but the latest wrap_tokens, starting with a user prompt. If max_tokens is provided but wrap_tokens is not, then the overflow tokens will be truncated.


Update the linked lists in the messages that refer to each other. This gets called after messages are added, removed, or their order changed. You wouldn’t typically need to call this yourself.


Find the oldest entry from which the chat doesn’t exceed the number of wrap_tokens, and that the entry should be a user query. This is used to keep those more recent chat entries when the history overflows past the max context window of the model.


Sanitize message contents to HTML representation, apply code formatting, ect.

run_tools(message, tools={}, append=True)[source]

Invoke any function calls in the output text and return the results.

Chat Message

class ChatMessage(role='user', text=None, image=None, **kwargs)[source]

Bases: object

Create a chat entry consisting of a text message, image, ect as input.

  • role (str) – The chat’s turn template to apply, typically ‘user’ or ‘bot’. The role should have a corresponding entry in the active ChatTemplate.

  • text (str) – String containing the message’s content for text messages.

  • image (str|image) – Either a np.ndarray, torch.Tensor, cudaImage, PIL.Image, or a path to an image file (.jpg, .png, .bmp, ect)

  • kwargs

    For messages with alternate content types, pass them in via kwargs and they will automatically be determined like so:

    message = ChatMessage(role='user', audio='sounds.wav')

    There are additional lower-level kwargs that can be set below.

  • use_cache (bool) – cache the tokens/embeddings for reused prompts (defaults to false)

  • tokens (list[int] or np.ndarray) – the message contents already having been tokenized

  • embedding (np.ndarray) – the message contents already having been embedded

  • history (ChatHistory) – the ChatHistory object this message belongs to


The user role or character (‘user’, ‘assistant’, ‘system’, ect)


The version of this message with the role template applied


The tokenized version of the message


The embedding of the message


The ChatHistory object this message belongs to


Set to true if the tokens/embeddings should be cached for reused prompts


Set to true if the message is already in the chat embedding


The index of this message in the chat history


The previous message in the chat history


The next message in the chat history


The type of the message (‘text’, ‘image’, ‘audio’, ect)


The content or media contained in the message

property num_tokens

Return the number of tokens used by this message. embed() needs to have been called for this to be valid.

property start_token

The token offset or position in the chat history at which this message begins.

static content_type(content)[source]

Try to automatically determine the message content type.


Return true if the message is of the given type (like ‘text’, ‘image’, ect)

embed(return_tensors='np', **kwargs)[source]

Apply message templates, tokenization, and generate the embedding.

Function Calling

You can expose Python functions that the model is able to invoke using its code generation abilities, should you so instruct it to. A list of functions can be provided to NanoLLM.generate() that will be called inline with the generation, and recieve the output produced so far by the model.

These functions can then parse the text from the bot to determine if it was called, and execute it accordingly. Any text returned by these functions will be added to the chat before resuming generation, so the bot is able to utilize them the rest of its reply.

The bot_function() decorator automatically wraps Python functions, performs regex matching on the model output, runs them if they were called using Python eval(), and returns any results:

from nano_llm import NanoLLM, ChatHistory, BotFunctions, bot_function
from datetime import datetime

def DATE():
    """ Returns the current date. """
    return"%A, %B %-m %Y")
def TIME():
    """ Returns the current time. """
    return"%-I:%M %p")
# load the model   
model = NanoLLM.from_pretrained(

# create the chat history
system_prompt = "You are a helpful and friendly AI assistant." + BotFunctions.generate_docs()
chat_history = ChatHistory(model, system_prompt=system_prompt)

while True:
    # enter the user query from terminal
    print('>> ', end='', flush=True)
    prompt = input().strip()

    # add user prompt and generate chat tokens/embeddings
    chat_history.append(role='user', msg=prompt)
    embedding, position = chat_history.embed_chat()

    # generate bot reply (give it function access)
    reply = model.generate(
    # stream the output
    for token in reply:
        print(token, end='\n\n' if reply.eos else '', flush=True)

    # save the final output
    chat_history.append(role='bot', text=reply.text, tokens=reply.tokens)
    chat_history.kv_cache = reply.kv_cache
bot_function(func, name=None, docs=None, enabled=True)[source]

Decorator for exposing a function to be callable by the LLM. This will create wrapper functions that do the parsing to determine if this function was called in the output text, and then interpret it to invoke the function call. Text returned from these functions will be added to the chat.

For example, this definition will expose the TIME() function to the bot:

def TIME():
    ''' Returns the current time. '''
    return"%-I:%M %p")

You should then add instructions for calling it to the system prompt so that the bot knows it’s available. BotFunctions.generate_docs() can automatically generate the function descriptions for you from their Python docstrings, which you can then add to the chat history.

  • func (Callable) – The function to be called by the model.

  • name (str) – The function name that the model should refer to. By default, it will be the actual Python function name.

  • docs (str) – Description of the function that overrides its pydoc string.

  • enabled (bool) – Boolean that toggles whether this function is added to the system prompt and able to be called or not.

class BotFunctions(all=False, load=True, test=False)[source]

Bases: object

Manager of functions able to be called by the LLM that have been registered with the bot_function() decorator or BotFunctions.register(). This is a singleton that is mostly intended to be used like a list, where BotFunction() returns the currently enabled functions.

You can pass these to NanoLLM.generate(), and they will be called inline with the generation:

    BotFunctions().generate_docs() + "What is the date?",

BotFunctions.generate_docs() will automatically generate function descriptions from their Python docstrings. You can filter and disable functions with BotFunctions.filter()

static __new__(cls, all=False, load=True, test=False)[source]

Return the list of enabled functions whenever BotFunctions() is called, making it seem like you are just calling a function that returns a list:

for func in BotFunctions():

If all=True, then even the disabled functions will be included. If load=True, then the built-in functions will be loaded (if they haven’t yet been). If test=True, then the built-in functions will be tested (if they haven’t yet been).

classmethod len()[source]

Returns the number of all registered bot functions.

classmethod list(all=False)[source]

Return the list of all enabled functions available to the bot. If all=True, then even the disabled functions will be included.

classmethod filter(filters, mode='enable')[source]

Apply filters to the registered functions, either enabling or disabling them if their names are matched against the filter list.

classmethod find(name, functions=None)[source]

Find a function by name, or return None if not found

classmethod generate_docs(prologue=True, epilogue=True, spec='python', functions=None)[source]

Collate the documentation strings from all the enabled functions

classmethod register(func, name=None, docs=None, enabled=True)[source]

See the docs for bot_function()

classmethod run(text, template=None, functions=None)[source]

Invoke any function calls in the output text and return the results.

classmethod load(test=True)[source]

Load the built-in functions by importing their modules.

classmethod test(disable_on_error=True)[source]

Test that the functions are able to be run, and disable them if not. Returns true if all tests passed, otherwise false.