Agents

Agents are just plugins that create or connect pipelines of other nested plugins, for implementing higher-level behaviors with more advanced control flow. They are designed to be layered on top of each other, so that you can combine capabilities of different agents together.

Chat Agent

class ChatAgent(model='meta-llama/Llama-2-7b-chat-hf', interactive=True, **kwargs)[source]

Bases: Agent

Agent for two-turn multimodal chat.

__init__(model='meta-llama/Llama-2-7b-chat-hf', interactive=True, **kwargs)[source]

Parameters:

model (NanoLLM|str) – either the loaded model instance, or model name/path to load.
interactive (bool) – should the agent get user input from the terminal or not (default True)

pipeline: input() → LLM → print() pipeline.

chat: The ChatQuery session manager

model: The loaded NanoLLM model instance

on_interrupt(signum, frame)[source]: Interrupts the bot output when the user presses Ctrl+C.

Voice Chat

class VoiceChat(asr=None, llm=None, tts=None, **kwargs)[source]

Bases: Agent

Agent for ASR → LLM → TTS pipeline.

__init__(asr=None, llm=None, tts=None, **kwargs)[source]

Parameters:

asr (NanoLLM.plugins.AutoASR|str) – the ASR plugin instance or model name to connect with the LLM.
llm (NanoLLM.Plugin|str) – The LLM model plugin instance (like ChatQuery) or model name.
tts (NanoLLM.plugins.AutoTTS|str) – the TTS plugin instance (or model name)- if None, will be loaded from kwargs.

prompt: Text prompts from web UI or CLI.

asr_partial(text)[source]: Callback that occurs when the ASR has a partial transcript (while the user is speaking). These partial transcripts get revised mid-stream until the user finishes their phrase. This is also used for pausing/interrupting the bot output for when the user starts speaking.

asr_final(text)[source]: Callback that occurs when the ASR outputs when there is a pause in the user talking, like at the end of a sentence or paragraph. This will interrupt/cancel any ongoing bot output.

on_interrupt()[source]: Interrupt/cancel the bot output when the user submits (or speaks) a full query.

Web Chat

class WebChat(**kwargs)[source]

Bases: VoiceChat

Adds webserver hooks to ASR/TTS voice chat agent and provide web UI. When a multimodal model is loaded, the user can drag & drop images to chat about into the UI. Also supports streaming the client’s microphone and output speakers using WebAudio.

__init__(**kwargs)[source]

Parameters:: upload_dir (str) – the path to save files uploaded from the client

See VoiceChat and WebServer for inherited arguments.

on_message(msg, msg_type=0, metadata='', **kwargs)[source]: Websocket message handler from the client.

SAVE("<insert info here>") - save information about the user, for example SAVE("Mary likes to garden")[source]

property system_prompt: Get the instruction prologue of the system prompt, before functions or RAG are added.

generate_system_prompt(instruct=None, enable_autodoc=None, enable_profile=None, force_reset=False)[source]: Assemble the system prompt from the instruction prologue, function docs, and user profile.

on_asr_partial(text)[source]: Update the web chat history when a partial ASR transcript arrives.

on_asr_waiting(transcript)[source]: If the ASR partial transcript hasn’t changed, probably a misrecognized sound or echo (cancel it)

on_llm_reply(text)[source]: Update the web chat history when the latest LLM response arrives.

on_tts_samples(audio)[source]: Send audio samples to the client when they arrive.

send_chat_history()[source]: Sanitize the chat history for HTML and send it to the client.

start()[source]: Start the webserver & websocket listening in other threads.

Video Stream

class VideoStream(video_input=None, video_output=None, **kwargs)[source]

Bases: Agent

Relay, view, or test a video stream. Use the --video-input and --video-output arguments to set the video source and output protocols used from jetson_utils like V4L2, CSI, RTP/RTSP, WebRTC, or static video files.

For example, this will capture a V4L2 camera and serve it via WebRTC with H.264 encoding:

python3 -m nano_llm.agents.video_stream \
   --video-input /dev/video0 \
   --video-output webrtc://@:8554/output

It’s also used as a basic test of video streaming before using more complex agents that rely on it.

__init__(video_input=None, video_output=None, **kwargs)[source]

Parameters:

video_input (Plugin|str) – the VideoSource plugin instance, or URL of the video stream or camera device.
video_output (Plugin|str) – the VideoOutput plugin instance, or output stream URL / device ID.

Video Query

class VideoQuery(model='liuhaotian/llava-v1.5-13b', nanodb=None, vision_scaling='resize', **kwargs)[source]

Bases: Agent

Closed-loop visual agent that repeatedly applies a set of prompts to a video stream, and is also able to match the incoming stream against a vector database, and then use the matching metadata for multimodal RAG. Also serves an interactive web UI for the user to change queries, set event filters, and tag images in the database.

__init__(model='liuhaotian/llava-v1.5-13b', nanodb=None, vision_scaling='resize', **kwargs)[source]

Parameters:

model (NanoLLM|str) – the NanoLLM multimodal model instance, or name/path of a multimodal model to load.
nanodb (NanoDB|str) – optional NanoDB plugin instance (or path to a NanoDB on disk) to match the incoming stream against.
vision_scaling (str) – 'resize' to ignore aspect ratio when downscaling to the often-square resolution of the vision encoder, or crop to center-crop the images first to maintain aspect ratio (while discarding pixels).
kwargs – forwarded to the plugin initializers for ChatQuery, VideoSource, and VideoOutput

llm: The model plugin (ChatQuery)

video_source: The video source plugin

video_output: The video output plugin

server: the webserver (by default on https://localhost:8050)

events: event filters for parsing bot output and triggering actions when conditions are met.

on_video(image)[source]: When a new frame is recieved from the video source, run the model on it with the set prompt, applying RAG using the metadata from the most-similar match from the vector database (if enabled). Then render the latest text from the model over the image, and send it to the output video stream.

on_text(text)[source]: When new output is recieved from the model, update the text to render, and check if it satisfied any of the event filters when the output is complete.

on_image_embedding(embedding)[source]: Recieve the image embedding from CLIP that was used when the model processed the last image, and search it against the database to find the most similar images and their metadata for RAG. Also, if the user requested the last image be tagged, add the embedding to the vector database along with the metadata tags.

on_search(results)[source]: Recieve the similar matches from the vector database and update RAG with them, along with the most recent results shown in the web UI.

on_websocket(msg, msg_type=0, metadata='', **kwargs)[source]: Websocket message handler from the client.

start()[source]: Start the webserver & websocket listening in other threads.