Agents
Agents are just plugins that create or connect pipelines of other nested plugins, for implementing higher-level behaviors with more advanced control flow. They are designed to be layered on top of each other, so that you can combine capabilities of different agents together.
Chat Agent
- class ChatAgent(model='meta-llama/Llama-2-7b-chat-hf', interactive=True, **kwargs)[source]
Bases:
Agent
Agent for two-turn multimodal chat.
- __init__(model='meta-llama/Llama-2-7b-chat-hf', interactive=True, **kwargs)[source]
- Parameters:
model (NanoLLM|str) – either the loaded model instance, or model name/path to load.
interactive (bool) – should the agent get user input from the terminal or not (default True)
- pipeline
input() → LLM → print() pipeline.
- chat
The
ChatQuery
session manager
- model
The loaded NanoLLM model instance
Voice Chat
- class VoiceChat(asr=None, llm=None, tts=None, **kwargs)[source]
Bases:
Agent
Agent for ASR → LLM → TTS pipeline.
- __init__(asr=None, llm=None, tts=None, **kwargs)[source]
- Parameters:
asr (NanoLLM.plugins.AutoASR|str) – the ASR plugin instance or model name to connect with the LLM.
llm (NanoLLM.Plugin|str) – The LLM model plugin instance (like ChatQuery) or model name.
tts (NanoLLM.plugins.AutoTTS|str) – the TTS plugin instance (or model name)- if None, will be loaded from kwargs.
- prompt
Text prompts from web UI or CLI.
- asr_partial(text)[source]
Callback that occurs when the ASR has a partial transcript (while the user is speaking). These partial transcripts get revised mid-stream until the user finishes their phrase. This is also used for pausing/interrupting the bot output for when the user starts speaking.
Web Chat
- class WebChat(**kwargs)[source]
Bases:
VoiceChat
Adds webserver hooks to ASR/TTS voice chat agent and provide web UI. When a multimodal model is loaded, the user can drag & drop images to chat about into the UI. Also supports streaming the client’s microphone and output speakers using WebAudio.
- __init__(**kwargs)[source]
- Parameters:
upload_dir (str) – the path to save files uploaded from the client
See
VoiceChat
andWebServer
for inherited arguments.
- on_message(msg, msg_type=0, metadata='', **kwargs)[source]
Websocket message handler from the client.
- SAVE("<insert info here>") - save information about the user, for example SAVE("Mary likes to garden")[source]
- property system_prompt
Get the instruction prologue of the system prompt, before functions or RAG are added.
- generate_system_prompt(instruct=None, enable_autodoc=None, enable_profile=None, force_reset=False)[source]
Assemble the system prompt from the instruction prologue, function docs, and user profile.
- on_asr_waiting(transcript)[source]
If the ASR partial transcript hasn’t changed, probably a misrecognized sound or echo (cancel it)
Video Stream
- class VideoStream(video_input=None, video_output=None, **kwargs)[source]
Bases:
Agent
Relay, view, or test a video stream. Use the
--video-input
and--video-output
arguments to set the video source and output protocols used from jetson_utils like V4L2, CSI, RTP/RTSP, WebRTC, or static video files.For example, this will capture a V4L2 camera and serve it via WebRTC with H.264 encoding:
python3 -m nano_llm.agents.video_stream \ --video-input /dev/video0 \ --video-output webrtc://@:8554/output
It’s also used as a basic test of video streaming before using more complex agents that rely on it.
Video Query
- class VideoQuery(model='liuhaotian/llava-v1.5-13b', nanodb=None, vision_scaling='resize', **kwargs)[source]
Bases:
Agent
Closed-loop visual agent that repeatedly applies a set of prompts to a video stream, and is also able to match the incoming stream against a vector database, and then use the matching metadata for multimodal RAG. Also serves an interactive web UI for the user to change queries, set event filters, and tag images in the database.
- __init__(model='liuhaotian/llava-v1.5-13b', nanodb=None, vision_scaling='resize', **kwargs)[source]
- Parameters:
model (NanoLLM|str) – the NanoLLM multimodal model instance, or name/path of a multimodal model to load.
nanodb (NanoDB|str) – optional NanoDB plugin instance (or path to a NanoDB on disk) to match the incoming stream against.
vision_scaling (str) –
'resize'
to ignore aspect ratio when downscaling to the often-square resolution of the vision encoder, orcrop
to center-crop the images first to maintain aspect ratio (while discarding pixels).kwargs – forwarded to the plugin initializers for ChatQuery, VideoSource, and VideoOutput
- llm
The model plugin (ChatQuery)
- video_source
The video source plugin
- video_output
The video output plugin
- server
the webserver (by default on
https://localhost:8050
)
- events
event filters for parsing bot output and triggering actions when conditions are met.
- on_video(image)[source]
When a new frame is recieved from the video source, run the model on it with the set prompt, applying RAG using the metadata from the most-similar match from the vector database (if enabled). Then render the latest text from the model over the image, and send it to the output video stream.
- on_text(text)[source]
When new output is recieved from the model, update the text to render, and check if it satisfied any of the event filters when the output is complete.
- on_image_embedding(embedding)[source]
Recieve the image embedding from CLIP that was used when the model processed the last image, and search it against the database to find the most similar images and their metadata for RAG. Also, if the user requested the last image be tagged, add the embedding to the vector database along with the metadata tags.
- on_search(results)[source]
Recieve the similar matches from the vector database and update RAG with them, along with the most recent results shown in the web UI.