Multimodal
Multimodal Agents on Jetson AI Lab
Refer to these guides and tutorials on Jetson AI Lab: llamaspeak | Live Llava | NanoVLM
NanoLLM provides optimized multimodal pipelines, including vision/language models (VLM), vector databases (NanoDB), and speech services that can be integrated into interactive agents.
These are implemented through the model and chat interfaces covered in the previous sections:
python3 -m nano_llm.chat --api=mlc \
--model Efficient-Large-Model/VILA1.5-3b \
--prompt '/data/images/lake.jpg' \
--prompt 'please describe the scene.' \
--prompt 'are there any hazards to be aware of?'
See the Tested Models section for the list of multimodal models that are supported in NanoLLM.
Image Messages
You can get a vision/language model to respond about an image by adding it to the chat history, and then asking a query about it:
chat_history.append(role='user', image=img) # np.ndarray, torch.Tensor, PIL.Image, cudaImage
chat_history.append(role='user', text='Describe the image.')
print(model.generate(chat_history.embed_chat()[0], streaming=False))
Image messages will be embedded into the chat using the model’s CLIP/SigLIP vision encoder and multimodal projector. Supported image types are np.ndarray
, torch.Tensor
, PIL.Image
, jetson_utils.cudaImage
, and URLs or local paths to image files (jpg, png, tga, bmp, gif)
Code Example
#!/usr/bin/env python3
#
# This multimodal example is a simplified version of the 'Live Llava' demo,
# wherein the same prompt (or set of prompts) is applied to a stream of images.
#
# You can run it like this (these options will replicate the defaults)
#
# python3 -m nano_llm.vision.example \
# --model Efficient-Large-Model/VILA1.5-3b \
# --video-input "/data/images/*.jpg" \
# --prompt "Describe the image." \
# --prompt "Are there people in the image?"
#
# You can specify multiple prompts (or a text file) to be applied to each image,
# and the video inputs can be sequences of files, camera devices, or network streams.
#
# For example, `--video-input /dev/video0` will capture from a V4L2 webcam. See here:
# https://github.com/dusty-nv/jetson-inference/blob/master/docs/aux-streaming.md
#
import time
import termcolor
from nano_llm import NanoLLM, ChatHistory
from nano_llm.utils import ArgParser, load_prompts
from nano_llm.plugins import VideoSource
from jetson_utils import cudaMemcpy, cudaToNumpy
# parse args and set some defaults
args = ArgParser(extras=ArgParser.Defaults + ['prompt', 'video_input']).parse_args()
prompts = load_prompts(args.prompt)
if not prompts:
prompts = ["Describe the image.", "Are there people in the image?"]
if not args.model:
args.model = "Efficient-Large-Model/VILA1.5-3b"
if not args.video_input:
args.video_input = "/data/images/*.jpg"
print(args)
# load vision/language model
model = NanoLLM.from_pretrained(
args.model,
api=args.api,
quantization=args.quantization,
max_context_len=args.max_context_len,
vision_model=args.vision_model,
vision_scaling=args.vision_scaling,
)
assert(model.has_vision)
# create the chat history
chat_history = ChatHistory(model, args.chat_template, args.system_prompt)
# open the video stream
video_source = VideoSource(**vars(args), cuda_stream=0, return_copy=False)
# apply the prompts to each frame
while True:
img = video_source.capture()
if img is None:
continue
chat_history.append('user', image=img)
time_begin = time.perf_counter()
for prompt in prompts:
chat_history.append('user', prompt, use_cache=True)
embedding, _ = chat_history.embed_chat()
print('>>', prompt)
reply = model.generate(
embedding,
kv_cache=chat_history.kv_cache,
max_new_tokens=args.max_new_tokens,
min_new_tokens=args.min_new_tokens,
do_sample=args.do_sample,
repetition_penalty=args.repetition_penalty,
temperature=args.temperature,
top_p=args.top_p,
)
for token in reply:
termcolor.cprint(token, 'blue', end='\n\n' if reply.eos else '', flush=True)
chat_history.append('bot', reply)
time_elapsed = time.perf_counter() - time_begin
print(f"time: {time_elapsed*1000:.2f} ms rate: {1.0/time_elapsed:.2f} FPS")
chat_history.reset()
if video_source.eos:
break
Video Sequences
The code in vision/video.py
keeps a rolling history of image frames and can be used with models that were trained to understand video (like VILA-1.5) to apply video search/summarization, action & behavior analysis, change detection, and other temporal-based vision functions:
python3 -m nano_llm.vision.video \
--model Efficient-Large-Model/VILA1.5-3b \
--max-images 8 \
--max-new-tokens 48 \
--video-input /data/my_video.mp4 \
--video-output /data/my_output.mp4 \
--prompt 'What changes occurred in the video?'
Multimodal Agents
llamaspeak - Talk live with Llama using Riva ASR/TTS, and chat about images with VLMs.
Live Llava - Run multimodal models on live video streams over a repeating set of prompts.
Video VILA - Process multiple images per query for temporal understanding of video sequences.