Multimodal

Multimodal Agents on Jetson AI Lab

Refer to these guides and tutorials on Jetson AI Lab: llamaspeak | Live Llava | NanoVLM

NanoLLM provides optimized multimodal pipelines, including vision/language models (VLM), vector databases (NanoDB), and speech services that can be integrated into interactive agents.

These are implemented through the model and chat interfaces covered in the previous sections:

python3 -m nano_llm.chat --api=mlc \
  --model Efficient-Large-Model/VILA1.5-3b \
  --prompt '/data/images/lake.jpg' \
  --prompt 'please describe the scene.' \
  --prompt 'are there any hazards to be aware of?'

See the Tested Models section for the list of multimodal models that are supported in NanoLLM.

Image Messages

You can get a vision/language model to respond about an image by adding it to the chat history, and then asking a query about it:

chat_history.append(role='user', image=img) # np.ndarray, torch.Tensor, PIL.Image, cudaImage
chat_history.append(role='user', text='Describe the image.')

print(model.generate(chat_history.embed_chat()[0], streaming=False))

Image messages will be embedded into the chat using the model’s CLIP/SigLIP vision encoder and multimodal projector. Supported image types are np.ndarray, torch.Tensor, PIL.Image, jetson_utils.cudaImage, and URLs or local paths to image files (jpg, png, tga, bmp, gif)

Code Example

#!/usr/bin/env python3
#
# This multimodal example is a simplified version of the 'Live Llava' demo,
# wherein the same prompt (or set of prompts) is applied to a stream of images.
#
# You can run it like this (these options will replicate the defaults)
#
#    python3 -m nano_llm.vision.example \
#      --model Efficient-Large-Model/VILA1.5-3b \
#      --video-input "/data/images/*.jpg" \
#      --prompt "Describe the image." \
#      --prompt "Are there people in the image?"
#
# You can specify multiple prompts (or a text file) to be applied to each image,
# and the video inputs can be sequences of files, camera devices, or network streams.
#
# For example, `--video-input /dev/video0` will capture from a V4L2 webcam. See here:
# https://github.com/dusty-nv/jetson-inference/blob/master/docs/aux-streaming.md
#
import time
import termcolor

from nano_llm import NanoLLM, ChatHistory
from nano_llm.utils import ArgParser, load_prompts
from nano_llm.plugins import VideoSource

from jetson_utils import cudaMemcpy, cudaToNumpy


# parse args and set some defaults
args = ArgParser(extras=ArgParser.Defaults + ['prompt', 'video_input']).parse_args()
prompts = load_prompts(args.prompt)

if not prompts:
    prompts = ["Describe the image.", "Are there people in the image?"]
    
if not args.model:
    args.model = "Efficient-Large-Model/VILA1.5-3b"

if not args.video_input:
    args.video_input = "/data/images/*.jpg"
    
print(args)

# load vision/language model
model = NanoLLM.from_pretrained(
    args.model, 
    api=args.api,
    quantization=args.quantization, 
    max_context_len=args.max_context_len,
    vision_model=args.vision_model,
    vision_scaling=args.vision_scaling, 
)

assert(model.has_vision)

# create the chat history
chat_history = ChatHistory(model, args.chat_template, args.system_prompt)

# open the video stream
video_source = VideoSource(**vars(args), cuda_stream=0, return_copy=False)

# apply the prompts to each frame
while True:
    img = video_source.capture()
    
    if img is None:
        continue

    chat_history.append('user', image=img)
    time_begin = time.perf_counter()
    
    for prompt in prompts:
        chat_history.append('user', prompt, use_cache=True)
        embedding, _ = chat_history.embed_chat()
        
        print('>>', prompt)
        
        reply = model.generate(
            embedding,
            kv_cache=chat_history.kv_cache,
            max_new_tokens=args.max_new_tokens,
            min_new_tokens=args.min_new_tokens,
            do_sample=args.do_sample,
            repetition_penalty=args.repetition_penalty,
            temperature=args.temperature,
            top_p=args.top_p,
        )
        
        for token in reply:
            termcolor.cprint(token, 'blue', end='\n\n' if reply.eos else '', flush=True)

        chat_history.append('bot', reply)
      
    time_elapsed = time.perf_counter() - time_begin
    print(f"time:  {time_elapsed*1000:.2f} ms  rate:  {1.0/time_elapsed:.2f} FPS")
    
    chat_history.reset()
    
    if video_source.eos:
        break

Video Sequences

The code in vision/video.py keeps a rolling history of image frames and can be used with models that were trained to understand video (like VILA-1.5) to apply video search/summarization, action & behavior analysis, change detection, and other temporal-based vision functions:

  python3 -m nano_llm.vision.video \
    --model Efficient-Large-Model/VILA1.5-3b \
    --max-images 8 \
    --max-new-tokens 48 \
    --video-input /data/my_video.mp4 \
    --video-output /data/my_output.mp4 \
    --prompt 'What changes occurred in the video?'

Multimodal Agents

llamaspeak - Talk live with Llama using Riva ASR/TTS, and chat about images with VLMs.



Live Llava - Run multimodal models on live video streams over a repeating set of prompts.



Video VILA - Process multiple images per query for temporal understanding of video sequences.