Below is the model card of VipLlava model 13b, which is copied from the origial Llava model card that you ca fid here. Check out also the Google Colab demo to ru Llava o a free-tier Google Colab istace (the model works similarly as Llava): Vip-LlaVa ehaces the traiig protocol of Llava by markig images ad iteract with the model usig atural cues like a
“red boudig box” or “poited arrow” durig traiig. First, make sure to have Accordig to the official code base, it is recommeeded to use this template: Where Below is a example script to ru geeratio i First make sure to istall First make sure to istall Llama 2 is licesed uder the LLAMA 2 Commuity Licese,
Copyright (c) Meta Platforms, Ic. All Rights Reserved. To cite this work please useVipLLaVA Model Card
Model details
How to use the model
trasformers >= 4.35.3
.
The model supports multi-image ad multi-prompt geeratio. Meaig that you ca pass multiple images i your prompt. Make sure also to follow the correct prompt template ad add the toke <image>
to the locatio where you wat to query images:A chat betwee a curious huma ad a artificial itelligece assistat. The assistat gives helpful, detailed, ad polite aswers to the huma's questios.###Huma: <image>\<prompt>###Assistat:
<prompt>
deotes the prompt asked by the userUsig
pipelie
:from trasformers import pipelie
from PIL import Image
import requests
model_id = "llava-hf/vip-llava-13b-hf"
pipe = pipelie("image-to-text", model=model_id)
url = "https://huggigface.co/datasets/huggigface/documetatio-images/resolve/mai/trasformers/tasks/ai2d-demo.jpg"
image = Image.ope(requests.get(url, stream=True).raw)
questio = "What does the label 15 represet? (1) lava (2) core (3) tuel (4) ash cloud"
prompt = f"A chat betwee a curious huma ad a artificial itelligece assistat. The assistat gives helpful, detailed, ad polite aswers to the huma's questios.###Huma: <image>\{questio}###Assistat:"
outputs = pipe(image, prompt=prompt, geerate_kwargs={"max_ew_tokes": 200})
prit(outputs)
Usig pure
trasformers
:float16
precisio o a GPU device:import requests
from PIL import Image
import torch
from trasformers import AutoProcessor, VipLlavaForCoditioalGeeratio
model_id = "llava-hf/vip-llava-13b-hf"
questio = "What are these?"
prompt = f"A chat betwee a curious huma ad a artificial itelligece assistat. The assistat gives helpful, detailed, ad polite aswers to the huma's questios.###Huma: <image>\{questio}###Assistat:"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
model = VipLlavaForCoditioalGeeratio.from_pretraied(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = AutoProcessor.from_pretraied(model_id)
raw_image = Image.ope(requests.get(image_file, stream=True).raw)
iputs = processor(prompt, raw_image, retur_tesors='pt').to(0, torch.float16)
output = model.geerate(**iputs, max_ew_tokes=200, do_sample=False)
prit(processor.decode(output[0][2:], skip_special_tokes=True))
Model optimizatio
4-bit quatizatio through
bitsadbytes
librarybitsadbytes
, pip istall bitsadbytes
ad make sure to have access to a CUDA compatible GPU device. Simply chage the sippet above with: model = VipLlavaForCoditioalGeeratio.from_pretraied(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
+ load_i_4bit=True
)
Use Flash-Attetio 2 to further speed-up geeratio
flash-att
. Refer to the origial repository of Flash Attetio regardig that package istallatio. Simply chage the sippet above with: model = VipLlavaForCoditioalGeeratio.from_pretraied(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
+ use_flash_attetio_2=True
).to(0)
Licese
Citatio
@misc{cai2023makig,
title={Makig Large Multimodal Models Uderstad Arbitrary Visual Prompts},
author={Mu Cai ad Haotia Liu ad Siva Karthik Mustikovela ad Gregory P. Meyer ad Yuig Chai ad Deis Park ad Yog Jae Lee},
year={2023},
eprit={2312.00784},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
点击空白处退出提示
评论