vip-llava-13b-hf

我要开发同款
匿名用户2024年07月31日
67阅读

技术信息

开源地址
https://modelscope.cn/models/mirror013/vip-llava-13b-hf

作品详情

VipLLaVA Model Card

image/pg

Below is the model card of VipLlava model 13b, which is copied from the origial Llava model card that you ca fid here.

Check out also the Google Colab demo to ru Llava o a free-tier Google Colab istace (the model works similarly as Llava):

Or check out our Spaces demo!

Model details

Model type: LLaVA is a ope-source chatbot traied by fie-tuig LLaMA/Vicua o GPT-geerated multimodal istructio-followig data. It is a auto-regressive laguage model, based o the trasformer architecture.

Vip-LlaVa ehaces the traiig protocol of Llava by markig images ad iteract with the model usig atural cues like a “red boudig box” or “poited arrow” durig traiig.

Model date: ViP-LLaVa was released i December 2023.

Paper or resources for more iformatio: https://vip-llava.github.io/

How to use the model

First, make sure to have trasformers >= 4.35.3. The model supports multi-image ad multi-prompt geeratio. Meaig that you ca pass multiple images i your prompt. Make sure also to follow the correct prompt template ad add the toke <image> to the locatio where you wat to query images:

Accordig to the official code base, it is recommeeded to use this template:

A chat betwee a curious huma ad a artificial itelligece assistat. The assistat gives helpful, detailed, ad polite aswers to the huma's questios.###Huma: <image>\<prompt>###Assistat:

Where <prompt> deotes the prompt asked by the user

Usig pipelie:

from trasformers import pipelie
from PIL import Image    
import requests

model_id = "llava-hf/vip-llava-13b-hf"
pipe = pipelie("image-to-text", model=model_id)
url = "https://huggigface.co/datasets/huggigface/documetatio-images/resolve/mai/trasformers/tasks/ai2d-demo.jpg"

image = Image.ope(requests.get(url, stream=True).raw)
questio = "What does the label 15 represet? (1) lava (2) core (3) tuel (4) ash cloud"
prompt = f"A chat betwee a curious huma ad a artificial itelligece assistat. The assistat gives helpful, detailed, ad polite aswers to the huma's questios.###Huma: <image>\{questio}###Assistat:"

outputs = pipe(image, prompt=prompt, geerate_kwargs={"max_ew_tokes": 200})
prit(outputs)

Usig pure trasformers:

Below is a example script to ru geeratio i float16 precisio o a GPU device:

import requests
from PIL import Image

import torch
from trasformers import AutoProcessor, VipLlavaForCoditioalGeeratio

model_id = "llava-hf/vip-llava-13b-hf"

questio = "What are these?"
prompt = f"A chat betwee a curious huma ad a artificial itelligece assistat. The assistat gives helpful, detailed, ad polite aswers to the huma's questios.###Huma: <image>\{questio}###Assistat:"

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"

model = VipLlavaForCoditioalGeeratio.from_pretraied(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretraied(model_id)


raw_image = Image.ope(requests.get(image_file, stream=True).raw)
iputs = processor(prompt, raw_image, retur_tesors='pt').to(0, torch.float16)

output = model.geerate(**iputs, max_ew_tokes=200, do_sample=False)
prit(processor.decode(output[0][2:], skip_special_tokes=True))

Model optimizatio

4-bit quatizatio through bitsadbytes library

First make sure to istall bitsadbytes, pip istall bitsadbytes ad make sure to have access to a CUDA compatible GPU device. Simply chage the sippet above with:

model = VipLlavaForCoditioalGeeratio.from_pretraied(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   load_i_4bit=True
)

Use Flash-Attetio 2 to further speed-up geeratio

First make sure to istall flash-att. Refer to the origial repository of Flash Attetio regardig that package istallatio. Simply chage the sippet above with:

model = VipLlavaForCoditioalGeeratio.from_pretraied(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   use_flash_attetio_2=True
).to(0)

Licese

Llama 2 is licesed uder the LLAMA 2 Commuity Licese, Copyright (c) Meta Platforms, Ic. All Rights Reserved.

Citatio

To cite this work please use

@misc{cai2023makig,
      title={Makig Large Multimodal Models Uderstad Arbitrary Visual Prompts}, 
      author={Mu Cai ad Haotia Liu ad Siva Karthik Mustikovela ad Gregory P. Meyer ad Yuig Chai ad Deis Park ad Yog Jae Lee},
      year={2023},
      eprit={2312.00784},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

功能介绍

VipLLaVA Model Card Below is the model card of VipLlava model 13b, which is copied from the origina

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论