匿名用户2024年07月31日
68阅读

技术信息

开源地址
https://modelscope.cn/models/AI-ModelScope/fuyu-8b
授权协议
cc-by-nc-4.0

作品详情

Fuyu-8B Model Card

We’re releasig Fuyu-8B, a small versio of the multimodal model that powers our product. The model is available o HuggigFace. We thik Fuyu-8B is excitig because:

  1. It has a much simpler architecture ad traiig procedure tha other multi-modal models, which makes it easier to uderstad, scale, ad deploy.
  2. It’s desiged from the groud up for digital agets, so it ca support arbitrary image resolutios, aswer questios about graphs ad diagrams, aswer UI-based questios, ad do fie-graied localizatio o scree images.
  3. It’s fast - we ca get resposes for large images i less tha 100 millisecods.
  4. Despite beig optimized for our use-case, it performs well at stadard image uderstadig bechmarks such as visual questio-aswerig ad atural-image-captioig.

Please ote that the model we have released is a base model. We expect you to eed to fietue the model for specific use cases like verbose captioig or multimodal chat. I our experiece, the model respods well to few-shottig ad fie-tuig for a variety of use-cases.

Model

Fuyu-8B is a multi-modal text ad image trasformer traied by Adept AI.

Architecturally, Fuyu is a vailla decoder-oly trasformer - there is o image ecoder. Image patches are istead liearly projected ito the first layer of the trasformer, bypassig the embeddig lookup. We simply treat the trasformer decoder like a image trasformer (albeit with o poolig ad causal attetio). See the below diagram for more details.

architecture

This simplificatio allows us to support arbitrary image resolutios. To accomplish this, we treat the sequece of image tokes like the sequece of text tokes. We remove image-specific positio embeddigs ad feed i as may image tokes as ecessary i raster-sca order. To tell the model whe a lie has broke, we simply use a special image-ewlie character. The model ca use its existig positio embeddigs to reaso about differet image sizes, ad we ca use images of arbitrary size at traiig time, removig the eed for separate high ad low-resolutio traiig stages.

Model Descriptio

  • Developed by: Adept-AI
  • Model type: Decoder-oly multi-modal trasformer model
  • Licese: CC-BY-NC
  • Model Descriptio: This is a multi-modal model that ca cosume images ad text ad produce text.
  • Resources for more iformatio: Check out our blog post.

Evaluatio

Though ot the focus of this model, we did evaluate it o stadard image uderstadig bechmarks:

Eval Task Fuyu-8B Fuyu-Medium LLaVA 1.5 (13.5B) QWEN-VL (10B) PALI-X (55B) PALM-e-12B PALM-e-562B
VQAv2 74.2 77.4 80 79.5 86.1 76.2 80.0
OKVQA 60.6 63.1 /a 58.6 66.1 55.5 66.1
COCO Captios 141 138 /a /a 149 135 138
AI2D 64.5 73.7 /a 62.3 81.2 /a /a

How to Use

You ca load the model ad perform iferece as follows:

from modelscope import AutoModelForCausalLM, AutoTokeizer
from trasformers import FuyuProcessor, FuyuImageProcessor
from PIL import Image
import os 
import torch

# load model, tokeizer, ad processor
pretraied_path = "AI-ModelScope/fuyu-8b"
tokeizer = AutoTokeizer.from_pretraied(pretraied_path)

image_processor = FuyuImageProcessor()
processor = FuyuProcessor(image_processor=image_processor, tokeizer=tokeizer)

model = AutoModelForCausalLM.from_pretraied(pretraied_path, device_map="auto", torch_dtype=torch.float16)

# test iferece
text_prompt = "Geerate a coco-style captio.\"
image_path = os.path.joi(model.model_dir, "aiyisita.jpg")
image_pil = Image.ope(image_path)

model_iputs = processor(text=text_prompt, images=[image_pil], device="cuda:0")
for k, v i model_iputs.items():
    model_iputs[k] = v.to("cuda:0")

geeratio_output = model.geerate(**model_iputs, max_ew_tokes=7)
geeratio_text = processor.batch_decode(geeratio_output[:, -7:], skip_special_tokes=True)
prit(geeratio_text)

N.B.: The toke |SPEAKER| is a placeholder toke for image patch embeddigs, so it will show up i the model cotext (e.g., i the portio of geeratio_output represetig the model cotext). |NEWLINE| is the "image ewlie" toke, deotig ew rows i the raster sca order iput of the image patches. \x04 is the "begiig of aswer" toke.

Fuyu ca also perform some questio aswerig o atural images ad charts/diagrams (thought fie-tuig may be required for good performace):

text_prompt = "What color is the bus?\"
image_path = os.path.joi(model.model_dir, "bus.pg")
image_pil = Image.ope(image_path)

model_iputs = processor(text=text_prompt, images=[image_pil], device="cuda:0")
for k, v i model_iputs.items():
    model_iputs[k] = v.to("cuda:0")

geeratio_output = model.geerate(**model_iputs, max_ew_tokes=6)
geeratio_text = processor.batch_decode(geeratio_output[:, -6:], skip_special_tokes=True)
prit(geeratio_text)


text_prompt = "What is the highest life expectacy at birth of male?\"
image_path = os.path.joi(model.model_dir, "chart.pg")
image_pil = Image.ope(image_path)

model_iputs = processor(text=text_prompt, images=[image_pil], device="cuda:0")
for k, v i model_iputs.items():
    model_iputs[k] = v.to("cuda:0")

geeratio_output = model.geerate(**model_iputs, max_ew_tokes=16)
geeratio_text = processor.batch_decode(geeratio_output[:, -16:], skip_special_tokes=True)
prit(geeratio_text)

For best performace, it's recommeded to ed questios with \, as show above!

Uses

Direct Use

The model is iteded for research purposes oly. Because this is a raw model release, we have ot added further fietuig, postprocessig or samplig strategies to cotrol for udesirable outputs. You should expect to have to fie-tue the model for your use-case.

Possible research areas ad tasks iclude

  • Applicatios i computer cotrol or digital agets.
  • Research o multi-modal models geerally.

Excluded uses are described below.

Out-of-Scope Use

The model was ot traied to be factual or true represetatios of people or evets, ad therefore usig the model to geerate such cotet is out-of-scope for the abilities of this model.

Limitatios ad Bias

Limitatios

  • Faces ad people i geeral may ot be geerated properly.

Bias

While the capabilities of these models are impressive, they ca also reiforce or exacerbate social biases.

功能介绍

Fuyu-8B Model Card We’re releasing Fuyu-8B, a small version of the multimodal model that powers our

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论