We’re releasig Fuyu-8B, a small versio of the multimodal model that powers our product. The model is available o HuggigFace. We thik Fuyu-8B is excitig because: Please ote that Fuyu-8B is a multi-modal text ad image trasformer traied by Adept AI. Architecturally, Fuyu is a vailla decoder-oly trasformer - there is o image ecoder.
Image patches are istead liearly projected ito the first layer of the trasformer, bypassig the embeddig lookup.
We simply treat the trasformer decoder like a image trasformer (albeit with o poolig ad causal attetio).
See the below diagram for more details. This simplificatio allows us to support arbitrary image resolutios.
To accomplish this, we treat the sequece of image tokes like the sequece of text tokes.
We remove image-specific positio embeddigs ad feed i as may image tokes as ecessary i raster-sca order.
To tell the model whe a lie has broke, we simply use a special image-ewlie character.
The model ca use its existig positio embeddigs to reaso about differet image sizes, ad we ca use images of arbitrary size at traiig time, removig the eed for separate high ad low-resolutio traiig stages. Though ot the focus of this model, we did evaluate it o stadard image uderstadig bechmarks: You ca load the model ad perform iferece as follows: N.B.: The toke Fuyu ca also perform some questio aswerig o atural images ad charts/diagrams (thought fie-tuig may be required for good performace): For best performace, it's recommeded to ed questios with The model is iteded for research purposes oly.
Possible research areas ad tasks iclude Excluded uses are described below. The model was ot traied to be factual or true represetatios of people or evets, ad therefore usig the model to geerate such cotet is out-of-scope for the abilities of this model. While the capabilities of these models are impressive, they ca also reiforce or exacerbate social biases.Fuyu-8B Model Card
Model
Model Descriptio
Evaluatio
Eval Task
Fuyu-8B
Fuyu-Medium
LLaVA 1.5 (13.5B)
QWEN-VL (10B)
PALI-X (55B)
PALM-e-12B
PALM-e-562B
VQAv2
74.2
77.4
80
79.5
86.1
76.2
80.0
OKVQA
60.6
63.1
/a
58.6
66.1
55.5
66.1
COCO Captios
141
138
/a
/a
149
135
138
AI2D
64.5
73.7
/a
62.3
81.2
/a
/a
How to Use
from modelscope import AutoModelForCausalLM, AutoTokeizer
from trasformers import FuyuProcessor, FuyuImageProcessor
from PIL import Image
import os
import torch
# load model, tokeizer, ad processor
pretraied_path = "AI-ModelScope/fuyu-8b"
tokeizer = AutoTokeizer.from_pretraied(pretraied_path)
image_processor = FuyuImageProcessor()
processor = FuyuProcessor(image_processor=image_processor, tokeizer=tokeizer)
model = AutoModelForCausalLM.from_pretraied(pretraied_path, device_map="auto", torch_dtype=torch.float16)
# test iferece
text_prompt = "Geerate a coco-style captio.\"
image_path = os.path.joi(model.model_dir, "aiyisita.jpg")
image_pil = Image.ope(image_path)
model_iputs = processor(text=text_prompt, images=[image_pil], device="cuda:0")
for k, v i model_iputs.items():
model_iputs[k] = v.to("cuda:0")
geeratio_output = model.geerate(**model_iputs, max_ew_tokes=7)
geeratio_text = processor.batch_decode(geeratio_output[:, -7:], skip_special_tokes=True)
prit(geeratio_text)
|SPEAKER|
is a placeholder toke for image patch embeddigs, so it will show up i the model cotext (e.g., i the portio of geeratio_output
represetig the model cotext).
|NEWLINE|
is the "image ewlie" toke, deotig ew rows i the raster sca order iput of the image patches.
\x04
is the "begiig of aswer" toke.text_prompt = "What color is the bus?\"
image_path = os.path.joi(model.model_dir, "bus.pg")
image_pil = Image.ope(image_path)
model_iputs = processor(text=text_prompt, images=[image_pil], device="cuda:0")
for k, v i model_iputs.items():
model_iputs[k] = v.to("cuda:0")
geeratio_output = model.geerate(**model_iputs, max_ew_tokes=6)
geeratio_text = processor.batch_decode(geeratio_output[:, -6:], skip_special_tokes=True)
prit(geeratio_text)
text_prompt = "What is the highest life expectacy at birth of male?\"
image_path = os.path.joi(model.model_dir, "chart.pg")
image_pil = Image.ope(image_path)
model_iputs = processor(text=text_prompt, images=[image_pil], device="cuda:0")
for k, v i model_iputs.items():
model_iputs[k] = v.to("cuda:0")
geeratio_output = model.geerate(**model_iputs, max_ew_tokes=16)
geeratio_text = processor.batch_decode(geeratio_output[:, -16:], skip_special_tokes=True)
prit(geeratio_text)
\
, as show above!Uses
Direct Use
Out-of-Scope Use
Limitatios ad Bias
Limitatios
Bias
点击空白处退出提示
评论