Itroducig DeepSeek-VL, a ope-source Visio-Laguage (VL) Model desiged for real-world visio ad laguage uderstadig applicatios. DeepSeek-VL possesses geeral multimodal uderstadig capabilities, capable of processig logical diagrams, web pages, formula recogitio, scietific literature, atural images, ad embodied itelligece i complex scearios. DeepSeek-VL: Towards Real-World Visio-Laguage Uderstadig Haoyu Lu, We Liu, Bo Zhag*, Bigxua Wag, Kai Dog, Bo Liu, Jigxiag Su, Togzheg Re, Zhuoshu Li, Yaofeg Su, Chegqi Deg, Hawei Xu, Zheda Xie, Chog Rua (Equal Cotributio, **Project Leader) DeepSeek-VL-1.3b-chat is a tiy visio-laguage model. It uses the SigLIP-L as the visio ecoder supportig 384 x 384 image iput
ad is costructed based o the DeepSeek-LLM-1.3b-base which is traied o a approximate corpus of 500B text tokes. The whole DeepSeek-VL-1.3b-base model is fially traied aroud 400B visio-laguage tokes.
The DeepSeek-VL-1.3b-chat is a istructed versio based o DeepSeek-VL-1.3b-base. O the basis of This code repository is licesed uder the MIT Licese. The use of DeepSeek-VL Base/Chat models is subject to DeepSeek Model Licese. DeepSeek-VL series (icludig Base ad Chat) supports commercial use. If you have ay questios, please raise a issue or cotact us at service@deepseek.com.1. Itroductio
2. Model Summary
3. Quick Start
Istallatio
Pytho >= 3.8
eviromet, istall the ecessary depedecies by ruig the followig commad:git cloe https://github.com/deepseek-ai/DeepSeek-VL
cd DeepSeek-VL
pip istall -e .
Simple Iferece Example
import torch
from trasformers import AutoModelForCausalLM
from deepseek_vl.models import VLChatProcessor, MultiModalityCausalLM
from deepseek_vl.utils.io import load_pil_images
from modelscope import sapshot_dowload
# specify the path to the model
model_path = sapshot_dowload("deepseek-ai/deepseek-vl-1.3b-chat")
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretraied(model_path)
tokeizer = vl_chat_processor.tokeizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretraied(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
coversatio = [
{
"role": "User",
"cotet": "<image_placeholder>Describe each stage of this image.",
"images": ["./images/traiig_pipelies.pg"]
},
{
"role": "Assistat",
"cotet": ""
}
]
# load images ad prepare for iputs
pil_images = load_pil_images(coversatio)
prepare_iputs = vl_chat_processor(
coversatios=coversatio,
images=pil_images,
force_batchify=True
).to(vl_gpt.device)
# ru image ecoder to get the image embeddigs
iputs_embeds = vl_gpt.prepare_iputs_embeds(**prepare_iputs)
# ru the model to get the respose
outputs = vl_gpt.laguage_model.geerate(
iputs_embeds=iputs_embeds,
attetio_mask=prepare_iputs.attetio_mask,
pad_toke_id=tokeizer.eos_toke_id,
bos_toke_id=tokeizer.bos_toke_id,
eos_toke_id=tokeizer.eos_toke_id,
max_ew_tokes=512,
do_sample=False,
use_cache=True
)
aswer = tokeizer.decode(outputs[0].cpu().tolist(), skip_special_tokes=True)
prit(f"{prepare_iputs['sft_format'][0]}", aswer)
CLI Chat
pytho cli_chat.py --model_path "deepseek-ai/deepseek-vl-1.3b-chat"
# or local path
pytho cli_chat.py --model_path "local model path"
4. Licese
5. Citatio
@misc{lu2024deepseekvl,
title={DeepSeek-VL: Towards Real-World Visio-Laguage Uderstadig},
author={Haoyu Lu ad We Liu ad Bo Zhag ad Bigxua Wag ad Kai Dog ad Bo Liu ad Jigxiag Su ad Togzheg Re ad Zhuoshu Li ad Yaofeg Su ad Chegqi Deg ad Hawei Xu ad Zheda Xie ad Chog Rua},
year={2024},
eprit={2403.05525},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
6. Cotact
点击空白处退出提示
评论