CodeFuse-VLM
CodeFuse-VLM is a Multimodal LLM(MLLM) framework that provides users with multiple vision encoders, multimodal alignment adapters, and LLMs. Through CodeFuse-VLM framework, users are able to customize their own MLLM model to adapt their own tasks. As more and more models are published on Huggingface community, there will be more open-source vision encoders and LLMs. Each of these models has their own specialties, e.g. Code-LLama is good at code-related tasks but has poor performance for Chinese tasks. Therefore, we built CodeFuse-VLM framework to support multiple vision encoders, multimodal alignment adapters, and LLMs to adapt different types of tasks.
Under CodeFuse-VLM framework, we use cross attention multimodal adapter, Qwen-14B LLM, and Qwen-VL's vision encoder to train CodeFuse-VLM-14B model. On multiple benchmarks, our CodeFuse-VLM-14B shows superior performances over Qwen-VL and LLAVA-1.5.
Here is the table for different MLLM model's performance on benchmarks
Model | MMBench | MMBench-CN | VqaV2 | GQA | TextVQA | Vizwiz |
---|---|---|---|---|---|---|
LLAVA-1.5 | 67.7 | 63.6 | 80.0 | 63.3 | 61.3 | 53.6 |
Qwen-VL | 60.6 | 56.7 | 78.2 | 57.5 | 63.8 | 38.9 |
CodeFuse-VLM-14B | 75.7 | 69.8 | 79.3 | 59.4 | 63.9 | 45.3 |
Contents
Install
Please run sh init_env.sh
Datasets
Here's the table of datasets we used to train CodeFuse-VLM-14B:
Dataset | Task Type | Number of Samples |
---|---|---|
synthdog-en | OCR | 800,000 |
synthdog-zh | OCR | 800,000 |
cc3m(downsampled) | Image Caption | 600,000 |
cc3m(downsampled) | Image Caption | 600,000 |
SBU | Image Caption | 850,000 |
Visual Genome VQA (Downsampled) | Visual Question Answer(VQA) | 500,000 |
Visual Genome Region descriptions (Downsampled) | Reference Grouding | 500,000 |
Visual Genome objects (Downsampled) | Grounded Caption | 500,000 |
OCR VQA (Downsampled) | OCR and VQA | 500,000 |
Please download these datasets on their own official websites.
Multimodal Alignment
Please run sh scripts/pretrain.sh or sh scripts/pretrain_multinode.sh
Visual Instruction Tuning
Please run sh scripts/finetune.sh or sh scripts/finetune_multinode.sh
Evaluation
Please run python scripts in directory llava/eval/. Our pre-trained CodeFuse-VLM-14B can be loaded with the following code:
import os
from llava.model.builder import load_mixed_pretrained_model
model_path = '/pretrained/model/path'
tokenizer, model, image_processor, context_len = load_mixed_pretrained_model(model_path, None, 'qwen-vl-14b', os.path.join(model_path, 'Qwen-VL-visual'), 'cross_attn', os.path.join(model_path, 'mm_projector/mm_projector.bin'))
You can also run scripts/merge_qwen_vl_weights.sh first and load the merged model by the following code:
from llava.model import LlavaQWenForCausalLM
model = LlavaQWenForCausalLM.from_pretrained('/path/to/our/pretrained/model')
Join Us
We are the AI Native team within the Platform Technology Business Group at Ant Group, dedicated to the intelligentization of Ant Group's platform engineering. Established for over three years, our team has played a pivotal role in supporting the intelligent operation and maintenance of Ant Group's cloud computing infrastructure. Our mission is to build algorithm services and platforms with a wide user base through world-class technological innovation and impact, supporting the implementation of internal and external products and businesses. Embracing an innovation-driven ethos, our team not only supports business implementation but also propels technological influence. Over the past three years, we have published more than 20 papers at top conferences like ICLR, NeurIPS, KDD, and ACL. Our innovative business outcomes have earned us two Ant Technology's highest T-Star awards and one SuperMA award from Ant Group. Our open-source project CodeFuse has received 4K stars as of February 2024, and our models have been downloaded over 1.5 million times on Huggingface and Modelscope.
We are on the lookout for top talents to join our vibrant team! If you're eager to develop your career in an environment filled with energy, innovation, and a culture of excellence, we welcome you to explore our career opportunities for both campus and experienced hires. Join us and be a part of creating the next milestone in the industry.
Campus Recruitment: https://hrrecommend.antgroup.com/guide.html?code=8uoP5mlus5DqQYbEEnqcE2FD5JZH21MwvMUIb9mb6X3osXPuBraG54SyM8GLn7
Experienced Hires: https://talent.antgroup.com/off-campus-position?positionId=1933830
评论