CogVLM

Introduction

CogVLM是一个强大的开源视觉语言模型，利用视觉专家模块深度整合语言编码和视觉编码，在14项权威跨模态基准上取得了SOTA性能。目前仅支持英文，后续会提供中英双语版本支持，欢迎持续关注！

CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow-align method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. CogVLM enables deep fusion of visual language features without sacrificing any performance on NLP tasks.
CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and rank the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B.
We anticipate that the open-sourcing of CogVLM will greatly help the research and industrial application of visual understanding.

示例代码

# 使用之前需要执行pip install en_core_web_sm -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html，下载spaCy提供的小型英语语言模型
from modelscope.utils.constant import Tasks
from modelscope.pipelines import pipeline
from modelscope import snapshot_download, Model

local_tokenizer_dir = snapshot_download("AI-ModelScope/vicuna-7b-v1.5",revision='v1.0.0')

pipe = pipeline(task=Tasks.chat, model='AI-ModelScope/cogvlm-chat', model_revision='v1.0.7', local_tokenizer=local_tokenizer_dir)
inputs = {'text':'Who is the man in the picture?', 'history': None, 'image': "https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/resources/aiyinsitan.jpg"}
result = pipe(inputs)
print(result["response"])
inputs = {'text':'How did he die?', 'history': result['history']}
result = pipe(inputs)
print(result["response"])

Examples

CogVLM is powerful for answering various types of visual questions, including Detailed Description & Visual Question Answering, Complex Counting, Visual Math Problem Solving, OCR-Free Reasonging, OCR-Free Visual Question Answering, World Knowledge, Referring Expression Comprehension, Programming with Visual Input, Grounding with Caption, Grounding Visual Question Answering, etc.

Click to expand/collapse more examples

Chat Examples

Method

CogVLM model comprises four fundamental components: a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT), and a visual expert module. See Paper for more details.

License

The code in this repository is open source under the Apache-2.0 license, while the use of the CogVLM model weights must comply with the Model License.

Citation & Acknowledgements

If you find our work helpful, please consider citing the following papers

In the instruction fine-tuning phase of the CogVLM, there are some English image-text data from the MiniGPT-4, LLAVA, LRV-Instruction, LLaVAR and Shikra projects, as well as many classic cross-modal work datasets. We sincerely thank them for their contributions.

cogvlm-chat

作品详情