InternVL-Chat-ViT-6B-Vicuna-7B

我要开发同款
匿名用户2024年07月31日
29阅读
所属分类ai、llava、pytorch
开源地址https://modelscope.cn/models/OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B

作品详情



Model Card for InternVL-Chat-ViT-6B-Vicuna-7B

What is InternVL?

[Paper] [GitHub] [Chat Demo]

InternVL scales up the ViT to 6B parameters and aligns it with LLM.

It is trained using web-scale, noisy image-text pairs. The data are all publicly available and comprise multilingual content, including LAION-en, LAION-multi, LAION-COCO, COYO, Wukong, CC12M, CC3M, and SBU.

It is the largest open-source vision/vision-language foundation model (14B) to date, achieving 32 state-of-the-art performances on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.

image/png

How to Run?

Please refer to this README to run this model.

Note: We have retained the original documentation of LLaVA 1.5 as a more detailed manual. In most cases, you will only need to refer to the new documentation that we have added.

Model details

Model type: InternVL-Chat is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

Model date: InternVL-Chat-ViT-6B-Vicuna-7B was trained in November 2023.

Paper or resources for more information: https://github.com/OpenGVLab/InternVL

License

InternVL is released under the MIT license.

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Where to send questions or comments about the model: https://github.com/OpenGVLab/InternVL/issues

Intended use

Primary intended uses: The primary use of InternVL-Chat is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

  • 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
  • 158K GPT-generated multimodal instruction-following data.
  • 450K academic-task-oriented VQA data mixture.
  • 40K ShareGPT data.

Evaluation dataset

A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.

Acknowledgement

This model card is adapted from LLaVA's model card. Thanks for their awesome work!

Citation

If you find this project useful in your research, please consider citing:

@article{chen2023internvl,
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2312.14238},
  year={2023}
}
声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论