Qwen-VL-Chat-Finetuned-Dense-Captioner_开源AI项目-程序员客栈

新闻

loading..
2024-07-18开源模型Qwen-VL-Chat-Finetuned-Dense-Captioner
2024-07-16开源数据集SA1B-描述-子图对
2024-05-24开源数据集SA1B-长文本图文描述

模型简介

Qwen-VL-Chat-Finetuned-Dense-Captioner 是Qwen-VL-Chat模型在生成式数据+人工数据上通过LoRA Finetuning方式得到的可以输出图片结构化描述的大模型。该模型支持输出中文或英文图片描述，支持在输入图片原始描述的基础上得到更准确细致的图片描述。

模型的输出是一个结构化的数据格式，其中globalcaption是图片的整体描述，captionlist是一个描述列表，每一条描述和图片的某个局部相关。具体例子如下所示。更多例子可参考快速使用

{
    "global_caption": "这是一张在自然光照下拍摄的海滩上与狗互动的照片。一位女性坐在沙滩上，身穿格子衬衫，正与一只黄色的拉布拉多犬互动。狗狗似乎在向她伸出爪子，而女性则微笑着回应。他们周围是细腻的沙滩和波光粼粼的海水，背景是温暖的夕阳，为整个场景增添了一抹金色的温暖。",
    "caption_list": [
        "一位女性坐在沙滩上，身穿格子衬衫，正在与一只黄色拉布拉多犬互动。",
        "黄色拉布拉多犬似乎在向女性伸出爪子，表情活泼。",
        "背景是波光粼粼的海水和细腻的沙滩，夕阳的余晖洒在海面上，营造出宁静的氛围。"
    ]
}

Prompt列表

模型支持以下4种prompt，分别对应的场景如下表所示：

场景	prompt
输出中文描述	用中文生成输入图片内容的详细描述和图片中所有实体的描述列表。输出为格式为：{"globalcaption":"详细描述", "captionlist":["实体A的描述", "实体B的描述", "实体C的描述", …]}。
在原始描述的基础上输出中文描述	根据输入的图片和描述提示：###中文或英文原始描述###用中文生成图片内容的详细描述和图片中所有实体的描述列表。输出为格式为：{"globalcaption":"详细描述", "captionlist":["实体A的描述", "实体B的描述", "实体C的描述", …]}。
输出英文描述	Generate an English detailed description of the content of the input image and a list of descriptions for all entities in the image. The output should be in the format: {"globalcaption":"Detailed description", "captionlist":["Description of Entity A", "Description of Entity B", "Description of Entity C", …]}.
在原始描述的基础上输出英文描述	Given an image and some tips: ###中文或英文原始描述### related to image, generate an English detailed description of the content of the input image and a list of descriptions for all entities in the image. The output should be in the format: {"globalcaption":"Detailed description", "captionlist":["Description of Entity A", "Description of Entity B", "Description of Entity C", …]}.

注意：prompt中的#需要保留。

依赖项

python 3.8及以上版本
pytorch 1.12及以上版本，推荐2.0及以上版本
建议使用CUDA 11.4及以上（GPU用户需考虑此选项）

pip install modelscope -U
pip install transformers accelerate tiktoken -U
pip install einops transformers_stream_generator -U
pip install "pillow==9.*" -U
pip install torchvision
pip install matplotlib -U

快速使用

您可以通过以下代码轻松调用：

from modelscope import (
    snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
)
import torch


model_id = 'Tongyi-DataEngine/Qwen-VL-Chat-Finetuned-Dense-Captioner'

model_dir = snapshot_download(model_id)
torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True)

# 中文输入输出
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
    {'text': '用中文生成输入图片内容的详细描述和图片中所有实体的描述列表。输出为格式为：{"global_caption":"详细描述", "caption_list":["实体A的描述", "实体B的描述", "实体C的描述", ...]}。'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
#{"global_caption": "这是一张在自然光照下拍摄的海滩上与狗互动的照片。一位女性坐在沙滩上，身穿格子衬衫，正与一只黄色的拉布拉多犬互动。狗狗似乎在向她伸出爪子，而女性则微笑着回应。他们周围是细腻的沙滩和波光粼粼的海水，背景是温暖的夕阳，为整个场景增添了一抹金色的温暖。", "caption_list": ["一位女性坐在沙滩上，身穿格子衬衫，正在与一只黄色拉布拉多犬互动。", "黄色拉布拉多犬似乎在向女性伸出爪子，表情活泼。", "背景是波光粼粼的海水和细腻的沙滩，夕阳的余晖洒在海面上，营造出宁静的氛围。"]}

# 根据中文原始描述输出中文
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
    {'text': '根据输入的图片和描述提示：###狗伸出右前爪###用中文生成图片内容的详细描述和图片中所有实体的描述列表。输出为格式为：{"global_caption":"详细描述", "caption_list":["实体A的描述", "实体B的描述", "实体C的描述", ...]}。'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
#{"global_caption": "这是一张展现人与动物和谐共处的自然光照摄影风格的照片，画面中一位女性坐在沙滩上，她的右腿伸直，与一只黄色的拉布拉多犬进行互动。狗狗伸出它的右前爪，似乎在与女性进行友好的交流。他们坐在沙滩上，周围是平静的海浪和远处的海平线，天空呈现出温暖的金色调，整个场景洋溢着宁静和温馨的氛围。", "caption_list": ["一位女性坐在沙滩上，穿着格子衬衫，她的右腿伸直，与狗狗互动。", "一只黄色的拉布拉多犬伸出它的右前爪，看起来像是在与女性进行友好的交流。", "背景是平静的海浪和远处的海平线，天空呈现出温暖的金色调。"]}

# 根据英文原始描述输出中文
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
    {'text': '根据输入的图片和描述提示：###The dog is stretching out its right front paw###用中文生成图片内容的详细描述和图片中所有实体的描述列表。输出为格式为：{"global_caption":"详细描述", "caption_list":["实体A的描述", "实体B的描述", "实体C的描述", ...]}。'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
#{"global_caption": "这是一张展现人与动物和谐共处的自然光照摄影图像，画面中一位女性坐在沙滩上，她的右腿伸展着，而一只黄色的拉布拉多犬正用它的右前爪与她击掌。他们位于沙滩上，背景是波光粼粼的海面和温暖的夕阳，营造出一种宁静而温馨的氛围。", "caption_list": ["一位女性坐在沙滩上，右腿伸展，与一只拉布拉多犬击掌", "一只黄色的拉布拉多犬正用它的右前爪与女性击掌", "背景是波光粼粼的海面和温暖的夕阳"]}


#英文输入输出
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
    {'text': 'Generate an English detailed description of the content of the input image and a list of descriptions for all entities in the image. The output should be in the format: {"global_caption":"Detailed description", "caption_list":["Description of Entity A", "Description of Entity B", "Description of Entity C", ...]}.'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
#{"global_caption": "This is a photo taken on the beach at sunset, showing a woman sitting on the sand, interacting with a large dog. The woman is wearing a plaid shirt, smiling at the dog, and the dog is raising its front paw, seemingly playing or greeting. They are surrounded by the soft light of the sunset, and the waves gently lap the beach, creating a tranquil and harmonious atmosphere.", "caption_list": ["A woman in a plaid shirt is sitting on the beach, smiling at a dog, holding a cell phone in her hand.", "A large dog is sitting on the beach, raising its front paw, seemingly interacting with the woman.", "The background is the beach and the waves at sunset, with the sky presenting a warm orange-yellow hue."]}

#根据英文原始描述输出英文
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
    {'text': 'Given an image and some tips: ###The dog is stretching out its right front paw### related to image, generate an English detailed description of the content of the input image and a list of descriptions for all entities in the image. The output should be in the format: {"global_caption":"Detailed description", "caption_list":["Description of Entity A", "Description of Entity B", "Description of Entity C", ...]}.'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
#{"global_caption": "This is a photo taken on the beach at sunset, showing a woman sitting on the sand, interacting with a yellow Labrador Retriever. The dog is stretching out its right front paw, seemingly playing or seeking attention from the woman. The woman is wearing a plaid shirt, smiling, and seems to be enjoying this moment. The background is the vast ocean and the sky, with the sun about to set, adding a warm tone to the entire scene.", "caption_list": ["A woman in a plaid shirt is sitting on the beach, smiling, interacting with a dog.", "A yellow Labrador Retriever is stretching out its right front paw, seemingly playing or seeking attention.", "The background is the vast ocean and the sky, with the sun about to set, adding a warm tone to the scene."]}

#根据中文原始描述输出英文
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
    {'text': 'Given an image and some tips: ###狗伸出右前爪### related to image, generate an English detailed description of the content of the input image and a list of descriptions for all entities in the image. The output should be in the format: {"global_caption":"Detailed description", "caption_list":["Description of Entity A", "Description of Entity B", "Description of Entity C", ...]}.'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
#{"global_caption": "This is a photo taken on the beach at sunset, showing a woman sitting on the sand, interacting with a yellow Labrador Retriever. The dog is stretching out its right front paw, seemingly asking for a high five. The woman is wearing a plaid shirt, sitting on the sand, smiling at the dog, and holding a mobile phone in her hand. The surrounding environment is peaceful, with the waves gently lapping the beach, and the sky is a light blue, with the sun about to set.", "caption_list": ["A woman in a plaid shirt is sitting on the beach, smiling at a dog, holding a mobile phone in her hand.", "A yellow Labrador Retriever is stretching out its right front paw, seemingly asking for a high five.", "The background is a peaceful beach, with waves gently lapping the shore, and the sky is a light blue, with the sun about to set."]}

使用协议

遵循Qwen-VL-Chat的使用协议。

局限性和免责声明

Qwen-VL-Chat-Finetuned-Dense-Captioner与其他LLM模型一样，在特定情境下可能生成不准确、带有偏见或令人不悦的内容。因此，请小心使用该模型产生的输出，切勿传播任何有害信息。

我们严正警告不得利用Qwen-VL-Chat-Finetuned-Dense-Captioner模型制作或散播有害信息，以及从事任何可能危害公众、国家安全、社会稳定或违反法律法规的活动。对于因使用Qwen-VL-Chat-Finetuned-Dense-Captioner模型而导致的任何问题，包括数据安全漏洞、公众舆情风险，或是模型被错误理解、滥用、传播及不合规使用的所有相关风险与问题，我们概不承担责任。

致谢

感谢通义千问和通义万相团队的模型开源工作。

Qwen-VL-Chat-Finetuned-Dense-Captioner

作品详情

新闻

模型简介

Prompt列表

依赖项

快速使用

使用协议

局限性和免责声明

致谢

重点城市程序员兼职推荐

重点岗位程序员兼职推荐