QA-CLIP-ViT-L-14

我要开发同款
匿名用户2024年07月31日
32阅读
所属分类ai、chinese_clip、pytorch
开源地址https://modelscope.cn/models/AI-ModelScope/QA-CLIP-ViT-L-14
授权协议apache-2.0

作品详情

中文说明 | English

Introduction

This project aims to provide a better Chinese CLIP model. The training data used in this project consists of publicly accessible image URLs and related Chinese text descriptions, totaling 400 million. After screening, we ultimately used 100 million data for training. This project is produced by QQ-ARC Joint Lab, Tencent PCG. For more detailed information, please refer to the main page of the QA-CLIP project. We have also open-sourced our code on GitHub, QA-CLIP, and welcome to star!

Results

We conducted zero-shot tests on MUGE Retrieval, Flickr30K-CN, and COCO-CN datasets for image-text retrieval tasks. For the image zero-shot classification task, we tested on the ImageNet dataset. The test results are shown in the table below:

Flickr30K-CN Zero-shot Retrieval (Official Test Set):

TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5048.876.084.660.085.992.0
QA-CLIPRN5050.577.486.167.187.993.2
CN-CLIPViT-B/1662.786.992.874.693.597.1
QA-CLIPViT-B/1663.888.093.278.496.198.5
CN-CLIPViT-L/1468.089.794.480.296.698.2
AltClipViT-L/1469.790.194.884.897.799.1
QA-CLIPViT-L/1469.390.394.785.397.999.2


MUGE Zero-shot Retrieval (Official Validation Set):

TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5042.668.578.030.056.266.9
QA-CLIPRN5044.069.979.532.459.570.3
CN-CLIPViT-B/1652.176.784.438.765.675.1
QA-CLIPViT-B/1653.277.785.140.768.277.2
CN-CLIPViT-L/1456.479.886.242.669.878.6
AltClipViT-L/1429.649.958.821.442.051.9
QA-CLIPViT-L/1457.481.087.745.573.081.4


COCO-CN Zero-shot Retrieval (Official Test Set):

TaskText-to-ImageImage-to-Text
MetricR@1R@5R@10R@1R@5R@10
CN-CLIPRN5048.181.390.550.981.190.5
QA-CLIPRN5050.182.591.756.785.292.9
CN-CLIPViT-B/1662.287.194.956.384.093.3
QA-CLIPViT-B/1662.987.794.761.587.694.8
CN-CLIPViT-L/1464.988.894.260.684.493.1
AltClipViT-L/1463.587.693.562.688.595.9
QA-CLIPViT-L/1465.790.295.064.588.395.1


Zero-shot Image Classification on ImageNet:

TaskImageNet
CN-CLIPRN5033.5
QA-CLIPRN5035.5
CN-CLIPViT-B/1648.4
QA-CLIPViT-B/1649.7
CN-CLIPViT-L/1454.7
QA-CLIPViT-L/1455.8




Getting Started

Inference Code

Inference code example:

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")
processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-L-14")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)



Acknowledgments

The project code is based on implementation of Chinese-CLIP, and we are very grateful for their outstanding open-source contributions.

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论