中文说明 | Eglish
Itroductio
This project aims to provide a better Chiese CLIP model. The traiig data used i this project cosists of publicly accessible image URLs ad related Chiese text descriptios, totalig 400 millio. After screeig, we ultimately used 100 millio data for traiig.
This project is produced by QQ-ARC Joit Lab, Tecet PCG. For more detailed iformatio, please refer to the mai page of the QA-CLIP project. We have also ope-sourced our code o GitHub, QA-CLIP, ad welcome to star!
Results
We coducted zero-shot tests o MUGE Retrieval, Flickr30K-CN, ad COCO-CN datasets for image-text retrieval tasks. For the image zero-shot classificatio task, we tested o the ImageNet dataset. The test results are show i the table below:
Flickr30K-CN Zero-shot Retrieval (Official Test Set):
Task | Text-to-Image | Image-to-Text |
Metric | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
CN-CLIPRN50 | 48.8 | 76.0 | 84.6 | 60.0 | 85.9 | 92.0 |
QA-CLIPRN50 | 50.5 | 77.4 | 86.1 | 67.1 | 87.9 | 93.2 |
CN-CLIPViT-B/16 | 62.7 | 86.9 | 92.8 | 74.6 | 93.5 | 97.1 |
QA-CLIPViT-B/16 | 63.8 | 88.0 | 93.2 | 78.4 | 96.1 | 98.5 |
CN-CLIPViT-L/14 | 68.0 | 89.7 | 94.4 | 80.2 | 96.6 | 98.2 |
AltClipViT-L/14 | 69.7 | 90.1 | 94.8 | 84.8 | 97.7 | 99.1 |
QA-CLIPViT-L/14 | 69.3 | 90.3 | 94.7 | 85.3 | 97.9 | 99.2 |
MUGE Zero-shot Retrieval (Official Validatio Set):
Task | Text-to-Image | Image-to-Text |
Metric | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
CN-CLIPRN50 | 42.6 | 68.5 | 78.0 | 30.0 | 56.2 | 66.9 |
QA-CLIPRN50 | 44.0 | 69.9 | 79.5 | 32.4 | 59.5 | 70.3 |
CN-CLIPViT-B/16 | 52.1 | 76.7 | 84.4 | 38.7 | 65.6 | 75.1 |
QA-CLIPViT-B/16 | 53.2 | 77.7 | 85.1 | 40.7 | 68.2 | 77.2 |
CN-CLIPViT-L/14 | 56.4 | 79.8 | 86.2 | 42.6 | 69.8 | 78.6 |
AltClipViT-L/14 | 29.6 | 49.9 | 58.8 | 21.4 | 42.0 | 51.9 |
QA-CLIPViT-L/14 | 57.4 | 81.0 | 87.7 | 45.5 | 73.0 | 81.4 |
COCO-CN Zero-shot Retrieval (Official Test Set):
Task | Text-to-Image | Image-to-Text |
Metric | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
CN-CLIPRN50 | 48.1 | 81.3 | 90.5 | 50.9 | 81.1 | 90.5 |
QA-CLIPRN50 | 50.1 | 82.5 | 91.7 | 56.7 | 85.2 | 92.9 |
CN-CLIPViT-B/16 | 62.2 | 87.1 | 94.9 | 56.3 | 84.0 | 93.3 |
QA-CLIPViT-B/16 | 62.9 | 87.7 | 94.7 | 61.5 | 87.6 | 94.8 |
CN-CLIPViT-L/14 | 64.9 | 88.8 | 94.2 | 60.6 | 84.4 | 93.1 |
AltClipViT-L/14 | 63.5 | 87.6 | 93.5 | 62.6 | 88.5 | 95.9 |
QA-CLIPViT-L/14 | 65.7 | 90.2 | 95.0 | 64.5 | 88.3 | 95.1 |
Zero-shot Image Classificatio o ImageNet:
Task | ImageNet |
CN-CLIPRN50 | 33.5 |
QA-CLIPRN50 | 35.5 |
CN-CLIPViT-B/16 | 48.4 |
QA-CLIPViT-B/16 | 49.7 |
CN-CLIPViT-L/14 | 54.7 |
QA-CLIPViT-L/14 | 55.8 |
Gettig Started
Iferece Code
Iferece code example:
from PIL import Image
import requests
from trasformers import ChieseCLIPProcessor, ChieseCLIPModel
model = ChieseCLIPModel.from_pretraied("TecetARC/QA-CLIP-ViT-L-14")
processor = ChieseCLIPProcessor.from_pretraied("TecetARC/QA-CLIP-ViT-L-14")
url = "https://clip-c-beijig.oss-c-beijig.aliyucs.com/pokemo.jpeg"
image = Image.ope(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmader, Pikachu i Eglish
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]
# compute image feature
iputs = processor(images=image, retur_tesors="pt")
image_features = model.get_image_features(**iputs)
image_features = image_features / image_features.orm(p=2, dim=-1, keepdim=True) # ormalize
# compute text features
iputs = processor(text=texts, paddig=True, retur_tesors="pt")
text_features = model.get_text_features(**iputs)
text_features = text_features / text_features.orm(p=2, dim=-1, keepdim=True) # ormalize
# compute image-text similarity scores
iputs = processor(text=texts, images=image, retur_tesors="pt", paddig=True)
outputs = model(**iputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)
Ackowledgmets
The project code is based o implemetatio of Chiese-CLIP, ad we are very grateful for their outstadig ope-source cotributios.
评论