siglip-so400m-patch14-384

我要开发同款
匿名用户2024年07月31日
20阅读
所属分类ai、siglip、vision
开源地址https://modelscope.cn/models/thomas/siglip-so400m-patch14-384
授权协议apache-2.0

作品详情

SigLIP (shape-optimized model)

SigLIP model pre-trained on WebLi at resolution 384x384. It was introduced in the paper Sigmoid Loss for Language Image Pre-Training by Zhai et al. and first released in this repository.

This model has the SoViT-400m architecture, which is the shape-optimized version as presented in Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design by Alabdulmohsin et al.

Disclaimer: The team releasing SigLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

SigLIP is CLIP, a multimodal model, with a better loss function. The sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. This allows further scaling up the batch size, while also performing better at smaller batch sizes.

A TLDR of SigLIP by one of the authors can be found here.

Intended uses & limitations

You can use the raw model for tasks like zero-shot image classification and image-text retrieval. See the model hub to look for other versions on a task that interests you.

How to use

Here is how to use this model to perform zero-shot image classification:

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

Alternatively, one can leverage the pipeline API which abstracts away the complexity for the user:

from transformers import pipeline
from PIL import Image
import requests

# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch14-384")

# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)

For more code examples, we refer to the documentation.

Training procedure

Training data

SigLIP is pre-trained on the WebLI dataset (Chen et al., 2023).

Preprocessing

Images are resized/rescaled to the same resolution (384x384) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

Texts are tokenized and padded to the same length (64 tokens).

Compute

The model was trained on 16 TPU-v4 chips for three days.

Evaluation results

Evaluation of SigLIP compared to CLIP is shown below (taken from the paper).

drawing

BibTeX entry and citation info

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Language Image Pre-Training}, 
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论