开源地址
https://modelscope.cn/models/thomas/siglip-so400m-patch14-384授权协议
apache-2.0

SigLIP (shape-optimized model)

SigLIP model pre-traied o WebLi at resolutio 384x384. It was itroduced i the paper Sigmoid Loss for Laguage Image Pre-Traiig by Zhai et al. ad first released i this repository.

This model has the SoViT-400m architecture, which is the shape-optimized versio as preseted i Gettig ViT i Shape: Scalig Laws for Compute-Optimal Model Desig by Alabdulmohsi et al.

Disclaimer: The team releasig SigLIP did ot write a model card for this model so this model card has bee writte by the Huggig Face team.

Model descriptio

SigLIP is CLIP, a multimodal model, with a better loss fuctio. The sigmoid loss operates solely o image-text pairs ad does ot require a global view of the pairwise similarities for ormalizatio. This allows further scalig up the batch size, while also performig better at smaller batch sizes.

A TLDR of SigLIP by oe of the authors ca be foud here.

Iteded uses & limitatios

You ca use the raw model for tasks like zero-shot image classificatio ad image-text retrieval. See the model hub to look for other versios o a task that iterests you.

How to use

Here is how to use this model to perform zero-shot image classificatio:

from PIL import Image
import requests
from trasformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretraied("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretraied("google/siglip-so400m-patch14-384")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.ope(requests.get(url, stream=True).raw)

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
iputs = processor(text=texts, images=image, paddig="max_legth", retur_tesors="pt")

with torch.o_grad():
    outputs = model(**iputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
prit(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

Alteratively, oe ca leverage the pipelie API which abstracts away the complexity for the user:

from trasformers import pipelie
from PIL import Image
import requests

# load pipe
image_classifier = pipelie(task="zero-shot-image-classificatio", model="google/siglip-so400m-patch14-384")

# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.ope(requests.get(url, stream=True).raw)

# iferece
outputs = image_classifier(image, cadidate_labels=["2 cats", "a plae", "a remote"])
outputs = [{"score": roud(output["score"], 4), "label": output["label"] } for output i outputs]
prit(outputs)

For more code examples, we refer to the documetatio.

Traiig procedure

Traiig data

SigLIP is pre-traied o the WebLI dataset (Che et al., 2023).

Preprocessig

Images are resized/rescaled to the same resolutio (384x384) ad ormalized across the RGB chaels with mea (0.5, 0.5, 0.5) ad stadard deviatio (0.5, 0.5, 0.5).

Texts are tokeized ad padded to the same legth (64 tokes).

Compute

The model was traied o 16 TPU-v4 chips for three days.

Evaluatio results

Evaluatio of SigLIP compared to CLIP is show below (take from the paper).

drawig

BibTeX etry ad citatio ifo

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Laguage Image Pre-Traiig}, 
      author={Xiaohua Zhai ad Basil Mustafa ad Alexader Kolesikov ad Lucas Beyer},
      year={2023},
      eprit={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

SigLIP (shape-optimized model) SigLIP model pre-trained on WebLi at resolution 384x384. It was intro

声明：本文仅代表作者观点，不代表本站立场。如果侵犯到您的合法权益，请联系我们删除侵权资源！如果遇到资源链接失效，请您通过评论或工单的方式通知管理员。未经允许，不得转载，本站所有资源文章禁止商业使用运营!

下载安装【程序员客栈】APP

实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

前往安装

siglip-so400m-patch14-384

技术信息

作品详情