jina-clip-v1_开源AI项目-程序员客栈

Fietuer logo: Fietuer helps you to create experimets i order to improve embeddigs o search tasks. It accompaies you to deliver the last mile of performace-tuig for eural search applicatios.

The embeddig set traied by Jia AI.

Jia CLIP: your CLIP model is also your text retriever!

Iteded Usage & Model Ifo

jia-clip-v1 is a state-of-the-art Eglish multimodal (text-image) embeddig model.

Traditioal text embeddig models, such as jia-embeddigs-v2-base-e, excel i text-to-text retrieval but icapable of cross-modal tasks. Models like opeai/clip-vit-base-patch32 effectively alig image ad text embeddigs but are ot optimized for text-to-text retrieval due to their traiig methodologies ad cotext limitatios.

jia-clip-v1 bridges this gap by offerig robust performace i both domais. Its text compoet matches the retrieval efficiecy of jia-embeddigs-v2-base-e, while its overall architecture sets a ew bechmark for cross-modal retrieval. This dual capability makes it a excellet tool for multimodal retrieval-augmeted geeratio (MuRAG) applicatios, eablig seamless text-to-text ad text-to-image searches withi a sigle model.

Data & Parameters

Check out our paper

Usage

The easiest way to startig usig jia-clip-v1-e is to use Jia AI's Embeddigs API.
Alteratively, you ca use Jia CLIP directly via trasformers package.

!pip istall trasformers eiops timm pillow
from trasformers import AutoModel

# Iitialize the model
model = AutoModel.from_pretraied('jiaai/jia-clip-v1', trust_remote_code=True)

# New meaigful seteces
seteces = ['A blue cat', 'A red cat']

# Public image URLs
image_urls = [
    'https://i.piimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
    'https://i.piimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]

# Ecode text ad images
text_embeddigs = model.ecode_text(seteces)
image_embeddigs = model.ecode_image(image_urls)  # also accepts PIL.image, local fileames, dataURI

# Compute similarities
prit(text_embeddigs[0] @ text_embeddigs[1].T) # text embeddig similarity
prit(text_embeddigs[0] @ image_embeddigs[0].T) # text-image cross-modal similarity
prit(text_embeddigs[0] @ image_embeddigs[1].T) # text-image cross-modal similarity
prit(text_embeddigs[1] @ image_embeddigs[0].T) # text-image cross-modal similarity
prit(text_embeddigs[1] @ image_embeddigs[1].T)# text-image cross-modal similarity

JavaScript developers ca use Jia CLIP via the Trasformers.js library. Note that to use this model, you eed to istall Trasformers.js v3 from source usig pm istall xeova/trasformers.js#v3.

import { AutoTokeizer, CLIPTextModelWithProjectio, AutoProcessor, CLIPVisioModelWithProjectio, RawImage, cos_sim } from '@xeova/trasformers';

// Load tokeizer ad text model
cost tokeizer = await AutoTokeizer.from_pretraied('jiaai/jia-clip-v1');
cost text_model = await CLIPTextModelWithProjectio.from_pretraied('jiaai/jia-clip-v1');

// Load processor ad visio model
cost processor = await AutoProcessor.from_pretraied('Xeova/clip-vit-base-patch32');
cost visio_model = await CLIPVisioModelWithProjectio.from_pretraied('jiaai/jia-clip-v1');

// Ru tokeizatio
cost texts = ['A blue cat', 'A red cat'];
cost text_iputs = tokeizer(texts, { paddig: true, trucatio: true });

// Compute text embeddigs
cost { text_embeds } = await text_model(text_iputs);

// Read images ad ru processor
cost urls = [
    'https://i.piimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
    'https://i.piimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
];
cost image = await Promise.all(urls.map(url => RawImage.read(url)));
cost image_iputs = await processor(image);

// Compute visio embeddigs
cost { image_embeds } = await visio_model(image_iputs);

//  Compute similarities
cosole.log(cos_sim(text_embeds[0].data, text_embeds[1].data)) // text embeddig similarity
cosole.log(cos_sim(text_embeds[0].data, image_embeds[0].data)) // text-image cross-modal similarity
cosole.log(cos_sim(text_embeds[0].data, image_embeds[1].data)) // text-image cross-modal similarity
cosole.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cross-modal similarity
cosole.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity

Performace

Text-Image Retrieval

Name	Flickr Image Retr. R@1	Flickr Image Retr. R@5	Flickr Text Retr. R@1	Flickr Text Retr. R@5
ViT-B-32	0.597	0.8398	0.781	0.938
ViT-B-16	0.6216	0.8572	0.822	0.966
jia-clip	0.6748	0.8902	0.811	0.965

Name	MSCOCO Image Retr. R@1	MSCOCO Image Retr. R@5	MSCOCO Text Retr. R@1	MSCOCO Text Retr. R@5
ViT-B-32	0.342	0.6001	0.5234	0.7634
ViT-B-16	0.3309	0.5842	0.5242	0.767
jia-clip	0.4111	0.6644	0.5544	0.7904

Text-Text Retrieval

Name	STS12	STS15	STS17	STS13	STS14	STS16	STS22	STSBechmark	SummEval
jia-embeddigs-v2	0.7427	0.8755	0.8888	0.833	0.7917	0.836	0.6346	0.8404	0.3056
jia-clip	0.7352	0.8746	0.8976	0.8323	0.7868	0.8377	0.6583	0.8493	0.3048

Name	ArguAa	FiQA2018	NFCorpus	Quora	SCIDOCS	SciFact	TRECCOVID
jia-embeddigs-v2	0.4418	0.4158	0.3245	0.882	0.1986	0.6668	0.6591
jia-clip	0.4933	0.3827	0.3352	0.8789	0.2024	0.6734	0.7161

Cotact

Joi our Discord commuity ad chat with other commuity members about ideas.

Citatio

If you fid jia-clip-v1 useful i your research, please cite the followig paper:

@misc{2405.20204,
    Author = {Adreas Koukouas ad Georgios Mastrapas ad Michael Güther ad Bo Wag ad Scott Martes ad Isabelle Mohr ad Saba Sturua ad Mohammad Kalim Akram ad Joa Fotaals Martíez ad Saahil Ogawala ad Susaa Guzma ad Maximilia Werk ad Na Wag ad Ha Xiao},
    Title = {Jia CLIP: Your CLIP Model Is Also Your Text Retriever},
    Year = {2024},
    Eprit = {arXiv:2405.20204},
}

FAQ

I ecouter this problem, what should I do?

ValueError: The model class you are passig has a `cofig_class` attribute that is ot cosistet with the cofig class you passed (model has <class 'trasformers_modules.jiaai.jia-clip-implemetatio.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.cofiguratio_clip.JiaCLIPCofig'> ad you passed <class 'trasformers_modules.jiaai.jia-clip-implemetatio.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.cofiguratio_cli.JiaCLIPCofig'>. Fix oe of those so they match!

There was a bug i Trasformers library betwee 4.40.x to 4.41.1. You ca update trasformers to >4.41.2 or <=4.40.0

Give oe query, how ca I merge its text-text ad text-image cosie similarity?

Our emperical study shows that text-text cosie similarity is ormally larger tha text-image cosie similarity! If you wat to merge two scores, we recommeded 2 ways:

weighted average of text-text sim ad text-image sim:

combied_scores = sim(text, text) + lambda * sim(text, image)  # optimal lambda depeds o your dataset, but i geeral lambda=2 ca be a good choice.

apply z-score ormalizatio before mergig scores:

# pseudo code
query_documet_mea = p.mea(cos_sim_text_texts)
query_documet_std = p.std(cos_sim_text_texts)
text_image_mea = p.mea(cos_sim_text_images)
text_image_std = p.std(cos_sim_text_images)

query_documet_sim_ormalized = (cos_sim_query_documets - query_documet_mea) / query_documet_std
text_image_sim_ormalized = (cos_sim_text_images - text_image_mea) / text_image_std

jina-clip-v1

技术信息

作品详情