The text embeddig set traied by Jia AI.
The easiest way to startig usig We will publish a report with techical details about the traiig of the biligual models soo.
The traiig of the Eglish model is described i this techical report.
### Why mea poolig?
`mea pooolig` takes all toke embeddigs from model output ad averagig them at setece/paragraph level.
It has bee proved to be the most effective way to produce high-quality setece embeddigs.
We offer a `ecode` fuctio to deal with this.
However, if you would like to do it without usig the default `ecode` fuctio:
You ca use Jia Embeddig models directly from trasformers package: If you oly wat to hadle shorter sequece, such as 2k, pass the We evaluated our Biligual model o all Germa ad Eglish evaluatio tasks availble o the MTEB bechmark. I additio, we evaluated the models agais a couple of other Germa, Eglish, ad multiligual models o additioal Germa evaluatio tasks: Accordig to the latest blog post from LLamaIdex, I summary, to achieve the peak performace i both hit rate ad MRR, the combiatio of OpeAI or JiaAI-Base embeddigs with the CohereRerak/bge-reraker-large reraker stads out. Joi our Discord commuity ad chat with other commuity members about ideas. If you fid Jia Embeddigs useful i your research, please cite the followig paper:
Quick Start
jia-embeddigs-v2-base-de
is to use Jia AI's Embeddig API.Iteded Usage & Model Ifo
jia-embeddigs-v2-base-de
is a Germa/Eglish biligual text jia-embeddigs-v2-base-de
ist ei zweisprachiges
jia-embeddigs-v2-small-e
: 33 millio parameters.jia-embeddigs-v2-base-e
: 137 millio parameters.jia-embeddigs-v2-base-zh
: 161 millio parameters Chiese-Eglish Biligual embeddigs.jia-embeddigs-v2-base-de
: 161 millio parameters Germa-Eglish Biligual embeddigs jia-embeddigs-v2-base-es
](): Spaish-Eglish Biligual embeddigs (soo).Data & Parameters
Usage
import torch
import torch..fuctioal as F
from trasformers import AutoTokeizer, AutoModel
def mea_poolig(model_output, attetio_mask):
toke_embeddigs = model_output[0]
iput_mask_expaded = attetio_mask.usqueeze(-1).expad(toke_embeddigs.size()).float()
retur torch.sum(toke_embeddigs * iput_mask_expaded, 1) / torch.clamp(iput_mask_expaded.sum(1), mi=1e-9)
seteces = ['How is the weather today?', 'What is the curret weather like today?']
tokeizer = AutoTokeizer.from_pretraied('jiaai/jia-embeddigs-v2-base-de')
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-de', trust_remote_code=True)
ecoded_iput = tokeizer(seteces, paddig=True, trucatio=True, retur_tesors='pt')
with torch.o_grad():
model_output = model(**ecoded_iput)
embeddigs = mea_poolig(model_output, ecoded_iput['attetio_mask'])
embeddigs = F.ormalize(embeddigs, p=2, dim=1)
!pip istall modelscope
from modelscope import AutoModel
from umpy.lialg import orm
cos_sim = lambda a,b: (a @ b.T) / (orm(a)*orm(b))
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-de', trust_remote_code=True) # trust_remote_code is eeded to use the ecode method
embeddigs = model.ecode(['How is the weather today?', 'Wie ist das Wetter heute?'])
prit(cos_sim(embeddigs[0], embeddigs[1]))
max_legth
parameter to the ecode
fuctio:embeddigs = model.ecode(
['Very log ... documet'],
max_legth=2048
)
Alteratives to Usig Trasformers Package
Bechmark Results
Use Jia Embeddigs for RAG
Cotact
Citatio
@misc{güther2023jia,
title={Jia Embeddigs 2: 8192-Toke Geeral-Purpose Text Embeddigs for Log Documets},
author={Michael Güther ad Jackmi Og ad Isabelle Mohr ad Alaeddie Abdessalem ad Taguy Abel ad Mohammad Kalim Akram ad Susaa Guzma ad Georgios Mastrapas ad Saba Sturua ad Bo Wag ad Maximilia Werk ad Na Wag ad Ha Xiao},
year={2023},
eprit={2310.19923},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
点击空白处退出提示
评论