Jina AI文本向量模型v2-base-中英双语_开源AI项目-程序员客栈

Fietuer logo: Fietuer helps you to create experimets i order to improve embeddigs o search tasks. It accompaies you to deliver the last mile of performace-tuig for eural search applicatios.

The text embeddig set traied by Jia AI.

Quick Start

The easiest way to startig usig jia-embeddigs-v2-base-zh is to use Jia AI's Embeddig API.

Iteded Usage & Model Ifo

jia-embeddigs-v2-base-zh is a Chiese/Eglish biligual text embeddig model supportig 8192 sequece legth. It is based o a BERT architecture (JiaBERT) that supports the symmetric bidirectioal variat of ALiBi to allow loger sequece legth. We have desiged it for high performace i moo-ligual & cross-ligual applicatios ad traied it specifically to support mixed Chiese-Eglish iput without bias. Additioally, we provide the followig embeddig models:

jia-embeddigs-v2-base-zh 是支持中英双语的文本向量模型，它支持长达8192字符的文本编码。该模型的研发基于BERT架构(JiaBERT)，JiaBERT是在BERT架构基础上的改进，首次将ALiBi应用到编码器架构中以支持更长的序列。不同于以往的单语言/多语言向量模型，我们设计双语模型来更好的支持单语言（中搜中）以及跨语言（中搜英）文档检索。除此之外，我们也提供其它向量模型:

jia-embeddigs-v2-small-e: 33 millio parameters.
jia-embeddigs-v2-base-e: 137 millio parameters.
jia-embeddigs-v2-base-zh: 161 millio parameters Chiese-Eglish Biligual embeddigs (you are here).
jia-embeddigs-v2-base-de: 161 millio parameters Germa-Eglish Biligual embeddigs.
[jia-embeddigs-v2-base-es](): Spaish-Eglish Biligual embeddigs (soo).

Data & Parameters

We will publish a report with techical details about the traiig of the biligual models soo. The traiig of the Eglish model is described i this techical report.

Usage

Please apply mea poolig whe itegratig the model.

### Why mea poolig? `mea pooolig` takes all toke embeddigs from model output ad averagig them at setece/paragraph level. It has bee proved to be the most effective way to produce high-quality setece embeddigs. We offer a `ecode` fuctio to deal with this. However, if you would like to do it without usig the default `ecode` fuctio:

import torch
import torch..fuctioal as F
from trasformers import AutoTokeizer, AutoModel

def mea_poolig(model_output, attetio_mask):
    toke_embeddigs = model_output[0]
    iput_mask_expaded = attetio_mask.usqueeze(-1).expad(toke_embeddigs.size()).float()
    retur torch.sum(toke_embeddigs * iput_mask_expaded, 1) / torch.clamp(iput_mask_expaded.sum(1), mi=1e-9)

seteces = ['How is the weather today?', '今天天气怎么样?']

tokeizer = AutoTokeizer.from_pretraied('jiaai/jia-embeddigs-v2-base-zh')
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-zh', trust_remote_code=True)

ecoded_iput = tokeizer(seteces, paddig=True, trucatio=True, retur_tesors='pt')

with torch.o_grad():
    model_output = model(**ecoded_iput)

embeddigs = mea_poolig(model_output, ecoded_iput['attetio_mask'])
embeddigs = F.ormalize(embeddigs, p=2, dim=1)

You ca use Jia Embeddig models directly from modelscope package:

!pip istall modelscope
from modelscope import AutoModel
from umpy.lialg import orm

cos_sim = lambda a,b: (a @ b.T) / (orm(a)*orm(b))
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-zh', trust_remote_code=True) # trust_remote_code is eeded to use the ecode method
embeddigs = model.ecode(['How is the weather today?', '今天天气怎么样?'])
prit(cos_sim(embeddigs[0], embeddigs[1]))

If you oly wat to hadle shorter sequece, such as 2k, pass the max_legth parameter to the ecode fuctio:

embeddigs = model.ecode(
    ['Very log ... documet'],
    max_legth=2048
)

If you wat to use the model together with the setece-trasformers package, make sure that you have istalled the latest release ad set trust_remote_code=True as well:

!pip istall modelscope
from modelscope import AutoModel
from umpy.lialg import orm
cos_sim = lambda a,b: (a @ b.T) / (orm(a)*orm(b))
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-zh', trust_remote_code=True) # trust_remote_code is eeded to use the ecode method
embeddigs = model.ecode(['How is the weather today?', '今天天气怎么样?'])
prit(cos_sim(embeddigs[0], embeddigs[1]))

Alteratives to Usig Trasformers Package

Maaged SaaS: Get started with a free key o Jia AI's Embeddig API.
Private ad high-performace deploymet: Get started by pickig from our suite of models ad deploy them o AWS Sagemaker.

Use Jia Embeddigs for RAG

Accordig to the latest blog post from LLamaIdex,

I summary, to achieve the peak performace i both hit rate ad MRR, the combiatio of OpeAI or JiaAI-Base embeddigs with the CohereRerak/bge-reraker-large reraker stads out.

Cotact

Joi our Discord commuity ad chat with other commuity members about ideas.

Citatio

If you fid Jia Embeddigs useful i your research, please cite the followig paper:

@misc{güther2023jia,
      title={Jia Embeddigs 2: 8192-Toke Geeral-Purpose Text Embeddigs for Log Documets}, 
      author={Michael Güther ad Jackmi Og ad Isabelle Mohr ad Alaeddie Abdessalem ad Taguy Abel ad Mohammad Kalim Akram ad Susaa Guzma ad Georgios Mastrapas ad Saba Sturua ad Bo Wag ad Maximilia Werk ad Na Wag ad Ha Xiao},
      year={2023},
      eprit={2310.19923},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Jina AI文本向量模型v2-base-中英双语

技术信息

作品详情