Jina AI文本向量模型v2-base-中英双语

我要开发同款
匿名用户2024年07月31日
120阅读

技术信息

开源地址
https://modelscope.cn/models/jinaai/jina-embeddings-v2-base-zh
授权协议
apache-2.0

作品详情



Fietuer logo: Fietuer helps you to create experimets i order to improve embeddigs o search tasks. It accompaies you to deliver the last mile of performace-tuig for eural search applicatios.

The text embeddig set traied by Jia AI.

Quick Start

The easiest way to startig usig jia-embeddigs-v2-base-zh is to use Jia AI's Embeddig API.

Iteded Usage & Model Ifo

jia-embeddigs-v2-base-zh is a Chiese/Eglish biligual text embeddig model supportig 8192 sequece legth. It is based o a BERT architecture (JiaBERT) that supports the symmetric bidirectioal variat of ALiBi to allow loger sequece legth. We have desiged it for high performace i moo-ligual & cross-ligual applicatios ad traied it specifically to support mixed Chiese-Eglish iput without bias. Additioally, we provide the followig embeddig models:

jia-embeddigs-v2-base-zh 是支持中英双语的文本向量模型,它支持长达8192字符的文本编码。 该模型的研发基于BERT架构(JiaBERT),JiaBERT是在BERT架构基础上的改进,首次将ALiBi应用到编码器架构中以支持更长的序列。 不同于以往的单语言/多语言向量模型,我们设计双语模型来更好的支持单语言(中搜中)以及跨语言(中搜英)文档检索。 除此之外,我们也提供其它向量模型:

Data & Parameters

We will publish a report with techical details about the traiig of the biligual models soo. The traiig of the Eglish model is described i this techical report.

Usage

Please apply mea poolig whe itegratig the model.

### Why mea poolig? `mea pooolig` takes all toke embeddigs from model output ad averagig them at setece/paragraph level. It has bee proved to be the most effective way to produce high-quality setece embeddigs. We offer a `ecode` fuctio to deal with this. However, if you would like to do it without usig the default `ecode` fuctio:

import torch
import torch..fuctioal as F
from trasformers import AutoTokeizer, AutoModel

def mea_poolig(model_output, attetio_mask):
    toke_embeddigs = model_output[0]
    iput_mask_expaded = attetio_mask.usqueeze(-1).expad(toke_embeddigs.size()).float()
    retur torch.sum(toke_embeddigs * iput_mask_expaded, 1) / torch.clamp(iput_mask_expaded.sum(1), mi=1e-9)

seteces = ['How is the weather today?', '今天天气怎么样?']

tokeizer = AutoTokeizer.from_pretraied('jiaai/jia-embeddigs-v2-base-zh')
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-zh', trust_remote_code=True)

ecoded_iput = tokeizer(seteces, paddig=True, trucatio=True, retur_tesors='pt')

with torch.o_grad():
    model_output = model(**ecoded_iput)

embeddigs = mea_poolig(model_output, ecoded_iput['attetio_mask'])
embeddigs = F.ormalize(embeddigs, p=2, dim=1)

You ca use Jia Embeddig models directly from modelscope package:

!pip istall modelscope
from modelscope import AutoModel
from umpy.lialg import orm

cos_sim = lambda a,b: (a @ b.T) / (orm(a)*orm(b))
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-zh', trust_remote_code=True) # trust_remote_code is eeded to use the ecode method
embeddigs = model.ecode(['How is the weather today?', '今天天气怎么样?'])
prit(cos_sim(embeddigs[0], embeddigs[1]))

If you oly wat to hadle shorter sequece, such as 2k, pass the max_legth parameter to the ecode fuctio:

embeddigs = model.ecode(
    ['Very log ... documet'],
    max_legth=2048
)

If you wat to use the model together with the setece-trasformers package, make sure that you have istalled the latest release ad set trust_remote_code=True as well:

!pip istall modelscope
from modelscope import AutoModel
from umpy.lialg import orm
cos_sim = lambda a,b: (a @ b.T) / (orm(a)*orm(b))
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-zh', trust_remote_code=True) # trust_remote_code is eeded to use the ecode method
embeddigs = model.ecode(['How is the weather today?', '今天天气怎么样?'])
prit(cos_sim(embeddigs[0], embeddigs[1]))

Alteratives to Usig Trasformers Package

  1. Maaged SaaS: Get started with a free key o Jia AI's Embeddig API.
  2. Private ad high-performace deploymet: Get started by pickig from our suite of models ad deploy them o AWS Sagemaker.

Use Jia Embeddigs for RAG

Accordig to the latest blog post from LLamaIdex,

I summary, to achieve the peak performace i both hit rate ad MRR, the combiatio of OpeAI or JiaAI-Base embeddigs with the CohereRerak/bge-reraker-large reraker stads out.

Cotact

Joi our Discord commuity ad chat with other commuity members about ideas.

Citatio

If you fid Jia Embeddigs useful i your research, please cite the followig paper:

@misc{güther2023jia,
      title={Jia Embeddigs 2: 8192-Toke Geeral-Purpose Text Embeddigs for Log Documets}, 
      author={Michael Güther ad Jackmi Og ad Isabelle Mohr ad Alaeddie Abdessalem ad Taguy Abel ad Mohammad Kalim Akram ad Susaa Guzma ad Georgios Mastrapas ad Saba Sturua ad Bo Wag ad Maximilia Werk ad Na Wag ad Ha Xiao},
      year={2023},
      eprit={2310.19923},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

功能介绍

The text embedding set trained by Jina AI. Quick Start The easiest way to starting using jina-embed

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论