Jina AI文本向量模型v2-base-德英双语

我要开发同款
匿名用户2024年07月31日
51阅读

技术信息

开源地址
https://modelscope.cn/models/jinaai/jina-embeddings-v2-base-de
授权协议
apache-2.0

作品详情



Jia AI logo: Jia AI is your Portal to Multimodal AI

The text embeddig set traied by Jia AI.

Quick Start

The easiest way to startig usig jia-embeddigs-v2-base-de is to use Jia AI's Embeddig API.

Iteded Usage & Model Ifo

jia-embeddigs-v2-base-de is a Germa/Eglish biligual text embeddig model supportig 8192 sequece legth. It is based o a BERT architecture (JiaBERT) that supports the symmetric bidirectioal variat of ALiBi to allow loger sequece legth. We have desiged it for high performace i moo-ligual & cross-ligual applicatios ad traied it specifically to support mixed Germa-Eglish iput without bias. Additioally, we provide the followig embeddig models:

jia-embeddigs-v2-base-de ist ei zweisprachiges Text Embeddig Modell für Deutsch ud Eglisch, welches Texteigabe mit eier Läge vo bis zu 8192 Toke uterstützt. Es basiert auf der adaptierte Bert-Modell-Architektur JiaBERT, welche mithilfe eier symmetrische Variate vo ALiBi lägere Eigabetexte erlaubt. Wir habe, das Model für hohe Performace i eisprachige ud cross-ligual Aweduge etwickelt ud speziell darauf traiiert, gemischte deutsch-eglische Eigabe ohe eie Bias zu kodiere. Des Weitere stelle wir folgede Embeddig-Modelle bereit:

Data & Parameters

We will publish a report with techical details about the traiig of the biligual models soo. The traiig of the Eglish model is described i this techical report.

Usage

Please apply mea poolig whe itegratig the model.

### Why mea poolig? `mea pooolig` takes all toke embeddigs from model output ad averagig them at setece/paragraph level. It has bee proved to be the most effective way to produce high-quality setece embeddigs. We offer a `ecode` fuctio to deal with this. However, if you would like to do it without usig the default `ecode` fuctio:

import torch
import torch..fuctioal as F
from trasformers import AutoTokeizer, AutoModel

def mea_poolig(model_output, attetio_mask):
    toke_embeddigs = model_output[0]
    iput_mask_expaded = attetio_mask.usqueeze(-1).expad(toke_embeddigs.size()).float()
    retur torch.sum(toke_embeddigs * iput_mask_expaded, 1) / torch.clamp(iput_mask_expaded.sum(1), mi=1e-9)

seteces = ['How is the weather today?', 'What is the curret weather like today?']

tokeizer = AutoTokeizer.from_pretraied('jiaai/jia-embeddigs-v2-base-de')
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-de', trust_remote_code=True)

ecoded_iput = tokeizer(seteces, paddig=True, trucatio=True, retur_tesors='pt')

with torch.o_grad():
    model_output = model(**ecoded_iput)

embeddigs = mea_poolig(model_output, ecoded_iput['attetio_mask'])
embeddigs = F.ormalize(embeddigs, p=2, dim=1)

You ca use Jia Embeddig models directly from trasformers package:

!pip istall modelscope
from modelscope import AutoModel
from umpy.lialg import orm

cos_sim = lambda a,b: (a @ b.T) / (orm(a)*orm(b))
model = AutoModel.from_pretraied('jiaai/jia-embeddigs-v2-base-de', trust_remote_code=True) # trust_remote_code is eeded to use the ecode method
embeddigs = model.ecode(['How is the weather today?', 'Wie ist das Wetter heute?'])
prit(cos_sim(embeddigs[0], embeddigs[1]))

If you oly wat to hadle shorter sequece, such as 2k, pass the max_legth parameter to the ecode fuctio:

embeddigs = model.ecode(
    ['Very log ... documet'],
    max_legth=2048
)

Alteratives to Usig Trasformers Package

  1. Maaged SaaS: Get started with a free key o Jia AI's Embeddig API.
  2. Private ad high-performace deploymet: Get started by pickig from our suite of models ad deploy them o AWS Sagemaker.

Bechmark Results

We evaluated our Biligual model o all Germa ad Eglish evaluatio tasks availble o the MTEB bechmark. I additio, we evaluated the models agais a couple of other Germa, Eglish, ad multiligual models o additioal Germa evaluatio tasks:

Use Jia Embeddigs for RAG

Accordig to the latest blog post from LLamaIdex,

I summary, to achieve the peak performace i both hit rate ad MRR, the combiatio of OpeAI or JiaAI-Base embeddigs with the CohereRerak/bge-reraker-large reraker stads out.

Cotact

Joi our Discord commuity ad chat with other commuity members about ideas.

Citatio

If you fid Jia Embeddigs useful i your research, please cite the followig paper:

@misc{güther2023jia,
      title={Jia Embeddigs 2: 8192-Toke Geeral-Purpose Text Embeddigs for Log Documets}, 
      author={Michael Güther ad Jackmi Og ad Isabelle Mohr ad Alaeddie Abdessalem ad Taguy Abel ad Mohammad Kalim Akram ad Susaa Guzma ad Georgios Mastrapas ad Saba Sturua ad Bo Wag ad Maximilia Werk ad Na Wag ad Ha Xiao},
      year={2023},
      eprit={2310.19923},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

功能介绍

The text embedding set trained by Jina AI. Quick Start The easiest way to starting using jina-embed

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论