plant-nucleotide-transformer

我要开发同款
匿名用户2024年07月31日
37阅读
所属分类ai、esm、Pytorch、genomics、biology、DNA
开源地址https://modelscope.cn/models/zhangtaolab/plant-nucleotide-transformer
授权协议CC-BY-NC-SA-4.0

作品详情

植物基础DNA大语言模型 (Plant foundation DNA large language models)

The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes.
All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary.

开发者: zhangtaolab

Model Sources

  • Repository: Plant DNA LLMs
  • Manuscript: [Versatile applications of foundation DNA large language models in plant genomes]()

Architecture

The model is trained based on the InstaDeepAI/nucleotide-transformer-v2-100m-multi-species model with modified tokenizer that replaces k-mer to BPE.

How to use

Install the runtime library first:

pip install transformers

Here is a simple code for inference:

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

model_name = 'plant-nucleotide-transformer'
# load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)

# example sequence and tokenization
sequences = ['ATATACGGCCGNC','GGGTATCGCTTCCGAC']
tokens = tokenizer(sequences,padding="longest")['input_ids']
print(f"Tokenzied sequence: {tokenizer.batch_decode(tokens)}")

# inference
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
inputs = tokenizer(sequences, truncation=True, padding='max_length', max_length=512, 
                   return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
outs = model(
    **inputs,
    output_hidden_states=True
)

# get the final layer embeddings and prediction logits
embeddings = outs['hidden_states'][-1].detach().numpy()
logits = outs['logits'].detach().numpy()

Training data

We use MaskedLM method to pre-train the model, the tokenized sequence have a maximum length of 512.
Detailed training procedure can be found in our manuscript.

Hardware

Model was pre-trained on a NVIDIA RTX4090 GPU (24 GB).

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论