路卡珀特

我要开发同款
匿名用户2024年07月31日
67阅读

技术信息

官网地址
https://github.com/alibaba/LucaProt
开源地址
https://modelscope.cn/models/alibabacloudlucaprot/LucaProt
授权协议
Apache License 2.0

作品详情

LucaProt

LucaProt(DeepProtFuc) is a ope source project developed by Alibaba ad licesed uder the Apache Licese (Versio 2.0).

This product cotais various third-party compoets uder other ope source liceses. See the NOTICE file for more iformatio.

Itroductio

LucaProt: A ovel deep learig framework that icorporates protei amio acid sequece ad structure iformatio to predict protei fuctio.

1. Model

1) Model Itroductio

We developed a ew deep learig model, amely, Deep Sequetial ad Structural Iformatio Fusio Network for Proteis Fuctio Predictio (DeepProtFuc/LucaProt), which takes ito accout protei sequece compositio ad structure to facilitate the accurate aotatio of protei fuctio.

Here, we applied LucaProt to idetify viral RdRP.

2) Model Architecture

We treat protei fuctio predictio as a classificatio problem. For example, viral RdRP idetificatio is a biary-class classificatio task, ad protei geeral fuctio aotatio is a multi-label classificatio task. The model icludes five modules: Iput, Tokeizer, Ecoder, Poolig, ad Output. Its architecture is show i Figure 1.

Figure 1 The Architecture of LucaProt

3) Model Iput/Output

Use the amio acid letter sequece as the iput of our model. The model outputs the fuctio label of the iput protei, which is a sigle tag (biary-class classificatio or multi-class classificatio) or a set of tags (multi-label classificatio).

2. Depedece

System: Ubutu 20.04.5 LTS
Pytho: 3.9.13
Dowload aacoda: aacoda
Cuda: cuda11.7 (torch==1.13.1)

# Select 'YES' durig istallatio for iitializig the coda eviromet  
sh Aacoda3-2022.10-Liux-x86_64.sh  
# Source the eviromet
source ~/.bashrc  
# Verificatio
coda  
# Istall ev ad pytho 3.9.13   
coda create - lucaprot pytho=3.9.13    
# activate ev
coda activate lucaprot  
# Istall git      
sudo apt-get update         
sudo apt istall git-all

# Eter the project   
cd LucaProt     

# Istall
pip istall -r requiremets.txt -i https://pypi.tua.tsighua.edu.c/simple        

3. Iferece

1) Predictio from oe sample

cd LucaProt/src/predictio/ 
sh ru_predict_oe_sample.sh

Note: the embeddig matrix of the sample is real-time predictive.

Or:

cd LucaProt/src/

export CUDA_VISIBLE_DEVICES=0

pytho predict_oe_sample.py \
    --protei_id protei_1 \
    --sequece MTTSTAFTGKTLMITGGTGSFGNTVLKHFVHTDLAEIRIFSRDEKKQDDMRHRLQEKSPELADKVRFFIGDVRNLQSVRDAMHGVDYIFHAAALKQVPSCEFFPMEAVRTNVLGTDNVLHAAIDEGVDRVVCLSTDKAAYPINAMGKSKAMMESIIYANARNGAGRTTICCTRYGNVMCSRGSVIPLFIDRIRKGEPLTVTDPNMTRFLMNLDEAVDLVQFAFEHANPGDLFIQKAPASTIGDLAEAVQEVFGRVGTQVIGTRHGEKLYETLMTCEERLRAEDMGDYFRVACDSRDLNYDKFVVNGEVTTMADEAYTSHNTSRLDVAGTVEKIKTAEYVQLALEGREYEAVQ    \
    --emb_dir ./emb/ \
    --trucatio_seq_legth 4096 \
    --dataset_ame rdrp_40_exted \
    --dataset_type protei \
    --task_type biary_class \
    --model_type sef \
    --time_str 20230201140320 \
    --step 100000 \
    --threshold 0.5
  • --protei_id
    str, the protei id.

  • --sequece
    str, the protei sequece.

  • --trucatioseqlegth
    it, trucate sequeces loger tha the give value. Recommeded values: 4096, 2048, 1984, 1792, 1534, 1280, 1152, 1024, defualt: 4096.

  • --emb_dir(optioal)
    path, the saved dirpath of the protei predicted embeddig matrix or vector durig predictio, optioal.

  • --datasetame
    str, the dataset ame for buildig of our traied model(rdrp
    40_exted).

  • --dataset_type
    str, the dataset type for buildig of our traied model(protei).

  • --tasktype
    str, the task ame for buildig of our traied model(biary
    class).

  • --model_type
    str, the model ame for buildig of our traied model(sef).

  • --time_str
    str, the ruig time strig(yyyymmddHimiss) for buildig of our traied model(20230201140320).

  • --step
    it, the traiig global step of model fializatio(100000).

  • --threshold
    float, sigmoid threshold for biary-class or multi-label classificatio, Noe for multi-class classificatio, default: 0.5.

2) Predictio from may samples

the samples are i *.fasta, sample by sample predictio.

  • --fasta_file
    str, the samples fasta file

  • --save_file
    str, file path, save the predicted results ito the file.

  • --pritperumber
    it, prit progress iformatio for every umber of samples completed, default: 100.

cd LucaProt/src/predictio/   
sh ru_predict_may_samples.sh

Or:

cd LucaProt/src/

export CUDA_VISIBLE_DEVICES=0  

pytho predict_may_samples.py \
    --fasta_file ../data/rdrp/test/test.fasta  \
    --save_file ../result/rdrp/test/test_result.csv  \
    --emb_dir ../emb/   \
    --trucatio_seq_legth 4096  \
    --dataset_ame rdrp_40_exted  \
    --dataset_type protei     \
    --task_type biary_class     \
    --model_type sef     \
    --time_str 20230201140320   \
    --step 100000  \
    --threshold 0.5 \
    --prit_per_umber 10 

3) Predictio from the file

The test data (small ad real) is i demo.csv, where the 7th colum of each lie is the fileame of the structural embeddig iformatio prepared i advace.
Ad the structural embeddig files store i embs.

The test data icludes 50 viral-RdRPs ad 50 o-viral RdRPs.

cd LucaProt/src/predictio/   
sh ru_predict_from_file.sh

Or:

cd LucaProt/src/

export CUDA_VISIBLE_DEVICES=0

pytho predict.py \
    --data_path ../data/rdrp/demo/demo.csv \
    --emb_dir ../data/rdrp/demo/embs/esm2_t36_3B_UR50D \
    --dataset_ame rdrp_40_exted \
    --dataset_type protei \
    --task_type biary_class \
    --model_type sef \
    --time_str 20230201140320 \
    --step 100000 \
    --evaluate \
    --threshold 0.5 \
    --batch_size 16 \
    --prit_per_batch 2
  • --data_path
    path, the file path of predictio data, icludig 9 colums metioed above. The value of Colum Label ca be ull.

  • --emb_dir
    path, the saved dirpath of all sample's structural embeddig iformatio prepared i advace.

  • --datasetame
    str, the dataset ame for buildig of our traied model(rdrp
    40_exted).

  • --dataset_type
    str, the dataset type for buildig of our traied model(protei).

  • --tasktype
    str, the task ame for buildig of our traied model(biary
    class).

  • --model_type
    str, the model ame for buildig of our traied model(sef).

  • --time_str
    str, the ruig time strig(yyyymmddHimiss) for buildig of our traied model(20230201140320).

  • --step
    it, the traiig global step of model fializatio(100000).

  • --threshold
    float, sigmoid threshold for biary-class or multi-label classificatio, Noe for multi-class classificatio, default: 0.5.

  • --evaluate(optioal)
    store_true, whether to evaluate the predicted results.

  • --groudtruthcolidex(optioal)
    it, the groud truth col idex of the ${data
    path}, default: Noe.

  • --batch size
    it, batch size per GPU/CPU for evaluatio, default: 16.

  • --pritperbatch
    it, how may batches are completed every time for pritig progress iformatio, default: 1000.

Note: the embeddig matrices of all the proteis i this file eed to prepare i advace($emb_dir).

4. Iferece Time

LucaProt is suitably speedy because it oly eeds to predict the structural represetatio matrix rather tha the complete 3D structure of the protei sequece.

Bechmark: For each sequece legth rage, selected 50 viral-RdRPS ad 50 o-viral RdRPs for iferece time cost calculatio.

Note: The sped time icludes the time of the structural represetatio matrix iferece, excludes the time of model loadig.

1) GPU(Nvidia A100, Cuda: 11.7)

Protei Seq Le Rage Average Time Maximum Time Miimum Time
300 <= Le < 500 0.20s 0.24s 0.16s
500 <= Le < 800 0.30s 0.39s 0.24s
800 <= Le < 1,000 0.42s 0.46s 0.39s
1,000 <= Le < 1,500 0.59s 0.74s 0.45s
1,500 <= Le < 2,000 0.87s 1.02s 0.73s
2,000 <= Le < 3,000 1.31s 1.69s 1.01s
3,000 <= Le < 5,000 2.14s 2.78s 1.72s
5,000 <= Le < 8,000 3.03s 3.45s 2.65s
8,000 <= Le < 10,000 3.77s 4.24s 3.32s
10,000 <= Le 9.92s 17.66s 4.30s

2) CPU (16 cores, 64G memory of Alibaba Cloud ECS)

Protei Seq Le Rage Average Time Maximum Time Miimum Time
300 <= Le < 500 3.97s 5.71s 2.77s
500 <= Le < 800 5.78s 7.50s 4.48s
800 <= Le < 1,000 8.23s 9.41s 7.41s
1,000 <= Le < 1,500 11.49s 16.42s 9.22s
1,500 <= Le < 2,000 17.71s 22.36s 14.93s
2,000 <= Le < 3,000 26.97s 36.68s 20.99s
3,000 <= Le < 5,000 45.56s 58.42s 35.82s
5,000 <= Le < 8,000 56.57s 58.17s 55.55s
8,000 <= Le < 10,000 57.76s 58.86s 56.66s
10,000 <= Le 66.49s 76.80s 58.42s

3) CPU (96 cores, 768G memory of Alibaba Cloud ECS)

Protei Seq Le Rage Average Time Maximum Time Miimum Time
300 <= Le < 500 1.89s 2.55s 1.10s
500 <= Le < 800 2.68s 3.44s 2.13s
800 <= Le < 1,000 3.45s 4.25s 2.65s
1,000 <= Le < 1,500 4.27s 5.90s 3.54s
1,500 <= Le < 2,000 5.81s 7.44s 4.76s
2,000 <= Le < 3,000 8.14s 10.74s 6.37s
3,000 <= Le < 5,000 13.25s 17.69s 10.06s
5,000 <= Le < 8,000 17.03s 18.20s 15.98s
8,000 <= Le < 10,000 17.90s 18.99s 16.92s
10,000 <= Le 25.90s 35.02s 18.66s

5. Dataset for Virus RdRP

1) Fasta

  • viral RdRP(Postive: 5,979)

    The positive sequece fasta file is i data/rdrp/all_dataset_positive.fasta.zip
    alldatasetpositive.fasta.zip

  • No-viral RdRP(Negative: 229434)

    The egative sequece fasta file is i dataset/rdrp/all_dataset_egative.fasta.zip
    icludig:

    • other proteis of the virus
    • other protei domais of the virus
    • o-viral proteis

    alldatasetegative.fasta.zip

2) Structural embeddig(matrix ad vector)

All structural embeddig files of the dataset for model buildig are available at: embs
All structural embeddig files of the predictio data for opeig are i the process(because of the amout of data).

3) PDB (3D Structure)

All 3D-structure PDB files of the model buildig dataset ad predicted data for opeig are i the process (because of the amout of data).

4) Vocab

  • structure vocab
    This vocab file is struct_vocab/rdrp_40_exted/protei/biary_class/struct_vocab.txt
    struct_vocab.txt

  • subword-level vocab
    The size of the vocab of sequece we use is 20,000.
    This vocab file is vocab/rdrp_40_exted/protei/biary_class/subword_vocab_20000.txt
    subwordvocab20000.txt

  • char-level vocab
    This vocab file is vocab/rdrp_40_exted/protei/biary_class/vocab.txt
    vocab.txt

5) Label

Viral RdRP idetificatio is a biary-class classificatio task, icludig positive ad egative classes, usig 0 ad 1 to represet a egative ad positive sample, respectively. The label list file is dataset/rdrp_40_exted/protei/biary_class/label.txt
label.txt

6) Dataset

We costructed a data set with 235,413 samples for model buildig, which icluded 5,979 positive samples of kow viral RdRPs (i.e. the well-curated RdRP database described i the previous sectio of Methods), ad 229,434 (to maitai a 1:40 ratio for viral RdRP ad o-virus RdRPs) egative samples of cofirmed o-virus RdRPs. Ad the o-virus RdRPs cotaied proteis from Eukaryota DNA depedet RNA polymerase (Eu DdRP, N=1,184), Eukaryota RNA depedet RNA polymerase (Eu RdRP, N=2,233), Reverse Trascriptase (RT, N=48,490), proteis obtaied from DNA viruses (N=1,533), o-RdRP proteis obtaied from RNA viruses (N=1,574), ad a wide array of cellular proteis from differet fuctioal categories (N=174,420). We radomly divided the dataset ito traiig, validatio, ad testig sets with a ratio of 8.5:1:1, which were used for model fittig, model fializatio (based o the best F1-score traiig iteratio), ad performace reportig (icludig accuracy, precisio, recall, F1-score, ad Area uder the ROC Curve (AUC)), respectively.

  • Etire Dataset
    This file is dataset/rdrp/all_dataset_with_pdb_emb.csv.zip
    alldatasetwithpdbemb.csv.zip

  • Traiig set
    This file copy to dataset/rdrp_40_exted/protei/biary_class/trai_with_pdb_emb.csv
    traiwithpdb_emb.csv

  • Validatio set
    This file copy to dataset/rdrp_40_exted/protei/biary_class/dev_with_pdb_emb.csv
    devwithpdb_emb.csv

  • Testig set
    This file copy to dataset/rdrp_40_exted/protei/biary_class/test_with_pdb_emb.csv
    testwithpdb_emb.csv

Oe row i all the above files represets oe sample. All three files cosist of 9 colums, icludig protid, seq, seqle, pdbfileame, ptm, meaplddt, emb_fileame, label, ad source. The details of these colums are as follows:

  • prot_id
    the protei id
  • seq
    the amio acid(aa) sequece
  • seq_le
    the legth of the protei sequece.
  • pdb_fileame
    The PDB fileames of 3D-structure are predicted by the calculatio model or obtaied by experimets.
  • ptm
    the pTM of the predicted 3D-structure.
  • mea_plddt
    the mea pLDDT of the predicted 3D-structure.
  • emb_fileame
    The fileame of the embeddig matrix or vector of protei structure.
    Note: the embeddig matrics of the dataset eed to prepare i advace.
  • label
    the sample label, 0 or 1 for biary-class classificatio, [0, 1, …, N-1] for multi-class classificatio, a list of [0, 1, …, N-1] for multi-label classificatio.
  • source
    optioal, the sample source (such as RdRP, RT, DdRP, o-virus RdRP, ad Other).

Note: if usig strategy oe i structure ecoder, the pdbfileame, the ptm, ad the meaplddt ca be ull.

6. Supported Task Types

  • biary-class classificatio
    The label is 0 or 1 for biary-class classificatio, such as viral RdRP idetificatio.

  • multi-class classificatio
    The label is 0~N-1 for multi-class classificatio, such as the species predictio for proteis.

  • multi-label classificatio
    The labels form a list of 0~N-1 for multi-label classificatio, such as Gee Otology aotatio for proteis.

7. Buildig Your Model

1) Predictio of protei 3D-structure(Optioal)

The script structure_from_esm_v1.py is i the directory "src/proteistructure", ad it use ESMFold (esmfoldv1) to predict 3D-Structure of protei.

I. Predictio from file

cd LucaProt/src/protei_structure/     

export CUDA_VISIBLE_DEVICES=0

pytho structure_from_esm_v1.py \
    -i data/rdrp/rdrp.fasta \
    -o pdbs/rdrp/ \
    --um-recycles 4 \
    --trucatio_seq_legth 4096 \
    --chuk-size 64 \
    --cpu-offload \
    --batch_size 1

Parameters:

  • -i (iput filepaths)

    • fasta filepath
    • csv filepath
      the first row is the header
      colum 0: protei_id
      colum 1: sequece
    • mutil filepaths
      comma-cocateatio
  • -o (save dirpath)
    The dir path of savig the predicted 3D-structure data, each protei is stored i a PDB file, ad each PDB file is amed as "protei" + a auto-icremet id + ".pdb", such as "protei1.pdb".
    The mappig betwee protei ids ad auto-icremet ids is stored i the file "resultifo.csv" (icludig: "idex", "proteiid(uuid)", "seqle", "ptm", "meaplddt") i this dir path.
    For failed samples(CUDA out of memory), this script will save their protei ids i the "ucompleted.txt", ad you ca reduce the value of "trucatioseqlegth" ad add "--try_failure" for retry.

  • --batch_size
    the batch size of ruig, default: 1.

  • --trucatioseqlegth
    trucate sequeces loger tha the give value, recommeded values: 4096, 2048, 1984, 1792, 1536, 1280, 1152, 1022.

  • --um-recycles
    umber of recycles to ru.

  • --chuk-size
    chuks axial attetio computatio to reduce memory usage from O(L^2) to O(L), recommeded values: 128, 64, 32.

  • --tryfailure
    retry the failed samples whe reducig the "trucatio
    seq_legth" value.

II. Predictio from iput sequeces

cd LucaProt/src/protei_structure/    

export CUDA_VISIBLE_DEVICES=0

pytho structure_from_esm_v1.py \
    -ame protei_id1,protei_id2  \
    -seq VGGLFDYYSVPIMT,LPDSWENKLLTDLILFAGSFVGSDTCGKLF \
    -o pdbs/rdrp/  \
    --um-recycles 4 \
    --trucatio_seq_legth 4096 \
    --chuk-size 64 \
    --cpu-offload \
    --batch_size 1
         

Parameters:

  • -ame
    protei ids, comma-cocateatio for multi proteis.
  • -seq
    protei sequeces, comma-cocateatio for multi proteis.

2) Predictio of protei structural embeddig

The script embeddig_from_esmfold.py is i "src/proteistructure", ad it use ESMFold (esm2t363BUR50D) to predict protei structural embeddig matrices or vectors.

I. Predictio from file

cd LucaProt/src/protei_structure/    

export CUDA_VISIBLE_DEVICES=0  

pytho embeddig_from_esmfold.py \
    --model_ame esm2_t36_3B_UR50D \
    --file data/rdrp.fasta \
    --output_dir emb/rdrp/ \
    --iclude per_tok cotacts bos \
    --trucatio_seq_legth 4094 

Parameters:

  • --modelame
    the model ame, default: "esm2
    t363BUR50D"

  • -i/--file (iput filepath)

    • fasta filepath
    • csv filepath
      the first row is the header
      colum 0: protei_id
      colum 1: sequece
  • -o/--outputdir (save dirpath)
    The dir path of savig the predicted structural embeddig data, each protei is stored i a pickle file, ad each embeddig file is amed as "embeddig
    " + auto-icremet id + ".pt", such as "embeddig1.pt". The mappig betwee protei ids ad auto-icremet ids is stored i the file "{}embedfastaid2idx.csv"(icludig: "idex", "proteiid(uuid)") i this dir path. For failed samples(CUDA out of memory), this script will save their protei ids i the "{}embeducompleted.txt", ad you ca reduce the "trucatioseqlegth" value ad add "--tryfailure" for retry.

  • --trucatioseqlegth
    trucate sequeces loger tha the give value. Recommeded values: 4094, 2046, 1982, 1790, 1534, 1278, 1150, 1022.

  • --iclude
    The embeddig matrix or vector type of the predicted structural embeddig data, icludig per_tok, mea, cotacts, ad bos.

    • pertok icludes the full sequece, with a embeddig per amio acid (seqle x hidde_dim).
    • mea icludes the embeddigs averaged over the full sequece, per layer.
    • bos icludes the embeddigs from the begiig-of-sequece toke.
    • cotacts icludes the attetio value betwee two amio acids of the the full sequece.

    Referece:https://github.com/facebookresearch/esm [Compute embeddigs i bulk from FASTA]

II. Predictio from iput sequeces

cd LucaProt/src/protei_structure/     

export CUDA_VISIBLE_DEVICES=0  

pytho embeddig_from_esmfold.py \
    --model_ame esm2_t36_3B_UR50D \
    -ame protei_id1,protei_id2 \
    -seq VGGLFDYYSVPIMT,LPDSWENKLLTDLILFAGSFVGSDTCGKLF \
    --output_dir embs/rdrp/test/ \
    --iclude per_tok cotacts bos \
    --trucatio_seq_legth 4094

Parameters:

  • -ame
    protei ids, comma-cocateatio for multi proteis.

  • -seq
    protei sequeces, comma-cocateatio for multi proteis.

3) Costruct dataset for model buildig

Costruct your dataset ad radomly divide the dataset ito traiig, validatio, ad testig sets with a specified ratio, ad save the three sets i dataset/${dataset_ame}/${dataset_type}/${task_type}, icludig trai.csv, dev.csv, test_*.csv.

The file format ca be .csv (must iclude the header ) or .txt (does ot eed to have the header).

Each file lie is a sample cotaiig 9 colums, icludig protid, seq, seqle, pdbfileame, ptm, meaplddt, emb_fileame, label, ad source.

Colum seq is the sequece, Colum pdbfileame is the saved PDB fileame for structure ecoder strategy 2, Colum ptm ad Colum meaplddt are optioal, which are obtaied from the 3D-Structure computed model, Colum emb_fileame is the saved embeddig fileame for structure ecoder strategy 1, Colum label is the sample class(a sigle value or a list value of label idex or label ame). Colum source is the sample source (optioal).

For example:

like_YP_009351861.1_Meghai_flavivirus,MEQNG...,3416,,,,embeddig_21449.pt,1,rdrp

Note: if your dataset takes too much space to load ito memory at oce,
use "src/dataprocess/datapreprocessitotfrecordsforrdrp.py" to covert the dataset ito "tfrecords". Ad create a idex file: pytho -m tfrecord.tools.tfrecord2idx xxxx.tfrecords xxxx.idex

4) Traiig the model

  • ru.py
    the mai script for buildig model.

  • Parameters

    • data_dir: path, the dataset dirpath
    • fileamepatter: the dataset fileame patter, such as "{}withpdbemb.csv", icludig traiwithpdbemb.csv, devwithpdbemb.csv, ad testwithpdbemb.csv i ${datadir}
    • separatefile: storetrue, load the etire dataset ito memory, the ames of the pdb ad embeddig files are listed i the trai/dev/test.csv, ad eed to load them.
    • tfrecords: storetrue, whether the dataset is i the tfrecords, whe true, oly the specified umber of samples(${shufflequeuesize}) are loaded ito memory at oce. The tfrecords must cosist of "${datadir}/tfrecords/trai/xxx.tfrecords", "${datadir}/tfrecords/dev/xxx.tfrecords" ad "${datadir}/tfrecords/test/xxx.tfrecords". "xxx.tfrecords" is oe of 01-of-01.tfrecords(oly icludig sequece), 01-of-01emb.records (icludig sequece ad structural embeddig), ad 01-of-01pdb_emb.records (icludig sequece, 3D-structure cotact map, ad structural embeddig).
    • shufflequeuesize: it, how may samples are loaded ito memory at oce, default: 5000.
    • datasetame: str, your dataset ame, such as "rdrp40_exted"
    • dataset_type: str, your dataset type, such as "protei"
    • tasktype: choices=["multilabel", "multiclass", "biaryclass"], your task type, such as "biary_class"
    • model_type: choices=["sequece", "structure", "embeddig", "sef", "ssf"], they represet oly the sequece for iput, oly the 3D-structure cotact map for iput, oly the structural embeddig for iput, the sequece ad the structural embeddig for iput, ad the sequece ad the 3D-structure cotact map for iput, respectively
    • subword: store_true, whether to process for sequece at the subword level.
    • codesfile: path, subword codes filepath whe usig subword, such as "../subword/rdrp/proteicodesrdrp20000.txt"
    • label_type: str, the label type ame, such as "rdrp"
    • label_filepath: path, the label list filepath
    • cmaptype: choices=["Calpha", "C_bert"], the calculatio type of 3D-structure cotact map
    • cmap_thresh: the distace threshold (Uit: Agstrom) i cotact map calculatio. Two amio acids are liked if the distace betwee them is equal to ad less tha the threshold, default: 10.0.
    • output_dir: path, the output dirpath
    • log_dir: path, the logger savepath
    • tblogdir: path, the save path of metric evaluatio records i model traiig, the tesorboardX ca be used to show these metrics.
    • cofig_path: path, the cofiguratio filepath of the model.
    • seqvocabpath: path, the vocab filepath of sequece tokeizer
    • structvocabpath: path, the vocab filepath of 3D-structure ode (Structural Ecoder Strategy 2)
    • seqpooligtype: choices=["oe", "max", "value_attetio"], the sequece represetaio matrix poolig type, "oe" represets that \ vector is used.
    • structpooligtype: choices=["max", "value_attetio"], the 3D-structure represetaio matrix poolig type.
    • embeddigpooligtype: choices=["oe", "max", "value_attetio"], the structual embeddig represetaio matrix poolig type, "oe" represets that \ vector is used.
    • evaluatedurigtraiig: store_true, whether to evaluate the validatio set ad the testig set durig traiig.
    • doeval: storetrue, whether to use the best saved model to evaluate the validatio set.
    • dopredict: storetrue, whether to use the best saved model to evaluate the testig set.
    • dolowercase: store_true, whether to lowercase the iput whe tokeizig.
    • pergputraibatchsize: it, batch size per GPU/CPU for traiig, default: 16
    • pergpuevalbatchsize: it, batch size per GPU/CPU for evaluatio, default: 16
    • gradietaccumulatiosteps: it, umber of updates steps to accumulate before performig a backward/update pass, default: 1.
    • learig_rate: float, the iitial learig rate for Adam, defaul: 1e-4.
    • umtraiepochs: it, the total umber of traiig epochs to perform, default: 50,.
    • loggig_steps: log every X updates steps, default: 1000.
    • losstype: choices=["focalloss", "bce", "multilabel_cce", "asl", "cce"], loss-fuctio type of model traiig, default: "bce".
    • maxmetrictype: choices=["acc", "jaccard", "prec", "recall", "f1", "fmax", "rocauc", "prauc"], which metric is used for model fializatio, default: "f1".
    • pos_weight: float, positive samples weight for "bce".
    • focallossalpha: float, alpha for focal loss, default: 0.7.
    • focallossgamma: float, gamma for focal loss, default:2.0.
    • focallossreduce: store_true, "mea" for oe sample whe i multi-label classifcatio, default:"sum".
    • aslgammaeg: float, egative gamma for asl, default: 4.0.
    • aslgammapos: float, positive gamma for asl, default: 1.0.
    • seqmaxlegth: it, the legth of iput sequece more tha max legth will be trucated, shorter will be padded, default: 2048.
    • structmaxlegth: it, the legth of iput cotact map more tha max legth will be trucated, shorter will be padded., default: 2048.
    • truc_type: choices=["left", "right"], the trucate type for whole iput sequece, default: "right".
    • opositioembeddigs: store_true, whether ot use positio embeddig for the sequece.
    • otoketypeembeddigs: storetrue, whether ot use toke type embeddig for the sequece.
    • embeddigiputsize: it, the dim of the structural embeddig vector/matrix, default: 2560, {"esm2t30150MUR50D": 640, "esm2t33650MUR50D": 1280, "esm2t363BUR50D": 2560, "esm2t4815BUR50D": 5120}.
    • embeddig_type: choices=[Noe, "cotacts", "bos", "matrix"], the type of the structural embeddig ifo, default: "matrix.
    • embeddigmaxlegth: it, the legth of iput embeddig matrix more tha max legth will be trucated, shorter will be padded, default: 2048.
    • saveall: storetrue, the model for each evaluatio is saved.
    • deleteold: storetrue, oly save the best metric (${maxmetrictype}) model of all evaluatio o testig set durig traiig.
  • Traiig ```shell

    !/bi/bash

    export CUDAVISIBLEDEVICES=0

    DATASETNAME="rdrp40exted" DATASETTYPE="protei" TASKTYPE="biaryclass"

    sequece + structural embedddig

    MODELTYPE="sef" CONFIGNAME="sefcofig.jso" INPUTMODE="sigle" LABELTYPE="rdrp" embeddigiputsize=2560 embeddigtype="matrix" SEQMAXLENGTH="2048" embeddigmaxlegth="2048" TRUNCT_TYPE="right"

    oe, max, value_attetio

    SEQPOOLINGTYPE="value_attetio"

    max, value_attetio

    embeddigpooligtype="valueattetio" VOCABNAME="subwordvocab20000.txt" SUBWORDCODESNAME="proteicodesrdrp20000.txt" MAXMETRICTYPE="f1" timestr=$(date "+%Y%m%d%H%M%S")

    pytho ru.py \ --datadir ../dataset/$DATASETNAME/$DATASETTYPE/$TASKTYPE \ --tfrecords \ --fileamepatter {}withpdbemb.csv \ --datasetame $DATASETNAME \ --datasettype $DATASETTYPE \ --tasktype $TASKTYPE \ --modeltype $MODELTYPE \ --subword \ --codesfile ../subword/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$SUBWORDCODESNAME\ --iputmode $INPUTMODE \ --labeltype $LABELTYPE \ --labelfilepath ../dataset/$DATASETNAME/$DATASETTYPE/$TASKTYPE/label.txt \ --outputdir ../models/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$MODELTYPE/$timestr \ --logdir ../logs/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$MODELTYPE/$timestr \ --tblogdir ../tb-logs/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$MODELTYPE/$timestr \ --cofigpath ../cofig/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$CONFIGNAME \ --seqvocabpath ../vocab/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$VOCABNAME\ --seqpooligtype $SEQPOOLINGTYPE \ --embeddigpooligtype $embeddigpooligtype \ --dotrai \ --doeval \ --dopredict \ --evaluatedurigtraiig \ --pergputraibatchsize=16 \ --pergpuevalbatchsize=16 \ --gradietaccumulatiosteps=1 \ --learigrate=1e-4 \ --umtraiepochs=50 \ --loggigsteps=1000 \ --savesteps=1000 \ --overwriteoutputdir \ --sigmoid \ --losstype bce \ --maxmetrictype $MAXMETRICTYPE \ --seqmaxlegth=$SEQMAXLENGTH \ --embeddigmaxlegth=$embeddigmaxlegth \ --tructype=$TRUNCTTYPE \ --otoketypeembeddigs \ --embeddigiputsize $embeddigiputsize\ --embeddigtype $embeddigtype \ --shufflequeuesize 10000 \ --save_all ```

  • Cofiguratio file
    The cofiguratio files of all methods is i "cofig/rdrp40exted/protei/biaryclass/". If traiig your model, please put the cofiguratio file i "cofig/${datasetame}/${datasettype}/${tasktype}/"

  • Value meaig i cofiguratio file
    referrig to "src/SSFN/README.md"

  • Baselies

    • LGBM (usig the embeddig vector: \ as the iput)
      cd src/baselies/
      sh ru_lgbm.sh
    
    • XGBoost (usig the embeddig vector: \ as the iput)
      cd src/baselies/
      sh ru_xgb.sh
    
    • DNN (usig the embeddig vector: \ as the iput)
      cd src/baselies/
      sh ru_d.sh
    

    Or:

      cd src/traiig
      ru_subword_rdrp_emb.sh
    
    • Trasoformer-Char Level (usig the sequece as the iput)
      cd src/traiig
      sh ru_char_rdrp_seq.sh
    
    • Trasoformer-Subword Level (usig the sequece as the iput)
      cd src/traiig
      sh ru_subword_rdrp_seq.sh
    
    • DNN2 (VALP + DNN, usig the embeddig matrix as the iput)
      cd src/traiig
      ru_subword_rdrp_emb_v2.sh
    
  • Ours

    • Ours (the sequece + the 3D-structure)
      comig soo…

    • Ours (the sequece + the embeddig matrix)

      cd src/traiig
      ru_subword_rdrp_sef.sh
    

5) Traiig Loggig Iformatio

logs

The ruig iformatio is saved i "logs/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${time_str}/logs.txt".

The iformatio icludes the model cofiguratio, model layers, ruig parameters, ad evaluatio iformatio.

models

The checkpoits are saved i "models/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${timestr}/checkpoit-${globalstep}/", this directory icludes "pytorchmodel.bi", "cofig.jso", "traiigargs.bi", ad tokeizer iformatio "sequece" or "strcut". The details are show i Figure 2.

Figure 2: The File List i Checkpoit Dir Path

tb-logs

The metrics are recorded i "tb-logs/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${time_str}/evets.out.tfevets.xxxxx.xxxxx"

ru: tesorboard --logdir=tb-logs/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${timestr --bidall

predicts

The predicted results is saved i "predicts/${datasetame}/${datasettype}/${tasktype}/${modeltype}/${timestr}/checkpoit-${globalstep}", icludig:

  • predcofusiomatrix.pg
  • pred_metrics.txt
  • pred_result.csv
  • seqlegthdistributio.pg

The details are show i Figure 3.

Figure 3: The File List i Predictio Dir Path

Note: whe usig the saved model to predict, the "logs.txt" ad the checkpoit dirpath will be used.

8. Related to the Project

1) ClstrSearch

A covetioal approach that clustered all proteis based o their sequece homology.

See ClstrSerch/README.md for details.

2) src

Costruct RdRP Dataset for Model Buildig

*.py i "src/data_preprocess"

Model

*.py i "src/SSFN"

Predictio Shell Script

*.sh i "src/predictio"
icludig:

  • rupredictfrom_file.sh
    ru predictio for may samples from a file, the structural embeddig iformatio prepared i advace.

  • rupredictoe_sample.sh
    ru predictio for oe sample from the iput.

  • rupredictmay_samples.sh
    ru predictio for may samples from the iput.

We perform ablatio studies o our model by removig specific module(sequece-specific ad embeddig-specific) oe at a time to explore their relative importace.

  • rupredictolyseqfrom_file.sh
    oly usig the sequece to predict ad calculate metrics three positive testig datasets, three egative testig datasets, ad our checked RdRPs by predictio SRA.

  • rupredictolyembfrom_file.sh
    oly usig the structural embeddig to predict ad calculate metrics three positive testig datasets, three egative testig datasets, ad our checked RdRPs by predictio SRA.

  • rupredictseqembfrom_file.sh
    usig the sequetail ifo ad the structural embeddig to predict ad calculate metrics three positive testig datasets, three egative testig datasets, ad our checked RdRPs by predictio SRA.

Baselies

*.py i "src/baselies", usig the embeddig vector as the iput, icludig:

  • DNN
  • LGBM
  • XGBoost

Baselies for Deep Learig

*.py i "src/deep_baselies", icludig:

CHEER: HierarCHical taxoomic classificatio for viral mEtagEomic data via deep learig(2021). code: CHEER

VirHuter: A Deep Learig-Based Method for Detectio of Novel RNA Viruses i Plat Sequecig Data(2022). code: VirHuter

Virtifier: a deep learig-based idetifier for viral sequeces from metageomes(2022). code: Virtifier

RNN-VirSeeker: RNN-VirSeeker: A Deep Learig Method for Idetificatio of Short Viral Sequeces From Metageomes. code: RNN-VirSeeker

  • rudeepbaselies.sh
    the script to trai deep baselie models.

  • rupredictdeep_baselies.sh
    use traied deep baselie models to predict three positive test datasets, three egative test datasets, ad our checked RdRP datasets.

  • ru.py
    the mai script for traiig deep baselie models.

  • statistics
    the script to statistic the accuracy i three kids of test datasets(positive, egative, our checked) after predictio by deep baselies.

Cotact Map Geerator

*.py i "src/biotoolbox"

Loss & Metrics

*.py i "src/commo"

Traiig Model

*.sh i "src/traiig"

Predictio of Model

*.sh i "src/predictio"

3) Data

Raw Data

the raw data is i "data/".

Dataset

the files of the dataset is i "dataset/${datasetame}/${datasettype}/${task_type}/".

4) Model Cofiguratio

the cofiguratio file of all methods is i "cofig/${datasetame}/${datasettype}/${task_type}/".

5) Pic

some pictures is i "pics/".

6) Plot

the scripts of pictures plotig is i "src/plot".

7) Spider

the codes ad results of Geo iformatio Spider i "src/geo_map".

9. Ope Resource

The ope resources of our study ar icludes six subdirectories: Kow_RdRPs, Results, All_Cotigs, All_Protei_Sequeces, ad LucaProt, ad Self_Sequecig_Reads.

LucaProt/ icludes some resources related to LucaProt, icludig code, model buildig dataset, model testig datasets, ad our traied model.

1) Code

As metioed above.

2) Dataset

Model Buildig Dataset

  • sequetial ifo
    traiwithpdb_emb.csv
    copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/

    devwithpdb_emb.csv
    copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/

    testwithpdb_emb.csv
    copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/

  • structural ifo
    embs
    copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/embs/

  • tfrcords
    trai
    copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/tfrecords/trai/

    dev
    copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/tfrecords/dev/

    test
    copy to LucaProt/dataset/rdrp_40_exted/protei/biary_class/tfrecords/test/

Model Testig (Validatio) Dataset

  • Three Positive Testig Dataset

    • sequetial ifo
      Neri RdRP
      copy to LucaProt/data/rdrp
      Referece: Expasio of the global RNA virome reveals diverse clades of bacteriophages

      Zayed RdRP
      copy to LucaProt/data/rdrp
      Referece: Cryptic ad abudat marie viruses at the evolutioary origis of Earth’s RNA virome

      Che RdRP
      copy to LucaProt/data/rdrp
      Referece: RNA viromes from terrestrial sites across Chia expad evirometal viral diversity

    • structural ifo
      Neri RdRP

      Zayed RdRP

      Che RdRP

  • Three Negative Testig Dataset

Results

  • Our Checked RdRP Dataset (Our Results)

    • sequetial ifo
      ourscheckedrdrp_fial.csv

    • structural ifo
      embs

    • PDB
      All 3D-structure PDB files of our predicted results for opeig are i the process.

Self-Samples

3) Traied Model

The traied model for RdRP idetificatio is available at:

  • logs
    logs
    copy tp LucaProt/logs/

  • models
    models
    copy tp LucaProt/models/

10. Cotributor

LucaTeam:
Yog He, Zhaorog Li, Xi Hou, Mag Shi

11. FTP

The all data of LucaProt is available at the website: Ope Resources

12. Citatio

the pre-prit versio:

@article { lucaprot,
author = {Xi Hou ad Yog He ad Pa Fag ad Shi-Qiag Mei ad Za Xu ad Wei-Che Wu ad Ju-Hua Tia ad Shu Zhag ad Zhe-Yu Zeg ad Qi-Yu Gou ad Ge-Yag Xi ad Shi-Jia Le ad Yi-Yue Xia ad Yu-La Zhou ad Feg-Mig Hui ad Yua-Fei Pa ad Joh-Sebastia Ede ad Zhao-Hui Yag ad Chog Ha ad Yue-Log Shu ad Deyi Guo ad Ju Li ad Edward C Holmes ad Zhao-Rog Li ad Mag Shi},
title = {Artificial itelligece redefies RNA virus discovery},
elocatio-id = {2023.04.18.537342},
year = {2023},
doi = {10.1101/2023.04.18.537342},
publisher = {Cold Sprig Harbor Laboratory}, URL = {https://www.biorxiv.org/cotet/early/2023/04/18/2023.04.18.537342},
eprit = {https://www.biorxiv.org/cotet/early/2023/04/18/2023.04.18.537342.full.pdf},
joural = {bioRxiv}
}

13. Pip

ame: lucaprot
chaels:
  - defaults
depedecies:
  - pip:
        - h5py==3.8.0
        - biopytho==1.80
        - biotite==0.35.0
        - brotlipy==0.7.0
        - umpy==1.24.2
        - oboet==0.3.1
        - padas==1.5.3
        - pickle5==0.0.11
        - Pillow==9.3.0
        - scikit-lear==1.2.1
        - scipy==1.10.1
        - seabor==0.12.2
        - six==1.16.0
        - subword-mt==0.3.8
        - tesorboard==2.11.2
        - tesorboardX==2.5.1
        - tesorflow==2.11.0
        - tesorflow-estimator==2.11.0
        - tfrecord==1.14.1
        - tokeizers==0.13.2
        - torch==1.13.1
        - torchaudio==0.13.1
        - torchvisio==0.14.1
        - tqdm==4.64.1
        - trasformers==4.26.0
        - huggigface-hub==0.12.0
        - matplotlib==3.6.3
        - Werkzeug==2.2.2
        - wget==3.2
        - wrapt==1.14.1
        - xgboost==1.7.3
        - zipp==3.12.0
        - lightgbm==3.3.5
        - xgboost==1.7.3
        - BeautifulSoup4==4.11.1
        - requests==2.24.0
        - gemmi==0.5.8
        - etworkx==3.0
        - fair-esm[esmfold]
        - dllogger @ git+https://github.com/NVIDIA/dllogger.git

功能介绍

LucaProt LucaProt(DeepProtFunc) is a open source project developed by Alibaba and licensed under the

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论