LucaProt

LucaProt(DeepProtFunc) is a open source project developed by Alibaba and licensed under the Apache License (Version 2.0).

This product contains various third-party components under other open source licenses. See the NOTICE file for more information.

Introduction

LucaProt: A novel deep learning framework that incorporates protein amino acid sequence and structure information to predict protein function.

1. Model

1) Model Introduction

We developed a new deep learning model, namely, Deep Sequential and Structural Information Fusion Network for Proteins Function Prediction (DeepProtFunc/LucaProt), which takes into account protein sequence composition and structure to facilitate the accurate annotation of protein function.

Here, we applied LucaProt to identify viral RdRP.

2) Model Architecture

We treat protein function prediction as a classification problem. For example, viral RdRP identification is a binary-class classification task, and protein general function annotation is a multi-label classification task. The model includes five modules: Input, Tokenizer, Encoder, Pooling, and Output. Its architecture is shown in Figure 1.

Figure 1 The Architecture of LucaProt

3) Model Input/Output

Use the amino acid letter sequence as the input of our model. The model outputs the function label of the input protein, which is a single tag (binary-class classification or multi-class classification) or a set of tags (multi-label classification).

2. Dependence

System: Ubuntu 20.04.5 LTS
Python: 3.9.13
Download anaconda: anaconda
Cuda: cuda11.7 (torch==1.13.1)

# Select 'YES' during installation for initializing the conda environment  
sh Anaconda3-2022.10-Linux-x86_64.sh  
# Source the environment
source ~/.bashrc  
# Verification
conda  
# Install env and python 3.9.13   
conda create -n lucaprot python=3.9.13    
# activate env
conda activate lucaprot  
# Install git      
sudo apt-get update         
sudo apt install git-all

# Enter the project   
cd LucaProt     

# Install
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

3. Inference

1) Prediction from one sample

cd LucaProt/src/prediction/ 
sh run_predict_one_sample.sh

Note: the embedding matrix of the sample is real-time predictive.

Or:

cd LucaProt/src/

export CUDA_VISIBLE_DEVICES=0

python predict_one_sample.py \
    --protein_id protein_1 \
    --sequence MTTSTAFTGKTLMITGGTGSFGNTVLKHFVHTDLAEIRIFSRDEKKQDDMRHRLQEKSPELADKVRFFIGDVRNLQSVRDAMHGVDYIFHAAALKQVPSCEFFPMEAVRTNVLGTDNVLHAAIDEGVDRVVCLSTDKAAYPINAMGKSKAMMESIIYANARNGAGRTTICCTRYGNVMCSRGSVIPLFIDRIRKGEPLTVTDPNMTRFLMNLDEAVDLVQFAFEHANPGDLFIQKAPASTIGDLAEAVQEVFGRVGTQVIGTRHGEKLYETLMTCEERLRAEDMGDYFRVACDSRDLNYDKFVVNGEVTTMADEAYTSHNTSRLDVAGTVEKIKTAEYVQLALEGREYEAVQ    \
    --emb_dir ./emb/ \
    --truncation_seq_length 4096 \
    --dataset_name rdrp_40_extend \
    --dataset_type protein \
    --task_type binary_class \
    --model_type sefn \
    --time_str 20230201140320 \
    --step 100000 \
    --threshold 0.5

--protein_id
str, the protein id.
--sequence
str, the protein sequence.
--truncationseqlength
int, truncate sequences longer than the given value. Recommended values: 4096, 2048, 1984, 1792, 1534, 1280, 1152, 1024, defualt: 4096.
--emb_dir(optional)
path, the saved dirpath of the protein predicted embedding matrix or vector during prediction, optional.
--datasetname
str, the dataset name for building of our trained model(rdrp40_extend).
--dataset_type
str, the dataset type for building of our trained model(protein).
--tasktype
str, the task name for building of our trained model(binaryclass).
--model_type
str, the model name for building of our trained model(sefn).
--time_str
str, the running time string(yyyymmddHimiss) for building of our trained model(20230201140320).
--step
int, the training global step of model finalization(100000).
--threshold
float, sigmoid threshold for binary-class or multi-label classification, None for multi-class classification, default: 0.5.

2) Prediction from many samples

the samples are in *.fasta, sample by sample prediction.

--fasta_file
str, the samples fasta file
--save_file
str, file path, save the predicted results into the file.
--printpernumber
int, print progress information for every number of samples completed, default: 100.

cd LucaProt/src/prediction/   
sh run_predict_many_samples.sh

Or:

cd LucaProt/src/

export CUDA_VISIBLE_DEVICES=0  

python predict_many_samples.py \
    --fasta_file ../data/rdrp/test/test.fasta  \
    --save_file ../result/rdrp/test/test_result.csv  \
    --emb_dir ../emb/   \
    --truncation_seq_length 4096  \
    --dataset_name rdrp_40_extend  \
    --dataset_type protein     \
    --task_type binary_class     \
    --model_type sefn     \
    --time_str 20230201140320   \
    --step 100000  \
    --threshold 0.5 \
    --print_per_number 10

3) Prediction from the file

The test data (small and real) is in demo.csv, where the 7th column of each line is the filename of the structural embedding information prepared in advance.
And the structural embedding files store in embs.

The test data includes 50 viral-RdRPs and 50 non-viral RdRPs.

cd LucaProt/src/prediction/   
sh run_predict_from_file.sh

Or:

cd LucaProt/src/

export CUDA_VISIBLE_DEVICES=0

python predict.py \
    --data_path ../data/rdrp/demo/demo.csv \
    --emb_dir ../data/rdrp/demo/embs/esm2_t36_3B_UR50D \
    --dataset_name rdrp_40_extend \
    --dataset_type protein \
    --task_type binary_class \
    --model_type sefn \
    --time_str 20230201140320 \
    --step 100000 \
    --evaluate \
    --threshold 0.5 \
    --batch_size 16 \
    --print_per_batch 2

--data_path
path, the file path of prediction data, including 9 columns mentioned above. The value of Column Label can be null.
--emb_dir
path, the saved dirpath of all sample's structural embedding information prepared in advance.
--datasetname
str, the dataset name for building of our trained model(rdrp40_extend).
--dataset_type
str, the dataset type for building of our trained model(protein).
--tasktype
str, the task name for building of our trained model(binaryclass).
--model_type
str, the model name for building of our trained model(sefn).
--time_str
str, the running time string(yyyymmddHimiss) for building of our trained model(20230201140320).
--step
int, the training global step of model finalization(100000).
--threshold
float, sigmoid threshold for binary-class or multi-label classification, None for multi-class classification, default: 0.5.
--evaluate(optional)
store_true, whether to evaluate the predicted results.
--groundtruthcolindex(optional)
int, the ground truth col index of the ${datapath}, default: None.
--batch size
int, batch size per GPU/CPU for evaluation, default: 16.
--printperbatch
int, how many batches are completed every time for printing progress information, default: 1000.

Note: the embedding matrices of all the proteins in this file need to prepare in advance($emb_dir).

4. Inference Time

LucaProt is suitably speedy because it only needs to predict the structural representation matrix rather than the complete 3D structure of the protein sequence.

Benchmark: For each sequence length range, selected 50 viral-RdRPS and 50 non-viral RdRPs for inference time cost calculation.

Note: The spend time includes the time of the structural representation matrix inference, excludes the time of model loading.

1) GPU(Nvidia A100, Cuda: 11.7)

Protein Seq Len Range	Average Time	Maximum Time	Minimum Time
300 <= Len < 500	0.20s	0.24s	0.16s
500 <= Len < 800	0.30s	0.39s	0.24s
800 <= Len < 1,000	0.42s	0.46s	0.39s
1,000 <= Len < 1,500	0.59s	0.74s	0.45s
1,500 <= Len < 2,000	0.87s	1.02s	0.73s
2,000 <= Len < 3,000	1.31s	1.69s	1.01s
3,000 <= Len < 5,000	2.14s	2.78s	1.72s
5,000 <= Len < 8,000	3.03s	3.45s	2.65s
8,000 <= Len < 10,000	3.77s	4.24s	3.32s
10,000 <= Len	9.92s	17.66s	4.30s

2) CPU (16 cores, 64G memory of Alibaba Cloud ECS)

Protein Seq Len Range	Average Time	Maximum Time	Minimum Time
300 <= Len < 500	3.97s	5.71s	2.77s
500 <= Len < 800	5.78s	7.50s	4.48s
800 <= Len < 1,000	8.23s	9.41s	7.41s
1,000 <= Len < 1,500	11.49s	16.42s	9.22s
1,500 <= Len < 2,000	17.71s	22.36s	14.93s
2,000 <= Len < 3,000	26.97s	36.68s	20.99s
3,000 <= Len < 5,000	45.56s	58.42s	35.82s
5,000 <= Len < 8,000	56.57s	58.17s	55.55s
8,000 <= Len < 10,000	57.76s	58.86s	56.66s
10,000 <= Len	66.49s	76.80s	58.42s

3) CPU (96 cores, 768G memory of Alibaba Cloud ECS)

Protein Seq Len Range	Average Time	Maximum Time	Minimum Time
300 <= Len < 500	1.89s	2.55s	1.10s
500 <= Len < 800	2.68s	3.44s	2.13s
800 <= Len < 1,000	3.45s	4.25s	2.65s
1,000 <= Len < 1,500	4.27s	5.90s	3.54s
1,500 <= Len < 2,000	5.81s	7.44s	4.76s
2,000 <= Len < 3,000	8.14s	10.74s	6.37s
3,000 <= Len < 5,000	13.25s	17.69s	10.06s
5,000 <= Len < 8,000	17.03s	18.20s	15.98s
8,000 <= Len < 10,000	17.90s	18.99s	16.92s
10,000 <= Len	25.90s	35.02s	18.66s

5. Dataset for Virus RdRP

1) Fasta

viral RdRP(Postive: 5,979)

The positive sequence fasta file is in data/rdrp/all_dataset_positive.fasta.zip
alldatasetpositive.fasta.zip
Non-viral RdRP(Negative: 229434)

The negative sequence fasta file is in dataset/rdrp/all_dataset_negative.fasta.zip
including:
- other proteins of the virus
- other protein domains of the virus
- non-viral proteins
alldatasetnegative.fasta.zip

2) Structural embedding(matrix and vector)

All structural embedding files of the dataset for model building are available at: embs
All structural embedding files of the prediction data for opening are in the process(because of the amount of data).

3) PDB (3D Structure)

All 3D-structure PDB files of the model building dataset and predicted data for opening are in the process (because of the amount of data).

4) Vocab

structure vocab
This vocab file is struct_vocab/rdrp_40_extend/protein/binary_class/struct_vocab.txt
struct_vocab.txt
subword-level vocab
The size of the vocab of sequence we use is 20,000.
This vocab file is vocab/rdrp_40_extend/protein/binary_class/subword_vocab_20000.txt
subwordvocab20000.txt
char-level vocab
This vocab file is vocab/rdrp_40_extend/protein/binary_class/vocab.txt
vocab.txt

5) Label

Viral RdRP identification is a binary-class classification task, including positive and negative classes, using 0 and 1 to represent a negative and positive sample, respectively. The label list file is dataset/rdrp_40_extend/protein/binary_class/label.txt
label.txt

6) Dataset

We constructed a data set with 235,413 samples for model building, which included 5,979 positive samples of known viral RdRPs (i.e. the well-curated RdRP database described in the previous section of Methods), and 229,434 (to maintain a 1:40 ratio for viral RdRP and non-virus RdRPs) negative samples of confirmed non-virus RdRPs. And the non-virus RdRPs contained proteins from Eukaryota DNA dependent RNA polymerase (Eu DdRP, N=1,184), Eukaryota RNA dependent RNA polymerase (Eu RdRP, N=2,233), Reverse Transcriptase (RT, N=48,490), proteins obtained from DNA viruses (N=1,533), non-RdRP proteins obtained from RNA viruses (N=1,574), and a wide array of cellular proteins from different functional categories (N=174,420). We randomly divided the dataset into training, validation, and testing sets with a ratio of 8.5:1:1, which were used for model fitting, model finalization (based on the best F1-score training iteration), and performance reporting (including accuracy, precision, recall, F1-score, and Area under the ROC Curve (AUC)), respectively.

Entire Dataset
This file is dataset/rdrp/all_dataset_with_pdb_emb.csv.zip
alldatasetwithpdbemb.csv.zip
Training set
This file copy to dataset/rdrp_40_extend/protein/binary_class/train_with_pdb_emb.csv
trainwithpdb_emb.csv
Validation set
This file copy to dataset/rdrp_40_extend/protein/binary_class/dev_with_pdb_emb.csv
devwithpdb_emb.csv
Testing set
This file copy to dataset/rdrp_40_extend/protein/binary_class/test_with_pdb_emb.csv
testwithpdb_emb.csv

One row in all the above files represents one sample. All three files consist of 9 columns, including protid, seq, seqlen, pdbfilename, ptm, meanplddt, emb_filename, label, and source. The details of these columns are as follows:

prot_id
the protein id
seq
the amino acid(aa) sequence
seq_len
the length of the protein sequence.
pdb_filename
The PDB filenames of 3D-structure are predicted by the calculation model or obtained by experiments.
ptm
the pTM of the predicted 3D-structure.
mean_plddt
the mean pLDDT of the predicted 3D-structure.
emb_filename
The filename of the embedding matrix or vector of protein structure.
Note: the embedding matrics of the dataset need to prepare in advance.
label
the sample label, 0 or 1 for binary-class classification, [0, 1, …, N-1] for multi-class classification, a list of [0, 1, …, N-1] for multi-label classification.
source
optional, the sample source (such as RdRP, RT, DdRP, non-virus RdRP, and Other).

Note: if using strategy one in structure encoder, the pdbfilename, the ptm, and the meanplddt can be null.

6. Supported Task Types

binary-class classification
The label is 0 or 1 for binary-class classification, such as viral RdRP identification.
multi-class classification
The label is 0~N-1 for multi-class classification, such as the species prediction for proteins.
multi-label classification
The labels form a list of 0~N-1 for multi-label classification, such as Gene Ontology annotation for proteins.

7. Building Your Model

1) Prediction of protein 3D-structure(Optional)

The script structure_from_esm_v1.py is in the directory "src/proteinstructure", and it use ESMFold (esmfoldv1) to predict 3D-Structure of protein.

I. Prediction from file

cd LucaProt/src/protein_structure/     

export CUDA_VISIBLE_DEVICES=0

python structure_from_esm_v1.py \
    -i data/rdrp/rdrp.fasta \
    -o pdbs/rdrp/ \
    --num-recycles 4 \
    --truncation_seq_length 4096 \
    --chunk-size 64 \
    --cpu-offload \
    --batch_size 1

Parameters:

-i (input filepaths)
- fasta filepath
- csv filepath
  the first row is the header
  column 0: protein_id
  column 1: sequence
- mutil filepaths
  comma-concatenation
-o (save dirpath)
The dir path of saving the predicted 3D-structure data, each protein is stored in a PDB file, and each PDB file is named as "protein" + an auto-increment id + ".pdb", such as "protein1.pdb".
The mapping between protein ids and auto-increment ids is stored in the file "resultinfo.csv" (including: "index", "proteinid(uuid)", "seqlen", "ptm", "meanplddt") in this dir path.
For failed samples(CUDA out of memory), this script will save their protein ids in the "uncompleted.txt", and you can reduce the value of "truncationseqlength" and add "--try_failure" for retry.
--batch_size
the batch size of running, default: 1.
--truncationseqlength
truncate sequences longer than the given value, recommended values: 4096, 2048, 1984, 1792, 1536, 1280, 1152, 1022.
--num-recycles
number of recycles to run.
--chunk-size
chunks axial attention computation to reduce memory usage from O(L^2) to O(L), recommended values: 128, 64, 32.
--tryfailure
retry the failed samples when reducing the "truncationseq_length" value.

II. Prediction from input sequences

cd LucaProt/src/protein_structure/    

export CUDA_VISIBLE_DEVICES=0

python structure_from_esm_v1.py \
    -name protein_id1,protein_id2  \
    -seq VGGLFDYYSVPIMT,LPDSWENKLLTDLILFAGSFVGSDTCGKLF \
    -o pdbs/rdrp/  \
    --num-recycles 4 \
    --truncation_seq_length 4096 \
    --chunk-size 64 \
    --cpu-offload \
    --batch_size 1

Parameters:

-name
protein ids, comma-concatenation for multi proteins.
-seq
protein sequences, comma-concatenation for multi proteins.

2) Prediction of protein structural embedding

The script embedding_from_esmfold.py is in "src/proteinstructure", and it use ESMFold (esm2t363BUR50D) to predict protein structural embedding matrices or vectors.

I. Prediction from file

cd LucaProt/src/protein_structure/    

export CUDA_VISIBLE_DEVICES=0  

python embedding_from_esmfold.py \
    --model_name esm2_t36_3B_UR50D \
    --file data/rdrp.fasta \
    --output_dir emb/rdrp/ \
    --include per_tok contacts bos \
    --truncation_seq_length 4094

Parameters:

--modelname
the model name, default: "esm2t363BUR50D"
-i/--file (input filepath)
- fasta filepath
- csv filepath
  the first row is the header
  column 0: protein_id
  column 1: sequence
-o/--outputdir (save dirpath)
The dir path of saving the predicted structural embedding data, each protein is stored in a pickle file, and each embedding file is named as "embedding" + auto-increment id + ".pt", such as "embedding1.pt". The mapping between protein ids and auto-increment ids is stored in the file "{}embedfastaid2idx.csv"(including: "index", "proteinid(uuid)") in this dir path. For failed samples(CUDA out of memory), this script will save their protein ids in the "{}embeduncompleted.txt", and you can reduce the "truncationseqlength" value and add "--tryfailure" for retry.
--truncationseqlength
truncate sequences longer than the given value. Recommended values: 4094, 2046, 1982, 1790, 1534, 1278, 1150, 1022.
--include
The embedding matrix or vector type of the predicted structural embedding data, including per_tok, mean, contacts, and bos.
- pertok includes the full sequence, with an embedding per amino acid (seqlen x hidden_dim).
- mean includes the embeddings averaged over the full sequence, per layer.
- bos includes the embeddings from the beginning-of-sequence token.
- contacts includes the attention value between two amino acids of the the full sequence.
Reference：https://github.com/facebookresearch/esm [Compute embeddings in bulk from FASTA]

II. Prediction from input sequences

cd LucaProt/src/protein_structure/     

export CUDA_VISIBLE_DEVICES=0  

python embedding_from_esmfold.py \
    --model_name esm2_t36_3B_UR50D \
    -name protein_id1,protein_id2 \
    -seq VGGLFDYYSVPIMT,LPDSWENKLLTDLILFAGSFVGSDTCGKLF \
    --output_dir embs/rdrp/test/ \
    --include per_tok contacts bos \
    --truncation_seq_length 4094

Parameters:

-name
protein ids, comma-concatenation for multi proteins.
-seq
protein sequences, comma-concatenation for multi proteins.

3) Construct dataset for model building

Construct your dataset and randomly divide the dataset into training, validation, and testing sets with a specified ratio, and save the three sets in dataset/${dataset_name}/${dataset_type}/${task_type}, including train.csv, dev.csv, test_*.csv.

The file format can be .csv (must include the header ) or .txt (does not need to have the header).

Each file line is a sample containing 9 columns, including protid, seq, seqlen, pdbfilename, ptm, meanplddt, emb_filename, label, and source.

Colunm seq is the sequence, Colunm pdbfilename is the saved PDB filename for structure encoder strategy 2, Colunm ptm and Column meanplddt are optional, which are obtained from the 3D-Structure computed model, Colunm emb_filename is the saved embedding filename for structure encoder strategy 1, Column label is the sample class(a single value or a list value of label index or label name). Column source is the sample source (optional).

For example:

like_YP_009351861.1_Menghai_flavivirus,MEQNG...,3416,,,,embedding_21449.pt,1,rdrp

Note: if your dataset takes too much space to load into memory at once,
use "src/dataprocess/datapreprocessintotfrecordsforrdrp.py" to convert the dataset into "tfrecords". And create an index file: python -m tfrecord.tools.tfrecord2idx xxxx.tfrecords xxxx.index

4) Training the model

run.py
the main script for building model.
Parameters
- data_dir: path, the dataset dirpath
- filenamepattern: the dataset filename pattern, such as "{}withpdbemb.csv", including trainwithpdbemb.csv, devwithpdbemb.csv, and testwithpdbemb.csv in ${datadir}
- separatefile: storetrue, load the entire dataset into memory, the names of the pdb and embedding files are listed in the train/dev/test.csv, and need to load them.
- tfrecords: storetrue, whether the dataset is in the tfrecords, when true, only the specified number of samples(${shufflequeuesize}) are loaded into memory at once. The tfrecords must consist of "${datadir}/tfrecords/train/xxx.tfrecords", "${datadir}/tfrecords/dev/xxx.tfrecords" and "${datadir}/tfrecords/test/xxx.tfrecords". "xxx.tfrecords" is one of 01-of-01.tfrecords(only including sequence), 01-of-01emb.records (including sequence and structural embedding), and 01-of-01pdb_emb.records (including sequence, 3D-structure contact map, and structural embedding).
- shufflequeuesize: int, how many samples are loaded into memory at once, default: 5000.
- datasetname: str, your dataset name, such as "rdrp40_extend"
- dataset_type: str, your dataset type, such as "protein"
- tasktype: choices=["multilabel", "multiclass", "binaryclass"], your task type, such as "binary_class"
- model_type: choices=["sequence", "structure", "embedding", "sefn", "ssfn"], they represent only the sequence for input, only the 3D-structure contact map for input, only the structural embedding for input, the sequence and the structural embedding for input, and the sequence and the 3D-structure contact map for input, respectively
- subword: store_true, whether to process for sequence at the subword level.
- codesfile: path, subword codes filepath when using subword, such as "../subword/rdrp/proteincodesrdrp20000.txt"
- label_type: str, the label type name, such as "rdrp"
- label_filepath: path, the label list filepath
- cmaptype: choices=["Calpha", "C_bert"], the calculation type of 3D-structure contact map
- cmap_thresh: the distance threshold (Unit: Angstrom) in contact map calculation. Two amino acids are linked if the distance between them is equal to and less than the threshold, default: 10.0.
- output_dir: path, the output dirpath
- log_dir: path, the logger savepath
- tblogdir: path, the save path of metric evaluation records in model training, the tensorboardX can be used to show these metrics.
- config_path: path, the configuration filepath of the model.
- seqvocabpath: path, the vocab filepath of sequence tokenizer
- structvocabpath: path, the vocab filepath of 3D-structure node (Structural Encoder Strategy 2)
- seqpoolingtype: choices=["none", "max", "value_attention"], the sequence representaion matrix pooling type, "none" represents that \ vector is used.
- structpoolingtype: choices=["max", "value_attention"], the 3D-structure representaion matrix pooling type.
- embeddingpoolingtype: choices=["none", "max", "value_attention"], the structual embedding representaion matrix pooling type, "none" represents that \ vector is used.
- evaluateduringtraining: store_true, whether to evaluate the validation set and the testing set during training.
- doeval: storetrue, whether to use the best saved model to evaluate the validation set.
- dopredict: storetrue, whether to use the best saved model to evaluate the testing set.
- dolowercase: store_true, whether to lowercase the input when tokenizing.
- pergputrainbatchsize: int, batch size per GPU/CPU for training, default: 16
- pergpuevalbatchsize: int, batch size per GPU/CPU for evaluation, default: 16
- gradientaccumulationsteps: int, number of updates steps to accumulate before performing a backward/update pass, default: 1.
- learning_rate: float, the initial learning rate for Adam, defaul: 1e-4.
- numtrainepochs: int, the total number of training epochs to perform, default: 50,.
- logging_steps: log every X updates steps, default: 1000.
- losstype: choices=["focalloss", "bce", "multilabel_cce", "asl", "cce"], loss-function type of model training, default: "bce".
- maxmetrictype: choices=["acc", "jaccard", "prec", "recall", "f1", "fmax", "rocauc", "prauc"], which metric is used for model finalization, default: "f1".
- pos_weight: float, positive samples weight for "bce".
- focallossalpha: float, alpha for focal loss, default: 0.7.
- focallossgamma: float, gamma for focal loss, default:2.0.
- focallossreduce: store_true, "mean" for one sample when in multi-label classifcation, default:"sum".
- aslgammaneg: float, negative gamma for asl, default: 4.0.
- aslgammapos: float, positive gamma for asl, default: 1.0.
- seqmaxlength: int, the length of input sequence more than max length will be truncated, shorter will be padded, default: 2048.
- structmaxlength: int, the length of input contact map more than max length will be truncated, shorter will be padded., default: 2048.
- trunc_type: choices=["left", "right"], the truncate type for whole input sequence, default: "right".
- nopositionembeddings: store_true, whether not use position embedding for the sequence.
- notokentypeembeddings: storetrue, whether not use token type embedding for the sequence.
- embeddinginputsize: int, the dim of the structural embedding vector/matrix, default: 2560， {"esm2t30150MUR50D": 640, "esm2t33650MUR50D": 1280, "esm2t363BUR50D": 2560, "esm2t4815BUR50D": 5120}.
- embedding_type: choices=[None, "contacts", "bos", "matrix"], the type of the structural embedding info, default: "matrix.
- embeddingmaxlength: int, the length of input embedding matrix more than max length will be truncated, shorter will be padded, default: 2048.
- saveall: storetrue, the model for each evaluation is saved.
- deleteold: storetrue, only save the best metric (${maxmetrictype}) model of all evaluation on testing set during training.
Training ```shell

!/bin/bash

export CUDAVISIBLEDEVICES=0

DATASETNAME="rdrp40extend" DATASETTYPE="protein" TASKTYPE="binaryclass"

sequence + structural embeddding

MODELTYPE="sefn" CONFIGNAME="sefnconfig.json" INPUTMODE="single" LABELTYPE="rdrp" embeddinginputsize=2560 embeddingtype="matrix" SEQMAXLENGTH="2048" embeddingmaxlength="2048" TRUNCT_TYPE="right"

none, max, value_attention

SEQPOOLINGTYPE="value_attention"

max, value_attention

embeddingpoolingtype="valueattention" VOCABNAME="subwordvocab20000.txt" SUBWORDCODESNAME="proteincodesrdrp20000.txt" MAXMETRICTYPE="f1" timestr=$(date "+%Y%m%d%H%M%S")

python run.py \ --datadir ../dataset/$DATASETNAME/$DATASETTYPE/$TASKTYPE \ --tfrecords \ --filenamepattern {}withpdbemb.csv \ --datasetname $DATASETNAME \ --datasettype $DATASETTYPE \ --tasktype $TASKTYPE \ --modeltype $MODELTYPE \ --subword \ --codesfile ../subword/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$SUBWORDCODESNAME\ --inputmode $INPUTMODE \ --labeltype $LABELTYPE \ --labelfilepath ../dataset/$DATASETNAME/$DATASETTYPE/$TASKTYPE/label.txt \ --outputdir ../models/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$MODELTYPE/$timestr \ --logdir ../logs/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$MODELTYPE/$timestr \ --tblogdir ../tb-logs/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$MODELTYPE/$timestr \ --configpath ../config/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$CONFIGNAME \ --seqvocabpath ../vocab/$DATASETNAME/$DATASETTYPE/$TASKTYPE/$VOCABNAME\ --seqpoolingtype $SEQPOOLINGTYPE \ --embeddingpoolingtype $embeddingpoolingtype \ --dotrain \ --doeval \ --dopredict \ --evaluateduringtraining \ --pergputrainbatchsize=16 \ --pergpuevalbatchsize=16 \ --gradientaccumulationsteps=1 \ --learningrate=1e-4 \ --numtrainepochs=50 \ --loggingsteps=1000 \ --savesteps=1000 \ --overwriteoutputdir \ --sigmoid \ --losstype bce \ --maxmetrictype $MAXMETRICTYPE \ --seqmaxlength=$SEQMAXLENGTH \ --embeddingmaxlength=$embeddingmaxlength \ --trunctype=$TRUNCTTYPE \ --notokentypeembeddings \ --embeddinginputsize $embeddinginputsize\ --embeddingtype $embeddingtype \ --shufflequeuesize 10000 \ --save_all ```
Configuration file
The configuration files of all methods is in "config/rdrp40extend/protein/binaryclass/". If training your model, please put the configuration file in "config/${datasetname}/${datasettype}/${tasktype}/"
Value meaning in configuration file
referring to "src/SSFN/README.md"
Baselines
- LGBM (using the embedding vector: \ as the input)
```
  cd src/baselines/
  sh run_lgbm.sh
```
- XGBoost (using the embedding vector: \ as the input)
```
  cd src/baselines/
  sh run_xgb.sh
```
- DNN (using the embedding vector: \ as the input)
```
  cd src/baselines/
  sh run_dnn.sh
```
Or：
```
  cd src/training
  run_subword_rdrp_emb.sh
```
- Transoformer-Char Level (using the sequence as the input)
```
  cd src/training
  sh run_char_rdrp_seq.sh
```
- Transoformer-Subword Level (using the sequence as the input)
```
  cd src/training
  sh run_subword_rdrp_seq.sh
```
- DNN2 (VALP + DNN, using the embedding matrix as the input)
```
  cd src/training
  run_subword_rdrp_emb_v2.sh
```
Ours
- Ours (the sequence + the 3D-structure)
  coming soon…
- Ours (the sequence + the embedding matrix)
```
  cd src/training
  run_subword_rdrp_sefn.sh
```

5) Training Logging Information

logs

The running information is saved in "logs/${datasetname}/${datasettype}/${tasktype}/${modeltype}/${time_str}/logs.txt".

The information includes the model configuration, model layers, running parameters, and evaluation information.

models

The checkpoints are saved in "models/${datasetname}/${datasettype}/${tasktype}/${modeltype}/${timestr}/checkpoint-${globalstep}/", this directory includes "pytorchmodel.bin", "config.json", "trainingargs.bin", and tokenizer information "sequence" or "strcut". The details are shown in Figure 2.

Figure 2: The File List in Checkpoint Dir Path

tb-logs

The metrics are recorded in "tb-logs/${datasetname}/${datasettype}/${tasktype}/${modeltype}/${time_str}/events.out.tfevents.xxxxx.xxxxx"

run: tensorboard --logdir=tb-logs/${datasetname}/${datasettype}/${tasktype}/${modeltype}/${timestr --bindall

predicts

The predicted results is saved in "predicts/${datasetname}/${datasettype}/${tasktype}/${modeltype}/${timestr}/checkpoint-${globalstep}", including:

predconfusionmatrix.png
pred_metrics.txt
pred_result.csv
seqlengthdistribution.png

The details are shown in Figure 3.

Figure 3: The File List in Prediction Dir Path

Note: when using the saved model to predict, the "logs.txt" and the checkpoint dirpath will be used.

8. Related to the Project

1) ClstrSearch

A conventional approach that clustered all proteins based on their sequence homology.

See ClstrSerch/README.md for details.

2) src

Construct RdRP Dataset for Model Building

*.py in "src/data_preprocess"

Model

*.py in "src/SSFN"

Prediction Shell Script

*.sh in "src/prediction"
including:

runpredictfrom_file.sh
run prediction for many samples from a file, the structural embedding information prepared in advance.
runpredictone_sample.sh
run prediction for one sample from the input.
runpredictmany_samples.sh
run prediction for many samples from the input.

We perform ablation studies on our model by removing specific module(sequence-specific and embedding-specific) one at a time to explore their relative importance.

runpredictonlyseqfrom_file.sh
only using the sequence to predict and calculate metrics three positive testing datasets, three negative testing datasets, and our checked RdRPs by prediction SRA.
runpredictonlyembfrom_file.sh
only using the structural embedding to predict and calculate metrics three positive testing datasets, three negative testing datasets, and our checked RdRPs by prediction SRA.
runpredictseqembfrom_file.sh
using the sequentail info and the structural embedding to predict and calculate metrics three positive testing datasets, three negative testing datasets, and our checked RdRPs by prediction SRA.

Baselines

*.py in "src/baselines", using the embedding vector as the input, including:

DNN
LGBM
XGBoost

Baselines for Deep Learning

*.py in "src/deep_baselines", including:

CHEER: HierarCHical taxonomic classification for viral mEtagEnomic data via deep learning(2021). code: CHEER

VirHunter: A Deep Learning-Based Method for Detection of Novel RNA Viruses in Plant Sequencing Data(2022). code: VirHunter

Virtifier: a deep learning-based identifier for viral sequences from metagenomes(2022). code: Virtifier

RNN-VirSeeker: RNN-VirSeeker: A Deep Learning Method for Identification of Short Viral Sequences From Metagenomes. code: RNN-VirSeeker

rundeepbaselines.sh
the script to train deep baseline models.
runpredictdeep_baselines.sh
use trained deep baseline models to predict three positive test datasets, three negative test datasets, and our checked RdRP datasets.
run.py
the main script for training deep baseline models.
statistics
the script to statistic the accuracy in three kinds of test datasets(positive, negative, our checked) after prediction by deep baselines.

Contact Map Generator

*.py in "src/biotoolbox"

Loss & Metrics

*.py in "src/common"

Training Model

*.sh in "src/training"

Prediction of Model

*.sh in "src/prediction"

3) Data

Raw Data

the raw data is in "data/".

Dataset

the files of the dataset is in "dataset/${datasetname}/${datasettype}/${task_type}/".

4) Model Configuration

the configuration file of all methods is in "config/${datasetname}/${datasettype}/${task_type}/".

5) Pic

some pictures is in "pics/".

6) Plot

the scripts of pictures ploting is in "src/plot".

7) Spider

the codes and results of Geo information Spider in "src/geo_map".

9. Open Resource

The open resources of our study ar includes six subdirectories: Known_RdRPs, Results, All_Contigs, All_Protein_Sequences, and LucaProt, and Self_Sequencing_Reads.

LucaProt/ includes some resources related to LucaProt, including code, model building dataset, model testing datasets, and our trained model.

1) Code

As mentioned above.

2) Dataset

Model Building Dataset

sequential info
trainwithpdb_emb.csv
copy to LucaProt/dataset/rdrp_40_extend/protein/binary_class/

devwithpdb_emb.csv
copy to LucaProt/dataset/rdrp_40_extend/protein/binary_class/

testwithpdb_emb.csv
copy to LucaProt/dataset/rdrp_40_extend/protein/binary_class/
structural info
embs
copy to LucaProt/dataset/rdrp_40_extend/protein/binary_class/embs/
tfrcords
train
copy to LucaProt/dataset/rdrp_40_extend/protein/binary_class/tfrecords/train/

dev
copy to LucaProt/dataset/rdrp_40_extend/protein/binary_class/tfrecords/dev/

test
copy to LucaProt/dataset/rdrp_40_extend/protein/binary_class/tfrecords/test/

Model Testing (Validation) Dataset

Three Positive Testing Dataset
- sequential info
  Neri RdRP
  copy to LucaProt/data/rdrp
  Reference: Expansion of the global RNA virome reveals diverse clades of bacteriophages
  
  Zayed RdRP
  copy to LucaProt/data/rdrp
  Reference: Cryptic and abundant marine viruses at the evolutionary origins of Earth’s RNA virome
  
  Chen RdRP
  copy to LucaProt/data/rdrp
  Reference: RNA viromes from terrestrial sites across China expand environmental viral diversity
- structural info
  Neri RdRP
  
  Zayed RdRP
  
  Chen RdRP
Three Negative Testing Dataset
- sequential info
  RT
  copy to LucaProt/data/rdrp
  
  Eu DdRP
  copy to LucaProt/data/rdrp
  
  Eu RdRP
  copy to LucaProt/data/rdrp
- structural info
  RT
  
  Eu DdRP
  
  Eu RdRP

Results

Our Checked RdRP Dataset (Our Results)
- sequential info
  ourscheckedrdrp_final.csv
- structural info
  embs
- PDB
  All 3D-structure PDB files of our predicted results for opening are in the process.

Self-Samples

Our Sampled Dataset
- fasta
  00selfsequecing300aa.pep

3) Trained Model

The trained model for RdRP identification is available at:

logs
logs
copy tp LucaProt/logs/
models
models
copy tp LucaProt/models/

10. Contributor

LucaTeam:
Yong He, Zhaorong Li, Xin Hou, Mang Shi

11. FTP

The all data of LucaProt is available at the website: Open Resources

12. Citation

the pre-print version:

@article { lucaprot,
author = {Xin Hou and Yong He and Pan Fang and Shi-Qiang Mei and Zan Xu and Wei-Chen Wu and Jun-Hua Tian and Shun Zhang and Zhen-Yu Zeng and Qin-Yu Gou and Gen-Yang Xin and Shi-Jia Le and Yin-Yue Xia and Yu-Lan Zhou and Feng-Ming Hui and Yuan-Fei Pan and John-Sebastian Eden and Zhao-Hui Yang and Chong Han and Yue-Long Shu and Deyin Guo and Jun Li and Edward C Holmes and Zhao-Rong Li and Mang Shi},
title = {Artificial intelligence redefines RNA virus discovery},
elocation-id = {2023.04.18.537342},
year = {2023},
doi = {10.1101/2023.04.18.537342},
publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2023/04/18/2023.04.18.537342},
eprint = {https://www.biorxiv.org/content/early/2023/04/18/2023.04.18.537342.full.pdf},
journal = {bioRxiv}
}

13. Pip

name: lucaprot
channels:
  - defaults
dependencies:
  - pip:
        - h5py==3.8.0
        - biopython==1.80
        - biotite==0.35.0
        - brotlipy==0.7.0
        - numpy==1.24.2
        - obonet==0.3.1
        - pandas==1.5.3
        - pickle5==0.0.11
        - Pillow==9.3.0
        - scikit-learn==1.2.1
        - scipy==1.10.1
        - seaborn==0.12.2
        - six==1.16.0
        - subword-nmt==0.3.8
        - tensorboard==2.11.2
        - tensorboardX==2.5.1
        - tensorflow==2.11.0
        - tensorflow-estimator==2.11.0
        - tfrecord==1.14.1
        - tokenizers==0.13.2
        - torch==1.13.1
        - torchaudio==0.13.1
        - torchvision==0.14.1
        - tqdm==4.64.1
        - transformers==4.26.0
        - huggingface-hub==0.12.0
        - matplotlib==3.6.3
        - Werkzeug==2.2.2
        - wget==3.2
        - wrapt==1.14.1
        - xgboost==1.7.3
        - zipp==3.12.0
        - lightgbm==3.3.5
        - xgboost==1.7.3
        - BeautifulSoup4==4.11.1
        - requests==2.24.0
        - gemmi==0.5.8
        - networkx==3.0
        - fair-esm[esmfold]
        - dllogger @ git+https://github.com/NVIDIA/dllogger.git

路卡珀特

作品详情

LucaProt

Introduction

1. Model

1) Model Introduction

2) Model Architecture

3) Model Input/Output

2. Dependence

3. Inference

1) Prediction from one sample

2) Prediction from many samples

3) Prediction from the file

4. Inference Time

1) GPU(Nvidia A100, Cuda: 11.7)

2) CPU (16 cores, 64G memory of Alibaba Cloud ECS)

3) CPU (96 cores, 768G memory of Alibaba Cloud ECS)

5. Dataset for Virus RdRP

1) Fasta

2) Structural embedding(matrix and vector)

3) PDB (3D Structure)

4) Vocab

5) Label

6) Dataset

6. Supported Task Types

7. Building Your Model

1) Prediction of protein 3D-structure(Optional)

I. Prediction from file

II. Prediction from input sequences

2) Prediction of protein structural embedding

I. Prediction from file

II. Prediction from input sequences

3) Construct dataset for model building

4) Training the model

!/bin/bash

sequence + structural embeddding

none, max, value_attention

max, value_attention

5) Training Logging Information

logs

models

tb-logs

predicts

8. Related to the Project

1) ClstrSearch

2) src

Construct RdRP Dataset for Model Building

Model

Prediction Shell Script

Baselines

Baselines for Deep Learning

Contact Map Generator

Loss & Metrics

Training Model

Prediction of Model

3) Data

Raw Data

Dataset

4) Model Configuration

5) Pic

6) Plot

7) Spider

9. Open Resource

1) Code

2) Dataset

Model Building Dataset

Model Testing (Validation) Dataset

Results

Self-Samples

3) Trained Model

10. Contributor

11. FTP

12. Citation

13. Pip

重点城市程序员兼职推荐

重点岗位程序员兼职推荐