[go: up one dir, main page]

CN118782150B - A method for designing and modifying new thermostable enzymes based on a deep learning model - Google Patents

A method for designing and modifying new thermostable enzymes based on a deep learning model

Info

Publication number
CN118782150B
CN118782150B CN202310369852.7A CN202310369852A CN118782150B CN 118782150 B CN118782150 B CN 118782150B CN 202310369852 A CN202310369852 A CN 202310369852A CN 118782150 B CN118782150 B CN 118782150B
Authority
CN
China
Prior art keywords
sequence
model
protein
sequences
temperature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310369852.7A
Other languages
Chinese (zh)
Other versions
CN118782150A (en
Inventor
江会锋
初环宇
田振阳
程健
常宏
胡玲玲
白杰
张鹤渐
卢丽娜
刘丁玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Institute of Industrial Biotechnology of CAS
Original Assignee
Tianjin Institute of Industrial Biotechnology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Institute of Industrial Biotechnology of CAS filed Critical Tianjin Institute of Industrial Biotechnology of CAS
Priority to CN202310369852.7A priority Critical patent/CN118782150B/en
Publication of CN118782150A publication Critical patent/CN118782150A/en
Application granted granted Critical
Publication of CN118782150B publication Critical patent/CN118782150B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B10/00ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Databases & Information Systems (AREA)
  • Physiology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a new enzyme design and transformation method based on deep learning model. And directly generating a new enzyme sequence with the heat stability characteristic by fusing the protein heat stability discrimination model and the protein sequence generation model. The invention utilizes the optimal growth temperature information of the heat-resistant species, applies a deep learning model to extract the heat stability characteristics of the heat-resistant species, avoids the defect of insufficient heat-resistant protein data and the bias caused by manually extracting the characteristics, utilizes a protein sequence generation model to effectively expand the protein sequence space, directly generates a new enzyme sequence with the heat stability characteristics through two model data feedback loops, can overcome the path dependence problem of the traditional directed evolution method and remarkably reduce the size of a directed evolution screening library, and can efficiently obtain the new enzyme sequence with the heat stability characteristics.

Description

Thermal stability new enzyme design and transformation method based on deep learning model
Technical Field
The invention belongs to the fields of molecular biology and protein engineering, and in particular relates to a new heat stability enzyme design and transformation method for generating a deep learning model based on heat stability discrimination and protein sequences.
Background
Enzymes are key to modern bioengineering and have been widely used in various fields such as food, medicine and chemistry. However, most natural enzymes are easily inactivated at high temperatures, limiting their use in industrial applications. Therefore, enhancing the thermostability of enzymes has become a common goal of enzyme engineering. Currently, directed evolution is still the most common method of improving enzyme thermostability. Although this method has been successful for many years, it still faces the problems of low screening efficiency and high workload, and often requires the construction of thousands of mutants to screen a suitable sequence. In recent years, various rational design methods have been developed to enhance the thermostability of enzymes. These include adjusting amino acid preferences, cleaving flexible regions of the target protein, and increasing interactions within the enzyme molecule, such as disulfide bonds, salt bridges, and the like. Most of these methods require a deeper understanding of the structure and catalytic mechanism of the target enzymes to find key sites, which limits their application. Computational designs can facilitate predicting mutations that improve thermal stability, but the accuracy of the computational tools currently in practical use is still not ideal.
As the amount of biological data increases, the methods used by machine learning methods to improve protein thermostability are rapidly evolving. These data-driven methods aim to correlate protein characteristics (e.g., sequence and/or structure) with stability-related measures (e.g., tm) to improve the thermostability of the enzyme. For example, the DeepDDG method predicts the effect of point mutations on protein stability through neural networks, however, this method does not take into account a wider range of interactions between amino acids and has the problem of less training data, which limits their use in enzyme engineering. The DeepET method uses large-scale genome data to predict the optimal temperature of the protein, but is limited by the accuracy of the data, so that high-accuracy protein property prediction is difficult to realize, and the method is not easy to apply to enzyme engineering.
In recent years, protein language models (such as ESM) have been developed rapidly, and have become an effective means of protein sequence characterization. The protein sequence generation model (such as ProteinGAN) also develops rapidly, which helps to expand the protein sequence space and overcome the problem of difficult combination of point mutations, but the current protein sequence generation model mainly aims at generating functional protein sequences and cannot generate required sequences according to specified properties (such as heat resistance, acid resistance and the like).
Therefore, there is an urgent need in the art to develop a deep learning-based method that extracts thermostable information from a large number of protein sequences using a protein language model, and directly generates an enzyme sequence having thermostability based on a protein sequence generation model, so that the sequence space of the enzyme can be expanded, and experimental screening workload can be reduced.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a new enzyme design and transformation method based on the thermal stability of a deep learning model, which overcomes the defects of the existing directed evolution and rational design methods. According to the invention, a sequence generation model and a thermal stability protein discrimination model are constructed, and data circulation feedback is carried out between the sequence generation model and the thermal stability protein discrimination model, so that a deep learning model capable of directly generating a protein sequence with thermal stability is finally obtained through optimization. The invention uses a protein sequence generation model to overcome the limitation that the traditional machine learning method only predicts the influence of single-point mutation on protein stability, and uses a thermal stability protein discrimination model to perform data optimization to overcome the limitation that the traditional sequence generation model can not generate a sequence according to the appointed property. Meanwhile, the invention can efficiently generate and screen protein sequences with heat stability, compared with a directed evolution method, the number of candidate sequences for test detection is obviously reduced, and the enzyme transformation efficiency is improved.
In order to achieve the above purpose, the invention adopts the following technical scheme:
In a first aspect, the invention provides a new enzyme design modification method based on deep learning model, comprising the following steps:
1) Collecting the marker-related sequences;
2) Constructing a protein sequence generation model;
3) Constructing a heat stability protein discrimination model;
4) Generating a model by using the thermal stability protein discrimination model to tune the protein sequence;
5) Calculating and screening a generated sequence;
6) And carrying out experimental verification on the sequence which is generated by the design and is judged to have thermal stability.
Preferably, the step 1) specifically comprises the steps of collecting optimal growth temperature information OGT of different species in a database, respectively defining organisms as high-temperature species HTO, medium-temperature species MTO and low-temperature species LTO according to the difference of the optimal growth temperatures, extracting a gene sequence corresponding to the species in a gene dataset according to the previously marked species information, respectively marking the gene sequence as high Wen Xulie, medium-temperature sequence and low-temperature sequence, collecting enzyme sequences with expected functions, and obtaining potential sequences of enzymes with expected functions for constructing a protein sequence generation model;
Preferably, the step 1) defines the organisms OGT > = 50 ℃,30 ℃ less than or equal to OGT <50 ℃ and OGT <30 ℃ as high temperature species HTO, medium temperature species MTO and low temperature species LTO, respectively;
Further preferably, the step 1) collects optimal growth temperature information OGT of different species in TEMPURA, exProtDB, NCBI, bacDive database;
Still further preferably, the step 1) extracts the gene sequence corresponding to the species in unipro, uniref gene dataset according to the previously labeled species information.
Preferably, the step 2) specifically comprises the steps of filtering the related sequences of the potential enzymes with expected functions obtained in the step 1) according to the sequence length, constructing a model for generating an antagonistic network or a diffusion network by using a deep learning framework, expanding the sequence space of the target enzymes, evaluating the quality of the generated model, and generating a new protein sequence with similar distribution to the natural sequence by the generated network after the model is stable;
Further preferably, the step 2) uses the deep learning framework tensorflow or pytorch to build up a model of the antagonism network or the diffusion network, and uses tSNE, shannon entropy calculation, and other methods to evaluate the quality of the generated model.
Preferably, the step 3) is specifically that a classification model is built by utilizing the high-temperature sequence, the medium-temperature sequence and the low-temperature sequence obtained in the step 1), model training is carried out, and whether the input sequence has a heat stability characteristic is judged;
Further preferably, the step 3) encodes the high-temperature sequence, the medium-temperature sequence and the low-temperature sequence obtained in the step 1) by using a protein language model, randomly dividing the high-temperature sequence, the medium-temperature sequence and the low-temperature sequence into a training set and a testing set, constructing a classification model by using a deep learning framework tensorflow or pytorch, performing model training by using the training set, judging whether the input sequence has a thermal stability characteristic, and judging the quality of the model by using the accuracy of the model to the testing set and the recall rate to the high Wen Xulie after the model is stable.
Preferably, the step 4) specifically includes the steps of carrying out protein language model coding on the new protein sequence generated in the step 2), inputting the new protein sequence into the thermal stability protein discrimination model constructed in the step 3), screening the new protein sequence discriminated as high Wen Xulie, and optimizing the protein sequence generation model obtained in the step 2) as a new training sequence until the optimized protein sequence generation model can directly generate a large proportion of new sequences which can be judged as thermal stability sequences.
Preferably, the step 5) specifically includes transferring the sequence generated by the final protein sequence generation model into a thermal stability protein discrimination model for thermal stability judgment, performing similarity comparison and important functional domain comparison with the natural sequence to evaluate the foldability of the sequence, and performing structural modeling by alphafold to evaluate the quality of the protein generation sequence, and screening out a proper number of sequences for experimental verification according to the flux detected by the experiment.
Preferably, the step 6) is specifically to synthesize the selected design sequence by using a DNA synthesis technology, transfer the design sequence into cells and determine the expression condition of the design sequence, purify the design protein and determine the enzyme activity at different temperatures, and verify whether the design sequence has thermal stability.
Further preferably, the enzyme having the desired function described in step 1) is glyceraldehyde-3-phosphate dehydrogenase.
Still more preferably, step 6) verifies that the amino acid sequence of the obtained thermostable novel enzyme is shown in any one of SEQ ID NO.1-6, said thermostability being the enzyme activity at 65 ℃.
In a second aspect, the present invention provides a glyceraldehyde-3-phosphate dehydrogenase having a thermostability, which means an enzymatic activity at 65℃as shown in any one of SEQ ID NO. 1-6.
The invention has the beneficial effects that:
The invention constructs a protein sequence generation model and a heat stability protein discrimination model, and data circulation feedback is carried out between the protein sequence generation model and the heat stability protein discrimination model, thereby realizing the direct generation of a new heat stability enzyme sequence. The new enzyme design and transformation method with thermal stability provided by the invention can expand the sequence space of the enzyme, has no prejudice in training data, and results show that the enzyme variant with thermal stability can be effectively obtained, and has good application prospect.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the quality evaluation of the generated sequences of the present invention;
orange points represent the distribution of natural sequences in tSNE two-dimensional projection space, and purple points represent the distribution of generated sequences in the same space;
FIG. 3 shows the measurement data of the thermostable novel enzyme activity designed in the present invention.
Wherein G3PDH is natural G3PDH enzyme in escherichia coli, B1, C1 and E, F, M, D are G3PDH enzymes with heat stability designed in the embodiment 1, G2 and G3 are G3PDH enzymes designed in the comparative example, B1C1E_nat is natural G3PDH enzyme closest to B1, C1 and E, F_nat, M_nat, D1_nat, G2_nat and G3_nat are natural G3PDH enzymes closest to F, M, D, G1, G2 and G3 respectively. Blue represents enzyme activity at 30 degrees celsius and red represents enzyme activity at 65 degrees celsius.
Detailed Description
The technical scheme of the invention will be further described in detail below with reference to specific embodiments. It is to be understood that the following examples are illustrative only and are not to be construed as limiting the scope of the invention. All techniques implemented based on the above description of the invention are intended to be included within the scope of the invention.
Unless defined otherwise or clearly indicated by context, all technical and scientific terms in this disclosure have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
Unless otherwise indicated, the materials and reagents in the following examples or comparative examples are commercially available or may be prepared by known methods. The experimental methods in the following examples, in which specific conditions are not noted, are generally performed according to the conditions described in the conventional conditions or according to the conditions recommended by the manufacturer.
The invention constructs a model expansion sequence space by utilizing a functional sequence with a similar catalytic structural domain, constructs a thermal stability protein sequence discrimination model by collecting the optimal growth temperatures of different species in different databases and further extracting genome sequences of corresponding species, realizes the direct generation of a thermal stability new enzyme sequence by utilizing the feedback loop of sequence information between the two, filters and screens by a calculation method to obtain a design sequence for experimental verification, and finally detects and obtains the new enzyme with thermal stability by an experimental means. In the comparative example, we only constructed a protein sequence generation model to generate a new enzyme sequence with functions, but because the generated sequence is not optimized by a heat stability protein discrimination model, the activity at high temperature is poor, which proves that the invention has the capability of generating a heat stability protein sequence by data cycle optimization of two models.
Example 1 design of G3PDH enzyme sequences with thermostable Properties based on deep learning model
Step 1, collecting a marker related sequence:
(1) The collection tag has a sequence of temperature information tags, the invention collects the most suitable growth temperature information of different species from four different sources, the first source is TEMPURA database, which collects the growth temperatures of common and rare prokaryotes, the invention obtains about 8000 organisms and their most suitable growth temperatures (OGT) from them, the second source is ExProtDB database, which collects extreme preference proteins and their host organisms, the invention collects about 300 thermophiles from them, the third source is NCBI database, the names of all genomically sequenced microorganisms are searched from NCBI database, then these names are searched in NCBI and wiki department, if keywords such as "extreme heat preference", "thermophile", "high temperature", "hot acidity" or "multiple extreme preference" are included in the web page, it is checked whether these microorganisms are heat preference organisms, and about 500 heat preference organisms are collected in this way. The fourth source is the BacDive database, which contains information about bacterial and archaeal biodiversity, from which about 5000 microorganisms containing growth temperature information are collected. The present invention has collected 10,189 species in total, with duplicate species removed, some organisms without OGT being searched on the web site alone. 805 organisms with OGT > 50 ℃, 5122 organisms with OGT <50 ℃ and 4262 organisms with OGT <30 ℃ are defined as high temperature species (HTO), medium temperature species (MTO) and low temperature species (LTO), respectively. Subsequently, the corresponding genes were obtained from three downloaded gene sets (UniProt reference proteomes, uniRef, 90 and UniRef). The obtained genes are further classified into HTO, MTO and LTO genes. In the UniProt reference proteome, 25,724,264 genes were obtained, including 1,393,345 HTO genes, 12,317,734 MTO genes, and 12,013,185 LTO genes. In UniRef90, 15,901,817 genes were obtained in total, including 973,655 HTO genes, 7,941,331 MTO genes, and 6,986,831 LTO genes. In UniRef, 2,199,998 genes were obtained in total, including 165,625 HTO genes, 1,120,580 MTO genes, and 913,793 LTO genes. These genes are considered as potential training sets for constructing thermostable protein discrimination models.
(2) Enzyme/protein sequences having potential specific functions in this example, glyceraldehyde-3-phosphate dehydrogenase (G3 PDH) gene was collected as an example of functional enzyme production. A search was performed on a non-redundant database using the Pfam domain ID PF02800, resulting in 67493 genes with this domain. Second, all G3PDH genes were retrieved from the KEGG database. Next, by using the G3PDH genes in NCBI as query sequences, a local blastp search was performed in KEGG, screening out potential G3PDH genes with best match, similarity greater than 40% and alignment length greater than 200, screening out 54896 in total. Through filtering the excessively long and excessively short genes, 40000 potential G3PDH genes are finally selected to construct a protein sequence generation model.
In the embodiment, the protein sequence generation model is constructed based on a generation countermeasure network (GAN), and the method specifically comprises the steps of firstly carrying out length filtering and similarity screening on 40000 potential G3PDH genes obtained in the step 1, selecting sequences with the sequence length less than 400 and the similarity between every two sequences between 30% and 90%, and thus obtaining 15454 genes for constructing the generation model. Then, 15454 gene sequences are divided into a training set and a testing set by a random division method, and a GAN model is constructed on the basis of a ProteinGAN model. The model uses ResNet blocks as a network of discriminators and generators, the random vector input to the generator is 128-dimensional, and the matrix output is 512x 21, corresponding to a single thermal coded sequence of length 512, containing 20 amino acids and gaps at the beginning or end of the sequence. In the training process, the generator takes 64 sequences as a group of generated genes, mixes the generated genes with 64 natural gene sequences in the training set through a weight sampling method, and then transmits the mixed natural gene sequences to the discriminator for discrimination. In particular, for the case of a target sequence, the weight of a sequence similar to the target sequence may be increased so that the generated sequence is closer to the target sequence. And optimizing by using an unsaturated loss function and R1 regularization, and selecting an Adam algorithm to perform network optimization. And finally training 150000 steps, and converging the model. The generated sequences were not significantly different from the test set sequences at the discriminator score. In this case, tSNE analysis is performed on the generated sequence, and as shown in FIG. 2, the generated sequence and the natural sequence have similar distribution, and the generated sequence expands the distribution space of the natural sequence.
And 3, constructing a heat stability protein discrimination model. Converting the HTO and LTO genes obtained in the step 1 into protein sequences, filtering according to the sequence length, selecting 30968 high-temperature protein sequences and 162891 low-temperature protein sequences with the length of 300-600 amino acids as data sets, and dividing the data sets into training sets and test sets. Next, the training data was preprocessed using the protein language model ESM 1b, encoding each sequence into 1280-dimensional vectors. A three-layer neural network classifier is then trained using the Pytorch framework using these encoded vectors, with dimensions 1280:64:16, and trained using a binary cross entropy loss function. A total of 1000 epochs were trained to stabilize the loss. After the model is stable, the overall accuracy of 95% on the test set is achieved, and the recall rate of Wen Xulie% is 78.0%.
And 4, generating a model by utilizing the thermal stability protein discrimination model tuning sequence. The present embodiment utilizes a feedback data loop method to enhance sampling in the G3PDH sequence space with thermal stability characteristics by combining the GAN model and the thermal stability discrimination model. In this example, 10 ten thousand G3PDH sequences were first generated, 2 ten thousand G3PDH conserved domain screening was performed before scoring by the GAN discriminator, 18238 sequences with function conserved residues were selected for thermal stability classification, 1354 high temperature stable protein sequences were screened out and added to the training set of GAN model for fine tuning. After retraining, the proportion of protein-producing sequences that can be classified as thermostable protein sequences increases from 7.4% to 14.9%. While the overall distribution of the newly generated sequence is similar to that of the original generated sequence.
And 5, screening a protein generation sequence. In this example, 100,000 new G3PDH sequences were generated using the tuned protein sequence generation model. G3PDH conserved domain screening was performed by scoring the first 2 ten thousand by GAN discriminator, yielding approximately 16000 design sequences. After the thermal stability protein discrimination model is filtered, 30 design sequences are selected according to the similarity between the thermal stability protein discrimination model and the closest natural sequence and the score of structural prediction by AlphaFold, and the similarity between the design sequences and the closest natural sequence is distributed between 70% and 95%, so that the design sequences are used for experimental verification.
And 6, carrying out experimental verification on the designed high-thermal stability sequence. The gene sequence of the coded protein is synthesized and cloned into pET28a expression vector, and after sequence verification, the constructed vector is transformed into BL21 (DE 3) escherichia coli. The aforementioned E.coli was inoculated into a 2YT medium containing 50. Mu.g/mL kanamycin at a ratio of 1:160 and grown at 37℃and 220 rpm. When the OD600 of the cells reached 0.4-0.6, IPTG was added to induce expression at a final concentration of 0.5 mM. The strain was cultured overnight at 16℃at 220rpm and then harvested by centrifugation. Cells were resuspended in 50mM Tris-HCl (pH 6.8) lysis buffer and lysed using a high pressure homogenizer 2-3 times at 1200-1500 bar. Cell debris was removed by centrifugation (10,000 g,40 min), and the Ni-NTA agarose column was equilibrated with ddH 2 O and lysis buffer for 2 column volumes. The supernatant was applied to a cartridge and then the proteins were eluted using a gradient elution buffer (50 mM Tris-HCl, containing 10mM, 50mM, 200mM imidazole respectively). The isolated proteins were analyzed by SDS-PAGE, then concentrated by centrifugation (4,000 g,30 min) in a 10kDa ultrafiltration tube (Centriplus YM series, millipore), and finally flash frozen in liquid nitrogen and stored at-80 ℃.
The activity of G3PDH was detected by measuring the formation of NADH. Three parallel samples of purified protein were mixed with 10mM NAD in 993. Mu.L reaction (40 mM triethanolamine, 50mM Na 2HPO4, 5mM EDTA,0.1mM DTT,pH8.6), respectively. mu.L of DL-G3PDH solution (Sigma) was added to the above system, and then immediately A340 was measured. The reaction system was incubated at 30 ℃ for 10 minutes, and then a340 was determined again. The activity calculation formula of G3PDH is unit/mg/min= Δa340×vt (volume of tube)/(6.22 x concentration (mg) x time (sec)). For the measurement of thermal stability, 100. Mu.L of the reaction system was placed in a 96-well plate. Plates were incubated at the design temperature of the isothermal microwell reader for 30 minutes, with a continuous reading of a340. Of the 30 designed G3 PDHs tested in this example, 23 proteins were correctly expressed and purified. According to the results of the G3PDH activity assay, 17 proteins showed normal G3PDH catalytic activity at 30 ℃. Next, the thermal stability of these 17 proteins at 65 ℃ was measured, with 5 proteins B1, C1, E, F, M showing significantly increased enzymatic activity at high temperature relative to their closest native sequence proteins, while 1 protein D1 had a slight increase in enzymatic activity compared to the native sequence protein (fig. 3). These novel enzymes differ from the native sequence by about 20-50 amino acid residues (Table 1). Therefore, it was shown that the invention can effectively obtain a novel enzyme having high thermostability properties.
TABLE 1
Comparative example 1 production of G3PDH enzyme sequence by conventional protein sequence Generation model
Step 1-collection of enzyme/protein sequences with potential specific functions in this comparative example, glyceraldehyde-3-phosphate dehydrogenase (G3 PDH) gene was collected as an example of functional enzyme production. A search was performed on the non-redundant database using the Pfam Domain_ID PF02800 according to (2) in step 1 of example 1, obtaining 67493 genes with this domain. Second, all G3PDH genes were retrieved from the KEGG database. Next, by using the G3PDH genes in NCBI as query sequences, a local blastp search was performed in KEGG, screening out potential G3PDH genes with best match, similarity greater than 40% and alignment length greater than 200, screening out 54896 in total. Through filtering the excessively long and excessively short genes, 40000 potential G3PDH genes are finally selected to construct a generating model.
And 2, constructing a protein sequence generation model. In the comparative example, a sequence generation model of G3PDH genes is constructed based on generation of a countermeasure network (GAN), and the specific content includes that 40000 potential G3 PDHs obtained in the step 1 are subjected to length filtering and similarity screening, sequences with the sequence length of less than 400 and the similarity between every two sequences of 30-90% are selected, so that 15454 genes are obtained for constructing the generation model. Then, 15454 gene sequences are divided into a training set and a testing set by a random division method, and a GAN model is constructed on the basis of a ProteinGAN model. The model uses ResNet blocks as a network of discriminators and generators, the random vector input to the generator is 128-dimensional, and the matrix output is 512x 21, corresponding to a single thermal coded sequence of length 512, containing 20 amino acids and gaps at the beginning or end of the sequence. In the training process, the generator takes 64 sequences as a group of generated genes, mixes the generated genes with 64 natural gene sequences in the training set through a weight sampling method, and then transmits the mixed natural gene sequences to the discriminator for discrimination. In particular, for the case of a target sequence, the weight of a sequence similar to the target sequence may be increased so that the generated sequence is closer to the target sequence. And optimizing by using an unsaturated loss function and R1 regularization, and selecting an Adam algorithm to perform network optimization. And finally training 150000 steps, and converging the model. The generated sequences were not significantly different from the test set sequences at the discriminator score.
And 3, screening the generated sequence. In this comparative example, 100,000 new G3PDH sequences were generated using a protein sequence generation model. G3PDH conserved domain screening was performed by scoring the first 2 ten thousand by GAN discriminator, yielding approximately 16000 design sequences. Based on their similarity to the closest natural sequence and the score of AlphaFold for structural prediction, 6 design sequences were selected, which have a similarity distribution between 70% and 95% to the closest natural sequence for experimental verification.
Step 4:
And carrying out experimental verification on the designed high-heat stability sequence. The DNA gene sequence for synthesizing and encoding the protein is synthesized and cloned into a pET28a expression vector, and after sequence verification, the constructed vector is transformed into BL21 (DE 3) escherichia coli. The E.coli cells were inoculated into 2YT medium containing 50. Mu.g/mL kanamycin at a ratio of 1:160 and grown at 37℃and 220 rpm. When the OD600 of the cells reached 0.4-0.6, IPTG was added to induce expression at a final concentration of 0.5 mM. The strain was cultured overnight at 16℃at 220rpm and then harvested by centrifugation. Cells were resuspended in 50mM Tris-HCl (pH 6.8) lysis buffer and lysed using a high pressure homogenizer 2-3 times at 1200-1500 bar. Cell debris was removed by centrifugation (10,000Xg, 40 min), and the Ni-NTA agarose column was equilibrated for 2 column volumes with ddH 2 O and lysis buffer. The supernatant was applied to a cartridge and then the proteins were eluted using a gradient elution buffer (50 mM Tris-HCl, containing 10mM, 50mM, 200mM imidazole respectively). The isolated proteins were analyzed by SDS-PAGE, then concentrated by centrifugation (4,000 g,30 min) in a 10kDa ultrafiltration tube (Centriplus YM series, millipore), and finally flash frozen in liquid nitrogen and stored at-80 ℃.
Activity of G3PDH the activity of G3PDH was detected by measuring the formation of NADH. Three duplicate samples of purified protein were mixed with 10mM NAD in 993. Mu.L reaction (40 mM triethanolamine, 50mM Na2HPO4,5mM EDTA,0.1mM DTT,pH8.6) respectively. mu.L of DL-G3PDH solution (Sigma) was added to the above system, and then immediately A340 was measured. The reaction system was incubated at 30 ℃ for 10 minutes, and then a340 was determined again. The activity calculation formula of G3PDH is unit/mg/min= Δa340×vt (volume of tube)/(6.22 x concentration (mg) x time (sec)). For the measurement of thermal stability, 100. Mu.L of the reaction system was placed in a 96-well plate. Plates were incubated at the design temperature in a thermostated microwell reader for 30 minutes, with a continuous reading of a340. In this control example, 3 proteins were correctly expressed and purified in the 6 designed G3PDH tested, G1, G2 and G3, and the amino acid sequences are shown in SEQ ID NO.7-9, respectively. Based on the results of the G3PDH activity assay, these 3 proteins showed normal G3PDH catalytic activity. Next, the thermal stability of these 3 proteins at 65 ℃ was measured, wherein neither G2 nor G3 showed activity at high temperature, and only G1 showed weak high temperature activity. This comparative example shows that it is difficult to obtain a new enzyme with thermostable properties only by the existing protein sequence generation model, and that the thermostable protein discrimination model and the data loop feedback optimization between the two models in the present invention are critical for the generation of a new enzyme with thermostable properties.
The embodiments of the present invention have been described above. However, the present invention is not limited to the above embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The new enzyme design and transformation method based on the deep learning model is characterized by comprising the following steps of:
1) Collecting the marking related sequences, namely collecting the optimal growth temperature information OGT of different species in a database, and respectively defining organisms as a high-temperature species HTO, a medium-temperature species MTO and a low-temperature species LTO according to the difference of the optimal growth temperatures, extracting the gene sequences corresponding to the species in a gene dataset according to the species information marked in the front, and respectively marking the gene sequences as a high Wen Xulie, a medium-temperature sequence and a low-temperature sequence;
2) The method comprises the steps of constructing a protein sequence generation model, namely filtering the related sequence of the potential enzyme with the expected function obtained in the step 1) according to the sequence length, constructing the generation model by using a deep learning framework, expanding the sequence space of a target enzyme, evaluating the quality of the generation model by using a related algorithm, and generating a new protein sequence with similar distribution with a natural sequence by using the generation model after the model is stable;
3) Constructing a thermal stability protein discrimination model, namely constructing a classification model by utilizing the high-temperature sequence, the medium-temperature sequence and the low-temperature sequence obtained in the step 1), training the model, and judging whether an input sequence has thermal stability characteristics or not;
4) Inputting the new protein sequence generated in the step 2) into the thermal stability protein discrimination model constructed in the step 3), screening the new protein sequence discriminated as Wen Xulie high as a new training sequence, and optimizing the protein sequence generation model obtained in the step 2) until the protein sequence generation model after optimization can directly generate a large proportion of new sequences which can be judged as thermal stability sequences;
5) Screening the generated sequences, namely, transmitting the sequences generated by the final protein sequence generation model into a thermal stability protein discrimination model for thermal stability judgment, evaluating the quality of the sequences by a calculation method, and screening out proper numbers of sequences for experimental verification according to the flux detected by experiments;
6) And (3) carrying out experimental verification on the sequence which is generated by design and is judged to have thermal stability, namely obtaining the screened design sequence by using an experimental method, determining the expression condition of the design sequence, and then carrying out experimental verification on whether the design sequence has thermal stability.
2. The method according to claim 1, wherein step 1) defines the organisms of OGT > = 50 ℃,30 ℃ +.ogt <50 ℃ and OGT <30 ℃ as high temperature species HTO, medium temperature species MTO and low temperature species LTO, respectively.
3. The method according to claim 1, wherein step 1) collects optimal growth temperature information OGT for different species in TEMPURA, exProtDB, NCBI, bacDive database.
4. The method according to claim 1, wherein said step 1) extracts the corresponding gene sequence of the species in the unipro, uniref gene dataset according to the previously tagged species information.
5. The method according to claim 1, wherein said step 2) uses a deep learning framework tensorflow or pytorch to build up a model of the generation of the antagonism network or the diffusion network.
6. The method according to claim 1, wherein said step 2) uses tSNE or shannon entropy calculation method to evaluate the quality of the generated model.
7. The method according to claim 1, wherein the step 3) encodes the high-temperature sequence, the medium-temperature sequence and the low-temperature sequence obtained in the step 1) by using a protein language model ESM 1b, randomly divides the sequences into a training set and a testing set, builds a classification model by using a deep learning framework tensorflow or pytorch, performs model training by using the training set, and judges whether the input sequence has a thermal stability characteristic.
8. The method of claim 1, wherein said step 5) uses alphafold for structural modeling to evaluate the quality of the protein-producing sequence.
9. The method according to any one of claims 1 to 8, wherein the enzyme of the desired function of step 1) is glyceraldehyde-3-phosphate dehydrogenase.
10. The method according to claim 9, wherein step 6) verifies that the amino acid sequence of the obtained thermostable novel enzyme is shown in any one of SEQ ID No.1 to 6, said thermostability being the enzyme activity at 65 ℃.
CN202310369852.7A 2023-04-07 2023-04-07 A method for designing and modifying new thermostable enzymes based on a deep learning model Active CN118782150B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310369852.7A CN118782150B (en) 2023-04-07 2023-04-07 A method for designing and modifying new thermostable enzymes based on a deep learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310369852.7A CN118782150B (en) 2023-04-07 2023-04-07 A method for designing and modifying new thermostable enzymes based on a deep learning model

Publications (2)

Publication Number Publication Date
CN118782150A CN118782150A (en) 2024-10-15
CN118782150B true CN118782150B (en) 2025-09-02

Family

ID=92990563

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310369852.7A Active CN118782150B (en) 2023-04-07 2023-04-07 A method for designing and modifying new thermostable enzymes based on a deep learning model

Country Status (1)

Country Link
CN (1) CN118782150B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119920310B (en) * 2025-01-02 2025-12-26 湖北大学 Protein thermal stability point mutation prediction model construction method and prediction method
CN120932732B (en) * 2025-10-14 2026-01-16 之江实验室 A method, apparatus, device, and storage medium for predicting protein sequence properties.

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110517730A (en) * 2019-09-02 2019-11-29 河南师范大学 A method of thermophilic protein is identified based on machine learning
CN113539374A (en) * 2021-06-29 2021-10-22 深圳先进技术研究院 Protein sequence generation method, apparatus, medium and apparatus for high thermostable enzymes

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120265513A1 (en) * 2011-04-08 2012-10-18 Jianwen Fang Methods and systems for designing stable proteins
WO2015048572A1 (en) * 2013-09-27 2015-04-02 Codexis, Inc. Automated screening of enzyme variants
CN116434844A (en) * 2019-05-19 2023-07-14 贾斯特-埃沃泰克生物制品有限公司 Methods and systems for generating amino acid sequences of proteins
US20220367007A1 (en) * 2019-09-27 2022-11-17 Uab Biomatter Designs Method for generating functional protein sequences with generative adversarial networks
EP4143830A1 (en) * 2020-04-27 2023-03-08 Flagship Pioneering Innovations VI, LLC Optimizing proteins using model based optimizations

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110517730A (en) * 2019-09-02 2019-11-29 河南师范大学 A method of thermophilic protein is identified based on machine learning
CN113539374A (en) * 2021-06-29 2021-10-22 深圳先进技术研究院 Protein sequence generation method, apparatus, medium and apparatus for high thermostable enzymes

Also Published As

Publication number Publication date
CN118782150A (en) 2024-10-15

Similar Documents

Publication Publication Date Title
Paoli et al. Biosynthetic potential of the global ocean microbiome
CN118782150B (en) A method for designing and modifying new thermostable enzymes based on a deep learning model
Gabler et al. Protein sequence analysis using the MPI bioinformatics toolkit
Yan et al. Machine learning assisted discovery of new thermoset shape memory polymers based on a small training dataset
Smanski et al. Functional optimization of gene clusters by combinatorial design and assembly
Kamble et al. In-silico bioprospecting: finding better enzymes
Van Brempt et al. Predictive design of sigma factor-specific promoters
LT6839B (en) Method for generating functional protein sequences with generative adversarial networks
Wang et al. Self-attention based neural network for predicting RNA-protein binding sites
CN119811507A (en) Liquid-liquid phase separation protein prediction method and system based on multiple characteristics
Xie et al. Multilevel attention network with semi-supervised domain adaptation for drug-target prediction
CN118471333A (en) Escherichia coli gene expression level prediction system and method based on automatic machine learning algorithm
Hu et al. Protein-peptide binding residue prediction based on protein language models and cross-attention mechanism
Merchant et al. Semantic design of functional de novo genes from a genomic language model
Baranowski et al. Can protein expression be ‘solved’?
Chan et al. Learning to predict expression efficacy of vectors in recombinant protein production
Vik et al. MArVD2: a machine learning enhanced tool to discriminate between archaeal and bacterial viruses in viral datasets
CN119905145B (en) Method, medium and device for constructing general prediction model of classified mutant enzyme-substrate
CN115691701A (en) A Model for Molecular Property Prediction Based on Multidimensional Feature Fusion Based on Attention Mechanism
Rivas et al. Graph‐based deconvolution analysis of multiplex sandwich microarray immunoassays: applications for environmental monitoring
Teufel et al. SecretoGen: towards prediction of signal peptides for efficient protein secretion
Bongirwar et al. A hybrid bidirectional long short-term memory and bidirectional gated recurrent unit architecture for protein secondary structure prediction
Cheng et al. PromoterDiff: de novo design approach for Escherichia coli promoters based on a diffusion model
Jestin et al. Optimization models and the structure of the genetic code
US20250182853A1 (en) Method for producing library by machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant