CN110600085B - Tree-LSTM-based organic matter physicochemical property prediction method - Google Patents
Tree-LSTM-based organic matter physicochemical property prediction method Download PDFInfo
- Publication number
- CN110600085B CN110600085B CN201910500140.8A CN201910500140A CN110600085B CN 110600085 B CN110600085 B CN 110600085B CN 201910500140 A CN201910500140 A CN 201910500140A CN 110600085 B CN110600085 B CN 110600085B
- Authority
- CN
- China
- Prior art keywords
- tree
- lstm
- physical
- molecular
- organic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 239000005416 organic matter Substances 0.000 title claims abstract description 23
- 239000000126 substance Substances 0.000 claims abstract description 51
- 238000012549 training Methods 0.000 claims abstract description 14
- 238000013528 artificial neural network Methods 0.000 claims description 19
- 238000004422 calculation algorithm Methods 0.000 claims description 10
- 239000013598 vector Substances 0.000 claims description 8
- 238000010586 diagram Methods 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 4
- 238000003062 neural network model Methods 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000000547 structure data Methods 0.000 claims description 3
- 125000004429 atom Chemical group 0.000 description 27
- 239000010410 layer Substances 0.000 description 12
- 210000004027 cell Anatomy 0.000 description 10
- 239000000463 material Substances 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 8
- 238000012512 characterization method Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- QTBSBXVTEAMEQO-UHFFFAOYSA-M Acetate Chemical compound CC([O-])=O QTBSBXVTEAMEQO-UHFFFAOYSA-M 0.000 description 2
- 208000003098 Ganglion Cysts Diseases 0.000 description 2
- 208000005400 Synovial Cyst Diseases 0.000 description 2
- 238000009835 boiling Methods 0.000 description 2
- 125000004432 carbon atom Chemical group C* 0.000 description 2
- 238000003889 chemical engineering Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- HGASFNYMVGEKTF-UHFFFAOYSA-N octan-1-ol;hydrate Chemical compound O.CCCCCCCCO HGASFNYMVGEKTF-UHFFFAOYSA-N 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 239000002356 single layer Substances 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 229910052799 carbon Inorganic materials 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 230000000704 physical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010972 statistical evaluation Methods 0.000 description 1
- 230000001988 toxicity Effects 0.000 description 1
- 231100000419 toxicity Toxicity 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A method for predicting physical and chemical properties of an organic matter based on Tree-LSTM comprises two parts of generating a prediction model and predicting physical and chemical properties, wherein the method for generating the prediction model comprises the following steps: 1) The molecular structure of the organic matter is normalized and encoded and a tree-shaped data structure (molecular characteristic descriptor) is generated; 2) Training a Tree-LSTM model by using the molecular characteristic descriptors and the physicochemical property experimental data of the organic matters to obtain a sea surface temperature prediction model based on LSTM; the predicted organic physicochemical properties include: and normalizing the molecular structure, encoding and inputting the normalized molecular structure into a prediction model to obtain an output result of the physical and chemical properties of the organic matters. The invention can automatically extract the relation between the molecular structure and the physical and chemical properties of the computer, is more suitable for learning the molecular structure information of various organic matters, and can obtain better prediction results.
Description
Technical Field
The invention relates to the field of chemistry C07, in particular to a method for predicting chemical substance quantitative structure association properties based on an artificial intelligence technology.
Background
The physicochemical property is basic data closely related to chemistry and chemical engineering, such as critical property, boiling point, generated heat, octanol water partition coefficient and the like are closely related to scientific research and production practice of chemistry and chemical engineering, and predicted values of various scientific and reasonable physicochemical properties can reduce measurement work of the physicochemical property, and save a large amount of manpower and material resources. The physical and chemical property data acquisition is generally difficult to develop due to harsh experimental measurement conditions or objective factors such as easy decomposition of the substance to be measured, and is mainly estimated by a group contribution method and a topological coefficient method based on multiple linear regression at present. However, the group contribution method and the topology coefficient method require manual extraction of molecular structural features before prediction, so that the application range of the two methods is limited.
The Tree-LSTM recurrent neural network is improved on the basis of an LSTM (Long-Short Term Memory) recurrent neural network, the neural network can learn more complex dependency relationships than a sequence structure, autonomously learn the contribution of a molecular Tree topology structure to predicted data for input data, and particularly overcome the defect that other neural networks cannot reproduce atomic connection relationships in molecules, and is more suitable for mining implicit relationships between molecular structures and physical and chemical properties. The existing group contribution method needs to disassemble molecules into different groups (molecular substructure fragments), and can realize the prediction of the physical and chemical properties of the organic matters by using multi-element linear fitting. The group contribution method predicts different resolution schemes of various group contribution methods, and some molecules cannot find a proper resolution scheme, so that prediction is biased or cannot be completed. The existing topology index rule is limited by complex calculation of the topology index and cannot intuitively represent the local structure of the molecule, so that the method has no wider physical and chemical property prediction capability. Thus, no method for predicting the physicochemical properties of the organic matter by using the Tree-LSTM recurrent neural network system alone has emerged.
Disclosure of Invention
The invention provides a method for predicting physical and chemical properties of an organic matter based on Tree-LSTM, which solves the technical problems of the prior art that the prediction range is not wide, the coverage quality is not wide and the prediction precision is not high.
In order to solve the problems, the invention adopts the following technical scheme:
the method comprises the following steps: step A, generating a prediction model; step B, predicting two physical and chemical properties;
the step A comprises the following steps:
a1, acquiring experimental data of physical and chemical properties of an organic matter and molecular structure information of the organic matter, and capturing a large amount of data from various databases by utilizing a web crawler technology;
a2, normalizing the molecular structure of a single organic matter (by a normalization algorithm of a graph), traversing each atom in the single organic matter molecule and generating corresponding atomic characteristic descriptors, sequencing all the atomic characteristic descriptors of the single organic matter molecule according to a dictionary sequence, and taking the smallest atomic characteristic descriptor as a molecular characteristic descriptor;
a3, according to the step A2, generating a molecular characteristic descriptor representing each representative molecular structure standardization diagram and a corresponding linear code according to all the obtained organic molecular structures;
a4, splitting all organic molecules into various chemical bonds, arranging character strings representing the chemical bonds according to each molecule, and generating word vectors for the character strings by adopting a word embedding algorithm;
a5, building a neural network model based on Tree-LSTM, and loading the physicochemical data acquired by A1 and the molecular structure data processed by A2-A4, wherein the Tree-LSTM automatically adapts to the topological shape of the molecular structure standardization diagram. Manually adjusting various super parameters and training a model, and preferentially selecting parameters from the training process to obtain a Tree-LSTM-based organic matter physical and chemical prediction model;
the step B comprises the following steps:
b1, processing the organic molecular structure without experimental data of a certain physical and chemical property by adopting the steps A2-A4, loading the generated characteristic descriptor and code into a physical and chemical property prediction model obtained by A5, and inputting the data of the unknown physical and chemical property predicted by the molecular characteristic descriptor.
As a further refinement, the step A5 includes the following:
a51: building a Tree-LSTM model under a Linux system or a Windows system;
a52: setting the input dimension of the Tree-LSTM and the length of input data; a53: setting the data quantity proportion of a Tree-LSTM training set and a test set; a54: setting a Tree-LSTM model optimizer and a learning rate; a55: setting the width of hidden layer neurons; a56: setting the iteration times of the model; a57: and continuously adjusting parameters, checking the convergence degree of the model according to model loss, and preferentially selecting high convergence degree parameters to form a Tree-LSTM-based physicochemical property prediction model.
Drawings
FIG. 1 is a flow chart of the organic physical and chemical property prediction of the invention;
FIG. 2 is a computational graph of the Tree-LSTM recurrent neural network in predicting the nature of an acetodoxime substance;
fig. 3 is a graph showing the predictive effect of the Tree-LSTM physicochemical property prediction model on the critical temperature of an organic matter, x represents the predicted value, and straight line represents the actual value.
FIG. 4 is an example of molecular characterization descriptor generation with an acetdoxin substance.
Fig. 5 is a coding rule of molecular characterization illustrating the meaning of each bit coding.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples, it being pointed out that the examples described below are only for a better understanding of the invention and do not limit the invention itself.
The invention provides a method for predicting physical and chemical properties of an organic matter based on Tree-LSTM, which is shown in figure 1 and comprises the following two steps: step A, generating a prediction model; step B, predicting two physical and chemical properties;
step A, generating a prediction model:
a1, acquiring experimental data of physical and chemical properties of the organic matters and molecular structure information of the organic matters, and capturing a large amount of data from various databases by utilizing a web crawler technology.
The physicochemical properties of the a11 organic matter mainly comprise: critical properties, normal boiling point, transfer properties, self-ignition point, flash point, toxicity, octanol water partition coefficient, biochemical activity, etc.
The A12 molecular structure information mainly takes SMILES expression, SMART expression, MOL file and SDF file as carriers.
A2, normalizing the structure of a single organic molecule, traversing each atom in the single organic molecule and generating corresponding atomic characteristic descriptors, sequencing all the atomic characteristic descriptors of the single organic molecule according to dictionary sequence, taking the smallest atomic characteristic descriptor as a molecular characteristic descriptor and encoding the molecular characteristic descriptor.
A21, generating a normalized graph from the two-dimensional topological graph of the organic matter molecule through a graph normalization algorithm in graph theory so as to realize isomorphic comparison of the molecular graph, wherein Nauty and Faulon graph normalization algorithms can be adopted.
The A22 coding method is as follows:
the first method, directly uses the molecular feature descriptor output by the Faulon normalization algorithm as the code of the organic matter, and is illustrated in fig. 4.
The second method encodes the molecular signature descriptors in a linear encoded format, as exemplified in table 1.
A3, according to the step A2, generating a molecular characteristic descriptor and a corresponding code of each molecule according to the obtained molecular structure information of all the organic matters.
A4, splitting all organic molecules into various chemical bonds, arranging character strings representing the chemical bonds according to each molecule, and generating word vectors for the character strings by adopting a word embedding algorithm.
A5, building a neural network model based on the Tree-LSTM, loading the physicochemical data acquired by the A1 and the molecular structure data processed by the A2-A4, continuously adjusting parameters, and preferentially selecting the parameters to obtain an organic physical chemistry prediction model based on the Tree-LSTM.
The step B comprises the following steps:
b1, processing an organic molecular structure without experimental data of a certain physical and chemical property by adopting the steps A2-A4, loading a generated characteristic descriptor and a code into a physical and chemical property prediction model, and obtaining data of unknown physical and chemical properties;
step A4 further comprises the following:
a41: traversing each molecule in the database, traversing the connected chemical bonds and atoms by taking each atom in each molecule as a starting point, forming a character string like 'A-B', and recording to form the original data. Description: "A" represents the elemental sign of atom A, "B" represents the elemental sign of atom B, and "-" represents the type of chemical bond between atom A and atom B.
A42: splitting character strings in the original data, such as A-B, to form a sub-character string set in three combination modes: combining: "A" and "-B", combination two: "A-" and "B", combination three: "A", "-" and "B".
A43: and building a neural network based on a skip-gram algorithm under a Linux system or a Windows system to obtain an embedded vector of each character string in the character string set obtained by the A42.
As a further refinement, the step A5 includes the following:
a51: building a Tree-LSTM model under a Linux system or a Windows system;
a52: the feature descriptors or linear codes for each molecule are parsed into a tree-shaped data structure and a corresponding embedded vector obtained by A4 is matched for each node (for each atom in the molecule) in the tree-shaped structure.
A52: setting the input dimension of the Tree-LSTM and the length of input data; the input dimension in the present invention is 1 and the length is 50.
A53: setting the data quantity proportion of a Tree-LSTM training set and a test set; the ratio in the present invention is 4:1.
A54: setting a Tree-LSTM model optimizer and a learning rate; the method adopts an Adam algorithm optimizer, and the learning rate is 0.001:
a55: setting the width of each hidden layer neuron;
a56: setting the iteration times of the model;
a57: adjusting hidden ganglion points under the same iteration times, and adjusting iteration times under the same hidden ganglion points, checking the convergence degree of the model according to the model integral loss and the iteration loss, and preferentially selecting high convergence degree parameters to form a Tree-LSTM-based physicochemical property prediction model.
The Tree-LSTM neural network structure is shown in fig. 2.
The Tree-LSTM has two mathematical models, one is a child node addition model and the other is a child node independent model.
The core of the Tree-LSTM is the control unit state c, the control including the forget gate f j Input gate i j Output door o j . Current node j, forget gate f j Responsible for controlling how much of the child node's c is saved to the current node's c j The method comprises the steps of carrying out a first treatment on the surface of the Input gate i j Responsible for controlling how much of the current node's instant state is input to the current cell state c j The method comprises the steps of carrying out a first treatment on the surface of the Current input unit state u j Then control how much new node information is added to the output; output door o j Responsible for controlling the current cell state c j How much hidden layer output h is the current node j . The calculation formulas of the child node addition model are respectively as follows:
f jk =σ(W (f) x j +U (f) h k +b (f) ) (2)
c j =i j ·u j +f j (6)
h j =o j ·tanh(c j ) (7)
wherein W is (f) 、W (i) 、W (o) Weight matrix of forgetting gate, input gate and output gate, b (f) 、b (i) 、b (o) Bias terms of a forget gate, an input gate and an output gate, respectively, and sigma is a sigmoid function. The following is a child node independent model calculation formula:
c j =i j ·u j +f j (14)
h j =o j ·tanh(c j ) (15)
the two models differ in whether or not to pair node h jl Adding, wherein the independent model of the child node adds a parameter to hjl of each child node, and the adding model of the child node is h of the child node jl Sum ofProviding training parameters.
The Tree-LSTM recurrent neural network structure is shown in fig. 3. Inputs to the LSTM include: cell state c of child node jl Hidden layer output value h of child node jl Input value x of current node j The method comprises the steps of carrying out a first treatment on the surface of the The output of LSTM includes: cell state c at the current time j And hidden layer output value h of LSTM at current time j 。
Wherein the current input unit state u j From input x of the current node j Hidden layer output value h of child node jl (in case of the child node addition model, here the hidden layer output value h of the child node jl Sum of) The calculation formula of the common decision is shown in the formula (4) or the formula (12).
Wherein W is (u) Is a weight matrix of input cell states, b (u) Is an offset term for the state of the input cell, and tanh is a hyperbolic tangent function. Current cell state c j From forgetting door f j (including child node cell state c jl Child node geneticForgetting door f jl ) Input gate i j And the currently entered cell state u j The common decision is that the calculation formula is shown in the formula (4) or (12), wherein the symbol and the expression are multiplied by elements. Hidden layer output value h of current node j From the output gate o j And the current cell state c j The calculation formula of the common decision is shown in the formula (7) or the formula (15).
The Tree-LSTM neural network output is determined by a single-layer or multi-layer neural network, for example, the single-layer neural network is used as an output layer, and the calculation formula is as follows:
p i =w*h ij +b (16)
properties p of the ith component i The Tree-LSTM neural network output hj of the root node of the Tree structure represented by the molecular signature of the component is related, w and b are trainable parameters.
In the invention, mean Square Error (MSE) or Mean Absolute Error (MAE) is adopted as a loss function (loss), and the calculation formula is as follows:
wherein N is the number of samples, x exp For observations, x prep Is a predicted value.
Experimental example
The effect of the Tree-LSTM based physicochemical property prediction method will be exemplified below. Taking the critical temperature of the organic matter as an example, the property is taken as basic data of various thermodynamic models and physical property estimation models, and the prediction of the property has certain practical and representative significance.
And acquiring experimental data of the critical temperature and molecular structure information of corresponding substances, wherein the total of 1759 organic substances are obtained, 1407 substances are used as training sets, and 352 substances are used as test sets.
The construction principle of the molecular feature descriptor is illustrated by taking an acetadome in a sample as an example, and is shown in fig. 4 in detail. The molecular characteristic descriptor is a data structure for storing molecular structure information, which is formed by taking one atom in a molecule as a starting point and expanding the molecule according to a tree structure. The acetate in this example starts with a carbon atom marked zero. Starting from this root atom C0, a predetermined distance (or height) is searched downward and the atoms encountered on the path, and the type of chemical bond attached to the atom, are recorded to record the characteristics of the molecule. From the root atom, all atoms in the molecule are traversed to obtain an atomic feature descriptor. If different root atoms are selected, different atomic feature descriptors are generated, and the feature descriptors are arranged in descending order according to dictionary sequence, wherein the first one is a molecular feature descriptor. Fig. 4 depicts using an acetate axime as an example: the tree-like expansion and atomic characterization of (a) molecular structure (B) molecular structure as atomic characterization descriptors for different heights (C) from height=0 and height=1. Wherein the sub-atoms of an atom are shown by nested brackets, and when the type of chemical bond is not specified, the sub-atoms of the atomic characterization Fu Zhongyuan are shown as single bonds. Otherwise, the chemical bond is represented as follows ("=" is a double bond; "#" is a triple bond; ":" is an aromatic bond.)
In order to store molecular characteristic descriptors conveniently, the invention develops a linear code to represent a tree-shaped unfolding structure of molecules, the linear code of the molecular characteristic descriptors of various depths is shown in a table 1, each atom is separated in a character string by 'I', and the meanings of numerals and letters in the atoms are shown in figure 5. The first atom and the root atom, whose current depth is 0, are denoted by "S", and have no parent atom, so that the parent atom is encoded as "S", and the chemical bond connected with the parent atom is not present, so that the encoding is also "S".
1759 organic matters are converted into molecular characteristic descriptors and are subjected to linear coding. Before inputting these substances into the neural network, they are parsed into a tree structure and the embedding vectors obtained in step a43 are associated for each node (atom) therein. For each molecule in the sample, each atom corresponds to each node in the Tree-LSTM neural network, and the embedded vector of each atom is the input vector of the node. In the case of 300 initial iterations, the number of output layer nodes is continuously adjusted, and finally 128 output layer nodes are taken as the optimal value in this example. The neural network structure of the Tree-LSTM is determined by the molecules of each organic matter, is a dynamic neural network, and is self-adaptive to the topological structures of different molecules. In this example, in the first 300 training, the learning rate is 0.008, and then the learning rate is adjusted to 0.00001 training 5000 times. To prevent overfitting, the calculation is terminated early when the loss function value is no longer decreasing. Finally, the prediction results in table 3 are obtained, and the higher the coincidence degree of the experimental value and the prediction value in the table is, the better the prediction effect is. Table 2 shows the statistical evaluation parameters of the Tree-LSTM neural network for training and prediction of critical temperatures for organics. In fig. 3, the x represents the predicted value, and the straight line represents the experimental value, so that it can be seen that for most data points, the present invention obtains a better prediction effect by using Tree-LSTM.
Table 1 molecular characterization Fu Xianxing coding examples
TABLE 2 statistical parameters for training and predicting critical temperatures of organics
TABLE 3 prediction results of critical temperature fraction of organics
Comparing the present invention with the representative methods of the radical contribution method, joback and Constantinou-Gani (CG) methods, the following results were obtained under the same bill of materials, as shown in table 4:
TABLE 4 comparison of the predictive power of the invention with classical radical contribution
The bill of materials used for comparison in Table 4 contains 460 total materials, and the Joback method predicts only 352 materials among them, and when the prediction method of the present invention is used for predicting the 352 materials, the prediction method of the present invention shows a method due to Joback. The CG method predicts a smaller and slightly worse amount of predictable material than the present invention. When the present invention predicts all of the species of the bill of species, 452 species of species therein can be covered and an acceptable accuracy is achieved. Superscript a indicates all predictable materials and superscript b indicates materials having a number of carbon atoms greater than 3.
Claims (2)
1. A method for predicting physical and chemical properties of organic matters based on Tree-LSTM is characterized in that a molecular diagram of the organic matters is converted into a standard diagram so as to be convenient for a computer to identify and learn, thereby enabling the computer to capture structural characteristics of the molecules, and enabling the computer to correlate the characteristics with the organic matters and the physical or chemical properties, and finally, predicting the properties of the matters is realized, and the method comprises the following steps: step A, producing a prediction model; step B, predicting two physical and chemical properties;
the step A comprises the following steps:
a1, acquiring experimental data of physical and chemical properties of an organic matter and molecular structure information of the organic matter, and capturing a large amount of data from various databases by utilizing a web crawler technology;
a2, carrying out single organic molecule structure standardization through a graph standardization algorithm, traversing each atom in a single organic molecule, generating corresponding atomic characteristic descriptors, sequencing all the atomic characteristic descriptors of the single organic molecule according to dictionary sequence, and taking the smallest atomic characteristic descriptor as a molecular characteristic descriptor;
a3, according to the step A2, generating a molecular characteristic descriptor and a corresponding code of each molecule according to the obtained molecular structure information of all the organic matters;
a4, splitting all organic molecules into various chemical bonds, arranging character strings representing the chemical bonds according to each molecule, and generating word vectors for the character strings by adopting a word embedding algorithm;
a5, building a neural network model based on Tree-LSTM, and loading the physicochemical data acquired by A1 and the molecular structure data processed by A2-A4, wherein the Tree-LSTM automatically adapts to the topological shape of the molecular structure standardization diagram; manually adjusting various super parameters and training a model, and preferentially selecting parameters from the training process to obtain a Tree-LSTM-based organic matter physical and chemical prediction model;
the step B comprises the following steps:
b1, processing the organic molecular structure without experimental data of a certain physical and chemical property by adopting the steps A2-A4, loading the generated characteristic descriptor and code into a physical and chemical property prediction model obtained by A5, and inputting the data of the unknown physical and chemical property predicted by the molecular characteristic descriptor.
2. The method for predicting physicochemical properties of an organic matter based on Tree-LSTM according to claim 1, wherein said step A5 comprises the steps of:
a51: building a Tree-LSTM-based neural network under a Linux system or a Windows system; a52: setting the input dimension of the Tree-LSTM and the length of input data; a53: setting the data quantity proportion of a Tree-LSTM training set and a test set; a54: setting a Tree-LSTM model optimizer and a learning rate; a55: setting the width of hidden layer neurons; a56: setting the iteration times of the model; a57: and continuously adjusting parameters, checking the convergence degree of the model according to model loss, and preferentially selecting high convergence degree parameters to form a Tree-LSTM-based physicochemical property prediction model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910500140.8A CN110600085B (en) | 2019-06-01 | 2019-06-01 | Tree-LSTM-based organic matter physicochemical property prediction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910500140.8A CN110600085B (en) | 2019-06-01 | 2019-06-01 | Tree-LSTM-based organic matter physicochemical property prediction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110600085A CN110600085A (en) | 2019-12-20 |
CN110600085B true CN110600085B (en) | 2024-04-09 |
Family
ID=68852617
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910500140.8A Active CN110600085B (en) | 2019-06-01 | 2019-06-01 | Tree-LSTM-based organic matter physicochemical property prediction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110600085B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111524557B (en) * | 2020-04-24 | 2024-04-05 | 腾讯科技(深圳)有限公司 | Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence |
CN111710375B (en) * | 2020-05-13 | 2023-07-04 | 中国科学院计算机网络信息中心 | A method and system for predicting molecular properties |
CN111899807B (en) * | 2020-06-12 | 2024-05-28 | 中国石油天然气股份有限公司 | Molecular structure generation method, system, equipment and storage medium |
CN111899814B (en) * | 2020-06-12 | 2024-05-28 | 中国石油天然气股份有限公司 | Single molecule and mixture physical property calculation method, device and storage medium |
CN115171807B (en) * | 2022-09-07 | 2022-12-06 | 合肥机数量子科技有限公司 | Molecular coding model training method, molecular coding method and molecular coding system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108108836A (en) * | 2017-12-15 | 2018-06-01 | 清华大学 | A kind of ozone concentration distribution forecasting method and system based on space-time deep learning |
CN109033738A (en) * | 2018-07-09 | 2018-12-18 | 湖南大学 | A kind of pharmaceutical activity prediction technique based on deep learning |
CN109476721A (en) * | 2016-04-04 | 2019-03-15 | 英蒂分子公司 | CD8- specificity capturing agent, composition and use and preparation method |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150017694A1 (en) * | 2008-11-06 | 2015-01-15 | Kiverdi, Inc. | Engineered CO2-Fixing Chemotrophic Microorganisms Producing Carbon-Based Products and Methods of Using the Same |
US10430685B2 (en) * | 2016-11-16 | 2019-10-01 | Facebook, Inc. | Deep multi-scale video prediction |
US10699185B2 (en) * | 2017-01-26 | 2020-06-30 | The Climate Corporation | Crop yield estimation using agronomic neural network |
EP3474201A1 (en) * | 2017-10-17 | 2019-04-24 | Tata Consultancy Services Limited | System and method for quality evaluation of collaborative text inputs |
-
2019
- 2019-06-01 CN CN201910500140.8A patent/CN110600085B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109476721A (en) * | 2016-04-04 | 2019-03-15 | 英蒂分子公司 | CD8- specificity capturing agent, composition and use and preparation method |
CN108108836A (en) * | 2017-12-15 | 2018-06-01 | 清华大学 | A kind of ozone concentration distribution forecasting method and system based on space-time deep learning |
CN109033738A (en) * | 2018-07-09 | 2018-12-18 | 湖南大学 | A kind of pharmaceutical activity prediction technique based on deep learning |
Non-Patent Citations (1)
Title |
---|
深度神经网络在化学中的应用研究;秦琦枫等;《江西化工》;20180615(第03期);1-4页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110600085A (en) | 2019-12-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110600085B (en) | Tree-LSTM-based organic matter physicochemical property prediction method | |
Guo et al. | A just-in-time modeling approach for multimode soft sensor based on Gaussian mixture variational autoencoder | |
CN111461157A (en) | Self-learning-based cross-modal Hash retrieval method | |
CN109063416A (en) | Gene expression prediction technique based on LSTM Recognition with Recurrent Neural Network | |
Kuok et al. | Broad Bayesian learning (BBL) for nonparametric probabilistic modeling with optimized architecture configuration | |
CN114066036B (en) | Cost prediction method and device based on self-correction fusion model | |
CN117524353B (en) | A molecular macromodel based on multi-dimensional molecular information, construction method and application | |
CN113570161B (en) | Method for constructing stirred tank reactant concentration prediction model based on width transfer learning | |
CN118153438B (en) | Wind pressure field prediction method and device for dependent low building | |
CN117877621A (en) | A drug response prediction method based on multi-source heterogeneous networks | |
CN114373093A (en) | Fine-grained image classification method based on direct-push type semi-supervised deep learning | |
Tanaka et al. | Automated structure discovery and parameter tuning of neural network language model based on evolution strategy | |
Peng et al. | Application of non-Gaussian feature enhancement extraction in gated recurrent neural network for fault detection in batch production processes | |
CN112199884A (en) | Item molecule generation method, device, equipment and storage medium | |
Wang et al. | A Novel Multi‐Input AlexNet Prediction Model for Oil and Gas Production | |
CN119128721A (en) | Gas classification and recognition method based on graph neural network based on edge labeling framework | |
CN114841114B (en) | High-energy-efficiency capacitance extraction method based on machine learning | |
CN117891226A (en) | Abnormality detection method and system for spacecraft control system based on MIC and graph neural network fusion | |
CN104346448A (en) | Incremental data mining method based on genetic programming algorithm | |
Huang et al. | Nonlinear model order selection: a gmm clustering approach based on a genetic version of em algorithm | |
Feinauer et al. | Mean dimension of generative models for protein sequences | |
CN107451537B (en) | Face recognition method based on deep learning multi-layer non-negative matrix factorization | |
Akyildiz et al. | Probabilistic sequential matrix factorization | |
Urhan et al. | Soft-sensor design for a crude distillation unit using statistical learning methods | |
CN115311521B (en) | Black-box video adversarial sample generation method and evaluation method based on reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |