CN110600085B

CN110600085B - Tree-LSTM-based organic matter physicochemical property prediction method

Info

Publication number: CN110600085B
Application number: CN201910500140.8A
Authority: CN
Inventors: 申威峰; 粟杨
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2019-06-01
Filing date: 2019-06-01
Publication date: 2024-04-09
Anticipated expiration: 2039-06-01
Also published as: CN110600085A

Abstract

A method for predicting physical and chemical properties of an organic matter based on Tree-LSTM comprises two parts of generating a prediction model and predicting physical and chemical properties, wherein the method for generating the prediction model comprises the following steps: 1) The molecular structure of the organic matter is normalized and encoded and a tree-shaped data structure (molecular characteristic descriptor) is generated; 2) Training a Tree-LSTM model by using the molecular characteristic descriptors and the physicochemical property experimental data of the organic matters to obtain a sea surface temperature prediction model based on LSTM; the predicted organic physicochemical properties include: and normalizing the molecular structure, encoding and inputting the normalized molecular structure into a prediction model to obtain an output result of the physical and chemical properties of the organic matters. The invention can automatically extract the relation between the molecular structure and the physical and chemical properties of the computer, is more suitable for learning the molecular structure information of various organic matters, and can obtain better prediction results.

Description

Tree-LSTM-based organic matter physicochemical property prediction method

Technical Field

The invention relates to the field of chemistry C07, in particular to a method for predicting chemical substance quantitative structure association properties based on an artificial intelligence technology.

Background

The physicochemical property is basic data closely related to chemistry and chemical engineering, such as critical property, boiling point, generated heat, octanol water partition coefficient and the like are closely related to scientific research and production practice of chemistry and chemical engineering, and predicted values of various scientific and reasonable physicochemical properties can reduce measurement work of the physicochemical property, and save a large amount of manpower and material resources. The physical and chemical property data acquisition is generally difficult to develop due to harsh experimental measurement conditions or objective factors such as easy decomposition of the substance to be measured, and is mainly estimated by a group contribution method and a topological coefficient method based on multiple linear regression at present. However, the group contribution method and the topology coefficient method require manual extraction of molecular structural features before prediction, so that the application range of the two methods is limited.

The Tree-LSTM recurrent neural network is improved on the basis of an LSTM (Long-Short Term Memory) recurrent neural network, the neural network can learn more complex dependency relationships than a sequence structure, autonomously learn the contribution of a molecular Tree topology structure to predicted data for input data, and particularly overcome the defect that other neural networks cannot reproduce atomic connection relationships in molecules, and is more suitable for mining implicit relationships between molecular structures and physical and chemical properties. The existing group contribution method needs to disassemble molecules into different groups (molecular substructure fragments), and can realize the prediction of the physical and chemical properties of the organic matters by using multi-element linear fitting. The group contribution method predicts different resolution schemes of various group contribution methods, and some molecules cannot find a proper resolution scheme, so that prediction is biased or cannot be completed. The existing topology index rule is limited by complex calculation of the topology index and cannot intuitively represent the local structure of the molecule, so that the method has no wider physical and chemical property prediction capability. Thus, no method for predicting the physicochemical properties of the organic matter by using the Tree-LSTM recurrent neural network system alone has emerged.

Disclosure of Invention

The invention provides a method for predicting physical and chemical properties of an organic matter based on Tree-LSTM, which solves the technical problems of the prior art that the prediction range is not wide, the coverage quality is not wide and the prediction precision is not high.

In order to solve the problems, the invention adopts the following technical scheme:

the method comprises the following steps: step A, generating a prediction model; step B, predicting two physical and chemical properties;

the step A comprises the following steps:

a1, acquiring experimental data of physical and chemical properties of an organic matter and molecular structure information of the organic matter, and capturing a large amount of data from various databases by utilizing a web crawler technology;

a2, normalizing the molecular structure of a single organic matter (by a normalization algorithm of a graph), traversing each atom in the single organic matter molecule and generating corresponding atomic characteristic descriptors, sequencing all the atomic characteristic descriptors of the single organic matter molecule according to a dictionary sequence, and taking the smallest atomic characteristic descriptor as a molecular characteristic descriptor;

a3, according to the step A2, generating a molecular characteristic descriptor representing each representative molecular structure standardization diagram and a corresponding linear code according to all the obtained organic molecular structures;

a4, splitting all organic molecules into various chemical bonds, arranging character strings representing the chemical bonds according to each molecule, and generating word vectors for the character strings by adopting a word embedding algorithm;

a5, building a neural network model based on Tree-LSTM, and loading the physicochemical data acquired by A1 and the molecular structure data processed by A2-A4, wherein the Tree-LSTM automatically adapts to the topological shape of the molecular structure standardization diagram. Manually adjusting various super parameters and training a model, and preferentially selecting parameters from the training process to obtain a Tree-LSTM-based organic matter physical and chemical prediction model;

the step B comprises the following steps:

b1, processing the organic molecular structure without experimental data of a certain physical and chemical property by adopting the steps A2-A4, loading the generated characteristic descriptor and code into a physical and chemical property prediction model obtained by A5, and inputting the data of the unknown physical and chemical property predicted by the molecular characteristic descriptor.

As a further refinement, the step A5 includes the following:

a51: building a Tree-LSTM model under a Linux system or a Windows system;

a52: setting the input dimension of the Tree-LSTM and the length of input data; a53: setting the data quantity proportion of a Tree-LSTM training set and a test set; a54: setting a Tree-LSTM model optimizer and a learning rate; a55: setting the width of hidden layer neurons; a56: setting the iteration times of the model; a57: and continuously adjusting parameters, checking the convergence degree of the model according to model loss, and preferentially selecting high convergence degree parameters to form a Tree-LSTM-based physicochemical property prediction model.

Drawings

FIG. 1 is a flow chart of the organic physical and chemical property prediction of the invention;

FIG. 2 is a computational graph of the Tree-LSTM recurrent neural network in predicting the nature of an acetodoxime substance;

fig. 3 is a graph showing the predictive effect of the Tree-LSTM physicochemical property prediction model on the critical temperature of an organic matter, x represents the predicted value, and straight line represents the actual value.

FIG. 4 is an example of molecular characterization descriptor generation with an acetdoxin substance.

Fig. 5 is a coding rule of molecular characterization illustrating the meaning of each bit coding.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples, it being pointed out that the examples described below are only for a better understanding of the invention and do not limit the invention itself.

The invention provides a method for predicting physical and chemical properties of an organic matter based on Tree-LSTM, which is shown in figure 1 and comprises the following two steps: step A, generating a prediction model; step B, predicting two physical and chemical properties;

step A, generating a prediction model:

a1, acquiring experimental data of physical and chemical properties of the organic matters and molecular structure information of the organic matters, and capturing a large amount of data from various databases by utilizing a web crawler technology.

The physicochemical properties of the a11 organic matter mainly comprise: critical properties, normal boiling point, transfer properties, self-ignition point, flash point, toxicity, octanol water partition coefficient, biochemical activity, etc.

The A12 molecular structure information mainly takes SMILES expression, SMART expression, MOL file and SDF file as carriers.

A2, normalizing the structure of a single organic molecule, traversing each atom in the single organic molecule and generating corresponding atomic characteristic descriptors, sequencing all the atomic characteristic descriptors of the single organic molecule according to dictionary sequence, taking the smallest atomic characteristic descriptor as a molecular characteristic descriptor and encoding the molecular characteristic descriptor.

A21, generating a normalized graph from the two-dimensional topological graph of the organic matter molecule through a graph normalization algorithm in graph theory so as to realize isomorphic comparison of the molecular graph, wherein Nauty and Faulon graph normalization algorithms can be adopted.

The A22 coding method is as follows:

the first method, directly uses the molecular feature descriptor output by the Faulon normalization algorithm as the code of the organic matter, and is illustrated in fig. 4.

The second method encodes the molecular signature descriptors in a linear encoded format, as exemplified in table 1.

A3, according to the step A2, generating a molecular characteristic descriptor and a corresponding code of each molecule according to the obtained molecular structure information of all the organic matters.

A4, splitting all organic molecules into various chemical bonds, arranging character strings representing the chemical bonds according to each molecule, and generating word vectors for the character strings by adopting a word embedding algorithm.

A5, building a neural network model based on the Tree-LSTM, loading the physicochemical data acquired by the A1 and the molecular structure data processed by the A2-A4, continuously adjusting parameters, and preferentially selecting the parameters to obtain an organic physical chemistry prediction model based on the Tree-LSTM.

The step B comprises the following steps:

b1, processing an organic molecular structure without experimental data of a certain physical and chemical property by adopting the steps A2-A4, loading a generated characteristic descriptor and a code into a physical and chemical property prediction model, and obtaining data of unknown physical and chemical properties;

step A4 further comprises the following:

a41: traversing each molecule in the database, traversing the connected chemical bonds and atoms by taking each atom in each molecule as a starting point, forming a character string like 'A-B', and recording to form the original data. Description: "A" represents the elemental sign of atom A, "B" represents the elemental sign of atom B, and "-" represents the type of chemical bond between atom A and atom B.

A42: splitting character strings in the original data, such as A-B, to form a sub-character string set in three combination modes: combining: "A" and "-B", combination two: "A-" and "B", combination three: "A", "-" and "B".

A43: and building a neural network based on a skip-gram algorithm under a Linux system or a Windows system to obtain an embedded vector of each character string in the character string set obtained by the A42.

As a further refinement, the step A5 includes the following:

a51: building a Tree-LSTM model under a Linux system or a Windows system;

a52: the feature descriptors or linear codes for each molecule are parsed into a tree-shaped data structure and a corresponding embedded vector obtained by A4 is matched for each node (for each atom in the molecule) in the tree-shaped structure.

A52: setting the input dimension of the Tree-LSTM and the length of input data; the input dimension in the present invention is 1 and the length is 50.

A53: setting the data quantity proportion of a Tree-LSTM training set and a test set; the ratio in the present invention is 4:1.

A54: setting a Tree-LSTM model optimizer and a learning rate; the method adopts an Adam algorithm optimizer, and the learning rate is 0.001:

a55: setting the width of each hidden layer neuron;

a56: setting the iteration times of the model;

a57: adjusting hidden ganglion points under the same iteration times, and adjusting iteration times under the same hidden ganglion points, checking the convergence degree of the model according to the model integral loss and the iteration loss, and preferentially selecting high convergence degree parameters to form a Tree-LSTM-based physicochemical property prediction model.

The Tree-LSTM neural network structure is shown in fig. 2.

The Tree-LSTM has two mathematical models, one is a child node addition model and the other is a child node independent model.

The core of the Tree-LSTM is the control unit state c, the control including the forget gate f _j Input gate i _j Output door o _j . Current node j, forget gate f _j Responsible for controlling how much of the child node's c is saved to the current node's c _j The method comprises the steps of carrying out a first treatment on the surface of the Input gate i _j Responsible for controlling how much of the current node's instant state is input to the current cell state c _j The method comprises the steps of carrying out a first treatment on the surface of the Current input unit state u _j Then control how much new node information is added to the output; output door o _j Responsible for controlling the current cell state c _j How much hidden layer output h is the current node _j . The calculation formulas of the child node addition model are respectively as follows:

f _jk ＝σ(W ^(f) x _j +U ^(f) h _k +b ^(f) ) (2)

c _j ＝i _j ·u _j +f _j (6)

h _j ＝o _j ·tanh(c _j ) (7)

wherein W is ^(f) 、W ⁽ⁱ⁾ 、W ^(o) Weight matrix of forgetting gate, input gate and output gate, b ^(f) 、b ⁽ⁱ⁾ 、b ^(o) Bias terms of a forget gate, an input gate and an output gate, respectively, and sigma is a sigmoid function. The following is a child node independent model calculation formula:

c _j ＝i _j ·u _j +f _j (14)

h _j ＝o _j ·tanh(c _j ) (15)

the two models differ in whether or not to pair node h _jl Adding, wherein the independent model of the child node adds a parameter to hjl of each child node, and the adding model of the child node is h of the child node _jl Sum ofProviding training parameters.

The Tree-LSTM recurrent neural network structure is shown in fig. 3. Inputs to the LSTM include: cell state c of child node _jl Hidden layer output value h of child node _jl Input value x of current node _j The method comprises the steps of carrying out a first treatment on the surface of the The output of LSTM includes: cell state c at the current time _j And hidden layer output value h of LSTM at current time _j 。

Wherein the current input unit state u _j From input x of the current node _j Hidden layer output value h of child node _jl (in case of the child node addition model, here the hidden layer output value h of the child node _jl Sum of) The calculation formula of the common decision is shown in the formula (4) or the formula (12).

Wherein W is ^(u) Is a weight matrix of input cell states, b ^(u) Is an offset term for the state of the input cell, and tanh is a hyperbolic tangent function. Current cell state c _j From forgetting door f _j (including child node cell state c _jl Child node geneticForgetting door f _jl ) Input gate i _j And the currently entered cell state u _j The common decision is that the calculation formula is shown in the formula (4) or (12), wherein the symbol and the expression are multiplied by elements. Hidden layer output value h of current node _j From the output gate o _j And the current cell state c _j The calculation formula of the common decision is shown in the formula (7) or the formula (15).

The Tree-LSTM neural network output is determined by a single-layer or multi-layer neural network, for example, the single-layer neural network is used as an output layer, and the calculation formula is as follows:

p _i ＝w*h _ij +b (16)

properties p of the ith component _i The Tree-LSTM neural network output hj of the root node of the Tree structure represented by the molecular signature of the component is related, w and b are trainable parameters.

In the invention, mean Square Error (MSE) or Mean Absolute Error (MAE) is adopted as a loss function (loss), and the calculation formula is as follows:

wherein N is the number of samples, x ^exp For observations, x ^prep Is a predicted value.

Experimental example

The effect of the Tree-LSTM based physicochemical property prediction method will be exemplified below. Taking the critical temperature of the organic matter as an example, the property is taken as basic data of various thermodynamic models and physical property estimation models, and the prediction of the property has certain practical and representative significance.

And acquiring experimental data of the critical temperature and molecular structure information of corresponding substances, wherein the total of 1759 organic substances are obtained, 1407 substances are used as training sets, and 352 substances are used as test sets.

The construction principle of the molecular feature descriptor is illustrated by taking an acetadome in a sample as an example, and is shown in fig. 4 in detail. The molecular characteristic descriptor is a data structure for storing molecular structure information, which is formed by taking one atom in a molecule as a starting point and expanding the molecule according to a tree structure. The acetate in this example starts with a carbon atom marked zero. Starting from this root atom C0, a predetermined distance (or height) is searched downward and the atoms encountered on the path, and the type of chemical bond attached to the atom, are recorded to record the characteristics of the molecule. From the root atom, all atoms in the molecule are traversed to obtain an atomic feature descriptor. If different root atoms are selected, different atomic feature descriptors are generated, and the feature descriptors are arranged in descending order according to dictionary sequence, wherein the first one is a molecular feature descriptor. Fig. 4 depicts using an acetate axime as an example: the tree-like expansion and atomic characterization of (a) molecular structure (B) molecular structure as atomic characterization descriptors for different heights (C) from height=0 and height=1. Wherein the sub-atoms of an atom are shown by nested brackets, and when the type of chemical bond is not specified, the sub-atoms of the atomic characterization Fu Zhongyuan are shown as single bonds. Otherwise, the chemical bond is represented as follows ("=" is a double bond; "#" is a triple bond; ":" is an aromatic bond.)

In order to store molecular characteristic descriptors conveniently, the invention develops a linear code to represent a tree-shaped unfolding structure of molecules, the linear code of the molecular characteristic descriptors of various depths is shown in a table 1, each atom is separated in a character string by 'I', and the meanings of numerals and letters in the atoms are shown in figure 5. The first atom and the root atom, whose current depth is 0, are denoted by "S", and have no parent atom, so that the parent atom is encoded as "S", and the chemical bond connected with the parent atom is not present, so that the encoding is also "S".

1759 organic matters are converted into molecular characteristic descriptors and are subjected to linear coding. Before inputting these substances into the neural network, they are parsed into a tree structure and the embedding vectors obtained in step a43 are associated for each node (atom) therein. For each molecule in the sample, each atom corresponds to each node in the Tree-LSTM neural network, and the embedded vector of each atom is the input vector of the node. In the case of 300 initial iterations, the number of output layer nodes is continuously adjusted, and finally 128 output layer nodes are taken as the optimal value in this example. The neural network structure of the Tree-LSTM is determined by the molecules of each organic matter, is a dynamic neural network, and is self-adaptive to the topological structures of different molecules. In this example, in the first 300 training, the learning rate is 0.008, and then the learning rate is adjusted to 0.00001 training 5000 times. To prevent overfitting, the calculation is terminated early when the loss function value is no longer decreasing. Finally, the prediction results in table 3 are obtained, and the higher the coincidence degree of the experimental value and the prediction value in the table is, the better the prediction effect is. Table 2 shows the statistical evaluation parameters of the Tree-LSTM neural network for training and prediction of critical temperatures for organics. In fig. 3, the x represents the predicted value, and the straight line represents the experimental value, so that it can be seen that for most data points, the present invention obtains a better prediction effect by using Tree-LSTM.

Table 1 molecular characterization Fu Xianxing coding examples

TABLE 2 statistical parameters for training and predicting critical temperatures of organics

TABLE 3 prediction results of critical temperature fraction of organics

Comparing the present invention with the representative methods of the radical contribution method, joback and Constantinou-Gani (CG) methods, the following results were obtained under the same bill of materials, as shown in table 4:

TABLE 4 comparison of the predictive power of the invention with classical radical contribution

The bill of materials used for comparison in Table 4 contains 460 total materials, and the Joback method predicts only 352 materials among them, and when the prediction method of the present invention is used for predicting the 352 materials, the prediction method of the present invention shows a method due to Joback. The CG method predicts a smaller and slightly worse amount of predictable material than the present invention. When the present invention predicts all of the species of the bill of species, 452 species of species therein can be covered and an acceptable accuracy is achieved. Superscript a indicates all predictable materials and superscript b indicates materials having a number of carbon atoms greater than 3.

Claims

1. A method for predicting physical and chemical properties of organic matters based on Tree-LSTM is characterized in that a molecular diagram of the organic matters is converted into a standard diagram so as to be convenient for a computer to identify and learn, thereby enabling the computer to capture structural characteristics of the molecules, and enabling the computer to correlate the characteristics with the organic matters and the physical or chemical properties, and finally, predicting the properties of the matters is realized, and the method comprises the following steps: step A, producing a prediction model; step B, predicting two physical and chemical properties;

the step A comprises the following steps:

a2, carrying out single organic molecule structure standardization through a graph standardization algorithm, traversing each atom in a single organic molecule, generating corresponding atomic characteristic descriptors, sequencing all the atomic characteristic descriptors of the single organic molecule according to dictionary sequence, and taking the smallest atomic characteristic descriptor as a molecular characteristic descriptor;

a3, according to the step A2, generating a molecular characteristic descriptor and a corresponding code of each molecule according to the obtained molecular structure information of all the organic matters;

a5, building a neural network model based on Tree-LSTM, and loading the physicochemical data acquired by A1 and the molecular structure data processed by A2-A4, wherein the Tree-LSTM automatically adapts to the topological shape of the molecular structure standardization diagram; manually adjusting various super parameters and training a model, and preferentially selecting parameters from the training process to obtain a Tree-LSTM-based organic matter physical and chemical prediction model;

the step B comprises the following steps:

2. The method for predicting physicochemical properties of an organic matter based on Tree-LSTM according to claim 1, wherein said step A5 comprises the steps of:

a51: building a Tree-LSTM-based neural network under a Linux system or a Windows system; a52: setting the input dimension of the Tree-LSTM and the length of input data; a53: setting the data quantity proportion of a Tree-LSTM training set and a test set; a54: setting a Tree-LSTM model optimizer and a learning rate; a55: setting the width of hidden layer neurons; a56: setting the iteration times of the model; a57: and continuously adjusting parameters, checking the convergence degree of the model according to model loss, and preferentially selecting high convergence degree parameters to form a Tree-LSTM-based physicochemical property prediction model.