[go: up one dir, main page]

CN107516012A - A Structure Descriptor Based on Calculation of 3D Molecular Structure of Organic Compounds - Google Patents

A Structure Descriptor Based on Calculation of 3D Molecular Structure of Organic Compounds Download PDF

Info

Publication number
CN107516012A
CN107516012A CN201710718151.4A CN201710718151A CN107516012A CN 107516012 A CN107516012 A CN 107516012A CN 201710718151 A CN201710718151 A CN 201710718151A CN 107516012 A CN107516012 A CN 107516012A
Authority
CN
China
Prior art keywords
hydrogen atom
molecule
hydrogen atoms
organic compound
atom
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201710718151.4A
Other languages
Chinese (zh)
Inventor
廖立敏
李建凤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neijiang Normal University
Original Assignee
Neijiang Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neijiang Normal University filed Critical Neijiang Normal University
Priority to CN201710718151.4A priority Critical patent/CN107516012A/en
Publication of CN107516012A publication Critical patent/CN107516012A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like

Landscapes

  • Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of structured descriptor calculated based on organic compound molecule 3-D solid structure, belong to compound Quantitative Study of Structure Property relation research method technical field.Purpose is by being training set from part known compound, structural characterization is carried out to training set sample using the descriptor, then such compound structure property relation (QSPR/QSAR) model is built with appropriate mathematical method (multiple linear regression (MLR), PLS (PLS)), for a certain property of the similar unknown compound of simulation and forecast.Method comprises the following steps:Skeleton non-hydrogen atom classification in step 1 organic compound molecule;Step 2 carries out parametrization dyeing to different non-hydrogen atoms;Step 3 builds the relation between different types of non-hydrogen atom by reciprocal function;Organic compound molecule structure is optimized minimum energy state by step 4, obtains the space coordinates of non-hydrogen atom, and structured descriptor is calculated in application program.By establishing the relational model between compound structure descriptor and certain property, can the accurately similar organic compound of simulation and forecast property, the QSPR/QSAR researchs for organic compound have very high reference value.

Description

一种基于有机化合物分子三维结构计算的结构描述符A Structure Descriptor Based on Calculation of 3D Molecular Structure of Organic Compounds

技术领域technical field

本发明具体涉及一种基于有机化合物分子三维结构计算的结构描述符,属于化合物定量结构-性质关系(QSPR/QSAR)研究方法技术领域。The invention specifically relates to a structural descriptor based on the calculation of the three-dimensional molecular structure of an organic compound, and belongs to the technical field of compound quantitative structure-property relationship (QSPR/QSAR) research methods.

背景技术Background technique

化合物结构决定性质,性质是化合物结构的反映。分子结构与性质之间的定量关系的构建,需要引入相应的结构描述符。长期以来,研究者们在这方面已经做了许多有意义的工作。通过利用分子的几何结构、拓扑性质及连接特征和各种物化参数进行结构描述,然后建立QSPR/QSAR模型来预测化合物的各种性质,见论文:广义相关指数用于持久性环境污染物的定量结构-色谱保留关系研究[J].分析化学,2006,34(8):1096-1100。但上述方法都是二维(2D)结构描述符,难以再现分子真实空间立体结构,对顺反异构等结构难以区分。三维(3D)描述符迅速发展,并成为QSPR/QSAR分子结构表征的主流,主要有WHIM指数和CoMFA,见论文:MS-WHIM,new 3D theoretical descriptors derived from molecular surfaceproperties:a comparative 3D QSAR study in a series of steroids.J Comput-AidedMol Des,1997,11:79-92;Investigation of structural requirements for inhibitoryactivity at the rat and housefly picrotoxinin binding sites in ionotropicGABA receptors using DISCOtech and CoMFA,Chemosphere,2007,69:864-871。WHIM指数是通过不同物理量对原子空间坐标进行加权变换产生对旋转和平移不变量而得到的,计算过程相当复杂而难以得到广泛应用。而CoMFA的弊端是在进行一组分子研究时首先要对样本分子进行空间结构叠合,构象重叠,另外空间网格划分、变量数目控制及势场探针选取等过程复杂难懂、工作量大,并且有许多不确定性因素,这些都是不可忽视的问题。因此构建简便、易懂的基于化合物分子空间立体结构的三维(3D)描述符具有重要意义,但是目前还没有非常有效、简便的方法出现。The structure of the compound determines the properties, and the properties are the reflection of the structure of the compound. The construction of quantitative relationships between molecular structures and properties requires the introduction of corresponding structural descriptors. For a long time, researchers have done a lot of meaningful work in this area. By using the geometric structure, topological properties, connection characteristics and various physical and chemical parameters of the molecule for structural description, and then establishing a QSPR/QSAR model to predict various properties of the compound, see the paper: Generalized Correlation Index for Quantification of Persistent Environmental Pollutants Study on structure-chromatographic retention relationship [J]. Analytical Chemistry, 2006, 34(8): 1096-1100. However, the above methods are all two-dimensional (2D) structure descriptors, which are difficult to reproduce the three-dimensional structure of molecules in real space, and it is difficult to distinguish structures such as cis-trans isomerism. Three-dimensional (3D) descriptors have developed rapidly and become the mainstream of QSPR/QSAR molecular structure characterization, mainly including WHIM index and CoMFA, see the paper: MS-WHIM, new 3D theoretical descriptors derived from molecular surface properties: a comparative 3D QSAR study in a series of steroids.J Comput-AidedMol Des, 1997, 11:79-92; Investigation of structural requirements for inhibitory activity at the rat and housefly picrotoxinin binding sites in ionotropic GABA receptors using DISCOtech and CoMFA, Chemosphere, 2004-689: 8. The WHIM index is obtained through the weighted transformation of atomic space coordinates by different physical quantities to generate invariant variables for rotation and translation. The calculation process is quite complicated and difficult to be widely used. The disadvantage of CoMFA is that when conducting a group of molecular research, it is first necessary to superimpose the spatial structure of the sample molecules and overlap the conformation. In addition, the process of spatial grid division, variable number control, and potential field probe selection is complicated and difficult, and the workload is heavy. , and there are many uncertain factors, which cannot be ignored. Therefore, it is of great significance to construct a simple and easy-to-understand three-dimensional (3D) descriptor based on the three-dimensional structure of compound molecules, but there is no very effective and simple method yet.

发明内容Contents of the invention

因此,针对现有技术的上述不足,本发明目的是为QSPR/QSAR研究提供简单、易懂、有效的化合物分子结构参数化表征方法(结构描述符)。在具体应用时,选用部分已知化合物为训练集,通过该描述符对训练集样本进行结构表征,然后采用适当的数学方法(多元线性回归(MLR)、偏最小二乘回归(PLS))构建该类化合物定量结构-性质关系(QSPR/QSAR)模型,用于预测同类未知化合物某一性质(如色谱保留值、毒性、迁移特性、降解性、药效、生物活性等),为开展其它相关研究提供参考。Therefore, aiming at the above-mentioned deficiencies in the prior art, the purpose of the present invention is to provide a simple, understandable and effective parametric characterization method (structural descriptor) for the molecular structure of compounds for QSPR/QSAR research. In specific applications, select some known compounds as the training set, use the descriptor to characterize the structure of the training set samples, and then use appropriate mathematical methods (multiple linear regression (MLR), partial least squares regression (PLS)) to construct This type of compound quantitative structure-property relationship (QSPR/QSAR) model is used to predict a certain property (such as chromatographic retention, toxicity, migration characteristics, degradability, drug efficacy, biological activity, etc.) Research provides references.

本发明的方法包括以下步骤:Method of the present invention comprises the following steps:

步骤一有机化合物分子中的骨架非氢原子分类Step 1 Classification of skeleton non-hydrogen atoms in organic compound molecules

有机化合物中的非氢原子按不同连接方式(化学键)构成分子,忽略非骨架性氢原子的影响,分子内的非氢原子依据其所连接的非氢原子数可分为A1、A2、A3、A4四类,分别表示与1、2、3、4个非氢原子相连。Non-hydrogen atoms in organic compounds form molecules according to different connection methods (chemical bonds), ignoring the influence of non-skeleton hydrogen atoms, non-hydrogen atoms in molecules can be divided into A1, A2, A3, The four types of A4 are connected to 1, 2, 3, and 4 non-hydrogen atoms, respectively.

步骤二给不同的非氢原子进行参数化染色Step 2 Parametric coloring of different non-hydrogen atoms

非氢原子在分子中的特征,主要由其价电子数、电子层数等因素决定,由此采用下式对不同的非氢原子进行参数化染色得到非氢原子的参数化染色值。The characteristics of non-hydrogen atoms in molecules are mainly determined by factors such as the number of valence electrons and the number of electron layers. Therefore, the parametric dyeing value of non-hydrogen atoms is obtained by using the following formula to perform parametric dyeing on different non-hydrogen atoms.

Zi=[mi(ni-1)(XC/Xi)1/2-hi]1/2 公式一Z i =[m i (n i -1)(X C /X i ) 1/2 -h i ] 1/2 Formula 1

式中ni为非氢原子i的电子层数,mi为最外层电子数,Xi为碳原子的鲍林电负性,hi为与其直接连接的氢原子数;XC为碳原子的鲍林电负性。In the formula, n i is the number of electron shells of non-hydrogen atom i, m i is the number of electrons in the outermost shell, Xi is the Pauling electronegativity of carbon atoms, h i is the number of hydrogen atoms directly connected to it; X C is carbon The Pauling electronegativity of the atom.

步骤三通过倒数函数构建不同种类的非氢原子之间的关系Step 3 Construct the relationship between different kinds of non-hydrogen atoms through the reciprocal function

分子中非氢原子间的关系并不是原子间某种具体的作用,而是要反映其密切程度与非氢原子Zi值的改变趋势一致及与两者距离的改变趋势相反的两方面情况。通常倒数形函数可满足这一要求,采用下式进行表达不同非氢原子间的关系。The relationship between the non-hydrogen atoms in the molecule is not a specific interaction between the atoms, but to reflect two aspects: the degree of closeness is consistent with the change trend of the Z i value of the non-hydrogen atoms, and it is opposite to the change trend of the distance between the two. Usually the reciprocal shape function can meet this requirement, and the following formula is used to express the relationship between different non-hydrogen atoms.

rij是分子中非氢原子i、j的相对距离(即两者间空间距离与碳碳单键键长值之比表示);n和l为非氢原子所属类型。化合物分子中4类非氢原子可以产生出10种不同相关项:m11、m12、m13、m14、m22、m23、m24、m33、m34、m44,其中m12表示分子中第一类非氢原子与第二类非氢原子之间的关系,同理m23表示分子中第二类非氢原子与第三类非氢原子之间的关系,以此类推。10种不同关系项分别记为x1、x2、x3、x4、x5、x6、x7、x8、x9和x10,这样对于研究样本最多将产生10个与分子结构相关的变量。r ij is the relative distance between the non-hydrogen atoms i and j in the molecule (that is, the ratio of the space distance between them to the length of the carbon-carbon single bond); n and l are the types of the non-hydrogen atoms. The 4 types of non-hydrogen atoms in the compound molecule can produce 10 different related items: m 11 , m 12 , m 13 , m 14 , m 22 , m 23 , m 24 , m 33 , m 34 , m 44 , where m 12 Indicates the relationship between the first type of non-hydrogen atoms and the second type of non-hydrogen atoms in the molecule, similarly m 23 indicates the relationship between the second type of non-hydrogen atoms and the third type of non-hydrogen atoms in the molecule, and so on. The 10 different relationship items are recorded as x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , x 9 and x 10 , so that a maximum of 10 molecular structures will be generated for the research sample related variables.

步骤四将有机化合物分子结构进行优化到能量最低状态,获取非氢原子的空间坐标,运用程序计算得到结构描述符。Step 4: optimize the molecular structure of the organic compound to the lowest energy state, obtain the spatial coordinates of non-hydrogen atoms, and use the program to calculate the structure descriptor.

使用ChemOffice 8.0构建有机化合物分子的初始立体结构,用Chem 3D自带的MOPAC半经验量子化学软件在AM1水平上最终优化得到分子结构(截断值0.001kJ·mol-1),并得到每个原子的空间位置坐标。将分子中每个原子的空间位置坐标及参数化染色值输入自编的C语言应用程序加以处理,得到分子结构描述符。Use ChemOffice 8.0 to construct the initial three-dimensional structure of organic compound molecules, use the MOPAC semi-empirical quantum chemistry software that comes with Chem 3D to finally optimize the molecular structure at the AM1 level (cutoff value 0.001kJ·mol -1 ), and obtain the molecular structure of each atom. Spatial position coordinates. Input the spatial position coordinates and parametric dyeing value of each atom in the molecule into the self-compiled C language application program for processing, and obtain the molecular structure descriptor.

在具体运用本发明的结构描述符时,还需要将上述步骤所获得的变量,首先采用逐步回归依据变量显著性对变量进行筛选,然后以筛选出的变量组合为因变量X,以化合物性质为因变量Y,运用适当的数学方法(多元线性回归(MLR)、偏最小二乘回归(PLS))构建该类化合物定量结构-性质关系(QSPR/QSAR)模型,进而对同类未知化合物性质进行模拟预测。When specifically using the structural descriptor of the present invention, it is also necessary to use the variables obtained in the above steps to firstly use stepwise regression to screen the variables according to the significance of the variables, and then use the selected variable combination as the dependent variable X, and the properties of the compound as Dependent variable Y, use appropriate mathematical methods (multiple linear regression (MLR), partial least squares regression (PLS)) to construct a quantitative structure-property relationship (QSPR/QSAR) model for this type of compound, and then simulate the properties of similar unknown compounds predict.

本发明的有益效果在于:本发明提供一种基于有机化合物分子三维结构计算的结构描述符,计算简单、易懂,可将有机化合物分子结构行了参数化表征。运用适当的数学方法(多元线性回归(MLR)、偏最小二乘回归(PLS))构建化合物定量结构-性质关系(QSPR/QSAR)模型,模型相关系数(R)及交互检验的相关系数(RCV)均较为理想,一定程度上揭示了影响化合物性质的结构因素。模型可以较准确地预测同类未知的化合物的相关性质,对于有机化合物的QSPR/QSAR研究具有很高的参考价值。The beneficial effect of the present invention is that: the present invention provides a structural descriptor based on the calculation of the three-dimensional molecular structure of the organic compound, which is simple and easy to understand, and can parametrically characterize the molecular structure of the organic compound. Use appropriate mathematical methods (multiple linear regression (MLR), partial least squares regression (PLS)) to construct a quantitative structure-property relationship (QSPR/QSAR) model of the compound, and the correlation coefficient (R) of the model and the correlation coefficient (R) of the interactive test CV ) are ideal, revealing the structural factors affecting the properties of the compounds to a certain extent. The model can more accurately predict the related properties of similar unknown compounds, and has a high reference value for the QSPR/QSAR research of organic compounds.

附图说明Description of drawings

图1为实施例中邻二甲苯分子结构Fig. 1 is o-xylene molecular structure in the embodiment

图2为实施例中变量筛选过程MLR模型相关系数(R/RCV)变化情况Fig. 2 is the variation situation of the MLR model correlation coefficient (R/R CV ) in the variable screening process in the embodiment

图3为实施例中变量筛选过程MLR模型标准偏差(SD/SDCV)变化情况Fig. 3 is the change situation of variable selection process MLR model standard deviation (SD/SD CV ) in the embodiment

图4为实施例中37个样本在PLS前两个主成分得分空间散点分布图;Fig. 4 is 37 samples in the embodiment in the first two principal component score space scatter distribution diagrams of PLS;

图5为实施例中预测值与实验值相关图;Fig. 5 is the correlation diagram of predicted value and experimental value in the embodiment;

具体实施方式detailed description

本发明的有机化合物结构描述符可以用于化合物多种性质(如色谱保留值、毒性、迁移特性、降解性、药效、生物活性等)的模拟预测,下面结合附图对本发明运用于化合物的色谱保留时间的模拟预测具体实施方式进行说明,当需要模拟预测化合物其它性质时,可以采用类似方法实施。The organic compound structure descriptor of the present invention can be used for the simulation prediction of compound multiple properties (such as chromatographic retention value, toxicity, migration characteristics, degradability, drug efficacy, biological activity, etc.), the present invention is applied to compound below in conjunction with accompanying drawing The simulation prediction of chromatographic retention time will be explained in detail, and similar methods can be used when other properties of compounds need to be simulated and predicted.

实验材料Experimental Materials

本实施例选取37个野茉莉花的香气成分为研究样本,化合物气相色谱保留时间以tR表示,实验值取自论文:固相微萃取气质联用分析野茉莉花的香气成分[J].精细化工,2007,24(2):159-161。化合物及其气相色谱保留时间(tR)列于表1。In this example, the aroma components of 37 wild jasmine flowers are selected as research samples, and the gas chromatography retention time of the compound is represented by t R. The experimental value is taken from the paper: Analysis of the aroma components of wild jasmine flowers by solid-phase microextraction GC-GC [J]. Fine Chemical Industry , 2007, 24(2): 159-161. The compounds and their GC retention times (t R ) are listed in Table 1.

表1Table 1

实验方法experimental method

1)化合物分子结构表征1) Compound molecular structure characterization

有机化合物的色谱保留时间(tR)除了与测量因素有关外,还与分子的结构相关。构成化合物原子种类、数目、原子之间的连接方式等都会对tR产生影响。非氢原子按不同连接方式(化学键)构成分子。忽略非骨架性氢原子的影响,分子内的非氢原子依据其所连接的非氢原子数可分为A1、A2、A3、A4四类,分别表示与1、2、3、4个非氢原子相连,如与两个非氢原子相连的仲碳原子属于A2原子类型。非氢原子在分子中的特征,主要由其价电子数、电子层数等因素决定,由此采用下式(1)对不同的非氢原子进行参数化染色得到非氢原子参数化染色值。The chromatographic retention time (t R ) of organic compounds is not only related to the measurement factors, but also related to the molecular structure. The type and number of atoms constituting the compound, the connection mode between atoms, etc. will all have an impact on t R . Non-hydrogen atoms form molecules according to different connection methods (chemical bonds). Neglecting the influence of non-skeletal hydrogen atoms, the non-hydrogen atoms in the molecule can be divided into four types according to the number of non-hydrogen atoms connected to them, A1, A2, A3, and A4, which respectively represent 1, 2, 3, and 4 non-hydrogen atoms. Atoms connected, such as secondary carbon atoms connected to two non-hydrogen atoms belong to the A2 atom type. The characteristics of non-hydrogen atoms in molecules are mainly determined by factors such as the number of valence electrons and the number of electron layers. Therefore, the following formula (1) is used to perform parametric dyeing on different non-hydrogen atoms to obtain the parametric dyeing value of non-hydrogen atoms.

Zi=[mi(ni-1)(XC/Xi)1/2-hi]1/2 (1)Z i =[m i (n i -1)(X C /X i ) 1/2 -h i ] 1/2 (1)

式中ni为非氢原子i的电子层数,mi为最外层电子数,Xi为碳原子的鲍林电负性,hi为与其直接连接的氢原子数;XC为碳原子的鲍林电负性。In the formula, n i is the number of electron shells of non-hydrogen atom i, m i is the number of electrons in the outermost shell, Xi is the Pauling electronegativity of carbon atoms, h i is the number of hydrogen atoms directly connected to it; X C is carbon The Pauling electronegativity of the atom.

分子中非氢原子间的关系并不是原子间某种具体的作用,而是要反映其密切程度与非氢原子Zi值的改变趋势一致及与两者距离的改变趋势相反的两方面情况。通常倒数形函数可满足这一要求,采用下式(2)进行表达不同非氢原子间的关系。The relationship between the non-hydrogen atoms in the molecule is not a specific interaction between the atoms, but to reflect two aspects: the degree of closeness is consistent with the change trend of the Z i value of the non-hydrogen atoms, and it is opposite to the change trend of the distance between the two. Usually the reciprocal shape function can meet this requirement, and the following formula (2) is used to express the relationship between different non-hydrogen atoms.

rij是分子非氢原子i、j的相对距离(即两者间空间距离与碳碳单键键长值之比表示);n和l为非氢原子所属类型。化合物分子中4类非氢原子可以产生出10种不同相关项:m11、m12、m13、m14、m22、m23、m24、m33、m34、m44,其中m12表示分子中第一类非氢原子与第二类非氢原子之间的关系,同理m23表示分子中第二类非氢原子与第三类非氢原子之间的关系,以此类推。10种不同关系项分别记为x1、x2、x3、x4、x5、x6、x7、x8、x9和x10,这样对于研究样本最多将产生10个与分子结构相关的变量。r ij is the relative distance between the non-hydrogen atoms i and j of the molecule (that is, the ratio of the space distance between them to the length of the carbon-carbon single bond); n and l are the types of the non-hydrogen atoms. The 4 types of non-hydrogen atoms in the compound molecule can produce 10 different related items: m 11 , m 12 , m 13 , m 14 , m 22 , m 23 , m 24 , m 33 , m 34 , m 44 , where m 12 Indicates the relationship between the first type of non-hydrogen atoms and the second type of non-hydrogen atoms in the molecule, similarly m 23 indicates the relationship between the second type of non-hydrogen atoms and the third type of non-hydrogen atoms in the molecule, and so on. The 10 different relationship items are recorded as x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , x 9 and x 10 , so that a maximum of 10 molecular structures will be generated for the research sample related variables.

使用ChemOffice 8.0构建有机化合物的初始立体结构,用Chem 3D自带的MOPAC半经验量子化学软件在AM1水平上最终优化得到分子结构(截断值0.001kJ·mol-1),并得到每个原子的空间位置坐标。将每个原子的空间位置坐标及参数化染色值输入自编的C语言应用程序加以处理,得到分子结构描述符。Use ChemOffice 8.0 to construct the initial three-dimensional structure of organic compounds, use the MOPAC semi-empirical quantum chemistry software that comes with Chem 3D to finally optimize the molecular structure at the AM1 level (cutoff value 0.001kJ·mol -1 ), and get the space of each atom Position coordinates. The spatial position coordinates and parametric dyeing values of each atom were input into the self-compiled C language application program for processing, and the molecular structure descriptor was obtained.

这里以邻二甲苯为例,说明结构描述符的计算。首先对化合物分子中的各个非氢原子进行编号,然后判断各个非氢原子的原子类型。对于邻二甲苯,分子内有8个非氢原子(见图1),其原子类型分别为A3,A3,A2,A2,A2,A2,A1和A1。计算x1,x1为分子中第一类非氢原子与第一类非氢原子的关系,即7号与8号两个原子之间的相互关系。邻二甲苯最优构象(见图1)时7号与8号两个原子的空间坐标分别为(-0.3524,2.0282,-0.0202)和(2.1627,0.5775,0.0627),它们的空间距离为0.2905nm,则r7,8=0.2905/0.1540=1.8864。另外,根据式(1)计算染色值,Z7=Z8=1.0000,因此x1数值计算如下:x1=m11=1.0000×1.0000/1.88642=0.2810。同样,可计算出其它几个结构描述符值。Here, o-xylene is taken as an example to illustrate the calculation of structure descriptors. First number each non-hydrogen atom in the compound molecule, and then determine the atom type of each non-hydrogen atom. For o-xylene, there are 8 non-hydrogen atoms in the molecule (see Figure 1), and their atom types are A3, A3, A2, A2, A2, A2, A1 and A1. Calculate x 1 , x 1 is the relationship between the first type of non-hydrogen atoms and the first type of non-hydrogen atoms in the molecule, that is, the relationship between the two atoms No. 7 and No. 8. In the optimal conformation of o-xylene (see Figure 1), the spatial coordinates of No. 7 and No. 8 atoms are (-0.3524, 2.0282, -0.0202) and (2.1627, 0.5775, 0.0627) respectively, and their spatial distance is 0.2905nm , then r 7,8 =0.2905/0.1540=1.8864. In addition, the dyeing value is calculated according to formula (1), Z7=Z8=1.0000, so the value of x 1 is calculated as follows: x 1 =m 11 =1.0000×1.0000/1.8864 2 =0.2810. Likewise, several other structure descriptor values can be calculated.

采用以上方法对研究样本结构进行参数化表征,得到化合物结构描述符,由于样本中不存在第4类非氢原子之间的关系,因而最终得到9个非全“0”结构描述符,列于表2。The above method is used to parametrically characterize the structure of the research sample, and the compound structure descriptor is obtained. Since the relationship between the fourth type of non-hydrogen atoms does not exist in the sample, 9 non-all "0" structure descriptors are finally obtained, listed in Table 2.

表2Table 2

QSPR建模与检验QSPR modeling and testing

多元线性回归(MLR)和偏最小二乘回归(PLS)方法分别被用来建立模型,采用“留一法”对模型进行交互检验。一般认为,建模相关系数(R)在0.8-1.0之间,表明模型高度相关;交互检验相关系数RCV≥0.7,表明模型具有良好的稳健性和预测能力。模型中各自变量的多重共线性用方差膨胀因子(variance inflation factors,VIF)评价,VIF的定义式为:VIF=(1-r2)-1,式中r为某自变量与其它变量的相关程度(判断标准:VIF值大于5.0,变量间共线性严重,相关方程不可靠),如诊断发现模型某变量VIF值过大,则继续减少变量建模。Multiple linear regression (MLR) and partial least squares regression (PLS) methods were used to build the model respectively, and the model was tested interactively by "leave one out". It is generally believed that the modeling correlation coefficient (R) is between 0.8-1.0, indicating that the model is highly correlated; the cross-test correlation coefficient R CV ≥ 0.7, indicating that the model has good robustness and predictive ability. The multicollinearity of each variable in the model is evaluated by variance inflation factors (VIF), the definition of VIF is: VIF=(1-r 2 ) -1 , where r is the correlation between an independent variable and other variables degree (judgment criteria: VIF value greater than 5.0, serious collinearity among variables, and unreliable correlation equation), if the diagnosis finds that the VIF value of a certain variable in the model is too large, continue to reduce variable modeling.

首先采用逐步回归(SMR)依据变量显著性大小依次提取变量,然后以挑选出的变量组合为自变量X,以化合物气相色谱保留时间(tR)为因变量Y,然后运用多元线性回归(MLR)建立模型,变量筛选及相应的结果见表3及图2、图3。Firstly, stepwise regression (SMR) is used to extract variables according to the significance of the variables, and then the selected variable combination is used as the independent variable X, and the gas chromatography retention time of the compound (t R ) is used as the dependent variable Y, and then multiple linear regression (MLR) is used. ) to establish a model, variable screening and corresponding results are shown in Table 3 and Figure 2 and Figure 3.

表3table 3

一个好的预测模型不但对内部样本具有较好的拟合能力,而且还应该对外部样本具有较强的预测能力。因此在选择模型时,在保证对内部样本具有良好拟合效果的情况下,尽量选择交互检验相关系数(RCV)较大的模型,以确保模型具有较强的预测能力和稳定性。表3、图2、图3中可以看出,应该选择由逐步回归(SMR)第3步挑选的变量组合(x3、x5、x7)建模所得模型,此时化合物气相色谱保留时间(tR)与结构描述符之间的回归方程式为:A good forecasting model not only has good fitting ability for internal samples, but also has strong predictive ability for external samples. Therefore, when selecting a model, try to choose a model with a large cross-test correlation coefficient ( RCV ) while ensuring a good fitting effect on internal samples, so as to ensure that the model has strong predictive ability and stability. It can be seen from Table 3, Figure 2, and Figure 3 that the model obtained by modeling the variable combination (x 3 , x 5 , x 7 ) selected in the third step of the stepwise regression (SMR) should be selected. At this time, the compound gas chromatography retention time The regression equation between (t R ) and the structure descriptor is:

tR=0.2095+0.4030·x3+0.2251·x5+0.2249·x7 (3)t R =0.2095+0.4030 x 3 +0.2251 x 5 +0.2249 x 7 (3)

N=37,R=0.9403,SD=1.5498,F=84.0232;RCV=0.9186,SDCV=1.7994FCV=59.4872。N = 37, R = 0.9403, SD = 1.5498, F = 84.0232; R CV = 0.9186, SD CV = 1.7994 F CV = 59.4872.

建模相关系数(R)达到0.9403(处于0.8-1.0之间),而交互检验相关系数(RCV)达到最大值0.9186(大于0.7),说明此模型高度相关、稳健性好、预测能力强。标准偏差(SD)为1.5498,交互检验标准偏差(SDCV)为1.7994,两者均较小,说明模型预测准确性较高。变量中x5的方差膨胀因子(VIF)最大为1.3379,远小于5.0的标准,说明模型变量间几乎不存在共线性。x3、x5、x7的标准化回归系数分别为0.6190、1.0776、0.1581,说明x5对化合物的色谱保留时间影响最大,其次是x3,x7对化合物的色谱保留时间影响相对较小。x5对应于第2类非氢原子之间的关系,说明化合物第2类非氢原子越多色谱保留时间就越大。The modeling correlation coefficient (R) reached 0.9403 (between 0.8-1.0), and the cross-test correlation coefficient (R CV ) reached the maximum value of 0.9186 (greater than 0.7), indicating that the model is highly correlated, robust and predictive. The standard deviation (SD) was 1.5498, and the standard deviation of the interactive test (SD CV ) was 1.7994, both of which were small, indicating that the prediction accuracy of the model was high. The variance inflation factor (VIF) of x 5 in the variable is the largest at 1.3379, which is much smaller than the standard of 5.0, indicating that there is almost no collinearity among the model variables. The standardized regression coefficients of x 3 , x 5 , and x 7 are 0.6190, 1.0776, and 0.1581, respectively, indicating that x 5 has the greatest impact on the chromatographic retention time of the compound, followed by x 3 , and x 7 has relatively little impact on the chromatographic retention time of the compound. x 5 corresponds to the relationship between the second type of non-hydrogen atoms, indicating that the more the second type of non-hydrogen atoms in the compound, the greater the chromatographic retention time.

将逐步回归第3步筛选所得分子结构描述符作为X变量,以化合物气相色谱保留时间(tR)为因变量Y,用Simca-P11.5对训练集样本进行建立偏最小二乘回归(PLS)模型,同时采用“留一法”对所得PLS模型进行交互检验。最终所得PLS模型含2个主成分(A),此时化合物气相色谱保留时间(tR)与原始结构描述符之间回归方程为式(4)。The molecular structure descriptors screened in the third step of stepwise regression were used as the X variable, and the compound gas chromatography retention time (t R ) was used as the dependent variable Y, and Simca-P11.5 was used to establish a partial least squares regression (PLS ) model, and the “leave one out method” was used to conduct an interactive test on the obtained PLS model. The final PLS model contains two principal components (A), and the regression equation between the gas chromatographic retention time (t R ) of the compound and the original structure descriptor is formula (4).

tR=0.3275+0.3790·x3+0.2243·x5+0.3157·x7 (4)t R =0.3275+0.3790 x 3 +0.2243 x 5 +0.3157 x 7 (4)

模型的各相关系数R为0.9381(处于0.8-1.0之间),RCV为0.8899(大于0.7)。各标准偏差SD、SDCV分别为1.5362、1.7436。以上说明模型拟合效果好、稳健性好、预测能力强。图4为37个样本在PLS前两个主成分得分空间散点图,Hotelling T2椭圆为95%置信度置信圈,可以看出绝大多数样本点都落在置信圈内,结果表明结构描述符能够恰当表现有机化合物的分子结构特征,并在统计模型中作出正确反应。The correlation coefficient R of the model is 0.9381 (between 0.8-1.0), and R CV is 0.8899 (greater than 0.7). Each standard deviation SD and SD CV were 1.5362 and 1.7436, respectively. The above shows that the model has good fitting effect, good robustness and strong predictive ability. Figure 4 is a scatter diagram of the scores of the first two principal components of 37 samples in the PLS. The Hotelling T 2 ellipse is a 95% confidence confidence circle. It can be seen that most of the sample points fall within the confidence circle. The results show that the structural description Symbols can properly represent the molecular structure characteristics of organic compounds and make correct responses in statistical models.

MLR模型和PLS模型对样本的气相色谱保留时间(tR)进行了模拟预测,预测值分别列于表1的Cal1和Cal2中。图5为模型预测值与实验值之间的相关图,从图5可以看出,所有样本点都分布在45°对角线上或者紧靠对角线,说明预测值与实验值非常接近,总体预测效果好,预测结果可信度高。The MLR model and the PLS model simulated and predicted the gas chromatographic retention time (t R ) of the sample, and the predicted values are listed in Cal1 and Cal2 in Table 1, respectively. Figure 5 is the correlation diagram between the model predicted value and the experimental value. It can be seen from Figure 5 that all sample points are distributed on the 45° diagonal or close to the diagonal, indicating that the predicted value is very close to the experimental value. The overall prediction effect is good, and the prediction results are highly reliable.

与现有技术相比,构建的结构描述符是基于分子的三维结构计算得到三维(3D)分子结构描述符,简单易懂、计算量小、能区分顺反异构等异构体,并且非氢原子染色值考虑了主量子数、电负性、最外层电子数、连接的氢原子数等丰富的结构信息。Compared with the existing technology, the constructed structural descriptor is a three-dimensional (3D) molecular structural descriptor calculated based on the three-dimensional structure of the molecule. The hydrogen atom coloring value takes into account rich structural information such as principal quantum number, electronegativity, number of outermost electrons, number of connected hydrogen atoms, etc.

将分子中的非氢原子进行分类并参数化染色,将不同非氢原子之间的空间距离关系作为分子结构描述符,对部分有机化合物结构进行了参数化表征。采用逐步回归(SMR)与多元线性回归(MLR)、偏最小二乘回归方法(PLS)构建了化合物结构与气相色谱保留时间(tR)的关系模型,模型相关系数(R)及交互检验的相关系数(R)均较为理想,一定程度上揭示了影响化合物气相色谱保留时间(tR)的结构因素。模型可以较准确地预测植物精油中挥发性有机化合物的气相色谱保留时间(tR),对于有机化合物的QSPR/QSAR研究具有较高的参考价值。The non-hydrogen atoms in the molecule are classified and parametrically colored, and the spatial distance relationship between different non-hydrogen atoms is used as a molecular structure descriptor to parametrically characterize the structure of some organic compounds. Stepwise regression (SMR), multiple linear regression (MLR), and partial least squares regression (PLS) were used to construct the relationship model between compound structure and gas chromatographic retention time (t R ), and the correlation coefficient (R) of the model and the interaction test were used. The correlation coefficients (R) are ideal, revealing to some extent the structural factors affecting the retention time (t R ) of compounds in gas chromatography. The model can accurately predict the gas chromatography retention time (t R ) of volatile organic compounds in plant essential oils, and has a high reference value for the QSPR/QSAR research of organic compounds.

以上所述是本发明的优选实施方式,应当指出本发明所述的分子结构描述符除可以应用于气相色谱保留时间(tR)模拟预测外,还可以应用于化合物的毒性、迁移特性、降解性、药效、生物活性等多种性质的模拟预测,对于本技术领域的普通技术人员来说,在不脱离本发明所述原理的前提下,还可以作出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is a preferred embodiment of the present invention. It should be pointed out that the molecular structure descriptor described in the present invention can be applied to the toxicity, migration characteristics, degradation of compounds in addition to the simulation prediction of gas chromatography retention time (t R ). For the simulation prediction of various properties such as sex, drug efficacy, biological activity, etc., for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made. These improvements and modifications It should also be regarded as the protection scope of the present invention.

Claims (1)

1. a kind of structured descriptor calculated based on organic compound molecule three-dimensional structure, it is characterised in that methods described includes Following steps:
Skeleton non-hydrogen atom classification in step 1 organic compound molecule
Non-hydrogen atom forms molecule by different connected modes (chemical bond) in organic compound, ignores the shadow of non-skeleton hydrogen atom Ringing, the non-hydrogen atom of intramolecular can be divided into the class of A1, A2, A3, A4 tetra- according to its non-hydrogen atom number connected, represent respectively with 1, 2nd, 3,4 non-hydrogen atoms are connected.
Step 2 carries out parametrization dyeing to different non-hydrogen atoms
The feature of non-hydrogen atom in the molecule, mainly determined by factors such as its valence electron number, the electronics numbers of plies, thus using following formula pair Different non-hydrogen atoms carries out parametrization dyeing and obtains the parametrization dye number of non-hydrogen atom.
Zi=[mi(ni-1)(XC/Xi)1/2-hi]1/2Formula one
N in formulaiFor the non-hydrogen atom i electronics number of plies, miFor outermost electron number, XiFor the Pauling electronegativity of carbon atom, hiFor with Its number of hydrogen atoms being directly connected to;XCFor the Pauling electronegativity of carbon atom.
Step 3 builds the relation between different types of non-hydrogen atom by reciprocal function
Relation in molecule between non-hydrogen atom is not certain specific interaction between atom, but to reflect its level of intimate With non-hydrogen atom ZiThe change trend of value is consistent and the two aspect situations opposite with the change trend of both distances.Usual shape reciprocal Function can meet this requirement, carry out expressing the relation between different non-hydrogen atoms using following formula.
rijIt is relative distance (i.e. the ratio between space length and carbon-carbon single bond bond distance's value table between the two of non-hydrogen atom i, j in molecule Show);N and l is the affiliated type of non-hydrogen atom.4 class non-hydrogen atoms can produce 10 kinds of different relational terms in compound molecule: m11、m12、m13、m14、m22、m23、m24、m33、m34、m44, wherein m12Represent first kind non-hydrogen atom and the second class non-hydrogen in molecule Relation between atom, similarly m23The relation between the second class non-hydrogen atom and the 3rd class non-hydrogen atom in molecule is represented, with this Analogize.10 kinds of different relational terms are designated as x respectively1、x2、x3、x4、x5、x6、x7、x8、x9And x10, so at most will for research sample Produce 10 structured descriptors related to molecular structure.
Organic compound molecule structure is optimized minimum energy state by step 4, obtains the space coordinates of non-hydrogen atom, Structured descriptor is calculated in application program.
The initial volumetric structure of organic compound molecule is built using ChemOffice 8.0, the MOPAC half carried with Chem 3D Experience quantum chemistry software final optimization pass in AM1 levels obtains molecular structure (cutoff value 0.001kJmol-1), and obtains The locus coordinate of each atom.The locus coordinate of each atom in molecule and parametrization dye number input is self-editing C language application program is acted upon, and obtains Molecular structure descriptor.
CN201710718151.4A 2017-08-21 2017-08-21 A Structure Descriptor Based on Calculation of 3D Molecular Structure of Organic Compounds Withdrawn CN107516012A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710718151.4A CN107516012A (en) 2017-08-21 2017-08-21 A Structure Descriptor Based on Calculation of 3D Molecular Structure of Organic Compounds

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710718151.4A CN107516012A (en) 2017-08-21 2017-08-21 A Structure Descriptor Based on Calculation of 3D Molecular Structure of Organic Compounds

Publications (1)

Publication Number Publication Date
CN107516012A true CN107516012A (en) 2017-12-26

Family

ID=60722355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710718151.4A Withdrawn CN107516012A (en) 2017-08-21 2017-08-21 A Structure Descriptor Based on Calculation of 3D Molecular Structure of Organic Compounds

Country Status (1)

Country Link
CN (1) CN107516012A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111429980A (en) * 2020-04-14 2020-07-17 北京迈高材云科技有限公司 An automated method for obtaining crystal structure features of materials
CN111798935A (en) * 2019-04-09 2020-10-20 南京药石科技股份有限公司 A neural network-based prediction method for universal compound structure-property correlation
CN112185477A (en) * 2020-09-25 2021-01-05 北京望石智慧科技有限公司 Method and device for extracting molecular characteristics and calculating three-dimensional quantitative structure-activity relationship
CN114207619A (en) * 2019-09-05 2022-03-18 株式会社日立制作所 Material property prediction system and information processing method
CN119027614A (en) * 2024-08-16 2024-11-26 中国地质大学(北京) An automatic analysis system and method for detecting characteristic pollutants in drilling fluid

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107024558A (en) * 2017-01-10 2017-08-08 内江师范学院 A kind of organic compound molecule structure parameterization characterizing method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107024558A (en) * 2017-01-10 2017-08-08 内江师范学院 A kind of organic compound molecule structure parameterization characterizing method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
廖立敏等: "部分有机污染物灰/水分配系数的定量结构性质关系研究", 《南京理工大学学报》 *
李悦等: "新型结构描述法用于芳烃类化合物水溶性模拟", 《内江师范学院学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798935A (en) * 2019-04-09 2020-10-20 南京药石科技股份有限公司 A neural network-based prediction method for universal compound structure-property correlation
CN114207619A (en) * 2019-09-05 2022-03-18 株式会社日立制作所 Material property prediction system and information processing method
CN111429980A (en) * 2020-04-14 2020-07-17 北京迈高材云科技有限公司 An automated method for obtaining crystal structure features of materials
CN112185477A (en) * 2020-09-25 2021-01-05 北京望石智慧科技有限公司 Method and device for extracting molecular characteristics and calculating three-dimensional quantitative structure-activity relationship
CN112185477B (en) * 2020-09-25 2024-04-16 北京望石智慧科技有限公司 Method and device for extracting molecular characteristics and calculating three-dimensional quantitative structure-activity relationship
CN119027614A (en) * 2024-08-16 2024-11-26 中国地质大学(北京) An automatic analysis system and method for detecting characteristic pollutants in drilling fluid

Similar Documents

Publication Publication Date Title
CN107516012A (en) A Structure Descriptor Based on Calculation of 3D Molecular Structure of Organic Compounds
Sergeev et al. Combining spatial autocorrelation with machine learning increases prediction accuracy of soil heavy metals
Spicer et al. Navigating freely-available software tools for metabolomics analysis
Lee et al. Methods of inference and learning for performance modeling of parallel applications
Manallack et al. Neural networks in drug discovery: have they lived up to their promise?
CN104766090B (en) A kind of Coherent Noise in GPR Record method for visualizing based on BEMD and SOFM
CN104237158A (en) Near infrared spectrum qualitative analysis method with universality
CN106055827A (en) Oil deposit numerical value simulation parameter sensibility analysis device and method
CN109085282A (en) A kind of chromatographic peaks analytic method based on wavelet transformation and Random Forest model
CN107132190A (en) A kind of soil organism spectra inversion model calibration samples collection construction method
CN101696968A (en) New method for monitoring heavy metal content in soil
Kuzmanovski et al. Counter-propagation neural networks in Matlab
Han et al. Automatic untargeted metabolic profiling analysis coupled with Chemometrics for improving metabolite identification quality to enhance geographical origin discrimination capability
CN107330219A (en) A kind of multipoint parallel global optimization method based on Kriging models
CN104268662A (en) Settlement prediction method based on step-by-step optimization quantile regression
CN106951325A (en) Space computational fields calculate intensity cube construction method
CN102323973A (en) A method for predicting the properties/activity of common environmental toxicants based on intelligent correlation index
CN103440391B (en) Semiconductor process corner scanning and simulating method based on numerical value selection function
CN107024558A (en) A kind of organic compound molecule structure parameterization characterizing method
CN108802251A (en) The method for quickly measuring chiral material based on limitation Alternating trilinear decomposition algorithm and HPLC-DAD instruments
Carrillo-Tripp et al. CapsidMaps: Protein–protein interaction pattern discovery platform for the structural analysis of virus capsids using Google Maps
CN106529680A (en) Multiscale extreme learning machine integrated modeling method based on empirical mode decomposition
CN116933386A (en) Aircraft pneumatic data fusion method based on MCOK proxy model
CN105824994A (en) Zoning-based combined approximate model building method
CN109741239B (en) A multi-spatial scale information extraction method based on soil quality parameters

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20171226