WO2024063584A1 - Central-atom-vector-based protein-ligand binding structure analysis method of artificial intelligence new drug platform - Google Patents
Central-atom-vector-based protein-ligand binding structure analysis method of artificial intelligence new drug platform Download PDFInfo
- Publication number
- WO2024063584A1 WO2024063584A1 PCT/KR2023/014454 KR2023014454W WO2024063584A1 WO 2024063584 A1 WO2024063584 A1 WO 2024063584A1 KR 2023014454 W KR2023014454 W KR 2023014454W WO 2024063584 A1 WO2024063584 A1 WO 2024063584A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- ligand
- compound
- protein
- binding
- artificial intelligence
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
Definitions
- the present invention relates to in silico prescreening technology applied to the process of discovering active substances for new drugs using CADD (Computer aided drug discovery) or AI drug platform for new drug development, and more specifically, when the binding information between proteins and compounds is not known.
- CADD Computer aided drug discovery
- AI drug platform for new drug development, and more specifically, when the binding information between proteins and compounds is not known.
- This relates to a method for analyzing the protein-ligand bond structure based on a central atom vector that generates the bond structure of a protein bound to a reference ligand.
- LigandScout Pharmacophore ensemble approach Generate ensemble of pharmacophores from multiple active ligands or protein conformations. Hours ROCS 3D align 2D fingerprints force field Rapid overlay of chemical structures using shape and chemical features for virtual screening. Minutes to Hours
- a pre-screening step based on chemical properties or structural similarity and 3D docking are used. It is divided into an in-depth screening step that utilizes protein-ligand interaction information.
- the pre-screening step is generally used for the purpose of reducing the number of candidates from large-scale substances, a simple number-based discrimination algorithm such as rule of 5 (Guideline for drug design, Lipinski) is used. Because the information used for screening is limited, it is generally known to be reliable within the range of maintaining the screening rate at 10%.
- rule of 5 Guideline for drug design, Lipinski
- pre-screening methods such as comparative analysis of the similarity of features including chemical properties or screening methods based on similarity between substances as a method of characterizing two-dimensional patterns of molecular structures are used.
- accuracy T/P, true/positive
- the present invention was created to solve the above problems.
- the present invention has limitations in that the existing 3D structure-based screening technology requires a lot of analysis time, making it difficult to screen large-scale materials.
- the goal is to provide a method for analyzing the protein-ligand bond structure that shortens the analysis time so that the work can be done.
- the existing 3D structure-based screening technology consumes a lot of system resources by repeating the process of generating a randomized structure from the structure of the previous stage in the process of creating a structure with a high probability of interaction with the binding target material, so an efficient binding form
- the aim is to provide a method for analyzing the protein-ligand binding structure that allows analysis of various binding forms while minimizing resource and time requirements for analysis.
- the present invention seeks to provide an automated protein-ligand binding structure analysis method through a standard algorithm by providing standards for binding positions in 3D structure-based screening technology.
- the present invention includes the steps of (A) constructing a database of a three-dimensional conformer of an analyte (compound); (B) extracting the coordinates of the core ring of the ligand from the 3D binding structure of the target protein-ligand; (C) extracting core ring coordinates from each binding position constituting the conformer of the compound to be screened; (D) Docking of the compound by aligning the core ring of the ligand with the binding position of the compound.
- the step (A) can be performed by collecting data related to the structure of the compound collected from compound data whose binding relationship has been verified and then converting it into 3D information to create a 3D conformer.
- the three-dimensional conformers generated in step (A) can be classified and selected according to the root mean square deviation (RMSD) calculated according to the inter-atomic distance information.
- RMSD root mean square deviation
- the core ring of the ligand and the core ring of the compound are located adjacent to each other depending on the distance from the center of gravity of the ligand and the compound, and have the same or similar shape as the ring. It can be set as an atomic complex with an atomic bonding structure.
- the protein-ligand interconnection index is the number of interacting atoms (interactive atoms) to induce binding between the protein and the ligand (or compound); It may be an index for the mutual bond relationship calculated according to the number of repulsive atoms (crash) that have a repulsive effect to prevent the bond between the protein and the ligand (or compound).
- the protein-ligand interconnection index may be calculated by the difference between the sum of the number of interactive atoms and the number of repulsive atoms.
- a weight may be reflected depending on the degree of repulsion influence of the repulsive atoms (crash).
- the core ring of the ligand and the core ring of the compound are aligned to overlap, and the atom coordinates of the compound are aligned.
- This may be performed by rotating the binding position (pose) based on , calculating each of the PLI scores, and selecting a binding position with high binding force according to the PLI score as the binding position (pose) for the corresponding conformer.
- the present invention (G) calculates the bond energy between the compound and the target protein for the binding positions calculated for each conformer of the compound through the step (F). And, it may be performed by further including the step of selecting candidate substances according to the binding energy.
- the binding energy of the (G) step may be calculated according to the three-dimensional arrangement of the central atom and surrounding atoms of the protein with respect to the binding structure of the target protein-compound.
- the present invention simplifies the flexible docking process for protein-ligand 3D structure analysis, which used to take minutes to hours per material structure, and reduces the processing time to several seconds to tens of seconds, compared to conventional analysis technology. This has the effect of making it possible to analyze 10 to 100 times more substances in the same time.
- the present invention when the present invention is applied to the prior screening process for large-scale new drug candidates, the amount of calculation and work time of system resources required to determine the screening material based on the docking pose is reduced by 1/8. There is an effect that can reduce the level.
- AI-drug platform an artificial intelligence new drug platform
- Figure 2 is a conceptual diagram showing the cloud service structure of an artificial intelligence new drug platform to which the present invention is applied.
- Figure 3 is a conceptual diagram showing the effective substance discovery process of the artificial intelligence new drug platform to which the present invention is applied.
- Figure 4 is a conceptual diagram showing the lead material discovery process of the artificial intelligence new drug platform to which the present invention is applied.
- Figure 5 is a flowchart showing a method for analyzing the protein-ligand binding structure based on a central atom vector according to a specific embodiment of the present invention.
- Figure 6 is an example diagram showing the process of aligning the binding site through the core ring of the binding target material in the GAP-DOCK process according to the present invention.
- Figure 7 is a graph showing the comparison results of the analysis time required for the example to which GAP-DOCK according to the present invention is applied.
- Figure 8 is a graph showing the accuracy comparison results of the analysis results of the example to which GAP-DOCK according to the present invention is applied.
- a preferred embodiment of the present invention for achieving the above-described object includes (A) constructing a database of a three-dimensional conformer of an analyte (compound); (B) extracting the coordinates of the core ring of the ligand from the 3D binding structure of the target protein-ligand; (C) extracting core ring coordinates from each binding position constituting the conformer of the compound to be screened; (D) Docking of the compound by aligning the core ring of the ligand with the binding position of the compound.
- the conformer has a bonding relationship. is the binding form (pose) information of the compound collected from the binding information of the proven compound and the protein, and the step (A) is the structure-related data of the compound collected from the compound data for which the binding relationship has been verified. After collection, it is converted into 3D information to generate 3D conformers, and the 3D conformers generated in step (A) have an index value calculated according to the distance information between atoms. They are classified and selected according to (RMSD, Root Mean Square Deviation).
- each block in the attached block diagram and each step in the flow chart may be performed by computer program instructions (execution engine), and these computer program instructions can be installed on a processor of a general-purpose computer, special-purpose computer, or other programmable data processing equipment. Since it can be mounted, the instructions executed through a processor of a computer or other programmable data processing equipment create a means of performing the functions described in each block of the block diagram or each step of the flow diagram.
- computer program instructions can also be mounted on a computer or other programmable data processing equipment, so a series of operation steps are performed on the computer or other programmable data processing equipment to create a process that is executed by the computer and runs on the computer or other program. Instructions that perform possible data processing equipment may also provide steps for executing functions described in each block of the block diagram and each step of the flow diagram.
- each block or each step may represent a module, segment, or portion of code containing one or more executable instructions for executing specified logical functions, and in some alternative embodiments, the blocks or steps referred to in the blocks or steps may represent a portion of code. It is also possible for functions to occur out of order.
- Figure 1 is a configuration diagram showing the overall configuration of an artificial intelligence new drug platform (AI-drug platform) to which the present invention is applied
- Figure 2 is a conceptual diagram showing the cloud service structure of an artificial intelligence new drug platform to which the present invention is applied.
- Figure 3 is a conceptual diagram showing the active material discovery process of the artificial intelligence new drug platform to which the present invention is applied
- Figure 4 is a conceptual diagram showing the lead material discovery process of the artificial intelligence new drug platform to which the present invention is applied.
- the AI-drug platform to which the present invention is applied is basically a platform that performs the entire process of discovering new drug candidates in the preclinical stage, and applicants can be serviced through the cloud (STB CLOUD).
- new drugs include synthetic new drugs (small molecules) and antibody drugs, and the artificial intelligence new drug platform (AI-drug platform) according to the present invention provides a discovery process for all of them.
- the artificial intelligence new drug platform (AI-drug platform) according to the present invention, as shown in Figure 1, includes a hit material automated discovery platform, a lead material automated discovery platform, and drug reaction (ADMET). , Absorption, Distribution, Metabolism, Excretion & Toxicity) and an automated analysis platform.
- AI-drug platform includes a hit material automated discovery platform, a lead material automated discovery platform, and drug reaction (ADMET). , Absorption, Distribution, Metabolism, Excretion & Toxicity) and an automated analysis platform.
- the artificial intelligence new drug platform (AI-drug platform) according to the present invention performs the entire new drug development process of selecting active substances, discovering lead substances among them, and then selecting candidate substances through drug reaction analysis. It is an artificial intelligence platform designed to do this.
- Figure 2 shows the cloud service process of the AI-drug platform according to the present invention. As shown, the present invention discovers effective substances, generates lead substances, and ADMET/PK. It provides all areas of the drug discovery and development process, from pharmacogenetics to biomarkers.
- the artificial intelligence new drug platform (AI-drug platform) according to the present invention is composed of three individual artificial intelligence systems, including generative artificial intelligence and large-scale language model-based artificial intelligence.
- An intelligent system (GPT/BERT), a three-dimensional structural artificial intelligence system (3D-CNN), and a molecular dynamics analysis system (Auto-MD simulation) are applied.
- each hit material automated discovery platform, lead material automated discovery platform, and drug reaction is a specific method for implementing an automated analysis platform, discovering effective substances through 3D structural information between proteins and ligands (hereinafter referred to as 'DMC-PRE', a technical name coined by the applicant), and central atom vector-based proteins.
- DMC-PRE and GAP-Dock are technologies applied in advance to discover active substances through the automated discovery platform for hit substances
- DMC-SCR is applied to the molecular dynamics analysis system (Auto-MD simulation).
- Auto-MD simulation It is a technology applied later to discover active substances through the hit material automated discovery platform
- LEAD-GEN is a technology applied to discover lead materials through the lead material automated discovery platform.
- DMC-MD is applied to the molecular dynamics analysis system (Auto-MD simulation), the hit material automated discovery platform, the lead material automated discovery platform, and drug reaction (ADMET, Absorption, Distribution, Metabolism, Excretion & Toxicity) is a technology that verifies the combined stability of results derived from an automated analysis platform, and 3bmGPT is applied to a generative artificial intelligence system (GPT/BERT) to identify active substances through the automated discovery platform for hit substances. It is a technology that selects the analyte target for calculation.
- GPS/BERT generative artificial intelligence system
- the analyte substances selected through the 3bmGPT are classified into the DMC-PRE and GAP- Dock is applied for preliminary screening, DMC-SCR is applied for in-depth screening, and DMC-MD is applied to verify binding stability to derive effective substances.
- the LEAD-GEN is applied to discover the lead material, and DMC-MD is applied. By verifying the binding stability, the lead material is derived.
- Figure 5 is a flowchart showing a method for analyzing the protein-ligand binding structure based on a central atom vector according to a specific embodiment of the present invention
- Figure 6 is a core ring of the binding target material in the GAP-DOCK process according to the present invention.
- ) is an example diagram showing the process of aligning the binding position through
- Figure 7 is a graph showing the comparison results of the analysis time required for the example to which GAP-DOCK according to the present invention is applied
- Figure 8 is a graph showing the comparison results of the analysis time required for the example to which GAP-DOCK according to the present invention is applied.
- -This is a graph showing the accuracy comparison results of the analysis results of the example where DOCK was applied.
- the present invention quantifies the interactions that contribute to binding and the interactions that hinder binding, and the protein-ligand-ligand (PLI) consisting of the difference between these interactions is quantified. interaction) scoring function is created and the reliability of the binding position (pose) is evaluated simply and quickly using a simplified method using the PLI index.
- PLI protein-ligand-ligand
- a conformer for the binding form of the compound when analyzing the docking form/pose of a compound on a target protein, a conformer for the binding form of the compound is applied as defined in advance, Significantly reduce analysis time.
- the present invention defines the ring structure located at the center close to the center of gravity of the substance to be screened as core ring, and similarly, the ring structure close to the center of atom of the ligand extracted from the protein-ligand structure is defined as core ring.
- the 3D structure of the compound to be analyzed is aligned with the 3D structure of the ligand so that the core rings on both sides are matched, and then the 3D structure of the compound to be analyzed is aligned based on the central axis that passes through the plane where the two core rings are placed.
- the present invention determines the stability of the binding position (pose) using a simplified PLI index for a preset conformer, so it is possible to perform a task similar to flexible docking with relatively few calculations even for candidate materials with a high degree of freedom. .
- the present invention relates to a flexible docking technology used in the process of screening bioactive pharmaceutical substances from the 3D structure of the material selected through the DMC-PRE process of the pre-screening process.
- the change in binding position is used to continuously and repeatedly generate a random compound structure similar to the binding position (pose) of the previous step to create a compound more suitable for the pocket.
- the bonding position is set in the form of a preset conformer for the bonding form of the compound, and the bonding relationship is analyzed.
- the conformer is a form of binding to a protein defined for each compound and can range from 1 to hundreds of trillions depending on the shape of the protein, the number of atoms constituting the compound, the performance of the equipment used, and the target time. It can be defined including the structure of
- the determination of the interaction (binding stability) between a protein and a ligand aligned to a specific binding position (pose) can be done by aligning the core ring of preset ligand structures to the core ring of a known reference structure, as described above. After aligning, the optimal docking position for the target protein can be derived by finding the pose that maximizes the PLI (protein-ligand interaction) scoring function.
- PLI protein-ligand interaction
- the 3D structure of the compound used to create diversity of binding positions (pose) is created in advance and made into a database, so there is no need to regenerate all templates, and accordingly, the binding position for the binding material ( It is possible to save computational time and resources required to generate poses, thereby significantly reducing the time required for joint position (pose) analysis.
- the effect of reducing analysis time is significantly improved as the degree of freedom of the compound increases. This is because the changeable binding position increases in proportion to the square as the degree of freedom of the compound increases. In the case of compounds with a high degree of freedom, the present invention provides 1000 times more It also shows speed increase efficiency.
- the analysis of the protein-ligand binding structure based on the central atom vector according to the present invention includes (A) databaseizing the 3D conformer structure of the analyte, and (B) determining the core ring of the ligand from the 3D binding structure of the target protein-ligand. (C) extracting the coordinates of the core ring from each pose constituting the compound conformer to be screened, (D) aligning the core ring of the compound pose and the core ring of the ligand.
- a step of obtaining the docking coordinate information of the compound (E) using the docking coordinate information of the compound as an input to the PLI scoring function to derive the docking pose with the highest PLI score, (F) ) Repeating steps (A) to (E) for preset compound conformers.
- it refers to the step of estimating the affinity of a substance for a target protein based on the substance structure obtained using the docking algorithm according to the present invention and using this to select a new drug candidate.
- the bond energy is a free energy value calculated using the 3D arrangement information of the core atom of the protein and its surrounding atoms, which is the subject of energy calculation, with respect to the docking structure of the target protein-compound, AMBER, CHARMM, OPLS, MM/PBSA , MM/GBSA, enva (Syntekabio), etc. can be calculated by various tools.
- the step (A) of databaseizing the 3D conformer structure of the analyte is to obtain structure-related data for new drug candidates from pharmaceuticals or chemicals (compounds whose binding relationships have already been verified) such as the ZINC database or Chembl database. After obtaining, this means creating a conformer using tools such as rdkit or openbabel.
- the step (B) of extracting the coordinates of the core ring of the ligand from the 3D binding structure of the target protein-ligand is a step of selecting the coordinates corresponding to the binding center to which the compound will be bound from the structure of the target protein-ligand. , the ring located closest to the center of mass of the ligand is selected as the core ring.
- the coordinates of the three atoms closest to the center of mass are selected, assumed to be a ring structure, and selected as a core ring.
- step (C) of extracting core ring coordinates from each pose constituting the compound conformer to be screened is to locate the binding center from the compound to be screened to be bound to the target protein instead of the Ligand for a single compound in the compound conformer structure. This is a step in selecting the atom group within the material. The ring located closest to the center of mass of the compound is designated as the core ring.
- step (C) is performed for each compound conformer structure.
- the step of obtaining docking coordinate information of the compound by aligning the core ring of the compound pose and the core ring of the ligand (D) is the step of obtaining docking coordinate information of the compound structure and the core ring of the ligand ring. After selecting the point where the distance between coordinates is minimum, rotate along the core ring plane to place the compound in various directions and store the compound atom coordinates at this time.
- the step of deriving the docking pose with the highest PLI score by using the docking coordinate information of the compound (E) as an input to the PLI scoring function is the atom coordinates of all compounds obtained by aligning the core ring of the ligand and the core ring of the compound. Based on this, the PLI score is calculated according to the PLI scoring function, and the docking arrangement type with the highest PLI score among the rotated arrangement methods is selected as the optimal binding position (pose) for the combination.
- PLI Protein-ligand interaction score
- Interactive atom searched atom that able to interact with residue in target protein
- the weight multiplied by Crash in the PLI scoring function of Equation 1 is a value that can be adjusted experimentally.
- an analysis method based on 2 was presented as an example, but depending on the type of protein or composition of material data, the weight is 0.1. It is a value that can be adjusted in various ways in the range of ⁇ 10.
- steps (A) to (E) are repeated for preset compound conformers, and the docking pose with the highest PLI score is selected as the optimal binding position (pose) for the conformer.
- the rmsd of each conformer that differs by at least 0.2 from the previously created conformer was defined as another conformer and created to construct a database in which pdb standard material coordinate data is connected.
- the pdb structure of the target protein-ligand complex was used by downloading 4yur from the PDB Bind database, and the atom id list of the core ring of the ligand was extracted using gap, the applicant's own tool.
- the core ring of the ligand and the core ring of the compound are aligned using the gap, and the movement coordinates of the entire compound are calculated based on the coordinate movement information of the compound's core atom. and saved in pdb format.
- the PLI score for each docking pose of each compound pose is calculated to determine the representative docking pose for each pose that makes up the conformer, and the docking pose with the highest PLI score among each docking pose is used as the docking pose for the conformer. It was selected as the representative docking pose.
- the bond energy was calculated from the representative docking pose of the conformer selected in step 5 above using enva, the applicant's own tool, and 1000 materials with a large absolute value of the calculated bond energy were selected.
- the time required per material was 5 hours to process 100,000 materials in the example according to the present invention, but in the case of Comparative Example 1 (gnina algorithm), 41 hours were required to process 100,000 materials. It was confirmed that it takes time.
- Npc_filter number of positive control after filtration
- Npc_input number of positive control in input dataset
- Example according to the present invention shows analysis accuracy that exceeds the prior art, despite a significant reduction in analysis time compared to the prior art.
- the present invention relates to discovering effective substances for new drug development.
- the flexible docking process for protein-ligand 3D structure analysis which used to take minutes to hours per material structure, is simplified, and the processing time is reduced from a few seconds to a few seconds. By reducing the time to tens of seconds, it has the effect of making it possible to analyze 10 to 100 times more substances in the same time compared to conventional analysis techniques.
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Physiology (AREA)
- Genetics & Genomics (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
Description
본 발명은 신약개발을 위한 CADD(Computer aided drug discovery) 또는 AI drug platform의 신약 유효물질 발굴 과정에 적용되는 in silico prescreening 기술에 관한 것으로, 더욱 상세하게는 단백질과 화합물 간 결합 정보가 알려져 있지 않은 경우 기준 리간드(reference ligand)가 결합된 단백질의 결합 구조를 생성하는 중심원자 벡터 기반 단백질-리간드 간 결합구조 분석 방법에 관한 것이다.The present invention relates to in silico prescreening technology applied to the process of discovering active substances for new drugs using CADD (Computer aided drug discovery) or AI drug platform for new drug development, and more specifically, when the binding information between proteins and compounds is not known. This relates to a method for analyzing the protein-ligand bond structure based on a central atom vector that generates the bond structure of a protein bound to a reference ligand.
일반적으로 신약개발은 후보물질 발굴과 스크리닝 과정을 거쳐, 최적화 과정, 비임상시험/독성시험 및 임상시험 등의 과정을 통해 이루어지는데, 최근에는 신약 후보물질 발굴에 소요되는 시간과 비용을 절감하기 위하여, 컴퓨팅 분석 기술(AI 등)이 적용되고 있다. In general, new drug development is carried out through a process of discovery and screening of candidate substances, followed by optimization, non-clinical testing/toxicity testing, and clinical trials. Recently, in order to reduce the time and cost required to discover new drug candidates, , computing analysis technologies (AI, etc.) are being applied.
이에, 현재 후보물질 분석 시스템(computational screening)분야에서는 다양한 분석도구들이 사용되고 있고, 대표적인 분석도구들은 아래 [표 1]과 같다.Accordingly, various analysis tools are currently being used in the field of candidate material analysis systems (computational screening), and representative analysis tools are shown in [Table 1] below.
whichconsidersvanderWaalsforces,electrostaticinteractions,anddesolvationGrid based semi-empherical scoring function
whichconsidersVanderWaalsforces,electrostaticinteractions,anddesolvation
which considers terms like Coulombic, van der Waals, and solvation effectsempirical scoring function (GlideScore)
which considers terms like Coulombic, van der Waals, and solvation effects
(once trained)2.5 minutes
(once trained)
Discovery Studio
MOEPhase
Discovery Studio
MOE
2D fingerprints
Force field3D align
2D fingerprints
force field
한편, 화합물(compound)로부터 신약 후보물질(drug candidate)을 가상으로 스크리닝 하는 과정에서는, 화학적 성질(chemical property)이나 구조적 유사성에 기반한 선행 스크리닝(pre-screening) 단계와 3차원 결합(3D docking)된 단백질-리간드 간 상호작용(phrotein-ligand interaction) 정보를 활용하는 심층 스크리닝(screening) 단계로 구분된다. Meanwhile, in the process of virtually screening drug candidates from compounds, a pre-screening step based on chemical properties or structural similarity and 3D docking are used. It is divided into an in-depth screening step that utilizes protein-ligand interaction information.
선행 스크리닝(Pre-screening) 단계는 일반적으로 대규모의 물질로부터 후보군의 수를 줄이는 목적으로 사용되고 있기 때문에 rule of 5(약물 디자인을 위한 가이드라인, Lipinski)등과 같이 숫자 기반의 간단한 판별 알고리즘을 사용하나, 스크리닝에 사용 되는 정보가 제한적이기 때문에 통상적으로 스크리닝 비율을 10% 수준으로 유지하는 범위에서 신뢰도를 보이는 것으로 알려져 있다. Since the pre-screening step is generally used for the purpose of reducing the number of candidates from large-scale substances, a simple number-based discrimination algorithm such as rule of 5 (Guideline for drug design, Lipinski) is used. Because the information used for screening is limited, it is generally known to be reliable within the range of maintaining the screening rate at 10%.
그러나 컴퓨팅 기술이 발전하면서 수천 내지 수백십억 개 단위의 분석대상에 대한 연산에 대한 수요가 증가하고 있으며, 이로 인해 기존의 스크리닝(screening) 전략을 고도화 해야하는 필요성이 발생되었다.However, as computing technology develops, the demand for computation of analysis targets in the tens to hundreds of billions is increasing, resulting in the need to upgrade existing screening strategies.
이를 해결하기 위하여 화학적 성질(chemical property)을 포함하는 특성(feature)들의 유사도를 비교 분석하거나 분자구조의 2차원적 패턴을 특성화하는 방법으로 물질간 유사도에 기반한 스크리닝 방법 등이 선행 스크리닝(pre-screening)의 고도화 전략으로 연구되었으나 선별결과 중 정확도(T/P, true/positive) 비율이 높지 않고 모든 물질의 구조를 패턴화할 수 없다는 한계가 있었다.To solve this problem, pre-screening methods such as comparative analysis of the similarity of features including chemical properties or screening methods based on similarity between substances as a method of characterizing two-dimensional patterns of molecular structures are used. ) was studied as an advanced strategy, but there were limitations in that the accuracy (T/P, true/positive) ratio among the selection results was not high and the structure of all materials could not be patterned.
한편, 비교적 스크리닝의 신뢰도가 높다고 알려진 3D 및 4D 기반의 protein-ligand binding affinity 예측 기술들의 경우, 분석소요시간이 수분에서 수시간 까지 소요되기 때문에 대규모 분석 대상을 스크리닝하는데는 적용하기 어려운 문제점이 있었다.Meanwhile, in the case of 3D and 4D based protein-ligand binding affinity prediction technologies, which are known to have relatively high screening reliability, the analysis time required ranges from minutes to several hours, making it difficult to apply them to screening large-scale analysis targets.
이에 따라, 선행 스크리닝(pre-screening)에 사용되는 저차원 정보의 한계로 인해 많은 시스템 자원이 소요됨에도 불구하고 3D구조에 기반한 스크리닝이 필요하나, 실제 ZINC database와 같은 화학물질 database에 등록된 상용물질만도 수십억개 단위이기 때문에 이를 모두 서비스 대상에 넣기 위해서는 3D 정보 기반 스크리닝에 소요되는 시간을 물질 당 수초 수준까지 감소시킬 수 있는 간소화된 3D 기반 screening 알고리즘의 필요성이 제기되었다.Accordingly, screening based on 3D structure is necessary despite the fact that it consumes a lot of system resources due to the limitations of low-dimensional information used in pre-screening, but commercial substances registered in chemical databases such as the actual ZINC database Since there are billions of units, the need for a simplified 3D-based screening algorithm that can reduce the time required for 3D information-based screening to the level of a few seconds per substance was raised in order to include all of them in the service.
본 발명은 상기와 같은 문제점을 해결하기 위하여 안출 된 것으로, 본 발명은 기존의 3D 구조기반 스크리닝 기술은 분석시간이 많이 소요되어 대규모 물질에 대한 스크리닝 작업이 어려운 한계가 있는 바, 대규모 물질에 대한 스크리닝 작업이 가능하도록 분석 시간이 단축되는 단백질-리간드 간 결합구조 분석 방법을 제공하고자 하는 것이다.The present invention was created to solve the above problems. The present invention has limitations in that the existing 3D structure-based screening technology requires a lot of analysis time, making it difficult to screen large-scale materials. The goal is to provide a method for analyzing the protein-ligand bond structure that shortens the analysis time so that the work can be done.
그리고 본 발명은 기존 3D 구조기반 스크리닝 기술이 결합 대상 물질과 상호작용 확률이 높은 구조를 만드는 과정에서 이전 단계의 구조로부터 randomized 구조를 생성하는 과정을 반복하면서 많은 시스템 자원이 소요되는 바, 효율적인 결합 형태(pose)를 기 설정된 형태로 제공하여, 다양한 결합형태에 대한 분석이 수행되면서도 분석에 따른 자원 및 시간 소요가 최소화되는 단백질-리간드 간 결합구조 분석 방법을 제공하고자 하는 것이다.In addition, in the present invention, the existing 3D structure-based screening technology consumes a lot of system resources by repeating the process of generating a randomized structure from the structure of the previous stage in the process of creating a structure with a high probability of interaction with the binding target material, so an efficient binding form By providing (pose) in a preset form, the aim is to provide a method for analyzing the protein-ligand binding structure that allows analysis of various binding forms while minimizing resource and time requirements for analysis.
또한 본 발명은 3D 구조기반 스크리닝 기술에서, 결합위치에 대한 기준이 제시되어 표준적인 알고리즘을 통해 자동화된 단백질-리간드 간 결합구조 분석 방법을 제공하고자 하는 것이다.In addition, the present invention seeks to provide an automated protein-ligand binding structure analysis method through a standard algorithm by providing standards for binding positions in 3D structure-based screening technology.
상기한 바와 같은 목적을 달성하기 위한 본 발명의 특징에 따르면, 본 발명은 (A) 분석대상 물질(화합물)의 3차원 컨포머(conformer)를 데이터베이스화하여 구축하는 단계와; (B) 표적단백질-리간드의 3D 결합 구조로부터 리간드(ligand)의 코어링(core ring)의 좌표를 추출하는 단계와; (C) 스크리닝 대상 화합물의 컴포머(conformer)를 구성하는 각각의 결합위치(pose)들로부터 코어링(core ring) 좌표를 추출하는 단계와; (D) 상기 화합물(compound)의 결합위치(pose)에 대한 코어링(core ring)과 리간드(ligand)의 코어링(core ring)을 정렬(align)하여, 화합물(compound)의 결합(docking) 좌표정보를 획득하는 단계와; (E) 화합물(compound)의 결합(docking) 좌표정보에 따라 단백질-리간드 간 상호연관 지수(PLI score)를 산출하여, 결합위치(docking pose)를 선정하는 단계; 그리고 (F) 상기 (A) 내지 (E) 단계를 기 설정된 화합물(compound)의 컨포머(conformer)들에 대하여 반복 수행하는 단계;를 포함하여 수행되고: 상기 컨포머(conformer)는, 결합관계가 입증된 화합물(compound)과 단백질의 결합정보들로부터 수집된, 화합물의 결합형태(pose) 정보들이다.According to the features of the present invention for achieving the above-described object, the present invention includes the steps of (A) constructing a database of a three-dimensional conformer of an analyte (compound); (B) extracting the coordinates of the core ring of the ligand from the 3D binding structure of the target protein-ligand; (C) extracting core ring coordinates from each binding position constituting the conformer of the compound to be screened; (D) Docking of the compound by aligning the core ring of the ligand with the binding position of the compound. Obtaining coordinate information; (E) calculating the protein-ligand interconnection index (PLI score) according to the docking coordinate information of the compound and selecting the docking position; And (F) repeating the steps (A) to (E) for the conformers of the preset compound. The conformer has a bonding relationship. This is information on the compound's binding form (pose) collected from proven binding information between compounds and proteins.
여기서, 상기 (A) 단계는, 결합관계가 검증된 화합물 데이터로부터 수집된 화합물의 구조관련 데이터를 수집한 후, 3차원 정보로 변환하여 3차원 컨포머를 생성함에 의해 수행될 수 있다.Here, the step (A) can be performed by collecting data related to the structure of the compound collected from compound data whose binding relationship has been verified and then converting it into 3D information to create a 3D conformer.
그리고 상기 (A) 단계에서 생성된, 3차원 컨포머(conformer)들은, 원자간 거리정보에 따라 산출되는 지수값(RMSD, Root Mean Square Deviation)에 따라 구분되어, 선별될 수 있다.And the three-dimensional conformers generated in step (A) can be classified and selected according to the root mean square deviation (RMSD) calculated according to the inter-atomic distance information.
또한, 상기 리간드(ligand)의 코어링(core ring) 및 화합물의 코어링(core ring)은, 상기 리간드 및 화합물의 무게중심과의 거리에 따라, 인접하여 위치되고, 링과 동일 또는 유사한 형태의 원자 결합구조를 갖는 원자결합체로 설정될 수 있다.In addition, the core ring of the ligand and the core ring of the compound are located adjacent to each other depending on the distance from the center of gravity of the ligand and the compound, and have the same or similar shape as the ring. It can be set as an atomic complex with an atomic bonding structure.
그리고 상기 단백질-리간드 간 상호연관 지수(PLI score)는, 상기 단백질과 리간드(또는 화합물) 사이의 결합이 유도되도록, 상호작용을 하는 연관원자(interactive atom)의 수와; 상기 단백질과 리간드(또는 화합물) 사이의 결합을 방해되도록, 반발작용을 하는 반발원자(crash)의 수;에 따라 산출되는 상호 결합관계에 대한 지수일 수도 있다.And the protein-ligand interconnection index (PLI score) is the number of interacting atoms (interactive atoms) to induce binding between the protein and the ligand (or compound); It may be an index for the mutual bond relationship calculated according to the number of repulsive atoms (crash) that have a repulsive effect to prevent the bond between the protein and the ligand (or compound).
또한, 상기 단백질-리간드 간 상호연관 지수(PLI score)는, 연관원자(interactive atom) 수의 합과 반발원자(crash) 수의 합의 차이값에 의해 산출될 수도 있다.Additionally, the protein-ligand interconnection index (PLI score) may be calculated by the difference between the sum of the number of interactive atoms and the number of repulsive atoms.
그리고 반발원자(crash) 수의 합계 산출에는, 상기 반발원자(crash)의 반발 영향정도에 따라 가중치(weight)가 반영될 수도 있다.In addition, when calculating the total number of repulsive atoms (crash), a weight may be reflected depending on the degree of repulsion influence of the repulsive atoms (crash).
또한, 상기 (E) 단계는, 리간드(ligand)의 코어링(core ring)과 화합물(compound)의 코어링(core ring) 중첩되도록 정렬(align)하고, 화합물(compound)의 원자(atom) 좌표를 기준으로 결합위치(pose)를 회전시키면서, 상기 PLI score를 각각 산출하여, 상기 PLI score에 따라 결합력이 높은 결합위치를 해당 컨포머에 대한 결합위치(pose)로 선정함에 의해 수행될 수도 있다.In addition, in the step (E), the core ring of the ligand and the core ring of the compound are aligned to overlap, and the atom coordinates of the compound are aligned. This may be performed by rotating the binding position (pose) based on , calculating each of the PLI scores, and selecting a binding position with high binding force according to the PLI score as the binding position (pose) for the corresponding conformer.
그리고 본 발명은 (G) 상기 (F) 단계에 의해, 화합물에 대하여 각각의 컨포머(conformer) 별로 산출된 결합위치(pose)들에 대하여, 화합물과 표적단백질 간의 결합에너지(bond energy)를 산출하고, 상기 결합 에너지에 따라 후보물질을 선별하는 단계를 더 포함하여 수행될 수도 있다.And the present invention (G) calculates the bond energy between the compound and the target protein for the binding positions calculated for each conformer of the compound through the step (F). And, it may be performed by further including the step of selecting candidate substances according to the binding energy.
또한, 상기 (G)단계의 결합에너지는, 표적단백질-화합물의 결합구조에 대하여, 단백질의 중심 원자와 주변 원자들의 3차원 배치에 따라 산출될 수도 있다.Additionally, the binding energy of the (G) step may be calculated according to the three-dimensional arrangement of the central atom and surrounding atoms of the protein with respect to the binding structure of the target protein-compound.
위에서 살핀 바와 같은 본 발명에 의한 인공지능 신약 플랫폼의 중심원자 벡터 기반 단백질-리간드 간 결합구조 분석 방법에서는 다음과 같은 효과를 기대할 수 있다. The following effects can be expected from the protein-ligand binding structure analysis method based on the central atom vector of the artificial intelligence new drug platform according to the present invention as seen above.
즉, 본 발명에서는 물질구조당 수분~수시간 소요되던 protein-ligand 3D 구조 분석을 flexible docking 작업을 간소화 하여, 처리 소요시간을 수 초 내지 수십 초 수준으로 저감화 하여, 종래의 분석 기술에 대비하여, 10 내지 100배 이상의 물질을 동일한 시간에 분석하는 것이 가능해지는 효과가 있다.In other words, the present invention simplifies the flexible docking process for protein-
또한, 본 발명에서는 상호작용의 연관성 지표로 사용되는 PLI scoring function의 수치를 종래기술에서 사용되는 지표인 결합 에너지(bond energy)와 비교하였을 때, 양 지표간에 비례하는 correlation이 확인되었고, 1% 선별조건으로 단계 수율을 분석한 결과, 25% 이상의 recall 성능을 보여주어 대표적인 종래 기술인 gnina 기반 분석방법과 유사한 스크리닝 정확도를 나타내는 바, 본 발명은 분석정확도는 3차원 결합에너지 분석방식과 동등한 정도를 유지하면서, 처리시간을 획기적으로 감소시킬 수 있는 효과가 있다.In addition, when the value of the PLI scoring function, which is used as an indicator of correlation of interaction in the present invention, was compared with the bond energy, which is an indicator used in the prior art, a proportional correlation between both indicators was confirmed, and a 1% selection As a result of analyzing the step yield under the condition, it showed a recall performance of more than 25%, indicating a screening accuracy similar to that of the gnina-based analysis method, which is a representative conventional technology. The present invention maintains the analysis accuracy equivalent to the three-dimensional binding energy analysis method. , has the effect of dramatically reducing processing time.
이를 통하여, 본 발명에서는 대규모 신약 후보물질을 대상으로 선행 스크리닝 과정에 적용할 경우, 결합 위치(docking pose)를 기반으로 스크리닝 물질을 판별하기 위해 소요되는 시스템 자원의 연산량 및 작업 소요시간을 1/8 수준으로 저감화 할 수 있는 효과가 있다.Through this, when the present invention is applied to the prior screening process for large-scale new drug candidates, the amount of calculation and work time of system resources required to determine the screening material based on the docking pose is reduced by 1/8. There is an effect that can reduce the level.
도 1은 본 발명이 적용되는 인공지능 신약플랫폼(AI-drug platform)의 전체 구성을 도시한 구성도.1 is a configuration diagram showing the overall configuration of an artificial intelligence new drug platform (AI-drug platform) to which the present invention is applied.
도 2는 본 발명이 적용되는 인공지능 신약플랫폼의 클라우드 서비스 구조를 도시한 개념도.Figure 2 is a conceptual diagram showing the cloud service structure of an artificial intelligence new drug platform to which the present invention is applied.
도 3은 본 발명이 적용되는 인공지능 신약플랫폼의 유효물질 발굴과정을 도시한 개념도.Figure 3 is a conceptual diagram showing the effective substance discovery process of the artificial intelligence new drug platform to which the present invention is applied.
도 4는 본 발명이 적용되는 인공지능 신약플랫폼의 선도물질 발굴과정을 도시한 개념도.Figure 4 is a conceptual diagram showing the lead material discovery process of the artificial intelligence new drug platform to which the present invention is applied.
도 5는 본 발명의 구체적인 실시예에 의한 중심원자 벡터 기반 단백질-리간드 간 결합구조 분석 방법을 도시한 흐름도.Figure 5 is a flowchart showing a method for analyzing the protein-ligand binding structure based on a central atom vector according to a specific embodiment of the present invention.
도 6은 본 발명에 의한 GAP-DOCK 과정에서 결합 대상 물질의 코어 링(core ring)을 통해 결합 위치가 정렬되는 과정을 도시한 예시도.Figure 6 is an example diagram showing the process of aligning the binding site through the core ring of the binding target material in the GAP-DOCK process according to the present invention.
도 7은 본 발명에 의한 GAP-DOCK이 적용된 실시예의 분석 소요시간의 비교결과를 도시한 그래프.Figure 7 is a graph showing the comparison results of the analysis time required for the example to which GAP-DOCK according to the present invention is applied.
도 8은 본 발명에 의한 GAP-DOCK이 적용된 실시예의 분석결과의 정확도 비교결과를 도시한 그래프.Figure 8 is a graph showing the accuracy comparison results of the analysis results of the example to which GAP-DOCK according to the present invention is applied.
상기한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예는 (A) 분석대상 물질(화합물)의 3차원 컨포머(conformer)를 데이터베이스화하여 구축하는 단계와; (B) 표적단백질-리간드의 3D 결합 구조로부터 리간드(ligand)의 코어링(core ring)의 좌표를 추출하는 단계와; (C) 스크리닝 대상 화합물의 컴포머(conformer)를 구성하는 각각의 결합위치(pose)들로부터 코어링(core ring) 좌표를 추출하는 단계와; (D) 상기 화합물(compound)의 결합위치(pose)에 대한 코어링(core ring)과 리간드(ligand)의 코어링(core ring)을 정렬(align)하여, 화합물(compound)의 결합(docking) 좌표정보를 획득하는 단계와; (E) 화합물(compound)의 결합(docking) 좌표정보에 따라 단백질-리간드 간 상호연관 지수(PLI score)를 산출하여, 결합위치(docking pose)를 선정하는 단계; 그리고 (F) 상기 (A) 내지 (E) 단계를 기 설정된 화합물(compound)의 컨포머(conformer)들에 대하여 반복 수행하는 단계;를 포함하여 수행되고: 상기 컨포머(conformer)는, 결합관계가 입증된 화합물(compound)과 단백질의 결합정보들로부터 수집된, 화합물의 결합형태(pose) 정보들이고, 상기 (A) 단계는, 결합관계가 검증된 화합물 데이터로부터 수집된 화합물의 구조관련 데이터를 수집한 후, 3차원 정보로 변환하여 3차원 컨포머를 생성함에 의해 수되며, 그리고 상기 (A) 단계에서 생성된, 3차원 컨포머(conformer)들은, 원자간 거리정보에 따라 산출되는 지수값(RMSD, Root Mean Square Deviation)에 따라 구분되어, 선별된다.A preferred embodiment of the present invention for achieving the above-described object includes (A) constructing a database of a three-dimensional conformer of an analyte (compound); (B) extracting the coordinates of the core ring of the ligand from the 3D binding structure of the target protein-ligand; (C) extracting core ring coordinates from each binding position constituting the conformer of the compound to be screened; (D) Docking of the compound by aligning the core ring of the ligand with the binding position of the compound. Obtaining coordinate information; (E) calculating the protein-ligand interconnection index (PLI score) according to the docking coordinate information of the compound and selecting the docking position; And (F) repeating the steps (A) to (E) for the conformers of the preset compound. The conformer has a bonding relationship. is the binding form (pose) information of the compound collected from the binding information of the proven compound and the protein, and the step (A) is the structure-related data of the compound collected from the compound data for which the binding relationship has been verified. After collection, it is converted into 3D information to generate 3D conformers, and the 3D conformers generated in step (A) have an index value calculated according to the distance information between atoms. They are classified and selected according to (RMSD, Root Mean Square Deviation).
이하에서는 첨부된 도면을 참조하여 본 발명의 구체적인 실시예에 의한 인공지능 신약 플랫폼의 중심원자 벡터 기반 단백질-리간드 간 결합구조 분석 방법을 살펴보기로 한다.Hereinafter, with reference to the attached drawings, we will look at the method of analyzing the protein-ligand binding structure based on the central atom vector of the artificial intelligence new drug platform according to a specific embodiment of the present invention.
설명에 앞서 먼저, 본 발명의 효과, 특징 및 이를 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예에서 명확해진다. 그러나 본 발명은 이하에서 개시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Prior to the description, the effects, features, and methods of achieving the present invention will become clear in the examples described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. The present embodiments are merely provided to ensure that the disclosure of the present invention is complete and to provide common knowledge in the technical field to which the present invention pertains. It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims.
본 발명의 실시 예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이며, 후술되는 용어들은 본 발명의 실시 예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In describing the embodiments of the present invention, if it is judged that a detailed description of a known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description will be omitted, and the terms described below will be used in the embodiments of the present invention. These are terms defined in consideration of the function of and may vary depending on the intention or custom of the user or operator. Therefore, the definition should be made based on the contents throughout this specification.
첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들(실행 엔진)에 의해 수행될 수도 있으며, 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다.The combination of each block in the attached block diagram and each step in the flow chart may be performed by computer program instructions (execution engine), and these computer program instructions can be installed on a processor of a general-purpose computer, special-purpose computer, or other programmable data processing equipment. Since it can be mounted, the instructions executed through a processor of a computer or other programmable data processing equipment create a means of performing the functions described in each block of the block diagram or each step of the flow diagram.
그리고, 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성하여 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명되는 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다.In addition, computer program instructions can also be mounted on a computer or other programmable data processing equipment, so a series of operation steps are performed on the computer or other programmable data processing equipment to create a process that is executed by the computer and runs on the computer or other program. Instructions that perform possible data processing equipment may also provide steps for executing functions described in each block of the block diagram and each step of the flow diagram.
또한, 각 블록 또는 각 단계는 특정된 논리적 기능들을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있으며, 몇 가지 대체 실시 예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능하다.Additionally, each block or each step may represent a module, segment, or portion of code containing one or more executable instructions for executing specified logical functions, and in some alternative embodiments, the blocks or steps referred to in the blocks or steps may represent a portion of code. It is also possible for functions to occur out of order.
그리고 본 발명이 적용되는 인공지능 신약 플랫폼 분야에서는 국문으로 정의되지 않고 영문명칭이 일반명칭으로 사용되는 기술용어가 대다수인 바, 국문으로 병기된 기술용어의 경우, 기술분야에서 일반명칭으로 통용되는 영문명칭의 의미로 해석되어야 한다.In the field of artificial intelligence new drug platforms to which the present invention is applied, most technical terms are not defined in Korean and have English names used as general names. In the case of technical terms written in Korean, the English names are commonly used as general names in the technical field. It must be interpreted according to the meaning of the name.
본 발명에 의한 중심원자 벡터 기반 단백질-리간드 간 결합구조 분석 방법의 구체적인 실시예를 설명하기에 앞서, 본 발명이 적용되는 전체 인공지능 신약 플랫폼에 대하여 설명하기로 한다.Before explaining specific examples of the central atom vector-based protein-ligand binding structure analysis method according to the present invention, the entire artificial intelligence new drug platform to which the present invention is applied will be described.
도 1은 본 발명이 적용되는 인공지능 신약플랫폼(AI-drug platform)의 전체 구성을 도시한 구성도이고, 도 2는 본 발명이 적용되는 인공지능 신약플랫폼의 클라우드 서비스 구조를 도시한 개념도이며, 도 3은 본 발명이 적용되는 인공지능 신약플랫폼의 유효물질 발굴과정을 도시한 개념도이고, 도 4는 본 발명이 적용되는 인공지능 신약플랫폼의 선도물질 발굴과정을 도시한 개념도이다.Figure 1 is a configuration diagram showing the overall configuration of an artificial intelligence new drug platform (AI-drug platform) to which the present invention is applied, and Figure 2 is a conceptual diagram showing the cloud service structure of an artificial intelligence new drug platform to which the present invention is applied. Figure 3 is a conceptual diagram showing the active material discovery process of the artificial intelligence new drug platform to which the present invention is applied, and Figure 4 is a conceptual diagram showing the lead material discovery process of the artificial intelligence new drug platform to which the present invention is applied.
본 발명의 적용되는 인공지능 신약플랫폼(AI-drug platform)은 기본적으로 전임상 단계에서 신약 후보물질을 발굴하는 전체 과정을 수행하는 플랫폼으로, 출원인은 클라우드를 통해 서비스(STB CLOUD)될 수 있다.The AI-drug platform to which the present invention is applied is basically a platform that performs the entire process of discovering new drug candidates in the preclinical stage, and applicants can be serviced through the cloud (STB CLOUD).
이때, 신약이라 함은 합성신약(Small Molecule) 및 항체신약(Antibody)을 포함하는 것으로 본 발명에 의한 인공지능 신약플랫폼(AI-drug platform)은 이들 모두에 대한 발굴과정을 제공한다.At this time, new drugs include synthetic new drugs (small molecules) and antibody drugs, and the artificial intelligence new drug platform (AI-drug platform) according to the present invention provides a discovery process for all of them.
한편, 이를 위하여, 본 발명에 의한 인공지능 신약플랫폼(AI-drug platform)은 도 1에 도시된 바와 같이, 유효(hit)물질 자동화 발굴 플랫폼, 선도(lead)물질 자동화 발굴 플랫폼 및 약물반응(ADMET, Absorption, Distribution, Metabolism, Excretion & Toxicity) 자동화 분석 플랫폼을 포함하여 구성된다.Meanwhile, for this purpose, the artificial intelligence new drug platform (AI-drug platform) according to the present invention, as shown in Figure 1, includes a hit material automated discovery platform, a lead material automated discovery platform, and drug reaction (ADMET). , Absorption, Distribution, Metabolism, Excretion & Toxicity) and an automated analysis platform.
즉, 본 발명에 의한 인공지능 신약플랫폼(AI-drug platform)은 유효물질을 선별하고, 이 중 선도물질을 발굴한 후, 약물반응 분석을 통해 후보물질을 선택하는 신약 개발과정의 전 과정을 수행하도록 구성괸 인공지능 플랫폼이다.In other words, the artificial intelligence new drug platform (AI-drug platform) according to the present invention performs the entire new drug development process of selecting active substances, discovering lead substances among them, and then selecting candidate substances through drug reaction analysis. It is an artificial intelligence platform designed to do this.
도 2에는 본 발명에 의한 인공지능 신약플랫폼(AI-drug platform)의 클라우드 서비스 과정이 도시되어 있는데, 이에 도시된 바와 같이, 본 발명은 유효물질을 발굴하고, 선도물질을 생성하며, ADMET/PK부터 약물유전학 바이오마커에 이르기까지 약물 발견 및 개발 프로세스의 모든 영역을 제공한다.Figure 2 shows the cloud service process of the AI-drug platform according to the present invention. As shown, the present invention discovers effective substances, generates lead substances, and ADMET/PK. It provides all areas of the drug discovery and development process, from pharmacogenetics to biomarkers.
또한, 이들 신약 개발의 각 발굴 단계의 플랫폼을 운영하기 위하여, 본 발명에 의한 인공지능 신약플랫폼(AI-drug platform)은 3개의 개별적인 인공지능 시스템인, 생성형 인공지능을 비롯한 대규모 언어모델 기반 인공지능 시스템(GPT/BERT), 3 차원 구조 인공지능 시스템(3D-CNN), 분자동역학 분석 시스템(Auto-MD simulation)이 적용된다.In addition, in order to operate the platform for each stage of discovery in the development of these new drugs, the artificial intelligence new drug platform (AI-drug platform) according to the present invention is composed of three individual artificial intelligence systems, including generative artificial intelligence and large-scale language model-based artificial intelligence. An intelligent system (GPT/BERT), a three-dimensional structural artificial intelligence system (3D-CNN), and a molecular dynamics analysis system (Auto-MD simulation) are applied.
그리고, 상기 인공지능 신약플랫폼(AI-drug platform)의 상기 인공지능시스템들을 이용하여, 각 유효(hit)물질 자동화 발굴 플랫폼, 선도(lead)물질 자동화 발굴 플랫폼 및 약물반응(ADMET, Absorption, Distribution, Metabolism, Excretion & Toxicity) 자동화 분석 플랫폼을 실행하기 위한 구체적인 방법으로, 단백질-리간드 간 3차원 구조정보를 통한 유효물질 발굴(이하 출원인 조어 기술명인 'DMC-PRE'라 한다), 중심원자 벡터 기반 단백질-리간드 간 도킹구조 분석(이하 출원인 조어 기술명인 'GAP-Dock'이라 한다), 3D-CNN 학습모델을 이용한 단백질-화합물 간 최적화 결합구조 예측(이하 출원인 조어 기술명인 'DMC-SCR'이라 한다), 표적 단백질의 결합 포켓 구조를 통한 유도체 생성(이하 출원인 조어 기술명인 'LEAD-GEN'이라 한다), 분자동역학 시뮬레이션 데이터를 통한 단백질-화합물 상호 결합 안정성 분석(이하 출원인 조어 기술명인 'DMC-MD'라 한다) 및 단백질-화합물 간 3차원 상호작용 데이터를 이용하여 학습된 생성된 인공지능 모델(이하 출원인 조어 기술명인 '3bmGPT'라 한다)이 적용된다.And, using the artificial intelligence systems of the AI-drug platform, each hit material automated discovery platform, lead material automated discovery platform, and drug reaction (ADMET, Absorption, Distribution, Metabolism, Excretion & Toxicity) is a specific method for implementing an automated analysis platform, discovering effective substances through 3D structural information between proteins and ligands (hereinafter referred to as 'DMC-PRE', a technical name coined by the applicant), and central atom vector-based proteins. - Analysis of docking structure between ligands (hereinafter referred to as 'GAP-Dock', the technical name of the applicant), prediction of optimized binding structure between proteins and compounds using a 3D-CNN learning model (hereinafter referred to as 'DMC-SCR', the technical name of the applicant) , generation of derivatives through the binding pocket structure of the target protein (hereinafter referred to as 'LEAD-GEN', the technical name of the applicant), analysis of protein-compound interaction stability through molecular dynamics simulation data (hereinafter referred to as 'DMC-MD', the technical name of the applicant) ) and the generated artificial intelligence model learned using 3D interaction data between proteins and compounds (hereinafter referred to as '3bmGPT', the technical name coined by the applicant) is applied.
여기서, 상기 DMC-PRE 및 GAP-Dock은 상기 유효(hit)물질 자동화 발굴 플랫폼을 통해 유효물질을 발굴함에 선행적용되는 기술이고, DMC-SCR은 분자동역학 분석 시스템(Auto-MD simulation)에 적용되어, 상기 유효(hit)물질 자동화 발굴 플랫폼을 통해 유효물질을 발굴함에 후행적용되는 기술이며, LEAD-GEN은 선도(lead)물질 자동화 발굴 플랫폼을 통해 선도물질을 발굴함에 적용되는 기술이다.Here, the DMC-PRE and GAP-Dock are technologies applied in advance to discover active substances through the automated discovery platform for hit substances, and DMC-SCR is applied to the molecular dynamics analysis system (Auto-MD simulation). , It is a technology applied later to discover active substances through the hit material automated discovery platform, and LEAD-GEN is a technology applied to discover lead materials through the lead material automated discovery platform.
그리고 DMC-MD는 상기 분자동역학 분석 시스템(Auto-MD simulation)에 적용되어, 상기 유효(hit)물질 자동화 발굴 플랫폼, 선도(lead)물질 자동화 발굴 플랫폼 및 약물반응(ADMET, Absorption, Distribution, Metabolism, Excretion & Toxicity) 자동화 분석 플랫폼에서 도출된 결과물의 결합 안정성을 검증하는 기술이고, 3bmGPT는 생성형 인공지능 시스템(GPT/BERT)에 적용되어, 상기 유효(hit)물질 자동화 발굴 플랫폼을 통해 유효물질을 산출함에 있어 분석대상물질을 선별하는 기술이다.And DMC-MD is applied to the molecular dynamics analysis system (Auto-MD simulation), the hit material automated discovery platform, the lead material automated discovery platform, and drug reaction (ADMET, Absorption, Distribution, Metabolism, Excretion & Toxicity) is a technology that verifies the combined stability of results derived from an automated analysis platform, and 3bmGPT is applied to a generative artificial intelligence system (GPT/BERT) to identify active substances through the automated discovery platform for hit substances. It is a technology that selects the analyte target for calculation.
구체적으로, 본 발명에 의한 인공지능 신약플랫폼(AI-drug platform)의 유효물질 발굴과정을 살피면, 도 3에 도시된 바와 같이, 상기 3bmGPT를 통해 선별한 분석대상 물질을 상기 DMC-PRE 및 GAP-Dock을 적용하여 선행 스크리닝하고, 상기 DMC-SCR을 적용하여 심층 스크리닝을 한후, DMC-MD를 적용하여 결합 안정성을 검증하여 유효물질을 도출한다.Specifically, looking at the process of discovering effective substances of the artificial intelligence new drug platform (AI-drug platform) according to the present invention, as shown in Figure 3, the analyte substances selected through the 3bmGPT are classified into the DMC-PRE and GAP- Dock is applied for preliminary screening, DMC-SCR is applied for in-depth screening, and DMC-MD is applied to verify binding stability to derive effective substances.
그리고 본 발명에 의한 인공지능 신약플랫폼(AI-drug platform)의 선도물질 발굴과정을 살피면, 도 4에 도시된 바와 같이, 상기 LEAD-GEN을 적용하여, 선도물질을 발굴하고, DMC-MD를 적용하여 결합 안정성을 검증하여 선도물질을 도출한다.And looking at the lead material discovery process of the artificial intelligence new drug platform (AI-drug platform) according to the present invention, as shown in Figure 4, the LEAD-GEN is applied to discover the lead material, and DMC-MD is applied. By verifying the binding stability, the lead material is derived.
아허에서는 본 발명에 의한 중심원자 벡터 기반 단백질-리간드 간 결합구조 분석 방법을 상세히 설명하기로 한다.In Aheo, we will explain in detail the method for analyzing the protein-ligand binding structure based on the central atom vector according to the present invention.
도 5는 본 발명의 구체적인 실시예에 의한 중심원자 벡터 기반 단백질-리간드 간 결합구조 분석 방법을 도시한 흐름도이고, 도 6은 본 발명에 의한 GAP-DOCK 과정에서 결합 대상 물질의 코어 링(core ring)을 통해 결합 위치가 정렬되는 과정을 도시한 예시도이며, 도 7은 본 발명에 의한 GAP-DOCK이 적용된 실시예의 분석 소요시간의 비교결과를 도시한 그래프이고, 도 8은 본 발명에 의한 GAP-DOCK이 적용된 실시예의 분석결과의 정확도 비교결과를 도시한 그래프이다.Figure 5 is a flowchart showing a method for analyzing the protein-ligand binding structure based on a central atom vector according to a specific embodiment of the present invention, and Figure 6 is a core ring of the binding target material in the GAP-DOCK process according to the present invention. ) is an example diagram showing the process of aligning the binding position through, Figure 7 is a graph showing the comparison results of the analysis time required for the example to which GAP-DOCK according to the present invention is applied, and Figure 8 is a graph showing the comparison results of the analysis time required for the example to which GAP-DOCK according to the present invention is applied. -This is a graph showing the accuracy comparison results of the analysis results of the example where DOCK was applied.
먼저, 본 발명의 기술적 특징을 요약하면, 본 발명은 결합(binding)에 기여하는 상호작용(interaction)과 결합을 방해하는 상호작용(interaction)을 수치화하여, 이들의 차로 구성된 PLI(protein-ligand-interaction) scoring function 을 생성하고, PLI 지수를 이용하는 간소화된 방법으로 결합위치(pose)에 대한 신뢰도를 간단하면서도 신속하게 평가한다.First, to summarize the technical features of the present invention, the present invention quantifies the interactions that contribute to binding and the interactions that hinder binding, and the protein-ligand-ligand (PLI) consisting of the difference between these interactions is quantified. interaction) scoring function is created and the reliability of the binding position (pose) is evaluated simply and quickly using a simplified method using the PLI index.
그리고 본 발명은 표적 단백질에 화합물의 결합(docking) 형태/위치(pose)를 변화시켜 분석함에 있어, 화합물(compound)의 결합형태에 대한 컨포머(conformer)를 사전에 정의된 바에 의해 적용하여, 분석시간을 현저히 감소시키도록 한다.In addition, in the present invention, when analyzing the docking form/pose of a compound on a target protein, a conformer for the binding form of the compound is applied as defined in advance, Significantly reduce analysis time.
또한, 본 발명은 스크리닝 대상 물질의 무게중심에 가까운 중심에 위치한 ring 구조를 코어링(core ring)으로 정의하고, 마찬가지로, protein-ligand 구조로부터 추출된 ligand의 center of atom에 가까운 ring 구조를 코어링(core ring)으로 정의하여, 양측의 코어링(core ring)이 매칭되도록 ligand의 3D 구조에 분석대상 compound의 3D 구조를 align 한 뒤, 두 개의 코어링이 배치된 평면을 관통하는 중심축을 기준으로 회전시켜 각각의 compound pose로부터 가장 높은 PLI-score를 갖는 결합 pose를 도출하고, 다른 conformer compound pose들에 대해서도 이 과정을 반복하여 모든 conformer compound pose 들 중 PLI score가 가장 우수한 구조를 해당 3D 구조에 대한 최적 docking pose로 정의한다.In addition, the present invention defines the ring structure located at the center close to the center of gravity of the substance to be screened as core ring, and similarly, the ring structure close to the center of atom of the ligand extracted from the protein-ligand structure is defined as core ring. Defined as a (core ring), the 3D structure of the compound to be analyzed is aligned with the 3D structure of the ligand so that the core rings on both sides are matched, and then the 3D structure of the compound to be analyzed is aligned based on the central axis that passes through the plane where the two core rings are placed. Rotate to derive the bonding pose with the highest PLI-score from each compound pose, repeat this process for other conformer compound poses, and select the structure with the best PLI score among all conformer compound poses for the corresponding 3D structure. Defined as the optimal docking pose.
이와 같이, 본 발명은 기 설정된 컨포머에 대하여 간소화된 PLI 지수를 이용하는 결합위치(pose)에 대한 안정성을 판단하므로, 자유도가 높은 후보 물질에 대해서도 비교적 적은 연산으로 flexible docking 과 유사한 작업 수행이 가능하다. In this way, the present invention determines the stability of the binding position (pose) using a simplified PLI index for a preset conformer, so it is possible to perform a task similar to flexible docking with relatively few calculations even for candidate materials with a high degree of freedom. .
즉, 본 발명은 선행 스크리닝(pre-screening) 과정의 DMC-PRE 과정을 거쳐 선별된 물질의 3D 구조로부터 의약품 생리활성 물질을 스크리닝하는 과정에서 사용되는 flexible docking 기술에 관한 것이다. In other words, the present invention relates to a flexible docking technology used in the process of screening bioactive pharmaceutical substances from the 3D structure of the material selected through the DMC-PRE process of the pre-screening process.
본 발명에서는 결합위치(pose)를 변화시키면서 상호관계를 분석함에 있어, 결합 위치의 변화를 이전 단계의 결합위치(pose)와 유사한 random compound structure를 연속하여 반복 생성하여 pocket에 더 적합한 화합물(compound)의 3차원 좌표를 반복적으로 생성하는 기존의 분석 방식과 달리, 화합물(compound)의 결합형태에 대하여, 기 설정된 컨포머(conformer) 형태로 결합위치를 설정하여, 상호결합관계를 분석한다.In the present invention, when analyzing the interrelationship while changing the binding position (pose), the change in binding position is used to continuously and repeatedly generate a random compound structure similar to the binding position (pose) of the previous step to create a compound more suitable for the pocket. Unlike the existing analysis method that repeatedly generates three-dimensional coordinates, the bonding position is set in the form of a preset conformer for the bonding form of the compound, and the bonding relationship is analyzed.
이때, 상기 컨포머(conformer)는 화합물(compound) 별로 정의된 단백질과의 결합 형태로 단백질의 형태, 화합물(compound)을 구성하는 원자의 수, 사용하는 장비의 성능, 목표시간에 따라 1 내지 수백조 개의 구조를 포함하여 정의될 수 있다.At this time, the conformer is a form of binding to a protein defined for each compound and can range from 1 to hundreds of trillions depending on the shape of the protein, the number of atoms constituting the compound, the performance of the equipment used, and the target time. It can be defined including the structure of
한편, 특정 결합위치(pose)로 정렬된 단백질과 리간드간 상호결합관계(interaction, 결합안정성)의 판단은, 전술한 바와 같이, 기 설정된 ligand 구조들의 core ring을 알려진 기준 구조의 core ring에 정렬(align) 한 뒤 PLI(protein-ligand interaction) scoring function이 최대가 되는 pose를 찾는 방법으로 표적단백질(target protein)에 대한 최적 결합위치(docking pose)를 도출 할 수 있다.Meanwhile, the determination of the interaction (binding stability) between a protein and a ligand aligned to a specific binding position (pose) can be done by aligning the core ring of preset ligand structures to the core ring of a known reference structure, as described above. After aligning, the optimal docking position for the target protein can be derived by finding the pose that maximizes the PLI (protein-ligand interaction) scoring function.
즉, 도 6에 도시된 바와 같이, 동일한 표적 단백질(target protein)에 대하여 생성된 결합위치(docking pose)들은 PLI score에 의해 정량적으로 비교가 가능하기 때문에 결합(docking)에 유리한 물질을 우선 선별할 수 있다.That is, as shown in Figure 6, docking poses generated for the same target protein can be quantitatively compared by the PLI score, so materials that are advantageous for docking can be selected first. You can.
이때, 결합위치(pose)의 다양성을 만들기 위해 사용되는 화합물(compound)의 3D 구조는 사전에 생성하여 데이터 베이스화되므로, 모든 template에 대하여 다시 생성할 필요가 없으며, 이에 따라 결합물질에 대한 결합위치(pose) 들을 생성하는데 소요되는 연산 시간 및 자원을 절약할 수 있어, 결합위치(pose) 분석에 따른 소요시간을 현저히 감소시킬 수 있다.At this time, the 3D structure of the compound used to create diversity of binding positions (pose) is created in advance and made into a database, so there is no need to regenerate all templates, and accordingly, the binding position for the binding material ( It is possible to save computational time and resources required to generate poses, thereby significantly reducing the time required for joint position (pose) analysis.
나아가, 분석시간의 감소효과는 화합물의 자유도가 높을 수록 현저히 향상되는데, 화합물의 자유도가 높을 수록 변화 가능한 결합위치가 제곱에 비례하여 증가되기 때문이며, 자유도가 높은 화합물의 경우, 본 발명은 1000배 이상의 속도증가 효율도 나타낸다.Furthermore, the effect of reducing analysis time is significantly improved as the degree of freedom of the compound increases. This is because the changeable binding position increases in proportion to the square as the degree of freedom of the compound increases. In the case of compounds with a high degree of freedom, the present invention provides 1000 times more It also shows speed increase efficiency.
이하에서는 본 발명에 의한 중심원자 벡터 기반 단백질-리간드 간 결합구조 분석 방법의 구체적인 수행 과정을 상세히 설명하기로 한다.Hereinafter, the specific implementation process of the central atom vector-based protein-ligand binding structure analysis method according to the present invention will be described in detail.
본 발명에 의한 중심원자 벡터 기반 단백질-리간드 간 결합구조 분석은, (A) 분석대상 물질의 3D conformer 구조를 database화하는 단계와, (B) 표적단백질-리간드의 3D 결합 구조로부터 ligand의 core ring의 좌표를 추출하는 단계와, (C) 스크리닝 대상 compound conformer를 구성하는 각각의 pose 들로부터 core ring 좌표를 추출하는 단계와, (D) compound pose의 core ring과 ligand의 core ring을 정렬(align)하여 화합물(compound)의 결합(docking) 좌표정보를 획득하는 단계와, (E) compound의 docking 좌표정보를 PLI scoring function의 input으로 사용하여 PLI score가 가장 높은 docking pose를 도출하는 단계와, (F) 상기 (A) 내지 (E) 단계를 기 설정된 compound 컨포머(conformer)들에 대하여 반복 수행하는 단계;를 통해 수행된다.The analysis of the protein-ligand binding structure based on the central atom vector according to the present invention includes (A) databaseizing the 3D conformer structure of the analyte, and (B) determining the core ring of the ligand from the 3D binding structure of the target protein-ligand. (C) extracting the coordinates of the core ring from each pose constituting the compound conformer to be screened, (D) aligning the core ring of the compound pose and the core ring of the ligand. A step of obtaining the docking coordinate information of the compound, (E) using the docking coordinate information of the compound as an input to the PLI scoring function to derive the docking pose with the highest PLI score, (F) ) Repeating steps (A) to (E) for preset compound conformers.
그리고 이를 활용하여 얻은 구조들의 bond energy를 비교하여 표적단백질과 affinity를 가질 확률이 높은 물질을 선별할 수 있다.And by comparing the bond energies of the structures obtained using this, materials with a high probability of having affinity with the target protein can be selected.
구체적으로, 본 발명에 의한 결합구조 분석 방법(docking algorithm)을 이용하여 얻은 물질구조를 기준으로, 물질이 표적 단백질에 대하여 가질 affinity를 추정하고 이를 이용하여 신약 후보물질을 선별하는 단계를 말한다.Specifically, it refers to the step of estimating the affinity of a substance for a target protein based on the substance structure obtained using the docking algorithm according to the present invention and using this to select a new drug candidate.
이때, 상기 bond energy는 target 단백질-compound의 docking 구조에 대하여 에너지 계산의 대상인 protein의 core 원자와 그 주변 원자들의 3D 배치정보를 활용하여 계산되는 free energy값으로, AMBER, CHARMM, OPLS, MM/PBSA, MM/GBSA, enva(Syntekabio) 등의 다양한 tool에 의해 산출될 수 있다.At this time, the bond energy is a free energy value calculated using the 3D arrangement information of the core atom of the protein and its surrounding atoms, which is the subject of energy calculation, with respect to the docking structure of the target protein-compound, AMBER, CHARMM, OPLS, MM/PBSA , MM/GBSA, enva (Syntekabio), etc. can be calculated by various tools.
이하에서 이들 각 수행 단계에 대하여 상세히 살피도록 한다.Below, we will look at each of these execution steps in detail.
먼저, 상기 (A) 분석대상 물질의 3D conformer 구조를 database화하는 단계는, ZINC database 또는 Chembl database와 같은 의약품 또는 화학물질(결합관계가 이미 검증된 화합물)등으로부터 신약 후보물질에 대한 구조 관련 data를 얻은 뒤, rdkit이나 openbabel등의 tool을 이용하여 컨포머(conformer)를 생성하는 것을 의미한다.First, the step (A) of databaseizing the 3D conformer structure of the analyte is to obtain structure-related data for new drug candidates from pharmaceuticals or chemicals (compounds whose binding relationships have already been verified) such as the ZINC database or Chembl database. After obtaining, this means creating a conformer using tools such as rdkit or openbabel.
이때, input으로 사용되는 data의 형태나 source database의 종류, conformer 생성 tool 등은 다양한 형태 및 분석 tool이 적용될 수 있다.At this time, various types and analysis tools can be applied to the type of data used as input, type of source database, conformer creation tool, etc.
다음으로, 상기 (B) 표적단백질-리간드의 3D 결합 구조로부터 ligand의 core ring의 좌표를 추출하는 단계는, 표적단백질-ligand의 구조로부터 compound가 결합될 binding center에 해당하는 좌표를 선정하는 단계로, ligand의 center of mass에 가장 가깝게 위치한 ring을 core ring으로 선정한다.Next, the step (B) of extracting the coordinates of the core ring of the ligand from the 3D binding structure of the target protein-ligand is a step of selecting the coordinates corresponding to the binding center to which the compound will be bound from the structure of the target protein-ligand. , the ring located closest to the center of mass of the ligand is selected as the core ring.
이때, 만일 ring 구조가 존재하지 않는 ligand인 경우, center of mass와 가장 인접한 3개의 원자 좌표를 선정하여 링 구조로 가정하여, 코어링(core ring)으로 선정한다.At this time, if the ligand does not have a ring structure, the coordinates of the three atoms closest to the center of mass are selected, assumed to be a ring structure, and selected as a core ring.
그리고 상기 (C) 스크리닝 대상 compound conformer를 구성하는 각각의 pose 들로부터 core ring 좌표를 추출하는 단계는, Compound conformer 구조 중 단일 compound에 대하여 Ligand 대신 표적 단백질에 결합될 스크리닝 대상 compound로부터 binding center 에 위치할 물질 내 atom 집단을 선정하는 단계로, compound의 center of mass에 가장 가깝게 위치한 ring을 core ring으로 지정한다.And the step (C) of extracting core ring coordinates from each pose constituting the compound conformer to be screened is to locate the binding center from the compound to be screened to be bound to the target protein instead of the Ligand for a single compound in the compound conformer structure. This is a step in selecting the atom group within the material. The ring located closest to the center of mass of the compound is designated as the core ring.
이 경우에도, 만일 compound 내 ring 구조가 존재하지 않는 경우, center of mass와 가장 인접한 3개의 원자 좌표를 선정하여 링구조로 가정하여, 코어링(core ring)으로 선정한다.Even in this case, if a ring structure does not exist in the compound, the coordinates of the three atoms closest to the center of mass are selected, assumed to be a ring structure, and selected as a core ring.
이때, 상기 (C)단계는, compound conformer 구조에 대하여 각각 진행된다.At this time, step (C) is performed for each compound conformer structure.
한편, 상기 (D) compound pose의 core ring과 ligand의 core ring을 정렬(align)하여 화합물(compound)의 결합(docking) 좌표정보를 획득하는 단계는, compound 구조들의 core ring 좌표들과 ligand ring의 좌표간 거리가 최소가 되는 점을 선정한 뒤 core ring 평면을 따라 회전시켜 다양한 방향으로 compound를 배치시키고 이때의 compound atom 좌표를 저장한다.Meanwhile, the step of obtaining docking coordinate information of the compound by aligning the core ring of the compound pose and the core ring of the ligand (D) is the step of obtaining docking coordinate information of the compound structure and the core ring of the ligand ring. After selecting the point where the distance between coordinates is minimum, rotate along the core ring plane to place the compound in various directions and store the compound atom coordinates at this time.
그리고 상기 (E) compound의 docking 좌표정보를 PLI scoring function의 input으로 사용하여 PLI score가 가장 높은 docking pose를 도출하는 단계는, ligand의 core ring과 compound의 core ring을 align 하여 얻은 모든 compound의 atom 좌표를 기준으로 PLI scoring function에 따라 PLI score를 구하여, 회전된 배치방법 중 가장 PLI score가 높은 형태의 docking 배치형태를 해당 결합의 최적 결합위치(pose)로 선정한다. And the step of deriving the docking pose with the highest PLI score by using the docking coordinate information of the compound (E) as an input to the PLI scoring function is the atom coordinates of all compounds obtained by aligning the core ring of the ligand and the core ring of the compound. Based on this, the PLI score is calculated according to the PLI scoring function, and the docking arrangement type with the highest PLI score among the rotated arrangement methods is selected as the optimal binding position (pose) for the combination.
이때, 상기 PLI score를 산출하는 수학식의 예가 아래 [수학식 1]에 개시되어있다.At this time, an example of the equation for calculating the PLI score is disclosed in [Equation 1] below.
PLI = Protein-ligand interaction scorePLI = Protein-ligand interaction score
Interactive atom = searched atom that able to interact with residue in target proteinInteractive atom = searched atom that able to interact with residue in target protein
Crash = searched atom that dispulsive because the distance to residue atom in target protein is too closeCrash = searched atom that dispulsive because the distance to residue atom in target protein is too close
Weight = the ratio that crash affect to the affinityWeight = the ratio that crash affect to the affinity
상기 수학식 1의 PLI scoring function에서 Crash에 곱해지는 weight는 실험적으로 조정될 수 있는 값으로, 본 발명에서는 2를 기준으로 분석하는 방법을 예시로 제시하였으나, 단백질의 종류나 물질 데이터의 구성에 따라 0.1~ 10의 범위에서 다양하게 조절 가능한 값이다.The weight multiplied by Crash in the PLI scoring function of
한편, (F) 상기 (A) 내지 (E) 단계를 기 설정된 compound 컨포머(conformer)들에 대하여 반복 수행하여 가장 PLI score가 높은 docking pose를 해당 conformer의 최적결합 위치(pose)로 선정한다.Meanwhile, (F) steps (A) to (E) are repeated for preset compound conformers, and the docking pose with the highest PLI score is selected as the optimal binding position (pose) for the conformer.
이하에서는 본 발명의 구체적인 실시예의 실시결과를 비교예와 함께 상세히 설명하기로 한다.Hereinafter, the results of specific embodiments of the present invention will be described in detail along with comparative examples.
실시예Example
1. 분석대상 물질의 3D conformer 구조 생성1. Generation of 3D conformer structure of the analyte material
ZINC20 database로부터 100,000개의 smiles(화합물의 3차원 물질구조를 1차원 텍스트로 변환하여 정리한 데이터) 정보를 추출한 뒤, 각각에 대하여 rdkit을 이용하여 3D conformer 구조 1,000개를 생성하였다. After extracting information on 100,000 smiles (data organized by converting the 3D material structure of a compound into 1D text) from the ZINC20 database, 1,000 3D conformer structures were created for each using rdkit.
이때, 각 conformer의 rmsd는 기존에 생성된 conformer에 대하여 최소 0.2 이상 차이나는 것을 다른 conformer로 정의하고 생성하여 pdb 규격의 물질 좌표 데이터가 연결된 형태의 database를 구축하였다.At this time, the rmsd of each conformer that differs by at least 0.2 from the previously created conformer was defined as another conformer and created to construct a database in which pdb standard material coordinate data is connected.
그리고 Positive control로 PLK4의 ligand 127 개를 Chembl database로부터 다운받아 포함시켰으며, Conformer database 구축방법은 ZINC20 compound와 동일하게 진행하였다.As a positive control, 127 PLK4 ligands were downloaded from the Chembl database and included, and the conformer database construction method was the same as for the ZINC20 compound.
2. 표적단백질-리간드의 3D 결합 구조로부터 ligand의 core ring의 좌표 추출2. Extraction of coordinates of the core ring of the ligand from the 3D binding structure of the target protein-ligand
표적단백질-리간드 복합체 pdb구조는 PDB Bind database로부터 4yur을 다운로드 받아 사용하였으며, 본 출원인의 자체 tool인 gap을 이용하여 ligand의 core ring의 atom id list를 추출하였다.The pdb structure of the target protein-ligand complex was used by downloading 4yur from the PDB Bind database, and the atom id list of the core ring of the ligand was extracted using gap, the applicant's own tool.
3. 스크리닝 대상 compound conformer를 구성하는 각각의 pose 들로부터 core ring 좌표 추출3. Extract core ring coordinates from each pose that constitutes the compound conformer to be screened.
Compound의 conformer에서 각각의 단일pose에 대하여 atom id list를 추출하였다. 이후 ligand와 마찬가지로 본 출원인의 자체 tool인 gap을 이용하여 ligand의 core ring의 atom id list를 추출하였다.An atom id list was extracted for each single pose from the compound's conformer. Afterwards, as with the ligand, the applicant's own tool, gap, was used to extract the atom ID list of the core ring of the ligand.
4. compound pose의 core ring과 ligand의 core ring을 align 하여 docking 구조 생성4. Create a docking structure by aligning the core ring of the compound pose and the core ring of the ligand.
Ligand 구조 및 compound구조에서 얻은 좌표정보를 기반으로 gap을 이용하여 ligand의 core ring과 compound의 core ring을 정렬(align)시키고, compound의 core atom의 좌표 이동 정보를 기준으로 compound 전체의 이동 좌표를 계산하여 pdb 형태로 저장하였다. Based on the coordinate information obtained from the ligand structure and the compound structure, the core ring of the ligand and the core ring of the compound are aligned using the gap, and the movement coordinates of the entire compound are calculated based on the coordinate movement information of the compound's core atom. and saved in pdb format.
이후 Core atom을 구성하는 ring을 따라 compound 구조를 회전하면서 다수의 docking pose를 생성하였다.Afterwards, a number of docking poses were created by rotating the compound structure along the ring that makes up the core atom.
5. compound의 docking 구조들의 PLI score 비교5. Comparison of PLI scores of compound docking structures
Gap코드에서 지원하는 PLI 계산공식을 활용하여 각 compound pose의 docking pose 별 PLI score를 계산하여 conformer를 구성하는 pose별 대표 docking pose를 결정하고, 각 docking pose 중 가장 PLI score가 높은 docking pose를 conformer의 대표 docking pose로 선정하였다.Using the PLI calculation formula supported by the Gap code, the PLI score for each docking pose of each compound pose is calculated to determine the representative docking pose for each pose that makes up the conformer, and the docking pose with the highest PLI score among each docking pose is used as the docking pose for the conformer. It was selected as the representative docking pose.
6. PLI score, BE filter를 활용한 compound 집단 중 후보물질 집단 선별6. Selection of candidate substance groups among compound groups using PLI score and BE filter
위 5번 과정에서 선정된 conformer의 대표 docking pose로부터 출원인의 자체 tool인 enva를 사용하여 bond energy를 계산하였고, 계산된 Bond energy의 절대값이 큰 1000개의 물질을 선별하였다.The bond energy was calculated from the representative docking pose of the conformer selected in step 5 above using enva, the applicant's own tool, and 1000 materials with a large absolute value of the calculated bond energy were selected.
비교예1Comparative Example 1
실시예와 positive control 및 input data는 동일한 것을 사용하였으나, 본 발명의 알고리즘 대신 상용 tool인 gnina를 사용하여 CNN affinity를 기준으로 스크리닝을 수행하였다.The same positive control and input data as the examples were used, but screening was performed based on CNN affinity using gnina, a commercial tool, instead of the algorithm of the present invention.
으며, 소요된 시간을 분석예1과 같이 비교하였다(도3). 또한, Recall을 분석예2에 기초하여 결과를 비교하였다(도4).And the time required was compared with Analysis Example 1 (Figure 3). In addition, the results were compared based on Recall Analysis Example 2 (Figure 4).
분석예1Analysis example 1
본 발명에 의한 실시예와 비교예 1의 분석시간을 비교하기 위하여, 실시예의 경우, 선별대상 compound 100,000개를, positive control 127개와 혼합하여 상위 1,000개의 물질을 100대의 CPU가 구비된 하드웨어 시스템을 통해 수행하였고, 선별과정을 수행하는데 소요되는 시간을 측정하였다. In order to compare the analysis time of the Example according to the present invention and Comparative Example 1, in the Example, 100,000 compounds to be screened were mixed with 127 positive controls, and the top 1,000 substances were analyzed through a hardware system equipped with 100 CPUs. was performed, and the time required to perform the selection process was measured.
비교예 1의 경우, gnina 데이터셋을 git hub에서 다운받아 3D 구조를 input으로 CNN affinity를 계산하였고, 분석 결과에서 CNN affinity가 높은 물질을 우선 선별하는 방식으로 상위 1%의 물질을 선별하는 데까지 소요되는 시간을 측정하였다.In the case of Comparative Example 1, the gnina dataset was downloaded from git hub and CNN affinity was calculated using the 3D structure as input, and it took up to selecting the top 1% of materials by first selecting materials with high CNN affinity from the analysis results. The time taken was measured.
이들의 분석 결과, 도 7에 도시된 바와 같이, 물질 당 소요된 시간이 본 발명에 의한 실시예의 경우, 100,000개의 물질을 처리하는데 5시간이 소요되었으나, 비교예 1(gnina 알고리즘)의 경우, 41 시간이 소요되는 것을 확인하였다.As a result of their analysis, as shown in FIG. 7, the time required per material was 5 hours to process 100,000 materials in the example according to the present invention, but in the case of Comparative Example 1 (gnina algorithm), 41 hours were required to process 100,000 materials. It was confirmed that it takes time.
분석예2Analysis example 2
각 알고리즘의 스크리닝 방법을 적용하였을 때 상위 1% 기준으로 추출된 최종 리스트에서 positive control의 수를 측정하여 아래 [수학식 2]를 통해 계산하였다.When applying the screening method of each algorithm, the number of positive controls was measured in the final list extracted based on the top 1% and calculated using [Equation 2] below.
recall : yield of positive control after filtration algorithmrecall: yield of positive control after filtration algorithm
Npc_filter : number of positive control after filtrationNpc_filter: number of positive control after filtration
Npc_input : number of positive control in input datasetNpc_input: number of positive control in input dataset
실시예와 비교예2를 이에 따라 대비 분석한 결과, 1만개 의약품 후보를 추출하는 조건에서, 도 8에 도시된 바와 같이, Recall은 본 발명에 의한 실시예가 37.8%를 달성하였고 비교예 1의 경우 22.0%를 달성하였는 바, 본 발명에 의한 실시예가 종래 대비기술에 비하여 현저한 분석시간 단축에도 불구하고, 종래기술을 상회하는 분석 정확도를 보이는 것을 확인할 수 있다.As a result of comparative analysis of Example and Comparative Example 2, as shown in Figure 8, under the condition of extracting 10,000 drug candidates, the recall according to the present invention achieved 37.8%, and in the case of Comparative Example 1, As 22.0% was achieved, it can be seen that the example according to the present invention shows analysis accuracy that exceeds the prior art, despite a significant reduction in analysis time compared to the prior art.
본 발명의 권리는 위에서 설명된 실시예에 한정되지 않고 청구범위에 기재된 바에 의해 정의되며, 본 발명의 분야에서 통상의 지식을 가진 자가 청구범위에 기재된 권리범위 내에서 다양한 변형과 개작을 할 수 있다는 것은 자명하다.The rights of the present invention are not limited to the embodiments described above but are defined by the claims, and those skilled in the art can make various changes and modifications within the scope of the claims. This is self-evident.
본 발명은 신약개발을 위한 유효물질을 발굴에 관한 것으로, 본 발명에 의하면, 물질구조당 수분~수시간 소요되던 protein-ligand 3D 구조 분석을 flexible docking 작업을 간소화 하여, 처리 소요시간을 수 초 내지 수십 초 수준으로 저감화 하여, 종래의 분석 기술에 대비하여, 10 내지 100배 이상의 물질을 동일한 시간에 분석하는 것이 가능해지는 효과가 있다.The present invention relates to discovering effective substances for new drug development. According to the present invention, the flexible docking process for protein-
Claims (10)
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2022-0119661 | 2022-09-21 | ||
KR20220119661 | 2022-09-21 | ||
KR1020230126610A KR20240040670A (en) | 2022-09-21 | 2023-09-21 | Aanalysis methods of protein-ligand docking structure based on vector for AI drug platform |
KR10-2023-0126610 | 2023-09-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024063584A1 true WO2024063584A1 (en) | 2024-03-28 |
Family
ID=90454737
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2023/014454 WO2024063584A1 (en) | 2022-09-21 | 2023-09-21 | Central-atom-vector-based protein-ligand binding structure analysis method of artificial intelligence new drug platform |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024063584A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005038429A2 (en) * | 2003-10-14 | 2005-04-28 | Verseon | Method and apparatus for analysis of molecular configurations and combinations |
KR20200128710A (en) * | 2018-03-05 | 2020-11-16 | 더 보드 어브 트러스티스 어브 더 리랜드 스탠포드 주니어 유니버시티 | A method for improving binding and activity prediction based on machine learning and molecular simulation |
KR102296188B1 (en) * | 2019-10-21 | 2021-09-01 | 주식회사 스탠다임 | Methods and apparatus for designing compounds |
KR20210153540A (en) * | 2020-06-10 | 2021-12-17 | 주식회사 에이조스바이오 | System for phenotype-based anticancer drug screening using artificial intelligence deep learning |
-
2023
- 2023-09-21 WO PCT/KR2023/014454 patent/WO2024063584A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005038429A2 (en) * | 2003-10-14 | 2005-04-28 | Verseon | Method and apparatus for analysis of molecular configurations and combinations |
KR20200128710A (en) * | 2018-03-05 | 2020-11-16 | 더 보드 어브 트러스티스 어브 더 리랜드 스탠포드 주니어 유니버시티 | A method for improving binding and activity prediction based on machine learning and molecular simulation |
KR102296188B1 (en) * | 2019-10-21 | 2021-09-01 | 주식회사 스탠다임 | Methods and apparatus for designing compounds |
KR20210153540A (en) * | 2020-06-10 | 2021-12-17 | 주식회사 에이조스바이오 | System for phenotype-based anticancer drug screening using artificial intelligence deep learning |
Non-Patent Citations (1)
Title |
---|
CRAMPON KEVIN, GIORKALLOS ALEXIS, DELDOSSI MYRTILLE, BAUD STÉPHANIE, STEFFENEL LUIZ ANGELO: "Machine-learning methods for ligand–protein molecular docking", DRUG DISCOVERY TODAY, ELSEVIER, AMSTERDAM, NL, vol. 27, no. 1, 1 January 2022 (2022-01-01), AMSTERDAM, NL , pages 151 - 164, XP093150102, ISSN: 1359-6446, DOI: 10.1016/j.drudis.2021.09.007 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Nag et al. | Deep learning tools for advancing drug discovery and development | |
Morris et al. | Using autodock for ligand‐receptor docking | |
Gatto et al. | Using R and Bioconductor for proteomics data analysis | |
CN107038348B (en) | Drug target prediction method based on protein-ligand interaction fingerprint | |
Hussain et al. | Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets | |
CN1389829A (en) | Fault searching method and fault searching apparatus | |
CN112562092A (en) | Electrical test intelligent interaction system based on AR technology | |
US20080097734A1 (en) | System And Method For Simulating Global Product Development | |
CN114678082B (en) | Computer-aided virtual high-throughput screening algorithm | |
WO2022050551A1 (en) | Legal service provision system and method therefor | |
Xu et al. | CPredictor3. 0: detecting protein complexes from PPI networks with expression data and functional annotations | |
WO2024063580A1 (en) | Method for discovering effective substances of artificial intelligence new drug platform reflecting 3d structural information between protein and ligand | |
Konc et al. | Protein binding sites for drug design | |
WO2024063584A1 (en) | Central-atom-vector-based protein-ligand binding structure analysis method of artificial intelligence new drug platform | |
Zhu et al. | AI-driven precision subcellular navigation with fluorescent probes | |
Gaviraghi et al. | Pharmacokinetic challenges in lead optimization | |
Ying et al. | Maximizing cohesion and separation for detecting protein functional modules in protein-protein interaction networks | |
WO2024063583A1 (en) | Method for generating derivatives using binding pocket structure of target protein through artificial intelligence drug discovery platform | |
KR20240040670A (en) | Aanalysis methods of protein-ligand docking structure based on vector for AI drug platform | |
Taube et al. | Society for Immunotherapy of Cancer: updates and best practices for multiplex immunohistochemistry (IHC) and immunofluorescence (IF) image analysis and data sharing | |
Agüero-Chapin et al. | Alignment-free methods for the detection and specificity prediction of adenylation domains | |
WO2024063582A1 (en) | Method for analyzing protein-compound inter-binding stability by artificial intelligence drug discovery platform using molecular dynamics simulation data | |
Kannas et al. | A workflow system for virtual screening in cancer chemoprevention | |
WO2024063581A1 (en) | Protein-compound optimal binding structure prediction method using large-capacity conformer generation and three-dimensional convolutional deep transfer learning model | |
Gu et al. | Prediction of antibody-antigen interaction based on backbone aware with invariant point attention |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23868636 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |