WO2021080122A1 - Method and apparatus for analyzing protein-ligand interaction using parallel operation - Google Patents
Method and apparatus for analyzing protein-ligand interaction using parallel operation Download PDFInfo
- Publication number
- WO2021080122A1 WO2021080122A1 PCT/KR2020/008789 KR2020008789W WO2021080122A1 WO 2021080122 A1 WO2021080122 A1 WO 2021080122A1 KR 2020008789 W KR2020008789 W KR 2020008789W WO 2021080122 A1 WO2021080122 A1 WO 2021080122A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- ligand
- information
- ligands
- gpu
- score
- Prior art date
Links
- 239000003446 ligand Substances 0.000 title claims abstract description 299
- 238000000034 method Methods 0.000 title claims abstract description 70
- 230000003993 interaction Effects 0.000 title abstract description 12
- 238000004088 simulation Methods 0.000 claims abstract description 137
- 238000003032 molecular docking Methods 0.000 claims abstract description 115
- 230000015654 memory Effects 0.000 claims abstract description 77
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 63
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 63
- 238000011156 evaluation Methods 0.000 claims description 14
- 239000000470 constituent Substances 0.000 claims description 8
- 230000006870 function Effects 0.000 description 11
- 238000005457 optimization Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 239000002131 composite material Substances 0.000 description 3
- 150000001875 compounds Chemical class 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000035772 mutation Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- XDDAORKBJWWYJS-UHFFFAOYSA-N glyphosate Chemical group OC(=O)CNCP(O)(O)=O XDDAORKBJWWYJS-UHFFFAOYSA-N 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000000329 molecular dynamics simulation Methods 0.000 description 1
- 238000012900 molecular simulation Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/48—Biological material, e.g. blood, urine; Haemocytometers
- G01N33/50—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
- G01N33/68—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
- G01N33/6803—General methods of protein analysis not limited to specific proteins or families of proteins
- G01N33/6845—Methods of identifying protein-protein interactions in protein mixtures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/20—Protein or domain folding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
Definitions
- the present disclosure relates to a method and an apparatus associated with molecular docking to predict protein-ligand interaction.
- One particular embodiment relates to a method and apparatus for performing molecular docking simulation related to a target protein and a ligand using parallel computing with a plurality of processors.
- molecular docking simulation is used to derive information associated with binding between one molecule and another molecule.
- Such molecular docking simulation is mainly used for simulation associated with binding through interaction between specific proteins and ligands, and in this manner, it is possible to perform a simulation using a computer before a biochemical experiment.
- the molecular docking simulation may provide information on a pose of a receptor and a ligand, the pose in which the two molecules can form a stable binding.
- a simulation tool may perform various computations to verify whether the pose between the receptor and the ligand allows a stable binding.
- An aspect provides a method and an apparatus for simulation associated with molecular docking. Another aspect provides a method and an apparatus for molecular docking simulation by performing a plurality of processes in parallel.
- a method for simulation associated with molecular docking includes identifying information on a plurality of ligands, identifying protein information including grid information on a plurality of proteins, identifying at least one ligand set based on a size of each ligand according to the information on the plurality of ligands, determining a score for the at least one ligand set based on properties of ligands included in the ligand set, copying, to a Graphics Processing Unit (GPU) memory, information on ligands included in ligand sets in a first group selected based on the score for the at least one ligand set and the number of GPU cores involved in performing molecular docking simulation, and the protein information, and performing molecular docking simulation using the GPU cores based on the information copied to the GPU memory.
- GPU Graphics Processing Unit
- an apparatus for molecular docking simulation includes a Central Processing Unit (CPU) core, a plurality of Graphics Processing Unit (GPU) cores, and a controller, and the controller is configured to identify information on a plurality of ligands, identify protein information including grid information on a plurality of proteins, identify at least one ligand set based on a size of each ligand according to the information on the plurality of ligands, determine a score for the at least one ligand set based on properties of ligands included in the ligand set, copy, to a GPU memory, information on ligands included in ligand sets in a first group selected based on the score for the at least one ligand set and the number of GPU cores involved in performing the molecular docking simulation, and the protein information, and perform the molecular docking simulation using the GPU cores based on the information copied to the GPU memory.
- CPU Central Processing Unit
- GPU Graphics Processing Unit
- a method and an apparatus for molecular docking simulation by performing computations in parallel is provided.
- a processor such as a Graphics Processing Unit (GPU)
- GPU Graphics Processing Unit
- FIG. 1 is a diagram illustrating a structure of a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) according to an embodiment of the present disclosure.
- CPU Central Processing Unit
- GPU Graphics Processing Unit
- FIG. 2 is a flowchart illustrating a method for molecular docking simulation according to an embodiment of the present disclosure.
- FIG. 3 is a flowchart illustrating a method for performing a simulation using parallel computing according to an embodiment of the present disclosure.
- FIG. 4 is a flowchart illustrating a method in which a computation and a data structure corresponding to a second processing unit is identified through a computation and a data structure corresponding to a first processing unit for simulation and then a computation associated with simulation is performed based on the identified computation and data structure.
- FIG. 5 is a flowchart illustrating a method for molecular docking simulation using a Graphics Processing Unit (GPU) core according to an embodiment of the present disclosure.
- GPU Graphics Processing Unit
- FIG. 6 is a schematic view illustrating ligand information including array information according to an embodiment of the present disclosure.
- FIG. 7 is a schematic view illustrating a step of copying ligand information including array information to a GPU memory according to an embodiment of the present disclosure.
- FIG. 8 is a schematic view illustrating a method for classifying ligand sets by score according to an embodiment of the present disclosure.
- FIG. 9 is a flowchart illustrating a method for performing molecular docking simulations in parallel according to an embodiment of the present disclosure.
- FIG. 10 is a block diagram illustrating an apparatus for molecular docking simulation according to an embodiment of the present disclosure.
- each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations can be implemented by computer program instructions.
- These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a means for implementing the functions specified in the flowchart block or blocks.
- These computer program instructions may also be stored in a computer-usable or computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-usable or computer-readable memory produce an article of manufacture including instruction means that implement the function specified in the flowchart block or blocks.
- the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions that are executed on the computer or other programmable apparatus provide operations for implementing the functions specified in the flowchart block or blocks.
- blocks of the flow chart refer to part of codes, segments or modules that include one or more executable instructions to perform one or more logic functions. It should be noted that the functions described in the blocks of the flow chart may be performed in a different order from the embodiments described above. For example, the functions described in two adjacent blocks may be performed at the same time or in reverse order.
- component 'module,' refers to a software element or a hardware element such as a PGGA, an ASIC, etc., and performs a corresponding function. It should be, however, understood that the component 'module' is not limited to a software or hardware element.
- the component 'module' may be implemented in storage media that can be designated by addresses and may also be configured to regenerate one or more processors.
- the component 'module' may include various types of elements (e.g., software elements, object-oriented software elements, class elements, task elements, etc.), segments (e.g., processes, functions, achieves, attribute, procedures, sub-routines, program codes, etc.), drivers, firmware, micro-codes, circuit, data, data base, data structures, tables, arrays, variables, etc.
- functions provided by elements and the components 'modules' may be formed by combining the small number of elements and components 'modules' or may be divided into additional elements and components 'modules.'
- elements and components 'modules' may also be implemented to regenerate one or more CPUs in devices or security multi-cards.
- FIG. 1 is a diagram illustrating the structure of a Central Processing unit (CPU) and a Graphics Processing Unit (GPU) according to an embodiment of the present disclosure.
- CPU Central Processing unit
- GPU Graphics Processing Unit
- configurations of a CPU 100 and configurations of a GPU 150 are illustrated.
- the CPU 100 may include at least one of a controller 105, an arithmetic logic unit (ALU) 110, a cache memory 115, and a DRAM 120.
- ALU arithmetic logic unit
- the controller 105 may control overall operations of the CPU 100.
- the CPU may include at least one ALU 110, and the ALU 110 may perform a computation based on a control of the controller 105.
- the cache memory 115 may store and process data associated with a computation of the CPU 100.
- the cache memory 115 may include data at a location most frequently used in the main memory, and may be a memory having a relatively low capacity but capable of quickly storing and acquiring data. Most memory accesses tend to occur frequently in the vicinity of a specific location, so copying data to cache memory can save an average memory access time.
- the controller when the controller tries to read or write the main memory, the controller may first check whether data corresponding to the address exists in the cache memory. If the data corresponding to the address exists in the cache memory, the data may be read directly from the cache memory; otherwise the main memory may be accessed directly. At this point, most processors may directly access the main memory and copy transmitted data to the cache memory, so that the processors can directly read and write data in the cache memory when accessing the same address.
- the DRAM 120 may be the main memory and may store data associated with a computation of the CPU 100.
- the GPU 150 may include at least one of a controller 155, an ALU 160, a cache memory 165, and a DRAM 170.
- the controller 155 may control an operation of the corresponding ALU 160.
- the GPU 150 may include one or more controllers 155, and the controller 155 may include the corresponding cache memory 165 and at least one ALU 160. As such, since each controller 155 may include the cache memory 165 and at least one ALU 160, there is an advantage in performing multiple computations in parallel.
- the ALU 110 of the CPU 100 may be often superior in computing performance to the ALU 160 of the GPU 150.
- the GPU 150 may achieve a rapid computing performance compared to the CPU 100.
- it is necessary to convert the computation and also necessary to convert address-related information for accessing the cache memory and the main memory.
- FIG. 2 is a flowchart illustrating molecular docking simulation method according to an embodiment of the present disclosure.
- FIG. 2 a method for performing molecular docking simulation in a computing device is illustrated.
- the computing device may identify at least one of a protein and a ligand.
- at least one of the protein and the ligand may be identified based on at least one of: information input by a user, information stored in a database, and a result learned by a specific computing device.
- a pose search associated with identification of interaction may be performed.
- the computing device may perform at least one of: energy evaluation associated with binding and evaluation for optimization.
- a genetic algorithm may be used for the energy evaluation.
- analysis of a differential equation corresponding to each element involved in the binding may be performed.
- a result of interaction may be identified according to a result associated with the above computation.
- a pose search may be performed to identify interaction between the protein and the ligand.
- the pose search it is possible to identify the interaction between the protein and the ligand, and predict binding energy and properties of a molecule formed after the binding.
- FIG. 3 is a flowchart illustrating a method for performing simulation using parallel computing according to an embodiment of the present disclosure.
- FIG. 3 a method for performing parallel computing for a pose search in a computing device is illustrated.
- the computing device may identify information associated with analysis of a genetic algorithm and analysis of an optimization differential equation in relation to the pose search.
- at least one of the energy-related computation and optimization-related computation may be performed, and, in an embodiment, the computing device may perform the computations in a different method or order according to a target protein and a ligand type.
- a data structure for parallel computing and a memory index corresponding thereto may be identified.
- a procedure of dividing one computation into a plurality of computations, performing the plurality of computations, and merging results from the plurality of computations may also be performed.
- simulation may be performed in a manner as follow: the corresponding computation is divided into a plurality of computations, the plurality of computations is performed by the GPU, and a result value is derived by merging the plurality of computations.
- a data structure necessary to perform the computations in the GPU may be formed so that the computation can be easily performed in the GPU, and the data structure may be copied to a GPU-related memory.
- parallel computing associated with molecular docking simulation may be performed in the CPU and the GPU.
- data associated with simulation may be converted into an array type so that the data associated with simulation can be copied to the GPU-related memory, and the converted data may also include information on a target protein as well as information on a configuration and a structure of a molecule associated with simulation.
- the converted data may also include information on a target protein as well as information on a configuration and a structure of a molecule associated with simulation.
- the computing device may perform parallel computing associated with molecular docking simulation in a plurality of processes based on at least one of the information identified in the previous step, and may derive a simulation result based on results of performed computations.
- FIG. 4 is a flowchart of a method for identifying a computation and a data structure corresponding to a second processing unit based on a computation and a data structure corresponding to a first processing unit for simulation and performing a computation associated with simulation based on the identified computation and the identified data structure according to an embodiment of the present disclosure.
- a computing device identifies a computation and a data structure corresponding to a second processing unit based on a computation and a data structure corresponding to a first processing unit and performs a computation based on the identified computation and the identified data structure during molecular docking simulation.
- the computing device may identify the computation and the data structure corresponding to the first processing unit for simulation.
- the first processing unit may include a CPU, and a computation to be performed for simulation and a data structure corresponding the computation may be identified.
- the computing device may identify the computation and the data structure corresponding to the second processing unit for simulation based on the identified information.
- the second processing unit may include a GPU, and a computation for performing simulation on the GPU and a data structure corresponding to the computation may be identified based on the information identified in the previous step.
- the computing device may identify a computation and a data structure by further considering performance of the GPU used for the computation.
- the computing device may identify a processor to perform a computation based on the number of GPUs used for the computation, and may copy data matching a data structure corresponding to the identified processor to a memory corresponding to the processor.
- the computing device may perform a computation associated with simulation using the second processing unit.
- the computation may be performed in parallel, and a control associated with the parallel computing may be performed through at least one of a controller, a CPU, and a GPU of the computing device. By performing such a control, it is possible to perform a computation corresponding to the first processing unit by the second processing unit in parallel and to identify a final computation result associated with the simulation by merging results of the parallel computing.
- the computing device may identify result values of the parallel computing, and may derive a simulation result value based on the identified values.
- simulation is performed by converting a computation and a data structure corresponding to the CPU so that they can be performed in parallel in the GPU, it is possible to reduce time to be taken for simulation associated with molecular docking without using high-performance hardware.
- molecular docking to predict protein-ligand interaction may be simulated, and a program such as Autodock Vina (AD Vina) may be used for the simulation.
- AD Vina Autodock Vina
- a computation can be performed through the CPU for the simulation, it is necessary to perform the computation in parallel, given the nature of docking simulation.
- accelerated molecular simulation with a speed improved through the GPU has become easier to access in molecular dynamics simulation.
- Such a computation is not limited to porting the computation performed in the CPU to be performed in the GPU environment, and may be generally used to process a computation for molecular docking simulation in parallel.
- pose search one of the processes requiring the largest number of computations in docking simulation is "pose search", and the pose search may be simulated with parallel computing, thereby reducing time to be taken for the computation. More specifically, a large portion of the computation in simulation associated with docking may correspond to identifying a pose for docking and evaluating energy therefor. It is expected that performance improvement can be achieved by performing such a computation in parallel, and the parallel computing may be more easily implemented in a GPU environment.
- a simulation environment may be realized, by converting a code associated with simulation implemented in the CPU so that the converted code can be implemented in the GPU.
- the simulation environment may be realized through a GPU-related code corresponding to an algorithm for energy evaluation.
- it is necessary to perform optimization for the pose search and it is necessary to write a code for calculating a differential equation for the optimization in parallel in the GPU.
- Such a composite data structure may include a data structure for storing data regarding proteins, ligands, and docking calculation therebetween, and for storing and processing information associated with calculation.
- the data structure may be generated using a static memory.
- data-related expressions corresponding to each other to be interoperable both in the CPU and the GPU may be used.
- a GPU-related code may be implemented to correspond to a CPU-related code.
- a cross-docking campaign may be performed. For example, a single query ligand requires approximately 10,000 docking computations corresponding to the number of target proteins.
- the parallel computing with the CPU to perform the computation may not lead to significant improvement of performance, and it requires a high-performance computing device with a large number of CPUs.
- the aforementioned computations are performed through the GPU, it is possible to increase a computational speed through a more inexpensive hardware. For example, when parallel processing is carried out, using a typical GPU, on docking simulation performed by the CPU, it is possible to perform calculations 10 times or more than a related art.
- the aforementioned AD Vina program may perform a pose search for molecular docking using Monte Carlo Search algorithm and Metropolis-Hastings algorithm.
- optimization for a pose search and energy evaluation may be performed using Quasi-Newton method (especially, Broyden-Fletcher-Goldfarb-Shanno (BFGS) method).
- ligand information and protein information may be read from a PDBQT file through AD Vina, and a grid cache may be generated based on the read ligand information and the read protein information. Then, an optimal pose with strong binding energy may be predicted using the generated grid cache through Monte Carlos Search.
- Monte Carlos Search is a method of random sampling using a random value, that is, a search method for performing docking simulation while changing a molecular structure based on a statistical distribution.
- energy is evaluated by arbitrarily changing the orientation and torsion of atoms and molecules. Then, optimization between pose and energy evaluation may be performed in a manner as follows: if energy of a previous structure and energy of a new structure coincides with Boltzmann distribution, the new structure is selected, and if not, the new structure is discarded and returns back to the previous structure.
- a pose search may be performed using a quasi-Newton method.
- the quasi-Newton method is a method of estimation using an approximate value, by which a pose with a high value of energy evaluation can be estimated using a change and a mutation of ligand conformation.
- a pose search may be performed using the Metropolis-Hastings algorithm by extracting rejected samples, and ligand conformations (e.g., in number of 20) with optimal values of protein-ligand binding energy evaluation may be output.
- the number of ligand conformations to be output may be set based on at least one of: a user setting, characteristics of an estimated sample, and a configuration of a computing device.
- a grid cache may be generated based on grid information generated based on protein information.
- the size and the shape of the grid cache do not affect a simulation execution time due to the characteristics of the grid cache after the generation of the grid cache. For example, when a cache for the grid is generated based on protein information, a total simulation execution time may be 1 second or less, and, after the simulation, repeated tasks such as memory copying may be reduced by efficiently utilizing the protein information.
- the number of repetitions of the Monte Carlo Search may be determined heuristically according to the number of atoms in a ligand and the degree of freedom of a ligand, by which the number of times to perform the energy evaluation can be determined.
- the number of times to perform the energy evaluation may be determined according to the number of atoms in a ligand and the degree of freedom of a ligand, and at least one of a method and a processor to perform a computation may be determined based on the determined number of times.
- a method and an apparatus for simulation that can perform computations in a CPU and a GPU may be implemented.
- the maximum number of dockings may be processed within a predetermined time by using the CPU and GPU performance at the same time.
- a basic algorithm and a parameter used in AD Vina may be utilized, and a method of performing a computation associated with L ⁇ P dockings at once in parallel according to the number (L) of ligands and the number (P) of proteins which are considered in docking may be devised.
- FIG. 5 is a flowchart illustrating a method for molecular docking simulation using GPU cores according to an embodiment of the present disclosure.
- information on a plurality of ligands that are the target of the molecular docking simulation may be identified.
- Information on the plurality of ligands may include information on ligand property information including at least one of: ligand conformation information, the number of atoms of the ligands, and the degree of freedom of the ligands.
- protein information on a plurality of proteins that are the target of the molecular docking simulation may be identified, and in this case, the protein information may include grid information.
- the protein information may include information on the size of proteins in a grid cache type, a structure of a receptor to bind with a ligand, and the like.
- At least one ligand set may be identified based on the size of each ligand according to the information on the plurality of ligands.
- a plurality of ligands may be clustered into a single ligand set based on the number of atoms of the ligands or the degree of freedom of the ligands. That is, ligands having either the same number of atoms of each ligand or the same degree of freedom of each ligand may be considered to have similar sizes, and the ligands having the similar sizes may be clustered into one ligand set.
- the plurality of ligands L 1 to L n may be classified into ligand sets S 1 to S m (n is an integer that satisfies n ⁇ m (m is an integer)). In doing so, it is possible to implement simulation so that the ligands having the similar sizes and a plurality of proteins can be docked.
- a ligand size may vary according to the environment of a system where molecular docking simulation is performed and the conformation of a ligand to be simulated.
- a score for at least one ligand set may be determined based on properties of a ligand included in the ligand set. Specifically, a score for a ligand set may be determined based on a first index and a second index associated with a ligand included in the ligand set. For example, a score for a ligand set may correspond to a value obtained by multiplying the first index and the second index.
- the first index may be determined based on the number of atoms of ligands included in a ligand set and the degree of freedom of ligands included in a ligand set.
- a second index may be determined based on the number of ligands included in a ligand set and the number of a plurality of proteins corresponding thereto. These properties of ligands may be identified from the information on the plurality of ligands.
- the first index may be determined based on the number of atoms of ligands, the degree of freedom of ligands, and the wrap size of a GPU. More specifically, the first index may be determined according to a value proportional to the square of the number of atoms of ligands and inversely proportional to at least one of: the degree of freedom of ligands and the wrap size of a GPU. This is because energy efficiency according to a pose of a protein-ligand complex is proportional to the square of the number of atoms of a ligand, and because the degree of freedom of a ligand serves to reduce the energy efficiency.
- a value inversely proportional to the wrap size of a GPU may correspond to a value proportional to the maximum number of GPU thread blocks capable of simultaneously executing docking simulation, by which the number of computational procedures capable of being executed simultaneously may be determined.
- the first index may be determined using Equation 1 below.
- N denotes the number of atoms of a ligand
- w denotes the wrap size of a GPU
- F denotes the degree of freedom of a ligand. Interaction between atoms of a ligand compound may be evaluated through N, and w may be generally a constant value of 32.
- the second index may be determined according to a value that is proportional to at least one of: the number of ligands included in a ligand set and the number of a plurality of proteins. This is because the size of the ligand set (which can correspond to the number of ligands included in the ligand set) and the number of the proteins have a linear direct effect on data copying to a GPU memory.
- the second index may be determined using Equation 2 below.
- e denotes a data array conversion constant
- L denotes the number of ligands included in a ligand set
- P denotes the number of proteins.
- e may be set to a default value of 8 or may be set by a user. When set by the user, e may be set to 20 or less.
- e may correspond to the number of simulations to be processed in parallel in a GPU core per docking simulation. This may be a value corresponding to the number of data converted to array information per execution of the docking simulation when the simulation is implemented.
- a plurality of ligand sets may be selected as a first group based on a score for at least one ligand set and the number of GPU cores involved in performing molecular docking simulation, and protein information and information on ligands belonging to the ligand sets in the first group may be copied to a GPU memory.
- molecular docking simulation may be performed using GPU cores based on the information copied to the GPU memory in step 511.
- the ligand information copied to the GPU memory may include one or more of: ligand conformation information, the number of atoms of a ligand, and the degree of freedom of a ligand.
- the information copied to the GPU memory may be array information including information on ligands.
- the array information may include an array of constituent molecular information and an array of a connection relationship information of constituent molecules in relation to ligands. This will be described with reference to FIG. 6.
- FIG. 6 is a schematic diagram illustrating ligand information including array information.
- constituent molecules associated with ligands and the connection relationship thereof may be as shown in a state 610 dynamically allocated in a tree structure.
- a computation may be performed in one processor, but it is difficult to perform a computation in a processor not suitable for the corresponding memory structure.
- ligand information may be expressed in an array type so that the information can be interoperable in both a CPU core and a GPU core.
- ligand information 610 in a tree structure may be implemented as continuous array information 620. In this case, data with same parent nodes according to the tree structure may be implemented in parallel in the same array information.
- the information 610 dynamically allocated in the tree structure may be implemented as the continuous array information 620 that consists of data of nodes 1, 2, 3, 4, 5, 6 and 7 and the parent nodes of the respective nodes -, 1, 1, 1, 2, 3, and 3.
- the ligand information 610 in this array type is included, it is possible to effectively copy data to a memory corresponding to each processor so that parallel computing can be performed easily.
- the array information including the ligand information it is possible to enable a mutation module, a pose search, energy evaluation, and an optimization module to operate in a GPU core.
- a computation associated with such array information may be performed using a software or hardware layer-related tool that enables the use of a virtual instruction set of a GPU, and, for example, such a computation may be implemented using CUDA thrust Library.
- CUDA thrust Library a software or hardware layer-related tool that enables the use of a virtual instruction set of a GPU
- other appropriate parallel algorithms may be used to express a highly complex data structure.
- step 509 the number of times to copy, to the GPU memory, the information on ligands belonging to the ligand sets included in the first group and the protein information may be determined based on the number of times to perform a pose search for the ligands belonging to the ligand sets. This will be described with reference to FIG. 7.
- FIG. 7 is a schematic diagram illustrating a step in which ligand information including array information is copied to a GPU memory.
- ligand information 710 in a tree structure may be implemented as a plurality of array information 720-1, 720-2, and 720-3.
- Each of the array information 720-1, 720-2, and 720-3 may include ligand-related array information, which includes information on ligand-related constituent molecules, a connection relationship of the constituent molecules, the number of atoms of a ligand, and atomic coordinates forming a ligand molecule.
- each of the array information 720-1, 720-2, and 720-3 may include information on proteins to be docked, and accordingly, each of the array information may include array information on a pair LP of one ligand and one protein. Since information on all ligand-protein pairs for docking are included in such array information, if the corresponding information is copied once to a memory associated with a processor for performing such a computation, a computation associated with molecular docking simulation may be performed in the processor. Through this procedure, a computation associated with simulation may be performed in parallel in a plurality of processors. Even in an environment where a processor to perform a next computation is determined in proceedings of computation, the corresponding computation may be performed independently in each processor simply by copying to a memory and a result of the computation is received, thereby improving computation efficiency.
- the array information 720-1, 720-2, and 720-3 may be copied to a GPU memory the number of times corresponding to the number of pose searches of ligands.
- array information corresponding to information on ligand-protein pairs for docking may be generated and copied.
- the number of times to perform a pose search for a ligand may be determined by complexity of a ligand, and the complexity of a ligand may be based on the number of atoms of a ligand or the degree of freedom of a ligand.
- the number of atoms of a ligand may be offset by an increase in computational efficiency due to the use of a GPU core when evaluating the binding energy of the protein-ligand complex. Therefore, complexity of a pose search for a ligand may be increased according to the degree of freedom of the ligand, and the computational load of docking simulation may be increased by increasing the number of times to copy to a GPU memory.
- the array information 720-1, 720-2, and 720-3 may be copied all at once in units of ligand sets. That is, when copied to the GPU memory once, array information about ligands belonging to the same ligand set may be copied together. For example, when ligands according to ligand information included in the array information 720-1 and the array information 720-2 are classified into the same ligand set due to the same number of atoms or the same degree of freedom, the array information 720-1 and the array information 720-2 may be copied together to the GPU memory, unlike the array information 720-3, and each array information may be separated and copied to a memory corresponding to each GPU that performs a computation. In this case, the array information may be copied after the pose search for a ligand is completed.
- a docking model for molecular docking simulation is a complex structure that includes not just information on a structure and constituent atoms of a compound but also information on a shape and a conformation change of the compound. Accordingly, in an embodiment of the present disclosure, since such a complex structure is capable of being expressed as memories in a continuous form, a large number of docking simulations may be performed.
- data synchronization between a CPU core and a GPU core may be enabled by a single memory copying computation, so that an overhead required for the data synchronization can be greatly reduced and a large amount of docking can be performed.
- DMA Direct Memory Access
- in the system is effective in copying a large amount of memory at a time, which can increase the efficiency of docking simulation.
- ligand sets included in a first group may be selected based on a value that is proportional to a score for at least one ligand set and the number of GPU cores and inversely proportional to the wrap size of a GPU.
- the value inversely proportional to the wrap size of a GPU may correspond to a value proportional to a maximum number of GPU thread blocks that can execute docking simulation simultaneously.
- ligand sets each having a ligand set score greater than a calculated value considering the number of GPU cores may be selected as a first group.
- ligand sets each having a ligand set score smaller than a calculated value considering the number of GPU cores may be selected as a second group.
- the score for the ligand set included in the first group may be greater than the score for the ligand set included in the second group.
- a ligand set including ligands equal to or greater than a predetermined size may be subject to molecular docking simulation in a GPU core.
- the ligand sets corresponding to the first group may be determined using Equation 3 below.
- k denotes a constant according to the GPU performance index
- w denotes the GPU wrap size that is generally 32.
- FIG. 8 is a schematic diagram showing a method of classifying ligand sets by the score according to an embodiment
- FIG. 9 is a flowchart illustrating a method for performing molecular docking simulation on a first group and molecular docking simulation on a second group in parallel according to an embodiment.
- a plurality of ligand sets may be sorted in order from a low ligand set score to a high ligand set score.
- a ligand set score and a calculated value considering the number of GPU cores may be compared in step 901, and ligand sets each having a ligand set score greater than a calculated value considering the number of GPU cores may be selected as a first group, which represents Large Set 810, in step 902.
- ligand sets each having a ligand set score smaller than a calculated value considering the number of GPU cores may be selected as a second group, which represents Small Set 830, in step 903.
- a calculated value considering the number of GPU cores may be determined based on GPU performance index k, and more specifically, the calculated value may be a value proportional to the GPU performance index.
- information on ligands belonging to the ligand sets 810 selected as the first group and protein information may be copied to a GPU memory in step 904.
- molecular docking simulation may be performed in a GPU core 820 based on the information copied to the GPU memory in step 906.
- the ligand set score is greater than a maximum block allocation value of the GPU by a specific value, the performance of the GPU may be maximized. In doing so, it is possible to consider that docking calculation can be performed more efficiently in the GPU than when the docking calculation is performed in the CPU.
- ligand information and protein information may be copied to a GPU memory, and the ligand information may be copied in order from a ligand set with the highest score among ligand sets in the ligand set 810 included in the first group. In doing so, molecular docking simulation may be performed in the GPU core 820 in the order from a ligand set with the highest score.
- molecular docking simulation may be performed using a CPU core 840 based on information on the ligand set 830 included in the second group and the protein information.
- molecular docking simulation may be performed using the CPU core 840 in order from a ligand set with the lowest score among the ligand sets 830 included in the second group.
- the ligand set 810 corresponding to the first group is simulated in the GPU core 820
- the ligand set 830 corresponding to the second group may be simulated in the CPU core 840. That is, according to an embodiment of the present disclosure, molecular docking simulation for the first group and molecular docking simulation for the second group may be performed in parallel.
- the step 511 of performing molecular docking simulation may include a step of energy evaluation to evaluate energy according to a pose of a protein-ligand complex, the pose which is determined based on ligand information and protein information. In doing so, it is possible to predict how strongly and in which conformation a ligand molecule can bind with a protein structure, and to optimize a pose and energy.
- the time required for the classification is insignificant compared to the time required for the entire docking simulation. For example, 10 days or more are required to perform 100,000 docking simulations, but it may take 5 minutes or less to classify 100,000 docking models.
- the time required to perform the molecular docking simulation may be reduced to 70.
- the time required to perform the entire docking simulation may be reduced to about 7 days.
- a processor to perform simulation by scoring ligand sets may be selected and a procedure regarding this may be first performed, so that a computation suitable for characteristics for each processor can be performed by a corresponding processor, and accordingly, the usability of the entire processors may increase.
- ligand sets may not be selected as a specific group but may be classified by a score, and ligand sets corresponding to both peripheries may be simulated in a CPU and a GPU, so that the entire processors may be utilized effectively and a difference in computational speed may not be significant even when ligand sets are distributed variously.
- FIG. 10 is a block diagram schematically illustrating a molecular docking simulation apparatus according to an embodiment of the present disclosure.
- the molecular docking simulation apparatus may include a CPU core 1010, a GPU core 1020, and a controller 1030.
- the CPU core 1010 may be a device included in the CPU 100 illustrated in FIG. 1.
- the GPU core 1020 may include a GPU memory, and may be a device included in the GPU 150 illustrated in FIG. 1. A repeated description of the CPU core 1010 and the GPU core 1020 will be omitted.
- the controller 1030 may identify information on a plurality of ligands and identify protein information including grid information on a plurality of proteins. In addition, the controller 1030 may identify at least one ligand set based on a size of each ligand according to the information on the plurality of ligands, and may determine a score for at least one ligand set based on properties of ligands included in each ligand set.
- the controller 1030 may include at least one of the controller 105 and the controller 155 of FIG. 1, and the operation of the controller 1030 may be implemented as a configuration of a controller for an additional computation other than the controller of FIG. 1.
- the controller 1030 may copy, to a GPU memory, protein information and information on ligands belonging to ligand sets selected as a first group based on a score for at least one ligand set and the number of GPU cores 1020 involved in performing molecular docking simulation.
- the molecular docking simulation may be performed using the GPU cores 1020 based on the information copied to the GPU memory.
- the GPU memory may include at least one of a cache memory 165, an ALU 160, and a DRAM 170 of FIG. 1, and the operation of the GPU memory may be performed even using a separate memory structure for storing information necessary for GPU computation.
- controller 1030 may further perform molecular docking simulation using the CPU core 1010 based on protein information and information on ligand sets that is selected as the second group based on a score for at least one ligand set.
- controller 1030 may perform molecular docking simulation using the CPU core 1010 in order from a ligand set with the lowest score among the ligand sets included in the second group. In doing so, it is possible to perform molecular docking simulation for the first group and molecular docking simulation for the second group in parallel.
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Chemical & Material Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Public Health (AREA)
- Immunology (AREA)
- Urology & Nephrology (AREA)
- Hematology (AREA)
- Pathology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Medicinal Chemistry (AREA)
- Food Science & Technology (AREA)
- Primary Health Care (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Cell Biology (AREA)
- Data Mining & Analysis (AREA)
- Physiology (AREA)
- Microbiology (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Analytical Chemistry (AREA)
- Biochemistry (AREA)
- General Physics & Mathematics (AREA)
- Pharmacology & Pharmacy (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Provided are a method and an apparatus for analyzing protein-ligand interaction through parallel computing. A method for molecular docking simulation according to the present disclosure includes: identifying information on a plurality of ligands, identifying protein information including grid information on a plurality of proteins, identifying at least one ligand set based on a size of each ligand according to the information on the plurality of ligands, determining a score for the at least one ligand set based on properties of ligands included in the ligand set, copying, to a GPU memory, information on ligands included in ligand sets in a first group selected based on the score for the at least one ligand set and the number of GPU cores involved in performing molecular docking simulation, and the protein information, and performing molecular docking simulation using the GPU cores based on the information copied to the GPU memory.
Description
The present disclosure relates to a method and an apparatus associated with molecular docking to predict protein-ligand interaction. One particular embodiment relates to a method and apparatus for performing molecular docking simulation related to a target protein and a ligand using parallel computing with a plurality of processors.
In molecule modeling, molecular docking simulation is used to derive information associated with binding between one molecule and another molecule. Such molecular docking simulation is mainly used for simulation associated with binding through interaction between specific proteins and ligands, and in this manner, it is possible to perform a simulation using a computer before a biochemical experiment.
The molecular docking simulation may provide information on a pose of a receptor and a ligand, the pose in which the two molecules can form a stable binding. In this case, a simulation tool may perform various computations to verify whether the pose between the receptor and the ligand allows a stable binding.
However, in such computations, the amount of calculation increases depending on a size or structure of the receptor and the ligand, and thus, a method and an apparatus for effectively processing such computations are required.
An aspect provides a method and an apparatus for simulation associated with molecular docking. Another aspect provides a method and an apparatus for molecular docking simulation by performing a plurality of processes in parallel.
In order to achieve the above-described problem, a method for simulation associated with molecular docking according to an embodiment of the present disclosure includes identifying information on a plurality of ligands, identifying protein information including grid information on a plurality of proteins, identifying at least one ligand set based on a size of each ligand according to the information on the plurality of ligands, determining a score for the at least one ligand set based on properties of ligands included in the ligand set, copying, to a Graphics Processing Unit (GPU) memory, information on ligands included in ligand sets in a first group selected based on the score for the at least one ligand set and the number of GPU cores involved in performing molecular docking simulation, and the protein information, and performing molecular docking simulation using the GPU cores based on the information copied to the GPU memory.
In addition, in order to achieve the above-described problem, an apparatus for molecular docking simulation according to an embodiment of the present disclosure includes a Central Processing Unit (CPU) core, a plurality of Graphics Processing Unit (GPU) cores, and a controller, and the controller is configured to identify information on a plurality of ligands, identify protein information including grid information on a plurality of proteins, identify at least one ligand set based on a size of each ligand according to the information on the plurality of ligands, determine a score for the at least one ligand set based on properties of ligands included in the ligand set, copy, to a GPU memory, information on ligands included in ligand sets in a first group selected based on the score for the at least one ligand set and the number of GPU cores involved in performing the molecular docking simulation, and the protein information, and perform the molecular docking simulation using the GPU cores based on the information copied to the GPU memory.
According to an embodiment of the present disclosure, a method and an apparatus for molecular docking simulation by performing computations in parallel is provided. Using the method and the apparatus, it is possible to more effectively perform the simulation using a processor, such as a Graphics Processing Unit (GPU), which is specialized in parallel processing. In addition, according to an embodiment of the present disclosure, it is possible to optimize a procedure associated with parallel processing according to a configuration of a computing device performing a simulation, thereby achieving optimal simulation performance.
FIG. 1 is a diagram illustrating a structure of a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) according to an embodiment of the present disclosure.
FIG. 2 is a flowchart illustrating a method for molecular docking simulation according to an embodiment of the present disclosure.
FIG. 3 is a flowchart illustrating a method for performing a simulation using parallel computing according to an embodiment of the present disclosure.
FIG. 4 is a flowchart illustrating a method in which a computation and a data structure corresponding to a second processing unit is identified through a computation and a data structure corresponding to a first processing unit for simulation and then a computation associated with simulation is performed based on the identified computation and data structure.
FIG. 5 is a flowchart illustrating a method for molecular docking simulation using a Graphics Processing Unit (GPU) core according to an embodiment of the present disclosure.
FIG. 6 is a schematic view illustrating ligand information including array information according to an embodiment of the present disclosure.
FIG. 7 is a schematic view illustrating a step of copying ligand information including array information to a GPU memory according to an embodiment of the present disclosure.
FIG. 8 is a schematic view illustrating a method for classifying ligand sets by score according to an embodiment of the present disclosure.
FIG. 9 is a flowchart illustrating a method for performing molecular docking simulations in parallel according to an embodiment of the present disclosure.
FIG. 10 is a block diagram illustrating an apparatus for molecular docking simulation according to an embodiment of the present disclosure.
Hereinafter, embodiments of the present disclosure are described with reference to the accompanying drawings in detail.
Detailed description of well-known functions and structures incorporated herein may be omitted to avoid obscuring the subject matter of the present disclosure. This aims to omit unnecessary description so as to make the subject matter of the present disclosure clear.
For the same reason, some of elements are exaggerated, omitted or simplified in the drawings. In addition, the elements may have sizes and/or shapes different from those shown in drawings, in practice. In the drawings, identical or corresponding elements are provided with identical reference numerals.
The advantages and features of the present disclosure and methods of achieving them will be apparent from the following exemplary embodiments that will be described in more detail with reference to the accompanying drawings. However, the present disclosure is not limited to the embodiments set forth below, but may be implemented in various different forms. The following embodiments are provided only to completely disclose the present disclosure and inform those skilled in the art of the scope of the present disclosure, and the present disclosure is defined only by the scope of the appended claims. Throughout the specification, the same or like reference numerals designate the same or like elements.
Here, it will be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a means for implementing the functions specified in the flowchart block or blocks. These computer program instructions may also be stored in a computer-usable or computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-usable or computer-readable memory produce an article of manufacture including instruction means that implement the function specified in the flowchart block or blocks. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions that are executed on the computer or other programmable apparatus provide operations for implementing the functions specified in the flowchart block or blocks.
In addition, the blocks of the flow chart refer to part of codes, segments or modules that include one or more executable instructions to perform one or more logic functions. It should be noted that the functions described in the blocks of the flow chart may be performed in a different order from the embodiments described above. For example, the functions described in two adjacent blocks may be performed at the same time or in reverse order.
In the embodiments, the terminology, component 'module,' refers to a software element or a hardware element such as a PGGA, an ASIC, etc., and performs a corresponding function. It should be, however, understood that the component 'module' is not limited to a software or hardware element. The component 'module' may be implemented in storage media that can be designated by addresses and may also be configured to regenerate one or more processors. For example, the component 'module' may include various types of elements (e.g., software elements, object-oriented software elements, class elements, task elements, etc.), segments (e.g., processes, functions, achieves, attribute, procedures, sub-routines, program codes, etc.), drivers, firmware, micro-codes, circuit, data, data base, data structures, tables, arrays, variables, etc. Functions provided by elements and the components 'modules' may be formed by combining the small number of elements and components 'modules' or may be divided into additional elements and components 'modules.' In addition, elements and components 'modules' may also be implemented to regenerate one or more CPUs in devices or security multi-cards.
FIG. 1 is a diagram illustrating the structure of a Central Processing unit (CPU) and a Graphics Processing Unit (GPU) according to an embodiment of the present disclosure.
Referring to FIG. 1, configurations of a CPU 100 and configurations of a GPU 150 are illustrated.
The CPU 100 may include at least one of a controller 105, an arithmetic logic unit (ALU) 110, a cache memory 115, and a DRAM 120.
The controller 105 may control overall operations of the CPU 100. In addition, the CPU may include at least one ALU 110, and the ALU 110 may perform a computation based on a control of the controller 105.
The cache memory 115 may store and process data associated with a computation of the CPU 100. For instance, the cache memory 115 may include data at a location most frequently used in the main memory, and may be a memory having a relatively low capacity but capable of quickly storing and acquiring data. Most memory accesses tend to occur frequently in the vicinity of a specific location, so copying data to cache memory can save an average memory access time.
In one embodiment, when the controller tries to read or write the main memory, the controller may first check whether data corresponding to the address exists in the cache memory. If the data corresponding to the address exists in the cache memory, the data may be read directly from the cache memory; otherwise the main memory may be accessed directly. At this point, most processors may directly access the main memory and copy transmitted data to the cache memory, so that the processors can directly read and write data in the cache memory when accessing the same address.
The DRAM 120 may be the main memory and may store data associated with a computation of the CPU 100.
The GPU 150 may include at least one of a controller 155, an ALU 160, a cache memory 165, and a DRAM 170.
The controller 155 may control an operation of the corresponding ALU 160. The GPU 150 may include one or more controllers 155, and the controller 155 may include the corresponding cache memory 165 and at least one ALU 160. As such, since each controller 155 may include the cache memory 165 and at least one ALU 160, there is an advantage in performing multiple computations in parallel.
In general, the ALU 110 of the CPU 100 may be often superior in computing performance to the ALU 160 of the GPU 150. However, when performing parallel computing with a plurality of ALUs 160, the GPU 150 may achieve a rapid computing performance compared to the CPU 100. In order to convert, in parallel, a computation performed by the CPU 100 to be performed by the GPU 150, it is necessary to convert the computation and also necessary to convert address-related information for accessing the cache memory and the main memory.
FIG. 2 is a flowchart illustrating molecular docking simulation method according to an embodiment of the present disclosure.
Referring to FIG. 2, a method for performing molecular docking simulation in a computing device is illustrated.
In step 205, the computing device may identify at least one of a protein and a ligand. In an embodiment, at least one of the protein and the ligand may be identified based on at least one of: information input by a user, information stored in a database, and a result learned by a specific computing device.
In step 210, a pose search associated with identification of interaction may be performed. In an exemplary embodiment, for the pose search, the computing device may perform at least one of: energy evaluation associated with binding and evaluation for optimization. In an embodiment, a genetic algorithm may be used for the energy evaluation. In addition, for the optimization, analysis of a differential equation corresponding to each element involved in the binding may be performed.
In step 215, a result of interaction may be identified according to a result associated with the above computation.
As such, a pose search may be performed to identify interaction between the protein and the ligand. Through the pose search, it is possible to identify the interaction between the protein and the ligand, and predict binding energy and properties of a molecule formed after the binding.
FIG. 3 is a flowchart illustrating a method for performing simulation using parallel computing according to an embodiment of the present disclosure.
Referring to FIG. 3, a method for performing parallel computing for a pose search in a computing device is illustrated.
In step 305, the computing device may identify information associated with analysis of a genetic algorithm and analysis of an optimization differential equation in relation to the pose search. In relation to the pose search, at least one of the energy-related computation and optimization-related computation may be performed, and, in an embodiment, the computing device may perform the computations in a different method or order according to a target protein and a ligand type.
In step 310, a data structure for parallel computing and a memory index corresponding thereto may be identified. In addition, in an embodiment, in order to perform relevant computations in parallel, a procedure of dividing one computation into a plurality of computations, performing the plurality of computations, and merging results from the plurality of computations may also be performed. For example, in order to enable the GPU to perform a computation performed by the CPU, simulation may be performed in a manner as follow: the corresponding computation is divided into a plurality of computations, the plurality of computations is performed by the GPU, and a result value is derived by merging the plurality of computations. To this end, a data structure necessary to perform the computations in the GPU may be formed so that the computation can be easily performed in the GPU, and the data structure may be copied to a GPU-related memory. In doing so, parallel computing associated with molecular docking simulation may be performed in the CPU and the GPU. More specifically, data associated with simulation may be converted into an array type so that the data associated with simulation can be copied to the GPU-related memory, and the converted data may also include information on a target protein as well as information on a configuration and a structure of a molecule associated with simulation. As such, by copying the converted data to the GPU-related memory that is to perform a computation, it is possible to effectively perform the computation in the GPU, and the efficiency of simulation may be increased using parallelization.
In step 315, the computing device may perform parallel computing associated with molecular docking simulation in a plurality of processes based on at least one of the information identified in the previous step, and may derive a simulation result based on results of performed computations.
FIG. 4 is a flowchart of a method for identifying a computation and a data structure corresponding to a second processing unit based on a computation and a data structure corresponding to a first processing unit for simulation and performing a computation associated with simulation based on the identified computation and the identified data structure according to an embodiment of the present disclosure.
Referring to FIG. 4, there is illustrated a method in which a computing device according to an embodiment identifies a computation and a data structure corresponding to a second processing unit based on a computation and a data structure corresponding to a first processing unit and performs a computation based on the identified computation and the identified data structure during molecular docking simulation.
In step 405, the computing device may identify the computation and the data structure corresponding to the first processing unit for simulation. In an embodiment, the first processing unit may include a CPU, and a computation to be performed for simulation and a data structure corresponding the computation may be identified.
In step 410, the computing device may identify the computation and the data structure corresponding to the second processing unit for simulation based on the identified information. In an embodiment, the second processing unit may include a GPU, and a computation for performing simulation on the GPU and a data structure corresponding to the computation may be identified based on the information identified in the previous step. In an embodiment, the computing device may identify a computation and a data structure by further considering performance of the GPU used for the computation. In addition, in an embodiment, the computing device may identify a processor to perform a computation based on the number of GPUs used for the computation, and may copy data matching a data structure corresponding to the identified processor to a memory corresponding to the processor.
In step 415, the computing device may perform a computation associated with simulation using the second processing unit. In an embodiment, the computation may be performed in parallel, and a control associated with the parallel computing may be performed through at least one of a controller, a CPU, and a GPU of the computing device. By performing such a control, it is possible to perform a computation corresponding to the first processing unit by the second processing unit in parallel and to identify a final computation result associated with the simulation by merging results of the parallel computing.
In step 420, the computing device may identify result values of the parallel computing, and may derive a simulation result value based on the identified values.
As described above, as simulation is performed by converting a computation and a data structure corresponding to the CPU so that they can be performed in parallel in the GPU, it is possible to reduce time to be taken for simulation associated with molecular docking without using high-performance hardware.
In addition, molecular docking to predict protein-ligand interaction may be simulated, and a program such as Autodock Vina (AD Vina) may be used for the simulation. On the other hand, although a computation can be performed through the CPU for the simulation, it is necessary to perform the computation in parallel, given the nature of docking simulation. Recently, accelerated molecular simulation with a speed improved through the GPU has become easier to access in molecular dynamics simulation.
In response to simulation that performs a computation through a CPU, it is possible to port the code of a simulation program and perform optimization to run the code in a GPU environment. Such a computation is not limited to porting the computation performed in the CPU to be performed in the GPU environment, and may be generally used to process a computation for molecular docking simulation in parallel.
For example, one of the processes requiring the largest number of computations in docking simulation is "pose search", and the pose search may be simulated with parallel computing, thereby reducing time to be taken for the computation. More specifically, a large portion of the computation in simulation associated with docking may correspond to identifying a pose for docking and evaluating energy therefor. It is expected that performance improvement can be achieved by performing such a computation in parallel, and the parallel computing may be more easily implemented in a GPU environment.
For example, a simulation environment may be realized, by converting a code associated with simulation implemented in the CPU so that the converted code can be implemented in the GPU. In this case, the simulation environment may be realized through a GPU-related code corresponding to an algorithm for energy evaluation. In addition, it is necessary to perform optimization for the pose search, and it is necessary to write a code for calculating a differential equation for the optimization in parallel in the GPU. As such, it is necessary to write a code to convert a program implemented in the CPU to be implemented in the GPU environment, and it is necessary to optimize the code for docking simulation. It is necessary to designate an ALU for performing a computation for simulation and to implement an additional layer for applying a code implemented in the CPU to the GPU in response to access to the cache/main memory. In addition, in order to realize such simulation, a composite data structure interoperable between the CPU and the GPU and a library corresponding to the composite data structure are necessary. Such a composite data structure may include a data structure for storing data regarding proteins, ligands, and docking calculation therebetween, and for storing and processing information associated with calculation. In an embodiment, at a time of generating a data structure, the data structure may be generated using a static memory. In addition, in order to perform such a computation, data-related expressions corresponding to each other to be interoperable both in the CPU and the GPU may be used. In addition, in order to implement an interface to access a data structure, a GPU-related code may be implemented to correspond to a CPU-related code.
In order to identify a protein-ligand interaction prediction that usually requires a very large number of docking calculations in an application program for simulation, a cross-docking campaign may be performed. For example, a single query ligand requires approximately 10,000 docking computations corresponding to the number of target proteins. The parallel computing with the CPU to perform the computation may not lead to significant improvement of performance, and it requires a high-performance computing device with a large number of CPUs. Thus, as the aforementioned computations are performed through the GPU, it is possible to increase a computational speed through a more inexpensive hardware. For example, when parallel processing is carried out, using a typical GPU, on docking simulation performed by the CPU, it is possible to perform calculations 10 times or more than a related art.
The aforementioned AD Vina program may perform a pose search for molecular docking using Monte Carlo Search algorithm and Metropolis-Hastings algorithm. In addition, optimization for a pose search and energy evaluation may be performed using Quasi-Newton method (especially, Broyden-Fletcher-Goldfarb-Shanno (BFGS) method).
Specifically, ligand information and protein information may be read from a PDBQT file through AD Vina, and a grid cache may be generated based on the read ligand information and the read protein information. Then, an optimal pose with strong binding energy may be predicted using the generated grid cache through Monte Carlos Search.
Monte Carlos Search is a method of random sampling using a random value, that is, a search method for performing docking simulation while changing a molecular structure based on a statistical distribution. In the process of Monte Carlo Search, energy is evaluated by arbitrarily changing the orientation and torsion of atoms and molecules. Then, optimization between pose and energy evaluation may be performed in a manner as follows: if energy of a previous structure and energy of a new structure coincides with Boltzmann distribution, the new structure is selected, and if not, the new structure is discarded and returns back to the previous structure.
In addition, as part of the Monte Carlo search, a pose search may be performed using a quasi-Newton method. The quasi-Newton method is a method of estimation using an approximate value, by which a pose with a high value of energy evaluation can be estimated using a change and a mutation of ligand conformation. In addition, a pose search may be performed using the Metropolis-Hastings algorithm by extracting rejected samples, and ligand conformations (e.g., in number of 20) with optimal values of protein-ligand binding energy evaluation may be output. In an embodiment, the number of ligand conformations to be output may be set based on at least one of: a user setting, characteristics of an estimated sample, and a configuration of a computing device.
In particular, in relation to the energy evaluation, a grid cache may be generated based on grid information generated based on protein information. The size and the shape of the grid cache do not affect a simulation execution time due to the characteristics of the grid cache after the generation of the grid cache. For example, when a cache for the grid is generated based on protein information, a total simulation execution time may be 1 second or less, and, after the simulation, repeated tasks such as memory copying may be reduced by efficiently utilizing the protein information.
In addition, the number of repetitions of the Monte Carlo Search may be determined heuristically according to the number of atoms in a ligand and the degree of freedom of a ligand, by which the number of times to perform the energy evaluation can be determined. In addition, the number of times to perform the energy evaluation may be determined according to the number of atoms in a ligand and the degree of freedom of a ligand, and at least one of a method and a processor to perform a computation may be determined based on the determined number of times.
Meanwhile, in an embodiment of the present disclosure, in order to process a computation for molecular docking simulation more quickly, a method and an apparatus for simulation that can perform computations in a CPU and a GPU may be implemented. Specifically, through the embodiment of the present disclosure, the maximum number of dockings may be processed within a predetermined time by using the CPU and GPU performance at the same time. To this end, a basic algorithm and a parameter used in AD Vina may be utilized, and a method of performing a computation associated with L×P dockings at once in parallel according to the number (L) of ligands and the number (P) of proteins which are considered in docking may be devised.
FIG. 5 is a flowchart illustrating a method for molecular docking simulation using GPU cores according to an embodiment of the present disclosure.
According to an embodiment of the present disclosure, prior to molecular docking simulation in step 501, information on a plurality of ligands that are the target of the molecular docking simulation may be identified. Information on the plurality of ligands may include information on ligand property information including at least one of: ligand conformation information, the number of atoms of the ligands, and the degree of freedom of the ligands.
In addition, in step 503, protein information on a plurality of proteins that are the target of the molecular docking simulation may be identified, and in this case, the protein information may include grid information. According to an embodiment, the protein information may include information on the size of proteins in a grid cache type, a structure of a receptor to bind with a ligand, and the like.
In step 505, at least one ligand set may be identified based on the size of each ligand according to the information on the plurality of ligands. According to an embodiment, a plurality of ligands may be clustered into a single ligand set based on the number of atoms of the ligands or the degree of freedom of the ligands. That is, ligands having either the same number of atoms of each ligand or the same degree of freedom of each ligand may be considered to have similar sizes, and the ligands having the similar sizes may be clustered into one ligand set. For example, the plurality of ligands L1 to Ln may be classified into ligand sets S1 to Sm (n is an integer that satisfies n≥m (m is an integer)). In doing so, it is possible to implement simulation so that the ligands having the similar sizes and a plurality of proteins can be docked. A ligand size may vary according to the environment of a system where molecular docking simulation is performed and the conformation of a ligand to be simulated.
In step 507, a score for at least one ligand set may be determined based on properties of a ligand included in the ligand set. Specifically, a score for a ligand set may be determined based on a first index and a second index associated with a ligand included in the ligand set. For example, a score for a ligand set may correspond to a value obtained by multiplying the first index and the second index.
According to an embodiment, the first index may be determined based on the number of atoms of ligands included in a ligand set and the degree of freedom of ligands included in a ligand set. In addition, a second index may be determined based on the number of ligands included in a ligand set and the number of a plurality of proteins corresponding thereto. These properties of ligands may be identified from the information on the plurality of ligands.
According to an embodiment, the first index may be determined based on the number of atoms of ligands, the degree of freedom of ligands, and the wrap size of a GPU. More specifically, the first index may be determined according to a value proportional to the square of the number of atoms of ligands and inversely proportional to at least one of: the degree of freedom of ligands and the wrap size of a GPU. This is because energy efficiency according to a pose of a protein-ligand complex is proportional to the square of the number of atoms of a ligand, and because the degree of freedom of a ligand serves to reduce the energy efficiency. In addition, a value inversely proportional to the wrap size of a GPU may correspond to a value proportional to the maximum number of GPU thread blocks capable of simultaneously executing docking simulation, by which the number of computational procedures capable of being executed simultaneously may be determined.
Alternatively, the first index may be determined using Equation 1 below.
[Equation 1]
Here, N denotes the number of atoms of a ligand, w denotes the wrap size of a GPU, and F denotes the degree of freedom of a ligand. Interaction between atoms of a ligand compound may be evaluated through N, and w may be generally a constant value of 32.
According to an embodiment, the second index may be determined according to a value that is proportional to at least one of: the number of ligands included in a ligand set and the number of a plurality of proteins. This is because the size of the ligand set (which can correspond to the number of ligands included in the ligand set) and the number of the proteins have a linear direct effect on data copying to a GPU memory.
Alternatively, the second index may be determined using Equation 2 below.
[Equation 2]
Second Index = e×L×P
Here, e denotes a data array conversion constant, L denotes the number of ligands included in a ligand set, and P denotes the number of proteins. According to an embodiment, e may be set to a default value of 8 or may be set by a user. When set by the user, e may be set to 20 or less. In addition, e may correspond to the number of simulations to be processed in parallel in a GPU core per docking simulation. This may be a value corresponding to the number of data converted to array information per execution of the docking simulation when the simulation is implemented.
As a next step for molecular docking simulation, in step 509, a plurality of ligand sets may be selected as a first group based on a score for at least one ligand set and the number of GPU cores involved in performing molecular docking simulation, and protein information and information on ligands belonging to the ligand sets in the first group may be copied to a GPU memory. In addition, molecular docking simulation may be performed using GPU cores based on the information copied to the GPU memory in step 511. According to an embodiment, the ligand information copied to the GPU memory may include one or more of: ligand conformation information, the number of atoms of a ligand, and the degree of freedom of a ligand.
According to an embodiment, the information copied to the GPU memory may be array information including information on ligands. In this case, the array information may include an array of constituent molecular information and an array of a connection relationship information of constituent molecules in relation to ligands. This will be described with reference to FIG. 6.
FIG. 6 is a schematic diagram illustrating ligand information including array information.
Referring to FIG. 6, constituent molecules associated with ligands and the connection relationship thereof may be as shown in a state 610 dynamically allocated in a tree structure. In this dynamic allocated state, a computation may be performed in one processor, but it is difficult to perform a computation in a processor not suitable for the corresponding memory structure. In an embodiment of the present disclosure, ligand information may be expressed in an array type so that the information can be interoperable in both a CPU core and a GPU core. For example, ligand information 610 in a tree structure may be implemented as continuous array information 620. In this case, data with same parent nodes according to the tree structure may be implemented in parallel in the same array information. For example, the information 610 dynamically allocated in the tree structure may be implemented as the continuous array information 620 that consists of data of nodes 1, 2, 3, 4, 5, 6 and 7 and the parent nodes of the respective nodes -, 1, 1, 1, 2, 3, and 3. As the ligand information 610 in this array type is included, it is possible to effectively copy data to a memory corresponding to each processor so that parallel computing can be performed easily.
According to the implementation of the array information including the ligand information according to an embodiment, it is possible to enable a mutation module, a pose search, energy evaluation, and an optimization module to operate in a GPU core.
A computation associated with such array information may be performed using a software or hardware layer-related tool that enables the use of a virtual instruction set of a GPU, and, for example, such a computation may be implemented using CUDA thrust Library. However, other appropriate parallel algorithms may be used to express a highly complex data structure.
Meanwhile, in step 509, the number of times to copy, to the GPU memory, the information on ligands belonging to the ligand sets included in the first group and the protein information may be determined based on the number of times to perform a pose search for the ligands belonging to the ligand sets. This will be described with reference to FIG. 7.
FIG. 7 is a schematic diagram illustrating a step in which ligand information including array information is copied to a GPU memory.
Referring to FIG. 7, ligand information 710 in a tree structure may be implemented as a plurality of array information 720-1, 720-2, and 720-3. Each of the array information 720-1, 720-2, and 720-3 may include ligand-related array information, which includes information on ligand-related constituent molecules, a connection relationship of the constituent molecules, the number of atoms of a ligand, and atomic coordinates forming a ligand molecule.
In addition, each of the array information 720-1, 720-2, and 720-3 may include information on proteins to be docked, and accordingly, each of the array information may include array information on a pair LP of one ligand and one protein. Since information on all ligand-protein pairs for docking are included in such array information, if the corresponding information is copied once to a memory associated with a processor for performing such a computation, a computation associated with molecular docking simulation may be performed in the processor. Through this procedure, a computation associated with simulation may be performed in parallel in a plurality of processors. Even in an environment where a processor to perform a next computation is determined in proceedings of computation, the corresponding computation may be performed independently in each processor simply by copying to a memory and a result of the computation is received, thereby improving computation efficiency.
According to an embodiment, the array information 720-1, 720-2, and 720-3 may be copied to a GPU memory the number of times corresponding to the number of pose searches of ligands. Alternatively, array information corresponding to information on ligand-protein pairs for docking may be generated and copied. According to an embodiment, the number of times to perform a pose search for a ligand may be determined by complexity of a ligand, and the complexity of a ligand may be based on the number of atoms of a ligand or the degree of freedom of a ligand. However, the number of atoms of a ligand may be offset by an increase in computational efficiency due to the use of a GPU core when evaluating the binding energy of the protein-ligand complex. Therefore, complexity of a pose search for a ligand may be increased according to the degree of freedom of the ligand, and the computational load of docking simulation may be increased by increasing the number of times to copy to a GPU memory.
In addition, the array information 720-1, 720-2, and 720-3 may be copied all at once in units of ligand sets. That is, when copied to the GPU memory once, array information about ligands belonging to the same ligand set may be copied together. For example, when ligands according to ligand information included in the array information 720-1 and the array information 720-2 are classified into the same ligand set due to the same number of atoms or the same degree of freedom, the array information 720-1 and the array information 720-2 may be copied together to the GPU memory, unlike the array information 720-3, and each array information may be separated and copied to a memory corresponding to each GPU that performs a computation. In this case, the array information may be copied after the pose search for a ligand is completed.
According to an embodiment, as information on ligands is implemented as array information, every data in a complex form may be expressed in a continuous memory form. A docking model for molecular docking simulation is a complex structure that includes not just information on a structure and constituent atoms of a compound but also information on a shape and a conformation change of the compound. Accordingly, in an embodiment of the present disclosure, since such a complex structure is capable of being expressed as memories in a continuous form, a large number of docking simulations may be performed.
Specifically, data synchronization between a CPU core and a GPU core may be enabled by a single memory copying computation, so that an overhead required for the data synchronization can be greatly reduced and a large amount of docking can be performed. In this regard, DMA (Direct Memory Access) in the system is effective in copying a large amount of memory at a time, which can increase the efficiency of docking simulation.
Meanwhile, according to an embodiment of the present disclosure, ligand sets included in a first group may be selected based on a value that is proportional to a score for at least one ligand set and the number of GPU cores and inversely proportional to the wrap size of a GPU. In this case, the value inversely proportional to the wrap size of a GPU may correspond to a value proportional to a maximum number of GPU thread blocks that can execute docking simulation simultaneously. For example, ligand sets each having a ligand set score greater than a calculated value considering the number of GPU cores may be selected as a first group. Conversely, ligand sets each having a ligand set score smaller than a calculated value considering the number of GPU cores may be selected as a second group. Accordingly, the score for the ligand set included in the first group may be greater than the score for the ligand set included in the second group. In addition, a ligand set including ligands equal to or greater than a predetermined size may be subject to molecular docking simulation in a GPU core.
Alternatively, the ligand sets corresponding to the first group may be determined using Equation 3 below.
[Equation 3]
Here, k denotes a constant according to the GPU performance index, denotes the number of GPU cores, and w denotes the GPU wrap size that is generally 32.
In this regard, FIG. 8 is a schematic diagram showing a method of classifying ligand sets by the score according to an embodiment, and FIG. 9 is a flowchart illustrating a method for performing molecular docking simulation on a first group and molecular docking simulation on a second group in parallel according to an embodiment.
Referring to FIGS. 8 and 9, a plurality of ligand sets may be sorted in order from a low ligand set score to a high ligand set score. A ligand set score and a calculated value considering the number of GPU cores may be compared in step 901, and ligand sets each having a ligand set score greater than a calculated value considering the number of GPU cores may be selected as a first group, which represents Large Set 810, in step 902. Also, ligand sets each having a ligand set score smaller than a calculated value considering the number of GPU cores may be selected as a second group, which represents Small Set 830, in step 903. In an embodiment, a calculated value considering the number of GPU cores may be determined based on GPU performance index k, and more specifically, the calculated value may be a value proportional to the GPU performance index.
According to an embodiment, information on ligands belonging to the ligand sets 810 selected as the first group and protein information may be copied to a GPU memory in step 904. In addition, molecular docking simulation may be performed in a GPU core 820 based on the information copied to the GPU memory in step 906. In doing so, in the embodiment of the present disclosure, it is possible to predict the number of GPU cores required for the entire molecular docking simulation based on the ligand set score. In addition, when the ligand set score is greater than a maximum block allocation value of the GPU by a specific value, the performance of the GPU may be maximized. In doing so, it is possible to consider that docking calculation can be performed more efficiently in the GPU than when the docking calculation is performed in the CPU.
In addition, ligand information and protein information may be copied to a GPU memory, and the ligand information may be copied in order from a ligand set with the highest score among ligand sets in the ligand set 810 included in the first group. In doing so, molecular docking simulation may be performed in the GPU core 820 in the order from a ligand set with the highest score.
According to an embodiment, in step 905, molecular docking simulation may be performed using a CPU core 840 based on information on the ligand set 830 included in the second group and the protein information. In particular, molecular docking simulation may be performed using the CPU core 840 in order from a ligand set with the lowest score among the ligand sets 830 included in the second group.
In addition, while the ligand set 810 corresponding to the first group is simulated in the GPU core 820, the ligand set 830 corresponding to the second group may be simulated in the CPU core 840. That is, according to an embodiment of the present disclosure, molecular docking simulation for the first group and molecular docking simulation for the second group may be performed in parallel.
Meanwhile, according to an embodiment, the step 511 of performing molecular docking simulation may include a step of energy evaluation to evaluate energy according to a pose of a protein-ligand complex, the pose which is determined based on ligand information and protein information. In doing so, it is possible to predict how strongly and in which conformation a ligand molecule can bind with a protein structure, and to optimize a pose and energy.
In addition, it is possible to provide information on a result of molecular docking simulation by evaluating energy according to a pose of the protein-ligand complex. For example, a report including information on results of the entire docking simulation may be issued.
According to an embodiment of the present disclosure, since only a linear search is performed to classify a plurality of ligand sets into the first group or the second group, the time required for the classification is insignificant compared to the time required for the entire docking simulation. For example, 10 days or more are required to perform 100,000 docking simulations, but it may take 5 minutes or less to classify 100,000 docking models.
Therefore, in the case where a set of ligands and proteins to be docked is determined, it is possible to reduce the time required to perform the molecular docking simulation by about 30% when using both a GPU and a CPU, compared to when using only a CPU. For example, in the assumption that the time required to perform docking using only a CPU is 100, when a GPU performs 30% of the entire docking models, the time required to perform the entire docking simulation may be reduced to 70. Alternatively, in order to simulate about 100,000 docking models (when a computer having 48 cores is used), 10 days or more are required. If a GPU performs about 30% of the docking models, the time required to perform the entire docking simulation may be reduced to about 7 days.
Meanwhile, in an embodiment, a processor to perform simulation by scoring ligand sets may be selected and a procedure regarding this may be first performed, so that a computation suitable for characteristics for each processor can be performed by a corresponding processor, and accordingly, the usability of the entire processors may increase. For example, ligand sets may not be selected as a specific group but may be classified by a score, and ligand sets corresponding to both peripheries may be simulated in a CPU and a GPU, so that the entire processors may be utilized effectively and a difference in computational speed may not be significant even when ligand sets are distributed variously.
FIG. 10 is a block diagram schematically illustrating a molecular docking simulation apparatus according to an embodiment of the present disclosure.
The molecular docking simulation apparatus according to an embodiment of the present disclosure may include a CPU core 1010, a GPU core 1020, and a controller 1030.
The CPU core 1010 may be a device included in the CPU 100 illustrated in FIG. 1. Also, the GPU core 1020 may include a GPU memory, and may be a device included in the GPU 150 illustrated in FIG. 1. A repeated description of the CPU core 1010 and the GPU core 1020 will be omitted.
The controller 1030 according to an embodiment of the present disclosure may identify information on a plurality of ligands and identify protein information including grid information on a plurality of proteins. In addition, the controller 1030 may identify at least one ligand set based on a size of each ligand according to the information on the plurality of ligands, and may determine a score for at least one ligand set based on properties of ligands included in each ligand set. The controller 1030 may include at least one of the controller 105 and the controller 155 of FIG. 1, and the operation of the controller 1030 may be implemented as a configuration of a controller for an additional computation other than the controller of FIG. 1.
The controller 1030 may copy, to a GPU memory, protein information and information on ligands belonging to ligand sets selected as a first group based on a score for at least one ligand set and the number of GPU cores 1020 involved in performing molecular docking simulation. In addition, the molecular docking simulation may be performed using the GPU cores 1020 based on the information copied to the GPU memory. The GPU memory may include at least one of a cache memory 165, an ALU 160, and a DRAM 170 of FIG. 1, and the operation of the GPU memory may be performed even using a separate memory structure for storing information necessary for GPU computation.
In addition, the controller 1030 according to an embodiment may further perform molecular docking simulation using the CPU core 1010 based on protein information and information on ligand sets that is selected as the second group based on a score for at least one ligand set.
In addition, the controller 1030 may perform molecular docking simulation using the CPU core 1010 in order from a ligand set with the lowest score among the ligand sets included in the second group. In doing so, it is possible to perform molecular docking simulation for the first group and molecular docking simulation for the second group in parallel.
Meanwhile, although the preferred embodiments have been disclosed in this specification and drawings and specific terms have been used therein, they have been used in common meanings for easily describing the technological contents of the present disclosure and helping understanding of the present disclosure, but are not intended to limit the scope of the present disclosure. It will be evident to those skilled in the art that various implementations based on the technological spirit of the present disclosure are possible in addition to the disclosed embodiments.
Claims (20)
- A method for molecular docking simulation, the method comprising:identifying information on a plurality of ligands;identifying protein information including grid information on a plurality of proteins;identifying at least one ligand set based on a size of each ligand according to the information on the plurality of ligands;determining a score for the at least one ligand set based on properties of ligands included in the ligand set;copying, to a Graphics Processing Unit (GPU) memory, information on ligands included in ligand sets in a first group selected based on the score for the at least one ligand set and the number of GPU cores involved in performing molecular docking simulation, and the protein information; andperforming molecular docking simulation using the GPU cores based on the information copied to the GPU memory.
- The method according to claim 1, further comprising performing molecular docking simulation using the CPU cores based on information on ligand sets included in a second group and the protein information,wherein the ligand sets included in the second group is selected based on the score for the at least one ligand set.
- The method of claim 2, wherein the molecular docking simulation is performed using the CPU cores in order from a ligand set with the lowest score among the ligand sets included in the second group.
- The method of claim 3, wherein a score for a ligand set included in the first group is greater than a score for a ligand set included in the second group.
- The method of claim 3, wherein molecular docking simulation for the first group and molecular docking simulation for the second group are performed in parallel.
- The method of claim 1, wherein the ligand sets included in the first group are selected based on a specific value, the specific value being proportional to the score for the at least one ligand set and the number of GPU cores, and inversely proportional to a wrap size of the GPU.
- The method of claim 1, wherein the information on the plurality of ligands comprises at least one of: ligand conformation information, the number of atoms of the ligands, and degree of freedom of the ligands.
- The method of claim 1, wherein the information copied to the GPU memory comprises array information comprising the information on the plurality of ligands, and the array information comprises an array of constituent molecule information and an array of a connection relationship information of constituent molecules in relation to the plurality of ligands.
- The method of claim 1, wherein the number of times to copy the information on the ligands included in the ligand sets in the first group and the protein information to the GPU memory is determined based on the number of times to perform a pose search for the ligands included in the ligand sets.
- The method of claim 1, wherein the copying comprises copying, to the GPU memory, corresponding ligand information on ligand sets in order from a ligand set with the highest score among the ligand sets in the first group, and the protein information.
- The method of claim 1, wherein:the score for the ligand set is determined based on a first index and a second index associated with the ligands included in the ligand set,the first index is determined based on the number of atoms of the ligands included in the ligand set and degree of freedom of the ligands included in the ligand set, andthe second index is determined based on the number of the ligands included in the ligand set and the number of the plurality of proteins.
- The method of claim 11, wherein the first index is determined based on the number of atoms of the ligands, the degree of freedom of the ligands, and the wrap size of the GPU.
- The method of claim 12, wherein the first index is proportional to a square of the number of atoms of the ligands and inversely proportional to at least one of: the degree of freedom of the ligands or the wrap size of the GPU.
- The method of claim 11, wherein the second index is determined according to a value that is proportional to at least one of: the number of the ligands included in the ligand set or the number of the plurality of proteins.
- The method of claim 1, wherein the performing of the molecular docking simulation comprises performing energy evaluation according to a pose of a protein-ligand complex determined based on the information on the ligands and the protein information, andproviding information on a result of the molecular docking simulation.
- An apparatus for molecular docking simulation, comprising:a Central Processing Unit (CPU) core;a plurality of Graphics Processing Unit (GPU) cores; anda controller,wherein the controller is configured to:identify information on a plurality of ligands;identify protein information including grid information on a plurality of proteins;identify at least one ligand set based on a size of each ligand according to the information on the plurality of ligands;determine a score for the at least one ligand set based on properties of ligands included in the ligand set;copy, to a GPU memory, information on ligands included in ligand sets in a first group selected based on the score for the at least one ligand set and the number of GPU cores involved in performing the molecular docking simulation, and the protein information; andperform molecular docking simulation using the GPU cores based on the information copied to the GPU memory.
- The apparatus of claim 16, wherein the controller is configured to further perform molecular docking simulation using the CPU cores based on information on ligand sets included in a second group and the protein information,wherein the ligand sets included in the second group is selected based on the score for the at least one ligand set.
- The apparatus of claim 17, wherein the controller is configured to perform the molecular docking simulation using the CPU cores in order from a ligand set with the lowest score among the ligand sets included in the second group, and to perform molecular docking simulation for the first group and molecular docking simulation for the second group in parallel.
- The apparatus of claim 18, wherein a score for a ligand set included in the first group is greater than a score for a ligand set included in the second group.
- The apparatus of claim 16, wherein:the score for the ligand set is determined based on a first index and a second index associated with the ligands included in the ligand set,the first index is determined based on the number of atoms of the ligands included in the ligand set and degree of freedom of the ligands included in the ligand set, andthe second index is determined based on the number of the ligands included in the ligand set and the number of the plurality of proteins.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2019-0130752 | 2019-10-21 | ||
KR20190130752 | 2019-10-21 | ||
KR1020200046176A KR102209526B1 (en) | 2019-10-21 | 2020-04-16 | Method and apparatus for analysis protein-ligand interaction using parallel operation |
KR10-2020-0046176 | 2020-04-16 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021080122A1 true WO2021080122A1 (en) | 2021-04-29 |
Family
ID=74571536
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2020/008789 WO2021080122A1 (en) | 2019-10-21 | 2020-07-06 | Method and apparatus for analyzing protein-ligand interaction using parallel operation |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR102209526B1 (en) |
WO (1) | WO2021080122A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115016951B (en) * | 2022-08-10 | 2022-10-25 | 中国空气动力研究与发展中心计算空气动力研究所 | Flow field numerical simulation method and device, computer equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20100092596A (en) * | 2009-02-13 | 2010-08-23 | 건국대학교 산학협력단 | A molecular docking simulation method and its apparatus |
US8036867B2 (en) * | 2003-10-14 | 2011-10-11 | Verseon | Method and apparatus for analysis of molecular configurations and combinations |
US20150051090A1 (en) * | 2013-08-19 | 2015-02-19 | D.E. Shaw Research, Llc | Methods for in silico screening |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007219760A (en) * | 2006-02-15 | 2007-08-30 | Fujitsu Ltd | Docking simulation program, recording medium recording the program, docking simulation apparatus, and docking simulation method |
KR101879419B1 (en) * | 2017-03-15 | 2018-08-17 | 주식회사 클래스액트 | A task distribution method using parallel processing algorithm |
-
2020
- 2020-04-16 KR KR1020200046176A patent/KR102209526B1/en active Active
- 2020-07-06 WO PCT/KR2020/008789 patent/WO2021080122A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8036867B2 (en) * | 2003-10-14 | 2011-10-11 | Verseon | Method and apparatus for analysis of molecular configurations and combinations |
KR20100092596A (en) * | 2009-02-13 | 2010-08-23 | 건국대학교 산학협력단 | A molecular docking simulation method and its apparatus |
US20150051090A1 (en) * | 2013-08-19 | 2015-02-19 | D.E. Shaw Research, Llc | Methods for in silico screening |
Non-Patent Citations (2)
Also Published As
Publication number | Publication date |
---|---|
KR102209526B1 (en) | 2021-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | The PetscSF scalable communication layer | |
US20120233486A1 (en) | Load balancing on heterogeneous processing clusters implementing parallel execution | |
WO2010137822A2 (en) | Ray tracing core and ray tracing chip including same | |
WO2013151221A1 (en) | System and method for analyzing cluster results of large amounts of data | |
Tang et al. | A data skew oriented reduce placement algorithm based on sampling | |
González-Domínguez et al. | Parallel pairwise epistasis detection on heterogeneous computing architectures | |
CN112765094A (en) | Sparse tensor canonical decomposition method based on data division and calculation distribution | |
Cui et al. | On efficient external-memory triangle listing | |
Lan et al. | SWhybrid: a hybrid-parallel framework for large-scale protein sequence database search | |
Liu | Parallel and scalable sparse basic linear algebra subprograms | |
WO2021080122A1 (en) | Method and apparatus for analyzing protein-ligand interaction using parallel operation | |
KR20140070493A (en) | System and method for efficient resource management of a signal flow programmed digital signal processor code | |
Kobus et al. | Accelerating metagenomic read classification on CUDA-enabled GPUs | |
Jang et al. | Realgraph+: A high-performance single-machine-based graph engine that utilizes io bandwidth effectively | |
Huang et al. | Wisegraph: Optimizing gnn with joint workload partition of graph and operations | |
WO2023045257A1 (en) | Compressed sensing image recovery method and apparatus, and device and medium | |
JPH0962639A (en) | Interprocessor communication method for parallel computers | |
WO2025135540A1 (en) | Data processing method and computing device for npu-based batch inference optimization | |
Shamoto et al. | GPU-accelerated large-scale distributed sorting coping with device memory capacity | |
Gao et al. | Memory-efficient and skew-tolerant MapReduce over MPI for supercomputing systems | |
Hua et al. | Data similarity-aware computation infrastructure for the cloud | |
WO2021182781A1 (en) | Matrix calculation method and device | |
Wang et al. | Casta: Cuda-accelerated static timing analysis for VLSI designs | |
WO2019104618A1 (en) | Svm-based sample data update method and classification system, and a storage device | |
CN109584967B (en) | A Parallel Acceleration Method for Protein Identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20878645 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20878645 Country of ref document: EP Kind code of ref document: A1 |