KR20230082064A

KR20230082064A - Device and method for target gene selection using natural language and processing

Info

Publication number: KR20230082064A
Application number: KR1020210169622A
Authority: KR
Inventors: 최원재
Original assignee: 주식회사 꿀비
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2023-06-08

Abstract

본 발명은 자연어 처리를 이용한 타겟 유전자 선별 방법과, 그 시스템에 대한 것으로서, 구체적으로, 하나 이상의 바이오 분야 문서 데이터로부터 질병(disease), 유전자(gene) 및 단백질(protein) 관련 용어를 추출하고, 각 용어 간의 관계(relation)를 추출하여 타겟 유전 선별에 이용하는 기술로서, 바이오 분야 문헌의 초록 또는 전문이 저장된 데이터베이스(DB)와 연동하여 문장에서 문장 단위로 분리하고 개체명을 인식하여 추출하는 제1단계; 및, 상기 초록 또는 전문의 문장에서 추출한 개체명에 대하여 관계를 인식하고 관계 이벤트를 생성하여 패스웨이를 구축하는 제2단계;를 포함하는 자연어 처리 문헌 분석과정을 구비하여 타겟 유전자 선별에 이용되도록 구성된다.The present invention relates to a method and system for selecting a target gene using natural language processing, and specifically, extracts disease, gene, and protein-related terms from one or more bio-field document data, and each As a technology that extracts relationships between terms and uses them for target genetic selection, the first step is to separate sentences into sentences and recognize and extract entity names in conjunction with a database (DB) that stores abstracts or full texts of bio-field literature. ; And, a second step of recognizing the relationship with respect to the entity name extracted from the abstract or full sentence and generating a relationship event to build a pathway; configured to be used for target gene selection by having a natural language processing literature analysis process including do.

Description

Apparatus and method for target gene selection using natural language processing {Device and method for target gene selection using natural language and processing}

본 발명은 자연어 처리를 이용한 타겟 유전자 선별 장치 및 방법에 관한 것으로서, 신약 개발을 위한 새로운 유전자 후보군을 발굴하거나, 타겟 유전자와 관련도가 높고 연구 가치가 있다고 판단되는 유전자 후보군을 선별하기 위하여 바이오 분야 문헌을 대상으로 자연어 처리에 의한 문헌분석 및 관련 유전자 정보를 효율적으로 처리할 수 있는 타겟 유전자 선별 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for selecting a target gene using natural language processing, in order to discover a new gene candidate group for new drug development or to select a gene candidate group that is highly related to a target gene and is judged to have research value. It relates to a target gene selection device and method that can efficiently process literature analysis and related gene information by natural language processing for .

특정의 질병을 대상으로 하는 약품개발 및 유전자의 연구에 있어서, 유전자와 화합물과 질병의 상호관계를 함께 인식하는 것은 중요한 일인데, 기존의 분자생물학 연구방법은 대상을 세분화하여 나누어 분석하는 방법이었다. 예를 들어 유전자 A의 기능을 연구하고자 할 경우 이 유전자가 세포의 어떤 역할에 영향을 미치는지, 예를 들어 세포의 움직임, 세포의 사멸, 세포의 성장 등으로 세분화하고, 이 유전자 A가 조절하는 유전자 10개 중 1개만을 선택하여 이 하위 조절 유전자가 세포의 기능을 어떻게 변화시키는지 관찰하는 방식이었다. 이렇게 세분화된 유전자 기능을 단순 합치게 되면 세포의 기능에 대해서 이해할 수 있다고 받아들여졌다. In drug development and gene research targeting specific diseases, it is important to recognize the mutual relationship between genes, compounds, and diseases. The existing molecular biology research method was a method of segmenting and dividing the target for analysis. For example, if you want to study the function of gene A, what role this gene affects cells, for example, subdivide into cell movement, cell death, cell growth, etc., and the gene that this gene A controls It was a way to select only one out of 10 and observe how this subregulatory gene changes the cell's function. It was accepted that a simple combination of these subdivided gene functions could help us understand the function of cells.

그러나 최근 분자생물학적 분석방법의 발달로 인하여 다수의 유전자의 상호작용으로 일어나는 과정이 알려지고 다수의 유전자를 동시다발적으로 분석할 수 있는 최신 기술(예, NGS 시퀀싱 기술, microarray, Mass-spectrometer) 등이 발달하면서 유전자 기능을 일일이 연구하기보다 확인된 유전자를 이용하여 분석하는 시스템 생물학이 대두하기 시작하였다. 시스템 생물학은 유전자의 기능이 단순하게 여러 유전자의 기능의 합이 아닌 전체를 분석해야만 이해할 수 있다는 연구 패러다임으로 전환된 것으로, 생명현상을 복합체로 규정하고 생물학뿐만 아니라 전산학, 수학, 물리학, 화학 등의 원칙을 사용하여 분석하고 모사 발명하는 것을 목표로 하는 학문이다. 일례로, 인간의 종양에서 암 발생 혹은 종양 억제 기능과 연관된 유전자의 발현이 비정상적으로 증가하는 현상을 들 수 있다. 몇몇 유전자는 공통적으로 종양의 비정상적인 발현을 억제하지만, 종양의 종류에 따라 독특한 유전자양상을 나타내는 경우가 더 많기 때문에 이를 통해 종양조직의 기원을 유추할 있다. 질병 발생의 기저에 있는 유전자 기능 이상의 원인으로 특징적인 유전자와 단백질의 발현 패턴 양상은 진단, 예후, 그리고 예측에 필요한 표지자를 개발하는데 기회를 제공하여 향후 사회적 비용을 감소시키는데 크게 기여할 수 있다.However, due to the recent development of molecular biological analysis methods, the process caused by the interaction of multiple genes is known, and the latest technology (e.g., NGS sequencing technology, microarray, mass-spectrometer) that can simultaneously analyze multiple genes, etc. With this development, systems biology, which analyzes using identified genes rather than studying gene functions individually, began to emerge. Systems biology has shifted to a research paradigm in which the function of a gene can only be understood by analyzing the whole rather than simply the sum of the functions of several genes. It is a discipline that aims to analyze and invent replicas using principles. As an example, there is an abnormal increase in the expression of genes associated with cancer development or tumor suppression function in human tumors. Some genes commonly suppress the abnormal expression of tumors, but since there are more cases that show unique genetic patterns depending on the type of tumor, the origin of tumor tissue can be inferred through this. Characteristic gene and protein expression patterns as a cause of gene dysfunction underlying disease occurrence can greatly contribute to reducing social costs in the future by providing opportunities to develop markers necessary for diagnosis, prognosis, and prediction.

지난 한 세기 이상 동안 꾸준히 진행된 유전학 연구에도 불구하고 단순한 단세포 생물체인 효모조차 그 기능이 실험적으로 전혀 검증되지 않은 유전자들은 1000여개에 이르고 있다. 유전자들의 다기능성(pleiotropy)을 고려한다면 전체 유전자 기능에서 우리가 현재 알고 있는 부분은 매우 한정적일 수밖에 없다. Despite the steady progress of genetic research over the past century, even yeast, a simple single-celled organism, has about 1,000 genes whose functions have not been experimentally verified. Considering the pleiotropy of genes, what we currently know about overall gene function is very limited.

한편 기존의 유전자 분석 방법으로 검출된 유전자 후보군 중에는 이미 연구자들에 의해 연구가 활발히 진행되거나, 실험적 검증이 완료된 유전자가 포함되어 있을 수 있는데, 매년 수만 건의 새로운 연구결과들이 논문발표, 특허등록, 학술대회 등을 통해 축적되는 상황에서 상기 연구를 이해할 수 있는 인력이 투입되어 관련 논문정보를 이해하고 해당 유전자 정보를 최신화하여 유효 유전자를 선별하는 것은 거의 불가능하므로 타겟 유전자의 후보를 어떻게 선별할 것인가에 대한 연구가 진행되고 있다.On the other hand, among the gene candidates detected by existing gene analysis methods, there may be genes that have already been actively researched by researchers or that have been experimentally verified. Since it is almost impossible to select an effective gene by injecting manpower who can understand the above research, understanding related paper information, and updating the corresponding gene information in a situation where the information is accumulated through such, it is almost impossible to select a target gene candidate. Research is ongoing.

이러한 선별 과정에는, 신약 개발 단계에서 질병과 유전자 또는 질병과 단백질 간의 관계를 밝혀내기 위해 다양하게 분포되어 있는 데이터들을 수집하고, 정리하는 과정이 요구되며, 현대에는 방대한 양의 지식 데이터가 비정형 텍스트의 형태로 배포되고 있어서, 이들을 신속하게 파악하여 질병과 유전자/단백질 간의 관계를 파악하는 것은 많은 노력이 요구된다. 따라서, 질병과 유전자/단백질 간의 관계를 파악을 위한 새로운 연구개발에 앞서 타겟 유전자를 선별 과정에서 연산 기능을 갖춘 장치(예를 들어, 컴퓨터)를 통해 바이오 분야 문서 데이터에 포함된 비정형 텍스트로부터 자연어 처리를 수행할 수 있는 기술개발 및 접목이 필요하다.This screening process requires the process of collecting and organizing variously distributed data in order to uncover the relationship between disease and gene or disease and protein in the new drug development stage. Since they are distributed in a form, it requires a lot of effort to quickly identify them and identify the relationship between diseases and genes/proteins. Therefore, in the process of selecting target genes prior to new research and development to identify the relationship between disease and gene/protein, natural language processing from unstructured text included in bio-field document data is performed using a device (eg, computer) equipped with an arithmetic function. It is necessary to develop and graft technology that can perform

본 발명의 목적은 바이오 분야 문헌에 대한 자연어 처리를 이용하여 종래 기술의 단점을 극복하고 유전자들의 미지의 기능을 효과적으로 예측하여 유전자 후보군을 출력하는 분석 방법 및 시스템을 제공하는 것이다.An object of the present invention is to provide an analysis method and system for overcoming the disadvantages of the prior art using natural language processing for bio-field literature and effectively predicting unknown functions of genes and outputting gene candidate groups.

또한, 본 발명의 목적은 신약개발을 위한 새로운 유전자 후보군 리스트를 효과적으로 발굴하는 것이다. In addition, an object of the present invention is to effectively discover a new gene candidate group list for new drug development.

또한, 본 발명의 목적은 타겟 유전자 또는 질병과의 유사성을 예측하여 병원성 미생물 치료약물 유전자 후보군 리스트를 발굴하는 것이다.In addition, an object of the present invention is to predict the similarity with a target gene or disease to discover a list of drug candidate genes for the treatment of pathogenic microorganisms.

본 발명이 이루고자 하는 기술적 과제는 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problem to be achieved by the present invention is not limited to the above-mentioned technical problem, and other technical problems not mentioned can be clearly understood by those skilled in the art from the description below. There will be.

상기 기술적 과제를 달성하기 위하여, 본 발명의 일실시예에 따른 자연어 처리를 이용한 타겟 유전자 선별 방법은, 하나 이상의 바이오 분야 문서 데이터로부터 질병(disease), 유전자(gene) 및 단백질(protein) 관련 용어를 추출하고, 각 용어 간의 관계(relation)를 추출하여 타겟 유전 선별에 이용하는 방법으로서, 바이오 분야 문헌의 초록 또는 전문이 저장된 데이터베이스(DB)와 연동하여 문장에서 문장 단위로 분리하고 개체명을 인식하여 추출하는 1단계; 및, 상기 초록 또는 전문의 문장에서 추출한 개체명에 대하여 관계를 인식하고 관계 이벤트를 생성하여 패스웨이를 구축하는 2단계;를 포함하는 자연어 처리 문헌 분석과정이 구비된다.In order to achieve the above technical problem, a method for selecting a target gene using natural language processing according to an embodiment of the present invention is a disease, gene, and protein related term from one or more bio field document data. It is a method of extracting and extracting the relationship between each term and using it for target genetic screening. In conjunction with the database (DB) where the abstract or full text of bio-field literature is stored, it is separated from sentences into sentence units, and individual names are recognized and extracted. step 1; and a second step of constructing a pathway by recognizing a relationship with respect to the entity name extracted from the abstract or full sentence and creating a relationship event.

또한, 본 발명의 일실시예에 따른 자연어 처리를 이용한 타겟 유전자 선별 방법에서, 상기 자연어 처리 문헌 분석과정은, 상기 추출한 개체명에 대하여 각각의 위치에 대한 다차원 인덱스를 생성하고, 각각의 유전자 질병 인덱스 관점에서 볼 수 있도록 저장하는 단계;를 더 포함하여 구성될 수 있다.In addition, in the target gene selection method using natural language processing according to an embodiment of the present invention, the natural language processing literature analysis process generates a multidimensional index for each location of the extracted entity name, and each genetic disease index It may be configured to further include; storing so that it can be seen from a viewpoint.

또한, 본 발명의 일실시예에 따른 자연어 처리를 이용한 타겟 유전자 선별 방법에서, 상기 다차원 인덱스는, 유전자(Gene), 질병(Disease), 약(Drugs), 경로(Pathway), 화합물(Chemical), 단백질(Protein)로부터 선택되어 구성된 제1차원; 및, 유전자-유전자, 유전자-질병, 유전자-약, 유전자-경로, 유전자-화합물, 유전자-단백질, 질병-약, 질병-화합물, 질병-단백질, 약-경로, 약-화합물, 약-단백질, 경로-화합물, 경로-단백질, 화합물-단백질 관계 중에서 선택되어 구성된 제2차원; 유전자-질병-경로 관계로 구성된 제3차원;을 포함하여 구성될 수 있다.In addition, in the target gene screening method using natural language processing according to an embodiment of the present invention, the multidimensional index includes genes, diseases, drugs, pathways, compounds, A first dimension selected and configured from proteins; and, gene-gene, gene-disease, gene-drug, gene-path, gene-compound, gene-protein, disease-drug, disease-compound, disease-protein, drug-path, drug-compound, drug-protein, a second dimension selected from pathway-compound, pathway-protein, and compound-protein relationships; It can be configured to include; a third dimension composed of gene-disease-path relationships.

또한, 본 발명의 일실시예에 따른 자연어 처리를 이용한 타겟 유전자 선별 방법에서, 상기 개체명은 질병(disease), 약물(drugs) 및 경로(pathway), 케미컬(chemical), 단백질(protein) 중 적어도 어느 하나이며, 상기 유전자와 상기 개체명의 관계는 상향조절(upregulator), 하향조절(downregulator), 간접상향조절(indirect-upregulator), 간접하향조절(indirect-downregulator) 로 구분되어 추출되도록 구성될 수 있다.In addition, in the target gene screening method using natural language processing according to an embodiment of the present invention, the entity name is at least one of a disease, a drug and a pathway, a chemical, and a protein. One, and the relationship between the gene and the individual name can be configured to be extracted by dividing it into upregulator, downregulator, indirect-upregulator, and indirect-downregulator.

또한, 본 발명의 일실시예에 따른 자연어 처리를 이용한 타겟 유전자 선별 방법에서, 상기 자연어 처리 문헌 분석과정은, 바이오 분야 문헌의 키워드 중심으로 검색하여 필터링하거나, 영향력지수(Impact Factor) 또는 조회수(Hit Count)를 포함하는 바이오 분야 문헌의 상세정보 중심으로 검색하여 필터링하는 단계;를 더 포함하여 대상 문헌 또는 유전자를 필터링하도록 구성될 수 있다.In addition, in the target gene selection method using natural language processing according to an embodiment of the present invention, the natural language processing literature analysis process is performed by searching and filtering based on keywords of bio field documents, or by determining the impact factor or the number of hits (Hit Searching and filtering based on detailed information of bio field documents including Count); may be configured to filter target documents or genes by further including.

또한, 본 발명의 일실시예에 따른 자연어 처리를 이용한 타겟 유전자 선별 방법은, 염기서열에 관한 FASTQ 자료를 준비하는 단계; 및, 상기 준비된 염기서열을 대상으로 NGS(next generation sequencing)/DEG (Differential expressed gene) 분석을 수행하는 단계; 머신러닝 기반으로 유전자와 특정질병과의 효과성을 판별하여 상기 특정 질병과 관련된 후보 유전자를 추출하는 의학통계 분석 단계; 유전자 네트워크 분석을 통하여 유전자를 선별하는 유전자 네트워크 분석단계; 중 적어도 어느 하나를 더 포함하여 구성될 수 있다.In addition, the method for selecting a target gene using natural language processing according to an embodiment of the present invention includes preparing FASTQ data related to a nucleotide sequence; and performing NGS (next generation sequencing)/DEG (Differential expressed gene) analysis on the prepared nucleotide sequence; A medical statistical analysis step of determining the effectiveness of a gene and a specific disease based on machine learning and extracting a candidate gene related to the specific disease; Gene network analysis step of selecting genes through gene network analysis; It may be configured to further include at least one of them.

또한, 본 발명의 일실시예에 따른 자연어 처리를 이용한 타겟 유전자 선별 시스템은, 상술한 자연어 처리를 이용한 타겟 유전자 선별 방법을 사용하여 구축된다.In addition, a target gene selection system using natural language processing according to an embodiment of the present invention is constructed using the above-described target gene selection method using natural language processing.

또한, 본 발명의 일실시예에 따른 자연어 처리를 이용한 타겟 유전자 선별용 프로그램은, 상술한 자연어 처리를 이용한 타겟 유전자 선별 방법을 실행하도록 컴퓨터 판독 가능한 기록 매체에 저장되어 형성된다.In addition, the program for selecting a target gene using natural language processing according to an embodiment of the present invention is stored and formed in a computer readable recording medium to execute the method for selecting a target gene using natural language processing.

본 발명에 따른 유전자 분석 시스템 및 방법은 새로운 형질 유전자를 탐색하기 위하여 유전체의 유전자를 모두 테스트할 필요 없이 연구 가치가 높은 유전자 후보군을 자연어 처리를 이용하는 분석도구를 통하여 선별하고, 선별된 유전자 후보군 유전자들에 대해 집중 테스트를 하여 시간과 노력 그리고 경비를 획기적으로 단축할 수 있는 효과가 있다.The gene analysis system and method according to the present invention selects a gene candidate group with high research value through an analysis tool using natural language processing without the need to test all genes in the genome to search for a new trait gene, and the selected gene candidate group genes There is an effect that can drastically reduce time, effort, and expense by performing intensive testing on

또한 주어진 샘플 군에서 생물학적 프로세스와 연관된 핵심 질병 관련 유전자를 보다 정확하고 효율적으로 선별하여 타겟 유전자의 후보를 선별하는 시간과 노력을 단축할 수 있는 효과가 있다.In addition, there is an effect of reducing time and effort in selecting a candidate for a target gene by more accurately and efficiently selecting a key disease-related gene associated with a biological process in a given sample group.

또한, 통합형 분석 플랫폼을 웹 형태로 제공하여 별도의 프로그램 설치하거나 분석파일을 업로드 하지 않고 다운로드 주소를 입력하여 서버 컴퓨터에서 바로 다운로드 받아 분석할 수 있도록 구성하여 사용자가 쉽게 분석 서비스를 이용할 수 있는 효과가 있다.In addition, the integrated analysis platform is provided in the form of a web, so that users can easily use the analysis service by entering the download address and directly downloading and analyzing from the server computer without installing a separate program or uploading the analysis file. there is.

본 발명의 효과는 상기한 효과로 한정되는 것은 아니며, 본 발명의 상세한 설명 또는 특허청구범위에 기재된 발명의 구성으로부터 추론 가능한 모든 효과를 포함하는 것으로 이해되어야 한다.The effects of the present invention are not limited to the above effects, and should be understood to include all effects that can be inferred from the detailed description of the present invention or the configuration of the invention described in the claims.

도 1은 본 발명의 일실시예에 따른 자연어 처리 문헌 분석 수단의 구성을 나타내는 도면이다.
도 2는 자연어 처리를 위한 개별 문장의 세부 형태 분석을 나타내는 도면이다.
도 3은 유전-질병 등의 관계 판단을 위한 관계 타입 종류를 예시적으로 나타낸 것이다.
도 4는 일실시예에 따른 바이오 분야 문헌 분석에 관한 웹 UI/UX 화면을 나타낸 도면이다.
도 5는 바이오 문헌 처리의 결과를 예시적으로 나타낸 도면이다.
도 6은 본 발명의 일실시예에 따른 자연어 처리를 이용한 타겟 유전자 선별 장치의 구성을 나타내는 도면이다.
도 7은 타겟 유전자를 선별하기 위한 본 발명의 프로세스에 대한 도면이다.
도 8은 본 발명에 의한 통합 분석 시스템 및 방법이 특정 질병과 관련된 후보 유전자를 추출하기 위한 머신러닝 시스템 모델을 구축하는 방법에 관한 도면이다.
도 9는 논문, 특허문헌 등의 자연어처리를 위한 구조에 관한 도면이다.
도 10은 유전자 네트워크 분석을 통해 유전자의 기능을 예측하는 방법에 관한 도면이다.
도 11은, 하나의 예시로서, 신약개발을 위한 새로운 유전자 후보군 발굴 목적으로 암조직과 정상조직에 관하여 본 발명에 의한 통합 분석 시스템 및 방법을 이용하여 타겟 유전자를 선정하는 단계를 도시한 도면이다.
도 12는 하나의 예시적인 생존곡선분석(Survival curve analysis)에 대한 출력 그래프이다.
도 13은 하나의 예시적인 ROC곡선분석(ROC curve analysis)에 대한 출력 그래프이다.1 is a diagram showing the configuration of a natural language processing document analysis unit according to an embodiment of the present invention.
2 is a diagram illustrating detailed shape analysis of individual sentences for natural language processing.
FIG. 3 exemplarily shows types of relationship types for determining relationships such as genetics and diseases.
4 is a diagram showing a web UI/UX screen related to bio-field literature analysis according to an embodiment.
5 is a diagram showing the results of bio-document processing by way of example.
6 is a diagram showing the configuration of an apparatus for selecting a target gene using natural language processing according to an embodiment of the present invention.
7 is a diagram of the process of the present invention for selecting target genes.
8 is a diagram of a method for constructing a machine learning system model for extracting a candidate gene related to a specific disease by the integrated analysis system and method according to the present invention.
9 is a diagram related to a structure for natural language processing of papers, patent documents, and the like.
10 is a diagram of a method for predicting the function of a gene through gene network analysis.
11 is, as an example, a diagram showing a step of selecting a target gene using the integrated analysis system and method according to the present invention with respect to cancer tissue and normal tissue for the purpose of discovering a new gene candidate group for drug development.
12 is a graph of the output of one exemplary survival curve analysis.
13 is a graph of the output of one exemplary ROC curve analysis.

이하에서는 첨부한 도면을 참조하여 본 발명을 설명하기로 한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 따라서 여기에서 설명하는 실시예로 한정되는 것은 아니다.Hereinafter, the present invention will be described with reference to the accompanying drawings. However, the present invention may be embodied in many different forms and, therefore, is not limited to the embodiments described herein.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결(접속, 접촉, 결합)"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 부재를 사이에 두고 "간접적으로 연결"되어 있는 경우도 포함한다. 또한, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 구비할 수 있다는 것을 의미한다.Throughout the specification, when a part is said to be "connected (connected, contacted, combined)" with another part, this is not only "directly connected", but also "indirectly connected" with another member in between. "Including cases where In addition, when a part "includes" a certain component, it means that it may further include other components without excluding other components unless otherwise stated.

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this specification are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as "include" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

종래의 유전자 분석시스템의 경우 텍스트 마이닝 방법에 의존한 분석 방법으로, 하나의 문장 내에서의 개체들 간의 관계만을 추출하게 된다. 문서 데이터에 포함된 하나의 문장에 질병과 관련된 인자들이 모두 나열되는 경우도 있지만, 주로 다수의 문장에 질병과 관련된 인자들이 서술되어 있는 경우가 빈번하다. 종래 기술에서 개체들 간의 관계가 추출되고, 추출된 관계에 따라 개체들이 연결되더라도, 어디까지나 하나의 문장마다 독립적으로 자연어 처리를 수행하는 한계 때문에, 속도가 늦으며 데이터의 정확한 의미파악과 신뢰성있게 유전자의 기능을 분석하기에는 한계점 등이 존재했다.In the case of a conventional genetic analysis system, only the relationship between entities within one sentence is extracted as an analysis method dependent on a text mining method. Although there are cases in which all disease-related factors are listed in one sentence included in document data, it is frequently the case that disease-related factors are described in multiple sentences. In the prior art, even if relationships between entities are extracted and entities are connected according to the extracted relationships, due to limitations in performing natural language processing independently for each sentence, the speed is slow and the accurate meaning of data can be grasped reliably. There were limitations in analyzing the function of .

이에, 본 발명자들은 인공지능(Artificial Intelligence) 딥러닝(deep learning)을 활용한 자연어처리기술을 이용하여, 다수의 문서 데이터들에서 전체 텍스트 내용의 전후 맥락, 단어 자체의 형태 등을 고려하여, 논문 데이터와 유전자 간의 관계를 도출하고 추출된 유전자와 그 관련성에 대한 정보를 사용자에게 전달함으로써 기존의 기술적 한계를 극복한다.Accordingly, the present inventors use natural language processing technology using artificial intelligence (AI) deep learning to consider the context of the entire text content, the shape of the word itself, etc. in multiple document data, It overcomes existing technical limitations by deriving the relationship between data and genes and conveying information about extracted genes and their relationships to users.

본 발명의 일실시예에 따른 자연어 처리를 이용한 타겟 유전자 선별 장치 및 방법은, 분자생물학적으로 검증가능한 타겟 유전자군을 대상으로 연구개발하기에는 시간적, 비용적 무리가 있는 경우에 자연어 기반의 문헌분석툴을 통하여 유효한 유전자 후보군을 선별하는 과정에서 이용될 수 있다.An apparatus and method for selecting a target gene using natural language processing according to an embodiment of the present invention, when it is time-consuming and costly to research and develop a target gene group that can be verified molecularly, through a natural language-based literature analysis tool. It can be used in the process of selecting effective gene candidates.

본 발명에서 대상으로 하는 바이오 분야 문헌은 특허문헌, 논문문헌 등을 포함할 수 있으며, 상기 논문문헌은 학위 논문문헌, 학회 논문문헌, 저널 논문문헌, 기술보고서 등을 포함한다. 이하에서는 예시적으로 학술 연구 논문을 주대상으로 한 분석을 설명하나 이에 한정되는 것은 아니다.Bio field literature targeted in the present invention may include patent literature, thesis literature, and the like, and the thesis literature includes degree thesis literature, conference thesis literature, journal thesis literature, technical reports, and the like. Hereinafter, an analysis of academic research papers as the main target is described by way of example, but is not limited thereto.

상기 자연어 처리 기반의 문헌분석툴은, 자연어 처리 기술과 신경망을 결합하여, 출간된 논문 등으로 구축된 데이터 베이스(DB)를 이해하여 자동으로 분류하고 그 내용을 반영할 수 있는 도구로서, 자연어 처리 시스템;과 공개된 논문이나 특허문헌, 신뢰성 있는 학술적 저널 등을 기반으로 구축되는 데이터베이스(DB);와 웹페이지에 출력결과를 전송하는 파이프라인;을 포함하고 있다. The natural language processing-based literature analysis tool is a tool that can understand, automatically classify, and reflect the contents of a database (DB) built with published papers by combining natural language processing technology and neural networks, and is a natural language processing system. ; and a database (DB) built on the basis of published papers, patent documents, and reliable academic journals; and a pipeline that transmits output results to a web page.

상기 데이터베이스(DB)는 논문이나 특허문헌의 초록을 수집하여 구성되었으나 전문을 수집하여 구성하는 것도 가능하다. The database (DB) is configured by collecting abstracts of papers or patent documents, but it is also possible to collect and configure the full text.

상기 자연어 처리 시스템은, 출간된 논문 등으로 구축된 상기 데이터 베이스(DB)를 기반으로 자연어 처리 기술과 신경망을 결합하여 그 내용을 이해하고 후보 유전자를 자동으로 분류할 수 있다. 자연어 처리를 위한 기술로 SyntaxNet, BERT 등을 이용할 수 있다. The natural language processing system can understand the content and automatically classify candidate genes by combining natural language processing technology and a neural network based on the database (DB) constructed from published papers and the like. SyntaxNet, BERT, etc. can be used as technologies for natural language processing.

먼저, 상기 자연어 처리 시스템에 대하여 자세히 설명한다.First, the natural language processing system will be described in detail.

기존의 자연어 처리 시스템은 텍스트마이닝(text mining) 기술을 기반으로 하여 자연어로 구성된 비정형 텍스트 데이터에서 패턴 또는 관계를 추출하여 가치와 의미 있는 정보를 찾아냈다. 즉, 텍스트에서 중요한 용어로 인식할 용어들을 추출하여, 단어가 발생한 위치와 발생한 횟수에 기반되는 가중치 함수와 지지도 함수의 계산을 수행함으로서 추출된 용어 특성의 중요성을 나타내었다. 가령, 한 문서 내에서 여러 번 나타나는 단어의 중요도는 높다고 가정하지만, 여러 문서에서 걸쳐서 발생도가 높다면 이 단어의 중요도는 낮다고 간주하는 것이다.Existing natural language processing systems find value and meaningful information by extracting patterns or relationships from unstructured text data composed of natural language based on text mining technology. That is, terms to be recognized as important terms are extracted from the text, and the importance of the extracted term characteristics is shown by calculating the weight function and the support function based on the location and number of occurrences of the word. For example, it is assumed that the importance of a word appearing several times in one document is high, but if the occurrence is high across several documents, the importance of this word is considered low.

본 발명의 자연어 처리 시스템은 상기 텍스트마이닝 기술에서 한단계 더 나아가 롤-베이스 기반 머신러닝 기술을 적용하는 것으로 유전자 또는 질병 등에 대한 다차원 분석을 수행하고 대용량의 바이오 분야 문헌으로부터 다차원 인덱스를 적용하여 유전자-유전자 관계(gene- gene relationship) 또는 유전자-질병 관계(gene-disease relationship)를 효과적으로 추출한다. 즉, 대용량의 바이오 분야 문헌으로부터 질병, 유전자, 기능, 생물학적 경로(pathway) 등에 대한 인덱스를 구축하고, 구축된 상기 인덱스를 미리 정해진 인덱스 저장구조에 따라 저장하여, 저장된 상기 인덱스를 이용하여 사용자가 혹은 컴퓨터가 검색어를 입력하고 상기 대용량 바이오 분야 문헌으로부터 질병, 유전자, 기능, 생물학적 경로 (pathway) 등에 대한 분석을 수행하는 것으로 유전자-유전자 관계(gene- gene relationship) 또는 유전자-질병 관계(gene-disease relationship)를 추출할 수 있다.The natural language processing system of the present invention goes one step further from the text mining technology and applies a roll-based machine learning technology, performs multi-dimensional analysis on genes or diseases, and applies multi-dimensional indexes from large-scale bio field literature to gene-gene. Effectively extract gene-gene relationship or gene-disease relationship. That is, indexes for diseases, genes, functions, biological pathways, etc. are constructed from large-capacity bio-field literature, and the built indexes are stored according to a predetermined index storage structure, and the user or A computer inputs a search word and analyzes the disease, gene, function, biological pathway, etc. from the large-volume bio field literature, which is a gene-gene relationship or a gene-disease relationship. ) can be extracted.

본 발명에서 자연어 처리 시스템의 전체적인 흐름은 다음과 같다. 바이오 분야 문헌의 초록 또는 전문이 저장된 데이터베이스(DB)와 연동하여 문장에서 문장 단위로 분리하고 개체명을 인식하여 추출하는 1단계;와 상기 초록 또는 전문의 문장에서 추출한 개체명에 대하여 관계를 인식하고 관계 이벤트를 생성하여 패스웨이를 구축하는 2단계;를 포함하고, 추가로 추출한 개체명에 대하여 각각의 위치에 대한 다차원 인덱스를 생성하고, 각각의 유전자 질병 인덱스 관점에서 볼 수 있도록 저장하는 단계;를 포함한다.The overall flow of the natural language processing system in the present invention is as follows. Step 1 of separating sentences into sentences in conjunction with the database (DB) where the abstract or full text of bio-field literature is stored, and recognizing and extracting entity names; Recognizing the relationship between the entity name extracted from the abstract or full text and Step 2 of constructing a pathway by generating a relationship event; generating a multi-dimensional index for each location for the additionally extracted object name and storing it so that it can be viewed from the perspective of each genetic disease index; include

상기 바이오 분야 문헌이 저장된 데이터베이스에서 초록 또는 전문의 문장에서 문장 단위로 분리하고 개체명을 인식하여 추출하는 1단계는 인공신경망 모델을 기초로 분할된 형태소로부터 개체명을 추출할 수 있다. 개체명이란 문장 내에서 유전자명, 질병명, 화학식 등과 같은 고유한 의미가 있는 명사를 의미한다. 개체명 인식이란 문서 내에서 이와 같은 개체명을 추출하고 카테고리를 분류하는 것을 의미한다. 종래의 개체명 인식방법은 사전기반, 규칙기반의 방법을 사용하였지만, 최근에는 RNN(Recurrent Neural Network), CNN(Convolutional Neural Network) 등의 기술을 기반으로 한 인공신경망 모델이 개발 중에 있다. RNN(Recurrent Neural Network)는 매순간의 데이터를 인공신경망 구조에 쌓아올린 것으로 입력 값들의 딥 러닝 중 가장 깊은 네트워크 구조로, 앞뒤 문맥이나 순서가 존재하는 서열 데이터를 입력 받는 대표적인 딥러닝 네트 워크이다. 예시적으로, 본 발명의 일실시예에서는 RNN(Recurrent Neural Network) 모델, 또는 LSTM(Long short-term memory) 모델, BERT 모델을 이용할 수 있다.In the first step of separating the abstract or full text sentence by sentence from the database storing the bio-field literature and recognizing and extracting the entity name, the entity name can be extracted from the divided morphemes based on the artificial neural network model. The entity name refers to a noun having a unique meaning, such as a gene name, a disease name, or a chemical formula, in a sentence. Entity name recognition means extracting such entity names from documents and classifying categories. Conventional entity name recognition methods used dictionary-based and rule-based methods, but recently, artificial neural network models based on technologies such as RNN (Recurrent Neural Network) and CNN (Convolutional Neural Network) are being developed. RNN (Recurrent Neural Network) is the deepest network structure among deep learning of input values by stacking data at each moment in an artificial neural network structure. Illustratively, in one embodiment of the present invention, a recurrent neural network (RNN) model, a long short-term memory (LSTM) model, or a BERT model may be used.

상기 인공신경망 모델을 기반으로 개체명을 인식하고 추출하는 1단계를 수행하기 위하여 입력된 텍스트를 기설정된 단위로 분할하여 분할텍스트를 생성하는 분할부;와 상기 분할텍스트를 형태소 단위로 분할하여 분할형태소를 생성하는 형태소분할부; 상기 분할형태소로부터 벡터형태의 데이터인 추론결과를 도출하는 추론부; 및 상기 추론결과를 기초로 개체명을 추출하여 개체명 결과를 도출하는 개체명추출부;를 포함하고 있다. In order to perform the first step of recognizing and extracting entity names based on the artificial neural network model, a segmentation unit divides the input text into predetermined units to generate segmented text; a morpheme segmentation unit that generates; an inference unit deriving an inference result, which is data in a vector form, from the divided morphemes; and an object name extraction unit extracting the object name based on the reasoning result and deriving the object name result.

상기 추론부는 제1 추론결과를 도출하는 제1 추론부; 상기 제1 추론결과를 기초로 개체명에 대한 벡터형태의 데이터인 제2 추론결과를 도출하는 제2 추론부;로 구성될 수 있다. 또한, 상기 제1 추론부는 2 이상의 학습된 인공신경망 모델을 포함하고, 상기 제1 추론결과는 상기 2 이상의 학습된 인공신경망 모델에서 도출되는 2 이상의 도출결과를 포함하고, 상기 제2 추론부는 1 이상의 학습된 인공신경망 모델을 포함할 수 있다.The inference unit may include a first inference unit deriving a first inference result; A second reasoning unit that derives a second reasoning result, which is data in the form of a vector for the entity name, based on the first reasoning result. In addition, the first reasoning unit includes two or more learned artificial neural network models, the first reasoning result includes two or more derivation results derived from the two or more learned artificial neural network models, and the second reasoning unit includes one or more A trained artificial neural network model may be included.

또한, 상기 제1추론부의 인공신경망 모델은 단어 단위로 학습되고 입력된 단어에 대한 단어세부 결과를 도출하는 제1 모델과, 품사 단위로 학습되고 입력된 단어에 대한 품사세부 결과를 도출하는 제2모델을 포함할 수 있다.In addition, the artificial neural network model of the first inference unit is a first model that is learned in word units and derives word detail results for input words, and a second model that is learned in parts of speech units and derives parts of speech detail results for input words. models can be included.

다음으로, 상기 데이터베이스(DB)에서 추출한 개체명에 대하여 관계를 인식하고 관계 이벤트를 생성하여 패스웨이를 구축하는 2단계에 대하여 설명한다.Next, the second step of establishing a pathway by recognizing a relationship with the entity name extracted from the database (DB) and generating a relationship event will be described.

여기서 패스웨이는 단백질 유전자 세포 등의 생체적 요소 간의 역학관계 혹은 상호작용 등을 네트워크 형식으로 표현한 생물학적 심층지식을 의미한다.Here, pathway means in-depth biological knowledge that expresses the dynamics or interactions between biological elements such as proteins, genes, and cells in a network format.

상기 단계에서는 상기 1단계에서 개체명추출부에 의해 도출된 추론 결과를 기반으로 웹 검색을 수행하여 개체들이 출현하는 문서 및 개체들의 염기서열을 분석하여 획득한 세포 내 장소를 수집하고, 수집된 정보에 의한 관계 이벤트, 예를 들어 두 개체와의 관계 두 개체와 관련된 질병 각 개체의 위치 정보 등의 관계 이벤트를 생성하고 이를 근거로 세포 내 해당 장소에 해당 개체들을 표시하는 패스웨이를 생성한다.In the above step, a web search is performed based on the deduction result derived by the entity name extraction unit in the above step 1, documents in which entities appear and locations in cells obtained by analyzing base sequences of entities are collected, and the collected information A relationship event by, for example, a relationship between two entities, a disease related to two entities, a relationship event such as location information of each entity is generated, and based on this, a pathway displaying the corresponding entities at the corresponding place within the cell is created.

예를 들어, 기설정된 문맥 패턴을 이용하여 문맥을 추출하고 추출된 문맥을 이용해서 관계 인식 방법 예시 유전자 혹은 단백질 개체명과 함께 고빈도로 나타나는 동사들 중 ‘activate’나 ‘inhibit' 같은 상호 작용 관계를 나타내는 이벤트성 동사들을 추출해 패턴을 분석하고 분석된 패턴 정보를 활용하여 개체들간의 관계를 인식할 수 있다.For example, a context is extracted using a predetermined context pattern, and a relationship recognition method using the extracted context is an example of an interaction relationship such as 'activate' or 'inhibit' among verbs that appear with high frequency along with gene or protein entity names. It is possible to analyze patterns by extracting event-type verbs and recognize relationships between entities by utilizing the analyzed pattern information.

상기의 패스웨이를 생성하기 위하여 본 발명은 수집된 정보에 의한 관계 이벤트를 생성하는 관계 이벤트 생성부; 상기 인식된 관계 이벤트를 근거로 세포내 해당 장소에 해당 개체들을 표시하여 패스웨이를 생성하는 패스웨이 생성부;를 포함하는 패스웨이 구축 시스템;이 제공될 수 있다.In order to generate the above pathway, the present invention includes a relationship event generating unit generating a relationship event based on collected information; A pathway construction system including a pathway creation unit that displays corresponding objects at a corresponding place in a cell based on the recognized relationship event and creates a pathway.

다음으로, 추출한 개체명에 대하여 각각의 위치에 대한 다차원 인덱스를 생성하고, 각각의 유전자 질병 인덱스 관점에서 볼 수 있도록 저장하는 단계에 대하여 설명한다.Next, a step of generating a multidimensional index for each position of the extracted individual name and storing it so that it can be viewed from the viewpoint of each genetic disease index will be described.

본 발명은 해마다 몇 천건씩 증가하는 바이오 분야 문헌들의 문장 전체에 직접 접근하지 않고 인덱스를 활용하여 검색의 성능과 정확도를 획기적으로 높이도록 하기 위하여, 다차원 인덱스를 구축하고, 구축된 상기 다차원 인덱스를 미리 정해진 인덱스 저장구조에 따라 저장하도록 하고 있다.In the present invention, in order to dramatically increase the performance and accuracy of retrieval by utilizing the index without directly accessing the entire sentences of bio-field documents, which increase by several thousand each year, the present invention builds a multi-dimensional index, and prepares the built multi-dimensional index in advance. It is stored according to the defined index storage structure.

예를 들서, 본 발명은 1차원은 유전자(Gene), 질병(Disease), 약(Drugs), 경로(Pathway), 화합물(Chemical), 단백질(Protein)로 구성하고, 2 차원은 유전자-유전자, 유전자-질병, 유전자-약, 유전자-경로, 유전자-화합물, 유전자-단백질, 질병-약, 질병-화합물, 질병-단백질, 약-경로, 약-화합물, 약-단백질, 경로-화합물, 경로-단백질, 화합물-단백질 관계 등으로 구성하고, 3 차원은 유전자-질병-경로 관계로 분석할 수 있도록 구성하였다. For example, in the present invention, the first dimension is composed of genes, diseases, drugs, pathways, chemicals, and proteins, and the second dimension is gene-gene, gene-disease, gene-drug, gene-pathway, gene-compound, gene-protein, disease-drug, disease-compound, disease-protein, drug-route, drug-compound, drug-protein, pathway-compound, pathway- It is composed of protein, compound-protein relationship, etc., and is configured to analyze the 3-dimensional gene-disease-path relationship.

상기 1차원에 대한 예시로, 유전자 인덱스에 대하여는 GEO 아이디(ID), 이름(Name)이나 유기체, 유전자 발현값 등에 대한 정보가 저장되고, 여기에 표준 유전자 명칭 및 유전자 명칭에 대한 동의어 정보가 관련되어 저장될 수 있다. As an example of the first dimension, for the gene index, information on the GEO ID, name, organism, gene expression value, etc. is stored, and the standard gene name and synonym information for the gene name are related to this can be stored

또한 질병 인덱스에 대하여는, pubmed 아이디, 문장번호, 질병 아이디 및 질병이름, 시작위치, 종료위치에 대한 정보가 저장되고, 여기에 표준 질병명 및 질병에 대한 동의어 정보가 관련되어 저장될 수 있다.In addition, for the disease index, pubmed ID, sentence number, disease ID and disease name, start position, and end position information are stored, and standard disease name and synonym information for the disease may be related and stored.

그 외에 다른 분석차원에 대하여도, 상기한 바와 같은 내용을 참조하여 필요에 따라 적절히 인덱스 정보만을 추가하면 되므로, 용이하게 또 다른 분석차원을 추가하여 다차원 분석모델을 수립할 수 있다.For other analysis dimensions, only index information needs to be appropriately added as needed with reference to the above, so that a multidimensional analysis model can be established by easily adding another analysis dimension.

본 발명에서는 상기 다차원 인덱스 기술을 통하여 데이터 베이스의 문장을 각각의 유전자 질병 인덱스 관점에서 볼 수 있도록 저장되어 유전자-질병-화합물 관계를 추출할 수 있다.In the present invention, a gene-disease-compound relationship can be extracted through the multidimensional index technology by storing sentences in the database so as to be viewed from the perspective of each gene disease index.

본 발명의 데이터 베이스(DB)는 공개된 논문이나 특허문헌, 신뢰성 있는 학술적 저널 등을 기반으로 구축되어 메모리에 저장되어 있는데, 상기 데이터베이스(DB)를 최신화 상태로 유지하기 위하여 논문 또는 특허문헌 등의 자료를 즉각적으로 반영하고 최신화하는 알고리즘을 적용할 수 있다. The database (DB) of the present invention is built on the basis of published papers, patent documents, reliable academic journals, etc. and is stored in the memory. It is possible to apply an algorithm that immediately reflects and updates the data of

상기 자연어처리 데이터 베이스의 업데이트 솔루션 구조는 다음과 같다.The update solution structure of the natural language processing database is as follows.

먼저 룰에 기반하여 구축된 논문 정보 데이터 베이스를 활용한 학습 데이터셋을 구축한다. 그리고 파스트리(Parse tree)를 입력으로 하여 원하는 feature를 탐색 및 분류해내는 모델을 설계한다. 상기 모델은 문장이 입력되면 괄호 및 중복 띄어쓰기 등처럼 사전 설정된 요소들을 제거하는 전처리 과정을 거치고, 룰에 기반하여, 불필요한 요소가 제거된 문장에 포함된 접속사의 성격에 따라 문장을 분리하거나, 문장의 일부를 제거하도록 한다. 그리고 분리되거나 일부가 제거된 문장에서 문장을 이루는 주어, 동사, 목적어, 보어 및 수식어 등과 같은 문장 성분을 인식하는 과정을 통해 학습된 feature 추출 모델 및 그 구현체를 산출할 수 있다.First, a learning dataset is built using the thesis information database built based on the rules. Then, we design a model that searches and classifies the desired feature by taking the Parse tree as an input. When a sentence is input, the model goes through a preprocessing process that removes pre-set elements such as parentheses and redundant spaces, and based on rules, separates sentences according to the nature of conjunctions included in sentences from which unnecessary elements are removed, or sentences to remove some In addition, a learned feature extraction model and its implementation can be calculated through the process of recognizing sentence components such as subjects, verbs, objects, complements, and modifiers constituting sentences in separated or partially removed sentences.

바이오 분야 문헌으로부터 입력된 문장에 대한 자연처리 분석 장치 또는 수단은 유전자 인식 모듈, 관련성 판단 모듈, 관계 도출 모듈, 연결 유형 분류 모듈, 학습 모듈 등을 포함하여 구성될 수 있으며, 필요에 따라 입력 모듈, 출력 모듈 등을 더 포함하여 구성될 수 있다.(도 1 참조)A natural processing analysis device or means for sentences input from bio field literature may include a gene recognition module, a relevance determination module, a relationship derivation module, a connection type classification module, a learning module, and the like, and, if necessary, an input module, It may be configured to further include an output module and the like. (See FIG. 1)

상기 유전자 인식 모듈은 분석하려는 유전자를 논문 데이터 베이스에서 찾아 개체로 인식하고, 관련성 판단 모듈에서는 인식된 유전자 개체와 논문 내 문장을 분석하여 object와의 관련성을 판단한다. 관계 도출 모듈은 관련성 판단 모듈로부터 도출된 결과를 바탕으로 유전자와 object 사이를 문장 분석을 통해 관계(Correlation) 를 도출한다. 연결 유형 분류 모듈은 관계 도출 모듈로 얻어진 유전자와 object 와의 관계를 유형별로 분류한다.(도 2 참조)The gene recognition module finds the gene to be analyzed in the thesis database and recognizes it as an entity, and the relevance determination module analyzes the recognized gene entity and sentences in the thesis to determine relevance with the object. The relationship derivation module derives a correlation between genes and objects through sentence analysis based on the results derived from the relevance judgment module. The connection type classification module classifies the relationship between genes and objects obtained by the relationship derivation module by type (see Fig. 2).

본 발명의 유전자 선별 장치에서 자연어 기반 검색 필터의 입력부에는 유전자리스트와 매칭되는 상기 바이오 분야 문헌은 입력하는object 값에 따라 추출되며 object는 유전자(Gene), 질병(Disease), Drugs, Pathway, Chemical, Protein 중에서 한가지 이상의 카테고리를 선택하도록 구성한다. 이를 통하여 상기 바이오 분야 문헌 데이터로부터 상기 유전자(gene), 질병(disease), 약물(drugs) 및 경로(pathway) 등의 관련 용어를 추출하고, 각 용어 간의 관계를 도출한다.In the gene selection device of the present invention, in the input part of the natural language-based search filter, the bio field documents matching the gene list are extracted according to the input object value, and the object is gene, disease, drugs, pathway, chemical, It is configured to select one or more categories among proteins. Through this, related terms such as gene, disease, drug, and pathway are extracted from the bio-field literature data, and a relationship between each term is derived.

본 발명의 자연어 처리 방법은 바이오 분야 문헌 내 문장을 형태소로 분석하고 유전자와 object간의 상관관계를 결정하는 구성요소라 여겨지는 단어들을 인공신경망 모델을 이용하여 학습시킨다. 상기 과정에 따라 유전자 인식 모듈과 관계도출 모듈은 사전에 학습되어 논문 데이터로부터 관계 도출을 수행하도록 구성된다.The natural language processing method of the present invention analyzes sentences in bio field documents into morphemes and learns words that are considered components that determine the correlation between genes and objects using an artificial neural network model. According to the above process, the gene recognition module and the relationship derivation module are trained in advance and configured to perform relationship derivation from thesis data.

상기 유전자리스트와 상기 자연어처리 기반의 데이터베이스로 구축된 문헌분석툴을 이용해 추출된 바이오 분야 문헌에서 유전자와 상기object의 상관관계(Correlation)를 상향조절(upregulator), 하향조절(downregulator), 간접상향조절(indirect-upregulator), 간접하향조절(indirect-downregulator) 로 구분되어 추출할 수 있다. (도 2 참조)Upregulation, downregulation, and indirect upregulation of the correlation between the gene and the object in the bio field literature extracted using the literature analysis tool constructed with the gene list and the natural language processing-based database indirect-upregulator) and indirect-downregulator. (See Fig. 2)

더 세분화해서는 관계없음(not), 억제(inhibitor), 활성(activator), 길항(antagonist), 효능억제(agonist-inhibitor), 효능활성(agonist-activator), 기질(substrate), 생성물(product_of), 기질합성생성물(substrate_product)을 포함한 최대13가지의 상관관계로 추출할 수 있다. It can be further subdivided into not, inhibitor, activator, antagonist, agonist-inhibitor, agonist-activator, substrate, product_of, Up to 13 correlations including substrate synthesis product (substrate_product) can be extracted.

예를 들어 상기 학습된 인공신경망모델은 논문의 문맥 패턴을 이용하여 문맥을 추출하고 추출된 문맥을 이용해서 관계 인식 방법을 상기 유전자리스트 중 하나의 유전자 혹은 단백질 개체명과 함께 고빈도로 나타나는 동사들 중 ‘activate’나 ‘inhibit' 같은 상호 작용 관계를 나타내는 이벤트성 동사들을 추출해 패턴을 분석하고 분석된 패턴 정보를 활용하여 개체들간의 관계를 인식할 수 있다. 상기 유전자 리스트 중 하나의 유전자가 특정 질병과의 관계에서 상향조절(upregulator)과 상관관계를 갖는다면 후보 유전자는 특정 질병에 상향조절 발현되는 유전자임을 추측할 수 있다. 또는 관계없음(not)이 있다면 후보 유전자가 특정 질병과 관련된 문장을 찾을 수 없음을 도출할 수 있다. For example, the learned artificial neural network model extracts context using the context pattern of the paper and uses the extracted context to perform a relation recognition method among verbs that appear with high frequency along with one gene or protein entity name in the gene list. It is possible to recognize the relationship between entities by extracting event-type verbs representing interactive relationships such as 'activate' or 'inhibit', analyzing patterns, and utilizing the analyzed pattern information. If one gene of the gene list has a correlation with upregulation in relation to a specific disease, it can be inferred that the candidate gene is a gene that is upregulated and expressed in a specific disease. Or, if there is no relation (not), it can be derived that the candidate gene cannot find a sentence related to a specific disease.

상기 검색 결과로 도출된 유전자 리스트의 유전자가 유전자와 유전자, 유전자와 질병, 유전자와 경로, 유전자와 약물들과 관련하여 상관관계를 가질 때 연구적 가치가 있는 유전자로 판단할 수 있다.Genes in the gene list derived as a result of the search can be determined as genes with research value when they have a correlation with respect to genes and genes, genes and diseases, genes and pathways, and genes and drugs.

사용자가 유전자를 이용하는데 신뢰도를 강화하기 위해 object와 관계성(Correlation) 이 있다고 판단 한 상기 논문 데이터 베이스 안에서 도출된 문장(sentence)을 포함하여 결과 값으로 제공한다.In order to strengthen the reliability of the user's use of the gene, the resulting value is provided including the sentence derived from the thesis database that has been determined to have a correlation with the object.

자연어처리 기반의 데이터베이스가 구축된 문헌 분석툴을 통하여 출력부에 상기 후보 유전자와 상기 바이오 분야 문헌과의 매칭이 높은 유전자를 소거하여 추출하거나 관련성 있는 상기 논문 정보를 추출할 수 있다.Genes with high matching between the candidate gene and the bio-field literature can be eliminated and extracted from the output unit through the literature analysis tool in which the natural language processing-based database is built, or the related thesis information can be extracted.

본 발명에서는 상기 자연어 처리 기반의 데이터베이스가 구축된 문헌 분석툴을 통하여 해당 유전자와 관련되는 바이오 분야 문헌을 참조할 수 있는 다운로드 링크를 함께 제공하도록 구성될 수 있다. In the present invention, it may be configured to provide a download link for referring to bio-field literature related to a corresponding gene through a literature analysis tool in which the database based on natural language processing is constructed.

우선 바이오 분야 문헌 데이터 베이스에서 수집된 텍스트를 분할하여 분할 텍스트를 생성하고 이를 다시 형태소 단위로 분할하는 과정을 거친다. 문장내 무의미 한 부분을 제거하기 위해 괄호를 제거하고 목적절의 접속사와 등위접속사, 종속접속사, 목적절접속사를 분리한다. 제거하고 남은 단어를 문장 형식 요소로 구분하고 해당 단어의 문장성분에 해당하는 주어, 동사, 목적어, 보어를 추출한다. (도 3 참조)First, the text collected from the bio field literature database is segmented to create segmented text, and then segmented into morpheme units. In order to remove meaningless parts in sentences, parentheses are removed, and conjunctions, coordinating conjunctions, subordinate conjunctions, and object clause conjunctions of the object clause are separated. The remaining words after removal are divided into sentence form elements, and the subject, verb, object, and complement corresponding to the sentence components of the word are extracted. (See Fig. 3)

본 발명에서는 논문 데이터베이스를 제공할 수 있다. 구체적으로는 논문의 초록(Abstract)일 수 있으나, 특별히 이에 제한되는 것은 아니다. 비정형 데이터인 논문 데이터를 정제된 형태의 상기 논문 데이터 베이스를 제공함으로써 사용자가 분석 결과값을 도출하는데 있어 걸리는 시간을 단축할 수 있게 한다.In the present invention, a thesis database can be provided. Specifically, it may be an abstract of the thesis, but is not particularly limited thereto. By providing the thesis database in a refined form of the thesis data, which is unstructured data, it is possible for the user to shorten the time required to derive the analysis result value.

먼저 자연어 처리 논문 분석 시스템은 인공지능학습 모듈에 의해 기설정된 구조를 갖는 인공 신경망 모델이, 논문 데이터베이스로부터 분석 및 선별되어 추출된 유전자 리스트의 관계 정보를 도출한다.First, in the natural language processing thesis analysis system, an artificial neural network model having a structure preset by an artificial intelligence learning module derives relational information of a gene list extracted by analyzing and selecting from a thesis database.

본 발명의 자연어 처리 기반의 논문 분석은 보조적인 선처리를 포함하는 3단계로 구성된다.(도 4 참조)Paper analysis based on natural language processing of the present invention consists of three steps including auxiliary pre-processing (see Fig. 4).

논문의 키워드 중심 검색 필터를 사용하는1단계와, 논문의 상세정보 중심 검색 필터로 데이터베이스를 조회하는 2단계, 그리고 자연어 기반의 논문 문장 내 상관관계를 검색하는 3단계이다.The first step is to use the keyword-based search filter of the thesis, the second step is to search the database with the detailed information-oriented search filter of the thesis, and the third step is to search for correlations within the text of the thesis based on natural language.

상기 1단계 논문의 키워드 중심 검색 필터에서는 입력부에 상기 유전자 리스트를 입력하면 논문데이터베이스에 저장되어 있으면서 입력된 유전자 리스트의 키워드를 포함하는 논문의 데이터가 검색될 수 있으며 상기 논문의 키워드 데이터로부터 상기 유전자 개체들간의 관계를 도출하는 것이 가능하다. 상기 입력부에 상기 논문 데이터 베이스에서 추출된 논문 키워드 중심의 유전자를 검색하는 다중 키워드 검색을 지원한다. 또한 논문이 발행된 기간을 설정하여 검색하는 기능을 필요에 따라 더 구비한다.In the keyword-based search filter of the thesis in the first step, when the gene list is input in the input unit, the data of the thesis stored in the thesis database and including the keyword of the input gene list can be searched, and the genetic object is obtained from the keyword data of the thesis. It is possible to derive a relationship between them. The input unit supports multi-keyword search for searching genes centered on the thesis keywords extracted from the thesis database. In addition, a search function by setting the period in which the thesis was published is further provided as needed.

검색 결과 값으로 상기 입력된 유전자와 관련있는 키워드를 포함한 논문 건수와 최근순의 대표 논문을 도출하여 제공한다. 상세한 정보로 논문명, 저자명, 발행기관명을 제시할 수 있다. 또한 논문의 주제를 자연과학, 공학, 의약학으로 분류하며 논문의 유형은 학술저널, 학술대회자료, 전문잡지, 연구보고서 등으로 나누어 제공함으로써 사용자는 해당 유전자에 대한 연구의 진행 또는 성과를 파악하는데 용이하게 한다.As a result of the search, the number of papers including keywords related to the input gene and the most recent representative papers are derived and provided. As detailed information, the name of the paper, the name of the author, and the name of the publishing institution can be presented. In addition, the subject of the thesis is classified into natural science, engineering, and pharmaceutical science, and the type of thesis is divided into academic journals, academic conference materials, professional magazines, and research reports, so that users can easily understand the progress or achievements of research on the gene. let it

상기 2단계 논문 상세 정보 중심 검색 필터에서는 상기 1단계 검색필터에 의한 결과를 대상으로 하여 논문의 Impact Factor수 와 Hit 수 필터를 추가하거나 연구하고자 하는 상기 유전자 리스트와 관련되는 논문의 중요도 등을 고려하여 선별함으로써 해당 유전자 값을 확인 및 최종 유전자로 추가 또는 소거의 판단에 도움을 준다. 예를 들어 해당 유전자와 관련된 논문이 Impact Factor수 와 Hit 수가 높다면 잘 이미 잘 알려진 유전자로 신기능을 예측을 위한 유전자 후보로는 적합하지 않아 최종 유전자에서 소거하는 판단을 할 수 있다. (도 5 참조)In the second-stage paper detailed information-oriented search filter, the number of Impact Factors and Hits of the paper are added to the results of the first-stage search filter, or the importance of the paper related to the gene list to be studied is considered. By screening, the value of the corresponding gene is confirmed and it helps to determine whether to add or delete the final gene. For example, if the paper related to the gene has a high number of Impact Factors and Hits, it is a well-known gene and is not suitable as a gene candidate for predicting new function, so it can be decided to delete it from the final gene. (See Fig. 5)

이후 상기 3단계 자연어 기반 논문 문장내 상관 관계 검색에서 상기 분석으로 도출된 유전자 리스트와 상기 논문 데이터 베이스를 통해 상기 논문 초록을 필터링 한 논문 문장 내 상관관계를 조회한다. (도 1 참조)Thereafter, in the third step, in the natural language-based intra-sentence correlation search, the gene list derived from the analysis and the correlation within the thesis sentence after filtering the thesis abstract are searched through the thesis database. (See Figure 1)

본 발명은 자연어 처리를 이용한 타겟 유전자 선별 장치(10)와, 그 방법은, 도 6에 도시된 바와 같이, 앞선 설명의 자연어 처리 분석장치를 하나의 자연어 처리 문헌 분석모듈로서 포함하여 구성된다. 또한 이외에도 염기서열분석(RNA sequencing)에 대한 전반적인 과정을 분석하는 웹 UI/UX 기반의 의학적 통계 분석 통합 도구가 제공되어 NGS(next generation sequencing) 검출 데이터를 이용하여 DEG (Differential expressed gene) 분석을 수행하는 염기서열 분석모듈;과 머신러닝 기반으로 유전자와 특정질병과의 효과성을 판별하여 상기 특정 질병과 관련된 후보 유전자를 추출하는 의학통계 분석모듈;과 유전자 네트워크 분석을 통하여 유전자를 선별하는 유전자 네트워크 분석모듈;을 포함할 수 있다.The present invention includes the target gene selection device 10 using natural language processing and the method, as shown in FIG. 6 , including the above-described natural language processing analysis device as one natural language processing literature analysis module. In addition, a web UI/UX-based medical statistical analysis integrated tool that analyzes the overall process of RNA sequencing is provided to perform DEG (Differential expressed gene) analysis using NGS (next generation sequencing) detection data. A nucleotide sequence analysis module that performs; and a medical statistical analysis module that determines the effectiveness of a gene and a specific disease based on machine learning and extracts a candidate gene related to the specific disease; and gene network analysis that selects genes through gene network analysis. module; may be included.

여기서 타겟 유전자 혹은 후보 유전자는 최종적으로 연구 대상이 될 유전자뿐만 아니라, 본 발명에 의한 분석 시스템 및 방법을 수행하여 추출된 후보 유전자 전체를 포함하여 지칭될 수 있다. Here, the target gene or candidate gene may refer to all of the candidate genes extracted by performing the analysis system and method according to the present invention as well as the gene to be finally studied.

본 발명의 웹 UI/UX 기반의 의학적 통계 분석 통합 도구는 인터넷 공중망을 이용하여 원시 시퀀스 데이터(Raw sequence data)를 분석하여 출력결과를 제공하는 시스템 및 방법을 의미하며, 서버 리소스와 시퀀싱 후 결과인 FASTQ 형식 파일의 분석이 가능한 파이프라인(Pipeline), 그리고 웹 UI/UX기반의 GUI 분석 소프트웨어를 같이 제공하여 원스톱 분석을 가능하게 하고, 유의미한 타겟 유전자를 선별할 수 있도록 한다. 여기서 FASTQ 형식은 생물학적 시퀀스(일반적으로 뉴클레오티드 시퀀스)와 해당 품질 점수를 모두 저장하기 위한 텍스트 기반 형식을 뜻한다. FASTQ 파일을 분석하고 그래프화하는 범용 도구는 보편화된 편이나 원시 시퀀스 데이터를 분석하기 위해서는 많은 연산 자원과 기억공간이 필요하다. 본 발명에서는 개별 사용자가 이러한 자원조건을 충족시킬 필요 없이 FASTQ 파일을 분석하는 파이프라인이 구축된 서버를 제공하여 별도의 프로그램 설치 없이 쉽게 분석할 수 있도록 고용량의 FASTQ 파일을 서버 리소스와 함께 제공하는 분석 시스템 및 방법을 제공하고 있다.The web UI/UX-based medical statistical analysis integrated tool of the present invention refers to a system and method that analyzes raw sequence data using an Internet public network and provides output results, and server resources and sequencing results, A pipeline capable of analyzing FASTQ format files and GUI analysis software based on web UI/UX are provided together to enable one-stop analysis and to select meaningful target genes. The FASTQ format here refers to a text-based format for storing both biological sequences (usually nucleotide sequences) and their quality scores. General-purpose tools that analyze and graph FASTQ files are common, but they require a lot of computational resources and memory space to analyze raw sequence data. In the present invention, an analysis that provides a high-capacity FASTQ file together with server resources so that individual users can easily analyze FASTQ files without installing a separate program by providing a server with a built-in pipeline that analyzes FASTQ files without having to meet these resource conditions. systems and methods are provided.

본 발명에 의한 타겟 유전자 선별을 위한 단계는 NGS(next generation sequencing) 검출 데이터를 이용하여 DEG (Differential expressed gene) 분석을 수행하는 1단계; 머신러닝 기반의 의학통계분석툴을 통해 분석하는 2단계; 자연어 기반의 문헌 분석툴을 통해 분석하는 3단계; 유전자 네크워크 분석 기반의 집중탐색툴로 분석하는 4단계;가 포함된다(도 7 참조).Steps for selecting a target gene according to the present invention include a first step of performing differential expressed gene (DEG) analysis using NGS (next generation sequencing) detection data; 2nd step of analyzing through a machine learning-based medical statistical analysis tool; Step 3 of analyzing through a natural language-based literature analysis tool; 4 steps of analysis with a focused search tool based on gene network analysis; included (see FIG. 7).

먼저 NGS(next generation sequencing) 검출 데이터를 이용하여 DEG (Differential expressed gene) 분석을 수행하는 1단계에 대하여 설명한다.First, the first step of performing DEG (Differential expressed gene) analysis using NGS (next generation sequencing) detection data will be described.

상기 NGS 분석(next generation sequencing, 차세대 염기서열 분석)은 유전체의 염기서열의 고속 분석 방법이며 High-throughput sequencing Massive parallel sequencing 또는 Second-generation sequencing이라고도 불린다. 기존 생어 염기서열 분석(Sanger sequencing)과 달리 많은 수의 DNA조각을 병렬로 처리하는 데 특징이 있다. 즉, 하나의 유전체를 수많은 조각으로 분해하여 각 조각을 동시에 읽은 후 전산기술을 이용하여 조합함으로써 방대한 유전체 정보를 빠르게 해독하는 방법이다. 차세대 염기서열 분석의 등장으로 유전체 분석에 필요한 비용이 급격히 낮아져 많은 분야에서 다양하게 사용되고 있다.The NGS analysis (next generation sequencing) is a method for high-speed analysis of genome sequences and is also called high-throughput sequencing, massive parallel sequencing, or second-generation sequencing. Unlike the existing Sanger sequencing, it is characterized by processing a large number of DNA fragments in parallel. In other words, it is a method of rapidly deciphering vast genome information by disassembling a genome into numerous pieces, reading each piece simultaneously, and then combining them using computational technology. With the advent of next-generation sequencing, the cost required for genome analysis has drastically decreased, and it is widely used in many fields.

상기 DEG (Differential expressed gene) 분석은 동일한 유전자가 발현되는 평균 발현량이 서로 다른 조건에서 어느 정도 차이가 나는지 통계적 방법에 따라 분석하는 분석 방법에 관한 것으로, 일반적인 산출 방법은 다음과 같다. The DEG (Differential expressed gene) analysis relates to an analysis method of analyzing the difference in the average expression level of the same gene under different conditions according to a statistical method, and a general calculation method is as follows.

먼저 유전자 발현 데이터를 사용하여 표본 통계량(평균, 표준편차 등)을 구한다. 상기 유전자 발현 데이터 수집은 당업계에 알려진 방법이라면 어느 것이나 사용하여 이루어질 수 있다. 예를 들어 마이크로어레이 유전자 발현 자료, 멀티 플렉스 PCR(multiplex polymerase chainreaction), 정량 RT-PCR(quantitative reverse transcription polymerase chain reaction), 타일링 어레이(tiling array)를 이용한 전사체(transcriptome) 해석, 쇼트 리드 시퀀싱(short read sequencing)를 이용하여 수집할 수 있으며, 이에 한정되는 것은 아니다. 이후 표본 통계량을 통해 모집단의 분포를 가정한 후, 실험/대조군의 차이를 유의확률(p-value)로 나타내며, 예를 들어, 일반적으로 p-value<0.05인 유전자를 DEG로 결정한다. 이렇게 유전자 발현 패턴을 분석하면 특정 컨디션에서 높거나 낮은 발현을 보이는 DEGs(Differentially Expressed Genes)를 얻을 수 있다. 즉 특정 유전자의 발현이 다른 유전자 발현에 미치는 영향을 분석할 수 있다.First, the gene expression data are used to obtain sample statistics (mean, standard deviation, etc.). The gene expression data collection may be performed using any method known in the art. For example, microarray gene expression data, multiplex polymerase chainreaction (PCR), quantitative reverse transcription polymerase chain reaction (RT-PCR), transcriptome analysis using tiling arrays, short read sequencing ( short read sequencing), but is not limited thereto. Then, after assuming the distribution of the population through sample statistics, the difference between the experiment / control group is represented by a significant probability (p-value), for example, a gene with a p-value <0.05 is generally determined as a DEG. By analyzing gene expression patterns in this way, it is possible to obtain DEGs (Differentially Expressed Genes) that show high or low expression under specific conditions. That is, the effect of the expression of a specific gene on the expression of other genes can be analyzed.

상기 NGS 검출 데이터에 관한 분석 과정과 DEG 선별 및 의학통계 분석과정 등은 통합적 진행이 필요한 과정임에도 불구하고 이를 통합하여 정보를 제공하는 서비스는 아직 부재 상태이며, 본 발명에서는 통합진행하여 분석 효율 및 정확성 등의 측면에서 개선된 효과를 제공한다. Although the analysis process for the NGS detection data, the DEG selection, and the medical statistics analysis process require integrated progress, there is still no service that provides information by integrating them, and in the present invention, the analysis efficiency and accuracy are integrated It provides improved effects in terms of, etc.

본 발명에서는 NGS 검출 데이터 분석은, 다수의 분석 그룹 간의 DEG (Differential expressed gene) 추출, t-test를 이용한 통계검증 (p-value < 0.05), Go-term(유전자 온톨로지 용어)과 Reactome pathway DB(생물학적 경로에 대한 데이터베이스)를 이용한 패스웨이(pathway) 분석 등 다양한 유전자 분석 종합 도구를 제공하여 NGS(next generation sequencing) 과정에서 검출된 데이터를 분석하고 DEG (Differential expressed gene) 분석을 수행할 수 있는 모듈을 제공한다. 즉, 본 발명에 의한 분석 모듈을 이용하면 분석 서버가 적어도 하나의 타겟 유전자를 기준으로 다른 DB에 저장된 유전자 온톨로지 정보 또는 패스웨이 정보 중 적어도 하나를 식별하여 분석을 수행할 수 있다.In the present invention, NGS detection data analysis is performed by extracting DEG (Differential expressed gene) between multiple analysis groups, statistical verification using t-test (p-value < 0.05), Go-term (gene ontology term) and Reactome pathway DB ( A module that can analyze data detected in the NGS (next generation sequencing) process and perform DEG (Differential expressed gene) analysis by providing various gene analysis tools such as pathway analysis using a database for biological pathways provides That is, if the analysis module according to the present invention is used, the analysis server can perform analysis by identifying at least one of gene ontology information or pathway information stored in another DB based on at least one target gene.

다음으로, 머신러닝 기반의 의학통계분석모듈을 통해 분석을 수행하는 2단계에 대하여 설명한다.Next, the second step of performing the analysis through the machine learning-based medical statistical analysis module will be described.

상기의 머신러닝 기반의 의학통계분석모듈이란 생존곡선분석(Survival curve analysis), ROC곡선분석(ROC curve analysis), LASSO회귀분석(LASSO regression analysis) 등의 러닝머신 기반 그래프 분석기법 혹은 통계기법을 활용하여 의학통계분석을 수행할 수 있는 분석 도구를 의미한다. The above machine learning-based medical statistical analysis module utilizes a running machine-based graph analysis technique or statistical technique such as survival curve analysis, ROC curve analysis, and LASSO regression analysis. It means an analysis tool that can perform medical statistical analysis.

상기 Survival curve analysis는 정의된 시작점부터 어떤 사건(event)가 일어나기까지 걸린 시간을 분석하고 예측하는 분석기법이다. 예를 들어 미국 암환자 1만명 이상의 유전체 공개 데이터베이스인 TCGA Data set을 활용하여 암환자의 생존기간과 통계학적으로 연관성이 높은 유전자만을 선별하여 후보 유전자군의 개수를 기존 1000-2000여개에서 100-200개 내외로 선별할 수 있다. The survival curve analysis is an analysis technique that analyzes and predicts the time taken from a defined starting point to the occurrence of an event. For example, by using the TCGA Data set, a genome open database of more than 10,000 cancer patients in the US, only genes that are statistically correlated with the survival time of cancer patients are selected, and the number of candidate gene groups is reduced from the existing 1000-2000 to 100-200. Can be selected from within or without dogs.

상기 ROC curve analysis는 분류 분석 모형의 평가를 쉽게 비교할 수 있도록 시각화한 그래프 분석기법이다. 그래프에서 곡선하면적(ROC) 커브의 아래 면적을 나타내는 AUROC의 값이 1에 가까울수록 모형의 성능이 우수하며 0.5에 가까울수록 무작위로 예측하는 랜덤 모델에 가까운 좋지 못한 모형이다. 상기 ROC curve analysis의 분석기법을 이용하면 해당 유전자와 질병진단과의 효과성을 판별할 수 있어 임상의사라도 쉽고 빠르게 분자생물학적 연구를 진행할 수 있다.The ROC curve analysis is a graph analysis technique visualized so that evaluation of classification analysis models can be easily compared. The closer the value of AUROC, which represents the area under the ROC curve in the graph, to 1, the better the performance of the model. Using the analysis technique of the ROC curve analysis, it is possible to determine the effectiveness of the corresponding gene and disease diagnosis, so even a clinician can easily and quickly conduct molecular biological research.

상기 LASSO regression analysis는 회귀 모델에서 이용되는 중요 변수 선택 방법론 중 하나이다. 다량의 데이터에서 주요하지 않은 변수를 제거함으로써 회귀 모델의 복잡도를 줄여나갈 수 있으며, 본 발명에서는 LASSO regression, SVM(서포트벡터머신) 등의 방법을 이용하여 질병 연관성이 높은 후보 유전자군을 효과적으로 선별할 수 있다.The LASSO regression analysis is one of the important variable selection methodologies used in regression models. The complexity of the regression model can be reduced by removing insignificant variables from a large amount of data. can

다음으로, 특정 질병과 관련된 후보 유전자를 추출하기 위한 머신러닝 시스템 모델을 구축하는 방법은 다음과 같다.Next, a method of constructing a machine learning system model for extracting a candidate gene related to a specific disease is as follows.

유전자정보가 알려진 질병에 대한 DEG파일들을 이용하여 upper 경계가 라벨링된 제1유형의 트레이닝 데이터셋(양의 foldchange값을 갖는 유전자 데이터)과 lower경계가 라벨링된 제2유형의 트레이닝 데이터셋(음의 foldchange 값을 갖는 유전자 데이터)을 구축한다. 다음으로 제1트레이닝 데이터셋을 이용하여 복수의 제1모델을 학습시킨 후 복수의 제1모델 중 성능이 가장 좋은 모델을 upper 경계를 산출하기 위한 모델로 이용하고, 제2트레이닝 데이터셋을 이용하여 복수의 제2모델을 학습시킨 후 복수의 제2모델 중 성능이 가장 좋은 모델을 lower경계를 산출하기 위한 모델로 이용한다(도 8 참조). 여기서 upper 경계를 산출하기 위한 모델과 lower 모델을 산출하기 위한 모델은 서로 다른 모델일 수 있다.Using DEG files for diseases for which genetic information is known, the first type of training dataset labeled with upper boundaries (gene data with positive foldchange values) and the second type of training dataset with labeled lower boundaries (negative foldchange values) genetic data with foldchange values) is constructed. Next, after learning a plurality of first models using the first training dataset, the model with the best performance among the plurality of first models is used as a model for calculating the upper boundary, and using the second training dataset After training a plurality of second models, the model with the best performance among the plurality of second models is used as a model for calculating the lower boundary (see FIG. 8). Here, the model for calculating the upper boundary and the model for calculating the lower boundary may be different models.

통상적으로는 앞서 설명한 것처럼 기존 NGS 분석의 일반적인 통계방법 (fold change 및 t-test)을 이용하여 두 샘플군을 구별하여 DEG (Differentially expressed gene) 분석을 수행하는 경우가 많았다. 그러나 상기의 분석기법으로는 평균적으로 1000여개의 후보 유전자가 검출되어 특이적인 타겟 유전자 정보를 선별적으로 파악하는데 어려움이 있다.As described above, in many cases, DEG (Differentially expressed gene) analysis was performed by distinguishing two sample groups using general statistical methods (fold change and t-test) of existing NGS analysis. However, the above analysis technique detects an average of about 1000 candidate genes, making it difficult to selectively identify specific target gene information.

여기서 본발명을 이용하여 상기 머신러닝 기반의 의학통계분석툴을 수행하면 상기 1단계에 의해 추출된 1000~2000여개의 유전자 후보군이 타겟 유전자 혹은 관련된 질병군과 의학적으로 연관관계가 뚜렷한지 여부를 판별하여 100 내지 200개 내외의 유효한 후보 유전자로 선별할 수 있다.Here, when the machine learning-based medical statistical analysis tool is performed using the present invention, it is determined whether the 1000 to 2000 gene candidates extracted in step 1 have a clear medical relationship with the target gene or related disease group. to about 200 effective candidate genes.

다음으로 자연어 처리 분석 과정을 통하여 SyntaxNet, BERT 등을 통해 문장을 구조화하고 규칙을 구성하는 파스트리(Parse tree)를 생성하여, 문장 내 단어들의 품사를 식별하여 태그를 붙이는 품사 태깅(POS Tagging)을 수행한다. 이후 파스트리(Parse tree)탐색을 통한 Rule-based feature extraction 파이프라인을 구축한다. 여기서 feature는 주어, 동사, 목적어, 보어 등일 수 있다. 구축된 파이프라인의 출력결과와 라벨링(파이프라인에 의한 오분류 정정 및 각 feature 간의 상관관계 계수, 방향 표현 등)을 결합한 논문 정보 데이터 베이스(DB)를 구축한다.(도 9 참조)Next, through the natural language processing analysis process, a parse tree is created that structures sentences and composes rules through SyntaxNet, BERT, etc., and identifies and tags parts of speech of words in sentences. carry out Afterwards, build a Rule-based feature extraction pipeline through Parse tree search. Here, the feature can be subject, verb, object, complement, etc. Build a thesis information database (DB) that combines the output results of the constructed pipeline and labeling (misclassification correction by pipeline, correlation coefficient between each feature, direction expression, etc.) (see Fig. 9).

본 발명에서는 상기 자연어 처리 기반의 데이터베이스가 구축된 문헌 분석툴을 통하여 상기 2단계에서 선별된 100-200개의 유효 후보 유전자군에서 해당 내용이 기 출간된 유전자군을 제외함으로써 분자생물학적 검증실험이 가능한 5-20개 내외의 후보 유전자군을 제공할 수 있다.In the present invention, by excluding gene groups whose contents have been previously published from the 100-200 effective candidate gene groups selected in step 2 through the literature analysis tool in which the natural language processing-based database was constructed, a 5- About 20 candidate gene groups can be provided.

다음으로, 유전자 네크워크 분석 기반의 집중 탐색툴로 분석하는 4단계에 대하여 설명한다.Next, the four stages of analysis with the intensive search tool based on gene network analysis will be described.

상기 유전자 네트워크 분석이란 유전자-유전자간의 상호작용 또는 유전자-질병간의 상호작용 등을 네트워크 분석 기법을 기반으로 유전자들끼리의 연관성을 도출하거나 관련 질병을 규명하는 분석 방법이다. 예를 들어, 유전자 기능 및 발현 정보, 시스템 생물학을 이용한 G Protein 신호 전달 과정, 면역 반응 또는 암 등의 메커니즘 등처럼 질병기전 이해를 위해 세포 내 수많은 상호작용 요소들을 네트워크 분석을 통해 유전자의 기능이나 질병 간의 상호 연관성을 규명할 수 있으며, 관련된 유전자들의 미지의 기능들을 예측하는 것도 가능하다.The gene network analysis is an analysis method for deriving associations between genes or identifying related diseases based on a network analysis technique for gene-gene interactions or gene-disease interactions. For example, to understand the mechanism of disease, such as gene function and expression information, G Protein signal transduction process using systems biology, immune response or cancer, etc. It is possible to identify the inter-relationships between the genes and to predict the unknown functions of related genes.

유전자들의 기능을 예측하기 위해 유전자 네트워크 분석이 매우 유용함은 여러 연구 결과들을 통하여 입증되었다. 알려지지 않은 유전자의 기능을 예측할 수 있다면 질병을 치료할 수 있는 새로운 약물 타겟으로 연구될 가능성이 매우 높음으로 신약개발을 위한 타겟 유전자 선정 시 해당분석이 필수적으로 요구되고 있다.The usefulness of gene network analysis for predicting the functions of genes has been demonstrated through several studies. If the function of an unknown gene can be predicted, it is highly likely that it will be studied as a new drug target that can treat diseases. Therefore, corresponding analysis is essential when selecting a target gene for new drug development.

상기 유전자 네크워크 분석 기반의 집중 탐색 방법을 수행하기 위한 네트워크 분석 알고리즘은 Guilt-By Association (GBA) 기법, 네트워크 연결성 기반 예측기법, 생존필수 유전자(essential gene) 기반 예측 기법 등을 활용할 수 있다.The network analysis algorithm for performing the intensive search method based on gene network analysis may utilize a Guilt-By Association (GBA) technique, a network connectivity-based prediction technique, an essential gene-based prediction technique, and the like.

상기 GBA 기법은 네트워크에서 서로 연결된 유전자들은 모두 기능적으로 관련성이 높을 것이라는 가정 하에 특정 유전자에 연결된 모든 유전자들 중 기능적으로 이미 알려진 유전자들을 이용해 그 특정 유전자의 새로운 기능을 예측하는 방법이다.The GBA technique is a method of predicting a new function of a specific gene using previously known functional genes among all genes linked to a specific gene under the assumption that all genes linked to each other in the network will be highly functionally related.

상기 네트워크 연결성 기반 예측 기법은 네트워크에서 연결성(connectivity)을 이용해 유전자들의 클러스터(cluster)들을 정의하고 각 클러스터 내에서 높은 빈도를 보이는 알려진 유전자들의 기능을 이용해 동일 클러스터의 기능이 알려지지 않은 구성유전자들의 기능을 예측하는 방법이다. GBA기법과 유사하나 GBA 기법은 지도학습(supervised learning)만 가능한 반면 네트워크 연결성 기반 예측 기법은 비지도학습(unsupervised learning)도 가능 하다. 즉 정의된 클러스터 내에 기능적으로 알려진 유전자가 전혀 없더라도 데이터 세트의 패턴을 식별하여 주어진 클러스터 자체가 어떤 미지의 기능단위체 (functional module)를 나타낸다고 가설을 세워볼 수도 있다. The network connectivity-based prediction technique defines clusters of genes using connectivity in the network, and uses the functions of known genes showing high frequency in each cluster to determine the functions of constituent genes whose functions of the same cluster are unknown. way to predict It is similar to the GBA method, but the GBA method is capable of only supervised learning, while the network connectivity-based prediction method is also capable of unsupervised learning. That is, even if there are no functionally known genes in a defined cluster, it is possible to hypothesize that a given cluster itself represents an unknown functional module by identifying a pattern in a data set.

다음으로, 생존필수 유전자(essential gene) 기반 예측기법은 각각의 유전자 네트워크에서의 centrality를 이용해 그 유전자의 생물체에 대한 생존필수성(essentiality)을 예측하는 것이다. 네트워크 분석은 그래프 모델(graph model)중의 하나이므로 그래프 이론(graph theory)을 이용한 생물학적 가설의 유출이 가능한데, centrality와 생존필수성은 관련성(correlation)이 높은 것으로 다수의 네트워크 모델들을 통하여 보고된 바가 있다. (Jeong, H., et al., Lethality and centrality in protein networks. Nature, 2001. 411(6833):p.41-2.). Next, the essential gene-based prediction technique predicts the essentiality of the gene for an organism using the centrality in each gene network. Since network analysis is one of the graph models, it is possible to leak biological hypotheses using graph theory, and it has been reported through a number of network models that centrality and vitality are highly correlated. (Jeong, H., et al., Lethality and centrality in protein networks. Nature, 2001. 411(6833): p.41-2.).

생존필수 유전자의 발굴은 특히 병원성 미생물에서 매우 중요한데, 병원성 미생물의 생존필수 유전자는 그 병원균의 저해에 이용될 신약개발에 중요한 표적유전자를 제공할 수 있기 때문이다.Discovery of survival essential genes is very important, especially in pathogenic microorganisms, because survival essential genes of pathogenic microorganisms can provide important target genes for the development of new drugs to be used to inhibit the pathogen.

이러한 유전자 네트워크 분석을 통해 유전자의 기능을 예측하는 방법은 예시적으로 다음과 같다.A method of predicting the function of a gene through such gene network analysis is exemplarily as follows.

서로 상호작용을 갖는 두 단백질을 발견하고, 그중 어느 하나의 단백질이 암 관련 단백질로 확인될 경우 Unknown 단백질의 기능은 암과 연관되어 있을 가능성을 가질 확률이 높다고 판단한다(도 10(a)참조). If two proteins interacting with each other are discovered and either one of them is identified as a cancer-related protein, it is judged that the function of the Unknown protein is likely to be related to cancer (see Fig. 10 (a)) .

또 다른 예시로, 유전자 A의 경우 다른 유전자에 비해 상호상관성이 높음으로 다른 유전자에 비해서 그 기능적 중요성이 높을 것으로 예측한다. 또한 유전자 G의 기능이 유방암에 관련되어 있음을 미리 알고 있을 경우 Unknown-1과 Unknown-2의 유전자 또한 유방암에 연관이 있을 가능성이 높을 것으로 추정이 가능하다. 따라서 A유전자의 경우에는 세포의 생존에 필수적인 유전자일 가능성이 높을 것으로 예측하고 Unknown-1과 Unknown-2 유전자의 경우 아직 연구가 진행되지 않았지만 새로운 신약개발을 위한 유전자 후보 유전자로 선정 가능성이 높다(도 10(b) 참조).As another example, in the case of gene A, its cross-correlation is higher than that of other genes, so its functional importance is predicted to be higher than that of other genes. In addition, if it is known in advance that the function of gene G is related to breast cancer, it can be estimated that the Unknown-1 and Unknown-2 genes are also highly likely to be related to breast cancer. Therefore, in the case of gene A, it is predicted that it is highly likely to be an essential gene for cell survival, and in the case of genes Unknown-1 and Unknown-2, although research has not yet been conducted, it is highly likely to be selected as a candidate gene for new drug development (Fig. 10(b)).

본 발명에서는 GBA 예측모델과 네트워크 연결성 기반 예측 기법, 생존필수 유전자(essential gene)의 예측을 유전자 네트워크 분석 알고리즘으로 활용하여 두가지 목적의 신약개발 타겟 유전자를 선별할 수 있다. In the present invention, the GBA prediction model, network connectivity-based prediction technique, and prediction of essential genes can be used as a gene network analysis algorithm to select target genes for new drug development for two purposes.

첫번째로, GBA 예측모델을 이용하여 질병치료를 위한 새로운 신약개발 타겟 후보 유전자군을 선별한다. 기존에 기능이 알려져 있는 또는 질병에 연관되어 있는 유전자를 알고 있을 경우 상호 작용하는 유전자가 해당 유전자와 연관되어 있을 가능성이 높을 것으로 추측한다. 그리고 네트워크 연결성 기반 예측 기법을 통하여 규명하고자 하는 질병 관련 유전자와 연결되어 있는 신규 유전자의 기능을 예측하고, 최우선적으로 실험하고자 하는 타겟 유전자 리스트를10개 이하, 바람직하게 2~3개 내외로 제공한다. First, a new drug development target candidate gene group for disease treatment is selected using the GBA prediction model. If a gene whose function is known or is associated with a disease is known, it is assumed that the interacting gene is highly likely to be associated with that gene. In addition, the function of a new gene linked to the disease-related gene to be identified is predicted through a network connectivity-based prediction technique, and a list of target genes to be tested with priority is less than 10, preferably around 2-3. .

두번째로, 생존필수 유전자(essential gene)의 예측을 이용하여 병원성 미생물 치료를 위한 신약개발 타겟 후보 유전자군을 선별한다. Centrality가 높은 유전자를 생존에 매우 중요한 유전자일 것으로 기능을 예측하고, 해당 유전자를 타겟팅하여 새로운 약물을 개발함으로써 병원성 미생물의 치료효과를 보일 수 있는 신약개발용 타겟 유전자를 선별한다.Second, a candidate gene group for drug development target for the treatment of pathogenic microorganisms is selected using prediction of essential genes. Genes with high centrality are predicted to function as very important genes for survival, and by targeting the genes to develop new drugs, target genes for drug development that can show therapeutic effects against pathogenic microorganisms are selected.

일례로, 신약개발을 위한 새로운 유전자 후보군 발굴 목적으로 암조직과 정상조직에 관하여 본 발명에 의한 통합 분석 도구를 이용하여 타겟 유전자를 선정하기 위한 단계는 다음과 같다. 먼저, 암 유전자와 정상 유전자에 대하여 NGS를 통해 검출된 데이터(FASTQ 형식)를 본 발명에 의한 통합 분석 도구에 입력(import)한 후 DEG (Differential expressed gene) 분석을 수행하여 2000여개의 후보 유전자를 선별한다. 이후, 머신러닝 기반의 의학통계분석툴을 통해 후보 유전자의 수는 200여개로 줄인다. 다음으로 자연어 기반의 문헌 분석툴을 통해 분석을 수행하면 20여개의 후보 유전자 수가 남으며, 해당 후보 유전자 수를 유전자 네크워크 분석 기반의 집중탐색툴로 분석하여 10개 이하의 타겟 유전자를 선별한다.(도 11 참조)As an example, steps for selecting target genes using the integrated analysis tool according to the present invention with respect to cancer tissues and normal tissues for the purpose of discovering new gene candidates for drug development are as follows. First, the data (FASTQ format) detected through NGS for cancer genes and normal genes are imported into the integrated analysis tool according to the present invention, and then DEG (Differential expressed gene) analysis is performed to select about 2000 candidate genes. select Afterwards, the number of candidate genes is reduced to about 200 through a machine learning-based medical statistical analysis tool. Next, when the analysis is performed using a natural language-based literature analysis tool, the number of candidate genes remains about 20, and the number of candidate genes is analyzed with a gene network analysis-based intensive search tool to select 10 or less target genes (FIG. 11). reference)

본 발명에서 GBA 기반 알고리즘을 적용하는 툴의 구조는 다음과 같다. 자연어 처리 기반 데이터 베이스 정보를 DEG 분석 선별 작업 프로세스에 이용할 수 있도록 웹 UI/UX가 구비되고, 상기 자연어 처리 기반의 문헌 데이터베이스 및 DEG 유전자 데이터베이스와 연동한다. 상기 문헌 데이터베이스와 유전자 데이터베이스는 외부 데이터 베이스나 또는 내부의 데이터 베이스를 사용할 수 있다. 그리고 DEG 유전자 리스트 중 사용자가 선택한 질병에 관련된 유전자를 태깅한다. 태깅된 유전자와 연결된 Unknown 유전자의 기능을 군집화 알고리즘을 적용하여 선별하는 알고리즘 적용하고, 해당 알고리즘을 통해서 산출된 결과를 네트워크 분석툴에 연동하여 전송한다. 여기서 사용자의 편리성을 위해 기존 PDF 기반 분석리포트와 연동하여 해당 분석산출물을 보고서에 추가할 수도 있다. The structure of the tool applying the GBA-based algorithm in the present invention is as follows. A web UI/UX is provided so that natural language processing-based database information can be used in the DEG analysis selection process, and interlocks with the natural language processing-based document database and DEG gene database. External databases or internal databases may be used as the literature database and gene database. In addition, genes related to diseases selected by the user among the DEG gene list are tagged. Apply an algorithm that selects the functions of unknown genes linked to tagged genes by applying a clustering algorithm, and transmit the results calculated through the algorithm in conjunction with the network analysis tool. Here, for user's convenience, the corresponding analysis output can be added to the report by linking with the existing PDF-based analysis report.

상기 네트워크 분석툴의 구조는 다음과 같다. 유전자 개체(Entity)를 네트워크 분석 기반의 노드분석(Node analysis)을 수행할 수 있는 UI/UX가 웹페이지에 구비되고, 유전자 개체를 네트워크 노드(Node)로 표현한 후, 노드의 개수, 입력데이터를 결정하여 Centrality에 기반한 생존유전자 및 중요 유전자 선별 프로세스 기능을 구현한다. 여기서 DEG 분석데이터와 자연어 처리된 데이터베이스는 서로 연동되도록 시스템이 구축되어 있으며, 상기 상기 네트워크 분석툴은 네트워크 분석 알고리즘을 이용하여 직접 설계할 수도 있으나, STRING, Cytoscape, Yeastnet, bioPIXE, Wormnet 등을 적용할 수 있다.The structure of the network analysis tool is as follows. A UI/UX that can perform node analysis based on network analysis of the gene entity is provided on the web page, and after expressing the gene entity as a network node, the number of nodes and input data are displayed. Centrality-based survival genes and important gene selection process functions are implemented. Here, a system is constructed so that the DEG analysis data and the natural language processed database are interoperable, and the network analysis tool can be designed directly using a network analysis algorithm, but STRING, Cytoscape, Yeastnet, bioPIXE, Wormnet, etc. can be applied. there is.

본 발명에 의한 통합 분석 도구는 자연어 처리 기반 데이터베이스를 활용하여 두 방법으로 유전자 네트워크 분석을 수행할 수 있다. 첫번째는 알려지지 않은 유전자를 선별하는 것이다. 이는 유전자간의 관계가 덜 알려지고, 연구된 논문의 수가 적을수록 연구적 가치가 높다는 가정을 기반으로 후보 유전자를 선별하는 방법이다. 즉, 탐색되지 않은 유전자 간의 관계가 정의되지 않은 것들을 연구하기 위한 목적이 있거나, 특정 특징들이 규명되었을 때 활용하는 것이 바람직하다. 두번째는 유전자 간의 기능 또는 관계성이 확보된 후보 유전자들을 탐색하여 선별하는 것이다. 본 발명에서는 유전자간의 관계성을 크게 4가지, 세분화하여 13가지 케이스로 나누어 표현하고 있다. 즉, 관계성이 없는 유전자는 필터링하여 제외하고, 관계성이 있는 후보 유전자만을 탐색하여 선별하는 것이다. 여기서 상기 관계성에 따른 유전자 선별 분류는 필요에 따라 더 세분화하거나, 축소하는 것이 가능하다.The integrated analysis tool according to the present invention can perform gene network analysis in two ways by utilizing a database based on natural language processing. The first is to screen for unknown genes. This is a method of selecting candidate genes based on the assumption that the relationship between genes is less known and the research value increases as the number of papers studied decreases. That is, it is desirable to use it for the purpose of studying those for which the relationship between unexplored genes has not been defined, or when specific characteristics have been identified. The second is to search for and select candidate genes for which functions or relationships between genes have been secured. In the present invention, the relationship between genes is largely divided into 4 cases and subdivided into 13 cases. That is, unrelated genes are filtered out and only related candidate genes are searched and selected. Here, gene selection and classification according to the relationship can be further subdivided or reduced as needed.

본 발명에서는 유전자 네트워크 분석 집중 탐색툴을 이용하여 알려지지 않은 유전자를 선별하거나, 유전자 간의 기능 또는 관계성이 확보된 후보 유전자를 분석하여 타겟 유전자를 선정할 수 있다.In the present invention, a target gene can be selected by selecting an unknown gene using a gene network analysis intensive search tool or by analyzing a candidate gene having a function or relationship between genes.

상기 유전자 네크워크 분석을 기반으로 유전자 네트워크 분석 집중 탐색툴을 수행하는 4단계 과정은 다음과 같다. 상기 3단계의 분석 종료 후 얻어진 5 ~20여개의 후보 유전자의 데이터를 이용하여 유전자 네트워크를 생성한다. 생성된 유전자 네트워크로부터 하나 이상의 클러스터들을 추출하여 각 클러스터들을 정의한다. 각 클러스터내에서 높은 빈도의 연결성을 보이는 알려진 유전자들의 기능을 이용하여 동일 클러스터 내 기능이 알려지지 않은 구성 유전자들의 기능을 예측한다. 이를 통하여 알려지지 않았으나 연구 가치가 있는, 또는 연구하고자 하는 가능과 관련성이 높은 10개 내외의 타겟 유전자 후보를 도출할 수 있다.The four-step process of performing the gene network analysis intensive search tool based on the gene network analysis is as follows. A gene network is created using the data of 5 to 20 candidate genes obtained after completion of the above 3-step analysis. Each cluster is defined by extracting one or more clusters from the generated gene network. Using the functions of known genes showing high frequency connectivity within each cluster, the functions of constituent genes whose functions are unknown within the same cluster are predicted. Through this, it is possible to derive about 10 target gene candidates that are not known but have research value or are highly related to the possibility to be studied.

다음으로 본 발명에 의한 실시예로 제공되는 웹페이지의 구성을 통해 보다 상세히 설명한다. 먼저 본 발명에 의한 웹 UI/UX기반 분석 도구는 사용자 정보를 입력하여 계정을 연동하는 로그인 화면이 제공되고, 인증 절차를 마치면 프로젝트를 생성할 수 있는 홈 화면이 제공된다. 사용자는 프로젝트를 생성하여 유전자 리스트 파일을 업로드하고 유전자 목록을 가져올 수 있다. 여기서 유전자 리스트 파일은 FASTQ 파일 형식일 수 있다. 이후 업로드 된 유전자 목록은 사용자의 옵션 없이 원클릭 자동 분석을 진행하거나, 사용자가 각 단계별 옵션을 설정하여 매뉴얼 분석을 수행할 수도 있다.Next, it will be described in more detail through the configuration of a web page provided as an embodiment according to the present invention. First, the web UI/UX-based analysis tool according to the present invention provides a login screen for linking accounts by inputting user information, and a home screen for creating a project after completing the authentication process. Users can create a project, upload a gene list file and import a gene list. Here, the gene list file may be in a FASTQ file format. Afterwards, the uploaded gene list can be subjected to one-click automatic analysis without user options, or manual analysis can be performed by setting options for each step by the user.

사용자의 옵션 없이 원클릭 자동 분석을 진행할 경우, 머신러닝 기반의 의학통계분석툴을 통해 분석하는 단계;와 자연어 기반의 문헌 분석툴을 통해 분석하는 단계와; 유전자 네크워크 분석 기반의 집중탐색툴로 분석하는 단계;와 분석 결과를 출력하는 단계;를 거쳐 타겟 유전자를 자동으로 선별한다. 분석이 끝난 후 선별된 타겟 유전자 리스트가 출력되며, 사용자는 분석 결과를 검토할 수 있는 레포트를 확인할 수 있다.When one-click automatic analysis is performed without a user's option, analyzing through a machine learning-based medical statistical analysis tool; and analyzing through a natural language-based literature analysis tool; Target genes are automatically selected through the step of analyzing with a concentrated search tool based on gene network analysis; and the step of outputting the analysis result. After the analysis is finished, the selected target gene list is output, and the user can check the report to review the analysis result.

각 단계별 옵션을 설정하여 매뉴얼 분석을 수행할 경우, 머신러닝 기반의 의학통계분석툴을 통해 분석하는 단계;와 자연어 기반의 문헌 분석툴을 통해 분석하는 단계와; 유전자 네크워크 분석 기반의 집중탐색툴로 분석하는 단계;와 분석 결과를 출력하는 단계;를 포함하여 각 단계별로 주어진 옵션에 따라 유전자를 선별하는 과정을 거친다. 먼저 머신러닝 기반의 의학통계분석툴을 통해 분석하는 단계에서는 머신러닝 모델을 선택할 수 있다. 상기 머신러닝 모델은 다차원 스케일링(MDS), Survival curve, 회귀분석(LASSO, Ridge, Elastic net)을 포함하여 구성될 수 있다. 상기 머신러닝 모델 중 다차원 스케일링(MDS)은 d차원 공간 상에 있는 객체의 거리를 최대한 보존하는 저차원의 좌표계를 찾는 것을 기반으로 하여, 불필요한 데이터, 차원을 제거하기 위한 분석법이다. When performing manual analysis by setting options for each step, analyzing through a machine learning-based medical statistical analysis tool; and analyzing through a natural language-based literature analysis tool; The process of selecting genes according to the options given in each step, including the step of analyzing with a concentrated search tool based on gene network analysis; and the step of outputting the analysis result, is performed. First, in the step of analyzing through a machine learning-based medical statistical analysis tool, a machine learning model can be selected. The machine learning model may include multidimensional scaling (MDS), survival curve, and regression analysis (LASSO, Ridge, Elastic net). Among the machine learning models, multidimensional scaling (MDS) is an analysis method for removing unnecessary data and dimensions based on finding a low-dimensional coordinate system that maximally preserves the distance of an object in a d-dimensional space.

이 외에도 전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산 또는 분할되어 실시될 수도 있으며, 마찬가지로 분산 또는 분할된 것으로 설명되어 있는 구성 요소들도 통상의 기술자가 이해하는 범위 안에서 결합된 형태로 실시될 수 있다. 또한, 방법의 단계는 단독으로 복수회 실시되거나 혹은 적어도 다른 어느 한 단계와 조합으로 복수회 수행되는 형태로 실시될 수 있다.In addition to this, the description of the present invention described above is for illustrative purposes, and those skilled in the art will understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. You will be able to. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed or divided manner, and likewise, components described as distributed or divided may be implemented in a combined form within the understanding of a person skilled in the art. there is. Further, a method step may be performed alone or multiple times or in combination with at least one other step.

본 발명의 범위는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts should be interpreted as being included in the scope of the present invention.

Claims

A method of extracting disease, gene, and protein related terms from one or more bio field document data and extracting the relationship between each term to use for target genetic screening,
Step 1 of separating sentences from sentences in conjunction with a database (DB) storing abstracts or full text of bio-field literature and recognizing and extracting object names; and,
A target using natural language processing, characterized in that it is provided with a natural language processing literature analysis process including; a second step of recognizing the relationship with respect to the entity name extracted from the abstract or full sentence and generating a relationship event to build a pathway Genetic screening methods.

According to claim 1,
The natural language processing literature analysis process,
Generating a multi-dimensional index for each location of the extracted entity name and storing it so that it can be viewed from the perspective of each gene disease index; Characterized in that it is configured to further include, target gene selection method using natural language processing.

According to claim 2,
The multidimensional index,
a first dimension selected from Gene, Disease, Drug, Pathway, Chemical, and Protein; and,
Gene-gene, gene-disease, gene-drug, gene-path, gene-compound, gene-protein, disease-drug, disease-compound, disease-protein, drug-path, drug-compound, drug-protein, pathway- a second dimension selected from compounds, pathway-proteins, and compound-protein relationships; A method for selecting a target gene using natural language processing, characterized in that it includes; a third dimension composed of gene-disease-path relationships.

According to claim 3,
The object name is at least one of a disease, a drug and a pathway, a chemical, and a protein,
The relationship between the gene and the individual name is characterized by being extracted by being divided into upregulator, downregulator, indirect-upregulator, and indirect-downregulator, through natural language processing Target gene screening method used.

According to claim 1,
The natural language processing literature analysis process,
Searching and filtering based on the keywords of the bio field literature, or searching and filtering based on the detailed information of the bio field literature including Impact Factor or Hit Count; Characterized in that filtering, a method for selecting a target gene using natural language processing.

According to claim 1,
Preparing FASTQ data for nucleotide sequences; and,
performing NGS (next generation sequencing)/DEG (differential expressed gene) analysis on the prepared nucleotide sequence;
A medical statistical analysis step of determining the effectiveness of a gene and a specific disease based on machine learning and extracting a candidate gene related to the specific disease;
Gene network analysis step of selecting genes through gene network analysis; Characterized in that it further comprises at least one of, a target gene selection method using natural language processing.

A system constructed using the target gene selection method using natural language processing according to any one of claims 1 to 6.

A computer program stored in a computer readable recording medium to execute the target gene selection method using natural language processing according to any one of claims 1 to 6.