KR20200106643A

KR20200106643A - High sensitive genetic variation detection and reporting system based on barcode sequence information

Info

Publication number: KR20200106643A
Application number: KR1020190025109A
Authority: KR
Inventors: 이경표; 형기은; 김형용; 정은미; 강병철; 최남우
Original assignee: (주)인실리코젠
Priority date: 2019-03-05
Filing date: 2019-03-05
Publication date: 2020-09-15

Abstract

본 발명은 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템에 관한 것이다.
본 발명은 액체생검(liquid biopsy) 방식으로 획득한 환자 DNA에 대하여 디지털 시퀀싱을 수행하여 유전변이를 탐지/분석하는 디지털 시퀀싱용 분석기 및 상기 디지털 시퀀싱용 분석기에 의해 탐지된 유전변이에 대한 정보를 대상으로 공개 데이터베이스를 참조하여 상기 유전변이와 연관된 임상지원 보고서를 생성하여 제공하는 서비스 서버를 포함한다.
본 발명에 따르면, 디지털 시퀀싱 방식으로 획득한 액체 생검 NGS 데이터를 분석하여 유전변이를 탐지하고, 전문 의료진이 임상 의사 결정에 참고할 수 있는 형태로 보고서화하여 제공할 수 있다.The present invention relates to a highly sensitive genetic variation detection system based on barcode sequence information.
The present invention relates to a digital sequencing analyzer for detecting/analyzing genetic variation by performing digital sequencing on patient DNA obtained by a liquid biopsy method, and information on the genetic variation detected by the digital sequencing analyzer. It includes a service server that generates and provides a clinical support report related to the genetic mutation by referring to a public database.
According to the present invention, a liquid biopsy NGS data obtained by digital sequencing can be analyzed to detect genetic variation, and a report can be provided in a form that can be referred to by a medical professional for clinical decision making.

Description

High sensitivity genetic variation detection and reporting system based on barcode sequence information {HIGH SENSITIVE GENETIC VARIATION DETECTION AND REPORTING SYSTEM BASED ON BARCODE SEQUENCE INFORMATION}

본 발명은 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템에 관한 것이다. 보다 구체적으로, 본 발명은 디지털 시퀀싱 방식으로 획득한 액체 생검 NGS 데이터를 분석하여 유전변이를 탐지하고, 전문 의료진이 임상 의사 결정에 참고할 수 있는 형태로 보고서화하여 제공할 수 있는 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템에 관한 것이다.The present invention relates to a highly sensitive genetic variation detection system based on barcode sequence information. More specifically, the present invention analyzes liquid biopsy NGS data obtained by digital sequencing to detect genetic mutations, and provides a report in a form that can be referred to by professional medical staff for clinical decision-making. It relates to a sensitive genetic variation detection system.

생명체의 DNA 서열 정보를 통하여 질병과 관련된 중요한 정보를 획득할 수 있으며, 최근에는 차세대 염기서열 분석법(Next Generation Sequencing, NGS)으로 변이 연구 뿐만 아니라 임상에 까지 그 활용성이 증대되고 있다.Important information related to diseases can be obtained through DNA sequence information of living organisms, and in recent years, the use of next generation sequencing (NGS) is increasing not only for mutation studies but also for clinical use.

차세대 염기서열 분석방법의 도입으로 실험에 의존하여 소수 몇 개의 유전자를 직접 프로브(probe)로 짜거나 단일 마커(marker)를 사용하여 변이를 확인하는 종래의 검사하는 방법을 벗어나 관심 있는 질병에 연관된 다수의 유전자 및 여러 형태의 변이를 검사하는 추세로 바뀌게 되었다.With the introduction of the next-generation sequencing method, a number of genes related to the disease of interest are deviated from the conventional test method that directly probes several genes with a probe or uses a single marker to identify mutations depending on experiments. There has been a change in the trend of testing genes and various types of mutations.

일반적으로 암 관련 표적 유전자의 변이를 확인하기 위해 시료를 채취하는 방법은 조직을 떼어내어 DNA를 획득하는 생검(생체검사, biopsy) 방법이다. 이는 물리적인 방식이므로 출혈이나 염증 반응 및 전이 등 환자의 고통을 유발하기 때문에 최근 비침습적인 방법인 액체생검(Liquid biopsy)으로 DNA를 채취하여 변이를 검사하는 방법이 이슈화가 되고 있다.In general, a method of collecting a sample to confirm mutations in a target gene related to cancer is a biopsy (biopsy) method in which DNA is obtained by removing tissue. Since this is a physical method, it causes pain in patients such as bleeding, inflammatory reactions and metastasis, and thus, a method of examining mutations by collecting DNA using a liquid biopsy, a non-invasive method, has become an issue.

이밖에 액체 생검을 사용하였을 때의 이점은 약물 처방 이후 내성 검사를 위해 주기적인 샘플링이 필요할 때 비 침습적인 방법으로 환자에게 위험성 및 고통을 덜어줄 수 있으며, 질병이 이미 어느 정도의 단계로 접어들었을 때 진단을 내릴 수 있던 영상 분석 및 조직검사보다 조기 검사가 가능하다는 연구결과도 발표되고 있다.In addition, the advantage of using a liquid biopsy is that it can reduce the risk and pain to the patient in a non-invasive method when periodic sampling is required for tolerance testing after drug prescription, and the disease may have already entered a certain stage. Research results have also been published indicating that an earlier examination is possible than image analysis and biopsy, which could be diagnosed at the time.

생검으로 조직을 채취하면 국소적인 조직 내의 변이 빈도를 볼 수 있는 반면, 액체 생검(Liquid biopsy) 혹은 세포 유리 DNA(cell-free DNA)는 몸에 있는 전체 DNA 변이 중 확인 대상인 표적 유전자의 변이를 확인해야 하기 때문에 상대적으로 변이 빈도가 낮을 수 밖에 없는데, 이때 존재하는 낮은 빈도의 변이는 차세대 염기서열 분석 장치의 오차율(0.1%-1%)에 해당한다.When tissue is collected by biopsy, it is possible to see the frequency of mutations in the local tissue, whereas liquid biopsy or cell-free DNA identifies mutations in the target gene to be identified among the total DNA mutations in the body. Since it must be done, the mutation frequency is inevitably low, and the low frequency mutation that exists at this time corresponds to the error rate (0.1%-1%) of the next-generation sequencing device.

오차율의 한계 범위를 벗어나 실제 변이를 탐지하기 위해서는 오류가 아닌 실제 변이의 존재를 입증할 만한 양의 데이터를 생산하여 변이 존재를 입증해야 하며 이를 해결하기 위해서 많은 수의 리드(read)를 생산하는 딥 시퀀싱(Deep sequencing) 기술이 요구된다.In order to detect the actual variation outside the limit of the error rate, it is necessary to prove the existence of the variation by producing a quantity of data that can prove the existence of the actual variation, not the error, and to solve this, a deep dip that produces a large number of reads. Deep sequencing technology is required.

하지만 딥 시퀀싱을 진행하면 생산비용이 올라가므로 진단 검사비가 부담이 될 뿐만 아니라 탐지된 변이가 PCR(Polymerase Chain Reaction)이나 시퀀싱 과정에서 생성된 오류인지 아니면 실제 존재하는 변이인지 확인도 어려우며, 정량분석 시 부정확한 결과를 유발할 수 있다.However, since deep sequencing increases production cost, it is not only burdensome for diagnostic testing, but also it is difficult to determine whether the detected mutation is an error generated during PCR (Polymerase Chain Reaction) or sequencing, or whether a mutation actually exists. This can lead to inaccurate results.

디지털 시퀀싱(Digital sequencing) 방식은 각 리드(read)에 UMI(Unique molecular index)라는 고유한 임의 서열(약 12bp)를 붙혀주고, PCR 증폭하여, 정량분석 시 UMI만 카운트함으로써, 시퀀싱 오류를 피할 수 있다.The digital sequencing method can avoid sequencing errors by attaching a unique random sequence (about 12 bp) called a UMI (Unique molecular index) to each read, amplifying PCR, and counting only UMI during quantitative analysis. have.

따라서 디지털 시퀀싱 방식으로 획득한 액체 생검 NGS 데이터를 분석할 전문 솔루션이 필요하며, 이는 환자 정보를 등록하는 것부터 시작하여, NGS 데이터를 업로드하고, 데이터 품질을 평가(Quality Control)하고, 유전변이를 탐지하고, 전문 의료진이 임상 의사 결정에 참고할 수 있는 형태로 보고서화하여 제공하는 기능을 갖추어야 하나, 현재까지는 이러한 기능을 구비한 솔루션이 제공되지 못하고 있다.Therefore, a specialized solution to analyze liquid biopsy NGS data acquired by digital sequencing is required, which starts with registering patient information, uploads NGS data, evaluates data quality (Quality Control), and detects genetic variations. In addition, it must have a function to provide a report in a form that can be referred to for clinical decision-making by professional medical staff, but a solution with this function has not been provided until now.

대한민국 공개특허공보 제10-2016-0020400호(공개일자: 2016년 02월 23일, 명칭: 산모의 혈청 DNA를 이용한 태아의 단일유전자 유전변이의 예측방법)Republic of Korea Patent Publication No. 10-2016-0020400 (published date: February 23, 2016, name: prediction method of single gene genetic variation in fetus using maternal serum DNA) 대한민국 공개특허공보 제10-2014-0023847호(공개일자: 2014년 02월 27일, 명칭 태아 유전학적 이상의 비침습성 검출)Republic of Korea Patent Publication No. 10-2014-0023847 (published date: February 27, 2014, name Fetal genetic abnormality non-invasive detection)

본 발명의 기술적 과제는 디지털 시퀀싱 방식으로 획득한 액체 생검 NGS 데이터를 분석하여 유전변이를 탐지하고, 전문 의료진이 임상 의사 결정에 참고할 수 있는 형태로 보고서화하여 제공할 수 있는 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템을 제공하는 것이다.The technical problem of the present invention is to analyze liquid biopsy NGS data acquired by digital sequencing to detect genetic variation, and to provide high sensitivity based on barcode sequence information that can be provided by reporting in a form that can be referenced by professional medical personnel for clinical decision making. It is to provide a genetic variation detection system.

또한, 본 발명의 기술적 과제는 액체 생검용 분석 패널을 이용하여 바코드 서열을 이용한 디지털 차세대 염기서열(Digital NGS)데이터가 생산되었을 때 분석알고리즘상의 해결 방법으로 바코드 서열 정보를 포함하는 리드간의 군집(Cluster)를 만들어 검출 한계 및 NGS 노이즈를 억제하는 것이다.In addition, the technical problem of the present invention is a cluster of reads containing barcode sequence information as a solution to the analysis algorithm when digital next-generation nucleotide sequence (Digital NGS) data using barcode sequences is produced using an analysis panel for liquid biopsy. ) To suppress the detection limit and NGS noise.

또한, 본 발명의 기술적 과제는 기존 차세대 염기서열 분석법을 통한 리드 분석과는 다르게 바코드 서열 정보 서열이 포함된 리드를 고려한 분석 파이프라인을 설계함으로써, 액체 생검의 장점을 유지하면서 차세대 염기서열 분석법의 단점인 변이 확인 민감도를 보완하는 것이다.In addition, the technical problem of the present invention is to design an analysis pipeline that considers reads containing barcode sequence information, unlike read analysis through the existing next-generation sequencing method, thereby maintaining the advantages of liquid biopsy and disadvantages of next-generation sequencing methods. It complements the sensitivity of the identification of phosphorus mutations.

또한, 본 발명의 기술적 과제는 기존의 생검을 이용한 분석법을 벗어나 환자의 입장으로, 보다 편리하고 안전한 비침습적-차세대 염기서열 분석법의 새로운 분석 기법을 탑재한 시스템 패키지를 통해 암 진단이나 유전자 마커를 이용한 맞춤형 진단 솔루션을 제공하는 것이다.In addition, the technical problem of the present invention is to deviate from the conventional biopsy-based analysis method and to the patient's position, and through a system package equipped with a new analysis technique of a more convenient and safe non-invasive-next-generation sequencing method, cancer diagnosis or genetic marker It is to provide customized diagnostic solutions.

또한, 본 발명의 기술적 과제는 이식 가능성을 높이기 위하여 서로 다른 NGS 시퀀서인 이온 토렌트(Ion torrent)와 일루미나(Illumina) 기반의 플랫폼에서 생성된 single-end, paired-end 리드 데이터를 모두 분석이 가능하도록 지원하는 시스템을 설계하고, NGS의 전문 분석 수행방법에 익숙치 않더라도 웹상의 그래픽 사용자 인터페이스(GUI)로 샘플 및 환자의 등록, 분석 실시, 레포트 반환을 손쉽게 수행할 수 있도록 하는 것이다..In addition, the technical problem of the present invention is to enable the analysis of both single-end and paired-end read data generated on different NGS sequencers, Ion torrent and Illumina-based platforms, in order to increase portability. Even if you are not familiar with NGS' professional analysis method, you can easily register samples and patients, conduct analysis, and return reports with a graphical user interface (GUI) on the web.

또한, 본 발명의 기술적 과제는 임상에 적용될 수 있는 변이의 주석(Annotation) 정보를 포함한 유전변이 보고서를 생성하여 변이와 관련된 질병에 대하여 전문 의료원이 약물 혹은 처방요법(therapy)의 근거 자료로 사용될 수 있게 하는 것이다.In addition, the technical task of the present invention is to generate a genetic mutation report including annotation information of mutations that can be applied to the clinic, so that a specialized medical center can be used as a basis for drugs or therapy for diseases related to mutations. To be.

이러한 기술적 과제를 해결하기 위한 본 발명에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템은 액체생검(liquid biopsy) 방식으로 획득한 환자 DNA에 대하여 디지털 시퀀싱을 수행하여 유전변이를 탐지/분석하는 디지털 시퀀싱용 분석기 및 상기 디지털 시퀀싱용 분석기에 의해 탐지된 유전변이에 대한 정보를 대상으로 공개 데이터베이스를 참조하여 상기 유전변이와 연관된 임상지원 보고서를 생성하여 제공하는 서비스 서버를 포함한다.The highly sensitive genetic mutation detection system based on barcode sequence information according to the present invention for solving these technical problems is digital sequencing that detects/analyzes genetic mutations by performing digital sequencing on patient DNA obtained by a liquid biopsy method. And a service server that generates and provides a clinical support report related to the genetic variation by referring to a public database for information on the genetic variation detected by the analysis analyzer and the digital sequencing analyzer.

본 발명에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템에 있어서, 상기 서비스 서버는, 환자 정보를 입력받아 관리하는 환자 정보 관리부, 상기 환자 DNA에 대한 NGS(Next Generation Sequencing) 분석 결과물인 NGS 데이터를 업로드하고, 상기 NGS 데이터의 품질을 평가(Quality Control)하여 상기 NGS 데이터의 폼질이 허용기준을 충족하는 경우, 상기 디지털 시퀀싱용 분석기에 상기 NGS 데이터를 전달하여 상기 환자 DNA에 대한 유전변이 탐지/분석을 요청하는 NGS 데이터 관리부, 상기 디지털 시퀀싱용 분석기로부터 상기 환자 DNA의 유전변이에 대한 정보를 전달받고, 인터넷 상에 공개된 변이정보 DB를 참조하여 상기 유전변이에 대한 상세 설명정보를 수집하는 정보 수집부 및 상기 정보 수집부에 의해 수집된 상기 유전변이에 대한 상세 설명정보를 미리 정의된 형식을 가공하여 상기 유전변이와 연관된 임상지원 보고서를 생성하여 제공하는 보고서 관리부를 포함하는 것을 특징으로 한다.In the highly sensitive genetic variation detection system based on barcode sequence information according to the present invention, the service server includes a patient information management unit that receives and manages patient information, and NGS data, which is a result of Next Generation Sequencing (NGS) analysis of the patient DNA. Upload and evaluate the quality of the NGS data (Quality Control), if the foam quality of the NGS data meets the acceptance criteria, transfer the NGS data to the digital sequencing analyzer to detect/analyze the genetic variation of the patient DNA The NGS data management unit for requesting a request, receives information on the genetic variation of the patient's DNA from the digital sequencing analyzer, and collects detailed description information about the genetic variation by referring to the variation information DB published on the Internet. It characterized in that it comprises a report management unit that generates and provides a clinical support report related to the genetic variation by processing the detailed description information on the genetic variation collected by the unit and the information collection unit in a predefined format.

본 발명에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템에 있어서, 상기 임상지원 보고서에는, DNA를 제공한 환자의 정보, DNA 샘플을 특정하는 정보, 상기 유전변이가 HG19 또는 HG38에 정의된 참조기준서열과 비교하여 나타낸 변이빈도(전체 고유한 바코드에서 변이를 포함한 것의 비율, Variant MT (barcode) fraction, VMF)에 대한 정보가 포함되어 있는 것을 특징으로 한다.In the highly sensitive genetic variation detection system based on barcode sequence information according to the present invention, the clinical support report includes information on a patient who has provided DNA, information specifying a DNA sample, and a reference standard in which the genetic variation is defined in HG19 or HG38. It is characterized by including information on the frequency of mutations compared to the sequence (the ratio of all unique barcodes including mutations, variant MT (barcode) fraction, VMF).

본 발명에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템에 있어서, 상기 보고서 관리부는, 상기 임상지원 보고서를 통하여, 상기 유전변이와 표현형과의 연관관계를 상기 변이정보 DB에 제시된 가이드 라인에 따라 임상적 유의성(pathogenicity)과 활성도 및 위험도(actionability)의 정보를 기준으로 복수의 분류군으로 구분하여 제공하는 것을 특징으로 한다.In the high-sensitivity genetic variation detection system based on barcode sequence information according to the present invention, the report management unit determines the relationship between the genetic variation and the phenotype through the clinical support report according to the guideline presented in the variation information DB. It is characterized in that it is divided into a plurality of taxa and provided based on information on pathogenicity, activity, and actionability.

본 발명에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템에 있어서, 상기 디지털 시퀀싱용 분석기는, 상기 환자 DNA에 대하여 디지털 시퀀싱을 수행하여 바코드를 포함하는 리드(read)를 생성하는 리드 생성과정과, 상기 리드에서 어탭터(adaptor) 시퀀스와 낮은 퀄리티(low quality) 리드를 제거하는 트리밍(trimming) 과정과, 상기 어탭터(adaptor) 시퀀스와 낮은 퀄리티(low quality) 리드가 제거된 리드들을 대상 생물 종의 대표 서열인 참조 서열(reference sequence)에 붙이는 참조 서열 맵핑(mapping) 과정과, 상기 리드들의 맵핑된 정보를 기반으로 바코드 서열을 탐색하여 정의하고, 정의된 바코드 서열을 기준으로 리드들을 군집화(clustering)하는 군집화 과정과, 상기 바코드 서열을 기준으로 군집화된 리드들 중에서, 동일한 바코드를 가진 리드들내에서 유전변이를 탐지하는 유전변이 탐지 과정을 수행하는 것을 특징으로 한다.In the highly sensitive genetic variation detection system based on barcode sequence information according to the present invention, the digital sequencing analyzer includes a read generation process for generating a read including a barcode by performing digital sequencing on the patient DNA, A trimming process for removing an adapter sequence and a low quality lead from the read, and a read from which the adapter sequence and a low quality lead are removed are representative of the target species. A process of mapping a reference sequence attached to a reference sequence, which is a sequence, searching and defining a barcode sequence based on the mapped information of the reads, and clustering the reads based on the defined barcode sequence. It characterized by performing a clustering process and a genetic variation detection process of detecting a genetic variation in reads having the same barcode among reads clustered based on the barcode sequence.

본 발명에 따르면, 디지털 시퀀싱 방식으로 획득한 액체 생검 NGS 데이터를 분석하여 유전변이를 탐지하고, 전문 의료진이 임상 의사 결정에 참고할 수 있는 형태로 보고서화하여 제공할 수 있는 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템이 제공되는 효과가 있다.According to the present invention, high sensitivity genetics based on barcode sequence information that can be provided by analyzing liquid biopsy NGS data acquired by digital sequencing to detect genetic mutations, and to provide a report in a form that can be referenced by professional medical staff for clinical decision making. There is an effect that a mutation detection system is provided.

또한, 액체 생검용 분석 패널을 이용하여 바코드 서열을 이용한 디지털 차세대 염기서열(Digital NGS)데이터가 생산되었을 때 분석알고리즘상의 해결 방법으로 바코드 서열 정보를 포함하는 리드간의 군집(Cluster)를 만들어 검출 한계 및 NGS 노이즈를 억제할 수 있다.In addition, when digital next-generation nucleotide sequence (Digital NGS) data using barcode sequence is produced using an analysis panel for liquid biopsy, a cluster between reads containing barcode sequence information is created as a solution to the analysis algorithm to limit detection and NGS noise can be suppressed.

또한, 기존 차세대 염기서열 분석법을 통한 리드 분석과는 다르게 바코드 서열 정보 서열이 포함된 리드를 고려한 분석 파이프라인을 설계함으로써, 액체 생검의 장점을 유지하면서 단점인 변이 확인 민감도를 보완할 수 있다.In addition, unlike read analysis through the existing next-generation sequencing method, by designing an analysis pipeline that considers reads containing barcode sequence information, it is possible to supplement the sensitivity of mutation identification, which is a disadvantage, while maintaining the advantages of liquid biopsy.

또한, 기존의 생검을 이용한 분석법을 벗어나 환자의 입장으로, 보다 편리하고 안전한 비침습적-차세대 염기서열 분석법의 새로운 분석 기법을 탑재한 시스템 패키지를 통해 암 진단이나 유전자 마커를 이용한 맞춤형 진단 솔루션을 제공할 수 있다.In addition, through a system package equipped with a new analysis technique of a more convenient and safe non-invasive-next-generation sequencing method, it will provide a customized diagnosis solution using a genetic marker or a cancer diagnosis from the patient's point of view beyond the conventional biopsy method. I can.

또한, 이식 가능성을 높이기 위하여 서로 다른 NGS 시퀀서인 이온 토렌트(Ion torrent)와 일루미나(Illumina) 기반의 플랫폼에서 생성된 single-end, paired-end 리드 데이터를 모두 분석이 가능하도록 지원하는 시스템을 설계하고, NGS의 전문 분석 수행방법에 익숙치 않더라도 웹상의 그래픽 사용자 인터페이스(GUI)로 샘플 및 환자의 등록, 분석 실시, 레포트 반환을 손쉽게 수행할 수 있도록 지원할 수 있다.In addition, in order to increase the portability, we designed a system that supports analysis of both single-end and paired-end read data generated on different NGS sequencers, Ion torrent and Illumina-based platforms. , Even if you are not familiar with NGS' method of performing professional analysis, you can support to easily perform registration, analysis, and report return of samples and patients with a graphical user interface (GUI) on the web.

또한, 임상에 적용될 수 있는 변이의 주석(Annotation) 정보를 포함한 유전변이 보고서를 생성하여 변이와 관련된 질병에 대하여 전문 의료원이 약물 혹은 처방요법(therapy)의 근거 자료로 사용되도록 지원할 수 있다.In addition, by generating a genetic mutation report including annotation information of mutations that can be applied to the clinic, it is possible to support medical clinics specializing in mutation-related diseases to be used as basis data for drugs or therapy.

도 1은 본 발명의 일 실시 예에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템을 나타낸 도면이고,
도 2는 본 발명의 일 실시 예에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템의 예시적인 동작을 설명하기 도면이고,
도 3은 본 발명의 일 실시 예에 있어서, 바코드 서열 정보를 이용한 리드내 오류 탐지를 개념적으로 설명하기 위한 도면이고,
도 4는 본 발명의 일 실시 예에 있어서, 샘플 등록 화면을 예시적으로 나타낸 도면이고,
도 5는 본 발명의 일 실시 예에 있어서, 샘플 데이터 업로드 화면을 예시적으로 나타낸 도면이고,
도 6은 본 발명의 일 실시 예에 있어서, 접수된 샘플 및 분석 현황 화면을 예시적으로 나타낸 도면이고,
도 7은 본 발명의 일 실시 예에 있어서, 분석 정보 및 레포트 다운로드 화면을 예시적으로 나타낸 도면이고,
도 8은 본 발명의 일 실시 예에 있어서, 유전변이 분석의 결과물인 임상지원 보고서 화면을 예시적으로 나타낸 도면이다.1 is a diagram showing a high sensitivity genetic variation detection system based on barcode sequence information according to an embodiment of the present invention,
2 is a diagram illustrating an exemplary operation of a system for detecting high sensitivity genetic variation based on barcode sequence information according to an embodiment of the present invention.
3 is a diagram for conceptually explaining error detection in a read using barcode sequence information in an embodiment of the present invention,
4 is a diagram illustrating an example of a sample registration screen according to an embodiment of the present invention,
5 is a diagram illustrating an example data upload screen in an embodiment of the present invention,
6 is a diagram illustrating an exemplary received sample and analysis status screen in an embodiment of the present invention,
7 is a diagram illustrating an analysis information and a report download screen according to an embodiment of the present invention,
8 is a diagram illustrating a screen of a clinical support report that is a result of an analysis of genetic variation according to an embodiment of the present invention.

본 명세서에 개시된 본 발명의 개념에 따른 실시 예들에 대해서 특정한 구조적 또는 기능적 설명은 단지 본 발명의 개념에 따른 실시 예들을 설명하기 위한 목적으로 예시된 것으로서, 본 발명의 개념에 따른 실시 예들은 다양한 형태들로 실시될 수 있으며 본 명세서에 설명된 실시 예들에 한정되지 않는다.Specific structural or functional descriptions of the embodiments according to the concept of the present invention disclosed in the present specification are only exemplified for the purpose of describing the embodiments according to the concept of the present invention, and the embodiments according to the concept of the present invention are in various forms. And are not limited to the embodiments described herein.

본 발명의 개념에 따른 실시 예들은 다양한 변경들을 가할 수 있고 여러 가지 형태들을 가질 수 있으므로 실시 예들을 도면에 예시하고 본 명세서에서 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시 예들을 특정한 개시 형태들에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물, 또는 대체물을 포함한다.Since the embodiments according to the concept of the present invention can apply various changes and have various forms, embodiments are illustrated in the drawings and will be described in detail in the present specification. However, this is not intended to limit the embodiments according to the concept of the present invention to specific disclosed forms, and includes all changes, equivalents, or substitutes included in the spirit and scope of the present invention.

제1 또는 제2 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만, 예컨대 본 발명의 개념에 따른 권리 범위로부터 벗어나지 않은 채, 제1 구성 요소는 제2 구성 요소로 명명될 수 있고 유사하게 제2 구성 요소는 제1 구성 요소로도 명명될 수 있다.Terms such as first or second may be used to describe various elements, but the elements should not be limited by the terms. The terms are only for the purpose of distinguishing one component from other components, for example, without departing from the scope of the rights according to the concept of the present invention, the first component may be named as the second component and similarly the second component. The component may also be referred to as a first component.

어떤 구성 요소가 다른 구성 요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성 요소에 직접 연결되어 있거나 접속되어 있을 수도 있지만, 중간에 다른 구성 요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성 요소가 다른 구성 요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는 중간에 다른 구성 요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성 요소간의 관계를 설명하는 다른 표현들, 즉 "~사이에" 와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being "connected" or "connected" to another component, it should be understood that it is directly connected or may be connected to the other component, but other components may exist in the middle. will be. On the other hand, when a component is referred to as being "directly connected" or "directly connected" to another component, it should be understood that there is no other component in the middle. Other expressions describing the relationship between components, such as "between" and "directly between" or "adjacent to" and "directly adjacent to" should be interpreted as well.

본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로서, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 본 명세서에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in the present specification are used only to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In the present specification, terms such as "comprise" or "have" are intended to designate the presence of features, numbers, steps, actions, components, parts, or combinations thereof described herein, but one or more other features. It is to be understood that the possibility of addition or presence of elements or numbers, steps, actions, components, parts, or combinations thereof is not preliminarily excluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어는 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 나타낸다. 일반적으로 사용되는 사전에 정의된 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms, including technical or scientific terms, used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms as defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted as an ideal or excessively formal meaning unless explicitly defined in the present specification. .

이하에서는, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예를 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시 예에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템을 나타낸 도면이고, 도 2는 본 발명의 일 실시 예에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템의 예시적인 동작을 설명하기 도면이고, 도 3은 본 발명의 일 실시 예에 있어서, 바코드 서열 정보를 이용한 리드내 오류 탐지를 개념적으로 설명하기 위한 도면이고, 도 4는 본 발명의 일 실시 예에 있어서, 샘플 등록 화면을 예시적으로 나타낸 도면이고, 도 5는 본 발명의 일 실시 예에 있어서, 샘플 데이터 업로드 화면을 예시적으로 나타낸 도면이고, 도 6은 본 발명의 일 실시 예에 있어서, 접수된 샘플 및 분석 현황 화면을 예시적으로 나타낸 도면이고, 도 7은 본 발명의 일 실시 예에 있어서, 분석 정보 및 레포트 다운로드 화면을 예시적으로 나타낸 도면이고, 도 8은 본 발명의 일 실시 예에 있어서, 유전변이 분석의 결과물인 임상지원 보고서 화면을 예시적으로 나타낸 도면이다.1 is a diagram showing a high sensitivity genetic variation detection system based on barcode sequence information according to an embodiment of the present invention, and Figure 2 is an exemplary view of a high sensitivity genetic variation detection system based on barcode sequence information according to an embodiment of the present invention. 3 is a diagram for conceptually explaining detection of errors in a read using barcode sequence information in an embodiment of the present invention, and FIG. 4 is a diagram illustrating a sample in an embodiment of the present invention. FIG. 5 is a diagram illustrating a registration screen as an example, FIG. 5 is a diagram illustrating a sample data upload screen in an embodiment of the present invention, and FIG. 6 is a diagram illustrating a received sample and An exemplary view showing an analysis status screen, FIG. 7 is a diagram illustrating an analysis information and a report download screen according to an embodiment of the present invention, and FIG. 8 is a diagram illustrating an oil field in an embodiment of the present invention. This is an exemplary diagram showing the clinical support report screen, which is the result of mutation analysis.

도 1 내지 도 8을 참조하면, 본 발명의 일 실시 예에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템은 디지털 시퀀싱용 분석기(20) 및 서비스 서버(30)를 포함한다.Referring to FIGS. 1 to 8, a high sensitivity genetic variation detection system based on barcode sequence information according to an embodiment of the present invention includes a digital sequencing analyzer 20 and a service server 30.

디지털 시퀀싱용 분석기(20)는 액체생검(liquid biopsy) 방식으로 획득한 환자 DNA에 대하여 디지털 시퀀싱을 수행하여 유전변이를 탐지/분석하는 기능을 수행한다.The digital sequencing analyzer 20 performs a function of detecting/analyzing a genetic variation by performing digital sequencing on patient DNA obtained by a liquid biopsy method.

예를 들어, 디지털 시퀀싱용 분석기(20)는, 환자 DNA에 대하여 디지털 시퀀싱을 수행하여 바코드를 포함하는 리드(read)를 생성하는 리드 생성과정과, 리드에서 어탭터(adaptor) 시퀀스와 낮은 퀄리티(low quality) 리드를 제거하는 트리밍(trimming) 과정과, 어탭터(adaptor) 시퀀스와 낮은 퀄리티(low quality) 리드가 제거된 리드들을 대상 생물 종의 대표 서열인 참조 서열(reference sequence)에 붙이는 참조 서열 맵핑(mapping) 과정과, 리드들의 맵핑된 정보를 기반으로 바코드 서열을 탐색하여 정의하고, 정의된 바코드 서열을 기준으로 리드들을 군집화(clustering)하는 군집화 과정과, 바코드 서열을 기준으로 군집화된 리드들 중에서, 동일한 바코드를 가진 리드들내에서 유전변이를 탐지하는 유전변이 탐지 과정을 수행하도록 구성될 수 있다.For example, the digital sequencing analyzer 20 performs digital sequencing on patient DNA to generate a read including a barcode, and an adapter sequence and low quality from the read. quality) A trimming process for removing reads, and reference sequence mapping in which the adapter sequences and reads from which low quality reads are removed are attached to a reference sequence, which is a representative sequence of the target species ( mapping) process, a clustering process of searching and defining a barcode sequence based on the mapped information of the reads, clustering the reads based on the defined barcode sequence, and among the reads clustered based on the barcode sequence, It can be configured to perform a genetic variation detection process that detects genetic variation in reads with the same barcode.

서비스 서버(30)는 디지털 시퀀싱용 분석기(20)에 의해 탐지된 유전변이에 대한 정보를 대상으로 공개 데이터베이스를 참조하여 상기 유전변이와 연관된 임상지원 보고서를 생성하여 제공하는 기능을 수행한다.The service server 30 performs a function of generating and providing a clinical support report related to the genetic variation by referring to a public database for information on the genetic variation detected by the digital sequencing analyzer 20.

예를 들어, 서비스 서버(30)는 환자 정보 관리부(310), NGS 데이터 관리부(320), 정보 수집부(330), 보고서 관리부(340)를 포함하여 구성될 수 있다.For example, the service server 30 may include a patient information management unit 310, an NGS data management unit 320, an information collection unit 330, and a report management unit 340.

환자 정보 관리부(310)는 환자 정보를 입력받아 관리하는 기능을 수행한다.The patient information management unit 310 performs a function of receiving and managing patient information.

NGS 데이터 관리부(320)는 환자 DNA에 대한 NGS(Next Generation Sequencing) 분석 결과물인 NGS 데이터를 업로드하고, NGS 데이터의 품질을 평가(Quality Control)하여 NGS 데이터의 폼질이 허용기준을 충족하는 경우, 디지털 시퀀싱용 분석기(20)에 NGS 데이터를 전달하여 환자 DNA에 대한 유전변이 탐지/분석을 요청하는 기능을 수행한다.The NGS data management unit 320 uploads NGS data, which is a result of Next Generation Sequencing (NGS) analysis of patient DNA, and evaluates the quality of the NGS data, so that if the NGS data meets the acceptance criteria, the digital Transmits NGS data to the sequencing analyzer 20 to perform a function of requesting detection/analysis of genetic mutations on patient DNA.

정보 수집부(330)는 디지털 시퀀싱용 분석기(20)로부터 환자 DNA의 유전변이에 대한 정보를 전달받고, 인터넷 상에 공개된 변이정보 DB(40)를 참조하여 유전변이에 대한 상세 설명정보를 수집하는 기능을 수행한다.The information collection unit 330 receives information on the genetic variation of patient DNA from the digital sequencing analyzer 20, and collects detailed explanatory information on the genetic variation by referring to the variation information DB 40 published on the Internet. Performs the function of

보고서 관리부(340)는 정보 수집부(330)에 의해 수집된 유전변이에 대한 상세 설명정보를 미리 정의된 형식을 가공하여 유전변이와 연관된 임상지원 보고서를 생성하여 제공하는 기능을 수행한다.The report management unit 340 performs a function of generating and providing a clinical support report related to the genetic variation by processing detailed description information on the genetic variation collected by the information collection unit 330 in a predefined format.

예를 들어, 보고서 관리부(340)에 의해 생성되어 제공되는 임상지원 보고서에는, DNA를 제공한 환자의 정보(성명, 생년월일 등), DNA 샘플을 특정하는 정보(샘플 아이디, 종류 구분, 등록 날짜, 분석 날짜 등), 유전변이가 HG19 또는 HG38에 정의된 참조기준서열과 비교하여 나타낸 변이빈도(전체 고유한 바코드에서 변이를 포함한 것의 비율, Variant MT (barcode) fraction, VMF)에 대한 정보가 포함될 수 있다.For example, in the clinical support report generated and provided by the report management unit 340, information on a patient who provided DNA (name, date of birth, etc.), information specifying a DNA sample (sample ID, type classification, registration date, Date of analysis, etc.), and information on the mutation frequency (ratio of all unique barcodes containing mutations, Variant MT (barcode) fraction, VMF) compared to the reference sequence defined in HG19 or HG38. have.

예를 들어, 보고서 관리부(340)는, 임상지원 보고서를 통하여, 유전변이와 표현형과의 연관관계를 변이정보 DB(40)에 제시된 가이드 라인에 따라 임상적 유의성(pathogenicity)과 활성도 및 위험도(actionability)의 정보를 기준으로 복수의 분류군으로 구분하여 제공하도록 구성될 수 있다. 구체적인 예로, 유전변이가 가진 임상적 의미를 해석하기 위해 변이를 임상적 유의성(pathogenicity)과 활성도 및 위험도(actionability)를 기준으로 특정기관이 제시한 표준 가이드라인(예를 들어, ACMG guideline, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4544753/)을 통해 아래와 같이 5가지의 분류군으로 나눌 수 있다.For example, the report management unit 340, through a clinical support report, determines the relationship between genetic variation and phenotype in accordance with the guidelines presented in the variation information DB 40, in accordance with the clinical significance (pathogenicity), activity, and risk (actionability). ) May be configured to be divided into a plurality of taxa and provided based on the information. As a specific example, in order to interpret the clinical implications of genetic mutations, standard guidelines suggested by specific institutions based on clinical pathogenicity, activity, and actionability (for example, ACMG guideline, https: //www.ncbi.nlm.nih.gov/pmc/articles/PMC4544753/), it can be divided into five taxa as follows.

1.Pathogenic - 강한 임상적 유의성과 활성도1.Pathogenic-strong clinical significance and activity

2.Likely pathogenic - 잠재적인 임상 유의성과 활성도2.Likely pathogenic-potential clinical significance and activity

3. Likely benign - 양성일 가능성이 높은 변이3. Likely benign-a mutation that is likely to be benign

4.Benign - 양성인 변이4.Benign-a positive mutation

5.Uncertain significance - 앞의 4가지로 분류하기엔 충분한 증거가 없는 변이5.Uncertain significance-A variation that does not have enough evidence to classify it into the preceding four categories.

사용자 단말(10)은 서비스 서버(30)에 접속하기 위한 수단으로서, 예를 들어, 의사 등이 사용하는 단말일 수 있다.The user terminal 10 is a means for accessing the service server 30 and may be, for example, a terminal used by a doctor or the like.

이하에서는 본 발명의 일 실시 예에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템의 세부적인 구성을 예시적으로 설명한다.Hereinafter, a detailed configuration of a high-sensitivity genetic variation detection system based on barcode sequence information according to an embodiment of the present invention will be exemplarily described.

1. 구성 및 특징1. Composition and features

본 발명의 일 실시 예에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템은 기존의 차세대 염기서열 분석법에서 추가로 바코드 서열 정보를 포함한 어댑터 서열 정보를 사용함으로써 바코드 서열 정보를 포함한 리드를 전처리하고, 바코드 서열 정보를 기반으로 변이를 탐지한다. 이를 이용하면 라이브러리에서 염기서열 결정 단계까지 생길 수 있는 증폭(amplification) 에러와 시퀀싱 에러를 파악할 수 있으므로 액체 생검의 단점인 검출 한계 문제를 극복할 수 있다. 어댑터 서열은 알려진 짧은 DNA 서열로 목적 DNA 양 끝에 부착하는 것이다. 어댑터 서열의 어느 부분에 프라이머 서열이 상보적으로 결합하여 시퀀싱(sequencing) 된다.The highly sensitive genetic variation detection system based on barcode sequence information according to an embodiment of the present invention pre-processes a read including barcode sequence information by using adapter sequence information including barcode sequence information in addition to the existing next-generation sequencing method, and Detect mutations based on sequence information. By using this, since amplification errors and sequencing errors that may occur from the library to the sequence determination step can be identified, it is possible to overcome the problem of detection limit, a disadvantage of liquid biopsy. The adapter sequence is a short known DNA sequence that is attached to both ends of the target DNA. The primer sequence is complementarily bound to any part of the adapter sequence and is sequenced.

본 발명의 일 실시 예에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템은 변이의 진위판단과 함께 해당변이의 대립유전자빈도(Allele frequency)를 계산하는 기능, 검출된 변이 정보를 담은 DB와의 연동 기능도 함께 들어간 유전체 생물정보 전문 분석 파이프라인 구성요소뿐 아니라, 환자정보를 등록하고 시퀀서에 DNA를 넣어 분석이 완료되면 시퀀서에서 생성된 초기 데이터를 클라우드 분석 환경에 올려주고 파이프라인의 일련의 생물정보 분석을 거쳐 최종으로 임상에서 적용 가능한 변이 보고서를 PDF 형식으로 생성해 주는 관리 및 레포팅 시스템 구성요소, 즉, 서비스 서버(30)를 포함한다.The highly sensitive genetic mutation detection system based on barcode sequence information according to an embodiment of the present invention determines the authenticity of the mutation, calculates the allele frequency of the mutation, and interlocks with the DB containing the detected mutation information. Registers patient information and puts DNA in the sequencer, as well as the components of the pipeline for analyzing specialized genomic bioinformation, and when the analysis is complete, the initial data generated by the sequencer is uploaded to the cloud analysis environment and a series of biological information analysis of the pipeline Finally, it includes a management and reporting system component, that is, a service server 30, that generates a clinically applicable mutation report in PDF format.

2. 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템의 구성2. Configuration of highly sensitive genetic mutation detection system based on barcode sequence information

본 발명의 일 실시 예에 따른 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템은 두가지 큰 구성요소 즉, 유전변이 탐지 분석을 수행하는 디지털 시퀀싱용 분석기(20)와 샘플 등록, 분석처리 및 레포팅 관리 등을 수행하는 서비스 서버(30)로 나뉠 수 있다. The highly sensitive genetic variation detection system based on barcode sequence information according to an embodiment of the present invention has two major components, namely, a digital sequencing analyzer 20 that performs genetic variation detection and analysis, and sample registration, analysis processing, and reporting management. It can be divided into a service server 30 to perform.

가. 유전변이 탐지 분석 과정end. Genetic variation detection analysis process

바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템의 유전변이 탐지 분석과정은 도 2에 예시된 플로우에 E따라 수행될 수 있다. 분석방법은 액체 생검의 민감도를 개선하기 위해서 일반적인 WGS(Whole Genome Sequencing)나 WES(Whole Exome Sequencing)의 분석 과정과는 구분되는 바코드 서열 정보 클러스터링(Barcode clustering)단계가 포함될 수 있다.The genetic variation detection and analysis process of the highly sensitive genetic variation detection system based on barcode sequence information may be performed according to E in the flow illustrated in FIG. 2. The analysis method may include a barcode sequence information clustering step, which is distinct from the analysis process of general Whole Genome Sequencing (WGS) or Whole Exome Sequencing (WES) in order to improve the sensitivity of liquid biopsy.

[전체 과정 요약][Overall Course Summary]

과정 (1)에서는, 디지털 시퀀싱을 수행하여 바코드 서열 정보를 포함하는 리드를 생성한다.(S160)In step (1), digital sequencing is performed to generate a read including barcode sequence information (S160).

구체적으로 과정 (1)에서는, 액체 생검으로부터 추출된 DNA은 라이브러리 제작(DNA library preparation, library prep)과정으로부터 목표(target) 유전자들 만을 분리하여 차세대 염기서열 분석법에 필요한 필요 서열들을 붙이고 증폭과정으로 여러 목표 유전자들을 만드는 과정을 거친다. 이후 시퀀서에 목표유전자를 넣게 되면 목표 유전자들과 상보적인 서열을 붙여가며 짧은 가닥들의 염기서열 결정이 일어나는데 이렇게 생성된 짧은 서열들을 리드(read)라고 부르며 시퀀서는 생성된 리드들의 각 염기서열을 저장하고 있다.Specifically, in process (1), DNA extracted from a liquid biopsy is separated from only the target genes from the process of DNA library preparation (library prep), attaching the necessary sequences necessary for the next-generation sequencing method, and performing various amplification processes. It goes through the process of creating target genes. Thereafter, when the target gene is inserted into the sequencer, sequences that are complementary to the target genes are attached to determine the nucleotide sequence of short strands.The short sequences thus generated are called reads, and the sequencer stores each nucleotide sequence of the generated reads. have.

과정 (2)에서는, 어탭터(adaptor) 시퀀스와 낮은 퀄리티(low quality) 리드를 제거하는 과정을 거쳐 고품질 리드를 선별한다.(S170)In step (2), a high quality lead is selected through a process of removing an adapter sequence and a low quality lead (S170).

구체적으로 과정 (2)에서는, 해당 리드들은 각각 퀄리티정보들을 지니고 있으며 깨끗하게 읽히지 못한 리드일수록 퀄리티가 떨어지기 때문에 시퀀싱 오류가 될 수 있는 이런 리드들은 사전에 미리 제거하는 과정을 지닌다. 또한 이 과정에서 목표 유전자의 서열정보만을 필요로 하기 때문에 라이브러리 제작시 붙여놨던 어뎁터 서열도 함께 제거한다.Specifically, in the process (2), each of the corresponding reads has quality information, and since the quality of the reads that are not clearly read is degraded, these leads, which may cause sequencing errors, are removed in advance. In addition, since only the sequence information of the target gene is required during this process, the adapter sequence attached to the library is also removed.

과정 (3). 바코드 서열 정보를 포함한 리드들을 대상으로 하는 생물 종의 대표 서열이라 부를 수 있는 참조 서열(reference sequence)에 붙이는 맵핑(mapping)과정을 거친다.(S180)Process (3). The reads including the barcode sequence information go through a mapping process that attaches them to a reference sequence that can be called a representative sequence of the target species (S180).

구체적으로 과정 (3)에서는, 타겟 유전자의 서열만을 지니고 있는 리드들을 참조 서열에 붙이는 맵핑 과정을 수행한다. 속도가 빠르고 정확하며 에러율이 비교적 낮은 서열 정렬 도구인 BWA-mem프로그램을 사용하여 맵핑을 수행할 수 있으며 일반적으로 이 과정이 끝나면 원하는 목표 서열에 여러 개의 리드가 붙게 되므로 그 데이터를 기준으로 변이 유무를 확인 할 수 있게 된다. 그러나 해당 과정에서 리드의 서열이 반복 서열이 많거나 유사한 서열이 참조 서열의 다른 부분에도 존재한다면 엉뚱한 부분에 매핑이 되기도 하는데 이 경우 잘못된 변이 정보를 얻게 되게 되고 당연히 변이 확인(variant calling)에 에러율을 높이게 된다.Specifically, in step (3), a mapping process is performed in which reads having only the sequence of the target gene are attached to the reference sequence. Mapping can be performed using the BWA-mem program, which is a sequence alignment tool that is fast, accurate, and has a relatively low error rate. Generally, after this process, multiple reads are attached to the desired target sequence. You will be able to confirm. However, in the process, if the read sequence has many repetitive sequences or similar sequences exist in other parts of the reference sequence, it may be mapped to the wrong part. In this case, incorrect mutation information is obtained, and of course, the error rate in variant calling is reduced. It is raised.

도 3에 예시된 바와 같이, 디지털 시퀀싱에 의해 생산된 리드들은 실험 샘플의 유전정보와는 관련 없는 일련의 바코드 서열 정보 서열을 포함하고 있으며, 이는 NGS 데이터 생산을 하기 전 관심있는 변이 영역을 포함하는 Genomic DNA 혹은 목표 서열에 대한 라이브러리(library)를 제작할 시에 붙여준 아이디 서열이다. 바코드 서열 정보 서열이 있으면 NGS 시퀀싱(sequencing) 과정 상 증폭(Amplification)이 일어날 때나 맵핑 후에 같은 바코드 서열 정보 서열을 가지고 있는 리드들을 조사하여 잘못 중합된 염기를 시퀀싱 오류로 처리하여 제거 하거나 맵핑이 엉뚱한 곳으로부터 온 리드들도 함께 제거할 수 있다.As illustrated in FIG. 3, reads produced by digital sequencing contain a series of barcode sequence information sequences that are not related to the genetic information of the experimental sample, which includes a mutation region of interest before producing NGS data. This is the ID sequence given when creating a library for genomic DNA or target sequence. If there is a sequence of barcode sequence information, when amplification occurs during the NGS sequencing process or after mapping, the reads that have the same barcode sequence information sequence are investigated, and the incorrectly polymerized base is treated as a sequencing error, and the mapping is incorrect. Leads from from can also be removed.

과정 (4)에서는, 리드의 맵핑된 정보를 기반으로 바코드 서열 정보 서열을 탐색하여 정의하고, 정의된 바코드 서열 정보 서열을 기준으로 리드를 군집화(clustering)한다.(S190, S200, S210)In step (4), the barcode sequence information sequence is searched for and defined based on the mapped information of the read, and the read is clustered based on the defined barcode sequence information sequence (S190, S200, S210).

구체적으로 과정 (4)의 변이 분석 시, 같은 바코드 서열 정보를 가지고 있는 리드들간의 비교를 통해 탐지된 변이가 PCR이나 시퀀싱 오류로 인한 것인지 실제 존재하는 변이인지 판단할 수 있어, 변이 분석을 위해서는 맵핑 후에 모든 리드에서 바코드 서열 정보 서열을 추출하고 같은 바코드 서열 정보의 리드끼리 군집화 하는 처리 과정이 진행되어야 하며, 위 과정은 4단계의 세부 과정으로 다시 나눌 수 있다.Specifically, when analyzing the mutation in step (4), it is possible to determine whether the detected mutation is due to PCR or sequencing errors or whether the mutation actually exists through comparison between reads having the same barcode sequence information. Afterwards, the process of extracting the barcode sequence information sequence from all reads and clustering the reads of the same barcode sequence information must be carried out, and the above process can be divided into four detailed steps.

(4-1) 리드 서열 중 바코드 서열 정보의 위치는 레퍼런스 맵핑을 통해 생성된 bam 포맷 (https://samtools.github.io/hts-specs/SAMv1.pdf) 파일의 FLAG 컬럼과 POS 컬럼, CIGAR 컬럼을 통해 예측한다. FLAG 컬럼은 리드의 맵핑 여부와 방향성을 설명하고, POS 컬럼은 리드가 레퍼런스 서열의 어느 위치에 맵핑되는지를 설명한다. 그리고 CIGAR 컬럼은 리드가 어떻게 맵핑되는지를 의미한다.(4-1) The location of the barcode sequence information in the read sequence is the FLAG column, POS column, and CIGAR of the bam format (https://samtools.github.io/hts-specs/SAMv1.pdf) file created through reference mapping. Predict through columns. The FLAG column describes whether the read is mapped and the orientation, and the POS column describes where the read is mapped to the reference sequence. And the CIGAR column means how the reads are mapped.

(4-2) 리드의 앞쪽에 바코드 서열 정보가 위치한다고 가정하고 리드의 맵핑이 시작된 위치를 POS, CIGAR 컬럼을 통해 파악한다. 맵핑 시작 위치로부터 앞쪽 방향으로 12bp (바코드 서열 정보는 무작위의 12개의 서열로 이루어짐)까지의 position을 바코드 서열 정보로 예측한다. 이 때 FLAG 컬럼을 통해 리드가 역방향인지 파악한다.(4-2) Assuming that the barcode sequence information is located in the front of the read, the position where the mapping of the read started is identified through the POS and CIGAR columns. The position from the mapping start position to the forward direction up to 12bp (barcode sequence information consists of 12 random sequences) is predicted by barcode sequence information. At this time, it is determined whether the read is in the reverse direction through the FLAG column.

(4-3) 각 리드로부터 바코드 서열 정보의 위치를 파악한 후에는 바코드 서열 정보의 서열을 추출하고 추출한 바코드 서열 정보를 기준으로 리드들을 그룹핑한다. 이 과정은 논문의 군집화 방법(논문 현황 참고2. Peng et al. BMC Genomics, 2015)을 기본으로 하되 데이터의 특성에 맞게 서열 처리방법 등을 최적화 하였다.(4-3) After determining the position of the barcode sequence information from each read, the sequence of the barcode sequence information is extracted, and the reads are grouped based on the extracted barcode sequence information. This process is based on the method of clustering the papers (refer to the current status of the paper, Peng et al. BMC Genomics, 2015), but optimized the sequence processing method according to the characteristics of the data.

(4-4) 군집화 과정을 거친 후, 기존 bam 파일의 리드 ID를 바코드 서열 정보 정보를 담고 있는 형태로 수정하여 새로운 bam 파일을 생성한다.(4-4) After going through the clustering process, a new bam file is created by modifying the read ID of the existing bam file to a form containing barcode sequence information.

과정 (5)에서는, 바코드 서열 정보-인지 변이 탐지 소프트웨어(barcode-aware variant caller)를 사용하여 같은 바코드 서열 정보를 가진 리드내에서 변이를 탐지한다.(S220)In step (5), a mutation is detected in a read having the same barcode sequence information by using a barcode sequence information-aware variant caller (S220).

구체적으로, 과정 (5)에서, 본 발명의 일 실시 예는 바코드 서열 정보 기반 변이분석 모듈(barcode-aware variant calling)을 이용한다.Specifically, in step (5), an embodiment of the present invention uses a barcode-aware variant calling module based on barcode sequence information.

일반적인 변이분석 파이프라인은 디지털 시퀀싱으로부터 생성된 데이터 분석에는 최적화되어 있지 않으나, 본 발명의 일 실시 예는 바코드 서열 정보 정보를 이용하여 변이분석을 할 수 있는 molecular barcode-aware variant caller인 smCounter를 기본으로 하여 변이분석 파이프라인을 구축하였다.The general mutation analysis pipeline is not optimized for data analysis generated from digital sequencing, but an embodiment of the present invention is based on smCounter, a molecular barcode-aware variant caller that can perform mutation analysis using barcode sequence information. Thus, a mutation analysis pipeline was established.

smCounter는 paired-end 리드로만 작동되기 때문에 single-end 리드에서도 작동할 수 있도록 보완하였다.Since smCounter operates only with paired-end leads, it has been supplemented to work with single-end leads.

타겟 영역의 정렬(alignment) 정보를 이용하여 영역내의 리드들을 가져오고, 같은 바코드 서열 정보를 가진 리드끼리 모아서 Barcode depth(MT depth), quality, pi(prediction index) 등을 계산하고 최종적으로 발굴된 변이를 VCF 포맷 파일로 출력한다.Using the alignment information of the target area, reads within the area are retrieved, and reads with the same barcode sequence information are collected to calculate Barcode depth (MT depth), quality, pi (prediction index), etc., and finally discovered mutations Is output as a VCF format file.

과정 (6)에서는, 탐지된 변이를 대상으로 공개 데이터 베이스에서 변이 설명 정보를 찾는다.In step (6), mutation description information is searched for the detected mutation in a public database.

구체적으로 과정 (6)에서는, 인터넷 상에 공개된 유전변이 DB를 참조하여 변이 주석(Annotation) 정보를 첨부한다.Specifically, in step (6), mutation annotation information is attached by referring to the genetic mutation DB published on the Internet.

샘플내의 확인된 변이들을 대상으로 유용한 변이의 정보를 담은 유전변이 DB(예를 들어 ClinVar, OMIM, dbSNP, mycancergenome 등)에서 임상에서 활용될 수 있는 정보만을 추출하여 1차 결과 파일을 생성한다.The primary result file is generated by extracting only information that can be used clinically from a genetic variation DB (e.g., ClinVar, OMIM, dbSNP, mycancergenome, etc.) containing useful mutation information for the identified mutations in the sample.

활용될 수 있는 데이터는 ClinVar의 변이 분류(variant classification), NM ID, gDNA ID와 dbSNP의 rs number, mycancergenome의 theraphy 정보, COSMIC의 prediction, Drug resistance 등이 있으나, 이에 한정되지는 않는다.The data that can be used include ClinVar's variant classification, NM ID, gDNA ID and dbSNP's rs number, mycancergenome's theraphy information, COSMIC's prediction, and drug resistance, but are not limited thereto.

나. 샘플 등록, 분석처리 및 레포팅 관리I. Sample registration, analysis processing and reporting management

도 4 내지 도 8에 예시된 바와 같이 서비스 서버(30)는 유전 변이 분석을 제외한 나머지 모든 시스템의 구조 및 관리 기능과 결과를 레포트 형식으로 출력해주는 기능을 담당한다. 크게 총 4가지 기능으로 이루어져 있다.As illustrated in FIGS. 4 to 8, the service server 30 is responsible for a function of outputting the structure and management functions and results of all systems except for genetic variation analysis in a report format. It consists of a total of 4 functions.

예를 들어, 도 4를 참조하면, 서비스 서버(30)는 환자 및 샘플 등록 기능을 지원할 수 있다.For example, referring to FIG. 4, the service server 30 may support a patient and sample registration function.

개별로 환자 정보를 직접 입력하거나, 정해진 엑셀 형식에 맞춘 파일로 일괄 등록할 수 있다. Patient information can be entered individually or registered in a batch as a file conforming to the specified Excel format.

예를 들어, 도 5를 참조하면, 서비스 서버(30)는 클라우드 분석 서버로 생성된 차세대 염기서열 데이터의 자동 전달 기능을 지원할 수 있다.For example, referring to FIG. 5, the service server 30 may support an automatic delivery function of next-generation sequence data generated by a cloud analysis server.

NGS 실험 결과 파일은 대용량(샘플당 0.3 ~ 2 GB) 이므로, 별도의 대용량 파일 전송 기능을 탑재한다. 이 기능은 장비가 지역 네트워크내 있는 경우를 위한 “Local” 기능, 그리고, 그렇지 않은 경우를 위한 “GridFTP, S3” 기능으로 구분된다. 특정 디렉토리를 등록함으로써, 자동으로 파일 전송 작업이 개시된다. Since the NGS test result file is large (0.3 ~ 2 GB per sample), a separate large file transfer function is installed. This function is divided into “Local” function for when the device is in the local network, and “GridFTP, S3” function for when the device is not. By registering a specific directory, the file transfer operation is automatically started.

예를 들어, 도 6 및 도 7을 참조하면, 서비스 서버(30)는 변이 분석 진행 확인 및 분석 건 관리 기능을 지원한다.For example, referring to FIGS. 6 and 7, the service server 30 supports a mutation analysis progress check and analysis case management function.

등록된 환자의 액체생검 NGS 데이터 업로드가 완료되면, QC 분석을 수행할 수 있는 상태로 샘플 상태 정보가 변경된다. QC 분석을 진행 후, 적절한 퀄리티의 NGS 데이터임이 확인되면, 디지털시퀀싱 변이분석을 수행할 수 있는 상태로 변경되며, 분석을 수행할 수 있다. 분석이 완료되면, 상세정보에서 결과 리포트를 다운로드 할 수 있다.When the upload of the liquid biopsy NGS data of the registered patient is completed, the sample status information is changed to a state in which QC analysis can be performed. After the QC analysis, if it is confirmed that it is NGS data of an appropriate quality, it is changed to a state in which digital sequencing mutation analysis can be performed, and analysis can be performed. When the analysis is complete, you can download the result report from the detailed information.

예를 들어, 도 8을 참조하면, 서비스 서버(30)는 임상활용이 가능한 리포트, 즉, 임상지원 보고서 출력 기능을 제공한다.For example, referring to FIG. 8, the service server 30 provides a report that can be used clinically, that is, a clinical support report output function.

예를 들어, 임상지원 보고서에는 DNA를 제공한 환자의 정보(성명, 생년월일 등)와 DNA 샘플 정보(샘플 아이디, 종류 구분, 등록 날짜, 분석 날짜 등)이 포함ㄷ될 수 있다.For example, a clinical support report may include patient information (name, date of birth, etc.) and DNA sample information (sample ID, type classification, registration date, analysis date, etc.) who provided DNA.

또한, 예를 들어, 확인된 변이는 샘플 내에서 와일드 타입(wild type), 예를 들어, HG19 또는 HG38에 정의된 참조기준서열과 비교하여 변이빈도(전체 고유한 바코드에서 변이를 포함한 것의 비율, Variant MT (barcode) fraction, VMF)를 확인할 수 있다. Also, for example, the identified mutation is a wild type within the sample, e.g., the frequency of mutation (the ratio of the mutation in the entire unique barcode, compared to the reference sequence defined in HG19 or HG38, Variant MT (barcode) fraction, VMF) can be checked.

또한, 예를 들어, 유전변이 DB에서 주석정보를 활용하여 각 유전변이 DB의 변이 위험군 분류(variant classification)를 나타내고 이 정보를 활용하여 전문 의료원은 환자에게 검사 결과를 진단할 수 있다. 각 유전자마다의 관심있는 변이를 확인할 수 있으며 해당 변이의 기타 주석 정보와 처방요법들을 근거로 처방에 도움을 줄 수 있도록 구성될 수 있다.In addition, for example, annotative information in the genetic variation DB is used to indicate the variant classification of each genetic variation DB, and by using this information, a specialized medical center can diagnose the test result to the patient. The mutation of interest for each gene can be identified, and it can be configured to help with the prescription based on other annotation information and prescription therapies of the mutation.

이상에서 상세히 설명한 바와 같이 본 발명에 따르면, 디지털 시퀀싱 방식으로 획득한 액체 생검 NGS 데이터를 분석하여 유전변이를 탐지하고, 전문 의료진이 임상 의사 결정에 참고할 수 있는 형태로 보고서화하여 제공할 수 있는 바코드 서열 정보 기반 고민감도 유전변이 탐지 시스템이 제공되는 효과가 있다.As described in detail above, according to the present invention, a barcode that can be provided by analyzing liquid biopsy NGS data acquired by digital sequencing to detect genetic mutations, and to provide a report in a form that can be referred to by a medical professional for clinical decision making. There is an effect of providing a high-sensitivity genetic mutation detection system based on sequence information.

또한, 액체 생검용 분석 패널을 이용하여 바코드 서열을 이용한 디지털 차세대 염기서열(Digital NGS)데이터가 생산되었을 때 분석알고리즘상의 해결 방법으로 바코드 서열 정보를 포함하는 리드간의 군집(Cluster)를 만들어 검출 한계 및 NGS 노이즈를 억제하는 것이다.In addition, when digital next-generation nucleotide sequence (Digital NGS) data using barcode sequences is produced using an analysis panel for liquid biopsy, a cluster between reads containing barcode sequence information is created as a solution to the analysis algorithm to limit detection and It suppresses NGS noise.

또한, 이식 가능성을 높이기 위하여 서로 다른 NGS 시퀀서인 이온 토렌트(Ion torrent)와 일루미나(Illumina) 기반의 플랫폼에서 생성된 single-end, paired-end 리드 데이터를 모두 분석이 가능하도록 지원하는 시스템을 설계하고, NGS의 전문 분석 수행방법에 익숙치 않더라도 웹상의 그래픽 사용자 인터페이스(GUI)로 샘플 및 환자의 등록, 분석 실시, 레포트 반환을 손쉽게 수행할 수 있도록 지원할 수 있다.In addition, in order to increase the portability, we designed a system that supports analysis of both single-end and paired-end read data generated on platforms based on different NGS sequencers, Ion torrent and Illumina. , Even if you are not familiar with NGS' method of performing professional analysis, you can support to easily perform registration, analysis, and report return of samples and patients with a graphical user interface (GUI) on the web.

또한, 임상에 적용될 수 있는 변이의 주석(Annotation) 정보를 포함한 유전변이 보고서를 생성하여 변이와 관련된 질병에 대하여 전문 의료원이 약물 혹은 처방요법(therapy)의 근거 자료로 사용되도록 지원할 수 있다.In addition, by generating a genetic mutation report including annotation information of mutations that can be applied to the clinic, it is possible to support medical clinics specializing in mutation-related diseases to be used as evidence for drugs or therapy.

10: 사용자 단말
20: 디지털 시퀀싱용 분석기
30: 서비스 서버
40: 변이정보 DB
310: 환자 정보 관리부
320: NGS 데이터 관리부
330: 정보 수집부
340: 보고서 관리부10: user terminal
20: Analyzer for digital sequencing
30: service server
40: mutation information DB
310: Patient Information Management Department
320: NGS data management unit
330: information collection unit
340: report management unit

Claims

As a highly sensitive genetic variation detection system based on barcode sequence information,
A digital sequencing analyzer that detects/analyzes genetic variation by performing digital sequencing on patient DNA obtained by a liquid biopsy method; And
High sensitivity genetic variation detection based on barcode sequence information, including a service server that generates and provides a clinical support report related to the genetic variation by referring to a public database for information on the genetic variation detected by the digital sequencing analyzer system.

The method of claim 1,
The service server,
A patient information management unit receiving and managing patient information;
When the NGS data, which is the result of the NGS (Next Generation Sequencing) analysis of the patient DNA, is uploaded, and the quality of the NGS data is evaluated (Quality Control), and the foam quality of the NGS data meets the acceptance criteria, the digital sequencing analyzer An NGS data management unit for requesting detection/analysis of the genetic variation of the patient DNA by transmitting the NGS data to the patient;
An information collection unit receiving information on the genetic variation of the patient's DNA from the digital sequencing analyzer and collecting detailed description information about the genetic variation by referring to the variation information DB published on the Internet; And
It characterized in that it comprises a report management unit that generates and provides a clinical support report related to the genetic variation by processing the detailed description information on the genetic variation collected by the information collection unit in a predefined format, barcode sequence information Based high sensitivity genetic variation detection system.

The method of claim 2,
In the above clinical support report,
Information on the patient who provided the DNA, information specifying the DNA sample, and the frequency of mutation (the ratio of the mutation in the entire unique barcode) compared to the reference reference sequence defined in HG19 or HG38, Variant MT (barcode ) Fraction, VMF), characterized in that it contains information, barcode sequence information based high sensitivity genetic variation detection system.

The method of claim 3,
The report management unit,
Through the clinical support report, the relationship between the genetic mutation and the phenotype is divided into a plurality of taxa based on information on clinical significance, activity, and actionability according to the guidelines presented in the mutation information DB. Characterized in that provided by, high-sensitivity genetic variation detection system based on barcode sequence information.

The method of claim 1,
The digital sequencing analyzer,
A read generation process of generating a read including a barcode by performing digital sequencing on the patient DNA; and
A trimming process of removing an adapter sequence and a low quality lead from the lead, and
A reference sequence mapping process in which the adapter sequence and the reads from which the low quality reads are removed are attached to a reference sequence, which is a representative sequence of a target species,
A clustering process of searching and defining a barcode sequence based on the mapped information of the reads, and clustering the reads based on the defined barcode sequence,
High sensitivity genetic variation detection system based on barcode sequence information, characterized in that performing a genetic variation detection process for detecting genetic variation in reads having the same barcode among reads clustered based on the barcode sequence.