CN121034398A

CN121034398A - A method, apparatus, and storage medium for analyzing the degree of copy number variation.

Info

Publication number: CN121034398A
Application number: CN202511553985.5A
Authority: CN
Inventors: 许德德; 聂佩瑶; 梁相志; 张顺利; 毋晓宁; 郭强
Original assignee: Haining Aoling Medical Laboratory Co ltd
Current assignee: Haining Aoling Medical Laboratory Co ltd
Priority date: 2025-10-29
Filing date: 2025-10-29
Publication date: 2025-11-28

Abstract

The invention provides a copy number variation degree analysis method, a copy number variation degree analysis device and a storage medium, and relates to the technical field of bioinformatics analysis. The method comprises the steps of determining a copy number baseline of each chromosome region through a reference sample set, determining Z standard scores of chromosome regions with corresponding window levels according to the copy number baseline of each chromosome region, summing the Z standard scores with absolute values larger than a set threshold value to obtain a whole genome cumulative score, determining thresholds of normal samples and abnormal samples in the whole genome cumulative score through an ROC curve method, and judging results according to the whole genome cumulative score threshold after the sample to be tested is calculated according to the method to obtain the whole genome cumulative score, so that risks of deviation of analysis results from actual can be reduced, and reliability of the whole genome cumulative score in clinical application can be improved.

Description

Analysis method, device and storage medium for copy number variation degree

Technical Field

The invention relates to the technical field of bioinformatics analysis, in particular to a copy number variation degree analysis method, a copy number variation degree analysis device and a storage medium.

Background

Copy number variation (Copy Number Variation, CNV) refers to the phenomenon of increasing or decreasing the copy number of a genomic DNA region, and is one of the important forms of genomic structural variation. Numerous studies have demonstrated that CNV plays a critical role in the development and progression of a variety of diseases, particularly cancer and genetic diseases. Therefore, accurate and reliable CNV detection and evaluation has great clinical significance for early diagnosis, prognosis judgment and personalized treatment of diseases.

However, existing Z-score based evaluation models have significant limitations in application. In order to simplify the calculation model or highlight the importance, these methods often consider only a few CNV cases of "important" chromosomes or chromosome arms preset in the genome or generally considered to be highly related to a specific disease, which leads to the deviation of the analysis result from reality, and may even generate missed judgment or misjudgment, affecting the reliability in clinical application.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides an analysis method, an analysis device and a storage medium for copy number variation degree.

The invention provides an analysis method of copy number variation degree, which comprises the following steps:

obtaining a reference sample set, wherein the reference sample set comprises a normal reference sample and an abnormal reference sample;

Determining a copy number baseline for each chromosome region from a normal reference sample in the reference sample set;

Determining a first Z standard score of a corresponding chromosome region of each reference sample in the reference sample set according to the chromosome region copy number baseline, summing the first Z standard scores with absolute values larger than a set threshold value to obtain a first whole genome cumulative score of each reference sample in the reference sample set, and determining a whole genome cumulative score threshold value based on the first whole genome cumulative score of the normal reference sample and the first whole genome cumulative score of the abnormal reference sample;

Determining a second Z standard score of the corresponding chromosome region of the sample to be detected according to the copy number baseline of each chromosome region, and summing the second Z standard scores with absolute values larger than a set threshold value to obtain a second whole genome accumulated score;

Analyzing the copy number variation degree of the sample to be tested through the second whole genome cumulative score and the whole genome cumulative score threshold value.

According to the analysis method of copy number variation degree provided by the invention, the chromosome region is a window-level chromosome region;

The determining a copy number baseline for each chromosome region from normal reference samples in the reference sample set comprises:

acquiring sequencing data of a normal reference sample in the reference sample set, and respectively preprocessing the sequencing data to obtain the original coverage of a chromosome region of each window level of the normal reference sample;

determining a first average of the original coverage of all window-level chromosomal regions for each reference sample;

And obtaining the internal standard coverage of the chromosome region of each window level according to the first average value so as to determine the copy number baseline of the chromosome region of each window level.

According to the analysis method for the degree of variation of the copy number, the second Z standard fractions with absolute values larger than the set threshold are summed to obtain a second whole genome cumulative fraction of each reference sample in the reference sample set according to a formula, wherein the formula is as follows:

wherein CZAR is the second whole genome cumulative score, n is the number of chromosome regions at the window level of all chromosomes, Represents a second Z standard fraction, |·| represents taking absolute value, and C represents setting a threshold.

According to the analysis method of copy number variation degree provided by the invention, after the copy number variation degree of the sample to be tested is analyzed through the second whole genome cumulative score and the whole genome cumulative score threshold value, the method further comprises:

Under the condition that the sample to be detected is an abnormal sample, acquiring a copy number baseline of each chromosome arm;

determining the internal standard coverage of the corresponding chromosome arm according to the arithmetic sum of the internal standard coverage of the chromosome region of the window level corresponding to each chromosome arm;

and analyzing chromosome aberration of the sample to be tested according to the internal standard coverage of the chromosome arm and the corresponding copy number baseline.

and under the condition that the sample to be detected is an abnormal sample, carrying out visual analysis on the copy number variation degree of the chromosome region of the sample to be detected based on the second Z standard score of the chromosome region of the window level.

According to the analysis method of copy number variation degree provided by the invention, after obtaining the internal standard coverage of the chromosome region of each window level according to the first average value, the method further comprises:

determining a second average value and standard deviation of the internal standard coverage of the chromosome region of the corresponding window level according to the internal standard coverage of the chromosome region of the same window level of each normal reference sample in the reference sample set;

Determining a normal coverage range through the second average value and the standard deviation, and rejecting the normal reference samples with the internal standard coverage of the chromosome region of the window level outside the normal coverage range;

repeating the steps according to the removed residual normal reference samples until the internal standard coverage of the chromosome region of the window level of the normal reference samples is within the normal coverage range, so as to obtain the internal standard coverage of the chromosome region of the window level of at least two baseline reference samples;

Determining a copy number baseline for each chromosomal region, comprising:

determining a copy number baseline for the chromosomal region at each window level based on the internal standard coverage of the chromosomal region at the window level for the at least two baseline reference samples.

According to the analysis method of copy number variation degree provided by the invention, the first Z standard score of the corresponding chromosome region of each reference sample in the reference sample set is determined according to the copy number baseline of the chromosome region, and the analysis method comprises the following steps:

determining a first Z standard score of the chromosome region of the corresponding window level of each reference sample in the reference sample set according to the copy number baseline of the chromosome region of each window level and the internal standard coverage of the chromosome region of the corresponding window level of each reference sample in the reference sample set;

determining a whole genome cumulative score threshold value based on the first whole genome cumulative score of the normal reference sample and the first whole genome cumulative score of the abnormal reference sample, comprising:

A full genome cumulative score threshold value is determined using ROC curve method based on the first full genome cumulative score of the normal reference sample and the first full genome cumulative score of the abnormal reference sample.

The invention also provides an analysis device for copy number variation degree, which comprises:

A reference sample set acquisition module for acquiring a reference sample set, wherein the reference sample set comprises a normal reference sample and an abnormal reference sample;

A copy number baseline determination module for determining a copy number baseline for each chromosome region from normal reference samples in the reference sample set;

A whole genome cumulative score threshold value determining module, configured to determine a first Z-standard score of a corresponding chromosome region of each reference sample in the reference sample set according to the chromosome region copy number baseline, and sum the first Z-standard scores with absolute values greater than a set threshold value to obtain a first whole genome cumulative score of each reference sample in the reference sample set, so as to determine a whole genome cumulative score threshold value based on the first whole genome cumulative score of the normal reference sample and the first whole genome cumulative score of the abnormal reference sample;

The second whole genome cumulative score determining module is used for determining a second Z standard score of a corresponding chromosome region of the sample to be detected according to the copy number baseline of each chromosome region, and summing the second Z standard scores with absolute values larger than a set threshold value to obtain a second whole genome cumulative score;

and the copy number variation analysis module is used for analyzing the copy number variation degree of the sample to be tested through the second whole genome cumulative score and the whole genome cumulative score threshold value.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the analysis method of the copy number variation degree when executing the computer program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of analysing the degree of copy number variation as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a method of analysing the degree of copy number variation as described in any of the above.

According to the analysis method for the copy number variation degree, the copy number base line of each chromosome region is determined according to the normal reference sample in the reference sample set, the first Z standard score of the corresponding chromosome region of each reference sample in the reference sample set is determined according to the copy number base line of each chromosome region, the first Z standard score with the absolute value larger than the set threshold value is summed to obtain the whole genome cumulative score of each reference sample, the whole genome cumulative score threshold value is determined based on the first whole genome cumulative score of the normal reference sample and the first whole genome cumulative score of the abnormal reference sample, the second Z standard score of the corresponding chromosome region of the sample to be detected is determined according to the copy number base line of each chromosome region, the second Z standard score with the absolute value larger than the set threshold value is summed to obtain the second whole genome cumulative score, the stability of all chromosomes of the sample to be detected can be evaluated through a single score according to the second whole genome cumulative score and the whole genome cumulative score threshold value, the reliability of analysis results deviating from reality is reduced, and the reliability in clinical application is improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the method for analyzing the degree of copy number variation provided by the invention.

FIG. 2 is a schematic diagram of a method for analyzing the degree of copy number variation according to the present invention.

FIG. 3 is a schematic representation of the ROC curve application of the method for analyzing the degree of copy number variation provided by the invention.

Fig. 4 is a schematic structural diagram of an analysis device for copy number variation degree provided by the present invention.

Fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The occurrence of complex diseases such as cancer is often accompanied by extensive and complex variation across the genome, which may be spread across multiple chromosomes, rather than being limited to only a few "hot spot" regions. Therefore, the prior art only considers copy number variation of several important chromosomes, and systematically ignores possible variation in other areas of the genome, which is also of important biological significance, so that the evaluation result cannot comprehensively and objectively reflect the real genome instability state of the sample

Therefore, the embodiment of the invention provides a method, a device and a storage medium for analyzing copy number variation degree, and a technical scheme in the embodiment of the invention is described below with reference to the drawings in the embodiment of the invention.

FIG. 1 is a flow chart of the analysis method for copy number variation degree provided by the invention, as shown in FIG. 1, the method comprises the following steps:

Step 101, a reference sample set is obtained, wherein the reference sample set comprises a normal reference sample and an abnormal reference sample.

The normal reference sample refers to a biological sample of target cells meeting the requirements of the research background. For example, the normal reference sample may be a biological sample of a healthy individual, or may be a biological sample that itself may be an abnormal sample but meets the set experimental conditions.

Abnormal reference samples refer to biological samples of target cells that do not meet the requirements of the research setting. For example, the abnormal reference sample may be a biological sample of an individual patient, or may be a biological sample that may itself be a normal sample but does not meet the set experimental conditions.

Step 102, determining a copy number baseline of each chromosome region according to the normal reference sample in the reference sample set.

It should be noted that, the original sequencing data (Raw data) of the reference sample set of 18-22 cervical exfoliated cells can be obtained, quality control and cleaning can be performed, and the linker sequence and the low-quality read length (Reads) in the original sequencing data can be removed, so that the sequencing data (CLEAN DATA) after quality control can be obtained, and the copy number baseline of each chromosome region can be determined.

The chromosomal region is a DNA sequence which is a minimum analysis unit and is clearly defined when performing copy number variation analysis.

Wherein the low mass Reads may include Reads in which the proportion of N bases in the sequence is greater than 10% and Reads in which the base number ratio of less than 10 is 50% or more in the whole Read.

There are many ways to determine the copy number baseline for each chromosomal region from sequencing data, such as analysis based on read depth (READ DEPTH, RD), analysis based on B-Allele Frequency (BAF), etc., and the present invention is not limited in this regard.

Step 103, determining a first Z standard score of the corresponding chromosome region of each reference sample in the reference sample set according to the copy number baseline of the chromosome region, and summing the first Z standard scores with absolute values larger than a set threshold to obtain a first whole genome cumulative score of each reference sample in the reference sample set, so as to determine a whole genome cumulative score threshold based on the first whole genome cumulative score of the normal reference sample and the first whole genome cumulative score of the abnormal reference sample.

Taking as an example the determination of the copy number baseline for each chromosomal region from the coverage of each chromosomal region of the reference sample and the corresponding region of the human reference genome, the coverage of the corresponding chromosomal region of the reference sample and the corresponding region of the human reference genome may be calculated, and the first Z-standard score thereof may be determined from the copy number baseline for each chromosomal region and the coverage.

The set threshold is a constant value, and the set threshold may be 3, 4, or the like, and may be actually determined according to the specific situation, which is not limited in the present invention.

It will be appreciated that after determining the first Z-standard score for all chromosome regions of each reference sample, summing the first Z-standard scores for each reference sampleable absolute value greater than the set threshold value results in a total genome cumulative score thereof. The method for determining the total genome accumulation fraction threshold based on the first total genome accumulation fraction of the normal reference sample and the first total genome accumulation fraction of the abnormal reference sample can be set according to the actual precision requirement, and the invention is not limited to this.

And 104, determining a second Z standard score of the corresponding chromosome region of the sample to be detected according to the copy number baseline of each chromosome region, and summing the second Z standard scores with absolute values larger than a set threshold value to obtain a second whole genome cumulative score.

The working principle of obtaining the second whole genome cumulative score is substantially the same as the working principle of obtaining the first whole genome cumulative score, and will not be described in detail herein.

Step 105, analyzing the copy number variation degree of the sample to be tested through the second whole genome accumulated fraction and the whole genome accumulated fraction threshold.

Preferably, whether the sample to be tested is a normal sample or an abnormal sample may be directly determined according to the magnitude of the second whole genome cumulative score and the whole genome cumulative score threshold value. In addition, whether the sample to be tested is a normal sample or an abnormal sample may be indirectly determined according to the ratio of the second whole genome cumulative score to the whole genome cumulative score threshold value, which is not limited in the present invention.

Based on the above embodiment, the chromosomal region is a window-level chromosomal region;

It should be noted that the preprocessing of the sequencing data may be to divide the human reference genome into consecutive non-overlapping windows (bins) according to the size of the chromosomal region at the window level. The different GC content of each window can cause uneven sequence distribution in the chromosomal region at the corresponding window level of the sequencing data, cause GC bias, and affect the accuracy of copy number variation detection. Therefore, the GC content on the chromosome region of each window level and the corresponding human reference genome window is calculated for the sequencing data after quality control, and GC correction is performed, so that the original coverage after GC correction is obtained.

The size of the chromosomal region at the window level may be 100kb or 200kb, and the present invention is not limited thereto. GC content is the length of the G+C sequence in a stretch.

There are many ways to perform GC correction on the chromosome region at each window level, such as a loess function in R-packet, regression model, etc., and the present invention is not limited thereto.

Illustratively, after obtaining a first average of the original coverage of the chromosomal regions at all window levels of a normal reference sample, a ratio of the original coverage of the chromosomal regions at each window level to the first average may be determined to obtain an internal standard coverage of the chromosomal regions at each window level of the normal reference sample, thereby determining a copy number baseline for each chromosomal region.

It can be understood that, according to the first average value of the original coverage of the chromosome regions of all window levels of the normal reference sample, the internal standard coverage of the chromosome region of each window level is obtained, so that the difference of the overall sequencing depth of the normal reference sample can be eliminated, the internal of the normal reference sample is standardized, and the influence of the total amount of sequencing data of the normal reference sample on subsequent analysis is reduced.

Under the condition that the copy number base line of each chromosome region is determined according to a plurality of normal reference samples, after the copy number base line of each chromosome region of each normal reference sample is obtained, the plurality of copy number base lines of each chromosome region can be further processed to obtain a final single copy number base line. The mode of further processing the plurality of copy number baselines to obtain the final single copy number baselines can be set according to actual requirements, and the invention is not limited to the mode.

Based on any of the above embodiments, summing the second Z-standard scores with absolute values greater than a set threshold to obtain a second whole genome cumulative score for each reference sample in the reference sample set according to a formula, where the formula is as follows:

It will be appreciated that summing the first Z-standard scores, which have absolute values greater than a set threshold, results in a first whole genome cumulative score for each reference sample in the reference sample set, also calculated with reference to the above formula, whereinThe first Z standard score is used.

Illustratively, after the second Z-standard score calculation of bin levels (window-level chromosomal regions) of 22 chromosomes of the test sample is completed, the second whole genome cumulative score CZAR may be calculated using the calculated second Z-standard score according to the above formula. And comparing the calculated second whole genome cumulative score CZAR with a whole genome cumulative score threshold value to determine whether the sample to be detected is a normal sample or an abnormal sample.

Based on any of the above embodiments, after obtaining the internal standard coverage of the chromosomal region at each window level from the first average value, the method further comprises:

Determining a copy number baseline for each chromosomal region, comprising:

For example, an outlier detection upper limit upper=second average +3 may be determinedStandard deviation, abnormal value detection lower limit=second average value-3And determining an internal standard coverage range greater than the abnormal value detection upper limit and less than the abnormal value detection lower limit as a normal coverage range by the standard deviation.

Repeating the steps according to the removed residual normal reference sample, redefining a second average value and standard deviation of the internal standard coverage of the chromosome region of the window level according to the removed residual normal reference sample, redefining the normal coverage range of the removed residual normal reference sample by adopting a similar method until the internal standard coverage of the removed residual normal reference sample is between the currently-determined abnormal value detection upper limit and the currently-determined abnormal value detection lower limit, and determining the copy number baseline of each chromosome region according to the currently-determined second average value and standard deviation.

It will be appreciated that filtering the reference samples through the normal coverage range can reduce the risk of using abnormal samples as normal reference samples, thereby improving the reliability of the resulting copy number baseline for each chromosomal region.

In addition, compared with the method of removing only the internal standard coverage of the chromosome region at the window level, in the embodiment, under the condition that the internal standard coverage of the chromosome region at the window level in a certain normal reference sample is out of the normal coverage range, the normal reference sample is directly removed, so that the risk of obtaining the copy number baseline of the chromosome region based on suspicious data can be further reduced.

In addition, in this embodiment, the copy number baseline of the chromosome region is obtained through the normal reference samples with the internal standard coverage of the chromosome region at the window level within the normal coverage range, and the detection analysis of the copy number variation degree is performed according to the copy number baseline, so that the sensitivity of the detection analysis can be improved.

Based on any of the above embodiments, the determining a first Z-standard score for the corresponding chromosomal region for each reference sample in the reference sample set from the copy number baseline for the chromosomal region comprises:

And determining a first Z standard score of the chromosome region of the corresponding window level of each reference sample in the reference sample set according to the copy number baseline of the chromosome region of the window level and the internal standard coverage of the chromosome region of the corresponding window level of each reference sample in the reference sample set.

For example, a first Z-criterion score for a window-level chromosomal region may be determined based on the following formula:

;

Wherein, the Represents the internal standard coverage of the chromosomal region at the window level,Mean values of copy number baselines of the corresponding chromosomal regions are represented,The standard deviation of the copy number baseline for the corresponding chromosomal region is indicated.

The first whole Genome Cumulative Score may also be referred to as a standard Score Cumulative Score (Cumulative Z-Score for Genome-wide Chromosomal Aberrant Regions, CZAR).

Calculation of the first whole genome cumulative score to determine a baseline reference sample may be calculated with reference to the formula described above for calculating the second whole genome cumulative score for each reference sample.

Based on any of the above embodiments, determining the whole genome cumulative score threshold value based on the first whole genome cumulative score of the normal reference sample and the first whole genome cumulative score of the abnormal reference sample comprises:

In another embodiment, the first Z-standard scores of the chromosome region at the window level of the at least one baseline reference sample may be summed to obtain at least two first whole genome cumulative scores, and if the absolute value of the first Z-standard scores is greater than the set threshold value, determining the maximum value of the at least two first whole genome cumulative scores results in the whole genome cumulative score threshold value.

It will be appreciated that in the case of obtaining the first whole genome cumulative score of at least two baseline reference samples, determining the threshold value using ROC curve method can minimize the risk of misjudging the normal sample as an abnormal sample, achieve 100% specificity, and thus most clearly separate the normal sample from the abnormal sample.

The working principle and the technical effect of calculating and determining the second whole genome cumulative score of the sample to be tested are basically the same as those of calculating and determining the first whole genome cumulative score of the baseline reference sample, and are not described in detail herein.

Based on any of the above embodiments, after the analyzing the copy number variation degree of the test sample by the second whole genome cumulative score and the whole genome cumulative score threshold value, the method further comprises:

It should be noted that, after the sequencing data of at least two reference samples of the target cells are preprocessed to obtain the original coverage of the chromosome region of each window level, the original coverage of the corresponding chromosome arm can be determined according to the arithmetic sum of the chromosome regions of the window level corresponding to each chromosome arm, the third average value of all chromosome arms of each reference sample is determined, and the internal standard coverage of each chromosome arm is obtained according to the third average value, so as to determine the copy number baseline of each chromosome arm.

The Z-normalized score of a chromosome arm can be determined by calculation of the following formula:

Wherein, the Represents the internal standard coverage of the chromosomal region of the chromosome arm,Mean values of copy number baselines corresponding to chromosome arms are shown,Standard deviation of copy number baselines for corresponding chromosome arms is indicated.

and under the condition that the sample to be detected is an abnormal sample, carrying out visual analysis on the copy number variation degree of the whole genome of the sample to be detected based on the second Z standard score of the chromosome region of the window level.

It can be understood that when the stability of all chromosomes of the sample to be tested is evaluated according to a single score to determine that the sample to be tested is an abnormal sample, the analysis of different resolution levels such as the visual analysis of the chromosome region and the chromosome aberration analysis of the chromosome arm at the window level can reflect the overall variation details of the sample to be tested, further judge the overall instability condition of the chromosome, and have monitoring significance in the development process of cancer.

Fig. 2 is a schematic diagram of a data query method according to the present invention, and as shown in fig. 2, in order to specifically illustrate the function of the analysis method for copy number variation degree according to the present embodiment, a specific example is provided below.

The reference sample set can be obtained in a sampling mode of cervical swab sampling, the reference sample set comprises 20 normal reference samples and 35 abnormal reference samples, and each sample is subjected to low-depth high-throughput sequencing to obtain corresponding sequencing data. Wherein, 35 abnormal reference samples are abnormal samples judged by gold standard results obtained by a gold standard method, and can comprise 7 benign samples, 5 cervical Low-grade squamous intraepithelial lesions (Low-grade Squamous Intraepithelial Lesion, LSIL) samples, 12 cervical high-grade squamous intraepithelial lesions ‌ (HSIL) samples and 11 cervical cancer samples.

Among them, low-grade squamous intraepithelial lesions represent a mild abnormal proliferation of cervical squamous epithelial cells, usually caused by HPV infection, belonging to low-risk precancerous lesions. High-grade squamous intraepithelial lesions, which are one of the pre-cervical lesions, suggest that there is a significant abnormality in the cells that requires timely intervention to prevent progression to cervical cancer.

And comparing the sequencing data with a reference genome for preprocessing, and obtaining the original coverage of the chromosome region of each window level of the reference sample. Afterwards, redundancy removal processing can be performed on the sequencing data, and the adaptor sequences and low-quality read lengths in the original sequencing data are removed (Reads), so that quality-controlled sequencing data (CLEAN DATA) are obtained, as shown in table 1.

TABLE 1 sequencing data after quality control

Wherein CLEANREADS is the number of reads after filtration, cleanBases is the number of bases after filtration, N is the calculated base content percentage, Q20 is the proportion of bases with a Q value of more than or equal to 20 to the total bases, and Q30 is the proportion of bases with a Q value of more than or equal to 30 to the total bases.

Specifically, the quality-controlled sequencing data can be divided into window-level chromosome regions, the window-level chromosome regions are compared to human reference genome, the reads on the non-repeated comparison and the unique comparison are reserved, the reads are homogenized, and GC correction is performed, so that the original coverage after GC correction is obtained. The numerical values of the chromosome region at the partial window level are shown in table 2.

TABLE 2 numerical values of chromosome regions at the partial window level

Wherein the number 1 represents chromosome one and p represents the chromosome short arm.

Determining a normal coverage range through the second average value and the standard deviation so as to reject the normal reference samples, of which the internal standard coverage is outside the normal coverage range, of the chromosome region of the window level;

Repeating the steps according to the removed residual normal reference samples, redefining a second average value and standard deviation of the internal standard coverage of the chromosome region at the window level according to the removed residual normal reference samples, redefining the normal coverage range of the removed residual normal reference samples by adopting a similar method until the normal reference samples outside the normal coverage range are completely removed, and determining a copy number baseline according to the second average value and standard deviation (currently determined) so as to calculate and determine a first Z standard score (of the internal standard coverage) of the chromosome region at the corresponding window level of all the reference samples, as shown in a table 3.

TABLE 3 first Z-norm score for internal Standard coverage of partial window level chromosomal regions for all samples

Wherein 1000001-1200000, 1200001-1400000, 1400001-1600000, etc. are window-level chromosomal regions of chromosome one short arm, and the size is a 200kb length interval.

Summing the first Z-standard scores of the reference samples with absolute values greater than the set threshold yields a corresponding first whole genome cumulative score, as shown in table 4.

TABLE 4 first genome-wide cumulative score for all samples

Analysis of the first whole genome cumulative score for all samples gave the following determination:

The normal sample CZAR is (20.84-429.3), the benign sample CZAR is (710.51-3917.21), the LSIL sample CZAR is (456.84-8270.61), the HSIL sample CZAR is (892.89-9376.57), and the cancer sample CZAR is (594.09-30336.7).

As shown in fig. 3, a CZAR threshold may be derived from the first whole genome cumulative score (CZAR) for the normal and abnormal samples using the subject operating characteristics curve (Receiver Operating Characteristic Curve, ROC).

Therefore, the total genome cumulative score threshold was set to 443.07, and the total genome cumulative score threshold and the second total genome cumulative score were finally obtained, so that CZAR≤ 443.07 was a normal sample, CZAR >443.07 was an abnormal sample, and the result was determined according to CZAR as shown in Table 5.

TABLE 5 CZAR distinguishes between different classes of Performance

It can be seen that the CZAR can effectively distinguish normal samples from abnormal samples, and can be used as a biological marker for diagnosing cancers.

And after the gold standard result and the actual monitoring result of the sample are summarized, the threshold value can be calculated by using the working characteristic curve (Receiver Operating Characteristic Curve, ROC) curve of the test subject, and the cumulative score threshold value of the whole genome can be verified to determine the rationality.

The apparatus for analyzing the degree of copy number variation provided by the present invention will be described below, and the apparatus for analyzing the degree of copy number variation described below and the method for analyzing the degree of copy number variation described above may be referred to in correspondence with each other.

Fig. 4 is a schematic structural diagram of an analysis device for copy number variation degree provided by the present invention, as shown in fig. 4, the device includes:

A reference sample set obtaining module 401, configured to obtain a reference sample set, where the reference sample set includes a normal reference sample and an abnormal reference sample;

a copy number baseline determination module 402 for determining a copy number baseline for each chromosome region from a reference sample of target cells;

A whole genome cumulative score threshold value determining module 403, configured to determine a first Z standard score of a corresponding chromosome region of at least one normal sample according to the copy number baseline of each chromosome region, and sum the first Z standard scores with absolute values greater than a set threshold value to obtain a whole genome cumulative score threshold value;

A second whole genome cumulative score determining module 404, configured to determine a second Z standard score of a corresponding chromosome region of the sample to be tested according to the copy number baseline of each chromosome region, and sum the second Z standard scores with absolute values greater than a set threshold to obtain a second whole genome cumulative score;

And the copy number variation analysis module 405 is configured to analyze the copy number variation degree of the sample to be tested through the second whole genome cumulative score and the whole genome cumulative score threshold value.

Based on any of the above embodiments, the chromosomal region is a window-level chromosomal region;

the copy number baseline determination module 402 is specifically configured to:

Based on any of the above embodiments, the apparatus further comprises a chromosome aberration analysis module for:

Based on any of the above embodiments, the apparatus further includes a global visual analysis module, configured to perform visual analysis on a copy number variation degree of a chromosome region of the sample to be tested based on the second Z-standard score of the chromosome region of the window level in a case where the sample to be tested is determined to be an abnormal sample.

Based on any of the above embodiments, the apparatus further includes an internal standard coverage determination module configured to:

The copy number baseline determination module 402 is specifically configured to determine a copy number baseline of the chromosome region at each window level according to the internal standard coverage of the chromosome region at the window level of the at least two baseline reference samples.

Based on any of the above embodiments, the genome-wide cumulative score threshold value determination module 403 is specifically configured to:

Fig. 5 illustrates a schematic structure of an electronic device, which may include a processor (processor) 510, a communication interface (Communications Interface) 520, a memory (memory) 530, and a communication bus 540, where the processor 510, the communication interface 520, and the memory 530 perform communication with each other through the communication bus 540, as shown in fig. 5. Processor 510 may invoke logic instructions in memory 530 to perform a method of analyzing a degree of copy number variation, the method comprising determining a copy number baseline for each chromosome region from a reference sample of a target cell, determining a first Z-standard score for a corresponding chromosome region for at least one normal sample from the copy number baseline for each chromosome region, summing the first Z-standard scores greater than a set threshold in absolute value to obtain a total genome cumulative score threshold, determining a second Z-standard score for a corresponding chromosome region for a test sample from the copy number baseline for each chromosome region, summing the second Z-standard scores greater than the set threshold in absolute value to obtain a second total genome cumulative score, and analyzing the degree of copy number variation for the test sample from the second total genome cumulative score and the total genome cumulative score threshold.

Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention provides a computer program product, where the computer program product includes a computer program, and the computer program is capable of being stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer program is capable of executing the analysis method of the copy number variation degree provided by the methods, and the method includes determining a copy number baseline of each chromosome region through a reference sample of a target cell, determining a first Z standard score of a corresponding chromosome region of at least one normal sample according to the copy number baseline of each chromosome region, summing the first Z standard score of the corresponding chromosome region with an absolute value greater than a set threshold to obtain a whole genome cumulative score threshold, determining a second Z standard score of the corresponding chromosome region of the sample to be tested according to the copy number baseline of each chromosome region, summing the second Z standard score with an absolute value greater than the set threshold to obtain a second whole genome cumulative score, and analyzing the copy number variation degree of the sample to be tested through the second whole genome cumulative score and the whole genome cumulative score threshold.

In yet another aspect, the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor performs the method of analyzing the degree of copy number variation provided by the above methods, the method comprising determining a copy number baseline for each chromosome region from a reference sample of a target cell, determining a first Z-standard score for a corresponding chromosome region of at least one normal sample from the copy number baseline for each chromosome region, summing the first Z-standard scores with an absolute value greater than a set threshold to obtain a total genome accumulation score threshold, determining a second Z-standard score for a corresponding chromosome region of a sample to be tested from the copy number baseline for each chromosome region, summing the second Z-standard scores with an absolute value greater than the set threshold to obtain a second total genome accumulation score, and analyzing the degree of copy number variation of the sample to be tested from the second total genome accumulation score and the total genome accumulation score threshold.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.

Claims

1. A method for analyzing the degree of copy number variation, comprising:

2. The method for analyzing the degree of copy number variation according to claim 1, wherein the chromosomal region is a window-level chromosomal region;

3. The method of claim 2, wherein summing the second Z-standard scores having absolute values greater than a set threshold yields a second whole genome cumulative score calculated according to the formula:

;

4. The method according to claim 2, wherein after the analysis of the copy number variation degree of the test sample by the second whole genome cumulative score and the whole genome cumulative score threshold value, the method further comprises:

5. The method according to claim 2 or 4, wherein after analyzing the degree of copy number variation of the sample to be tested by the second whole genome cumulative score and the whole genome cumulative score threshold value, the method further comprises:

6. The method of analyzing the degree of copy number variation according to claim 2, wherein after obtaining the internal standard coverage of the chromosome region of each window level from the first average value, the method further comprises:

Determining a copy number baseline for each chromosomal region, comprising:

7. The method of claim 6, wherein determining a first Z-standard score for a corresponding chromosomal region of each reference sample in the reference sample set based on the chromosomal region's copy number baseline comprises:

8. An analysis device for the degree of copy number variation, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the method of analysing the degree of copy number variation according to any one of claims 1 to 7 when executing the computer program.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the method of analyzing the degree of copy number variation of any of claims 1 to 7.