Disclosure of Invention
In order to solve the defects of the prior art, the embodiment of the application discloses a pattern recognition method and device for medical gene sequencing data. The application solves the technical problems of insufficient accuracy and the like of the data filtering in the prior art, and is realized by the following technical scheme.
A pattern recognition method for medical gene sequencing data comprises the following steps of preprocessing a target gene sequence obtained through sequencing, including sequence reading, quality assessment, pairing and splicing, length screening, generating a preprocessed sequence file, analyzing a reference sequence library based on the preprocessed sequence file, constructing an adaptive scoring matrix, dynamically adjusting a position point score by utilizing neural network training, utilizing the adaptive scoring matrix to scan sequence differences through a sliding window and analyze the sequence differences by combining with a Gaussian mixture model, generating a mutation area information file, reading the mutation area information file, constructing a base conversion matrix to conduct sequence comparison, generating a mutation classification result by analyzing and comparing paths, carrying out standardization processing on mutation area feature vectors based on the mutation classification result, and generating mutation pattern types by an improved clustering algorithm.
The method comprises the steps of preprocessing a target gene sequence obtained by sequencing, including sequence reading, quality assessment, pairing and splicing and length screening, and generating a preprocessed sequence file, wherein the steps of reading the gene sequence file output by a sequencer, establishing a corresponding relation array of a base sequence and a quality value, assessing the quality of the sequence by utilizing a self-adaptive sliding window, marking a region with the quality value lower than a preset threshold value as a low-quality region, splicing paired sequences by adopting pairing information, setting the minimum overlapping length and the allowable mismatch rate, performing length screening on the spliced sequences, removing joint sequences at two ends of the sequence by adopting a dynamic programming algorithm, writing the preprocessed sequence into a temporary file, and converting the preprocessed sequence into a target format.
The method comprises the steps of analyzing a reference sequence library based on a preprocessed sequence file, constructing a self-adaptive scoring matrix, utilizing a neural network to train and dynamically adjust a site score, and comprises the steps of reading the preprocessed sequence file, analyzing a gene sequence in the reference sequence library, counting the occurrence frequency of bases of each site, generating a frequency matrix, calculating the information entropy of each site based on the frequency matrix, determining the conservation score of the site according to the information entropy, dividing sequences in the reference sequence library into a training set and a verification set according to a preset proportion, comparing the training set with multiple sequences, extracting feature vectors of each site in the multiple sequence comparison result, inputting the feature vectors into the neural network to train, outputting the neural network based on the trained neural network to dynamically adjust the site specificity score, determining conservation grades according to the site specificity scores, and adopting differential scoring strategies for different conservation grade areas.
The method comprises the steps of utilizing a self-adaptive scoring matrix, adopting a sliding window to scan sequence differences and combining Gaussian mixture model analysis to generate a variation region information file, wherein the step comprises the steps of adopting the sliding window to scan sequence differences based on a differential scoring strategy, marking a region with the mismatch rate exceeding a preset threshold as a potential variation region, extracting sequence contexts before and after the potential variation region, calculating local background characteristics, utilizing the local background characteristics, adopting the Gaussian mixture model to determine a content threshold interval, integrating sequence comparison information to evaluate conservation, calculating the secondary structural stability of the sequence of the potential variation region, taking a free energy variation value as a characteristic, and combining and analyzing a plurality of variation sites which are closely adjacent based on the characteristic to generate the variation region information file.
The method comprises the steps of reading a variation region information file, constructing a base conversion matrix for sequence comparison, generating a variation classification result through analysis and comparison of paths, and comprises the steps of reading the variation region information file, constructing the base conversion matrix, determining variation types through sequence comparison, comparing a reference sequence and a target sequence of a variation region based on the base conversion matrix to obtain an optimal comparison path, dividing the variation region into three types of substitution, insertion or deletion according to the optimal comparison path, calculating feature vectors of sequences before and after division of the variation region, including base composition and structural features, analyzing sequence fragments among the variation regions, determining associated variation, and generating the variation classification result.
The method comprises the steps of carrying out standardization processing on feature vectors of mutation areas based on mutation classification results and generating mutation pattern types through an improved clustering algorithm, wherein the steps comprise the steps of reading the mutation classification results, constructing feature vectors for each mutation area, wherein the feature vectors comprise mutation type codes, position values and length values, carrying out standardization processing on the feature vectors to enable average values and standard deviations of all dimensions to meet preset conditions, carrying out grouping by adopting the improved clustering algorithm based on the feature vectors subjected to standardization processing, introducing distance weights based on sequence similarity, and determining convergence conditions on the clustering grouping results through iterative optimization to determine the mutation pattern types.
The method comprises the steps of calculating information entropy of each site based on a frequency matrix, determining a conservation score of the site according to the information entropy, calculating a logarithmic value of the occurrence frequency of a base of each site, accumulating and summing the product of the logarithmic value and the base frequency of the site, taking a negative value of the accumulated and summed result to obtain the information entropy value of the site, and comparing the information entropy value of the site with a preset reference value to determine the conservation score of the site.
The method comprises the steps of inputting feature vectors of a locus into a trained neural network to obtain predicted probability distribution vectors of various nucleotides of the locus, calculating logarithmic values of predicted probabilities of the nucleotides in the predicted probability distribution vectors, summing products of the logarithmic values and the predicted probabilities to obtain negative values, searching homologous sequences in an evolution database, carrying out multiple comparison on the obtained homologous sequences, counting occurrence frequencies of the nucleotides in multiple comparison results, calculating logarithmic products of the frequencies to obtain an evolution conservation score, and carrying out weighted combination on the predicted probability distribution calculation results and the evolution conservation score to obtain an adjustment value of the locus specificity score.
Determining conservation grades according to the site specificity scores, and adopting a differential scoring strategy for different conservation grade areas, wherein the differential scoring strategy comprises the steps of comparing the evolution conservation scores of sites with a preset first threshold value, comparing information entropy with a preset second threshold value, dividing the sites into highly-conserved areas, variable areas or medium-conserved areas according to comparison results, setting reference values of a first group of matching scores, mismatch scores and gap penalties for the highly-conserved areas, setting reference values of a second group of matching scores, mismatch scores and gap penalties for the variable areas, adopting weighted average values of the first group of reference values and the second group of reference values for the medium-conserved areas, calculating probability distribution of substitution of each type, and correcting the reference values of the conservation grade areas according to the substitution probability distribution to obtain final scoring values.
A pattern recognition device for medical gene sequencing data is characterized by comprising a processor, a memory and a system bus, wherein the processor and the memory are connected through the system bus, and the memory is used for storing one or more programs, and the one or more programs comprise instructions which when executed by the processor cause the processor to execute the pattern recognition method for medical gene sequencing data according to any one of the embodiments.
In the pattern recognition method and the pattern recognition device for the medical gene sequencing data disclosed by the embodiment of the application, an adaptive sliding window mechanism is introduced in a sequence preprocessing link, so that the accuracy of quality control can be improved. Secondly, in the aspect of a sequence comparison algorithm, a self-adaptive scoring matrix driven by machine learning is constructed, and dynamic adjustment of the bit-point specificity score is realized through a three-layer neural network structure. In the aspects of mutation site detection and classification, the embodiment of the application establishes a set of multi-level analysis framework, and realizes the positioning of potential mutation regions. By analyzing the distance between the variation regions and the co-occurrence characteristics, the complex mutation mode is identified. In the mode cluster analysis link, a distance weight mechanism based on sequence similarity is introduced, so that the accuracy of clustering is remarkably improved, and the capability of processing complex variation modes is enhanced.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.
It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present disclosure are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them. It should also be understood that in embodiments of the present disclosure, "plurality" may refer to two or more, and "at least one" may refer to one, two or more. It should also be appreciated that any component, data, or structure referred to in the presently disclosed embodiments may be generally understood as one or more without explicit limitation or the contrary in the context. In addition, the term "and/or" in this disclosure is merely an association relation describing the association object, and indicates that three kinds of relations may exist, for example, a and/or B may indicate that a exists alone, and a and B exist together, and B exists alone. In addition, the character "/" in this disclosure generally indicates that the related object is an or relationship, and it should be further understood that the description of the embodiments in this disclosure emphasizes the differences between the embodiments, and the same or similar features may be referred to each other, so that they will not be repeated for brevity.
Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate. It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
FIG. 1 is a flow chart of a pattern recognition method for medical gene sequencing data according to an embodiment of the present application.
It is to be understood that, in the field of analysis of gene sequencing data, currently most commonly used tools for sequence alignment and mutation detection include BLAST (Basic Local ALIGNMENT SEARCH Tool), BWA (Burows-WHEELER ALIGNER), GATK (Genome Analysis Toolkit), and the like. BLAST uses seed expansion strategy to align sequences, uses fixed length seed sequences (typically 11 bp) as anchor points, and although it is advantageous in terms of computational efficiency, its scoring matrix uses fixed BLOSUM or PAM matrix, and cannot dynamically adjust parameters according to sequence specificity, which results in easy occurrence of false positives when dealing with highly variable regions. BWA relies mainly on the Burrows-Wheeler transform for sequence indexing, and uses a uniform quality threshold (typically Q20 or Q30) for filtering during sequence preprocessing, and fails to consider the influence of local sequence features on quality assessment. In the mutation detection process of the GATK, although a mutation detection algorithm based on a Bayesian model is introduced, the mutation site positioning mainly depends on simple mismatch rate statistics, and independent assumptions are often adopted when adjacent mutation is processed, so that complex associated mutation modes cannot be effectively identified and analyzed.
In contrast, the embodiment of the application performs targeted technical improvement and parameter optimization in a plurality of key links. In the sequence preprocessing stage, different from a BWA fixed quality threshold strategy, the embodiment of the application introduces a self-adaptive window mechanism based on local GC content, the window size is dynamically adjusted within the range of 30-45bp, and the base with low quality value is directionally corrected by combining with the Phred algorithm, so that the accuracy of data preprocessing is remarkably improved. And in the sequence comparison algorithm level, the limitation of a BLAST fixed scoring matrix is broken through, and a self-adaptive scoring system based on deep learning is constructed. The system integrates 23-dimensional characteristic information comprising nucleotide types, local GC content, secondary structure prediction and the like through a three-layer neural network (128-64-32 node structure), and realizes accurate scoring of different conservation areas. In particular when processing highly variable regions, the system can be based on the entropy of the information of the sitesAnd evolutionary conservation scoreThe match score and mismatch score are automatically adjusted to reveal a more accurate sequence alignment than the fixed scoring matrix.
In the mutation detection and classification links, the embodiment of the application overcomes the defects of GATK and other tools in the process of complex mutation modes. By introducing a scanning mechanism with the size of a 23bp window and the step length of 7bp and combining dynamic adjustment of the GC content threshold by a Gaussian mixture model, the high-precision positioning of the potential variation region is realized. Particularly, when adjacent mutation is processed, a 3×4 base conversion matrix and a correlation mutation analysis mechanism based on a 34bp distance threshold are designed, and when the co-occurrence frequency between the mutation exceeds 0.567 and the sequence characteristic similarity is higher than 0.823, the system can automatically identify the correlation mutation as the correlation mutation and perform unified processing. This approach significantly improves the ability to identify complex mutation patterns. In the final pattern clustering analysis, the more accurate classification and identification of the mutation pattern are realized by constructing 23-dimensional feature vectors comprising mutation type 3-dimensional one-hot codes, position information, length features and the like and introducing a distance weight mechanism based on sequence similarity (the weight is increased by 1.2 times when the similarity is more than 0.845). Not only do these technical innovations overcome the limitations of existing tools in dealing with complex patterns of variation, but a more rapid and reliable framework for analysis of gene sequencing data is also disclosed.
As shown in fig. 1, at step S101, a target gene sequence obtained by sequencing is preprocessed, including sequence reading, quality evaluation, pairing and splicing, and length screening, to generate a preprocessed sequence file. The method comprises the steps of reading a gene sequence file output by a sequencer, establishing a corresponding relation array of a base sequence and a quality value, evaluating the quality of the sequence by utilizing a self-adaptive sliding window, marking a region with the quality value lower than a preset threshold value as a low-quality region, splicing paired sequences by adopting pairing information, setting the minimum overlapping length and the allowable mismatch rate, screening the lengths of the spliced sequences, removing joint sequences at two ends of the sequence by adopting a dynamic programming algorithm, writing the preprocessed sequence into a temporary file, and converting the preprocessed sequence into a target format.
In one embodiment, the sequence of the target gene obtained by sequencing is subjected to data preprocessing. And reading the FASTQ format file output by the sequencer into a memory, establishing a corresponding relation array of base sequences and quality values, extracting three fields of ID identification, base sequences and quality character strings of each sequence, and respectively storing the three fields in the character string array with the length of N, wherein N is the total number of the sequences. And (3) evaluating the sequence quality by adopting a self-adaptive sliding window method, wherein the window size is dynamically adjusted within a range of 30-45bp according to the local GC content, and the step length correspondingly floats within a range of 3-7 bp. When the average magnitude of the homogeneity within the window is below 23, this region is marked as a low quality region. Sequences with sequencing reads less than 67bp or low mass regions exceeding 12% of the total length were filtered. And calculating the mass value of each base by using a Phred algorithm, and correcting the base with the mass value lower than 18. And splicing the paired sequences by utilizing double-end pairing information, wherein the minimum overlapping length is set to be 31bp during splicing, and the allowable mismatch rate is 0.8-0.9%. And (3) carrying out length screening on the spliced sequences, and reserving sequence fragments with the length in the range of 89-457 bp. And removing the joint sequences at two ends of the sequence by adopting a dynamic programming algorithm, and simultaneously removing areas with more than 4N continuously appearing in the sequence. Writing the preprocessed sequence and the quality information thereof into a temporary file in a binary format, and recording statistical information such as sequence number, length interval and the like at the head of the file. The processed sequences are converted into FASTA format and each sequence is assigned a unique identifier.
At step S102, a reference sequence library is analyzed based on the preprocessed sequence file, an adaptive scoring matrix is constructed, and the site scores are dynamically adjusted using neural network training. The method comprises the steps of reading a pretreatment sequence file, analyzing a gene sequence in a reference sequence library, counting the base occurrence frequency of each site to generate a frequency matrix, calculating the information entropy of each site based on the frequency matrix, determining the conservation score of the site according to the information entropy, dividing the sequence in the reference sequence library into a training set and a verification set according to a preset proportion, comparing the training set with multiple sequences, extracting the characteristic vector of each site in the multiple sequence comparison result, inputting the characteristic vector into a neural network for training, outputting the neural network based on the trained neural network, dynamically adjusting the specific score of the site, determining the conservation grade according to the specific score of the site, and adopting a differential scoring strategy for different conservation grade areas.
Specifically, an improved sequence alignment algorithm model is constructed. By analyzing 1000 known E2 gene sequences in a reference sequence library, the base occurrence frequency of each position is counted, and a4 xL frequency matrix is generated, wherein L is the sequence length. For each site i, calculating its information entropyWhereinIs the frequency of occurrence of base j at position i. And determining the conservation score of the locus according to the information entropy. And constructing a self-adaptive scoring matrix driven by machine learning based on a Smith-Waterman local comparison algorithm. By analyzing the E2 gene sequence group in the reference sequence library, a deep learning method is adopted to calculate the dynamic conservation weight for each position. For the highly conserved region, the alignment parameters are dynamically adjusted according to sequence statistics, the matching score ranges from 2.0 to 2.5, the mismatch deduction ranges from-1.5 to-2.0, the gap penalty ranges from-1.8 to-2.3, and for the variable region, the system calculates the corresponding parameter interval through a sequence evolution model. A sparse matrix is used to store a dynamic programming scoring matrix, recording only non-zero elements with scores greater than a threshold. In compressed row storage format, each non-zero element records its column index and score value. By maintaining an index range of the active computation band, invalid computations in the matrix filling process are reduced. The minimum similarity threshold for sequence alignment was set to 87.3% and the maximum allowed gap ratio was set to 3.2%. In the comparison process, an improved backtracking algorithm is adopted, a search space is optimized through setting branch-and-bound conditions, and branches are directly pruned when the local comparison score is lower than a threshold value of 13.7. For sequences with the length exceeding 234bp, a segmentation comparison strategy is adopted, the sequences are divided into a plurality of overlapped subsequences for comparison, the subsequences are 156bp in length, and the overlapping length of adjacent subsequences is 47bp. And reading the preprocessed temporary file in a memory mapping mode, so as to realize efficient data access.
Further, calculating information entropy of each site based on the frequency matrix, and determining a conservation score of the site according to the information entropy, wherein the information entropy comprises calculating a logarithmic value of the occurrence frequency of the base of each site, and accumulating and summing the product of the logarithmic value and the base frequency of the site; and comparing the site information entropy value with a preset reference value to determine the conservation score of the site.
The method comprises the steps of inputting feature vectors of a locus into the trained neural network to obtain predicted probability distribution vectors of various nucleotides of the locus, calculating logarithmic values of predicted probabilities of the nucleotides in the predicted probability distribution vectors, summing products of the logarithmic values and the predicted probabilities to take negative values, searching homologous sequences in an evolution database, carrying out multiple comparison on the obtained homologous sequences, counting occurrence frequencies of the nucleotides in the multiple comparison results, calculating logarithmic products of the frequencies to obtain an evolution conservation score, and carrying out weighted combination on the calculated results of the predicted probability distribution and the evolution conservation score to obtain an adjustment value of the locus specificity score.
Determining conservation grades according to the site specificity scores, adopting a differential scoring strategy for different conservation grade areas, comprising the steps of comparing the evolution conservation scores of the sites with a preset first threshold value, comparing information entropy with a preset second threshold value, dividing the sites into highly-conserved areas, variable areas or medium-conserved areas according to comparison results, setting reference values of a first group of matching scores, mismatch deductions and gap penalties for the highly-conserved areas, setting reference values of a second group of matching scores, mismatch deductions and gap penalties for the variable areas, adopting weighted average values of the first group of reference values and the second group of reference values for the medium-conserved areas, calculating probability distribution of each type of substitution, and correcting the reference values of each conservation grade area according to the substitution probability distribution to obtain a final scoring value.
In one embodiment, a scoring matrix for the Smith-Waterman local alignment algorithm is first constructed. E2 gene sequences in a reference sequence library are randomly divided into a training set and a verification set according to the proportion of 78.9%, and sequences in the training set are subjected to multi-sequence comparison to generate an initial position specificity scoring matrix. Obtaining a site-specific score by calculating the nucleotide frequency distribution at each position
WhereinIndicating the frequency of occurrence of nucleotide j at position i,Indicating the background frequency of the nucleotide. For any two sequence alignment positions (i, j), the initial scoring value is passedIntegration is performed with the nucleotide composition feature vector within the 29bp window of the local sequence context. For each site, features including nucleotide type, local GC content, secondary structure prediction score, etc. are extracted, and a 23-dimensional feature vector is constructed. These feature vectors are input into a feed-forward neural network comprising three hidden layers, 128, 64 and 32 neurons in number, respectively, using the ReLU activation function. Network training is performed by minimizing the cross entropy loss function on the verification set, an Adam optimizer is used for parameter updating, the initial learning rate is set to 0.00234, and training is stopped when the relative change rate of loss values of 5 epoch verification sets in succession is smaller than 0.00127.
After the trained neural network is obtained, the network output is utilized to dynamically adjust the position point specificity score. For each locus i, inputting the feature vector of the locus i into a trained neural network to obtain a probability distribution vector p i of an output layer. Based on the probability distribution, information entropy of the sites is calculatedWhereinRepresenting the predicted probability of the kth nucleotide occurring at position i. Combining information entropy with sequence evolution information, searching homologous sequences in an evolution database through BLAST, and setting an E-value threshold to be 1E-12. Multiple comparison is carried out on the obtained homologous sequences, and the evolutionary conservation score is calculatedWhereinThe frequency of the kth nucleotide in the multiplex alignment for this site. Based on the information entropy and the evolutionary conservation score, the conservation grade of the locus is calculated whenBelow threshold 0.537 and H i below 0.682, this site is marked as a highly conserved region, whenAbove 0.892 or H i above 1.237, this position is marked as the variable region and the remaining positions are marked as the moderately conserved regions.
Differential scoring strategies are employed for regions of different conservation classes. For the highly conserved region, a match score benchmark of 2.37 was set, a mismatch score benchmark of-1.83, and a gap penalty benchmark of-2.13. The reference values are adjusted to 1.73, -1.27 and-1.62 for the variable regions, respectively. The reference value for the medium conserved region is then a weighted average of the two. On this basis, the evolution pattern information of the binding sites fine-tunes the baseline value. Calculating a substitution pattern probability distribution for each site by analyzing homologous sequences in the evolution databaseThe probability of mutation of nucleotide j to k is indicated. For any aligned position (i, j), the actual scoring value is calculated by scoring = if it is a matching siteIf mismatch sites, score =If it is a gap, penalty =WhereinIn order to adjust the coefficient of the power supply,Indicating the probability of deletion of nucleotide j.
At step S103, the sequence differences are scanned using a sliding window and analyzed in combination with a gaussian mixture model using an adaptive scoring matrix to generate a variance region information file. The method comprises the steps of scanning sequence differences by adopting a sliding window based on a differential scoring strategy, marking a region with a mismatch rate exceeding a preset threshold as a potential variation region, extracting sequence contexts before and after the potential variation region, calculating local background features, determining a content threshold interval by utilizing the local background features and adopting a Gaussian mixture model, evaluating conservation of integrated sequence comparison information, calculating the secondary structural stability of the sequence of the potential variation region, taking a free energy variation value as a feature, and carrying out merging analysis on a plurality of variation sites which are closely adjacent based on the feature to generate a variation region information file.
For the localization of the site of sequence variation. Based on the comparison result, the sequence difference is scanned by adopting a sliding window method, the window size is set to be 23bp, and the step length is 7bp. When the mismatch within the window exceeds 6.8%, the region is marked as a potential variant region. For each potential mutation region, the sequence context of 50bp before and after each potential mutation region is extracted. Local background features of the 101bp long sequence fragment were calculated, including GC content distribution, repeat density, and the like. And dynamically determining a GC content threshold interval based on a local sequence background by adopting a Gaussian mixture model, and simultaneously integrating sequence alignment information among species to evaluate evolution conservation. And (3) cutting the sequence with the step length of 3bp, counting the occurrence times of all possible triplets, and normalizing to obtain the complexity score. Regions with a complexity greater than 0.783 and meeting the dynamic GC content threshold were determined to be reliable variant regions. And storing each determined mutation area information in a mutation feature array, wherein each array element comprises a starting position, a termination position, a reference sequence fragment and a target sequence fragment of the mutation area. For each detected mutation region, calculating the secondary structural stability of the local sequence, and taking the free energy change value as one of important characteristics of mutation sites. For a plurality of closely adjacent mutation sites, if the distance is less than 12bp, they are combined into one mutation region for analysis. And writing all the identified mutation area information into an intermediate file to prepare for subsequent classification labeling.
In step S104, the mutation region information file is read, a base conversion matrix is constructed for sequence comparison, and mutation classification results are generated by analyzing the comparison path. The method comprises the steps of reading a variation region information file, constructing a base conversion matrix, determining variation types through sequence comparison, comparing a reference sequence and a target sequence of a variation region based on the base conversion matrix, obtaining an optimal comparison path, dividing the variation region into three types of replacement, insertion or deletion according to the optimal comparison path, calculating feature vectors of sequences before and after the divided variation region, including base composition and structural features, analyzing sequence fragments among the variation regions, determining associated variation, and generating a variation classification result.
In one embodiment, the identified variations are labeled for classification. Reading the mutation area information file generated in the previous step, constructing a 3 multiplied by 4 base conversion matrix, wherein the rows represent the original bases A/T/C/G, and the columns represent the mutated bases. For each mutation region, determining specific mutation type by sequence alignment, namely firstly constructing a local alignment matrix, setting a matching score to be 1.0, mismatch deduction to be-0.7, gap initiation penalty to be-1.2 and gap extension penalty to be-0.4. Based on the scoring scheme, the reference sequence and the target sequence of the variation region are precisely compared, and an optimal comparison path is obtained. According to the conditions of matching, mismatch and vacancy in the alignment path, the mutation region is divided into three basic types of substitution, insertion or deletion. Information such as mutation type, start position, mutation length, affected base sequence, etc. is recorded for each mutation region. Meanwhile, feature vectors of 21bp sequences before and after the mutation region are calculated, wherein the feature vectors comprise information such as base composition, secondary structure features, local complexity and the like. When the distance between two mutation areas is found to be smaller than 34bp, extracting sequence fragments between the mutation areas, analyzing the conservation and functional characteristics of the sequences, if the co-occurrence frequency is more than 0.567 and the similarity of the sequence characteristics is higher than 0.823, marking the mutation as associated mutation, and combining the characteristic information of the mutation as associated mutation. And storing the classification labeling result in a structural format, wherein the classification labeling result comprises a plurality of data tables such as basic variation information, sequence characteristics, association relations and the like.
At step S105, the mutation region feature vector is normalized based on the mutation classification result, and the mutation pattern type is generated by the improved clustering algorithm. The method comprises the steps of reading a variation classification result, constructing a feature vector for each variation region, wherein the feature vector comprises features of variation type codes, position values and length values, carrying out standardization processing on the feature vector to enable average values and standard deviations of all dimensions to meet preset conditions, carrying out grouping by adopting an improved clustering algorithm based on the feature vector subjected to the standardization processing, introducing distance weights based on sequence similarity, and determining convergence conditions on the clustering grouping result through iterative optimization to determine mutation mode types.
And (5) establishing a cluster analysis flow of the mutation mode. The mutation classification result of the previous step can be read, 23-dimensional feature vectors are constructed for each mutation region, namely, the mutation type is converted into 3-dimensional one-hot codes, the normalized relative position value is 1-dimensional, the logarithmic value of the mutation length is 1-dimensional, the local GC content is 1-dimensional, the sequence complexity is 1-dimensional, and the base composition of each 8bp context is represented by 16-dimensional. And (3) carrying out standardization processing on all the feature vectors to ensure that the average value of each dimension is 0 and the standard deviation is 1. The improved K-means clustering algorithm is adopted for grouping, and the initial clustering center number is set to be 4. In the clustering process, a distance weight based on sequence similarity is introduced, namely when the sequence context similarity of two variant regions is higher than 0.845, the distance weight of the two variant regions in the feature space is increased by 1.2 times. Through the iterative optimization process, when the ratio change rate of the inter-class distance to the intra-class distance is smaller than 0.0023 and the cluster center position change amounts of 3 continuous iterations are smaller than 0.0067, the clusters are considered to be converged. And integrating the sequence characteristics of the variation regions in each class and the associated variation information according to the clustering result, and generating the characteristic description abstract of the class. Comparing the typical variation sequence in each cluster with the known patterns in the reference database, adopting a local sequence comparison algorithm, setting a matching threshold value to be 0.8-0.85, and determining the mutation pattern type represented by each cluster.
FIG. 2 is a schematic diagram showing a sequence conservation analysis of E2 gene according to the embodiment of the present application. As shown in the figure, the E2 gene sequence conservation analysis schematic comprises an upper graphic component, a middle graphic component and a lower graphic component which are mutually related. Wherein, the abscissa adopts sequence position values ranging from 0 to 450.
In the first graphic element above, the ordinate indicates the base frequency, and the value ranges from 0 to 2.00. The A, T, C, G bases are shown as red, green, blue and bluish, respectively. The data show that the frequency of the A base is stably maintained between 1.00 and 1.25 in the three sequence intervals of 50 to 150, 180 to 280 and 300 to 400, the frequency of other bases is relatively low, and the four base frequencies have obvious fluctuation in the rest intervals, and the maximum fluctuation amplitude reaches 0.75.
In the middle second graphic assembly, the ordinate represents the local feature value, and the value range is 0-2.0. The black curve shows that the local eigenvalues remain stable around 0.5 in the highly conserved region and rise above 1.5 in the variable region. The background area is divided into four colors of gray, green, orange and powder, and the four colors respectively correspond to the sequence intervals of 0-100, 100-200, 200-300 and 300-400. Sequencing data shows that when the characteristic value is less than 0.5, the sequence conservation of the corresponding region reaches more than 87.3 percent.
In the lower third graphical component, the ordinate represents the conservation score, ranging from 0 to 1.0. The purple solid line represents the site-specific score, and the degree of conservation is divided into three intervals by the red (0.537) and green (0.892) dashed lines. The data show that the conservation score is lower than 0.6 in the sequence intervals of 100-150, 200-250, 300-350, etc., and generally higher than 0.8 in other intervals. The locus is marked as a highly conserved region when the conservation score is above 0.892, as a variable region when the conservation score is below 0.537, and as a moderately conserved region when the score is between the two.
FIG. 3 is a comparison of a sequencing mass distribution graph according to an embodiment of the present application. The graph comprises three subgraphs, namely an upper sequencing mass interval distribution box diagram, a lower left GC content and mass score correlation diagram and a lower right dynamic step length density distribution diagram.
As shown in the box diagram, the horizontal axis is a sequencing read long interval (bp), the sequence is divided into five sections according to 89-156bp, 156-234bp, 234-347bp, 347-457bp and 457bp+, and the vertical axis is a Phred quality score. Wherein the blue bin line data set corresponds to the Smith-Waterman traditional method and the green bin line data set corresponds to the adaptive sliding window method of the embodiments of the application, the red dashed line is located at the mered quality score of 23.7. By the self-adaptive sliding window method, the window size is dynamically adjusted within the range of 30-45bp, and the step length correspondingly floats within the range of 3-7bp, so that the Phred quality score of each reading length interval is obviously higher than the quality threshold line by 23.7. The concrete numerical values are that the median value of the quality score in the 89-156bp interval is increased from 20.5 to 24.7,156-234bp interval from 21.8 to 26.5,234-347bp interval from 23.2 to 27.8,347-457bp interval from 25.6 to 29.3, and the interval above 457bp is increased from 27.9 to 31.2.
In the GC content correlation diagram shown in the lower left, the horizontal axis represents the GC content interval (%), the range is 30-60%, and the left vertical axis represents the mass score (bp). Wherein the blue line is a reference line, and the green line and the light envelope region thereof represent the mass distribution of the adaptive method according to the embodiment of the present application. The mass score was low when the GC content was in the interval of 35-40%, the lowest point corresponding to 33.5 score at 42% GC content, and then on an ascending trend with increasing GC content, reaching 37.5 mass score when 55%.
In the step distribution density map shown in the lower right, the horizontal axis represents the dynamic step range (bp), and the vertical axis represents the frequency density. The curve is unimodal, the peak is located at 5bp, and the corresponding frequency density is 0.8. The frequency density values between 3-4bp and 6-7bp are relatively low, and the symmetrical descending trend is shown.
Further, the embodiment of the application also discloses a pattern recognition device of medical gene sequencing data, which comprises a processor, a memory and a system bus, wherein the processor and the memory are connected through the system bus, and the memory is used for storing one or more programs, and the one or more programs comprise instructions which when executed by the processor, cause the processor to execute any one of the methods.
In summary, in the embodiment of the application, an adaptive sliding window mechanism is introduced in the sequence preprocessing link, the window size is dynamically adjusted within the range of 30-45bp according to the local GC content, and the step length correspondingly floats within the range of 3-7bp, so that the accuracy of quality control is remarkably improved. Secondly, in the aspect of a sequence comparison algorithm, a self-adaptive scoring matrix driven by machine learning is constructed, and dynamic adjustment of the bit-point specificity score is realized through a three-layer neural network structure (the number of hidden layer neurons is 128, 64 and 32 respectively). The network realizes a differential scoring strategy for different conservation areas by integrating 23-dimensional characteristics such as nucleotide types, local GC content, secondary structure prediction scores and the like. Specifically, the system dynamically adjusts scoring parameters based on the information entropy (H i) and the evolutionary conservation score (C i) of the locus, identifying the locus as a highly conserved region when C i is below 0.537 and H i is below 0.682, using a higher matching score (2.37) and a stricter mismatch score (-1.83), and using a more flexible parameter setting for the variable region (C i is above 0.892 or H i is above 1.237).
In terms of mutation site detection and classification, the embodiment of the application establishes a set of multi-level analysis framework. The sequence difference is scanned by a sliding window method (the window size is 23bp, the step length is 7 bp), and the GC content threshold interval is dynamically determined by combining a Gaussian mixture model, so that the accurate positioning of the potential variation region is realized. In the mutation classification process, a3×4 base conversion matrix and an associated mutation analysis mechanism are introduced, and the accurate identification of the complex mutation mode is realized by analyzing the distance between mutation areas (threshold 34 bp) and the co-occurrence characteristic (frequency threshold 0.567). Finally, in the mode cluster analysis link, a 23-dimensional feature vector system is designed, the system comprises 3-dimensional one-hot codes of variation types, position information, length features and sequence context information, and the accurate classification of variation modes is realized through an improved K-means algorithm, wherein a distance weight mechanism based on sequence similarity is introduced, and when the sequence context similarity is higher than 0.845, the feature space distance weight is increased by 1.2 times, so that the clustering accuracy is remarkably improved. The technical improvements effectively solve the limitation of the traditional method in processing complex mutation modes, and realize more accurate and rapid analysis and identification of medical gene sequencing data.
Further, embodiments of the present application also disclose a computer program product which, when run on a terminal device, causes the terminal device to perform any of the methods described above.
From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described example methods may be implemented in software plus necessary general purpose hardware platforms. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.
It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.