CN112580691B - Term matching method, matching system and storage medium for metadata field - Google Patents
Term matching method, matching system and storage medium for metadata field Download PDFInfo
- Publication number
- CN112580691B CN112580691B CN202011342621.XA CN202011342621A CN112580691B CN 112580691 B CN112580691 B CN 112580691B CN 202011342621 A CN202011342621 A CN 202011342621A CN 112580691 B CN112580691 B CN 112580691B
- Authority
- CN
- China
- Prior art keywords
- matching
- trained
- word
- words
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000011218 segmentation Effects 0.000 claims abstract description 60
- 238000007781 pre-processing Methods 0.000 claims abstract description 14
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000004140 cleaning Methods 0.000 claims description 5
- 239000000203 mixture Substances 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 241001482311 Trionychidae Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- VLKZOEOYAKHREP-UHFFFAOYSA-N n-Hexane Chemical group CCCCCC VLKZOEOYAKHREP-UHFFFAOYSA-N 0.000 description 1
- 235000011962 puddings Nutrition 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/28—Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Library & Information Science (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a term matching method, a matching system and a storage medium of metadata fields, comprising the following steps: preprocessing the first search term in the metadata training set to obtain a second search term; judging the second search term, and matching the second search term with the vocabulary to be trained in the vocabulary database to be trained to obtain the search term to be successfully matched; performing word segmentation on the words to be searched by using a conditional random field word segmentation algorithm to obtain a plurality of third search words of each word to be searched, and matching the plurality of third search words with the words to be trained to determine matching words; updating a vocabulary database to be trained according to the matched words, and training a conditional random field word segmentation algorithm to obtain a trained classifier and instructions vocabulary; and matching the words to be matched in the metadata fields to be matched by using the trained classifier and vocabulary. The conditional random field word segmentation algorithm is used for word segmentation of the to-be-searched words, and the matching words are determined, so that the matching accuracy of the to-be-searched words can be improved on the premise of automatic matching.
Description
Technical Field
The present application relates to the field of data identification technologies, and in particular, to a method, a system, and a storage medium for matching terms of metadata fields.
Background
Most businesses spend a great deal of time and effort dealing with cluttered and integrated data. Their employees either cannot find the appropriate data or do not trust the found data. Most importantly, various industry regulations restrict self-service and data autonomy processes. Accordingly, enterprises attempt to repair data through various labor-intensive tasks (including writing custom programs, developing global replacement functions, etc.), which severely affects the productivity of data analysts and data scientists. This is especially true for large businesses, where many years of parallel purchases have collected systems and databases of various colors, resulting in extremely complex data environments. While maintaining these legacy data environments has become tired for the enterprise, new data is continually being generated at an unexpected rate.
In view of the foregoing, there is a need to provide a term matching method, matching system, and storage medium for metadata fields that is automatic and highly accurate.
Disclosure of Invention
In order to solve the above problems, the present application proposes a term matching method, a matching system and a storage medium for metadata fields.
In a first aspect, the present application proposes a term matching method for metadata fields, including:
Preprocessing the first search term in the metadata training set to obtain a second search term;
Judging the second search word, and matching the second search word with the vocabulary to be trained in the vocabulary database to be trained to obtain the search word to be successfully matched;
Performing word segmentation on the words to be searched by using a conditional random field word segmentation algorithm to obtain a plurality of third search words of each word to be searched, and matching the plurality of third search words with the words to be trained to determine matching words;
Updating a vocabulary database to be trained according to the matching words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
and matching the metadata fields to be matched by using the trained classifier and the trained vocabulary.
Preferably, before the preprocessing is performed on the first term in the metadata training set to obtain the second term, the method further includes:
collecting table data and cleaning illegal characters in the table data to establish a vocabulary to be trained;
A vocabulary database to be trained is established using the vocabulary to be trained.
Preferably, the preprocessing the first term in the metadata training set to obtain a second term includes:
Acquiring all first search words in metadata in a metadata training set;
And removing illegal characters in each first search term to obtain a second search term corresponding to each first search term.
Preferably, the step of judging the second search term, and matching with the vocabulary to be trained in the vocabulary database to be trained to obtain the search term includes:
Judging the second search term to obtain a Chinese second search term and a non-Chinese second search term;
Directly matching the Chinese second search terms in a vocabulary database to be trained, and determining the Chinese second search terms which are not matched with the matched terms;
and taking the non-Chinese second retrieval word and the Chinese second retrieval word which is not matched with the matched word as the to-be-retrieved word.
Preferably, the word segmentation is performed on the words to be searched by using a conditional random field word segmentation algorithm to obtain a plurality of third search words of each word to be searched, the plurality of third search words are matched with the words to be trained, and the determining of the matched words includes:
Performing word segmentation on the longer word to be searched by using a conditional random field word segmentation algorithm to generate a plurality of third search words of each word segment of the word to be searched, wherein the third search words comprise: full spelling, simple spelling, english name and/or Chinese name;
Performing character matching on all the third search words of each to-be-searched word and the to-be-trained word, and calculating the matching degree of each third search word corresponding to each to-be-searched word;
and determining the matching word of the to-be-searched word according to the matching degree and the matching threshold value.
Preferably, the matching the metadata fields to be matched using the trained classifier and the trained vocabulary includes:
the classifier matches with the trained metadata fields to be matched, and the metadata fields which are not successfully matched are obtained to be used as words to be matched;
using a trained conditional random field word segmentation algorithm as a trained classifier to segment each word to be matched, and obtaining a plurality of third search words corresponding to each word to be matched;
performing character matching on the plurality of third search terms in the trained vocabulary, and calculating the matching degree of each third search term;
and determining the matching word corresponding to each word to be matched according to the matching degree and the matching threshold, wherein the matching word is a matching term of the word to be matched.
Preferably, the table data includes:
Customer terminology table, industry standard field interpretation mapping table, metadata field and interpretation comparison table which are approved by business system, wherein each table comprises: chinese name, english name, full spelling, and simple spelling.
Preferably, the types of the first search term and the second search term each include: chinese, full spell, simple spell, english abbreviations, and/or mixtures.
In a second aspect, the present application proposes a term matching system for metadata fields, comprising:
The training module is used for preprocessing the first search word in the metadata training set to obtain a second search word; judging the second search word, and matching the second search word with the vocabulary to be trained in the vocabulary database to be trained to obtain the search word to be successfully matched; performing word segmentation on the words to be searched by using a conditional random field word segmentation algorithm to obtain a plurality of third search words of each word to be searched, and matching the plurality of third search words with the words to be trained to determine matching words; updating a vocabulary database to be trained according to the matching words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
And the matching module is used for matching the words to be matched in the metadata fields to be matched by using the trained classifier and the trained vocabulary.
In a third aspect, the present application provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform a method of term matching of metadata fields as described above.
The application has the advantages that: judging the second search words, matching the second search words with the words in the trained vocabulary database to obtain unsuccessfully matched search words, then using a conditional random field word segmentation algorithm to segment the search words and using a trained classifier to classify the search words to obtain a plurality of third search words, matching the third search words with the trained vocabulary, calculating the matching degree, determining the matching words according to a matching threshold, and improving the matching accuracy of the search words on the premise of automatic matching.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a schematic diagram of steps of a method for term matching of metadata fields provided by the present application;
FIG. 2 is a flow chart of a method for term matching of metadata fields provided by the present application;
FIG. 3 is a schematic diagram of an Aho-Corasick algorithm of a term matching method for metadata fields according to the present application for preprocessing pattern strings into finite state automata;
FIG. 4 is a schematic diagram of an adjacency matrix data structure of a term matching method for metadata fields according to the present application;
FIG. 5 is a schematic diagram of a data structure of an adjacency list used for the Viterbi algorithm solution of the term matching formula of metadata fields provided by the present application;
fig. 6 is a schematic diagram of a term matching system for metadata fields provided by the present application.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In a first aspect, according to an embodiment of the present application, a term matching method for metadata fields is provided, as shown in fig. 1, including:
S101, preprocessing a first search term in a metadata training set to obtain a second search term;
s102, judging the second search term, and matching the second search term with the vocabulary to be trained in the vocabulary database to be trained to obtain the search term to be successfully matched;
S103, word segmentation is carried out on the words to be searched by using a conditional random field word segmentation algorithm, a plurality of third search words of each word to be searched are obtained, matching is carried out on the third search words and the words to be trained, and matching words are determined;
S104, updating a vocabulary database to be trained according to the matched words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
s105, matching the metadata fields to be matched by using the trained classifier and the trained vocabulary.
Preferably, before preprocessing the first term in the metadata training set to obtain the second term, the method further includes: collecting table data and cleaning illegal characters in the table data to establish a vocabulary to be trained; a vocabulary database to be trained is established using the vocabulary to be trained.
The vocabulary database to be trained is formed by vocabulary and word segmentation of the vocabulary, and Chinese, simple spelling, full spelling, english abbreviation or mixed lookup tables corresponding to the vocabulary and/or word segmentation.
The collected table data includes: customer terminology table, industry standard field interpretation mapping table, metadata field and interpretation comparison table which are approved by business system. Wherein each table comprises: chinese name, english name, full spelling, simple spelling, etc.
Preprocessing the first search term in the metadata training set to obtain a second search term, including: acquiring all first search words in metadata in a metadata training set; and removing illegal characters in each first search term to obtain a second search term corresponding to each first search term.
Wherein, the first search words in the metadata training set are words with labels (the matched words are determined). The types of the first search term and the second search term comprise: chinese, full spell, simple spell, english abbreviations, and/or mixtures.
Judging the second search term, and matching with the vocabulary to be trained in the vocabulary database to be trained to obtain the search term, wherein the method comprises the following steps:
Judging the second search term to obtain a Chinese second search term and a non-Chinese second search term; directly matching the Chinese second search terms in the vocabulary database to be trained, and determining the Chinese second search terms which are not matched with the matched terms; and taking the non-Chinese second retrieval word and the Chinese second retrieval word which is not matched with the matched word as the to-be-retrieved word. Direct matching is the direct search match.
Performing word segmentation on the to-be-searched words by using a conditional random field word segmentation algorithm to obtain a plurality of third search words of each to-be-searched word, matching the plurality of third search words with the to-be-trained word, and determining a matching word, wherein the method comprises the following steps: using a conditional random field word segmentation algorithm to segment longer words to be searched, and generating a plurality of third search words of each word segment of the words to be searched, wherein the third search words comprise: full spelling, simple spelling, english name and/or Chinese name; performing character matching on all third search words of each to-be-searched word and the to-be-trained word, and calculating the matching degree of each third search word corresponding to each to-be-searched word; and determining the matching word of the word to be retrieved according to the matching degree and the matching threshold value. The method for obtaining the plurality of third search terms of each to-be-searched term further comprises the following steps: and directly generating a plurality of third search terms of the to-be-searched terms. If the second search term is a chinese term, a plurality of third search terms corresponding to the second search term may be directly generated.
The third search term is Chinese, simple spelling, english abbreviation and/or the mixture of the above.
The character matching is measured according to the ratio of the matching number of the characters in the third search term and the characters of the corresponding item (word) in the vocabulary to be trained to the length of the longest character string in the characters, and the matching degree CD is defined as:
Wherein I C represents the number of characters matched, and L MAX represents the maximum value of the number of characters in the third search term and the number of characters in the corresponding term characters in the vocabulary to be trained.
Matching metadata fields to be matched using a trained classifier and a trained vocabulary, comprising: matching the metadata fields to be matched by using the trained vocabulary, and obtaining the metadata fields which are not successfully matched as words to be matched; using a trained conditional random field (Conditional Random Field, CRF) word segmentation algorithm as a trained classifier to segment each word to be matched to obtain a plurality of third search words corresponding to each word to be matched; performing character matching on a plurality of third search words in the trained vocabulary, and calculating the matching degree of each third search word; and determining the matching word corresponding to each word to be matched according to the matching degree and the matching threshold value, wherein the matching word is a matching term of the word to be matched. When the matching word is used for updating the trained vocabulary database, the matching word and the first, second and/or third search words corresponding to the matching word are stored. The vocabulary database includes a plurality of vocabularies. When matching is performed, sorting is performed according to decision suggestions and matching degrees, wherein the decision suggestions are matching words (matching terms) with the largest matching degree and used as words to be matched; and the matching degree sequencing is that if the classification result with the largest matching degree is multiple, namely the matching degrees of the multiple classification results are the same and the largest, the matching words which are arranged at the forefront or are used as the words to be matched and are calculated to be the matching degrees at the forefront are selected.
The following is a further explanation of the embodiments of the present application.
As shown in fig. 2, first, a vocabulary to be trained needs to be established for matching with the vocabulary in the metadata training set, and therefore, collection at least includes: customer term tables such as Chinese names, english names, full spelling codes and/or simple spelling codes, and the like, and table data such as an industry standard field interpretation mapping table, a metadata field checked by a business system edition, an interpretation comparison table, and the like. The model is then pre-trained from existing industry-oriented datasets and labels to form a machine-learning based classifier to provide decision suggestions and term assignment rules. Confidence scores are determined by establishing a term match of 0-1, and modeling is performed according to the alphabetical order in the keywords and the number of matches.
The method comprises the steps of firstly cleaning collected table data, removing illegal characters in the table data, establishing vocabulary and word segmentation of the vocabulary, and forming a to-be-trained vocabulary formed by Chinese (Chinese name), simple spelling code, full spelling code, english (English name), english abbreviation or mixed lookup tables corresponding to the vocabulary and/or word segmentation; a vocabulary database to be trained is established using the vocabulary to be trained.
Metadata of search words (first search words) with determined meanings are obtained, a metadata training set is formed, illegal characters in each first search word are removed, and second search words of each first search word are obtained.
Judging the type of the second search term, and determining the Chinese word and the non-Chinese word in the second search term to obtain the Chinese second search term and the non-Chinese second search term. The types of the second search term include: full spelling, simple spelling, english abbreviations and/or mixtures. And carrying out direct search matching on the second Chinese search words in the vocabulary database to be trained, determining the second Chinese search words which are not matched with the matched words, and taking the second non-Chinese search words and the second Chinese search words which are not matched with the matched words as the words to be searched. The Chinese search terms matched with the matching terms are directly used for updating the vocabulary database to be trained.
And performing word segmentation on the to-be-searched words by using a conditional random field word segmentation algorithm to obtain a plurality of third search words of each to-be-searched word, and matching the plurality of third search words with the to-be-trained word to determine matching words. Specifically, a plurality of third search terms of the to-be-searched term can be directly generated, or a longer to-be-searched term is segmented by using a conditional random field segmentation algorithm, and a plurality of third search terms of each segmented term of the to-be-searched term are generated, wherein the third search terms comprise: full spelling, simple spelling, english and/or chinese names. The Chinese of the search word can obtain the full spelling code and the simple spelling code according to the spelling rule, the English name is obtained by searching through the Chinese-English dictionary word stock, and the English short-hand code can be obtained according to the English and English short-hand comparison table.
Assuming that the to-be-retrieved word is name, directly generating a third retrieval word of the to-be-retrieved word, including: name, xingming, and/or xm, etc. Assuming that the to-be-retrieved word is xing _mini, performing word segmentation on the to-be-retrieved word by using a conditional random field word segmentation algorithm, removing illegal characters in each word segment to obtain two word segments of xing and mini, and generating a third retrieval word, wherein the third retrieval word corresponding to the word segment xing comprises: surnames, surname, FAMILY NAME, last name, x, etc.; the third search term corresponding to the segmentation term set comprises: name, GIVEN NAME, FIRST NAME, m, etc.
Traversing the vocabulary to be trained in the vocabulary to be trained, carrying out character matching on all third search words of each word to be trained, and calculating the matching degree of each third search word corresponding to each word to be trained and the vocabulary to be trained; and determining the matching word of the to-be-searched word according to the matching degree and the set matching threshold. And selecting the Chinese names of the words with the highest sum of the matching degrees and larger than the matching threshold as new retrieval words. For the third search term which is an English search term, if the full spelling code and the simple spelling code are the same, the full spelling code has no matching degree of 1 and the simple spelling code has matching degree of 1, the Chinese names of all the words with the simple spelling code matching degree of 1 are selected.
Assume that a longer Chinese term "A, B, C and D" is taken as an example, after the longer Chinese term "A, B", "C and D" are respectively obtained, 4 terms of "A", "B", "C and D" are obtained, and Chinese names, english names, full spelling codes, simple spelling codes and the like corresponding to the four terms of "A", "B", "C and D" are used as third terms.
And matching the third search term corresponding to each word segment by using character matching. Taking the example of character matching of Chinese names of 'A', 'B', 'C' and 'T', if the matched Chinese names are 'soft-shelled turtle', 'Kongji', 'dipropyl' and 'pudding', the matching degree is 1/2;1/3;1/2;1/2. Assume that character matching is continuously carried out on all spelling codes of 'A', 'B', 'C', 'T', and the obtained matching degree is respectively 1;2/3;4/5;0 and the matching threshold is 1, assuming that the sum of the matching degrees of the full spellings of the first, second and third and the Chinese name of the fourth is highest and is approximately equal to 2.967, the full spellings of the first, second and third and the Chinese name of the fourth are used as the matching words of the to-be-searched word.
Confidence scores are made for each candidate match word for every third term to assist in selecting the correct match word, which automatically assigns terms to metadata if the match confidence reaches 90%. If the matching is unsuccessful, performing manual matching, and updating the vocabulary to be trained by using the manual matching result.
The method for classifying the search term comprises the following steps: chinese terms, english terms, mixed terms, spelled terms and English abbreviation terms. Wherein, the Chinese search term only includes Chinese characters, the English search term only includes English characters, and the rest is mixed search term. The Chinese name of the Chinese search term is the search term itself, and the English name is the blank string; english name of English search term is search term itself, chinese name is blank string; the Chinese name and English name of the mixed search term are the search term itself.
In the process of matching the characters with the third search term, the character strings of the corresponding items of the to-be-trained vocabulary in the to-be-trained vocabulary are matched from left to right, the matching number is calculated in the third search term in a character-by-character mode, the occurrence sequence of the characters is ignored, and the English characters are not distinguished in case.
When traversing the vocabulary to be trained in the vocabulary to be trained, traversing the full spelling, chinese name, simple spelling and the like in sequence for the Chinese to be searched word and calculating the matching degree; for English words to be searched, traversing English names, full spelling codes, simple spelling codes and the like in sequence and calculating the matching degree of the English words; and traversing the full spelling code, the English name, the simple spelling code, the Chinese name and the like in sequence for the mixed to-be-searched words, and calculating the matching degree of the mixed to-be-searched words. If the vocabulary with the matching degree of 1 between the full spelling code or English name and the full spelling code or English name of the word to be trained is found in the traversal calculation process, determining that the Chinese name of the word to be trained is a new search word, ending the traversal, completing the matching of the word to be trained, and updating the vocabulary to be trained.
After all the first words in the metadata training set are matched, training and updating the conditional random field word segmentation algorithm according to the matching words actually corresponding to all the first words in the metadata training set and the matching words actually matched, taking the trained conditional random field word segmentation algorithm as a trained classifier, and finally obtaining the trained classifier and the trained and updated vocabulary. And then matching the metadata fields to be matched by using the training method, the trained classifier and the trained vocabulary.
The method comprises the steps of performing word segmentation on the unsuccessfully matched to-be-searched words by using a conditional random field word segmentation algorithm, classifying by using a trained classifier to obtain a classification result of each third search word, matching with a trained vocabulary, calculating the matching degree, determining the matching words according to a matching threshold, and improving the matching accuracy of the to-be-searched words on the premise of automatic matching.
For the vocabulary to be trained and the trained vocabulary, the dictionary storage data structure comprises: chinese has 7000 common words and 56000 common words, so that although the data is easy to load into a memory, the operation of high concurrency millisecond level is difficult, and here, a Double-array Trie (Double-ARRAY TRIE) structure is adopted, and only two linear arrays are used for representing the Trie tree, and the structure effectively combines the characteristic of high retrieval time of a digital search tree (DIGITAL SEARCH TREE) and the characteristic of compact structure of a chain-type represented Trie space, so that single mode matching can be completed in O (n) time.
The dual-array Trie is essentially a deterministic finite state automaton (DFA), each node representing a state of the automaton, performing state transitions according to variables, and completing a query operation when an end state is reached or cannot be transitioned. The relation between the characters contained in all keys of the double-number group is expressed by simple mathematical addition operation, so that the retrieval speed is improved, a large number of pointers used in a chain structure are omitted, and the storage space is saved.
If the multi-pattern matching is to be completed at O (n) time, a word graph is constructed, the pattern string needs to be preprocessed into a finite state automaton by Aho-Corasick algorithm, as shown in FIG. 3, where the pattern string is he/she/his/hers and the text is "ushers". Thus, the first time to leaf node 5, the next step of matching can be started directly from node 2, and all pattern strings can be identified in one traversal.
The conditional random field word segmentation algorithm is a word segmentation based on words, the word segmentation based on words does not match words in advance on sentences, but the word segmentation is regarded as a sequence labeling problem, and a word is marked as B (Begin), I (Inside), O (side), E (End) and S (Single). Therefore, the input is a feature composed of each word and its preceding and following words, and the output is a classification flag. For classification problems, solving by using a statistical machine learning method. The conditional random field is a discriminant undirected graph model capable of modeling the conditional probability of a plurality of variables based on a given observation, and for a given labeling sequence Y and observation sequence X, the conditional probability P (y|x) is defined instead of modeling the joint probability. The conditional random field word segmentation algorithm can be said to be the most commonly used word segmentation, part-of-speech tagging and entity recognition algorithm at present, and has good recognition capability on the unregistered words. In the embodiment of the application, a conditional random field word segmentation algorithm in a statistical machine learning method is used, a series of algorithms are used for abstracting the problem, a model is further obtained, and the obtained model is used for solving the similar problem. The model can also be regarded as a function, and the input word is taken as X, resulting in a label f (X) =y for each word. In addition, models are generally classified into two categories in machine learning: a generative model and a discriminant model, which differ essentially in the relation of X and Y generation. The generating model models the joint probability of P (X, Y) by taking 'output Y generates input X according to a certain rule' as a hypothesis; the discriminant model considers that Y is determined by X, and models the posterior probability P (Y|X) directly. The two have advantages and disadvantages, the relation description of the generated model to the variables is clearer, and the discriminant model is easy to establish and learn.
The embodiment of the application uses the adjacency matrix to store the matching degree of the word to be searched, the third search word and the word to be trained, and stores the matching degree of the word to be matched and the matching word in the trained vocabulary when matching the word to be matched in the metadata field to be matched. The adjacency matrix is used to store the relationships between individual words and word-to-word relationships. And storing the result of the matching calculation by using the adjacency list, and obtaining the result weight of the to-be-searched word and the matched matching word. The data structure used in the process of calculating the degree of matching stores the weights and the weights between the search term and the matching term.
As shown in fig. 4, the adjacency matrix represents nodes by array subscripts, and the value represents the weight of an edge, i.e., d [ i ] [ j ] =v represents the weight of an edge between node i and node j as v.
The adjacency list establishes a single linked list for each node in the graph, and the storage space can be greatly saved for the sparse graph. The nodes in the ith singly linked list represent edges attached to vertex i as shown in figure 5. In practical application, especially when the Viterbi algorithm is used to solve the optimal path, since the graph is traversed according to the breadth-first strategy, the graph is preferably stored by using the adjacency table, so that all nodes under a certain node can be conveniently accessed.
In a second aspect, the present application proposes a term matching system for metadata fields, as shown in fig. 6, including:
The training module 101 is configured to pre-process the first search term in the metadata training set to obtain a second search term; judging the second search term, and matching the second search term with the vocabulary to be trained in the vocabulary database to be trained to obtain the search term to be successfully matched; performing word segmentation on the words to be searched by using a conditional random field word segmentation algorithm to obtain a plurality of third search words of each word to be searched, and matching the plurality of third search words with the words to be trained to determine matching words; updating a vocabulary database to be trained according to the matched words, and training a conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
And the matching module 102 is used for matching the metadata fields to be matched by using the trained classifier and the trained vocabulary.
In a third aspect, the present application provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the term matching method for metadata fields described above.
In the method, the second search word is judged and matched with the vocabulary in the trained vocabulary database to obtain the unsuccessfully matched search word, then the conditional random field word segmentation algorithm is used for word segmentation of the search word to obtain a plurality of third search words of a plurality of search words, the third search words are matched with the trained vocabulary, the matching degree is calculated, the matching word is determined according to the matching threshold, and the matching accuracy of the search word can be improved on the premise of automatic matching. And a Double-array Trie (Double-ARRAY TRIE) structure is adopted, so that the retrieval speed is improved, and the storage space is saved.
The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (9)
1. A method for term matching of metadata fields, comprising:
Preprocessing the first search term in the metadata training set to obtain a second search term;
Judging the second search word, and matching the second search word with the vocabulary to be trained in the vocabulary database to be trained to obtain the search word to be successfully matched;
Performing word segmentation on the words to be searched by using a conditional random field word segmentation algorithm to obtain a plurality of third search words of each word to be searched, and matching the plurality of third search words with the words to be trained to determine matching words;
Updating a vocabulary database to be trained according to the matching words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary;
Matching metadata fields to be matched by using the trained classifier and the trained vocabulary;
before the first search term in the metadata training set is preprocessed to obtain the second search term, the method further comprises the following steps:
collecting table data and cleaning illegal characters in the table data to establish a vocabulary to be trained;
Establishing a vocabulary database to be trained by using the vocabulary to be trained;
the vocabulary to be trained and the trained vocabulary adopt a double-array Trie structure; preprocessing the pattern string into a finite state automaton by using an Aho-Corasick algorithm;
and storing the to-be-retrieved words, the third retrieval words and the matching degree by using an adjacency matrix.
2. The method for term matching of metadata fields according to claim 1, wherein the preprocessing the first term in the metadata training set to obtain the second term comprises:
Acquiring all first search words in metadata in a metadata training set;
And removing illegal characters in each first search term to obtain a second search term corresponding to each first search term.
3. The method for matching terms of metadata fields according to claim 1, wherein said determining the second term and matching the second term with the vocabulary to be trained in the vocabulary database to be trained to obtain the term to be searched includes:
Judging the second search term to obtain a Chinese second search term and a non-Chinese second search term;
Directly matching the Chinese second search terms in a vocabulary database to be trained, and determining the Chinese second search terms which are not matched with the matched terms;
and taking the non-Chinese second retrieval word and the Chinese second retrieval word which is not matched with the matched word as the to-be-retrieved word.
4. The method for matching terms of metadata fields according to claim 1, wherein said using a conditional random field word segmentation algorithm to segment said to-be-retrieved words to obtain a plurality of third retrieval words of each to-be-retrieved word, matching said to-be-trained vocabulary, and determining a matching word includes:
Performing word segmentation on the longer word to be searched by using a conditional random field word segmentation algorithm to generate a plurality of third search words of each word segment of the word to be searched, wherein the third search words comprise: full spelling, simple spelling, english name and/or Chinese name;
Performing character matching on all the third search words of each to-be-searched word and the to-be-trained word, and calculating the matching degree of each third search word corresponding to each to-be-searched word;
and determining the matching word of the to-be-searched word according to the matching degree and the matching threshold value.
5. The method of claim 1, wherein said matching metadata fields to be matched using said trained classifier and said trained vocabulary comprises:
Matching the metadata fields to be matched by using the trained vocabulary, and obtaining the metadata fields which are not successfully matched as words to be matched;
using a trained conditional random field word segmentation algorithm as a trained classifier to segment each word to be matched, and obtaining a plurality of third search words corresponding to each word to be matched;
performing character matching on the plurality of third search terms in the trained vocabulary, and calculating the matching degree of each third search term;
And determining the matching word corresponding to each word to be matched according to the matching degree and the matching threshold, wherein the matching word is a matching term of the word to be matched.
6. The term matching method of metadata fields according to claim 2, wherein said table data comprises:
Customer terminology table, industry standard field interpretation mapping table, metadata field and interpretation comparison table which are approved by business system, wherein each table comprises: chinese name, english name, full spelling, and simple spelling.
7. The term matching method of metadata fields according to claim 2, wherein the types of the first term and the second term each include: chinese, full spell, simple spell, english abbreviations, and/or mixtures.
8. A term matching system for metadata fields, comprising:
The training module is used for collecting table data and cleaning illegal characters in the table data, and establishing a vocabulary to be trained; establishing a vocabulary database to be trained by using the vocabulary to be trained; the vocabulary to be trained and the trained vocabulary adopt a double-array Trie structure; preprocessing the pattern string into a finite state automaton by using an Aho-Corasick algorithm; preprocessing the first search term in the metadata training set to obtain a second search term; judging the second search word, and matching the second search word with the vocabulary to be trained in the vocabulary database to be trained to obtain the search word to be successfully matched; performing word segmentation on the words to be searched by using a conditional random field word segmentation algorithm to obtain a plurality of third search words of each word to be searched, and matching the plurality of third search words with the words to be trained to determine matching words; updating a vocabulary database to be trained according to the matching words, and training the conditional random field word segmentation algorithm to obtain a trained classifier and a trained vocabulary; storing the to-be-retrieved words, the third retrieval words and the matching degree by using an adjacency matrix;
and the matching module is used for matching the metadata fields to be matched by using the trained classifier and the trained vocabulary.
9. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform a term matching method for metadata fields as claimed in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011342621.XA CN112580691B (en) | 2020-11-25 | 2020-11-25 | Term matching method, matching system and storage medium for metadata field |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011342621.XA CN112580691B (en) | 2020-11-25 | 2020-11-25 | Term matching method, matching system and storage medium for metadata field |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112580691A CN112580691A (en) | 2021-03-30 |
CN112580691B true CN112580691B (en) | 2024-05-14 |
Family
ID=75123569
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011342621.XA Active CN112580691B (en) | 2020-11-25 | 2020-11-25 | Term matching method, matching system and storage medium for metadata field |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112580691B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116361532A (en) * | 2021-12-28 | 2023-06-30 | 中移(杭州)信息技术有限公司 | Resource search method, device and computer-readable storage medium |
CN114896381A (en) * | 2022-05-16 | 2022-08-12 | 广州太平洋电脑信息咨询有限公司 | Fault-tolerant matching method and device for automobile vehicle information |
CN114969001B (en) * | 2022-05-24 | 2024-05-10 | 浪潮卓数大数据产业发展有限公司 | Database metadata field matching method, device, equipment and medium |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751430A (en) * | 2008-12-12 | 2010-06-23 | 汉王科技股份有限公司 | Electronic dictionary fuzzy searching method |
CN103336850A (en) * | 2013-07-24 | 2013-10-02 | 昆明理工大学 | Method and device for confirming index word in database retrieval system |
CN103412858A (en) * | 2012-07-02 | 2013-11-27 | 清华大学 | A method for large-scale feature matching for text or web content analysis |
CN103810168A (en) * | 2012-11-06 | 2014-05-21 | 深圳市世纪光速信息技术有限公司 | Search application method, device and terminal |
CN106933883A (en) * | 2015-12-31 | 2017-07-07 | 中移(苏州)软件技术有限公司 | Point of interest Ordinary search word sorting technique, device based on retrieval daily record |
CN108255813A (en) * | 2018-01-23 | 2018-07-06 | 重庆邮电大学 | A kind of text matching technique based on term frequency-inverse document and CRF |
CN108304375A (en) * | 2017-11-13 | 2018-07-20 | 广州腾讯科技有限公司 | A kind of information identifying method and its equipment, storage medium, terminal |
US10242320B1 (en) * | 2018-04-19 | 2019-03-26 | Maana, Inc. | Machine assisted learning of entities |
CN110309368A (en) * | 2018-03-26 | 2019-10-08 | 腾讯科技(深圳)有限公司 | Determination method, apparatus, storage medium and the electronic device of data address |
CN110931137A (en) * | 2018-09-19 | 2020-03-27 | 京东方科技集团股份有限公司 | Machine-assisted dialog system, method and device |
CN111291195A (en) * | 2020-01-21 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Data processing method, device, terminal and readable storage medium |
CN111310456A (en) * | 2020-02-13 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Entity name matching method, device and equipment |
CN111899829A (en) * | 2020-07-31 | 2020-11-06 | 青岛百洋智能科技股份有限公司 | Full-text retrieval matching engine based on ICD9/10 participle lexicon |
CN111967242A (en) * | 2020-08-17 | 2020-11-20 | 支付宝(杭州)信息技术有限公司 | Text information extraction method, device and equipment |
CN112148885A (en) * | 2020-09-04 | 2020-12-29 | 上海晏鼠计算机技术股份有限公司 | Intelligent searching method and system based on knowledge graph |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180232443A1 (en) * | 2017-02-16 | 2018-08-16 | Globality, Inc. | Intelligent matching system with ontology-aided relation extraction |
-
2020
- 2020-11-25 CN CN202011342621.XA patent/CN112580691B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751430A (en) * | 2008-12-12 | 2010-06-23 | 汉王科技股份有限公司 | Electronic dictionary fuzzy searching method |
CN103412858A (en) * | 2012-07-02 | 2013-11-27 | 清华大学 | A method for large-scale feature matching for text or web content analysis |
CN103810168A (en) * | 2012-11-06 | 2014-05-21 | 深圳市世纪光速信息技术有限公司 | Search application method, device and terminal |
CN103336850A (en) * | 2013-07-24 | 2013-10-02 | 昆明理工大学 | Method and device for confirming index word in database retrieval system |
CN106933883A (en) * | 2015-12-31 | 2017-07-07 | 中移(苏州)软件技术有限公司 | Point of interest Ordinary search word sorting technique, device based on retrieval daily record |
CN108304375A (en) * | 2017-11-13 | 2018-07-20 | 广州腾讯科技有限公司 | A kind of information identifying method and its equipment, storage medium, terminal |
CN108255813A (en) * | 2018-01-23 | 2018-07-06 | 重庆邮电大学 | A kind of text matching technique based on term frequency-inverse document and CRF |
CN110309368A (en) * | 2018-03-26 | 2019-10-08 | 腾讯科技(深圳)有限公司 | Determination method, apparatus, storage medium and the electronic device of data address |
US10242320B1 (en) * | 2018-04-19 | 2019-03-26 | Maana, Inc. | Machine assisted learning of entities |
CN110931137A (en) * | 2018-09-19 | 2020-03-27 | 京东方科技集团股份有限公司 | Machine-assisted dialog system, method and device |
CN111291195A (en) * | 2020-01-21 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Data processing method, device, terminal and readable storage medium |
CN111310456A (en) * | 2020-02-13 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Entity name matching method, device and equipment |
CN111899829A (en) * | 2020-07-31 | 2020-11-06 | 青岛百洋智能科技股份有限公司 | Full-text retrieval matching engine based on ICD9/10 participle lexicon |
CN111967242A (en) * | 2020-08-17 | 2020-11-20 | 支付宝(杭州)信息技术有限公司 | Text information extraction method, device and equipment |
CN112148885A (en) * | 2020-09-04 | 2020-12-29 | 上海晏鼠计算机技术股份有限公司 | Intelligent searching method and system based on knowledge graph |
Non-Patent Citations (2)
Title |
---|
Improved Chinese Word Segmentation Disambiguation Model Based on Conditional Random Fields;Fanjin Mai 等;《Proceedings of the 4th International Conference on Computer Engineering and Networks 》;20151231;第355卷;全文 * |
学科领域本体学习及学术资源语义标注研究;蒋婷;《中国博士学位论文全文数据库 信息科技辑》;20180615;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112580691A (en) | 2021-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9971974B2 (en) | Methods and systems for knowledge discovery | |
EP3020005B1 (en) | Active featuring in computer-human interactive learning | |
US9672205B2 (en) | Methods and systems related to information extraction | |
Cohen et al. | Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods | |
CN112580691B (en) | Term matching method, matching system and storage medium for metadata field | |
US20040107189A1 (en) | System for identifying similarities in record fields | |
US20040107205A1 (en) | Boolean rule-based system for clustering similar records | |
CN113553400A (en) | Method and device for constructing entity link model of enterprise knowledge graph | |
CN114239828B (en) | A method for constructing a supply chain event graph based on causality | |
CN112328800A (en) | System and method for automatically generating programming specification question answers | |
CN111309944B (en) | A Digital Humanities Search Method Based on Graph Database | |
CN112036178A (en) | A Semantic Search Method Related to Distribution Network Entity | |
CN109933787A (en) | Method, device and medium for extracting text key information | |
CN114049165B (en) | Commodity price comparison method, device, equipment and medium for purchasing system | |
Wang et al. | A probabilistic address parser using conditional random fields and stochastic regular grammar | |
CN111767733A (en) | A document classification method based on statistical word segmentation | |
CN115438195A (en) | A method and device for constructing a knowledge map in the field of financial standardization | |
Dewi et al. | The design of automatic summarization of Indonesian texts using a hybrid approach | |
CN112949287B (en) | Hot word mining method, system, computer equipment and storage medium | |
CN116127971B (en) | English push named entity extraction method and device based on subjective and objective word list | |
US11995110B2 (en) | Systems and methods for candidate database querying | |
Liu et al. | A Financial Advertisement Recognition Algorithm Model Based on Text | |
Ghavimi et al. | EXmatcher: Combining Features Based on Reference Strings and Segments to Enhance Citation Matching | |
Giner Pérez de Lucía | Named entity recognition in handwritten text images from the k best transcripts | |
Cherkassky et al. | Conventional and associative memory-based spelling checkers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |