[go: up one dir, main page]

CN115438181A - Metadata label classification method and device - Google Patents

Metadata label classification method and device Download PDF

Info

Publication number
CN115438181A
CN115438181A CN202211035702.4A CN202211035702A CN115438181A CN 115438181 A CN115438181 A CN 115438181A CN 202211035702 A CN202211035702 A CN 202211035702A CN 115438181 A CN115438181 A CN 115438181A
Authority
CN
China
Prior art keywords
metadata
label
classifier
information
semantic information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211035702.4A
Other languages
Chinese (zh)
Inventor
周檬
吴宏杰
檀康
李静
陈汉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN202211035702.4A priority Critical patent/CN115438181A/en
Publication of CN115438181A publication Critical patent/CN115438181A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明实施例提供一种元数据的标签分类方法及装置,应用于大数据技术领域,用以解决现有技术中确定元数据的标签分类效率较低的问题。包括:针对任一元数据,获取元数据的中文语义信息和元数据的英文语义信息;根据中文分词集合,确定元数据的中文语义信息对应的第一特征向量;中文分词集合是通过对各元数据的中文语义信息进行分词得到的;根据英文分词集合,确定元数据的英文语义信息对应的第二特征向量;英文分词集合是通过对各元数据的英文语义信息进行分词得到的;将第一特征向量与第二特征向量拼接,得到元数据的特征编码向量;将元数据的特征编码向量分别输入各个标签分类器,确定元数据的标签类别。

Figure 202211035702

Embodiments of the present invention provide a method and device for label classification of metadata, which are applied in the field of big data technology to solve the problem of low efficiency of label classification for determining metadata in the prior art. Including: for any metadata, obtain the Chinese semantic information of the metadata and the English semantic information of the metadata; according to the Chinese word segmentation set, determine the first feature vector corresponding to the Chinese semantic information of the metadata; the Chinese word segmentation set is obtained by analyzing each metadata According to the English word segmentation set, determine the second feature vector corresponding to the English semantic information of the metadata; the English word segmentation set is obtained by segmenting the English semantic information of each metadata; the first feature vector The vector is concatenated with the second feature vector to obtain a feature encoding vector of the metadata; the feature encoding vector of the metadata is respectively input into each label classifier to determine the label category of the metadata.

Figure 202211035702

Description

一种元数据的标签分类方法及装置Method and device for label classification of metadata

技术领域technical field

本发明实施例涉及大数据技术领域,特别涉及一种元数据的标签分类方法及装置。Embodiments of the present invention relate to the technical field of big data, and in particular to a tag classification method and device for metadata.

背景技术Background technique

随着科技的不断发展,在大数据治理工作中,其中一项工作是对元数据进行标签分类,通过确定元数据的标签类别,可以有效的帮助用户更好的理解数据。With the continuous development of science and technology, in the work of big data governance, one of the tasks is to label and classify metadata. By determining the label category of metadata, it can effectively help users better understand data.

目前,对每一个元数据,都是由人工进行判断其所属的标签类别,会导致确定元数据的标签分类效率较低。At present, for each piece of metadata, the tag category to which it belongs is manually judged, which will result in low efficiency in determining the tag classification of the metadata.

综上,如何解决确定元数据的标签分类效率较低,是当前亟需解决的技术问题。To sum up, how to solve the low efficiency of label classification for determining metadata is a technical problem that needs to be solved urgently.

发明内容Contents of the invention

本发明实施例提供了一种元数据的标签分类方法,用以解决现有技术中确定元数据的标签分类效率较低的问题。An embodiment of the present invention provides a method for label classification of metadata, which is used to solve the problem of low efficiency of label classification for determining metadata in the prior art.

第一方面,本发明实施例提供一种元数据的标签分类方法,包括:针对任一元数据,获取元数据的中文语义信息和元数据的英文语义信息;根据中文分词集合,确定元数据的中文语义信息对应的第一特征向量;中文分词集合是通过对各元数据的中文语义信息进行分词得到的;根据英文分词集合,确定元数据的英文语义信息对应的第二特征向量;英文分词集合是通过对各元数据的英文语义信息进行分词得到的;将第一特征向量与第二特征向量拼接,得到元数据的特征编码向量;将元数据的特征编码向量分别输入各个标签分类器,确定元数据的标签类别。In the first aspect, an embodiment of the present invention provides a tag classification method for metadata, including: for any metadata, acquiring Chinese semantic information of metadata and English semantic information of metadata; The first feature vector corresponding to the semantic information; the Chinese word segmentation set is obtained by segmenting the Chinese semantic information of each metadata; according to the English word segmentation set, determine the second feature vector corresponding to the English semantic information of the metadata; the English word segmentation set is It is obtained by segmenting the English semantic information of each metadata; splicing the first feature vector and the second feature vector to obtain the feature encoding vector of metadata; inputting the feature encoding vector of metadata into each label classifier respectively, and determining the element The label category for the data.

本发明实施例中,通过元数据的中文语义信息和英文语义信息进行分词,确定元数据的特征编码向量,然后将元数据的特征编码向量输入至各个标签分类器中,将预测概率最大值对应的标签分类器的标签类别作为元数据的标签类别,从而实现较准确的确定元数据的标签类别以及提高确定元数据的标签类别的效率。In the embodiment of the present invention, word segmentation is performed through the Chinese semantic information and English semantic information of the metadata, the feature encoding vector of the metadata is determined, and then the feature encoding vector of the metadata is input into each label classifier, and the maximum value of the predicted probability corresponds to The label category of the label classifier is used as the label category of the metadata, so as to realize more accurate determination of the label category of the metadata and improve the efficiency of determining the label category of the metadata.

可选的,根据中文分词集合,确定元数据的中文语义信息对应的第一特征向量,包括:对元数据的中文语义信息进行分词,得到各第一分词;针对任一第一分词,若确定第一分词存在于中文分词集合中,则设置第一分词对应的子特征向量为第一值;若确定第一分词不存在于中文分词集合中,则设置第一分词对应的子特征向量为第二值;按照各第一分词在元数据的中文语义信息中的位置,将各第一分词对应的子特征向量进行拼接,得到元数据的中文语义信息对应的第一特征向量。Optionally, according to the Chinese word segmentation set, determine the first feature vector corresponding to the Chinese semantic information of the metadata, including: performing word segmentation on the Chinese semantic information of the metadata to obtain each first word; for any first word, if determined The first participle exists in the Chinese participle set, then set the sub-feature vector corresponding to the first participle as the first value; if it is determined that the first participle does not exist in the Chinese participle set, then set the sub-feature vector corresponding to the first participle as the first value Binary value: according to the position of each first participle in the Chinese semantic information of the metadata, the sub-feature vectors corresponding to each first participle are spliced to obtain the first feature vector corresponding to the Chinese semantic information of the metadata.

本发明实施例中,通过将元数据的中文语义信息进行分词得到第一分词,然后根据中文分词集合和第一分词,确定元数据的第一特征向量,从而便于后续根据第一特征向量确定元数据的特征编码向量,进而实现将元数据的特征编码向量输入至各个标签分类器中,较准确较快速的确定元数据的标签类别。In the embodiment of the present invention, the first word segmentation is obtained by segmenting the Chinese semantic information of the metadata, and then the first feature vector of the metadata is determined according to the Chinese word segmentation set and the first word segmentation, so as to facilitate subsequent determination of the element based on the first feature vector. The feature encoding vector of the data, and then realize the input of the feature encoding vector of the metadata into each label classifier, and determine the label category of the metadata more accurately and quickly.

可选的,根据英文分词集合,确定元数据的英文语义信息对应的第二特征向量,包括:对元数据的英文语义信息进行分词,得到各第二分词;针对任一第二分词,若确定第二分词存在于英文分词集合中,则设置第二分词对应的子特征向量为第一值;若确定第二分词不存在于英文分词集合中,则设置第二分词对应的子特征向量为第二值;按照各第二分词在元数据的英文语义信息中的位置,将各第二分词对应的子特征向量进行拼接,得到元数据的英文语义信息对应的第二特征向量。Optionally, according to the English word segmentation set, determine the second feature vector corresponding to the English semantic information of the metadata, including: performing word segmentation on the English semantic information of the metadata to obtain each second word; for any second word, if determined The second participle exists in the English participle set, then set the sub-feature vector corresponding to the second participle as the first value; if it is determined that the second participle does not exist in the English participle set, then set the sub-feature vector corresponding to the second participle as the first value Binary value: according to the position of each second participle in the English semantic information of the metadata, the sub-feature vectors corresponding to each second participle are spliced to obtain the second feature vector corresponding to the English semantic information of the metadata.

本发明实施例中,通过将元数据的英文语义信息进行分词得到第二分词,然后根据第二分词和英文分词集合,确定元数据的第二特征向量,从而便于后续根据第二特征向量确定元数据的特征编码向量,进而实现将元数据的特征编码向量输入至各个标签分类器中,可以较准确较快速的确定元数据的标签类别。In the embodiment of the present invention, the second word segmentation is obtained by segmenting the English semantic information of the metadata, and then according to the second word segmentation and the English word segmentation set, the second feature vector of the metadata is determined, so as to facilitate subsequent determination of the metadata based on the second feature vector. The feature encoding vector of the data, and then realize the input of the feature encoding vector of the metadata into each label classifier, so that the label category of the metadata can be determined more accurately and quickly.

可选的,所述将所述元数据的特征编码向量分别输入各个标签分类器,确定所述元数据的标签类别,包括:将所述元数据的特征编码向量输入各个标签分类器,获取所述各个标签分类器中对应的预测概率;根据所述各个标签分类器对应的预测概率,确定所述元数据的标签类别。Optionally, the inputting the feature encoding vectors of the metadata into each label classifier to determine the label category of the metadata includes: inputting the feature encoding vectors of the metadata into each label classifier, and obtaining the The corresponding prediction probabilities in each of the above label classifiers; according to the corresponding prediction probabilities of each of the label classifiers, determine the label category of the metadata.

本发明实施例中,通过将预测概率最大值对应的标签分类器的标签类别作为元数据的标签类别,从而实现较准确的确定元数据的标签类别以及提高确定元数据的标签类别的效率。In the embodiment of the present invention, the label category of the label classifier corresponding to the maximum value of the predicted probability is used as the label category of the metadata, so as to achieve more accurate determination of the label category of the metadata and improve the efficiency of determining the label category of the metadata.

可选的,各个标签分类器是通过训练样本进行训练得到;每个标签分类器具有通过训练得到的上限阈值和下限阈值;根据各个标签分类器对应的预测概率,确定元数据的标签类别,包括:若第一标签分类器对应的第一预测概率高于第一标签分类器的上限阈值,且第二标签分类器对应的第二预测概率低于第二标签分类器的下限阈值,则确定元数据具有第一标签分类器对应的标签类别;第二标签分类器为各个标签分类器中除第一标签分类器之外的各标签分类器。Optionally, each label classifier is obtained by training through training samples; each label classifier has an upper limit threshold and a lower limit threshold obtained through training; according to the corresponding prediction probability of each label classifier, determine the label category of the metadata, including : If the first predicted probability corresponding to the first label classifier is higher than the upper threshold of the first label classifier, and the second predicted probability corresponding to the second label classifier is lower than the lower threshold of the second label classifier, then the determination element The data has a label category corresponding to the first label classifier; the second label classifier is each label classifier except the first label classifier in each label classifier.

本发明实施例中,通过将符合条件的元数据确定为训练样本,从而可以实现增加训练样本,使得各个标签分类器训练得到的上限阈值和下限阈值的更准确,进而实现较准确的确定元数据的标签类别以及提高确定元数据的标签类别的效率。In the embodiment of the present invention, by determining qualified metadata as training samples, it is possible to increase the training samples, so that the upper threshold and lower threshold obtained by training each label classifier are more accurate, thereby realizing more accurate determination of metadata tag categories and improve the efficiency of determining tag categories for metadata.

可选的,还包括:针对任一标签分类器,若标签分类器的预测概率低于标签分类器的下限阈值,则确定元数据不具有标签分类器对应的标签类别,并将元数据作为用于对各个标签分类器进行更新的训练样本。Optionally, it also includes: for any label classifier, if the predicted probability of the label classifier is lower than the lower limit threshold of the label classifier, then determine that the metadata does not have the label category corresponding to the label classifier, and use the metadata as the Training samples for updating each label classifier.

本发明实施例中,通过将符合条件的元数据确定为训练样本,从而可以实现增加训练样本的数量,使得在训练的过程中,优化各个标签分类器的上限阈值和下限阈值,进而实现较准确的确定元数据的标签类别以及提高确定元数据的标签类别的效率。In the embodiment of the present invention, by determining qualified metadata as training samples, it is possible to increase the number of training samples, so that in the training process, the upper limit threshold and lower limit threshold of each label classifier are optimized, thereby achieving more accurate Determining the label category of metadata and improving the efficiency of determining the label category of metadata.

可选的,确定元数据的标签类别之后,还包括:将元数据作为用于对各个标签分类器进行更新的训练样本。Optionally, after determining the label category of the metadata, the method further includes: using the metadata as a training sample for updating each label classifier.

本发明实施例中,将元数据作为各个标签分类器进行更新的训练样本,从而实现可以优化各个标签分类器的上限阈值和下限阈值,进而实现较准确的确定元数据的标签类别以及提高确定元数据的标签类别的效率。In the embodiment of the present invention, the metadata is used as the training sample updated by each label classifier, so that the upper limit threshold and the lower limit threshold of each label classifier can be optimized, and then the label category of the metadata can be determined more accurately and the determination of the metadata can be improved. The efficiency of labeling categories of data.

可选的,还包括:若不存在任一标签分类器对应的预测概率高于标签分类器的上限阈值,则在各个标签分类器进行更新后,继续将元数据的特征编码向量分别输入更新后的各个标签分类器,确定元数据的标签类别。Optionally, it also includes: if there is no predicted probability corresponding to any label classifier higher than the upper threshold of the label classifier, after each label classifier is updated, continue to input the feature encoding vector of the metadata into the updated The individual label classifiers of , determine the label category of the metadata.

本发明实施例中,通过将元数据的特征编码向量输入至更新后的各个标签分类器中,从而可以实现提高确定元数据的标签类别的效率和较准确的确定元数据的标签类别。In the embodiment of the present invention, by inputting the feature encoding vector of the metadata into each updated label classifier, the efficiency of determining the label category of the metadata can be improved and the label category of the metadata can be determined more accurately.

第二方面,本发明实施例提供了一种元数据的标签分类装置,包括:获取单元,用于针对任一元数据,获取元数据的中文语义信息和元数据的英文语义信息。处理单元,用于根据中文分词集合,确定元数据的中文语义信息对应的第一特征向量;中文分词集合是通过对各元数据的中文语义信息进行分词得到的;根据英文分词集合,确定元数据的英文语义信息对应的第二特征向量;英文分词集合是通过对各元数据的英文语义信息进行分词得到的;将第一特征向量与第二特征向量拼接,得到元数据的特征编码向量;将元数据的特征编码向量分别输入各个标签分类器,确定元数据的标签类别。In a second aspect, an embodiment of the present invention provides a tag classification device for metadata, including: an acquisition unit configured to acquire Chinese semantic information of metadata and English semantic information of metadata for any metadata. The processing unit is used to determine the first feature vector corresponding to the Chinese semantic information of the metadata according to the Chinese word segmentation set; the Chinese word segmentation set is obtained by segmenting the Chinese semantic information of each metadata; according to the English word segmentation set, determine the metadata The second feature vector corresponding to the English semantic information of the metadata; the English word segmentation set is obtained by segmenting the English semantic information of each metadata; the first feature vector and the second feature vector are spliced to obtain the feature encoding vector of the metadata; The feature encoding vectors of the metadata are respectively input into each label classifier to determine the label category of the metadata.

可选的,处理单元具体用于:对元数据的中文语义信息进行分词,得到各第一分词;针对任一第一分词,若确定第一分词存在于中文分词集合中,则设置第一分词对应的子特征向量为第一值;若确定第一分词不存在于中文分词集合中,则设置第一分词对应的子特征向量为第二值;按照各第一分词在元数据的中文语义信息中的位置,将各第一分词对应的子特征向量进行拼接,得到元数据的中文语义信息对应的第一特征向量。Optionally, the processing unit is specifically configured to: segment the Chinese semantic information of the metadata to obtain each first segment; for any first segment, if it is determined that the first segment exists in the Chinese segment set, set the first segment The corresponding sub-characteristic vector is the first value; if it is determined that the first participle does not exist in the Chinese word segmentation set, then set the sub-characteristic vector corresponding to the first participle as the second value; according to the Chinese semantic information of each first participle in the metadata In the position, the sub-feature vectors corresponding to each first word segmentation are concatenated to obtain the first feature vector corresponding to the Chinese semantic information of the metadata.

可选的,处理单元具体用于:对元数据的英文语义信息进行分词,得到各第二分词;针对任一第二分词,若确定第二分词存在于英文分词集合中,则设置第二分词对应的子特征向量为第一值;若确定第二分词不存在于英文分词集合中,则设置第二分词对应的子特征向量为第二值;按照各第二分词在元数据的英文语义信息中的位置,将各第二分词对应的子特征向量进行拼接,得到元数据的英文语义信息对应的第二特征向量。Optionally, the processing unit is specifically configured to: segment the English semantic information of the metadata to obtain each second participle; for any second participle, if it is determined that the second participle exists in the English participle set, set the second participle The corresponding sub-characteristic vector is the first value; if it is determined that the second participle does not exist in the English participle set, then set the sub-characteristic vector corresponding to the second participle as the second value; according to the English semantic information of each second participle in the metadata The position in , the sub-feature vectors corresponding to each second word segmentation are concatenated to obtain the second feature vector corresponding to the English semantic information of the metadata.

可选的,处理单元具体用于:将所述元数据的特征编码向量输入各个标签分类器,获取所述各个标签分类器中对应的预测概率;根据所述各个标签分类器对应的预测概率,确定所述元数据的标签类别。Optionally, the processing unit is specifically configured to: input the feature encoding vector of the metadata into each label classifier, and obtain the corresponding prediction probability in each label classifier; according to the corresponding prediction probability of each label classifier, Determines the tag category for the metadata.

可选的,各个标签分类器是通过训练样本进行训练得到;每个标签分类器具有通过训练得到的上限阈值和下限阈值,处理单元具体用于:若第一标签分类器对应的第一预测概率高于第一标签分类器的上限阈值,且第二标签分类器对应的第二预测概率低于第二标签分类器的下限阈值,则确定元数据具有第一标签分类器对应的标签类别;第二标签分类器为各个标签分类器中除第一标签分类器之外的各标签分类器。Optionally, each label classifier is obtained by training through training samples; each label classifier has an upper threshold and a lower threshold obtained through training, and the processing unit is specifically used for: if the first predicted probability corresponding to the first label classifier is higher than the upper limit threshold of the first label classifier, and the second predicted probability corresponding to the second label classifier is lower than the lower limit threshold of the second label classifier, then it is determined that the metadata has a label category corresponding to the first label classifier; The two-label classifiers are all label classifiers except the first label classifier among the label classifiers.

可选的,处理单元具体用于:针对任一标签分类器,若标签分类器的预测概率低于标签分类器的下限阈值,则确定元数据不具有标签分类器对应的标签类别,并将元数据作为用于对各个标签分类器进行更新的训练样本。Optionally, the processing unit is specifically configured to: for any label classifier, if the predicted probability of the label classifier is lower than the lower limit threshold of the label classifier, determine that the metadata does not have a label category corresponding to the label classifier, and send the metadata The data are used as training samples for updating each label classifier.

可选的,处理单元具体用于:将元数据作为用于对各个标签分类器进行更新的训练样本。Optionally, the processing unit is specifically configured to: use the metadata as training samples for updating each label classifier.

可选的,处理单元具体用于:若不存在任一标签分类器对应的预测概率高于标签分类器的上限阈值,则在各个标签分类器进行更新后,继续将元数据的特征编码向量分别输入更新后的各个标签分类器,确定元数据的标签类别。Optionally, the processing unit is specifically configured to: if there is no predicted probability corresponding to any label classifier higher than the upper threshold of the label classifier, after each label classifier is updated, continue to separate the feature encoding vectors of the metadata into Input the updated label classifiers to determine the label categories of the metadata.

第三方面,本发明实施例还提供一种计算设备,包括至少一个处理器以及至少一个存储器,其中,所述存储器存储有计算机程序,当所述程序被所述处理器执行时,使得所述处理器执行上述第一方面的一种元数据的标签分类方法。In the third aspect, the embodiment of the present invention also provides a computing device, including at least one processor and at least one memory, wherein the memory stores a computer program, and when the program is executed by the processor, the The processor executes a tag classification method for metadata according to the first aspect.

第四方面,本发明实施例还提供一种计算机可读存储介质,所述存储介质存储有程序,当所述程序在计算机上运行时,使得计算机实现执行上述第一方面的一种元数据的标签分类方法。In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, the storage medium stores a program, and when the program is run on a computer, the computer implements a meta-data implementation of the above-mentioned first aspect. Label classification method.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For Those of ordinary skill in the art can also obtain other drawings based on these drawings without any creative effort.

图1为本发明实施例提供一种元数据的标签分类方法的流程图;FIG. 1 is a flowchart of a tag classification method for metadata provided by an embodiment of the present invention;

图2为本发明实施例提供的一种提高各个标签分类器的上限阈值和下限阈值的准确率的方法流程图;FIG. 2 is a flow chart of a method for improving the accuracy of the upper threshold and the lower threshold of each label classifier provided by an embodiment of the present invention;

图3为本发明实施例提供的一种元数据的标签分类的结构示意图;FIG. 3 is a schematic structural diagram of tag classification of metadata provided by an embodiment of the present invention;

图4为本发明实施例提供的一种计算设备的结构示意图。FIG. 4 is a schematic structural diagram of a computing device provided by an embodiment of the present invention.

具体实施方式detailed description

为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作进一步地详细描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

下面,对本申请中的部分用语进行通用解释说明,以便于本领域技术人员理解,并不对本申请中的用语进行限定。In the following, some terms used in this application are generally explained to facilitate the understanding of those skilled in the art, and the terms used in this application are not limited.

一、元数据:元数据又称中介数据、中继数据,为描述数据的数据(data aboutdata),主要是描述数据属性(property)的信息。1. Metadata: Metadata, also known as intermediary data and relay data, is data describing data (data about data), mainly information describing data attributes (property).

在一种可能的场景下,公司A在进行数据治理的时候,公司A需要对数据资源目录中二级目录的元数据进行标签分类,其中标签类别包括机构信息、商户信息、用户信息、终端信息、交易信息等。将元数据进行分类,一方面可以有效的帮助用户更好的去理解数据,另一方面,将元数据进行标签分类可以进一步帮助大数据平台建立数据等级。In a possible scenario, when company A is performing data governance, company A needs to label and classify the metadata of the secondary directory in the data resource directory, where the label categories include institutional information, merchant information, user information, terminal information , transaction information, etc. Classifying metadata can effectively help users better understand data on the one hand, and on the other hand, labeling metadata can further help big data platforms establish data levels.

如背景技术所描述的,由于元数据的数量较大,使用人工的方法去判断元数据所属的标签类别,太浪费人力物力,而且由于使用人工的方法去判断元数据所属的标签类别的误差较大,因此,使用人工的方法去判断元数据所属的标签类别,不仅提高了人工的成本的同时,还降低了元数据的标签分类的效率以及降低确定元数据的标签类别的准确率。As described in the background technology, due to the large amount of metadata, it is too wasteful of manpower and material resources to use manual methods to determine the tag category to which metadata belongs, and because the error in using manual methods to determine the label category to which metadata belongs Therefore, using a manual method to determine the label category of metadata not only increases the cost of labor, but also reduces the efficiency of label classification of metadata and the accuracy of determining the label category of metadata.

鉴于上述问题,本申请提出一种元数据的标签分类方法,该方法可以通过将元数据输入至标签分类器中,即可得到元数据对应的标签类别,从而提高了元数据的分类的效率,降低了人工成本。In view of the above problems, the present application proposes a metadata label classification method, which can obtain the label category corresponding to the metadata by inputting the metadata into a label classifier, thereby improving the efficiency of metadata classification, Reduced labor costs.

由于元数据的属性信息是各种各样的,举个例子:采集到元数据属性信息包括但不限于英文名称、中文名称、字段英文名称、字段类型、字段中文名称、字段取值说明等,通过根据不同的属性信息对元数据进行特征编码向量,然后将特征编码向量输入至各个标签分类器中,输出元数据的标签类别,从而可以实现较准确的确定元数据的标签类别以及提高元数据的标签分类的效率。下面为了方便介绍本方案,以元数据的属性信息为元数据的中文语义信息和元数据的英文语义信息为例以便于本领域技术人员理解,并不对本申请中元数据的属性信息进行限定。Because the attribute information of metadata is various, for example: the collected metadata attribute information includes but not limited to English name, Chinese name, field English name, field type, field Chinese name, field value description, etc. By performing feature encoding vectors on the metadata according to different attribute information, and then inputting the feature encoding vectors into each label classifier, and outputting the label categories of the metadata, it is possible to more accurately determine the label categories of the metadata and improve the quality of the metadata. The efficiency of label classification. In order to facilitate the introduction of this solution, the attribute information of metadata is taken as the Chinese semantic information of metadata and the English semantic information of metadata as an example to facilitate the understanding of those skilled in the art, and does not limit the attribute information of metadata in this application.

如图1所示,为本发明实施例提供一种元数据的标签分类方法的流程图,该方法包括以下步骤:As shown in FIG. 1 , it provides a flow chart of a tag classification method for metadata according to an embodiment of the present invention, and the method includes the following steps:

步骤101,针对任一元数据,获取元数据的中文语义信息和元数据的英文语义信息。Step 101, for any metadata, obtain the Chinese semantic information of the metadata and the English semantic information of the metadata.

本发明实施例中,针对任一元数据,获取元数据的中文语义信息和元数据的英文语义信息。举个例子,若数据A为01000000,获取元数据的中文语义信息A’为发卡机构标识码和元数据的英文语义信息A’为ISS_INS_ID_CD。再举个例子,若数据B为李明明,获取元数据的中文语义信息B’为姓名和元数据的英文语义信息B’为NAME。再举个例子,若数据C为8020,获取元数据的中文语义信息C’为收单机构标识码和元数据的英文语义信息C’为ACQ_INS_CD。In the embodiment of the present invention, for any metadata, the Chinese semantic information of the metadata and the English semantic information of the metadata are obtained. For example, if the data A is 01000000, the Chinese semantic information A' of the obtained metadata is the identification code of the card issuer and the English semantic information A' of the metadata is ISS_INS_ID_CD. For another example, if the data B is Li Mingming, the Chinese semantic information B' of the metadata obtained is name and the English semantic information B' of the metadata is NAME. For another example, if the data C is 8020, the Chinese semantic information C' of the acquired metadata is the acquirer identification code and the English semantic information C' of the metadata is ACQ_INS_CD.

步骤102,根据中文分词集合,确定元数据的中文语义信息对应的第一特征向量。Step 102, according to the Chinese word segmentation set, determine the first feature vector corresponding to the Chinese semantic information of the metadata.

本发明实施例中,根据元数据的中文语义信息,确定元数据的中文分词集合。举个例子,若有元数据A’、元数据B’元数据C’。其中,元数据的中文语义信息A’为发卡机构标识码,将元数据的中文语义信息A’进行分词,得到元数据的中文语义信息A’的第一分词分别为发卡、机构、标识码。元数据的中文语义信息B’为姓名,将元数据的中文语义信息B’进行分词,得到元数据的中文语义信息B’的第一分词为姓名。元数据的中文语义信息C’为收单机构标识码,将元数据的中文语义信息C’进行分词,得到元数据的中文语义信息C’的第一分词分别为收单、机构、标识码。因此,元数据的中文分词集合为:{发卡,机构,标识码,收单,姓名}。In the embodiment of the present invention, according to the Chinese semantic information of the metadata, the Chinese word segmentation set of the metadata is determined. For example, if there is metadata A', metadata B' and metadata C'. Among them, the Chinese semantic information A' of the metadata is the identification code of the card issuer, and the Chinese semantic information A' of the metadata is segmented, and the first participle of the Chinese semantic information A' of the metadata is respectively the card issuer, the institution, and the identification code. The Chinese semantic information B' of the metadata is the name, and the Chinese semantic information B' of the metadata is segmented to obtain the first participle of the Chinese semantic information B' of the metadata as the name. The Chinese semantic information C' of the metadata is the identification code of the acquiring institution, and the Chinese semantic information C' of the metadata is segmented, and the first participle of the Chinese semantic information C' of the metadata is obtained as the acquirer, the institution, and the identification code. Therefore, the Chinese word segmentation set of metadata is: {issuer, institution, identification code, acquirer, name}.

然后根据元数据的中文分词集合和元数据的中文语义信息,确定元数据的第一特征向量。举个例子,若有元数据A’、元数据B’和元数据C’,元数据的中文分词集合为{发卡,机构,标识码,收单,姓名},元数据的中文语义信息A’为发卡机构标识码,元数据的中文语义信息B’为姓名,元数据的中文语义信息C’为收单机构标识码。根据任一的第一分词,确定第一分词存在于中文分词集合中,则设置第一分词对应的子特征向量为第一值;若确定第一分词不存在于所述中文分词集合中,则设置第一分词对应的子特征向量为第二值,其中以第一值为1,第二值为0为例,然后按照各第一分词在元数据的中文语义信息中的位置,将各第一分词对应的子特征向量进行拼接,得到元数据的中文语义信息对应的第一特征向量,因此,元数据A’的第一特征向量为{1,1,1,0,0},元数据B’的第一特征向量为{0,0,0,0,1},元数据C’的第一特征向量为{0,1,1,1,0}。Then, according to the Chinese word segmentation set of the metadata and the Chinese semantic information of the metadata, the first feature vector of the metadata is determined. For example, if there are metadata A', metadata B', and metadata C', the Chinese word segmentation set of metadata is {issuing card, organization, identification code, receipt, name}, and the Chinese semantic information of metadata A' is the identification code of the card issuer, the Chinese semantic information B' of the metadata is the name, and the Chinese semantic information C' of the metadata is the identification code of the acquiring institution. According to any first participle, it is determined that the first participle exists in the Chinese participle set, then the sub-feature vector corresponding to the first participle is set as the first value; if it is determined that the first participle does not exist in the Chinese participle set, then Set the sub-feature vector corresponding to the first participle as the second value, where the first value is 1 and the second value is 0 as an example, and then according to the position of each first participle in the Chinese semantic information of the metadata, each second participle The sub-feature vectors corresponding to a participle are spliced to obtain the first feature vector corresponding to the Chinese semantic information of the metadata. Therefore, the first feature vector of the metadata A' is {1, 1, 1, 0, 0}, and the metadata The first feature vector of B' is {0, 0, 0, 0, 1}, and the first feature vector of metadata C' is {0, 1, 1, 1, 0}.

步骤103,根据英文分词集合,确定元数据的英文语义信息对应的第二特征向量。Step 103, according to the English word segmentation set, determine the second feature vector corresponding to the English semantic information of the metadata.

本发明实施例中,根据元数据的英文语义信息,确定元数据的英文分词集合。举个例子,若有元数据A’、元数据B’和元数据C’,其中,元数据的英文语义信息A’为ISS_INS_ID_CD,将元数据的英文语义信息A’进行分词,得到元数据的英文语义信息A’的第二分词分别为ISS、INS、ID和CD。元数据的英文语义信息B’为NAME,将元数据的英文语义信息B’进行分词,得到元数据的英文语义信息B’的第二分词分别为NAME,元数据的英文语义信息C’为ACQ_INS_CD,将元数据的英文语义信息C’进行分词,得到元数据的英文语义信息C’的第二分词分别为ACQ、INS和CD。因此,元数据的英文分词集合为:{ISS,INS,ID,CD,ACQ,NAME}。然后,根据元数据的英文分词集合和元数据的英文语义信息,确定元数据的第二特征向量。举个例子,若有元数据A’、元数据B’和元数据C’,元数据的英文分词集合为:{ISS,INS,ID,CD,ACQ,NAME},元数据的英文语义信息A’为ISS_INS_ID_CD,元数据的英文语义信息B’为NAME,元数据的英文语义信息C’为ACQ_INS_CD,根据任一的第二分词,确定第二分词存在于英文分词集合中,则设置第二分词对应的子特征向量为第一值;若确定第二分词不存在于所述英文分词集合中,则设置第二分词对应的子特征向量为第二值,其中以第一值为1,第二值为0为例,然后按照各第二分词在元数据的英文语义信息中的位置,将各第二分词对应的子特征向量进行拼接,得到元数据的英文语义信息对应的第二特征向量。因此,元数据A’的第二特征向量为{1,1,1,1,0,0},元数据B’的第二特征向量为{0,0,0,0,0,1},元数据C’的第二特征向量为{0,1,1,1,1,0}。其中为了便于理解本方案,以步骤102位于步骤103之前执行为例来介绍,步骤102位于步骤103之前执行为一种可能的实现方式,步骤102位于步骤103之后执行也为一种可能的实现方式,步骤102与步骤103并行执行也为一种可能的实现方式,在此不做限定。In the embodiment of the present invention, according to the English semantic information of the metadata, the English word segmentation set of the metadata is determined. For example, if there are metadata A', metadata B', and metadata C', where the English semantic information A' of the metadata is ISS_INS_ID_CD, the English semantic information A' of the metadata is segmented to obtain the The second participles of English semantic information A' are ISS, INS, ID and CD respectively. The English semantic information B' of the metadata is NAME, and the English semantic information B' of the metadata is segmented to obtain the second participle of the English semantic information B' of the metadata respectively being NAME, and the English semantic information C' of the metadata is ACQ_INS_CD , segment the English semantic information C' of the metadata into words, and obtain the second word segmentation of the English semantic information C' of the metadata as ACQ, INS, and CD, respectively. Therefore, the English word segmentation set of metadata is: {ISS, INS, ID, CD, ACQ, NAME}. Then, according to the English word segmentation set of the metadata and the English semantic information of the metadata, the second feature vector of the metadata is determined. For example, if there is metadata A', metadata B', and metadata C', the English word segmentation set of metadata is: {ISS, INS, ID, CD, ACQ, NAME}, and the English semantic information A of metadata ' is ISS_INS_ID_CD, metadata English semantic information B' is NAME, metadata English semantic information C' is ACQ_INS_CD, according to any second participle, if it is determined that the second participle exists in the English participle set, then set the second participle The corresponding sub-characteristic vector is the first value; if it is determined that the second participle does not exist in the English participle set, then the sub-characteristic vector corresponding to the second participle is set as the second value, wherein the first value is 1, and the second participle The value is 0 as an example, and then according to the position of each second participle in the English semantic information of the metadata, the sub-feature vectors corresponding to each second participle are spliced to obtain the second feature vector corresponding to the English semantic information of the metadata. Therefore, the second eigenvector of metadata A' is {1, 1, 1, 1, 0, 0}, and the second eigenvector of metadata B' is {0, 0, 0, 0, 0, 1}, The second feature vector of metadata C' is {0, 1, 1, 1, 1, 0}. In order to facilitate the understanding of this solution, the implementation of step 102 before step 103 is used as an example. It is a possible implementation of step 102 before step 103, and it is also a possible implementation of step 102 after step 103. Parallel execution of step 102 and step 103 is also a possible implementation manner, which is not limited here.

步骤104,将第一特征向量与第二特征向量拼接,得到元数据的特征编码向量。Step 104, concatenating the first feature vector and the second feature vector to obtain a feature encoding vector of metadata.

本发明实施例中,将元数据的第一特征向量和元数据的第二特征向量拼接成元数据的特征编码向量。举个例子,若元数据A’的第一特征向量为{1,1,1,0,0},元数据A’的第二特征向量为{1,1,1,1,0,0},将元数据A’的第一特征向量和元数据A’的第二特征向量进行拼接,得到元数据A’的特征编码向量为{1,1,1,0,0,1,1,1,1,0,0}。In the embodiment of the present invention, the first feature vector of the metadata and the second feature vector of the metadata are spliced into a feature encoding vector of the metadata. For example, if the first feature vector of metadata A' is {1, 1, 1, 0, 0}, the second feature vector of metadata A' is {1, 1, 1, 1, 0, 0} , concatenate the first feature vector of metadata A' and the second feature vector of metadata A', and obtain the feature encoding vector of metadata A' as {1, 1, 1, 0, 0, 1, 1, 1 , 1, 0, 0}.

再举个例子,若元数据B’的第一特征向量为{0,0,0,0,1},元数据B’的第二特征向量为{0,0,0,0,0,1},将元数据B’的第一特征向量和元数据B’的第二特征向量进行拼接,得到元数据B’的特征编码向量为{0,0,0,0,1,0,0,0,0,0,1}。For another example, if the first feature vector of metadata B' is {0, 0, 0, 0, 1}, the second feature vector of metadata B' is {0, 0, 0, 0, 0, 1 }, concatenate the first feature vector of metadata B' and the second feature vector of metadata B', and obtain the feature encoding vector of metadata B' as {0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1}.

再举个例子,若元数据C’的第一特征向量为{0,1,1,1,0},元数据C’的第二特征向量为{0,1,1,1,1,0},将元数据C’的第一特征向量和元数据C’的第二特征向量进行拼接,得到元数据C’的特征编码向量为{0,1,1,1,0,0,1,1,1,1,0}。For another example, if the first feature vector of metadata C' is {0, 1, 1, 1, 0}, the second feature vector of metadata C' is {0, 1, 1, 1, 1, 0 }, concatenate the first feature vector of metadata C' and the second feature vector of metadata C', and obtain the feature encoding vector of metadata C' as {0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0}.

步骤105,将元数据的特征编码向量分别输入各个标签分类器,确定元数据的标签类别。In step 105, the feature encoding vectors of the metadata are respectively input into each label classifier to determine the label category of the metadata.

本发明实施例中,设置N个标签类别,其中可以是预设设置的N个标签类别,也可以是根据具体情况而设定的N个标签类别,在此不做限定。举个例子,若N为5,5个标签类别分别是机构信息类别、商户信息类别、用户信息类别、终端信息类别和交易信息类别。根据五个标签类别确定五个标签分类器,其中这五个标签分类器分别为机构信息标签分类器、商户信息标签分类器、用户信息标签分类器、终端信息标签分类器和交易信息标签分类器。将元数据的特征编码向量分别输入至五个标签分类器中,可以获得这五个标签分类器对应的预测概率。In the embodiment of the present invention, N label categories are set, which may be preset N label categories, or N label categories set according to specific situations, which is not limited here. For example, if N is 5, the five tag categories are institution information category, merchant information category, user information category, terminal information category and transaction information category. Determine five label classifiers according to five label categories, among which the five label classifiers are institution information label classifier, merchant information label classifier, user information label classifier, terminal information label classifier and transaction information label classifier . The feature encoding vectors of the metadata are respectively input into the five label classifiers, and the corresponding prediction probabilities of the five label classifiers can be obtained.

举个例子,若元数据A’的特征编码向量为{1,1,1,0,0,1,1,1,1,0,0},共有机构信息标签分类器、商户信息标签分类器、用户信息标签分类器、终端信息标签分类器和交易信息标签分类器这五个标签分类器,将元数据A’分别输入至上述五个标签分类器中,输出的是机构信息标签分类器对应的预测概率为90%,商户信息标签分类器对应的预测概率为20%,用户信息标签分类器对应的预测概率为21%,终端信息标签分类器对应的预测概率为22%,交易信息标签分类器对应的预测概率为19%。将预测概率最大值对应的标签分类器对应的标签类别,确定为元数据A’的标签类别,由于90%>22%>21%>20%>19%,因此,元数据A’的标签类别为机构信息。For example, if the feature encoding vector of metadata A' is {1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0}, there are institutional information label classifier and merchant information label classifier , user information tag classifier, terminal information tag classifier and transaction information tag classifier, these five tag classifiers input the metadata A' into the above five tag classifiers respectively, and the output is the organization information tag classifier corresponding The prediction probability corresponding to the merchant information label classifier is 90%, the corresponding prediction probability of the merchant information label classifier is 20%, the corresponding prediction probability of the user information label classifier is 21%, the corresponding prediction probability of the terminal information label classifier is 22%, and the transaction information label classification The predictor corresponds to a predicted probability of 19%. Determine the label category corresponding to the label classifier corresponding to the maximum predicted probability as the label category of metadata A'. Since 90%>22%>21%>20%>19%, therefore, the label category of metadata A' for institutional information.

通过上述步骤101至步骤105可以看出,通过将元数据进行分词,确定元数据的特征编码向量,然后将元数据的特征编码向量输入至各个标签分类器中,将预测概率最大值对应的标签分类器的标签类别作为元数据的标签类别,从而实现较准确的确定元数据的标签类别以及提高确定元数据的标签类别的效率。From the above steps 101 to 105, it can be seen that by segmenting the metadata, determining the feature encoding vector of the metadata, and then inputting the feature encoding vector of the metadata into each label classifier, the label corresponding to the maximum value of the predicted probability The label category of the classifier is used as the label category of the metadata, so as to realize more accurate determination of the label category of the metadata and improve the efficiency of determining the label category of the metadata.

由于本申请是通过将元数据的特征编码向量输入至各个标签分类器中,从而快速的确定元数据的标签类别,进而实现提高元数据的分类效率。其中,通过训练优化各个标签分类器,从而可以提高确定元数据的标签类别的准确性。下面介绍如何训练优化各个标签分类器。Since the present application inputs the feature encoding vectors of the metadata into each label classifier, the label category of the metadata is quickly determined, thereby improving the classification efficiency of the metadata. Wherein, by training and optimizing each label classifier, the accuracy of determining the label category of the metadata can be improved. The following describes how to train and optimize each label classifier.

为了,可以将训练样本输入至各个标签分类器中,从而可以更好的训练优化各个标签分类器,便于后续将测试元数据输入至优化好的各标签分类器中,可以较准确的确定元数据的标签类别仅为实现提高元数据的分类效率。训练样本为人工判断好的元数据所属的标签类别,可以理解的是,训练样本已知所属的标签类别,训练样本用于训练优化各个标签分类器。In order, the training samples can be input into each label classifier, so that each label classifier can be better trained and optimized, and the test metadata can be input into each optimized label classifier later, so that the metadata can be determined more accurately The label category of is only to improve the classification efficiency of metadata. The training samples are the label categories to which the manually judged metadata belongs. It can be understood that the label categories to which the training samples belong are known, and the training samples are used to train and optimize each label classifier.

将训练样本输入至各个标签分类器,确定各个标签分类器的上限阈值和下限阈值。举个例子,若训练样本分别为元数据A’、元数据B’、元数据C’、元数据D’、元数据E’和元数据F’,标签分类器分别为机构信息标签分类器、商户信息标签分类器、用户信息标签分类器、终端信息标签分类器和交易信息标签分类器。其中已知元数据A’的标签类别为机构信息,元数据B’的标签类别为机构信息,元数据C’的标签类别为用户信息,元数据D’的标签类别为商户信息,元数据E’的标签类别为终端信息,元数据F’的标签类别为交易信息,将元数据A’和元数据B’输入至机构信息标签分类器中,并迭代N次,其中,当迭代次数为1时,会将元数据A’和元数据B’输入至机构信息标签分类器中,然后根据机构信息标签分类器输出的结果的准确率,来重新确定机构信息标签分类器的上限阈值和下限阈值,并根据重新确定的上限阈值和下限阈值更新机构信息标签分类器,在下一次迭代的时候,元数据A’和元数据B’输入至更新后的机构信息标签分类器中,然后继续根据机构信息标签分类器输出的结果的准确率,来重新确定机构信息标签分类器的上限阈值和下限阈值,并根据重新确定的上限阈值和下限阈值更新机构信息标签分类器。直至迭代N次后,会输出元数据A’的标签类别为机构信息的预测概率以及元数据B’的标签类别为机构信息的预测概率,其中,机构信息标签分类器的上限阈值和下限阈值是用于判定元数据是否属于机构信息标签类别。若根据训练样本确定机构信息标签分类器的上限阈值为85%,下限阈值为20%,那么若后续将元数据M输入至机构信息标签分类器中,并迭代N次后,输出的机构信息标签分类器对应的预测概率为19%,由于19%小于机构信息标签分类器的下限阈值20%,则确定元数据M不属于机构信息标签类别。若输出的预测概率为90%,由于90%大于机构信息标签分类器的上限阈值85%,则确定元数据M属于机构信息标签类别。Input the training samples to each label classifier, and determine the upper threshold and lower threshold of each label classifier. For example, if the training samples are metadata A', metadata B', metadata C', metadata D', metadata E', and metadata F', the label classifiers are organization information label classifier, Merchant information tag classifier, user information tag classifier, terminal information tag classifier and transaction information tag classifier. Among them, the label category of metadata A' is organization information, the label category of metadata B' is organization information, the label category of metadata C' is user information, the label category of metadata D' is merchant information, and the label category of metadata E The label category of 'is terminal information, the label category of metadata F' is transaction information, input metadata A' and metadata B' into the organization information label classifier, and iterate N times, wherein, when the number of iterations is 1 , the metadata A' and metadata B' will be input into the institution information label classifier, and then the upper limit threshold and the lower limit threshold of the institution information label classifier will be re-determined according to the accuracy of the results output by the institution information label classifier , and update the organization information label classifier according to the re-determined upper threshold and lower limit threshold. In the next iteration, metadata A' and metadata B' are input into the updated organization information label classifier, and then continue to be based on the organization information The accuracy of the results output by the label classifier is used to re-determine the upper threshold and lower threshold of the organization information label classifier, and update the organization information label classifier according to the re-determined upper threshold and lower threshold. After N iterations, the predicted probability that the label category of metadata A' is institutional information and the predicted probability that the label category of metadata B' is institutional information will be output, where the upper threshold and lower threshold of the institutional information label classifier are It is used to determine whether metadata belongs to the category of institutional information tags. If the upper limit threshold of the organization information label classifier is determined to be 85% and the lower limit threshold is 20% according to the training samples, then if the metadata M is subsequently input into the organization information label classifier, and after N iterations, the output organization information label The prediction probability corresponding to the classifier is 19%, and since 19% is less than the lower threshold of 20% of the institution information label classifier, it is determined that the metadata M does not belong to the institution information label category. If the output prediction probability is 90%, since 90% is greater than the upper threshold 85% of the institution information label classifier, it is determined that the metadata M belongs to the institution information label category.

将元数据C’和输入至用户信息标签分类器中,并迭代N次,其中,当迭代次数为1时,会将元数据C’输入至用户信息标签分类器中,然后根据用户信息标签分类器输出的结果的准确率,来重新确定用户信息标签分类器的上限阈值和下限阈值,并根据重新确定的上限阈值和下限阈值更新用户信息标签分类器,在下一次迭代的时候,元数据C’输入至更新后的用户信息标签分类器中,然后继续根据用户信息标签分类器输出的结果的准确率,来重新确定用户信息标签分类器的上限阈值和下限阈值,并根据重新确定的上限阈值和下限阈值更新用户信息标签分类器。直至迭代N次后,会输出元数据C’的标签类别为用户信息的预测概率,其中,用户信息标签分类器的上限阈值和下限阈值是用于判定元数据是否属于用户信息标签类别。若用户信息标签分类器的上限阈值为90%,下限阈值为25%,那么当后续的测试元数据N输入至用户信息标签分类器中,若输出的用户信息标签分类器对应的预测概率为19%,由于19%小于用户信息标签分类器的下限阈值25%,则确定元数据N不属于用户信息标签类别。若输出的预测概率为91%,由于91%大于用户信息标签分类器的上限阈值90%,则确定元数据N属于用户信息标签类别。Input the metadata C' and into the user information label classifier, and iterate N times, wherein, when the number of iterations is 1, the metadata C' will be input into the user information label classifier, and then classified according to the user information label The accuracy rate of the result output by the filter to re-determine the upper threshold and lower threshold of the user information label classifier, and update the user information label classifier according to the re-determined upper threshold and lower threshold. In the next iteration, the metadata C' Input to the updated user information label classifier, and then continue to re-determine the upper and lower thresholds of the user information label classifier according to the accuracy of the output results of the user information label classifier, and according to the re-determined upper threshold and The lower threshold updates the user information label classifier. After N iterations, the predicted probability that the label category of metadata C' is user information will be output, where the upper threshold and lower threshold of the user information label classifier are used to determine whether the metadata belongs to the user information label category. If the upper limit threshold of the user information label classifier is 90%, and the lower limit threshold is 25%, then when the subsequent test metadata N is input into the user information label classifier, if the predicted probability corresponding to the output user information label classifier is 19 %, since 19% is less than the lower limit threshold 25% of the user information label classifier, it is determined that the metadata N does not belong to the user information label category. If the output prediction probability is 91%, since 91% is greater than the upper threshold 90% of the user information label classifier, it is determined that the metadata N belongs to the user information label category.

将元数据D’和输入至商户信息标签分类器中,并迭代N次,其中,当迭代次数为1时,会将元数据D’输入至商户信息标签分类器中,然后根据商户信息标签分类器输出的结果的准确率,来重新确定商户信息标签分类器的上限阈值和下限阈值,并根据重新确定的上限阈值和下限阈值更新商户信息标签分类器,在下一次迭代的时候,元数据D’输入至更新后的商户信息标签分类器中,然后继续根据商户信息标签分类器输出的结果的准确率,来重新确定商户信息标签分类器的上限阈值和下限阈值,并根据重新确定的上限阈值和下限阈值更新商户信息标签分类器。直至迭代N次后,会输出元数据D’的标签类别为商户信息的预测概率,其中,商户信息标签分类器的上限阈值和下限阈值是用于判定元数据是否属于商户信息标签类别。若商户信息标签分类器的上限阈值为80%,下限阈值为15%,那么当后续的元数据K输入至商户信息标签分类器中,若输出的商户信息标签分类器对应的预测概率为13%,由于13%小于商户信息标签分类器的下限阈值15%,则确定元数据K不属于商户信息标签类别。若输出的预测概率为85%,由于85%大于商户信息标签分类器的上限阈值80%,则确定元数据K属于商户信息标签类别。Input the metadata D' and into the business information label classifier, and iterate N times, wherein, when the number of iterations is 1, the metadata D' will be input into the business information label classifier, and then classified according to the business information label The accuracy rate of the results output by the device to re-determine the upper threshold and lower threshold of the merchant information label classifier, and update the merchant information label classifier according to the re-determined upper threshold and lower threshold. In the next iteration, the metadata D' Input to the updated business information label classifier, and then continue to re-determine the upper limit threshold and lower limit threshold of the business information label classifier according to the accuracy of the results output by the business information label classifier, and according to the newly determined upper threshold and The lower threshold updates the business information label classifier. After N iterations, the predicted probability that the label category of the metadata D' is merchant information will be output, where the upper threshold and lower threshold of the merchant information label classifier are used to determine whether the metadata belongs to the merchant information label category. If the upper limit threshold of the merchant information label classifier is 80%, and the lower limit threshold is 15%, then when the subsequent metadata K is input into the merchant information label classifier, if the predicted probability corresponding to the output merchant information label classifier is 13% , since 13% is less than the lower threshold 15% of the merchant information label classifier, it is determined that the metadata K does not belong to the merchant information label category. If the output predicted probability is 85%, since 85% is greater than the upper threshold 80% of the merchant information label classifier, it is determined that the metadata K belongs to the merchant information label category.

将元数据E’和输入至终端信息标签分类器中,并迭代N次,其中,当迭代次数为1时,会将元数据E’输入至终端信息标签分类器中,然后根据终端信息标签分类器输出的结果的准确率,来重新确定终端信息标签分类器的上限阈值和下限阈值,并根据重新确定的上限阈值和下限阈值更新终端信息标签分类器,在下一次迭代的时候,元数据E’输入至更新后的终端信息标签分类器中,然后继续根据终端信息标签分类器输出的结果的准确率,来重新确定终端信息标签分类器的上限阈值和下限阈值,并根据重新确定的上限阈值和下限阈值更新终端信息标签分类器。直至迭代N次后,会输出元数据E’的标签类别为终端信息的预测概率。其中,终端信息标签分类器的上限阈值和下限阈值是用于判定元数据是否属于终端信息标签类别。若终端信息标签分类器的上限阈值为87%,下限阈值为23%,那么当后续的元数据L输入至终端信息标签分类器中,若输出的终端信息标签分类器对应的预测概率为18%,由于18%小于终端信息标签分类器的下限阈值23%,则确定元数据L不属于终端信息标签类别。若输出的预测概率为90%,由于90%大于终端信息标签分类器的上限阈值87%,则确定元数据L属于终端信息标签类别。Input the metadata E' into the terminal information label classifier, and iterate N times, wherein, when the number of iterations is 1, the metadata E' will be input into the terminal information label classifier, and then classified according to the terminal information label The accuracy rate of the results output by the device to re-determine the upper threshold and lower threshold of the terminal information label classifier, and update the terminal information label classifier according to the re-determined upper threshold and lower threshold. In the next iteration, the metadata E' Input to the updated terminal information label classifier, and then continue to re-determine the upper limit threshold and lower limit threshold of the terminal information label classifier according to the accuracy of the output result of the terminal information label classifier, and according to the re-determined upper limit threshold and The lower threshold updates the terminal information label classifier. After N iterations, the predicted probability that the label category of the metadata E' is terminal information will be output. Wherein, the upper threshold and the lower threshold of the terminal information label classifier are used to determine whether the metadata belongs to the terminal information label category. If the upper limit threshold of the terminal information label classifier is 87%, and the lower limit threshold is 23%, then when the subsequent metadata L is input into the terminal information label classifier, if the output terminal information label classifier corresponds to a prediction probability of 18% , since 18% is less than the lower limit threshold 23% of the terminal information label classifier, it is determined that the metadata L does not belong to the terminal information label category. If the output prediction probability is 90%, since 90% is greater than the upper threshold 87% of the terminal information label classifier, it is determined that the metadata L belongs to the terminal information label category.

将元数据F’和输入至交易信息标签分类器中,并迭代N次,其中,当迭代次数为1时,会将元数据F’输入至交易信息标签分类器中,然后根据交易信息标签分类器输出的结果的准确率,来重新确定交易信息标签分类器的上限阈值和下限阈值,并根据重新确定的上限阈值和下限阈值更新交易信息标签分类器,在下一次迭代的时候,元数据F’输入至更新后的交易信息标签分类器中,然后继续根据交易信息标签分类器输出的结果的准确率,来重新确定交易信息标签分类器的上限阈值和下限阈值,并根据重新确定的上限阈值和下限阈值更新交易信息标签分类器。直至迭代N次后,会输出元数据F’的标签类别为交易信息的预测概率,根据元数据F’的标签类别为交易信息的预测概率,确定交易信息标签分类器的上限阈值和下限阈值。其中,交易信息标签分类器的上限阈值和下限阈值是用于判定元数据是否属于交易信息标签类别。若交易信息标签分类器的上限阈值为88%,下限阈值为25%,那么当后续的测试元数据W输入至交易信息标签分类器中,若输出的交易信息标签分类器对应的预测概率为20%,由于20%小于交易信息标签分类器的下限阈值25%,则确定元数据W不属于交易信息标签类别。若输出的预测概率为91%,由于91%大于交易信息标签分类器的上限阈值88%,则确定测试元数据W属于交易信息标签类别。Input the metadata F' and into the transaction information label classifier, and iterate N times, wherein, when the number of iterations is 1, the metadata F' will be input into the transaction information label classifier, and then classified according to the transaction information label The accuracy rate of the result output by the filter to re-determine the upper threshold and lower threshold of the transaction information label classifier, and update the transaction information label classifier according to the re-determined upper threshold and lower threshold. In the next iteration, the metadata F' Input to the updated transaction information label classifier, and then continue to re-determine the upper and lower thresholds of the transaction information label classifier according to the accuracy of the output results of the transaction information label classifier, and according to the re-determined upper threshold and The lower bound threshold updates the transaction information label classifier. After N iterations, the tag category of metadata F' will be output as the predicted probability of transaction information. According to the tag category of metadata F' is the predicted probability of transaction information, the upper threshold and lower threshold of the transaction information label classifier are determined. Wherein, the upper threshold and the lower threshold of the transaction information label classifier are used to determine whether the metadata belongs to the transaction information label category. If the upper limit threshold of the transaction information label classifier is 88%, and the lower limit threshold is 25%, then when the subsequent test metadata W is input into the transaction information label classifier, if the output transaction information label classifier corresponds to a prediction probability of 20 %, since 20% is less than the lower limit threshold 25% of the transaction information label classifier, it is determined that the metadata W does not belong to the transaction information label category. If the output prediction probability is 91%, since 91% is greater than the upper threshold 88% of the transaction information label classifier, it is determined that the test metadata W belongs to the transaction information label category.

将各个标签分类器训练优化好后,通过将元数据输入至各个标签分类器中,从而可以实现确定元数据的标签类别。下面介绍如何确定元数据的标签类别。After each tag classifier is trained and optimized, the tag category of the metadata can be determined by inputting the metadata into each tag classifier. The following describes how to determine the label category of metadata.

在一种可能的情况下,由于训练样本的数量和种类较多,足以满足训练各个标签分类器的条件,那么通过训练样本可以较准确的确定各个标签分类器的上限阈值和下限阈值。然后将元数据输入至训练好的各个标签分类器中,各个标签分类器会输出各个标签分类器对应的预测概率。举个例子,若共有5个标签分类器,其中分别为机构信息标签分类器、商户信息标签分类器、用户信息标签分类器、终端信息标签分类器和交易信息标签分类器,通过将训练样本对应输入至这5个标签分类器中,从而实现确定了这5个标签分类器的上限阈值和下限阈值,其中,机构信息标签分类器的上限阈值为85%,机构信息标签分类器的下限阈值为20%。商户信息标签分类器的上限阈值为90%,商户信息标签分类器的下限阈值为25%。用户信息标签分类器的上限阈值为86%,用户信息标签分类器的下限阈值为19%。终端信息标签分类器的上限阈值为91%,终端信息标签分类器的下限阈值为23%。交易信息标签分类器的上限阈值为95%,交易信息标签分类器的下限阈值为30%。若将元数据Z分别输入至训练好的这5个标签分类器中,会对应输出5个预测概率,分别是机构信息标签分类器对应的预测概率90%,商户信息标签分类器对应的预测概率为70%,用户信息标签分类器对应的预测概率为60%,终端信息标签分类器对应的预测概率为55%,交易信息标签分类器对应的预测概率为55%。将这5个预测概率中数值最大的标签分类器的标签类别作为元数据Z的标签类别,因此,确定元数据Z的标签类别为机构信息。In a possible situation, since the number and types of training samples are large enough to meet the conditions for training each label classifier, the upper and lower thresholds of each label classifier can be determined more accurately through the training samples. Then the metadata is input into each trained label classifier, and each label classifier will output the corresponding prediction probability of each label classifier. For example, if there are 5 label classifiers in total, including the organization information label classifier, merchant information label classifier, user information label classifier, terminal information label classifier and transaction information label classifier, by matching the training samples to Input into these 5 label classifiers, so as to realize the determination of the upper limit threshold and lower limit threshold of these 5 label classifiers, among them, the upper limit threshold of the organization information label classifier is 85%, and the lower limit threshold of the organization information label classifier is 20%. The upper threshold of the merchant information tag classifier is 90%, and the lower threshold of the merchant information tag classifier is 25%. The upper threshold of the user information label classifier is 86%, and the lower threshold of the user information label classifier is 19%. The upper limit threshold of the terminal information label classifier is 91%, and the lower limit threshold of the terminal information label classifier is 23%. The upper threshold of the transaction information label classifier is 95%, and the lower threshold of the transaction information label classifier is 30%. If the metadata Z is input into the five trained label classifiers, five predicted probabilities will be output, which are 90% of the predicted probability corresponding to the organization information label classifier and 90% of the predicted probability corresponding to the merchant information label classifier. is 70%, the prediction probability corresponding to the user information label classifier is 60%, the prediction probability corresponding to the terminal information label classifier is 55%, and the prediction probability corresponding to the transaction information label classifier is 55%. The label category of the label classifier with the largest value among the five predicted probabilities is taken as the label category of metadata Z, therefore, the label category of metadata Z is determined as institutional information.

在另一种可能的情况下,由于训练样本的数量和种类很少,不足以满足训练各个标签分类器的条件,那么仅通过训练样本确定的各个标签分类器的上限阈值和下限阈值的准确率较低。因此,需要提高各个标签分类器的第一上先阈值和下限阈值的准确率,从而实现后续通过各个标签分类器较准确的确定元数据的标签类别。下面介绍如何提高各个标签分类器的上限阈值和下限阈值的准确率。In another possible situation, since the number and types of training samples are not enough to meet the conditions for training each label classifier, then the accuracy rate of the upper threshold and lower threshold of each label classifier determined only by training samples lower. Therefore, it is necessary to improve the accuracy rate of the first upper threshold and the lower threshold of each label classifier, so as to achieve a more accurate determination of the label category of the metadata through each label classifier. The following describes how to improve the accuracy of the upper and lower thresholds for each label classifier.

为了更好的理解方案,以标签分类器分别有机构信息标签分类器、商户信息标签分类器、用户信息标签分类器、终端信息标签分类器和交易信息标签分类器为例,以判断元数据是否属于机构信息标签分类器为例来描述如何提高各个标签分类器的上限阈值和下限阈值的准确率,以便于本领域技术人员理解,并不对本申请中元数据是否属于机构信息标签分类器进行限定。In order to better understand the solution, take the tag classifier as an example, which includes an organization information tag classifier, a merchant information tag classifier, a user information tag classifier, a terminal information tag classifier, and a transaction information tag classifier, to determine whether the metadata Belonging to the institutional information label classifier as an example to describe how to improve the accuracy of the upper threshold and lower threshold of each label classifier, so that those skilled in the art can understand, and does not limit whether the metadata in this application belongs to the institutional information label classifier .

如图2所示,为本发明实施例提供的一种提高各个标签分类器的上限阈值和下限阈值的准确率的方法流程图,该方法包括以下步骤:As shown in Figure 2, it is a flow chart of a method for improving the accuracy of the upper threshold and lower threshold of each label classifier provided by the embodiment of the present invention, the method includes the following steps:

步骤201,将元数据输入至各个标签分类器中。Step 201, input metadata into each label classifier.

本发明实施例中,通过将训练样本输入至对应的标签分类器中,初步确定各个标签分类器的上限阈值和下限阈值,从而得到训练好的各个标签分类器,然后将元数据分别输入至各个标签分类器中。In the embodiment of the present invention, by inputting the training sample into the corresponding label classifier, the upper limit threshold and the lower limit threshold of each label classifier are preliminarily determined, so as to obtain each trained label classifier, and then the metadata are respectively input into each in the label classifier.

步骤202,判断迭代次数是否大于N次,若是,则执行步骤203,若否,则执行步骤204。Step 202, judging whether the number of iterations is greater than N times, if yes, execute step 203, if not, execute step 204.

本发明实施例中,由于元数据输入至各个标签分类器中后,需要迭代多次,每迭代一次都会输出元数据在各个标签分类器对应的预测概率,通过将元数据输入至各个标签分类器中迭代的次数大于N次,可以较准确的输出元数据在各个标签分类器对应的预测概率。In the embodiment of the present invention, after the metadata is input into each label classifier, multiple iterations are required, each iteration will output the predicted probability corresponding to the metadata in each label classifier, and by inputting the metadata into each label classifier The number of iterations in the middle is greater than N times, and the prediction probability corresponding to each label classifier of the metadata can be output more accurately.

步骤203,各个标签分类器输出各个标签分类器对应的预测概率,根据各个标签分类器对应的预测概率,确定元数据的标签类别。In step 203, each tag classifier outputs the predicted probability corresponding to each tag classifier, and the tag category of the metadata is determined according to the predicted probability corresponding to each tag classifier.

本发明实施例中,当迭代次数大于N次后,迭代结束,各个标签分类器会输出各个标签分类器对应的预测概率,然后根据各个标签分类器对应的预测概率的大小关系,确定元数据的标签类别。In the embodiment of the present invention, when the number of iterations is greater than N times, the iteration ends, and each label classifier will output the predicted probability corresponding to each label classifier, and then determine the metadata according to the magnitude relationship of the predicted probability corresponding to each label classifier. label category.

举个例子,标签分类器分别是机构信息标签分类器、商户信息标签分类器、用户信息标签分类器、终端信息标签分类器和交易信息标签分类器。其中,机构信息标签分类器的上限阈值为87%,机构信息标签分类器的下限阈值为21%。商户信息标签分类器的上限阈值为91%,商户信息标签分类器的下限阈值为24%。用户信息标签分类器的上限阈值为85%,用户信息标签分类器的下限阈值为18%。终端信息标签分类器的上限阈值为92%,终端信息标签分类器的下限阈值为24%。交易信息标签分类器的上限阈值为96%,交易信息标签分类器的下限阈值为26%。若将元数据U分别输入至训练好的这5个标签分类器中,迭代大于N次之后,会对应输出5个预测概率,分别是机构信息标签分类器对应的预测概率90%,商户信息标签分类器对应的预测概率为70%,用户信息标签分类器对应的预测概率为60%,终端信息标签分类器对应的预测概率为55%,交易信息标签分类器对应的预测概率为55%。将这5个预测概率中数值最大的标签分类器的标签类别作为元数据U的标签类别,因为机构信息标签分类器对应的标签类别是机构信息,因此,确定元数据U的标签类别为机构信息。For example, the tag classifiers are institution information tag classifier, merchant information tag classifier, user information tag classifier, terminal information tag classifier and transaction information tag classifier. Among them, the upper limit threshold of the institution information label classifier is 87%, and the lower limit threshold of the institution information label classifier is 21%. The upper threshold of the merchant information tag classifier is 91%, and the lower threshold of the merchant information tag classifier is 24%. The upper threshold of the user information label classifier is 85%, and the lower threshold of the user information label classifier is 18%. The upper threshold of the terminal information label classifier is 92%, and the lower threshold of the terminal information label classifier is 24%. The upper threshold of the transaction information label classifier is 96%, and the lower threshold of the transaction information label classifier is 26%. If the metadata U is respectively input into the five trained label classifiers, after more than N iterations, five predicted probabilities will be output correspondingly, which are respectively 90% of the predicted probability corresponding to the organization information label classifier and 90% of the merchant information label The prediction probability corresponding to the classifier is 70%, the prediction probability corresponding to the user information label classifier is 60%, the prediction probability corresponding to the terminal information label classifier is 55%, and the prediction probability corresponding to the transaction information label classifier is 55%. The label category of the label classifier with the largest value among the five predicted probabilities is used as the label category of metadata U, because the label category corresponding to the organization information label classifier is organization information, therefore, the label category of metadata U is determined to be organization information .

步骤204,判断机构信息分类器对应的预测概率是否大于机构信息分类器的上限阈值,且判断其余标签分类器对应的预测概率是否小于其余标签分类器的下限阈值,若是,则执行步骤205,若否,则执行步骤206。Step 204, judge whether the predicted probability corresponding to the organization information classifier is greater than the upper limit threshold of the organization information classifier, and judge whether the predicted probability corresponding to the other label classifiers is less than the lower limit threshold of the other label classifiers, if so, then perform step 205, if If not, go to step 206.

本发明实施例中,举个例子,若机构信息标签分类器的上限阈值为87%,机构信息标签分类器的下限阈值为21%。商户信息标签分类器的上限阈值为91%,商户信息标签分类器的下限阈值为24%。用户信息标签分类器的上限阈值为85%,用户信息标签分类器的下限阈值为18%。将元数据R分别输入至各个标签分类器中,当迭代M次后,其中M小于N,会得到机构信息分类器对应的预测概率为98%,用户信息分类器对应的预测概率为10%,商户信息分类器对应的预测概率为9%,终端信息分类器对应的预测概率为8%,交易信息分类器对应的预测概率为5%。由于机构信息分类器对应的预测概率为98%大于机构信息分类器的上限阈值,且用户信息分类器对应的预测概率为10%小于用户信息分类器的下限阈值,商户信息分类器对应的预测概率为10%小于用户信息分类器的下限阈值,终端信息分类器对应的预测概率为10%小于用户信息分类器的下限阈值,交易信息分类器对应的预测概率为10%小于用户信息分类器的下限阈值,那么可以确定元数据R的标签类别为机构信息。In the embodiment of the present invention, for example, if the upper limit threshold of the organization information tag classifier is 87%, the lower limit threshold of the organization information tag classifier is 21%. The upper threshold of the merchant information tag classifier is 91%, and the lower threshold of the merchant information tag classifier is 24%. The upper threshold of the user information label classifier is 85%, and the lower threshold of the user information label classifier is 18%. Input the metadata R into each label classifier respectively. After iterating M times, where M is less than N, the prediction probability corresponding to the organization information classifier is 98%, and the prediction probability corresponding to the user information classifier is 10%. The prediction probability corresponding to the merchant information classifier is 9%, the prediction probability corresponding to the terminal information classifier is 8%, and the prediction probability corresponding to the transaction information classifier is 5%. Since the predicted probability corresponding to the organization information classifier is 98% greater than the upper threshold of the organization information classifier, and the predicted probability corresponding to the user information classifier is 10% lower than the lower threshold of the user information classifier, the corresponding predicted probability of the merchant information classifier 10% is less than the lower limit threshold of the user information classifier, the corresponding prediction probability of the terminal information classifier is 10% less than the lower limit threshold of the user information classifier, and the corresponding prediction probability of the transaction information classifier is 10% less than the lower limit of the user information classifier threshold, then it can be determined that the label category of metadata R is institutional information.

步骤205,确定元数据的标签类别为机构信息类别,并将元数据作为训练样本。Step 205, determine that the label category of the metadata is the category of institution information, and use the metadata as a training sample.

本发明实施例中,由于在迭代次数小于N之前就已经确定了元数据的标签类别,那么可以将元数据作为训练样本,便于后续根据训练样本对各个标签分类器的上限阈值和下限阈值进行更新,得到更新后的各个标签分类器,从而实现提高后续元数据的标签类别的准确率。In the embodiment of the present invention, since the label category of the metadata has been determined before the number of iterations is less than N, the metadata can be used as a training sample to facilitate the subsequent updating of the upper threshold and lower threshold of each label classifier based on the training samples , to obtain the updated label classifiers, so as to improve the accuracy of the label category of the subsequent metadata.

步骤206,判断机构信息分类器对应的预测概率是否小于机构信息分类器的下限阈值,若是,则执行步骤207,若否,则执行步骤201。Step 206 , judging whether the predicted probability corresponding to the institution information classifier is smaller than the lower limit threshold of the institution information classifier, if yes, execute step 207 , if not, execute step 201 .

本发明实施例中,由于元数据的标签类别有三种可能,第一种可能是元数据的标签类别是机构信息,第二种可能是元数据的标签类别一定不是机构信息,第三种可能是元数据可能是机构信息,也可以不是机构信息,暂时不确定元数据的标签类别。通过步骤204,可以判定元数据不符合第一种可能,那么需要根据步骤206判断元数据符合第二种可能还是符合第三种可能。举个例子,若机构信息标签分类器的上限阈值为87%,机构信息标签分类器的下限阈值为21%。商户信息标签分类器的上限阈值为91%,商户信息标签分类器的下限阈值为24%。用户信息标签分类器的上限阈值为85%,用户信息标签分类器的下限阈值为18%。将元数据H分别输入至各个标签分类器中,当迭代M次后,其中M小于N,会得到机构信息分类器对应的预测概率为13%,用户信息分类器对应的预测概率为70%,商户信息分类器对应的预测概率为74%,终端信息分类器对应的预测概率为75%,交易信息分类器对应的预测概率为80%。由于机构信息分类器对应的预测概率为13%小于机构信息分类器的下限阈值,那么说明元数据H的标签类别不是机构信息。In the embodiment of the present invention, since there are three possibilities for the label category of metadata, the first possibility is that the label category of metadata is organization information, the second possibility is that the label category of metadata must not be organization information, and the third possibility is Metadata may or may not be institutional information, and the tag category of metadata is temporarily uncertain. Through step 204, it can be determined that the metadata does not conform to the first possibility, then it needs to be determined according to step 206 whether the metadata conforms to the second possibility or the third possibility. For example, if the upper threshold of the institution information label classifier is 87%, the lower limit threshold of the institution information label classifier is 21%. The upper threshold of the merchant information tag classifier is 91%, and the lower threshold of the merchant information tag classifier is 24%. The upper threshold of the user information label classifier is 85%, and the lower threshold of the user information label classifier is 18%. Input the metadata H into each label classifier respectively. After M iterations, where M is less than N, the prediction probability corresponding to the organization information classifier is 13%, and the prediction probability corresponding to the user information classifier is 70%. The prediction probability corresponding to the merchant information classifier is 74%, the prediction probability corresponding to the terminal information classifier is 75%, and the prediction probability corresponding to the transaction information classifier is 80%. Since the predicted probability corresponding to the institution information classifier is 13% less than the lower limit threshold of the institution information classifier, it means that the label category of the metadata H is not institution information.

再举个例子,若机构信息标签分类器的上限阈值为87%,机构信息标签分类器的下限阈值为21%。商户信息标签分类器的上限阈值为91%,商户信息标签分类器的下限阈值为24%。用户信息标签分类器的上限阈值为85%,用户信息标签分类器的下限阈值为18%。将元数据P分别输入至各个标签分类器中,当迭代M次后,其中M小于N,会得到机构信息分类器对应的预测概率为25%,用户信息分类器对应的预测概率为70%,商户信息分类器对应的预测概率为74%,终端信息分类器对应的预测概率为75%,交易信息分类器对应的预测概率为80%。由于机构信息分类器对应的预测概率为25%大于机构信息分类器的下限阈值,且小于机构信息分类器的上限阈值,那么无法确定元数据P的标签类别,元数据P的标签类别可能是机构信息,也可能不是机构信息,暂时无法确定元数据的标签类别,由于迭代次数小于N,将元数据再次输入至各个标签分类器中,从而实现后续较准确的确定元数据的标签类别。For another example, if the upper limit threshold of the institution information label classifier is 87%, the lower limit threshold of the institution information label classifier is 21%. The upper threshold of the merchant information tag classifier is 91%, and the lower threshold of the merchant information tag classifier is 24%. The upper threshold of the user information label classifier is 85%, and the lower threshold of the user information label classifier is 18%. Input the metadata P into each label classifier respectively. After M iterations, where M is less than N, the prediction probability corresponding to the organization information classifier is 25%, and the prediction probability corresponding to the user information classifier is 70%. The prediction probability corresponding to the merchant information classifier is 74%, the prediction probability corresponding to the terminal information classifier is 75%, and the prediction probability corresponding to the transaction information classifier is 80%. Since the predicted probability corresponding to the institution information classifier is 25% greater than the lower limit threshold of the institution information classifier and smaller than the upper limit threshold of the institution information classifier, the label category of metadata P cannot be determined, and the label category of metadata P may be institution The information may not be institutional information, and the label category of the metadata cannot be determined for the time being. Since the number of iterations is less than N, the metadata is re-input into each label classifier, so as to achieve a more accurate subsequent determination of the label category of the metadata.

步骤207,确定元数据的标签类别不是机构信息类别,并将元数据作为训练样本。In step 207, it is determined that the label category of the metadata is not an institution information category, and the metadata is used as a training sample.

本发明实施例中,由于在迭代次数小于N之前就已经确定了元数据的标签类别一定不是机构信息,那么可以将元数据作为训练样本,便于后续根据训练样本对各个标签分类器的上限阈值和下限阈值进行更新,得到更新后的各个标签分类器,从而实现提高后续确定元数据的标签类别的准确率。In the embodiment of the present invention, since it has been determined that the label category of the metadata must not be institutional information before the number of iterations is less than N, the metadata can be used as a training sample, which is convenient for the subsequent upper limit threshold and The lower limit threshold is updated to obtain updated label classifiers, so as to improve the accuracy of subsequent determination of label categories of metadata.

通过上述步骤201至步骤207可以看出,将元数据输入至各个标签分类器中迭代的过程中,通过根据上述判断条件将部分元数据确定为训练样本,增加各个标签分类器的训练样本,从而实现更新各个标签分类器的上限阈值和下限阈值,进而实现提高后续确定元数据的标签类别的准确率。From the above steps 201 to 207, it can be seen that in the iterative process of inputting metadata into each label classifier, by determining part of the metadata as training samples according to the above judgment conditions, the training samples of each label classifier are increased, thereby Realize updating the upper limit threshold and lower limit threshold of each label classifier, and then realize improving the accuracy rate of the label category of subsequent determination of metadata.

基于上述同样的技术构思,本发明实施例还提供一种元数据的标签分类装置,该装置可执行上述发明方法实施例中的方法。本发明实施例提供的一种元数据的标签分类装置的结构可参见图3,该装置300包括:获取单元301,用于针对任一元数据,获取元数据的中文语义信息和元数据的英文语义信息。处理单元302,用于根据中文分词集合,确定元数据的中文语义信息对应的第一特征向量;中文分词集合是通过对各元数据的中文语义信息进行分词得到的;根据英文分词集合,确定元数据的英文语义信息对应的第二特征向量;英文分词集合是通过对各元数据的英文语义信息进行分词得到的;将第一特征向量与第二特征向量拼接,得到元数据的特征编码向量;将元数据的特征编码向量分别输入各个标签分类器,确定元数据的标签类别。Based on the same technical idea as above, the embodiment of the present invention also provides a metadata label classification device, which can execute the method in the above method embodiment of the present invention. The structure of a tag classification device for metadata provided by an embodiment of the present invention can be seen in FIG. 3 . The device 300 includes: an acquisition unit 301 for acquiring Chinese semantic information of metadata and English semantic information of metadata for any metadata. information. The processing unit 302 is used to determine the first feature vector corresponding to the Chinese semantic information of the metadata according to the Chinese word segmentation set; the Chinese word segmentation set is obtained by segmenting the Chinese semantic information of each metadata; according to the English word segmentation set, determine the element The second feature vector corresponding to the English semantic information of the data; the English word segmentation set is obtained by segmenting the English semantic information of each metadata; splicing the first feature vector and the second feature vector to obtain the feature encoding vector of the metadata; The feature encoding vectors of metadata are respectively input into each label classifier to determine the label category of metadata.

可选的,处理单元302具体用于:对元数据的中文语义信息进行分词,得到各第一分词;针对任一第一分词,若确定第一分词存在于中文分词集合中,则设置第一分词对应的子特征向量为第一值;若确定第一分词不存在于中文分词集合中,则设置第一分词对应的子特征向量为第二值;按照各第一分词在元数据的中文语义信息中的位置,将各第一分词对应的子特征向量进行拼接,得到元数据的中文语义信息对应的第一特征向量。Optionally, the processing unit 302 is specifically configured to: segment the Chinese semantic information of the metadata to obtain each first segment; for any first segment, if it is determined that the first segment exists in the Chinese segment set, set the first The sub-feature vector corresponding to the word segmentation is the first value; if it is determined that the first word segmentation does not exist in the Chinese word segmentation set, then set the sub-feature vector corresponding to the first word segmentation as the second value; according to the Chinese semantics of each first segmentation word in the metadata position in the information, the sub-feature vectors corresponding to each first participle are concatenated to obtain the first feature vector corresponding to the Chinese semantic information of the metadata.

可选的,处理单元302具体用于:对元数据的英文语义信息进行分词,得到各第二分词;针对任一第二分词,若确定第二分词存在于英文分词集合中,则设置第二分词对应的子特征向量为第一值;若确定第二分词不存在于英文分词集合中,则设置第二分词对应的子特征向量为第二值;按照各第二分词在元数据的英文语义信息中的位置,将各第二分词对应的子特征向量进行拼接,得到元数据的英文语义信息对应的第二特征向量。Optionally, the processing unit 302 is specifically configured to: segment the English semantic information of the metadata to obtain each second segment; for any second segment, if it is determined that the second segment exists in the English segment set, set the second segment The sub-feature vector corresponding to the word segmentation is the first value; if it is determined that the second word segmentation does not exist in the English word segmentation set, then set the sub-feature vector corresponding to the second word segmentation as the second value; according to the English semantics of each second word segmentation in the metadata position in the information, the sub-feature vectors corresponding to the second participle are concatenated to obtain the second feature vector corresponding to the English semantic information of the metadata.

可选的,处理单元302具体用于:将所述元数据的特征编码向量输入各个标签分类器,获取所述各个标签分类器中对应的预测概率;根据所述各个标签分类器对应的预测概率,确定所述元数据的标签类别。Optionally, the processing unit 302 is specifically configured to: input the feature encoding vector of the metadata into each label classifier, and obtain the corresponding prediction probability in each label classifier; according to the corresponding prediction probability of each label classifier , to determine the label category of the metadata.

可选的,各个标签分类器是通过训练样本进行训练得到;每个标签分类器具有通过训练得到的上限阈值和下限阈值,处理单元302具体用于:若第一标签分类器对应的第一预测概率高于第一标签分类器的上限阈值,且第二标签分类器对应的第二预测概率低于第二标签分类器的下限阈值,则确定元数据具有第一标签分类器对应的标签类别;第二标签分类器为各个标签分类器中除第一标签分类器之外的各标签分类器。Optionally, each label classifier is obtained by training through training samples; each label classifier has an upper threshold and a lower threshold obtained through training, and the processing unit 302 is specifically configured to: if the first prediction corresponding to the first label classifier The probability is higher than the upper limit threshold of the first label classifier, and the second predicted probability corresponding to the second label classifier is lower than the lower limit threshold of the second label classifier, then it is determined that the metadata has a label category corresponding to the first label classifier; The second label classifier is each label classifier except the first label classifier among the various label classifiers.

可选的,处理单元302具体用于:针对任一标签分类器,若标签分类器的预测概率低于标签分类器的下限阈值,则确定元数据不具有标签分类器对应的标签类别,并将元数据作为用于对各个标签分类器进行更新的训练样本。Optionally, the processing unit 302 is specifically configured to: for any label classifier, if the predicted probability of the label classifier is lower than the lower limit threshold of the label classifier, determine that the metadata does not have a label category corresponding to the label classifier, and set Metadata serve as training samples for updating classifiers for each label.

可选的,处理单元302具体用于:将元数据作为用于对各个标签分类器进行更新的训练样本。Optionally, the processing unit 302 is specifically configured to: use the metadata as a training sample for updating each label classifier.

可选的,处理单元302具体用于:若不存在任一标签分类器对应的预测概率高于标签分类器的上限阈值,则在各个标签分类器进行更新后,继续将元数据的特征编码向量分别输入更新后的各个标签分类器,确定元数据的标签类别。Optionally, the processing unit 302 is specifically configured to: if there is no predicted probability corresponding to any label classifier higher than the upper threshold of the label classifier, after each label classifier is updated, continue to convert the feature encoding vector of the metadata to Input the updated label classifiers respectively to determine the label category of the metadata.

基于相同的技术构思,本申请实施例还提供了一种计算设备400,如图4所示,包括至少一个处理器401,以及与至少一个处理器连接的存储器402,本申请实施例中不限定处理器401与存储器402之间的具体连接介质,图4中处理器401和存储器402之间通过总线连接为例。总线可以分为地址总线、数据总线、控制总线等。Based on the same technical concept, the embodiment of the present application also provides a computing device 400, as shown in FIG. As for the specific connection medium between the processor 401 and the memory 402, the bus connection between the processor 401 and the memory 402 in FIG. 4 is taken as an example. The bus can be divided into address bus, data bus, control bus and so on.

在本申请实施例中,存储器402存储有可被至少一个处理器401执行的指令,至少一个处理器401通过执行存储器402存储的指令,可以执行前述的元数据的标签分类方法中所包括的步骤。In the embodiment of the present application, the memory 402 stores instructions executable by at least one processor 401, and at least one processor 401 executes the instructions stored in the memory 402 to perform the steps included in the aforementioned metadata label classification method .

其中,处理器401是计算设备的控制中心,可以利用各种接口和线路连接计算设备的各个部分,通过运行或执行存储在存储器402内的指令以及调用存储在存储器402内的数据,从而实现数据处理。可选的,处理器801可包括一个或多个处理单元,处理器401可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理下发指令。可以理解的是,上述调制解调处理器也可以不集成到处理器401中。在一些实施例中,处理器401和存储器402可以在同一芯片上实现,在一些实施例中,它们也可以在独立的芯片上分别实现。Among them, the processor 401 is the control center of the computing device, which can use various interfaces and lines to connect various parts of the computing device, by running or executing instructions stored in the memory 402 and calling data stored in the memory 402, thereby realizing data deal with. Optionally, the processor 801 may include one or more processing units, and the processor 401 may integrate an application processor and a modem processor. The call processor mainly handles issuing instructions. It can be understood that the foregoing modem processor may not be integrated into the processor 401 . In some embodiments, the processor 401 and the memory 402 can be implemented on the same chip, and in some embodiments, they can also be implemented on independent chips.

处理器401可以是通用处理器,例如中央处理器(CPU)、数字信号处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合元数据的标签分类方法实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。The processor 401 may be a general processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices and discrete hardware components can implement or execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the label classification method combined with metadata can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.

存储器402作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。存储器402可以包括至少一种类型的存储介质,例如可以包括闪存、硬盘、多媒体卡、卡型存储器、随机访问存储器(Random AccessMemory,RAM)、静态随机访问存储器(Static Random Access Memory,SRAM)、可编程只读存储器(Programmable Read Only Memory,PROM)、只读存储器(Read Only Memory,ROM)、带电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、磁性存储器、磁盘、光盘等等。存储器402是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器402还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。The memory 402, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs and modules. The memory 402 may include at least one type of storage medium, for example, may include flash memory, hard disk, multimedia card, card-type memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Magnetic Memory, Disk, discs and more. Memory 402 is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 402 in the embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, and is used for storing program instructions and/or data.

基于相同的技术构思,本申请实施例还提供了一种计算机可读存储介质,其存储有可由计算设备执行的计算机程序,当所述程序在所述计算设备上运行时,使得所述计算设备执行上述元数据的标签分类方法的步骤。Based on the same technical idea, an embodiment of the present application also provides a computer-readable storage medium, which stores a computer program executable by a computing device, and when the program is run on the computing device, the computing device The steps of the tag classification method for metadata described above are performed.

本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。While preferred embodiments of the present application have been described, additional changes and modifications to these embodiments can be made by those skilled in the art once the basic inventive concept is appreciated. Therefore, the appended claims are intended to be construed to cover the preferred embodiment and all changes and modifications which fall within the scope of the application.

显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application is also intended to include these modifications and variations.

Claims (11)

1.一种元数据的标签分类方法,其特征在于,包括:1. A tag classification method for metadata, comprising: 针对任一元数据,获取所述元数据的中文语义信息和所述元数据的英文语义信息;For any metadata, obtain the Chinese semantic information of the metadata and the English semantic information of the metadata; 根据中文分词集合,确定所述元数据的中文语义信息对应的第一特征向量;所述中文分词集合是通过对各元数据的中文语义信息进行分词得到的;According to the Chinese word segmentation set, determine the first feature vector corresponding to the Chinese semantic information of the metadata; the Chinese word segmentation set is obtained by segmenting the Chinese semantic information of each metadata; 根据英文分词集合,确定所述元数据的英文语义信息对应的第二特征向量;所述英文分词集合是通过对各元数据的英文语义信息进行分词得到的;Determine the second feature vector corresponding to the English semantic information of the metadata according to the English word segmentation set; the English word segmentation set is obtained by segmenting the English semantic information of each metadata; 将所述第一特征向量与所述第二特征向量拼接,得到所述元数据的特征编码向量;splicing the first feature vector and the second feature vector to obtain a feature encoding vector of the metadata; 将所述元数据的特征编码向量分别输入各个标签分类器,确定所述元数据的标签类别。The feature encoding vectors of the metadata are respectively input into each label classifier to determine the label category of the metadata. 2.如权利要求1所述的方法,其特征在于,所述根据中文分词集合,确定所述元数据的中文语义信息对应的第一特征向量,包括:2. The method according to claim 1, wherein the first feature vector corresponding to the Chinese semantic information of the metadata is determined according to the Chinese word segmentation set, comprising: 对所述元数据的中文语义信息进行分词,得到各第一分词;Segmenting the Chinese semantic information of the metadata to obtain each first participle; 针对任一第一分词,若确定所述第一分词存在于所述中文分词集合中,则设置所述第一分词对应的子特征向量为第一值;若确定所述第一分词不存在于所述中文分词集合中,则设置所述第一分词对应的子特征向量为第二值;For any first participle, if it is determined that the first participle exists in the Chinese participle set, then the sub-feature vector corresponding to the first participle is set as the first value; if it is determined that the first participle does not exist in In the Chinese word segmentation set, the sub-feature vector corresponding to the first word segmentation is set as the second value; 按照所述各第一分词在所述元数据的中文语义信息中的位置,将各第一分词对应的子特征向量进行拼接,得到所述元数据的中文语义信息对应的第一特征向量。According to the position of each first participle in the Chinese semantic information of the metadata, the sub-feature vectors corresponding to each first participle are spliced to obtain the first feature vector corresponding to the Chinese semantic information of the metadata. 3.如权利要求1所述的方法,其特征在于,所述根据英文分词集合,确定所述元数据的英文语义信息对应的第二特征向量,包括:3. The method according to claim 1, wherein the second feature vector corresponding to the English semantic information of the metadata is determined according to the English word segmentation set, comprising: 对所述元数据的英文语义信息进行分词,得到各第二分词;Segmenting the English semantic information of the metadata to obtain each second participle; 针对任一第二分词,若确定所述第二分词存在于所述英文分词集合中,则设置所述第二分词对应的子特征向量为第一值;若确定所述第二分词不存在于所述英文分词集合中,则设置所述第二分词对应的子特征向量为第二值;For any second participle, if it is determined that the second participle exists in the English participle set, then the sub-feature vector corresponding to the second participle is set as the first value; if it is determined that the second participle does not exist in In the English participle set, the sub-feature vector corresponding to the second participle is set as the second value; 按照所述各第二分词在所述元数据的英文语义信息中的位置,将各第二分词对应的子特征向量进行拼接,得到所述元数据的英文语义信息对应的第二特征向量。According to the position of each second participle in the English semantic information of the metadata, the sub-feature vectors corresponding to each second participle are spliced to obtain the second feature vector corresponding to the English semantic information of the metadata. 4.如权利要求1至3任一项所述的方法,其特征在于,所述将所述元数据的特征编码向量分别输入各个标签分类器,确定所述元数据的标签类别,包括:4. The method according to any one of claims 1 to 3, wherein said inputting the feature encoding vectors of said metadata into respective label classifiers to determine the label category of said metadata comprises: 将所述元数据的特征编码向量输入各个标签分类器,获取所述各个标签分类器中对应的预测概率;Input the feature encoding vector of the metadata into each label classifier, and obtain the corresponding prediction probability in each label classifier; 根据所述各个标签分类器对应的预测概率,确定所述元数据的标签类别。The label category of the metadata is determined according to the prediction probabilities corresponding to the respective label classifiers. 5.如权利要求4所述的方法,其特征在于,所述各个标签分类器是通过训练样本进行训练得到;每个标签分类器具有通过训练得到的上限阈值和下限阈值;5. The method according to claim 4, wherein each label classifier is trained by training samples; each label classifier has an upper threshold and a lower threshold obtained through training; 所述根据所述各个标签分类器对应的预测概率,确定所述元数据的标签类别,包括:The determining the label category of the metadata according to the prediction probability corresponding to each label classifier includes: 若第一标签分类器对应的第一预测概率高于所述第一标签分类器的上限阈值,且第二标签分类器对应的第二预测概率低于所述第二标签分类器的下限阈值,则确定所述元数据具有所述第一标签分类器对应的标签类别;所述第二标签分类器为所述各个标签分类器中除所述第一标签分类器之外的各标签分类器。If the first predicted probability corresponding to the first label classifier is higher than the upper threshold of the first label classifier, and the second predicted probability corresponding to the second label classifier is lower than the lower threshold of the second label classifier, Then it is determined that the metadata has a tag category corresponding to the first tag classifier; and the second tag classifier is each tag classifier except the first tag classifier among the tag classifiers. 6.如权利要求5所述的方法,其特征在于,还包括:针对任一标签分类器,若所述标签分类器的预测概率低于所述标签分类器的下限阈值,则确定所述元数据不具有所述标签分类器对应的标签类别,并将所述元数据作为用于对所述各个标签分类器进行更新的训练样本。6. The method according to claim 5, further comprising: for any label classifier, if the predicted probability of the label classifier is lower than the lower limit threshold of the label classifier, then determining the The data does not have a label category corresponding to the label classifier, and the metadata is used as a training sample for updating each label classifier. 7.如权利要求5所述的方法,其特征在于,所述确定所述元数据的标签类别之后,还包括:7. The method according to claim 5, further comprising: 将所述元数据作为用于对所述各个标签分类器进行更新的训练样本。The metadata are used as training samples for updating the respective label classifiers. 8.如权利要求5所述的方法,其特征在于,还包括:若不存在任一标签分类器对应的预测概率高于所述标签分类器的上限阈值,则在所述各个标签分类器进行更新后,继续将所述元数据的特征编码向量分别输入更新后的各个标签分类器,确定所述元数据的标签类别。8. The method according to claim 5, further comprising: if there is no predicted probability corresponding to any label classifier higher than the upper threshold of the label classifier, performing After the update, continue to input the feature encoding vectors of the metadata into the updated label classifiers respectively to determine the label category of the metadata. 9.一种元数据的标签分类装置,其特征在于,包括:9. A tag classification device for metadata, comprising: 获取单元,用于针对任一元数据,获取所述元数据的中文语义信息和所述元数据的英文语义信息;An acquisition unit, configured to acquire the Chinese semantic information of the metadata and the English semantic information of the metadata for any metadata; 处理单元,用于根据中文分词集合,确定所述元数据的中文语义信息对应的第一特征向量;所述中文分词集合是通过对各元数据的中文语义信息进行分词得到的;根据英文分词集合,确定所述元数据的英文语义信息对应的第二特征向量;所述英文分词集合是通过对各元数据的英文语义信息进行分词得到的;将所述第一特征向量与所述第二特征向量拼接,得到所述元数据的特征编码向量;将所述元数据的特征编码向量分别输入各个标签分类器,确定所述元数据的标签类别。The processing unit is used to determine the first feature vector corresponding to the Chinese semantic information of the metadata according to the Chinese word segmentation set; the Chinese word segmentation set is obtained by segmenting the Chinese semantic information of each metadata; according to the English word segmentation set , determine the second feature vector corresponding to the English semantic information of the metadata; the English word segmentation set is obtained by segmenting the English semantic information of each metadata; combine the first feature vector with the second feature Concatenating the vectors to obtain the feature encoding vectors of the metadata; inputting the feature encoding vectors of the metadata into respective label classifiers to determine the label categories of the metadata. 10.一种计算设备,其特征在于,包括至少一个处理器以及至少一个存储器,其中,所述存储器存储有计算机程序,当所述程序被所述处理器执行时,使得所述处理器执行权利要求1至8任一权利要求所述的方法。10. A computing device, characterized in that it comprises at least one processor and at least one memory, wherein the memory stores a computer program that, when the program is executed by the processor, causes the processor to execute the rights The method according to any one of claims 1 to 8. 11.一种计算机可读存储介质,其特征在于,所述存储介质存储有程序,当所述程序在计算机上运行时,使得计算机实现执行权利要求1至8中任一项所述的方法。11. A computer-readable storage medium, wherein the storage medium stores a program, and when the program is run on a computer, the computer is enabled to implement the method according to any one of claims 1 to 8.
CN202211035702.4A 2022-08-26 2022-08-26 Metadata label classification method and device Pending CN115438181A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211035702.4A CN115438181A (en) 2022-08-26 2022-08-26 Metadata label classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211035702.4A CN115438181A (en) 2022-08-26 2022-08-26 Metadata label classification method and device

Publications (1)

Publication Number Publication Date
CN115438181A true CN115438181A (en) 2022-12-06

Family

ID=84243873

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211035702.4A Pending CN115438181A (en) 2022-08-26 2022-08-26 Metadata label classification method and device

Country Status (1)

Country Link
CN (1) CN115438181A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180090136A1 (en) * 2016-09-27 2018-03-29 International Business Machines Corporation System, method and computer program product for improving dialog service quality via user feedback
CN112118783A (en) * 2018-03-16 2020-12-22 Zoll医疗公司 Monitoring physiological states based on biological vibration and RF data analysis
CN114491030A (en) * 2022-01-19 2022-05-13 北京百度网讯科技有限公司 Skill label extraction and candidate phrase classification model training method and device
CN114817526A (en) * 2022-02-21 2022-07-29 华院计算技术(上海)股份有限公司 Text classification method and device, storage medium and terminal

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180090136A1 (en) * 2016-09-27 2018-03-29 International Business Machines Corporation System, method and computer program product for improving dialog service quality via user feedback
CN112118783A (en) * 2018-03-16 2020-12-22 Zoll医疗公司 Monitoring physiological states based on biological vibration and RF data analysis
CN114491030A (en) * 2022-01-19 2022-05-13 北京百度网讯科技有限公司 Skill label extraction and candidate phrase classification model training method and device
CN114817526A (en) * 2022-02-21 2022-07-29 华院计算技术(上海)股份有限公司 Text classification method and device, storage medium and terminal

Similar Documents

Publication Publication Date Title
US11348352B2 (en) Contract lifecycle management
US10963691B2 (en) Platform for document classification
CN113886573B (en) Text review method, device, electronic device and storage medium
CN114020916B (en) Text classification method, device, storage medium and electronic device
CN112070093A (en) Method for generating image classification model, image classification method, device and equipment
CN111242358A (en) Enterprise information loss prediction method with double-layer structure
WO2019242442A1 (en) Multi-model feature-based malware identification method, system and related apparatus
CN112488557A (en) Automatic calculation method, device and terminal based on grading standard objective scores
CN114120304A (en) Entity identification method, device and computer program product
CN111460137B (en) A topic model-based microservice focus identification method, equipment and medium
CN112989050A (en) Table classification method, device, equipment and storage medium
CN113918709A (en) Industry classification model training method, classification method and device
CN119378494A (en) An entity relationship extraction method and system for building knowledge graphs in the financial field
CN116029280A (en) A document key information extraction method, device, computing device and storage medium
US11765193B2 (en) Contextual embeddings for improving static analyzer output
CN112749293A (en) Image classification method and device and storage medium
CN115544256A (en) Automatic data classification and classification method and system based on NLP algorithm model
CN112685374A (en) Log classification method and device and electronic equipment
CN119939356A (en) A method, device and electronic device for optimizing data classification and grading scanning performance
US11321527B1 (en) Effective classification of data based on curated features
CN111898378B (en) Industry classification method and device for government enterprise clients, electronic equipment and storage medium
CN115438181A (en) Metadata label classification method and device
CN118966778A (en) A method, system and computer equipment for predicting comprehensive contract risks
CN114443803B (en) A text information mining method, device, electronic device and storage medium
CN117997845A (en) Classification model updating method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination