[go: up one dir, main page]

CN114724167A - Marketing text recognition method and system - Google Patents

Marketing text recognition method and system Download PDF

Info

Publication number
CN114724167A
CN114724167A CN202210498687.0A CN202210498687A CN114724167A CN 114724167 A CN114724167 A CN 114724167A CN 202210498687 A CN202210498687 A CN 202210498687A CN 114724167 A CN114724167 A CN 114724167A
Authority
CN
China
Prior art keywords
text
word
label
representation
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210498687.0A
Other languages
Chinese (zh)
Other versions
CN114724167B (en
Inventor
马坤
李乐平
纪科
陈贞翔
杨波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Jinan
Original Assignee
University of Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Jinan filed Critical University of Jinan
Priority to CN202210498687.0A priority Critical patent/CN114724167B/en
Publication of CN114724167A publication Critical patent/CN114724167A/en
Application granted granted Critical
Publication of CN114724167B publication Critical patent/CN114724167B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a marketing text recognition method and a marketing text recognition system, which comprise the following steps: acquiring a text to be recognized and preprocessing the text; constructing a text graph of the text to be recognized based on the preprocessed text to be recognized; generating text level word representation based on a text graph of a text to be recognized, and generating text representation by combining the embedded representations of all labels; based on the text representation, a classifier is adopted to obtain a result of whether the text to be recognized belongs to the marketing text; the method for acquiring the embedded representation of the label comprises the following steps: generating probability distribution of the subject words based on the text graph and the labels of the training set, mapping the probability distribution of the subject words to a label vector space, and learning the correlation relation and semantic information among the labels under the guidance of the label graph to obtain the embedded representation of the labels. The purpose of generating complete label embedding is achieved, more information related to classification is captured by jointly learning words and labels, and the marketing text recognition precision is improved.

Description

一种营销文本识别方法及系统Marketing text recognition method and system

技术领域technical field

本发明属于自然语言处理技术领域,尤其涉及一种营销文本识别方法及系统。The invention belongs to the technical field of natural language processing, and in particular relates to a marketing text recognition method and system.

背景技术Background technique

本部分的陈述仅仅是提供了与本发明相关的背景技术信息,不必然构成在先技术。The statements in this section merely provide background information related to the present invention and do not necessarily constitute prior art.

社交媒体平台每天都发布大量以推广为目的含有广告内容的营销文章。为了消除读者反感,赢得潜在客户信任,营销内容往往隐藏在普通的文章内容中,难以识别发现。与传统媒体不同,某些自媒体编辑为了谋取利益精心设计广告内容,甚至不惜夸大事实、编辑虚假信息,不仅会误导消费者、损害消费者利益,还会破坏健康的网络环境。因此,迫切需要相应的方法和系统来检测自媒体内容营销文章。Every day, social media platforms publish a large number of marketing articles with advertising content for promotional purposes. In order to eliminate readers' disgust and win the trust of potential customers, marketing content is often hidden in ordinary article content, which is difficult to identify and discover. Different from traditional media, some self-media editors carefully design advertising content for profit, and even exaggerate facts and edit false information, which will not only mislead consumers and harm consumers' interests, but also damage a healthy network environment. Therefore, corresponding methods and systems are urgently needed to detect self-media content marketing articles.

大多数研究将内容营销文章的检测视为文本分类问题。现有方法中,解决文本分类任务的方法主要分为以下几类:(1)传统的文本分类方法包括朴素贝叶斯、最大熵、决策树、支持向量机等,这类方法主要依赖人工标注的特征,不仅忽略了词之间的相关性,而且效率低下;(2)深度学习方法,如TextRNN、TextRCNN、fastText,能够自动地获取分类特征,但是这类方法关注单词的局部性,缺乏长距离和非连续的单词交互;(3)基于图神经网络的方法,如TextGCN、HyperGAT、TextING,能够直接处理复杂的结构化数据,并对全局特征进行优先级挖掘,但是这类方法没有考虑细粒度的标签信息以及与标签相关的文本信息。Most studies treat the detection of content marketing articles as a text classification problem. Among the existing methods, the methods for solving text classification tasks are mainly divided into the following categories: (1) Traditional text classification methods include Naive Bayes, maximum entropy, decision tree, support vector machine, etc. These methods mainly rely on manual annotation. (2) Deep learning methods, such as TextRNN, TextRCNN, and fastText, can automatically obtain classification features, but these methods focus on the locality of words and lack long distance and non-consecutive word interactions; (3) methods based on graph neural networks, such as TextGCN, HyperGAT, TextING, can directly process complex structured data and perform priority mining on global features, but such methods do not consider detailed Granular label information as well as label-related textual information.

近年来,一些研究发现标签与文本分类直接相关,并且可以帮助模型获取与分类更相关的信息。对于多标签分类任务,往往涉及大量标签,类别间关系复杂,难以找到合理的方式对其进行描述,因此有效地挖掘标签信息是多标签文本分类学习成功的关键。为了利用标签信息提高分类性能,深度极端多标签学习方法(DXML)通过探索标签结构来捕获标签依赖关系,序列生成模型(SGM)利用长短期记忆网络(LSTM,Long Short-Term Memory)处理标签序列依赖关系,捕获复杂的标签关系。虽然它们从不同的方面学习标签信息来获取标签关系,但是忽视了标签共现信息在捕获标签关系时的作用,难以合理描述标签关系来帮助模型提升分类效果。标签嵌入注意模型(LEAM)联合词和标签来获得文本的嵌入表示,显式交互模型(EXAM)利用类表示来获得单词标签交互信息。这类方法以标签嵌入的形式考虑了标签信息,但是在标签空间向量化时,标签向量缺乏标签间特征信息传递,以至于标签表示不能涵盖标签空间的完整语义。标签特定的注意网络(LSAN)利用自注意机制来识别标签特定的信息,一些基于自动编码器的方法通过基于排名的自动编码器体系结构产生类似的文本标签分数。然而,这类方法假设所有的标签是相互独立的,没有充分考虑标签语义和整体标签相关性。In recent years, some studies have found that labels are directly related to text classification and can help models obtain information more relevant to classification. For multi-label classification tasks, a large number of labels are often involved, and the relationship between categories is complex, and it is difficult to find a reasonable way to describe them. Therefore, effectively mining label information is the key to the success of multi-label text classification learning. To exploit label information to improve classification performance, deep extreme multi-label learning method (DXML) captures label dependencies by exploring label structure, sequence generation model (SGM) utilizes long short-term memory network (LSTM, Long Short-Term Memory) to process label sequences Dependencies, capturing complex label relationships. Although they learn label information from different aspects to obtain label relationships, they ignore the role of label co-occurrence information in capturing label relationships, and it is difficult to reasonably describe label relationships to help the model improve the classification effect. The Label Embedding Attention Model (LEAM) combines words and labels to obtain embedded representations of text, and the Explicit Interaction Model (EXAM) utilizes class representations to obtain word-label interaction information. These methods consider the label information in the form of label embedding, but when the label space is vectorized, the label vector lacks the feature information transfer between labels, so that the label representation cannot cover the complete semantics of the label space. Label-specific attention network (LSAN) utilizes a self-attention mechanism to identify label-specific information, and some autoencoder-based methods produce similar text label scores through a ranking-based autoencoder architecture. However, such methods assume that all labels are independent of each other and do not fully consider label semantics and overall label correlation.

发明内容SUMMARY OF THE INVENTION

为了解决上述背景技术中存在的技术问题,本发明提供一种营销文本识别方法及系统,达到了捕获标签全局语义和整体相关性关系,生成完整标签嵌入的目的,联合学习单词与标签捕获更多与分类相关的信息,提高了营销文本识别的精度。In order to solve the technical problems existing in the above background technology, the present invention provides a marketing text recognition method and system, which achieves the purpose of capturing the global semantics and overall correlation of tags, generating complete tag embedding, and jointly learning words and tags to capture more Classification-related information that improves the accuracy of marketing text recognition.

为了实现上述目的,本发明采用如下技术方案:In order to achieve the above object, the present invention adopts the following technical solutions:

本发明的第一个方面提供一种营销文本识别方法,其包括:A first aspect of the present invention provides a marketing text recognition method, comprising:

获取待识别文本,并进行预处理;Obtain the text to be recognized and preprocess it;

基于预处理后的待识别文本,构建待识别文本的文本图;Based on the preprocessed text to be recognized, construct a text map of the text to be recognized;

基于待识别文本的文本图,生成文本级单词表示,并结合所有标签的嵌入表示,生成文本表示;Generate text-level word representations based on the text graph of the text to be recognized, and combine the embedded representations of all tags to generate text representations;

基于文本表示,采用分类器得到待识别文本是否属于营销文本的结果;Based on the text representation, the classifier is used to obtain the result of whether the text to be identified belongs to the marketing text;

其中,标签的嵌入表示的获取方法为:基于训练集的文本图及其标签,生成主题单词概率分布,将主题单词概率分布映射到标签向量空间,并在标签图的指导下学习标签间的相关性关系和语义信息,得到标签的嵌入表示。Among them, the method of obtaining the embedded representation of the label is: based on the text map of the training set and its labels, generate the probability distribution of the topic words, map the probability distribution of the topic words to the label vector space, and learn the correlation between the labels under the guidance of the label map. Sexual relationship and semantic information, get the embedded representation of the label.

进一步地,所述预处理包括清理非文本数据、去除停用词、移除低频词、移除高频词和词形还原。Further, the preprocessing includes cleaning non-text data, removing stop words, removing low frequency words, removing high frequency words and morphological restoration.

进一步地,所述文本图的构建方法为:对于一个文本,统计固定滑动窗口内单词的共现次数,以每个单词为文本图顶点,单词之间的共现次数为文本图的边。Further, the construction method of the text graph is: for a text, count the co-occurrence times of words in a fixed sliding window, take each word as a vertex of the text graph, and the co-occurrence times between words as an edge of the text graph.

进一步地,所述标签图的构建方法为:对于训练集中的所有标签,统计固定滑动窗口内标签的共现次数,以每个标签为顶点,标签共现次数为边。Further, the method for constructing the label graph is: for all labels in the training set, count the co-occurrence times of labels in a fixed sliding window, take each label as a vertex and the label co-occurrence times as an edge.

进一步地,所述主题单词概率分布的生成方法为:Further, the generation method of described topic word probability distribution is:

使用第一先验参数的狄利克雷函数为训练集单词集合中每一个单词生成对应于全部主题的多项式概率分布,得到主题单词概率分布;Using the Dirichlet function of the first prior parameter to generate a polynomial probability distribution corresponding to all topics for each word in the training set word set, and obtain the topic word probability distribution;

对于训练集中的每个文本,基于训练集的文本图和其对应的标签信息生成第二先验参数,使用第二先验参数的狄利克雷函数生成主题分布,将所述主题分布作为多项式分布的参数,得到每个文本中每个单词对应的主题编号,将每个主题编号对应的词分布作为多项式分布的参数生成单词;For each text in the training set, generate a second prior parameter based on the text map of the training set and its corresponding label information, use the Dirichlet function of the second prior parameter to generate a topic distribution, and use the topic distribution as a multinomial distribution parameters, obtain the topic number corresponding to each word in each text, and use the word distribution corresponding to each topic number as a parameter of the multinomial distribution to generate words;

基于生成的单词,更新模型参数和主题单词概率分布。Based on the generated words, the model parameters and the topic word probability distribution are updated.

进一步地,所述文本级单词表示的生成方法为:Further, the generation method of the text-level word representation is:

基于待识别文本的文本图,采用第一层门控图神经网络,合并每个文本图节点表示与来自其一阶邻居节点的信息,更新每个单词的嵌入表示;Based on the text graph of the text to be recognized, a first-layer gated graph neural network is used to merge the representation of each text graph node with the information from its first-order neighbor nodes, and update the embedded representation of each word;

基于更新后的每个单词的嵌入表示,采用第二层门控图神经网络,得到每个单词的文本级单词表示。Based on the updated embedding representation of each word, a second layer of gated graph neural network is used to obtain the text-level word representation of each word.

进一步地,所述文本表示的生成方法为:Further, the generation method of the text representation is:

基于所述文本级单词表示和所有标签的嵌入表示,计算每个文本单词相对于每个标签的注意力值;Calculate the attention value of each text word relative to each label based on the text-level word representations and the embedded representations of all labels;

采用所述注意力值对所有标签的嵌入表示进行加权求和,得到每个单词的标签语义组件;Use the attention value to perform weighted summation on the embedded representations of all tags to obtain the tag semantic component of each word;

基于每个单词的标签语义组件,采用双向长短期记忆层,得到每个单词的标签表示;Based on the label semantic component of each word, a bidirectional long short-term memory layer is used to obtain the label representation of each word;

将单词的标签表示与文本级单词表示拼接后进行加权,得到加权特征;The label representation of the word and the text-level word representation are spliced and weighted to obtain weighted features;

基于所述加权特征,进行最大池化、求和和取平均操作,得到文本表示。Based on the weighted features, max pooling, summing and averaging operations are performed to obtain text representations.

本发明的第二个方面提供一种营销文本识别系统,其包括:A second aspect of the present invention provides a marketing text recognition system comprising:

预处理模块,其被配置为:获取待识别文本,并进行预处理;A preprocessing module, which is configured to: obtain the text to be recognized, and perform preprocessing;

图构建模块,其被配置为:基于预处理后的待识别文本,构建待识别文本的文本图;The graph building module is configured to: construct a text graph of the to-be-recognized text based on the pre-processed to-be-recognized text;

联合学习模块,其被配置为:基于待识别文本的文本图,生成文本级单词表示,并结合所有标签的嵌入表示,生成文本表示;a joint learning module, which is configured to: generate a text-level word representation based on a text graph of the text to be recognized, and combine the embedded representations of all tags to generate a text representation;

分类模块,其被配置为:基于文本表示,采用分类器得到待识别文本是否属于营销文本的结果;A classification module, which is configured to: based on the text representation, use a classifier to obtain a result of whether the text to be recognized belongs to the marketing text;

其中,标签的嵌入表示的获取方法为:基于训练集的文本图及其标签,生成主题单词概率分布,将主题单词概率分布映射到标签向量空间,并在标签图的指导下学习标签间的相关性关系和语义信息,得到标签的嵌入表示。Among them, the method of obtaining the embedded representation of the label is: based on the text map of the training set and its labels, generate the probability distribution of the topic words, map the probability distribution of the topic words to the label vector space, and learn the correlation between the labels under the guidance of the label map. Sexual relationship and semantic information, get the embedded representation of the label.

本发明的第三个方面提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如上述所述的一种营销文本识别方法中的步骤。A third aspect of the present invention provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the steps in the above-mentioned method for recognizing marketing text.

本发明的第四个方面提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述所述的一种营销文本识别方法中的步骤。A fourth aspect of the present invention provides a computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the above-mentioned one when executing the program Steps in a Marketing Text Recognition Method.

与现有技术相比,本发明的有益效果是:Compared with the prior art, the beneficial effects of the present invention are:

本发明提供了一种营销文本识别方法,其提供了主题先验自适应的标记狄利克雷分布主题模型,该模型将每个主题直接对应一个标签,并根据每个文档的标签集将文档主题空间映射到k维空间来修改Di r i ch l et主题先验,生成涵盖全局语义信息的主题单词概率分布。The invention provides a marketing text recognition method, which provides a topic a priori self-adaptive marked Dirichlet distribution topic model, which directly corresponds each topic to a label, and assigns document topics according to the label set of each document. The space is mapped to a k-dimensional space to modify the Di r i ch l et topic prior, generating topic word probability distributions covering global semantic information.

本发明提供了一种营销文本识别方法,其提供了一种标签信息整合方法,在生成标签嵌入时,将单词主题概率分布映射为实值向量,并且通过在实值向量之间传递标签的统计共现信息,将标签全局语义和相关性关系融入标签表示,得到涵盖全局信息的标签嵌入表示。The present invention provides a marketing text recognition method, which provides a label information integration method. When generating label embedding, the word topic probability distribution is mapped to a real-valued vector, and the statistics of the label are transferred between the real-valued vectors. Co-occurrence information, integrates the global semantics and relevance of tags into the tag representation, and obtains the tag embedding representation that covers the global information.

本发明提供了一种营销文本识别方法,其将获取单词特定的标签表示并和单词表示进行联合学习,能够提取更多标签与文本之间的语义信息,提升模型分类效果。The invention provides a marketing text recognition method, which acquires word-specific label representations and performs joint learning with the word representations, which can extract more semantic information between labels and texts and improve the model classification effect.

附图说明Description of drawings

构成本发明的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。The accompanying drawings forming a part of the present invention are used to provide further understanding of the present invention, and the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, and do not constitute an improper limitation of the present invention.

图1是本发明实施例一的一种营销文本识别方法流程图;1 is a flowchart of a marketing text recognition method according to Embodiment 1 of the present invention;

图2是本发明实施例一的主题先验自适应的狄利克雷分布主题模型图。FIG. 2 is a diagram of a topic model of Dirichlet distribution of topic prior adaptation according to Embodiment 1 of the present invention.

具体实施方式Detailed ways

下面结合附图与实施例对本发明作进一步说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.

应该指出,以下详细说明都是例示性的,旨在对本发明提供进一步的说明。除非另有指明,本文使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the invention. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本发明的示例性实施方式。如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terminology used herein is for the purpose of describing specific embodiments only, and is not intended to limit the exemplary embodiments according to the present invention. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural as well, furthermore, it is to be understood that when the terms "comprising" and/or "including" are used in this specification, it indicates that There are features, steps, operations, devices, components and/or combinations thereof.

术语解释:Terminology Explanation:

标签嵌入表示:将文本的标签以便于操作的向量形式表示。Label Embedding Representation: Represents the labels of text in vector form for manipulation.

单词嵌入表示:词嵌入实际上是一类技术,单个词在预定义的向量空间中被表示为实数向量,每个单词都映射到一个向量即单词嵌入表示。Word Embedding Representation: Word Embedding is actually a class of techniques, a single word is represented as a real vector in a predefined vector space, and each word is mapped to a vector, the word embedding representation.

实施例一Example 1

本实施例提供了一种营销文本识别方法,首先为每个文档构建独立的文本图,并使用训练集的标签信息构建标签图;然后使用文本图节点与文本图对应的标签信息训练主题模型,生成主题单词概率分布;随后构建感知器将主题单词概率分布映射到标签向量空间,构建图卷积网络,在标签图的指导下学习标签相关性关系以及语义信息,更新标签向量得到标签嵌入表示;最后对单词和标签进行联合学习得到最终文本表示后,通过sigmoid层输出最终分类结果。如图1所示,具体包括以下步骤:This embodiment provides a marketing text recognition method. First, an independent text graph is constructed for each document, and a label graph is constructed by using the label information of the training set; then a topic model is trained by using the label information corresponding to the text graph nodes and the text graph, Generate the topic word probability distribution; then build a perceptron to map the topic word probability distribution to the label vector space, build a graph convolution network, learn the label correlation and semantic information under the guidance of the label graph, and update the label vector to obtain the label embedding representation; Finally, after jointly learning words and labels to obtain the final text representation, the final classification result is output through the sigmoid layer. As shown in Figure 1, it specifically includes the following steps:

步骤1:获取训练集和待识别文本,并分别对训练集和待识别文本进行数据预处理,预处理包括清理非文本数据、去除停用词、移除低频词、移除高频词和词形还原,具体的:Step 1: Obtain the training set and the text to be recognized, and perform data preprocessing on the training set and the text to be recognized respectively. The preprocessing includes cleaning non-text data, removing stop words, removing low-frequency words, and removing high-frequency words and words. Shape reduction, specifically:

步骤(101)、清理非文本数据,数据中可能会包含HTML标签、URL地址等非文本内容,这些内容对分类没有什么帮助甚至会产生负面影响;Step (101), clean up non-text data, the data may contain non-text content such as HTML tags, URL addresses, etc., these contents are not helpful for classification or even have a negative impact;

步骤(102)、去除停用词,助动词、虚词、标点符号等在所有的文章中都大量存在,并不能反映文本的意思,通常情况下,在文本中去掉这些停用词能够使模型更好地拟合语义特征,从而提高模型泛化能力;Step (102): Remove stop words. Auxiliary verbs, function words, punctuation marks, etc. are abundant in all articles and cannot reflect the meaning of the text. Usually, removing these stop words from the text can make the model better Fit semantic features to improve the generalization ability of the model;

步骤(103)、移除低频词以及高频词,字词的重要性与它在文本中出现的次数成正比与它在语料库中出现的频率成反比,因此在语料库中的高频词以及低频词对于分类没有什么帮助,可以移除(短文本中无需此步操作);Step (103), remove low-frequency words and high-frequency words, the importance of the word is proportional to the number of times it appears in the text and inversely proportional to its frequency in the corpus, so the high-frequency words and low-frequency words in the corpus are proportional to Words are not helpful for classification and can be removed (this step is not required in short texts);

步骤(104)、词形还原,像is、are、am这些词对于计算机来说是完全不同的内容,但它们含义相同,通过词形还原将它们统一起来方便后续的处理和分析。Step (104), lemmatization, words such as is, are, and am are completely different contents to the computer, but they have the same meaning, and they are unified through lemmatization to facilitate subsequent processing and analysis.

步骤2:构建图:对于待识别文本,基于预处理后的待识别文本,为每一个待识别文本构建文本图;对于训练集,基于预处理后的训练集中的文本,为训练集中的每一个文本构建文本图,并基于训练集中的所有标签,构建一个标签图。具体的:Step 2: Construct a graph: For the text to be recognized, a text graph is constructed for each text to be recognized based on the preprocessed text to be recognized; for the training set, based on the text in the preprocessed training set, a text map is constructed for each text in the training set. Text builds a text graph, and based on all the labels in the training set, builds a label graph. specific:

构建文本图,根据一定大小滑动窗口内单词共现次数构建文本图:对于待识别文本或训练集中的某一个文本,统计固定滑动窗口内单词的共现次数,以每个独特的单词为文本图顶点,单词之间的共现次数为文本图的边,为每一个文本构建独立的文本图;其中文本图顶点也叫文本图节点。Construct a text map, and construct a text map according to the number of co-occurrences of words in a sliding window of a certain size: For a text to be recognized or a certain text in the training set, count the number of co-occurrences of words in a fixed sliding window, and take each unique word as a text map Vertices, the number of co-occurrences between words is the edge of the text graph, and an independent text graph is constructed for each text; the text graph vertices are also called text graph nodes.

基于标签信息(训练集中的每一个文本都有与其对应的标签集合,也就是这个文本所属类别的集合),构建标签图,根据一定大小滑动窗口内标签共现次数构建标签图:对于训练集中的所有标签,统计固定滑动窗口内标签的共现次数,以每一个标签对应的类别为顶点,标签共现次数为边,构建一个标签图。Based on the label information (each text in the training set has its corresponding label set, that is, the set of the category to which the text belongs), a label map is constructed, and the label map is constructed according to the number of label co-occurrences in a sliding window of a certain size: For all labels, count the co-occurrence times of labels in the fixed sliding window, and construct a label graph with the category corresponding to each label as the vertex and the label co-occurrence times as the edge.

其中,标签图和训练集中所有文本的文本图用于训练营销文本识别模型;营销文本识别模型包括标签的嵌入表示获取模块、单词标签联合学习模块和分类模块。将待识别文本的文本图输入训练好的营销文本识别模型的单词标签联合学习模块和分类模块,得到待识别文本的标签(即待识别文本是否属于营销文本的结果),即所属类别(不包含广告内容并且内容真实可信的文章、观点中立的文章、含有广告内容的文章等),具体的,基于待识别文本的文本图,采用两个依次连接的门控图神经网络,得到文本级单词表示;基于文本级单词表示,结合所有标签的嵌入表示,得到文本表示;基于文本表示,采用分类器得到文本分类结果。Among them, the label map and the text map of all texts in the training set are used to train the marketing text recognition model; the marketing text recognition model includes a label embedding representation acquisition module, a word label joint learning module and a classification module. Input the text image of the text to be recognized into the word label joint learning module and the classification module of the trained marketing text recognition model, and obtain the label of the text to be recognized (that is, the result of whether the text to be recognized belongs to the marketing text), that is, the category to which it belongs (excluding the Articles with authentic and credible content, articles with neutral opinions, articles containing advertising content, etc.), specifically, based on the text graph of the text to be recognized, two gated graph neural networks connected in sequence are used to obtain text-level words Representation; based on the text-level word representation, combined with the embedded representation of all labels, the text representation is obtained; based on the text representation, the classifier is used to obtain the text classification result.

步骤3:标签的嵌入表示获取模块,基于训练集的文本图及其对应的标签信息,生成主题单词概率分布,将主题单词概率分布映射到标签向量空间,并在标签图的指导下学习标签间的相关性关系和语义信息,得到标签嵌入表示。包括获取主题单词概率分布模块和标签信息整合模块。Step 3: The label embedding represents the acquisition module, based on the text map of the training set and its corresponding label information, generates the probability distribution of the topic words, maps the probability distribution of the topic words to the label vector space, and learns between labels under the guidance of the label map. The correlation relationship and semantic information of , get the label embedding representation. Including the acquisition of topic word probability distribution module and label information integration module.

(1)获取主题单词概率分布模块:首先,构建主题先验自适应的标记狄利克雷分布主题模型,如图2所示,然后,使用Collapsed Gibbs采样在生成训练集所有文档的过程中训练主题模型,得到主题单词概率分布。(1) Obtain the topic word probability distribution module: First, build a topic-prior adaptive labelled Dirichlet distribution topic model, as shown in Figure 2, and then use Collapsed Gibbs sampling to train topics in the process of generating all documents in the training set model to get the probability distribution of topic words.

主题先验自适应的标记狄利克雷分布主题模型,是一种概率图形模型,模型中每个标签与一个主题直接对应,对任一主题k,使用参数为η的多项式分布为其生成词分布βk,对任一文档d,通过矩阵运算来更新α(d)使得所有主题的先验概率随文档标签集的变化而变化,使用参数为α(d)的狄利克雷函数为其生成主题分布θ(d);对于文档d中的第j个词,从主题分布θ(d)中得到它的主题编号zj,d~Multinomial(θ(d)),然后由该主题编号得到单词wj,d~Multinomial(βzj,d),不断重复以上过程直至生成全部文档。The topic-prior adaptive labelled Dirichlet distribution topic model is a probabilistic graphical model. Each label in the model corresponds directly to a topic. For any topic k, a multinomial distribution with parameter η is used to generate the word distribution for it. β k , for any document d, update α (d) through matrix operations so that the prior probability of all topics changes with the change of the document label set, and use the Dirichlet function with parameter α (d) to generate topics for it Distribution θ (d) ; for the jth word in document d, get its topic number z j,d ~Multinomial(θ (d) ) from the topic distribution θ (d ), and then get word w from the topic number j,d ~Multinomial(βz j,d ), and repeat the above process until all documents are generated.

主题先验自适应的标记狄利克雷分布主题模型的构建与训练的流程如下:The construction and training process of the topic-prior adaptive labelled Dirichlet distribution topic model is as follows:

步骤①:建立标签与主题之间一一对应的关系,数据集标签类别个数等于主题个数K;其中,标签指文章所属的类别,主题指能够概括文章的几个词汇;使用第一先验参数为η的狄利克雷函数为训练集单词集合中每一个单词生成对应于全部主题的多项式概率分布,得到每个主题对应的词分布(主题单词概率分布)βk=(βk,1,...,βk,V-1k,V)~Dir(.|η),其中V也代表全部训练集文本的词汇个数,βk,v代表第v个单词由第k个主题生成的概率,Dir(也就是Dirchlet)代表狄利克雷函数,η是第一先验参数,βk代表第k个主题对应的词分布,k∈{1,...,K};Step 1: Establish a one-to-one correspondence between labels and topics. The number of label categories in the dataset is equal to the number of topics K; among them, the label refers to the category to which the article belongs, and the topic refers to several words that can summarize the article; The Dirichlet function with the test parameter η generates a multinomial probability distribution corresponding to all topics for each word in the training set word set, and obtains the word distribution corresponding to each topic (topic word probability distribution) β k = (β k,1 ,...,β k,V-1k,V )~Dir(.|η), where V also represents the number of words in the entire training set text, β k,v represents the vth word from the kth word The probability of generating a topic, Dir (that is, Dirchlet) represents the Dirichlet function, η is the first prior parameter, β k represents the word distribution corresponding to the k-th topic, k∈{1,...,K};

步骤②:对于训练集中的每一个文本,基于训练集的文本图和其对应的标签信息生成第二先验参数,具体的:对于每一个文本d,d∈{1,...,D},当其标签未知时,使用标签先验为Φ的伯努利分布生成文本的标签Λ(d),当标签已知时为其初始化一个维度为1*k的标签矩阵Λ(d),文本拥有的标签对应位置置为1,反之置为0;初始化维度等于主题个数的单位矩阵q(d),q(d)和文本的标签矩阵Λ(d)进行求和运算后与第二先验参数α进行乘法运算,即第二先验α(d)=(q(d)(d))×α;通过以上操作,为与文本标签集相对应的主题分配高主题先验,其余主题分配低主题先验来更新第二先验参数α(d)Step 2: For each text in the training set, generate a second prior parameter based on the text map of the training set and its corresponding label information, specifically: for each text d, d∈{1,...,D} , when its label is unknown, use the Bernoulli distribution with label prior Φ to generate the label Λ (d) of the text, when the label is known, initialize a label matrix Λ (d) with dimension 1*k for it, the text The corresponding position of the owned label is set to 1, otherwise it is set to 0; the initialization dimension is equal to the number of topics. The identity matrix q (d) , q (d) and the label matrix Λ (d) of the text are summed with the second first. Perform multiplication operation on the priori parameter α, that is, the second prior α (d) = (q (d) + Λ (d) )×α; through the above operations, assign a high topic prior to the topic corresponding to the text label set, The remaining topics are assigned a low topic prior to update the second prior parameter α (d) ;

步骤③:使用第二先验参数为α(d)的狄利克雷函数生成主题分布θ(d),θ(d)~Dirichlet(.|α(d));Step ③: Use the Dirichlet function with the second prior parameter α (d) to generate the topic distribution θ (d) , θ (d) ~ Dirichlet(.|α (d) );

步骤④:将主题分布θ(d)作为多项式分布的参数,得到文本d中的第j个单词对应的主题编号zj,d~Multinomial(θ(d)),j∈{1,...,Nd},Nd为文本d中的单词个数;Step 4: Take the topic distribution θ (d) as the parameter of the multinomial distribution, and obtain the topic number z j,d ~Multinomial(θ (d) ) corresponding to the jth word in the text d, j∈{1,... , N d }, N d is the number of words in the text d;

步骤⑤:将主题编号为zj,d对应的主题的词分布βzj,d作为多项式分布的参数生成得到单词wj,d~Multinomial(βzj,d);Step ⑤: The word distribution βz j,d of the topic corresponding to the topic number z j,d is used as a parameter of the multinomial distribution to generate the word w j,d ~Multinomial(βz j,d );

步骤⑥:基于所有的单词wj,d,使用Collapsed Gibbs采样训练主题模型,更新模型参数(包括第一先验参数η,第二先验参数α,主题分布θ)和主题单词概率分布,随着参数的更新,β也被更新,不断重复步骤①-⑤,经过训练得到全部文档的主题单词概率分布β。Step ⑥: Based on all the words w j,d , use Collapsed Gibbs sampling to train the topic model, update the model parameters (including the first prior parameter η, the second prior parameter α, the topic distribution θ) and the topic word probability distribution, with With the update of parameters, β is also updated, and steps ①-⑤ are repeated continuously, and the probability distribution β of topic words of all documents is obtained after training.

(2)标签信息整合模块:标签信息整合模块包括生成标签嵌入表示模块。生成标签嵌入表示模块,经过一层感知器将主题单词概率分布映射到标签向量空间,然后采用两层图卷积网络,在标签图的指导下,学习标签相关性关系以及语义信息并将它们融入到标签向量中,得到标签的嵌入表示。(2) Label information integration module: The label information integration module includes the generation of the label embedded representation module. Generate a label embedding representation module, map the topic word probability distribution to the label vector space through a layer of perceptrons, and then use a two-layer graph convolution network, under the guidance of the label map, learn label correlation and semantic information and integrate them into into the label vector to get the embedded representation of the label.

步骤①:构建relu激活的多层感知器(MLP),将主题单词概率分布β映射到标签嵌入空间,得到涵盖全局信息的所有标签的嵌入表示;Step 1: Build a multi-layer perceptron (MLP) activated by relu, map the topic word probability distribution β to the label embedding space, and obtain the embedding representation of all labels covering the global information;

步骤②:构建第一层图卷积网络(GCN),在标签图的指导下,在邻居节点之间传递标签相关性关系以及语义信息,更新标签的嵌入表示,更新后的标签嵌入表示为H1Step 2: Construct the first layer of graph convolutional network (GCN), under the guidance of the label graph, transfer label correlation and semantic information between neighbor nodes, update the embedding representation of the label, and the updated label embedding is represented as H 1 ;

步骤③:构建第二层图卷积网络(GCN),在标签节点之间传递高阶信息,增强标签嵌入表示,增强后的标签嵌入表示为H2Step 3: Construct a second-layer graph convolutional network (GCN), transfer high-order information between label nodes, enhance the label embedding representation, and the enhanced label embedding is represented as H 2 ;

图卷积网络学习过程为:The learning process of the graph convolutional network is:

Hl+1=σ(ALHlWl) (1)H l+1 =σ( AL H l W l ) (1)

其中,Hl是上一层组件表示,Hl+1是当前层输出,σ为LeakyReLU激活函数,Wl为可训练参数,AL是标签图标准化的邻接矩阵,标准化过程为:Among them, H l is the component representation of the previous layer, H l+1 is the output of the current layer, σ is the LeakyReLU activation function, W l is the trainable parameter, A L is the adjacency matrix of label map normalization, and the normalization process is:

Figure BDA0003634427050000111
Figure BDA0003634427050000111

Figure BDA0003634427050000112
Figure BDA0003634427050000112

其中,

Figure BDA0003634427050000113
为标签图的邻接矩阵表示,
Figure BDA0003634427050000114
代表邻接矩阵的第i行第j列的值。in,
Figure BDA0003634427050000113
is the adjacency matrix representation of the label graph,
Figure BDA0003634427050000114
Represents the value of the ith row and jth column of the adjacency matrix.

步骤4:单词标签联合学习模块:单词标签联合学习模块包括获取文本级单词表示组件、提取单词特定的标签语义组件、双向长短期记忆层、软注意力层和读出层;获取文本级单词表示组件采用两层门控图神经网络(GGNN)进行文本级信息交互,更新单词嵌入表示,进而获得文本级单词表示;提取单词特定的标签语义组件,通过注意力机制计算文本单词对应于所有标签的注意力值,并使用注意力值对标签嵌入进行加权获得单词特定的标签语义组件;双向长短期记忆层,构建双向长短期记忆网络(BiLSTM)进一步研究单词特定的标签表示中的依赖关系和单词标签语义信息,生成单词特定的标签表示;软注意力层对文本级单词嵌入表示和单词特定的标签表示拼接后的表示进行加权,得到加权特征;读出层,首先对加权特征在一定维度上求和取平均,然后对加权特征进行最大池化操作,最后将两步操作的结果合并在一起得到完整的文本表示。Step 4: Word label joint learning module: The word label joint learning module includes obtaining text-level word representation components, extracting word-specific label semantic components, bidirectional long short-term memory layer, soft attention layer and readout layer; obtaining text-level word representations The component uses a two-layer gated graph neural network (GGNN) for text-level information interaction, updates word embedding representation, and then obtains text-level word representation; extracts word-specific label semantic components, and calculates text words corresponding to all labels through attention mechanism. attention value, and use the attention value to weight the label embedding to obtain word-specific label semantic components; bidirectional long short-term memory layer, build a bidirectional long short-term memory network (BiLSTM) to further study the dependencies and words in word-specific label representations Label semantic information to generate word-specific label representations; the soft attention layer weights the concatenated representations of text-level word embedding representations and word-specific label representations to obtain weighted features; readout layer firstly weights the weighted features in a certain dimension The summation is averaged, then the weighted features are max-pooled, and the results of the two-step operations are merged together to obtain a complete text representation.

步骤(401):基于待识别文本的文本图,采用第一层门控图神经网络,以文本为单位,对文本图进行学习,文本图节点通过合并自身与来自一阶邻居节点的语义等特征信息,更新每个单词的嵌入表示;Step (401): Based on the text graph of the text to be recognized, a first-layer gated graph neural network is used to learn the text graph in units of text, and the text graph nodes merge themselves with features such as semantics from first-order neighbor nodes. information, update the embedding representation of each word;

步骤(402):基于更新后的每个单词的嵌入表示,采用第二层门控图神经网络,在每个文本图节点之间传递高阶信息,得到文本级单词表示X;Step (402): Based on the updated embedded representation of each word, a second-layer gated graph neural network is used to transmit high-level information between each text graph node to obtain a text-level word representation X;

门控图神经网络学习过程为:The learning process of the gated graph neural network is:

at=ATht-1Wa (4)a t =A T h t-1 W a (4)

zt=σ(Wzat+Uzht-1+bz) (5)z t =σ(W z a t +U z h t-1 +b z ) (5)

rt=σ(Wrat+Urht-1+br) (6)r t =σ(W r a t +U r h t-1 + br ) (6)

h~t=tanh(What+Uh(rt⊙ht-1)+bh) (7)h ~ t = tanh(W h a t +U h (r t ⊙h t-1 )+b h ) (7)

ht=h~t⊙zt+ht-1⊙(1-zt) (8)h t =h ~t ⊙z t +h t-1 ⊙(1-z t ) (8)

其中,ht代表第t层门控神经网络的单词嵌入表示,它代表的是一个文本图的所有单词的嵌入表示,W,U,b为可训练参数,σ为sigmoid函数,⊙代表点积操作,AT是文本图标准化的邻接矩阵,其标准化过程与标签图邻接矩阵标准化过程相似,文本图节点对应的向量表示ht-1是第t-1层门控图神经网络的输入。t=2时X=h2,公式中的t为1时是第一层门控图神经网络,t为2时是第二层门控神经网络,这里表示的是门控神经网络的通用表达,在本文所述模型中只用到两层。X即为经过两层门控图神经网络学习之后的文本级单词表示。步骤(403):使用等式(9)以文本为单位计算每个文本单词相对于每个标签的注意力值μ;Among them, h t represents the word embedding representation of the t-th layer gated neural network, which represents the embedding representation of all words in a text graph, W, U, b are trainable parameters, σ is the sigmoid function, and ⊙ represents the dot product Operation, A T is the adjacency matrix of text graph normalization, the normalization process is similar to the label graph adjacency matrix normalization process, the vector representation h t-1 corresponding to the text graph node is the input of the t-1 layer gated graph neural network. When t=2, X=h 2 , when t in the formula is 1, it is the first layer of gated graph neural network, and when t is 2, it is the second layer of gated neural network, here is the general expression of gated neural network , only two layers are used in the model described in this paper. X is the text-level word representation learned by the two-layer gated graph neural network. Step (403): Calculate the attention value μ of each text word relative to each label in units of text using equation (9);

Figure BDA0003634427050000121
Figure BDA0003634427050000121

其中,H2为增强的标签嵌入表示,

Figure BDA0003634427050000131
表示第j个增强的标签嵌入表示,Xi表示待识别文本中第i个单词的的文本级单词表示,exp()为指数函数,()T为转置操作。where H2 is the enhanced label embedding representation,
Figure BDA0003634427050000131
represents the j-th enhanced label embedding representation, X i represents the text-level word representation of the i-th word in the text to be recognized, exp() is an exponential function, and () T is a transpose operation.

步骤(404):基于注意力值,使用等式(10)对所有标签的嵌入表示H2进行加权求和,得到每个单词特定的标签语义组件Q;Step (404): Based on the attention value, use Equation ( 10 ) to perform a weighted sum of the embedded representations H of all tags to obtain the tag semantic component Q specific to each word;

Figure BDA0003634427050000132
Figure BDA0003634427050000132

步骤(405):使用等式(11)构建双向长短期记忆层,进一步学习单词特定的标签语义组件的依赖关系和语义信息,生成所有单词特定的标签表示R;Step (405): use equation (11) to construct a bidirectional long-term and short-term memory layer, further learn the dependencies and semantic information of word-specific tag semantic components, and generate all word-specific tag representations R;

R=BiLSTM(Q) (11)R=BiLSTM(Q) (11)

其中,μij是文本中第i个单词相对于第j个标签的注意力值,Qi代表第i个单词特定的标签语义组件。where μ ij is the attention value of the ith word in the text relative to the j th label, and Q i represents the label semantic component specific to the ith word.

步骤(406):沿着矩阵的第二个维度将所有单词特定的标签表示R拼接到所有单词的文本级单词表示X后面得到T;Step (406): splicing all word-specific label representations R to text-level word representations X of all words along the second dimension of the matrix to obtain T;

步骤(407):使用等式(12)计算T的软注意力值,并使用软注意力值对拼接内容进行加权得到加权特征HVStep (407): use equation (12) to calculate the soft attention value of T, and use the soft attention value to weight the spliced content to obtain the weighted feature HV ;

HV=σ(fatt(T))⊙tanh(ftr(T)) (12)H V =σ(f att (T))⊙tanh(f tr (T)) (12)

步骤(408):使用等式(13)对加权内容进行最大池化、求和和取平均操作并合并操作结果得到HGStep (408): use equation (13) to perform maximum pooling, summation and averaging operations on the weighted content and combine the results of the operations to obtain H G ;

Figure BDA0003634427050000133
Figure BDA0003634427050000133

其中,fatt和ftr分别是激活函数为sigmoid和relu的感知器,前者表示软注意力后者是一个非线性转换,Max表示最大池化操作,V表示共有V个加权特征,hv是HV中的第v个加权特征。Among them, fatt and ftr are perceptrons with activation functions of sigmoid and relu, respectively, the former represents soft attention and the latter is a nonlinear transformation, Max represents the maximum pooling operation, V represents a total of V weighted features, and h v is The vth weighted feature in H V.

步骤5:分类器,使用sigmoid函数处理得到的文本表示,得到最终分类结果:HG经过sigmoid函数得到文本属于“好文章”即不包含广告内容并且内容真实可信的文章、观点中立的文章、含有广告内容的文章等标签中的一个或多个的结果。Step 5: The classifier uses the sigmoid function to process the obtained text representation, and obtains the final classification result: H G obtains the text through the sigmoid function. Results for one or more of the tags, such as articles with ad content.

在本实施例中,滑动窗口为每次按照从左往右的顺序从一篇文章中读取一个固定长度的区域。固定长度是滑动窗口的大小,每次移动的距离是滑动窗口的步长。In this embodiment, the sliding window is an area of a fixed length that is read from an article in order from left to right each time. The fixed length is the size of the sliding window, and the distance of each move is the step size of the sliding window.

在本实施例中,单词共现信息为一定大小滑动窗口内单词共同出现的次数。In this embodiment, the word co-occurrence information is the number of co-occurrences of words in a sliding window of a certain size.

在本实施例中,标签共现信息为一定大小滑动窗口内标签共同出现的次数。In this embodiment, the tag co-occurrence information is the number of co-occurrences of tags in a sliding window of a certain size.

在本实施例中,主题单词概率分布为任一主题其对应的单词概率分布。In this embodiment, the topic word probability distribution is the word probability distribution corresponding to any topic.

在本实施例中,Collapsed Gibbs采样一种马尔可夫链蒙特卡罗算法,通常用于统计推断。In this embodiment, Collapsed Gibbs samples a Markov Chain Monte Carlo algorithm, typically used for statistical inference.

在本实施例中,感知器是神经网络的一种。In this embodiment, the perceptron is a type of neural network.

在本实施例中,标签相关性关系指标签之间相互依赖或者互相排斥的关系。In this embodiment, the tag correlation relationship refers to a relationship in which tags are mutually dependent or mutually exclusive.

在本实施例中,图卷积网络是机器学习中卷积网络在图数据中的一种应用。In this embodiment, the graph convolutional network is an application of the convolutional network in machine learning to graph data.

在本实施例中,门控图神经网络是一种基于GRU的经典空间域信息传递模型。In this embodiment, the gated graph neural network is a GRU-based classical spatial domain information transfer model.

在本实施例中,高阶信息指堆叠t层门控图神经网络,每个节点的信息可以到达离它t跳的节点,这个过程传递了高阶信息。In this embodiment, the high-order information refers to a stacked t-layer gated graph neural network, the information of each node can reach the node t-hops away from it, and this process transmits the high-order information.

在本实施例中,文本级单词表示指使用两层门控图神经网络学习文本图,进行文本级信息交互后更新了单词嵌入表示,得到的新表示即为文本级单词嵌入表示。In this embodiment, the text-level word representation refers to using a two-layer gated graph neural network to learn a text graph, and updating the word embedding representation after text-level information interaction, and the obtained new representation is the text-level word embedding representation.

在本实施例中,拼接是指沿着特定方向轴把一个数组加到另一个数组的特定维度后面。In this embodiment, concatenating refers to adding one array to another array along a specific direction axis after a specific dimension.

在本实施例中,软注意力在所有信息在被聚合之前会以自适应的方式进行重新加权,这样可以分离出重要信息,并避免这些信息受到不重要信息干扰。In this embodiment, the soft attention will be re-weighted in an adaptive manner before all information is aggregated, so that important information can be separated, and the information can be prevented from being interfered by unimportant information.

在本实施例中,最大池化即取局部接受域中值最大的点。In this embodiment, the maximum pooling is to take the point with the largest value in the local receptive field.

实施例二Embodiment 2

本实施例提供了一种营销文本识别系统,其具体包括如下模块:This embodiment provides a marketing text recognition system, which specifically includes the following modules:

预处理模块,其被配置为:获取待识别文本,并进行预处理;A preprocessing module, which is configured to: obtain the text to be recognized, and perform preprocessing;

图构建模块,其被配置为:基于预处理后的待识别文本,构建待识别文本的文本图;The graph building module is configured to: construct a text graph of the to-be-recognized text based on the pre-processed to-be-recognized text;

联合学习模块,其被配置为:基于待识别文本的文本图,生成文本级单词表示,并结合所有标签的嵌入表示,生成文本表示;a joint learning module, which is configured to: generate a text-level word representation based on a text graph of the text to be recognized, and combine the embedded representations of all tags to generate a text representation;

分类模块,其被配置为:基于文本表示,采用分类器得到待识别文本是否属于营销文本的结果;A classification module, which is configured to: based on the text representation, use a classifier to obtain a result of whether the text to be recognized belongs to the marketing text;

其中,标签的嵌入表示的获取方法为:基于训练集的文本图及其标签,生成主题单词概率分布,将主题单词概率分布映射到标签向量空间,并在标签图的指导下学习标签间的相关性关系和语义信息,得到标签的嵌入表示。Among them, the method of obtaining the embedded representation of the label is: based on the text map of the training set and its labels, generate the probability distribution of the topic words, map the probability distribution of the topic words to the label vector space, and learn the correlation between the labels under the guidance of the label map. Sexual relationship and semantic information, get the embedded representation of the label.

此处需要说明的是,本实施例中的各个模块与实施例一中的各个步骤一一对应,其具体实施过程相同,此处不再累述。It should be noted here that each module in this embodiment corresponds to each step in Embodiment 1 one by one, and the specific implementation process thereof is the same, which is not repeated here.

实施例三Embodiment 3

本实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如上述实施例一所述的一种营销文本识别方法中的步骤。This embodiment provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the steps in the marketing text recognition method described in the first embodiment above.

实施例四Embodiment 4

本实施例提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述实施例一所述的一种营销文本识别方法中的步骤。This embodiment provides a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the one described in the first embodiment when the processor executes the program. Steps in a Marketing Text Recognition Method.

本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, optical storage, and the like.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(RandomAccessMemory,RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium. During execution, the processes of the embodiments of the above-mentioned methods may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims (10)

1. A marketing text recognition method, comprising:
acquiring a text to be recognized and preprocessing the text;
constructing a text graph of the text to be recognized based on the preprocessed text to be recognized;
generating text level word representation based on a text graph of a text to be recognized, and generating text representation by combining the embedded representations of all labels;
based on the text representation, a classifier is adopted to obtain a result of whether the text to be identified belongs to the marketing text;
the method for acquiring the embedded representation of the label comprises the following steps: generating probability distribution of the subject words based on the text graph and the labels of the training set, mapping the probability distribution of the subject words to a label vector space, and learning the correlation relation and semantic information among the labels under the guidance of the label graph to obtain the embedded representation of the labels.
2. The marketing text recognition method of claim 1, wherein the preprocessing comprises cleaning non-text data, removing stop words, removing low frequency words, removing high frequency words, and morphological restoration.
3. The marketing text recognition method of claim 1, wherein the text graph is constructed by: for a text, counting the co-occurrence times of words in a fixed sliding window, taking each word as the vertex of the text graph, and taking the co-occurrence times between the words as the edges of the text graph.
4. The marketing text recognition method of claim 1, wherein the tag map is constructed by: and for all the labels in the training set, counting the co-occurrence times of the labels in the fixed sliding window, and taking each label as a vertex and the co-occurrence times of the labels as an edge.
5. The marketing text recognition method of claim 1, wherein the topic word probability distribution is generated by:
using a Dirichlet function of a first prior parameter to generate polynomial probability distribution corresponding to all topics for each word in a training set word set to obtain topic word probability distribution;
for each text in the training set, generating a second prior parameter based on a text graph of the training set and corresponding label information of the text graph, generating topic distribution by using a Dirichlet function of the second prior parameter, taking the topic distribution as a parameter of polynomial distribution to obtain a topic number corresponding to each word in each text, and taking word distribution corresponding to each topic number as the parameter of polynomial distribution to generate words;
based on the generated words, the model parameters and the subject word probability distributions are updated.
6. The marketing text recognition method of claim 1, wherein the text-level word representation is generated by:
based on a text graph of a text to be recognized, adopting a first-level gated graph neural network, combining each text graph node and a first-order neighbor node, and updating the embedded representation of each word;
and adopting a second-layer gated graph neural network to obtain a text-level word representation of each word based on the updated embedded representation of each word.
7. The marketing text recognition method of claim 1, wherein the text representation is generated by:
calculating an attention value for each text word relative to each tag based on the text-level word representations and the embedded representations of all tags;
carrying out weighted summation on the embedded representations of all the labels by adopting the attention value to obtain a label semantic component of each word;
obtaining a label representation of each word by adopting a bidirectional long-short term memory layer based on a label semantic component of each word;
the label representation of the word and the text level word representation are spliced and weighted to obtain a weighted feature;
and performing maximum pooling, summation and averaging operation based on the weighting characteristics to obtain text representation.
8. A marketing text recognition system, comprising:
a pre-processing module configured to: acquiring a text to be recognized and preprocessing the text;
a graph building module configured to: constructing a text graph of the text to be recognized based on the preprocessed text to be recognized;
a joint learning module configured to: generating text-level word representation based on a text graph of a text to be recognized, and generating text representation by combining the embedded representations of all tags;
a classification module configured to: based on the text representation, a classifier is adopted to obtain a result of whether the text to be identified belongs to the marketing text;
the method for acquiring the embedded representation of the label comprises the following steps: generating probability distribution of the subject words based on the text graph and the labels of the text graph of the training set, mapping the probability distribution of the subject words to a label vector space, and learning the correlation relation and semantic information among the labels under the guidance of the label graph to obtain the embedded representation of the labels.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of a marketing text recognition method according to any one of claims 1 to 7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in a marketing text recognition method of any of claims 1-7.
CN202210498687.0A 2022-05-09 2022-05-09 Marketing text recognition method and system Active CN114724167B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210498687.0A CN114724167B (en) 2022-05-09 2022-05-09 Marketing text recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210498687.0A CN114724167B (en) 2022-05-09 2022-05-09 Marketing text recognition method and system

Publications (2)

Publication Number Publication Date
CN114724167A true CN114724167A (en) 2022-07-08
CN114724167B CN114724167B (en) 2025-04-01

Family

ID=82231608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210498687.0A Active CN114724167B (en) 2022-05-09 2022-05-09 Marketing text recognition method and system

Country Status (1)

Country Link
CN (1) CN114724167B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117574896A (en) * 2024-01-16 2024-02-20 之江实验室 Surgery fee identification method, device and storage medium based on electronic medical record text

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050008193A1 (en) * 2000-06-13 2005-01-13 Microsoft Corporation System and process for bootstrap initialization of nonparametric color models
CN113806547A (en) * 2021-10-15 2021-12-17 南京大学 Deep learning multi-label text classification method based on graph model
CN113886577A (en) * 2021-09-10 2022-01-04 润联软件系统(深圳)有限公司 A text classification method, device, equipment and storage medium
CN114330338A (en) * 2022-01-13 2022-04-12 东北电力大学 Program language identification system and method fusing associated information
CN114358007A (en) * 2022-01-11 2022-04-15 平安科技(深圳)有限公司 Multi-label identification method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050008193A1 (en) * 2000-06-13 2005-01-13 Microsoft Corporation System and process for bootstrap initialization of nonparametric color models
CN113886577A (en) * 2021-09-10 2022-01-04 润联软件系统(深圳)有限公司 A text classification method, device, equipment and storage medium
CN113806547A (en) * 2021-10-15 2021-12-17 南京大学 Deep learning multi-label text classification method based on graph model
CN114358007A (en) * 2022-01-11 2022-04-15 平安科技(深圳)有限公司 Multi-label identification method and device, electronic equipment and storage medium
CN114330338A (en) * 2022-01-13 2022-04-12 东北电力大学 Program language identification system and method fusing associated information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈可嘉;刘惠: "文本分类中基于单词表示的全局向量模型和隐含狄利克雷分布的文本表示改进方法", 科学技术与工程, vol. 21, no. 029, 31 December 2021 (2021-12-31) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117574896A (en) * 2024-01-16 2024-02-20 之江实验室 Surgery fee identification method, device and storage medium based on electronic medical record text
CN117574896B (en) * 2024-01-16 2024-04-09 之江实验室 Surgical fee identification method, device and storage medium based on electronic medical record text

Also Published As

Publication number Publication date
CN114724167B (en) 2025-04-01

Similar Documents

Publication Publication Date Title
Dhal et al. A comprehensive survey on feature selection in the various fields of machine learning
Ahmed et al. Deep learning modelling techniques: current progress, applications, advantages, and challenges
CN111368996B (en) Retraining projection network capable of transmitting natural language representation
Xu et al. Investigation on the Chinese text sentiment analysis based on convolutional neural networks in deep learning.
Mahmoudi et al. Deep neural networks understand investors better
Li et al. Improving convolutional neural network for text classification by recursive data pruning
Zulqarnain et al. An efficient two-state GRU based on feature attention mechanism for sentiment analysis
CN111782769B (en) Intelligent knowledge graph question-answering method based on relation prediction
Xiao et al. Using convolution control block for Chinese sentiment analysis
Li et al. Image sentiment prediction based on textual descriptions with adjective noun pairs
CN113704460A (en) Text classification method and device, electronic equipment and storage medium
Dangi et al. An efficient model for sentiment analysis using artificial rabbits optimized vector functional link network
CN112465226B (en) User behavior prediction method based on feature interaction and graph neural network
Islam et al. HARC-new hybrid method with hierarchical attention based bidirectional recurrent neural network with dilated convolutional neural network to recognize multilabel emotions from text
Zheng et al. Dynamically Route Hierarchical Structure Representation to Attentive Capsule for Text Classification.
CN116561314A (en) Text classification method for selecting self-attention based on self-adaptive threshold
CN114724167B (en) Marketing text recognition method and system
Moholkar et al. Lioness adapted GWO-based deep belief network enabled with multiple features for a novel question answering system
Omidvar et al. Learning to determine the quality of news headlines
CN114491076B (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
Meng et al. Multi-layer convolutional neural network model based on prior knowledge of knowledge graph for text classification
CN116401390A (en) Visual question-answering processing method, system, storage medium and electronic equipment
CN117216617A (en) Text classification model training method, device, computer equipment and storage medium
Sindhu et al. Deep Learning for Sentiment Analysis: Exploring the Power of Deep Learning Techniques in Opinion Mining
Song et al. Prior-guided multi-scale fusion transformer for face attribute recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant