CN110457597A

CN110457597A - Advertisement recognition method and device

Info

Publication number: CN110457597A
Application number: CN201910728853.XA
Authority: CN
Inventors: 任宁; 晋耀红; 李德彦
Original assignee: Zhongke Dingfu (beijing) Science And Technology Development Co Ltd
Current assignee: Zhongke Dingfu (beijing) Science And Technology Development Co Ltd
Priority date: 2019-08-08
Filing date: 2019-08-08
Publication date: 2019-11-15

Abstract

The embodiment of the present application provides an advertisement identification method and device, which can use a classification model containing classification expressions to obtain suspected advertisement information from media information released by users; generate the suspected advertisement information according to at least one preset weight factor The weight factor includes the proportion of the length of the suspected advertisement information to the full text of the media information, the proportion of the advertisement information released by the user to all the media information released by the user, and the pictures in the suspected advertisement information One or more of the numbers; determining the suspected advertisement information whose advertisement weight is greater than a preset threshold as advertisement information. The method provided in the embodiment of the present application uses a classification model and advertisement weights to screen media information at two levels, first determines suspected advertisement information, and then determines whether the suspected advertisement information is an advertisement according to the advertisement weight, thereby accurately identifying advertisements and solving the problem of the prior art Problems with advertising within social platforms cannot be identified in a timely and effective manner.

Description

Advertisement recognition method and device

技术领域technical field

本申请涉及自然语言处理技术领域，尤其涉及一种广告识别方法及装置。The present application relates to the technical field of natural language processing, and in particular to an advertisement recognition method and device.

背景技术Background technique

各类社交平台，例如微博、微信、贴吧和论坛等，往往会成为广告投放者的目标。广告投放者通过在各个社交平台大量注册账号，并大量生成广告留言、回复等方式进行广告投放，导致社交平台的正常内容中穿插了大量的广告内容，降低了社交平台的内容质量，使用户在社交平台浏览内容时被动地浏览到广告信息，影响用户的使用体验。Various social platforms, such as Weibo, WeChat, Tieba and forums, are often targeted by advertisers. Advertisers place advertisements by registering a large number of accounts on various social platforms, and generating a large number of advertisement messages, replies, etc., resulting in a large amount of advertisement content interspersed in the normal content of the social platform, reducing the content quality of the social platform, making users in Passive browsing of advertising information when browsing content on social platforms affects user experience.

目前，为了治理在社交平台中出现的各类广告，社交平台的管理者或运营者通常会设置一些具有审核权限的账号，并由持有这些账号的人员以人工巡查的方式找出广告并进行删除。但是，广告投放者为了提高广告投放量通常使用软件进行自动投放，广告投放数量巨大，导致采用人工巡查去除广告的方法难以对这些广告进行有效且及时地遏制。因此，社交平台的广告问题始终得不到有效地解决。At present, in order to manage all kinds of advertisements appearing on social platforms, the managers or operators of social platforms usually set up some accounts with audit authority, and the personnel holding these accounts will find out the advertisements by manual inspection and conduct inspections. delete. However, advertisers usually use software to automatically place advertisements in order to increase the amount of advertisements, and the amount of advertisements is huge, which makes it difficult to effectively and timely contain these advertisements by manually inspecting and removing advertisements. Therefore, the problem of advertising on social platforms has not been effectively resolved.

发明内容Contents of the invention

本申请实施例提供了一种广告识别方法及装置，以解决现有技术无法及时有效地识别社交平台内广告的问题。The embodiment of the present application provides an advertisement identification method and device to solve the problem that the prior art cannot identify advertisements on social platforms in a timely and effective manner.

第一方面，本申请实施例提供了一种广告识别方法，包括：使用包含分类表达式的分类模型从用户发布的媒体信息中获取疑似广告信息；根据预设的至少一种权重因素生成所述疑似广告信息的广告权重；所述权重因素包括所述疑似广告信息占所述媒体信息全文的长度比重，用户已发布的广告信息占用户已发布的全部媒体信息的比重，以及所述疑似广告信息中的图片数量中的一个或多个；将所述广告权重大于预设阈值的所述疑似广告信息确定为广告信息。In the first aspect, an embodiment of the present application provides an advertisement identification method, including: using a classification model containing classification expressions to obtain suspected advertisement information from media information released by users; generating the advertisement information according to at least one preset weight factor. The advertisement weight of the suspected advertisement information; the weight factors include the proportion of the length of the suspected advertisement information to the full text of the media information, the proportion of the advertisement information released by the user to all the media information released by the user, and the proportion of the suspected advertisement information One or more of the number of pictures in ; determining the suspected advertisement information whose advertisement weight is greater than a preset threshold as advertisement information.

第二方面，本申请实施例提供了一种广告识别装置，包括：信息获取模块，用于使用包含分类表达式的分类模型从用户发布的媒体信息中获取疑似广告信息；权重生成模块，用于根据预设的至少一种权重因素生成所述疑似广告信息的广告权重；所述权重因素包括所述疑似广告信息占所述媒体信息全文的长度比重，用户已发布的广告信息占用户已发布的全部媒体信息的比重，以及所述疑似广告信息中的图片数量中的一个或多个；广告确定模块，用于将所述广告权重大于预设阈值的所述疑似广告信息确定为广告信息。In the second aspect, an embodiment of the present application provides an advertisement recognition device, including: an information acquisition module, configured to obtain suspected advertisement information from media information released by users by using a classification model containing classification expressions; a weight generation module, configured to The advertisement weight of the suspected advertisement information is generated according to at least one preset weight factor; the weight factor includes the proportion of the suspected advertisement information to the length of the full text of the media information, and the proportion of the advertisement information released by the user to the length of the published advertisement information by the user The proportion of all media information, and one or more of the number of pictures in the suspected advertisement information; an advertisement determination module, configured to determine the suspected advertisement information whose advertisement weight is greater than a preset threshold as advertisement information.

由以上技术方案可知，本申请实施例提供了一种广告识别方法及装置，能够使用包含分类表达式的分类模型从用户发布的媒体信息中获取疑似广告信息；根据预设的至少一种权重因素生成所述疑似广告信息的广告权重；所述权重因素包括所述疑似广告信息占所述媒体信息全文的长度比重，用户已发布的广告信息占用户已发布的全部媒体信息的比重，以及所述疑似广告信息中的图片数量中的一个或多个；将所述广告权重大于预设阈值的所述疑似广告信息确定为广告信息。本申请实施例提供的技术方案，使用分类模型和广告权重对媒体信息进行两级筛选，先确定疑似广告信息，再基于疑似广告信息的字符长度、内容和疑似广告信息对应的用户行为等因素确定其广告权重，并根据广告权重确定疑似广告信息是否是广告，从而准确识别广告，解决了现有技术无法及时有效地识别社交平台内广告的问题。It can be seen from the above technical solutions that the embodiments of the present application provide an advertisement identification method and device, which can use a classification model containing classification expressions to obtain suspected advertisement information from media information released by users; according to at least one preset weight factor Generate the advertisement weight of the suspected advertisement information; the weight factors include the proportion of the length of the suspected advertisement information to the full text of the media information, the proportion of the advertisement information released by the user to all the media information released by the user, and the One or more of the number of pictures in the suspected advertisement information; determining the suspected advertisement information whose advertisement weight is greater than a preset threshold as advertisement information. The technical solution provided by the embodiment of this application uses the classification model and advertisement weight to perform two-level screening of media information, first determines the suspected advertisement information, and then determines based on factors such as the character length, content, and user behavior corresponding to the suspected advertisement information. Its advertisement weight, and determine whether the suspected advertisement information is an advertisement according to the advertisement weight, so as to accurately identify the advertisement, and solve the problem that the existing technology cannot timely and effectively identify the advertisement in the social platform.

附图说明Description of drawings

为了更清楚地说明本申请的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，对于本领域普通技术人员而言，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solution of the present application more clearly, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, for those of ordinary skill in the art, on the premise of not paying creative labor, Additional drawings can also be derived from these drawings.

图1是本申请实施例提供的一种广告识别方法的流程图；FIG. 1 is a flow chart of an advertisement identification method provided by an embodiment of the present application;

图2是本申请实施例提供的获取疑似广告信息长度的流程图；Fig. 2 is a flow chart of obtaining the length of suspected advertisement information provided by the embodiment of the present application;

图3是本申请实施例提供生成疑似广告信息的广告权重的流程图；Fig. 3 is a flow chart of providing advertisement weights for generating suspicious advertisement information according to an embodiment of the present application;

图4是本申请实施例提供的对发布广告的用户进行管理的流程图；FIG. 4 is a flow chart for managing users who publish advertisements provided by the embodiment of the present application;

图5是本申请实施例提供的一种广告识别装置的结构示意图。Fig. 5 is a schematic structural diagram of an advertisement recognition device provided by an embodiment of the present application.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本申请中的技术方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described The embodiments are only some of the embodiments of the present application, but not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the scope of protection of this application.

现有技术中，网页中的广告至少有两种形式，一种是网站服务商、广告商或各个广告联盟在网页中指定位置投放的文字广告、图片广告和弹窗广告等，这种广告可以通过网页元素过滤的方式进行屏蔽；另一种是在各类社交平台，例如微博、微信、贴吧和论坛中，广告投放者用这些平台账号进行发帖、留言、回复等方式发布的广告，这些广告本身属于社交平台内容的一部分，因此无法用上述网页元素过滤的方式进行屏蔽。In the prior art, there are at least two forms of advertisements in webpages, one is text advertisements, picture advertisements and pop-up advertisements placed on designated positions in webpages by website service providers, advertisers or various advertising alliances. Blocking is done by filtering web page elements; the other is that on various social platforms, such as Weibo, WeChat, Tieba and forums, advertisers use these platform accounts to post, leave messages, and reply. The advertisement itself is part of the content of the social platform, so it cannot be blocked by the above-mentioned webpage element filtering method.

对此，为了治理在社交平台中出现的各类广告，社交平台的管理者或运营者通常会设置一些具有审核权限的账号，并由持有这些账号的人员以人工巡查的方式找出广告并进行删除。但是，广告投放者为了提高广告投放量通常使用软件进行自动投放，广告投放数量巨大，导致采用人工巡查去除广告的方法难以对这些广告进行有效且及时地遏制。因此，社交平台的广告问题始终得不到有效地解决。In this regard, in order to manage all kinds of advertisements that appear on social platforms, managers or operators of social platforms usually set up some accounts with audit authority, and the personnel holding these accounts will find out the advertisements by manual inspection and check them out. to delete. However, advertisers usually use software to automatically place advertisements in order to increase the amount of advertisements, and the amount of advertisements is huge, which makes it difficult to effectively and timely contain these advertisements by manually inspecting and removing advertisements. Therefore, the problem of advertising on social platforms has not been effectively resolved.

下面是本申请的方法实施例。The following are method embodiments of the present application.

本申请的方法实施例提供了一种广告识别方法。图1是该广告识别方法的流程图。该方法可以应用于服务器、PC(个人电脑)、平板电脑、手机等多种设备中。The method embodiment of the present application provides an advertisement identification method. Fig. 1 is a flow chart of the advertisement identification method. The method can be applied to multiple devices such as servers, PCs (personal computers), tablet computers, and mobile phones.

如图1所示，该方法包括以下步骤：As shown in Figure 1, the method includes the following steps:

步骤S101，使用包含分类表达式的分类模型从用户发布的媒体信息中获取疑似广告信息。Step S101, using a classification model including classification expressions to obtain suspected advertisement information from media information released by users.

本申请实施例中使用的分类模型从结构上来说，包括：至少一个分类节点，每个分类节点对应一个广告类别，每个广告类别对应一个类别权重；每个分类节点包括至少一个分类表达式，分类表达式用于从所述媒体信息中识别出所述疑似广告信息。Structurally, the classification model used in the embodiment of the present application includes: at least one classification node, each classification node corresponds to an advertisement category, and each advertisement category corresponds to a category weight; each classification node includes at least one classification expression, The classification expression is used to identify the suspected advertisement information from the media information.

本申请实施例中使用的分类模型从内容上来说，由本体、要素和概念这三个部分组成。其中，广告描述中可能出现的业务性表述，根据其内容和语义建成不同的要素，每个要素对应一种业务分类；广告描述中可能出现的语言通用性的表述，根据其内容和语义建成不同的概念，每个概念包含至少一个表述相同语义的概念表达式；至少一个概念表达式通过“与”“或”“非”“距离”“顺序”等各种算子组合形成分类表达式；至少一个业务分类和业务分类中的至少一个分类表达式，集成在一起就构成了模型的本体。In terms of content, the classification model used in the embodiment of this application consists of three parts: ontology, elements and concepts. Among them, the business expressions that may appear in the advertisement description are constructed into different elements according to their content and semantics, and each element corresponds to a business classification; the general language expressions that may appear in the advertisement description are constructed into different elements according to their content and semantics. Each concept contains at least one concept expression expressing the same semantics; at least one concept expression forms a classification expression by combining various operators such as "and", "or", "not", "distance", "order"; at least A business classification and at least one classification expression in the business classification are integrated to form the ontology of the model.

表1示出了分类模型的结构。Table 1 shows the structure of the classification model.

表1Table 1

如表1所示，要素对应的业务分类包括“高权重广告识别”，在“高权重广告识别”分类下包括多个分类表达式，以分类表达式“c_推销网址+{0，30}c_尝试或沟通”为例，它包括“c_推销网址”和“c_尝试或沟通”这两个概念，以及“+”这一“与”算子和“{0，30}”这一距离算子。As shown in Table 1, the business classification corresponding to the elements includes "high-weight advertisement identification", and multiple classification expressions are included under the "high-weight advertisement identification" category, and the classification expression "c_promoting URL+{0, 30} c_try or communicate" as an example, it includes the concepts of "c_sale URL" and "c_try or communicate", and the AND operator "+" and "{0,30}" A distance operator.

进一步地，每个概念可以对应设置有概念树，概念树中包括至少一个概念表达式，用于从语料中匹配出该概念对应的内容，例如：“推销话术”这一概念中可以包括例如“温和.*亲肤”“亲肤嫩肤”等概念表达式(其中，“.”“*”等均为算子，可以依照正则表达式的规则编写)。因此，分类表达式“c_推销网址+{0，30}c_尝试或沟通”能够匹配到的内容为：匹配“c_推销网址”并且在0-30个字符范围内还匹配到“c_尝试或沟通”的语料内容。Further, each concept can be provided with a corresponding concept tree, which includes at least one concept expression, which is used to match the content corresponding to the concept from the corpus, for example: the concept of "marketing speech" can include, for example Conceptual expressions such as "gentle.*skin-friendly" and "skin-friendly and rejuvenating" (wherein, "." and "*" are operators, which can be written according to the rules of regular expressions). Therefore, the classification expression "c_sales URL+{0,30}c_attempt or communication" can match: match "c_sales URL" and also match "c _try or communicate" corpus content.

本申请实施例中，分类节点可以包括多个广告类别，例如表1中的“高权重广告识别”，以及“低权重广告类别”等(表1中未示出)。对于不同的广告类别，本申请实施例对应设置有不同的类别权重，例如：设置“高权重广告识别”的类别权重W₁为2，设置“低权重广告类别”的类别权重W₁为1。In the embodiment of the present application, the classification node may include multiple advertisement categories, such as "high-weight advertisement identification" and "low-weight advertisement category" in Table 1 (not shown in Table 1). For different advertisement categories, different category weights are set correspondingly in the embodiment of the present application, for example: set the category weight W 1 of "high weight advertisement identification" to 2, and set the category weight W ₁ of "low weight advertisement category" to ₁ .

本申请实施例使用上述分类模型对用户发布的媒体信息进行内容匹配，如果媒体信息与分类模型中的某个表达式匹配，则该媒体信息就是疑似广告信息。分类模型可以根据预设的“广告类别”将疑似广告信息分为“高权重广告信息”和“低权重广告信息”，其中，“高权重广告信息”是指包括网站链接、联系方式、优惠信息、运费信息等的疑似广告信息；“低权重广告信息”是指只包含推销话术的疑似广告信息。In the embodiment of the present application, the above classification model is used to perform content matching on the media information posted by the user. If the media information matches an expression in the classification model, the media information is suspected advertisement information. The classification model can divide the suspected advertisement information into "high weight advertisement information" and "low weight advertisement information" according to the preset "advertisement category". , freight information, etc.; "low-weight advertisement information" refers to suspected advertisement information that only contains salesmanship.

示例地，高权重广告信息：For example, high-weight advertisement information:

关注我，3分钟高效选股，一辈子受益无穷。QQ：XXXXXX，微信：XXXXXXFollow me, select stocks efficiently in 3 minutes, and benefit endlessly for a lifetime. QQ: XXXXXX, WeChat: XXXXXX

示例地，低权重广告信息：For example, low-weight advertisement information:

温和干净，亲肤嫩肤，不伤角质层，是一款不可多得的好洁面Gentle and clean, skin-friendly and rejuvenating, without hurting the stratum corneum, it is a rare good cleanser

在从媒体信息中获取疑似广告信息之前，本申请实施例还对媒体信息进行预处理，以提高获取疑似广告信息的准确性。预处理的过程包括从媒体信息中去除特定字符，对媒体信息进行字符转换，以及对媒体信息进行汉字转数字的一个或多个。Before obtaining the suspected advertisement information from the media information, the embodiment of the present application further preprocesses the media information, so as to improve the accuracy of obtaining the suspected advertisement information. The preprocessing process includes removing specific characters from the media information, performing character conversion on the media information, and performing one or more of converting Chinese characters to numbers on the media information.

例如，一些广告发布者会刻意使用一些特殊字符对广告内容进行处理，以规避审核，因此，预处理的过程首先就要把特殊字符去除，例如：For example, some advertisement publishers will deliberately use some special characters to process the advertisement content in order to evade review. Therefore, the preprocessing process must first remove the special characters, for example:

想改变自己现在的现状不妨拿出你的手机加【微】【信】abc111111，咨询咨询吧！If you want to change your current situation, you might as well take out your mobile phone and add [WeChat] [Letter] abc111111 for consultation!

去除特殊字符之后为：After removing special characters:

想改变自己现在的现状不妨拿出你的手机加微信abc111111，咨询咨询吧！If you want to change your current situation, you might as well take out your mobile phone and add WeChat abc111111 to consult and consult!

除上述例子中示出的“【”“】”等特殊字符之外，能够被去除的特殊字符还包括：！～#$^&<>[]{}()*？/,.【】等，本申请实施例不再赘述。In addition to special characters such as "【" and "】" shown in the above examples, special characters that can be removed also include: ! ~#$^&<>[]{}()*? /,.[], etc., which will not be repeated in the embodiments of this application.

在去除了特殊字符之后，下一个步骤是对媒体信息进行字符转换，例如宽体字转换为窄体字、繁体字转换为简体字、大写字母转换为小写字母等。After the special characters are removed, the next step is to perform character conversion on the media information, such as converting wide-body characters to narrow-body characters, converting traditional characters to simplified characters, converting uppercase letters to lowercase letters, and so on.

以繁体字转换为简体字为例：Take the conversion of traditional Chinese characters to simplified Chinese characters as an example:

转换前：在家做兼職，日結，每小時佰園Before conversion: Part-time job at home, daily settlement, hourly hundred yuan

转换后：在家做兼职，日结，每小时百元After conversion: part-time job at home, daily settlement, 100 yuan per hour

在对字符进行转换之后，下一个步骤是对数字进行归一化处理，具体是将汉字或谐音字表达的数字转换成阿拉伯数字，例如：After converting the characters, the next step is to normalize the numbers, specifically to convert the numbers expressed by Chinese characters or homophonic characters into Arabic numerals, for example:

重庆易瑞沙代购_易瑞沙价格_印度易瑞沙代购服务热线：幺三九二二五XXXXXChongqing Iressa Purchasing_Iressa Price_India Iressa Purchasing Service Hotline: 139225XXXXX

转换为：translates to:

重庆易瑞沙代购_易瑞沙价格_印度易瑞沙代购服务热线：139225XXXXXChongqing Iressa Purchasing_Iressa Price_India Iressa Purchasing Service Hotline: 139225XXXXX

通过上述预处理之后，媒体信息变的比较规整，便于使用分类表达式进行识别，由此能够提高疑似广告信息提取的准确性。After the above preprocessing, the media information becomes relatively regular, which is convenient for identification using classification expressions, thereby improving the accuracy of extracting suspected advertisement information.

步骤S102，根据预设的至少一种权重因素生成所述疑似广告信息的广告权重；所述权重因素包括所述疑似广告信息占所述媒体信息全文的长度比重，用户已发布的广告信息占用户已发布的全部媒体信息的比重，以及所述疑似广告信息中的图片数量中的一个或多个。Step S102, generating the advertisement weight of the suspected advertisement information according to at least one preset weight factor; the weight factor includes the proportion of the suspected advertisement information to the length of the full text of the media information, the proportion of the advertisement information published by the user to the user One or more of the proportion of all published media information, and the number of pictures in the suspected advertisement information.

本申请实施例中，确定了至少三个权重因素，包括：In the embodiment of this application, at least three weighting factors are determined, including:

1、疑似广告信息占媒体信息全文的长度比重。示例地，某个用户在某论坛的一次回帖中，发布了50个字符，其中有20个字符被识别成疑似广告信息，那么，在本次回帖中，疑似广告信息占媒体信息全文的长度比重X₂＝20/50＝0.4。1. The proportion of suspected advertising information in the length of the full text of media information. For example, if a user posts 50 characters in a post on a forum, 20 of which are identified as suspected advertising information, then, in this reply, the proportion of suspected advertising information in the length of the full text of the media information X ₂ =20/50=0.4.

图2是本申请实施例提供的获取疑似广告信息长度的流程图。Fig. 2 is a flow chart of obtaining the length of suspected advertisement information provided by the embodiment of the present application.

在一个实施例中，如图2所示，获取疑似广告信息长度具体包括以下步骤：In one embodiment, as shown in Figure 2, obtaining the length of suspected advertisement information specifically includes the following steps:

步骤S201，分别获取每个分类表达式匹配到的所述疑似广告信息的起始位置和结束位置。Step S201, acquiring the start position and end position of the suspected advertisement information matched by each classification expression respectively.

示例地，媒体信息(语料)为：Exemplarily, the media information (corpus) is:

其匹配到所有的概念表达式和位置如表2所示：It matches all concept expressions and positions as shown in Table 2:

概念名称concept name 概念表达式concept expression 匹配内容matching content 起始位置starting point 结束位置end position 11 推销话术sales pitch 温和.*亲肤Gentle.* Skin friendly 温和干净，亲肤Gentle and clean, skin-friendly 00 88 22 推销话术sales pitch 亲肤嫩肤skin rejuvenation 亲肤嫩肤skin rejuvenation 66 1010

表2Table 2

从表2可以看出，行1的概念表达式匹配内容的起始位置为0，结束位置为8；行2的概念表达式匹配内容的起始位置为6，结束位置为10。It can be seen from Table 2 that the starting position of the matching content of the concept expression in line 1 is 0 and the ending position is 8; the starting position of the matching content of the concept expression in line 2 is 6 and the ending position is 10.

步骤S202，根据所述起始位置和结束位置判断所述疑似广告信息的位置是否存在交集。Step S202, judging whether the positions of the suspected advertisement information overlap according to the start position and the end position.

结合表2，由于行1的起始位置小于行2的起始位置，并且，行1的结束位置大于行2的起始位置，并且，行1的结束位置小于行2的结束位置，因此行1和行2匹配的内容存在交集。Combined with Table 2, since the starting position of row 1 is smaller than the starting position of row 2, and the ending position of row 1 is greater than the starting position of row 2, and the ending position of row 1 is smaller than the ending position of row 2, the row There is an intersection between the content matched by line 1 and line 2.

步骤S203，如果所述疑似广告信息的位置存在交集，将所述疑似广告信息中的结束位置最大值与起始位置最小值的差值作为所述疑似广告信息的长度。Step S203, if the positions of the suspected advertisement information overlap, the difference between the maximum end position value and the minimum start position value in the suspected advertisement information is used as the length of the suspected advertisement information.

结合表2，由于疑似广告信息的位置存在交集，并且，疑似广告信息的结束位置的最大值为10，起始位置的最小值为0，因此，疑似广告信息的长度为行2的结束位置10减去行1的起始位置0，即10-0＝10；而文本总长度为28，因此X₂＝10/28。Combined with Table 2, since the positions of the suspected advertisement information overlap, and the maximum value of the end position of the suspected advertisement information is 10, and the minimum value of the start position is 0, the length of the suspected advertisement information is 10 at the end position of line 2 The starting position 0 of line 1 is subtracted, that is, 10-0=10; and the total length of the text is 28, so X ₂ =10/28.

2、用户已发布的广告信息占用户已发布的全部媒体信息的比重。示例地，可以根据用户发布媒体信息使用的账号ID、用户名和IP地址等身份信息对用户进行行为追踪和统计，包括统计用户发布的全部媒体信息的数量C1和广告信息的数量C2。基于上述统计的信息，用户已发布的广告信息占用户已发布的全部媒体信息的比重X₃＝C2/C1。2. The proportion of the advertising information released by the user to all the media information released by the user. For example, user behavior tracking and statistics can be performed according to identity information such as account ID, user name, and IP address used by users to post media information, including counting the number C1 of all media information and the number C2 of advertisement information released by users. Based on the above statistical information, the proportion of the advertisement information released by the user to all the media information released by the user is X ₃ =C2/C1.

3、疑似广告信息中的图片数量。示例地，某个用户在某论坛的一次回帖被识别为包含疑似广告信息，并且该回帖中包含3张图片，则疑似广告信息中的图片数量X₄＝3。3. The number of pictures in the suspected advertisement information. For example, if a user's reply in a certain forum is identified as containing suspected advertisement information, and the reply contains 3 pictures, then the number of pictures in the suspected advertisement information X ₄ =3.

图3是本申请实施例提供生成疑似广告信息的广告权重的流程图。Fig. 3 is a flowchart of providing advertisement weights for generating suspected advertisement information according to an embodiment of the present application.

如图2所示，基于上述三种权重因素，本申请实施例采用以下步骤得到疑似广告信息的广告权重：As shown in Figure 2, based on the above three weight factors, the embodiment of the present application adopts the following steps to obtain the advertisement weight of suspected advertisement information:

步骤S301，将每一种所述权重因素乘以对应的权重系数，得到每一种所述权重因素的权值。Step S301, multiplying each weight factor by a corresponding weight coefficient to obtain the weight of each weight factor.

其中，疑似广告信息占媒体信息全文的长度比重X₂对应第一权值，该第一权值为疑似广告信息占媒体信息全文的长度比重X₂与第一权重系数W₂的乘积，即：W₂X₂。其中，第一权重系数W₂是本申请实施例预设的一个参数，取值大于0。Wherein, suspected advertisement information accounts for the length proportion X ₂ of the full text of the media information corresponding to the first weight, and this first weight is the product of the length proportion X ₂ and the first weight coefficient W ₂ that the suspected advertisement information accounts for the full text of the media information, namely: W ₂ X ₂ . Wherein, the first weight coefficient W ₂ is a parameter preset in the embodiment of the present application, and its value is greater than 0.

另外，用户已发布的广告信息占用户已发布的全部媒体信息的比重X₃对应第二权值，该第二权值为用户已发布的广告信息占用户已发布的全部媒体信息的比重X₃与第二权重系数W₃的乘积，即：W₃X₃。其中，第二权重系数W₃是本申请实施例预设的一个参数，取值大于0。In addition, the proportion X3 of the advertisement information released by the user to all the media information released by the user corresponds to the second weight, and the second weight is the proportion _X3 _of the advertisement information released by the user to all the media information released by the user The product with the second weight coefficient W ₃ , namely: W ₃ X ₃ . Wherein, the second weight coefficient W ₃ is a parameter preset in the embodiment of the present application, and its value is greater than 0.

另外，疑似广告信息中的图片数量X₄对应第三权值，该第三权值为疑似广告信息中的图片数量X₄与第三权重系数W₄的乘积，即：X₄W₄。第三权重系数W₄是本申请实施例预设的一个参数，取值大于0。In addition, the number X ₄ of pictures in the suspected advertisement information corresponds to a third weight, and the third weight is the product of the number X ₄ of pictures in the suspected advertisement information and the third weight coefficient W ₄ , that is, X ₄ W ₄ . The third weight coefficient W ₄ is a parameter preset in the embodiment of the present application, and its value is greater than 0.

需要补充说明的是，上述第一权重系数W₂、第二权重系数W₃、第三权重系数W₄的取值，可以灵活确定，例如：当需要重点监控一些用户多次投放广告的行为时，可以设置第二权重系数W₃为较大的值；当一些论坛、社交平台等出现以图片广告为主的广告投放行为时，可以设置第三权重系数W₄为较大值。It should be added that the values of the above-mentioned first weight coefficient W ₂ , second weight coefficient W ₃ , and third weight coefficient W ₄ can be determined flexibly, for example: when it is necessary to focus on monitoring the behavior of some users placing advertisements multiple times , the second weight coefficient W ₃ can be set to a relatively large value; when some forums, social platforms, etc. appear to place advertisements mainly based on image advertisements, the third weight coefficient W ₄ can be set to a relatively large value.

步骤S302，将所述疑似广告信息的类别权重与各个所述因素权重的权值相加，得到所述广告权重。Step S302, adding the category weight of the suspected advertisement information to the weight of each of the factor weights to obtain the advertisement weight.

本申请实施例中，广告权重表示某个媒体信息(包括论坛发帖、社交平台的各类动态等)是广告的可能性。如果将广告权重用X来表示，则：In the embodiment of the present application, the advertisement weight indicates the possibility that certain media information (including postings on forums, various dynamics of social platforms, etc.) is an advertisement. If the advertisement weight is represented by X, then:

X＝W₁+W₂X₂+W₃X₃+X₄W₄ X＝W ₁ +W ₂ X ₂ +W ₃ X ₃ +X ₄ W ₄

步骤S103，将所述广告权重大于预设阈值的所述疑似广告信息确定为广告信息。Step S103, determining the suspected advertisement information whose advertisement weight is greater than a preset threshold as advertisement information.

本申请实施例中，预先设定一个阈值M，并与上述广告权重X比较数值大小，如果广告权重X大于或者等于阈值M，则确定媒体信息是广告，如果广告权重X小于阈值M，则确定媒体信息不是广告。In the embodiment of the present application, a threshold M is preset, and the numerical value is compared with the above advertisement weight X. If the advertisement weight X is greater than or equal to the threshold M, it is determined that the media information is an advertisement, and if the advertisement weight X is less than the threshold M, then it is determined Media information is not an advertisement.

阈值M的可以根据对广告的识别策略灵活确定。如果需要执行严格的广告识别策略，阈值M可以取较小值，此时，能够将广告以及具有广告倾向的媒体信息一并识别出来，而不考虑误杀；如果需要执行宽松的广告识别策略，阈值M可以取较大值，此时，仅识别较为明显的广告，而不考虑遗漏。由于论坛或者社交媒体在运营时，会根据管理成本、对用户使用体验的影响等多方面的因素确定采用哪种强度的广告过滤方式，因此，阈值M具体可以由论坛或者社交媒体运营商在实践中自行确定，本申请实施例中不做具体限定。The threshold M can be flexibly determined according to the advertisement identification strategy. If a strict advertisement identification strategy needs to be implemented, the threshold M can take a smaller value. At this time, advertisements and media information with advertising tendencies can be identified together, regardless of manslaughter; if a loose advertisement identification strategy needs to be implemented, the threshold M can take a larger value. At this time, only relatively obvious advertisements are recognized, and omissions are not considered. When a forum or social media is operating, it will determine the strength of the advertising filtering method based on various factors such as management costs and the impact on user experience. Therefore, the threshold M can be determined by the forum or social media operator in practice. It is determined by itself, and is not specifically limited in the embodiments of the present application.

图4是本申请实施例提供的对发布广告的用户进行管理的流程图。Fig. 4 is a flow chart of managing users posting advertisements provided by the embodiment of the present application.

如图4所示，在一个实施例中，本申请实施例在确定疑似广告信息为广告之后，还包括以下步骤，以对发布广告的用户进行管理：As shown in Figure 4, in one embodiment, after determining that the suspected advertisement information is an advertisement, the embodiment of the present application further includes the following steps to manage users who publish advertisements:

步骤S401，更新用户已发布的广告信息的数量。Step S401, updating the quantity of advertisement information published by the user.

本申请实施例中，针对每个用户，统计其发布的广告信息的数量，如果识别到该用户发布了新的广告信息，则对统计的数量进行更新。其中，本申请实施例可以根据用户的IP地址、手机号码、邮箱号码、实名制认证信息等辨别发布广告的用户是否为同一个用户，如果是同一个用户，则合并计算该用户发布广告信息的数量，从而，防止一些用户为了躲避管理而注册多个账号发布广告的行为发生。In the embodiment of the present application, for each user, the quantity of advertisement information released by the user is counted, and if it is recognized that the user has released new advertisement information, the counted quantity is updated. Among them, the embodiment of the present application can distinguish whether the user who publishes the advertisement is the same user according to the user's IP address, mobile phone number, mailbox number, and real-name authentication information. , thereby preventing some users from registering multiple accounts to publish advertisements in order to avoid management.

步骤S402，判断用户已发布的广告信息的数量是否大于数量阈值。Step S402, judging whether the quantity of advertisement information published by the user is greater than a quantity threshold.

步骤S403，当用户已发布的广告信息的数量大于数量阈值时，封禁用户的IP地址并删除用户的账号信息。Step S403, when the quantity of advertisement information published by the user is greater than the quantity threshold, ban the user's IP address and delete the user's account information.

本申请实施例中，删除用户的账号信息可以使发送广告信息的用户无法继续登录账号，封禁用户的IP地址用于防止用户重新注册账号，从而，从根本上杜绝了用户发布广告的现象。In the embodiment of this application, deleting the user's account information can prevent the user who sent the advertisement information from continuing to log in to the account, and blocking the user's IP address is used to prevent the user from re-registering the account, thereby fundamentally eliminating the phenomenon of the user posting advertisements.

在一些实施例中，在对用户发布的广告信息的数量进行统计时，可以分段统计用户在最近一个预设时间段内发布的广告信息的数量，例如：统计用户一小时内发布广告的数量或者用户一天内发布的广告数量等。同时，针对每个时间段分别设置数量阈值，例如：一个小时对应的数量阈值为3、一天对应的数量阈值为5、一周对应的数量阈值为10，等等。在步骤S402判断用户已发布的广告信息的数量是否大于数量阈值时，只要用户在任一个时间段内发布的广告信息的数量大于对应的阈值，就会触发步骤S403，以封禁用户的IP地址并删除用户的账号信息。In some embodiments, when counting the number of advertisements posted by users, the number of advertisements posted by users in the latest preset time period can be counted in segments, for example: counting the number of advertisements posted by users within one hour Or the number of advertisements posted by the user in a day, etc. At the same time, a quantity threshold is set for each time period, for example, the quantity threshold corresponding to an hour is 3, the quantity threshold corresponding to a day is 5, the quantity threshold corresponding to a week is 10, and so on. In step S402, when judging whether the quantity of advertisement information published by the user is greater than the quantity threshold, as long as the quantity of advertisement information published by the user in any time period is greater than the corresponding threshold, step S403 will be triggered to block the user's IP address and delete the The user's account information.

由以上技术方案可知，本申请实施例提供了一种广告识别方法，包括：使用包含分类表达式的分类模型从用户发布的媒体信息中获取疑似广告信息；根据预设的至少一种权重因素生成所述疑似广告信息的广告权重；所述权重因素包括所述疑似广告信息占所述媒体信息全文的长度比重，用户已发布的广告信息占用户已发布的全部媒体信息的比重，以及所述疑似广告信息中的图片数量中的一个或多个；将所述广告权重大于预设阈值的所述疑似广告信息确定为广告信息。本申请实施例提供的方法，使用分类模型和广告权重对媒体信息进行两级筛选，先确定疑似广告信息，再基于疑似广告信息的字符长度、内容和疑似广告信息对应的用户行为等因素确定其广告权重，并根据广告权重确定疑似广告信息是否是广告，从而准确识别广告，解决了现有技术无法及时有效地识别社交平台内广告的问题。It can be seen from the above technical solutions that the embodiment of the present application provides an advertisement identification method, including: using a classification model containing classification expressions to obtain suspected advertisement information from media information released by users; generating advertisement information according to at least one preset weight factor The advertisement weight of the suspected advertisement information; the weight factors include the proportion of the length of the suspected advertisement information to the full text of the media information, the proportion of the advertisement information released by the user to all the media information released by the user, and the proportion of the suspected advertisement information One or more of the number of pictures in the advertisement information; determining the suspected advertisement information whose advertisement weight is greater than a preset threshold as advertisement information. The method provided in the embodiment of the present application uses the classification model and advertisement weight to perform two-level screening of media information, first determines the suspected advertisement information, and then determines the suspected advertisement information based on factors such as character length, content, and user behavior corresponding to the suspected advertisement information. Advertisement weight, and determine whether the suspected advertisement information is an advertisement according to the advertisement weight, so as to accurately identify the advertisement, and solve the problem that the existing technology cannot timely and effectively identify the advertisement in the social platform.

下面是本申请的装置实施例，提供了一种广告识别装置，该广告识别装置可用于执行本申请的方法实施例，有关本申请装置实施例中未公开的技术细节，请参照本申请的方法实施例。The following is the device embodiment of the present application, which provides an advertisement recognition device that can be used to implement the method embodiment of the present application. For the undisclosed technical details in the device embodiment of the present application, please refer to the method of the present application Example.

如图5所示，该装置包括：As shown in Figure 5, the device includes:

信息获取模块501，用于使用包含分类表达式的分类模型从用户发布的媒体信息中获取疑似广告信息；An information acquisition module 501, configured to acquire suspected advertisement information from media information released by users by using a classification model containing classification expressions;

权重生成模块502，用于根据预设的至少一种权重因素生成所述疑似广告信息的广告权重；所述权重因素包括所述疑似广告信息占所述媒体信息全文的长度比重，用户已发布的广告信息占用户已发布的全部媒体信息的比重，以及所述疑似广告信息中的图片数量中的一个或多个；The weight generation module 502 is configured to generate the advertisement weight of the suspected advertisement information according to at least one preset weight factor; the weight factor includes the proportion of the length of the suspected advertisement information in the full text of the media information, and the One or more of the proportion of advertising information in all media information released by the user, and the number of pictures in the suspected advertising information;

广告确定模块503，用于将所述广告权重大于预设阈值的所述疑似广告信息确定为广告信息。An advertisement determining module 503, configured to determine the suspected advertisement information whose advertisement weight is greater than a preset threshold as advertisement information.

由以上技术方案可知，本申请实施例提供了一种广告识别装置，该装置用于：使用包含分类表达式的分类模型从用户发布的媒体信息中获取疑似广告信息；根据预设的至少一种权重因素生成所述疑似广告信息的广告权重；所述权重因素包括所述疑似广告信息占所述媒体信息全文的长度比重，用户已发布的广告信息占用户已发布的全部媒体信息的比重，以及所述疑似广告信息中的图片数量中的一个或多个；将所述广告权重大于预设阈值的所述疑似广告信息确定为广告信息。本申请实施例提供的装置，使用分类模型和广告权重对媒体信息进行两级筛选，先确定疑似广告信息，再基于疑似广告信息的字符长度、内容和疑似广告信息对应的用户行为等因素确定其广告权重，并根据广告权重确定疑似广告信息是否是广告，从而准确识别广告，解决了现有技术无法及时有效地识别社交平台内广告的问题。It can be seen from the above technical solutions that the embodiment of the present application provides an advertisement recognition device, which is used to: use a classification model containing classification expressions to obtain suspected advertisement information from media information released by users; according to at least one preset The weight factor generates the advertisement weight of the suspected advertisement information; the weight factor includes the proportion of the length of the suspected advertisement information to the full text of the media information, the proportion of the advertisement information released by the user to all the media information released by the user, and One or more of the number of pictures in the suspected advertisement information; determining the suspected advertisement information whose advertisement weight is greater than a preset threshold as advertisement information. The device provided in the embodiment of the present application uses the classification model and advertisement weight to perform two-level screening of media information, first determines the suspected advertisement information, and then determines the suspected advertisement information based on factors such as character length, content, and user behavior corresponding to the suspected advertisement information. Advertisement weight, and determine whether the suspected advertisement information is an advertisement according to the advertisement weight, so as to accurately identify the advertisement, and solve the problem that the existing technology cannot timely and effectively identify the advertisement in the social platform.

本领域技术人员在考虑说明书及实践这里公开的申请后，将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本申请的真正范围和精神由下面的权利要求指出。Other embodiments of the application will be readily apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any modification, use or adaptation of the application, these modifications, uses or adaptations follow the general principles of the application and include common knowledge or conventional technical means in the technical field not disclosed in the application . The specification and examples are to be considered exemplary only, with a true scope and spirit of the application indicated by the following claims.

应当理解的是，本申请并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。It should be understood that the present application is not limited to the precise constructions which have been described above and shown in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. a kind of advertisement recognition method characterized by comprising

Doubtful advertising information is obtained from the media information that user issues using the disaggregated model comprising expression formula of classifying；

The advertisement weight of the doubtful advertising information is generated according to preset at least one weight；The weight includes The doubtful advertising information accounts for the length specific gravity of the media information full text, and the announced advertising information of user occupies family and issued Total medium information specific gravity and one or more of the picture number in the doubtful advertising information；

The great doubtful advertising information in preset threshold of the advertising rights is determined as advertising information.

2. the method according to claim 1, wherein the use comprising classify expression formula disaggregated model from Before obtaining doubtful advertising information in the media information of family publication, further includes:

The media information is pre-processed；The pretreatment includes removing specific character from the media information, to described Media information carries out character conversion, and the one or more of Chinese character revolution word is carried out to the media information.

3. the method according to claim 1, wherein

The disaggregated model includes at least one class node, each corresponding advertisement classification of the class node, Mei Geguang Accuse the corresponding class weight of classification；Each class node includes at least one classification expression formula, the classification expression formula For identifying the doubtful advertising information from the media information.

4. according to the method described in claim 3, it is characterized in that, described generate institute according to preset at least one weight State the advertisement weight of doubtful advertising information, comprising:

By each described weight multiplied by corresponding weight coefficient, the weight of each weight is obtained；

The class weight of the doubtful advertising information is added with the weight of each factor weight, obtains the advertising rights Weight.

5. according to the method described in claim 4, it is characterized in that,

The weight of the weight includes the first weight, and first weight is that the doubtful advertising information accounts for the media letter Cease the length specific gravity of full text and the product of the first weight coefficient.

6. according to the method described in claim 4, it is characterized in that,

The weight of the weight includes the second weight, and second weight is that the announced advertising information of user has occupied family The specific gravity of the total medium information of publication and the product of the second weight coefficient.

7. according to the method described in claim 4, it is characterized in that,

The weight of the weight includes third weight, and the third weight is the picture number in the doubtful advertising information With the product of third weight coefficient.

8. the method according to claim 1, wherein the length of the doubtful advertising information is obtained by following steps It takes:

Initial position and the end position of the doubtful advertising information that each classification expression formula is matched to are obtained respectively；

Judge the position of the doubtful advertising information with the presence or absence of intersection according to the initial position and end position；

If the position of the doubtful advertising information there are intersection, by the doubtful advertising information end position maximum value with Length of the difference of initial position minimum value as the doubtful advertising information.

9. the method according to claim 1, wherein described that advertising rights are great in the described doubtful of preset threshold Advertising information is determined as after advertising information, further includes:

Update the quantity of the announced advertising information of user；

Judge whether the quantity of the announced advertising information of user is greater than amount threshold；

When the quantity of the announced advertising information of user is greater than amount threshold, closes the IP address of user and delete the account of user Number information.

10. a kind of advertisement identification device characterized by comprising

Data obtaining module is doubted for using the disaggregated model comprising expression formula of classifying to obtain from the media information that user issues Like advertising information；

Weight generation module, for generating the advertising rights of the doubtful advertising information according to preset at least one weight Weight；The weight includes the length specific gravity that the doubtful advertising information accounts for the media information full text, and user is announced Advertising information occupies in the specific gravity of the announced total medium information in family and the picture number in the doubtful advertising information It is one or more；

Advertisement determining module, for the great doubtful advertising information in preset threshold of the advertising rights to be determined as advertisement letter Breath.