CN114722321A

CN114722321A - Webpage content processing method and device, electronic equipment and storage medium

Info

Publication number: CN114722321A
Application number: CN202110005229.4A
Authority: CN
Inventors: 陈煌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2022-07-08
Anticipated expiration: 2041-01-05
Also published as: CN114722321B

Abstract

The embodiment of the application discloses a webpage content processing method, a device, electronic equipment and a storage medium; the method includes the steps that webpage content to be processed of a webpage is obtained, wherein the webpage content to be processed comprises webpage content units and webpage labels thereof, and the webpage content comprises content characters and identification labels corresponding to the content characters; according to the webpage label and the identification label, the position information of the content character in the webpage can be identified, so that the processing type of the content character is determined; when the content characters are of a canvas drawing type, drawing the content characters on a preset canvas, and when the content characters are of a label processing type, adding identification labels corresponding to the content characters into a preset label set according to a preset rule; and then, target webpage content of the webpage is generated and output according to the current canvas and the current tag set, so that the crawler can be effectively prevented from capturing the content on the webpage, and the safety of the webpage content is improved.

Description

Web content processing method, apparatus, electronic device and storage medium

技术领域technical field

本申请涉及计算机技术领域，具体涉及一种网页内容处理方法、装置、电子设备和存储介质。The present application relates to the field of computer technologies, and in particular, to a method, apparatus, electronic device, and storage medium for processing webpage content.

背景技术Background technique

随着互联网技术的发展，互联网已经深入到社会日常生活、经济生活的各个角落，其中，网页已经成为信息传输的一种重要方式，人们通过网页浏览信息，企业通过网页开展企业的生产、营销等各种活动。With the development of Internet technology, the Internet has penetrated into every corner of social daily life and economic life. Among them, web pages have become an important way of information transmission. People browse information through web pages, and enterprises conduct production and marketing through web pages. different activities.

但是，当利用网页浏览信息，例如利用网页阅读电子书籍时，由于网页端的内容容易被爬虫抓取，爬虫只需要等待网页上的内容加载完毕后，通过脚本程序便能够获取网页上展示的内容，导致网页内容的安全性较低。However, when using webpages to browse information, such as reading e-books through webpages, since the content on the webpage is easy to be crawled by the crawler, the crawler only needs to wait for the content on the webpage to be loaded, and then the script program can obtain the content displayed on the webpage. This results in lower security of web content.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种网页内容处理方法、装置、电子设备和存储介质，可以提高网页内容的安全性。The embodiments of the present application provide a method, apparatus, electronic device, and storage medium for processing webpage content, which can improve the security of webpage content.

本申请实施例提供了一种网页内容处理方法，包括：The embodiment of the present application provides a web content processing method, including:

获取网页的待处理网页内容，所述待处理网页内容包括至少一个网页内容单元以及网页内容单元对应的网页标签，其中，所述网页内容单元包括至少一个内容字符以及内容字符对应的识别标签；Obtaining the to-be-processed web page content of the web page, the to-be-processed web page content includes at least one web page content unit and a web page label corresponding to the web page content unit, wherein the web page content unit includes at least one content character and an identification label corresponding to the content character;

根据所述网页内容单元对应的网页标签和所述内容字符对应的识别标签，识别所述内容字符在所述网页中的位置信息；Identifying the location information of the content character in the webpage according to the webpage label corresponding to the webpage content unit and the identification label corresponding to the content character;

针对网页内容中的每个内容字符，根据所述内容字符在所述网页中的位置信息，确定所述内容字符的处理类型；For each content character in the webpage content, determine the processing type of the content character according to the position information of the content character in the webpage;

当所述内容字符的处理类型为画布绘制类型时，将所述内容字符绘制在预设画布；When the processing type of the content character is a canvas drawing type, the content character is drawn on a preset canvas;

当所述内容字符的处理类型为标签处理类型时，按照预设顺序将所述内容字符对应的识别标签添加到预设标签集；When the processing type of the content characters is a label processing type, adding the identification tags corresponding to the content characters to the preset tag set according to a preset order;

当满足预设内容生成条件时，根据当前画布和当前标签集生成并输出所述网页的目标网页内容。When the preset content generation conditions are met, the target web page content of the web page is generated and output according to the current canvas and the current tag set.

相应的，本申请实施例还提供了一种网页内容处理装置，包括：Correspondingly, the embodiment of the present application also provides a webpage content processing device, including:

获取单元，用于获取网页的待处理网页内容，所述待处理网页内容包括至少一个网页内容单元以及网页内容单元对应的网页标签，其中，所述网页内容单元包括至少一个内容字符以及内容字符对应的识别标签；an acquisition unit, configured to acquire the to-be-processed webpage content of the webpage, the to-be-processed webpage content includes at least one webpage content unit and a webpage tag corresponding to the webpage content unit, wherein the webpage content unit includes at least one content character and the corresponding content character the identification label;

识别单元，用于根据所述网页内容单元对应的网页标签和所述内容字符对应的识别标签，识别所述内容字符在所述网页中的位置信息；an identification unit, configured to identify the location information of the content character in the webpage according to the webpage label corresponding to the webpage content unit and the identification label corresponding to the content character;

确定单元，用于针对网页内容中的每个内容字符，根据所述内容字符在所述网页中的位置信息，确定所述内容字符的处理类型；a determining unit, configured to, for each content character in the webpage content, determine the processing type of the content character according to the position information of the content character in the webpage;

绘制单元，用于当所述内容字符的处理类型为画布绘制类型时，将所述内容字符绘制在预设画布；a drawing unit, used to draw the content character on a preset canvas when the processing type of the content character is a canvas drawing type;

添加单元，用于当所述内容字符的处理类型为标签处理类型时，按照预设顺序将所述内容字符对应的识别标签添加到预设标签集；An adding unit, configured to add the identification tags corresponding to the content characters to a preset tag set according to a preset order when the processing type of the content characters is a tag processing type;

生成单元，用于当满足预设内容生成条件时，根据当前画布和当前标签集生成并输出所述网页的目标网页内容。The generating unit is configured to generate and output the target web page content of the web page according to the current canvas and the current tag set when the preset content generation condition is satisfied.

在一实施例中，所述获取单元之前，还包括：In one embodiment, before the obtaining unit, further includes:

获取子单元，用于获取所述网页的初始网页内容，所述初始网页内容包括至少一个网页标签以及网页标签对应的子网页内容，其中所述子网页内容包括至少一个内容字符；an acquiring subunit, configured to acquire initial web page content of the web page, where the initial web page content includes at least one web page tag and sub-web page content corresponding to the web page tag, wherein the sub-web page content includes at least one content character;

拆分子单元，用于对子网页内容中的内容字符进行拆分处理，得到初始网页内容单元，所述初始网页内容单元包括至少一个内容字符；Splitting subunits, used for splitting the content characters in the content of the sub-web page, to obtain an initial web page content unit, where the initial web page content unit includes at least one content character;

添加子单元，用于对所述初始网页内容单元中的内容字符添加识别标签，得到网页内容单元。A subunit is added for adding identification tags to the content characters in the initial web page content unit to obtain a web page content unit.

其中，添加子单元可以用于：Among them, adding subunits can be used to:

确定所述初始网页内容单元中内容字符的数量；determining the number of content characters in the initial web page content unit;

根据所述初始网页内容单元中内容字符的数量，确定所述初始网页内容单元中的内容字符对应的识别标签；Determine the identification tags corresponding to the content characters in the initial webpage content unit according to the number of content characters in the initial webpage content unit;

根据所述识别标签，对所述初始网页内容单元中的内容字符添加识别标签，得到网页内容单元。According to the identification tag, an identification tag is added to the content characters in the initial webpage content unit to obtain a webpage content unit.

在一实施例中，确定单元包括：In one embodiment, the determining unit includes:

比较子单元，用于针对网页内容中的每个内容字符，将所述内容字符在所述网页中的位置信息与预设位置信息进行比较，得到比较结果；a comparison subunit, configured to compare the position information of the content character in the webpage with the preset position information for each content character in the webpage content to obtain a comparison result;

确定子单元，用于根据所述比较结果确定所述内容字符的处理类型。A determination subunit, configured to determine the processing type of the content character according to the comparison result.

在一实施例中，绘制单元包括：In one embodiment, the drawing unit includes:

确定子单元，用于当所述内容字符的处理类型为画布绘制类型时，确定用于内容字符绘制的绘制函数；Determining a subunit for determining a drawing function for drawing the content character when the processing type of the content character is a canvas drawing type;

绘制子单元，用于根据所述绘制函数和所述内容字符在所述网页中的位置信息，将所述内容字符绘制在预设画布对应的位置上。The drawing subunit is configured to draw the content characters on the positions corresponding to the preset canvas according to the drawing function and the position information of the content characters in the webpage.

在一实施例中，添加单元包括：In one embodiment, the adding unit includes:

确定子单元，用于当所述内容字符的处理类型为标签处理类型时，根据预设顺序确定对应的标签顺序设置函数；Determining a subunit, for determining a corresponding label order setting function according to a preset order when the processing type of the content character is a label processing type;

添加子单元，用于采用所述标签顺序设置函数将所述内容字符对应的识别标签按照预设顺序添加到预设标签集。The adding subunit is used for adding the identification tags corresponding to the content characters to the preset tag set according to the preset order by using the tag order setting function.

在一实施例中，生成单元包括：In one embodiment, the generating unit includes:

插入子单元，用于当满足预设内容生成条件时，根据所述当前画布上内容字符在所述网页中的位置信息，将当前画布插入到所述网页的对应位置上；Inserting a subunit for inserting the current canvas into the corresponding position of the web page according to the position information of the content characters on the current canvas in the web page when the preset content generation conditions are met;

输出子单元，用于根据当前标签集中识别标签的顺序，将识别标签对应的内容字符输出到所述网页的对应位置上。The output subunit is configured to output the content characters corresponding to the identification tags to the corresponding positions of the webpage according to the sequence of the identification tags in the current tag set.

本申请实施例还提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述一方面的各种可选方式中提供的方法。Embodiments of the present application also provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided in various optional manners of the above aspect.

相应的，本申请实施例还提供一种计算机可读存储介质，所述计算机可读存储介质存储有指令，所述指令被处理器执行时实现本申请实施例任一提供的网页内容处理方法。Correspondingly, the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores instructions, and when the instructions are executed by a processor, implement the webpage content processing method provided in any of the embodiments of the present application.

本申请实施例获取网页的待处理网页内容，所述待处理网页内容包括至少一个网页内容单元以及网页内容单元对应的网页标签，其中，所述网页内容单元包括至少一个内容字符以及内容字符对应的识别标签；根据所述网页内容单元对应的网页标签和所述内容字符对应的识别标签，识别所述内容字符在所述网页中的位置信息；针对网页内容中的每个内容字符，根据所述内容字符在所述网页中的位置信息，确定所述内容字符的处理类型；当所述内容字符的处理类型为画布绘制类型时，将所述内容字符绘制在预设画布；当所述内容字符的处理类型为标签处理类型时，按照预设顺序将所述内容字符对应的识别标签添加到预设标签集；当满足预设内容生成条件时，根据当前画布和当前标签集生成并输出所述网页的目标网页内容。该方案通过将处理类型为画布绘制类型的内容字符绘制在预设画布上，而将处理类型为标签处理类型的内容字符的识别标签按照预设顺序添加到预设标签集中的方法，能够有效地防止爬虫抓取网页上的内容，从而提高网络内容的安全性。The embodiment of the present application acquires the to-be-processed webpage content of the webpage, the to-be-processed webpage content includes at least one webpage content unit and a webpage label corresponding to the webpage content unit, wherein the webpage content unit includes at least one content character and the corresponding content character identification label; according to the webpage label corresponding to the webpage content unit and the identification label corresponding to the content character, identify the position information of the content character in the webpage; for each content character in the webpage content, according to the The position information of the content characters in the webpage determines the processing type of the content characters; when the processing type of the content characters is a canvas drawing type, the content characters are drawn on a preset canvas; When the processing type is the label processing type, the identification labels corresponding to the content characters are added to the preset label set according to the preset order; when the preset content generation conditions are met, the current canvas and the current label set are generated and output. The page's landing page content. In this solution, the content characters whose processing type is canvas drawing type are drawn on the preset canvas, and the identification tags of the content characters whose processing type is label processing type are added to the preset tag set in a preset order, which can effectively Prevent crawlers from crawling content on web pages, thereby improving the security of web content.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained from these drawings without creative effort.

图1是本申请实施例提供的网页内容处理方法的一种场景示意图；FIG. 1 is a schematic diagram of a scenario of a web page content processing method provided by an embodiment of the present application;

图2a是本申请实施例提供的网页内容处理方法的一种流程示意图；2a is a schematic flowchart of a method for processing web page content provided by an embodiment of the present application;

图2b是本申请实施例提供的网页内容处理方法的另一场景示意图；FIG. 2b is a schematic diagram of another scenario of the webpage content processing method provided by the embodiment of the present application;

图2c是本申请实施例提供的网页内容的一种示意图；FIG. 2c is a schematic diagram of webpage content provided by an embodiment of the present application;

图2d是本申请实施例提供的网页内容的另一种示意图；FIG. 2d is another schematic diagram of webpage content provided by an embodiment of the present application;

图2e是本申请实施例提供的内容字符的位置信息示意图；2e is a schematic diagram of position information of content characters provided by an embodiment of the present application;

图2f是本申请实施例提供的网页内容处理方法的另一种场景示意图；FIG. 2f is a schematic diagram of another scenario of a web page content processing method provided by an embodiment of the present application;

图3a是本申请实施例提供的网络内容处理方法的另一种流程示意图；3a is another schematic flowchart of a network content processing method provided by an embodiment of the present application;

图3b是本申请实施例提供的网页内容处理方法的另一种场景示意图；3b is a schematic diagram of another scenario of the web content processing method provided by the embodiment of the present application;

图4a是本申请实施例提供的网络内容处理装置的一种结构示意图；4a is a schematic structural diagram of a network content processing apparatus provided by an embodiment of the present application;

图4b是本申请实施例提供的网络内容处理装置的另一种结构示意图；FIG. 4b is another schematic structural diagram of a network content processing apparatus provided by an embodiment of the present application;

图5是本申请实施例提供的网络设备的结构示意图。FIG. 5 is a schematic structural diagram of a network device provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present application.

本申请实施例提供了一种网页内容处理方法、装置、电子设备和计算机可读存储介质。其中，该网页内容处理装置可以集成在电子设备中，该电子设备可以是服务器，也可以是终端等设备。其中，终端可以是笔记本电脑、台式电脑、平板电脑(PC，PersonalComputer)等具备存储单元并安装有微处理器的电子设备。Embodiments of the present application provide a web content processing method, apparatus, electronic device, and computer-readable storage medium. Wherein, the webpage content processing apparatus may be integrated in an electronic device, and the electronic device may be a server or a terminal or other device. The terminal may be an electronic device, such as a notebook computer, a desktop computer, and a tablet computer (PC, Personal Computer), which is provided with a storage unit and installed with a microprocessor.

例如，以网页内容处理装置集成在终端为例进行说明。如图1所示，首先终端获取网页的待处理网页内容，所述待处理网页内容包括至少一个网页内容单元以及网页内容单元对应的网页标签，其中，所述网页内容单元包括至少一个内容字符以及内容字符对应的识别标签；然后，终端可以根据所述网页内容单元对应的网页标签和所述内容字符对应的识别标签，识别所述内容字符在所述网页中的位置信息；针对网页内容中的每个内容字符，终端可以根据所述内容字符在所述网页中的位置信息，确定所述内容字符的处理类型；当所述内容字符的处理类型为画布绘制类型时，终端可以将所述内容字符绘制在预设画布；当所述内容字符的处理类型为标签处理类型时，终端按照预设顺序将所述内容字符对应的识别标签添加到预设标签集；当满足预设内容生成条件时，终端根据当前画布和当前标签集生成并输出所述网页的目标网页内容。For example, the description is given by taking the example that the webpage content processing apparatus is integrated in the terminal. As shown in FIG. 1, first, the terminal obtains the to-be-processed webpage content of the webpage, the to-be-processed webpage content includes at least one webpage content unit and a webpage tag corresponding to the webpage content unit, wherein the webpage content unit includes at least one content character and identification tags corresponding to the content characters; then, the terminal can identify the location information of the content characters in the webpage according to the webpage tags corresponding to the webpage content units and the identification tags corresponding to the content characters; For each content character, the terminal can determine the processing type of the content character according to the position information of the content character in the webpage; when the processing type of the content character is canvas drawing type, the terminal can The characters are drawn on the preset canvas; when the processing type of the content characters is the label processing type, the terminal adds the identification tags corresponding to the content characters to the preset tag set according to the preset order; when the preset content generation conditions are met , the terminal generates and outputs the target webpage content of the webpage according to the current canvas and the current label set.

以下分别进行详细说明，需要说明的是，以下实施例的描述顺序不作为对实施例优选顺序的限定。The following detailed descriptions are provided respectively. It should be noted that the description order of the following embodiments does not limit the preferred order of the embodiments.

本申请实施例将从网页内容处理装置的角度进行描述，该网页内容处理装置可以集成在电子设备中，该电子设备可以是服务器，也可以是终端等设备。The embodiments of the present application will be described from the perspective of a webpage content processing apparatus. The webpage content processing apparatus may be integrated in an electronic device, and the electronic device may be a server or a terminal or other equipment.

如图2a所示，提供了一种网页内容处理方法，具体流程包括：As shown in Figure 2a, a method for processing web page content is provided, and the specific process includes:

101、获取网页的待处理网页内容，待处理网页内容包括至少一个网页内容单元以及网页内容单元对应的网页标签，其中，网页内容单元包括至少一个内容字符以及内容字符对应的识别标签。101. Acquire to-be-processed webpage content of the webpage, where the to-be-processed webpage content includes at least one webpage content unit and a webpage label corresponding to the webpage content unit, wherein the webpage content unit includes at least one content character and an identification label corresponding to the content character.

其中，待处理网页内容可以是包括通过浏览器展示的内容。比如，待处理网页内容可以包括浏览器文档等。其中，浏览器文档可以为浏览器可识别且渲染到网页上展示的文档。Wherein, the content of the webpage to be processed may include content displayed through a browser. For example, the web page content to be processed may include browser documents and the like. The browser document may be a document that can be recognized by the browser and rendered on the web page for display.

其中，待处理网页内容的语言格式可以有多种。There may be multiple language formats of the content of the webpage to be processed.

例如，待处理网页内容可以是用超文本标记语言(Hyper Text Markup Language，HTML)编写的文档。其中，HTML文档可以是被多种网页浏览器读取并生成能够传递各类咨询的文档。而HTML可以是一种标识性语言，它包括一系列的标签，通过这些标签定义了网页内容的含义和结构。For example, the content of the webpage to be processed may be a document written in Hyper Text Markup Language (Hyper Text Markup Language, HTML). Wherein, the HTML document can be read by various web browsers to generate a document capable of delivering various types of inquiries. While HTML can be a markup language, it includes a series of tags that define the meaning and structure of web page content.

又例如，待处理网页内容可以是在HTML嵌入如JavaScript脚本语言的文档。其中，JavaScript是一种应用于网络开发的高级脚本语言，JavaScript可以为网页添加各式各样的动态功能，从而使得浏览器上网页内容的呈现效果更加流畅和美观。在HTML中嵌入如JavaScript脚本语言的文档可以影响网页的行为。For another example, the content of the web page to be processed may be a document embedded in a script language such as JavaScript in HTML. Among them, JavaScript is an advanced scripting language applied to web development. JavaScript can add various dynamic functions to web pages, thereby making the presentation effect of web page content on the browser smoother and more beautiful. Embedding documents in a scripting language such as JavaScript in HTML can affect the behavior of web pages.

又例如，待处理网页内容可以是在HTML中嵌入如JavaScript脚本语言和层叠样式表(Cascading Style Sheets，CSS)的文档。其中，CSS是一种定义样式结构如字体、颜色、位置等的语言，被用于描述网页上的信息格式化和显示的方式，可以影响网页中呈现的内容的外观与布局。For another example, the content of the web page to be processed may be a document embedded in HTML such as a JavaScript scripting language and Cascading Style Sheets (CSS). Among them, CSS is a language that defines style structures such as fonts, colors, positions, etc. It is used to describe the way information on web pages is formatted and displayed, which can affect the appearance and layout of content presented in web pages.

在一实施例中，待处理网页内容的内容类型可以有多种。例如，待处理网页内容可以是文字、图片、表格、动画甚至音乐中的一种或者多种等等。In one embodiment, there may be multiple content types of the webpage content to be processed. For example, the content of the webpage to be processed may be one or more of text, pictures, tables, animations, and even music.

在一实施例中，待处理网页内容可以包括至少一个网页内容单元以及网页单元对应的网页标签，其中，网页内容单元可以包括至少一个内容字符以及内容字符对应的识别标签。In an embodiment, the webpage content to be processed may include at least one webpage content unit and a webpage label corresponding to the webpage unit, wherein the webpage content unit may include at least one content character and an identification label corresponding to the content character.

其中，网页标签可以是用来构建网页架构、网页布局和承载内容的标签，即通过网页标签，浏览器可以生成和网页标签对应的网页架构和布局以及识别网页标签标识的内容信息，并且根据网页标签在待处理网页内容中的位置信息，将网页标签标识的内容渲染到网页对应的位置上，从而使得网页标签标识的内容在网页上呈现。Among them, the web page tag can be a tag used to construct the web page structure, web page layout and bearing content, that is, through the web page tag, the browser can generate the web page structure and layout corresponding to the web page tag and identify the content information of the web page tag identification, and according to the web page The location information of the tag in the content of the webpage to be processed is rendered, and the content identified by the webpage tag is rendered to the corresponding position of the webpage, so that the content identified by the webpage tag is presented on the webpage.

例如，在HTML中，网页标签可以是由“<”和“>”以及“<”、“>”包裹的块级元素组成的标签。其中，块级元素可以用来构建网络架构、网页布局和承载内容。例如，块级元素可以包括head、div、p、body等等。因此，网页标签可以是<head>、<div>、、<body>等标签。For example, in HTML, a web page tag can be a tag consisting of "<" and ">" and block-level elements enclosed by "<", ">". Among them, block-level elements can be used to build network architecture, web page layout and host content. For example, block-level elements can include head, div, p, body, and so on. Therefore, web page tags can be <head>, <div>, , <body> and so on.

在一实施例中，网页标签都有其对应的功能。例如，有的网页标签可以定义其标识的内容在浏览器呈现时的样式，有的网页标签可以定义其标识的内容在浏览器中的呈现格式等等。例如，在HTML中，<head>标签可以定义待处理网页内容的信息，例如待处理网页内容在浏览器中的编码格式等等；<div>标签可以定义待处理网页内容中的节；标签可以定义段落；<body>标签可以定义主题，等等。In one embodiment, the webpage tags have their corresponding functions. For example, some webpage tags may define the style of the content identified by the browser when rendered, and some webpage tags may define the rendering format of the identified content in the browser, and so on. For example, in HTML, the <head> tag can define the information of the webpage content to be processed, such as the encoding format of the webpage content to be processed in the browser, etc.; the <div> tag can define the section in the webpage content to be processed; tags can define paragraphs; <body> tags can define topics, and so on.

在一实施例中，网页标签可以包括开始标签和结束标签，而开始标签和结束标签包裹的内容便是呈现在网页上的内容，并且开始标签和结束标签包裹的内容可以根据标签的功能呈现在网页上。例如，在HTML文档中，标签可以包括开始标签和结束标签是，而标签标识的内容可以是在和之间的内容，在和之间的内容可以以段落的形式呈现在网页页面上。In one embodiment, a webpage tag may include a start tag and an end tag, and the content wrapped by the start tag and the end tag is the content presented on the web page, and the content wrapped by the start tag and the end tag may be presented in the function of the tag. on the webpage. For example, in an HTML document, the tag can include the opening tag and the closing tag is , and the content identified by the tag can be the content between and , the content between and can be presented on the web page in the form of paragraphs.

例如，如图2b所示，在HTML1文档中，标签的开始标签和结束标签中包裹的内容是“这是第一个段落”，而和之间的内容可以以段落的形式呈现在网页页面上，因此“这是第一个段落”便会以段落的形式呈现在网页上。For example, as shown in Figure 2b, in an HTML1 document, the content wrapped in the opening and closing tags of the tag is "this is the first paragraph", while and < The content between /p> can be presented on the webpage in the form of paragraphs, so "this is the first paragraph" will be presented on the webpage in the form of paragraphs.

又例如，<title>标签可以定义HTML文档中的标题。如图2b所示，在HTML1文档中，<title>标签的开始标签<title>和结束标签</title>标签中包裹的内容是“例子”，因此</title>标签中包裹的内容“例子”可以以标题的形式呈现在网页上。As another example, the <title> tag can define the title in the HTML document. As shown in Figure 2b, in an HTML1 document, the content wrapped in the opening tag <title> and closing tag </title> of the <title> tag is "example", so the content wrapped in the </title> tag is "example" ” can be presented on a web page as a title.

在一实施例中，网页标签可以单独使用。例如，<input>标签可以单独出现，其定义了输入控件。In one embodiment, the webpage tags may be used alone. For example, an <input> tag can appear alone, which defines an input control.

在一实施例中，网络标签可以互相嵌套使用。例如，如图2b所述，标签可以嵌入到<body>标签中使用。In one embodiment, web tags can be used nested within each other. For example, as shown in Figure 2b, the tag can be embedded within the <body> tag.

在一实施例中，识别标签可以是承载内容的标签。通过识别标签，浏览器可以识别识别标签标识的内容及其标识的内容的位置信息，并且将识别标签标识的内容渲染到网页对应的位置上。In one embodiment, the identification tag may be a content-bearing tag. Through the identification tag, the browser can identify the content identified by the identification tag and the location information of the identified content, and render the content identified by the identification tag to the corresponding position of the web page.

例如，在HTML中，网页标签可以是由“<”和“>”和在“<”和“>”中包裹的行内元素组成。其中，行内元素可以是基于语义级的基本元素。例如，行内元素可以包括span、strong、em等等。因此识别标签可以是、、等标签。其中，每种识别标签都可以识别其标识的网页内容的内容信息和位置信息。For example, in HTML, a web page tag can consist of "<" and ">" and inline elements wrapped in "<" and ">". Among them, the inline element can be a basic element based on the semantic level. For example, inline elements can include span, strong, em, etc. Therefore, the identification tags can be , , and other tags. Wherein, each identification tag can identify the content information and location information of the identified web page content.

在一实施例中，识别标签可以包括开始标签和结束标签，而开始标签和结束标签包裹的内容可以是呈现在网页上的内容，并且开始标签和结束标签包裹的内容可以根据标签的功能呈现在网页上。例如，在HTML文档中，标签可以是成对出现的，其开始标签是，其结束标签是，而标签标识的内容可以是在和之间的内容。In one embodiment, the identification tag may include a start tag and an end tag, and the content wrapped by the start tag and the end tag may be content presented on a web page, and the content wrapped by the start tag and the end tag may be presented on the page according to the function of the tag. on the webpage. For example, in an HTML document, the tags can be paired, with the opening tag being and the closing tag being , and the content identified by the tag can be between and . content.

在一实施例中，识别标签可以单独使用。例如，<img>标签可以单独使用，其标识的内容是图像。In one embodiment, the identification tag may be used alone. For example, the <img> tag can be used alone, and the content it identifies is an image.

在一实施例中，识别标签可以嵌入到网页标签中使用，识别标签对网页标签中标识的内容进行更进一步的标识。例如，如图2c所示，标签可以嵌入到标签使用。In one embodiment, the identification tag can be embedded in the webpage tag for use, and the identification tag further identifies the content identified in the webpage tag. For example, as shown in Figure 2c, the tag can be used embedded within the tag.

当把识别标签嵌套到网页标签中使用时，浏览器不仅可以识别网页标签标识的内容和位置信息，还能识别到嵌套在网页标签中的识别标签标识的内容和位置信息。例如，如图2c所示，当浏览器识别该网页内容时，首先会识别到网页标签<div>中包裹的内容，然后是识别两个网页标签中标识的内容。其中，识别第一个网页标签时，浏览器可以识别到其标识的内容是“这本书写了一些什么样的人呢？”和其标识的内容在网页中的位置。而识别第二个网页标签时，浏览器不仅可以识别其标识的内容和位置信息，还可以对第二个网页标签的识别标签标识的内容和位置信息进行识别。例如，浏览器可以识别到第一个识别标签标识的内容“这是一群被遗忘的人。”及其位置信息，还能识别到第二识别标签标识的内容“有”及其位置信息。When the identification tag is nested into the webpage tag, the browser can not only identify the content and position information of the webpage tag, but also the content and position information of the identification tag embedded in the webpage tag. For example, as shown in Figure 2c, when the browser identifies the webpage content, it first identifies the content wrapped in the webpage tag <div>, and then identifies the content identified in the two webpage tags . Among them, when recognizing the first web page tag , the browser can recognize that the identified content is "What kind of people are written in this book?" and the location of the identified content in the web page. When recognizing the second webpage tag , the browser can not only identify the content and location information identified by the browser, but also identify the content and location information identified by the identification tag of the second webpage tag . For example, the browser can identify the content identified by the first identification tag "This is a group of forgotten people." and its location information, and can also identify the content identified by the second identification tag "Yes" and its location information.

在一实施例中，网页内容单元可以为网页内容中的内容单元。比如，网页内容单元可以是网页内容中最小的内容单位。例如，以文本内容为例，网页内容单元可以是段落、句子、词组、字符等等。又例如，以图片内容为例，网页内容单元可以是图片等。In one embodiment, the webpage content unit may be a content unit in the webpage content. For example, the webpage content unit may be the smallest content unit in the webpage content. For example, taking text content as an example, web content units may be paragraphs, sentences, phrases, characters, and so on. For another example, taking picture content as an example, the webpage content unit may be a picture or the like.

在一实施例中，网页内容单元可以包括内容字符。其中，内容字符可以是各种符号。例如，内容字符可以是各个国家用的文字，可以是限定文字之间关系的关系符号，还可以是自创符号等等。例如，内容字符可以是汉字、英文字母、阿拉伯数字、标点符号、日文等等。In one embodiment, a web page content unit may include content characters. Among them, the content characters can be various symbols. For example, the content characters may be characters used in various countries, may be relational symbols that define the relationship between characters, or may be self-created symbols, and so on. For example, the content characters may be Chinese characters, English letters, Arabic numerals, punctuation marks, Japanese, and so on.

在一实施例中，为了方便对网页内容中的内容字符进行处理，网页内容单元可以是由内容字符和识别标签组合而成。例如，识别标签可以包括开始标签和结束标签，而开始标签和结束标签以及开始标签和结束标签包裹的内容字符可以构成一个网页内容单元。例如，如图2c所示，“有”可以构成一个网页内容单元。又例如，“这是一群被遗忘的人。”也可以构成一个网页内容单元。In an embodiment, in order to facilitate the processing of content characters in the webpage content, the webpage content unit may be formed by combining content characters and identification tags. For example, an identification tag may include an opening tag and an closing tag, and the opening and closing tags and the content characters enclosed by the opening and closing tags may constitute a web page content unit. For example, as shown in Figure 2c, "has" may constitute a web page content unit. As another example, "This is a group of forgotten people." can also constitute a web content unit.

在一实施例中，待处理网页内容可以存储在具有存储功能的存储载体中。例如，待处理网页内容可以存储在本地的服务器或终端的存储器中。当需要获取待处理网页内容时，可以直接从本地的服务器或终端的存储器中获取。又例如，待处理网页内容可以存储在非本地的服务器或终端的存储器中。当需要获取待处理网页内容时，可以先向浏览器请求获取待处理网页内容，然后浏览器会向非本地的服务器或终端获取待处理网页内容，并将待处理网页内容加载在浏览器当中，此时，可以通过浏览器获取待处理网页内容。In one embodiment, the webpage content to be processed may be stored in a storage carrier with a storage function. For example, the web page content to be processed may be stored in a local server or in the memory of the terminal. When the webpage content to be processed needs to be obtained, it can be obtained directly from the local server or the storage of the terminal. For another example, the web page content to be processed may be stored in a non-local server or memory of a terminal. When you need to obtain the content of the webpage to be processed, you can first request the browser to obtain the content of the webpage to be processed, and then the browser will obtain the content of the webpage to be processed from a non-local server or terminal, and load the content of the webpage to be processed in the browser. At this point, the content of the to-be-processed webpage can be obtained through the browser.

在一实施例中，在获取待处理网页内容之前可以先用拆字法对待处理网页内容进行预处理，从而提高待处理网页内容后续处理的效率。具体地，步骤“获取网页的待处理网页内容”之前，还包括：In one embodiment, before acquiring the content of the webpage to be processed, the content of the webpage to be processed may be preprocessed by using the word splitting method, thereby improving the efficiency of subsequent processing of the content of the webpage to be processed. Specifically, before the step "obtaining the content of the web page to be processed", it further includes:

获取所述网页的初始网页内容，所述初始网页内容包括至少一个网页标签以及网页标签对应的子网页内容，其中所述子网页内容包括至少一个内容字符；acquiring initial web page content of the web page, the initial web page content includes at least one web page tag and sub-web page content corresponding to the web page tag, wherein the sub-web page content includes at least one content character;

对子网页内容中的内容字符进行拆分处理，得到初始网页内容单元，所述初始网页内容单元包括至少一个内容字符；splitting the content characters in the content of the sub-web page to obtain an initial web page content unit, where the initial web page content unit includes at least one content character;

对所述初始网页内容单元中的内容字符添加识别标签，得到网页内容单元。An identification tag is added to the content characters in the initial webpage content unit to obtain a webpage content unit.

其中，初始网页内容可以是待处理网页内容预处理之前获取到的网页内容，始网页内容可以是获取到的未经任何处理的网页内容。例如，如图2d所示，初始网页内容可以是内容字符没有对应的识别标签的网页内容。Wherein, the initial webpage content may be the webpage content obtained before the preprocessing of the webpage content to be processed, and the initial webpage content may be the obtained webpage content without any processing. For example, as shown in FIG. 2d, the initial webpage content may be webpage content whose content characters do not have corresponding identification tags.

其中，初始网页内容的语言格式也可以有多种。例如，浏览器文档可以是HTML文档，也可以是在HTML中嵌入如JavaScript脚本语言的文档，还可以是在HTML中嵌入如JavaScript脚本语言和CSS的文档等等。Wherein, the language format of the initial web page content may also have multiple formats. For example, the browser document may be an HTML document, a document embedded in HTML such as a JavaScript scripting language, or a document embedded in HTML such as a JavaScript scripting language and CSS, and so on.

在一实施例中，初始网页内容可以存储在具有存储功能的存储载体中。例如，初始网页内容可以存储在本地的服务器或终端的存储器中。当需要获取待处理网页内容时，可以直接从本地的服务器或终端的存储器中获取。又例如，初始网页内容可以存储在非本地的服务器或终端的存储器中。当需要获取待初始网页内容时，可以先向浏览器请求获取初始网页内容，然后浏览器会向非本地的服务器或终端获取初始网页内容，并将初始网页内容加载在浏览器当中，此时，可以通过浏览器获取初始网页内容。In one embodiment, the initial web page content may be stored in a storage carrier with a storage function. For example, the initial web page content may be stored locally on the server or in the terminal's memory. When the webpage content to be processed needs to be obtained, it can be obtained directly from the local server or the storage of the terminal. As another example, the initial web page content may be stored in a non-local server or memory of the terminal. When it is necessary to obtain the content of the initial web page, you can first request the browser to obtain the content of the initial web page, and then the browser will obtain the content of the initial web page from a non-local server or terminal, and load the content of the initial web page in the browser. The initial web page content can be obtained through a browser.

在一实施例中，初始网页内容可以包括至少一个网页标签以及网页标签对应的子网页内容，其中，子网页内容包括至少一个内容字符。In one embodiment, the initial webpage content may include at least one webpage tag and sub-page content corresponding to the webpage tag, wherein the sub-page content includes at least one content character.

其中，网页标签对应的子网页内容可以是只有网页标签标识，但没有对应的识别标签的内容字符段。例如，如图2d所示，在初始网页内容中，网页标签标识的内容“你好吗？”可以是一个子网页内容，另一个网页标签标识的内容“我很好。”也可以是一个子网页内容。Wherein, the content of the sub-web page corresponding to the web page tag may be a content character field with only the web page tag identifier but no corresponding identification tag. For example, as shown in Figure 2d, in the initial webpage content, the content "How are you?" identified by the webpage tag can be a sub-page content, and the content identified by another webpage tag "I'm fine. " can also be a subpage content.

在一实施例中，获取到子网页内容后，需要对子网页内容中的内容字符进行拆分处理，从而得到初始网页内容单元。In one embodiment, after the sub-page content is acquired, the content characters in the sub-page content need to be split, so as to obtain the initial webpage content unit.

其中，对子网页内容中的字符进行拆分处理之前可以先识别子网页内容中内容字符的数量，根据子网页内容中内容字符的数量确定拆分处理的规则，然后根据拆分处理的规则对子网页内容的内容字符进行拆分处理。具体地，步骤“对子网页内容中的内容字符进行拆分处理，得到初始网页内容单元”，包括：Among them, before the characters in the sub-page content are split, the number of content characters in the sub-page content can be identified first, the split processing rules are determined according to the number of content characters in the sub-page content, and then according to the split processing rules The content characters of the sub-page content are split. Specifically, the step of "splitting the content characters in the content of the sub-page to obtain the initial webpage content unit" includes:

确定所述子网页内容中的内容字符数量；determining the number of content characters in the content of the sub-page;

根据所述子网页内容中的内容字符数量确定拆分处理规则；Determine the split processing rule according to the number of content characters in the content of the sub-page;

根据所述拆分处理规则，对子网页内容中的内容字符进行拆分处理，得到初始网页内容单元。According to the splitting processing rule, splitting processing is performed on the content characters in the content of the sub-page to obtain the initial webpage content unit.

例如，当子网页内容中的内容字符数量只有一个时，可以直接对子网页内容中的内容字符进行拆分，得到初始网页内容单元。For example, when there is only one content character in the content of the sub-page, the content characters in the content of the sub-page can be directly split to obtain the initial webpage content unit.

例如，当子网页内容中的内容字符数量是多个时，可以根据子网页内容中的内容字符数量将每个内容字符都进行拆分处理，得到初始网页内容单元。也可以对内容字符两两进行拆分处理，得到初始网页内容单元，等等。For example, when the number of content characters in the sub-page content is multiple, each content character may be split according to the number of content characters in the sub-page content to obtain the initial webpage content unit. It is also possible to split the content characters in pairs to obtain initial web page content units, and so on.

在一实施例中，在得到初始网页内容单元后，可以对初始网页内容单元中的内容字符添加识别标签，得到网页内容单元。In one embodiment, after the initial webpage content unit is obtained, identification tags may be added to the content characters in the initial webpage content unit to obtain the webpage content unit.

其中，在对初始网页内容单元中的内容字符添加识别标签之前可以先确定初始网页内容单元中的内容字符的数量，然后根据初始网页内容单元中的内容字符的数量确定识别标签，最后将识别标签添加到初始网页内容单元当中。具体地，步骤“对所述初始网页内容单元中的内容字符添加识别标签，得到网页内容单元”，包括：Wherein, before adding identification tags to the content characters in the initial webpage content unit, the number of content characters in the initial webpage content unit can be determined first, then the identification tags are determined according to the number of content characters in the initial webpage content units, and finally the identification tags Added to the initial web content unit. Specifically, the step of "adding identification tags to the content characters in the initial webpage content unit to obtain the webpage content unit" includes:

确定所述初始网页内容单元中的内容字符的数量；determining the number of content characters in the initial web page content unit;

根据所述初始网页内容单元中的内容字符的数量，确定所述初始网页内容单元中的内容字符对应的识别标签；determining, according to the number of content characters in the initial web page content unit, an identification label corresponding to the content characters in the initial web page content unit;

例如，当初始网页内容单元中的内容字符为一个时，识别标签可以是能够识别初始网页内容单元中的内容字符的位置信息的浏览器识别标签。例如，标签、标签、标签等标签。For example, when the content character in the initial webpage content unit is one, the identification tag may be a browser identification tag capable of identifying the location information of the content character in the initial webpage content unit. For example, tags, tags, tags, etc.

例如，当初始网页内容单元中的内容字符为多个时，识别标签可以是能够识别初始网页内容单元中的每个内容字符的位置信息的浏览器识别标签。For example, when there are multiple content characters in the initial webpage content unit, the identification tag may be a browser identification tag capable of identifying the location information of each content character in the initial webpage content unit.

在一实施例中，可以将经过预处理后的初始网页内容加载在浏览器上渲染，从而得到待处理网页内容。In one embodiment, the preprocessed initial web page content may be loaded and rendered on a browser, so as to obtain the web page content to be processed.

102、根据网页内容对应的网页标签和内容字符对应的识别标签，识别内容字符在网页中的位置信息。102. Identify the position information of the content characters in the webpage according to the webpage tags corresponding to the webpage content and the identification tags corresponding to the content characters.

其中，位置信息可以包括描述内容字符位置的信息。例如，内容字符在网页中的位置信息可以包括定位信息和尺寸信息，等等。其中，定位信息可以包括内容字符在网页中的坐标信息，等等，尺寸信息可以包括内容字符的长、宽、高等。其中，位置信息可以包括多个位置参数。其中，位置参数可以包括描述内容字符位置的参数。例如，内容字符在网页中的位置信息包括内容字符的内容字符的宽度和高度，其中，内容字符的长度对应一个位置参数，内容字符的高度又对应另一个位置参数。The location information may include information describing the location of characters in the content. For example, the position information of the content characters in the web page may include positioning information and size information, and so on. The positioning information may include coordinate information of the content characters in the web page, etc., and the size information may include the length, width, and height of the content characters. The location information may include multiple location parameters. Wherein, the position parameter may include a parameter describing the position of the content character. For example, the position information of the content character in the webpage includes the width and height of the content character of the content character, wherein the length of the content character corresponds to one position parameter, and the height of the content character corresponds to another position parameter.

在一实施例中，可以把网页设置成一个直角坐标系，则内容字符在网页中的位置信息可以以坐标轴上的坐标信息表示，其中，可以取网页中的任意一点作为直角坐标系的原点。In one embodiment, the web page can be set to a rectangular coordinate system, then the position information of the content characters in the web page can be represented by the coordinate information on the coordinate axis, wherein, any point in the web page can be taken as the origin of the rectangular coordinate system. .

例如，可以取网页的左上角作为坐标系的原点，例如，如图2e中的1021所示。其中，内容字符在网页中的位置信息可以以一个矩形(Rectangle，Rect)对象来表示。其中，每个内容字符对应的Rect对象可以包括四个位置参数，分别是(x，y，width，height)。其中，x可以表示矩形左上角的横坐标，即内容字符在网页中的横坐标；y可以表示矩形左上角的纵坐标，即内容字符在网页中的纵坐标；width可以表示矩形的宽；height可以表示矩形的高。For example, the upper left corner of the web page can be taken as the origin of the coordinate system, for example, as shown by 1021 in FIG. 2e. The position information of the content characters in the webpage may be represented by a rectangle (Rectangle, Rect) object. The Rect object corresponding to each content character may include four position parameters, namely (x, y, width, height). Among them, x can represent the abscissa of the upper left corner of the rectangle, that is, the abscissa of the content character in the web page; y can represent the ordinate of the upper left corner of the rectangle, that is, the ordinate of the content character in the web page; width can represent the width of the rectangle; height Can represent the height of the rectangle.

其中，内容字符在网页中的位置信息可以根据设置的坐标系进行确定，例如，如图2e所示，把网页设置成一个直角坐标系，其中，x轴的取值范围是0～50000，y轴的取值范围是0～30000。此时，“置”的位置信息可以是(100，28000，20，35)。The position information of the content characters in the web page can be determined according to the set coordinate system. For example, as shown in Figure 2e, the web page is set to a rectangular coordinate system, wherein the value range of the x-axis is 0-50000, the y-axis The value range of the axis is 0～30000. At this time, the location information of "set" may be (100, 28000, 20, 35).

又例如，可以取网页的中心作为坐标系的原点，例如，如图2e中的1022所示。其中，x轴的取值范围可以是-25000～25000，y轴的取值范围可以是-15000～15000。此时，“位”的位置信息可以是(-24980，14000，20，35)，“置”的位置信息可以是(24980，-1400，20，35)，等等。For another example, the center of the web page may be taken as the origin of the coordinate system, for example, as shown by 1022 in FIG. 2e. The value range of the x-axis may be -25000 to 25000, and the value range of the y-axis may be -15000 to 15000. At this time, the position information of "bit" may be (-24980, 14000, 20, 35), the position information of "set" may be (24980, -1400, 20, 35), and so on.

在一实施例中，网页内容对应的网页标签和内容字符对应的识别标签可以识别其标识的内容字符的内容和位置信息，因此可以通过网页标签和识别标签识别内容字符在网页中的位置信息。例如，可以通过对待处理网页内容中的内容进行遍历，根据遍历到的网络标签和识别标签确定内容字符的内容和在网页中的位置信息。In one embodiment, the webpage tag corresponding to the webpage content and the identification tag corresponding to the content character can identify the content and location information of the identified content character, so the position information of the content character in the webpage can be identified through the webpage tag and the identification tag. For example, by traversing the content in the content of the web page to be processed, the content of the content characters and the position information in the web page can be determined according to the traversed network tags and identification tags.

例如，在HTML文档中，可以通过对待处理网页内容进行文档对象模型(DocumentObject Model，DOM)遍历，通过DOM遍历获取待处理网页内容中的网络标签和识别标签以及网络标签和识别标签对应的内容字符的内容和位置信息。For example, in an HTML document, a document object model (Document Object Model, DOM) traversal can be performed on the webpage content to be processed, and the network tags and identification tags in the webpage content to be processed and the content characters corresponding to the network tags and identification tags can be obtained through DOM traversal. content and location information.

例如，如图2e所示，通过DOM遍历，可以获取到网络标签和识别标签标识的内容字符分别是“位置”以及“位置”在网页中的位置信息。For example, as shown in FIG. 2e, through DOM traversal, the content characters identified by the network tag and the identification tag are "location" and the location information of "location" in the webpage, respectively.

在一实施例中，通过遍历得到的内容字符的内容和在网页中的位置信息可以存储在一个预设的存储空间中。In an embodiment, the content of the content characters and the position information in the webpage obtained by traversing can be stored in a preset storage space.

例如，在HTML中嵌入如JavaScript脚本语言的浏览器文档中，可以建立一个数组存储遍历得到的内容字符的内容信息和在网页中的位置信息。For example, in a browser document embedded in a script language such as JavaScript in HTML, an array can be established to store the content information of the content characters obtained by traversal and the position information in the web page.

103、针对网页内容中的每个内容字符，根据内容字符在网页中的位置信息，确定内容字符的处理类型。103. For each content character in the webpage content, determine the processing type of the content character according to the position information of the content character in the webpage.

其中，处理类型可以表示对内容字符的处理种类的划分。例如，内容字符的处理类型可以包括画布绘制类型和标签处理类型。其中，当内容字符的处理类型是画布类型时，可以将内容字符绘制在预设画布上；当内容字符的处理类型是标签处理类型时，可以按照预设顺序将内容字符对应的识别标签添加到预设标签集中。在一实施例中，可以根据每个内容字符在网页中的位置信息直接确定内容字符的处理类型。Among them, the processing type may represent the division of processing types for the content characters. For example, the processing types of content characters may include canvas drawing type and label processing type. Wherein, when the processing type of the content characters is the canvas type, the content characters can be drawn on the preset canvas; when the processing type of the content characters is the label processing type, the identification tags corresponding to the content characters can be added to the preset order. Set of preset labels. In one embodiment, the processing type of the content character can be directly determined according to the position information of each content character in the webpage.

例如，可以当内容字符的位置参数都是正数时，该内容字符的类型便是画布绘制类型；否则，内容字符的类型便是标签处理型。例如，当把网页设置成一个直角坐标系，且网页的中心是直角坐标系时，如图2e中的1022所示。其中，“位”的位置信息可以是(-24980，14000，20，35)，“置”的位置信息可以是(24980，-1400，20，35)。由于，“位”和“置”的位置参数中有负数，因此“位”和“置”的类型是标签处理型。For example, when the positional parameters of the content character are all positive numbers, the type of the content character is the canvas drawing type; otherwise, the type of the content character is the label processing type. For example, when the web page is set to a Cartesian coordinate system, and the center of the web page is a Cartesian coordinate system, as shown by 1022 in FIG. 2e. The position information of "bit" may be (-24980, 14000, 20, 35), and the position information of "set" may be (24980, -1400, 20, 35). Since there are negative numbers in the positional parameters of "bit" and "set", the types of "bit" and "set" are label processing type.

又例如，可以当内容字符位置信息中的某个位置参数是正数时，该内容字符的类型是画布绘制类型；否则，该内容字符的类型是标签处理类型。例如，可以根据内容字符在横坐标上对应的参数确定内容字符的类型。此时，“位”的类型是标签处理类型，“置”的处理类型是画布绘制类型。For another example, when a certain position parameter in the content character position information is a positive number, the type of the content character is the canvas drawing type; otherwise, the type of the content character is the label processing type. For example, the type of the content character can be determined according to the parameter corresponding to the content character on the abscissa. At this time, the type of "bit" is the label processing type, and the processing type of "set" is the canvas drawing type.

在一实施例中，当内容字符的位置参数包括内容字符的宽和内容字符的高时，可以将内容字符的高和宽进行比较，从而确定内容字符的类型。例如，当内容字符的宽大于高时，内容字符的类型是画布绘制类型，当内容字符的宽小于或等于高时，内容字符的类型是标签处理类型，等等。In one embodiment, when the position parameter of the content character includes the width of the content character and the height of the content character, the height and width of the content character may be compared to determine the type of the content character. For example, when the width of the content character is greater than the height, the type of the content character is the canvas drawing type, when the width of the content character is less than or equal to the height, the type of the content character is the label processing type, and so on.

在一实施例中，可以将每个内容字符在网页中的位置信息和预设位置信息进行比较，然后根据比较结果确定内容字符的处理类型。具体地，步骤“针对网页内容中的每个内容字符，根据内容字符在网页中的位置信息，确定内容字符的处理类型”，包括：In one embodiment, the position information of each content character in the webpage can be compared with the preset position information, and then the processing type of the content character is determined according to the comparison result. Specifically, the step "for each content character in the webpage content, according to the position information of the content character in the webpage, determine the processing type of the content character", including:

针对网页内容中的每个内容字符，将所述内容字符在所述网页中的位置信息与预设位置信息进行比较，得到比较结果；For each content character in the webpage content, compare the position information of the content character in the webpage with the preset position information to obtain a comparison result;

根据所述比较结果确定所述内容字符的处理类型。The processing type of the content character is determined according to the comparison result.

其中，位置信息可以包括若干个位置参数，预设位置信息也可以包括若干个预设位置参数，可以根据位置信息中的一个或多个位置参数和预设位置信息中对应的一个或多个预设位置参数进行比较，得到比较结果，然后根据比较结果确定内容字符的处理类型。The location information may include several location parameters, and the preset location information may also include several preset location parameters, which may be based on one or more location parameters in the location information and one or more preset location parameters corresponding to the preset location information. Set the position parameter to compare, get the comparison result, and then determine the processing type of the content character according to the comparison result.

例如，当把网页设置成一个直角坐标系，其中网页的左上角是直角坐标系的原点时，内容字符的位置信息可以是(x，y，width，height)，预设位置信息可以是(x1，y1)。然后，可以将x和x1比较，从而确定内容字符的处理类型；也可以将y和y1比较，从而确定内容字符的处理类型。其中，x1可以是在纵坐标范围内的某个值，y1可以是横坐标范围内的某个值。For example, when the web page is set to a Cartesian coordinate system, where the upper left corner of the web page is the origin of the Cartesian coordinate system, the location information of the content characters can be (x, y, width, height), and the preset location information can be (x1 , y1). Then, x and x1 can be compared to determine the processing type of the content character; y and y1 can also be compared to determine the processing type of the content character. Wherein, x1 may be a certain value within the range of the ordinate, and y1 may be a certain value within the range of the abscissa.

例如，把网页设置成一个直角坐标系，其中，x轴的取值范围是0～50000，y轴的取值范围是0～30000。可以预设位置信息中的y1设置成8000，然后根据每个内容字符在网页坐标系中纵坐标对应的值y和8000比较，从而确定内容字符的处理类型。For example, the webpage is set as a rectangular coordinate system, wherein the value range of the x-axis is 0-50000, and the value range of the y-axis is 0-30000. The y1 in the preset position information can be set to 8000, and then the value y corresponding to the ordinate of each content character in the web page coordinate system is compared with 8000 to determine the processing type of the content character.

在一实施例中，根据比较结果确定内容字符的处理类型，可以是当位置信息中的位置参数大于或等于预设位置信息中的预设位置参数时，内容字符的处理类型是画布绘制类型；当位置信息中的位置参数小于预设位置信息中的预设位置参数时，内容字符的处理类型是标签处理类型。In one embodiment, determining the processing type of the content character according to the comparison result may be that when the position parameter in the position information is greater than or equal to the preset position parameter in the preset position information, the processing type of the content character is a canvas drawing type; When the position parameter in the position information is smaller than the preset position parameter in the preset position information, the processing type of the content character is the label processing type.

例如，可以把预设位置信息中的y1设置成8000，当内容字符对应的y大于8000时，内容字符的处理类型可以是画布绘制类型；当内容字符对应的y小于等于8000时，内容字符的处理类型可以是标签处理类型。For example, y1 in the preset position information can be set to 8000. When the y corresponding to the content character is greater than 8000, the processing type of the content character can be the canvas drawing type; when the y corresponding to the content character is less than or equal to 8000, the content character The processing type can be a label processing type.

在一实施例中，根据比较结果确定内容字符的处理类型，可以是当位置信息中的位置参数大于或等于预设位置信息中的预设位置参数时，内容字符的处理类型是标签处理类型；当位置信息中的位置参数小于预设位置信息中的预设位置参数时，内容字符的处理类型是画布绘制类型。In one embodiment, determining the processing type of the content character according to the comparison result may be that when the position parameter in the position information is greater than or equal to the preset position parameter in the preset position information, the processing type of the content character is the label processing type; When the position parameter in the position information is smaller than the preset position parameter in the preset position information, the processing type of the content character is the canvas drawing type.

例如，可以把预设位置信息中的y1设置成8000，当内容字符对应的y大于等于8000时，内容字符的处理类型可以是标签处理类型；当内容字符对应的y小于8000时，内容字符的处理类型可以是画布绘制类型。For example, y1 in the preset position information can be set to 8000. When the y corresponding to the content character is greater than or equal to 8000, the processing type of the content character can be the label processing type; when the y corresponding to the content character is less than 8000, the content character The processing type can be a canvas draw type.

在一实施例中，可以在将每个内容字符的位置信息和预设位置信息比较之后，根据当前的比较结果确定内容字符的处理类型。In one embodiment, after comparing the position information of each content character with the preset position information, the processing type of the content character may be determined according to the current comparison result.

例如，可以将存储在数组中的内容字符对应的位置信息逐个地提取出来和预设位置信息进行比较，然后根据当前的比较结果确定内容字符的处理类型。For example, the position information corresponding to the content characters stored in the array can be extracted one by one and compared with the preset position information, and then the processing type of the content characters can be determined according to the current comparison result.

在一实施例中，可以将每个内容字符的位置信息和预设位置信息比较之后，根据比较结果对每个内容字符进行分组，然后对分组后的内容字符确定处理类型。In one embodiment, after comparing the position information of each content character with preset position information, each content character is grouped according to the comparison result, and then the processing type is determined for the grouped content characters.

例如，可以将存储在数组中的内容字符对应的位置信息提取出来和预设位置信息进行比较，然后根据比较结果将内容字符划分为第一内容字符组和第二内容字符组。其中，可以第一内容字符组中的内容字符是画布绘制类型，第二内容字符组中的内容字符是标签处理类型；也可以是第一内容字符组中的内容字符是标签处理类型，第二内容字符组中的内容字符是画布绘制类型。For example, the position information corresponding to the content characters stored in the array can be extracted and compared with the preset position information, and then the content characters can be divided into the first content character group and the second content character group according to the comparison result. Wherein, the content characters in the first content character group may be of the canvas drawing type, and the content characters in the second content character group may be of the label processing type; or the content characters in the first content character group may be of the label processing type, and the second content character group may be of the label processing type. The content characters in the content character group are of the canvas draw type.

104、当内容字符的处理类型是画布绘制类型时，将内容字符绘制在预设画布。104. When the processing type of the content character is the canvas drawing type, draw the content character on the preset canvas.

其中，画布可以是一个浏览器文档中的图像容器，浏览器文档中的图像信息都可以存储在画布当中。例如，在HTML中，画布可以通过<canvas>标签创建一个画布。其中，<canvas>标签内置有若干设置参数，可以通过对<canvas>标签里内置的参数进行设置，从而对创建的画布进行定制，从而使得创建的画布和网页适配。例如，<canvas>的内置参数包括Width和Height，其中Width可以是画布的宽度，Height可以是画布的高度。通过对Width和Height进行设置，从而生成一定宽度和高度的画布。The canvas can be an image container in a browser document, and all image information in the browser document can be stored in the canvas. For example, in HTML, a canvas can be created with the <canvas> tag. Among them, the <canvas> tag has several built-in setting parameters, and the created canvas can be customized by setting the built-in parameters in the <canvas> tag, so that the created canvas can be adapted to the web page. For example, the built-in parameters of <canvas> include Width and Height, where Width can be the width of the canvas and Height can be the height of the canvas. By setting the Width and Height, a canvas with a certain width and height is generated.

在一实施例中，当内容字符的处理类型是画布绘制类型时，可以用绘制函数将内容字符绘制在预设画布上。具体地，步骤“当内容字符的处理类型是画布绘制类型时，将内容字符绘制在预设画布”，包括：In one embodiment, when the processing type of the content character is the canvas drawing type, a drawing function can be used to draw the content character on a preset canvas. Specifically, the step "when the processing type of the content character is the canvas drawing type, draw the content character on the preset canvas" includes:

当所述内容字符的处理类型为画布绘制类型时，确定用于内容字符绘制的绘制函数；When the processing type of the content character is the canvas drawing type, determine the drawing function used for the content character drawing;

根据所述绘制函数和所述内容字符在所述网页中的位置信息，将所述内容字符绘制在预设画布对应的位置上。According to the drawing function and the position information of the content character in the webpage, the content character is drawn on the position corresponding to the preset canvas.

其中，绘制函数可以是将内容字符绘制在预设画布上的操作函数。例如，在HTML中，绘制函数可以是fillText()函数或strokeText()函数等等。其中，fillText()函数可以用于绘制填充内容字符，strokeText()函数可以用于绘制描边内容字符。The drawing function may be an operation function for drawing the content characters on the preset canvas. For example, in HTML, the drawing function can be the fillText() function or the strokeText() function, etc. Among them, the fillText() function can be used to draw fill content characters, and the strokeText() function can be used to draw stroke content characters.

其中，绘制函数中内置有若干设置参数，可以通过设置绘制函数中的参数，从而对绘制在预设画布上的内容字符的样式和位置进行调整。例如，fillText()函数可以为fillText(内容字符，X，Y)，其中，参数X可以表示将内容字符绘制在网页的横坐标轴上和X对应的位置，参数Y可以表示将内容字符绘制在网页的纵坐标轴上和Y对应的位置上。具体地，X可以和内容字符在网页上的位置信息参数x对应，Y可以和内容字符在网页上的位置信息参数y对应，例如，fillText()函数可以设置为fillText(内容字符，x，y)。通过对绘制函数的设置函数进行设置，可以将内容字符绘制在预设画布对应的位置上。Among them, several setting parameters are built in the drawing function, and the style and position of the content characters drawn on the preset canvas can be adjusted by setting the parameters in the drawing function. For example, the fillText() function can be fillText(content character, X, Y), where the parameter X can indicate that the content character is drawn on the horizontal axis of the web page at the position corresponding to X, and the parameter Y can indicate that the content character is drawn on the On the vertical axis of the web page and the position corresponding to Y. Specifically, X can correspond to the position information parameter x of the content character on the webpage, and Y can correspond to the position information parameter y of the content character on the webpage. For example, the fillText() function can be set to fillText(content character, x, y ). By setting the setting function of the drawing function, the content character can be drawn on the position corresponding to the preset canvas.

其中，预设画布可以是预先生成的画布。例如，在HTML中，可以提前通过<canvas>标签创建一个画布，然后将内容字符绘制在预设画布上。The preset canvas may be a pre-generated canvas. For example, in HTML, you can create a canvas in advance through the <canvas> tag, and then draw the content characters on the preset canvas.

在一实施例中，在将内容字符绘制在预设画布上之前，还应该获取预设画布上的绘制环境，避免将内容字符绘制在预设画布上时发生错误。具体地，可以采用一个环境获取函数获取预设画布上的绘制环境。例如，在HTML中，可以使用getContext()等环境获取函数获取一个用于在预设画布上绘制内容字符的环境。In one embodiment, before drawing the content characters on the preset canvas, the drawing environment on the preset canvas should also be acquired, so as to avoid errors when drawing the content characters on the preset canvas. Specifically, an environment obtaining function may be used to obtain the drawing environment on the preset canvas. For example, in HTML, you can use an environment acquisition function such as getContext() to obtain an environment for drawing content characters on a preset canvas.

105、当内容字符的处理类型是标签处理类型时，按照预设顺序将内容字符对应的识别标签添加到预设标签集。105. When the processing type of the content characters is the label processing type, add the identification tags corresponding to the content characters to the preset tag set according to a preset order.

在一实施例中，当内容字符的处理类型是标签处理类型时，可以根据预设顺序确定设置识别标签顺序的顺序设置函数，然后利用顺序设置函数将内容字符对应的识别标签按照预设顺序添加到预设标签集中。具体地，步骤“当内容字符的处理类型是标签处理类型时，按照预设顺序将内容字符对应的识别标签添加到预设标签集”，包括：In one embodiment, when the processing type of the content characters is a label processing type, a sequence setting function for setting the sequence of the identification labels can be determined according to a preset sequence, and then the sequence setting function is used to add the identification labels corresponding to the content characters in a preset sequence. to the preset label set. Specifically, the step "when the processing type of the content characters is the label processing type, add the identification tags corresponding to the content characters to the preset tag set according to the preset order", including:

当所述内容字符的处理类型为标签处理类型时，根据预设顺序确定对应的标签顺序设置函数；When the processing type of the content character is the label processing type, determine the corresponding label order setting function according to the preset order;

采用所述标签顺序设置函数将所述内容字符对应的识别标签按照预设顺序添加到预设标签集。The identification tags corresponding to the content characters are added to the preset tag set in a preset order by using the tag order setting function.

其中，预设顺序可以是识别标签添加到预设标签集时所依据的顺序。其中，识别标签添加到预设标签集中时所依据的顺序方式可以有多种，例如，可以是按照随机的顺序将识别标签添加到预设标签集中，也可以按照有规律的方式将识别标签添加到预设标签集中。The preset sequence may be the sequence in which the identification tags are added to the preset tag set. Among them, there may be various order ways in which the identification tags are added to the preset tag set. For example, the identification tags may be added to the preset tag set in a random order, or the identification tags may be added in a regular manner. to the preset label set.

其中，标签顺序设置函数可以是根据预设顺序确定的函数。Wherein, the label order setting function may be a function determined according to a preset order.

例如，当按照随机的顺序将识别标签添加到预设标签集中时，标签顺序设置函数可以是一个随机函数。例如，在HTML中嵌入如JavaScript脚本语言的文档中，标签顺序设置函数可以是Math.random()等函数。For example, when the identification tags are added to the preset tag set in a random order, the tag order setting function may be a random function. For example, in a document embedded in a scripting language such as JavaScript in HTML, the tag order setting function may be a function such as Math.random().

例如，当按照有规律的方式将识别标签添加到预设标签集中时，标签预设顺序函数可以是一个有预设规律的函数。For example, when the identification tags are added to the preset tag set in a regular manner, the tag preset order function may be a function with preset rules.

其中，预设标签集可以是一个能够存储识别标签的存储空间。例如，在HTML文档中嵌入如JavaScript脚本语言的浏览器文档中，预设标签集可以是一个数组，识别标签可以按照预设顺序添加到该数组中。The preset label set may be a storage space capable of storing identification labels. For example, in a browser document embedded in a scripting language such as JavaScript in an HTML document, the preset tag set may be an array, and the identification tags may be added to the array in a preset order.

106、当满足预设内容生成条件时，根据当前画布和当前标签集生成并输出所示网页的目标网页内容。106. When the preset content generation condition is satisfied, generate and output the target webpage content of the shown webpage according to the current canvas and the current tag set.

其中，预设内容生成条件可以是待处理内容经过后续的处理步骤后满足输出到网页上的条件。The preset content generation condition may be that the to-be-processed content satisfies the condition for being output to the webpage after the subsequent processing steps.

例如，预设内容生成条件可以是待处理内容中的所有内容字符都经过了处理，其中处理可以是当内容字符的处理类型为画布绘制类型时，将内容字符绘制在预设画布；当内容字符的处理类型是标签处理类型时，按照预设顺序将内容字符对应的识别标签添加到预设标签集。For example, the preset content generation condition may be that all content characters in the content to be processed have been processed, wherein the processing may be that when the processing type of the content characters is canvas drawing type, the content characters are drawn on the preset canvas; When the processing type is the tag processing type, the identification tags corresponding to the content characters are added to the preset tag set according to the preset order.

其中，当前画布可以是承载了所有处理类型是画布绘制类型的内容字符的画布。当前标签集可以是存储了所有处理类型是标签处理类型的内容字符的标签集。Wherein, the current canvas may be a canvas carrying all content characters whose processing type is canvas drawing type. The current tag set may be a tag set that stores all content characters whose processing type is the tag processing type.

在一实施例中，步骤“当满足预设内容生成条件时，根据当前画布和当前标签集生成并输出所述网页的目标网页内容”，包括：In one embodiment, the step of “generating and outputting the target web page content of the web page according to the current canvas and the current tag set when the preset content generation conditions are met” includes:

当满足预设内容生成条件时，根据所述当前画布上内容字符在所述网页中的位置信息，将当前画布插入到所述网页的对应位置上；When the preset content generation conditions are met, insert the current canvas into the corresponding position of the webpage according to the position information of the content characters on the current canvas in the webpage;

根据当前标签集中识别标签的顺序，将识别标签对应的内容字符输出到所述网页的对应位置上。According to the sequence of the identification tags in the current tag set, the content characters corresponding to the identification tags are output to the corresponding positions of the webpage.

例如，当前画布上的内容字符包括“我”“很”“好”“。”，其中，内容字符在网页中的位置信息为纵坐标值为8975～8940，横坐标值200～280的范围内，因此画布会插入到该范围内。For example, the content characters on the current canvas include "I", "very", "good", and ".", wherein the position information of the content characters in the webpage is within the range of ordinate values 8975-8940 and abscissa values 200-280 , so the canvas is inserted into that extent.

在一实施例中，当按照标签集中识别标签顺序将内容字符输出到网页上时，内容字符在网页上的顺序将会发生变化，从而导致内容字符对应的内容发生改变。例如，如图2f所示，将处理类型时，按照预设顺序将内容字符的识别标签添加到预设标签集中，然后根据标签集中的识别标签直接生成网页。此时，网页上的内容的顺序会发生变化，从而影响了原来内容的意思。In one embodiment, when the content characters are output to the web page according to the sequence of the identification tags in the tag set, the sequence of the content characters on the web page will change, thereby causing the content corresponding to the content characters to change. For example, as shown in FIG. 2f, when the processing type is selected, the identification tags of the content characters are added to the preset tag set according to the preset order, and then the webpage is directly generated according to the identification tags in the tag set. At this point, the order of the content on the web page will change, thereby affecting the meaning of the original content.

因此，在一实施例中，可以根据当前标签集中识别标签的顺序，确定将识别标签对应内容字符输出到网页对应的位置上的定位函数，然后利用定位函数将识别标签对应的内容字符输出到网页对应的位置上。Therefore, in one embodiment, a positioning function for outputting the content characters corresponding to the identification tags to the positions corresponding to the web pages can be determined according to the sequence of the identification tags in the current tag set, and then the positioning function is used to output the content characters corresponding to the identification tags to the webpage in the corresponding position.

其中，原始顺序的定位函数可以添加到预设标签集中识别标签按照原始顺序定位到原始位置的函数。例如，在HTML嵌入如JavaScript脚本语言和CSS的文档中，定位函数可以是CSS中的定位机制。通过CSS中的定位机制，可以定位识别标签相对于其原来的位置应该出现的位置。The positioning function in the original order may be added to the preset label set to identify the function for positioning the label to the original position according to the original order. For example, in documents in which HTML is embedded in scripting languages such as JavaScript and CSS, the positioning function may be the positioning mechanism in CSS. Using the positioning mechanism in CSS, it is possible to locate where an identifying tag should appear relative to its original position.

上述介绍的网页内容处理方法，在获取到网页的待处理内容后，可以根据网页内容中网页内容单元对应的网页标签和内容字符对应的识别标签识别内容字符在网页中的位置信息；然后根据内容字符在网页中的位置信息，确定内容字符的处理类型；当内容字符的处理类型为画布绘制类型时，将内容字符绘制在预设画布，当内容字符为标签处理类型时，按照预设顺序将内容字符对应的识别标签添加到预设标签集；最后根据当前画布和当前标签集生成并输出网页的目标网页内容，从而能够有效地防止爬虫抓取网页上的内容，从而提高了网页内容的安全性。In the above-mentioned method for processing web page content, after acquiring the content to be processed of the web page, the position information of the content character in the web page can be identified according to the web page label corresponding to the web page content unit in the web page content and the identification label corresponding to the content character; The position information of the characters in the webpage determines the processing type of the content characters; when the processing type of the content characters is the canvas drawing type, the content characters are drawn on the preset canvas; when the content characters are the label processing type, the content characters are The identification tags corresponding to the content characters are added to the preset tag set; finally, the target webpage content of the webpage is generated and output according to the current canvas and the current tag set, which can effectively prevent the crawler from crawling the content on the webpage, thereby improving the security of the webpage content sex.

根据上面实施例所描述的方法，以下将举例作进一步详细说明。According to the methods described in the above embodiments, the following examples will be used for further detailed description.

本申请实施例将以网页内容处理装置集成在终端上为例来介绍本申请实施例方法。In the embodiment of the present application, the method of the embodiment of the present application will be described by taking an example that a webpage content processing apparatus is integrated on a terminal.

如图3a所示，一种网页内容处理方法，具体流程如下：As shown in Figure 3a, a method for processing webpage content, the specific process is as follows:

201、终端获取网页的初始网页内容，所述初始网页内容包括至少一个网页标签以及网页标签对应的子网页内容，其中子网页内容包括至少一个内容字符。201. The terminal acquires initial webpage content of a webpage, where the initial webpage content includes at least one webpage tag and sub-page content corresponding to the webpage tag, wherein the sub-page content includes at least one content character.

其中，初始网页内容可以是待处理网页内容预处理之前获取到的网页内容，即始网页内容可以是获取到的未经任何处理的网页内容。例如，如图3b所示，初始网页内容可以是没有包含识别标签的网页内容。Wherein, the initial webpage content may be the webpage content obtained before the preprocessing of the webpage content to be processed, that is, the initial webpage content may be the obtained webpage content without any processing. For example, as shown in FIG. 3b, the initial web page content may be web page content that does not contain an identification tag.

其中，网页标签可以是用来构建网页架构、网页布局和承载内容的标签。例如，在HTML中，网页标签可以是由块级元素构成的标签。例如，如图所示，网页标签可以是<body>标签、<div>标签和标签。Wherein, the webpage tag may be a tag used to construct the webpage structure, webpage layout and bearing content. For example, in HTML, a web page tag can be a tag composed of block-level elements. For example, as shown, web page tags can be <body> tags, <div> tags, and tags.

通过网页标签，浏览器可以生成和网页标签对应的网页架构和布局以及识别网页标签标识的内容信息，并且根据网页标签在待处理网页内容中的位置信息，将网页标签标识的内容渲染到网页对应的位置上，从而使得网页标签标识的内容在网页上呈现。Through the web page tag, the browser can generate the web page structure and layout corresponding to the web page tag and the content information identified by the web page tag, and according to the position information of the web page tag in the content of the web page to be processed, the content identified by the web page tag can be rendered to the corresponding web page. position, so that the content identified by the webpage tag is presented on the webpage.

其中，内容字符可以是各种符号。例如，内容字符可以是各个国家用的文字，可以是限定文字之间关系的关系符号，还可以是自创符号等等。例如，内容字符可以是汉字、英文字母、阿拉伯数字、标点符号、日文等等。例如，如图3b所示，在HTML文档中，“你”是一个内容字符，“？”也是一个内容字符。Among them, the content characters can be various symbols. For example, the content characters may be characters used in various countries, may be relational symbols that define the relationship between characters, or may be self-created symbols, and so on. For example, the content characters may be Chinese characters, English letters, Arabic numerals, punctuation marks, Japanese, and so on. For example, as shown in Figure 3b, in an HTML document, "you" is a content character, and "?" is also a content character.

其中，子网页内容可以是只有网页标签标识的内容字符段。例如，如图3b所示，<body>标签和<div>标签标识的子网页内容是“你好吗？”和“我很好。”，而第一个标签标识的子网页内容是“你好吗？”，第二个标签标识的子网页内容是“我很好。”。Wherein, the content of the sub-page may be a content field only identified by a webpage tag. For example, as shown in Figure 3b, the sub-page content identified by the <body> tags and <div> tags is "How are you?" and "I'm fine.", while the sub-page content identified by the first tag is "How are you?", and the second tag identifies the sub-page content as "I'm fine.".

在一实施例中，初始网页内容可以存储在具有存储功能的存储载体中。例如，初始网页内容可以存储在本地的服务器或终端的存储器中。当需要获取待处理网页内容时，可以直接从本地的服务器或终端的存储器中获取。又例如，初始网页内容可以存储在非本地的服务器或终端的存储器中。当需要获取待初始网页内容时，可以先向浏览器请求获取初始网页内容，然后浏览器会向非本地的服务器或终端获取初始网页内容，并将初始网页内容加载在浏览器当中，此时，可以通过浏览器获取初始网页内容。In one embodiment, the initial web page content may be stored in a storage carrier with a storage function. For example, the initial web page content may be stored locally on the server or in the terminal's memory. When the webpage content to be processed needs to be obtained, it can be obtained directly from the local server or the storage of the terminal. For another example, the initial web page content may be stored in a non-local server or memory of the terminal. When it is necessary to obtain the content of the initial web page, you can first request the browser to obtain the content of the initial web page, and then the browser will obtain the content of the initial web page from a non-local server or terminal, and load the content of the initial web page in the browser. The initial web page content can be obtained through a browser.

202、终端对子网页内容中的内容字符进行预处理，得到网页的待处理网页内容，其中，待处理网页内容包括至少一个网页标签以及网页标签对应的子网页内容，其中子网页内容包括至少一个内容字符。202. The terminal preprocesses the content characters in the content of the sub-page, and obtains the content of the webpage to be processed, wherein the content of the webpage to be processed includes at least one webpage tag and the content of the sub-page corresponding to the webpage tag, wherein the content of the sub-page includes at least one content characters.

其中，识别标签可以是承载内容的标签。通过识别标签，浏览器可以识别识别标签标识的内容及其标识的内容的位置信息，并且将识别标签标识的内容渲染到网页对应的位置上。例如，在HTML中，识别标签可以是由行内元素构成的标签。例如，识别标签可以是、或等标签。Wherein, the identification tag may be a tag carrying content. Through the identification tag, the browser can identify the content identified by the identification tag and the location information of the identified content, and render the content identified by the identification tag to the corresponding position of the web page. For example, in HTML, an identifying tag can be a tag consisting of inline elements. For example, identifying tags can be tags such as , , or .

其中，识别标签可以包括开始标签和结束标签，而开始标签和结束标签以及开始标签和结束标签包裹的内容可以构成一个网页内容单元。Wherein, the identification tag may include a start tag and an end tag, and the start tag and the end tag and the content wrapped by the start tag and the end tag may constitute a webpage content unit.

例如，如图3b所示，标签的开始标签是，结束标签是，而“你”包裹在和之间，因此你构成一个网页内容单元。同理，好也构成了一个网页内容单元等等。For example, as shown in Figure 3b, the opening tag of the tag is , the closing tag is , and "you" is wrapped between and , so you constitutes a unit of web content. Similarly, good also constitutes a web content unit and so on.

在一实施例中，终端对子网页内容中的内容字符进行预处理可以包括终端对子网页内容中的内容字符进行拆分处理，得到初始网页内容单元，其中，初始内容单元包括至少一个内容字符。然后，对初始网页内容单元中的内容字符添加识别标签，得到网页内容单元。最后，根据网页内容单元得到网页的待处理网页内容。In an embodiment, the terminal preprocessing the content characters in the content of the sub-page may include the terminal splitting the content characters in the content of the sub-page to obtain an initial webpage content unit, wherein the initial content unit includes at least one content character. . Then, adding identification tags to the content characters in the initial webpage content unit to obtain the webpage content unit. Finally, the to-be-processed webpage content of the webpage is obtained according to the webpage content unit.

在一实施例中，对子网页内容中的字符进行拆分处理之前可以先识别子网页内容中内容字符的数量，根据子网页内容中内容字符的数量确定拆分处理规则，然后根据拆分处理的规则对子网页内容的内容字符进行拆分处理，得到初始网页内容单元。In one embodiment, the number of content characters in the content of the sub-page may be identified before the characters in the content of the sub-page are split, and the split processing rule is determined according to the number of the content characters in the content of the sub-page, and then The rule splits the content characters of the sub-web page content to obtain the initial web page content unit.

例如，当子网页内容中的内容字符数量是多个时，可以根据子网页内容中的内容字符数量将每个内容字符都进行拆分处理，得到初始网页内容单元。也可以对内容字符两两进行拆分处理得到初始网页内容单元，等等。For example, when the number of content characters in the sub-page content is multiple, each content character may be split according to the number of content characters in the sub-page content to obtain the initial webpage content unit. It is also possible to split the content characters in pairs to obtain initial web page content units, and so on.

例如，如图3b所示，识别到子网页内容“你好吗？”和“我很好。”中的内容字符数量都是四个。当对子网页内容中的内容字符进行拆分处理时，可以将子网页内容中的内容字符一个个拆分开。即，把“你好吗？”和“我很好。”分别拆分成“你”“好”“吗”“？”和“我”“很”“好”“。”。其中，拆分后的每个内容字符都构成一个初始网页单元。例如“你”“好”“吗”“？”分别是一个完整的初始网页单元。For example, as shown in Fig. 3b, the number of content characters in the sub-page contents "How are you?" and "I'm fine." are both recognized as four. When the content characters in the content of the sub-page are divided, the content characters in the content of the sub-page can be divided one by one. That is, split "how are you?" and "I'm fine." into "you", "good", "are you", "?" and "me", "very good", "good", and ".". Wherein, each split content character constitutes an initial web page unit. For example, "you", "how are you", "are you", and "?" are a complete initial web page unit respectively.

在一实施例中，在得到初始网页内容单元后，可以确认初始网页内容单元中内容字符的数量。然后根据初始网页内容单元中内容字符的数量，确定初始网页内容单元中内容字符对应的识别标签。最后对初始网页内容中的内容字符添加识别标签，从而得到网页内容单元。In one embodiment, after the initial webpage content unit is obtained, the number of content characters in the initial webpage content unit can be confirmed. Then, according to the number of content characters in the initial webpage content unit, the identification tags corresponding to the content characters in the initial webpage content unit are determined. Finally, identification tags are added to the content characters in the initial webpage content, so as to obtain the webpage content unit.

例如，子网页内容中的内容字符只有一个“你”，因此可以给子网页内容中的内容字符“你”添加上识别标签，从而得到对应的网页内容单元你。For example, the content character in the sub-page content has only one "you", so the identification tag can be added to the content character "you" in the sub-page content, so as to obtain the corresponding webpage content unit you.

在一实例中，根据网页内容单元得到网页的待处理网页内容，终端可以将包括网页内容单元的网页内容加载到浏览器上渲染，从而得到待处理网页内容。In an example, the to-be-processed webpage content of the webpage is obtained according to the webpage content unit, and the terminal may load the webpage content including the webpage content unit into the browser for rendering, thereby obtaining the to-be-processed webpage content.

203、终端根据网页内容对应的网页标签和内容字符对应的识别标签，识别内容字符在网页中的位置信息。203. The terminal identifies the location information of the content characters in the webpage according to the webpage tags corresponding to the webpage content and the identification tags corresponding to the content characters.

其中，内容字符在网页中的位置信息可以是表示内容字符在网页中的位置的信息。The position information of the content characters in the webpage may be information indicating the positions of the content characters in the webpage.

在一实施例中，可以把网页设置成一个直角坐标系，其中网页的左下角可以是坐标系的原点，从而使得遍历得到的内容字符在网页中的位置信息可以以在坐标轴上的坐标信息表示。例如，如图3b所示，通过DOM遍历获取到的内容字符在网页中的位置信息可以以一个矩形(Rectangle，Rect)对象来表示。其中，每个内容字符对应的Rect对象可以包括四个参数，分别是(x，y，width，height)。其中，x可以表示矩形左上角的横坐标，即内容字符在网页中的横坐标；y可以表示矩形左上角的纵坐标，即内容字符在网页中的纵坐标；width可以表示矩形的宽，height可以表示矩形的高。通过Rect对象可以获取到内容字符在网页中的位置信息。In one embodiment, the web page can be set to a rectangular coordinate system, wherein the lower left corner of the web page can be the origin of the coordinate system, so that the position information of the content characters in the web page obtained by traversal can be based on the coordinate information on the coordinate axis. express. For example, as shown in FIG. 3b, the position information of the content characters in the webpage obtained through DOM traversal can be represented by a rectangle (Rectangle, Rect) object. The Rect object corresponding to each content character may include four parameters, namely (x, y, width, height). Among them, x can represent the abscissa of the upper left corner of the rectangle, that is, the abscissa of the content character in the webpage; y can represent the ordinate of the upper left corner of the rectangle, that is, the ordinate of the content character in the webpage; width can represent the width of the rectangle, height Can represent the height of the rectangle. The position information of the content characters in the web page can be obtained through the Rect object.

例如，把网页设置成一个直角坐标系，其中，x轴的取值范围是0～50000，y轴的取值范围是0～30000。其中，内容字符“你”“好”“吗”“？”的位置信息可以分别是(200，9000，20，35)、(220，9000，20，35)、(240，9000，20，35)和(260，9000，20，35)。其中，内容字符“我”“很”“好”“。”的位置信息可以分别是(200，8975，20，35)、(220，8975，20，35)、(240，8975，35)和(260，8975，20，35)。For example, the webpage is set as a rectangular coordinate system, wherein the value range of the x-axis is 0-50000, and the value range of the y-axis is 0-30000. Wherein, the position information of the content characters "you", "good", "are you", "?" can be (200, 9000, 20, 35), (220, 9000, 20, 35), (240, 9000, 20, 35) respectively ) and (260, 9000, 20, 35). The position information of the content characters "I", "Very", "Good", "." can be (200, 8975, 20, 35), (220, 8975, 20, 35), (240, 8975, 35) and (260, 8975, 20, 35).

其中，通过遍历得到的内容字符的内容和在网页中的位置信息可以存储在一个预设的存储空间中。The content of the content characters and the position information in the webpage obtained by traversing can be stored in a preset storage space.

例如，在HTML文档中嵌入如JavaScript脚本语言的浏览器文档中，可以建立一个数组存储遍历得到的内容字符的内容和在网页中的位置信息。For example, in an HTML document embedded in a browser document such as a JavaScript scripting language, an array can be established to store the content of the content characters obtained by traversal and the position information in the web page.

204、终端针对网页内容中的每个内容字符，根据内容字符在网页中的位置信息，确定内容字符的处理类型。204. The terminal determines, for each content character in the webpage content, the processing type of the content character according to the position information of the content character in the webpage.

在一实施例中，可以将每个内容字符在网页中的位置信息和预设位置信息进行比较，然后根据比较结果确定内容字符的处理类型。In one embodiment, the position information of each content character in the webpage can be compared with the preset position information, and then the processing type of the content character is determined according to the comparison result.

例如，当把网页设置成一个直角坐标系，其中网页的左下角是直角坐标系的原点时，内容字符的位置信息可以是(x，y，width，height)，预设位置信息可以是(x1，y1)。然后，可以将x和x1比较，从而确定内容字符的处理类型；也可以将y和y1比较，从而确定内容字符的处理类型。其中，x1可以是在纵坐标范围内的某个值，y1可以是横坐标范围内的某个值。For example, when the web page is set to a Cartesian coordinate system, where the lower left corner of the web page is the origin of the Cartesian coordinate system, the position information of the content characters can be (x, y, width, height), and the preset position information can be (x1 , y1). Then, x and x1 can be compared to determine the processing type of the content character; y and y1 can also be compared to determine the processing type of the content character. Wherein, x1 may be a certain value within the range of the ordinate, and y1 may be a certain value within the range of the abscissa.

在一实施例中，终端根据比较结果确定内容字符的处理类型，可以是当位置信息中的位置参数小于预设位置信息中的预设位置参数时，内容字符的处理类型是画布绘制类型；当位置信息中的位置参数大于等于预设位置信息中的预设位置参数时，内容字符的处理类型是标签处理类型。In one embodiment, the terminal determines the processing type of the content character according to the comparison result, which may be that when the position parameter in the position information is smaller than the preset position parameter in the preset position information, the processing type of the content character is the canvas drawing type; when When the position parameter in the position information is greater than or equal to the preset position parameter in the preset position information, the processing type of the content character is the label processing type.

例如，如图3b所示，当把预设位置信息中的y1设置成8980时，内容字符“你”“好”“吗”“？”对应的y大于8980，因此内容字符“你”“好”“吗”“？”的处理类型可以是标签处理类型；内容字符“我”“很”“好”“。”对应的y小于8980，因此内容字符“我”“很”“好”“。”的处理类型可以是画布绘制类型。For example, as shown in Figure 3b, when y1 in the preset position information is set to 8980, the y corresponding to the content characters "you", "good", "do" and "?" is greater than 8980, so the content characters "you" "good" The processing type of ""?" and "?" can be label processing type; the content characters "I" "very" "good" "." The corresponding y is less than 8980, so the content characters "I" "very" "good"". " can be a canvas draw type.

其中，终端可以将存储在数组中的内容字符对应的位置信息逐个地提取出来和预设位置信息进行比较，然后根据比较结果直接确定内容字符的处理类型。The terminal can extract the position information corresponding to the content characters stored in the array one by one and compare them with the preset position information, and then directly determine the processing type of the content characters according to the comparison result.

205、当内容字符的处理类型为画布绘制类型时，终端将内容字符绘制在预设画布。205. When the processing type of the content character is the canvas drawing type, the terminal draws the content character on the preset canvas.

在一实施例中，当内容字符的处理类型是画布绘制类型时，可以用绘制函数将内容字符绘制在预设画布上。其中，绘制函数可以是将内容字符绘制在预设画布上的操作函数。例如，绘制函数可以是fillText()函数等。In one embodiment, when the processing type of the content character is the canvas drawing type, a drawing function can be used to draw the content character on a preset canvas. The drawing function may be an operation function for drawing the content characters on the preset canvas. For example, the drawing function can be the fillText() function or the like.

其中，绘制函数中内置有若干设置参数，可以通过设置绘制函数中的参数，从而对绘制在预设画布上的内容字符的样式和位置进行调整。Among them, several setting parameters are built in the drawing function, and the style and position of the content characters drawn on the preset canvas can be adjusted by setting the parameters in the drawing function.

例如，fillText()函数可以为fillText(内容字符，X，Y)，其中，参数X可以表示将内容字符绘制在网页的横坐标轴上和X对应的位置，参数Y可以表示将内容字符绘制在网页的纵坐标轴上和Y对应的位置上。具体地，X可以和内容字符在网页上的位置信息参数x对应，Y可以和内容字符在网页上的位置信息参数y对应，例如，fillText()函数可以设置为fillText(内容字符，x，y)。通过对绘制函数的设置函数进行设置，可以将内容字符绘制在预设画布对应的位置上。For example, the fillText() function can be fillText(content character, X, Y), where the parameter X can indicate that the content character is drawn on the horizontal axis of the web page at the position corresponding to X, and the parameter Y can indicate that the content character is drawn on the On the vertical axis of the web page and the position corresponding to Y. Specifically, X can correspond to the position information parameter x of the content character on the webpage, and Y can correspond to the position information parameter y of the content character on the webpage. For example, the fillText() function can be set to fillText(content character, x, y ). By setting the setting function of the drawing function, the content character can be drawn on the position corresponding to the preset canvas.

例如，当处理类型是画布绘制类型的内容字符是“我”“很”“好”“。”时，终端可以利用fillText()函数将内容字符绘制在预设画布对应的位置上。例如，当把“我”绘制在画布上时，fillText()函数可以设置为fillText(我，200，8970)；当把“很”绘制在画布上时，fillText()函数可以设置为fillText(很，220，8970)。For example, when the processing type is canvas drawing and the content characters are "I", "very", "good", ".", the terminal can use the fillText() function to draw the content characters on the corresponding positions of the preset canvas. For example, when "I" is drawn on the canvas, the fillText() function can be set to fillText(I, 200, 8970); when "I" is drawn on the canvas, the fillText() function can be set to fillText(Very , 220, 8970).

206、当内容字符的处理类型为标签处理类型时，终端按照预设顺序将内容字符对应的识别标签添加到预设标签集。206. When the processing type of the content characters is the label processing type, the terminal adds the identification tags corresponding to the content characters to the preset tag set according to the preset order.

在一实施例中，当内容字符的处理类型是标签处理类型时，可以根据预设顺序确定设置识别标签顺序的顺序设置函数，然后利用顺序设置函数将内容字符对应的识别标签按照预设顺序添加到预设标签集中。In one embodiment, when the processing type of the content characters is a label processing type, a sequence setting function for setting the sequence of the identification labels can be determined according to a preset sequence, and then the sequence setting function is used to add the identification labels corresponding to the content characters in a preset sequence. to the preset label set.

其中，预设顺序可以是识别标签添加到预设标签集时所依据的顺序。例如，终端可以按照随机的顺序将标识标签添加到预设标签集中。The preset sequence may be the sequence in which the identification tags are added to the preset tag set. For example, the terminal may add identification tags to the preset tag set in random order.

例如，当按照随机的顺序将识别标签添加到预设标签集中时，顺序设置函数可以是一个随机函数。例如，在HTML文档中嵌入如JavaScript脚本语言的文档中，顺序设置函数可以是Math.random()函数。For example, when the identification tags are added to the preset tag set in a random order, the order setting function may be a random function. For example, in an HTML document embedded in a document in a scripting language such as JavaScript, the order setting function may be the Math.random() function.

例如，当处理类型是标签处理类型的内容字符是“你”“好”“吗”“？”时，终端可以利用Math.random()将内容字符“你”“好”“吗”“？”对应的识别标签随机地添加到预设标签集中。For example, when the processing type is the label processing type and the content characters are "you", "good", "do", "?", the terminal can use Math.random() to convert the content characters "you", "good", "do", "?" The corresponding identification tag is randomly added to the preset tag set.

其中，预设标签集可以是一个能够存储识别标签的存储空间。例如，在HTML文档中嵌入如JavaScript脚本语言的浏览器文档中，预设标签集可以是一个数组。The preset label set may be a storage space capable of storing identification labels. For example, in a browser document embedded in an HTML document in a scripting language such as JavaScript, the preset tag set may be an array.

207、当满足预设内容生成条件时，终端根据当前画布和当前标签集生成并输出所示网页的目标网页内容。207. When the preset content generation condition is satisfied, the terminal generates and outputs the target webpage content of the shown webpage according to the current canvas and the current tag set.

其中，当前画布可以是绘制了所有处理类型是画布绘制类型的内容字符的画布。例如，当前画布可以是绘制了内容字符“我”“很”“好”“。”的画布。Wherein, the current canvas may be a canvas on which all content characters whose processing type is the canvas drawing type are drawn. For example, the current canvas may be a canvas on which the content characters "I", "very", "good" and "." are drawn.

其中，当前标签集可以是存储了所有处理类型是标签处理类型的内容字符的标签集。例如，当前标签集可以是存储了内容字符“你”“好”“吗”“？”对应的识别标签的标签集。The current tag set may be a tag set that stores all content characters whose processing type is the tag processing type. For example, the current tag set may be a tag set that stores the identification tags corresponding to the content characters "you", "how are you", "are you", and "?".

在一实施例中，可以根据当前画布上的内容字符在网页中的位置信息，将当前画布插入到网页对应的位置上。In one embodiment, the current canvas may be inserted into the corresponding position of the web page according to the position information of the content characters on the current canvas in the web page.

在一实施例中，可以根据预设顺序确定用于获取识别标签的原始顺序的定位函数，然后终端利用定位函数将当前标签集中的识别标签对应的内容字符输出到网页对应的位置上。In one embodiment, a positioning function for obtaining the original sequence of identification tags may be determined according to a preset order, and then the terminal uses the positioning function to output the content characters corresponding to identification tags in the current tag set to the corresponding position of the webpage.

其中，识别标签的原始顺序可以是识别标签还没有按照预设顺序添加到预设标签集中的顺序。Wherein, the original order of the identification tags may be the order in which the identification tags have not been added to the preset tag set according to the preset order.

其中，原始顺序的定位函数可以是根据识别标签的原始顺序，将添加到预设标签集中识别标签按照原始顺序定位到原始位置的函数。例如，在HTML文档中嵌入如JavaScript脚本语言和CSS的文档中，定位函数可以是CSS中的定位机制。通过CSS中的定位机制，可以定位识别标签相对于其原来的位置应该出现的位置。The location function of the original order may be a function of positioning the identification labels added to the preset label set to the original position in the original order according to the original order of the identification labels. For example, in HTML documents embedded in documents such as JavaScript scripting languages and CSS, the positioning function may be the positioning mechanism in CSS. Using the positioning mechanism in CSS, it is possible to locate where an identifying tag should appear relative to its original position.

上述介绍的网页内容处理方法，终端获取网页的初始网页内容，其中，初始网页内容包括至少一个网页标签以及网页标签对应的子网页内容，子网页内容包括至少一个内容字符；然后终端对子网页内容中的内容字符进行处理，得到网页的待处理网页内容，其中，待处理网页内容包括网页内容单元和网页内容对应的网页标签，网页内容单元包括内容字符和内容字符对应的识别标签；终端根据内容字符在网页中的位置信息，确定内容字符的处理类型；当内容字符的处理类型为画布绘制类型时，终端将内容字符绘制在预设画布；当内容字符的处理类型为标签处理类型时，终端按照预设顺序将内容字符对应的识别标签添加到预设标签集；最后，当满足预设内容生成条件时，终端根据当前画布和当前标签集生成并输出网页的目标网页内容；通过上述方法，可以有效地防止爬虫抓取网页上的内容，从而提高了网页内容的安全性。In the webpage content processing method described above, the terminal obtains the initial webpage content of the webpage, wherein the initial webpage content includes at least one webpage label and the sub-page content corresponding to the webpage label, and the sub-page content includes at least one content character; to process the content characters in the webpage to obtain the to-be-processed webpage content of the webpage, wherein the webpage content to be processed includes the webpage content unit and the webpage label corresponding to the webpage content, and the webpage content unit includes the content character and the identification label corresponding to the content character; The position information of the characters in the webpage determines the processing type of the content characters; when the processing type of the content characters is the canvas drawing type, the terminal draws the content characters on the preset canvas; when the processing type of the content characters is the label processing type, the terminal The identification tags corresponding to the content characters are added to the preset tag set according to the preset order; finally, when the preset content generation conditions are met, the terminal generates and outputs the target webpage content of the webpage according to the current canvas and the current tag set; through the above method, It can effectively prevent the crawler from crawling the content on the webpage, thereby improving the security of the webpage content.

为了更好地实施本申请实施例提供的网页内容处理方法，在一实施例中还提供了一种网页内容处理装置，该网页内容处理装置可以集成于终端或服务器中。其中名词的含义与上述网页内容处理方法中相同，具体实现细节可以参考方法实施例中的说明。In order to better implement the webpage content processing method provided by the embodiments of the present application, an embodiment further provides a webpage content processing apparatus, and the webpage content processing apparatus may be integrated into a terminal or a server. The meanings of the nouns are the same as those in the above web content processing method, and the specific implementation details may refer to the descriptions in the method embodiments.

在一实施例中，提供了一种网页内容处理装置，该网页处理装置具体可以集成在终端如网页内容处理终端，如图4a所示，该网页内容处理装置包括：获取单元301、识别单元302、确定单元303、绘制单元304、添加单元305、生成单元306，具体如下：In one embodiment, a webpage content processing apparatus is provided, and the webpage content processing apparatus can be integrated in a terminal such as a webpage content processing terminal. As shown in FIG. 4a, the webpage content processing apparatus includes: an acquisition unit 301 and an identification unit 302. , determining unit 303, drawing unit 304, adding unit 305, generating unit 306, as follows:

获取单元301，用于获取网页的待处理网页内容，所述待处理网页内容包括至少一个网页内容单元以及网页内容单元对应的网页标签，其中，所述网页内容单元包括至少一个内容字符以及内容字符对应的识别标签；The obtaining unit 301 is configured to obtain the to-be-processed webpage content of the webpage, the to-be-processed webpage content includes at least one webpage content unit and a webpage tag corresponding to the webpage content unit, wherein the webpage content unit includes at least one content character and content character the corresponding identification label;

识别单元302，用于根据所述网页内容单元对应的网页标签和所述内容字符对应的识别标签，识别所述内容字符在所述网页中的位置信息；An identification unit 302, configured to identify the location information of the content character in the webpage according to the webpage label corresponding to the webpage content unit and the identification label corresponding to the content character;

确定单元303，用于针对网页内容中的每个内容字符，根据所述内容字符在所述网页中的位置信息，确定所述内容字符的处理类型；a determining unit 303, configured to, for each content character in the webpage content, determine the processing type of the content character according to the position information of the content character in the webpage;

绘制单元304，用于当所述内容字符的处理类型为画布绘制类型时，将所述内容字符绘制在预设画布；A drawing unit 304 for drawing the content characters on a preset canvas when the processing type of the content characters is a canvas drawing type;

添加单元305，用于当所述内容字符的处理类型为标签处理类型时，按照预设顺序将所述内容字符对应的识别标签添加到预设标签集；The adding unit 305 is configured to add the identification tags corresponding to the content characters to a preset tag set according to a preset order when the processing type of the content characters is a tag processing type;

生成单元306，用于当满足预设内容生成条件时，根据当前画布和当前标签集生成并输出所述网页的目标网页内容。The generating unit 306 is configured to generate and output the target web page content of the web page according to the current canvas and the current tag set when the preset content generation condition is satisfied.

在一实施例中，如图4b所示，确定单元303可以包括：In one embodiment, as shown in FIG. 4b, the determining unit 303 may include:

比较子单元3031，用于针对网页内容中的每个内容字符，将所述内容字符在所述网页中的位置信息与预设位置信息进行比较，得到比较结果；The comparison subunit 3031 is configured to compare the position information of the content character in the webpage with the preset position information for each content character in the webpage content to obtain a comparison result;

确定子单元3032，用于根据所述比较结果确定所述内容字符的处理类型。A determination subunit 3032, configured to determine the processing type of the content character according to the comparison result.

在一实施例中，如图4b所示，绘制单元304可以包括：In one embodiment, as shown in FIG. 4b, the drawing unit 304 may include:

确定子单元3041，用于当所述内容字符的处理类型为画布绘制类型时，确定用于内容字符绘制的绘制函数；Determining subunit 3041, for determining a drawing function for drawing content characters when the processing type of the content characters is a canvas drawing type;

绘制子单元3042，用于根据所述绘制函数和所述内容字符在所述网页中的位置信息，将所述内容字符绘制在预设画布对应的位置上。The drawing subunit 3042 is configured to draw the content character on the position corresponding to the preset canvas according to the drawing function and the position information of the content character in the webpage.

在一实施例中，如图4b所示，添加单元305可以包括：In one embodiment, as shown in FIG. 4b, the adding unit 305 may include:

确定子单元3051，用于当所述内容字符的处理类型为标签处理类型时，根据预设顺序确定对应的标签顺序设置函数；Determining subunit 3051 is used to determine the corresponding label order setting function according to the preset order when the processing type of the content character is the label processing type;

添加子单元3052，用于采用所述标签顺序设置函数将所述内容字符对应的识别标签按照预设顺序添加到预设标签集。The adding subunit 3052 is configured to use the label order setting function to add the identification labels corresponding to the content characters to the preset label set in a preset order.

在一实施例中，如图4b所示，生成单元306可以包括：In one embodiment, as shown in FIG. 4b, the generating unit 306 may include:

插入子单元3061，用于当满足预设内容生成条件时，根据所述当前画布上内容字符在所述网页中的位置信息，将当前画布插入到所述网页的对应位置上；Insertion subunit 3061 is used to insert the current canvas into the corresponding position of the webpage according to the position information of the content characters on the current canvas in the webpage when the preset content generation conditions are met;

输出子单元3062，用于根据当前标签集中识别标签的顺序，将识别标签对应的内容字符输出到所述网页的对应位置上。The output subunit 3062 is configured to output the content characters corresponding to the identification tags to the corresponding positions of the webpage according to the sequence of the identification tags in the current tag set.

具体实施时，以上各个单元可以作为独立的实体来实现，也可以进行任意组合，作为同一或若干个实体来实现，以上各个单元的具体实施可参见前面的方法实施例，在此不再赘述。During specific implementation, the above units can be implemented as independent entities, or can be arbitrarily combined to be implemented as the same or several entities. The specific implementation of the above units can refer to the previous method embodiments, which will not be repeated here.

通过上述的网页内容处理装置可以有效地防止爬虫抓取网页上的内容。The above-mentioned webpage content processing apparatus can effectively prevent the crawler from crawling the content on the webpage.

本申请实施例还提供一种计算机设备，该计算机设备可以包括终端或服务器，比如，计算机设备可以作为网页内容处理终端，该终端可以为手机、平板电脑等等；又比如计算机设备可以为服务器，如网页内容处理服务器等。如图5获取网页的待处理网页内容，所述待处理网页内容包括至少一个网页内容单元以及网页内容单元对应的网页标签，其中，所述网页内容单元包括至少一个内容字符以及内容字符对应的识别标签；The embodiment of the present application also provides a computer device, which may include a terminal or a server. For example, the computer device may be used as a web content processing terminal, and the terminal may be a mobile phone, a tablet computer, etc.; for example, the computer device may be a server, Such as web content processing server and so on. As shown in Figure 5, the to-be-processed web page content of the web page is obtained, the to-be-processed web page content includes at least one web page content unit and a web page tag corresponding to the web page content unit, wherein the web page content unit includes at least one content character and the identification corresponding to the content character Label;

如图5所示，其示出了本申请实施例所涉及的终端的结构示意图，具体来讲：As shown in FIG. 5 , it shows a schematic structural diagram of a terminal involved in an embodiment of the present application, specifically:

该计算机设备可以包括一个或者一个以上处理核心的处理器401、一个或一个以上计算机可读存储介质的存储器402、电源403和输入单元404等部件。本领域技术人员可以理解，图5中示出的计算机设备结构并不构成对计算机设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。其中：The computer device may include a processor 401 of one or more processing cores, a memory 402 of one or more computer-readable storage media, a power supply 403 and an input unit 404 and other components. Those skilled in the art can understand that the computer device structure shown in FIG. 5 does not constitute a limitation on the computer device, and may include more or less components than the one shown, or combine some components, or arrange different components. in:

处理器401是该计算机设备的控制中心，利用各种接口和线路连接整个计算机设备的各个部分，通过运行或执行存储在存储器402内的软件程序和/或模块，以及调用存储在存储器402内的数据，执行计算机设备的各种功能和处理数据，从而对计算机设备进行整体监控。可选的，处理器401可包括一个或多个处理核心；优选的，处理器401可集成应用处理器和调制解调处理器，其中，应用处理器主要处理操作系统、用户界面和应用程序等，调制解调处理器主要处理无线通信。可以理解的是，上述调制解调处理器也可以不集成到处理器401中。The processor 401 is the control center of the computer equipment, uses various interfaces and lines to connect various parts of the entire computer equipment, runs or executes the software programs and/or modules stored in the memory 402, and invokes the software programs stored in the memory 402. Data, perform various functions of computer equipment and process data, so as to conduct overall monitoring of computer equipment. Optionally, the processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc. , the modem processor mainly deals with wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 401.

存储器402可用于存储软件程序以及模块，处理器401通过运行存储在存储器402的软件程序以及模块，从而执行各种功能应用以及数据处理。存储器402可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等；存储数据区可存储根据计算机设备的使用所创建的数据等。此外，存储器402可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地，存储器402还可以包括存储器控制器，以提供处理器401对存储器402的访问。The memory 402 can be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by running the software programs and modules stored in the memory 402 . The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required for at least one function, and the like; Data created by the use of computer equipment, etc. Additionally, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 402 may also include a memory controller to provide processor 401 access to memory 402 .

计算机设备还包括给各个部件供电的电源403，优选的，电源403可以通过电源管理系统与处理器401逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源403还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。The computer equipment also includes a power supply 403 for supplying power to various components. Preferably, the power supply 403 can be logically connected to the processor 401 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system. Power source 403 may also include one or more DC or AC power sources, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and any other components.

该计算机设备还可包括输入单元404，该输入单元404可用于接收输入的数字或字符信息，以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。The computer device may also include an input unit 404 that may be operable to receive input numerical or character information and generate keyboard, mouse, joystick, optical, or trackball signal input related to user settings and functional control.

尽管未示出，计算机设备还可以包括显示单元等，在此不再赘述。具体在本实施例中，计算机设备中的处理器401会按照如下的指令，将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器402中，并由处理器401来运行存储在存储器402中的应用程序，从而实现各种功能，如下：Although not shown, the computer device may also include a display unit and the like, which will not be described herein again. Specifically in this embodiment, the processor 401 in the computer device will load the executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 will run the executable files stored in the memory 402 . The application program in the memory 402, thereby realizing various functions, as follows:

由上可知，本申请实施例提供的终端可以有效地防止爬虫抓取网页上的内容，从而提高网页内容的安全性。It can be seen from the above that the terminal provided by the embodiment of the present application can effectively prevent the crawler from crawling the content on the webpage, thereby improving the security of the webpage content.

以上各个操作的具体实施可参见前面的实施例，在此不再赘述。For the specific implementation of the above operations, reference may be made to the foregoing embodiments, and details are not described herein again.

其中，该存储介质可以包括：只读存储器(ROM，Read Only Memory)、随机存取记忆体(RAM，Random Access Memory)、磁盘或光盘等。Wherein, the storage medium may include: a read only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk or an optical disk, and the like.

由于该存储介质中所存储的指令，可以执行本申请实施例所提供的任一种网页内容处理方法中的步骤，因此，可以实现本申请实施例所提供的任一种网页内容处理方法所能实现的有益效果，详见前面的实施例，在此不再赘述。Since the instructions stored in the storage medium can execute the steps in any of the webpage content processing methods provided in the embodiments of the present application, it is possible to realize the capabilities of any webpage content processing methods provided in the embodiments of the present application. For the beneficial effects achieved, see the foregoing embodiments for details, which will not be repeated here.

此外，本申请实施例提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述一方面的各种可选方式中提供的方法。In addition, embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided in various optional manners of the above aspect.

例如，该计算机程序可以执行以下的步骤：For example, the computer program may perform the following steps:

以上对本申请实施例所提供的一种网页内容处理方法、装置、电子设备和存储介质进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上该，本说明书内容不应理解为对本申请的限制。The web content processing method, device, electronic device and storage medium provided by the embodiments of the present application have been described in detail above. The principles and implementations of the present application are described with specific examples in this article. The description of the above embodiments It is only used to help understand the method of the present application and its core idea; meanwhile, for those skilled in the art, according to the idea of the present application, there will be changes in the specific embodiments and application scope. In conclusion, this specification The content should not be construed as a limitation on this application.

Claims

1. a web content processing method, is characterized in that, comprises:

Obtaining the to-be-processed web page content of the web page, the to-be-processed web page content includes at least one web page content unit and a web page label corresponding to the web page content unit, wherein the web page content unit includes at least one content character and an identification label corresponding to the content character;

Identifying the location information of the content character in the webpage according to the webpage label corresponding to the webpage content unit and the identification label corresponding to the content character;

For each content character in the webpage content, determine the processing type of the content character according to the position information of the content character in the webpage;

When the processing type of the content character is a canvas drawing type, the content character is drawn on a preset canvas;

When the processing type of the content characters is a label processing type, adding the identification tags corresponding to the content characters to the preset tag set according to a preset order;

When the preset content generation conditions are met, the target web page content of the web page is generated and output according to the current canvas and the current tag set.

2. The webpage content processing method according to claim 1, wherein, for each content character in the webpage content, the processing type of the content character is determined according to the position information of the content character in the webpage, include:

For each content character in the webpage content, compare the position information of the content character in the webpage with the preset position information to obtain a comparison result;

The processing type of the content character is determined according to the comparison result.

3. The webpage content processing method according to claim 1, wherein when the processing type of the content character is a canvas drawing type, the content character is drawn on a preset canvas, comprising:

When the processing type of the content character is the canvas drawing type, determine the drawing function used for the content character drawing;

According to the drawing function and the position information of the content character in the webpage, the content character is drawn on the position corresponding to the preset canvas.

4. The webpage content processing method according to claim 1, wherein when the processing type of the content characters is a label processing type, the identification tags corresponding to the content characters are added to the preset tags according to a preset order set, including:

When the processing type of the content character is the label processing type, determine the corresponding label order setting function according to the preset order;

The identification tags corresponding to the content characters are added to the preset tag set in a preset order by using the tag order setting function.

5. The method for processing webpage content according to claim 1, wherein before acquiring the webpage content to be processed of the webpage, the method further comprises:

acquiring initial web page content of the web page, the initial web page content includes at least one web page tag and sub-web page content corresponding to the web page tag, wherein the sub-web page content includes at least one content character;

splitting the content characters in the content of the sub-web page to obtain an initial web page content unit, where the initial web page content unit includes at least one content character;

An identification tag is added to the content characters in the initial webpage content unit to obtain a webpage content unit.

6. The webpage content processing method according to claim 5, wherein, adding identification tags to the content characters in the initial webpage content unit to obtain the webpage content unit, comprising:

determining the number of content characters in the initial web page content unit;

Determine the identification tags corresponding to the content characters in the initial webpage content unit according to the number of content characters in the initial webpage content unit;

According to the identification tag, an identification tag is added to the content characters in the initial webpage content unit to obtain a webpage content unit.

7. The webpage content processing method according to claim 1, wherein, when a preset content generation condition is met, generating and outputting the target webpage content of the webpage according to the current canvas and the current tag set, comprising:

When the preset content generation conditions are met, insert the current canvas into the corresponding position of the webpage according to the position information of the content characters on the current canvas in the webpage;

According to the sequence of the identification tags in the current tag set, the content characters corresponding to the identification tags are output to the corresponding positions of the webpage.

8. A webpage content processing device, comprising:

an acquisition unit, configured to acquire the to-be-processed webpage content of the webpage, the to-be-processed webpage content includes at least one webpage content unit and a webpage tag corresponding to the webpage content unit, wherein the webpage content unit includes at least one content character and the corresponding content character the identification label;

an identification unit, configured to identify the location information of the content character in the webpage according to the webpage label corresponding to the webpage content unit and the identification label corresponding to the content character;

a determining unit, configured to, for each content character in the webpage content, determine the processing type of the content character according to the position information of the content character in the webpage;

a drawing unit, used to draw the content character on a preset canvas when the processing type of the content character is a canvas drawing type;

An adding unit, configured to add the identification tags corresponding to the content characters to a preset tag set according to a preset order when the processing type of the content characters is a tag processing type;

The generating unit is configured to generate and output the target web page content of the web page according to the current canvas and the current tag set when the preset content generation condition is satisfied.

9. An electronic device, characterized in that it comprises a memory and a processor; the memory stores an application program, and the processor is used to run the application program in the memory to execute any one of claims 1 to 7. operations in the web content processing method described above.

10. A computer-readable storage medium, wherein the computer-readable storage medium stores a plurality of instructions, and the instructions are adapted to be loaded by a processor to execute the method according to any one of claims 1 to 7. Steps in a web content processing method.