CN111651656B

CN111651656B - Method and system for dynamic webpage crawler based on agent mode

Info

Publication number: CN111651656B
Application number: CN202010488720.2A
Authority: CN
Inventors: 杨杰; 程克非; 吴渝; 李红波; 叶雯静; 刘钟书; 刘洋旗
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2023-02-24
Anticipated expiration: 2040-06-02
Also published as: CN111651656A

Abstract

The invention discloses a method and a system for dynamic webpage crawler based on a labor mode, wherein the method comprises the following steps: receiving service information, configuring crawler parameters, evaluating the service and preparing the service; allocating system resources and initiating a plurality of independent process business crawlers; crawling the original URL of the dynamic webpage by adopting a simulated browser mode, and returning the URL of the target static data content; reviewing validity and non-repeatability of the URL, constructing a production task message list by using the crawl tasks after review, and initiating a plurality of threads of production crawlers; crawling a static URL page by adopting an automatic program mode, and returning target data and an attachment file; processing and storing the returned content; and exporting the data. The invention respectively constructs a business crawler and a production crawler, adopts different crawling strategies for the dynamic webpage and the static content based on a substitute mode, utilizes system resources to the maximum extent and realizes large-scale and quick crawling of dynamic webpage data.

Description

A method and system for dynamic webpage crawler based on foundry mode

技术领域technical field

本发明涉及互联网信息检索、搜索引擎技术领域，尤其涉及一种基于代工模式的动态网页爬虫方法及系统。The invention relates to the technical fields of Internet information retrieval and search engines, in particular to a dynamic web page crawling method and system based on an OEM mode.

背景技术Background technique

网络爬虫是互联网搜索引擎的重要组成部分，主要用于在互联网上抓取网页中的数据，为搜索引擎建立索引。抓取量是否够大，决定搜索引擎的内容是否丰富，抓取是否即时，直接影响搜索引擎的整体效果。在大数据的背景下，网络爬虫也广泛用于网络舆情、商品交易、文体娱乐等网络数据的抓取，为进一步的数据挖掘、数据分析提供海量的基础数据。Web crawlers are an important part of Internet search engines, and are mainly used to crawl data in web pages on the Internet and build indexes for search engines. Whether the amount of crawling is large enough determines whether the content of the search engine is rich and whether the crawling is immediate, directly affects the overall effect of the search engine. In the context of big data, web crawlers are also widely used to capture network data such as Internet public opinion, commodity transactions, cultural and sports entertainment, and provide massive basic data for further data mining and data analysis.

通用的网络爬虫其工作原理是通过访问目标网页的URL，获得网页HTML数据，然后解析HTML中的DOM节点，抽取出目标数据或者数据的URL链接，保存到数据库或，再通过深度优先或者广度优先等策略继续爬取更多网页上的数据。由于网络爬虫或多或少会对目标网站产生一定的干扰或者出于数据保护等原因，许多网站会采取一些反爬虫手段。除此之外，有些网站由于其业务功能的原因，不会在用户初始打开页面时就将全部信息显示出来，而是需要通过点击某个按钮或者滑动滚动条等人工操作后，再通过Ajax的方式动态加载进来。对于动态网页的爬取，现有技术方法是采用模拟浏览器(例如Selenium、PhantomJS等)的方式，在需要人工操作的地方通过程序来模拟鼠标、键盘的行为，已达到触发网页动态加载新内容的目的。这种方式最大的缺点是效率低下，任务调度简单，不能满足大规模爬取任务的需求。The working principle of a general web crawler is to obtain the HTML data of the web page by accessing the URL of the target web page, then parse the DOM nodes in the HTML, extract the target data or the URL link of the data, save it to the database or, and then use depth-first or breadth-first Wait for the strategy to continue to crawl data on more web pages. Because web crawlers will more or less interfere with the target website or for reasons of data protection, many websites will adopt some anti-crawler means. In addition, due to their business functions, some websites will not display all the information when the user initially opens the page, but need to click a button or slide the scroll bar after manual operations, and then pass Ajax The method is dynamically loaded in. For the crawling of dynamic web pages, the prior art method adopts the mode of simulating browsers (such as Selenium, PhantomJS, etc.), and simulates the behavior of mouse and keyboard through programs in places where manual operation is required, so as to trigger web pages to dynamically load new content. the goal of. The biggest disadvantage of this method is low efficiency and simple task scheduling, which cannot meet the needs of large-scale crawling tasks.

发明内容Contents of the invention

本发明所要解决的技术问题是现有爬虫方法效率低下、任务调度简单、不能满足大规模爬取任务需求，目的在于提供一种基于代工模式的动态网页爬虫方法及系统，解决简单高效大规模完成爬取任务的问题。The technical problem to be solved by the present invention is that the existing crawler method has low efficiency, simple task scheduling, and cannot meet the needs of large-scale crawling tasks. The problem of completing the crawling task.

本发明通过下述技术方案实现：The present invention realizes through following technical scheme:

一种基于代工模式的动态网页爬虫方法，包括以下步骤：S1：接收用户输入的业务信息，配置爬虫业务参数，进行业务评估，并做好准备工作；S2：根据所述业务信息，在指定时间分配好系统资源，发起多个独立进程的业务爬虫；S3：所述业务爬虫采用模拟浏览器模式，对动态网页原始URL进行爬取，并返回目标静态数据内容的URL；S4：审查所述URL的有效性和非重复性，并对通过审查的爬取任务，构造生产任务消息列表，在分布式服务器上发起多个多线程的生产爬虫；S5：所述生产爬虫采用自动化程序模式，对含静态内容的URL页面进行爬取，并返回目标数据字段和附件文件；S6：对所述目标数据字段进行预处理，预处理后的目标数据字段和所述附件文件形成业务数据，存储所述业务数据；S7：导出所述业务数据，反馈给用户。A dynamic webpage crawling method based on an OEM mode, comprising the following steps: S1: receiving business information input by a user, configuring crawler business parameters, performing business evaluation, and making preparations; S2: according to the business information, specifying Time allocation of system resources, launch business crawlers with multiple independent processes; S3: The business crawler adopts the simulated browser mode to crawl the original URL of the dynamic web page, and return the URL of the target static data content; S4: Review the The effectiveness and non-repetition of the URL, and for the crawling tasks that have passed the review, construct a production task message list, and initiate multiple multi-threaded production crawlers on the distributed server; S5: the production crawler adopts an automated program mode, Crawl the URL page containing static content, and return the target data field and the attachment file; S6: perform preprocessing on the target data field, the preprocessed target data field and the attachment file form business data, and store the Business data; S7: Export the business data and feed it back to the user.

本发明通过对动态网页和静态内容采取不同的爬取策略，有针对性地发挥各自优势，最大限度地利用系统资源；基于代工模式，按照业务功能进行方法操作，可根据业务具体情况进行方法的相应调整和配置，步骤之间有序调度、相互配合，具有高度的灵活性，能够实现对动态网页进行大规模、快速爬取。本发明方法基于代工模式，对动态网页采用模拟浏览器操作、独立进程、同步模式的业务爬虫；在爬取到目标数据静态内容的链接URL后，交由效率更高，采用自动化程序、多线程、异步模式的生产爬虫并行爬取。对业务进行评估的结果，若数据量巨大、生产爬虫并行程度高，则采用分库分表的方式进行存储，以提高存取效率。The present invention adopts different crawling strategies for dynamic webpages and static content, exerts their respective advantages in a targeted manner, and maximizes the use of system resources; based on the foundry model, the method is operated according to the business function, and the method can be carried out according to the specific situation of the business The corresponding adjustment and configuration of the system, the orderly scheduling and mutual cooperation between the steps, has a high degree of flexibility, and can realize large-scale and fast crawling of dynamic web pages. The method of the present invention is based on the foundry mode, and adopts a business crawler that simulates browser operation, independent process, and synchronous mode for dynamic web pages; after crawling to the link URL of the static content of the target data, the delivery efficiency is higher, and automatic programs, multiple Production crawlers in threaded and asynchronous mode crawl in parallel. As a result of business evaluation, if the amount of data is huge and the parallelism of the production crawler is high, it will be stored in a sub-database and sub-table manner to improve access efficiency.

进一步的，所述业务信息包括业务ID、业务描述、起始URL、登录资料、执行策略、防反爬措施、爬取数据字段与定位和导出数据格式。Further, the business information includes business ID, business description, starting URL, login information, execution strategy, anti-crawling measures, crawling data fields and positioning and exporting data formats.

进一步的，所述登录资料包括账号、密码和CA证书，所述执行策略为一次性定时增量爬取，所述防反爬措施包括IP代理、浏览器头和CSS偏移。若目标URL设置有反爬虫措施，则在开启生产爬虫时，根据所述防反爬措施进行爬取。Further, the login information includes account number, password and CA certificate, the execution strategy is one-time timing incremental crawling, and the anti-crawling measures include IP proxy, browser header and CSS offset. If the target URL is set with anti-crawling measures, when the production crawler is started, it will be crawled according to the anti-crawling measures.

进一步的，所述步骤S1中准备工作包括：业务参数完备性检查：确保所述业务信息的完整性；保证用户输入的业务相关信息完整，系统能够依据相关参数执行爬取任务。访问通行证准备：根据所述业务信息，登录目标网站，获取并保存cookies信息，验证所述CA证书；确保账户成功登录。小任务尝试：根据所述起始URL和所述访问通行证准备，构造小型爬取任务，通过执行情况，分析所述目标网站的字符编码、数据量、访问时长和爬取成功率；为下一步安排正式爬取任务提供资源安排参考信息。设计数据字段与规范化要求：根据所述业务信息，设计目标数据的字段名称、字段格式、字段长度和规范化要求。建立数据库、数据表和附件存储空间。为该业务建立持久化存储相关资源。Further, the preparatory work in step S1 includes: business parameter completeness check: ensuring the integrity of the business information; ensuring the integrity of the business-related information input by the user, and the system can execute the crawling task according to the relevant parameters. Access pass preparation: According to the business information, log in to the target website, obtain and save cookies information, verify the CA certificate; ensure that the account is successfully logged in. Small task attempt: according to the preparation of the starting URL and the access pass, construct a small crawling task, and analyze the character encoding, data volume, access duration and crawling success rate of the target website through execution; for the next step Arranging formal crawling tasks provides resource arrangement reference information. Design data fields and standardization requirements: According to the business information, design the field names, field formats, field lengths and standardization requirements of the target data. Create databases, data tables, and storage space for attachments. Establish persistent storage related resources for this business.

进一步的，，所述步骤S1中业务评估包括对业务所需的服务器硬件、软件、存储和网络带宽的评估。结合当前系统的运行状态和资源情况，判断是否能够立即承接该业务；若能承接则着手进行业务准备，若不能承接则反馈给用户，暂不立即执行。Further, the business evaluation in step S1 includes evaluation of server hardware, software, storage and network bandwidth required by the business. Based on the current operating status and resource conditions of the system, judge whether the business can be undertaken immediately; if it can be undertaken, the business preparation will be started; if it cannot be undertaken, it will be fed back to the user, and it will not be executed immediately.

进一步的，若数据量巨大、生产爬虫并行程度高，则采用分库分表的方式进行存储。以提高存取效率。Furthermore, if the amount of data is huge and the production crawler has a high degree of parallelism, it will be stored in a sub-database and sub-table manner. To improve access efficiency.

进一步的，所述步骤S3中，所述模拟浏览器模式包括模拟浏览器鼠标点击、滚动条滑动、键盘输入和复制粘贴。Further, in the step S3, the simulated browser mode includes simulated browser mouse click, scroll bar sliding, keyboard input and copy and paste.

进一步的，所述步骤S4包括以下子步骤：S01：通过合法性规则审查URL的有效性；S02：通过Hash算法将URL映射到HashMap的Key上，采用布隆过滤器审查URL是否重复；若重复，则丢弃，并进行日志记录。Further, the step S4 includes the following sub-steps: S01: Check the validity of the URL through the legality rules; S02: Map the URL to the Key of the HashMap through the Hash algorithm, and use the Bloom filter to check whether the URL is repeated; if repeated , it is discarded and logged.

进一步的，所述这步骤S6中，对所述目标数据字段进行预处理，详细内容如下：数据完整性检查：指定要爬取的字段是否完整，数据内容是否达到要求；对不能达到要求的数据可根据严重程度进行纠正、标注或丢弃；字段规范化处理：字段的数据格式、数据类型是否符合设定；对不能达到要求的数据进行单独转换；附件文件转存：将爬取的附件文件进行病毒扫描后，重新分配含时间戳的不易重复的文件名，再转存到固定的位置存储，并新增一个数据字段用于记录附件文件的路径和文件名。Further, in the step S6, the target data field is preprocessed, and the details are as follows: data integrity check: whether the specified field to be crawled is complete, and whether the data content meets the requirements; for data that cannot meet the requirements It can be corrected, marked or discarded according to the severity; field standardization processing: whether the data format and data type of the field conform to the setting; individual conversion of the data that cannot meet the requirements; attachment file transfer: virus detection of crawled attachment files After scanning, reassign a non-repeatable file name with a time stamp, and then transfer it to a fixed location for storage, and add a new data field to record the path and file name of the attached file.

本发明的另一种实现方式，一种基于代工模式的动态网页爬虫系统，包括：业务接口模块：作为与业务相关的用户接口，接收用户输入的业务信息，配置爬虫业务相关参数，进行业务评估，并做好准备工作；所述业务信息包括：业务ID、业务描述、起始URL、登录资料、执行策略、防反爬措施、爬取数据字段与定位和最终导出数据格式；所述准备工作包括业务参数完备性检查、访问通行证准备、小任务尝试、设计数据字段与规范化要求、建立数据库、建立数据表和建立附件存储空间；所述业务评估包括对业务所需的服务器硬件、软件、存储和网络带宽的评估；业务调度模块：根据业务相关信息，在指定时间分配好系统资源，发起多个独立进程的业务爬虫；业务爬虫：采用模拟浏览器模式，对动态网页原始URL进行爬取，并返回目标静态数据内容的URL；生产调度模块：接收由业务爬虫返回的爬取任务，审查URL的有效性和非重复性，并对通过审核的爬取任务构造生产任务消息列表，在分布式集群服务器上发起多个多线程的生产爬虫；生产爬虫：采用自动化程序模式，对含静态内容的URL页面进行爬取，并返回目标数据和附件文件；存储模块：接收生产爬虫返回的目标数据和附件文件，对所述目标数据进行完整性、规范化处理，交将处理后的目标数据和附件文件存入数据库中，形成业务数据；导出模块：根据所述最终导出数据格式导出业务数据，反馈给用户。Another implementation of the present invention is a dynamic web crawler system based on OEM mode, including: a business interface module: as a user interface related to business, it receives business information input by users, configures parameters related to crawler business, and performs business Evaluate and make preparations; the business information includes: business ID, business description, starting URL, login information, execution strategy, anti-crawling measures, crawling data fields and positioning, and final export data format; the preparation The work includes completeness inspection of business parameters, preparation of access passes, small task attempts, design of data fields and standardization requirements, establishment of databases, establishment of data tables and establishment of storage space for attachments; the business assessment includes server hardware, software, Evaluation of storage and network bandwidth; business scheduling module: allocate system resources at a specified time according to business-related information, and initiate multiple independent processes of business crawlers; business crawler: use simulated browser mode to crawl the original URL of dynamic web pages , and return the URL of the target static data content; production scheduling module: receive the crawling tasks returned by the business crawler, review the validity and non-repetition of the URL, and construct a production task message list for the crawling tasks that have passed the review, in the distribution Initiate multiple multi-threaded production crawlers on a cluster server; production crawlers: use automated program mode to crawl URL pages with static content, and return target data and attachment files; storage module: receive target data returned by production crawlers and attachment files, completeness and standardization processing are carried out to the target data, and the processed target data and attachment files are stored in the database to form business data; export module: export business data according to the final export data format, feedback to the user.

通过负载均衡将爬取任务均匀分派到分布式服务器上，并以多线程、异步方式启动生产爬虫执行目标数据的爬取；生产爬虫采用自动化程序访问目标URL，根据定位获取目标数据发送给存储模块，同时将任务执行情况进行反馈生产调度模块。由业务调度模块根据业务需求和相关信息发起的若干独立进程、模拟浏览器模式的业务爬虫，执行对动态网页的爬取。系统基于代工模式，对动态网页采用模拟浏览器操作、独立进程、同步模式的业务爬虫；在爬取到目标数据静态内容的链接URL后，交由效率更高，采用自动化程序、多线程、异步模式的生产爬虫并行爬取。对业务进行评估的结果，若数据量巨大、生产爬虫并行程度高，则采用分库分表的方式进行存储，以提高存取效率。导出模块按照业务参数中对导出数据格式和形式的要求，从数据库和附件文件中导出爬取数据，同时生成对数据的统计描述信息，打包后交付给用户。Evenly distribute crawling tasks to distributed servers through load balancing, and start the production crawler to crawl the target data in a multi-threaded and asynchronous manner; the production crawler uses an automated program to access the target URL, obtains the target data according to the location and sends it to the storage module , and at the same time feed back the task execution status to the production scheduling module. A number of independent processes initiated by the business scheduling module according to business requirements and related information, and business crawlers simulating browser mode, execute crawling of dynamic web pages. Based on the foundry model, the system adopts a business crawler that simulates browser operation, independent process, and synchronous mode for dynamic web pages; after crawling to the link URL of the static content of the target data, the delivery efficiency is higher, using automated programs, multi-threading, The production crawler in asynchronous mode crawls in parallel. As a result of business evaluation, if the amount of data is huge and the parallelism of the production crawler is high, it will be stored in a sub-database and sub-table manner to improve access efficiency. The export module exports the crawled data from the database and attachment files according to the requirements of the format and form of the exported data in the business parameters, and generates statistical description information of the data at the same time, and delivers it to the user after packaging.

本发明与现有技术相比，具有如下的优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

本发明对动态网页和静态内容采取不同的爬取策略，有针对性地发挥各自优势，最大限度地利用系统资源；基于代工模式，将系统按照业务功能分为七个模块，模块内部可根据业务具体情况进行配置和改造，模块之间有序调度、相互配合，具有高度的灵活性，能够实现对动态网页进行大规模、快速爬取。The present invention adopts different crawling strategies for dynamic webpages and static content, exerts their respective advantages in a targeted manner, and maximizes the use of system resources; The specific business conditions are configured and transformed, and the modules are scheduled and coordinated with each other in an orderly manner. It has a high degree of flexibility and can realize large-scale and fast crawling of dynamic web pages.

附图说明Description of drawings

此处所说明的附图用来提供对本发明实施例的进一步理解，构成本申请的一部分，并不构成对本发明实施例的限定。在附图中：The drawings described here are used to provide a further understanding of the embodiments of the present invention, constitute a part of the application, and do not limit the embodiments of the present invention. In the attached picture:

图1为本发明的系统结构图；Fig. 1 is a system structure diagram of the present invention;

图2为本发明的方法架构图；Fig. 2 is a method architecture diagram of the present invention;

图3为本发明的业务接口模块功能图；Fig. 3 is a functional diagram of the service interface module of the present invention;

图4为本发明的业务爬虫和生产爬虫对应关系和任务单内容。Fig. 4 shows the corresponding relationship between the business crawler and the production crawler and the content of the task list in the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，下面结合实施例和附图，对本发明作进一步的详细说明，本发明的示意性实施方式及其说明仅用于解释本发明，并不作为对本发明的限定。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the examples and accompanying drawings. As a limitation of the present invention.

在以下描述中，为了提供对本发明的透彻理解阐述了大量特定细节。然而，对于本领域普通技术人员显而易见的是：不必采用这些特定细节来实行本发明。在其他实例中，为了避免混淆本发明，未具体描述公知的结构、电路、材料或方法。In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one of ordinary skill in the art that these specific details need not be employed to practice the present invention. In other instances, well-known structures, circuits, materials, or methods have not been described in detail in order to avoid obscuring the present invention.

在整个说明书中，对“一个实施例”、“实施例”、“一个示例”或“示例”的提及意味着：结合该实施例或示例描述的特定特征、结构或特性被包含在本发明至少一个实施例中。因此，在整个说明书的各个地方出现的短语“一个实施例”、“实施例”、“一个示例”或“示例”不一定都指同一实施例或示例。此外，可以以任何适当的组合和、或子组合将特定的特征、结构或特性组合在一个或多个实施例或示例中。此外，本领域普通技术人员应当理解，在此提供的示图都是为了说明的目的，并且示图不一定是按比例绘制的。这里使用的术语“和/或”包括一个或多个相关列出的项目的任何和所有组合。Throughout this specification, reference to "one embodiment," "an embodiment," "an example," or "example" means that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in the present invention. In at least one embodiment. Thus, appearances of the phrases "one embodiment," "an embodiment," "an example," or "example" in various places throughout this specification are not necessarily all referring to the same embodiment or example. Furthermore, particular features, structures or characteristics may be combined in any suitable combination and/or subcombination in one or more embodiments or examples. Furthermore, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

实施例Example

本实施例是一种基于代工模式的动态网页爬虫系统，如图1所示，系统结构分为业务接口模块、业务调度模块、业务爬虫、生产调度模块、生产爬虫、存储模块和导出模块七个部分。各部分主要功能如下：This embodiment is a dynamic webpage crawler system based on the foundry model. As shown in Figure 1, the system structure is divided into a business interface module, a business scheduling module, a business crawler, a production scheduling module, a production crawler, a storage module and an export module. parts. The main functions of each part are as follows:

(1)业务接口模块：作为与业务相关的用户接口，接收用户输入，配置爬虫业务相关参数，进行业务评估，并做好相关准备工作；(1) Business interface module: As a business-related user interface, it receives user input, configures crawler business-related parameters, conducts business evaluation, and makes relevant preparations;

(2)业务调度模块：根据业务相关信息，在指定时间分配好系统资源，发起若干个独立进程的业务爬虫；(2) Business scheduling module: according to business-related information, allocate system resources at a specified time, and initiate several business crawlers of independent processes;

(3)业务爬虫：采用模拟浏览器模式，对动态网页原始URL进行爬取，并返回目标静态数据内容的URL；(3) Business crawler: use the simulated browser mode to crawl the original URL of the dynamic webpage, and return the URL of the target static data content;

(4)生产调度模块：接收由业务爬虫返回的爬取任务，审查URL的有效性和非重复，并对通过审核的爬取任务构造生产任务消息列表，在分布式集群服务器上发起若干个多线程的生产爬虫；(4) Production scheduling module: receive crawling tasks returned by business crawlers, review the validity and non-repetition of URLs, construct a production task message list for the crawling tasks that have passed the review, and initiate several multiple crawling tasks on the distributed cluster server. threaded production crawler;

(5)生产爬虫：采用自动化程序模式，对含静态内容的URL页面进行爬取，并返回目标数据字段和附件文件；(5) Production crawler: adopt automatic program mode to crawl URL pages containing static content, and return target data fields and attachment files;

(6)存储模块：接收生产爬虫返回的数据和附件，对数据字段进行完整性、规范化处理，对附件文件进行转存，将所有数据通过数据库持久化保存；(6) Storage module: Receive the data and attachments returned by the production crawler, complete and standardize the data fields, dump the attachment files, and store all the data through the database;

(7)导出模块：按照业务对最终导出数据格式的要求导出业务数据，反馈给用户。(7) Export module: export business data according to the requirements of the business on the final export data format, and feed back to the user.

如图2所示，系统各模块之间的关系与工业代工生产相关环节相对应，在各模块内部，可根据业务具体情况进行配置和改造，模块之间有序调度、相互配合，具有高度的灵活性。在爬取生产过程中，对动态网页采用模拟浏览器操作、独立进程、同步模式的业务爬虫；在爬取到目标数据静态内容的链接URL后，交由效率更高，采用自动化程序、多线程、异步模式的生产爬虫并行爬取。As shown in Figure 2, the relationship between the various modules of the system corresponds to the relevant links of industrial OEM production. Within each module, configuration and transformation can be carried out according to the specific business conditions. The orderly scheduling and mutual cooperation between the modules has a high degree of flexibility. In the crawling production process, the business crawler that simulates browser operation, independent process, and synchronous mode is used for dynamic web pages; after crawling to the link URL of the static content of the target data, the delivery efficiency is higher, using automated programs and multi-threading , Parallel crawling by production crawlers in asynchronous mode.

在具体实施过程中，可根据业务实际工作量的情况部署硬件服务器。如果业务量较大，可以每个模块分别部署硬件服务器，其中业务爬虫和生产爬虫可以部署为分布式集群系统，存储模块可以按各数据库部署多台服务器；如果业务量不大，可以将业务接口模块、业务调度模块及业务爬虫合并在一台服务器上，生产调度模块、生产爬虫、存储模块和导出模块部署在一台服务器上；如果业务量确实较小，甚至可以将所有模块部署在一台服务器上。In the specific implementation process, hardware servers can be deployed according to the actual workload of the business. If the business volume is large, hardware servers can be deployed for each module, among which the business crawler and production crawler can be deployed as a distributed cluster system, and the storage module can deploy multiple servers according to each database; if the business volume is not large, the business interface can be The module, business scheduling module and business crawler are combined on one server, and the production scheduling module, production crawler, storage module and export module are deployed on one server; if the business volume is really small, all modules can even be deployed on one server on the server.

如图2所示，系统首先接收用户输入爬虫业务相关信息作为后续启动排产的参数，具体内容包括：业务ID、业务描述、起始URL、登录资料(账号、密码、CA证书等)、执行策略(一次性爬取，定时增量爬取)、防反爬措施(IP代理，浏览器头，CSS偏移等)、爬取数据字段与定位及最终导出数据格式等业务相关信息；所有信息由用户编辑后通过Json格式的文件保存，系统通过读取Json文件获取业务参数信息。As shown in Figure 2, the system first receives user-input crawler business-related information as parameters for subsequent production scheduling. The specific content includes: business ID, business description, starting URL, login information (account, password, CA certificate, etc.), Strategy (one-time crawling, timing incremental crawling), anti-crawling measures (IP proxy, browser header, CSS offset, etc.), crawling data fields and positioning, and final export data format and other business-related information; all information After being edited by the user, it is saved in a Json format file, and the system obtains business parameter information by reading the Json file.

业务接口模块负责与用户交互接收用户输入的相关参数，并进行审查评估与业务准备，主要功能如图3所示。The business interface module is responsible for interacting with users, receiving relevant parameters input by users, and performing review, evaluation and business preparation. The main functions are shown in Figure 3.

其中，业务接口模块中的业务评估包括对业务所需的服务器硬件、软件、存储、网络带宽等资源的评估；结合当前系统的运行状态和资源情况，判断是否能够立即承接该业务；若能承接则着手进行业务准备，若不能承接则反馈给用户，暂不立即执行。Among them, the business evaluation in the business interface module includes the evaluation of server hardware, software, storage, network bandwidth and other resources required by the business; combined with the current system operation status and resource conditions, it is judged whether the business can be undertaken immediately; if the business can be undertaken Then start business preparations, if it cannot be undertaken, it will be fed back to the user, and it will not be implemented immediately.

业务接口模块中的相关准备工作包括：Relevant preparations in the business interface module include:

(1)业务参数完备性检查：用户输入的业务相关信息完整，系统能够依据相关参数执行爬取任务；(1) Completeness check of business parameters: the business-related information entered by the user is complete, and the system can perform crawling tasks based on relevant parameters;

(2)访问通行证准备：根据登录资料登录目标网站，获取并保存cookies信息，或者验证CA证书，确保账户能够成功登录；有的网站，在登录后直接获取的cookies个别键名与携带cookies访问网络的键名并不一样，因此，在初次获取到cookies后要验证是否能够成功登录，若不能，则需要进一步检查cookies的键名；(2) Access pass preparation: log in to the target website according to the login information, obtain and save the cookie information, or verify the CA certificate to ensure that the account can be successfully logged in; some websites directly obtain the individual key name of the cookie after login and access the network with the cookie The key names of the cookies are different. Therefore, after obtaining the cookies for the first time, you need to verify whether you can log in successfully. If not, you need to further check the key names of the cookies;

(3)小任务尝试：根据业务参数中的起始URL和访问通行证，构造小型爬取任务，通过执行情况，分析目标网页的字符编码、数据量、访问时长、爬取成功率等信息，为下一步安排正式爬取任务提供资源安排参考信息；(3) Small task attempt: According to the starting URL and access pass in the business parameters, construct a small crawling task, analyze the character encoding, data volume, access time, crawling success rate and other information of the target webpage through the execution status, and provide The next step is to arrange formal crawling tasks to provide resource arrangement reference information;

(4)设计数据字段与规范化要求：根据爬取目标和业务要求，设计目标数据的字段名称、字段格式、字段长度以及规范化要求；(4) Design data fields and standardization requirements: According to the crawling target and business requirements, design the field name, field format, field length and standardization requirements of the target data;

(5)为该业务建立数据库、数据表、附件存储空间等持久化存储相关资源。(5) Establish persistent storage related resources such as database, data table, and attachment storage space for the business.

如图2所示，在小任务尝试爬取成功，存储规范与空间准备完成后，可进入排产阶段，首先进行任务分解，由业务调度模块根据业务需求和相关信息发起的若干独立进程、模拟浏览器模式的业务爬虫，执行对动态网页的爬取。As shown in Figure 2, after the small task is successfully crawled and the storage specification and space preparation are completed, the production scheduling stage can be entered. First, the task is decomposed, and several independent processes and simulations are initiated by the business scheduling module based on business requirements and related information. A business crawler in browser mode, which crawls dynamic web pages.

调度业务爬虫的执行策略分为一次性爬取和定时增量爬取两种方式。一次性爬取，就是根据目标网页URL、关键字、更新时间、最大页数及最大深度等信息，一次执行完全部爬取任务。定时增量爬取，就是在指定的爬取时间开启爬取任务，先从数据库或日志记录中找到最近一次爬取的位置，再对新增加的内容进行爬取。可根据业务要求设置爬取策略方式。The execution strategy of scheduling business crawlers is divided into two methods: one-time crawling and timing incremental crawling. One-time crawling is to execute all crawling tasks at once based on information such as the URL of the target webpage, keywords, update time, maximum number of pages, and maximum depth. Timing incremental crawling is to start the crawling task at the specified crawling time, first find the location of the latest crawling from the database or log records, and then crawl the newly added content. Crawling strategy methods can be set according to business requirements.

如果采用定时增量爬取方式，要注意爬取前再次检查访问通行证的有效性，防止爬取任务的间隔时间超出目标网站设定的用户登录过期时间。若访问通行证已过有效期，则需要再次获取。If you use the timing incremental crawling method, pay attention to check the validity of the access pass again before crawling to prevent the interval between crawling tasks from exceeding the user login expiration time set by the target website. If your access pass has expired, you will need to obtain it again.

业务爬虫可以模拟浏览器鼠标点击、滚动条滑动、键盘输入、复制粘贴等动作，采用同步方式，将目标页面的链接URL发送给生产调度模块，同时将任务执行情况反馈给业务调度模块。具体实施时可采用Selenium、PhantomJS等开发实现。The business crawler can simulate actions such as browser mouse click, scroll bar sliding, keyboard input, copy and paste, etc., and send the link URL of the target page to the production scheduling module in a synchronous manner, and at the same time feed back the task execution status to the business scheduling module. For specific implementation, Selenium, PhantomJS, etc. can be used for development and implementation.

生产调度模块在接收到由业务爬虫返回的爬取任务后，组成任务消息队列，依次进入审核处理，具体步骤为：After receiving the crawling tasks returned by the business crawler, the production scheduling module forms a task message queue and enters the review process in turn. The specific steps are:

首先，通过合法性规则审查URL的有效性；First, the validity of the URL is checked through legality rules;

然后，将URL用Hash算法映射到HashMap的Key上，采用布隆过滤器审查该业务的URL是否有重复；若有，则丢弃，并进行日志记录；若无，则在分布式集群服务器上发起生产爬虫，执行对静态网页的爬取。Then, use the Hash algorithm to map the URL to the Key of the HashMap, and use the Bloom filter to check whether the URL of the business is duplicated; if there is, it will be discarded and logged; if not, it will be initiated on the distributed cluster server Produce crawlers and perform crawling of static web pages.

生产调度模块通过负载均衡将爬取任务均匀分派到分布式服务器上，并以多线程、异步方式启动生产爬虫执行目标数据的爬取；生产爬虫采用自动化程序访问目标URL，根据定位获取目标数据发送给存储模块，同时将任务执行情况反馈给生产调度模块。具体实施时可采用Scrapy、Urllib、Requests等开发实现。The production scheduling module evenly distributes crawling tasks to distributed servers through load balancing, and starts the production crawler to crawl the target data in a multi-threaded and asynchronous manner; the production crawler uses an automated program to access the target URL, obtains the target data according to the location and sends it to the storage module, and at the same time feed back the task execution to the production scheduling module. It can be developed and implemented by using Scrapy, Urllib, Requests, etc. during specific implementation.

如果目标网站设置有反爬虫措施，则在开启生产爬虫时依据防反爬策略进行爬取。在具体实施时，可先将常规的防反爬功能如IP代理、访问频率调节、User-Agent定制等实现好，如果目标网站还采用了其他反爬虫策略，可再根据具体情况针对性地扩充防反爬虫策略。If the target website has anti-crawling measures, crawl according to the anti-crawling strategy when the production crawler is enabled. In the specific implementation, the conventional anti-crawling functions such as IP proxy, access frequency adjustment, User-Agent customization, etc. can be realized first. If the target website also adopts other anti-crawling strategies, it can be expanded according to the specific situation. Anti-reptile strategy.

业务爬虫和生产爬虫总体上是1：N的对应关系，其具体任务单包含的内容如图4所示。在生产爬虫上应能够根据业务爬虫ID进行溯源，在遇到爬取异常时，可以从日志中查找对应的业务爬虫以及其它由此业务爬虫派出的爬虫执行情况进行分析。对于数据字段在页面上的定位，在具体实施时可采用多种方式，如find、CSS、XPath等，为了书写的简便和查找的方便，一般采用XPath方式。There is generally a 1:N correspondence between business crawlers and production crawlers, and the contents of the specific task list are shown in Figure 4. The production crawler should be able to trace the source according to the business crawler ID. When encountering crawling exceptions, you can find the corresponding business crawler from the log and analyze the execution status of other crawlers sent by the business crawler. For the positioning of data fields on the page, various methods can be used in the specific implementation, such as find, CSS, XPath, etc. For the convenience of writing and searching, the XPath method is generally used.

如图2所示，存储模块接收生产爬虫抓取的数据和附件放入缓存中，对数据进行预处理后再转到数据库中持久化存储。数据预处理包括：As shown in Figure 2, the storage module receives the data and attachments captured by the production crawler and puts them in the cache, preprocesses the data and then transfers it to the database for persistent storage. Data preprocessing includes:

(1)数据完整性检查：指定要爬取的字段是否完整，数据内容是否达到要求；对不能达到要求的数据可根据严重程度进行纠正、标注或丢弃；(1) Data integrity check: Specify whether the fields to be crawled are complete and whether the data content meets the requirements; the data that cannot meet the requirements can be corrected, marked or discarded according to the severity;

(2)字段规范化处理：字段的数据格式、数据类型是否符合设定；对不能达到要求的数据进行单独转换(例如新浪微博和有的论坛网站，对新发布帖子的时间描述为**分钟之前，或者对日期用“今天”“昨天”来表示，这就需要转换为标准的日期时间格式)；(2) Field standardization processing: whether the data format and data type of the field conform to the setting; convert the data that cannot meet the requirements separately (for example, Sina Weibo and some forum websites describe the time of new posting as ** minutes Before, or use "today" and "yesterday" to represent the date, which needs to be converted to a standard date and time format);

(3)附件文件转存：将爬取的附件文件进行病毒扫描后，重新分配含时间戳的不易重复的文件名，再转存到固定的位置存储，并新增一个数据字段用于记录附件文件的路径和文件名等信息。(3) Attachment file transfer: After scanning the crawled attachment files for viruses, reassign the file names with time stamps that are not easy to be repeated, and then transfer them to a fixed location for storage, and add a new data field for recording attachments Information such as the path and file name of the file.

数据库持久化存储过程中，根据前面对业务进行评估的结果，若数据量巨大、生产爬虫并行程度高，则采用分库分表的方式进行存储，以提高存取效率。具体实施时，一般按水平方式进行切分，一条记录的存储位置选择可采用取模的方式，即idmodn，然后存到数据库中；如果前期业务估计不足或者后期执行计划有变，以至于数据库存储资源需要扩容，可以再增加数据库及服务器。During the persistent storage process of the database, according to the results of the previous business evaluation, if the amount of data is huge and the parallelism of the production crawler is high, the method of sub-database and sub-table storage is used to improve access efficiency. In actual implementation, it is generally divided horizontally. The storage location of a record can be selected by modulus, that is, idmodn, and then stored in the database; Resources need to be expanded, and databases and servers can be added.

导出模块按照业务参数中对导出数据格式和形式的要求，从数据库和附件文件中导出爬取数据，同时生成对数据的统计描述信息，打包后交付给用户。具体实施时，可开发一个用户界面，输入业务ID或相关信息，查看当前业务执行情况、数据量规模等信息，然后按既定的规则由系统管理员或授权账户操作导出。The export module exports the crawled data from the database and attachment files according to the requirements of the format and form of the exported data in the business parameters, and generates statistical description information of the data at the same time, and delivers it to the user after packaging. During specific implementation, a user interface can be developed, input business ID or related information, check the current business execution status, data size and other information, and then export it by the system administrator or authorized account according to the established rules.

以上所述的具体实施方式，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施方式而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. Protection scope, within the spirit and principles of the present invention, any modification, equivalent replacement, improvement, etc., shall be included in the protection scope of the present invention.

Claims

1. A dynamic webpage crawler method based on a proxy mode is characterized by comprising the following steps:

s1: receiving service information input by a user, configuring crawler service parameters, performing service evaluation, and preparing;

s2: according to the service information, system resources are distributed at a specified time, and a plurality of service crawlers with independent processes are initiated;

s3: the business crawler crawls the original URL of the dynamic webpage by adopting a simulated browser mode and returns the URL of the target static data content;

s4: checking the validity and non-repeatability of the URL, constructing a production task message list for the crawl tasks passing the checking, and initiating a plurality of multi-thread production crawlers on a distributed server;

s5: the production crawler crawls URL pages containing static content by adopting an automatic program mode, and returns target data fields and attachment files;

s6: preprocessing the target data field, forming service data by the preprocessed target data field and the attachment file, and storing the service data;

s7: exporting the service data and feeding back the service data to a user;

the service information comprises a service ID, a service description, an initial URL, login data, an execution strategy, anti-crawling measures, a crawling data field and a positioning and exporting data format;

the login data comprises an account number, a password and a CA certificate, the execution strategy is one-time timed incremental crawling, and the anti-crawling measures comprise an IP agent, a browser head and CSS offset;

wherein the preparation work in the step S1 includes:

checking the completeness of the service parameters: ensuring the integrity of the service information;

preparing an access pass: logging in a target website according to the service information, acquiring and storing Cookies information, and verifying the CA certificate;

and (3) small task trying: constructing a small-sized crawling task according to the initial URL and the access pass preparation, and analyzing character codes, data volume, access duration and crawling success rate of the target website through execution conditions;

design data fields and normalization requirements: according to the service information, designing a field name, a field format, a field length and a standardization requirement of target data;

establishing a database, a data table and an accessory storage space;

the service evaluation in step S1 includes evaluation of server hardware, software, storage and network bandwidth required by the service.

2. The dynamic web page crawler method based on the agent model as recited in claim 1, wherein if the data size is huge and the parallel degree of the production crawler is high, the data is stored in a database-based and table-based manner.

3. The method according to claim 1, wherein the browser-simulating mode comprises mouse click, scroll bar sliding, keyboard input and copy and paste in step S3.

4. The agent mode-based dynamic web crawler method according to claim 1, wherein the step S4 comprises the following sub-steps:

s01: checking the validity of the URL through a validity rule;

s02: mapping the URL to a Key of the HashMap by a Hash algorithm, and adopting a bloom filter to check whether the URL is repeated; and if the data is repeated, discarding and logging.

5. The method according to claim 1, wherein in step S6, the target data field is preprocessed, and details thereof are as follows:

data integrity checking: whether fields to be crawled are complete or not and whether data content meets requirements or not are specified; the data which can not meet the requirement can be corrected, marked or discarded according to the severity;

field normalization processing: whether the data format and the data type of the field accord with the setting or not; carrying out individual conversion on data which cannot meet the requirements;

and (3) unloading the accessory file: after virus scanning is carried out on the crawled attachment file, the file name which contains the timestamp and is not easy to repeat is redistributed, then the file name is stored in a fixed position, and a data field is newly added for recording the path and the file name of the attachment file.

6. A dynamic web crawler system based on a proxy mode, comprising:

a service interface module: as a user interface related to the service, receiving service information input by a user, configuring parameters related to the crawler service, performing service evaluation, and making preparation work; the service information includes: service ID, service description, initial URL, login data, execution strategy, anti-reverse climbing measure, climbing data field and positioning and finally exporting data format; the preparation work comprises the steps of checking the completeness of service parameters, preparing access passes, trying small tasks, designing data fields and standardization requirements, establishing a database, establishing a data table and establishing an accessory storage space; the service evaluation comprises the evaluation of server hardware, software, storage and network bandwidth required by the service;

a service scheduling module: according to the service related information, system resources are distributed at the appointed time, and a plurality of service crawlers with independent processes are initiated;

and (3) service crawler: crawling the original URL of the dynamic webpage by adopting a simulated browser mode, and returning the URL of the target static data content;

a production scheduling module: receiving a crawling task returned by the service crawler, checking the validity and non-repeatability of the URL, constructing a production task message list for the crawling task passing the checking, and initiating a plurality of multi-threaded production crawlers on the distributed cluster server;

producing the reptiles: crawling URL pages containing static content by adopting an automatic program mode, and returning target data and attachment files;

a storage module: receiving target data and an attachment file returned by the production crawler, performing integrity and standardization processing on the target data, and storing the processed target data and the processed attachment file into a database to form service data;

a derivation module: business data are exported according to the final export data format and fed back to a user;

wherein, the service parameter completeness check means: ensuring the integrity of the service information;

the access pass preparation is that: logging in a target website according to the service information, acquiring and storing Cookies information, and verifying a CA certificate;

the small task attempt refers to: constructing a small-sized crawling task according to the initial URL and the access pass preparation, and analyzing character codes, data volume, access duration and crawling success rate of the target website through execution conditions;

the design data field and the normalization requirement refer to: and designing the field name, the field format, the field length and the standardization requirement of the target data according to the service information.