CN103902386B

CN103902386B - Multi-thread network crawler processing method based on connection proxy optimal management

Info

Publication number: CN103902386B
Application number: CN201410146375.9A
Authority: CN
Inventors: 罗邦慧; 曾剑平
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2014-04-11
Filing date: 2014-04-11
Publication date: 2017-05-10
Anticipated expiration: 2034-04-11
Also published as: CN103902386A

Abstract

The invention belongs to the technical field of information processing, and specifically relates to a multi-threaded web crawler processing method based on connection agent optimization management, which first obtains a public agent server on the network, tests the network connection performance of the agent server, and obtains the optimal method according to the agent server performance. The number of threads; then manage the proxy server pool, and set an effective proxy server for each Http request; finally execute the Web page access request. The beneficial effects of the present invention are: the number of threads is obtained through calculation, which can effectively make the maximum use of resources without causing waste of resources; balance the use times of each available proxy server, and effectively prevent frequent access from being detected by the server .

Description

A multi-threaded web crawler processing method based on connection proxy optimization management

技术领域technical field

本发明涉及信息处理技术领域，具体涉及一种新型的Web页面信息获取方法，特别是在现有网络爬虫原理的基础上进行连接代理优化管理设计的新型网络爬虫处理方法。The invention relates to the technical field of information processing, in particular to a novel web page information acquisition method, in particular to a novel web crawler processing method based on the existing web crawler principle for optimal management and design of connection agents.

背景技术Background technique

随着网络的迅速发展，网络成为大量信息的载体，如何有效地提取这些信息成为一个巨大的挑战。With the rapid development of the network, the network has become the carrier of a large amount of information, how to effectively extract this information has become a huge challenge.

网络爬虫是搜索引擎系统中十分重要的组成部分，它负责从互联网中搜集网页，采集信息，这些网页信息用于建立索引从而为搜索引擎提供支持，其性能的优劣直接影响着搜索引擎的效果。随着网络信息量几何级的增长，对网络爬虫页面采集的性能和效率的要求也越来越高。The web crawler is a very important part of the search engine system. It is responsible for collecting web pages from the Internet and collecting information. These web page information is used to build an index to provide support for the search engine. Its performance directly affects the effect of the search engine. . As the amount of network information increases geometrically, the requirements for the performance and efficiency of web crawler page collection are also getting higher and higher.

我们总是希望在更短的时间内，获取更多的数据，但是，这会对网站造成非常高的负载，也带来了网络流量增加，泄露隐私数据等问题，很多网站采用爬虫检测技术，分析Web访问日志，当判断出爬虫时，则禁止爬虫使用地址，拒绝爬虫继续访问。为让爬虫能够避免被监测出来，针对网站对爬虫检测方法，目前已经设计出了大量的分布式网络爬虫,伪装网络爬虫和使用代理的网络爬虫，如斯坦福大学设计的Google爬虫、Disguised Spider、Internet Archive爬虫。它们采取更换 UserAgent，设置访问时间间隔和优化URL访问策略，使用代理服务器、多线程等方法优化爬虫。但在实际使用爬虫时，会遇到如下问题：（1）间隔参数没有具体的定义标准，爬虫性能得不到保证。当网站对爬虫检测非常严格时，爬虫需要很长的间隔时间，导致爬虫实用性不高，间隔缩短，会导致爬虫被鉴别出来而不可用。（2）使用代理服务器，需要根据代理服务器性能和多线程的数量来优化爬虫效率，但是目前的研究中并没有给出如何优化的具体方法，如果设置了不合适的关系值，那么爬虫效率会非常低。We always hope to obtain more data in a shorter period of time, but this will cause a very high load on the website, and it will also bring about problems such as increased network traffic and leakage of private data. Many websites use crawler detection technology. Analyze the web access log, and when a crawler is identified, the crawler is prohibited from using the address, and the crawler is refused to continue accessing. In order to prevent crawlers from being monitored, a large number of distributed web crawlers, camouflaged web crawlers and web crawlers using proxies have been designed for website crawler detection methods, such as Google crawlers designed by Stanford University, Disguised Spider, Internet Archive crawler. They replace the UserAgent, set the access time interval and optimize the URL access strategy, and use proxy servers, multi-threading and other methods to optimize crawlers. However, when the crawler is actually used, the following problems will be encountered: (1) There is no specific definition standard for the interval parameter, and the performance of the crawler cannot be guaranteed. When the website is very strict on crawler detection, the crawler needs a long interval, resulting in low practicality of the crawler, and shortening the interval will cause the crawler to be identified and unavailable. (2) When using a proxy server, it is necessary to optimize the crawler efficiency according to the performance of the proxy server and the number of multi-threads, but the current research does not give a specific method for how to optimize it. If an inappropriate relationship value is set, the crawler efficiency will decrease. very low.

由此可见，在现有爬虫技术的基础上，采用合理的代理连接管理方法来对爬虫数据获取线程进行优化配置，对于提升爬虫性能、避免爬虫被服务器拒绝是非常重要的。本发明给出了一种符合这种要求的设计方法。It can be seen that on the basis of the existing crawler technology, it is very important to optimize the configuration of the crawler data acquisition thread by using a reasonable proxy connection management method to improve the performance of the crawler and prevent the crawler from being rejected by the server. The present invention provides a design method that meets this requirement.

发明内容Contents of the invention

本发明的主要目的是针对爬虫访问Web 页面时被拒的问题，提出一种基于连接代理优化管理的多线程网络爬虫，避免被服务器端检测。这种方法具有一定的适应能力，能够解决爬虫在进行Web页面获取时被拒绝的问题。这种方法充分利用了现有互联网上公开的网络连接代理服务，在多线程爬行中进行了代理连接的优化管理和设计，通过代理服务的自动选择机制避免重复使用同一个客户端IP地址连接Web服务器，从而避免被服务端检测。The main purpose of the present invention is to propose a multi-threaded web crawler based on connection proxy optimization management to avoid being detected by the server for the problem that the crawler is rejected when accessing the web page. This method has a certain adaptability, and can solve the problem that the crawler is rejected when acquiring Web pages. This method makes full use of the existing network connection proxy services on the Internet, optimizes management and design of proxy connections in multi-threaded crawling, and avoids repeated use of the same client IP address to connect to the Web through the automatic selection mechanism of proxy services. server, so as to avoid being detected by the server.

本发明提出的基于连接代理优化管理的多线程网络爬虫处理方法，主要使用了多个连接代理，并提出有效的代理管理策略和参数设置方案，在多线程爬虫的基础上应用多代理进行数据爬取。其首先获取网络上公开代理服务器，测试代理服务器的网络连接性能，并根据代理服务器性能得到最优的线程数量；然后对代理服务器池进行管理，并为每一个 Http 请求设置一个有效代理服务器，最终执行Web页面访问请求。其中：The multi-threaded web crawler processing method based on connection agent optimization management proposed by the present invention mainly uses a plurality of connection agents, and proposes effective agent management strategies and parameter setting schemes, and uses multi-agents to perform data crawling on the basis of multi-thread crawlers. Pick. It first obtains the public proxy server on the network, tests the network connection performance of the proxy server, and obtains the optimal number of threads according to the performance of the proxy server; then manages the proxy server pool, and sets an effective proxy server for each Http request, and finally Execute web page access requests. in:

根据代理池中的代理服务器性能确定线程数量M，采用的计算公式如下：The number of threads M is determined according to the performance of the proxy server in the proxy pool, and the calculation formula adopted is as follows:

其中，为代理服务器的失败率，v 为爬取速度，为代理池中代理服务in, is the failure rate of the proxy server, v is the crawling speed, Serving the agents in the agent pool

器的响应时间期望值；The response time expectation of the device;

对代理服务器池进行管理，并为每一个 Http 请求设置一个有效代理服务器时，把代理服务器的失败率和响应时间作为鉴别一个代理是否有效的标准；隔段时间以后，再次尝试使用之前被判断为不能使用的代理，从而把无效的代理服务器池中可用的代理放入有效的代理服务器池中；代理任务分配时，为每个线程获取有效的代理服务器池中使用次数最少的代理，以均衡地把任务分配给每一个代理。When managing the proxy server pool and setting an effective proxy server for each Http request, set the failure rate of the proxy server to and response time as a criterion for identifying whether a proxy is valid; after a period of time, try to use the proxy that was judged to be unusable before, so as to put the available proxy in the invalid proxy server pool into the effective proxy server pool; proxy When assigning tasks, obtain the least-used agent in the effective agent server pool for each thread, so as to distribute tasks to each agent in a balanced manner.

本发明中，采用最小使用次数优先队列实现代理任务分配。用链表来存储代理，并记录在时间窗口内代理的使用次数。按照代理的使用次数由小到大的顺序，对链表进行排序。插入代理时，使用插入排序方法，从链表表尾的位置开始比较代理的使用次数，当在链表中找到使用次数不大于该代理的代理时，插入其后。需要提供代理使用时，获取链表表头位置的代理，并删除此代理。In the present invention, the agent task assignment is realized by adopting the priority queue with the minimum usage times. Use a linked list to store the agent, and record the number of times the agent is used within the time window. Sort the linked list in descending order of the number of times the agents are used. When inserting a proxy, use the insertion sorting method to compare the usage times of the proxy from the end of the linked list, and when the proxy is found in the linked list with a usage count not greater than that of the proxy, insert it. When a proxy needs to be provided, get the proxy at the head of the linked list and delete the proxy.

本发明中，根据代理服务器响应时间，计算出爬虫的平均反应时间期望值，计算公式如下：In the present invention, according to the response time of the proxy server, the average response time expectation value of the crawler is calculated ,Calculated as follows:

其中，为第 i 个代理服务器的响应时间， N 为代理服务器个数。in, is the response time of the i-th proxy server, and N is the number of proxy servers.

本发明中，根据代理服务器状态队列、代理服务器在时间窗口 W 内的失败次数，计算出失败率；计算公式如下：In the present invention, the failure rate is calculated according to the proxy server state queue and the number of failures of the proxy server in the time window W ;Calculated as follows:

其中，failedTimes 为连接执行期间代理服务器的失败次数，usedTimes 为Among them, failedTimes is the number of failures of the proxy server during connection execution, and usedTimes is

代理服务器的使用次数。The number of times the proxy server is used.

本发明的有益效果在于：（1）多线程的数目是基于所使用的代理服务器性能和数目得到。通过动态地计算线程数目与代理服务器参数的关系，使任何可用的代理服务器都能适应本发明提出的网络爬虫，不会因为代理服务器的不稳定性影响爬虫的正常使用。而且，通过计算得到线程数目，能有效地使资源得到最大的利用又不造成资源的浪费。（2）提出了灵活的管理代理服务器方法，针对不同的访问网站，可以选择使用对该网站有效的代理服务器进行访问。（3）提出代理服务器选择方法，使每个代理服务器的使用时间间隔最长。爬虫为每一个 Http 请求分配一个代理服务器进行访问，均衡每个可用代理服务器的使用次数，有效地避免频繁访问被服务器端检测出来。The beneficial effects of the present invention are: (1) The number of multi-threads is obtained based on the performance and number of the proxy server used. By dynamically calculating the relationship between the number of threads and the parameters of the proxy server, any available proxy server can adapt to the web crawler proposed by the present invention, and the normal use of the crawler will not be affected by the instability of the proxy server. Moreover, by calculating the number of threads, resources can be effectively utilized to the maximum without causing waste of resources. (2) A flexible method of managing proxy servers is proposed. For different visiting websites, you can choose to use a valid proxy server to access the website. (3) Propose a proxy server selection method to make the use time interval of each proxy server the longest. The crawler assigns a proxy server for each Http request for access, balances the number of times each available proxy server is used, and effectively prevents frequent access from being detected by the server.

附图说明Description of drawings

图1是本发明流程图。Fig. 1 is the flow chart of the present invention.

图2是本发明中Proxy 状态队列的一个示意图。Fig. 2 is a schematic diagram of the Proxy state queue in the present invention.

具体实施方式detailed description

下面结合附图和实施例对本发明进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments.

图1是本发明流程的进一步说明。图中，虚框 A 内的流程是建立爬虫所需要做的初始工作，只需要执行一次。虚框 B 内的流程是爬虫爬取Web页面的过程，需重复执行，直到结束。Figure 1 is a further illustration of the process of the present invention. In the figure, the process in the virtual box A is the initial work needed to build a crawler, and it only needs to be executed once. The process in the virtual frame B is the process of the crawler crawling the web page, which needs to be executed repeatedly until the end.

（1）获取代理服务器，存入代理服务器池。(1) Obtain a proxy server and store it in the proxy server pool.

（2）测试代理服务器的网络连接性能。(2) Test the network connection performance of the proxy server.

（3）根据代理服务器性能创建一定数量的多线程。(3) Create a certain number of multi-threads according to the performance of the proxy server.

（4）把爬虫开始的爬取目标地址转化成一个 Http 请求，并从代理服务器池(4) Convert the crawling target address started by the crawler into an Http request, and send it from the proxy server pool

中获取一个有效的代理服务器，设置 Http 请求通过该代理服务器执行。Obtain a valid proxy server in , and set the Http request to be executed through the proxy server.

（5）把 Http 请求加入一个Http请求队列里。(5) Add the Http request to an Http request queue.

（6）处于空闲状态的线程从队列 Http 请求队列里获取 Http 请求任务。(6) The idle thread gets the Http request task from the queue Http request queue.

（7）线程执行其获取的 Http 请求。执行过程中，如果遇到新的请求目标地(7) The thread executes the Http request it gets. During execution, if a new request destination is encountered

址，则如步骤（4）的方法，创建新的 Http 请求，加入 Http 请求队列里。address, create a new Http request as in step (4), and add it to the Http request queue.

（8） Http 请求执行完成后，得到 Http 响应信息，并将 Web 页面信息存储(8) After the Http request is executed, get the Http response information and store the Web page information

在本地。locally.

（9）线程回到空闲状态。(9) The thread returns to the idle state.

（10）重复步骤（6）到步骤（9），直到所有的线程都空闲并且Http 请求(10) Repeat steps (6) to (9) until all threads are idle and Http requests

队列里没有请求任务。There are no requested tasks in the queue.

本发明中，基于连接代理优化管理的多线程网络爬虫框架设计主要分为三个模块：Http 请求模块、代理服务器模块和爬取模块。Http请求模块根据 Http 请求生成一个待执行的 Http 请求，并放入Http 请求队列中，同时到代理服务器模块中获取一个有效的代理，设置使用该代理执行这个 Http请求。其中，代理服务器模块进行代理服务器的管理,为需要使用代理的模块提供有效的代理。Http 请求队列中的请求任务，由爬取模块来执行。爬取模块创建 M 个并行线程分别去Http 请求队列中拿任务来执行，执行期间若产生了新的 Http 请求，则继续添加到Http请求队列中。In the present invention, the multi-threaded network crawler frame design based on connection agent optimization management is mainly divided into three modules: Http request module, proxy server module and crawling module. The Http request module generates an Http request to be executed according to the Http request, and puts it into the Http request queue, and at the same time obtains an effective proxy from the proxy server module, and sets the proxy to execute the Http request. Among them, the proxy server module manages the proxy server and provides effective proxy for the modules that need to use the proxy. The request tasks in the Http request queue are executed by the crawling module. The crawling module creates M parallel threads to execute tasks in the Http request queue respectively. If a new Http request is generated during execution, it will continue to be added to the Http request queue.

1)代理服务器模块1) Proxy server module

1.1)获取代理服务器1.1) Obtain a proxy server

在数据库中建立代理服务器池，把 Http 代理服务器添加到代理服务器池中，以供使用。可以在互联网上提供代理服务器的网站站点中找到可用的代理服务器。具体的列表可以通过人工方式或另外的小爬虫自动获取。代理服务器的获取可以在互联网上有很多提供代理服务器的网站站点中找到。Create a proxy server pool in the database, and add the Http proxy server to the proxy server pool for use. Available proxy servers can be found at sites on the Internet that offer proxy servers. The specific list can be obtained manually or automatically by another small crawler. The acquisition of the proxy server can be found in many websites that provide proxy servers on the Internet.

1.2)代理服务器的定义1.2) Definition of proxy server

代理服务器定义为一个六元组，其中 address 为字符串类型数据，代表代理服务器的地址；port 为整型数据，代表代理服务器的端口；usedTimes为整型数据，代表该代理服务器的使用次数；timeQueue为一个链表，存储值为 1.3.2) 中计算的响应时间, 每次在链表的表头加入新的时间值，并且当链表的大小大于时间窗口 W 内的最大使用次数 S 时，删除链表中第S个时间后面的时间；statusQueue 也为一个链表，即 1.3.1）的 b.2)中定义的 statusQueue。proxy server It is defined as a six-tuple, where address is string type data, representing the address of the proxy server; port is integer data, representing the port of the proxy server; usedTimes is integer data, representing the number of times the proxy server is used; timeQueue is a Linked list, the storage value is the response time calculated in 1.3.2) , each time a new time value is added to the head of the linked list, and when the size of the linked list is greater than the maximum number of uses S in the time window W, delete the time after the Sth time in the linked list; statusQueue is also a linked list, that is, 1.3 The statusQueue defined in b.2) of .1).

1.3)管理数据库中的代理服务器1.3) Proxy servers in the management database

代理服务器并不是任何时候都能很好的完成工作。本发明希望能检查出代理服务器是否能够正常使用，并且完成一个 Http 请求的响应时间是整个爬虫系统能承受的。所以，为了得到能有效完成访问任务的代理，本发明把代理爬取数据的失败率和响应时间（responseTime ）作为鉴别一个代理是否有效的标准。Proxy servers don't do their job well all the time. The present invention hopes to check whether the proxy server can be used normally, and the response time for completing an Http request is acceptable to the entire crawler system. Therefore, in order to obtain an agent that can effectively complete the access task, the present invention takes the failure rate of the agent crawling data And response time (responseTime) as a criterion to identify whether a proxy is valid.

本发明设计了有效的代理服务器池和无效的代理服务器池，分别存储有效的代理和无效代理。在数据库中创建两个链表 ) , 其中 name 为链表的表名，分别设置为 “ValidProxyPool” 和 “invalidProxyPool”； key 为存储的代理的服务器地址，存储格式为 <Address>:<Port>；代理的形式定义见 1.2），在存储时，使用JSon 把代理类转换为输入流数据存入数据库。The present invention designs an effective agent server pool and an invalid agent server pool, and stores effective agents and invalid agents respectively. Create two linked lists in the database ), where name is the table name of the linked list, set to "ValidProxyPool" and "invalidProxyPool"respectively; key is the server address of the stored proxy, and the storage format is <Address>:<Port>; see 1.2 for the definition of the proxy form), in When storing, use JSon to convert the proxy class into input stream data and store it in the database.

1.3.1)无效代理的判定1.3.1) Judgment of Invalid Proxy

a)无效代理定义a) Invalid proxy definition

本发明将计算在最近使用的一个时间窗口 W 内，当同时满足以下条件 i 和 ii或者条件 i 和 iii 时，则此代理被视为无效：The present invention will calculate within a recently used time window W, when the following conditions i and ii or conditions i and iii are met at the same time, the agent is considered invalid:

i.usedTimes>Min_Used_Timesi.usedTimes>Min_Used_Times

其中 usedTimes 为在 W 内代理服务器被使用的次数，计算方法可见 b.3)。Min_Used_Times为设定的代理服务器最少被使用的次数，可以取值为一个较小的整数，如10 次。Among them, usedTimes is the number of times the proxy server is used in W, and the calculation method can be found in b.3). Min_Used_Times is the minimum number of times the proxy server is used, which can be a smaller integer, such as 10 times.

ii. failedRate>Failure_Rateii. failedRate>Failure_Rate

其中failedRate为时间窗口W内该代理的失败率，表示为。failedTimes 为代理服务器在 W 内访问失败的次数，计算方法可见 b.3)。Failure_Rate为可以接受的代理服务器的失败率，可以设置值为 10% 。Where failedRate is the failure rate of the agent within the time window W, expressed as . failedTimes is the number of times the proxy server failed to access within W, and the calculation method can be found in b.3). Failure_Rate is the acceptable failure rate of the proxy server, which can be set to 10%.

.responseTime<Max_Resonpse_Time.responseTime<Max_Resonpse_Time

其中，responseTime 为在时间窗口 W 内，平均响应时间，表示为Among them, responseTime is the average response time within the time window W, expressed as

。 .

为第 i 次使用该代理的响应时间。s 为在时间窗口 W 内的使用次数，即。Max_Response_Time为可以接受的代理服务器的最长响应时间，比如，可以设置值为 300s 。 is the response time of the i-th use of the proxy. s is the number of uses in the time window W, that is . Max_Response_Time is the longest acceptable response time of the proxy server, for example, the value can be set to 300s.

b)计算代理的失败率 b) Calculate the failure rate of the agent

b.1) 为每一个代理新建一个使用足够大的状态队列，存储代理的状态，并定义为一个二元组其中 time, state 分别表示使用时刻时间值与使用结果对应值状态。格式如下：b.1) Create a new state queue with a large enough size for each agent, store the state of the agent, and define it as a two-tuple Among them, time and state respectively represent the time value at the time of use and the corresponding value state of the use result. The format is as follows:

元素名称element name timetime statestate 取值类型value type TimestampTimestamp 0或10 or 1

具体而言：in particular:

time 是流程图1中，空闲线程从 Http 请求队列中获取请求的时刻。time is the moment when the idle thread obtains the request from the Http request queue in Flowchart 1.

state 则表示使用该代理执行 Http 请求后，执行结果成功与否的标记。其值若为 0，则表示访问成功；为 1，则表示访问失败。根据执行请求得到的 Http 响应状态码，当状态码为 2xx 和 3xx 时，访问成功；为其它值时，访问失败。如果代理服务器自行定义了Http 响应状态码，同理判断访问成功与否。state indicates whether the execution result is successful or not after using the proxy to execute the Http request. If its value is 0, it means the access is successful; if it is 1, it means the access failed. According to the Http response status code obtained by executing the request, when the status code is 2xx and 3xx, the access is successful; when it is other values, the access fails. If the proxy server defines the Http response status code by itself, judge whether the access is successful or not in the same way.

b.2)代理的状态队列b.2) Agent's state queue

代理的状态队列statusQueue 也为一个链表，存储b.1)中定义的 status, 每次在链表的表头加入新的 status，并且当队列中的 status 的时间值不在时间窗口内时，即当前时间与该状态的时间值的差值大于T, 删除该 status。The agent's status queue statusQueue is also a linked list , store the status defined in b.1), each time a new status is added to the head of the linked list, and when the time value of the status in the queue is not within the time window, that is, the difference between the current time and the time value of the state If it is greater than T, delete the status.

当代理执行完请求任务后，根据b.1)的方法，计算出 status，并加入队列statusQueue中。After the agent executes the request task, it calculates the status according to the method of b.1), and adds it to the queue statusQueue.

代理的状态队列的一个示意图如图2所示。A schematic diagram of an agent's state queue is shown in Figure 2.

b.3) 统计使用次数和失败次数b.3) Count usage and failure times

使用次数 usedTimes 是时间窗口 W 内Proxy 的状态数量。失败次数failedTimes 为 W 内，Proxy 状态 status = 1 的个数。The number of times usedTimes is the number of states of the Proxy within the time window W. The number of failures failedTimes is the number of Proxy status = 1 within W.

由此，可得失败率From this, the failure rate can be obtained

1.3.2）计算代理的响应时间期望 T 1.3.2) Calculate the agent's response time expectation T

在测试代理服务器的网络连接性能时，用代理池中的代理分别去访问目标地址，访问 s 次，在这里，s 的大小可以为略大于Min_Used_Time / Failure_Rate 。Proxy 某一次的响应时间为，开始执行Http请求到执行结束。t1 为线程从 Http请求队列中获取使用了该代理的请求任务的时刻，t2 为请求执行完成，线程回到空闲状态的时刻。可得：When testing the network connection performance of the proxy server, use the proxy in the proxy pool to visit the target address for s times. Here, the size of s can be slightly larger than Min_Used_Time / Failure_Rate. The response time of a certain time of Proxy For, start to execute the Http request to the end of execution. t1 is the moment when the thread obtains the request task using the agent from the Http request queue, and t2 is the moment when the request execution is completed and the thread returns to the idle state. Available:

所以，可得响应时间 T 近似为响应时间期望，如下：Therefore, the available response time T is approximately the response time expectation, as follows:

即，which is,

1.3.3）管理代理服务器池1.3.3) Manage Proxy Pools

因为代理服务器可能不稳定，导致它某一段时间内无法使用，但是以后可能依然能继续使用。或者当代理因为频繁访问等因素被禁了，但是一段时间以后会禁止访问权限可能会被解除。所以，本发明隔段时间以后，再次尝试使用之前被判断为不能使用的代理，从而把无效的代理服务器池中可用的代理放入有效的代理服务器池中。Because the proxy server may be unstable, causing it to be unavailable for a certain period of time, but it may continue to be available in the future. Or when the proxy is banned due to factors such as frequent access, but the access permission may be lifted after a period of time. Therefore, after a certain period of time, the present invention tries to use the agent judged as unusable before again, so as to put the available agents in the invalid agent server pool into the effective agent server pool.

1.4)提供代理服务器的使用方法。1.4) Provide the usage method of proxy server.

本发明希望尽可能均衡地使用代理服务器池中的每一个代理，减少代理负载过多及重复IP访问而被拒的问题，所以为每个线程获取 Proxy Pool 中使用次数最少的代理，以均衡地把任务分配给每一个 proxy。The present invention hopes to use each agent in the proxy server pool as evenly as possible, reducing the problem of excessive agent load and repeated IP access and being rejected, so obtain the agent with the least number of times of use in the Proxy Pool for each thread, to balance Assign tasks to each proxy.

在具体实现方式上，采用最小使用次数优先队列来确保这个代理服务器的使用策略。该队列表示为，用链表来存储代理，存储值为 proxy，形式化定义见1.2)。优先参数为 proxy 的属性 usedTimes，按照代理的使用次数由小到大的顺序，对链表进行排序。插入代理时，使用插入排序方法，从链表表尾的位置开始比较代理的使用次数，当在链表中找到使用次数不大于该代理的代理时，插入其后。需要提供代理使用时，获取链表表头位置的代理，并删除此代理。以此保证每次获取到的是代理池中，使用次数最少的代理，均衡地把使用任务分配到每一个代理。In the specific implementation manner, the priority queue with minimum usage times is used to ensure the usage strategy of the proxy server. The queue is represented as , use a linked list to store the proxy, and the stored value is proxy, see 1.2 for the formal definition. The priority parameter is the usedTimes attribute of the proxy, and the linked list is sorted in descending order of the times the proxy is used. When inserting a proxy, use the insertion sorting method to compare the usage times of the proxy from the end of the linked list, and when the proxy is found in the linked list with a usage count not greater than that of the proxy, insert it. When a proxy needs to be provided, get the proxy at the head of the linked list and delete the proxy. In this way, it is ensured that the agent with the least number of times of use in the agent pool is obtained each time, and the usage tasks are distributed to each agent in a balanced manner.

2)Http 请求模块2) Http request module

Http 请求是负责在爬虫中建立连接请求，实现Web页面信息的获取。其工作原理是：执行 Http 请求，返回 Http 响应结果。具体过程如下：The Http request is responsible for establishing a connection request in the crawler to obtain web page information. Its working principle is: execute Http request and return Http response result. The specific process is as follows:

针对每一个 Http 请求，新建一个对应的 Http 请求对象。并利用在代理模块中获取的代理连接，从有效的代理服务器池里获取代理，用 HttpClient 设置 Http 请求的的代理信息。For each Http request, create a corresponding Http request object. And use the proxy connection obtained in the proxy module to obtain the proxy from the effective proxy server pool, and use HttpClient to set the proxy information of the Http request.

2.1)执行爬取任务。2.1) Execute crawling tasks.

2.2)根据 Http 请求信息，爬取对应的页面内容。2.2) Crawl the corresponding page content according to the Http request information.

3) 爬取模块3) Crawling module

该模块负责创建多线程爬取任务、建立Http请求队列列表。This module is responsible for creating multi-threaded crawling tasks and establishing Http request queue lists.

3.1）创建多线程爬取任务3.1) Create a multi-threaded crawling task

每个线程从 Http 请求队列中得到爬取任务，让执行完当前爬取任务后，继续从队列中获取任务，直到线程结束。为了加速爬虫爬取速度，本发明将建立一个线程池，让多个线程并行爬取数据。Each thread gets the crawling task from the Http request queue. After the current crawling task is executed, it continues to get the task from the queue until the thread ends. In order to speed up the crawling speed of crawlers, the present invention will establish a thread pool to allow multiple threads to crawl data in parallel.

3.1.1) 计算代理池中代理服务器性能，得到响应时间期望值3.1.1) Calculate the performance of the proxy server in the proxy pool to get the expected response time

假设，一共有 N 个可用的代理，则得到总的响应时间期望值：Assuming that there are a total of N proxies available, the total expected response time is obtained:

为第 i个代理服务器的响应时间，计算方法见 1.3.2）。因为，如前面对代理使用的设计，本发明希望尽可能可以近似地当作每个代理使用的概率相等，即： is the response time of the i-th proxy server, see 1.3.2 for the calculation method). Because, as in the previous design for the use of agents, the present invention hopes to approximate as much as possible the probability that each agent uses is equal, that is:

3.1.2) 在线程池中创建 M 个线程3.1.2) Create M threads in the thread pool

根据代理服务器运行情况，计算合适的 M，使爬虫高效运行。根据，有效的代理服务器池中，可用的代理的平均响应时间和本发明期望的获取数据速度，来设置线程数量。According to the operating conditions of the proxy server, calculate the appropriate M to make the crawler run efficiently. The number of threads is set according to the average response time of available agents in the effective agent server pool and the expected data acquisition speed of the present invention.

对 M 个线程，完成 Q 个数据请求的时间开销 t 为：For M threads, the time cost t to complete Q data requests is:

根据代理服务器的响应时间期望，本发明可以根据需要的爬取数据速度，设置不同的线程数，以高效地完成数据爬取任务。According to the response time expectation of the proxy server, the present invention can set different numbers of threads according to the required data crawling speed, so as to efficiently complete the data crawling task.

即：which is:

3.2）建立多线程的爬取任务Http 请求队列列表3.2) Establish a multi-threaded crawling task Http request queue list

Http 请求队列的建立，与现在网络上成熟的网络爬虫中 URL 队列的建立方式相似。采用先进先出模式，从作为爬取入口的 Http请求开始，解析对应页面，从该页面中抽取出其包含的要爬取的下一个地址的信息集合，构造 Http 请求，用 HttpClient 新建Http 请求，并将 Http 请求加入Http 请求队列。The establishment of the Http request queue is similar to the establishment of the URL queue in the mature web crawler on the Internet. Using the first-in-first-out mode, starting from the Http request as the crawling entry, parsing the corresponding page, extracting the information set of the next address to be crawled from the page, constructing the Http request, and using HttpClient to create a new Http request, And add the Http request to the Http request queue.

从上述实施过程可以看出，本发明在爬虫获取Web页面信息过程中，动态地管理代理服务器，自动调整设置代理服务器与线程数的参量关系，在多线程爬行中通过代理服务的自动选择机制避免重复使用同一个客户端IP地址连接Web服务器，从而避免了被服务端检测。As can be seen from the above-mentioned implementation process, the present invention dynamically manages the proxy server during the process of crawler obtaining Web page information, automatically adjusts the parameter relationship between the proxy server and the number of threads, and avoids the problem by automatically selecting the proxy service in multi-threaded crawling. Reuse the same client IP address to connect to the web server, thus avoiding detection by the server.

Claims

1. A multithreading web crawler processing method based on connection proxy optimization management is characterized by comprising the following specific steps of firstly obtaining a public proxy server on a network, testing the network connection performance of the proxy server, and obtaining the optimal thread number according to the performance of the proxy server; then, managing a proxy server pool, and setting an effective proxy server for each Http request; finally executing the Web page access request; wherein:

the thread number M is determined according to the performance of the proxy servers in the proxy server pool, and the calculation formula is as follows:

here, ,the failure rate of the proxy server, v the crawling speed,the expected value of the response time of the proxy server in the proxy server pool is obtained;

managing the proxy server pool, and setting a valid proxy server for each Http request, determining the failure rate of the proxy serverAnd response time as a criterion for identifying whether an agent is valid; after the interval, trying again to use the agent which is judged to be unusable before, thereby putting the usable agent in the invalid agent server pool into the valid agent server pool; when the agent task is distributed, the agent with the least use times in the effective agent server pool is obtained for each thread, so that the task is distributed to each agent in a balanced manner; wherein: the invalidation agent is defined as follows:

calculating that within a time window W of recent use, the agent is considered invalid when the following conditions i and ii or conditions i and iii are satisfied simultaneously;

i.usedTimes>Min_Used_Times；

ii.failedRate>Failure_Rate；

responsetime < Max _ Response _ Time, where:

usedTimes is the number of Times the proxy server is Used in W, Min _ Used _ Times is the set minimum number of Times the proxy server is Used, failedRate is the Failure Rate of the proxy in Time window W, Failure _ Rate is the Failure Rate of the acceptable proxy server, responseTime is the average Response Time in Time window W, and Max _ Response _ Time is the longest Response Time of the acceptable proxy server.

2. The multithreading web crawler processing method based on connection broker optimization management of claim 1, wherein: and realizing agent task allocation by adopting a priority queue with minimum use times.

3. The multithreading web crawler processing method based on connection broker optimization management of claim 1, wherein: response time expectation of the proxy serverThe calculation formula of (a) is as follows:

wherein,the response time of the ith proxy server is shown, and N is the number of the proxy servers.

4. The multithreading web crawler processing method based on connection broker optimization management of claim 1, wherein: failure rate of the proxy serverAccording to the state queue of the proxy server and the failure times of the proxy server in the time window w, the calculation formula is as follows:

the failedTimes are failure times of the proxy server during connection execution, and the usedTimes are usage times of the proxy server.