CN106681992B

CN106681992B - Method and device for managing website login information

Info

Publication number: CN106681992B
Application number: CN201510745533.7A
Authority: CN
Inventors: 崔志伸
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2015-11-05
Filing date: 2015-11-05
Publication date: 2020-12-01
Anticipated expiration: 2035-11-05
Also published as: CN106681992A

Abstract

The invention discloses a method and a device for managing website login information, relates to the field of Internet technology, and can solve the problem in the prior art that when a crawler program determines that a certain login information is invalid, it is discarded, and then the discarded login information needs to be manually processed. , resulting in the problem of low efficiency in managing website login information. The method of the present invention mainly includes: acquiring locally stored invalid login information; judging whether the invalidation duration of the login information is greater than a preset time threshold corresponding to the login information; if the invalidation duration is greater than the preset time threshold, Then, the login information is restored to valid login information. The present invention is mainly applicable to the scenario in which the crawler program crawls the web page through the login credential.

Description

Method and device for managing website login information

技术领域technical field

本发明涉及互联网技术领域，尤其涉及一种管理网站登录信息的方法和装置。The invention relates to the field of Internet technologies, and in particular, to a method and device for managing website login information.

背景技术Background technique

网络爬虫是一种按照一定的规则，自动抓取万维网信息的程序。在实际应用中，爬虫程序在爬取各种网站时，常常会遇到需要登录凭证才有权爬取网页内容的网站。在这种情况下，在爬虫程序爬取网站之前，会先向网站服务器发送登录信息(包括登录账号和密码)；网站服务器接收到登录信息后，会通过验证规则对该登录信息进行验证；若验证通过，则向该爬虫程序反馈登录凭证，以便爬虫程序通过登录凭证爬取该网站上的网页内容。由此可知，登录信息是爬虫程序获得登录凭证的重要条件。A web crawler is a program that automatically crawls information from the World Wide Web according to certain rules. In practical applications, when crawling various websites, crawler programs often encounter websites that require login credentials to have the right to crawl web content. In this case, before the crawler program crawls the website, it will first send the login information (including the login account and password) to the website server; after the website server receives the login information, it will verify the login information through the verification rules; if If the verification is passed, the login credentials are fed back to the crawler program, so that the crawler program crawls the web page content on the website through the login credentials. It can be seen that the login information is an important condition for the crawler to obtain the login credentials.

然而，在实际应用中，却常常发生因所使用的登录信息失效，从而无法获得登录凭证的现象。其中，登录信息失效主要分为三种情况：(1)永久失效；(2)在一定时间段内失效，超过该时间段后，该登录信息可恢复正常使用；(3)由于网络或其他原因，请求获得登录凭证失败，而被爬虫程序误认为是登录信息失效。当爬虫程序获知某登录信息失效时，会将该登录信息丢弃，然后由人工来判断丢弃的登录信息是否可再次使用，若可再次使用，则将可再次使用的登录信息添加到爬虫程序中。由此可知，在对失效的登录信息进行管理的整个过程，操作复杂，并且需要人工参与处理，从而造成管理网站登录信息的效率较低。However, in practical applications, it often occurs that the login credentials cannot be obtained because the login information used is invalid. Among them, the failure of login information is mainly divided into three situations: (1) permanent failure; (2) failure within a certain period of time, after which the login information can be restored to normal use; (3) due to network or other reasons , the request to obtain the login credentials failed, and the crawler misunderstood that the login information was invalid. When the crawler learns that a certain login information is invalid, it will discard the login information, and then manually determine whether the discarded login information can be used again. If it can be used again, the reusable login information will be added to the crawler program. It can be seen that, in the whole process of managing the invalid login information, the operation is complicated and requires manual participation in the processing, resulting in low efficiency in managing website login information.

发明内容SUMMARY OF THE INVENTION

鉴于上述技术问题，本发明提出了一种管理网站登录信息的方法和装置，能够解决现有技术中当爬虫程序确定某登录信息失效时，将其丢弃，之后需人工对丢弃的登录信息进行处理，从而造成管理网站登录信息效率低的问题。In view of the above technical problems, the present invention proposes a method and device for managing website login information, which can solve the problem in the prior art that when a crawler program determines that a certain login information is invalid, it is discarded, and then the discarded login information needs to be manually processed. , resulting in low efficiency in managing website login information.

一方面，本发明提供了一种管理网站登录信息的方法，所述方法包括：In one aspect, the present invention provides a method for managing website login information, the method comprising:

获取本地存储的失效的登录信息；Get the invalid login information stored locally;

判断所述登录信息的失效时长是否大于所述登录信息对应的预设时间阈值；Determine whether the invalidation duration of the login information is greater than a preset time threshold corresponding to the login information;

若所述失效时长大于所述预设时间阈值，则将所述登录信息恢复为有效的登录信息。If the expiration time is greater than the preset time threshold, the login information is restored to valid login information.

另一方面，本发明提供了一种管理网站登录信息的装置，所述装置包括：In another aspect, the present invention provides a device for managing website login information, the device comprising:

获取单元，用于获取本地存储的失效的登录信息；an obtaining unit, used to obtain the locally stored invalid login information;

判断单元，用于判断所述获取单元获取的所述登录信息的失效时长是否大于所述登录信息对应的预设时间阈值；a judgment unit, configured to judge whether the invalidation duration of the login information acquired by the acquisition unit is greater than a preset time threshold corresponding to the login information;

恢复单元，用于当所述判断单元的判断结果为所述失效时长大于所述预设时间阈值时，将所述登录信息恢复为有效的登录信息。A restoration unit, configured to restore the login information to valid login information when the judgment result of the judgment unit is that the invalidation duration is greater than the preset time threshold.

借由上述技术方案，本发明提供的管理网站登录信息的方法和装置，能够在爬虫程序确定某登录信息失效后，将其保存在本地，并对其进行检测，判断登录信息的失效时长是否大于该登录信息对应的预设时间阈值，当失效时长大于预设时间阈值时，将该登录信息恢复为有效的登录信息。在整个登录信息恢复有效性的过程中，爬虫程序无需丢弃失效的登录信息，也无需人工参与处理失效的登录信息，从而提高了管理网站登录信息的效率。With the above technical solutions, the method and device for managing website login information provided by the present invention can save a certain login information locally after the crawler program determines that it is invalid, and detect it to determine whether the invalidation duration of the login information is greater than or not. The preset time threshold corresponding to the login information, when the expiration time is longer than the preset time threshold, the login information is restored to valid login information. During the whole process of restoring the validity of the login information, the crawler does not need to discard the invalid login information, nor does it need to manually participate in the processing of the invalid login information, thereby improving the efficiency of managing the website login information.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, in order to be able to understand the technical means of the present invention more clearly, it can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and easy to understand , the following specific embodiments of the present invention are given.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for the purpose of illustrating preferred embodiments only and are not to be considered limiting of the invention. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:

图1示出了本发明实施例提供的一种管理网站登录信息的方法的流程图；1 shows a flowchart of a method for managing website login information provided by an embodiment of the present invention;

图2示出了本发明实施例提供的一种管理网站登录信息的装置的组成框图；Fig. 2 shows the composition block diagram of a device for managing website login information provided by an embodiment of the present invention;

图3示出了本发明实施例提供的另一种管理网站登录信息的装置的组成框图。FIG. 3 shows a block diagram of another apparatus for managing website login information provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that the present disclosure will be more thoroughly understood, and will fully convey the scope of the present disclosure to those skilled in the art.

本发明实施例提供了一种管理网站登录信息的方法，如图1所示，该方法包括：An embodiment of the present invention provides a method for managing website login information, as shown in FIG. 1 , the method includes:

101、获取本地存储的失效的登录信息。101. Obtain locally stored invalid login information.

具体的，当爬虫程序使用某登录信息无法获得登录凭证时，会确定该登录信息失效，此时，爬虫程序会将该失效的登录信息以及确定失效时的时间点(即起始失效时间点)存储到本地。在判断登录信息是否为失效的登录信息时，可直接根据该登录信息是否有对应的起始失效时间点来判断该登录信息是否为失效的登录信息。Specifically, when the crawler program cannot obtain the login credentials by using a certain login information, it will determine that the login information is invalid. At this time, the crawler program will determine the invalid login information and the time point when it expires (ie, the initial invalidation time point). Store to local. When judging whether the login information is invalid login information, it can be directly determined whether the login information is invalid login information according to whether the login information has a corresponding initial invalidation time point.

此外，为了提高获取失效的登录信息的效率，在爬虫程序确定某登录信息为失效的登录信息时，还可以为其添加一个失效标记，以便根据失效标记快速确定当前获取的登录信息是否为失效的登录信息。In addition, in order to improve the efficiency of obtaining invalid login information, when the crawler program determines that a certain login information is invalid login information, it can also add an invalidation mark to it, so as to quickly determine whether the currently obtained login information is invalid according to the invalidation mark. login information.

需要说明的是，一般情况下，登录网站所使用的登录信息为登录账号和密码，因此本发明实施例中涉及的登录信息主要为登录账号和密码。It should be noted that, in general, the login information used to log in to a website is a login account and a password, so the login information involved in the embodiment of the present invention is mainly a login account and a password.

102、判断登录信息的失效时长是否大于登录信息对应的预设时间阈值。102. Determine whether the invalidation duration of the login information is greater than a preset time threshold corresponding to the login information.

其中，预设时间阈值是用于判断失效的登录信息是否能够恢复有效性的恢复周期。在实际应用中，各个网站针对失效的登录信息恢复有效性所需的时间往往不同，所以本步骤中涉及的预设时间阈值可能不同。The preset time threshold is a recovery period used to determine whether the invalid login information can be restored to be valid. In practical applications, the time required for each website to restore the validity of the invalid login information is often different, so the preset time threshold involved in this step may be different.

具体的，在本地可存储一个恢复周期对应表，该恢复周期对应表中至少记录有登录信息与预设时间阈值的对应关系，还可以记录对应网站的网址等信息。当需要对获取的失效的登录信息进行判断时，可以从本地存储的恢复周期对应表中查找到该失效的登录信息所对应的预设时间阈值，然后根据预设时间阈值进行判断。Specifically, a recovery period correspondence table may be stored locally, and the recovery period correspondence table at least records the correspondence between the login information and the preset time threshold, and may also record information such as the URL of the corresponding website. When the acquired invalid login information needs to be judged, the preset time threshold corresponding to the invalid login information can be found from the locally stored recovery period correspondence table, and then the judgment is made according to the preset time threshold.

此外，在上述步骤101中提及，在记录失效的登录信息时，还会记录该登录信息的起始失效时间点，因此，可根据该起始失效时间点获得该登录信息的失效时长，从而将失效时长与该登录信息对应的预设时间阈值进行比较，以便判断该登录信息是否可恢复有效性。In addition, as mentioned in the above step 101, when recording the invalid login information, the initial invalidation time point of the login information will also be recorded. Therefore, the invalidation time length of the login information can be obtained according to the initial invalidation time point, thereby The expiration time is compared with the preset time threshold corresponding to the login information, so as to determine whether the login information can be restored to validity.

需要说明的是，大多数情况下，爬虫开发者并不知道各个网站所设置的实际恢复周期，因此本步骤中提及的预设时间阈值可能是根据经验统计而来的。It should be noted that in most cases, crawler developers do not know the actual recovery period set by each website, so the preset time threshold mentioned in this step may be based on empirical statistics.

103、若失效时长大于预设时间阈值，则将登录信息恢复为有效的登录信息。103. If the expiration time is greater than the preset time threshold, restore the login information to valid login information.

当判断结果为当前检测的登录信息的失效时长大于预设时间阈值时，在网站服务器侧已经恢复了该登录信息的有效性，说明此时利用该登录信息可以成功申请到登录凭证。但是，在爬虫程序侧看来，该登录信息依然是失效的登录信息，因此，为了让爬虫程序确定其为有效的登录信息，需将该登录信息恢复为有效的登录信息，即在爬虫程序侧看来，该登录信息是有效的登录信息。其中，让爬虫程序确定该登录信息为有效的登录信息的方法有多种。例如，给该登录信息添加一个有效标记，用于标识该登录信息有效。When the judgment result is that the invalidation duration of the currently detected login information is greater than the preset time threshold, the validity of the login information has been restored on the website server side, indicating that the login information can be used to successfully apply for a login credential at this time. However, from the perspective of the crawler program, the login information is still invalid login information. Therefore, in order for the crawler program to determine that it is valid login information, the login information needs to be restored to valid login information, that is, on the crawler program side It appears that the login is a valid login. There are various methods for the crawler program to determine that the login information is valid login information. For example, a valid mark is added to the login information to identify that the login information is valid.

此外，在实际应用中可以实时检测本地存储的失效登录信息是否可以恢复有效性，也可以定时检测本地存储的失效登录信息是否可以恢复有效性，本发明实施例对此不做限定。In addition, in practical applications, whether the invalid login information stored locally can be restored can be detected in real time, and whether the invalid login information stored locally can be restored can be periodically detected, which is not limited in this embodiment of the present invention.

本发明实施例提供的管理网站登录信息的方法，能够在爬虫程序确定某登录信息失效后，将其保存在本地，并对其进行检测，判断登录信息的失效时长是否大于该登录信息对应的预设时间阈值，当失效时长大于预设时间阈值时，将该登录信息恢复为有效的登录信息。在整个登录信息恢复有效性的过程中，爬虫程序无需丢弃失效的登录信息，也无需人工参与处理失效的登录信息，从而提高了管理网站登录信息的效率。The method for managing website login information provided by the embodiment of the present invention can save a certain login information locally after the crawler program determines that it is invalid, and detect it to determine whether the invalidation duration of the login information is greater than the predetermined period corresponding to the login information. A time threshold is set, and when the expiration time is greater than the preset time threshold, the login information is restored to valid login information. During the whole process of restoring the validity of the login information, the crawler does not need to discard the invalid login information, nor does it need to manually participate in the processing of the invalid login information, thereby improving the efficiency of managing the website login information.

进一步的，在上述步骤101中提及，当爬虫程序确定某登录信息失效时，可为其添加一个失效标记，以便后续快速确定该登录信息是失效的登录信息。因此，当失效的登录信息携带失效标记时，可将步骤101细化为：遍历本地存储的登录信息；判断当前登录信息是否携带失效标记；若当前登录信息携带失效标记，则确定当前登录信息为失效的登录信息；若当前登录信息没有携带失效标记，则确定当前登录信息不是失效的登录信息。其中，失效标记是在爬虫程序确定登录信息失效时，为登录信息添加的标记。Further, as mentioned in the above step 101, when the crawler program determines that a certain login information is invalid, it can add an invalidation mark to it, so as to quickly determine that the login information is invalid login information later. Therefore, when the invalid login information carries the invalidation mark, step 101 can be refined as follows: traverse the locally stored login information; determine whether the current login information carries the invalidation mark; if the current login information carries the invalidation mark, determine that the current login information is Invalid login information; if the current login information does not carry an invalidation mark, it is determined that the current login information is not invalid login information. The invalidation mark is a mark added to the login information when the crawler program determines that the login information is invalid.

需要说明的是，在上述实施例中提及，在判断登录信息是否为失效的登录信息时，可直接根据该登录信息是否有对应的起始失效时间点来判断该登录信息是否为失效的登录信息。但是，在爬虫程序为失效的登录信息添加失效标记的情况下，爬虫程序在判断某登录信息是否失效时，仅以失效标记作为判断依据，并不考虑是否含有对应的起始失效时间点。It should be noted that, as mentioned in the above-mentioned embodiment, when judging whether the login information is invalid login information, it can be directly judged whether the login information is an invalid login according to whether the login information has a corresponding initial invalidation time point. information. However, in the case where the crawler program adds an invalidation mark to the invalid login information, when the crawler program determines whether a certain login information is invalid, it only uses the invalidation mark as the judgment basis, and does not consider whether the corresponding initial invalidation time point is included.

进一步的，当确定登录信息的失效时长大于预设时间阈值时，可以确定利用该登录信息可以成功申请登录凭证，因此，为了避免爬虫程序继续将其识别为失效的登录信息，需要将该登录信息恢复为有效的登录信息。具体的，可以将该登录信息携带的失效标记更改为有效标记，也可以直接将失效标记删除。Further, when it is determined that the invalidation duration of the login information is greater than the preset time threshold, it can be determined that the login information can be used to successfully apply for login credentials. Therefore, in order to avoid the crawler program from continuing to identify it as invalid login information, the login information needs to be used. Revert to valid login information. Specifically, the invalidation mark carried in the login information may be changed to a valid mark, or the invalidation mark may be directly deleted.

进一步的，在实际应用中，当爬虫程序确定某登录信息失效时，会将该失效的登录信息保存到本地，并记录对应的起始失效时间点，以便后续根据起始失效时间点判断该登录信息的失效时长。由此可知，对于判断登录信息的失效时长是否大于登录信息对应的预设时间阈值的具体实现方式可以为：从本地获取登录信息对应的起始失效时间点；根据起始失效时间点，计算登录信息的失效时长；从本地查找登录信息对应的预设时间阈值；判断失效时长是否大于预设时间阈值。Further, in practical applications, when the crawler program determines that a certain login information is invalid, it will save the invalid login information locally, and record the corresponding initial invalidation time point, so that the login can be subsequently judged according to the initial invalidation time point. The length of time the information will expire. It can be seen from this that the specific implementation method for determining whether the expiration time of the login information is greater than the preset time threshold corresponding to the login information may be: obtain the starting expiration time point corresponding to the login information from the local; calculate the login information according to the starting expiration time point Information expiration time; find the preset time threshold corresponding to the login information locally; determine whether the expiration time is greater than the preset time threshold.

示例性的，若登录信息对应的起始失效时间点为2015年10月1日18点25分4秒，当前时间点为2015年10月2日7点50分4秒，则爬虫程序根据起始失效时间点和当前时间点，计算出的失效时长为13小时25分钟，而登录信息所对应的预设时间阈值为10小时。由此可知，该登录信息的失效时长已经超过了预设时间阈值，因此爬虫程序可恢复该登录信息的有效性。Exemplarily, if the initial expiration time point corresponding to the login information is 18:25:4 on October 1, 2015, and the current time point is 7:50:4 on October 2, 2015, then the crawler program will start to expire according to the starting time. The calculated expiration time is 13 hours and 25 minutes, and the preset time threshold corresponding to the login information is 10 hours. It can be seen from this that the invalidation duration of the login information has exceeded the preset time threshold, so the crawler program can restore the validity of the login information.

进一步的，由于预设时间阈值可能是根据经验统计而来的，与实际的恢复周期存在误差，所以可能会出现预设时间阈值比实际的恢复周期小的情况。对于上述情况，可能会发生以下问题：当爬虫程序确定某登录信息的失效时长大于预设时间阈值时，会将该登录信息恢复为有效的登录信息，但是实际上该登录信息还未到达实际的恢复周期，此时若爬虫程序利用该登录信息申请登录凭证，会申请失败，从而爬虫程序又将该登录信息确定为失效的登录信息，从而降低了爬虫程序爬取网页的效率。为解决上述技术问题，本发明实施例提出了如下方案：Further, since the preset time threshold may be based on empirical statistics, and there is an error with the actual recovery period, the preset time threshold may be smaller than the actual recovery period. For the above situation, the following problems may occur: when the crawler determines that the expiration time of a certain login information is greater than the preset time threshold, it will restore the login information to valid login information, but in fact the login information has not reached the actual login information. In the recovery period, if the crawler uses the login information to apply for login credentials, the application will fail, and the crawler will determine the login information as invalid login information, thereby reducing the efficiency of the crawler for crawling web pages. In order to solve the above-mentioned technical problem, the embodiment of the present invention proposes the following scheme:

从登录信息恢复为有效的登录信息起，若在预设时间段内，登录信息由有效的登录信息变为失效的登录信息，则根据第一预设算法，将登录信息对应的预设时间阈值调高。After the login information is restored to valid login information, if the login information changes from valid login information to invalid login information within a preset time period, the preset time threshold corresponding to the login information is set according to the first preset algorithm. Turn up.

其中，第一预设算法的具体内容可以为：将当前预设时间阈值成倍增大，例如，将当前预设时间阈值乘以2；还可以为：将当前预设时间阈值增加一个固定值，例如，在当前预设时间阈值的基础上加上5分钟。The specific content of the first preset algorithm may be: multiplying the current preset time threshold, for example, multiplying the current preset time threshold by 2; it may also be: increasing the current preset time threshold by a fixed value, For example, add 5 minutes to the current preset time threshold.

此外，也可能出现预设时间阈值远远大于实际的恢复周期的情况，而这种情况会降低登录信息的使用率。为解决该问题，本发明实施例提出了如下方案：In addition, it may also happen that the preset time threshold is much larger than the actual recovery period, and this situation will reduce the usage rate of the login information. In order to solve this problem, the embodiment of the present invention proposes the following scheme:

在将登录信息恢复为有效的登录信息之后，根据预设调整规则，对登录信息对应的预设时间阈值进行调整，获得最佳时间阈值，其中，最佳时间阈值为基于当前预设时间阈值将失效的登录信息恢复有效性时，利用登录信息能够成功申请登录凭证，且若根据预设调整规则中的第二预设算法对当前预设时间阈值进行调低，则基于调低后的预设时间阈值将失效的登录信息恢复有效性时，利用登录信息无法成功申请登录凭证。After the login information is restored to valid login information, according to the preset adjustment rules, the preset time threshold corresponding to the login information is adjusted to obtain an optimal time threshold, wherein the optimal time threshold is based on the current preset time threshold. When the invalid login information is restored, the login information can be used to successfully apply for a login credential, and if the current preset time threshold is adjusted lower according to the second preset algorithm in the preset adjustment rule, based on the adjusted preset time threshold When the time threshold restores the validity of the invalid login information, the login information cannot be used to successfully apply for the login credentials.

具体的，预设调整规则可以为：(1)根据第二预设算法，将登录信息对应的预设时间阈值调低；(2)若在进行至少一次调低处理后，利用调低后的预设时间阈值将登录信息恢复有效性后，第一次出现无法成功申请到登录凭证的现象，则停止调低处理，并记录第一次出现无法成功申请到登录凭证时对应的预设时间阈值(以下简称第一时间阈值)，然后根据第一预设算法，将第一时间阈值进行至少一次调高处理，直至利用调高后的预设时间阈值将登录信息恢复有效性后，能够成功申请到登录凭证；(3)重复执行步骤(1)-(2)，直至出现以下情况：若对调高后的预设时间阈值进行至少一次调低处理的过程中，利用每一次调低后的预设时间阈值将登录信息恢复有效性后，均可成功申请到登录凭证，但是当将至少一次调低处理中的最后一次调低处理后的预设时间阈值进行调低处理前，发现若对最后一次调低处理后的预设时间阈值进行调低处理，调低后的预设时间阈值将会小于等于记录的至少一个第一时间阈值中的最大值，即若进行调低处理，利用调低后的预设时间阈值恢复登录信息的有效性后，无法成功获得登录凭证，则此时爬虫程序确定不会再对该至少一次调低处理中的最后一次调低处理后的预设时间阈值进行调低处理，而将该至少一次调低处理中的最后一次调低处理后的预设时间阈值确定为最终的预设时间阈值，即最佳时间阈值。Specifically, the preset adjustment rules may be: (1) according to the second preset algorithm, lower the preset time threshold corresponding to the login information; After the preset time threshold restores the validity of the login information, the first time that the login credential cannot be successfully applied, stop the lowering process, and record the preset time threshold corresponding to the first time that the login credential cannot be successfully applied for. (hereinafter referred to as the first time threshold), and then according to the first preset algorithm, the first time threshold is increased at least once until the login information is restored by the increased preset time threshold, and the application can be successfully applied. (3) Repeat steps (1)-(2) until the following situation occurs: if the preset time threshold after being adjusted is adjusted down at least once during the process of reducing the preset time threshold After setting the time threshold to restore the validity of the login information, the login credentials can be successfully applied for, but when the preset time threshold after the last lowering processing in the at least one lowering processing is adjusted down, it is found that if the The preset time threshold value after the one-time lowering processing is processed for lowering, and the preset time threshold value after the lowering processing will be less than or equal to the maximum value of the recorded at least one first time threshold value, that is, if the lowering processing is performed, the lowering processing is performed using the lowering value. After the validity of the login information is restored by the preset time threshold after that, and the login credentials cannot be successfully obtained, then the crawler program determines that it will not perform any further adjustment on the preset time threshold after the last adjustment in the at least one adjustment process. The lowering process is performed, and the preset time threshold after the last lowering process in the at least one lowering process is determined as the final preset time threshold, that is, the optimal time threshold.

由于将预设时间阈值大幅度调低，很可能会将原来的预设时间阈值快速调整为比实际恢复周期小的值，因此，第二预设算法的调整幅度较小。常用的方法为：在当前预设时间阈值的基础上减少一个固定值，例如在当前预设时间阈值的基础上减2分钟。Since the preset time threshold is greatly reduced, the original preset time threshold is likely to be quickly adjusted to a value smaller than the actual recovery period. Therefore, the adjustment range of the second preset algorithm is small. A commonly used method is: subtracting a fixed value from the current preset time threshold, for example, subtracting 2 minutes from the current preset time threshold.

示例性的，若预设时间阈值为8小时，第二预设算法为在当前预设时间阈值的基础上减少0.5小时，则当失效的登录信息的失效时长大于8小时时，爬虫程序将该登录信息恢复为有效的登录信息。此时，若爬虫程序使用该登录信息可以成功获得登录凭证，则根据第二预设算法，将该预设时间阈值调低，即调低后的预设时间阈值为8-0.5＝7.5小时。若在较长一段时间(例如一个月)后，该登录信息又由于某些原因由有效变为失效，则本次判断该登录信息是否可以恢复有效的时间为7.5小时，即失效时长大于7.5小时后，爬虫程序将该登录信息恢复为有效的登录信息。此时，若爬虫程序使用该登录信息仍然可以成功获得登录凭证，则再次根据第二预设算法，将该预设时间阈值调低，即调低后的预设时间阈值为7小时。Exemplarily, if the preset time threshold is 8 hours, and the second preset algorithm is to reduce the current preset time threshold by 0.5 hours, then when the invalidation duration of the invalid login information is greater than 8 hours, the crawler program will The login information reverts to valid login information. At this time, if the crawler program can successfully obtain the login credential using the login information, the preset time threshold is adjusted lower according to the second preset algorithm, that is, the adjusted preset time threshold is 8-0.5=7.5 hours. If after a long period of time (such as one month), the login information changes from valid to invalid due to some reasons, the time to judge whether the login information can be restored to be valid this time is 7.5 hours, that is, the invalidation period is longer than 7.5 hours After that, the crawler restores the login information to valid login information. At this time, if the crawler program can still successfully obtain the login credential using the login information, the preset time threshold is adjusted lower according to the second preset algorithm, that is, the preset time threshold after the adjustment is 7 hours.

若将初始的预设时间阈值进行了6次调低，且每次调低后的预设时间阈值都没有过小，即利用每次调低后的预设时间阈值恢复登录信息的有效性后，均可成功申请登录凭证。但是进行第7次调低后，利用调低后的预设时间阈值恢复登录信息的有效性后，却没有成功申请到登录凭证，即预设时间阈值变为4.5小时时，比实际恢复周期小。因此，需要将第7次调低后的预设时间阈值，根据第一预设算法进行调高处理。若第一预设算法为将当前预设时间阈值乘以2，则进行调高处理后的预设时间阈值变为4.5*2＝9小时。If the initial preset time threshold is lowered 6 times, and the preset time threshold after each adjustment is not too small, that is, the validity of the login information is restored by using the preset time threshold after each adjustment. , you can successfully apply for login credentials. However, after the 7th lowering, after using the lower preset time threshold to restore the validity of the login information, the login credentials were not successfully applied for, that is, when the preset time threshold became 4.5 hours, it was smaller than the actual recovery period. . Therefore, the preset time threshold after the seventh adjustment needs to be adjusted upward according to the first preset algorithm. If the first preset algorithm is to multiply the current preset time threshold by 2, the preset time threshold after the height adjustment process becomes 4.5*2=9 hours.

利用9小时作为预设时间阈值可以在登录信息恢复有效性后，成功申请到登录凭证。此时，需根据第二预设算法将当前预设时间阈值进行调低处理。在将9小时调低至5小时的过程中，利用每一次调低后的预设时间阈值恢复登录信息的有效性后，均可成功申请到登录凭证。爬虫程序欲根据第二预设算法将5小时再次调低时，发现调低后的预设时间阈值会变为4.5小时，而之前确定4.5小时比实际恢复周期短，因此爬虫程序会将5小时设置为最终的预设时间阈值，而不会再进行后续调整。Using 9 hours as the preset time threshold can successfully apply for the login credentials after the login information is restored to be valid. In this case, the current preset time threshold needs to be adjusted down according to the second preset algorithm. In the process of lowering the 9 hours to 5 hours, the login credentials can be successfully applied for after the validity of the login information is restored by using the preset time threshold after each reduction. When the crawler program wants to lower the 5 hours again according to the second preset algorithm, it finds that the preset time threshold after the adjustment will become 4.5 hours, and it was previously determined that 4.5 hours is shorter than the actual recovery period, so the crawler program will 5 hours. Set to the final preset time threshold without subsequent adjustments.

由此可知，利用上述对预设时间阈值的调高和调低的方法，在对预设时间阈值进行若干次调整后，获得的预设时间阈值会更接近实际的恢复周期。It can be seen that, by using the above-mentioned methods for increasing and decreasing the preset time threshold, after adjusting the preset time threshold several times, the obtained preset time threshold will be closer to the actual recovery period.

进一步的，对于永久失效的登录信息而言，无论失效时长为多少，都不会在网站服务器侧恢复至有效的登录信息。因此，当从爬虫程序将登录信息恢复为有效的登录信息起，若连续多次发生以下情况：在预设时间段内，登录信息由有效的登录信息变为失效的登录信，则爬虫程序确定该登录信息为永久失效的登录信息，并将其丢弃。Further, for permanently invalid login information, no matter how long the invalidation period is, the website server side will not restore valid login information. Therefore, when the crawler program restores the login information to valid login information, if the following situation occurs continuously for many times: within the preset time period, the login information changes from valid login information to invalid login information, then the crawler program determines The login information is permanently invalid and discarded.

进一步的，依据上述方法实施例，本发明的另一个实施例还提供了一种管理网站登录信息的装置，如图2所示，该装置包括：获取单元21、判断单元22和恢复单元23。其中，Further, according to the above method embodiment, another embodiment of the present invention further provides an apparatus for managing website login information, as shown in FIG. in,

获取单元21，用于获取本地存储的失效的登录信息；Obtaining unit 21, for obtaining locally stored invalid login information;

判断单元22，用于判断获取单元21获取的登录信息的失效时长是否大于登录信息对应的预设时间阈值；The judgment unit 22 is used to judge whether the invalidation duration of the login information acquired by the acquisition unit 21 is greater than the preset time threshold corresponding to the login information;

恢复单元23，用于当判断单元22的判断结果为失效时长大于预设时间阈值时，将登录信息恢复为有效的登录信息。The restoration unit 23 is configured to restore the login information to valid login information when the determination result of the determination unit 22 is that the invalidation duration is greater than the preset time threshold.

本发明实施例提供的管理网站登录信息的装置，能够在爬虫程序确定某登录信息失效后，将其保存在本地，并对其进行检测，判断登录信息的失效时长是否大于该登录信息对应的预设时间阈值，当失效时长大于预设时间阈值时，将该登录信息恢复为有效的登录信息。在整个登录信息恢复有效性的过程中，爬虫程序无需丢弃失效的登录信息，也无需人工参与处理失效的登录信息，从而提高了管理网站登录信息的效率。The device for managing website login information provided by the embodiment of the present invention can save a certain login information locally after the crawler program determines that it is invalid, and detect it to determine whether the invalidation duration of the login information is greater than the predetermined period corresponding to the login information. A time threshold is set, and when the expiration time is greater than the preset time threshold, the login information is restored to valid login information. During the whole process of restoring the validity of the login information, the crawler does not need to discard the invalid login information, nor does it need to manually participate in the processing of the invalid login information, thereby improving the efficiency of managing the website login information.

进一步的，如图3所示，获取单元21，包括：Further, as shown in Figure 3, the acquisition unit 21 includes:

遍历模块211，用于遍历本地存储的登录信息；Traversing module 211, for traversing locally stored login information;

判断模块212，用于判断当前登录信息是否携带失效标记，失效标记是在爬虫程序确定登录信息失效时，为登录信息添加的标记；The judgment module 212 is used for judging whether the current login information carries an invalidation mark, and the invalidation mark is a mark added to the login information when the crawler program determines that the login information is invalid;

确定模块213，用于当判断模块212的判断结果为当前登录信息携带失效标记时，确定当前登录信息为失效的登录信息。The determination module 213 is configured to determine that the current login information is invalid login information when the determination result of the determination module 212 is that the current login information carries an invalidation mark.

进一步的，恢复单元23，用于将登录信息携带的失效标记更改为有效标记。Further, the restoration unit 23 is configured to change the invalidation mark carried in the login information into a valid mark.

进一步的，如图3所示，判断单元22，包括：Further, as shown in Figure 3, the judgment unit 22 includes:

获取模块221，用于从本地获取登录信息对应的起始失效时间点；The acquisition module 221 is used to acquire the initial failure time point corresponding to the login information locally;

计算模块222，用于根据获取模块221获取的起始失效时间点，计算登录信息的失效时长；The calculation module 222 is used for calculating the invalidation duration of the login information according to the initial invalidation time point obtained by the obtaining module 221;

查找模块223，用于从本地查找登录信息对应的预设时间阈值；A search module 223, configured to search for a preset time threshold corresponding to the login information locally;

判断模块224，用于判断计算模块222获得的失效时长是否大于查找模块223查找的预设时间阈值。The judgment module 224 is configured to judge whether the failure duration obtained by the calculation module 222 is greater than the preset time threshold searched by the search module 223 .

进一步的，如图3所示，该装置还包括：Further, as shown in Figure 3, the device also includes:

调整单元24，用于从登录信息恢复为有效的登录信息起，若在预设时间段内，登录信息由有效的登录信息变为失效的登录信息，则根据第一预设算法，将登录信息对应的预设时间阈值调高。The adjustment unit 24 is used to restore the login information to valid login information. If the login information changes from valid login information to invalid login information within a preset time period, then according to the first preset algorithm, the login information is changed. The corresponding preset time threshold is increased.

进一步的，调整单元24，还用于根据预设调整规则，对登录信息对应的预设时间阈值进行调整，获得最佳时间阈值；Further, the adjustment unit 24 is further configured to adjust the preset time threshold corresponding to the login information according to the preset adjustment rule to obtain the best time threshold;

其中，最佳时间阈值为基于当前预设时间阈值将失效的登录信息恢复有效性时，利用登录信息能够成功申请登录凭证，且若根据预设调整规则中的第二预设算法对当前预设时间阈值进行调低，则基于调低后的预设时间阈值将失效的登录信息恢复有效性时，利用登录信息无法成功申请登录凭证。Among them, the optimal time threshold is when the invalid login information is restored to the validity based on the current preset time threshold, the login information can be used to successfully apply for the login credential, and if the current preset adjustment is performed according to the second preset algorithm in the preset adjustment rule If the time threshold is lowered, when the invalid login information is restored to the validity based on the lowered preset time threshold, the login information cannot be used to successfully apply for login credentials.

该装置实施例与前述方法实施例对应，为便于阅读，本装置实施例不再对前述方法实施例中的细节内容进行逐一赘述，但应当明确，本实施例中的装置能够对应实现前述方法实施例中的全部内容。This apparatus embodiment corresponds to the foregoing method embodiment. For ease of reading, this apparatus embodiment will not repeat the details in the foregoing method embodiment one by one, but it should be clear that the apparatus in this embodiment can correspondingly implement the foregoing method implementation. the entire contents of the example.

所述管理网站登录信息的装置包括处理器和存储器，上述获取单元、判断单元和恢复单元等均作为程序单元存储在存储器中，由处理器执行存储在存储器中的上述程序单元来实现相应的功能。The device for managing website login information includes a processor and a memory, and the above-mentioned acquisition unit, judgment unit and recovery unit are all stored in the memory as program units, and the processor executes the above-mentioned program units stored in the memory to realize corresponding functions. .

处理器中包含内核，由内核去存储器中调取相应的程序单元。内核可以设置一个或以上，通过调整内核参数来提高爬虫程序管理网站登录信息的效率。The processor contains a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can set one or more to improve the efficiency of the crawler program to manage website login information by adjusting the kernel parameters.

存储器可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)，存储器包括至少一个存储芯片。Memory may include non-persistent memory in computer readable media, random access memory (RAM) and/or non-volatile memory, such as read only memory (ROM) or flash memory (flash RAM), the memory including at least one memory chip.

本申请还提供了一种计算机程序产品，当在数据处理设备上执行时，适于执行初始化有如下方法步骤的程序代码：The present application also provides a computer program product that, when executed on a data processing device, is adapted to execute program code initialized with the following method steps:

判断登录信息的失效时长是否大于登录信息对应的预设时间阈值；Determine whether the expiration time of the login information is greater than the preset time threshold corresponding to the login information;

若失效时长大于预设时间阈值，则将登录信息恢复为有效的登录信息。If the expiration time is longer than the preset time threshold, the login information is restored to valid login information.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

在一个典型的配置中，计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

存储器可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。Memory may include non-persistent memory in computer readable media, random access memory (RAM) and/or non-volatile memory in the form of, for example, read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

以上仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的权利要求范围之内。The above are merely examples of the present application, and are not intended to limit the present application. Various modifications and variations of this application are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the scope of the claims of this application.

Claims

1. A method for managing website login information, the method being applied to a crawler, the method comprising:

acquiring locally stored invalid login information;

judging whether the failure duration of the login information is greater than a preset time threshold corresponding to the login information; the preset time threshold is a recovery period for judging whether the invalid login information can recover the validity;

and if the failure duration is greater than the preset time threshold, recovering the login information into effective login information.

2. The method of claim 1, wherein obtaining locally stored stale login information comprises:

traversing locally stored login information;

judging whether the current login information carries a failure mark, wherein the failure mark is a mark added to the login information when the crawler determines that the login information is failed;

and if the current login information carries a failure mark, determining that the current login information is failed login information.

3. The method of claim 2, wherein the restoring the login information to valid login information comprises:

and changing the invalid mark carried by the login information into an effective mark.

4. The method according to claim 1, wherein the determining whether the expiration duration of the login information is greater than a preset time threshold corresponding to the login information comprises:

acquiring an initial failure time point corresponding to the login information from a local place;

calculating the failure duration of the login information according to the initial failure time point;

searching a preset time threshold corresponding to the login information from the local;

and judging whether the failure duration is greater than the preset time threshold.

5. The method of claim 1, further comprising:

if the login information is changed from the valid login information to the invalid login information within a preset time period from the time when the login information is restored to the valid login information, the preset time threshold corresponding to the login information is increased according to a first preset algorithm.

6. The method of claim 5, wherein after the restoring the login information to valid login information, the method further comprises:

adjusting a preset time threshold corresponding to the login information according to a preset adjustment rule to obtain an optimal time threshold;

and if the current preset time threshold is reduced according to a second preset algorithm in the preset adjustment rule, the login credentials cannot be successfully applied by using the login information when the validity of the invalid login information is restored based on the reduced preset time threshold.

7. An apparatus for managing website login information, the apparatus being applied to a crawler program, the apparatus comprising:

the acquisition unit is used for acquiring locally stored invalid login information;

the judging unit is used for judging whether the failure duration of the login information acquired by the acquiring unit is greater than a preset time threshold corresponding to the login information; the preset time threshold is a recovery period for judging whether the invalid login information can recover the validity;

and the recovery unit is used for recovering the login information into effective login information when the judgment result of the judgment unit is that the failure duration is greater than the preset time threshold.

8. The apparatus of claim 7, wherein the obtaining unit comprises:

the traversing module is used for traversing the locally stored login information;

the judging module is used for judging whether the current login information carries a failure mark, wherein the failure mark is a mark added to the login information when the crawler determines that the login information is failed;

and the determining module is used for determining that the current login information is invalid login information when the judgment result of the judging module is that the current login information carries an invalid mark.

9. The apparatus according to claim 8, wherein the recovery unit is configured to change a failure flag carried by the login information to a valid flag.

10. The apparatus according to claim 7, wherein the determining unit comprises:

the acquisition module is used for acquiring an initial failure time point corresponding to the login information from a local place;

the calculation module is used for calculating the failure duration of the login information according to the initial failure time point obtained by the obtaining module;

the searching module is used for searching a preset time threshold corresponding to the login information from a local place;

and the judging module is used for judging whether the failure duration obtained by the calculating module is greater than the preset time threshold searched by the searching module.

11. A storage medium, comprising a stored program, wherein when the program runs, a device in which the storage medium is located is controlled to execute the method for managing website login information according to any one of claims 1 to 6.

12. A processor for executing a program, wherein the program executes the method for managing website login information according to any one of claims 1 to 6.