[go: up one dir, main page]

CN112256529A - Web crawler monitoring method and device, computer equipment and storage medium - Google Patents

Web crawler monitoring method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112256529A
CN112256529A CN202011137903.6A CN202011137903A CN112256529A CN 112256529 A CN112256529 A CN 112256529A CN 202011137903 A CN202011137903 A CN 202011137903A CN 112256529 A CN112256529 A CN 112256529A
Authority
CN
China
Prior art keywords
data
web crawler
crawler
web
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011137903.6A
Other languages
Chinese (zh)
Inventor
沈天诗
林进兴
伍庭波
李彦威
宁达楚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yougarage Network Technology Development Shenzhen Co ltd
Original Assignee
Yougarage Network Technology Development Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yougarage Network Technology Development Shenzhen Co ltd filed Critical Yougarage Network Technology Development Shenzhen Co ltd
Priority to CN202011137903.6A priority Critical patent/CN112256529A/en
Publication of CN112256529A publication Critical patent/CN112256529A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application relates to a web crawler monitoring method, a web crawler monitoring device, computer equipment and a storage medium. The method comprises the following steps: according to the preset frequency, obtaining the latest data in the data table corresponding to each web crawler from the database; the latest data comprises the warehousing time point of the latest data; determining the warehousing duration of the latest data according to the current time point and the warehousing time point of the latest data; when the warehousing duration is greater than or equal to the crawling interval corresponding to the web crawler, judging that the running condition of the web crawler is abnormal; and when the running condition of the web crawler is abnormal, performing abnormal alarm. The method can improve the monitoring efficiency.

Description

Web crawler monitoring method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a web crawler monitoring method and apparatus, a computer device, and a storage medium.
Background
A web crawler is a program or script that automatically captures information on a resource platform (e.g., a web page or an application) according to certain rules. With the rapid development of the internet and the coming of the information era, the number of websites is gradually increased, and the structure of website pages is also continuously changed, so that the number of web crawlers required to be deployed on a server is increased. If the web crawler stops running, and the administrator of the web crawler does not know in time, a certain loss is caused. Therefore, it is very important to effectively monitor the operation conditions of a large number of web crawlers.
In a traditional method, the operation state of a crawler is generally determined according to multidimensional information such as a network state, an IP address of a crawler server, a website page structure and the like, so that the operation state of the crawler is monitored. However, this method not only needs to modify the code in the original web crawler, but also has complex and variable conditions in the monitoring process, resulting in low monitoring efficiency.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a web crawler monitoring method, apparatus, computer device and storage medium capable of improving monitoring efficiency.
A web crawler monitoring method, the method comprising:
according to the preset frequency, obtaining the latest data in the data table corresponding to each web crawler from the database; the latest data comprises the warehousing time point of the latest data;
determining the warehousing duration of the latest data according to the current time point and the warehousing time point of the latest data;
when the warehousing duration is greater than or equal to the crawling interval corresponding to the web crawler, judging that the running condition of the web crawler is abnormal;
and when the running condition of the web crawler is abnormal, performing abnormal alarm.
In one embodiment, the method further comprises:
and when the latest data in the data table corresponding to the web crawler is not acquired, judging that the running condition of the web crawler is abnormal.
In one embodiment, the method further comprises:
acquiring data in a data table corresponding to each web crawler from the database;
performing multi-dimensional statistical analysis according to the data and the warehousing time points of the data to obtain the statistical result of each web crawler under each dimension;
and visually displaying the statistical result under each dimension.
In one embodiment, the performing multidimensional statistical analysis according to each piece of data and the warehousing time point of each piece of data to obtain the statistical result of each web crawler under each dimension includes:
and performing multidimensional statistical analysis according to the data and the warehousing time points of the data, and determining at least one statistical result of total data volume, newly increased data volume in the previous statistical period, contemporaneous data volume in the statistical period, contemporaneous data volume ring ratio in two continuous statistical periods, latest crawling time point, crawler state, data table main key, data table index, statistical time point and whether the web crawler is a newly increased crawler.
In one embodiment, the visually displaying the statistical results in the dimensions includes at least one of the following steps:
displaying the statistical result under each dimension in a data form;
and displaying the statistical result corresponding to each web crawler through a statistical chart.
In one embodiment, after performing the abnormal alarm when the operating condition of the web crawler is abnormal, the method further includes:
detecting the state of a resource platform corresponding to the web crawler;
and when the state of the resource platform is closed, stopping the operation of the web crawler.
In one embodiment, the displaying the statistical result in each dimension in the form of data includes:
and displaying the number of the web crawlers with normal running conditions, the number of the web crawlers with abnormal running conditions and the number of the web crawlers which stop running in a data form.
A web crawler monitoring apparatus, the apparatus comprising:
the data acquisition module is used for acquiring the latest data in the data table corresponding to each web crawler from the database according to the preset frequency; the latest data comprises the warehousing time point of the latest data;
the warehousing duration determining module is used for determining the warehousing duration of the latest data according to the current time point and the warehousing time point of the latest data;
the operation condition determining module is used for judging that the operation condition of the web crawler is abnormal when the warehousing duration is greater than or equal to the crawling interval corresponding to the web crawler;
and the alarm module is used for giving an abnormal alarm when the running condition of the web crawler is abnormal.
A computer device comprising a memory and a processor, wherein the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to perform the steps of the web crawler monitoring method according to the embodiments of the present application.
A computer-readable storage medium, having a computer program stored thereon, which, when executed by a processor, causes the processor to perform the steps of the web crawler monitoring method according to embodiments of the present application.
According to the web crawler monitoring method, the web crawler monitoring device, the computer equipment and the storage medium, the latest data in the data tables corresponding to each web crawler in the database are obtained according to the preset frequency, then the warehousing duration of the latest data is determined according to the current time point and the warehousing time point of the latest data, when the warehousing duration is greater than or equal to the crawling interval corresponding to the web crawler, the operation condition of the web crawler is judged to be abnormal, the operation condition of the web crawler can be rapidly determined, the abnormal condition of the web crawler is timely found, the monitoring efficiency is improved, when the operation condition of the web crawler is abnormal, an abnormal alarm is conducted, the abnormality of the web crawler can be timely reminded, and the monitoring efficiency is further improved.
Drawings
FIG. 1 is a diagram of an application environment of a web crawler monitoring method in one embodiment;
FIG. 2 is a flow diagram that illustrates a web crawler monitoring method in one embodiment;
FIG. 3 is a schematic of a crawl interval and a warehousing duration in one embodiment;
FIG. 4 is a schematic diagram illustrating an overall flowchart of a web crawler monitoring method according to an embodiment;
FIG. 5 is a diagram illustrating statistical results in the form of data according to one embodiment;
FIG. 6 is a diagram illustrating statistical results by a histogram in one embodiment;
FIG. 7 is a diagram illustrating statistical results by a pie chart according to an embodiment;
FIG. 8 is a block diagram showing the structure of a web crawler monitoring apparatus according to an embodiment;
FIG. 9 is a block diagram showing the construction of a crawler monitoring apparatus according to another embodiment;
FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The web crawler monitoring method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 and the server 104 communicate with each other via a network. The server 104 may run a web crawler. The terminal 102 may retrieve data from the database of the server 104 and determine the operating condition of the web crawler based on the retrieved data. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a web crawler monitoring method is provided, which is described by taking the method as an example applied to the terminal 102 in fig. 1, and includes the following steps:
s202, acquiring the latest data in the data table corresponding to each web crawler from the database according to the preset frequency; the latest data includes the warehousing time point of the latest data.
The preset frequency refers to a frequency of acquiring data, that is, how often data is acquired. The web crawler is a program or script for automatically capturing information on a resource platform according to a certain rule. Such as: the resource platform may be a web page or an application. The latest data refers to a piece of data which is stored in the data table most recently. The warehousing time point refers to a time point when the latest data is stored in the database.
It can be understood that each web crawler corresponds to one data table in the database, and the data stored in the database by each web crawler is stored in the data table corresponding to the web crawler.
In one embodiment, the predetermined frequency may be how often data is acquired. The terminal can obtain the latest data in the data table respectively corresponding to each web crawler from the database according to the time interval specified by the preset frequency. For example: the time interval specified by the preset frequency may be in units of minutes, hours, days, weeks, or the like. Such as: the preset frequency may be acquired every 20 hours. For another example: the preset frequency may be acquired every other day.
In another embodiment, the predetermined frequency may be that data is acquired every fixed point in time. The terminal can acquire the latest data in the data table corresponding to each web crawler from the database every time the terminal reaches the time point specified by the preset frequency. For example: the time point designated by the preset frequency may be a certain time point of each day. For example: the preset frequency may be acquired once at 0 o' clock per day.
It can be understood that the terminal needs to acquire the latest data in the database after the web crawler runs, and therefore, no matter how the preset frequency is set, the time for acquiring the data needs to be the time after the web crawler runs. Such as: the terminal can acquire the latest data in the database at 0 point every day, and the web crawler is finished running at the moment.
In one embodiment, when the terminal acquires the latest data, step S204 is executed.
In another embodiment, when the terminal does not acquire the latest data, the terminal may determine that the operation status of the web crawler is abnormal.
And S204, determining the warehousing duration of the latest data according to the current time point and the warehousing time point of the latest data.
The warehousing duration refers to the duration of time when the latest data is stored in the database.
Specifically, the terminal may determine the warehousing duration of the latest data according to a time difference between the current time point and the warehousing time point of the latest data.
Such as: the current time point is 3 points, the warehousing time point of the latest data is 2 points, and the warehousing time length of the latest data is 1 hour because the difference between the 3 points and the 2 points is 1 hour (namely, the time difference is 1 hour).
As shown in fig. 3, the time period from the time point of putting the latest data into the database to the current time point is the putting duration of the latest data into the database.
And S206, when the warehousing duration is greater than or equal to the crawling interval corresponding to the web crawler, judging that the running condition of the web crawler is abnormal.
The crawling interval refers to how often the web crawler captures information from the resource platform.
In one embodiment, the crawling interval may be the time difference between two adjacent time points when the web crawler starts to crawl information, or the time difference between two adjacent time points when the data is binned (i.e., the data store is put into the database). As shown in fig. 3, the time between the time point of entering the latest data last time and the time point of entering the latest data this time is the crawling interval.
Specifically, when the warehousing duration corresponding to the web crawler is greater than or equal to the crawling interval corresponding to the web crawler, the terminal may determine that the running status of the web crawler is abnormal.
It can be understood that when the storage duration of the latest data is greater than or equal to the crawling interval corresponding to the web crawler, it indicates that the captured data is not stored in the database by the web crawler for a long time, and therefore, the terminal can determine that the operating condition of the web crawler is abnormal.
In an embodiment, when the warehousing duration corresponding to the web crawler is less than the crawling interval corresponding to the web crawler, the terminal may determine that the running status of the web crawler is normal.
It can be understood that the terminal judges whether the operation state of the web crawler is abnormal or not according to the warehousing market and the crawling interval corresponding to each web crawler. For example, when the warehousing duration corresponding to the web crawler a is greater than or equal to the crawling interval corresponding to the web crawler a, the terminal may determine that the operating condition of the web crawler a is abnormal. When the warehousing duration corresponding to the web crawler B is less than the crawling interval corresponding to the web crawler B, the terminal can judge that the running condition of the web crawler B is normal.
And S208, when the running condition of the web crawler is abnormal, performing abnormal alarm.
In one embodiment, when the operation condition of the web crawler is abnormal, the terminal may send an alarm message to perform an abnormal alarm to inform a worker that the operation condition of the web crawler is abnormal. In one embodiment, the terminal may send the alarm information through any one of mail, short message, WeChat, telephone, and the like to perform the abnormal alarm.
In one embodiment, when the operation condition of the web crawler is abnormal, the terminal may perform an abnormal alarm by sending an alarm prompt tone. In one embodiment, the alert tone may be any one of sounds in the form of alert music, alert sound, alert voice, and the like.
In one embodiment, after receiving the abnormal alarm, the worker may perform maintenance on the web crawler.
In the network crawler monitoring method, the terminal can acquire the latest data in the data tables corresponding to each network crawler in the database according to the preset frequency, then the warehousing duration of the latest data is determined according to the current time point and the warehousing time point of the latest data, when the warehousing duration is greater than or equal to the crawling interval corresponding to the network crawler, the operation status of the network crawler is judged to be abnormal, the operation status of the network crawler can be rapidly determined, the abnormal status of the network crawler is timely found, the monitoring efficiency is improved, when the operation status of the network crawler is abnormal, an abnormal alarm is given, the abnormality of the network crawler can be timely reminded, the monitoring efficiency is further improved, and the operation stability of the network crawler is ensured. In addition, the monitoring of the web crawlers can be simply and conveniently realized without modifying the web crawlers, and the plurality of crawlers are easy to expand, so that the monitoring convenience and the monitoring efficiency are improved, and the expansibility of the method is improved.
In one embodiment, the method further comprises: and when the latest data in the data table corresponding to the web crawler is not acquired, judging that the running condition of the web crawler is abnormal.
Specifically, when the latest data in the data table corresponding to the web crawler is not acquired, it indicates that the web crawler does not capture information for a long time, and therefore, the terminal may determine that the operating condition of the web crawler is abnormal.
In this embodiment, the terminal may quickly determine that the operation status of the web crawler is abnormal when the terminal does not acquire the latest data in the data table corresponding to the web crawler, so that the monitoring efficiency is improved.
Fig. 4 is a schematic overall flow chart of the web crawler monitoring method in the present application. After the web crawler is operated, the terminal can acquire the latest data in the data table corresponding to the web crawler from the database, wherein the latest data comprises the warehousing time point. When the latest data does not exist (i.e., the latest data is not acquired), the terminal may determine that the operation status of the web crawler is abnormal. When the latest data exists (that is, the latest data is acquired), the terminal can calculate the time difference between the current time point and the warehousing time point of the latest data, and when the time difference is greater than or equal to the crawling interval, the terminal can judge that the operation condition of the web crawler is abnormal. And when the time difference is smaller than the crawling interval, the terminal can judge that the operation condition of the web crawler is normal. When the operation condition of the web crawler is abnormal, the terminal can perform abnormal alarm.
In one embodiment, the method further comprises: acquiring data in a data table corresponding to each web crawler from a database; carrying out multi-dimensional statistical analysis according to each data and the warehousing time point of each data to obtain the statistical result of each web crawler under each dimension; and visually displaying the statistical result under each dimension.
Specifically, the terminal can acquire all data in the data table corresponding to each web crawler from the database, perform multi-dimensional statistical analysis according to all data and the storage time point of each piece of data to obtain the statistical result of each web crawler in each dimension, and perform visual display on the statistical result in each dimension. It can be understood that each piece of data acquired from the database includes the warehousing time point corresponding to the piece of data.
In an embodiment, the terminal may obtain all data in the data table corresponding to each web crawler from the database according to a second preset frequency, perform multi-dimensional statistical analysis according to all data and the storage time point of each data, obtain the statistical result of each web crawler in each dimension, and perform visual display on the statistical result in each dimension.
In one embodiment, the second preset frequency may be any one of once per day, once per week, once per month, and the like. For example: the terminal can acquire all data in the data table corresponding to each web crawler from the database every day, then carries out multi-dimensional statistical analysis according to all data and the warehousing time point of each piece of data to obtain the statistical result of each web crawler under each dimension, and then carries out visual display on the statistical result under each dimension, namely, the terminal can update the displayed statistical result every day.
In one embodiment, the dimension of the statistical analysis may include at least one of a total data volume dimension, a newly added data volume dimension, a contemporaneous data volume dimension, a data volume comparison dimension, a time dimension, a crawler status dimension, a crawler attribute dimension, a data table setting dimension, and a resource platform quantity dimension, among others.
In one embodiment, the statistics under the total data volume dimension may include the total data volume.
In one embodiment, the statistics under the dimension of the new added data amount may include the new added data amount in the specified statistical period. The unit of the statistical period may be at least one of hours, days, weeks, months, and years. For example: the statistics under the dimension of the new data volume may include the new data volume of the previous month. For another example: statistics under the dimension of the new data volume can include the new data volume today or the new data volume yesterday.
In one embodiment, the statistics in the contemporaneous data amount dimension may include the amount of contemporaneous data within the specified statistical period. The unit of the statistical period may be at least one of week, month, year, and the like. For example: statistics under the contemporaneous data volume dimension may include the contemporaneous data volume of the last month. For another example: statistics in the contemporaneous data volume dimension may include the contemporaneous data volume of the last year. For another example: the statistics in the contemporaneous data volume dimension may include the contemporaneous data volume for the month. For example, if this day is 17 days in 10 months, the current date in the previous month is the data amount between 17 days in 9 months, and the current date in the current month is the data amount between 17 days in 10 months and 1 day in 10 months. The meaning of contemporaneous data volume of the last year is analogized.
In one embodiment, the statistical result in the data volume comparison dimension may include a contemporaneous data volume ring ratio between two adjacent statistical cycles. The unit of the statistical period may be at least one of hours, days, weeks, months, and years. For example: statistics in the data volume comparison dimension may include a ring ratio between the quantity of contemporaneous data of the previous month and the quantity of contemporaneous data of the present month, i.e., (quantity of contemporaneous data of the present month-quantity of contemporaneous data of the previous month)/quantity of contemporaneous data of the previous month.
In one embodiment, the statistics in the time dimension may include at least one of a most recent crawl time point, a statistical time point, and the like. The latest crawling time point refers to a time point of the crawling data closest to the time of the statistical analysis. The statistical time point refers to a time point for acquiring data of each web crawler from the database and performing statistical analysis.
In one embodiment, the statistics in the crawler status dimension may include whether the web crawler's operational status is normal or abnormal.
In one embodiment, the statistics in the crawler attribute dimension may include whether the web crawler is a newly added web crawler.
In one embodiment, the statistics in the data table setup dimension may include an index or primary key in the data table corresponding to the web crawler.
In one embodiment, the statistics in the resource platform number dimension may include at least one of a total number of resource platforms crawled by the web crawler and a number of resource platforms crawled by the web crawler within a statistics period. For example, the statistical result in the resource platform quantity dimension can be the quantity of the resource platforms crawled by the web crawler yesterday.
In this embodiment, the terminal can carry out multidimensional statistical analysis on the data in the data tables respectively corresponding to the web crawlers acquired from the database, and visually display the statistical results, so that the operation conditions of the web crawlers can be visually displayed, the working personnel can know the operation conditions of the web crawlers in time, and the monitoring efficiency of the web crawlers is improved. In addition, the multidimensional statistical result is displayed, the web crawler can be comprehensively monitored, and the monitoring comprehensiveness is improved.
In one embodiment, performing multidimensional statistical analysis according to each data and the storage time point of each data, and obtaining the statistical result of each web crawler in each dimension respectively includes: and determining at least one statistical result of the total data volume, the newly increased data volume in the last statistical period, the contemporaneous data volume in the two continuous statistical periods, the latest crawling time point, the crawler state, the main key of the data table, the index of the data table, the monitoring time length and whether the web crawler is the newly increased crawler according to the data and the warehousing time point of the data.
Wherein, the total data volume is the total amount of data captured by the web crawler. And the newly increased data volume in the previous statistical period is the data volume captured by the web crawler in the previous statistical period. The data amount in the same period in the last statistical period is the data amount synchronously captured by the web crawler in the last statistical period. The data amount in the statistical period is the data amount captured by the web crawler in the statistical period. The ring ratio of the synchronous data volume in two continuous statistical periods is the ring ratio of the synchronous data volume between two adjacent statistical periods.
Specifically, the terminal may obtain all data in the data tables corresponding to each web crawler from the database according to the second preset frequency, and then determine at least one statistical result of a total data volume corresponding to each web crawler, a newly added data volume in the last statistical period, a contemporaneous data volume in the present statistical period, a contemporaneous data volume loop ratio in two consecutive statistical periods, a latest crawling time point, a crawler state, a data table main key, a data table index, a monitoring time length, and whether the web crawler is a newly added crawler according to all the data and the warehousing time point of each piece of data.
In one embodiment, the time unit of the statistical period may be at least one of hours, days, weeks, months, years, and the like. In one embodiment, the new data amount in the previous statistical period may be the new data amount in the previous month. In another embodiment, the new data amount in the last statistical period may be the new data amount of the last day (i.e., the new data amount of yesterday).
In one embodiment, the amount of contemporaneous data in the last statistical period may be the amount of contemporaneous data in the last month. In one embodiment, the amount of contemporaneous data in the statistical period may be the amount of contemporaneous data in the month. In one embodiment, the cycle ratio of the contemporaneous data volumes in two consecutive statistical cycles may be the cycle ratio between the contemporaneous data volumes of the previous month and the present month.
In one embodiment, the crawler status may be that the crawler's operational status is normal or abnormal.
It will be appreciated that the statistics table primary key, or table index, is used to determine whether the primary key or index setting is reasonable.
In this embodiment, the terminal can perform multidimensional statistical analysis on the web crawler to obtain multidimensional statistical results, thereby improving the comprehensiveness of monitoring the web crawler.
In one embodiment, visually displaying the statistics in each dimension comprises at least one of the following steps: displaying the statistical result under each dimension in a data form; and displaying the statistical result corresponding to each web crawler through a statistical chart.
In one embodiment, the terminal may present the statistics in each dimension in the form of data. Fig. 5 is a schematic interface diagram showing the statistical results in each dimension in the form of data. The graph shows that the cumulative crawling data amount (namely the total data amount) is 28136532, the cumulative crawling platforms (the number of the crawled resource platforms) are 20, the current month contemporaneous data amount is 128240, the previous month contemporaneous data amount is 345829, the cyclic ratio of the previous month contemporaneous data amount to the current month is 62.92%, the yesterday data amount is 6071, the yesterday crawling platform amount is 4, the normal crawler table is 12, the abnormal crawler table is 3, and the stopped crawler table is 5.
In one embodiment, the terminal may show the statistical result corresponding to each web crawler through a statistical chart. The terminal can be used for displaying the statistical results of different web crawlers in a statistical chart in a distinguishing mode, and therefore the statistical results of the web crawlers can be displayed visually.
In one embodiment, the statistical chart may be at least one of a histogram, a pie chart, a line chart, and the like.
Fig. 6 is a schematic diagram showing statistical results corresponding to web crawlers through a histogram. The historical data volume, the newly increased data volume in the last month and the contemporaneous data volume in the current month corresponding to the web crawlers A, B, C, D and E are respectively shown in the form of bar charts.
Fig. 7 is a schematic diagram showing statistical results corresponding to web crawlers through a sector graph. The chart shows the yesterday new data volume corresponding to the web crawler A, B, C and D respectively in the form of a sector chart.
In this embodiment, the terminal may display the statistical result in at least one of a data form and a statistical chart form, so as to visually display the monitoring result of the web crawler, so that the worker can visually and timely know the condition of the web crawler, and the monitoring efficiency of the web crawler is improved.
In one embodiment, after the step of performing the abnormality warning when the operation condition of the web crawler is abnormal, the method further includes: detecting the state of a resource platform corresponding to the web crawler; and when the state of the resource platform is closed, stopping the operation of the web crawler.
Specifically, when the terminal determines that the operation status of the web crawler is abnormal, the terminal may perform an abnormal alarm and detect the state of the resource platform corresponding to the web crawler. When the resource platform is in a closed state, the terminal can stop the operation of the web crawler. When the resource platform is opened, the terminal does not stop the operation of the web crawler.
In this embodiment, the terminal may detect the state of the resource platform corresponding to the web crawler that operates abnormally, and stop the operation of the web crawler when the resource platform is closed, thereby avoiding waste of system resources.
In one embodiment, presenting statistics in the form of data for each dimension includes: and displaying the number of the web crawlers with normal running conditions, the number of the web crawlers with abnormal running conditions and the number of the web crawlers which stop running in a data form.
Specifically, the terminal may respectively show the number of web crawlers whose running conditions are normal, the number of web crawlers whose running conditions are abnormal, and the number of web crawlers that stop running in a data form.
As shown in fig. 5, it is shown in the form of data that the normal crawlers list is 12 (that is, the number of normal crawlers is 12), the abnormal crawlers list is 3 (that is, the number of abnormal crawlers is 3), and the stopped crawlers list is 5 (that is, the number of stopped crawlers is 5).
In this embodiment, the terminal can show the number of the web crawlers whose operation states are normal and abnormal and the number of the web crawlers which stop operating respectively, so that the monitoring result of the web crawlers can be shown visually, a worker can know the condition of the web crawlers timely and visually, and the monitoring efficiency is improved.
It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
In one embodiment, as shown in fig. 8, there is provided a web crawler monitoring apparatus 800 comprising: a data acquisition module 802, a warehousing duration determination module 804, an operating condition determination module 806, and an alarm module 808, wherein:
the data acquisition module 802 is configured to acquire, according to a preset frequency, the latest data in the data table corresponding to each web crawler from the database; the latest data includes the warehousing time point of the latest data.
The warehousing duration determining module 804 is configured to determine the warehousing duration of the latest data according to the current time point and the warehousing time point of the latest data.
And the running condition determining module 806 is configured to determine that the running condition of the web crawler is abnormal when the warehousing duration is greater than or equal to the crawling interval corresponding to the web crawler.
The alarm module 808 is configured to perform an abnormal alarm when the operation status of the web crawler is abnormal.
In one embodiment, the operation status determining module 806 is further configured to determine that the operation status of the web crawler is abnormal when the latest data in the data table corresponding to the web crawler is not obtained.
In one embodiment, the web crawler monitoring apparatus 800 further comprises:
the statistical module 810 is configured to obtain data in the data table corresponding to each web crawler from the database; carrying out multi-dimensional statistical analysis according to each data and the warehousing time point of each data to obtain the statistical result of each web crawler under each dimension; and visually displaying the statistical result under each dimension.
In an embodiment, the statistical module 810 is further configured to perform multidimensional statistical analysis according to each data and the storage time point of each data, and determine at least one statistical result of a total data amount corresponding to each web crawler, a newly added data amount in a previous statistical period, a contemporaneous data amount in the present statistical period, a contemporaneous data amount ring ratio in two consecutive statistical periods, a latest crawling time point, a crawler state, a data table main key, a data table index, a statistical time point, and whether the web crawler is a newly added crawler.
In one embodiment, the statistics module 810 is further configured to perform at least one of the following steps: displaying the statistical result under each dimension in a data form; and displaying the statistical result corresponding to each web crawler through a statistical chart.
In one embodiment, as shown in fig. 9, the web crawler monitoring apparatus 800 further includes:
a web crawler control module 812, configured to detect a state of a resource platform corresponding to a web crawler; and when the state of the resource platform is closed, stopping the operation of the web crawler.
In one embodiment, the statistical module 810 is further configured to display the number of crawlers that are in normal operation, the number of crawlers that are in abnormal operation, and the number of crawlers that are out of operation in the form of data.
In the network crawler monitoring device, the terminal can acquire the latest data in the data tables corresponding to each network crawler in the database according to the preset frequency, then the warehousing duration of the latest data is determined according to the current time point and the warehousing time point of the latest data, when the warehousing duration is greater than or equal to the crawling interval corresponding to the network crawler, the operation status of the network crawler is judged to be abnormal, the operation status of the network crawler can be rapidly determined, the abnormal status of the network crawler is timely found, the monitoring efficiency is improved, when the operation status of the network crawler is abnormal, an abnormal alarm is given, the abnormality of the network crawler can be timely reminded, the monitoring efficiency is further improved, and the operation stability of the network crawler is ensured. In addition, the monitoring of the web crawlers can be simply and conveniently realized without modifying the web crawlers, and the plurality of crawlers are easy to expand, so that the monitoring convenience and the monitoring efficiency are improved, and the expansibility of the method is improved.
For specific limitations of the web crawler monitoring apparatus, reference may be made to the above limitations of the web crawler monitoring method, which are not described herein again. The modules in the web crawler monitoring device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a web crawler monitoring method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A web crawler monitoring method, the method comprising:
according to the preset frequency, obtaining the latest data in the data table corresponding to each web crawler from the database; the latest data comprises the warehousing time point of the latest data;
determining the warehousing duration of the latest data according to the current time point and the warehousing time point of the latest data;
when the warehousing duration is greater than or equal to the crawling interval corresponding to the web crawler, judging that the running condition of the web crawler is abnormal;
and when the running condition of the web crawler is abnormal, performing abnormal alarm.
2. The method of claim 1, further comprising:
and when the latest data in the data table corresponding to the web crawler is not acquired, judging that the running condition of the web crawler is abnormal.
3. The method of claim 1, further comprising:
acquiring data in a data table corresponding to each web crawler from the database;
performing multi-dimensional statistical analysis according to the data and the warehousing time points of the data to obtain the statistical result of each web crawler under each dimension;
and visually displaying the statistical result under each dimension.
4. The method according to claim 3, wherein the performing multidimensional statistical analysis according to each of the data and the warehousing time point of each of the data to obtain the statistical result of each of the web crawlers in each of the dimensions comprises:
and performing multidimensional statistical analysis according to the data and the warehousing time points of the data, and determining at least one statistical result of total data volume, newly increased data volume in the previous statistical period, contemporaneous data volume in the statistical period, contemporaneous data volume ring ratio in two continuous statistical periods, latest crawling time point, crawler state, data table main key, data table index, statistical time point and whether the web crawler is a newly increased crawler.
5. The method of claim 3, wherein the visually presenting the statistics in each dimension comprises at least one of:
displaying the statistical result under each dimension in a data form;
and displaying the statistical result corresponding to each web crawler through a statistical chart.
6. The method according to claim 5, wherein after the performing an anomaly alarm when the operating condition of the web crawler is abnormal, the method further comprises:
detecting the state of a resource platform corresponding to the web crawler;
and when the state of the resource platform is closed, stopping the operation of the web crawler.
7. The method of claim 6, wherein the presenting the statistics in the form of data for each dimension comprises:
and displaying the number of the web crawlers with normal running conditions, the number of the web crawlers with abnormal running conditions and the number of the web crawlers which stop running in a data form.
8. A web crawler monitoring apparatus, the apparatus comprising:
the data acquisition module is used for acquiring the latest data in the data table corresponding to each web crawler from the database according to the preset frequency; the latest data comprises the warehousing time point of the latest data;
the warehousing duration determining module is used for determining the warehousing duration of the latest data according to the current time point and the warehousing time point of the latest data;
the operation condition determining module is used for judging that the operation condition of the web crawler is abnormal when the warehousing duration is greater than or equal to the crawling interval corresponding to the web crawler;
and the alarm module is used for giving an abnormal alarm when the running condition of the web crawler is abnormal.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202011137903.6A 2020-10-22 2020-10-22 Web crawler monitoring method and device, computer equipment and storage medium Pending CN112256529A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011137903.6A CN112256529A (en) 2020-10-22 2020-10-22 Web crawler monitoring method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011137903.6A CN112256529A (en) 2020-10-22 2020-10-22 Web crawler monitoring method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112256529A true CN112256529A (en) 2021-01-22

Family

ID=74264675

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011137903.6A Pending CN112256529A (en) 2020-10-22 2020-10-22 Web crawler monitoring method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112256529A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113220549A (en) * 2021-04-01 2021-08-06 深圳市猎芯科技有限公司 Crawler data monitoring method, system, computer equipment and storage medium
CN113835957A (en) * 2021-09-22 2021-12-24 上海妙一生物科技有限公司 Crawler task monitoring method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101355455A (en) * 2008-09-12 2009-01-28 中兴通讯股份有限公司 Alarm system and method for service management platform
US20110082897A1 (en) * 2009-10-05 2011-04-07 Tynt Multimedia Inc. Systems and methods for deterring traversal of domains containing network resources
CN103248625A (en) * 2013-04-27 2013-08-14 北京京东尚科信息技术有限公司 Monitoring method and system for abnormal operation of web crawler
CN107302469A (en) * 2016-04-14 2017-10-27 北京京东尚科信息技术有限公司 The real time monitoring apparatus and method updated for Distributed Services cluster system data
CN107403005A (en) * 2017-07-24 2017-11-28 浙江极赢信息技术有限公司 A kind of web publishing method and device
CN109298987A (en) * 2017-07-25 2019-02-01 北京国双科技有限公司 A kind of method and device detecting web crawlers operating status

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101355455A (en) * 2008-09-12 2009-01-28 中兴通讯股份有限公司 Alarm system and method for service management platform
US20110082897A1 (en) * 2009-10-05 2011-04-07 Tynt Multimedia Inc. Systems and methods for deterring traversal of domains containing network resources
CN103248625A (en) * 2013-04-27 2013-08-14 北京京东尚科信息技术有限公司 Monitoring method and system for abnormal operation of web crawler
CN107302469A (en) * 2016-04-14 2017-10-27 北京京东尚科信息技术有限公司 The real time monitoring apparatus and method updated for Distributed Services cluster system data
CN107403005A (en) * 2017-07-24 2017-11-28 浙江极赢信息技术有限公司 A kind of web publishing method and device
CN109298987A (en) * 2017-07-25 2019-02-01 北京国双科技有限公司 A kind of method and device detecting web crawlers operating status

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113220549A (en) * 2021-04-01 2021-08-06 深圳市猎芯科技有限公司 Crawler data monitoring method, system, computer equipment and storage medium
CN113835957A (en) * 2021-09-22 2021-12-24 上海妙一生物科技有限公司 Crawler task monitoring method and device

Similar Documents

Publication Publication Date Title
US11126538B1 (en) User interface for specifying data stream processing language programs for analyzing instrumented software
CN108845910B (en) Monitoring method, device and storage medium of large-scale micro-service system
US12120170B1 (en) Presenting un-deployed features of an application
CN110704873B (en) A method and system for preventing leakage of sensitive data
US11044144B2 (en) Self-monitoring
CN107092544A (en) monitoring method and device
CN111400126B (en) Network service abnormal data detection method, device, equipment and medium
CN109992473A (en) Monitoring method, device, equipment and the storage medium of application system
CN110659435A (en) Page data acquisition processing method and device, computer equipment and storage medium
CN112256529A (en) Web crawler monitoring method and device, computer equipment and storage medium
CN110309041A (en) Browser performance real-time monitoring method, device, equipment and readable storage medium
CN109976966A (en) A kind of application program launching time counting method, apparatus and system
CN108073499B (en) Application program testing method and device
CN105184156A (en) Security threat management method and system
CN117194191A (en) Log monitoring alarm method, device, computer equipment and storage medium
JP5242531B2 (en) Progress management device and progress management method
CN111158926B (en) Service request analysis method, device and equipment
CN114822804B (en) Data storage method and device, computer equipment and storage medium
CN110569114B (en) Service processing method, device, equipment and storage medium
CN114036421A (en) Method, device and computer equipment for displaying HTML5 page response time
CN110633165B (en) Fault processing method, device, system server and computer readable storage medium
US8788960B2 (en) Exception engine for capacity planning
CN109815082B (en) KAFKA theme monitoring method and device, electronic equipment and storage medium
CN110619541A (en) Application program management method and device, computer equipment and storage medium
CN116684306A (en) A fault prediction method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210122

RJ01 Rejection of invention patent application after publication