CN112417240A - Website link detection method and device and computer equipment - Google Patents
Website link detection method and device and computer equipment Download PDFInfo
- Publication number
- CN112417240A CN112417240A CN202010107165.4A CN202010107165A CN112417240A CN 112417240 A CN112417240 A CN 112417240A CN 202010107165 A CN202010107165 A CN 202010107165A CN 112417240 A CN112417240 A CN 112417240A
- Authority
- CN
- China
- Prior art keywords
- link
- website
- detected
- webpage
- queue
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 122
- 238000013515 script Methods 0.000 claims abstract description 26
- 230000009193 crawling Effects 0.000 claims abstract description 20
- 238000004590 computer program Methods 0.000 claims description 13
- 238000000034 method Methods 0.000 description 26
- 230000008569 process Effects 0.000 description 13
- 238000012545 processing Methods 0.000 description 13
- VYZAMTAEIAYCRO-UHFFFAOYSA-N Chromium Chemical compound [Cr] VYZAMTAEIAYCRO-UHFFFAOYSA-N 0.000 description 11
- 230000006870 function Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000007717 exclusion Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000005012 migration Effects 0.000 description 2
- 238000013508 migration Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9558—Details of hyperlinks; Management of linked annotations
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a website link detection method and device, computer equipment and a readable storage medium, and belongs to the technical field of internet. The website link detection method comprises the following steps: when a link detection instruction is received, acquiring a target link which needs to be detected currently in a website to be analyzed from a link queue to be detected; acquiring a web crawler, and crawling a webpage corresponding to a target link through the web crawler; when a webpage is crawled, acquiring a link address and webpage information contained in the webpage through a preset content script; judging whether the target link is an effective link or not according to the webpage information, and adding the link address to the link queue when the target link is the effective link; and returning to execute the operation of acquiring the target link which needs to be detected currently in the website to be analyzed from the link queue to be detected until all links in the link queue are detected. The invention can carry out comprehensive link detection.
Description
Technical Field
The invention relates to the technical field of internet, in particular to a website link detection method, a website link detection device and computer equipment.
Background
Currently, there may be multiple links in each page of a website, and normally, one link should be able to link to one website page. However, as the online operation time of a website increases, some invalid links (links that cannot be normally linked to a website page) inevitably occur. Invalid links can affect the user's normal browsing and can also cause the web site to be derated by the search engine.
In order to reduce or avoid the influence of invalid links on the normal browsing of web pages by a user, the links in each web page in a web site need to be detected to find the invalid links. In the prior art, links in a website are generally opened one by one manually or invalid links are detected by an online tool. However, the detection by a manual method is very inefficient, and the link detection by an online tool can only be performed on a single page, but cannot be performed on all directories and pages in a website in a comprehensive manner.
Disclosure of Invention
In view of the above, a website link detection method, a website link detection device, a computer device, and a computer readable storage medium are provided to solve the problem that the conventional website link detection method cannot perform comprehensive link detection on all directories and pages in a website.
The invention provides a website link detection method, which comprises the following steps: when a link detection instruction is received, acquiring a target link which needs to be detected currently in a website to be analyzed from a link queue to be detected; acquiring a web crawler, and crawling a webpage corresponding to the target link through the web crawler; when the webpage is crawled, acquiring a link address and webpage information contained in the webpage through a preset content script; judging whether the target link is an effective link or not according to the webpage information, and adding the link address to the link queue when the target link is the effective link; and returning to execute the operation of acquiring the current target link needing to be detected in the website to be analyzed from the link queue to be detected until all links in the link queue are detected.
Optionally, the website link detection method further includes: the step of obtaining the target link to be detected currently in the website to be analyzed from the link queue to be detected comprises: acquiring state information of each link in a link queue to be detected; and acquiring a link with unprocessed state information from the link queue to be detected as the target link.
Optionally, the acquiring the web crawler includes: detecting the state of each web crawler in the crawler group; and acquiring a web crawler in an idle state from the crawler group.
Optionally, adding the link address to the link queue comprises: searching a link containing a preset address from the link address; and adding all links except the link containing the preset address in the link address to the link queue.
Optionally, the website link detection method further includes: and receiving an initial access link set by a user, and adding the initial access link to the link queue.
Optionally, the website link detection method further includes: and when the target link is judged to be an invalid link according to the webpage information, marking the target link as the invalid link.
Optionally, the crawling, by the web crawler, the web page corresponding to the target link includes: creating an IFRAME object through the web crawler; executing IFRAME initialization operation; registering a load event of the window in the IFRAME; and loading the webpage corresponding to the target link, wherein the webpage is indicated to be crawled when the webpage is completely loaded, and the webpage is not crawled when the webpage is loaded for a time-out.
The invention also provides a website link detection device, which comprises: the first acquisition module is used for acquiring a target link which needs to be detected currently in a website to be analyzed from a link queue to be detected when a link detection instruction is received; the crawling module is used for acquiring a web crawler and crawling a webpage corresponding to the target link through the web crawler; the second acquisition module is used for acquiring a link address and webpage information contained in the webpage through a preset content script when the webpage is crawled; the adding module is used for judging whether the target link is an effective link or not according to the webpage information and adding the link address to the link queue when the target link is the effective link; and the returning module is used for returning and executing the operation of acquiring the current target link needing to be detected in the website to be analyzed from the link queue to be detected until all links in the link queue are detected.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
The present invention also provides a computer-readable storage medium having stored thereon a computer program characterized in that: which when being executed by a processor implements the steps of the above-mentioned method
The beneficial effects of the above technical scheme are that:
according to the website link detection method in the embodiment of the invention, after the target link needing to be detected currently in the website to be analyzed is obtained from the link queue to be detected, the target link can be used as a detection starting point, based on the target link, the webpage corresponding to the target link is crawled by a web crawler, and when the webpage is crawled, the link address and the webpage information contained in the webpage are obtained through the preset content script. Therefore, link addresses contained in the webpage corresponding to the target link can be obtained based on the target link, and the link addresses are stored in the link queue as the link addresses needing to be detected, so that the addresses are sequentially used as the target link addresses needing to be detected, the link addresses contained in all the pages in the website can be effectively covered, and the comprehensive detection of all the link addresses in the website is realized. Meanwhile, the target link is judged through the webpage information, and whether the target link is an invalid link or not can be accurately judged.
Drawings
FIG. 1 is a block diagram of an embodiment of a system block diagram of a website link detection method according to the present invention;
FIG. 2 is a flowchart illustrating a website link detection method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a detailed procedure of acquiring a target link to be detected currently in a website to be analyzed from a link queue to be detected according to the present invention;
FIG. 4 is a flowchart detailing the steps of obtaining a web crawler according to the present invention;
FIG. 5 is a flowchart illustrating a step refinement for crawling the web page corresponding to the target link by the web crawler in accordance with the present invention;
FIG. 6 is a flowchart detailing the steps of adding the link address to the link queue according to the present invention;
FIG. 7 is a block diagram of an embodiment of a website link detection apparatus according to the present invention;
fig. 8 is a schematic hardware structure diagram of a computer device for executing a website link detection method according to an embodiment of the present invention.
Detailed Description
The advantages of the invention are further illustrated in the following description of specific embodiments in conjunction with the accompanying drawings.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
In the description of the present invention, it should be understood that the numerical references before the steps do not identify the order of performing the steps, but merely serve to facilitate the description of the present invention and to distinguish each step, and thus should not be construed as limiting the present invention.
Fig. 1 schematically shows an application environment diagram of a website link detection method according to an embodiment of the present application. In an exemplary embodiment, the system of the application environment can include a client 102, and a server 104.
Wherein the client 102 and the server 104 communicate over a network. The client 102 is preferably a google browser Chrome, the client 102 running in a computer device. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers. The client 102 integrates the website link detection device corresponding to the website link detection method of the present invention, and the website link detection device can be integrated in the client 102 in a browser extension manner, and the use interface of the browser can be changed in the extension manner without directly affecting the visual content of the web page, such as adding a toolbar. The user may configure the initial access links through the expanded toolbar, along with detection rules including, but not limited to, specifying a domain name, specifying a path, specifying an extension, whether an out-link, a link depth, whether a path or page contains certain characters or character strings, etc.
The invention provides a website link detection method for solving the problem that the traditional website link detection method cannot perform comprehensive link detection on all directories and pages in a website. Fig. 2 is a flowchart illustrating a website link detection method according to an embodiment of the invention. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. Taking the application of the method to the client in fig. 1 as an example, the following description will be made, and the website link detection method provided in this embodiment includes:
and step S20, when the link detection instruction is received, acquiring the target link which needs to be detected currently in the website to be analyzed from the link queue to be detected.
Specifically, the website to be analyzed is a website which needs to be analyzed currently by the user, the link queue is a queue for storing the links to be analyzed, and a plurality of links can be stored in the link queue. In this embodiment, the target links to be detected are stored through the storage structure of the queue, so that when the stored links need to be acquired from the link queue, the links to be detected can be sequentially taken out from the link queue according to the sequence of the link storage time.
In this embodiment, a link queue is used to store the links to be detected. And when a link detection instruction triggered by a user is received, taking out one link from the link queue as a target link needing to be detected currently to detect. There are various ways for the user to trigger the link detection instruction, for example, the user triggers the link detection instruction by clicking a start detection button in the client. After the click operation of the user is received, a link detection instruction is generated based on the click operation. When the link detection instruction is received, the target link is obtained from the link queue to be detected.
It should be noted that, in this embodiment, the target link may be an address of one page, or may refer to a link relationship that points from one web page to another web page, where the pointed page may be another web page, or different positions on the same web page, or a page that includes pictures, files, and the like. In this embodiment, the target link is a URL (Uniform Resource Locator) of the access page.
In an embodiment, referring to fig. 3, the acquiring, from the link queue to be detected, a target link that needs to be detected currently in the website to be analyzed includes:
step S30, obtaining status information of each link in the link queue to be detected.
Specifically, in order to facilitate management of the link detection process, a state attribute may be set for each link stored in the link queue, where the state attribute is used to store state information of the link, the state information is used to describe a state in which the link is currently located, the state includes an unprocessed state, and a processed state in the process, where the processed state may include states of a valid link, an invalid link, an access timeout, and the like.
It should be noted that, in this embodiment, when detecting a link, the state of the link is updated, for example, when detecting the start of the link, the state of the link is updated from an unprocessed state to a processed state; upon completion of the link detection, the state of the link is updated again, for example, to a valid link, an invalid link, or the like.
Step S31, acquiring a link with unprocessed status information from the link queue to be detected as the target link.
Specifically, the state information of each link in the link queue is sequentially queried to find a link of which the state information is unprocessed, and then the link is used as the target link, so that the target link is detected. In the present embodiment, when no link whose status information is unprocessed is found, the link detection processing flow is ended.
In this embodiment, the status of each link in the link queue is queried to extract the link whose status is unprocessed from the status of each link, so that the situation that the link is repeatedly detected can be avoided.
In order to set a link for initial detection, in an embodiment, the website link detection method further includes:
and receiving an initial access link set by a user, and adding the initial access link to the link queue.
Specifically, the client may present a user input interface through which a user sets an initial access link, for example, the user needs to detect all links in a "sports" channel in a website to see whether an invalid link exists, where a website path is http:// localhost/sports/, and then the user may set the link http:// localhost/sports/as the initial access link through the input interface. After receiving the initial access link set by the user, adding the initial access link into the link queue.
In one embodiment, the user may also set the detection rules through the input interface, such as setting which links are not to be detected.
The embodiment of the invention can detect the link addresses contained in all pages in the website by setting an initial access link as a detection starting point, then adds the detected link into the link queue, then uses each link in the link queue as each link to be detected, and repeats the detection steps to realize the comprehensive detection of all the link addresses in the website.
And step S21, acquiring a web crawler, and crawling a webpage corresponding to the target link through the web crawler.
In particular, a web crawler may automatically crawl programs or scripts of web information. In this embodiment, this web crawler can be a plurality of to can crawl the webpage that a plurality of target links correspond simultaneously through a plurality of web crawlers, improve and crawl efficiency.
In one embodiment, referring to fig. 4, the step of acquiring the web crawler includes:
step S40, the status of each web crawler in the crawler group is detected.
Step S41, obtaining a web crawler in an idle state from the crawler group.
Specifically, a plurality of web crawlers exist in the embodiment, and the plurality of web crawlers form a crawler group. In order to facilitate management of each web crawler in the crawler group, state information can be set for each web crawler, and the state information includes an idle state, an overtime state, a processing state and the like, wherein the idle state is that the current web crawler does not crawl web pages, the overtime state is that the current web crawler crawls web pages overtime, and the processing state is that the current web crawler crawls web pages.
In this embodiment, when a web page needs to be crawled by using a web crawler, the states of the web crawlers in the crawler group are detected, so that the web crawler in an idle state can be taken out from the crawler group to crawl the web page. When no web crawler in an idle state exists in the web crawler group, the current acquisition flow of the web crawler can be quitted, when the web crawler in the idle state exists in the web crawler group, the found web crawler is taken out, the web crawler is activated to start web page crawling, and a crawled target link and the state of the web crawler are set to be in processing.
In this embodiment, through detecting the state of web crawler to web crawler in order can guarantee that the web crawler who takes out is idle state's crawler, avoid taking out the web crawler that the state is not idle state and climb up the webpage, improve the efficiency of crawling of webpage.
In one embodiment, referring to fig. 5, the step of crawling the web page corresponding to the target link by the web crawler includes:
step S50, an IFRAME object is created by the web crawler.
In step S51, an IFRAME initialization operation is performed.
In step S52, a load event in window in IFRAME is registered.
And step S53, loading the webpage corresponding to the target link, wherein when the webpage is loaded, the webpage is indicated to be crawled, and when the webpage is loaded and timed out, the webpage is indicated not to be crawled.
In particular, IFRAME, also known as an inline frame, is one of the HTML tags that can be used anywhere in HTML. Both it and < FRAME > can be used to display multiple pages in one window, each page is called a FRAME, and each FRAME is independent of the other FRAMEs.
In this embodiment, an access request is sent to a website to be analyzed by creating an IFRAME, then an IFRAME initialization operation is performed, so that a web page corresponding to a target link can be displayed in the IFRAME, and after the initialization operation is completed, a load event of window in the IFRAME is registered, so that after the web page corresponding to the target link is loaded, a link address and web page information included in the web page can be acquired through a preset content script.
In one embodiment, in order to avoid loading a web page when the loading time of the web page is out, and still loading the web page, a time-out timer may be created when a window load event in IFRAME is registered, so as to time the loading time of the web page by the time-out timer, and stop loading the web page when the loading time is out.
In the embodiment of the invention, the load event of the window in the IFRAME is registered in the process of loading the webpage, so that when the webpage corresponding to the target link is a dynamic page, the link address contained in the webpage can be detected.
Step S22, when the webpage is crawled, the link address and the webpage information contained in the webpage are obtained through the preset content script.
Specifically, the Content script is Content Scripts in the chrome browser, where the Content Scripts are js Scripts running in the context of a web page, and can read Content in the web page, including a link address and web page information in the web page, where the web page information may be a character or a character string.
In a specific application, the content script may obtain all link addresses contained in the webpage through a document. It can be understood that, in order to obtain all link addresses contained in the web page through the function, it is first necessary to configure a run _ at of content _ scripts in manifest.json of the extension program to be equal to document _ start, then register a load event of window when the detected page starts to load, and create a timeout timer, so that in a callback processing function of the load event and the timeout timer, all link addresses contained in the web page can be obtained through document.queryselectrallell ('a [ href ]').
In another embodiment of the present invention, when the web page is not crawled, for example, when the web crawler crawls the web page for timeout, the target link may be marked as an invalid link or marked as an access timeout link.
It can be understood that, when the web page corresponding to the target link crawled by the web crawler is overtime, the web crawler stops working, and in order to avoid that the web crawler stops working all the time, a polling mechanism may be added, that is, the state of each web crawler is detected at regular time, for example, the state of the web crawler is detected every 5 seconds, and when the state of the web crawler is detected to be overtime, the state of the web crawler is reset, so that the web crawler can continue working.
Step S23, determining whether the target link is an effective link according to the web page information, and adding the link address to the link queue when the target link is an effective link.
Specifically, after the web page information is acquired, it may be determined whether the web page information includes a preset character string, for example, it may be determined whether the web page information includes a character string "current page does not exist" and a character string similar to the character string, if the web page information includes the preset character string, the target link may be determined as an invalid link, and if the web page information does not include the preset character string, the target link may be determined as an valid link. In one embodiment, the target link is marked as an invalid link when it is determined to be an invalid link, and is marked as a valid link when it is determined to be a valid link.
It should be noted that the invalid Links are generally called Dead Links or Dead Links, and are used to refer to those unreachable Dead Links. Generally speaking, the link can be normally accessed before, but subsequently, due to reasons such as website migration, version change or improper operation, the target page pointed by the link does not exist, and the link which cannot be accessed is caused, namely, the link is called as an invalid link. Causes of link failure include: a file or page in a web site moves position, causing the link to it to become dead; the web page content is updated and changed into other links, and the original link is changed into a dead link and the like. Invalid links can affect not only search engine spidering and listing, but also the evaluation of website weights by the search engine. Generally, a search engine sets a weight value for each website, and the weight value is transmitted through a link relationship, if a large number of dead links exist in a website, the weight value inside the website is lost, and thus the weight of the whole website is reduced.
In the embodiment of the present invention, after determining that the target link is a valid link, the obtained link address may be automatically added to the link queue. In one embodiment, the link address may be added to the link queue through a postMessage function, where the postMessage function is a commonly used function in Windows API (application program interface) for placing a message into the message queue.
In one embodiment, referring to fig. 6, the step of adding the link address to the link queue includes:
step S60, finding out the link containing the preset address from the link addresses.
Step S61, adding all links in the link addresses except the link containing the preset address to the link queue.
Specifically, in order to avoid detecting some links that do not need to be detected, thereby wasting resources, the user may preset the link of the preset address, where the link of the preset address is a link that needs to be detected and is preset by the user, that is, a blacklist link. In this way, when the link address is obtained, in order to avoid adding a link including the preset address to the link queue, whether the link includes the preset address may be searched from the link address, for example, when detecting a link including "http:// localhost/sports/" in a website, since a web page including the link "http:// localhost/sports/arrow" is still in the setup, when detecting the website, it may be preset to exclude the web page including "http:// localhost/sports/arrow".
And after finding the links containing the preset addresses, excluding the links containing the preset addresses, and adding all the links except the excluded links into the link queue.
In this embodiment, the links including the preset addresses are excluded, so that the links corresponding to the webpage being constructed can be prevented from being mistaken as invalid links.
And step S24, returning to execute the operation of obtaining the target link which needs to be detected currently in the website to be analyzed from the link queue to be detected until all links in the link queue are detected.
Specifically, after the detection of the current target link is completed, the process returns to step S20 to repeatedly perform steps S20-S23 to detect other links in the link queue until all links are detected.
In one embodiment, after the detection of all the links is completed, the detection results of the links are displayed, so that the user can know which links are invalid and which links are valid.
In order to facilitate understanding of the present invention, the following two application scenarios are taken as examples to illustrate the embodiments of the present invention:
scene one: some website operator needs to detect all links in the "sports" channel of the website to see if there are invalid links. The website link is http:// localhost/sports/, and since the "archery" item under the channel is still under construction, the link pointing to the item, i.e. the link containing the http:// localhost/arrow character string, needs to be excluded in the detection process.
In this scenario, the method can be implemented by the following steps:
step 1: opening a developer mode of Chrome;
step 2: integrating a website link detection device corresponding to the website link detection method in the embodiment into the Chrome in an extended manner;
and step 3: setting an initial access link (URL), here http:// localhost/sports/;
step 4; setting an exclusion rule, wherein URLs containing http:// localhost/arrow are excluded;
and 5: starting to execute the website link detection method in the embodiment;
step 6: and after the detection is finished, obtaining a complete invalid link and a complete valid link list.
Scene two: a technical staff of a certain website needs to migrate the contents in a certain secondary directory to a new directory 1. needs to detect how many links pointing to the old directory in the whole website 2. see whether partial path conflicts exist. The old directory address is http:// localhost/sports/socker, and the new directory address is http:// localhost/sports/football.
In this scenario, the method can be implemented by the following steps:
step 1: opening a developer mode of Chrome;
step 2: integrating a website link detection device corresponding to the website link detection method in the embodiment into the Chrome in an extended manner;
and step 3: setting an initial access link (URL), here http:// localhost/sports/;
and 4, step 4: starting to execute the website link detection method in the embodiment;
and 5: after the detection is finished, obtaining a complete linked list;
step 6: checking whether a link like http:// localhost/sports/football exists in the list;
and 7: how many http:// localhost/sports/socker-like links are in the summary list.
According to the website link detection method in the embodiment of the invention, after the target link needing to be detected currently in the website to be analyzed is obtained from the link queue to be detected, the target link can be used as a detection starting point, based on the target link, the webpage corresponding to the target link is crawled by a web crawler, and when the webpage is crawled, the link address and the webpage information contained in the webpage are obtained through the preset content script. Therefore, link addresses contained in the webpage corresponding to the target link can be obtained based on the target link, and the link addresses are stored in the link queue as the link addresses needing to be detected, so that the addresses are sequentially used as the target link addresses needing to be detected, the link addresses contained in all the pages in the website can be effectively covered, and the comprehensive detection of all the link addresses in the website is realized. Meanwhile, the target link is judged through the webpage information, and whether the target link is an invalid link or not can be accurately judged.
Fig. 7 is a block diagram of a website link detection apparatus 700 according to an embodiment of the present invention.
In this embodiment, the website link detection apparatus 700 includes a series of computer program instructions stored in a memory, and when the computer program instructions are executed by a processor, the functions of the website link detection method according to the embodiments of the present invention can be implemented. In some embodiments, website link detection apparatus 700 may be divided into one or more modules based on the particular operations implemented by the portions of the computer program instructions. For example, in fig. 7, the website link detection apparatus 700 may be divided into a first obtaining module 701, a crawling module 702, a second obtaining module 703, an adding module 704, and a returning module 705.
Wherein:
the first obtaining module 701 is configured to obtain, when a link detection instruction is received, a target link that needs to be detected currently in a website to be analyzed from a link queue to be detected.
Specifically, the website to be analyzed is a website which needs to be analyzed currently by the user, the link queue is a queue for storing the links to be analyzed, and a plurality of links can be stored in the link queue. In this embodiment, the target links to be detected are stored through the storage structure of the queue, so that when the stored links need to be acquired from the link queue, the links to be detected can be sequentially taken out from the link queue according to the sequence of the link storage time.
In this embodiment, a link queue is used to store the links to be detected. And when a link detection instruction triggered by a user is received, taking out one link from the link queue as a target link needing to be detected currently to detect. There are various ways for the user to trigger the link detection instruction, for example, the user triggers the link detection instruction by clicking a start detection button in the client. After the click operation of the user is received, a link detection instruction is generated based on the click operation. When the link detection instruction is received, the target link is obtained from the link queue to be detected.
It should be noted that, in this embodiment, the target link may be an address of one page, or may refer to a link relationship that points from one web page to another web page, where the pointed page may be another web page, or different positions on the same web page, or a page that includes pictures, files, and the like. In this embodiment, the target link is a URL (Uniform Resource Locator) of the access page.
In an embodiment, the first obtaining module 701 is further configured to obtain status information of each link in a link queue to be detected.
Specifically, in order to facilitate management of the link detection process, a state attribute may be set for each link stored in the link queue, where the state attribute is used to store state information of the link, the state information is used to describe a state in which the link is currently located, the state includes an unprocessed state, and a processed state in the process, where the processed state may include states of a valid link, an invalid link, an access timeout, and the like.
It should be noted that, in this embodiment, when detecting a link, the state of the link is updated, for example, when detecting the start of the link, the state of the link is updated from an unprocessed state to a processed state; upon completion of the link detection, the state of the link is updated again, for example, to a valid link, an invalid link, or the like.
The first obtaining module 701 is further configured to obtain, from a link queue to be detected, a link whose state information is unprocessed as the target link.
Specifically, the state information of each link in the link queue is sequentially queried to find a link of which the state information is unprocessed, and then the link is taken as the target link, so that the target link is detected. In the present embodiment, when no link whose status information is unprocessed is found, the link detection processing flow is ended.
In this embodiment, the status of each link in the link queue is queried to extract the link whose status is unprocessed from the status of each link, so that the situation that the link is repeatedly detected can be avoided.
In order to set a link for initial detection, in one embodiment, the website link detection apparatus further includes:
and the receiving module is used for receiving an initial access link set by a user and adding the initial access link to the link queue.
Specifically, the client may present a user input interface through which a user sets an initial access link, for example, the user needs to detect all links in a "sports" channel in a website to see whether an invalid link exists, where a website path is http:// localhost/sports/, and then the user may set the link http:// localhost/sports/as the initial access link through the input interface. After receiving the initial access link set by the user, adding the initial access link into the link queue.
In one embodiment, the user may also set the detection rules through the input interface, such as setting which links are not to be detected.
The embodiment of the invention can detect the link addresses contained in all pages in the website by setting an initial access link as a detection starting point, then adds the detected link into the link queue, then uses each link in the link queue as each link to be detected, and repeats the detection steps to realize the comprehensive detection of all the link addresses in the website.
And the crawling module 702 is configured to acquire a web crawler and crawl a web page corresponding to the target link through the web crawler.
In particular, a web crawler may automatically crawl programs or scripts of web information.
In this embodiment, this web crawler can be a plurality of to can crawl the webpage that a plurality of target links correspond simultaneously through a plurality of web crawlers, improve and crawl efficiency.
In one embodiment, the crawling module 702 is further configured to detect a status of each web crawler in the crawler group; and acquiring a web crawler in an idle state from the crawler group.
Specifically, a plurality of web crawlers exist in the embodiment, and the plurality of web crawlers form a crawler group. In order to facilitate management of each web crawler in the crawler group, state information can be set for each web crawler, and the state information includes an idle state, an overtime state, a processing state and the like, wherein the idle state is that the current web crawler does not crawl web pages, the overtime state is that the current web crawler crawls web pages overtime, and the processing state is that the current web crawler crawls web pages.
In this embodiment, when a web page needs to be crawled by using a web crawler, the states of the web crawlers in the crawler group are detected, so that the web crawler in an idle state can be taken out from the crawler group to crawl the web page. When no web crawler in an idle state exists in the web crawler group, the current acquisition flow of the web crawler can be quitted, when the web crawler in the idle state exists in the web crawler group, the found web crawler is taken out, the web crawler is activated to start web page crawling, and a crawled target link and the state of the web crawler are set to be in processing.
In this embodiment, through detecting the state of web crawler to web crawler in order can guarantee that the web crawler who takes out is idle state's crawler, avoid taking out the web crawler that the state is not idle state and climb up the webpage, improve the efficiency of crawling of webpage.
In one embodiment, the crawling module 702 is further configured to create an IFRAME object through the web crawler; executing IFRAME initialization operation; registering a load event of the window in the IFRAME; and loading the webpage corresponding to the target link, wherein the webpage is indicated to be crawled when the webpage is completely loaded, and the webpage is not crawled when the webpage is loaded for a time-out.
In particular, IFRAME, also known as an inline frame, is one of the HTML tags that can be used anywhere in HTML. Both it and < FRAME > can be used to display multiple pages in one window, each page is called a FRAME, and each FRAME is independent of the other FRAMEs.
In this embodiment, an access request is sent to a website to be analyzed by creating an IFRAME, then an IFRAME initialization operation is performed, so that a web page corresponding to a target link can be displayed in the IFRAME, and after the initialization operation is completed, a load event of window in the IFRAME is registered, so that after the web page corresponding to the target link is loaded, a link address and web page information included in the web page can be acquired through a preset content script.
In one embodiment, in order to avoid loading a web page when the loading time of the web page is out, and still loading the web page, a time-out timer may be created when a window load event in IFRAME is registered, so as to time the loading time of the web page by the time-out timer, and stop loading the web page when the loading time is out.
In the embodiment of the invention, the load event of the window in the IFRAME is registered in the process of loading the webpage, so that when the webpage corresponding to the target link is a dynamic page, the link address contained in the webpage can be detected.
The second obtaining module 703 is configured to obtain, through a preset content script, a link address and web page information included in the web page when the web page is crawled.
Specifically, the Content script is Content Scripts in the chrome browser, where the Content Scripts are js Scripts running in the context of a web page, and can read Content in the web page, including a link address and web page information in the web page, where the web page information may be a character or a character string.
In a specific application, the content script may obtain all link addresses contained in the webpage through a document. It can be understood that, in order to obtain all link addresses contained in the web page through the function, it is first necessary to configure a run _ at of content _ scripts in manifest.json of the extension program to be equal to document _ start, then register a load event of window when the detected page starts to load, and create a timeout timer, so that in a callback processing function of the load event and the timeout timer, all link addresses contained in the web page can be obtained through document.queryselectrallell ('a [ href ]').
In another embodiment of the present invention, when the web page is not crawled, for example, when the web crawler crawls the web page for timeout, the target link may be marked as an invalid link or marked as an access timeout link.
It can be understood that, when the web page corresponding to the target link crawled by the web crawler is overtime, the web crawler stops working, and in order to avoid that the web crawler stops working all the time, a polling mechanism may be added, that is, the state of each web crawler is detected at regular time, for example, the state of the web crawler is detected every 5 seconds, and when the state of the web crawler is detected to be overtime, the state of the web crawler is reset, so that the web crawler can continue working.
An adding module 704, configured to determine whether the target link is an effective link according to the web page information, and add the link address to the link queue when the target link is an effective link.
Specifically, after the web page information is acquired, it may be determined whether the web page information includes a preset character string, for example, it may be determined whether the web page information includes a character string "current page does not exist" and a character string similar to the character string, if the web page information includes the preset character string, the target link may be determined as an invalid link, and if the web page information does not include the preset character string, the target link may be determined as an valid link. In one embodiment, the target link is marked as an invalid link when it is determined to be an invalid link, and is marked as a valid link when it is determined to be a valid link.
It should be noted that the invalid Links are generally called Dead Links or Dead Links, and are used to refer to those unreachable Dead Links. Generally speaking, the link can be normally accessed before, but subsequently, due to reasons such as website migration, version change or improper operation, the target page pointed by the link does not exist, and the link which cannot be accessed is caused, namely, the link is called as an invalid link. Causes of link failure include: a file or page in a web site moves position, causing the link to it to become dead; the web page content is updated and changed into other links, and the original link is changed into a dead link and the like. Invalid links can affect not only search engine spidering and listing, but also the evaluation of website weights by the search engine. Generally, a search engine sets a weight value for each website, and the weight value is transmitted through a link relationship, if a large number of dead links exist in a website, the weight value inside the website is lost, and thus the weight of the whole website is reduced.
In the embodiment of the present invention, after determining that the target link is a valid link, the obtained link address may be automatically added to the link queue. In one embodiment, the link address may be added to the link queue through a postMessage function, where the postMessage function is a commonly used function in Windows API (application program interface) for placing a message into the message queue.
In an embodiment, the adding module 704 is further configured to find a link containing a preset address from the link addresses; and adding all links except the link containing the preset address in the link address to the link queue.
Specifically, the link of the preset address is a link which is preset by a user and needs to be detected. In this embodiment, a user may set a link of the preset address first, and when acquiring a link address, in order to avoid adding a link including the preset address to a link queue, it may search for whether a link including the preset address is included from the link address, for example, when detecting a link including "http:// localhost/sports/" in a website, since a web page including the link "http:// localhost/sports/arrow" is still being built, when detecting the website, it may be preset to exclude a web page including "http:// localhost/sports/arrow".
And after finding the links containing the preset addresses, excluding the links containing the preset addresses, and adding all the links except the excluded links into the link queue.
In this embodiment, the links including the preset addresses are excluded, so that the links corresponding to the webpage being constructed can be prevented from being mistaken as invalid links.
The returning module 705 is configured to return to execute an operation of obtaining a target link currently required to be detected in a website to be analyzed from a link queue to be detected until all links in the link queue are detected.
Specifically, after the detection of the current target link is completed, the process returns to step S20 to repeatedly perform steps S20-S23 to detect other links in the link queue until all links are detected.
In one embodiment, after the detection of all the links is completed, the detection results of the links are displayed, so that the user can know which links are invalid and which links are valid.
In order to facilitate understanding of the present invention, the following two application scenarios are taken as examples to illustrate the embodiments of the present invention:
scene one: some website operator needs to detect all links in the "sports" channel of the website to see if there are invalid links. The website link is http:// localhost/sports/, and since the "archery" item under the channel is still under construction, the link pointing to the item, i.e. the link containing the http:// localhost/arrow character string, needs to be excluded in the detection process.
In this scenario, the method can be implemented by the following steps:
step 1: opening a developer mode of Chrome;
step 2: integrating a website link detection device corresponding to the website link detection method in the embodiment into the Chrome in an extended manner;
and step 3: setting an initial access link (URL), here http:// localhost/sports/;
step 4; setting an exclusion rule, wherein URLs containing http:// localhost/arrow are excluded;
and 5: starting to execute the website link detection method in the embodiment;
step 6: and after the detection is finished, obtaining a complete invalid link and a complete valid link list.
Scene two: a technical staff of a certain website needs to migrate the contents in a certain secondary directory to a new directory 1. needs to detect how many links pointing to the old directory in the whole website 2. see whether partial path conflicts exist. The old directory address is http:// localhost/sports/socker, and the new directory address is http:// localhost/sports/football.
In this scenario, the method can be implemented by the following steps:
step 1: opening a developer mode of Chrome;
step 2: integrating a website link detection device corresponding to the website link detection method in the embodiment into the Chrome in an extended manner;
and step 3: setting an initial access link (URL), here http:// localhost/sports/;
and 4, step 4: starting to execute the website link detection method in the embodiment;
and 5: after the detection is finished, obtaining a complete linked list;
step 6: checking whether a link like http:// localhost/sports/football exists in the list;
and 7: how many http:// localhost/sports/socker-like links are in the summary list.
According to the website link detection method in the embodiment of the invention, after the target link needing to be detected currently in the website to be analyzed is obtained from the link queue to be detected, the target link can be used as a detection starting point, based on the target link, the webpage corresponding to the target link is crawled by a web crawler, and when the webpage is crawled, the link address and the webpage information contained in the webpage are obtained through the preset content script. Therefore, link addresses contained in the webpage corresponding to the target link can be obtained based on the target link, and the link addresses are stored in the link queue as the link addresses needing to be detected, so that the addresses are sequentially used as the target link addresses needing to be detected, the link addresses contained in all the pages in the website can be effectively covered, and the comprehensive detection of all the link addresses in the website is realized. Meanwhile, the target link is judged through the webpage information, and whether the target link is an invalid link or not can be accurately judged.
Fig. 8 schematically shows a hardware architecture diagram of a computer device 2 suitable for implementing the website link detection method according to an embodiment of the present application. In the present embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set in advance or stored. For example, the server may be a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server or a server cluster composed of a plurality of servers). As shown in fig. 8, the computer device 2 includes at least, but is not limited to: the memory 801, processor 802, network interface 803 may be communicatively linked to each other by a system bus. Wherein:
the memory 801 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 801 may be an internal storage module of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 801 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device 2. Of course, the memory 801 may also include both internal and external memory modules of the computer device 2. In this embodiment, the memory 801 is generally used to store an operating system installed in the computer device 2 and various types of application software, such as the program codes of the website link detection method in the above-described embodiments. In addition, the memory 801 can also be used to temporarily store various types of data that have been output or are to be output.
The processor 802 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 802 generally serves to control the overall operation of the computer device 2, such as to perform control and processing related to data interaction or communication with the computer device 2. In this embodiment, the processor 802 is configured to execute program codes stored in the memory 801 or process data.
The network interface 803 may include a wireless network interface or a wired network interface, and the network interface 803 is typically used to establish a communication link between the computer device 2 and other computer devices. For example, the network interface 803 is used to connect the computer device 2 with an external terminal via a network, establish a data transmission channel and a communication link between the computer device 2 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), or Wi-Fi.
It is noted that FIG. 8 only shows a computer device having components 801-803, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.
In this embodiment, the website link detection method stored in the memory 801 may be further divided into one or more program modules and executed by one or more processors (in this embodiment, the processor 802) to complete the present invention.
The embodiment of the present application provides a non-volatile computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the website link detection method in the above embodiments.
In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the computer-readable storage medium may also include both internal and external storage devices of the computer device. In this embodiment, the computer-readable storage medium is generally used for storing an operating system and various types of application software installed in the computer device, for example, the program codes of the website link detection methods in the above-described embodiments, and the like. Further, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on at least two network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a computer-accessible storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-only memory (ROM), a Random Access Memory (RAM), or the like.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.
Claims (10)
1. A website link detection method is characterized by comprising the following steps:
when a link detection instruction is received, acquiring a target link which needs to be detected currently in a website to be analyzed from a link queue to be detected;
acquiring a web crawler, and crawling a webpage corresponding to the target link through the web crawler;
when the webpage is crawled, acquiring a link address and webpage information contained in the webpage through a preset content script;
judging whether the target link is an effective link or not according to the webpage information, and adding the link address to the link queue when the target link is the effective link;
and returning to execute the operation of acquiring the current target link needing to be detected in the website to be analyzed from the link queue to be detected until all links in the link queue are detected.
2. The website link detection method according to claim 1, wherein the obtaining of the target link to be detected currently in the website to be analyzed from the link queue to be detected comprises:
acquiring state information of each link in a link queue to be detected;
and acquiring a link with unprocessed state information from the link queue to be detected as the target link.
3. The website link detection method according to claim 1, wherein the acquiring the web crawler comprises:
detecting the state of each web crawler in the crawler group;
and acquiring a web crawler in an idle state from the crawler group.
4. The website link detection method of claim 1, wherein adding the link address to the link queue comprises:
searching a link containing a preset address from the link address;
and adding all links except the link containing the preset address in the link address to the link queue.
5. The website link detection method according to any one of claims 1 to 4, further comprising:
and receiving an initial access link set by a user, and adding the initial access link to the link queue.
6. The website link detection method according to claim 5, further comprising:
and when the target link is judged to be an invalid link according to the webpage information, marking the target link as the invalid link.
7. The website link detection method according to claim 5, wherein the crawling of the web page corresponding to the target link by the web crawler comprises:
creating an IFRAME object through the web crawler;
executing IFRAME initialization operation;
registering a load event of the window in the IFRAME;
and loading the webpage corresponding to the target link, wherein the webpage is indicated to be crawled when the webpage is completely loaded, and the webpage is not crawled when the webpage is loaded for a time-out.
8. A website link detection apparatus, comprising:
the first acquisition module is used for acquiring a target link which needs to be detected currently in a website to be analyzed from a link queue to be detected when a link detection instruction is received;
the crawling module is used for acquiring a web crawler and crawling a webpage corresponding to the target link through the web crawler;
the second acquisition module is used for acquiring a link address and webpage information contained in the webpage through a preset content script when the webpage is crawled;
the adding module is used for judging whether the target link is an effective link or not according to the webpage information and adding the link address to the link queue when the target link is the effective link;
and the returning module is used for returning and executing the operation of acquiring the current target link needing to be detected in the website to be analyzed from the link queue to be detected until all links in the link queue are detected.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the website link detection method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when being executed by a processor carries out the steps of the web site link detection method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010107165.4A CN112417240A (en) | 2020-02-21 | 2020-02-21 | Website link detection method and device and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010107165.4A CN112417240A (en) | 2020-02-21 | 2020-02-21 | Website link detection method and device and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112417240A true CN112417240A (en) | 2021-02-26 |
Family
ID=74844187
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010107165.4A Pending CN112417240A (en) | 2020-02-21 | 2020-02-21 | Website link detection method and device and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112417240A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113505287A (en) * | 2021-06-24 | 2021-10-15 | 微梦创科网络科技(中国)有限公司 | Website link detection method and system |
CN113590987A (en) * | 2021-09-29 | 2021-11-02 | 飞狐信息技术(天津)有限公司 | Link detection method and device |
CN113992378A (en) * | 2021-10-22 | 2022-01-28 | 绿盟科技集团股份有限公司 | Safety monitoring method and device, electronic equipment and storage medium |
CN114036364A (en) * | 2021-11-08 | 2022-02-11 | 北京百度网讯科技有限公司 | Method, apparatus, device, medium and product for identifying a crawler |
CN114760086A (en) * | 2022-01-24 | 2022-07-15 | 北京中交兴路信息科技有限公司 | Website page compliance detection method and device, storage medium and terminal |
CN114861101A (en) * | 2022-01-25 | 2022-08-05 | 浙江浩瀚能源科技有限公司 | A method, device, device and medium for detecting abnormal hyperlinks on portal website |
CN115459946A (en) * | 2022-08-02 | 2022-12-09 | 广州市玄武无线科技股份有限公司 | Abnormal webpage identification method, device, equipment and computer storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561814A (en) * | 2009-05-08 | 2009-10-21 | 华中科技大学 | Topic crawler system based on social labels |
CN102469113A (en) * | 2010-11-01 | 2012-05-23 | 北京启明星辰信息技术股份有限公司 | Security gateway and method for forwarding webpage |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN106326485A (en) * | 2016-09-05 | 2017-01-11 | 郑州悉知信息科技股份有限公司 | Method for detecting web link and device thereof |
CN108536691A (en) * | 2017-03-01 | 2018-09-14 | 中兴通讯股份有限公司 | Web page crawl method and apparatus |
CN110399546A (en) * | 2019-07-23 | 2019-11-01 | 中南民族大学 | Link De-weight method, device, equipment and storage medium based on web crawlers |
CN110781437A (en) * | 2019-10-28 | 2020-02-11 | 北京字节跳动网络技术有限公司 | Method and device for acquiring webpage image loading duration and electronic equipment |
-
2020
- 2020-02-21 CN CN202010107165.4A patent/CN112417240A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561814A (en) * | 2009-05-08 | 2009-10-21 | 华中科技大学 | Topic crawler system based on social labels |
CN102469113A (en) * | 2010-11-01 | 2012-05-23 | 北京启明星辰信息技术股份有限公司 | Security gateway and method for forwarding webpage |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN106326485A (en) * | 2016-09-05 | 2017-01-11 | 郑州悉知信息科技股份有限公司 | Method for detecting web link and device thereof |
CN108536691A (en) * | 2017-03-01 | 2018-09-14 | 中兴通讯股份有限公司 | Web page crawl method and apparatus |
CN110399546A (en) * | 2019-07-23 | 2019-11-01 | 中南民族大学 | Link De-weight method, device, equipment and storage medium based on web crawlers |
CN110781437A (en) * | 2019-10-28 | 2020-02-11 | 北京字节跳动网络技术有限公司 | Method and device for acquiring webpage image loading duration and electronic equipment |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113505287A (en) * | 2021-06-24 | 2021-10-15 | 微梦创科网络科技(中国)有限公司 | Website link detection method and system |
CN113590987A (en) * | 2021-09-29 | 2021-11-02 | 飞狐信息技术(天津)有限公司 | Link detection method and device |
CN113992378A (en) * | 2021-10-22 | 2022-01-28 | 绿盟科技集团股份有限公司 | Safety monitoring method and device, electronic equipment and storage medium |
CN113992378B (en) * | 2021-10-22 | 2023-11-07 | 绿盟科技集团股份有限公司 | Security monitoring method and device, electronic equipment and storage medium |
CN114036364A (en) * | 2021-11-08 | 2022-02-11 | 北京百度网讯科技有限公司 | Method, apparatus, device, medium and product for identifying a crawler |
CN114036364B (en) * | 2021-11-08 | 2022-10-21 | 北京百度网讯科技有限公司 | Method, apparatus, device, medium, and system for identifying crawlers |
CN114760086A (en) * | 2022-01-24 | 2022-07-15 | 北京中交兴路信息科技有限公司 | Website page compliance detection method and device, storage medium and terminal |
CN114760086B (en) * | 2022-01-24 | 2023-12-05 | 北京中交兴路信息科技有限公司 | Method and device for detecting compliance of web pages, storage medium and terminal |
CN114861101A (en) * | 2022-01-25 | 2022-08-05 | 浙江浩瀚能源科技有限公司 | A method, device, device and medium for detecting abnormal hyperlinks on portal website |
CN115459946A (en) * | 2022-08-02 | 2022-12-09 | 广州市玄武无线科技股份有限公司 | Abnormal webpage identification method, device, equipment and computer storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112417240A (en) | Website link detection method and device and computer equipment | |
CN108304498B (en) | Webpage data acquisition method and device, computer equipment and storage medium | |
US9602347B2 (en) | Method, system and program for browser to switch IE kernel | |
JP5425699B2 (en) | Information processing apparatus, test case generation method, program, and recording medium | |
CN105335404B (en) | Page info loading method and device | |
RU2665920C2 (en) | Optimized visualization process in browser | |
CN109376291B (en) | A method and device for scanning website fingerprint information based on web crawler | |
CN110275705A (en) | Generate method, apparatus, equipment and the storage medium for preloading page code | |
CN104881607A (en) | XSS vulnerability detection method based on simulating browser behavior | |
CN112384940B (en) | Mechanism for WEB crawling of e-commerce resource pages | |
CN111367595B (en) | Data processing method, program running method, device and processing equipment | |
CN111090797B (en) | Data acquisition method, device, computer equipment and storage medium | |
CN112637361B (en) | Page proxy method, device, electronic equipment and storage medium | |
CN106844486A (en) | Crawl the method and device of dynamic web page | |
CN112800309A (en) | Crawler system based on HTTP proxy and its realization method | |
EP3745292A1 (en) | Hidden link detection method and apparatus for website | |
CN112632358B (en) | Resource link obtaining method and device, electronic equipment and storage medium | |
CN111221711A (en) | User behavior data processing method, server and storage medium | |
US20160034378A1 (en) | Method and system for testing page link addresses | |
CN106371987A (en) | Test method and apparatus | |
CN109815083B (en) | Application crash monitoring method and device, electronic equipment and medium | |
EP2998885A1 (en) | Method and device for information search | |
CN110719344B (en) | Domain name acquisition method and device, electronic equipment and storage medium | |
CN112835779A (en) | Test case determination method and device and computer equipment | |
CN111291288A (en) | Webpage link extraction method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210226 |
|
RJ01 | Rejection of invention patent application after publication |