[go: up one dir, main page]

CN109561163A - The generation method and device of uniform resource locator rewriting rule - Google Patents

The generation method and device of uniform resource locator rewriting rule Download PDF

Info

Publication number
CN109561163A
CN109561163A CN201710892706.7A CN201710892706A CN109561163A CN 109561163 A CN109561163 A CN 109561163A CN 201710892706 A CN201710892706 A CN 201710892706A CN 109561163 A CN109561163 A CN 109561163A
Authority
CN
China
Prior art keywords
url
parameter
prefix
rewriting rule
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710892706.7A
Other languages
Chinese (zh)
Other versions
CN109561163B (en
Inventor
张旭俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710892706.7A priority Critical patent/CN109561163B/en
Publication of CN109561163A publication Critical patent/CN109561163A/en
Application granted granted Critical
Publication of CN109561163B publication Critical patent/CN109561163B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/09Mapping addresses
    • H04L61/10Mapping addresses of different types
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0263Rule management
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2101/00Indexing scheme associated with group H04L61/00
    • H04L2101/60Types of network addresses
    • H04L2101/604Address structures or formats

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

This application provides the generation method of uniform resource position mark URL rewriting rule and devices, wherein this method comprises: obtaining the target set of URL of targeted website;The targeted website are as follows: the website of uniform resource position mark URL rewriting rule to be generated;Obtain the parameter set of mutual corresponding prefix parameter and resource parameters in the target set of URL, wherein the resource parameters are the subpath of the prefix parameter;The URL rewriting rule collection of the targeted website is generated according to the parameter set.Using the embodiment of the present application, the URL rewriting rule of website can be automatically analyzed out according to access log, participated in without artificial.

Description

The generation method and device of uniform resource locator rewriting rule
Technical field
This application involves internet data processing technology field, in particular to a kind of uniform resource locator (Uniform Resource Locator, URL) rewriting rule generation method and device, a kind of scanning side URL based on URL rewriting rule Method and scanner and a kind of network equipment.
Background technique
In order to guarantee the safety of website, it usually can use scanner and security sweep carried out to website.Scanner can be with Using the web access log of website as input source, the parameters under each URL will be scanned after URL duplicate removal.Because In web access log, it might have thousands of URL, it is possible that a large amount of different URL indicate a scanning Path, because will also include meaningless parameter in URL, these meaningless parameters are not the component part in path.In this feelings Under condition, scanner there is still a need for this large amount of URL is scanned respectively, so that the working efficiency of scanning is lower.
It in the prior art, can be some by developer's human configuration for website in order to improve the working efficiency of scanner Which URL rewriting rule, the meaningless parameter indicated in URL by URL rewriting rule have, and advises to be rewritten according to URL After then mapping original URL, the url filtering for much indicating the same scan path can be fallen, only retain a URL and supply Scanner scanning.
Summary of the invention
But inventor has found in the course of the research, based on the mode of human configuration URL rewriting rule, needs developer It observes all Web logs and is based on observation result manual configuration, it is larger and easy out that there is only the data flows due to Web log Wrong situation also makes the mode for configuring URL rewriting rule waste biggish manpower and material resources cost.
Based on this, this application provides the generation method and device of a kind of URL rewriting rule, one kind rewriteeing rule based on URL URL scan method and scanner and a kind of network equipment then, to by automatically analyzing website according to certain rules The URL of web access log, without the URL rewriting rule for manually participating in producing the website.
To solve the above-mentioned problems, this application discloses a kind of generation methods of URL rewriting rule, this method comprises:
Obtain the target set of URL of targeted website;The targeted website are as follows: uniform resource position mark URL to be generated rewrites rule Website then;
Obtain the parameter set of mutual corresponding prefix parameter and resource parameters in the target set of URL, wherein the resource Parameter is the subpath of the prefix parameter;
The URL rewriting rule collection of the targeted website is generated according to the parameter set.
Wherein, the target set of URL for obtaining targeted website, comprising:
Initial set of URL in the access log of targeted website is pre-processed, target set of URL is obtained.
Wherein, the initial set of URL in the access log to targeted website pre-processes, and obtains target set of URL, packet It includes:
According to hypertext transfer protocol HTTP status code, filtered from the initial set of URL in the access log of targeted website The corresponding illegal URL of illegal URL request;
Standardization processing is carried out for the initial set of URL after illegal URL has been filtered, obtains specification set of URL, the specification Set of URL includes: domain name, path and filename;
Duplicate removal processing is carried out to the specification set of URL, obtains target set of URL.
Wherein, the parameter set of the prefix parameter obtained in the target set of URL and resource parameters, comprising:
Each target URL in the target set of URL is split based on default separator, respectively obtains each target URL Corresponding character array;
The sequence that the target URL is formed according to each character string in the character array determines each target URL respectively In corresponding prefix parameter and resource parameters, to obtain parameter set.
Wherein, the sequence that the target URL is formed according to each character string in the character array, determines each mesh respectively Mark corresponding prefix parameter and resource parameters in URL, comprising:
Any one character array is obtained as current array, executes array circulation process, the array circulation process packet It includes:
According to vertical sequence, the first character string in the current array is obtained as current prefix parameter;
Save to initial parameter corresponding with resource parameters adjacent thereafter of the current prefix parameter is concentrated;
Judge whether the current prefix parameter is concentrated in initial URL rewriting rule, if it is, by the current prefix Parameter and default overwrite parameter group are combined into update prefix parameter;If it is not, then by the current prefix parameter and adjacent thereafter Resource parameters group is combined into update prefix parameter;
With the update prefix parameter be current prefix parameter, execute it is described by the current prefix parameter with it is adjacent thereafter Resource parameters it is corresponding save to initial parameter the step of concentrating, until all character strings of current goal array have all recycled Finish;
Judge whether all circulation finishes all character arrays, if it is not, then any one uncirculated character array is made For current array, triggering executes the array circulation process;
If it is, using the initial parameter collection as the corresponding target component collection of target set of URL.
Wherein, the URL rewriting rule collection that the targeted website is generated according to the path parameter and non-path parameter, Include:
For each prefix parameter, judge whether the quantity of resource parameters under the prefix parameter is greater than preset threshold respectively, It is concentrated if it is, the prefix parameter is updated to the initial URL rewriting rule, it is again regular to obtain updated URL Collection, up to the initial URL, rule set no longer updates again;
Updated URL rewriting rule collection is determined as the target URL rewriting rule collection.
Wherein, the method also includes:
According to the URL rewriting rule collection of the targeted website, URL to be mapped is mapped into the URL after rewriteeing.
Wherein, the URL rewriting rule collection according to the targeted website, after URL to be mapped is mapped to rewriting URL, comprising:
Standardize to the URL to be mapped, the URL after being standardized;
The URL after the standardization is split based on default separator, the character array after respectively obtaining segmentation;
It, will be to according to the matching result concentrated in the URL rewriting rule of each prefix parameter in the character array after segmentation Mapping URL maps to the URL after rewriteeing.
Wherein, the matching concentrated according to each prefix parameter in character array after each segmentation in the URL rewriting rule As a result, URL to be mapped is mapped to the URL after rewriteeing, comprising:
According to vertical sequence, before the first character string in the character array after obtaining the segmentation is used as currently Sew parameter;
Judge whether the current prefix parameter is concentrated in the URL rewriting rule, if it is, by the current prefix Parameter and default overwrite parameter group are combined into update prefix parameter;If it is not, then by the current prefix parameter and adjacent thereafter Resource parameters group is combined into update prefix parameter;
With the update prefix parameter be current prefix parameter, execute it is described by the current prefix parameter with it is adjacent thereafter Resource parameters it is corresponding save to initial parameter the step of concentrating, until all character strings in the character array after the segmentation All circulation finishes;
The update prefix parameter is obtained as the URL after rewriteeing.
Wherein, in the current prefix parameter in the case where the URL rewriting rule is concentrated, further includes:
Obtain the value of the corresponding resource parameters of the current prefix parameter and the resource parameters;
By the resource parameters, the value of resource parameters, inquiry string preservation corresponding with the URL after the rewriting.
The embodiment of the present application also discloses the URL scan method based on URL rewriting rule, this method comprises:
Pre-generated URL rewriting rule collection is obtained, and, the initial set of URL to be scanned of targeted website;The URL weight It writes rule set to generate in the following way: obtaining the target set of URL of targeted website;The targeted website are as follows: unified money to be generated The website of source finger URL URL rewriting rule;Obtain the ginseng of mutual corresponding prefix parameter and resource parameters in the target set of URL Manifold, and generate according to the parameter set URL rewriting rule collection of the targeted website;
The initial URL in the initial set of URL is written over according to the URL rewriting rule collection, after being rewritten Initial set of URL;
Duplicate removal processing is carried out to the initial set of URL after the rewriting, obtains target ULR collection;
Target URL in the target set of URL is scanned.
The embodiment of the present application also discloses a kind of generating means of URL rewriting rule, which includes:
Set of URL unit is obtained, for obtaining the target set of URL of targeted website;The targeted website are as follows: unified money to be generated The website of the rewriting rule of source finger URL URL;
Acquiring unit, for obtaining the parameter set of mutual corresponding prefix parameter and resource parameters in the target set of URL, Wherein, the resource parameters are the subpath of the prefix parameter;
Generation unit, for generating the URL rewriting rule collection of the targeted website according to the parameter set.
Wherein, the acquisition set of URL unit is used for: being located in advance to the initial set of URL in the access log of targeted website Reason, obtains target set of URL.
Wherein, the acquisition set of URL unit includes:
Subelement is filtered, for foundation hypertext transfer protocol HTTP status code, from the access log of targeted website The corresponding illegal URL of illegal URL request is filtered in initial set of URL;
Standardize subelement, carries out standardization processing for being directed to the initial set of URL after having filtered illegal URL, is advised Model set of URL, the specification URL in the specification set of URL includes: domain name, path and filename;And
Duplicate removal subelement obtains target set of URL for carrying out duplicate removal processing to the specification set of URL.
Wherein, the unit that gets parms, comprising:
Divide subelement, for being split based on default separator to each target URL in the target set of URL, point The corresponding character array of each target URL is not obtained;
Parameter determines subelement, for forming the sequence of the target URL according to each character string in the character array, point Do not determine corresponding prefix parameter and resource parameters in each target URL, to obtain parameter set.
Wherein, the parameter determines subelement, is specifically used for:
Any one character array is obtained as current array, executes array circulation process, the array circulation process packet It includes:
According to vertical sequence, the first character string in the current array is obtained as current prefix parameter; Save to initial parameter corresponding with resource parameters adjacent thereafter of the current prefix parameter is concentrated;Judge the current prefix Whether parameter is concentrated in initial URL rewriting rule, if it is, the current prefix parameter and default overwrite parameter group are combined into Update prefix parameter;Prefix ginseng is updated if it is not, then the current prefix parameter and resource parameters group adjacent thereafter are combined into Number;With the update prefix parameter for current prefix parameter, execute described by the current prefix parameter and money adjacent thereafter Source parameter is corresponding to save to initial parameter the step of concentrating, until all character strings of current goal array are all recycled and finished;
Judge whether all circulation finishes all character arrays, if it is not, then any one uncirculated character array is made For current array, triggering executes the array circulation process;
If it is, using the initial parameter collection as the corresponding target component collection of target set of URL.
Wherein, the generation unit includes:
Judgment sub-unit judges that the quantity of resource parameters under the prefix parameter is for being directed to each prefix parameter respectively It is no to be greater than preset threshold;
Subelement is updated, in the case where the result of the judgment sub-unit, which is, is, the prefix parameter to be updated It is concentrated to the initial URL rewriting rule, obtains updated URL rule set again, up to the initial URL again rule set No longer update;
Rule determines subelement, for updated URL rewriting rule collection to be determined as the target URL rewriting rule Collection.
Wherein, described device further include:
Map unit, for the URL rewriting rule collection according to the targeted website, after original URL is mapped to rewriting URL。
Wherein, the map unit 504 may include:
Standardize subelement, for standardizing to the URL to be mapped, the URL after being standardized;
Divide subelement and respectively obtains segmentation for being split based on default separator to the URL after the standardization Character array afterwards;
Subelement is mapped, for concentrating according to each prefix parameter in the character array after segmentation in the URL rewriting rule Matching result, by URL to be mapped map to rewrite after URL.
Wherein, the mapping subelement, is specifically used for:
According to vertical sequence, before the first character string in the character array after obtaining the segmentation is used as currently Sew parameter;Judge whether the current prefix parameter is concentrated in the URL rewriting rule, if it is, by the current prefix Parameter and default overwrite parameter group are combined into update prefix parameter;If it is not, then by the current prefix parameter and adjacent thereafter Resource parameters group is combined into update prefix parameter;And with the update prefix parameter for current prefix parameter, execute described by institute The step of preservation corresponding with resource parameters adjacent thereafter of current prefix parameter is concentrated to initial parameter is stated, until after the segmentation Character array in all character strings all recycle and finish;The update prefix parameter is obtained as the URL after rewriteeing.
Wherein, the mapping subelement, is also used to:
Obtain the value of the corresponding resource parameters of the current prefix parameter and the resource parameters;And it and will be described Resource parameters, the value of resource parameters, inquiry string are corresponding with the URL after the rewriting to be saved.
The embodiment of the present application also discloses a kind of scanner, which includes:
Obtain URL unit, for obtaining pre-generated URL rewriting rule collection, and, targeted website it is to be scanned initial Set of URL;The URL rewriting rule collection generates in the following way: as under type generates: the target set of URL of targeted website is obtained, The targeted website are as follows: the website of uniform resource position mark URL rewriting rule to be generated;It obtains in the target set of URL mutually The parameter set of corresponding prefix parameter and resource parameters, and rule are rewritten according to the URL that the parameter set generates the targeted website Then collect;
Rewriting unit, for being written over according to the URL rewriting rule collection to the initial URL in the initial set of URL, Initial set of URL after being rewritten;
Duplicate removal unit obtains target ULR collection for carrying out duplicate removal processing to the initial set of URL after the rewriting;
Scanning element, for being scanned to the target URL in the target set of URL.
It includes: processor, memory, network interface and total linear system that the embodiment of the present application, which also discloses a kind of network equipment, System;
The bus system, for each hardware component of the network equipment to be coupled;
The network interface, for realizing the communication link between the network equipment and at least one other network equipment It connects;
The memory, for storing program instruction and/or data;
The processor, for reading the instruction and/or data that store in the memory, the following operation of execution:
Obtain the target set of URL of targeted website;The targeted website are as follows: the rewriting of uniform resource position mark URL to be generated The website of rule;
Obtain the parameter set of mutual corresponding prefix parameter and resource parameters in the target set of URL, wherein the resource Parameter is the subpath of the prefix parameter;
The URL rewriting rule collection of the targeted website is generated according to the parameter set.
Compared with prior art, the embodiment of the present application includes the following advantages:
It in the embodiment of the present application, can be in the web access log based on a website using the embodiment of the present application Set of URL, come to URL each in set of URL prefix parameter and resource parameters analyze, so that it is determined that URL overwrite parameter out, and will Prefix parameter before URL overwrite parameter finally obtains the target URL rewriting rule collection of the website as URL rewriting rule.Because The embodiment of the present application is not necessarily to manual analysis web access log, so saving a large amount of manpower and material resources costs, and also can be reduced craft Mistake when URL rule is configured, so that can also generate URL rewriting quickly for the application scenarios of a large amount of even magnanimity websites Rule.
Further, can also according to URL rewriting rule concentrate URL rewriting rule, by original URL be rewritten as with it is original Another URL different URL is scanned for scanner.URL overwrite parameter " $ { dynamic } " therein will not partially be swept Device is retouched as path to implement to scan, to not only reduce the sweep object of scanner, moreover it is possible to guarantee that scanner will not be attacked Person attacks easily.
Certainly, any product for implementing the application does not necessarily require achieving all the advantages described above at the same time.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for For those of ordinary skill in the art, without any creative labor, it can also be obtained according to these attached drawings His attached drawing.
Fig. 1 is the flow chart of the generation method embodiment of the URL rewriting rule of the application;
Fig. 2 is the result schematic diagram that URL is pre-processed in the present processes embodiment;
Fig. 3 is to be split to obtain the flow chart of parameter set to URL in the present processes embodiment;
Fig. 4 is the flow chart mapped in the present processes embodiment original URL;
Fig. 5 is the flow chart of the URL scan method embodiment based on URL rewriting rule of the application;
Fig. 6 is the structural block diagram of the generating means embodiment of the URL rewriting rule of the application;
Fig. 7 is the structural block diagram of the scanner embodiment of the application;
Fig. 8 is the structural block diagram of the network equipment 800 shown according to an exemplary embodiment.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
One of main thought of the application is visited for the web access log for getting one or more websites according to Web Ask the URL rewriting rule collection of each one or more websites of URL generation in log.Specifically, can be first to URL each in access log It is pre-processed, for example, the operation such as filtering, standardization or duplicate removal, obtains pretreated set of URL as target set of URL, then Target set of URL is split based on path, obtains an array for corresponding to the website.It recycles in the array, each prefix ginseng Several and the path parameter under the prefix parameter attaching relation, for example, whether the number of path parameter is big under some prefix parameter In preset threshold, to count all prefix parameters for needing to be added to URL rewriting rule concentration, the URL weight of the website is obtained Write rule set.
With reference to Fig. 1, a kind of flow chart of the generation method embodiment of URL rewriting rule of the application, the present embodiment are shown It may comprise steps of:
Step 101: obtaining the target set of URL of targeted website;The targeted website are as follows: uniform resource locator to be generated The website of the rewriting rule of URL.
In the present embodiment, targeted website can be the Web site of URL rewriting rule collection to be generated, in practical applications It is either one or more, for each targeted website, the URL rewriting rule collection of each targeted website is generated respectively i.e. It can.The access log of targeted website can be inquired from database of the corresponding server in targeted website etc. and be obtained.
In this step, each initial URL in the access log of the targeted website can be directly acquired as target URL Collection, it is of course also possible to be pre-processed to each initial URL in access log, and using pretreated each initial URL as mesh Mark set of URL.Wherein, preprocessing process is mainly that URL in the access log of Web site is filtered, standardizes and duplicate removal etc. Operation.Specifically, step 101 may comprise steps of A1~step A3:
Step A1: according to HTTP status code, illegal URL is filtered from the initial set of URL in the access log of targeted website Request corresponding illegal URL.
Wherein, HTTP status code (HTTP Status Code) is used to indicate that the 3 of web page server http response state Digit numerical code.Wherein, HTTP status code illustrates the processed success of HTTP request, the desired response of HTTP request for 200 Head or data volume will be returned with this response.Therefore, in this step, the url filtering by HTTP status code non-200 is needed to fall, with The URL for avoiding such from being not present or malfunction is interfered during generating URL rewriting rule collection.
Step A2: standardization processing is carried out for the initial set of URL after illegal URL has been filtered, obtains specification set of URL, institute Stating the specification URL in specification set of URL includes: domain name, path and filename.
After the illegal URL for having filtered HTTP status code non-200, standardize to filtered URL, to ignore Fall agreement, user name and the inquiry string etc. in all URL, to obtain specification URL, include in the specification URL domain name, Path and filename.It is understood that port can also be ignored simultaneously if port is 80 or 443, only retain domain name, path And filename.Those skilled in the art can also be converted to small English character to the English character of capitalization all in domain name, most Specification URL is obtained eventually.
Step A3: duplicate removal processing is carried out to the specification URL in the specification set of URL, obtains target set of URL.
In this step, duplicate removal is carried out to the specification URL in standardization set of URL, for mutual duplicate each URL, only protected A URL is stayed, target set of URL is finally obtained.
Refering to what is shown in Fig. 2, to be filtered, standardizing respectively to URL and the schematic diagram of when duplicate removal each step results.Fig. 2 HTTP status code existing for left upper be " 404 " corresponding URL " http://a.com/blog.php " will in step A1 into Row filtering, obtains five each URL at the upper right corner Fig. 2.Standardization processing is carried out to five URL at the upper right corner Fig. 2 again, is deleted Agreements such as " http: // ", and the inquiry string of "? id=211 ", and, convert " A.COM " to " a.com " of small letter, Etc., obtain five each specification URL at the lower right corner Fig. 2.And then deduplication operation is carried out to each URL at the lower right corner Fig. 2, it will Two duplicate URL " a.com/blog.php " only retain one, to obtain four URL of Fig. 2 lower right-hand corner.When So, each URL of Fig. 2 is only a specific example in practical application, and those skilled in the art should not be construed as this Shen Restriction please.
Step 102: obtaining the parameter set of mutual corresponding prefix parameter and resource parameters in the target set of URL, wherein The resource parameters are the subpath of the prefix parameter.
In this step, it for the target URL in each pretreated target set of URL, then needs to carry out cutting, i.e. foundation Path is split pretreated target URL to obtain an array, includes based on after the segmentation of path in the array Each character string, first character string are domain names.Then, respectively according to path relation, successively by domain name, domain name and its subpath, Subpath of domain name and its subpath and subpath etc. is used as prefix parameter, to determine the corresponding resource parameters of the prefix parameter, Finally obtain the parameter set of prefix parameter preservation corresponding with resource parameters.
In practical applications, an empty URL rewriting rule collection can be preset, as determine one of parameter set according to According to.Then this step 102 is primarily based on default separator and is split to each target URL in target set of URL, respectively obtains each mesh The corresponding character array of URL is marked, default separator therein can be "/", i.e. path separators.Again according in each character array Each character string forms the sequence of the target URL, determines corresponding prefix parameter and resource parameters in each target URL respectively, with Obtain parameter set.
Specifically, determining that corresponding prefix parameter and resource parameters may include: with the process for obtaining parameter set
Step B1: any one character array is obtained as current array.
In the present embodiment, the corresponding character array of a target URL.Then first by any one untreated character Array is as current array, for example, the current array obtained for target URL " a.com/search/winter/2 " are as follows: { ' a.com ', ' search ', ' winter ', ' 2 ' }.Wherein, ' a.com ' is the 1st character string of the array, and ' 2 ' be the array The 4th character string, which shares 4 character strings.
Step B2: array circulation process is executed.
For obtained current array, { ' a.com ', ' search ', ' winter ', ' 2 ' }, the array circulation process packet Include step 21~step 24:
Step B21: according to vertical sequence, before the first character string in the acquisition current array is used as currently Sew parameter.
In this step, it regard the 1st character string ' a.com ' of the array as current prefix parameter " prefix ", deserves Preceding prefix parameter resource parameters adjacent thereafter " resource " are second character string ' search ' in array.
Step B22: save to initial parameter corresponding with resource parameters adjacent thereafter of the current prefix parameter is concentrated.
' a.com ' and ' search ' corresponding save to initial parameter is concentrated, in the present embodiment, initial parameter collection can Think sky.
Step B23: judging whether the current prefix parameter is concentrated in initial URL rewriting rule, if it is, will be described Current prefix parameter and default overwrite parameter group are combined into update prefix parameter;If it is not, then by the current prefix parameter and its Adjacent resource parameters group is combined into update prefix parameter afterwards.
Then for ' a.com ', judge whether to concentrate in initial URL rewriting rule.Wherein, the initial URL rewriting rule Collection can be sky, as the rewriting rule that more and more initial URL rewriting rules of target URL analysis are concentrated is more and more, Until no longer updating.
Assuming that initial URL rewriting rule collection is sky, then ' a.com ' then may not be used in initial URL rewriting rule concentration Prefix parameter " a.com/ is updated so that current prefix parameter ' a.com ' and resource parameters ' search ' group adjacent thereafter to be combined into search".And assume that ' a.com/search ' is concentrated in initial URL rewriting rule, then illustrate ' a.com/search ' phase thereafter Adjacent resource parameters ' winter ' are URL overwrite parameter, then in this case, by current prefix parameter ' a.com/search ' It is combined into default overwrite parameter (such as " dynamic ") group and updates prefix parameter " a.com/search/ $ { dynamic } ".
Step B24: with the update prefix parameter be current prefix parameter, execute it is described will the current prefix parameter and Thereafter adjacent resource parameters are corresponding to save to initial parameter the step of concentrating, until all character strings of current goal array are all Circulation finishes.
In this step, followed by update prefix parameter obtained in step B23, that is, " a.com/search " or " a.com/search/ $ { dynamic } " is used as current prefix parameter, by " a.com/search " or " a.com/search/ $ { dynamic } " is corresponding with resource parameters adjacent thereafter to be saved to initial parameter collection.For example, by prefix parameter " a.com/ Search " is corresponding with its resource parameters " winter " to be saved to initial parameter collection.For another example by prefix parameter " a.com/ Search/ $ { dynamic } " is corresponding with its resource parameters " 2 " to be saved to initial parameter collection, until all in the target data Character string is all recycled and is finished.
Step B3: judge whether all circulation finishes all character arrays, if it is not, then by any one uncirculated character Array triggers step B2 and executes the array circulation process as current array;If it is, entering step B4.
Then judge whether all circulation finishes the corresponding all character arrays in each targeted website, if not, with any one The character array not being circulated throughout triggers step B2 and executes the array circulation process as current array.
Step B4: the initial parameter collection is obtained as the corresponding target component collection of target URL.
In this step, then each group prefix parameter and resource parameters that can be concentrated the initial parameter finally no longer updated Corresponding output, as the foundation for updating URL rewriting rule collection.
Step 103: the URL rewriting rule collection of the targeted website is generated according to the parameter set.
It in this step, can be according to how many a resource parameters have been corresponded under prefix parameter each in parameter set, to determine this Whether prefix parameter should be added to URL rewriting rule collection.
It in practical applications, can be with the quantity of the different resource parameters under the same prefix parameter, because for one For the normal path URL, the number of the resource parameters under a prefix parameter should be limited, for needing as URL The prefix parameter of rewriting rule, corresponding resource parameters are equivalent to the arbitrary parameter of user's input, and quantity is larger.So can A threshold value is preset, if the quantity of resource parameters is greater than the threshold value, the value of the prefix parameter is added to URL to rewrite and is advised It then concentrates, finally obtains updated URL rewriting rule collection, rule set no longer updates again up to initial URL, can will update URL rewriting rule collection afterwards is determined as the target URL rewriting rule collection.
Wherein, URL rewriting rule is used in such a way that a URL indicates a rewriting rule, which parameter indicated Adjacent path parameter is URL overwrite parameter after the URL concentrated for overwrite parameter, URL rewriting rule.Specifically, URL is rewritten Rule set may include a plurality of URL rewriting rule, wherein the format of single URL rewriting rule is a URL, such as " a.com/ S ", the meaning of the URL rewriting rule are as follows: the path parameter (such as " a1 " in a.com/s/a1/a2) after " a.com/s " For URL overwrite parameter.Due in a URL in access log may with the presence of multiple path URL rewriting rules, such as " a.com/search/test/2 " can be mapped to " a.com/search.php? keyword=test&page=2 ", therefore It needs to be iterated each URL in access log, until each URL that URL rewriting rule is concentrated does not change.
As it can be seen that using the embodiment of the present application, can set of URL in the web access log based on a website, to URL The prefix parameter and resource parameters for concentrating each URL are analyzed, so that it is determined that URL overwrite parameter out, and will be before URL overwrite parameter Prefix parameter as URL rewriting rule, to obtain the target URL rewriting rule collection of the website.Because of the embodiment of the present application Without manual analysis web access log, so saving a large amount of manpower and material resources costs, and manual configuration URL rule also can be reduced When mistake so that can also generate URL rewriting rule quickly for the application scenarios of a large amount of even magnanimity websites.
In practical applications, based on obtained URL rewriting rule collection, original WEB log can also be handled, it will Original URL maps to the path URL after rewriteeing, because URL rewriting rule can represent URL overwrite parameter, accordingly, it is possible to It calls using each URL including URL overwrite parameter as the same URL for scanner, is called which reduces scanner URL number.Therefore, after step 103, can also include:
Step 104: according to the URL rewriting rule collection of the targeted website, original URL being mapped into the URL after rewriteeing.
It can be based on URL rewriting rule collection in this step, original URL is mapped into the URL after rewriteeing, and extract inquiry word Overwrite parameter in symbol string and URL rewriting rule, the input source as scanner.Specifically, the realization process of step 104 can To include step C1~step C5:
Step C1: standardizing to the URL to be mapped, the URL after being standardized, and stores URL's to be mapped Inquiry string.
In this step, it needs to standardize to original URL to be mapped, it can be with reference to step the step of concrete norm The description of rapid A2, details are not described herein.For example, original URL be " a.com/search/winter/2? a=b ", then "? a=b " As inquiry string is stored, the URL after standardizing in this step are as follows: a.com/search/winter/2.
Step C2: being split the URL after the standardization based on default separator, the character after respectively obtaining segmentation Array.
In this step, preset path separators "/" is also based on to be split to the URL after specification, obtains word Accord with array: { ' a.com ', ' search ', ' winter ', ' 2 ' }.
Step C3: the matching knot concentrated according to each prefix parameter in the character array after segmentation in the URL rewriting rule URL to be mapped is mapped to the URL after rewriteeing by fruit.
Again according to each prefix parameter in the character array, for example, " a.com ", " a.com/search ", " a.com/ Search/winter ", " a.com/search/winter/2 " etc. will be wait reflect in the matching result that URL rewriting rule is concentrated It penetrates URL and maps to the URL after rewriteeing.
Specifically, the mapping process of step C3 may include step C31~step C35:
Step C31: according to vertical sequence, the first character string in the character array after obtaining the segmentation is made For current prefix parameter.
Still " a.com " obtained in character array is used as current prefix parameter.
Step C32: judging whether the current prefix parameter is concentrated in the URL rewriting rule, if it is, entering step Rapid C33, if it is not, then entering step C34.
If current prefix parameter " a.com/search " is concentrated in URL rewriting rule, C33 is entered step, if worked as Preceding prefix parameter is that " a.com " is not concentrated in URL rewriting rule, then enters step C34.
Step C33: the current prefix parameter and default overwrite parameter group are combined into update prefix parameter, worked as described in acquisition The value of the preceding corresponding resource parameters of prefix parameter and the resource parameters, and by the resource parameters, resource parameters value, look into Character string preservation corresponding with the URL after the rewriting is ask, C35 is entered step.
In this step, by current prefix parameter " a.com/search " and default overwrite parameter " $ { dynamic } " group It is combined into update prefix parameter, obtains " a.com/search/ $ { dynamic } ".In addition it is also necessary to obtain " a.com/search " Resource parameters value " winter ", and, inquiry string "? a=b ".
Step C34: the current prefix parameter and resource parameters group adjacent thereafter are combined into update prefix parameter, entered Step C35.
Then current prefix parameter " a.com/search " and its resource parameters " winter " group are combined into more in this step New prefix parameter obtains " a.com/search/winter ".
Step C4: with the update prefix parameter be current prefix parameter, execute it is described will the current prefix parameter and Thereafter adjacent resource parameters are corresponding to save to initial parameter the step of concentrating, until the institute in the character array after the segmentation There is character string all to recycle to finish.
Again by " a.com/search/ $ { dynamic } " in the step C33 or " a.com/search/ in C34 Winter ", as current prefix parameter, return step 33 is judged whether there is to be concentrated in URL rewriting rule, until character array In all character strings all recycle and finish, obtain URL overwrite parameter at this time, such as " winter ", or " 2 ", and inquiry Character string "? a=b ".
Step C5: the prefix parameter that updates is obtained as the URL after rewriteeing.
The update prefix parameter no longer updated is finally obtained as the URL after rewriteeing.Assuming that URL rewriting rule concentration includes " a.com/search " and " a.com/search/ $ { dynamic } ", then the URL packet after the rewriting got in this step It includes: " a.com/search/ $ { dynamic }/$ { dynamic } ";Wherein, when corresponding resource parameters are dynamic_1, money Source parameter value is winter;When corresponding resource parameters are dynamic_2, resource parameters value is 2;Corresponding resource parameters are a When, resource parameters value is b.
As it can be seen that in the embodiment of the present application, the URL rewriting rule also concentrated according to URL rewriting rule, by multiple including phase The original URL of same URL overwrite parameter is rewritten as a target URL, scans for scanner.URL overwrite parameter " $ therein { dynamic } " partially will not be implemented to scan by scanner as path, so that the sweep object of scanner is not only reduced, Also ensure that scanner will not be attacked easily by attacker.
With reference to Fig. 3, shows and be split and obtain prefix ginseng in the application embodiment of the method to the URL in access log Several and resource parameters flow charts, this process may comprise steps of:
Step 301: obtaining target set of URL.
Step 302: the target URL in target set of URL being split, an array is obtained.
Step 303: initiation parameter, n are equal to 1, and taking prefix parameter prefix is the 0th element, i.e. domain name in array.
In this step, the array url_array still to obtain are as follows: { ' a.com ', ' search ', ' winter ', ' 2 ' } For.Initiation parameter, n=1, prefix is " a.com " at this time.
Step 304: triggering step 304~step 307 circulation;Taking resource is the 1st element of array “search”。
Step 305: storing the value of prefix and corresponding resource at this time.
It will " a.com " and " search " corresponding storage.
Step 306: judging whether the value of prefix at this time is concentrated in URL rewriting rule, if enabling the prefix be Prefix+ " $ { dynamic } ", otherwise enabling prefix is prefix+resource;
Judge whether " a.com " concentrates in URL rewriting rule;Assuming that corresponding URL rewriting rule collection includes being rewritten by URL Rule are as follows: " a.com/search ";" a.com " is not concentrated then in the URL rewriting rule, then prefix is prefix+ Resource, i.e. a.com/search.
Step 307: enabling n=n+1, judge whether n is less than the length of array, if it is, continuing step 304~step The first step of 307 circulations, otherwise enters step 308.
N=2, n is enabled to be less than the length 4 of array, then continue the resource for taking prefix again are as follows: " winter ", into step Rapid 305 are stored, and successively execute step 306 and 307, until n is equal to the length of array.
Step 308: the value of all prefix and its corresponding resource that output stores in 305 steps.
For example, in the present example, output result can be as shown in table 1:
Table 1
With reference to Fig. 4, the example flow chart that the application embodiment of the method maps original URL according to URL rule set is shown, this Embodiment may comprise steps of:
Step 401: URL to be mapped being standardized, and stores the parameter name in inquiry string and parameter value.
Assuming that URL to be mapped are as follows: " a.com/search/winter/2? a=b " polling character that then stores in this step The parameter of string entitled a, parameter value b.
Step 402: URL after standardization processing being based on path separators "/" and is split, an array is obtained.
Separator "/" is based on to the URL " a.com/search/winter/2 " after specification to be split, and obtains a number Group.
Step 403: initiation parameter, n=1, taking prefix parameter prefix is array the 0th element, i.e. domain name.
In this step, initiation parameter, n=1, prefix is " a.com " at this time.
Step 404: triggering step 404~step 406 loop body, taking resource is the 1st element of array “search”。
Step 405: judging whether the value of prefix at this time is concentrated in URL rewriting rule, if enabling the prefix be Prefix+ " $ { dynamic } ", while the value of resource is stored, otherwise enabling prefix is prefix+resource.
Step 406: enabling n=n+1, judge whether n is less than the length of array, if it is, continuing cycling through the first step of body Step 404 is triggered, otherwise enters step 407.
Step 407: exporting prefix at this time is the URL after rewriteeing, while exporting the ginseng of inquiry string in step 401 The value of all the resource parameters names and resource parameters that are stored in several and parameter value and loop body step 405.
Specifically, assuming that URL rewriting rule concentrates the rewriting rule for including to have: " a.com/search ", and " a.com/ Search/ { dynamic } ", then a kind of possible output of step 407 can be shown in reference table 2.
Table 2
Referring to Fig. 5, the flow chart for the URL scan method embodiment based on URL rewriting rule that present invention also provides a kind of, The present embodiment may comprise steps of:
Step 501: pre-generated URL rewriting rule collection is obtained, and, the initial set of URL to be scanned of targeted website.
It in practical applications, can also be to WEB log after based on URL rewriting rule collection is obtained using method shown in FIG. 1 In set of URL be written over according to the URL rewriting rule collection.Specifically, scanner can be pre-saved using side shown in FIG. 1 The URL rewriting rule collection that method obtains, for example, save into memory, can also get targeted website all URL be used as to The initial set of URL of scanning.Because URL rewriting rule can represent URL overwrite parameter, accordingly, it is possible to will include same The initial URL of many of URL overwrite parameter is subsequent by rewriteeing with after duplicate removal, calls as a URL for scanner, thus Reduce URL number of scanner scanning.
Step 502: the initial URL in the initial set of URL being written over according to the URL rewriting rule collection, is obtained Initial set of URL after rewriting.
After getting URL rewriting rule collection and initial URL rule set, using URL rewriting rule collection to initial URL rule The each initial URL concentrated is written over.Specific rewrite process can be discussed in detail with reference to step C1~step C5, herein It repeats no more.
Step 503: duplicate removal processing being carried out to the initial set of URL after the rewriting, obtains target ULR collection.
Duplicate removal processing is carried out to the initial URL after rewriting, obtains different, a plurality of target URL.Because for initial For URL, many initial URL may include identical URL overwrite parameter, then means this multiple initial URL in fact all It is directed toward the same page, then the address after this multiple initial URL rewriting is identical, therefore, for first after this multiple rewriting Beginning URL only retains one.And so on, the available target set of URL more much smaller than the number in initial set of URL.
Step 504: the target URL in the target set of URL is scanned.
Then scanner is again scanned each target URL in target set of URL because the number of target URL than The number of initial URL is much smaller, therefore, the scan efficiency of scanner can be made higher.
For the aforementioned method embodiment, for simple description, therefore, it is stated as a series of action combinations, still Those skilled in the art should understand that the application is not limited by the described action sequence, because according to the application, it is certain Step can be performed in other orders or simultaneously.Secondly, those skilled in the art should also know that, it is described in the specification Embodiment belong to preferred embodiment, necessary to related actions and modules not necessarily the application.
It is corresponding with method provided by a kind of generation method embodiment of URL rewriting rule of above-mentioned the application, referring to figure 6, present invention also provides a kind of generating means embodiments of URL rewriting rule, in the present embodiment, the apparatus may include:
Set of URL unit 601 is obtained, for obtaining the target set of URL of targeted website;The targeted website are as follows: system to be generated The website of the rewriting rule of one Resource Locator URL.
Wherein, the acquisition set of URL unit 601 can be used for: to the initial set of URL in the access log of targeted website into Row pretreatment, obtains target set of URL.
Wherein, when the acquisition set of URL unit 601 executes pretreatment, may include:
Subelement is filtered, for foundation hypertext transfer protocol HTTP status code, from the access log of targeted website The corresponding illegal URL of illegal URL request is filtered in initial set of URL;
Standardize subelement, carries out standardization processing for being directed to the initial set of URL after having filtered illegal URL, is advised Model set of URL, the specification URL in the specification set of URL includes: domain name, path and filename;And
Duplicate removal subelement obtains target URL for carrying out duplicate removal processing to the specification set of URL.
Get parms unit 602, for obtaining mutual corresponding prefix parameter and resource parameters in the target set of URL Parameter set, wherein the resource parameters are the subpath of the prefix parameter.
Wherein, the unit 602 that gets parms may include:
Divide subelement, for being split based on default separator to each target URL in the target set of URL, point The corresponding character array of each target URL is not obtained;And
Parameter determines subelement, for forming the sequence of the target URL according to each character string in the character array, point Do not determine corresponding prefix parameter and resource parameters in each target URL, to obtain parameter set.
Wherein, the parameter determines subelement, is specifically used for:
Any one character array is obtained as current array, executes array circulation process, the array circulation process packet It includes:
According to vertical sequence, the first character string in the current array is obtained as current prefix parameter;
Save to initial parameter corresponding with resource parameters adjacent thereafter of the current prefix parameter is concentrated;
Judge whether the current prefix parameter is concentrated in initial URL rewriting rule, if it is, by the current prefix Parameter and default overwrite parameter group are combined into update prefix parameter;If it is not, then by the current prefix parameter and adjacent thereafter Resource parameters group is combined into update prefix parameter;
With the update prefix parameter be current prefix parameter, execute it is described by the current prefix parameter with it is adjacent thereafter Resource parameters it is corresponding save to initial parameter the step of concentrating, until all character strings of current goal array have all recycled Finish;
Judge whether all circulation finishes all character arrays, if it is not, then any one uncirculated character array is made For current array, triggering executes the array circulation process;
If it is, using the initial parameter collection as the corresponding target component collection of target set of URL.
Generation unit 603, for generating the URL rewriting rule collection of the targeted website according to the parameter set.
Wherein, the generation unit 603 can specifically include:
Judgment sub-unit judges that the quantity of resource parameters under the prefix parameter is for being directed to each prefix parameter respectively It is no to be greater than preset threshold;
Subelement is updated, in the case where the result of the judgment sub-unit, which is, is, the prefix parameter to be updated It is concentrated to the initial URL rewriting rule, obtains updated URL rule set again, up to the initial URL again rule set No longer update;
Rule determines subelement, for updated URL rewriting rule collection to be determined as the target URL rewriting rule Collection.
Using the device of the embodiment of the present application, can URL in the web access log based on a website, in URL Prefix parameter and resource parameters analyzed, so that it is determined that URL overwrite parameter out, and the prefix before URL overwrite parameter is joined Number generates URL rewriting rule, finally obtains target URL rewriting rule collection.Because being not necessarily to manual analysis web access log, Mistake when saving a large amount of manpower and material resources costs, and also can be reduced manual configuration URL rule, so that for a large amount of even magnanimity The application scenarios of website can also generate URL rewriting rule quickly.
Wherein, which can also include:
Map unit 604, for the URL rewriting rule collection according to the targeted website, after original URL is mapped to rewriting URL.
Wherein, the map unit 604 may include:
Standardize subelement, for standardizing to the URL to be mapped, the URL after being standardized;Segmentation is single Member, for being split based on default separator to the URL after the standardization, the character array after respectively obtaining segmentation;With And mapping subelement, the matching for being concentrated according to each prefix parameter in the character array after segmentation in the URL rewriting rule As a result, URL to be mapped is mapped to the URL after rewriteeing.
Wherein, the mapping subelement, specifically can be used for:
According to vertical sequence, before the first character string in the character array after obtaining the segmentation is used as currently Sew parameter;Judge whether the current prefix parameter is concentrated in the URL rewriting rule, if it is, by the current prefix Parameter and default overwrite parameter group are combined into update prefix parameter;If it is not, then by the current prefix parameter and adjacent thereafter Resource parameters group is combined into update prefix parameter;And with the update prefix parameter for current prefix parameter, execute described by institute The step of preservation corresponding with resource parameters adjacent thereafter of current prefix parameter is concentrated to initial parameter is stated, until after the segmentation Character array in all character strings all recycle and finish;The update prefix parameter is obtained as the URL after rewriteeing.
Wherein, the mapping subelement, can be also used for: obtain the corresponding resource parameters of the current prefix parameter and The value of the resource parameters;And it and will be after the resource parameters, the value of resource parameters, inquiry string and the rewriting URL is corresponding to be saved.
As it can be seen that the URL rewriting rule that the map unit 604 is also concentrated according to URL rewriting rule, original URL is rewritten For another URL, scanned for scanner.URL overwrite parameter " $ { dynamic } " therein partially will not be by scanner conduct Scanning is implemented in path, to not only reduce the sweep object of scanner, moreover it is possible to guarantee that scanner will not be attacked easily by attacker It hits.
Corresponding with the scan method that Fig. 5 is provided, with reference to Fig. 7, present invention also provides a kind of scanners, which can To include:
Obtain URL unit 701, for obtaining pre-generated URL rewriting rule collection, and, targeted website it is to be scanned Initial set of URL;The URL rewriting rule collection such as under type generates: obtaining the target set of URL of targeted website, the targeted website Are as follows: the website of uniform resource position mark URL rewriting rule to be generated;Obtain mutual corresponding prefix ginseng in the target set of URL The parameter sets with resource parameters are counted, and generate the URL rewriting rule collection of the targeted website according to the parameter set.
Rewriting unit 702, for carrying out weight to the initial URL in the initial set of URL according to the URL rewriting rule collection It writes, the initial set of URL after being rewritten.
Duplicate removal unit 703 obtains target ULR collection for carrying out duplicate removal processing to the initial set of URL after the rewriting.
Scanning element 704, for being scanned to the target URL in the target set of URL.
Because the number of target URL is more much smaller than the number of initial URL in the present embodiment, so the present embodiment Scanner scan efficiency it is higher.
Fig. 8 is a kind of hardware structural diagram of the network equipment 800 in the embodiment of the present invention.The network equipment 800 can be used for Realize 8.I.e. the network equipment 800 can be used for executing the method provided in above-described embodiment.In the present embodiment, the network equipment 800 It include: processor 801, memory 802, network interface 803 and bus system 804.
The bus system 804, for each hardware component of the network equipment 800 to be coupled.
The network interface 803, for realizing the communication link between the network equipment 800 and at least one other network equipment It connects, internet, wide area network, local network, the modes such as Metropolitan Area Network (MAN) can be used.
The memory 802, for storing program instruction and/or data.
The processor 801, for reading the instruction and/or data that store in memory 802, the following operation of execution:
Pre-generated URL rewriting rule collection is obtained, and, the initial set of URL to be scanned of targeted website;The URL weight Rule set is write to generate using the generation method of URL rewriting rule above-mentioned;
The initial URL in the initial URL rule set is written over according to the URL rewriting rule collection, is rewritten Initial URL afterwards;
Duplicate removal processing is carried out to the initial set of URL after the rewriting, obtains target ULR collection;
Target URL in the target set of URL is scanned.
It should be noted that all the embodiments in this specification are described in a progressive manner, each embodiment weight Point explanation is the difference from other embodiments, and the same or similar parts between the embodiments can be referred to each other. For device class embodiment, since it is basically similar to the method embodiment, so being described relatively simple, related place ginseng See the part explanation of embodiment of the method.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.
The generation method and device, scan method and device of URL rewriting rule provided herein are carried out above It is discussed in detail, specific examples are used herein to illustrate the principle and implementation manner of the present application, above embodiments Illustrate to be merely used to help understand the present processes and its core concept;At the same time, for those skilled in the art, according to According to the thought of the application, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification It should not be construed as the limitation to the application.

Claims (14)

1. a kind of generation method of uniform resource position mark URL rewriting rule, which is characterized in that this method comprises:
Obtain the target set of URL of targeted website;The targeted website are as follows: uniform resource position mark URL rewriting rule to be generated Website;
Obtain the parameter set of mutual corresponding prefix parameter and resource parameters in the target set of URL, wherein the resource parameters For the subpath of the prefix parameter;
The URL rewriting rule collection of the targeted website is generated according to the parameter set.
2. the method according to claim 1, wherein the target set of URL for obtaining targeted website, comprising:
Initial set of URL in the access log of targeted website is pre-processed, target set of URL is obtained.
3. according to the method described in claim 2, it is characterized in that, initial URL in the access log to targeted website Collection is pre-processed, and target set of URL is obtained, comprising:
According to hypertext transfer protocol HTTP status code, filtered from the initial set of URL in the access log of targeted website illegal The corresponding illegal URL of URL request;
Standardization processing is carried out for the initial set of URL after illegal URL has been filtered, obtains specification set of URL, the specification set of URL In specification URL include: domain name, path and filename;
Duplicate removal processing is carried out to the specification set of URL, obtains target set of URL.
4. the method according to claim 1, wherein the prefix parameter obtained in the target set of URL and The parameter set of resource parameters, comprising:
Each target URL in the target set of URL is split based on default separator, it is corresponding to respectively obtain each target URL Character array;
The sequence that the target URL is formed according to each character string in the character array, it is right in each target URL to determine respectively The prefix parameter and resource parameters answered, to obtain parameter set.
5. according to the method described in claim 4, it is characterized in that, described form institute according to character string each in the character array The sequence of target URL is stated, determines corresponding prefix parameter and resource parameters in each target URL respectively, comprising:
Any one character array is obtained as current array, executes array circulation process, the array circulation process includes:
According to vertical sequence, the first character string in the current array is obtained as current prefix parameter;
Save to initial parameter corresponding with resource parameters adjacent thereafter of the current prefix parameter is concentrated;
Judge whether the current prefix parameter is concentrated in initial URL rewriting rule, if it is, by the current prefix parameter Update prefix parameter is combined into default overwrite parameter group;If it is not, then by the current prefix parameter and resource adjacent thereafter Parameter combination is to update prefix parameter;
With the update prefix parameter for current prefix parameter, execute described by the current prefix parameter and money adjacent thereafter Source parameter is corresponding to save to initial parameter the step of concentrating, until all character strings of current goal array are all recycled and finished;
Judge whether all circulation finishes all character arrays, if it is not, then using any one uncirculated character array as working as Preceding array, triggering execute the array circulation process;
If it is, using the initial parameter collection as the corresponding target component collection of target set of URL.
6. according to the method described in claim 5, it is characterized in that, described generate according to the path parameter and non-path parameter The URL rewriting rule collection of the targeted website, comprising:
For each prefix parameter, judge whether the quantity of resource parameters under the prefix parameter is greater than preset threshold respectively, if It is that the prefix parameter is then updated to the initial URL rewriting rule and is concentrated, obtains updated URL rule set again, directly To the initial URL, rule set no longer updates again;
Updated URL rewriting rule collection is determined as the target URL rewriting rule collection.
7. the method according to claim 1, wherein further include:
According to the URL rewriting rule collection of the targeted website, URL to be mapped is mapped into the URL after rewriteeing.
8. the method according to the description of claim 7 is characterized in that the URL rewriting rule collection according to the targeted website, URL to be mapped is mapped into the URL after rewriteeing, comprising:
Standardize to the URL to be mapped, the URL after being standardized;
The URL after the standardization is split based on default separator, the character array after respectively obtaining segmentation;
It, will be to be mapped according to the matching result that each prefix parameter in the character array after segmentation is concentrated in the URL rewriting rule URL maps to the URL after rewriteeing.
9. according to the method described in claim 8, it is characterized in that, described according to each prefix parameter in character array after each segmentation In the matching result that the URL rewriting rule is concentrated, URL to be mapped is mapped into the URL after rewriteeing, comprising:
According to vertical sequence, the first character string in the character array after obtaining the segmentation is joined as current prefix Number;
Judge whether the current prefix parameter is concentrated in the URL rewriting rule, if it is, by the current prefix parameter Update prefix parameter is combined into default overwrite parameter group;If it is not, then by the current prefix parameter and resource adjacent thereafter Parameter combination is to update prefix parameter;
With the update prefix parameter for current prefix parameter, execute described by the current prefix parameter and money adjacent thereafter Source parameter is corresponding to save to initial parameter the step of concentrating, until all character strings in the character array after the segmentation are all followed Ring finishes;
The update prefix parameter is obtained as the URL after rewriteeing.
10. according to the method described in claim 9, it is characterized in that, in the current prefix parameter in the URL rewriting rule In the case where concentration, further includes:
Obtain the value of the corresponding resource parameters of the current prefix parameter and the resource parameters;
By the resource parameters, the value of resource parameters, inquiry string preservation corresponding with the URL after the rewriting.
11. a kind of URL scan method, which is characterized in that this method comprises:
Pre-generated URL rewriting rule collection is obtained, and, the initial set of URL to be scanned of targeted website;The URL rewrites rule Then collect and generate in the following way: obtaining the target set of URL of targeted website, the targeted website are as follows: unified resource to be generated is fixed The website of position symbol URL rewriting rule;Obtain the parameter of mutual corresponding prefix parameter and resource parameters in the target set of URL Collect, and generates the URL rewriting rule collection of the targeted website according to the parameter set;
The initial URL in the initial set of URL is written over according to the URL rewriting rule collection, it is initial after being rewritten Set of URL;
Duplicate removal processing is carried out to the initial set of URL after the rewriting, obtains target ULR collection;
Target URL in the target set of URL is scanned.
12. a kind of generating means of URL rewriting rule, which is characterized in that the device includes:
Set of URL unit is obtained, for obtaining the target set of URL of targeted website;The targeted website are as follows: unified resource to be generated is fixed The website of the rewriting rule of position symbol URL;
Get parms unit, for obtaining the parameter set of mutual corresponding prefix parameter and resource parameters in the target set of URL, Wherein, the resource parameters are the subpath of the prefix parameter;
Generation unit, for generating the URL rewriting rule collection of the targeted website according to the parameter set.
13. a kind of scanner, which is characterized in that the scanner includes:
URL unit is obtained, for obtaining pre-generated URL rewriting rule collection, and, the initial URL to be scanned of targeted website Collection;The URL rewriting rule collection generates in the following way: obtaining the target set of URL of targeted website, the targeted website are as follows: The website of uniform resource position mark URL rewriting rule to be generated;Obtain in the target set of URL mutual corresponding prefix parameter and The parameter set of resource parameters, and generate according to the parameter set URL rewriting rule collection of the targeted website;
Rewriting unit is obtained for being written over according to the URL rewriting rule collection to the initial URL in the initial set of URL Initial set of URL after rewriting;
Duplicate removal unit obtains target ULR collection for carrying out duplicate removal processing to the initial set of URL after the rewriting;
Scanning element, for being scanned to the target URL in the target set of URL.
14. a kind of network equipment, which is characterized in that the network equipment includes: processor, memory, network interface and total linear system System;
The bus system, for each hardware component of the network equipment to be coupled;
The network interface, for realizing the communication connection between the network equipment and at least one other network equipment;
The memory, for storing program instruction and/or data;
The processor, for reading the instruction and/or data that store in the memory, the following operation of execution:
Obtain the target set of URL of targeted website;The targeted website are as follows: uniform resource position mark URL rewriting rule to be generated Website;
Obtain the parameter set of mutual corresponding prefix parameter and resource parameters in the target set of URL, wherein the resource parameters For the subpath of the prefix parameter;
The URL rewriting rule collection of the targeted website is generated according to the parameter set.
CN201710892706.7A 2017-09-27 2017-09-27 Method and device for generating uniform resource locator rewriting rule Active CN109561163B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710892706.7A CN109561163B (en) 2017-09-27 2017-09-27 Method and device for generating uniform resource locator rewriting rule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710892706.7A CN109561163B (en) 2017-09-27 2017-09-27 Method and device for generating uniform resource locator rewriting rule

Publications (2)

Publication Number Publication Date
CN109561163A true CN109561163A (en) 2019-04-02
CN109561163B CN109561163B (en) 2022-03-15

Family

ID=65864234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710892706.7A Active CN109561163B (en) 2017-09-27 2017-09-27 Method and device for generating uniform resource locator rewriting rule

Country Status (1)

Country Link
CN (1) CN109561163B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399546A (en) * 2019-07-23 2019-11-01 中南民族大学 Link De-weight method, device, equipment and storage medium based on web crawlers
CN110413861A (en) * 2019-07-23 2019-11-05 中南民族大学 Link extracting method, device, equipment and storage medium based on web crawlers
CN111461537A (en) * 2020-03-31 2020-07-28 山东胜软科技股份有限公司 Oil gas production data based classified quantity counting method and control system
CN114157648A (en) * 2021-11-30 2022-03-08 北京知道创宇信息技术股份有限公司 Request matching rule generation method and device, website server and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101267457A (en) * 2008-04-14 2008-09-17 华耀环宇科技(北京)有限公司 A network resource mapping method oriented to L1 customer
US8510454B2 (en) * 2006-05-04 2013-08-13 Digital River, Inc. Mapped parameter sets using bulk loading system and method
CN103685237A (en) * 2013-11-22 2014-03-26 北京奇虎科技有限公司 Method and device for improving website vulnerability scanning speed
CN104933056A (en) * 2014-03-18 2015-09-23 腾讯科技(深圳)有限公司 Uniform resource locator (URL) de-duplication method and device
CN106708952A (en) * 2016-11-25 2017-05-24 北京神州绿盟信息安全科技股份有限公司 Web page clustering method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8510454B2 (en) * 2006-05-04 2013-08-13 Digital River, Inc. Mapped parameter sets using bulk loading system and method
CN101267457A (en) * 2008-04-14 2008-09-17 华耀环宇科技(北京)有限公司 A network resource mapping method oriented to L1 customer
CN103685237A (en) * 2013-11-22 2014-03-26 北京奇虎科技有限公司 Method and device for improving website vulnerability scanning speed
CN104933056A (en) * 2014-03-18 2015-09-23 腾讯科技(深圳)有限公司 Uniform resource locator (URL) de-duplication method and device
CN106708952A (en) * 2016-11-25 2017-05-24 北京神州绿盟信息安全科技股份有限公司 Web page clustering method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399546A (en) * 2019-07-23 2019-11-01 中南民族大学 Link De-weight method, device, equipment and storage medium based on web crawlers
CN110413861A (en) * 2019-07-23 2019-11-05 中南民族大学 Link extracting method, device, equipment and storage medium based on web crawlers
CN110413861B (en) * 2019-07-23 2021-10-22 中南民族大学 Link extraction method, device, equipment and storage medium based on web crawler
CN110399546B (en) * 2019-07-23 2022-02-08 中南民族大学 Link duplicate removal method, device, equipment and storage medium based on web crawler
CN111461537A (en) * 2020-03-31 2020-07-28 山东胜软科技股份有限公司 Oil gas production data based classified quantity counting method and control system
CN114157648A (en) * 2021-11-30 2022-03-08 北京知道创宇信息技术股份有限公司 Request matching rule generation method and device, website server and storage medium
CN114157648B (en) * 2021-11-30 2023-11-28 北京知道创宇信息技术股份有限公司 Request matching rule generation method and device, website server and storage medium

Also Published As

Publication number Publication date
CN109561163B (en) 2022-03-15

Similar Documents

Publication Publication Date Title
CN102222187B (en) Domain name structural feature-based hang horse web page detection method
CN103501306B (en) A kind of network address knows method for distinguishing, server and system
CN109561163A (en) The generation method and device of uniform resource locator rewriting rule
CN104486461B (en) Domain name classification method and device, domain name identification method and system
CN101471818B (en) Detection method and system for malevolence injection script web page
CN101370024B (en) Distributed information collection method and system
CN106708952B (en) A kind of Webpage clustering method and device
CN105243159A (en) Visual script editor-based distributed web crawler system
CN109344053B (en) Interface coverage test method, system, computer device and storage medium
CN102855418A (en) Method for discovering Web intranet agent bugs
CN106095979A (en) URL merging treatment method and apparatus
CN108959539B (en) Rule-configurable webpage data analysis method
CN105447035B (en) data scanning method and device
CN109308258A (en) Construction method, device, computer equipment and storage medium of test data
CN109885782B (en) Ecological environment space big data integration method
CN109710826A (en) A kind of internet information artificial intelligence acquisition method and its system
CN111723400A (en) JS sensitive information leakage detection method, device, equipment and medium
CN106940711B (en) URL detection method and detection device
CN111597422A (en) Buried point mapping method and device, computer equipment and storage medium
CN103647774A (en) Web content information filtering method based on cloud computing
CN103927325A (en) URL (uniform resource locator) classifying method and device
CN114186102A (en) Tree structure data construction method and device and computer equipment
CN111090802B (en) Malicious web crawler monitoring and processing method and system based on machine learning
US20180309854A1 (en) Protocol model generator and modeling method thereof
CN103685237A (en) Method and device for improving website vulnerability scanning speed

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant