[go: up one dir, main page]

CN105005600A - Preprocessing method of URL (Uniform Resource Locator) in access log - Google Patents

Preprocessing method of URL (Uniform Resource Locator) in access log Download PDF

Info

Publication number
CN105005600A
CN105005600A CN201510383588.8A CN201510383588A CN105005600A CN 105005600 A CN105005600 A CN 105005600A CN 201510383588 A CN201510383588 A CN 201510383588A CN 105005600 A CN105005600 A CN 105005600A
Authority
CN
China
Prior art keywords
url
referer
request
rule
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510383588.8A
Other languages
Chinese (zh)
Other versions
CN105005600B (en
Inventor
陈静
房鹏展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Focus Technology Co Ltd
Original Assignee
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Focus Technology Co Ltd filed Critical Focus Technology Co Ltd
Priority to CN201510383588.8A priority Critical patent/CN105005600B/en
Publication of CN105005600A publication Critical patent/CN105005600A/en
Application granted granted Critical
Publication of CN105005600B publication Critical patent/CN105005600B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a preprocessing method of a URL (Uniform Resource Locator) in a website access log. The preprocessing method comprises the following steps of: Step 11, performing website URL collection: sorting and concluding a website URL address system; Step 12, performing URL configuration and storage: configuring the website URLs obtained through collection in the Step 11, and storing the website URLs into a URL rule storage table, wherein the URL rule storage table comprises the following fields including a URL unique code, a URL identification rule, a URL name and a URL matching sequence; Step 13, taking out information in the URL rule storage table obtained in Step 12, and performing sequencing according to the URL matching sequence in a way of ensuring a mother URL to be arranged in front of a son URL; Step 14, obtaining an access log record which includes the visitor IP, the accessing time, REFERER information and REQUEST information; Step 15, matching the REFERER and the REQUEST in each access log record in the Step 14 with the URL identification rule in the URL rule storage table obtained in the Step 13 according to the sequence obtained in the Step 13; and Step 16, obtaining records of the REFERER and the REQUEST which are not successfully matched in the Step 15 and are coded into -1 or null values.

Description

一种访问日志中URL的预处理方法A Preprocessing Method of URL in Access Log

技术领域technical field

本发明涉及网站分析领域,具体而言,涉及一种网站访问日志URL的预处理方法和装置。The invention relates to the field of website analysis, in particular to a method and device for preprocessing website access log URLs.

背景技术Background technique

网站访问路径分析为优化网站的结构和页面布局,以及了解访客的行为偏好等提供了重要的数据支持和指导。而网站路径分析的基础数据来源于网站的访问日志,访问日志里记录了访客的IP、访问时间、、REFERER(上一次访问的页面)、REQUEST(当前访问的页面)等信息。其中,REFERER和REQUEST是构建访问网页集合和访问路径的非常主要的信息。Website access path analysis provides important data support and guidance for optimizing website structure and page layout, and understanding visitor behavior preferences. The basic data of website path analysis comes from the website's access log, which records information such as visitor's IP, access time, REFERER (the page visited last time), REQUEST (the page currently visited) and so on. Among them, REFERER and REQUEST are very important information for constructing a set of visited webpages and an access path.

访问日志中记录的REFERER和REQUEST都是URL地址的形式,比如中国制造网(以下简称:MIC)首页的URL(统一资源定位器,即WWW页的地址)地址为The REFERER and REQUEST recorded in the access log are all in the form of URL addresses. For example, the URL (uniform resource locator, that is, the address of the WWW page) of the home page of Made in China (hereinafter referred to as: MIC) is

“www.made-in-china.com”。基于访问日志中记录的原始REFERER和REQUEST进行路径分析的时候会遇到一个问题,REFERER和REQUEST过于明细,不利于后续统计分析和提取访问路径。比如,MIC的访客主要通过GOOGLE进入到MIC搜索列表页,不同的搜索词或搜索条件对应的搜索列表页的URL地址是不同的,比如,用“led“进行搜索,搜索列表页的URL为"www.made-in-china.com". A problem will be encountered when performing path analysis based on the original REFERER and REQUEST recorded in the access log. The REFERER and REQUEST are too detailed, which is not conducive to subsequent statistical analysis and extraction of access paths. For example, visitors to MIC mainly enter the MIC search list page through GOOGLE, and the URL addresses of the search list pages corresponding to different search words or search conditions are different. For example, if you use "led" to search, the URL of the search list page is

“www.made-in-china.com/productdirectory.do?word=led&subaction=hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0&isOpenCorrection=1”。"www.made-in-china.com/productdirectory.do?word=led&subaction=hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0&isOpenCorrection=1".

用“led light“进行搜索,搜索列表页的URL为Use "led light" to search, the URL of the search list page is

“www.made-in-china.com/productdirectory.do?subaction=hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0&isOpenCorrection=1&word=led+light”,实际上我们在进行访问路径分析的时候是将类似于上面两个具体的URL进行一定的归纳和分类,比如将它们都识别为”MIC搜索列表页“,这样才能分析得到整个网站的访客访问路径情况。"www.made-in-china.com/productdirectory.do?subaction=hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0&isOpenCorrection=1&word=led+light", in fact, when we analyze the access path, it will be similar to The above two specific URLs are summarized and classified to a certain extent, for example, they are both identified as "MIC search list pages", so that the visitor access path of the entire website can be analyzed.

当前,对于访问路径的研究主要集中在如何收集每个访客的访问页面集合和构建访问路径。对于访问日志中如何对REFERER和REQUEST进行预处理很少提及,而这一步骤是构建访问路径的重要前提。At present, the research on the access path mainly focuses on how to collect the set of pages visited by each visitor and construct the access path. There is little mention of how to preprocess REFERER and REQUEST in the access log, and this step is an important prerequisite for constructing the access path.

HTTP Referer是header的一部分,当浏览器向web服务器发送请求的时候,一般会带上Referer,告诉服务器当前页是从哪个页面跳转过来的地址(从哪个页面链接过来的,即上一次访问的页面),服务器籍此可以获得一些信息用于处理。REQUESTheader是一个客户端(通常是浏览器)向Web服务器发送一个请求时发送一个请求的命令行(访问的页面URL)。HTTP Referer is a part of the header. When the browser sends a request to the web server, it will usually bring the Referer to tell the server the address from which page the current page is redirected from (which page is linked from, that is, the address of the last visit) page), the server can obtain some information for processing. REQUESTheader is a client (usually a browser) that sends a command line (accessed page URL) when sending a request to a web server.

发明内容Contents of the invention

发明目的:本发明提供一种网站访问日志中记录的原始的URL的预处理方法和装置,解决路径分析中对于访问日志中记录的REFERER和REQUEST的数据预处理问题。Purpose of the invention: the present invention provides a method and device for preprocessing the original URL recorded in the website access log, so as to solve the data preprocessing problem of REFERER and REQUEST recorded in the access log in path analysis.

一种网站访问日志URL的预处理方法,其步骤包括:A method for preprocessing website access log URLs, the steps comprising:

S11:网站URL收集,即对网站URL地址体系的整理和归纳;收集网站主要的或重要的页面URL,并确认这些URL的基本信息,包括URL识别规则、URL名称;其中,URL识别规则是指根据原始网页的URL分析和归纳得出的某一类页面的URL的构成特征;URL识别规则能使用正则表达式进行描述;S11: Website URL collection, that is, sorting and summarizing the website URL address system; collecting the main or important page URLs of the website, and confirming the basic information of these URLs, including URL identification rules and URL names; among them, URL identification rules refer to According to the URL analysis and induction of the original webpage, the composition characteristics of the URL of a certain type of page are obtained; the URL identification rules can be described using regular expressions;

S12:URL配置和存储,将S11中收集得到的网站URL配置并存储到URL规则存储表中;URL规则存储表包括以下字段:URL唯一编码、URL识别规则、URL名称、URL匹配顺序;其中,“URL唯一编码”用于标注每一个URL识别规则的唯一身份,由数据库自动生成;“URL识别规则”和“URL名称”来源于S11步骤;“URL匹配顺序”用于控制URL匹配顺序;S12: URL configuration and storage, configuring and storing the URL of the website collected in S11 into the URL rule storage table; the URL rule storage table includes the following fields: URL unique code, URL identification rule, URL name, URL matching order; wherein, "URL unique code" is used to mark the unique identity of each URL identification rule, which is automatically generated by the database; "URL identification rule" and "URL name" are derived from step S11; "URL matching order" is used to control the URL matching order;

“URL匹配顺序”的确定方法为:假设B_URL是A_URL的一个子字符串,则称A_URL和B_URL之间具有字符串包含关系,其中A_URL为母URL,B_URL为子URL;则A_URL与B_URL的匹配顺序为A_URL在前,B_URL在后,即母URL排在子URL之前;The determination method of "URL matching order" is: assuming that B_URL is a substring of A_URL, it is said that there is a string containment relationship between A_URL and B_URL, where A_URL is the parent URL and B_URL is the sub-URL; then the matching of A_URL and B_URL The order is A_URL first, B_URL last, that is, the parent URL is ranked before the child URL;

S13:取出S12中得到的URL规则存储表中的信息,并按照“URL匹配顺序”进行排序,保证母URL排在子URL之前;S13: Take out the information in the URL rule storage table obtained in S12, and sort according to the "URL matching order", so as to ensure that the parent URL is ranked before the child URL;

S14:获取访问日志的记录,包括访客的IP、访问时间、REFERER(上一次访问的页面)、REQUEST(访问的页面)信息;S14: obtain the record of visit log, comprise visitor's IP, visit time, REFERER (page visited last time), REQUEST (page visited) information;

S15:将S14中每一条访问日志记录中的REFERER和REQUEST分别与S13中获取的URL规则存储表中的URL识别规则按照S13中的获取排序进行匹配;如果匹配成功,则记录下URL识别规则对应的URL唯一编码,作为REFERER的编码和REQUEST编码;如果REFERER或REQUEST与任何一条URL识别规则都不能匹配,则取-1或空值作为REFERER的编码或REQUEST编码;S15: Match the REFERER and REQUEST in each access log record in S14 with the URL identification rules in the URL rule storage table acquired in S13 according to the order of acquisition in S13; if the matching is successful, record the corresponding URL identification rules The unique encoding of the URL is used as the encoding of REFERER and the encoding of REQUEST; if REFERER or REQUEST cannot match any of the URL identification rules, -1 or empty value is used as the encoding of REFERER or REQUEST;

优选的,本发明在步骤S15将步骤S14中每一条访问日志记录中的REFERER和REQUEST分别与S13中获取的URL规则存储表中的URL识别规则按照S13中的排序进行匹配,包括:Preferably, in step S15, the present invention matches REFERER and REQUEST in each access log record in step S14 with the URL identification rules in the URL rule storage table acquired in S13 according to the sorting in S13, including:

当REFERER或REQUEST与URL规则存储表中的URL识别规则是字符串包含关系,即REFERER或REQUEST是该URL识别规则的母URL,则表示REFERER或REQUEST与该URL识别规则匹配成功;When REFERER or REQUEST and the URL identification rule in the URL rule storage table have a string inclusion relationship, that is, REFERER or REQUEST is the parent URL of the URL identification rule, it means that REFERER or REQUEST matches the URL identification rule successfully;

如果REFERER或REQUEST能够与URL规则存储表中的多个URL识别规则匹配成功,则取按照S13中的排序排在第一位的URL识别规则;If REFERER or REQUEST can be successfully matched with multiple URL identification rules in the URL rule storage table, then take the URL identification rule ranked first according to the sorting in S13;

S16:获取S15中没有匹配成功的REFERER和REQUEST,即REFERER编码或REQUEST编码为-1或者空值的记录,将所有的没有匹配成功的REFERER和REQUEST合并到一起,得到未匹配URL集;S16: Obtain the unmatched REFERER and REQUEST in S15, that is, records whose REFERER code or REQUEST code is -1 or null, and merge all unmatched REFERERs and REQUESTs together to obtain an unmatched URL set;

S17:对S16中的未匹配URL集进行统计分析,得到未匹配URL集中数量最多的URL,(并可结合人工的判断和监测,)将没有匹配成功的URL匹配到URL规则配置表中,从而能不断完善URL规则配置表中的URL识别规则。S17: Statistically analyze the unmatched URL set in S16, obtain the URL with the largest number of unmatched URL sets, (and in combination with manual judgment and monitoring,) match URLs that do not have a successful match into the URL rule configuration table, thereby It can continuously improve the URL identification rules in the URL rule configuration table.

本发明提供一种访问日志中的URL预处理装置,其特征包括:The present invention provides a URL preprocessing device in an access log, and its features include:

URL收集单元:用于收集网站的URL,并且确定URL识别规则、URL名称;其中,URL识别规则是指根据原始网页的URL分析和归纳得出的某一类页面的URL的构成特征,URL识别规则能使用正则表达式进行描述;URL collection unit: used to collect URLs of websites, and to determine URL identification rules and URL names; wherein, URL identification rules refer to the compositional characteristics of URLs of a certain type of page obtained by analyzing and summarizing URLs of original web pages, and URL identification rules Rules can be described using regular expressions;

URL规则存储单元:用于存储URL的识别规则以及相关信息,包括:URL唯一编码、URL识别规则、URL名称、URL匹配顺序。其中,“URL唯一编码”用于标注每一个URL识别规则的唯一身份;“URL识别规则”和“URL名称”来源于URL收集单元;用于控制URL匹配顺序。URL rule storage unit: used to store URL identification rules and related information, including: URL unique code, URL identification rules, URL name, and URL matching sequence. Among them, "URL unique code" is used to mark the unique identity of each URL identification rule; "URL identification rule" and "URL name" are derived from the URL collection unit; they are used to control the URL matching sequence.

优选的,URL规则存储单元包括:Preferably, the URL rule storage unit includes:

URL规则配置模块:用于确定“URL唯一编码”和“URL匹配顺序”。其中“URL唯一编码”可由数据库自动生成,或者手动生成,只要保证URL唯一编码与URL识别规则是一对一的关系即可。“URL匹配顺序”的确定方法为:假设B_URL是A_URL的一个子字符串(比如A_URL为“abcd”,B_URL为“abc”),则称A_URL和B_URL之间具有字符串包含关系,其中A_URL为母URL,B_URL为子URL。则是A_URL与B_URL的匹配顺序为A_URL在前,B_URL在后,即母URL排在子URL之前。URL rule configuration module: used to determine "URL unique encoding" and "URL matching sequence". The "unique URL code" can be automatically generated by the database or manually, as long as there is a one-to-one relationship between the unique URL code and the URL identification rule. The determination method of "URL matching order" is: assuming that B_URL is a substring of A_URL (for example, A_URL is "abcd" and B_URL is "abc"), it is said that there is a string containment relationship between A_URL and B_URL, where A_URL is Parent URL, B_URL is child URL. The matching order of A_URL and B_URL is that A_URL comes first and B_URL comes after, that is, the parent URL comes before the child URL.

URL规则存储模块:用于存储URL规则存储表,包括URL的识别规则以及相关信息,包括:URL唯一编码、URL识别规则、URL名称、URL匹配顺序;URL rule storage module: used to store URL rule storage table, including URL identification rules and related information, including: URL unique encoding, URL identification rules, URL name, URL matching sequence;

URL规则获取单元:将所述URL识别规则按照URL匹配顺序进行排序,并按此顺序获取URL识别规则及URL唯一编码;URL rule acquisition unit: sort the URL identification rules according to the URL matching order, and acquire URL identification rules and URL unique codes in this order;

日志记录获取单元:用于获取访问日志中每一条记录,包括访客的IP、访问时间、、REFERER(上一次访问的页面)、REQUEST(访问的页面)等信息;Log record acquisition unit: used to obtain each record in the access log, including visitor's IP, access time, REFERER (the page visited last time), REQUEST (the page visited) and other information;

URL匹配单元:用于将访问日志中的每一条记录的REFERER和REQUEST与URL识别规则进行匹配。取出一条日志记录,并将REFERER或REQUEST按照URL识别规则获取的顺序逐一与URL识别规则进行匹配,如果REFERER或REQUEST是某一URL识别规则的母URL,则匹配成功并取出该URL识别规则的URL唯一编码作为REFERER编码或REQUEST编码;如果REFERER或REQUEST与任何一个URL识别规则都不具有字符串包含关系,则将REFERER编码或REQUEST编码做特殊标记,比如标记为“-1“或空值,至此完成这条日志记录的匹配并且跳出此次匹配。然后,取下一条日志记录,按照上述方法进行匹配,直至所有的日志记录全部匹配完成;URL matching unit: used to match the REFERER and REQUEST of each record in the access log with URL identification rules. Take out a log record, and match REFERER or REQUEST with the URL recognition rules one by one in the order in which the URL recognition rules were obtained. If REFERER or REQUEST is the parent URL of a certain URL recognition rule, the match is successful and the URL of the URL recognition rule is taken out The unique code is used as REFERER code or REQUEST code; if REFERER or REQUEST does not have a string inclusion relationship with any URL recognition rule, then special mark the REFERER code or REQUEST code, such as marking "-1" or a null value, so far Complete the matching of this log record and jump out of this matching. Then, take the next log record and perform matching according to the above method until all log records are matched;

匹配结果集存储单元:用于存储访问日志与URL识别规则的匹配结果,包括:访问日志中的原始信息如IP、访问时间、REFERER、REQUEST等,以及上述REFERER编码、REQUEST编码;Matching result set storage unit: used to store the matching results of the access log and URL identification rules, including: the original information in the access log such as IP, access time, REFERER, REQUEST, etc., as well as the above-mentioned REFERER code and REQUEST code;

未匹配URL监测单元包括:未匹配数据获取单元:用于获取匹配结果集中未匹配成功REFERER和REQUEST,并将其合并为未匹配URL集;未匹配数据统计模块:统计出未匹配URL集中每个URL的记录条数;未匹配数据监测模块:根据出未匹配URL集中每个URL的记录条数并按照记录条数进行降序排列,可以收集到未匹配的URL集合。再结合实际的业务需求,可确定是否将这些URL配置到URL规则存储表中,如果需要配置,则重新回到URL收集单元按照上述流程执行,直至所有需要分析的URL都加入到URL规则存储表中。The unmatched URL monitoring unit includes: unmatched data acquisition unit: used to obtain unmatched REFERER and REQUEST in the matching result set, and merge them into an unmatched URL set; unmatched data statistics module: count each unmatched URL set The number of URL records; unmatched data monitoring module: According to the number of records of each URL in the unmatched URL set and sorted in descending order according to the number of records, the unmatched URL set can be collected. Combined with actual business needs, it can be determined whether to configure these URLs into the URL rule storage table. If configuration is required, return to the URL collection unit and follow the above process until all URLs that need to be analyzed are added to the URL rule storage table. middle.

本发明的有益结果如下:本发明提供一种网站访问日志中记录的原始的URL的预处理方法,能解决路径分析中对于访问日志中记录的REFERER和REQUEST的数据预处理问题:Beneficial results of the present invention are as follows: the present invention provides a kind of preprocessing method of the original URL recorded in the website access log, which can solve the data preprocessing problem of REFERER and REQUEST recorded in the access log in path analysis:

1)通过收集网站的页面URL并形成网站URL规则存储表,将原始访问日志中记录的REFERER和REQUEST与URL规则存储表中的URL识别规则进行匹配,将每一个REFERER和REQUEST进行编码和命名,把REFERER和REQUEST的原始URL地址格式转化为便于后续统计分析和应用的编码和业务名称。1) By collecting the page URL of the website and forming a website URL rule storage table, matching the REFERER and REQUEST recorded in the original access log with the URL identification rules in the URL rule storage table, encoding and naming each REFERER and REQUEST, Transform the original URL address formats of REFERER and REQUEST into codes and business names that are convenient for subsequent statistical analysis and application.

2)通过对未匹配URL集的监测和分析,可以不断的完善URL规则存储表,可以使得URL规则存储表逐步全面的覆盖所有的网站页面,从而保证访问日志中的记录尽可能多的匹配得到REFERER编码和REQUEST编码。为后续基于访问日志的分析提供完善的预处理好的数据。2) Through the monitoring and analysis of the unmatched URL set, the URL rule storage table can be continuously improved, and the URL rule storage table can gradually and comprehensively cover all website pages, so as to ensure that the records in the access log match as much as possible. REFERER encoding and REQUEST encoding. Provide complete preprocessed data for subsequent analysis based on access logs.

附图说明Description of drawings

图1为本发明实施例一种网站访问日志URL的预处理方法流程图;Fig. 1 is a flow chart of a method for preprocessing a website access log URL according to an embodiment of the present invention;

图2为本发明实施例一种网站访问日志URL的预处理装置的结构示意图。FIG. 2 is a schematic structural diagram of an apparatus for preprocessing website access log URLs according to an embodiment of the present invention.

具体实施方案specific implementation plan

下面结合附图和实施例,对本发明的具体实施方案作进一步详细描述,很显然,所描述的实施例仅为本发明的一部分实施例,而不是全部实施例。基于本申请的实施例,以及本发明权利要求的技术实质所做的改变或等同变化,仍落入本申请保护的范围。The specific embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings and examples. Obviously, the described examples are only some of the examples of the present invention, not all of them. Changes or equivalent changes made based on the embodiments of the present application and the technical essence of the claims of the present invention still fall within the protection scope of the present application.

参阅图1所示,本申请的实施步骤如下:Referring to shown in Figure 1, the implementation steps of the present application are as follows:

S11:网站URL收集,即对网站URL地址体系的整理和归纳。网站URL的收集在初始阶段可以依靠人工收集的方式,通过人工收集网站比较主要的或重要的页面URL,并确认这些URL的基本信息,包括URL识别规则、URL名称等。其中,URL识别规则是指根据原始网页的URL分析和归纳得出的某一类页面的URL的构成特征。S11: Website URL collection, that is, sorting and summarizing the website URL address system. The collection of website URLs can rely on manual collection at the initial stage, by manually collecting the main or important page URLs of the website, and confirming the basic information of these URLs, including URL identification rules, URL names, etc. Wherein, the URL identification rule refers to the constituent features of the URL of a certain type of webpage obtained by analyzing and summarizing the URL of the original webpage.

比如,中国制造网的产品搜索列表页的URL地址都是以“www.made-in-china.com/productdirectory.do?”开头;则产品搜索列表页的识别规则就是“www.made-in-china.com/productdirectory.do?”。而且,URL识别规则可以使用正则表达式进行描述。For example, the URL address of the product search list page of Made-in-China.com all starts with "www.made-in-china.com/productdirectory.do?"; the identification rule for the product search list page is "www.made-in-china.com/productdirectory.do?" china.com/productdirectory.do?”. Moreover, URL identification rules can be described using regular expressions.

产品搜索列表页的URL地址是如下形式:The URL address of the product search list page is as follows:

www.made-in-china.com/productdirectory.do?word=led&subaction=hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0&isOpenCorrection=1,其特征是www.made-in-china.com/productdirectory.do? word=led&subaction=hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0&isOpenCorrection=1, characterized by

以“www.made-in-china.com/productdirectory.do?”开头,后面的“word”等参数记录了所用的搜索词等信息。那么就可以根据如果某个URL以“www.made-in-china.com/productdirectory.do?”开头,则该URL为产品搜索列表页。It starts with "www.made-in-china.com/productdirectory.do?", and the following parameters such as "word" record the search terms used and other information. Then it can be based on that if a URL starts with "www.made-in-china.com/productdirectory.do?", then the URL is a product search list page.

中国制造网的首页URL地址是:www.made-in-china.com。The URL address of the home page of Made-in-China.com is: www.made-in-china.com.

中国制造网的专题活动首页URL地址是:www.made-in-china.com/special。The URL address of the special event home page of Made-in-China.com is: www.made-in-china.com/special.

中国制造网的专题活动detail页URL地址是:(比如magic-show专题)The URL address of the special event detail page of Made in China.com is: (such as the magic-show topic)

www.made-in-china.com/special/magic-show/。www.made-in-china.com/special/magic-show/.

则以上四个页面对应的的URL识别规则和URL名称可以分别是:Then the URL identification rules and URL names corresponding to the above four pages can be:

“www.made-in-china.com/productdirectory.do?”,“产品搜索列表页”;"www.made-in-china.com/productdirectory.do?", "product search list page";

“www.made-in-china.com$”,“MIC首页”;"www.made-in-china.com$", "MIC homepage";

“www.made-in-china.com/special”,“专题首页”;"www.made-in-china.com/special", "special homepage";

“www.made-in-china.com/special/”,“专题detail页”。"www.made-in-china.com/special/", "special detail page".

其中,MIC首页的识别规则中的“$”是正则表达式的表示方法,表示以“$”之前的字符串结尾,在此表示以“www.made-in-china.com”结尾的所有字符串;Among them, the "$" in the recognition rules on the MIC homepage is a regular expression expression, which means that it ends with the character string before "$", here it means all characters ending with "www.made-in-china.com" string;

S12:URL配置和存储,将S11中收集得到的网站URL配置并存储到URL规则存储表中。URL规则存储表包括以下字段:URL唯一编码、URL识别规则、URL名称、URL匹配顺序。其中,“URL唯一编码”用于标注每一个URL识别规则的唯一身份,可由数据库自动生成;“URL识别规则”和“URL名称”来源于S11步骤;“URL匹配顺序”用于控制URL匹配顺序。S12: URL configuration and storage, configuring and storing the URL of the website collected in S11 into a URL rule storage table. The URL rule storage table includes the following fields: URL unique code, URL identification rule, URL name, and URL matching sequence. Among them, "URL unique code" is used to mark the unique identity of each URL identification rule, which can be automatically generated by the database; "URL identification rule" and "URL name" come from step S11; "URL matching order" is used to control the URL matching order .

“URL匹配顺序”的确定方法为:假设B_URL是A_URL的一个子字符串(比如A_URL为“abcd”,B_URL为“abc”),则称A_URL和B_URL之间具有字符串包含关系,其中A_URL为母URL,B_URL为子URL。则是A_URL与B_URL的匹配顺序为A_URL在前,B_URL在后,即母URL排在子URL之前。The determination method of "URL matching order" is: assuming that B_URL is a substring of A_URL (for example, A_URL is "abcd" and B_URL is "abc"), it is said that there is a string containment relationship between A_URL and B_URL, where A_URL is Parent URL, B_URL is child URL. The matching order of A_URL and B_URL is that A_URL comes first and B_URL comes after, that is, the parent URL comes before the child URL.

具体的,如果是配置网站的第一个页面URL到URL规则存储表中,中国制造网的产品搜索列表页为例,则URL唯一编码、URL识别规则、URL名称、URL匹配顺序分别为:Specifically, if the URL of the first page of the configuration website is stored in the URL rule storage table, taking the product search list page of Made in China as an example, the URL unique code, URL identification rule, URL name, and URL matching sequence are:

“1001”,“www.made-in-china.com/productdirectory.do?”,“产品搜索列表页”,“产品搜索列表页”。需要注意的是目前URL规则存储表中还没有已经配置好的URL识别规则,因此“URL匹配顺序”的值可以随意,在此可取URL名称作为URL匹配顺序的取值。"1001", "www.made-in-china.com/productdirectory.do?", "Product Search List Page", "Product Search List Page". It should be noted that currently there is no configured URL identification rule in the URL rule storage table, so the value of "URL matching order" can be arbitrary, and here the URL name can be used as the value of URL matching order.

上述4个页面的URL唯一编码、URL识别规则、URL名称、URL匹配顺序分别为:The URL unique codes, URL identification rules, URL names, and URL matching sequences of the above four pages are:

“1002”,“www.made-in-china.com/productdirectory.do?”,“产品搜索列表页”,“产品搜索列表页”;"1002", "www.made-in-china.com/productdirectory.do?", "product search list page", "product search list page";

“1003”,“www.made-in-china.com$”,“MIC首页”,“MIC首页”;"1003", "www.made-in-china.com$", "MIC homepage", "MIC homepage";

“1004”,“www.made-in-china.com/special”,“专题首页”,“专题页2”;"1004", "www.made-in-china.com/special", "special home page", "special page 2";

“1005”,“www.made-in-china.com/special/”,“专题detail页”,“专题页1”。"1005", "www.made-in-china.com/special/", "Special detail page", "Special page 1".

其中,专题detail页的URL识别规则(www.made-in-china.com/special/)是专题首页的识别规则(www.made-in-china.com/special)的母URL,因此专题detail页和专题首页的URL匹配顺序分别为“专题页1”,“专题页2”,这样能够保证按照URL匹配顺序升序排列时,专题detail页排在专题首页之前。Among them, the URL identification rule of the special detail page (www.made-in-china.com/special/) is the parent URL of the identification rule of the special home page (www.made-in-china.com/special), so the special detail page The matching order of URLs with the homepage of the topic is "thematic page 1" and "thematic page 2", which ensures that when sorting in ascending order according to the URL matching order, the topic detail page is ranked before the topic homepage.

S13:取出S12中得到的URL规则存储表中的信息,并按照“URL匹配顺序”进行排序,保证母URL排在子URL之前。S13: Take out the information in the URL rule storage table obtained in S12, and sort according to the "URL matching order", so as to ensure that the parent URL is ranked before the child URL.

具体的,取出上述URL规则存储表中的信息,并按照“URL匹配顺序”升序排列,得到:Specifically, take out the information in the above URL rule storage table, and arrange them in ascending order according to the "URL matching order", to obtain:

“1003”,“www.made-in-china.com$”,“MIC首页”,“MIC首页”;"1003", "www.made-in-china.com$", "MIC homepage", "MIC homepage";

“1002”,“www.made-in-china.com/productdirectory.do?”,“产品搜索列表页”,“产品搜索列表页”;"1002", "www.made-in-china.com/productdirectory.do?", "product search list page", "product search list page";

“1005”,“www.made-in-china.com/special/”,“专题detail页”,“专题页1”;"1005", "www.made-in-china.com/special/", "Special detail page", "Special page 1";

“1004”,“www.made-in-china.com/special”,“专题首页”,“专题页2”。"1004", "www.made-in-china.com/special", "Special Home Page", "Special Page 2".

S14:获取访问日志的记录,包括访客的IP、访问时间、REFERER(上一次访问的页面)、REQUEST(访问的页面)等信息。具体的,访问日志中的记录可以是如下形式:S14: Obtain the records of the access log, including information such as the visitor's IP, access time, REFERER (page visited last time), REQUEST (page visited). Specifically, the records in the access log can be in the following form:

192.168.1.1,2015-01-0112:01:00,www.made-in-china.com,www.google.com;192.168.1.1, 2015-01-01 12:01:00, www.made-in-china.com, www.google.com;

192.168.1.1,2015-01-0112:01:30,www.made-in-china.com/special/vacuum-pump/,www.made-in-china.com;192.168.1.1, 2015-01-01 12:01:30, www.made-in-china.com/special/vacuum-pump/, www.made-in-china.com;

192.168.1.1,2015-01-0112:01:30,sourcing.made-in-china.com/suppliers.html,www.made-in-china.com/special/vacuum-pump/;192.168.1.1, 2015-01-01 12:01:30, sourcing.made-in-china.com/suppliers.html, www.made-in-china.com/special/vacuum-pump/;

192.168.2.1,2015-01-0112:02:10,www.made-in-china.com,www.google.com;192.168.2.1, 2015-01-01 12:02:10, www.made-in-china.com, www.google.com;

192.168.2.1,2015-01-0112:03:10,192.168.2.1, 2015-01-01 12:03:10,

http://www.made-in-china.com/productdirectory.do?word=led&subaction=hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0&isOpenCorrection=1,http://www.made-in-china.com/productdirectory.do? word=led&subaction=hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0&isOpenCorrection=1,

www.made-in-china.com;www.made-in-china.com;

其中,“192.168.1.1”和“192.168.2.1”为访客的IP地址;与IP地址临近的时间为访客访问相应页面的访问时间;与访问时间相邻的URL地址为访客当前所访问的页面URL即REQUEST,如第一条记录中的www.made-in-china.com;在当前访问的页面URL后的URL地址为访客所访问的上一个页面URL即REFERER,如第一条记录中的www.google.com。也就是说访客是从上一访问页面(REFERER)跳到当前访问页面(REQUEST)的,即访客是从www.google.com跳到www.made-in-china.com的。Among them, "192.168.1.1" and "192.168.2.1" are the IP addresses of visitors; the time close to the IP address is the visit time of the visitor to the corresponding page; the URL address adjacent to the visit time is the URL of the page currently visited by the visitor That is, REQUEST, such as www.made-in-china.com in the first record; the URL address after the URL of the currently visited page is the URL of the previous page visited by the visitor, that is, REFERER, such as www in the first record .google.com. That is to say, the visitor jumps from the last visited page (REFERER) to the current visited page (REQUEST), that is, the visitor jumps from www.google.com to www.made-in-china.com.

S15:将S14中每一条访问日志记录中的REFERER和REQUEST分别与S13中获取的URL规则存储表中的URL识别规则按照S13中的获取排序进行匹配。如果匹配成功,则记录下URL识别规则对应的URL唯一编码,作为REFERER的编码和REQUEST编码。如果REFERER或REQUEST与任何一条URL识别规则都不能匹配,则取-1或空值作为REFERER的编码或REQUEST编码。S15: Match the REFERER and REQUEST in each access log record in S14 with the URL identification rules in the URL rule storage table acquired in S13 according to the order acquired in S13. If the match is successful, record the unique code of the URL corresponding to the URL identification rule as the code of REFERER and the code of REQUEST. If REFERER or REQUEST cannot match any of the URL identification rules, take -1 or null value as the encoding of REFERER or REQUEST.

优选的,本申请中,在S15将S14中每一条访问日志记录中的REFERER和REQUEST分别与S13中获取的URL规则存储表中的URL识别规则按照S13中的排序进行匹配。包括:Preferably, in this application, in S15, the REFERER and REQUEST in each access log record in S14 are respectively matched with the URL identification rules in the URL rule storage table obtained in S13 according to the order in S13. include:

当REFERER或REQUEST与URL规则存储表中的URL识别规则是字符串包含关系,即REFERER或REQUEST是该URL识别规则的母URL,则表示REFERER或REQUEST与该URL识别规则匹配成功。When REFERER or REQUEST and the URL identification rule in the URL rule storage table have a character string inclusion relationship, that is, REFERER or REQUEST is the parent URL of the URL identification rule, it means that REFERER or REQUEST matches the URL identification rule successfully.

如果REFERER或REQUEST能够与URL规则存储表中的多个URL识别规则匹配成功,则取按照S13中的排序排在第一位的URL识别规则。If the REFERER or REQUEST can be successfully matched with multiple URL identification rules in the URL rule storage table, the URL identification rule ranked first according to the sorting in S13 is taken.

具体的,将S14中列出的日志记录,与S13中的URL规则存储表进行匹配。Specifically, the log records listed in S14 are matched with the URL rule storage table in S13.

取出第一条记录:Get the first record:

192.168.1.1,2015-01-0112:01:00,www.made-in-china.com,www.google.com;192.168.1.1, 2015-01-01 12:01:00, www.made-in-china.com, www.google.com;

REQUES为www.made-in-china.com,可匹配上S13的URL规则存储表中的MIC首页,取MIC首页对应的URL唯一编码“1003”作为这条记录的REQUEST编码。REFERFER为www.google.com,与S13的URL规则存储表中的任何一个URL识别规则都匹配不上,设置“-1”为这条记录的REFERER编码。REQUES is www.made-in-china.com, which can match the MIC home page in the URL rule storage table of S13, and take the unique URL code "1003" corresponding to the MIC home page as the REQUEST code of this record. REFERFER is www.google.com, which cannot match any URL identification rule in the URL rule storage table of S13, so set "-1" as the REFERER code of this record.

取出第二条记录:Take out the second record:

192.168.1.1,2015-01-0112:01:30,www.made-in-china.com/special/vacuum-pump/,www.made-in-china.com;192.168.1.1, 2015-01-01 12:01:30, www.made-in-china.com/special/vacuum-pump/, www.made-in-china.com;

REQUES为www.made-in-china.com/special/vacuum-pump/,可同时匹配上S13的URL规则存储表中的专题detail页和专题首页,取按照匹配顺序排在第一的识别规则,即取专题detail页对应的URL唯一编码“1005”作为这条记录的REQUEST编码。REFERFER为www.made-in-china.com,与S13的URL规则存储表中的MIC首页匹配成功,这条记录的REFERER编码为“1003”。REQUES is www.made-in-china.com/special/vacuum-pump/, which can match the topic detail page and topic homepage in the URL rule storage table of S13 at the same time, and take the identification rule ranked first in the matching order, That is, take the unique URL code "1005" corresponding to the topic detail page as the REQUEST code of this record. The REFERFER is www.made-in-china.com, which successfully matches the MIC home page in the URL rule storage table of S13, and the REFERER code of this record is "1003".

照此方法,直至所有日志记录匹配完成。最后,所有记录的匹配结果如下(IP,访问时间,REQUEST,REFERER,REQUEST编码,REFERER编码):Follow this method until all log records are matched. Finally, the matching results of all records are as follows (IP, access time, REQUEST, REFERER, REQUEST encoding, REFERER encoding):

192.168.1.1,2015-01-0112:01:00,www.made-in-china.com,www.google.com,1003,-1;192.168.1.1, 2015-01-01 12:01:00, www.made-in-china.com, www.google.com, 1003, -1;

192.168.1.1,2015-01-0112:01:30,www.made-in-china.com/special/vacuum-pump/,www.made-in-china.com,1005,1003;192.168.1.1, 2015-01-01 12:01:30, www.made-in-china.com/special/vacuum-pump/, www.made-in-china.com, 1005, 1003;

192.168.1.1,2015-01-0112:01:30,sourcing.made-in-china.com/suppliers.html,www.made-in-china.com/special/vacuum-pump/,-1,1005;192.168.1.1, 2015-01-01 12:01:30, sourcing.made-in-china.com/suppliers.html, www.made-in-china.com/special/vacuum-pump/, -1, 1005;

192.168.2.1,2015-01-0112:02:10,www.made-in-china.com,www.google.com,1003,-1;192.168.2.1, 2015-01-01 12:02:10, www.made-in-china.com, www.google.com, 1003, -1;

192.168.2.1,2015-01-0112:02:10,192.168.2.1, 2015-01-01 12:02:10,

http://www.made-in-china.com/productdirectory.do?word=led&subaction=hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0&isOpenCorrection=1,http://www.made-in-china.com/productdirectory.do? word=led&subaction=hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0&isOpenCorrection=1,

www.made-in-china.com,1002,1003;www.made-in-china.com, 1002, 1003;

S16:获取S15中没有匹配成功的REFERER和REQUEST,即REFERER编码或REQUEST编码为-1或者空值的记录,将所有的没有匹配成功的REFERER和REQUEST合并到一起,得到未匹配URL集。S16: Obtain unmatched REFERERs and REQUESTs in S15, that is, records whose REFERER code or REQUEST code is -1 or null, and combine all unmatched REFERERs and REQUESTs to obtain an unmatched URL set.

具体的,在S15中没有匹配成功的REFERER为:Specifically, the REFERERs that did not match successfully in S15 are:

www.google.com;www.google.com;

www.google.com。www.google.com.

没有匹配成功的REQUEST为:The REQUEST that did not match successfully is:

sourcing.made-in-china.com/suppliers.html。sourcing.made-in-china.com/suppliers.html.

合并得到未匹配URL集:Combine to get the set of unmatched URLs:

www.google.com;www.google.com;

www.google.com;www.google.com;

sourcing.made-in-china.com/suppliers.html。sourcing.made-in-china.com/suppliers.html.

S17:对S16中的未匹配URL集进行统计分析,得到未匹配URL集中数量最多的URL,并结合人工的判断和监测,可以把没有匹配成功的URL匹配到URL规则配置表中,从而可以不断完善URL规则配置表中的URL识别规则。S17: Statistically analyze the unmatched URL sets in S16 to obtain the URL with the largest number of unmatched URL sets, and combine manual judgment and monitoring to match unmatched URLs to the URL rule configuration table, so as to continuously Improve the URL identification rules in the URL rule configuration table.

具体的,在S16的未匹配URL集中未匹配的URL为:www.google.com,sourcing.made-in-china.com/suppliers.html。其中www.google.com为主要的搜索引擎,为大多数国外访客的访问入口,应该是大多数网站都应该重点关注的访问来源。因此,可将www.google.com也收集和配置到URL规则存储表中,则重复S11至S13。Specifically, the unmatched URLs in the unmatched URL set of S16 are: www.google.com, sourcing.made-in-china.com/suppliers.html. Among them, www.google.com is the main search engine and the access portal for most foreign visitors. It should be the source of visits that most websites should focus on. Therefore, www.google.com can also be collected and configured in the URL rule storage table, then S11 to S13 are repeated.

本发明还提供一种访问日志中的URL预处理装置其于上述方法设置各功能模块,URL收集单元:用于收集网站的URL,并且确定URL识别规则、URL名称等。其中,URL识别规则是指根据原始网页的URL分析和归纳得出的某一类页面的URL的构成特征,比如中国制造网的产品搜索列表页的URL地址都是以“www.made-in-china.com/productdirectory.do?”开头,则产品搜索列表页的识别规则就是“www.made-in-china.com/productdirectory.do?”。而且,URL识别规则可以使用正则表达式进行描述。The present invention also provides a URL preprocessing device in the access log, which is equipped with various functional modules in the above method, URL collection unit: used to collect URLs of websites, and determine URL identification rules, URL names, etc. Among them, the URL identification rule refers to the composition characteristics of the URL of a certain type of page based on the analysis and induction of the URL of the original web page. For example, the URL address of the product search list page of Made in China. china.com/productdirectory.do?”, then the identification rule for the product search list page is “www.made-in-china.com/productdirectory.do?”. Moreover, URL identification rules can be described using regular expressions.

URL规则存储单元:用于存储URL的识别规则以及相关信息,包括:URL唯一编码、URL识别规则、URL名称、URL匹配顺序。其中,“URL唯一编码”用于标注每一个URL识别规则的唯一身份;“URL识别规则”和“URL名称”来源于URL收集单元;用于控制URL匹配顺序。URL rule storage unit: used to store URL identification rules and related information, including: URL unique code, URL identification rules, URL name, and URL matching sequence. Among them, "URL unique code" is used to mark the unique identity of each URL identification rule; "URL identification rule" and "URL name" are derived from the URL collection unit; they are used to control the URL matching sequence.

日志记录获取单元:用于获取访问日志中每一条记录,包括访客的IP、访问时间、、REFERER(上一次访问的页面)、REQUEST(访问的页面)等信息。Log record obtaining unit: used to obtain each record in the access log, including visitor's IP, access time, REFERER (the page visited last time), REQUEST (the page visited) and other information.

URL匹配单元:用于将访问日志中的每一条记录的REFERER和REQUEST与URL识别规则进行匹配。取出一条日志记录,并将REFERER或REQUEST按照URL识别规则获取的顺序逐一与URL识别规则进行匹配,如果REFERER或REQUEST是某一URL识别规则的母URL,则匹配成功并取出该URL识别规则的URL唯一编码作为REFERER编码或REQUEST编码;如果REFERER或REQUEST与任何一个URL识别规则都不具有字符串包含关系,则将REFERER编码或REQUEST编码做特殊标记,比如标记为“-1“或空值,至此完成这条日志记录的匹配并且跳出此次匹配。然后,取下一条日志记录,按照上述方法进行匹配,直至所有的日志记录全部匹配完成。URL matching unit: used to match the REFERER and REQUEST of each record in the access log with URL identification rules. Take out a log record, and match REFERER or REQUEST with the URL recognition rules one by one in the order in which the URL recognition rules were obtained. If REFERER or REQUEST is the parent URL of a certain URL recognition rule, the match is successful and the URL of the URL recognition rule is taken out The unique code is used as REFERER code or REQUEST code; if REFERER or REQUEST does not have a string inclusion relationship with any URL recognition rule, then special mark the REFERER code or REQUEST code, such as marking "-1" or a null value, so far Complete the matching of this log record and jump out of this matching. Then, take the next log record and perform matching according to the above method until all log records are matched.

匹配结果集存储单元:用于存储访问日志与URL识别规则的匹配结果,包括:访问日志中的原始信息如IP、访问时间、REFERER、REQUEST等,以及上述REFERER编码、REQUEST编码。Matching result set storage unit: used to store the matching results of access logs and URL identification rules, including: original information in access logs such as IP, access time, REFERER, REQUEST, etc., as well as the above-mentioned REFERER codes and REQUEST codes.

未匹配URL监测单元:用于对访问日志中未匹配成功的REFERER和REQUEST进行分析,从而完善URL识别规则以覆盖网站的全部页面,达到逐步完善和优化的目的。Unmatched URL monitoring unit: It is used to analyze the unmatched REFERER and REQUEST in the access log, so as to improve the URL identification rules to cover all pages of the website, and achieve the purpose of gradual improvement and optimization.

未匹配URL监测单元包括:Unmatched URL monitoring units include:

未匹配数据获取单元:用于获取匹配结果集中未匹配成功REFERER和REQUEST,并将其合并为未匹配URL集。Unmatched data acquisition unit: used to acquire unmatched REFERER and REQUEST in the matching result set, and merge them into an unmatched URL set.

未匹配数据统计模块:统计出未匹配URL集中每个URL的记录条数。Unmatched data statistics module: count the number of records of each URL in the unmatched URL set.

未匹配数据监测模块:根据出未匹配URL集中每个URL的记录条数并按照记录条数进行降序排列,可以收集到未匹配的URL集合。再结合实际的业务需求,可确定是否将这些URL配置到URL规则存储表中,如果需要配置,则重新回到URL收集单元按照上述流程执行,直至所有需要分析的URL都加入到URL规则存储表中。Unmatched data monitoring module: According to the record number of each URL in the unmatched URL set and sort them in descending order according to the number of records, the unmatched URL set can be collected. Combined with actual business needs, it can be determined whether to configure these URLs into the URL rule storage table. If configuration is required, return to the URL collection unit and follow the above process until all URLs that need to be analyzed are added to the URL rule storage table. middle.

以上对本发明所提供的方法与系统进行了详细的介绍,但这些说明不能被理解为限制了本发明的范围,本发明的保护范围由随附的权利要求书限定,任何在本发明权利要求基础上的改动都是本发明的保护范围。The method and system provided by the present invention have been described in detail above, but these descriptions cannot be interpreted as limiting the scope of the present invention. The protection scope of the present invention is defined by the appended claims. All changes above are within the protection scope of the present invention.

Claims (3)

1.一种网站访问日志URL的预处理方法,其特征是包括步骤:1. A preprocessing method of website access log URL is characterized in that it comprises steps: S11:网站URL收集,即对网站URL地址体系的整理和归纳;收集网站主要的或重要的页面URL,并确认这些URL的基本信息,包括URL识别规则、URL名称;其中,URL识别规则是指根据原始网页的URL分析和归纳得出的某一类页面的URL的构成特征;URL识别规则能使用正则表达式进行描述;S11: Website URL collection, that is, sorting and summarizing the website URL address system; collecting the main or important page URLs of the website, and confirming the basic information of these URLs, including URL identification rules and URL names; among them, URL identification rules refer to According to the URL analysis and induction of the original webpage, the composition characteristics of the URL of a certain type of page are obtained; the URL identification rules can be described using regular expressions; S12:URL配置和存储,将S11中收集得到的网站URL配置并存储到URL规则存储表中;URL规则存储表包括以下字段:URL唯一编码、URL识别规则、URL名称、URL匹配顺序;其中,“URL唯一编码”用于标注每一个URL识别规则的唯一身份,由数据库自动生成;“URL识别规则”和“URL名称”来源于S11步骤;“URL匹配顺序”用于控制URL匹配顺序;S12: URL configuration and storage, configuring and storing the URL of the website collected in S11 into the URL rule storage table; the URL rule storage table includes the following fields: URL unique code, URL identification rule, URL name, URL matching order; wherein, "URL unique code" is used to mark the unique identity of each URL identification rule, which is automatically generated by the database; "URL identification rule" and "URL name" are derived from step S11; "URL matching order" is used to control the URL matching order; “URL匹配顺序”的确定方法为:假设B_URL是A_URL的一个子字符串,则称A_URL和B_URL之间具有字符串包含关系,其中A_URL为母URL,B_URL为子URL;则A_URL与B_URL的匹配顺序为A_URL在前,B_URL在后,即母URL排在子URL之前;The determination method of "URL matching order" is: assuming that B_URL is a substring of A_URL, it is said that there is a string containment relationship between A_URL and B_URL, where A_URL is the parent URL and B_URL is the sub-URL; then the matching of A_URL and B_URL The order is A_URL first, B_URL last, that is, the parent URL is ranked before the child URL; S13:取出S12中得到的URL规则存储表中的信息,并按照“URL匹配顺序”进行排序,保证母URL排在子URL之前;S13: Take out the information in the URL rule storage table obtained in S12, and sort according to the "URL matching order", so as to ensure that the parent URL is ranked before the child URL; S14:获取访问日志的记录,包括访客的IP、访问时间、REFERER(上一次访问的页面)、REQUEST(访问的页面)信息;S14: obtain the record of visit log, comprise visitor's IP, visit time, REFERER (page visited last time), REQUEST (page visited) information; S15:将S14中每一条访问日志记录中的REFERER和REQUEST分别与S13中获取的URL规则存储表中的URL识别规则按照S13中的获取排序进行匹配;如果匹配成功,则记录下URL识别规则对应的URL唯一编码,作为REFERER的编码和REQUEST编码;如果REFERER或REQUEST与任何一条URL识别规则都不能匹配,则取-1或空值作为REFERER的编码或REQUEST编码;S15: Match the REFERER and REQUEST in each access log record in S14 with the URL identification rules in the URL rule storage table acquired in S13 according to the order of acquisition in S13; if the matching is successful, record the corresponding URL identification rules The unique encoding of the URL is used as the encoding of REFERER and the encoding of REQUEST; if REFERER or REQUEST cannot match any of the URL identification rules, -1 or empty value is used as the encoding of REFERER or REQUEST; S16:获取S15中没有匹配成功的REFERER和REQUEST,即REFERER编码或REQUEST编码为-1或者空值的记录,将所有的没有匹配成功的REFERER和REQUEST合并到一起,得到未匹配URL集;S16: Obtain the unmatched REFERER and REQUEST in S15, that is, records whose REFERER code or REQUEST code is -1 or null, and merge all unmatched REFERERs and REQUESTs together to obtain an unmatched URL set; S17:对S16中的未匹配URL集进行统计分析,得到未匹配URL集中数量最多的URL,将没有匹配成功的URL匹配到URL规则配置表中,从而能不断完善URL规则配置表中的URL识别规则。S17: Perform statistical analysis on the unmatched URL set in S16, obtain the URL with the largest number of unmatched URL sets, and match the URLs that have not been successfully matched into the URL rule configuration table, so as to continuously improve the URL identification in the URL rule configuration table rule. 2.根据权利要求1所述的网站访问日志URL的预处理方法,其特征是步骤S15将步骤S14中每一条访问日志记录中的REFERER和REQUEST分别与S13中获取的URL规则存储表中的URL识别规则按照S13中的排序进行匹配,包括:2. the preprocessing method of website access log URL according to claim 1 is characterized in that step S15 is with the URL in the URL rule storage table that obtains in S13 with REFERER and REQUEST in each access log record in step S14 respectively The identification rules are matched according to the order in S13, including: 当REFERER或REQUEST与URL规则存储表中的URL识别规则是字符串包含关系,即REFERER或REQUEST是该URL识别规则的母URL,则表示REFERER或REQUEST与该URL识别规则匹配成功;When REFERER or REQUEST and the URL identification rule in the URL rule storage table have a string inclusion relationship, that is, REFERER or REQUEST is the parent URL of the URL identification rule, it means that REFERER or REQUEST matches the URL identification rule successfully; 如果REFERER或REQUEST能够与URL规则存储表中的多个URL识别规则匹配成功,则取按照S13中的排序排在第一位的URL识别规则。If the REFERER or REQUEST can be successfully matched with multiple URL identification rules in the URL rule storage table, the URL identification rule ranked first according to the sorting in S13 is taken. 3.一种访问日志中的URL预处理装置,其特征包括:3. A URL preprocessing device in an access log, characterized in that it comprises: URL收集单元:用于收集网站的URL,并且确定URL识别规则、URL名称;其中,URL识别规则是指根据原始网页的URL分析和归纳得出的某一类页面的URL的构成特征,URL识别规则能使用正则表达式进行描述;URL collection unit: used to collect URLs of websites, and to determine URL identification rules and URL names; wherein, URL identification rules refer to the compositional characteristics of URLs of a certain type of page obtained by analyzing and summarizing URLs of original web pages, and URL identification rules Rules can be described using regular expressions; URL规则存储单元:用于存储URL的识别规则以及相关信息,包括:URL唯一编码、URL识别规则、URL名称、URL匹配顺序;其中,“URL唯一编码”用于标注每一个URL识别规则的唯一身份;“URL识别规则”和“URL名称”来源于URL收集单元;用于控制URL匹配顺序;URL rule storage unit: used to store URL identification rules and related information, including: URL unique encoding, URL identification rules, URL name, URL matching sequence; where, "URL unique encoding" is used to mark the uniqueness of each URL identification rule Identity; "URL identification rule" and "URL name" are derived from the URL collection unit; used to control the URL matching order; 日志记录获取单元:用于获取访问日志中每一条记录,包括访客的IP、访问时间、REFERER、REQUEST信息;Log record acquisition unit: used to obtain each record in the access log, including visitor's IP, access time, REFERER, REQUEST information; URL匹配单元:用于将访问日志中的每一条记录的REFERER和REQUEST与URL识别规则进行匹配。取出一条日志记录,并将REFERER或REQUEST按照URL识别规则获取的顺序逐一与URL识别规则进行匹配,如果REFERER或REQUEST是某一URL识别规则的母URL,则匹配成功并取出该URL识别规则的URL唯一编码作为REFERER编码或REQUEST编码;如果REFERER或REQUEST与任何一个URL识别规则都不具有字符串包含关系,则将REFERER编码或REQUEST编码做特殊标记,比如标记为“-1“或空值,至此完成这条日志记录的匹配并且跳出此次匹配。然后,取下一条日志记录,按照上述方法进行匹配,直至所有的日志记录全部匹配完成;URL matching unit: used to match the REFERER and REQUEST of each record in the access log with URL identification rules. Take out a log record, and match REFERER or REQUEST with the URL recognition rules one by one in the order in which the URL recognition rules were obtained. If REFERER or REQUEST is the parent URL of a certain URL recognition rule, the match is successful and the URL of the URL recognition rule is taken out The unique code is used as REFERER code or REQUEST code; if REFERER or REQUEST does not have a string inclusion relationship with any URL recognition rule, then special mark the REFERER code or REQUEST code, such as marking "-1" or a null value, so far Complete the matching of this log record and jump out of this matching. Then, take the next log record and perform matching according to the above method until all log records are matched; 匹配结果集存储单元:用于存储访问日志与URL识别规则的匹配结果,包括:访问日志中的原始信息如IP、访问时间、REFERER、REQUEST等,以及上述REFERER编码、REQUEST编码;Matching result set storage unit: used to store the matching results of the access log and URL identification rules, including: the original information in the access log such as IP, access time, REFERER, REQUEST, etc., as well as the above-mentioned REFERER code and REQUEST code; 未匹配URL监测单元包括:未匹配数据获取单元:用于获取匹配结果集中未匹配成功REFERER和REQUEST,并将其合并为未匹配URL集;未匹配数据统计模块:统计出未匹配URL集中每个URL的记录条数;未匹配数据监测模块:根据出未匹配URL集中每个URL的记录条数并按照记录条数进行降序排列,可以收集到未匹配的URL集合。再结合实际的业务需求,可确定是否将这些URL配置到URL规则存储表中,如果需要配置,则重新回到URL收集单元按照上述流程执行,直至所有需要分析的URL都加入到URL规则存储表中。The unmatched URL monitoring unit includes: unmatched data acquisition unit: used to obtain unmatched REFERER and REQUEST in the matching result set, and merge them into an unmatched URL set; unmatched data statistics module: count each unmatched URL set The number of URL records; unmatched data monitoring module: According to the number of records of each URL in the unmatched URL set and sorted in descending order according to the number of records, the unmatched URL set can be collected. Combined with actual business needs, it can be determined whether to configure these URLs into the URL rule storage table. If configuration is required, return to the URL collection unit and follow the above process until all URLs that need to be analyzed are added to the URL rule storage table. middle. URL规则存储单元包括:URL rule storage units include: URL规则配置模块:用于确定“URL唯一编码”和“URL匹配顺序”;其中“URL唯一编码”由数据库自动生成,或者手动生成,只要保证URL唯一编码与URL识别规则是一对一的关系;“URL匹配顺序”的确定方法为:假设B_URL是A_URL的一个子字符串,则称A_URL和B_URL之间具有字符串包含关系,其中A_URL为母URL,B_URL为子URL。则是A_URL与B_URL的匹配顺序为A_URL在前,B_URL在后,即母URL排在子URL之前;URL rule configuration module: used to determine "URL unique code" and "URL matching order"; where "URL unique code" is automatically generated by the database, or manually generated, as long as the URL unique code and URL identification rules are in a one-to-one relationship ; The determination method of "URL matching order" is as follows: assuming that B_URL is a substring of A_URL, it is said that there is a string containment relationship between A_URL and B_URL, wherein A_URL is the parent URL and B_URL is the sub-URL. The matching order of A_URL and B_URL is that A_URL comes first and B_URL comes after, that is, the parent URL comes before the child URL; URL规则存储模块:用于存储URL规则存储表,包括URL的识别规则以及相关信息,包括:URL唯一编码、URL识别规则、URL名称、URL匹配顺序;URL rule storage module: used to store URL rule storage table, including URL identification rules and related information, including: URL unique encoding, URL identification rules, URL name, URL matching sequence; URL规则获取单元:将所述URL识别规则按照URL匹配顺序进行排序,并按此顺序获取URL识别规则及URL唯一编码。URL rule acquisition unit: sort the URL identification rules according to the URL matching order, and acquire URL identification rules and URL unique codes in this order.
CN201510383588.8A 2015-07-02 2015-07-02 Preprocessing method of URL (Uniform Resource Locator) in access log Expired - Fee Related CN105005600B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510383588.8A CN105005600B (en) 2015-07-02 2015-07-02 Preprocessing method of URL (Uniform Resource Locator) in access log

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510383588.8A CN105005600B (en) 2015-07-02 2015-07-02 Preprocessing method of URL (Uniform Resource Locator) in access log

Publications (2)

Publication Number Publication Date
CN105005600A true CN105005600A (en) 2015-10-28
CN105005600B CN105005600B (en) 2017-05-24

Family

ID=54378276

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510383588.8A Expired - Fee Related CN105005600B (en) 2015-07-02 2015-07-02 Preprocessing method of URL (Uniform Resource Locator) in access log

Country Status (1)

Country Link
CN (1) CN105005600B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105763633A (en) * 2016-04-14 2016-07-13 上海牙木通讯技术有限公司 Association method of domain name and website visiting behavior
CN106330563A (en) * 2016-08-30 2017-01-11 北京神州绿盟信息安全科技股份有限公司 Method and apparatus for determining service types of intranet HTTP communication flows
CN106445815A (en) * 2016-09-06 2017-02-22 合网络技术(北京)有限公司 Automated testing method and device
CN107317892A (en) * 2017-06-30 2017-11-03 北京知道创宇信息技术有限公司 A kind of processing method of the network address, computing device and readable storage medium storing program for executing
CN107330090A (en) * 2017-07-04 2017-11-07 北京锐安科技有限公司 A kind of information processing method and device
WO2017198145A1 (en) * 2016-05-20 2017-11-23 中兴通讯股份有限公司 Processing method and device for scheduling rule of uniform resource locator
CN109242528A (en) * 2018-07-26 2019-01-18 焦点科技股份有限公司 A kind of the funnel analysis method and device in the customized path of electric business platform
CN109995889A (en) * 2018-01-02 2019-07-09 中国移动通信有限公司研究院 Update method, device, gateway and the storage medium of mapping table
CN111162956A (en) * 2018-11-08 2020-05-15 优信数享(北京)信息技术有限公司 Log recording method and device
CN111368227A (en) * 2018-12-25 2020-07-03 阿里巴巴集团控股有限公司 URL processing method and device
CN115577197A (en) * 2022-12-07 2023-01-06 杭州城市大数据运营有限公司 Method, system and apparatus for component discovery

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445700B (en) * 2016-09-20 2019-11-12 新华三技术有限公司 A kind of URL matching process and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080209030A1 (en) * 2007-02-28 2008-08-28 Microsoft Corporation Mining Web Logs to Debug Wide-Area Connectivity Problems
CN103297435A (en) * 2013-06-06 2013-09-11 中国科学院信息工程研究所 Abnormal access behavior detection method and system on basis of WEB logs
CN103377260A (en) * 2012-04-28 2013-10-30 阿里巴巴集团控股有限公司 Analysis method and device of URLs (Uniform Resource Locator) of weblog

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080209030A1 (en) * 2007-02-28 2008-08-28 Microsoft Corporation Mining Web Logs to Debug Wide-Area Connectivity Problems
CN103377260A (en) * 2012-04-28 2013-10-30 阿里巴巴集团控股有限公司 Analysis method and device of URLs (Uniform Resource Locator) of weblog
CN103297435A (en) * 2013-06-06 2013-09-11 中国科学院信息工程研究所 Abnormal access behavior detection method and system on basis of WEB logs

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017177590A1 (en) * 2016-04-14 2017-10-19 上海牙木通讯技术有限公司 Method for associating domain name with website access behavior
CN105763633A (en) * 2016-04-14 2016-07-13 上海牙木通讯技术有限公司 Association method of domain name and website visiting behavior
RU2709647C9 (en) * 2016-04-14 2020-04-02 Шанхай Яму Коммюникейшн Текнолоджи Ко., Лтд Method of associating a domain name with a characteristic of visiting a website
RU2709647C1 (en) * 2016-04-14 2019-12-19 Шанхай Яму Коммюникейшн Текнолоджи Ко., Лтд Method of associating a domain name with a characteristic of visiting a website
GB2567749A (en) * 2016-04-14 2019-04-24 Shanghai Yamu Communication Tech Co Ltd Method for associating domain name with website access behavior
CN105763633B (en) * 2016-04-14 2019-05-21 上海牙木通讯技术有限公司 A kind of correlating method of domain name and website visiting behavior
WO2017198145A1 (en) * 2016-05-20 2017-11-23 中兴通讯股份有限公司 Processing method and device for scheduling rule of uniform resource locator
CN107404392A (en) * 2016-05-20 2017-11-28 中兴通讯股份有限公司 The processing method and processing device of the scheduling rule of uniform resource position mark URL
CN106330563B (en) * 2016-08-30 2019-09-17 北京神州绿盟信息安全科技股份有限公司 A kind of method and device of determining Intranet http communication stream service type
CN106330563A (en) * 2016-08-30 2017-01-11 北京神州绿盟信息安全科技股份有限公司 Method and apparatus for determining service types of intranet HTTP communication flows
CN106445815A (en) * 2016-09-06 2017-02-22 合网络技术(北京)有限公司 Automated testing method and device
CN106445815B (en) * 2016-09-06 2019-04-23 优酷网络技术(北京)有限公司 A kind of automated testing method and device
CN107317892B (en) * 2017-06-30 2020-08-07 北京知道创宇信息技术股份有限公司 Network address processing method, computing device and readable storage medium
CN107317892A (en) * 2017-06-30 2017-11-03 北京知道创宇信息技术有限公司 A kind of processing method of the network address, computing device and readable storage medium storing program for executing
CN107330090A (en) * 2017-07-04 2017-11-07 北京锐安科技有限公司 A kind of information processing method and device
CN109995889A (en) * 2018-01-02 2019-07-09 中国移动通信有限公司研究院 Update method, device, gateway and the storage medium of mapping table
CN109995889B (en) * 2018-01-02 2022-02-25 中国移动通信有限公司研究院 Method and device for updating mapping relation table, gateway equipment and storage medium
CN109242528A (en) * 2018-07-26 2019-01-18 焦点科技股份有限公司 A kind of the funnel analysis method and device in the customized path of electric business platform
CN111162956A (en) * 2018-11-08 2020-05-15 优信数享(北京)信息技术有限公司 Log recording method and device
CN111162956B (en) * 2018-11-08 2021-07-30 优信数享(北京)信息技术有限公司 A log recording method and device
CN111368227A (en) * 2018-12-25 2020-07-03 阿里巴巴集团控股有限公司 URL processing method and device
CN111368227B (en) * 2018-12-25 2023-06-27 阿里巴巴集团控股有限公司 URL processing method and device
CN115577197A (en) * 2022-12-07 2023-01-06 杭州城市大数据运营有限公司 Method, system and apparatus for component discovery
CN115577197B (en) * 2022-12-07 2023-10-27 杭州城市大数据运营有限公司 Methods, systems and devices for component discovery

Also Published As

Publication number Publication date
CN105005600B (en) 2017-05-24

Similar Documents

Publication Publication Date Title
CN105005600B (en) Preprocessing method of URL (Uniform Resource Locator) in access log
CN100405371C (en) Method and system for abstracting new word
CN104486461B (en) Domain name classification method and device, domain name identification method and system
CN106095979B (en) URL merging processing method and device
CN103514234B (en) A kind of page info extracting method and device
CN102841920B (en) Method and device for extracting webpage frame information
CN104750704B (en) A kind of webpage URL address sorts recognition methods and device
CN102710795B (en) Hotspot collecting method and device
KR100509276B1 (en) Method for searching web page on popularity of visiting web pages and apparatus thereof
CN103440139A (en) Acquisition method and tool facing microblog IDs (identitiesy) of mainstream microblog websites
CN104869009A (en) Website data statistics system and method
CN107092639A (en) A kind of search engine system
CN103970843B (en) A method of session merging based on UUID in web log preprocessing
CN103279567A (en) Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language)
CN102880647A (en) Method and device for acquiring another name of organization
Zhu et al. A random digit search (RDS) method for sampling of blogs and other user-generated content
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN103338260A (en) Distributed analytical system and analytical method for URL logs in network auditing
CN103530364A (en) Method and system for providing download link
CN107862039A (en) Web data acquisition methods, system and Data Matching method for pushing
KR100671077B1 (en) Server, method and system for providing information retrieval service using page bundle
CN101630315A (en) Quick retrieval method and system
CN104268289A (en) Link URL (Uniform Resource Locator) failure detection method and device
CN105808605B (en) A search log merging method and system
KR20120090131A (en) Method, system and computer readable recording medium for providing search results

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170524