[go: up one dir, main page]

CN103178982A - Method and device for analyzing log - Google Patents

Method and device for analyzing log Download PDF

Info

Publication number
CN103178982A
CN103178982A CN2011104399568A CN201110439956A CN103178982A CN 103178982 A CN103178982 A CN 103178982A CN 2011104399568 A CN2011104399568 A CN 2011104399568A CN 201110439956 A CN201110439956 A CN 201110439956A CN 103178982 A CN103178982 A CN 103178982A
Authority
CN
China
Prior art keywords
session
journal file
daily record
analysis
submodule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011104399568A
Other languages
Chinese (zh)
Inventor
乔平
许玉勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN2011104399568A priority Critical patent/CN103178982A/en
Publication of CN103178982A publication Critical patent/CN103178982A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention provides a method and device for analyzing a log. The method includes: collecting log documents generated by a website log server cluster; and conducting click stream log analysis on the collected log documents based on distribution with conversion as a unit according to a preset interval period, and the interval period enables system resources for analyzing the log documents to be used evenly in a day. By means of the method and device, correct real-time analysis of the click stream log based on the distribution with the conversion as the unit is achieved, and a problem in the prior art that the system resources can not be used in real time and evenly is solved, and therefore the flexibility and timeliness of website log analysis are improved.

Description

Log analysis method and device
Technical field
The application relates to field of Internet communication, in particular to a kind of log analysis method and device.
Background technology
Along with the development of Internet Information Service, many enterprises, company, government bodies and school etc. have all had or have built up the website of oneself.Management for the website, require us not only will pay close attention to the server throughput of every day, also to further understand the access situation of each webpage of website, improve content and the quality of webpage according to the click frequency of each webpage, improve the readability of content, therefore, the portal management personnel need in time be known the analysis result of journal file.
At present, existing click steam log analysis is exactly the Web server daily record of collecting, arrange, analyze, add up the website, excavation lies in its inner commercial value, and the data that will describe user behavior are converted to the utilizable effective information of policymaker, for website operator provides decision support.And so-called click steam, be exactly the visitor at the click track of website continuous access, when visitor's browsing page, the journal file of the Web server of website can correspondingly record the information that this visitor clicks.Click steam is different from traditional business model, under traditional business model, there are not direct information communication and feedback conduit between Web user and site information supplier, for example, which type of information is the most popular with users, what impact the web page contents additions and deletions have to user's click volume, and therefore, the manager of website can't improve according to the access situation of each webpage of website content and the quality of webpage.
as seen, although can excavating, existing click steam log analysis lies in its inner commercial value, for website operator provides decision support, but, the daily record of above-mentioned click steam log analysis is resolved granularity for analyzing by the sky, in the continuous increase along with number of netizens, the visit capacity of website is from 100,000, 1,000,000 ranks rise to ten million, more than one hundred million ranks, the quantity of the journal file of web server also rises to tens GB from tens MB, even reach the order of magnitude of TB, correspondingly also more and more higher to the time requirement of the statistics and analysis of journal file, therefore, may there be some shortcomings in click steam log analysis by the sky analysis, for example:
1) from the angle of main frame pressure, all more concentrated by the host CPU of day analyzing/IO/MEM pressure, database pressure, may occur the state of " busy dead when doing; extremely not busy in the time of not busy " under different scenes, can not realize host resource, database resource were balancedly used in one day;
2) from the angle of data age, differentiation along with business, the ageing of data can not content just to by the sky analysis, such as, the advertisement delivery effect data, as analyzing by the sky, data are upgraded by the sky, analysis result also will could be analyzed based on the data volume of a day and draw, and can not satisfy the desired data age of different business far away;
3) from the angle of maintenance cost, if extremely make mistakes by analysis centre, sky, need the full dose rollback again to process, such as, the daily record failed download needs again to process one day full dose data, has increased widely workload, and can cause data delay.
Summary of the invention
The application provides a kind of log analysis method and device, uses unbalanced problem to solve at least journal file and the system resource can not analyzed in real time of the prior art.
An aspect according to the application provides a kind of log analysis method, and it comprises: gather the journal file that the web log file server cluster generates; With predetermined gap periods to the journal file that gathers carry out take session as unit based on distributed click steam log analysis, wherein, gap periods makes for the resource of the system that analyzes journal file and was used fifty-fifty at one day; The analysis result that obtains according to the click steam log analysis generates analysis report, and wherein, analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.
Preferably, gap periods is 1 hour.
Preferably, the journal file that gathers is carried out comprising based on the step of distributed click steam log analysis take session as unit with predetermined gap periods: the journal file that gathers is decoded, remove error log and invalid daily record from decoded journal file, and the journal format that will remove the daily record in error log and invalid daily record journal file afterwards converts unified journal format to; Take session as unit, the journal file that converts unified journal format to is carried out based on distributed click steam log analysis the library file to be entered of output fact table.
The step of preferably, the journal file that gathers being decoded comprises: read the daily record in the journal file of collection by row; The field decomposition rule of setting according to daily record source sign selection carries out the field decomposition to the daily record of reading, and removes the error log that does not meet the field decomposition rule; Filter out the daily record that non-artificial access causes, the daily record of company's Intranet access in daily record the journal file that obtains according to the field value of the filtering rule of setting and field after decomposing; According to the daily record output format of setting, the daily record output format is unified in the daily record in the journal file after filtering; Carry out by the business sorting according to the journal file of the type of service of setting to output, the daily record of different business outputs to different paths.
Preferably, take session as unit, the journal file that converts unified journal format to is carried out comprising based on the step of distributed click steam log analysis: obtain the journal file and the upper journal file that the gap periods session is not closed that convert unified journal format to; The journal file that belongs to same user ID in the journal file that obtains is divided into one group, and the journal file of the adjacent daily record that is no more than predetermined space the webpage click time interval in each group in being divided into same session; Whether judgement is divided the session that obtains and is closed, if session is closed, the journal file in buttoned-up session is carried out the click steam log analysis; If session is not closed, according to the analysis mechanisms of the journal file in buttoned-up session, the journal file in the session of not closing is analyzed, do not close the session analysis result different from closing session by output identification to distinguish, portion is preserved in the daily record of simultaneously session not being closed in addition, analyzes for next gap periods.
The step of the journal file during the adjacent daily record that preferably, the webpage click interval is no more than predetermined space in each group is divided into same session comprises: the journal file in each group is sorted according to the webpage click time; Each journal file in each group after to sequence is carried out following steps according to the order after sequence take group as unit: judge in a group after sequence that whether the interval of webpage click between the time of a upper journal file in webpage click time of the journal file of pre-treatment and this group be over predetermined space, wherein, a upper journal file is divided into a journal file in session; If surpass predetermined space, create a current sessions, be that current sessions distributes the session identification that the overall situation is unique, and the journal file that will work as pre-treatment is divided into the journal file in current sessions; If do not surpass predetermined space, the journal file that will work as pre-treatment is divided into a journal file in session.
Preferably, judgement is divided the step whether session obtain close and is comprised: if current log analysis surpasses or equal the finish time of the time on the same day at session place fiducial time, judge session and close; If perhaps session surpasses predetermined space the corresponding current log analysis of the last webpage click time interval of user ID fiducial time, judge session and close, wherein, current log analysis is the concluding time when the space before place time period fiducial time.
Preferably, preserve the analysis result collection that obtains through the click steam log analysis on current gap periods by following steps: use the first set table to preserve the analysis result collection of having closed session in database, wherein, the first set table is preserved the analysis result collection that all have closed session; Use the second cover table to preserve the analysis result collection of the session of not closing in database, wherein, in the second current gap periods of cover preservation, all do not close the analysis result collection of session; Extract required parameter in journal file corresponding to the session on current gap periods from the first set table, and set up mapping relations between the session identification of closing session and the parameter that extracts; Be converted to buttoned-up session on the gap periods of partial session in the session of wherein, not closing after current gap periods.
Preferably, the step for the unique session identification of overall situation of current sessions distribution comprises: be that current sessions distributes the unique session identification of the overall situation by the sky, wherein, the session identification of same date distribution is not identical or different; Dynamically for current sessions distributes the unique session identification of the overall situation, wherein, the number of the uniqueness of session identification and the length of gap periods, Distributed Calculation node is uncorrelated in click steam is analyzed.
Preferably, with predetermined gap periods to the journal file that gathers carry out take session as unit based on after distributed click steam log analysis, above-mentioned log analysis method also comprises: the analysis result that obtains the click steam log analysis; Generate analysis report according to the analysis result that obtains, wherein, analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.
According to the application on the other hand, provide a kind of log analysis device, it comprises: collecting unit is used for gathering the journal file that the web log file server cluster generates; Analytic unit, be used for predetermined gap periods to the journal file that gathers carry out take session as unit based on distributed click steam log analysis, wherein, gap periods makes for the resource of the system that analyzes journal file and was used fifty-fifty at one day; Generation unit, the analysis result that is used for obtaining according to the click steam log analysis generates analysis report, and wherein, analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.
Preferably, analytic unit comprises: decoder module, be used for the journal file that gathers is decoded, remove error log and invalid daily record from decoded journal file, and the journal format that will remove the daily record in error log and invalid daily record journal file afterwards converts unified journal format to; Analysis module is used for take session as unit, the journal file that converts unified journal format to being carried out based on distributed click steam log analysis, the library file to be entered of output fact table.
Preferably, decoder module comprises: reading submodule is used for reading by row the daily record of the journal file of described collection; Decompose submodule, be used for according to the field decomposition rule that daily record source sign selection is set, field being carried out in the daily record of reading and decompose, remove the error log that does not meet the field decomposition rule; Filter submodule, be used for filtering out the daily record of the journal file that the field value according to the filtering rule of setting and field obtains the daily record that non-artificial access causes, the daily record of company's Intranet access after decomposing; Output sub-module is used for according to the daily record output format of setting, the daily record output format being unified in the daily record of the journal file after filtering; The sorting submodule is used for carrying out by the business sorting according to the journal file of the type of service of setting to output, and the daily record of different business outputs to different paths.
Preferably, filter submodule and comprise: the rule parsing submodule is used for loading filtering rule, initialization filter function; The rule judgment submodule is used for the field value according to the daily record decomposition, judges whether to meet a filtering rule in the filtering rule that loads.
Preferably, analysis module comprises: obtain submodule, be used for obtaining journal file and a upper journal file that the gap periods session is not closed that converts unified journal format to; The grouping submodule, the journal file that is used for the journal file that obtains is belonged to same user ID is divided into one group, and the journal file of the adjacent daily record that is no more than predetermined space in each group the webpage click time interval in being divided into same session; The session submodule is used for judging whether close, if session is closed, the journal file in buttoned-up session is analyzed if dividing the session that obtains; If session is not closed, according to the analysis mechanisms of the journal file in buttoned-up session, the journal file in the session of not closing is analyzed, do not close the session analysis result different from closing session by output identification to distinguish, simultaneously, portion is preserved in the daily record that session is not closed in addition, analyzed for next gap periods.
Preferably, the grouping submodule comprises: the sequence submodule is used for sorting according to the journal file of webpage click time to each group; The first judgement submodule, be used for judging according to each journal file to each group after sorting of the order after sequence take group as unit, judge in a group after sequence that whether the interval of webpage click between the time of a upper journal file in webpage click time of the journal file of pre-treatment and this group be over predetermined space, wherein, a upper journal file is divided into a journal file in session; Creating submodule, be used for creating a current sessions when judging over predetermined space, is session identification that the overall situation is unique of current sessions distribution, and the journal file that will work as pre-treatment is divided into the journal file in current sessions; Divide submodule, be used for when judging not over predetermined space, the journal file that will work as pre-treatment is divided into a journal file in session.
Preferably, the session submodule comprises: the second judgement submodule is used for judging session and closing when current log analysis surpasses fiducial time or equal finish time of time on the same day at session place; Perhaps the 3rd judges submodule, be used for when the current log analysis of the last webpage click time interval of user ID corresponding to session surpasses predetermined space fiducial time, judge session and close, wherein, current log analysis is the concluding time when the space before place time period fiducial time.
Preferably, the session submodule also comprises: first preserves submodule, is used for using the first set table to preserve the analysis result collection of having closed session at database, and wherein, the first set table is preserved all analysis result collection of having closed session; Second preserves submodule, is used for using the second cover table to preserve the analysis result collection of not closing session at database, and wherein, in the second current gap periods of cover table preservation, all do not close the analysis result collection of session; Be converted to buttoned-up session on the gap periods of partial session in the session of wherein, not closing after current gap periods; Dimension table updating submodule is used for identifying the newly-increased tolerance of click steam log analysis of current gap periods, and is updated in the dimension table, and wherein, newly-increased tolerance is used for the click steam log analysis of current gap periods.
Preferably, above-mentioned log analysis device also comprises: acquiring unit, be used for predetermined gap periods, the journal file that gathers is carried out take session as unit based on distributed click steam log analysis after, obtain the analysis result of click steam log analysis; Generation unit is used for generating analysis report according to the analysis result that obtains, and wherein, described analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.
In this application, the journal file that gathers is analyzed take session as unit every predetermined period, to analyze the resource of the system of journal file when improving log file analysis ageing used in each predetermined period in one day fifty-fifty, solve the journal file, system resource to analyze in real time of the prior art and used unbalanced problem, thereby improved the ageing of log analysis, rationally balancedly used the resource of system.
Description of drawings
Accompanying drawing described herein is used to provide the further understanding to the application, consists of the application's a part, and the application's illustrative examples and explanation thereof are used for explaining the application, do not consist of the improper restriction to the application.In the accompanying drawings:
Fig. 1 is a kind of preferred structure chart according to the Log Analysis System of the embodiment of the present application;
Fig. 2 is a kind of preferred structure chart according to the log analysis device of the embodiment of the present application;
Fig. 3 is a kind of preferred structure chart according to the analytic unit of the embodiment of the present application;
Fig. 4 is a kind of preferred structure chart according to the decoder module of the embodiment of the present application;
Fig. 5 is a kind of preferred structure chart according to the filtration submodule of the embodiment of the present application;
Fig. 6 is a kind of preferred structure chart according to the analysis module of the embodiment of the present application;
Fig. 7 is a kind of preferred structure chart according to the grouping submodule of the embodiment of the present application;
Fig. 8 is a kind of preferred structure chart according to the session submodule of the embodiment of the present application;
Fig. 9 is the another kind of preferred structure chart according to the session submodule of the embodiment of the present application;
Figure 10 is the another kind of preferred structure chart according to the log analysis device of the embodiment of the present application;
Figure 11 is a kind of preferred flow chart according to the log analysis method of the embodiment of the present application;
Figure 12 is the another kind of preferred flow chart according to the log analysis method of the embodiment of the present application.
Embodiment
Hereinafter also describe in conjunction with the embodiments the application in detail with reference to accompanying drawing.Need to prove, in the situation that do not conflict, embodiment and the feature in embodiment in the application can make up mutually.
Before the further details of each embodiment that describes the application, a suitable counting system structure of the principle that can be used for realizing the application is described with reference to Fig. 1.In the following description, except as otherwise noted, otherwise each embodiment of the application is described with reference to the symbolic representation of the action of being carried out by one or more computers and operation.Thus, be appreciated that this class action and the operation that sometimes are called as the computer execution comprise that the processing unit of computer is to representing the manipulation of the signal of telecommunication of data with structured form.This manipulation transforms safeguard it on data or the position in the accumulator system of computer, the operation of computer is reshuffled or changed to this mode of all understanding with those skilled in the art.The data structure of service data is the physical location of memory with defined particular community of form of data.Yet although describe the application in above-mentioned context, it does not also mean that restrictively, and as understood by those skilled in the art, the each side of hereinafter described action and operation also available hardware realizes.
Turn to accompanying drawing, wherein identical reference number refers to identical element, and the application's principle is shown in a suitable computing environment and realizes.Below describe the embodiment based on described the application, and should not think to limit the application about the alternative embodiment clearly do not described herein.
Fig. 1 shows the schematic diagram of an example computer architecture that can be used for these equipment.For purposes of illustration, the architecture of painting only is an example of proper environment, is not the scope of application or any limitation of function proposition to the application.This computing system should be interpreted as that arbitrary assembly shown in Figure 1 or its combination are had any dependence or demand yet.
The application's principle can or configure with other universal or special calculating or communication environment and operate.The example that is applicable to the application's well-known computing system, environment and configuration includes but not limited to, personal computer, server, multicomputer system, the system based on little processing, minicomputer, mainframe computer and the distributed computing environment (DCE) that comprises arbitrary said system or equipment.
In its most basic configuration, the Log Analysis System 100 in Fig. 1 comprises at least: website application server cluster 102, web log file server cluster 104, log analysis server cluster 106 and an one or more client 108.Website application server cluster 102, web log file server cluster 104 and log analysis server cluster 106 can include but not limited to Micro-processor MCV or programmable logic device FPGA etc. processing unit, be used for the storage data storage device and with the transmitting device of client communication; Client 108 can comprise: Micro-processor MCV, with the transmitting device of server communication, with the display unit of user interactions.In the present specification and claims, " Log Analysis System " also can be defined as can executive software, firmware or microcode come any nextport hardware component NextPort of practical function or the combination of nextport hardware component NextPort.Log Analysis System 100 can be even distributed, to realize distributed function.
As used in this application, term " module ", " assembly " or " unit " can refer to software object or the routine of execution on Log Analysis System 100.Different assembly described herein, module, unit, engine and service can be implemented as object or the process of carrying out (for example, as independent thread) on Log Analysis System 100.Although system and method described herein preferably realizes with software, the realization of the combination of hardware or software and hardware also may and be conceived.
As shown in Figure 1, Log Analysis System 100 comprises: website application server cluster 102, web log file server cluster 104, log analysis server cluster 106 and an one or more client 108.In the course of the work, client 108 is opened the webpage of website by user browser; The access request of website application server cluster 102 customer in response ends 108; User browser on client 108 is accepted the response that website application server cluster 102 returns, and sends request to web log file server cluster 104; The Request Log of web log file server cluster 104 recording users; Log analysis server cluster 106 gathers the daily record of web log file server cluster 104 records, and take session as unit, the click steam log analysis is done in the daily record that gathers with predetermined gap periods.Further, the session in above-described embodiment can be the session between client 108 and web log file server cluster 104, realizes journal file is carried out take session as unit the click steam log analysis, in order to improve efficient and the accuracy of click steam log analysis.
In above-mentioned preferred embodiment, take session as unit, the journal file that gathers is carried out the click steam log analysis with predetermined gap periods, to analyze the resource of the system of journal file when improving log file analysis ageing used in each predetermined period in one day fifty-fifty, thereby improved the ageing of log analysis, rationally balancedly used the resource of system; Simultaneously, analysis result can be generated analysis report, so that adjust accordingly according to the analysis result pair website structure corresponding with journal file, improve the use value of analysis result.
In following each embodiment, communication can realize by wireless connections or wired connection or its both combination, and the application does not do restriction to this.
Embodiment 1
Based on above-mentioned preferred embodiment, the application provides a kind of preferred log analysis device, improve the ageing of log analysis in order to reach, rationally balancedly use the technique effect of the resource of system, preferably, the log analysis device in the present embodiment can be arranged in log analysis server cluster 106 in Fig. 1.To achieve these goals, particularly, as shown in Figure 2, above-mentioned log analysis device comprises: collecting unit 202 is used for gathering the journal file that the web log file server cluster generates; Analytic unit 204, communicate by letter with collecting unit 202, be used for predetermined gap periods to the journal file that gathers carry out take session as unit based on distributed click steam log analysis, wherein, gap periods makes for the resource of the system that analyzes journal file and was used fifty-fifty at one day.
In above-mentioned preferred embodiment, take session as unit, the journal file that gathers is carried out the click steam log analysis with predetermined gap periods, to analyze the resource of the system of journal file when improving log file analysis ageing used in each predetermined period in one day fifty-fifty, solved the problem that to analyze in real time journal file, can not balancedly use the resource of system of the prior art, thereby improved the ageing of log analysis, rationally balancedly used the resource of system.
Further, on the basis of the various embodiments described above, gap periods in the application can be but be not limited to 1 hour, it can also be 30 minutes, 2 hours etc. according to different demands, in order to improve the ageing of log analysis, simultaneously, when daily record file or analysis result appearance mistake, can carry out again deal with data or reanalyse for wrong predetermined period occurring, to reduce workload.
On the basis of the various embodiments described above, the application provides a kind of preferred analytic unit 204, in order to improve analysis efficiency, improves precision of analysis.To achieve these goals, particularly, as shown in Figure 3, above-mentioned analytic unit 204 comprises: decoder module 302, be used for the journal file that gathers is decoded, remove error log and invalid daily record from decoded journal file, and the journal format that will remove the daily record in error log and invalid daily record journal file afterwards converts unified journal format to; Analysis module 304 is communicated by letter with decoder module 302, is used for take session as unit, the journal file that converts unified journal format to being carried out based on distributed click steam log analysis the library file to be entered of output fact table.In the present embodiment, remove error log and invalid daily record from decoded journal file, in order to journal file is analyzed, thereby improved the accuracy of analyzing; In addition, take session as unit, the journal file of unified journal format is analyzed, improved analysis efficiency.
On the basis of above-described embodiment, the application provides a kind of preferred decoder module 302, in order to improve analysis efficiency, improves precision of analysis.To achieve these goals, particularly, as shown in Figure 4, this decoder module 302 comprises: reading submodule 402 is used for reading by row the daily record of the journal file of described collection; Decompose submodule 404, communicate by letter with reading submodule 402, be used for according to the field decomposition rule that daily record source sign selection is set, field being carried out in the daily record of reading and decompose, remove the error log that does not meet the field decomposition rule; Filter submodule 406, communicate by letter with decomposing submodule 404, be used for filtering out the daily record of the journal file that the field value according to the filtering rule of setting and field obtains the daily record that non-artificial access causes, the daily record that company's Intranet is accessed after decomposing; Output sub-module 408 is communicated by letter with filtering submodule 406, is used for according to the daily record output format of setting, the daily record output format being unified in the daily record of the journal file after filtering; Sorting submodule 410 is communicated by letter with output sub-module 408, is used for carrying out by the business sorting according to the journal file of the type of service of setting to output, and the daily record of different business outputs to different paths.In the present embodiment, filter out the daily record that non-artificial access causes, the daily record of company's Intranet access in daily record the journal file that obtains according to the field value of the filtering rule of setting and field after decomposing, journal file to the daily record of having removed daily record that non-artificial access causes, company's Intranet access is analyzed, thereby improved precision of analysis, improved the reference value of analysis result; In addition, the daily record output format is unified in daily record in journal file after filtering, and sort according to the journal file of the type of service of setting to output, the journal file that collects from different collections sources is sorted and with unified daily record output format output realizing, thereby help to improve analysis efficiency, improve precision of analysis.
Further, the daily record of the daily record that the non-artificial access in above-described embodiment causes, company's Intranet access can be the non-artificial access that cause such as reptile, the access of the personnel of intra-company take test as purpose, iFrame floating frame etc. daily record, in web log file is analyzed, clean the accuracy that ineffective access can improve log analysis.
No matter the daily record of what form is according to its set form, always can correspondingly intercept its field parameter.Suppose visit_ip for the IP address of user's access, for the daily record of above-mentioned two kinds of forms, can intercept out the IP address of user's access, and be kept in $ (visit_ip) parameter, by that analogy, intercept out other several parameters, be kept in the structure of KEY=VALUE.For example, the field of daily record decoding definition can be form as shown in table 1.
Table 1
Field parameter The field implication
?$(visit_ip) The user accesses the IP address
?$(visit_time) The user asks the click time
?$(visit_zone) The user asks to click the time zone
?$(http_method) The HTTP method
?$(http_version) Http protocol
?$(http_code) The http response code
?$(http_flow) The HTTP flow
?$(entry_url) Current request URL
?$(entry_query) The QUERY of current request URL
?$(refer_url) The upper hop request URL
?$(refer_query) The QUERY of upper hop request URL
?$(agent_info) The browser feature
?$(cookie_id) The COOKIE_ID sign
?$(newcookie_flag) New COOKIE_ID sign
?$(a_cookie) COOKIE A field
?$(b_cookie) COOKIE B field
?$(c_cookie) COOKIE C field
Further, in the above-described embodiments, journal file is changed according to unified form, because it is different to gather the application in source, the form of journal file can be slightly different, and the field sequencing may have adjustment, and perhaps the part field may lack.Generally speaking, the journal file of standard comprises: URL (REFER URL), browser feature, the user of the URL (ENTRY URL) of the IP address of user's access, access time, current accessed, HTTP code, HTTP flow, http response time, upper hop access access unique identification, i.e. COOKIE ID etc.If the URL of upper hop access is empty, namely do not exist, value is "-".For example: apache cookie journal file is as follows:
Figure BDA0000124400250000081
The below is typical Beacon journal file, and this journal file comprises the ENTRY URL/REFER URL that the user accesses etc. information, encrypts by BASE64, is placed on the question mark back of URL:
Figure BDA0000124400250000091
On the basis of above-described embodiment, the application provides preferred filtration submodule 406, in order to improve filter effect.To achieve these goals, particularly, as shown in Figure 5, this filtration submodule 406 comprises: rule parsing submodule 502 is used for loading filtering rule, initialization filter function; Rule judgment submodule 504 is communicated by letter with rule parsing submodule 502, is used for the field value according to the daily record decomposition, judges whether to meet a filtering rule in the filtering rule that loads.In the present embodiment, the field value that decomposes according to daily record judges whether daily record meets filtering rule, filters out the daily record that meets filtering rule, to reach the purpose that improves the filter effect of daily record.
Further, after daily record is decomposed, each field of the daily record of decomposing can have definite value, according to concrete filtering rule, if the daily record of decomposing meets a certain filtering rule in filtering rule, filter this daily record, the field name in table 1 can be directly as the variable in filtering rule.Filtering rule can be supported the basic operations such as arithmetic, logic, relation, combination, and its priority is equal to the priority of the operator in ANSI C++, supports integer, character string constant amount, and a filtering rule can allow a plurality of variablees.Simultaneously, more embedded string operation function commonly used, for example llike in filtering rule, rlike, strstr, stristr, strlen, regex, atoi etc., wherein, llike and rlike are respectively the left coupling of character string and right adaptation function, other function definitions and ANSI C++ function performance of the same name are similar, and be as shown in table 2 particularly.
Table 2
Figure BDA0000124400250000092
For example, filtering rule is designated 111001 filtering rule and can filters all COOKIE ID and be empty daily record, and this priority " 1 " is limit priority, and will filter daily record and output in the filtration journal file that filtering code is F110; Filtering rule be designated 121001 filtering rule can filter user access IP the address for " 127.0.0.1 " or with the daily record of " 172.16. " beginning, this priority is designated the priority of 111001 filtering rule lower than filtering rule for " 2 ", the daily record after filtration outputs in the filtration journal file that filtering code is F210; Filtering rule is designated the daily record that 130101 filtering rule can filter the GOOGLE bot access, it is the daily record that the browser feature comprises the Googlebot character string, this priority is designated the priority of 121001 filtering rule lower than filtering rule for " 3 ", it is that F501 filters in journal file that the rear daily record of filtration outputs to the filter code, in addition, if one a plurality of filtering rules are satisfied in daily record simultaneously, prior applicability filters the highest filtering rule of priority; If it is identical to filter priority, the filtering rule of prior applicability filtering rule ID minimum.
further, the daily record output format is unified in daily record in journal file after filtering in the above-described embodiments, and according to the type of service of setting, the journal file of output is sorted to realize the journal file that collects from different collections sources is sorted and with unified daily record output format output, the log collection work that the journal file that gathers can the web log file server cluster be born a plurality of business simultaneously obtains, for example, the B2B of Alibaba its website log server cluster has 10 log collection servers, this web log file server cluster can gather Chinese website simultaneously, international station, the daily record of a plurality of websites such as Ali's finance, and in the process of log analysis, press the website sorting with concentrating the daily record that gathers, the URL resource that each website provides is different, can be according to $ (entry_url) field to concentrating the journal file that gathers to sort, the journal file of realizing each website can independent analysis, therefore, help to improve the accuracy of analysis result.
Further, to the journal file after sorting, do the conversion of some field level, such as removing anchor point, delete the parameter that repeats, then according to consolidation form output, the form of output can pass through parameter configuration, generally adopts following format configuration:
Figure BDA0000124400250000101
Decoding shields the unprocessed form of daily record exactly for analysis, and later possible format change, and filters rubbish, unnecessary daily record.
On the basis of above-mentioned each preferred embodiment, the application provides a kind of preferred analysis module 304, in order to improve analyze ageing.To achieve these goals, particularly, as shown in Figure 6, above-mentioned analysis module 304 comprises: obtain submodule 602, be used for obtaining journal file and a upper journal file that the gap periods session is not closed that converts unified journal format to; Grouping submodule 604, with obtain submodule 602 and communicate by letter, the journal file that is used for the journal file that obtains is belonged to same user ID is divided into one group, and the journal file of the adjacent daily record that is no more than predetermined space in each group the webpage click time interval in being divided into same session; Session submodule 606 is communicated by letter with grouping submodule 604, is used for judging whether close, if session is closed, the journal file in buttoned-up session is analyzed if dividing the session that obtains; If session is not closed, according to the analysis mechanisms of the journal file in buttoned-up session, the journal file in the session of not closing is analyzed, do not close the session analysis result different from closing session by output identification to distinguish, simultaneously, portion is preserved in the daily record that session is not closed in addition, analyzed for next gap periods.In the present embodiment, according to user ID, journal file is divided in groups, and according to predetermined space, the journal file in each group is divided into different sessions, take session as unit, journal file is analyzed, help to improve analysis efficiency, improve the accuracy of analysis result, in addition, the journal file of closing in session is analyzed, in next gap periods, the journal file of not closing session is analyzed, thereby realized the journal file of closing in session is carried out real-time analysis, improved analyze ageing.
On the basis of above-described embodiment, the application also provides a kind of preferred grouping submodule 604, in order to improve analysis efficiency.To achieve these goals, particularly, as shown in Figure 7, above-mentioned grouping submodule 604 comprises: sequence submodule 702 is used for sorting according to the journal file of webpage click time to each group; The first judgement submodule 704, communicate by letter with sequence submodule 702, be used for judging according to each journal file to each group after sorting of the order after sequence take group as unit, judge in a group after sequence that whether the interval of webpage click between the time of a upper journal file in webpage click time of the journal file of pre-treatment and this group be over predetermined space, wherein, a upper journal file is divided into a journal file in session; Create submodule 706, communicate by letter with the first judgement submodule 704, be used for creating a current sessions when judging over predetermined space, be that current sessions distributes the session identification that the overall situation is unique, and the journal file that will work as pre-treatment is divided into the journal file in current sessions; Divide submodule 708, communicate by letter with creating submodule 706, be used for when judging not over predetermined space, the journal file that will work as pre-treatment is divided into a journal file in session.In the present embodiment, after journal file in each group was sorted according to the webpage click time, journal file is divided into session according to predetermined space, and is session identification that the overall situation is unique of session distribution, in order to journal file is analyzed, improve analysis efficiency.Simultaneously, be divided into a upper session with judging the journal file that does not surpass predetermined space, exactly journal file is divided into different sessions, improve the accuracy rate of analyzing.
Preferably, establishment submodule 706 in above preferred embodiment can by the following method for current sessions distributes the unique session identification of the overall situation, particularly, be that current sessions distributes the unique session identification of the overall situation by the sky, wherein, the session identification of same date distribution is not identical or different; Dynamically for current sessions distributes the unique session identification of the overall situation, wherein, the number of the uniqueness of session identification and the length of gap periods, Distributed Calculation node is uncorrelated in click steam is analyzed.Unique during with the session identification realizing distributing the same day, the session identification that distributes of same date can be not identical or different, strengthened the application's operability, in addition, in analytic process, dynamically for current sessions distributes the unique session identification of the overall situation, help to improve the efficient of analysis.
Further, the session identification in above-described embodiment can be generated by browser end, adopts host ip or the Hostname of time-based, browser feature, access, adds the form of a pseudo random number, to guarantee the height uniqueness of session identification.simultaneously, the access behavior that this mode also more is close to the users, the different scenes of closing according to session, the mode that the application takes is that front end does not distribute session identification, user according to session concept reduction Website front-end when log analysis accesses track, the assign sessions sign, the benefit that the mode of this assign sessions sign is brought can guarantee that session identification is fully unique, and the generation of session identification and conversation end are hour relevant, when data are heavily processed, be easy to be applied to clear up the session sign in the process of database data, with the session identification in the time period of heavily processing for needs, data are put in storage again.
Simultaneously, session identification has extremely important effect in whole log analysis process, and all log analysis results all contain session identity fields, related between session sign can being used for table and showing, and requiring each session identification was unique in one day.The application's session identification distribution method to be being divided into by the hour example, but is not limited to this.When assign sessions identifies by the hour, by 1 second 100000 PageView amount calculate, such as the data of calculating 0 point~1, what distribute is 0~100000 * 3600 session identification, these some session identifications are divided by computing node, the session identification that distributes with each node that guarantees Distributed Calculation is independently again.
For example take 1 hour as example:
The maximum click volume (supposing 100000) of MAX_PER_CLICK=website per second;
SECOND_PER_HOUR=3600 (number of seconds of hour, 60 minutes * 60 seconds);
A hour maximum can use the ID amount to be so:
MAX_IDS_PER_HOUR=SECOND_PER_HOUR×MAX_PER_CLICK;
Suppose again current computing node number:
TTL_NODE_UNIT=computing node number (suppose 8 computing nodes, value is 8)
Hour each computing node can use the ID amount to be so:
MAX_NODE_IDS=MAX_IDS_PER_HOUR/TTL_NODE_UNIT;
Hour describe the process of assign sessions sign by the hour in detail below in conjunction with concrete computing node and concrete analysis:
The current computing node coding of CUR_NODE_ID=(suppose 8 computing nodes, span is [0-7])
CUR_HOUR_UV=present analysis hour; (such as 8 points, 10 points), span [0-23]
The initial SESSION_ID that this hour this computing node ID distributes is:
MIN_SESSION_ID=MAX_IDS_PER_HOUR×CUR_HOUR_UV+MAX_NODE_IDS×CUR_NODE_ID;
Stopping SESSION_ID is:
MAX_SESSION_ID=MIN_SESSION_ID+MAX_NODE_IDS;
The allocation algorithm of above-mentioned SESSION ID can guarantee to suppose HH at some hour, and the allocation space of its ID is:
[HH×MAX_IDS_PER_HOUR,(HH+1)×MAX_IDS_PER_HOUR]
The log file analysis result has the SESSION_ID field, need to heavily process the data of which hour, the directly SESSION_ID complete liquidation which hour scope is corresponding.
Further, on the basis of above-described embodiment, under distributed computing framework, all PageView under session must be in same computing node analysis, so at first journal file is divided into groups according to COOKIE_ID (user ID) when analyzing, deposits continuously with all daily records that guarantee same COOKIE_ID.Concrete grouping comprises that step is as follows:
Step 1: the computing node number of Computation distribution formula Computational frame is assumed to be n;
Step 2: journal file is divided into groups according to COOKIE_ID, be divided into the n group, deposit continuously before and after the journal file of same COOKIE_ID, and guarantee in same computing node analysis;
Step 3: the COOKIE_ID of each group of division in step 2 is carried out ascending sort according to the click time,
Simultaneously, consider the pressure of Website server, the time of journal file generally is accurate to second, and the daily record of clicking in same second can be sorted in such a way:
1) REFER URL="-", sort front,
2) REFER URL for outside the station, sorts front, and wherein, judgement URL is in the station or outside the station, can be at first judge whether it is in the station according to the territory, otherwise consider the special circumstances such as top retail shop, if this URL once occurred in ENTRY URL, judge this URL in the station, otherwise for outside the station
3) suppose two daily record A, B, if the REFER URL of A equals the ENTRY URL of B, B is front,
Above-mentioned sortord realizes pressing ENTRY URL sequence, is consistent to guarantee repeatedly result.
The journal file that produces through above-mentioned sortord is orderly, and the depositing of journal file that has guaranteed same COOKIE_ID is continuous, when dividing SESSION (session), read the journal file record by COOKIE_ID, if current record and a upper click time difference surpass predetermined space, produced a new SESSION; If the click time of the last item of sequence journal file with when pre-treatment hour difference surpass predetermined space, produce a new SESSION; If be last hour of one day when pre-treatment, produce a new SESION.Below illustrate as example that take predetermined space as 30 minutes the dividing mode of this SESION, predetermined space are only to be the application's preferred exemplary in 30 minutes, but be not limited to this.
1) same visitor is no more than 30 minutes in the time interval of two the webpage click PageView in up and down, thinks to belong to same SESSION;
2) visitor's more than 30 minutes does not have webpage click PageView, thinks that ESSION closes;
3) log analysis take in the sky as unit, adheres to different SESSION separately across the access in sky;
4) be that each complete SESSION distributes an independently SESSION_ID during log analysis.
By above-mentioned sessionizing mode, be that continuous journal file is divided into session with depositing of same COOKIE_ID, and distribute an independently SESSION_ID for each session.
On the basis of above-described embodiment, the application also provides a kind of preferred session submodule 606, in order to strengthened the application's use flexibility.To achieve these goals, particularly, as shown in Figure 8, above-mentioned session submodule 606 comprises: the second judgement submodule 802, communicate by letter with grouping submodule 604, be used for current log analysis fiducial time over or when equaling finish time of time on the same day at session place, judge session and close; Perhaps the 3rd judges submodule 804, communicate by letter with grouping submodule 604, be used for when the current log analysis of the last webpage click time interval of user ID corresponding to session surpasses predetermined space fiducial time, judging session closes, wherein, current log analysis is the concluding time when the space before place time period fiducial time.In the present embodiment, whether close in the session that judges of different scenes, strengthened the application's use flexibility.
On the basis of the various embodiments described above, the application has done further improvement to session submodule 606, in order to improve analysis efficiency.To achieve these goals, particularly, as shown in Figure 9, above-mentioned session submodule 606 also comprises: first preserves submodule 902, communicate by letter with grouping submodule 604, be used for using the first set table to preserve the analysis result collection of buttoned-up session at database, wherein, the first set table is preserved all analysis result collection of having closed session; The second preservation submodule 904 is communicated by letter with grouping submodule 604, is used for using the second cover table to preserve the analysis result collection of the session of not closing at database, and wherein, in the second current gap periods of cover table preservation, all do not close the analysis result collection of session; Be converted to buttoned-up session on the gap periods of partial session in the session of wherein, not closing after current gap periods; Dimension table updating submodule 906, preserve submodule 904 with the first preservation submodule 902 and second and communicate by letter, be used for identifying the newly-increased tolerance of click steam log analysis of current gap periods, and be updated in the dimension table, wherein, newly-increased tolerance is used for the click steam log analysis of current gap periods.In the present embodiment, according to whether closing, session is preserved respectively, differentiate session and whether close with clear, in order to analyze journal file corresponding to buttoned-up session, improved analysis efficiency, improved the accuracy rate of analysis result; Simultaneously, identify newly-increased the measuring on current gap periods, and real-time update is in the dimension table, that is, in real time, effectively newly-increased tolerance is analyzed, improved ageing, the accuracy of log analysis.
Further, on the basis of above-described embodiment, use the first set table to preserve when having closed the analysis result collection of session, extract required parameter in journal file corresponding to session that can be from the first set table, and set up mapping relations between the session identification of closing session and the parameter that extracts, so that the journal file in buttoned-up session is analyzed, the mapping relations of namely setting up between the parameter according to the session identification of session and extraction are analyzed journal file corresponding to buttoned-up session, thereby have improved analysis efficiency.
On the basis of above-mentioned each preferred embodiment, the application improves above-mentioned log analysis device, in order to improve the use value of log analysis.To achieve these goals, particularly, as shown in figure 10, above-mentioned log analysis device also comprises: acquiring unit 1002, be used for predetermined gap periods, the journal file that gathers is carried out take session as unit based on distributed click steam log analysis after, obtain the analysis result of click steam log analysis; Generation unit 1004 is communicated by letter with acquiring unit 1002, is used for generating analysis report according to the analysis result that obtains, and wherein, described analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.In the present embodiment, the analysis result of click steam log analysis is generated analysis report, in order to adjust accordingly according to the information pair website structure corresponding with journal file that analysis report feeds back, make website structure better meet user's different demands for services, thereby strengthened the use value of log analysis.
Further, on the basis of above-mentioned each preferred embodiment, session can be to be the session between client and website, realizes journal file is carried out take session as unit the click steam log analysis, in order to improve efficient and the accuracy of click steam log analysis.
Embodiment 2
On the basis of Fig. 1-10, the application provides a kind of preferred log analysis method, in order to improve the ageing of log analysis, rationally balancedly uses the resource of system.To achieve these goals, particularly, as shown in figure 11, above-mentioned log analysis method comprises:
S1102: gather the journal file that the web log file server cluster generates;
S1104: with predetermined gap periods to the journal file that gathers carry out take session as unit based on distributed click steam log analysis, wherein, gap periods makes for the resource of the system that analyzes journal file and was used fifty-fifty at one day.
In above-mentioned preferred embodiment, the journal file that gathers is analyzed take session as unit with predetermined gap periods, to analyze the resource of the system of journal file when improving log file analysis ageing used in each predetermined period in one day fifty-fifty, solved the problem that to analyze in real time journal file, can not balancedly use the resource of system of the prior art, thereby improved the ageing of log analysis, rationally balancedly used the resource of system.
Further, on the basis of the various embodiments described above, gap periods in the application can be but be not limited to 1 hour, it can also be 30 minutes, 2 hours etc. according to different demands, in order to improve the ageing of log analysis, simultaneously, when daily record file or analysis result appearance mistake, can carry out again deal with data or reanalyse for wrong gap periods occurring, to reduce workload.
On the basis of the various embodiments described above, the application provides a kind of and preferably take session as unit, the journal file that gathers has been carried out method based on distributed click steam log analysis with predetermined gap periods, so that the raising analysis efficiency, real-time, the accuracy of raising analysis result.To achieve these goals, particularly, above-mentionedly the journal file that gathers is carried out comprising based on the method for distributed click steam log analysis take session as unit with predetermined gap periods: the journal file that gathers is decoded, remove error log and invalid daily record from decoded journal file, and the journal format that will remove the daily record in error log and invalid daily record journal file afterwards converts unified journal format to; Take session as unit, the journal file that converts unified journal format to is carried out based on distributed click steam log analysis the library file to be entered of output fact table.In the present embodiment, remove error log and invalid daily record from decoded journal file, in order to journal file is analyzed, thereby improved the accuracy of analyzing; In addition, take session as unit, the journal file of unified journal format is analyzed, improved analysis efficiency.
On the basis of above-described embodiment, the application provides a kind of method of preferably journal file that gathers being decoded, in order to improve analysis efficiency, improves precision of analysis.To achieve these goals, particularly, this method that journal file that gathers is decoded comprises: read the daily record in the journal file of collection by row; The field decomposition rule of setting according to daily record source sign selection carries out the field decomposition to the daily record of reading, and removes the error log that does not meet the field decomposition rule; Filter out the daily record that non-artificial access causes, the daily record of company's Intranet access in daily record the journal file that obtains according to the field value of the filtering rule of setting and field after decomposing; According to the daily record output format of setting, the daily record output format is unified in the daily record in the journal file after filtering; Carry out by the business sorting according to the journal file of the type of service of setting to output, the daily record of different business outputs to different paths.In the present embodiment, filter out the daily record that non-artificial access causes, the daily record of company's Intranet access in daily record the journal file that obtains according to the field value of the filtering rule of setting and field after decomposing, journal file to the daily record of having removed daily record that non-artificial access causes, company's Intranet access is analyzed, thereby improved precision of analysis, improved the reference value of analysis result; In addition, the daily record output format is unified in daily record in journal file after filtering, and sort according to the journal file of the type of service of setting to output, the journal file that collects from different collections sources is sorted and with unified daily record output format output realizing, thereby help to improve analysis efficiency, improve precision of analysis.
Further, the daily record of the daily record that the non-artificial access in above-described embodiment causes, company's Intranet access can be the non-artificial access that cause such as reptile, the access of the personnel of intra-company take test as purpose, iFrame floating frame etc. daily record, in web log file is analyzed, clean the accuracy that ineffective access can improve log analysis.
further, in the above-described embodiments, journal file is changed according to unified form, because it is different to gather the application in source, the form of journal file can be slightly different, the form of journal file can be slightly different, the field sequencing may have adjustment, perhaps the part field may lack, but generally speaking, the journal file of standard comprises: the IP address of user's access, access time, the URL of current accessed (ENTRY URL), the HTTP code, the HTTP flow, the http response time, the URL (REFER URL) of upper hop access, the browser feature, the user accesses unique identification, be COOKIE ID etc.If the URL of upper hop access is empty, namely do not exist, value is "-".For example: apache cookie journal file is as follows:
Figure BDA0000124400250000151
The below is typical Beacon journal file, and this journal file comprises the ENTRY URL/REFER URL that the user accesses etc. information, encrypts by BASE64, is placed on the question mark back of URL:
Figure BDA0000124400250000152
But no matter the daily record of what form, according to its set form, always can correspondingly intercept its field parameter.Suppose visit_ip for the IP address of user's access, for the daily record of above-mentioned two kinds of forms, can intercept out the IP address of user's access, and be kept in $ (visi_ip) parameter, by that analogy, intercept out other several parameters, be kept in the structure of KEY=VALUE.For example, the field of daily record decoding definition can be form as shown in table 1.
Table 1
Field parameter The field implication
?$(visit_ip) The user accesses the IP address
?$(visit_time) The user asks the click time
?$(visit_zone) The user asks to click the time zone
?$(http_method) The HTTP method
?$(http_version) Http protocol
?$(http_code) The http response code
?$(http_flow) The HTTP flow
?$(entry_url) Current request URL
?$(entry_query) The QUERY of current request URL
?$(refer_url) The upper hop request URL
?$(refer_query) The QUERY of upper hop request URL
?$(agent_info) The browser feature
?$(cookie_id) The COOKIE_ID sign
?$(newcookie_flag) New COOKIE_ID sign
?$(a_cookie) COOKIE A field
?$(b_cookie) COOKIE B field
?$(c_cookie) COOKIE C field
On the basis of above-described embodiment, the application provides the preferred field decomposition rule of setting according to daily record source sign selection to carry out the field decomposition to the daily record of reading, removal does not meet the method for the error log of field decomposition rule, in order to improve filter effect.To achieve these goals, particularly, this field decomposition rule of setting according to daily record source sign selection carries out field to the daily record of reading and decomposes, and the method for removing the error log that does not meet the field decomposition rule must comprise: load filtering rule, the initialization filter function; Field value according to daily record is decomposed judges whether to meet a filtering rule in the filtering rule that loads.In the present embodiment, the field value that decomposes according to daily record judges whether daily record meets filtering rule, filters out the daily record that meets filtering rule, to reach the purpose that improves the filter effect of daily record.
Further, after daily record is decomposed, each field of the daily record of decomposing can have definite value, according to concrete filtering rule, if the daily record of decomposing meets a certain filtering rule in filtering rule, filter this daily record, the field name in table 1 can be directly as the variable in filtering rule.Filtering rule can be supported the basic operations such as arithmetic, logic, relation, combination, and its priority is equal to the priority of the operator in ANSI C++, supports integer, character string constant, and a filtering rule can allow a plurality of variablees.Simultaneously, more embedded string operation function commonly used, for example llike in filtering rule, rlike, strstr, stnstr, strlen, regex, atoi etc., wherein, llike and rlike are respectively the left coupling of character string and right adaptation function, other function definitions and ANSI C++ function performance of the same name are similar, and be as shown in table 2 particularly.
Table 2
Figure BDA0000124400250000161
For example, filtering rule is designated 111001 filtering rule and can filters all COOKIE ID and be empty daily record, and this priority " 1 " is limit priority, and will filter daily record and output in the filtration journal file that filtering code is F110; Filtering rule be designated 121001 filtering rule can filter user access IP the address for " 127.0.0.1 " or with the daily record of " 172.16. " beginning, this priority is designated the priority of 111001 filtering rule lower than filtering rule for " 2 ", the daily record after filtration outputs in the filtration journal file that filtering code is F210; Filtering rule is designated the daily record that 130101 filtering rule can filter the GOOGLE bot access, it is the daily record that the browser feature comprises the Googlebot character string, this priority is designated the priority of 121001 filtering rule lower than filtering rule for " 3 ", it is that F501 filters in journal file that the rear daily record of filtration outputs to the filter code, in addition, if one a plurality of filtering rules are satisfied in daily record simultaneously, prior applicability filters the highest filtering rule of priority; If it is identical to filter priority, the filtering rule of prior applicability filtering rule ID minimum.
further, the daily record output format is unified in daily record in journal file after filtering in the above-described embodiments, and according to the type of service of setting, the journal file of output is sorted to realize the journal file that collects from different collections sources is sorted and with unified daily record output format output, the log collection work that the journal file that gathers can the web log file server cluster be born a plurality of business simultaneously obtains, for example, the B2B of Alibaba its website log server cluster has 10 log collection servers, this web log file server cluster can gather Chinese website simultaneously, international station, the daily record of a plurality of websites such as Ali's finance, and in the process of log analysis, press the website sorting with concentrating the daily record that gathers, the URL resource that each website provides is different, can be according to $ (entry_url) field to concentrating the journal file that gathers to sort, the journal file of realizing each website can independent analysis, therefore, help to improve the accuracy of analysis result.
Further, to the journal file after sorting, do the conversion of some field level, such as removing anchor point, delete the parameter that repeats, then according to consolidation form output, the form of output can pass through parameter configuration, generally adopts following format configuration:
Decoding shields the unprocessed form of daily record exactly for analysis, and later possible format change, and filters rubbish, unnecessary daily record.
On the basis of above-mentioned each preferred embodiment, the application provides a kind of and preferred take session as unit, the journal file that converts unified journal format to has been carried out method based on distributed click steam log analysis, in order to improve analyze ageing.To achieve these goals, particularly, should carry out comprising based on the method for distributed click steam log analysis to the journal file that converts unified journal format to take session as unit: obtain the journal file and the upper journal file that the gap periods session is not closed that convert unified journal format to; The journal file that belongs to same user ID in the journal file that obtains is divided into one group, and the journal file of the adjacent daily record that is no more than predetermined space the webpage click time interval in each group in being divided into same session; Whether judgement is divided the session that obtains and is closed, if session is closed, the journal file in buttoned-up session is carried out the click steam log analysis; If session is not closed, according to the analysis mechanisms of the journal file in buttoned-up session, the journal file in the session of not closing is analyzed, do not close the session analysis result different from closing session by output identification to distinguish, portion is preserved in the daily record of simultaneously session not being closed in addition, analyzes for next gap periods.In the present embodiment, according to user ID, journal file is divided in groups, and according to predetermined space, the journal file in each group is divided into different sessions, take session as unit, journal file is analyzed, help to improve analysis efficiency, improve the accuracy of analysis result, in addition, the journal file of closing in session is analyzed, in next gap periods, the journal file of not closing session is analyzed, thereby realized the journal file of closing in session is carried out real-time analysis, improved analyze ageing.
On the basis of above-described embodiment, the application also provides the method for the journal file of a kind of adjacent daily record that preferably the webpage click interval is no more than predetermined space in each group in being divided into same session, in order to improve analysis efficiency.To achieve these goals, the method of the journal file during particularly, the above-mentioned adjacent daily record that the webpage click interval is no more than predetermined space in each group is divided into same session comprises: the journal file in each group is sorted according to the webpage click time; Each journal file in each group after to sequence is carried out following steps according to the order after sequence take group as unit: judge in a group after sequence that whether the interval of webpage click between the time of a upper journal file in webpage click time of the journal file of pre-treatment and this group be over predetermined space, wherein, a upper journal file is divided into a journal file in session; If surpass predetermined space, create a current sessions, be that current sessions distributes the session identification that the overall situation is unique, and the journal file that will work as pre-treatment is divided into the journal file in current sessions; If do not surpass predetermined space, the journal file that will work as pre-treatment is divided into a journal file in session.In the present embodiment, after journal file in each group was sorted according to the webpage click time, journal file is divided into session according to predetermined space, and is session identification that the overall situation is unique of session distribution, in order to journal file is analyzed, improve analysis efficiency.Simultaneously, be divided into a upper session with judging the journal file that does not surpass predetermined space, exactly journal file is divided into different sessions, improve the accuracy rate of analyzing.
On the basis of above-described embodiment, the application also provides the method for the unique session identification of overall situation of a kind of preferably current sessions distribution, in order to improve the efficient of analyzing.To achieve these goals, particularly, above-mentioned method for the unique session identification of overall situation of current sessions distribution comprises: be that current sessions distributes the unique session identification of the overall situation by the sky, wherein, the session identification of same date distribution is not identical or different; Dynamically for current sessions distributes the unique session identification of the overall situation, wherein, the number of the uniqueness of session identification and the length of gap periods, Distributed Calculation node is uncorrelated in click steam is analyzed.In the present embodiment, unique during the session identification realizing distributing the same day, the session identification that distributes of same date can be not identical or different, strengthened the application's operability, in addition, in analytic process, dynamically for current sessions distributes the unique session identification of the overall situation, help to improve the efficient of analysis.
Further, the session identification in above-described embodiment can be generated by browser end, adopts host ip or the Hostname of time-based, browser feature, access, adds the form of a pseudo random number, to guarantee the height uniqueness of session identification.simultaneously, the access behavior that this mode also more is close to the users, the different scenes of closing according to session, the mode that the application takes is that front end does not distribute session identification, user according to session concept reduction Website front-end when log analysis accesses track, the assign sessions sign, the benefit that the mode of this assign sessions sign is brought can guarantee that session identification is fully unique, and the generation of session identification and conversation end are hour relevant, when data are heavily processed, be easy to be applied to clear up the session sign in the process of database data, with the session identification in the time period of heavily processing for needs, data are put in storage again.
Simultaneously, session identification has extremely important effect in whole log analysis process, and all log analysis results all contain session identity fields, related between session sign can being used for table and showing, and requiring each session identification was unique in one day.The application's session identification distribution method to be being divided into by the hour example, but is not limited to this.When assign sessions identifies by the hour, by 1 second 100000 PageView amount calculate, such as the data of calculating 0 point~1, what distribute is 0~100000 * 3600 session identification, these some session identifications are divided by computing node, the session identification that distributes with each node that guarantees Distributed Calculation is independently again.
For example take 1 hour as example:
The maximum click volume (supposing 100000) of MAX_PER_CLICK=website per second;
SECOND_PER_HOUR=3600 (number of seconds of hour, 60 minutes * 60 seconds);
A hour maximum can use the ID amount to be so:
MAX_IDS_PER_HOUR=SECOND_PER_HOUR×MAX_PER_CLICK;
Suppose again current computing node number:
TTL_NODE_UNIT=computing node number (suppose 8 computing nodes, value is 8)
Hour each computing node can use the ID amount to be so:
MAX_NODE_IDS=MAX_IDS_PER_HOUR/TTL_NODE_UNIT;
Hour describe the process of assign sessions sign by the hour in detail below in conjunction with concrete computing node and concrete analysis:
The current computing node coding of CUR_NODE_ID=(suppose 8 computing nodes, span is [0-7])
CUR_HOUR_UV=present analysis hour; (such as 8 points, 10 points), span [0-23]
The initial SESSION_ID that this hour this computing node ID distributes is:
MIN_SESSION_ID=MAX_IDS_PER_HOUR×CUR_HOUR_UV+MAX_NODE_IDS×CUR_NODE_ID;
Stopping SESSION_ID is:
MAX_SESSION_ID=MIN_SESSION_ID+MAX_NODE_IDS;
The allocation algorithm of above-mentioned SESSION ID can guarantee to suppose HH at some hour, and the allocation space of its ID is:
[HH×MAX_IDS_PER_HOUR,(HH+1)×MAX_IDS_PER_HOUR]
The log file analysis result has the SESSION_ID field, need to heavily process the data of which hour, the directly SESSION_ID complete liquidation which hour scope is corresponding.
Further, on the basis of above-described embodiment, under distributed computing framework, all PageView under session must be in same computing node analysis, so at first journal file is divided into groups according to COOKIE_ID (user ID) when analyzing, deposits continuously with all daily records that guarantee same COOKIE_ID.Concrete grouping comprises that step is as follows:
Step 1: the computing node number of Computation distribution formula Computational frame is assumed to be n;
Step 2: journal file is divided into groups according to COOKIE_ID, be divided into the n group, deposit continuously before and after the journal file of same COOKIE_ID, and guarantee in same computing node analysis;
Step 3: the COOKIE_ID of each group of division in step 2 is carried out ascending sort according to the click time,
Simultaneously, consider the pressure of Website server, the time of journal file generally is accurate to second, and the daily record of clicking in same second can be sorted in such a way:
1) REFER URL="-", sort front,
2) REFER URL for outside the station, sorts front, and wherein, judgement URL is in the station or outside the station, can be at first judge whether it is in the station according to the territory, otherwise consider the special circumstances such as top retail shop, if this URL once occurred in ENTRY URL, judge this URL in the station, otherwise for outside the station
3) suppose two daily record A, B, if the REFER URL of A equals the ENTRY URL of B, B is front,
Above-mentioned sortord realizes pressing ENTRY URL sequence, is consistent to guarantee repeatedly result.
The journal file that produces through above-mentioned sortord is orderly, and the depositing of journal file that has guaranteed same COOKIE_ID is continuous, when dividing SESSION (session), read the journal file record by COOKIE_ID, if current record and a upper click time difference surpass predetermined space, produced a new SESSION; If the click time of the last item of sequence journal file with when pre-treatment hour difference surpass predetermined space, produce a new SESSION; If be last hour of one day when pre-treatment, produce a new SESION.Below illustrate as example that take predetermined space as 30 minutes the dividing mode of this SESION, predetermined space are only to be the application's preferred exemplary in 30 minutes, but be not limited to this.
1) same visitor is no more than 30 minutes in the time interval of two the webpage click PageView in up and down, thinks to belong to same SESSION;
2) visitor's more than 30 minutes does not have webpage click PageView, thinks that ESSION closes;
3) log analysis take in the sky as unit, adheres to different SESSION separately across the access in sky;
4) be that each complete SESSION distributes an independently SESSION_ID during log analysis.
By above-mentioned sessionizing mode, be that continuous journal file is divided into session with depositing of same COOKIE_ID, and distribute an independently SESSION_ID for each session.
On the basis of above-described embodiment, the method whether session that the application also provides a kind of preferred judgement division to obtain closes is in order to strengthened the application's use flexibility.To achieve these goals, particularly, above-mentioned judgement is divided the method whether session that obtains close and is comprised: if current log analysis surpasses or equal the finish time of the time on the same day at session place fiducial time, judge session and close; If perhaps session surpasses predetermined space the corresponding current log analysis of the last webpage click time interval of user ID fiducial time, judge session and close, wherein, current log analysis is the concluding time when the space before place time period fiducial time.In the present embodiment, whether close in the session that judges of different scenes, strengthened the application's use flexibility.
On the basis of the various embodiments described above, the application provides the method for preserving the analysis result collection that obtains through the click steam log analysis on current gap periods, in order to improve analysis efficiency.To achieve these goals, particularly, the method of the analysis result collection that obtains through the click steam log analysis on the current gap periods of above-mentioned preservation comprises: use the first set table to preserve the analysis result collection of buttoned-up session in database, wherein, the first set table is preserved the analysis result collection that all have closed session; Use the second cover table to preserve the analysis result collection of the session of not closing in database, wherein, in the second current gap periods of cover table preservation, all do not close the analysis result collection of session; Extract required parameter in journal file corresponding to the session on current gap periods from the first set table, and set up mapping relations between the session identification of closing session and the parameter that extracts; Be converted to buttoned-up session on the gap periods of partial session in the session of wherein, not closing after current gap periods.In the present embodiment, according to whether closing, session is preserved respectively, differentiate session and whether close with clear, in order to analyze journal file corresponding to buttoned-up session, improved analysis efficiency, improved the accuracy rate of analysis result.In addition, when journal file corresponding to buttoned-up session analyzed, extract required parameter, and set up mapping relations between the parameter of the session identification of session and extraction, the mapping relations of being convenient to set up between the parameter according to the session identification of session and extraction are analyzed journal file corresponding to buttoned-up session, thereby have improved analysis efficiency.
Further, on the basis of above-described embodiment, the application can also identify newly-increased tolerance in the process of carrying out the click steam log analysis, and the newly-increased tolerance that will identify is updated in the dimension table, to realize in real time, effectively newly-increased tolerance to be analyzed, improved ageing, the accuracy of log analysis.
On the basis of above-mentioned each preferred embodiment, the application improves above-mentioned log analysis method, in order to improve the use value of log analysis.To achieve these goals, particularly, above-mentioned log analysis method also comprises: with predetermined gap periods, the journal file that gathers is carried out take session as unit based on distributed click steam log analysis after, obtain the analysis result of click steam log analysis; Generate analysis report according to the analysis result that obtains, wherein, described analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.In the present embodiment, the analysis result of click steam log analysis is generated analysis report, in order to adjust accordingly according to the information pair website structure corresponding with journal file that analysis report feeds back, make website structure better meet user's different demands for services, thereby strengthened the use value of log analysis.
Further, on the basis of above-mentioned each preferred embodiment, session can be to be the session between client and website, realizes journal file is carried out take session as unit the click steam log analysis, in order to improve efficient and the accuracy of click steam log analysis.
Embodiment 3
On the basis of above-mentioned each preferred embodiment, the application provides a kind of preferred log analysis method, in order to improve the ageing of log analysis, rationally balancedly uses the resource of system.To achieve these goals, particularly, as shown in figure 12, above-mentioned log analysis method comprises:
S1: download by the hour the original log file, and upload to the Distributed Calculation file system;
Preferably, download log file by the hour, be equivalent to predetermined period is made as 1 hour, but being not limited to this, can also be 30 minutes, 2 hours etc. according to different demands, in order to improve the ageing of log analysis, simultaneously, when daily record file or analysis result appearance mistake, can carry out again deal with data or reanalyse for wrong predetermined period occurring, to reduce workload.In addition, Distributed Architecture has unrivaled superiority on retractility and cost, simultaneously, consider the factors such as opening, stability, scalability, exploitation ease for use, the application can realize based on distributed computing framework, and is certain, this is a kind of preferred exemplary, and the application is not limited only to this.
S2: derive dimension table data from database;
preferably, daily record is resolved at first based on a series of dimension table data, such as QUERY_DIMT0 dimension table, COOKIE_DIMT0 ties up table, the dimension table is used for extracting QUERY parameter and the COOKIE parameter of webpage URL, the maintenance mode of dimension table is mostly based on the artificial foreground interface operation that passes through, but exception is arranged also, such as COSITE_DIMT0 dimension table and URL_DIMT0 dimension table, COSITE_DIMT0 dimension table is used for extracting the source-information of partner site, the source that can locate current accessed by cosite parameter and location parameter is which position of which website, but, the ad placement of partner site is change often, can exist delay problem and workload problem by manual configuration this moment, so basically needing automatic program identification inserts, and the mapping relations that URL_DIMT0 dimension table is preserved URL and URL_ID, as an e-commerce website, different URL can be 1,000,000, ten million rank, log analysis is directly stored URL, not only consume storage, be unfavorable for that also the ETL layer does statement analysis, the strategy that can take is changed URL exactly by certain rule, being converted into a concrete URL_ID preserves, program can produce the corresponding relation of URL and URL_ID automatically according to the rule of design, also can manually be configured according to rule, so, URL_DIMT0 dimension table adopts manually and the way that automatically combines.
S3: daily record decoding;
Preferably, decompose the value of each field of daily record from journal file, filter out the non-artificial access that causes, and the journal file after filtering is changed according to unified form, and be sorted to different file paths by business, in order to journal file is carried out by business diagnosis, simultaneously, upgrade dimension table data, add newly-increased URL, COSITE etc., reject repeating data, in order to improve the accuracy rate of log analysis, improve log analysis efficient.
S4: derive dimension table data from database;
S5: (being equivalent to log file analysis) resolved in daily record;
preferably, the journal file that belongs to same user ID in journal file after conversion is divided into one group, journal file according to the webpage click time, each being organized sorts, and the journal file of the adjacent daily record that is no more than predetermined space the webpage click time interval in each group in being divided into same session, and be that session distributes the session identification that the overall situation is unique, if dividing the session that obtains closes, the journal file in buttoned-up session is analyzed, the session of not closing is kept in the table of second in database, so that the analysis of the journal file on next predetermined period.
S6: derive flat file from distributed file system, and import database;
S7: data reconstruction storehouse table index, insert ETL notification interface mark.
in above-mentioned preferred embodiment, the journal file that gathers is analyzed take session as unit every predetermined period, to analyze the resource of the system of journal file when improving log file analysis ageing used in each predetermined period in one day fifty-fifty, and analysis result is generated analysis report, so that adjust accordingly according to the analysis result pair website structure corresponding with journal file, solved and of the prior artly can not analyze in real time journal file, can not balancedly use the problem of the resource of system, thereby improved the ageing of log analysis, rationally balancedly used the resource of system.
Further, on the basis of above preferred embodiment, session can be to be the session between client and website, realizes journal file is carried out take session as unit the click steam log analysis, in order to improve efficient and the accuracy of click steam log analysis.
obviously, those skilled in the art should be understood that, each module of above-mentioned the application or each step can realize with general calculation element, they can concentrate on single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in storage device and be carried out by calculation element, and in some cases, can carry out step shown or that describe with the order that is different from herein, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step being made into the single integrated circuit module realizes.Like this, the application is not restricted to any specific hardware and software combination.
These are only the application's preferred embodiment, be not limited to the application, for a person skilled in the art, the application can have various modifications and variations.All within the application's spirit and principle, any modification of doing, be equal to replacement, improvement etc., within all should being included in the application's protection range.

Claims (19)

1. a log analysis method, is characterized in that, comprising:
Gather the journal file that the web log file server cluster generates;
With predetermined gap periods to the journal file that gathers carry out take session as unit based on distributed click steam log analysis, wherein, described gap periods makes for the system resource of analyzing described journal file and was used fifty-fifty at one day.
2. method according to claim 1, is characterized in that, described gap periods is 1 hour.
3. method according to claim 1 and 2, is characterized in that, the journal file of described collection carried out comprising based on the step of distributed click steam log analysis take session as unit with predetermined gap periods:
The journal file of described collection is decoded, remove error log and invalid daily record from decoded journal file, and the journal format that will remove the daily record in described error log and invalid daily record journal file afterwards converts unified journal format to;
Take session as unit, the journal file that converts unified journal format to is carried out based on distributed click steam log analysis the library file to be entered of output fact table.
4. method according to claim 3, is characterized in that, the step that the journal file of described collection is decoded comprises: read daily record in the journal file of described collection by row;
The field decomposition rule of setting according to daily record source sign selection carries out the field decomposition to the daily record of reading, and removes the error log that does not meet the field decomposition rule;
Filter out the daily record that non-artificial access causes, the daily record of company's Intranet access in daily record the journal file that obtains according to the field value of the filtering rule of setting and field after decomposing;
According to the daily record output format of setting, the daily record output format is unified in the daily record in the journal file after filtering;
Carry out by the business sorting according to the journal file of the type of service of setting to output, the daily record of different business outputs to different paths.
5. method according to claim 3, is characterized in that, take session as unit, the journal file that converts unified journal format to carried out comprising based on the step of distributed click steam log analysis:
Obtain described journal file and a upper journal file that the gap periods session is not closed that converts unified journal format to;
The journal file that belongs to same user ID in the journal file that obtains is divided into one group, and the journal file of the adjacent daily record that is no more than predetermined space the webpage click time interval in each group in being divided into same session;
Whether judgement is divided the session that obtains and is closed, if session is closed, the journal file in described buttoned-up session is carried out the click steam log analysis;
If session is not closed, according to the analysis mechanisms of the journal file in described buttoned-up session, the journal file in described session of not closing is analyzed, by output identification to distinguish described session and the described different analysis result of session of having closed of not closing, portion is preserved in the daily record of simultaneously session not being closed in addition, analyzes for next gap periods.
6. method according to claim 5, is characterized in that, the step of the journal file during the adjacent daily record that is no more than predetermined space the webpage click time interval in each group is divided into same session comprises:
Journal file according to the webpage click time, each being organized sorts;
Take described group as unit each journal file in each group after to sequence of the order after according to described sequence carry out following steps:
Judge in a group after sequence that whether the interval of webpage click between the time of a upper journal file in webpage click time of the journal file of pre-treatment and this group be over described predetermined space, wherein, a described upper journal file is divided into a journal file in session;
If surpass described predetermined space, create a current sessions, be that described current sessions distributes the session identification that the overall situation is unique, and will described journal file of working as pre-treatment be divided into the journal file in described current sessions;
If do not surpass described predetermined space, described journal file when pre-treatment is divided into the journal file in a described upper session.
7. method according to claim 5, is characterized in that, the step whether session that the judgement division obtains closes comprises:
If current log analysis surpasses or equal the finish time of the time on the same day at described session place fiducial time, judge described session and close; Perhaps
If described session surpasses predetermined space the corresponding current log analysis of the last webpage click time interval of user ID fiducial time, judging described session closes, wherein, described current log analysis is the concluding time when the space before place time period fiducial time.
8. method according to claim 5, is characterized in that, preserves the analysis result collection that obtains through the click steam log analysis on current gap periods by following steps:
Use the first set table to preserve described analysis result collection of having closed session in database, wherein, described first set table is preserved all described analysis result collection of having closed session;
Use the second cover table to preserve described analysis result collection of not closing session in database, wherein, described the second cover table is preserved all described analysis result collection of not closing session in current gap periods;
Extract required parameter in journal file corresponding to the session on described current gap periods from described first set table, and set up mapping relations between described session identification of having closed session and the described parameter that extracts;
Wherein, be converted to described buttoned-up session on the gap periods of partial session after described current gap periods in described session of not closing.
9. method according to claim 6, is characterized in that, comprises for described current sessions distributes the step of the unique session identification of overall situation:
Be that described current sessions distributes the unique session identification of the overall situation by the sky, wherein, the session identification of same date distribution is not identical or different;
Be dynamically that described current sessions distributes the unique session identification of the overall situation in click steam is analyzed, wherein, the number of the uniqueness of described session identification and the length of gap periods, Distributed Calculation node is uncorrelated.
10. method according to claim 1, is characterized in that, with predetermined gap periods to the journal file that gathers carry out take session as unit based on after distributed click steam log analysis, also comprise:
Obtain the analysis result of click steam log analysis;
Generate analysis report according to the analysis result that obtains, wherein, described analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.
11. a log analysis device is characterized in that, comprising:
Collecting unit is used for gathering the journal file that the web log file server cluster generates;
Analytic unit, be used for predetermined gap periods to the journal file that gathers carry out take session as unit based on distributed click steam log analysis, wherein, described gap periods makes for the resource of the system that analyzes described journal file and was used fifty-fifty at one day.
12. device according to claim 11 is characterized in that, described analytic unit comprises:
Decoder module, be used for the journal file of described collection is decoded, remove error log and invalid daily record from decoded journal file, and the journal format that will remove the daily record in described error log and invalid daily record journal file afterwards converts unified journal format to;
Analysis module is used for take session as unit, the journal file that converts unified journal format to being carried out based on distributed click steam log analysis, the library file to be entered of output fact table.
13. device according to claim 12 is characterized in that, described decoder module comprises:
Reading submodule is used for reading by row the daily record of the journal file of described collection;
Decompose submodule, be used for according to the field decomposition rule that daily record source sign selection is set, field being carried out in the daily record of reading and decompose, remove the error log that does not meet the field decomposition rule;
Filter submodule, be used for filtering out the daily record of the journal file that the field value according to the filtering rule of setting and field obtains the daily record that non-artificial access causes, the daily record of company's Intranet access after decomposing;
Output sub-module is used for according to the daily record output format of setting, the daily record output format being unified in the daily record of the journal file after filtering;
The sorting submodule is used for carrying out by the business sorting according to the journal file of the type of service of setting to output, and the daily record of different business outputs to different paths.
14. device according to claim 13 is characterized in that, described filtration submodule comprises:
The rule parsing submodule is used for loading filtering rule, initialization filter function;
The rule judgment submodule is used for the field value according to the daily record decomposition, judges whether to meet a filtering rule in the filtering rule that loads.
15. device according to claim 12 is characterized in that, described analysis module comprises:
Obtain submodule, be used for obtaining described journal file and a upper journal file that the gap periods session is not closed that converts unified journal format to;
The grouping submodule, the journal file that is used for the journal file that obtains is belonged to same user ID is divided into one group, and the journal file of the adjacent daily record that is no more than predetermined space in each group the webpage click time interval in being divided into same session;
The session submodule is used for judging whether close, if session is closed, the journal file in described buttoned-up session is carried out the click steam log analysis if dividing the session that obtains;
If session is not closed, according to the analysis mechanisms of the journal file in described buttoned-up session, the journal file in described session of not closing is analyzed, by output identification to distinguish described session and the described different analysis result of session of having closed of not closing, simultaneously, portion is preserved in the daily record that session is not closed in addition, analyzed for next gap periods.
16. device according to claim 15 is characterized in that, described grouping submodule comprises:
The sequence submodule is used for sorting according to the journal file of webpage click time to each group;
The first judgement submodule, be used for that each journal file of each group of the order after according to described sequence after to sequence judges take described group as unit, judge in a group after sequence that whether the interval of webpage click between the time of a upper journal file in webpage click time of the journal file of pre-treatment and this group be over described predetermined space, wherein, a described upper journal file is divided into a journal file in session;
Creating submodule, be used for creating a current sessions judging when surpassing described predetermined space, is that described current sessions distributes the session identification that the overall situation is unique, and described journal file of working as pre-treatment is divided into journal file in described current sessions;
Divide submodule, be used for judging when not surpassing described predetermined space, described journal file when pre-treatment is divided into journal file in a described upper session.
17. device according to claim 15 is characterized in that, described session submodule comprises:
The second judgement submodule is used for judging described session and closing when current log analysis surpasses fiducial time or equal concluding time of time on the same day at described session place; Perhaps
The 3rd judgement submodule, be used for when the current log analysis of the last webpage click time interval of user ID corresponding to described session surpasses predetermined space fiducial time, judging described session closes, wherein, described current log analysis is the concluding time when the space before place time period fiducial time.
18. device according to claim 15 is characterized in that, described session submodule also comprises:
First preserves submodule, is used for using the first set table to preserve described analysis result collection of having closed session at database, and wherein, described first set table is preserved all described analysis result collection of having closed session;
Second preserves submodule, is used for using the second cover table to preserve described analysis result collection of not closing session at database, and wherein, described the second cover table is preserved all described analysis result collection of not closing session in current gap periods; Wherein, be converted to described buttoned-up session on the gap periods of partial session after described current gap periods in described session of not closing;
Dimension table updating submodule is used for identifying the newly-increased tolerance of click steam log analysis of current gap periods, and is updated in the dimension table, and wherein, described newly-increased tolerance is used for the click steam log analysis of described current gap periods.
19. device according to claim 11 is characterized in that, also comprises:
Acquiring unit, be used for predetermined gap periods, the journal file that gathers is carried out take session as unit based on distributed click steam log analysis after, obtain the analysis result of click steam log analysis;
Generation unit is used for generating analysis report according to the analysis result that obtains, and wherein, described analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.
CN2011104399568A 2011-12-23 2011-12-23 Method and device for analyzing log Pending CN103178982A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104399568A CN103178982A (en) 2011-12-23 2011-12-23 Method and device for analyzing log

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104399568A CN103178982A (en) 2011-12-23 2011-12-23 Method and device for analyzing log

Publications (1)

Publication Number Publication Date
CN103178982A true CN103178982A (en) 2013-06-26

Family

ID=48638614

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104399568A Pending CN103178982A (en) 2011-12-23 2011-12-23 Method and device for analyzing log

Country Status (1)

Country Link
CN (1) CN103178982A (en)

Cited By (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103312568A (en) * 2013-07-09 2013-09-18 北京国双科技有限公司 Data statistical method and device
CN103401849A (en) * 2013-07-18 2013-11-20 盘石软件(上海)有限公司 Abnormal session analyzing method for website logs
CN103399855A (en) * 2013-07-01 2013-11-20 百度在线网络技术(北京)有限公司 Behavior intention determining method and device based on multiple data sources
CN103414758A (en) * 2013-07-19 2013-11-27 北京奇虎科技有限公司 Method and device for processing logs
CN103577586A (en) * 2013-11-08 2014-02-12 北京国双科技有限公司 Method and device for processing log records
CN103593791A (en) * 2013-11-07 2014-02-19 广州优蜜信息科技有限公司 Mobile advertisement putting method and system
CN103595571A (en) * 2013-11-20 2014-02-19 北京国双科技有限公司 Preprocessing method, device and system for website access logs
CN103729479A (en) * 2014-01-26 2014-04-16 北京北纬通信科技股份有限公司 Web page content statistical method and system based on distributed file storage
CN104091276A (en) * 2013-12-10 2014-10-08 深圳市腾讯计算机系统有限公司 Click stream data online analyzing method and related device and system
CN104113605A (en) * 2014-07-30 2014-10-22 浪潮软件股份有限公司 Enterprise cloud application development monitoring processing method
CN104298782A (en) * 2014-11-07 2015-01-21 辽宁四维科技发展有限公司 Method for analyzing active access behaviors of internet users
CN104391954A (en) * 2014-11-27 2015-03-04 北京国双科技有限公司 Database log processing method and device
CN104426713A (en) * 2013-08-28 2015-03-18 腾讯科技(北京)有限公司 Method and device for monitoring network site access effect data
CN104639387A (en) * 2014-12-09 2015-05-20 北京京东尚科信息技术有限公司 Users' network behavior tracking method and equipment
CN105100128A (en) * 2014-04-24 2015-11-25 北京金山网络科技有限公司 Server cluster log acquiring and providing methods, log server and node server
CN105141448A (en) * 2015-07-28 2015-12-09 杭州华为数字技术有限公司 Method and device for collecting log
CN105337930A (en) * 2014-06-30 2016-02-17 北京新媒传信科技有限公司 Data processing method and apparatus
CN105812324A (en) * 2014-12-30 2016-07-27 华为技术有限公司 Method, device and system for IDC information safety management
CN105930329A (en) * 2015-12-28 2016-09-07 中国银联股份有限公司 Transaction log analysis method and apparatus
CN106021079A (en) * 2016-05-06 2016-10-12 华南理工大学 A Web application performance testing method based on a user frequent access sequence model
CN106130807A (en) * 2016-08-31 2016-11-16 百势软件(北京)有限公司 The extraction of a kind of Nginx daily record and analysis method and device
CN106453454A (en) * 2015-08-07 2017-02-22 北京国双科技有限公司 Dialogue identification information generating method and apparatus
CN106713041A (en) * 2016-12-29 2017-05-24 杭州迪普科技股份有限公司 Session log transmitting method and device
CN106776264A (en) * 2015-11-24 2017-05-31 北京国双科技有限公司 The method of testing and device of application code
CN106817270A (en) * 2015-12-01 2017-06-09 精硕科技(北京)股份有限公司 Network traffics acquisition method, system and server
CN106909499A (en) * 2015-12-22 2017-06-30 阿里巴巴集团控股有限公司 Method of testing and device
CN107135663A (en) * 2014-11-05 2017-09-05 起元技术有限责任公司 Impact analysis
CN107317873A (en) * 2017-07-21 2017-11-03 曙光信息产业(北京)有限公司 A kind of conversation processing method and device
CN107517203A (en) * 2017-08-08 2017-12-26 北京奇安信科技有限公司 A kind of user behavior baseline method for building up and device
CN107688619A (en) * 2017-08-10 2018-02-13 北京奇安信科技有限公司 A kind of daily record data processing method and processing device
CN108123840A (en) * 2017-12-22 2018-06-05 中国联合网络通信集团有限公司 Log processing method and system
CN108241661A (en) * 2016-12-23 2018-07-03 航天星图科技(北京)有限公司 A kind of distributed traffic analysis method
CN108255879A (en) * 2016-12-29 2018-07-06 北京国双科技有限公司 The detection method and device of web page browsing flow cheating
CN108363649A (en) * 2017-12-29 2018-08-03 微梦创科网络科技(中国)有限公司 A kind of method and device of distribution statistical log visit capacity
CN108629042A (en) * 2017-07-06 2018-10-09 深圳中兴飞贷金融科技有限公司 Big data acquisition method, apparatus and system
CN109190007A (en) * 2018-07-20 2019-01-11 阿里巴巴集团控股有限公司 Data analysing method and device
CN109218401A (en) * 2018-08-08 2019-01-15 平安科技(深圳)有限公司 Log collection method, system, computer equipment and storage medium
CN109325183A (en) * 2018-10-16 2019-02-12 深圳壹账通智能科技有限公司 Method, device and computer equipment for locating error problem based on crawler log
CN109359263A (en) * 2018-10-16 2019-02-19 杭州安恒信息技术股份有限公司 A kind of user behavior feature extraction method and system
CN109739821A (en) * 2018-12-18 2019-05-10 中国科学院计算机网络信息中心 Log data hierarchical storage method, device and storage medium
CN109885543A (en) * 2018-12-24 2019-06-14 航天信息股份有限公司 Log processing method and device based on big data cluster
CN110516440A (en) * 2019-08-12 2019-11-29 广州海颐信息安全技术有限公司 Privilege based on dragging threatens the method and device of action trail association playback
CN110659918A (en) * 2018-06-28 2020-01-07 上海传漾广告有限公司 Optimization method for tracking and analyzing network advertisements
CN110825943A (en) * 2019-10-23 2020-02-21 支付宝(杭州)信息技术有限公司 A method, system and device for generating user access path tree data
CN111224807A (en) * 2018-11-27 2020-06-02 中国移动通信集团江西有限公司 Distributed log processing method, device, device and computer storage medium
CN111723063A (en) * 2019-03-18 2020-09-29 北京沃东天骏信息技术有限公司 A method and device for offline log data processing
CN112069048A (en) * 2020-09-09 2020-12-11 北京明略昭辉科技有限公司 Log processing method, device and storage medium
CN112242919A (en) * 2019-07-19 2021-01-19 烽火通信科技股份有限公司 Fault file processing method and system
CN114827126A (en) * 2022-03-24 2022-07-29 中通服创立信息科技有限责任公司 IPTVDN user play log reporting method and system
US11647100B2 (en) 2018-09-30 2023-05-09 China Mobile Communication Co., Ltd Research Inst Resource query method and apparatus, device, and storage medium
CN116582423A (en) * 2023-05-23 2023-08-11 杭州电子科技大学 A log parsing method for edge gateway devices based on real-time stream processing
CN116975013A (en) * 2022-08-23 2023-10-31 中国移动通信集团浙江有限公司 Method, device, equipment and computer storage medium for constructing log analysis model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100085888A1 (en) * 2005-12-30 2010-04-08 Jeanette Larosa Method and apparatus for analyzing source internet protocol activity in a network
CN101770487A (en) * 2008-12-26 2010-07-07 聚友空间网络技术有限公司 Method and system for calculating user influence in social network
CN102075355A (en) * 2010-12-30 2011-05-25 北京世纪互联工程技术服务有限公司 Log system and using method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100085888A1 (en) * 2005-12-30 2010-04-08 Jeanette Larosa Method and apparatus for analyzing source internet protocol activity in a network
CN101770487A (en) * 2008-12-26 2010-07-07 聚友空间网络技术有限公司 Method and system for calculating user influence in social network
CN102075355A (en) * 2010-12-30 2011-05-25 北京世纪互联工程技术服务有限公司 Log system and using method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
俞辉: "基于Web日志挖掘的网页实时推荐算法研究", 《计算机工程与设计》, vol. 29, no. 7, 30 April 2008 (2008-04-30) *
李烈彪等: "Web日志挖掘中数据预处理方法的研究", 《计算机技术与发展》, vol. 17, no. 7, 31 July 2007 (2007-07-31), pages 45 - 48 *

Cited By (80)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399855A (en) * 2013-07-01 2013-11-20 百度在线网络技术(北京)有限公司 Behavior intention determining method and device based on multiple data sources
CN103312568B (en) * 2013-07-09 2016-07-13 北京国双科技有限公司 Data statistical approach and device
CN103312568A (en) * 2013-07-09 2013-09-18 北京国双科技有限公司 Data statistical method and device
CN103401849A (en) * 2013-07-18 2013-11-20 盘石软件(上海)有限公司 Abnormal session analyzing method for website logs
CN103401849B (en) * 2013-07-18 2017-02-15 盘石软件(上海)有限公司 Abnormal session analyzing method for website logs
CN103414758A (en) * 2013-07-19 2013-11-27 北京奇虎科技有限公司 Method and device for processing logs
US10587707B2 (en) 2013-08-28 2020-03-10 Tencent Technology (Shenzhen) Company Limited Method and apparatus for monitoring website access data
CN104426713B (en) * 2013-08-28 2018-04-17 腾讯科技(北京)有限公司 The monitoring method and device of web site access effect data
CN104426713A (en) * 2013-08-28 2015-03-18 腾讯科技(北京)有限公司 Method and device for monitoring network site access effect data
CN103593791A (en) * 2013-11-07 2014-02-19 广州优蜜信息科技有限公司 Mobile advertisement putting method and system
CN103577586A (en) * 2013-11-08 2014-02-12 北京国双科技有限公司 Method and device for processing log records
CN103577586B (en) * 2013-11-08 2017-03-15 北京国双科技有限公司 The processing method and processing device of log recording
CN103595571B (en) * 2013-11-20 2018-02-02 北京国双科技有限公司 Preprocess method, the apparatus and system of web log
CN103595571A (en) * 2013-11-20 2014-02-19 北京国双科技有限公司 Preprocessing method, device and system for website access logs
CN104091276A (en) * 2013-12-10 2014-10-08 深圳市腾讯计算机系统有限公司 Click stream data online analyzing method and related device and system
CN103729479A (en) * 2014-01-26 2014-04-16 北京北纬通信科技股份有限公司 Web page content statistical method and system based on distributed file storage
CN105100128A (en) * 2014-04-24 2015-11-25 北京金山网络科技有限公司 Server cluster log acquiring and providing methods, log server and node server
CN105337930A (en) * 2014-06-30 2016-02-17 北京新媒传信科技有限公司 Data processing method and apparatus
CN105337930B (en) * 2014-06-30 2019-02-19 北京新媒传信科技有限公司 The method and device that a kind of pair of data are handled
CN104113605A (en) * 2014-07-30 2014-10-22 浪潮软件股份有限公司 Enterprise cloud application development monitoring processing method
US11475023B2 (en) 2014-11-05 2022-10-18 Ab Initio Technology Llc Impact analysis
CN107135663B (en) * 2014-11-05 2021-06-22 起元技术有限责任公司 Impact Analysis
CN107135663A (en) * 2014-11-05 2017-09-05 起元技术有限责任公司 Impact analysis
CN104298782A (en) * 2014-11-07 2015-01-21 辽宁四维科技发展有限公司 Method for analyzing active access behaviors of internet users
CN104298782B (en) * 2014-11-07 2017-10-24 郭磊 Internet user actively accesses the analysis method of action trail
CN104391954A (en) * 2014-11-27 2015-03-04 北京国双科技有限公司 Database log processing method and device
CN104391954B (en) * 2014-11-27 2019-04-09 北京国双科技有限公司 The processing method and processing device of database journal
CN104639387B (en) * 2014-12-09 2019-03-01 北京京东尚科信息技术有限公司 A method and device for tracking user network behavior
CN104639387A (en) * 2014-12-09 2015-05-20 北京京东尚科信息技术有限公司 Users' network behavior tracking method and equipment
CN105812324A (en) * 2014-12-30 2016-07-27 华为技术有限公司 Method, device and system for IDC information safety management
CN105812324B (en) * 2014-12-30 2019-04-05 华为技术有限公司 IDC information security management method, device and system
CN105141448B (en) * 2015-07-28 2018-10-02 杭州华为数字技术有限公司 A kind of acquisition method and device of daily record
CN105141448A (en) * 2015-07-28 2015-12-09 杭州华为数字技术有限公司 Method and device for collecting log
CN106453454B (en) * 2015-08-07 2019-08-16 北京国双科技有限公司 Session label information generation method and device
CN106453454A (en) * 2015-08-07 2017-02-22 北京国双科技有限公司 Dialogue identification information generating method and apparatus
CN106776264A (en) * 2015-11-24 2017-05-31 北京国双科技有限公司 The method of testing and device of application code
CN106817270A (en) * 2015-12-01 2017-06-09 精硕科技(北京)股份有限公司 Network traffics acquisition method, system and server
CN106909499A (en) * 2015-12-22 2017-06-30 阿里巴巴集团控股有限公司 Method of testing and device
CN105930329A (en) * 2015-12-28 2016-09-07 中国银联股份有限公司 Transaction log analysis method and apparatus
CN106021079B (en) * 2016-05-06 2018-10-09 华南理工大学 It is a kind of based on the Web application performance test methods for being frequently visited by the user series model
CN106021079A (en) * 2016-05-06 2016-10-12 华南理工大学 A Web application performance testing method based on a user frequent access sequence model
CN106130807A (en) * 2016-08-31 2016-11-16 百势软件(北京)有限公司 The extraction of a kind of Nginx daily record and analysis method and device
CN108241661A (en) * 2016-12-23 2018-07-03 航天星图科技(北京)有限公司 A kind of distributed traffic analysis method
CN108255879A (en) * 2016-12-29 2018-07-06 北京国双科技有限公司 The detection method and device of web page browsing flow cheating
CN108255879B (en) * 2016-12-29 2021-10-08 北京国双科技有限公司 Method and device for detecting cheating in web browsing traffic
CN106713041A (en) * 2016-12-29 2017-05-24 杭州迪普科技股份有限公司 Session log transmitting method and device
CN108629042A (en) * 2017-07-06 2018-10-09 深圳中兴飞贷金融科技有限公司 Big data acquisition method, apparatus and system
CN107317873A (en) * 2017-07-21 2017-11-03 曙光信息产业(北京)有限公司 A kind of conversation processing method and device
CN107517203A (en) * 2017-08-08 2017-12-26 北京奇安信科技有限公司 A kind of user behavior baseline method for building up and device
CN107517203B (en) * 2017-08-08 2020-07-14 奇安信科技集团股份有限公司 User behavior baseline establishing method and device
CN107688619B (en) * 2017-08-10 2020-06-16 奇安信科技集团股份有限公司 Log data processing method and device
CN107688619A (en) * 2017-08-10 2018-02-13 北京奇安信科技有限公司 A kind of daily record data processing method and processing device
CN108123840A (en) * 2017-12-22 2018-06-05 中国联合网络通信集团有限公司 Log processing method and system
CN108363649A (en) * 2017-12-29 2018-08-03 微梦创科网络科技(中国)有限公司 A kind of method and device of distribution statistical log visit capacity
CN110659918A (en) * 2018-06-28 2020-01-07 上海传漾广告有限公司 Optimization method for tracking and analyzing network advertisements
CN109190007B (en) * 2018-07-20 2022-10-04 创新先进技术有限公司 Data analysis method and device
CN109190007A (en) * 2018-07-20 2019-01-11 阿里巴巴集团控股有限公司 Data analysing method and device
CN109218401B (en) * 2018-08-08 2021-08-31 平安科技(深圳)有限公司 Log collection method, system, computer device and storage medium
WO2020029376A1 (en) * 2018-08-08 2020-02-13 平安科技(深圳)有限公司 Log acquisition method and system, and computer device and storage medium
CN109218401A (en) * 2018-08-08 2019-01-15 平安科技(深圳)有限公司 Log collection method, system, computer equipment and storage medium
US11647100B2 (en) 2018-09-30 2023-05-09 China Mobile Communication Co., Ltd Research Inst Resource query method and apparatus, device, and storage medium
CN109359263B (en) * 2018-10-16 2020-09-29 杭州安恒信息技术股份有限公司 A kind of user behavior feature extraction method and system
CN109325183A (en) * 2018-10-16 2019-02-12 深圳壹账通智能科技有限公司 Method, device and computer equipment for locating error problem based on crawler log
CN109359263A (en) * 2018-10-16 2019-02-19 杭州安恒信息技术股份有限公司 A kind of user behavior feature extraction method and system
CN111224807B (en) * 2018-11-27 2023-08-01 中国移动通信集团江西有限公司 Distributed log processing method, device, equipment and computer storage medium
CN111224807A (en) * 2018-11-27 2020-06-02 中国移动通信集团江西有限公司 Distributed log processing method, device, device and computer storage medium
CN109739821A (en) * 2018-12-18 2019-05-10 中国科学院计算机网络信息中心 Log data hierarchical storage method, device and storage medium
CN109885543A (en) * 2018-12-24 2019-06-14 航天信息股份有限公司 Log processing method and device based on big data cluster
CN111723063A (en) * 2019-03-18 2020-09-29 北京沃东天骏信息技术有限公司 A method and device for offline log data processing
CN112242919B (en) * 2019-07-19 2022-07-29 烽火通信科技股份有限公司 Fault file processing method and system
CN112242919A (en) * 2019-07-19 2021-01-19 烽火通信科技股份有限公司 Fault file processing method and system
CN110516440B (en) * 2019-08-12 2021-12-10 广州海颐信息安全技术有限公司 Method and device for linkage playback of privilege threat behavior track based on dragging
CN110516440A (en) * 2019-08-12 2019-11-29 广州海颐信息安全技术有限公司 Privilege based on dragging threatens the method and device of action trail association playback
CN110825943A (en) * 2019-10-23 2020-02-21 支付宝(杭州)信息技术有限公司 A method, system and device for generating user access path tree data
CN110825943B (en) * 2019-10-23 2023-10-10 支付宝(杭州)信息技术有限公司 A method, system and device for generating user access path tree data
CN112069048A (en) * 2020-09-09 2020-12-11 北京明略昭辉科技有限公司 Log processing method, device and storage medium
CN114827126A (en) * 2022-03-24 2022-07-29 中通服创立信息科技有限责任公司 IPTVDN user play log reporting method and system
CN114827126B (en) * 2022-03-24 2023-07-14 中通服创立信息科技有限责任公司 IPTVCDN user play log reporting method and system
CN116975013A (en) * 2022-08-23 2023-10-31 中国移动通信集团浙江有限公司 Method, device, equipment and computer storage medium for constructing log analysis model
CN116582423A (en) * 2023-05-23 2023-08-11 杭州电子科技大学 A log parsing method for edge gateway devices based on real-time stream processing

Similar Documents

Publication Publication Date Title
CN103178982A (en) Method and device for analyzing log
US20230041672A1 (en) Enterprise data processing
Meiss et al. Ranking web sites with real user traffic
CN101192227B (en) Log file analytical method and system based on distributed type computing network
Das et al. Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method
CN105610616B (en) The single IP average flow rate statistical method of access net and system based on ICP liveness
Ding et al. Characterizing service level objectives for cloud services: Realities and myths
US9123006B2 (en) Techniques for parallel business intelligence evaluation and management
Günther et al. Mining activity clusters from low-level event logs
CN106933906B (en) Data multi-dimensional query method and device
CN107967347A (en) Batch data processing method, server, system and storage medium
CN107103064A (en) Data statistical approach and device
CN116975396B (en) Government service intelligent recommendation method, system, device and storage medium
CN113626447B (en) Civil aviation data management platform and method
CN111782611A (en) Predictive model modeling method, device, equipment and storage medium
CN106897313B (en) Mass user service preference evaluation method and device
CN104063456B (en) Based on vector query from broadcasting media atlas analysis method and apparatus
CN105471676A (en) Port scanning IP address activity degree statistical system and method
CN106127503A (en) A kind of Analysis of Network Information method based on true social relations and big data
Hu et al. How matchable are four thousand ontologies on the semantic web
CN108984802A (en) A kind of device class lookup method in O&M auditing system
CN118075155A (en) Multi-dimensional Internet service flow deep analysis method, device, equipment and storage medium
CN206421382U (en) Data processing system
Chan et al. Online course refinement through association rule mining
WO2023192230A1 (en) Graph-based query engine for an extensibility platform

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1182856

Country of ref document: HK

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130626