CN103178982A - Method and device for analyzing log - Google Patents
Method and device for analyzing log Download PDFInfo
- Publication number
- CN103178982A CN103178982A CN2011104399568A CN201110439956A CN103178982A CN 103178982 A CN103178982 A CN 103178982A CN 2011104399568 A CN2011104399568 A CN 2011104399568A CN 201110439956 A CN201110439956 A CN 201110439956A CN 103178982 A CN103178982 A CN 103178982A
- Authority
- CN
- China
- Prior art keywords
- session
- journal file
- daily record
- analysis
- submodule
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Debugging And Monitoring (AREA)
Abstract
The invention provides a method and device for analyzing a log. The method includes: collecting log documents generated by a website log server cluster; and conducting click stream log analysis on the collected log documents based on distribution with conversion as a unit according to a preset interval period, and the interval period enables system resources for analyzing the log documents to be used evenly in a day. By means of the method and device, correct real-time analysis of the click stream log based on the distribution with the conversion as the unit is achieved, and a problem in the prior art that the system resources can not be used in real time and evenly is solved, and therefore the flexibility and timeliness of website log analysis are improved.
Description
Technical field
The application relates to field of Internet communication, in particular to a kind of log analysis method and device.
Background technology
Along with the development of Internet Information Service, many enterprises, company, government bodies and school etc. have all had or have built up the website of oneself.Management for the website, require us not only will pay close attention to the server throughput of every day, also to further understand the access situation of each webpage of website, improve content and the quality of webpage according to the click frequency of each webpage, improve the readability of content, therefore, the portal management personnel need in time be known the analysis result of journal file.
At present, existing click steam log analysis is exactly the Web server daily record of collecting, arrange, analyze, add up the website, excavation lies in its inner commercial value, and the data that will describe user behavior are converted to the utilizable effective information of policymaker, for website operator provides decision support.And so-called click steam, be exactly the visitor at the click track of website continuous access, when visitor's browsing page, the journal file of the Web server of website can correspondingly record the information that this visitor clicks.Click steam is different from traditional business model, under traditional business model, there are not direct information communication and feedback conduit between Web user and site information supplier, for example, which type of information is the most popular with users, what impact the web page contents additions and deletions have to user's click volume, and therefore, the manager of website can't improve according to the access situation of each webpage of website content and the quality of webpage.
as seen, although can excavating, existing click steam log analysis lies in its inner commercial value, for website operator provides decision support, but, the daily record of above-mentioned click steam log analysis is resolved granularity for analyzing by the sky, in the continuous increase along with number of netizens, the visit capacity of website is from 100,000, 1,000,000 ranks rise to ten million, more than one hundred million ranks, the quantity of the journal file of web server also rises to tens GB from tens MB, even reach the order of magnitude of TB, correspondingly also more and more higher to the time requirement of the statistics and analysis of journal file, therefore, may there be some shortcomings in click steam log analysis by the sky analysis, for example:
1) from the angle of main frame pressure, all more concentrated by the host CPU of day analyzing/IO/MEM pressure, database pressure, may occur the state of " busy dead when doing; extremely not busy in the time of not busy " under different scenes, can not realize host resource, database resource were balancedly used in one day;
2) from the angle of data age, differentiation along with business, the ageing of data can not content just to by the sky analysis, such as, the advertisement delivery effect data, as analyzing by the sky, data are upgraded by the sky, analysis result also will could be analyzed based on the data volume of a day and draw, and can not satisfy the desired data age of different business far away;
3) from the angle of maintenance cost, if extremely make mistakes by analysis centre, sky, need the full dose rollback again to process, such as, the daily record failed download needs again to process one day full dose data, has increased widely workload, and can cause data delay.
Summary of the invention
The application provides a kind of log analysis method and device, uses unbalanced problem to solve at least journal file and the system resource can not analyzed in real time of the prior art.
An aspect according to the application provides a kind of log analysis method, and it comprises: gather the journal file that the web log file server cluster generates; With predetermined gap periods to the journal file that gathers carry out take session as unit based on distributed click steam log analysis, wherein, gap periods makes for the resource of the system that analyzes journal file and was used fifty-fifty at one day; The analysis result that obtains according to the click steam log analysis generates analysis report, and wherein, analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.
Preferably, gap periods is 1 hour.
Preferably, the journal file that gathers is carried out comprising based on the step of distributed click steam log analysis take session as unit with predetermined gap periods: the journal file that gathers is decoded, remove error log and invalid daily record from decoded journal file, and the journal format that will remove the daily record in error log and invalid daily record journal file afterwards converts unified journal format to; Take session as unit, the journal file that converts unified journal format to is carried out based on distributed click steam log analysis the library file to be entered of output fact table.
The step of preferably, the journal file that gathers being decoded comprises: read the daily record in the journal file of collection by row; The field decomposition rule of setting according to daily record source sign selection carries out the field decomposition to the daily record of reading, and removes the error log that does not meet the field decomposition rule; Filter out the daily record that non-artificial access causes, the daily record of company's Intranet access in daily record the journal file that obtains according to the field value of the filtering rule of setting and field after decomposing; According to the daily record output format of setting, the daily record output format is unified in the daily record in the journal file after filtering; Carry out by the business sorting according to the journal file of the type of service of setting to output, the daily record of different business outputs to different paths.
Preferably, take session as unit, the journal file that converts unified journal format to is carried out comprising based on the step of distributed click steam log analysis: obtain the journal file and the upper journal file that the gap periods session is not closed that convert unified journal format to; The journal file that belongs to same user ID in the journal file that obtains is divided into one group, and the journal file of the adjacent daily record that is no more than predetermined space the webpage click time interval in each group in being divided into same session; Whether judgement is divided the session that obtains and is closed, if session is closed, the journal file in buttoned-up session is carried out the click steam log analysis; If session is not closed, according to the analysis mechanisms of the journal file in buttoned-up session, the journal file in the session of not closing is analyzed, do not close the session analysis result different from closing session by output identification to distinguish, portion is preserved in the daily record of simultaneously session not being closed in addition, analyzes for next gap periods.
The step of the journal file during the adjacent daily record that preferably, the webpage click interval is no more than predetermined space in each group is divided into same session comprises: the journal file in each group is sorted according to the webpage click time; Each journal file in each group after to sequence is carried out following steps according to the order after sequence take group as unit: judge in a group after sequence that whether the interval of webpage click between the time of a upper journal file in webpage click time of the journal file of pre-treatment and this group be over predetermined space, wherein, a upper journal file is divided into a journal file in session; If surpass predetermined space, create a current sessions, be that current sessions distributes the session identification that the overall situation is unique, and the journal file that will work as pre-treatment is divided into the journal file in current sessions; If do not surpass predetermined space, the journal file that will work as pre-treatment is divided into a journal file in session.
Preferably, judgement is divided the step whether session obtain close and is comprised: if current log analysis surpasses or equal the finish time of the time on the same day at session place fiducial time, judge session and close; If perhaps session surpasses predetermined space the corresponding current log analysis of the last webpage click time interval of user ID fiducial time, judge session and close, wherein, current log analysis is the concluding time when the space before place time period fiducial time.
Preferably, preserve the analysis result collection that obtains through the click steam log analysis on current gap periods by following steps: use the first set table to preserve the analysis result collection of having closed session in database, wherein, the first set table is preserved the analysis result collection that all have closed session; Use the second cover table to preserve the analysis result collection of the session of not closing in database, wherein, in the second current gap periods of cover preservation, all do not close the analysis result collection of session; Extract required parameter in journal file corresponding to the session on current gap periods from the first set table, and set up mapping relations between the session identification of closing session and the parameter that extracts; Be converted to buttoned-up session on the gap periods of partial session in the session of wherein, not closing after current gap periods.
Preferably, the step for the unique session identification of overall situation of current sessions distribution comprises: be that current sessions distributes the unique session identification of the overall situation by the sky, wherein, the session identification of same date distribution is not identical or different; Dynamically for current sessions distributes the unique session identification of the overall situation, wherein, the number of the uniqueness of session identification and the length of gap periods, Distributed Calculation node is uncorrelated in click steam is analyzed.
Preferably, with predetermined gap periods to the journal file that gathers carry out take session as unit based on after distributed click steam log analysis, above-mentioned log analysis method also comprises: the analysis result that obtains the click steam log analysis; Generate analysis report according to the analysis result that obtains, wherein, analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.
According to the application on the other hand, provide a kind of log analysis device, it comprises: collecting unit is used for gathering the journal file that the web log file server cluster generates; Analytic unit, be used for predetermined gap periods to the journal file that gathers carry out take session as unit based on distributed click steam log analysis, wherein, gap periods makes for the resource of the system that analyzes journal file and was used fifty-fifty at one day; Generation unit, the analysis result that is used for obtaining according to the click steam log analysis generates analysis report, and wherein, analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.
Preferably, analytic unit comprises: decoder module, be used for the journal file that gathers is decoded, remove error log and invalid daily record from decoded journal file, and the journal format that will remove the daily record in error log and invalid daily record journal file afterwards converts unified journal format to; Analysis module is used for take session as unit, the journal file that converts unified journal format to being carried out based on distributed click steam log analysis, the library file to be entered of output fact table.
Preferably, decoder module comprises: reading submodule is used for reading by row the daily record of the journal file of described collection; Decompose submodule, be used for according to the field decomposition rule that daily record source sign selection is set, field being carried out in the daily record of reading and decompose, remove the error log that does not meet the field decomposition rule; Filter submodule, be used for filtering out the daily record of the journal file that the field value according to the filtering rule of setting and field obtains the daily record that non-artificial access causes, the daily record of company's Intranet access after decomposing; Output sub-module is used for according to the daily record output format of setting, the daily record output format being unified in the daily record of the journal file after filtering; The sorting submodule is used for carrying out by the business sorting according to the journal file of the type of service of setting to output, and the daily record of different business outputs to different paths.
Preferably, filter submodule and comprise: the rule parsing submodule is used for loading filtering rule, initialization filter function; The rule judgment submodule is used for the field value according to the daily record decomposition, judges whether to meet a filtering rule in the filtering rule that loads.
Preferably, analysis module comprises: obtain submodule, be used for obtaining journal file and a upper journal file that the gap periods session is not closed that converts unified journal format to; The grouping submodule, the journal file that is used for the journal file that obtains is belonged to same user ID is divided into one group, and the journal file of the adjacent daily record that is no more than predetermined space in each group the webpage click time interval in being divided into same session; The session submodule is used for judging whether close, if session is closed, the journal file in buttoned-up session is analyzed if dividing the session that obtains; If session is not closed, according to the analysis mechanisms of the journal file in buttoned-up session, the journal file in the session of not closing is analyzed, do not close the session analysis result different from closing session by output identification to distinguish, simultaneously, portion is preserved in the daily record that session is not closed in addition, analyzed for next gap periods.
Preferably, the grouping submodule comprises: the sequence submodule is used for sorting according to the journal file of webpage click time to each group; The first judgement submodule, be used for judging according to each journal file to each group after sorting of the order after sequence take group as unit, judge in a group after sequence that whether the interval of webpage click between the time of a upper journal file in webpage click time of the journal file of pre-treatment and this group be over predetermined space, wherein, a upper journal file is divided into a journal file in session; Creating submodule, be used for creating a current sessions when judging over predetermined space, is session identification that the overall situation is unique of current sessions distribution, and the journal file that will work as pre-treatment is divided into the journal file in current sessions; Divide submodule, be used for when judging not over predetermined space, the journal file that will work as pre-treatment is divided into a journal file in session.
Preferably, the session submodule comprises: the second judgement submodule is used for judging session and closing when current log analysis surpasses fiducial time or equal finish time of time on the same day at session place; Perhaps the 3rd judges submodule, be used for when the current log analysis of the last webpage click time interval of user ID corresponding to session surpasses predetermined space fiducial time, judge session and close, wherein, current log analysis is the concluding time when the space before place time period fiducial time.
Preferably, the session submodule also comprises: first preserves submodule, is used for using the first set table to preserve the analysis result collection of having closed session at database, and wherein, the first set table is preserved all analysis result collection of having closed session; Second preserves submodule, is used for using the second cover table to preserve the analysis result collection of not closing session at database, and wherein, in the second current gap periods of cover table preservation, all do not close the analysis result collection of session; Be converted to buttoned-up session on the gap periods of partial session in the session of wherein, not closing after current gap periods; Dimension table updating submodule is used for identifying the newly-increased tolerance of click steam log analysis of current gap periods, and is updated in the dimension table, and wherein, newly-increased tolerance is used for the click steam log analysis of current gap periods.
Preferably, above-mentioned log analysis device also comprises: acquiring unit, be used for predetermined gap periods, the journal file that gathers is carried out take session as unit based on distributed click steam log analysis after, obtain the analysis result of click steam log analysis; Generation unit is used for generating analysis report according to the analysis result that obtains, and wherein, described analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.
In this application, the journal file that gathers is analyzed take session as unit every predetermined period, to analyze the resource of the system of journal file when improving log file analysis ageing used in each predetermined period in one day fifty-fifty, solve the journal file, system resource to analyze in real time of the prior art and used unbalanced problem, thereby improved the ageing of log analysis, rationally balancedly used the resource of system.
Description of drawings
Accompanying drawing described herein is used to provide the further understanding to the application, consists of the application's a part, and the application's illustrative examples and explanation thereof are used for explaining the application, do not consist of the improper restriction to the application.In the accompanying drawings:
Fig. 1 is a kind of preferred structure chart according to the Log Analysis System of the embodiment of the present application;
Fig. 2 is a kind of preferred structure chart according to the log analysis device of the embodiment of the present application;
Fig. 3 is a kind of preferred structure chart according to the analytic unit of the embodiment of the present application;
Fig. 4 is a kind of preferred structure chart according to the decoder module of the embodiment of the present application;
Fig. 5 is a kind of preferred structure chart according to the filtration submodule of the embodiment of the present application;
Fig. 6 is a kind of preferred structure chart according to the analysis module of the embodiment of the present application;
Fig. 7 is a kind of preferred structure chart according to the grouping submodule of the embodiment of the present application;
Fig. 8 is a kind of preferred structure chart according to the session submodule of the embodiment of the present application;
Fig. 9 is the another kind of preferred structure chart according to the session submodule of the embodiment of the present application;
Figure 10 is the another kind of preferred structure chart according to the log analysis device of the embodiment of the present application;
Figure 11 is a kind of preferred flow chart according to the log analysis method of the embodiment of the present application;
Figure 12 is the another kind of preferred flow chart according to the log analysis method of the embodiment of the present application.
Embodiment
Hereinafter also describe in conjunction with the embodiments the application in detail with reference to accompanying drawing.Need to prove, in the situation that do not conflict, embodiment and the feature in embodiment in the application can make up mutually.
Before the further details of each embodiment that describes the application, a suitable counting system structure of the principle that can be used for realizing the application is described with reference to Fig. 1.In the following description, except as otherwise noted, otherwise each embodiment of the application is described with reference to the symbolic representation of the action of being carried out by one or more computers and operation.Thus, be appreciated that this class action and the operation that sometimes are called as the computer execution comprise that the processing unit of computer is to representing the manipulation of the signal of telecommunication of data with structured form.This manipulation transforms safeguard it on data or the position in the accumulator system of computer, the operation of computer is reshuffled or changed to this mode of all understanding with those skilled in the art.The data structure of service data is the physical location of memory with defined particular community of form of data.Yet although describe the application in above-mentioned context, it does not also mean that restrictively, and as understood by those skilled in the art, the each side of hereinafter described action and operation also available hardware realizes.
Turn to accompanying drawing, wherein identical reference number refers to identical element, and the application's principle is shown in a suitable computing environment and realizes.Below describe the embodiment based on described the application, and should not think to limit the application about the alternative embodiment clearly do not described herein.
Fig. 1 shows the schematic diagram of an example computer architecture that can be used for these equipment.For purposes of illustration, the architecture of painting only is an example of proper environment, is not the scope of application or any limitation of function proposition to the application.This computing system should be interpreted as that arbitrary assembly shown in Figure 1 or its combination are had any dependence or demand yet.
The application's principle can or configure with other universal or special calculating or communication environment and operate.The example that is applicable to the application's well-known computing system, environment and configuration includes but not limited to, personal computer, server, multicomputer system, the system based on little processing, minicomputer, mainframe computer and the distributed computing environment (DCE) that comprises arbitrary said system or equipment.
In its most basic configuration, the Log Analysis System 100 in Fig. 1 comprises at least: website application server cluster 102, web log file server cluster 104, log analysis server cluster 106 and an one or more client 108.Website application server cluster 102, web log file server cluster 104 and log analysis server cluster 106 can include but not limited to Micro-processor MCV or programmable logic device FPGA etc. processing unit, be used for the storage data storage device and with the transmitting device of client communication; Client 108 can comprise: Micro-processor MCV, with the transmitting device of server communication, with the display unit of user interactions.In the present specification and claims, " Log Analysis System " also can be defined as can executive software, firmware or microcode come any nextport hardware component NextPort of practical function or the combination of nextport hardware component NextPort.Log Analysis System 100 can be even distributed, to realize distributed function.
As used in this application, term " module ", " assembly " or " unit " can refer to software object or the routine of execution on Log Analysis System 100.Different assembly described herein, module, unit, engine and service can be implemented as object or the process of carrying out (for example, as independent thread) on Log Analysis System 100.Although system and method described herein preferably realizes with software, the realization of the combination of hardware or software and hardware also may and be conceived.
As shown in Figure 1, Log Analysis System 100 comprises: website application server cluster 102, web log file server cluster 104, log analysis server cluster 106 and an one or more client 108.In the course of the work, client 108 is opened the webpage of website by user browser; The access request of website application server cluster 102 customer in response ends 108; User browser on client 108 is accepted the response that website application server cluster 102 returns, and sends request to web log file server cluster 104; The Request Log of web log file server cluster 104 recording users; Log analysis server cluster 106 gathers the daily record of web log file server cluster 104 records, and take session as unit, the click steam log analysis is done in the daily record that gathers with predetermined gap periods.Further, the session in above-described embodiment can be the session between client 108 and web log file server cluster 104, realizes journal file is carried out take session as unit the click steam log analysis, in order to improve efficient and the accuracy of click steam log analysis.
In above-mentioned preferred embodiment, take session as unit, the journal file that gathers is carried out the click steam log analysis with predetermined gap periods, to analyze the resource of the system of journal file when improving log file analysis ageing used in each predetermined period in one day fifty-fifty, thereby improved the ageing of log analysis, rationally balancedly used the resource of system; Simultaneously, analysis result can be generated analysis report, so that adjust accordingly according to the analysis result pair website structure corresponding with journal file, improve the use value of analysis result.
In following each embodiment, communication can realize by wireless connections or wired connection or its both combination, and the application does not do restriction to this.
Embodiment 1
Based on above-mentioned preferred embodiment, the application provides a kind of preferred log analysis device, improve the ageing of log analysis in order to reach, rationally balancedly use the technique effect of the resource of system, preferably, the log analysis device in the present embodiment can be arranged in log analysis server cluster 106 in Fig. 1.To achieve these goals, particularly, as shown in Figure 2, above-mentioned log analysis device comprises: collecting unit 202 is used for gathering the journal file that the web log file server cluster generates; Analytic unit 204, communicate by letter with collecting unit 202, be used for predetermined gap periods to the journal file that gathers carry out take session as unit based on distributed click steam log analysis, wherein, gap periods makes for the resource of the system that analyzes journal file and was used fifty-fifty at one day.
In above-mentioned preferred embodiment, take session as unit, the journal file that gathers is carried out the click steam log analysis with predetermined gap periods, to analyze the resource of the system of journal file when improving log file analysis ageing used in each predetermined period in one day fifty-fifty, solved the problem that to analyze in real time journal file, can not balancedly use the resource of system of the prior art, thereby improved the ageing of log analysis, rationally balancedly used the resource of system.
Further, on the basis of the various embodiments described above, gap periods in the application can be but be not limited to 1 hour, it can also be 30 minutes, 2 hours etc. according to different demands, in order to improve the ageing of log analysis, simultaneously, when daily record file or analysis result appearance mistake, can carry out again deal with data or reanalyse for wrong predetermined period occurring, to reduce workload.
On the basis of the various embodiments described above, the application provides a kind of preferred analytic unit 204, in order to improve analysis efficiency, improves precision of analysis.To achieve these goals, particularly, as shown in Figure 3, above-mentioned analytic unit 204 comprises: decoder module 302, be used for the journal file that gathers is decoded, remove error log and invalid daily record from decoded journal file, and the journal format that will remove the daily record in error log and invalid daily record journal file afterwards converts unified journal format to; Analysis module 304 is communicated by letter with decoder module 302, is used for take session as unit, the journal file that converts unified journal format to being carried out based on distributed click steam log analysis the library file to be entered of output fact table.In the present embodiment, remove error log and invalid daily record from decoded journal file, in order to journal file is analyzed, thereby improved the accuracy of analyzing; In addition, take session as unit, the journal file of unified journal format is analyzed, improved analysis efficiency.
On the basis of above-described embodiment, the application provides a kind of preferred decoder module 302, in order to improve analysis efficiency, improves precision of analysis.To achieve these goals, particularly, as shown in Figure 4, this decoder module 302 comprises: reading submodule 402 is used for reading by row the daily record of the journal file of described collection; Decompose submodule 404, communicate by letter with reading submodule 402, be used for according to the field decomposition rule that daily record source sign selection is set, field being carried out in the daily record of reading and decompose, remove the error log that does not meet the field decomposition rule; Filter submodule 406, communicate by letter with decomposing submodule 404, be used for filtering out the daily record of the journal file that the field value according to the filtering rule of setting and field obtains the daily record that non-artificial access causes, the daily record that company's Intranet is accessed after decomposing; Output sub-module 408 is communicated by letter with filtering submodule 406, is used for according to the daily record output format of setting, the daily record output format being unified in the daily record of the journal file after filtering; Sorting submodule 410 is communicated by letter with output sub-module 408, is used for carrying out by the business sorting according to the journal file of the type of service of setting to output, and the daily record of different business outputs to different paths.In the present embodiment, filter out the daily record that non-artificial access causes, the daily record of company's Intranet access in daily record the journal file that obtains according to the field value of the filtering rule of setting and field after decomposing, journal file to the daily record of having removed daily record that non-artificial access causes, company's Intranet access is analyzed, thereby improved precision of analysis, improved the reference value of analysis result; In addition, the daily record output format is unified in daily record in journal file after filtering, and sort according to the journal file of the type of service of setting to output, the journal file that collects from different collections sources is sorted and with unified daily record output format output realizing, thereby help to improve analysis efficiency, improve precision of analysis.
Further, the daily record of the daily record that the non-artificial access in above-described embodiment causes, company's Intranet access can be the non-artificial access that cause such as reptile, the access of the personnel of intra-company take test as purpose, iFrame floating frame etc. daily record, in web log file is analyzed, clean the accuracy that ineffective access can improve log analysis.
No matter the daily record of what form is according to its set form, always can correspondingly intercept its field parameter.Suppose visit_ip for the IP address of user's access, for the daily record of above-mentioned two kinds of forms, can intercept out the IP address of user's access, and be kept in $ (visit_ip) parameter, by that analogy, intercept out other several parameters, be kept in the structure of KEY=VALUE.For example, the field of daily record decoding definition can be form as shown in table 1.
Table 1
| Field parameter | The field implication |
| ?$(visit_ip) | The user accesses the IP address |
| ?$(visit_time) | The user asks the click time |
| ?$(visit_zone) | The user asks to click the time zone |
| ?$(http_method) | The HTTP method |
| ?$(http_version) | Http protocol |
| ?$(http_code) | The http response code |
| ?$(http_flow) | The HTTP flow |
| ?$(entry_url) | Current request URL |
| ?$(entry_query) | The QUERY of current request URL |
| ?$(refer_url) | The upper hop request URL |
| ?$(refer_query) | The QUERY of upper hop request URL |
| ?$(agent_info) | The browser feature |
| ?$(cookie_id) | The COOKIE_ID sign |
| ?$(newcookie_flag) | New COOKIE_ID sign |
| ?$(a_cookie) | COOKIE A field |
| ?$(b_cookie) | COOKIE B field |
| ?$(c_cookie) | COOKIE C field |
Further, in the above-described embodiments, journal file is changed according to unified form, because it is different to gather the application in source, the form of journal file can be slightly different, and the field sequencing may have adjustment, and perhaps the part field may lack.Generally speaking, the journal file of standard comprises: URL (REFER URL), browser feature, the user of the URL (ENTRY URL) of the IP address of user's access, access time, current accessed, HTTP code, HTTP flow, http response time, upper hop access access unique identification, i.e. COOKIE ID etc.If the URL of upper hop access is empty, namely do not exist, value is "-".For example: apache cookie journal file is as follows:
The below is typical Beacon journal file, and this journal file comprises the ENTRY URL/REFER URL that the user accesses etc. information, encrypts by BASE64, is placed on the question mark back of URL:
On the basis of above-described embodiment, the application provides preferred filtration submodule 406, in order to improve filter effect.To achieve these goals, particularly, as shown in Figure 5, this filtration submodule 406 comprises: rule parsing submodule 502 is used for loading filtering rule, initialization filter function; Rule judgment submodule 504 is communicated by letter with rule parsing submodule 502, is used for the field value according to the daily record decomposition, judges whether to meet a filtering rule in the filtering rule that loads.In the present embodiment, the field value that decomposes according to daily record judges whether daily record meets filtering rule, filters out the daily record that meets filtering rule, to reach the purpose that improves the filter effect of daily record.
Further, after daily record is decomposed, each field of the daily record of decomposing can have definite value, according to concrete filtering rule, if the daily record of decomposing meets a certain filtering rule in filtering rule, filter this daily record, the field name in table 1 can be directly as the variable in filtering rule.Filtering rule can be supported the basic operations such as arithmetic, logic, relation, combination, and its priority is equal to the priority of the operator in ANSI C++, supports integer, character string constant amount, and a filtering rule can allow a plurality of variablees.Simultaneously, more embedded string operation function commonly used, for example llike in filtering rule, rlike, strstr, stristr, strlen, regex, atoi etc., wherein, llike and rlike are respectively the left coupling of character string and right adaptation function, other function definitions and ANSI C++ function performance of the same name are similar, and be as shown in table 2 particularly.
Table 2
For example, filtering rule is designated 111001 filtering rule and can filters all COOKIE ID and be empty daily record, and this priority " 1 " is limit priority, and will filter daily record and output in the filtration journal file that filtering code is F110; Filtering rule be designated 121001 filtering rule can filter user access IP the address for " 127.0.0.1 " or with the daily record of " 172.16. " beginning, this priority is designated the priority of 111001 filtering rule lower than filtering rule for " 2 ", the daily record after filtration outputs in the filtration journal file that filtering code is F210; Filtering rule is designated the daily record that 130101 filtering rule can filter the GOOGLE bot access, it is the daily record that the browser feature comprises the Googlebot character string, this priority is designated the priority of 121001 filtering rule lower than filtering rule for " 3 ", it is that F501 filters in journal file that the rear daily record of filtration outputs to the filter code, in addition, if one a plurality of filtering rules are satisfied in daily record simultaneously, prior applicability filters the highest filtering rule of priority; If it is identical to filter priority, the filtering rule of prior applicability filtering rule ID minimum.
further, the daily record output format is unified in daily record in journal file after filtering in the above-described embodiments, and according to the type of service of setting, the journal file of output is sorted to realize the journal file that collects from different collections sources is sorted and with unified daily record output format output, the log collection work that the journal file that gathers can the web log file server cluster be born a plurality of business simultaneously obtains, for example, the B2B of Alibaba its website log server cluster has 10 log collection servers, this web log file server cluster can gather Chinese website simultaneously, international station, the daily record of a plurality of websites such as Ali's finance, and in the process of log analysis, press the website sorting with concentrating the daily record that gathers, the URL resource that each website provides is different, can be according to $ (entry_url) field to concentrating the journal file that gathers to sort, the journal file of realizing each website can independent analysis, therefore, help to improve the accuracy of analysis result.
Further, to the journal file after sorting, do the conversion of some field level, such as removing anchor point, delete the parameter that repeats, then according to consolidation form output, the form of output can pass through parameter configuration, generally adopts following format configuration:
Decoding shields the unprocessed form of daily record exactly for analysis, and later possible format change, and filters rubbish, unnecessary daily record.
On the basis of above-mentioned each preferred embodiment, the application provides a kind of preferred analysis module 304, in order to improve analyze ageing.To achieve these goals, particularly, as shown in Figure 6, above-mentioned analysis module 304 comprises: obtain submodule 602, be used for obtaining journal file and a upper journal file that the gap periods session is not closed that converts unified journal format to; Grouping submodule 604, with obtain submodule 602 and communicate by letter, the journal file that is used for the journal file that obtains is belonged to same user ID is divided into one group, and the journal file of the adjacent daily record that is no more than predetermined space in each group the webpage click time interval in being divided into same session; Session submodule 606 is communicated by letter with grouping submodule 604, is used for judging whether close, if session is closed, the journal file in buttoned-up session is analyzed if dividing the session that obtains; If session is not closed, according to the analysis mechanisms of the journal file in buttoned-up session, the journal file in the session of not closing is analyzed, do not close the session analysis result different from closing session by output identification to distinguish, simultaneously, portion is preserved in the daily record that session is not closed in addition, analyzed for next gap periods.In the present embodiment, according to user ID, journal file is divided in groups, and according to predetermined space, the journal file in each group is divided into different sessions, take session as unit, journal file is analyzed, help to improve analysis efficiency, improve the accuracy of analysis result, in addition, the journal file of closing in session is analyzed, in next gap periods, the journal file of not closing session is analyzed, thereby realized the journal file of closing in session is carried out real-time analysis, improved analyze ageing.
On the basis of above-described embodiment, the application also provides a kind of preferred grouping submodule 604, in order to improve analysis efficiency.To achieve these goals, particularly, as shown in Figure 7, above-mentioned grouping submodule 604 comprises: sequence submodule 702 is used for sorting according to the journal file of webpage click time to each group; The first judgement submodule 704, communicate by letter with sequence submodule 702, be used for judging according to each journal file to each group after sorting of the order after sequence take group as unit, judge in a group after sequence that whether the interval of webpage click between the time of a upper journal file in webpage click time of the journal file of pre-treatment and this group be over predetermined space, wherein, a upper journal file is divided into a journal file in session; Create submodule 706, communicate by letter with the first judgement submodule 704, be used for creating a current sessions when judging over predetermined space, be that current sessions distributes the session identification that the overall situation is unique, and the journal file that will work as pre-treatment is divided into the journal file in current sessions; Divide submodule 708, communicate by letter with creating submodule 706, be used for when judging not over predetermined space, the journal file that will work as pre-treatment is divided into a journal file in session.In the present embodiment, after journal file in each group was sorted according to the webpage click time, journal file is divided into session according to predetermined space, and is session identification that the overall situation is unique of session distribution, in order to journal file is analyzed, improve analysis efficiency.Simultaneously, be divided into a upper session with judging the journal file that does not surpass predetermined space, exactly journal file is divided into different sessions, improve the accuracy rate of analyzing.
Preferably, establishment submodule 706 in above preferred embodiment can by the following method for current sessions distributes the unique session identification of the overall situation, particularly, be that current sessions distributes the unique session identification of the overall situation by the sky, wherein, the session identification of same date distribution is not identical or different; Dynamically for current sessions distributes the unique session identification of the overall situation, wherein, the number of the uniqueness of session identification and the length of gap periods, Distributed Calculation node is uncorrelated in click steam is analyzed.Unique during with the session identification realizing distributing the same day, the session identification that distributes of same date can be not identical or different, strengthened the application's operability, in addition, in analytic process, dynamically for current sessions distributes the unique session identification of the overall situation, help to improve the efficient of analysis.
Further, the session identification in above-described embodiment can be generated by browser end, adopts host ip or the Hostname of time-based, browser feature, access, adds the form of a pseudo random number, to guarantee the height uniqueness of session identification.simultaneously, the access behavior that this mode also more is close to the users, the different scenes of closing according to session, the mode that the application takes is that front end does not distribute session identification, user according to session concept reduction Website front-end when log analysis accesses track, the assign sessions sign, the benefit that the mode of this assign sessions sign is brought can guarantee that session identification is fully unique, and the generation of session identification and conversation end are hour relevant, when data are heavily processed, be easy to be applied to clear up the session sign in the process of database data, with the session identification in the time period of heavily processing for needs, data are put in storage again.
Simultaneously, session identification has extremely important effect in whole log analysis process, and all log analysis results all contain session identity fields, related between session sign can being used for table and showing, and requiring each session identification was unique in one day.The application's session identification distribution method to be being divided into by the hour example, but is not limited to this.When assign sessions identifies by the hour, by 1 second 100000 PageView amount calculate, such as the data of calculating 0 point~1, what distribute is 0~100000 * 3600 session identification, these some session identifications are divided by computing node, the session identification that distributes with each node that guarantees Distributed Calculation is independently again.
For example take 1 hour as example:
The maximum click volume (supposing 100000) of MAX_PER_CLICK=website per second;
SECOND_PER_HOUR=3600 (number of seconds of hour, 60 minutes * 60 seconds);
A hour maximum can use the ID amount to be so:
MAX_IDS_PER_HOUR=SECOND_PER_HOUR×MAX_PER_CLICK;
Suppose again current computing node number:
TTL_NODE_UNIT=computing node number (suppose 8 computing nodes, value is 8)
Hour each computing node can use the ID amount to be so:
MAX_NODE_IDS=MAX_IDS_PER_HOUR/TTL_NODE_UNIT;
Hour describe the process of assign sessions sign by the hour in detail below in conjunction with concrete computing node and concrete analysis:
The current computing node coding of CUR_NODE_ID=(suppose 8 computing nodes, span is [0-7])
CUR_HOUR_UV=present analysis hour; (such as 8 points, 10 points), span [0-23]
The initial SESSION_ID that this hour this computing node ID distributes is:
MIN_SESSION_ID=MAX_IDS_PER_HOUR×CUR_HOUR_UV+MAX_NODE_IDS×CUR_NODE_ID;
Stopping SESSION_ID is:
MAX_SESSION_ID=MIN_SESSION_ID+MAX_NODE_IDS;
The allocation algorithm of above-mentioned SESSION ID can guarantee to suppose HH at some hour, and the allocation space of its ID is:
[HH×MAX_IDS_PER_HOUR,(HH+1)×MAX_IDS_PER_HOUR]
The log file analysis result has the SESSION_ID field, need to heavily process the data of which hour, the directly SESSION_ID complete liquidation which hour scope is corresponding.
Further, on the basis of above-described embodiment, under distributed computing framework, all PageView under session must be in same computing node analysis, so at first journal file is divided into groups according to COOKIE_ID (user ID) when analyzing, deposits continuously with all daily records that guarantee same COOKIE_ID.Concrete grouping comprises that step is as follows:
Step 1: the computing node number of Computation distribution formula Computational frame is assumed to be n;
Step 2: journal file is divided into groups according to COOKIE_ID, be divided into the n group, deposit continuously before and after the journal file of same COOKIE_ID, and guarantee in same computing node analysis;
Step 3: the COOKIE_ID of each group of division in step 2 is carried out ascending sort according to the click time,
Simultaneously, consider the pressure of Website server, the time of journal file generally is accurate to second, and the daily record of clicking in same second can be sorted in such a way:
1) REFER URL="-", sort front,
2) REFER URL for outside the station, sorts front, and wherein, judgement URL is in the station or outside the station, can be at first judge whether it is in the station according to the territory, otherwise consider the special circumstances such as top retail shop, if this URL once occurred in ENTRY URL, judge this URL in the station, otherwise for outside the station
3) suppose two daily record A, B, if the REFER URL of A equals the ENTRY URL of B, B is front,
Above-mentioned sortord realizes pressing ENTRY URL sequence, is consistent to guarantee repeatedly result.
The journal file that produces through above-mentioned sortord is orderly, and the depositing of journal file that has guaranteed same COOKIE_ID is continuous, when dividing SESSION (session), read the journal file record by COOKIE_ID, if current record and a upper click time difference surpass predetermined space, produced a new SESSION; If the click time of the last item of sequence journal file with when pre-treatment hour difference surpass predetermined space, produce a new SESSION; If be last hour of one day when pre-treatment, produce a new SESION.Below illustrate as example that take predetermined space as 30 minutes the dividing mode of this SESION, predetermined space are only to be the application's preferred exemplary in 30 minutes, but be not limited to this.
1) same visitor is no more than 30 minutes in the time interval of two the webpage click PageView in up and down, thinks to belong to same SESSION;
2) visitor's more than 30 minutes does not have webpage click PageView, thinks that ESSION closes;
3) log analysis take in the sky as unit, adheres to different SESSION separately across the access in sky;
4) be that each complete SESSION distributes an independently SESSION_ID during log analysis.
By above-mentioned sessionizing mode, be that continuous journal file is divided into session with depositing of same COOKIE_ID, and distribute an independently SESSION_ID for each session.
On the basis of above-described embodiment, the application also provides a kind of preferred session submodule 606, in order to strengthened the application's use flexibility.To achieve these goals, particularly, as shown in Figure 8, above-mentioned session submodule 606 comprises: the second judgement submodule 802, communicate by letter with grouping submodule 604, be used for current log analysis fiducial time over or when equaling finish time of time on the same day at session place, judge session and close; Perhaps the 3rd judges submodule 804, communicate by letter with grouping submodule 604, be used for when the current log analysis of the last webpage click time interval of user ID corresponding to session surpasses predetermined space fiducial time, judging session closes, wherein, current log analysis is the concluding time when the space before place time period fiducial time.In the present embodiment, whether close in the session that judges of different scenes, strengthened the application's use flexibility.
On the basis of the various embodiments described above, the application has done further improvement to session submodule 606, in order to improve analysis efficiency.To achieve these goals, particularly, as shown in Figure 9, above-mentioned session submodule 606 also comprises: first preserves submodule 902, communicate by letter with grouping submodule 604, be used for using the first set table to preserve the analysis result collection of buttoned-up session at database, wherein, the first set table is preserved all analysis result collection of having closed session; The second preservation submodule 904 is communicated by letter with grouping submodule 604, is used for using the second cover table to preserve the analysis result collection of the session of not closing at database, and wherein, in the second current gap periods of cover table preservation, all do not close the analysis result collection of session; Be converted to buttoned-up session on the gap periods of partial session in the session of wherein, not closing after current gap periods; Dimension table updating submodule 906, preserve submodule 904 with the first preservation submodule 902 and second and communicate by letter, be used for identifying the newly-increased tolerance of click steam log analysis of current gap periods, and be updated in the dimension table, wherein, newly-increased tolerance is used for the click steam log analysis of current gap periods.In the present embodiment, according to whether closing, session is preserved respectively, differentiate session and whether close with clear, in order to analyze journal file corresponding to buttoned-up session, improved analysis efficiency, improved the accuracy rate of analysis result; Simultaneously, identify newly-increased the measuring on current gap periods, and real-time update is in the dimension table, that is, in real time, effectively newly-increased tolerance is analyzed, improved ageing, the accuracy of log analysis.
Further, on the basis of above-described embodiment, use the first set table to preserve when having closed the analysis result collection of session, extract required parameter in journal file corresponding to session that can be from the first set table, and set up mapping relations between the session identification of closing session and the parameter that extracts, so that the journal file in buttoned-up session is analyzed, the mapping relations of namely setting up between the parameter according to the session identification of session and extraction are analyzed journal file corresponding to buttoned-up session, thereby have improved analysis efficiency.
On the basis of above-mentioned each preferred embodiment, the application improves above-mentioned log analysis device, in order to improve the use value of log analysis.To achieve these goals, particularly, as shown in figure 10, above-mentioned log analysis device also comprises: acquiring unit 1002, be used for predetermined gap periods, the journal file that gathers is carried out take session as unit based on distributed click steam log analysis after, obtain the analysis result of click steam log analysis; Generation unit 1004 is communicated by letter with acquiring unit 1002, is used for generating analysis report according to the analysis result that obtains, and wherein, described analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.In the present embodiment, the analysis result of click steam log analysis is generated analysis report, in order to adjust accordingly according to the information pair website structure corresponding with journal file that analysis report feeds back, make website structure better meet user's different demands for services, thereby strengthened the use value of log analysis.
Further, on the basis of above-mentioned each preferred embodiment, session can be to be the session between client and website, realizes journal file is carried out take session as unit the click steam log analysis, in order to improve efficient and the accuracy of click steam log analysis.
Embodiment 2
On the basis of Fig. 1-10, the application provides a kind of preferred log analysis method, in order to improve the ageing of log analysis, rationally balancedly uses the resource of system.To achieve these goals, particularly, as shown in figure 11, above-mentioned log analysis method comprises:
S1102: gather the journal file that the web log file server cluster generates;
S1104: with predetermined gap periods to the journal file that gathers carry out take session as unit based on distributed click steam log analysis, wherein, gap periods makes for the resource of the system that analyzes journal file and was used fifty-fifty at one day.
In above-mentioned preferred embodiment, the journal file that gathers is analyzed take session as unit with predetermined gap periods, to analyze the resource of the system of journal file when improving log file analysis ageing used in each predetermined period in one day fifty-fifty, solved the problem that to analyze in real time journal file, can not balancedly use the resource of system of the prior art, thereby improved the ageing of log analysis, rationally balancedly used the resource of system.
Further, on the basis of the various embodiments described above, gap periods in the application can be but be not limited to 1 hour, it can also be 30 minutes, 2 hours etc. according to different demands, in order to improve the ageing of log analysis, simultaneously, when daily record file or analysis result appearance mistake, can carry out again deal with data or reanalyse for wrong gap periods occurring, to reduce workload.
On the basis of the various embodiments described above, the application provides a kind of and preferably take session as unit, the journal file that gathers has been carried out method based on distributed click steam log analysis with predetermined gap periods, so that the raising analysis efficiency, real-time, the accuracy of raising analysis result.To achieve these goals, particularly, above-mentionedly the journal file that gathers is carried out comprising based on the method for distributed click steam log analysis take session as unit with predetermined gap periods: the journal file that gathers is decoded, remove error log and invalid daily record from decoded journal file, and the journal format that will remove the daily record in error log and invalid daily record journal file afterwards converts unified journal format to; Take session as unit, the journal file that converts unified journal format to is carried out based on distributed click steam log analysis the library file to be entered of output fact table.In the present embodiment, remove error log and invalid daily record from decoded journal file, in order to journal file is analyzed, thereby improved the accuracy of analyzing; In addition, take session as unit, the journal file of unified journal format is analyzed, improved analysis efficiency.
On the basis of above-described embodiment, the application provides a kind of method of preferably journal file that gathers being decoded, in order to improve analysis efficiency, improves precision of analysis.To achieve these goals, particularly, this method that journal file that gathers is decoded comprises: read the daily record in the journal file of collection by row; The field decomposition rule of setting according to daily record source sign selection carries out the field decomposition to the daily record of reading, and removes the error log that does not meet the field decomposition rule; Filter out the daily record that non-artificial access causes, the daily record of company's Intranet access in daily record the journal file that obtains according to the field value of the filtering rule of setting and field after decomposing; According to the daily record output format of setting, the daily record output format is unified in the daily record in the journal file after filtering; Carry out by the business sorting according to the journal file of the type of service of setting to output, the daily record of different business outputs to different paths.In the present embodiment, filter out the daily record that non-artificial access causes, the daily record of company's Intranet access in daily record the journal file that obtains according to the field value of the filtering rule of setting and field after decomposing, journal file to the daily record of having removed daily record that non-artificial access causes, company's Intranet access is analyzed, thereby improved precision of analysis, improved the reference value of analysis result; In addition, the daily record output format is unified in daily record in journal file after filtering, and sort according to the journal file of the type of service of setting to output, the journal file that collects from different collections sources is sorted and with unified daily record output format output realizing, thereby help to improve analysis efficiency, improve precision of analysis.
Further, the daily record of the daily record that the non-artificial access in above-described embodiment causes, company's Intranet access can be the non-artificial access that cause such as reptile, the access of the personnel of intra-company take test as purpose, iFrame floating frame etc. daily record, in web log file is analyzed, clean the accuracy that ineffective access can improve log analysis.
further, in the above-described embodiments, journal file is changed according to unified form, because it is different to gather the application in source, the form of journal file can be slightly different, the form of journal file can be slightly different, the field sequencing may have adjustment, perhaps the part field may lack, but generally speaking, the journal file of standard comprises: the IP address of user's access, access time, the URL of current accessed (ENTRY URL), the HTTP code, the HTTP flow, the http response time, the URL (REFER URL) of upper hop access, the browser feature, the user accesses unique identification, be COOKIE ID etc.If the URL of upper hop access is empty, namely do not exist, value is "-".For example: apache cookie journal file is as follows:
The below is typical Beacon journal file, and this journal file comprises the ENTRY URL/REFER URL that the user accesses etc. information, encrypts by BASE64, is placed on the question mark back of URL:
But no matter the daily record of what form, according to its set form, always can correspondingly intercept its field parameter.Suppose visit_ip for the IP address of user's access, for the daily record of above-mentioned two kinds of forms, can intercept out the IP address of user's access, and be kept in $ (visi_ip) parameter, by that analogy, intercept out other several parameters, be kept in the structure of KEY=VALUE.For example, the field of daily record decoding definition can be form as shown in table 1.
Table 1
| Field parameter | The field implication |
| ?$(visit_ip) | The user accesses the IP address |
| ?$(visit_time) | The user asks the click time |
| ?$(visit_zone) | The user asks to click the time zone |
| ?$(http_method) | The HTTP method |
| ?$(http_version) | Http protocol |
| ?$(http_code) | The http response code |
| ?$(http_flow) | The HTTP flow |
| ?$(entry_url) | Current request URL |
| ?$(entry_query) | The QUERY of current request URL |
| ?$(refer_url) | The upper hop request URL |
| ?$(refer_query) | The QUERY of upper hop request URL |
| ?$(agent_info) | The browser feature |
| ?$(cookie_id) | The COOKIE_ID sign |
| ?$(newcookie_flag) | New COOKIE_ID sign |
| ?$(a_cookie) | COOKIE A field |
| ?$(b_cookie) | COOKIE B field |
| ?$(c_cookie) | COOKIE C field |
On the basis of above-described embodiment, the application provides the preferred field decomposition rule of setting according to daily record source sign selection to carry out the field decomposition to the daily record of reading, removal does not meet the method for the error log of field decomposition rule, in order to improve filter effect.To achieve these goals, particularly, this field decomposition rule of setting according to daily record source sign selection carries out field to the daily record of reading and decomposes, and the method for removing the error log that does not meet the field decomposition rule must comprise: load filtering rule, the initialization filter function; Field value according to daily record is decomposed judges whether to meet a filtering rule in the filtering rule that loads.In the present embodiment, the field value that decomposes according to daily record judges whether daily record meets filtering rule, filters out the daily record that meets filtering rule, to reach the purpose that improves the filter effect of daily record.
Further, after daily record is decomposed, each field of the daily record of decomposing can have definite value, according to concrete filtering rule, if the daily record of decomposing meets a certain filtering rule in filtering rule, filter this daily record, the field name in table 1 can be directly as the variable in filtering rule.Filtering rule can be supported the basic operations such as arithmetic, logic, relation, combination, and its priority is equal to the priority of the operator in ANSI C++, supports integer, character string constant, and a filtering rule can allow a plurality of variablees.Simultaneously, more embedded string operation function commonly used, for example llike in filtering rule, rlike, strstr, stnstr, strlen, regex, atoi etc., wherein, llike and rlike are respectively the left coupling of character string and right adaptation function, other function definitions and ANSI C++ function performance of the same name are similar, and be as shown in table 2 particularly.
Table 2
For example, filtering rule is designated 111001 filtering rule and can filters all COOKIE ID and be empty daily record, and this priority " 1 " is limit priority, and will filter daily record and output in the filtration journal file that filtering code is F110; Filtering rule be designated 121001 filtering rule can filter user access IP the address for " 127.0.0.1 " or with the daily record of " 172.16. " beginning, this priority is designated the priority of 111001 filtering rule lower than filtering rule for " 2 ", the daily record after filtration outputs in the filtration journal file that filtering code is F210; Filtering rule is designated the daily record that 130101 filtering rule can filter the GOOGLE bot access, it is the daily record that the browser feature comprises the Googlebot character string, this priority is designated the priority of 121001 filtering rule lower than filtering rule for " 3 ", it is that F501 filters in journal file that the rear daily record of filtration outputs to the filter code, in addition, if one a plurality of filtering rules are satisfied in daily record simultaneously, prior applicability filters the highest filtering rule of priority; If it is identical to filter priority, the filtering rule of prior applicability filtering rule ID minimum.
further, the daily record output format is unified in daily record in journal file after filtering in the above-described embodiments, and according to the type of service of setting, the journal file of output is sorted to realize the journal file that collects from different collections sources is sorted and with unified daily record output format output, the log collection work that the journal file that gathers can the web log file server cluster be born a plurality of business simultaneously obtains, for example, the B2B of Alibaba its website log server cluster has 10 log collection servers, this web log file server cluster can gather Chinese website simultaneously, international station, the daily record of a plurality of websites such as Ali's finance, and in the process of log analysis, press the website sorting with concentrating the daily record that gathers, the URL resource that each website provides is different, can be according to $ (entry_url) field to concentrating the journal file that gathers to sort, the journal file of realizing each website can independent analysis, therefore, help to improve the accuracy of analysis result.
Further, to the journal file after sorting, do the conversion of some field level, such as removing anchor point, delete the parameter that repeats, then according to consolidation form output, the form of output can pass through parameter configuration, generally adopts following format configuration:
Decoding shields the unprocessed form of daily record exactly for analysis, and later possible format change, and filters rubbish, unnecessary daily record.
On the basis of above-mentioned each preferred embodiment, the application provides a kind of and preferred take session as unit, the journal file that converts unified journal format to has been carried out method based on distributed click steam log analysis, in order to improve analyze ageing.To achieve these goals, particularly, should carry out comprising based on the method for distributed click steam log analysis to the journal file that converts unified journal format to take session as unit: obtain the journal file and the upper journal file that the gap periods session is not closed that convert unified journal format to; The journal file that belongs to same user ID in the journal file that obtains is divided into one group, and the journal file of the adjacent daily record that is no more than predetermined space the webpage click time interval in each group in being divided into same session; Whether judgement is divided the session that obtains and is closed, if session is closed, the journal file in buttoned-up session is carried out the click steam log analysis; If session is not closed, according to the analysis mechanisms of the journal file in buttoned-up session, the journal file in the session of not closing is analyzed, do not close the session analysis result different from closing session by output identification to distinguish, portion is preserved in the daily record of simultaneously session not being closed in addition, analyzes for next gap periods.In the present embodiment, according to user ID, journal file is divided in groups, and according to predetermined space, the journal file in each group is divided into different sessions, take session as unit, journal file is analyzed, help to improve analysis efficiency, improve the accuracy of analysis result, in addition, the journal file of closing in session is analyzed, in next gap periods, the journal file of not closing session is analyzed, thereby realized the journal file of closing in session is carried out real-time analysis, improved analyze ageing.
On the basis of above-described embodiment, the application also provides the method for the journal file of a kind of adjacent daily record that preferably the webpage click interval is no more than predetermined space in each group in being divided into same session, in order to improve analysis efficiency.To achieve these goals, the method of the journal file during particularly, the above-mentioned adjacent daily record that the webpage click interval is no more than predetermined space in each group is divided into same session comprises: the journal file in each group is sorted according to the webpage click time; Each journal file in each group after to sequence is carried out following steps according to the order after sequence take group as unit: judge in a group after sequence that whether the interval of webpage click between the time of a upper journal file in webpage click time of the journal file of pre-treatment and this group be over predetermined space, wherein, a upper journal file is divided into a journal file in session; If surpass predetermined space, create a current sessions, be that current sessions distributes the session identification that the overall situation is unique, and the journal file that will work as pre-treatment is divided into the journal file in current sessions; If do not surpass predetermined space, the journal file that will work as pre-treatment is divided into a journal file in session.In the present embodiment, after journal file in each group was sorted according to the webpage click time, journal file is divided into session according to predetermined space, and is session identification that the overall situation is unique of session distribution, in order to journal file is analyzed, improve analysis efficiency.Simultaneously, be divided into a upper session with judging the journal file that does not surpass predetermined space, exactly journal file is divided into different sessions, improve the accuracy rate of analyzing.
On the basis of above-described embodiment, the application also provides the method for the unique session identification of overall situation of a kind of preferably current sessions distribution, in order to improve the efficient of analyzing.To achieve these goals, particularly, above-mentioned method for the unique session identification of overall situation of current sessions distribution comprises: be that current sessions distributes the unique session identification of the overall situation by the sky, wherein, the session identification of same date distribution is not identical or different; Dynamically for current sessions distributes the unique session identification of the overall situation, wherein, the number of the uniqueness of session identification and the length of gap periods, Distributed Calculation node is uncorrelated in click steam is analyzed.In the present embodiment, unique during the session identification realizing distributing the same day, the session identification that distributes of same date can be not identical or different, strengthened the application's operability, in addition, in analytic process, dynamically for current sessions distributes the unique session identification of the overall situation, help to improve the efficient of analysis.
Further, the session identification in above-described embodiment can be generated by browser end, adopts host ip or the Hostname of time-based, browser feature, access, adds the form of a pseudo random number, to guarantee the height uniqueness of session identification.simultaneously, the access behavior that this mode also more is close to the users, the different scenes of closing according to session, the mode that the application takes is that front end does not distribute session identification, user according to session concept reduction Website front-end when log analysis accesses track, the assign sessions sign, the benefit that the mode of this assign sessions sign is brought can guarantee that session identification is fully unique, and the generation of session identification and conversation end are hour relevant, when data are heavily processed, be easy to be applied to clear up the session sign in the process of database data, with the session identification in the time period of heavily processing for needs, data are put in storage again.
Simultaneously, session identification has extremely important effect in whole log analysis process, and all log analysis results all contain session identity fields, related between session sign can being used for table and showing, and requiring each session identification was unique in one day.The application's session identification distribution method to be being divided into by the hour example, but is not limited to this.When assign sessions identifies by the hour, by 1 second 100000 PageView amount calculate, such as the data of calculating 0 point~1, what distribute is 0~100000 * 3600 session identification, these some session identifications are divided by computing node, the session identification that distributes with each node that guarantees Distributed Calculation is independently again.
For example take 1 hour as example:
The maximum click volume (supposing 100000) of MAX_PER_CLICK=website per second;
SECOND_PER_HOUR=3600 (number of seconds of hour, 60 minutes * 60 seconds);
A hour maximum can use the ID amount to be so:
MAX_IDS_PER_HOUR=SECOND_PER_HOUR×MAX_PER_CLICK;
Suppose again current computing node number:
TTL_NODE_UNIT=computing node number (suppose 8 computing nodes, value is 8)
Hour each computing node can use the ID amount to be so:
MAX_NODE_IDS=MAX_IDS_PER_HOUR/TTL_NODE_UNIT;
Hour describe the process of assign sessions sign by the hour in detail below in conjunction with concrete computing node and concrete analysis:
The current computing node coding of CUR_NODE_ID=(suppose 8 computing nodes, span is [0-7])
CUR_HOUR_UV=present analysis hour; (such as 8 points, 10 points), span [0-23]
The initial SESSION_ID that this hour this computing node ID distributes is:
MIN_SESSION_ID=MAX_IDS_PER_HOUR×CUR_HOUR_UV+MAX_NODE_IDS×CUR_NODE_ID;
Stopping SESSION_ID is:
MAX_SESSION_ID=MIN_SESSION_ID+MAX_NODE_IDS;
The allocation algorithm of above-mentioned SESSION ID can guarantee to suppose HH at some hour, and the allocation space of its ID is:
[HH×MAX_IDS_PER_HOUR,(HH+1)×MAX_IDS_PER_HOUR]
The log file analysis result has the SESSION_ID field, need to heavily process the data of which hour, the directly SESSION_ID complete liquidation which hour scope is corresponding.
Further, on the basis of above-described embodiment, under distributed computing framework, all PageView under session must be in same computing node analysis, so at first journal file is divided into groups according to COOKIE_ID (user ID) when analyzing, deposits continuously with all daily records that guarantee same COOKIE_ID.Concrete grouping comprises that step is as follows:
Step 1: the computing node number of Computation distribution formula Computational frame is assumed to be n;
Step 2: journal file is divided into groups according to COOKIE_ID, be divided into the n group, deposit continuously before and after the journal file of same COOKIE_ID, and guarantee in same computing node analysis;
Step 3: the COOKIE_ID of each group of division in step 2 is carried out ascending sort according to the click time,
Simultaneously, consider the pressure of Website server, the time of journal file generally is accurate to second, and the daily record of clicking in same second can be sorted in such a way:
1) REFER URL="-", sort front,
2) REFER URL for outside the station, sorts front, and wherein, judgement URL is in the station or outside the station, can be at first judge whether it is in the station according to the territory, otherwise consider the special circumstances such as top retail shop, if this URL once occurred in ENTRY URL, judge this URL in the station, otherwise for outside the station
3) suppose two daily record A, B, if the REFER URL of A equals the ENTRY URL of B, B is front,
Above-mentioned sortord realizes pressing ENTRY URL sequence, is consistent to guarantee repeatedly result.
The journal file that produces through above-mentioned sortord is orderly, and the depositing of journal file that has guaranteed same COOKIE_ID is continuous, when dividing SESSION (session), read the journal file record by COOKIE_ID, if current record and a upper click time difference surpass predetermined space, produced a new SESSION; If the click time of the last item of sequence journal file with when pre-treatment hour difference surpass predetermined space, produce a new SESSION; If be last hour of one day when pre-treatment, produce a new SESION.Below illustrate as example that take predetermined space as 30 minutes the dividing mode of this SESION, predetermined space are only to be the application's preferred exemplary in 30 minutes, but be not limited to this.
1) same visitor is no more than 30 minutes in the time interval of two the webpage click PageView in up and down, thinks to belong to same SESSION;
2) visitor's more than 30 minutes does not have webpage click PageView, thinks that ESSION closes;
3) log analysis take in the sky as unit, adheres to different SESSION separately across the access in sky;
4) be that each complete SESSION distributes an independently SESSION_ID during log analysis.
By above-mentioned sessionizing mode, be that continuous journal file is divided into session with depositing of same COOKIE_ID, and distribute an independently SESSION_ID for each session.
On the basis of above-described embodiment, the method whether session that the application also provides a kind of preferred judgement division to obtain closes is in order to strengthened the application's use flexibility.To achieve these goals, particularly, above-mentioned judgement is divided the method whether session that obtains close and is comprised: if current log analysis surpasses or equal the finish time of the time on the same day at session place fiducial time, judge session and close; If perhaps session surpasses predetermined space the corresponding current log analysis of the last webpage click time interval of user ID fiducial time, judge session and close, wherein, current log analysis is the concluding time when the space before place time period fiducial time.In the present embodiment, whether close in the session that judges of different scenes, strengthened the application's use flexibility.
On the basis of the various embodiments described above, the application provides the method for preserving the analysis result collection that obtains through the click steam log analysis on current gap periods, in order to improve analysis efficiency.To achieve these goals, particularly, the method of the analysis result collection that obtains through the click steam log analysis on the current gap periods of above-mentioned preservation comprises: use the first set table to preserve the analysis result collection of buttoned-up session in database, wherein, the first set table is preserved the analysis result collection that all have closed session; Use the second cover table to preserve the analysis result collection of the session of not closing in database, wherein, in the second current gap periods of cover table preservation, all do not close the analysis result collection of session; Extract required parameter in journal file corresponding to the session on current gap periods from the first set table, and set up mapping relations between the session identification of closing session and the parameter that extracts; Be converted to buttoned-up session on the gap periods of partial session in the session of wherein, not closing after current gap periods.In the present embodiment, according to whether closing, session is preserved respectively, differentiate session and whether close with clear, in order to analyze journal file corresponding to buttoned-up session, improved analysis efficiency, improved the accuracy rate of analysis result.In addition, when journal file corresponding to buttoned-up session analyzed, extract required parameter, and set up mapping relations between the parameter of the session identification of session and extraction, the mapping relations of being convenient to set up between the parameter according to the session identification of session and extraction are analyzed journal file corresponding to buttoned-up session, thereby have improved analysis efficiency.
Further, on the basis of above-described embodiment, the application can also identify newly-increased tolerance in the process of carrying out the click steam log analysis, and the newly-increased tolerance that will identify is updated in the dimension table, to realize in real time, effectively newly-increased tolerance to be analyzed, improved ageing, the accuracy of log analysis.
On the basis of above-mentioned each preferred embodiment, the application improves above-mentioned log analysis method, in order to improve the use value of log analysis.To achieve these goals, particularly, above-mentioned log analysis method also comprises: with predetermined gap periods, the journal file that gathers is carried out take session as unit based on distributed click steam log analysis after, obtain the analysis result of click steam log analysis; Generate analysis report according to the analysis result that obtains, wherein, described analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.In the present embodiment, the analysis result of click steam log analysis is generated analysis report, in order to adjust accordingly according to the information pair website structure corresponding with journal file that analysis report feeds back, make website structure better meet user's different demands for services, thereby strengthened the use value of log analysis.
Further, on the basis of above-mentioned each preferred embodiment, session can be to be the session between client and website, realizes journal file is carried out take session as unit the click steam log analysis, in order to improve efficient and the accuracy of click steam log analysis.
Embodiment 3
On the basis of above-mentioned each preferred embodiment, the application provides a kind of preferred log analysis method, in order to improve the ageing of log analysis, rationally balancedly uses the resource of system.To achieve these goals, particularly, as shown in figure 12, above-mentioned log analysis method comprises:
S1: download by the hour the original log file, and upload to the Distributed Calculation file system;
Preferably, download log file by the hour, be equivalent to predetermined period is made as 1 hour, but being not limited to this, can also be 30 minutes, 2 hours etc. according to different demands, in order to improve the ageing of log analysis, simultaneously, when daily record file or analysis result appearance mistake, can carry out again deal with data or reanalyse for wrong predetermined period occurring, to reduce workload.In addition, Distributed Architecture has unrivaled superiority on retractility and cost, simultaneously, consider the factors such as opening, stability, scalability, exploitation ease for use, the application can realize based on distributed computing framework, and is certain, this is a kind of preferred exemplary, and the application is not limited only to this.
S2: derive dimension table data from database;
preferably, daily record is resolved at first based on a series of dimension table data, such as QUERY_DIMT0 dimension table, COOKIE_DIMT0 ties up table, the dimension table is used for extracting QUERY parameter and the COOKIE parameter of webpage URL, the maintenance mode of dimension table is mostly based on the artificial foreground interface operation that passes through, but exception is arranged also, such as COSITE_DIMT0 dimension table and URL_DIMT0 dimension table, COSITE_DIMT0 dimension table is used for extracting the source-information of partner site, the source that can locate current accessed by cosite parameter and location parameter is which position of which website, but, the ad placement of partner site is change often, can exist delay problem and workload problem by manual configuration this moment, so basically needing automatic program identification inserts, and the mapping relations that URL_DIMT0 dimension table is preserved URL and URL_ID, as an e-commerce website, different URL can be 1,000,000, ten million rank, log analysis is directly stored URL, not only consume storage, be unfavorable for that also the ETL layer does statement analysis, the strategy that can take is changed URL exactly by certain rule, being converted into a concrete URL_ID preserves, program can produce the corresponding relation of URL and URL_ID automatically according to the rule of design, also can manually be configured according to rule, so, URL_DIMT0 dimension table adopts manually and the way that automatically combines.
S3: daily record decoding;
Preferably, decompose the value of each field of daily record from journal file, filter out the non-artificial access that causes, and the journal file after filtering is changed according to unified form, and be sorted to different file paths by business, in order to journal file is carried out by business diagnosis, simultaneously, upgrade dimension table data, add newly-increased URL, COSITE etc., reject repeating data, in order to improve the accuracy rate of log analysis, improve log analysis efficient.
S4: derive dimension table data from database;
S5: (being equivalent to log file analysis) resolved in daily record;
preferably, the journal file that belongs to same user ID in journal file after conversion is divided into one group, journal file according to the webpage click time, each being organized sorts, and the journal file of the adjacent daily record that is no more than predetermined space the webpage click time interval in each group in being divided into same session, and be that session distributes the session identification that the overall situation is unique, if dividing the session that obtains closes, the journal file in buttoned-up session is analyzed, the session of not closing is kept in the table of second in database, so that the analysis of the journal file on next predetermined period.
S6: derive flat file from distributed file system, and import database;
S7: data reconstruction storehouse table index, insert ETL notification interface mark.
in above-mentioned preferred embodiment, the journal file that gathers is analyzed take session as unit every predetermined period, to analyze the resource of the system of journal file when improving log file analysis ageing used in each predetermined period in one day fifty-fifty, and analysis result is generated analysis report, so that adjust accordingly according to the analysis result pair website structure corresponding with journal file, solved and of the prior artly can not analyze in real time journal file, can not balancedly use the problem of the resource of system, thereby improved the ageing of log analysis, rationally balancedly used the resource of system.
Further, on the basis of above preferred embodiment, session can be to be the session between client and website, realizes journal file is carried out take session as unit the click steam log analysis, in order to improve efficient and the accuracy of click steam log analysis.
obviously, those skilled in the art should be understood that, each module of above-mentioned the application or each step can realize with general calculation element, they can concentrate on single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in storage device and be carried out by calculation element, and in some cases, can carry out step shown or that describe with the order that is different from herein, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step being made into the single integrated circuit module realizes.Like this, the application is not restricted to any specific hardware and software combination.
These are only the application's preferred embodiment, be not limited to the application, for a person skilled in the art, the application can have various modifications and variations.All within the application's spirit and principle, any modification of doing, be equal to replacement, improvement etc., within all should being included in the application's protection range.
Claims (19)
1. a log analysis method, is characterized in that, comprising:
Gather the journal file that the web log file server cluster generates;
With predetermined gap periods to the journal file that gathers carry out take session as unit based on distributed click steam log analysis, wherein, described gap periods makes for the system resource of analyzing described journal file and was used fifty-fifty at one day.
2. method according to claim 1, is characterized in that, described gap periods is 1 hour.
3. method according to claim 1 and 2, is characterized in that, the journal file of described collection carried out comprising based on the step of distributed click steam log analysis take session as unit with predetermined gap periods:
The journal file of described collection is decoded, remove error log and invalid daily record from decoded journal file, and the journal format that will remove the daily record in described error log and invalid daily record journal file afterwards converts unified journal format to;
Take session as unit, the journal file that converts unified journal format to is carried out based on distributed click steam log analysis the library file to be entered of output fact table.
4. method according to claim 3, is characterized in that, the step that the journal file of described collection is decoded comprises: read daily record in the journal file of described collection by row;
The field decomposition rule of setting according to daily record source sign selection carries out the field decomposition to the daily record of reading, and removes the error log that does not meet the field decomposition rule;
Filter out the daily record that non-artificial access causes, the daily record of company's Intranet access in daily record the journal file that obtains according to the field value of the filtering rule of setting and field after decomposing;
According to the daily record output format of setting, the daily record output format is unified in the daily record in the journal file after filtering;
Carry out by the business sorting according to the journal file of the type of service of setting to output, the daily record of different business outputs to different paths.
5. method according to claim 3, is characterized in that, take session as unit, the journal file that converts unified journal format to carried out comprising based on the step of distributed click steam log analysis:
Obtain described journal file and a upper journal file that the gap periods session is not closed that converts unified journal format to;
The journal file that belongs to same user ID in the journal file that obtains is divided into one group, and the journal file of the adjacent daily record that is no more than predetermined space the webpage click time interval in each group in being divided into same session;
Whether judgement is divided the session that obtains and is closed, if session is closed, the journal file in described buttoned-up session is carried out the click steam log analysis;
If session is not closed, according to the analysis mechanisms of the journal file in described buttoned-up session, the journal file in described session of not closing is analyzed, by output identification to distinguish described session and the described different analysis result of session of having closed of not closing, portion is preserved in the daily record of simultaneously session not being closed in addition, analyzes for next gap periods.
6. method according to claim 5, is characterized in that, the step of the journal file during the adjacent daily record that is no more than predetermined space the webpage click time interval in each group is divided into same session comprises:
Journal file according to the webpage click time, each being organized sorts;
Take described group as unit each journal file in each group after to sequence of the order after according to described sequence carry out following steps:
Judge in a group after sequence that whether the interval of webpage click between the time of a upper journal file in webpage click time of the journal file of pre-treatment and this group be over described predetermined space, wherein, a described upper journal file is divided into a journal file in session;
If surpass described predetermined space, create a current sessions, be that described current sessions distributes the session identification that the overall situation is unique, and will described journal file of working as pre-treatment be divided into the journal file in described current sessions;
If do not surpass described predetermined space, described journal file when pre-treatment is divided into the journal file in a described upper session.
7. method according to claim 5, is characterized in that, the step whether session that the judgement division obtains closes comprises:
If current log analysis surpasses or equal the finish time of the time on the same day at described session place fiducial time, judge described session and close; Perhaps
If described session surpasses predetermined space the corresponding current log analysis of the last webpage click time interval of user ID fiducial time, judging described session closes, wherein, described current log analysis is the concluding time when the space before place time period fiducial time.
8. method according to claim 5, is characterized in that, preserves the analysis result collection that obtains through the click steam log analysis on current gap periods by following steps:
Use the first set table to preserve described analysis result collection of having closed session in database, wherein, described first set table is preserved all described analysis result collection of having closed session;
Use the second cover table to preserve described analysis result collection of not closing session in database, wherein, described the second cover table is preserved all described analysis result collection of not closing session in current gap periods;
Extract required parameter in journal file corresponding to the session on described current gap periods from described first set table, and set up mapping relations between described session identification of having closed session and the described parameter that extracts;
Wherein, be converted to described buttoned-up session on the gap periods of partial session after described current gap periods in described session of not closing.
9. method according to claim 6, is characterized in that, comprises for described current sessions distributes the step of the unique session identification of overall situation:
Be that described current sessions distributes the unique session identification of the overall situation by the sky, wherein, the session identification of same date distribution is not identical or different;
Be dynamically that described current sessions distributes the unique session identification of the overall situation in click steam is analyzed, wherein, the number of the uniqueness of described session identification and the length of gap periods, Distributed Calculation node is uncorrelated.
10. method according to claim 1, is characterized in that, with predetermined gap periods to the journal file that gathers carry out take session as unit based on after distributed click steam log analysis, also comprise:
Obtain the analysis result of click steam log analysis;
Generate analysis report according to the analysis result that obtains, wherein, described analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.
11. a log analysis device is characterized in that, comprising:
Collecting unit is used for gathering the journal file that the web log file server cluster generates;
Analytic unit, be used for predetermined gap periods to the journal file that gathers carry out take session as unit based on distributed click steam log analysis, wherein, described gap periods makes for the resource of the system that analyzes described journal file and was used fifty-fifty at one day.
12. device according to claim 11 is characterized in that, described analytic unit comprises:
Decoder module, be used for the journal file of described collection is decoded, remove error log and invalid daily record from decoded journal file, and the journal format that will remove the daily record in described error log and invalid daily record journal file afterwards converts unified journal format to;
Analysis module is used for take session as unit, the journal file that converts unified journal format to being carried out based on distributed click steam log analysis, the library file to be entered of output fact table.
13. device according to claim 12 is characterized in that, described decoder module comprises:
Reading submodule is used for reading by row the daily record of the journal file of described collection;
Decompose submodule, be used for according to the field decomposition rule that daily record source sign selection is set, field being carried out in the daily record of reading and decompose, remove the error log that does not meet the field decomposition rule;
Filter submodule, be used for filtering out the daily record of the journal file that the field value according to the filtering rule of setting and field obtains the daily record that non-artificial access causes, the daily record of company's Intranet access after decomposing;
Output sub-module is used for according to the daily record output format of setting, the daily record output format being unified in the daily record of the journal file after filtering;
The sorting submodule is used for carrying out by the business sorting according to the journal file of the type of service of setting to output, and the daily record of different business outputs to different paths.
14. device according to claim 13 is characterized in that, described filtration submodule comprises:
The rule parsing submodule is used for loading filtering rule, initialization filter function;
The rule judgment submodule is used for the field value according to the daily record decomposition, judges whether to meet a filtering rule in the filtering rule that loads.
15. device according to claim 12 is characterized in that, described analysis module comprises:
Obtain submodule, be used for obtaining described journal file and a upper journal file that the gap periods session is not closed that converts unified journal format to;
The grouping submodule, the journal file that is used for the journal file that obtains is belonged to same user ID is divided into one group, and the journal file of the adjacent daily record that is no more than predetermined space in each group the webpage click time interval in being divided into same session;
The session submodule is used for judging whether close, if session is closed, the journal file in described buttoned-up session is carried out the click steam log analysis if dividing the session that obtains;
If session is not closed, according to the analysis mechanisms of the journal file in described buttoned-up session, the journal file in described session of not closing is analyzed, by output identification to distinguish described session and the described different analysis result of session of having closed of not closing, simultaneously, portion is preserved in the daily record that session is not closed in addition, analyzed for next gap periods.
16. device according to claim 15 is characterized in that, described grouping submodule comprises:
The sequence submodule is used for sorting according to the journal file of webpage click time to each group;
The first judgement submodule, be used for that each journal file of each group of the order after according to described sequence after to sequence judges take described group as unit, judge in a group after sequence that whether the interval of webpage click between the time of a upper journal file in webpage click time of the journal file of pre-treatment and this group be over described predetermined space, wherein, a described upper journal file is divided into a journal file in session;
Creating submodule, be used for creating a current sessions judging when surpassing described predetermined space, is that described current sessions distributes the session identification that the overall situation is unique, and described journal file of working as pre-treatment is divided into journal file in described current sessions;
Divide submodule, be used for judging when not surpassing described predetermined space, described journal file when pre-treatment is divided into journal file in a described upper session.
17. device according to claim 15 is characterized in that, described session submodule comprises:
The second judgement submodule is used for judging described session and closing when current log analysis surpasses fiducial time or equal concluding time of time on the same day at described session place; Perhaps
The 3rd judgement submodule, be used for when the current log analysis of the last webpage click time interval of user ID corresponding to described session surpasses predetermined space fiducial time, judging described session closes, wherein, described current log analysis is the concluding time when the space before place time period fiducial time.
18. device according to claim 15 is characterized in that, described session submodule also comprises:
First preserves submodule, is used for using the first set table to preserve described analysis result collection of having closed session at database, and wherein, described first set table is preserved all described analysis result collection of having closed session;
Second preserves submodule, is used for using the second cover table to preserve described analysis result collection of not closing session at database, and wherein, described the second cover table is preserved all described analysis result collection of not closing session in current gap periods; Wherein, be converted to described buttoned-up session on the gap periods of partial session after described current gap periods in described session of not closing;
Dimension table updating submodule is used for identifying the newly-increased tolerance of click steam log analysis of current gap periods, and is updated in the dimension table, and wherein, described newly-increased tolerance is used for the click steam log analysis of described current gap periods.
19. device according to claim 11 is characterized in that, also comprises:
Acquiring unit, be used for predetermined gap periods, the journal file that gathers is carried out take session as unit based on distributed click steam log analysis after, obtain the analysis result of click steam log analysis;
Generation unit is used for generating analysis report according to the analysis result that obtains, and wherein, described analysis report is used for adjusting accordingly according to the analysis result pair website structure corresponding with journal file.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2011104399568A CN103178982A (en) | 2011-12-23 | 2011-12-23 | Method and device for analyzing log |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2011104399568A CN103178982A (en) | 2011-12-23 | 2011-12-23 | Method and device for analyzing log |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN103178982A true CN103178982A (en) | 2013-06-26 |
Family
ID=48638614
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN2011104399568A Pending CN103178982A (en) | 2011-12-23 | 2011-12-23 | Method and device for analyzing log |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN103178982A (en) |
Cited By (52)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103312568A (en) * | 2013-07-09 | 2013-09-18 | 北京国双科技有限公司 | Data statistical method and device |
| CN103401849A (en) * | 2013-07-18 | 2013-11-20 | 盘石软件(上海)有限公司 | Abnormal session analyzing method for website logs |
| CN103399855A (en) * | 2013-07-01 | 2013-11-20 | 百度在线网络技术(北京)有限公司 | Behavior intention determining method and device based on multiple data sources |
| CN103414758A (en) * | 2013-07-19 | 2013-11-27 | 北京奇虎科技有限公司 | Method and device for processing logs |
| CN103577586A (en) * | 2013-11-08 | 2014-02-12 | 北京国双科技有限公司 | Method and device for processing log records |
| CN103593791A (en) * | 2013-11-07 | 2014-02-19 | 广州优蜜信息科技有限公司 | Mobile advertisement putting method and system |
| CN103595571A (en) * | 2013-11-20 | 2014-02-19 | 北京国双科技有限公司 | Preprocessing method, device and system for website access logs |
| CN103729479A (en) * | 2014-01-26 | 2014-04-16 | 北京北纬通信科技股份有限公司 | Web page content statistical method and system based on distributed file storage |
| CN104091276A (en) * | 2013-12-10 | 2014-10-08 | 深圳市腾讯计算机系统有限公司 | Click stream data online analyzing method and related device and system |
| CN104113605A (en) * | 2014-07-30 | 2014-10-22 | 浪潮软件股份有限公司 | Enterprise cloud application development monitoring processing method |
| CN104298782A (en) * | 2014-11-07 | 2015-01-21 | 辽宁四维科技发展有限公司 | Method for analyzing active access behaviors of internet users |
| CN104391954A (en) * | 2014-11-27 | 2015-03-04 | 北京国双科技有限公司 | Database log processing method and device |
| CN104426713A (en) * | 2013-08-28 | 2015-03-18 | 腾讯科技(北京)有限公司 | Method and device for monitoring network site access effect data |
| CN104639387A (en) * | 2014-12-09 | 2015-05-20 | 北京京东尚科信息技术有限公司 | Users' network behavior tracking method and equipment |
| CN105100128A (en) * | 2014-04-24 | 2015-11-25 | 北京金山网络科技有限公司 | Server cluster log acquiring and providing methods, log server and node server |
| CN105141448A (en) * | 2015-07-28 | 2015-12-09 | 杭州华为数字技术有限公司 | Method and device for collecting log |
| CN105337930A (en) * | 2014-06-30 | 2016-02-17 | 北京新媒传信科技有限公司 | Data processing method and apparatus |
| CN105812324A (en) * | 2014-12-30 | 2016-07-27 | 华为技术有限公司 | Method, device and system for IDC information safety management |
| CN105930329A (en) * | 2015-12-28 | 2016-09-07 | 中国银联股份有限公司 | Transaction log analysis method and apparatus |
| CN106021079A (en) * | 2016-05-06 | 2016-10-12 | 华南理工大学 | A Web application performance testing method based on a user frequent access sequence model |
| CN106130807A (en) * | 2016-08-31 | 2016-11-16 | 百势软件(北京)有限公司 | The extraction of a kind of Nginx daily record and analysis method and device |
| CN106453454A (en) * | 2015-08-07 | 2017-02-22 | 北京国双科技有限公司 | Dialogue identification information generating method and apparatus |
| CN106713041A (en) * | 2016-12-29 | 2017-05-24 | 杭州迪普科技股份有限公司 | Session log transmitting method and device |
| CN106776264A (en) * | 2015-11-24 | 2017-05-31 | 北京国双科技有限公司 | The method of testing and device of application code |
| CN106817270A (en) * | 2015-12-01 | 2017-06-09 | 精硕科技(北京)股份有限公司 | Network traffics acquisition method, system and server |
| CN106909499A (en) * | 2015-12-22 | 2017-06-30 | 阿里巴巴集团控股有限公司 | Method of testing and device |
| CN107135663A (en) * | 2014-11-05 | 2017-09-05 | 起元技术有限责任公司 | Impact analysis |
| CN107317873A (en) * | 2017-07-21 | 2017-11-03 | 曙光信息产业(北京)有限公司 | A kind of conversation processing method and device |
| CN107517203A (en) * | 2017-08-08 | 2017-12-26 | 北京奇安信科技有限公司 | A kind of user behavior baseline method for building up and device |
| CN107688619A (en) * | 2017-08-10 | 2018-02-13 | 北京奇安信科技有限公司 | A kind of daily record data processing method and processing device |
| CN108123840A (en) * | 2017-12-22 | 2018-06-05 | 中国联合网络通信集团有限公司 | Log processing method and system |
| CN108241661A (en) * | 2016-12-23 | 2018-07-03 | 航天星图科技(北京)有限公司 | A kind of distributed traffic analysis method |
| CN108255879A (en) * | 2016-12-29 | 2018-07-06 | 北京国双科技有限公司 | The detection method and device of web page browsing flow cheating |
| CN108363649A (en) * | 2017-12-29 | 2018-08-03 | 微梦创科网络科技(中国)有限公司 | A kind of method and device of distribution statistical log visit capacity |
| CN108629042A (en) * | 2017-07-06 | 2018-10-09 | 深圳中兴飞贷金融科技有限公司 | Big data acquisition method, apparatus and system |
| CN109190007A (en) * | 2018-07-20 | 2019-01-11 | 阿里巴巴集团控股有限公司 | Data analysing method and device |
| CN109218401A (en) * | 2018-08-08 | 2019-01-15 | 平安科技(深圳)有限公司 | Log collection method, system, computer equipment and storage medium |
| CN109325183A (en) * | 2018-10-16 | 2019-02-12 | 深圳壹账通智能科技有限公司 | Method, device and computer equipment for locating error problem based on crawler log |
| CN109359263A (en) * | 2018-10-16 | 2019-02-19 | 杭州安恒信息技术股份有限公司 | A kind of user behavior feature extraction method and system |
| CN109739821A (en) * | 2018-12-18 | 2019-05-10 | 中国科学院计算机网络信息中心 | Log data hierarchical storage method, device and storage medium |
| CN109885543A (en) * | 2018-12-24 | 2019-06-14 | 航天信息股份有限公司 | Log processing method and device based on big data cluster |
| CN110516440A (en) * | 2019-08-12 | 2019-11-29 | 广州海颐信息安全技术有限公司 | Privilege based on dragging threatens the method and device of action trail association playback |
| CN110659918A (en) * | 2018-06-28 | 2020-01-07 | 上海传漾广告有限公司 | Optimization method for tracking and analyzing network advertisements |
| CN110825943A (en) * | 2019-10-23 | 2020-02-21 | 支付宝(杭州)信息技术有限公司 | A method, system and device for generating user access path tree data |
| CN111224807A (en) * | 2018-11-27 | 2020-06-02 | 中国移动通信集团江西有限公司 | Distributed log processing method, device, device and computer storage medium |
| CN111723063A (en) * | 2019-03-18 | 2020-09-29 | 北京沃东天骏信息技术有限公司 | A method and device for offline log data processing |
| CN112069048A (en) * | 2020-09-09 | 2020-12-11 | 北京明略昭辉科技有限公司 | Log processing method, device and storage medium |
| CN112242919A (en) * | 2019-07-19 | 2021-01-19 | 烽火通信科技股份有限公司 | Fault file processing method and system |
| CN114827126A (en) * | 2022-03-24 | 2022-07-29 | 中通服创立信息科技有限责任公司 | IPTVDN user play log reporting method and system |
| US11647100B2 (en) | 2018-09-30 | 2023-05-09 | China Mobile Communication Co., Ltd Research Inst | Resource query method and apparatus, device, and storage medium |
| CN116582423A (en) * | 2023-05-23 | 2023-08-11 | 杭州电子科技大学 | A log parsing method for edge gateway devices based on real-time stream processing |
| CN116975013A (en) * | 2022-08-23 | 2023-10-31 | 中国移动通信集团浙江有限公司 | Method, device, equipment and computer storage medium for constructing log analysis model |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20100085888A1 (en) * | 2005-12-30 | 2010-04-08 | Jeanette Larosa | Method and apparatus for analyzing source internet protocol activity in a network |
| CN101770487A (en) * | 2008-12-26 | 2010-07-07 | 聚友空间网络技术有限公司 | Method and system for calculating user influence in social network |
| CN102075355A (en) * | 2010-12-30 | 2011-05-25 | 北京世纪互联工程技术服务有限公司 | Log system and using method thereof |
-
2011
- 2011-12-23 CN CN2011104399568A patent/CN103178982A/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20100085888A1 (en) * | 2005-12-30 | 2010-04-08 | Jeanette Larosa | Method and apparatus for analyzing source internet protocol activity in a network |
| CN101770487A (en) * | 2008-12-26 | 2010-07-07 | 聚友空间网络技术有限公司 | Method and system for calculating user influence in social network |
| CN102075355A (en) * | 2010-12-30 | 2011-05-25 | 北京世纪互联工程技术服务有限公司 | Log system and using method thereof |
Non-Patent Citations (2)
| Title |
|---|
| 俞辉: "基于Web日志挖掘的网页实时推荐算法研究", 《计算机工程与设计》, vol. 29, no. 7, 30 April 2008 (2008-04-30) * |
| 李烈彪等: "Web日志挖掘中数据预处理方法的研究", 《计算机技术与发展》, vol. 17, no. 7, 31 July 2007 (2007-07-31), pages 45 - 48 * |
Cited By (80)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103399855A (en) * | 2013-07-01 | 2013-11-20 | 百度在线网络技术(北京)有限公司 | Behavior intention determining method and device based on multiple data sources |
| CN103312568B (en) * | 2013-07-09 | 2016-07-13 | 北京国双科技有限公司 | Data statistical approach and device |
| CN103312568A (en) * | 2013-07-09 | 2013-09-18 | 北京国双科技有限公司 | Data statistical method and device |
| CN103401849A (en) * | 2013-07-18 | 2013-11-20 | 盘石软件(上海)有限公司 | Abnormal session analyzing method for website logs |
| CN103401849B (en) * | 2013-07-18 | 2017-02-15 | 盘石软件(上海)有限公司 | Abnormal session analyzing method for website logs |
| CN103414758A (en) * | 2013-07-19 | 2013-11-27 | 北京奇虎科技有限公司 | Method and device for processing logs |
| US10587707B2 (en) | 2013-08-28 | 2020-03-10 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for monitoring website access data |
| CN104426713B (en) * | 2013-08-28 | 2018-04-17 | 腾讯科技(北京)有限公司 | The monitoring method and device of web site access effect data |
| CN104426713A (en) * | 2013-08-28 | 2015-03-18 | 腾讯科技(北京)有限公司 | Method and device for monitoring network site access effect data |
| CN103593791A (en) * | 2013-11-07 | 2014-02-19 | 广州优蜜信息科技有限公司 | Mobile advertisement putting method and system |
| CN103577586A (en) * | 2013-11-08 | 2014-02-12 | 北京国双科技有限公司 | Method and device for processing log records |
| CN103577586B (en) * | 2013-11-08 | 2017-03-15 | 北京国双科技有限公司 | The processing method and processing device of log recording |
| CN103595571B (en) * | 2013-11-20 | 2018-02-02 | 北京国双科技有限公司 | Preprocess method, the apparatus and system of web log |
| CN103595571A (en) * | 2013-11-20 | 2014-02-19 | 北京国双科技有限公司 | Preprocessing method, device and system for website access logs |
| CN104091276A (en) * | 2013-12-10 | 2014-10-08 | 深圳市腾讯计算机系统有限公司 | Click stream data online analyzing method and related device and system |
| CN103729479A (en) * | 2014-01-26 | 2014-04-16 | 北京北纬通信科技股份有限公司 | Web page content statistical method and system based on distributed file storage |
| CN105100128A (en) * | 2014-04-24 | 2015-11-25 | 北京金山网络科技有限公司 | Server cluster log acquiring and providing methods, log server and node server |
| CN105337930A (en) * | 2014-06-30 | 2016-02-17 | 北京新媒传信科技有限公司 | Data processing method and apparatus |
| CN105337930B (en) * | 2014-06-30 | 2019-02-19 | 北京新媒传信科技有限公司 | The method and device that a kind of pair of data are handled |
| CN104113605A (en) * | 2014-07-30 | 2014-10-22 | 浪潮软件股份有限公司 | Enterprise cloud application development monitoring processing method |
| US11475023B2 (en) | 2014-11-05 | 2022-10-18 | Ab Initio Technology Llc | Impact analysis |
| CN107135663B (en) * | 2014-11-05 | 2021-06-22 | 起元技术有限责任公司 | Impact Analysis |
| CN107135663A (en) * | 2014-11-05 | 2017-09-05 | 起元技术有限责任公司 | Impact analysis |
| CN104298782A (en) * | 2014-11-07 | 2015-01-21 | 辽宁四维科技发展有限公司 | Method for analyzing active access behaviors of internet users |
| CN104298782B (en) * | 2014-11-07 | 2017-10-24 | 郭磊 | Internet user actively accesses the analysis method of action trail |
| CN104391954A (en) * | 2014-11-27 | 2015-03-04 | 北京国双科技有限公司 | Database log processing method and device |
| CN104391954B (en) * | 2014-11-27 | 2019-04-09 | 北京国双科技有限公司 | The processing method and processing device of database journal |
| CN104639387B (en) * | 2014-12-09 | 2019-03-01 | 北京京东尚科信息技术有限公司 | A method and device for tracking user network behavior |
| CN104639387A (en) * | 2014-12-09 | 2015-05-20 | 北京京东尚科信息技术有限公司 | Users' network behavior tracking method and equipment |
| CN105812324A (en) * | 2014-12-30 | 2016-07-27 | 华为技术有限公司 | Method, device and system for IDC information safety management |
| CN105812324B (en) * | 2014-12-30 | 2019-04-05 | 华为技术有限公司 | IDC information security management method, device and system |
| CN105141448B (en) * | 2015-07-28 | 2018-10-02 | 杭州华为数字技术有限公司 | A kind of acquisition method and device of daily record |
| CN105141448A (en) * | 2015-07-28 | 2015-12-09 | 杭州华为数字技术有限公司 | Method and device for collecting log |
| CN106453454B (en) * | 2015-08-07 | 2019-08-16 | 北京国双科技有限公司 | Session label information generation method and device |
| CN106453454A (en) * | 2015-08-07 | 2017-02-22 | 北京国双科技有限公司 | Dialogue identification information generating method and apparatus |
| CN106776264A (en) * | 2015-11-24 | 2017-05-31 | 北京国双科技有限公司 | The method of testing and device of application code |
| CN106817270A (en) * | 2015-12-01 | 2017-06-09 | 精硕科技(北京)股份有限公司 | Network traffics acquisition method, system and server |
| CN106909499A (en) * | 2015-12-22 | 2017-06-30 | 阿里巴巴集团控股有限公司 | Method of testing and device |
| CN105930329A (en) * | 2015-12-28 | 2016-09-07 | 中国银联股份有限公司 | Transaction log analysis method and apparatus |
| CN106021079B (en) * | 2016-05-06 | 2018-10-09 | 华南理工大学 | It is a kind of based on the Web application performance test methods for being frequently visited by the user series model |
| CN106021079A (en) * | 2016-05-06 | 2016-10-12 | 华南理工大学 | A Web application performance testing method based on a user frequent access sequence model |
| CN106130807A (en) * | 2016-08-31 | 2016-11-16 | 百势软件(北京)有限公司 | The extraction of a kind of Nginx daily record and analysis method and device |
| CN108241661A (en) * | 2016-12-23 | 2018-07-03 | 航天星图科技(北京)有限公司 | A kind of distributed traffic analysis method |
| CN108255879A (en) * | 2016-12-29 | 2018-07-06 | 北京国双科技有限公司 | The detection method and device of web page browsing flow cheating |
| CN108255879B (en) * | 2016-12-29 | 2021-10-08 | 北京国双科技有限公司 | Method and device for detecting cheating in web browsing traffic |
| CN106713041A (en) * | 2016-12-29 | 2017-05-24 | 杭州迪普科技股份有限公司 | Session log transmitting method and device |
| CN108629042A (en) * | 2017-07-06 | 2018-10-09 | 深圳中兴飞贷金融科技有限公司 | Big data acquisition method, apparatus and system |
| CN107317873A (en) * | 2017-07-21 | 2017-11-03 | 曙光信息产业(北京)有限公司 | A kind of conversation processing method and device |
| CN107517203A (en) * | 2017-08-08 | 2017-12-26 | 北京奇安信科技有限公司 | A kind of user behavior baseline method for building up and device |
| CN107517203B (en) * | 2017-08-08 | 2020-07-14 | 奇安信科技集团股份有限公司 | User behavior baseline establishing method and device |
| CN107688619B (en) * | 2017-08-10 | 2020-06-16 | 奇安信科技集团股份有限公司 | Log data processing method and device |
| CN107688619A (en) * | 2017-08-10 | 2018-02-13 | 北京奇安信科技有限公司 | A kind of daily record data processing method and processing device |
| CN108123840A (en) * | 2017-12-22 | 2018-06-05 | 中国联合网络通信集团有限公司 | Log processing method and system |
| CN108363649A (en) * | 2017-12-29 | 2018-08-03 | 微梦创科网络科技(中国)有限公司 | A kind of method and device of distribution statistical log visit capacity |
| CN110659918A (en) * | 2018-06-28 | 2020-01-07 | 上海传漾广告有限公司 | Optimization method for tracking and analyzing network advertisements |
| CN109190007B (en) * | 2018-07-20 | 2022-10-04 | 创新先进技术有限公司 | Data analysis method and device |
| CN109190007A (en) * | 2018-07-20 | 2019-01-11 | 阿里巴巴集团控股有限公司 | Data analysing method and device |
| CN109218401B (en) * | 2018-08-08 | 2021-08-31 | 平安科技(深圳)有限公司 | Log collection method, system, computer device and storage medium |
| WO2020029376A1 (en) * | 2018-08-08 | 2020-02-13 | 平安科技(深圳)有限公司 | Log acquisition method and system, and computer device and storage medium |
| CN109218401A (en) * | 2018-08-08 | 2019-01-15 | 平安科技(深圳)有限公司 | Log collection method, system, computer equipment and storage medium |
| US11647100B2 (en) | 2018-09-30 | 2023-05-09 | China Mobile Communication Co., Ltd Research Inst | Resource query method and apparatus, device, and storage medium |
| CN109359263B (en) * | 2018-10-16 | 2020-09-29 | 杭州安恒信息技术股份有限公司 | A kind of user behavior feature extraction method and system |
| CN109325183A (en) * | 2018-10-16 | 2019-02-12 | 深圳壹账通智能科技有限公司 | Method, device and computer equipment for locating error problem based on crawler log |
| CN109359263A (en) * | 2018-10-16 | 2019-02-19 | 杭州安恒信息技术股份有限公司 | A kind of user behavior feature extraction method and system |
| CN111224807B (en) * | 2018-11-27 | 2023-08-01 | 中国移动通信集团江西有限公司 | Distributed log processing method, device, equipment and computer storage medium |
| CN111224807A (en) * | 2018-11-27 | 2020-06-02 | 中国移动通信集团江西有限公司 | Distributed log processing method, device, device and computer storage medium |
| CN109739821A (en) * | 2018-12-18 | 2019-05-10 | 中国科学院计算机网络信息中心 | Log data hierarchical storage method, device and storage medium |
| CN109885543A (en) * | 2018-12-24 | 2019-06-14 | 航天信息股份有限公司 | Log processing method and device based on big data cluster |
| CN111723063A (en) * | 2019-03-18 | 2020-09-29 | 北京沃东天骏信息技术有限公司 | A method and device for offline log data processing |
| CN112242919B (en) * | 2019-07-19 | 2022-07-29 | 烽火通信科技股份有限公司 | Fault file processing method and system |
| CN112242919A (en) * | 2019-07-19 | 2021-01-19 | 烽火通信科技股份有限公司 | Fault file processing method and system |
| CN110516440B (en) * | 2019-08-12 | 2021-12-10 | 广州海颐信息安全技术有限公司 | Method and device for linkage playback of privilege threat behavior track based on dragging |
| CN110516440A (en) * | 2019-08-12 | 2019-11-29 | 广州海颐信息安全技术有限公司 | Privilege based on dragging threatens the method and device of action trail association playback |
| CN110825943A (en) * | 2019-10-23 | 2020-02-21 | 支付宝(杭州)信息技术有限公司 | A method, system and device for generating user access path tree data |
| CN110825943B (en) * | 2019-10-23 | 2023-10-10 | 支付宝(杭州)信息技术有限公司 | A method, system and device for generating user access path tree data |
| CN112069048A (en) * | 2020-09-09 | 2020-12-11 | 北京明略昭辉科技有限公司 | Log processing method, device and storage medium |
| CN114827126A (en) * | 2022-03-24 | 2022-07-29 | 中通服创立信息科技有限责任公司 | IPTVDN user play log reporting method and system |
| CN114827126B (en) * | 2022-03-24 | 2023-07-14 | 中通服创立信息科技有限责任公司 | IPTVCDN user play log reporting method and system |
| CN116975013A (en) * | 2022-08-23 | 2023-10-31 | 中国移动通信集团浙江有限公司 | Method, device, equipment and computer storage medium for constructing log analysis model |
| CN116582423A (en) * | 2023-05-23 | 2023-08-11 | 杭州电子科技大学 | A log parsing method for edge gateway devices based on real-time stream processing |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN103178982A (en) | Method and device for analyzing log | |
| US20230041672A1 (en) | Enterprise data processing | |
| Meiss et al. | Ranking web sites with real user traffic | |
| CN101192227B (en) | Log file analytical method and system based on distributed type computing network | |
| Das et al. | Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method | |
| CN105610616B (en) | The single IP average flow rate statistical method of access net and system based on ICP liveness | |
| Ding et al. | Characterizing service level objectives for cloud services: Realities and myths | |
| US9123006B2 (en) | Techniques for parallel business intelligence evaluation and management | |
| Günther et al. | Mining activity clusters from low-level event logs | |
| CN106933906B (en) | Data multi-dimensional query method and device | |
| CN107967347A (en) | Batch data processing method, server, system and storage medium | |
| CN107103064A (en) | Data statistical approach and device | |
| CN116975396B (en) | Government service intelligent recommendation method, system, device and storage medium | |
| CN113626447B (en) | Civil aviation data management platform and method | |
| CN111782611A (en) | Predictive model modeling method, device, equipment and storage medium | |
| CN106897313B (en) | Mass user service preference evaluation method and device | |
| CN104063456B (en) | Based on vector query from broadcasting media atlas analysis method and apparatus | |
| CN105471676A (en) | Port scanning IP address activity degree statistical system and method | |
| CN106127503A (en) | A kind of Analysis of Network Information method based on true social relations and big data | |
| Hu et al. | How matchable are four thousand ontologies on the semantic web | |
| CN108984802A (en) | A kind of device class lookup method in O&M auditing system | |
| CN118075155A (en) | Multi-dimensional Internet service flow deep analysis method, device, equipment and storage medium | |
| CN206421382U (en) | Data processing system | |
| Chan et al. | Online course refinement through association rule mining | |
| WO2023192230A1 (en) | Graph-based query engine for an extensibility platform |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 1182856 Country of ref document: HK |
|
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20130626 |






