CN106570053A - Network data collection and validation method - Google Patents
Network data collection and validation method Download PDFInfo
- Publication number
- CN106570053A CN106570053A CN201610840743.9A CN201610840743A CN106570053A CN 106570053 A CN106570053 A CN 106570053A CN 201610840743 A CN201610840743 A CN 201610840743A CN 106570053 A CN106570053 A CN 106570053A
- Authority
- CN
- China
- Prior art keywords
- data
- classification
- network data
- collection
- amount
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3349—Reuse of stored results of previous queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a network data collection and validation method. The network data collection and validation method comprises the following implementation processes of firstly collecting network data, carrying out information classification on internet sites in collected data and carrying out random sampling by category; counting the amount of network data in the selected category, retrieving the sampled data stored in a database through a database operation script and counting the collected amount in the selected category; and comparing and checking the amount of network data and the collected amount to obtain a coverage rate of the collected data, wherein the coverage rate is equal to the collected data amount/actual data amount to validate whether collection leakage exists or not. Compared with the prior art, the network data collection and validation method has the advantages that the sampling survey theory of statistics is combined, the collected mass data are scientifically and reasonably validated, meanwhile, convenience is also provided for subsequent data analysis and mining work, and the network data collection and validation method is high in practicability, wide in application range and easy to popularize.
Description
Technical field
The present invention relates to big data applied analysis technical field, specifically a kind of practical, network data acquisition
Verification method.
Background technology
With flourishing for internet in recent years and information industry, data this concepts has penetrated into each industry
With operation function field, become the important factor of production.People imply that new ripple life for the excavation and utilization of mass data
Yield increases the arrival with consumer surplus's tide.The concept of big data has been penetrated in the minds of the experts and scholars of all trades and professions,
Also result in the extensive concern of masses.And at the same time, a large amount of distributed public informations are full of on network, government data
Further the aspect such as the fast development of the e-commerce platform such as opening, Taobao, expansion that the tax is handled on line is all generated
The information data of magnanimity.
Nowadays the information content day for including on webpage more increases, and species is various and complex structure, therefore when data acquisition is complete
A kind of Cheng Hou, it is necessary to scientific and reasonable verification method, can adopt with regard to whether verification data leaks.And the bigger website of information content,
Such as some portal websites or large-scale electric business shopping platform, the difficulty of checking is bigger.The information data of collection is often divided
Cloth is in the different pages, it is difficult to estimate the data total amount at the whole network station exactly, it is impossible to directly with the data strip of collection result
Number directly compares, and is judged.
It is contemplated that the factor such as Consumer's Experience, the full detail that large-scale website can't be grasped only does single mode
Presentation, big city classifies, the data message of same type combined.This creates the terminal a kind of new data to test
The thinking of card, i.e., carry out random sampling statistics by website given categorisation, then is compared with generic informational capacity in collection result
Compared with, you can whether checking collection has omission, if has reached the collection purpose that site information amount is all covered, based on this, has now carried
For a kind of scientific and efficient, network data acquisition verification method.
The content of the invention
The technical assignment of the present invention is for above weak point, there is provided a kind of practical, network data acquisition checking
Method.
A kind of network data acquisition verification method, it realizes that process is:
Network data is gathered first, and then the internet sites in gathered data are carried out with information classification, category random sampling;
The web database technology in selected classification is counted, then the data of adopting stored in storehouse are carried out by database manipulation script
Retrieval, counts the collection capacity of selected classification;
The two is carried out into contrast verification, the coverage rate of adopted data is drawn, coverage rate here=adopt data volume/real data
Amount, to verify whether leakage to adopt.
The collection network data and process classified is:The website homepage gathered by browser access, is therefrom looked for
To the classification entrance of collection information, then selection sort link, a classification is clicked at random with into the page in classification porch
Face, views corresponding information, i.e., the data for having collected, and positioning finds the project relevant with data volume.
The classification that information classification is the set presence in adopted website is carried out to gathered data, by clicking on the classification chain on website
Connect by the different classes of information that browses web sites, so as to be immediately seen such data total amount now, or indirectly by calculating
Data total amount.
The web database technology for obtaining selected classification is referred to:Then directly obtain when total how many information is explicitly shown out on webpage
To the value, else if illustrate only common N page, the data strip number of every one page is drawn by observation, be multiplied by N to draw such with M
Total amount of data in not, M, N here are positive integer.
A kind of network data acquisition verification method of the present invention, with advantages below:
A kind of network data acquisition verification method that the present invention is provided, this verification method is effectively to from internet sites(Especially
It is large-scale website)Middle gathered data are verified, by reasonable sampling and comparative analysis, have drawn collection result to institute
The level of coverage of station data total amount is adopted, to determine that data are adopted with the presence or absence of leakage, the authenticity and diversity of data is demonstrated, together
When also provides the approach that effectively verify for follow-up data analysis and excacation;In collection, information content is larger and page structure
After complex internet site, a certain category information module therein is chosen in sampling, and the data that category counts the module are total
Amount, to be contrasted with the homogeneous data adopted, and by the way that the present invention is repeated several times in step, so that it is determined that data are
No leakage is adopted, and the accuracy and confidence level for making collection result has obtained scientific and reasonable checking, practical, applied widely,
It is easy to spread.
Description of the drawings
Accompanying drawing 1 realizes schematic diagram for the present invention's.
Specific embodiment
Below in conjunction with the accompanying drawings and specific embodiment the invention will be further described.
The invention problem to be solved is that the high-load in network data acquisition technology for Large-Scale Interconnected website is believed
The validation problem of breath, that is, determine whether the capacity of collection result is consistent with former targeted website.
Due to the information of Large-Scale Interconnected website it is distributed more widely, it is difficult to draw total amount, it is impossible to directly whether verify collection result
All cover full site information.For one or more problems present in correlation technique, the present invention is a kind of main by providing
The verification method of large-scale website gathered data is directed to, at least one of to solve the above problems.As shown in Figure 1, this
A kind of bright network data acquisition verification method, first analyzes the information classification mode of internet sites, category random sampling,
The web database technology in selected classification is counted, then the data of adopting stored in storehouse are examined by database manipulation script
Rope, counts the collection capacity of selected classification.The two is carried out into contrast verification, the coverage rate of adopted data is drawn.
It realizes that process is:
After the completion of data acquisition, the data mode classification of targeted sites is adopted in first analysis, and category random sampling counts institute
The web database technology in classification is selected, by by operating script line retrieval is entered to the collection result in database again, is counted selected
The collection capacity of classification.The two is carried out into contrast verification, the coverage rate of adopted data is drawn(Adopted amount/actual amount), it is to verify
No leakage is adopted.
In above-mentioned steps, adopting targeted website does not have intuitively to show whole station data total amount in the page, and due to
Data volume is excessive, and for the reason such as facilitate user to browse, classifying and dividing has been done in website to data.
Data mode classification is the classification of the set presence in adopted website, and the assorted linking that can pass through to click on website is pressed not
The generic information that browses web sites, so as to be immediately seen such data total amount now, or draws number indirectly by Simple Calculation
According to total amount.
The coverage rate of adopted data is randomly selected after the classification of targeted sites, with the collection capacity of selected classification/selected
The ratio that web database technology in classification is obtained.In view of factors such as the renewal of website, the diversity and complexity of data, if should
It is that data acquisition is more comprehensive that value then can be assumed that more than 90%, not there are problems that leakage is adopted.
In order to introduce the present invention in more detail, it is described in detail presently in connection with accompanying drawing 1:
Step 1, the access target page:The website homepage gathered by browser access, therefrom finds the classification of collection information
Entrance, website can typically be placed on sidebar or the top of the page, be used with facilitating visitor to click.
Step 2, selection sort link:Click a classification at random, into the page, to check in classification porch
To corresponding information(The data for having collected), because purpose is to count the data total amount under the classification, so letter need not be paid close attention to
Breath particular content, positioning finds the project relevant with data volume, for example:Common how many pages, common how many commodity, altogether how many record
Deng.
Step 3, the data volume for obtaining selected classification:Can be straight if being explicitly shown out total how many information on webpage
Connect and obtain the value, else if illustrate only common N page(N is positive integer), the data strip number that observation draws every one page can be passed through, than
Such as M bars(M is positive integer), the total amount of data in being multiplied by N to draw the category with M.
Step 4, retrieval collection result:Line retrieval is entered to the collection result category in database by operating script, is united
Count out the collection capacity of selected classification.
Step 5, verification data amount:One is obtained with the collection capacity of selected classification divided by the web database technology in selected classification
Ratio, i.e. data cover rate.In view of factors such as the renewal of website, the diversity and complexity of data, if the value is more than 90%
Then it can be assumed that being that data acquisition is more comprehensive, not there are problems that leakage is adopted.
Repeat multiple step 2 to 5, adopt whether number can cover the mesh at the whole network station to reach checking by sample investigation
's.
The present invention not only large-scale website acquiescence offer show classification entrance in the case of be just adapted to use.If not carrying
For can artificially carry out Rational Classification according to the content of accepted and believed breath.Such as when website provides search column with search key
When mode obtains information, keyword can be input into and retrieval result is obtained, then adopt bag in data by operating script to obtain
Data volume containing the keyword, by the two contrast verification is carried out.Its main body checking thinking is consistent with this method.
Due to internet site information may real-time update, with the collection capacity of selected classification divided by the net in selected classification
Data cover rate obtained from network data volume, not necessarily will could judge that data are not leaked more than 90% and adopt.Specifically will basis
Time that the execution time of data acquisition counts with category and judge, if two time points are apart from each other, as one sees fit can will
90% is adjusted downward to a rational scope, and 90% is a referential data.
Due to notebook data verification method similar to statistics in be divided into sampling theory, will totally be divided into different sons
Group, is then sampled to all of subgroup.Although random sampling, the data under each classification are not chosen, according to system
Meter learn principle, this verification method science and it is reasonable.
Above-mentioned specific embodiment is only the concrete case of the present invention, and the scope of patent protection of the present invention is included but is not limited to
Above-mentioned specific embodiment, it is any to meet a kind of the claims and any of network data acquisition verification method of the invention
The appropriate change or replacement that the those of ordinary skill of the technical field is done to it, should all fall into the patent protection model of the present invention
Enclose.
Claims (4)
1. a kind of network data acquisition verification method, it is characterised in that it realizes that process is:
Network data is gathered first, and then the internet sites in gathered data are carried out with information classification, category random sampling;
The web database technology in selected classification is counted, then the data of adopting stored in storehouse are carried out by database manipulation script
Retrieval, counts the collection capacity of selected classification;
The two is carried out into contrast verification, the coverage rate of adopted data is drawn, coverage rate here=adopt data volume/real data
Amount, to verify whether leakage to adopt.
2. a kind of network data acquisition verification method according to claim 1, it is characterised in that collection network data is gone forward side by side
Row classification process be:The website homepage gathered by browser access, therefrom finds the classification entrance of collection information, then
Selection sort is linked, and a classification is clicked at random into the page, to view corresponding information in classification porch, i.e., gather
The data for arriving, positioning finds the project relevant with data volume.
3. a kind of network data acquisition verification method according to claim 1, it is characterised in that letter is carried out to gathered data
Breath classification is the classification of the set presence in adopted website, by clicking on the assorted linking on website by the different classes of letter that browses web sites
Breath, so as to be immediately seen such data total amount now, or indirectly by calculating data total amount.
4. a kind of network data acquisition verification method according to claim 3, it is characterised in that obtain the net of selected classification
Network data volume is referred to:The value is then directly obtained when total how many information is explicitly shown out on webpage, else if illustrate only
Common N page, the data strip number of every one page, the total amount of data in being multiplied by N to draw the category with M, M, N here are drawn by observation
It is positive integer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610840743.9A CN106570053A (en) | 2016-09-22 | 2016-09-22 | Network data collection and validation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610840743.9A CN106570053A (en) | 2016-09-22 | 2016-09-22 | Network data collection and validation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106570053A true CN106570053A (en) | 2017-04-19 |
Family
ID=58531929
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610840743.9A Pending CN106570053A (en) | 2016-09-22 | 2016-09-22 | Network data collection and validation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106570053A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108681579A (en) * | 2018-05-10 | 2018-10-19 | 北京鼎泰智源科技有限公司 | A kind of big data missing rate analysis method |
CN109685638A (en) * | 2018-12-28 | 2019-04-26 | 广东电网有限责任公司 | A kind of audit coverage measure method, apparatus and storage medium |
CN111008675A (en) * | 2019-12-26 | 2020-04-14 | 口碑(上海)信息技术有限公司 | Method and device for sampling and processing recall area |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070244883A1 (en) * | 2006-04-14 | 2007-10-18 | Websidestory, Inc. | Analytics Based Generation of Ordered Lists, Search Engine Fee Data, and Sitemaps |
CN101222349A (en) * | 2007-01-12 | 2008-07-16 | 中国电信股份有限公司 | Method and system for collecting web user action and performance data |
CN104090931A (en) * | 2014-06-25 | 2014-10-08 | 华南理工大学 | Information prediction and acquisition method based on webpage link parameter analysis |
-
2016
- 2016-09-22 CN CN201610840743.9A patent/CN106570053A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070244883A1 (en) * | 2006-04-14 | 2007-10-18 | Websidestory, Inc. | Analytics Based Generation of Ordered Lists, Search Engine Fee Data, and Sitemaps |
CN101222349A (en) * | 2007-01-12 | 2008-07-16 | 中国电信股份有限公司 | Method and system for collecting web user action and performance data |
CN104090931A (en) * | 2014-06-25 | 2014-10-08 | 华南理工大学 | Information prediction and acquisition method based on webpage link parameter analysis |
Non-Patent Citations (1)
Title |
---|
李恒训等: "WWW论坛采集关键技术研究", 《微计算机信息》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108681579A (en) * | 2018-05-10 | 2018-10-19 | 北京鼎泰智源科技有限公司 | A kind of big data missing rate analysis method |
CN109685638A (en) * | 2018-12-28 | 2019-04-26 | 广东电网有限责任公司 | A kind of audit coverage measure method, apparatus and storage medium |
CN111008675A (en) * | 2019-12-26 | 2020-04-14 | 口碑(上海)信息技术有限公司 | Method and device for sampling and processing recall area |
CN111008675B (en) * | 2019-12-26 | 2020-11-24 | 口碑(上海)信息技术有限公司 | Method and device for sampling and processing recall area |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103927398B (en) | The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method | |
Arbesser et al. | Visplause: Visual data quality assessment of many time series using plausibility checks | |
CN102567494B (en) | Website classification method and device | |
CN107943838B (en) | Method and system for automatically acquiring xpath generated crawler script | |
CN102521248B (en) | Network user classification method and device | |
CN103023714B (en) | The liveness of topic Network Based and cluster topology analytical system and method | |
CN107111625A (en) | Realize the method and system of the efficient classification and exploration of data | |
CN104462611A (en) | Modeling method, ranking method, modeling device and ranking device for information ranking model | |
CN105469274A (en) | Method and system for comparing goods information of plurality of websites | |
CN106469185A (en) | Method for collecting data in website statistics | |
CN107704621A (en) | A kind of internet public feelings map visualization methods of exhibiting | |
CN106202108B (en) | Web crawlers grabs method for allocating tasks and device and data grab method and device | |
CN104462115A (en) | Spam message identifying method and device | |
CN101819585A (en) | Device and method for constructing forum event dissemination pattern | |
CN103377240B (en) | Information providing method, processing server and merging server | |
CN103970747B (en) | Data processing method for network side computer to order search results | |
CN106570053A (en) | Network data collection and validation method | |
CN107728210A (en) | The determination method and apparatus in road are lacked in multiple instruments gathered data | |
CN102722561B (en) | Method for analyzing webpage exit region and exit reason | |
CN113408207A (en) | Data mining method based on social network analysis technology | |
CN103440328A (en) | User classification method based on mouse behaviors | |
CN110309402A (en) | Detect the method and system of website | |
CN103605744A (en) | Method and device for analyzing website searching engine traffic data | |
CN111754340B (en) | Guarantee network risk investigation system based on graph database | |
Bourqui et al. | Detecting structural changes and command hierarchies in dynamic social networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170419 |