[go: up one dir, main page]

CN106570053A - Network data collection and validation method - Google Patents

Network data collection and validation method Download PDF

Info

Publication number
CN106570053A
CN106570053A CN201610840743.9A CN201610840743A CN106570053A CN 106570053 A CN106570053 A CN 106570053A CN 201610840743 A CN201610840743 A CN 201610840743A CN 106570053 A CN106570053 A CN 106570053A
Authority
CN
China
Prior art keywords
data
classification
network data
collection
amount
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610840743.9A
Other languages
Chinese (zh)
Inventor
王洪添
邢荣
王传超
徐宏伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Cloud Service Information Technology Co Ltd
Original Assignee
Shandong Inspur Cloud Service Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Cloud Service Information Technology Co Ltd filed Critical Shandong Inspur Cloud Service Information Technology Co Ltd
Priority to CN201610840743.9A priority Critical patent/CN106570053A/en
Publication of CN106570053A publication Critical patent/CN106570053A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3349Reuse of stored results of previous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a network data collection and validation method. The network data collection and validation method comprises the following implementation processes of firstly collecting network data, carrying out information classification on internet sites in collected data and carrying out random sampling by category; counting the amount of network data in the selected category, retrieving the sampled data stored in a database through a database operation script and counting the collected amount in the selected category; and comparing and checking the amount of network data and the collected amount to obtain a coverage rate of the collected data, wherein the coverage rate is equal to the collected data amount/actual data amount to validate whether collection leakage exists or not. Compared with the prior art, the network data collection and validation method has the advantages that the sampling survey theory of statistics is combined, the collected mass data are scientifically and reasonably validated, meanwhile, convenience is also provided for subsequent data analysis and mining work, and the network data collection and validation method is high in practicability, wide in application range and easy to popularize.

Description

A kind of network data acquisition verification method
Technical field
The present invention relates to big data applied analysis technical field, specifically a kind of practical, network data acquisition Verification method.
Background technology
With flourishing for internet in recent years and information industry, data this concepts has penetrated into each industry With operation function field, become the important factor of production.People imply that new ripple life for the excavation and utilization of mass data Yield increases the arrival with consumer surplus's tide.The concept of big data has been penetrated in the minds of the experts and scholars of all trades and professions, Also result in the extensive concern of masses.And at the same time, a large amount of distributed public informations are full of on network, government data Further the aspect such as the fast development of the e-commerce platform such as opening, Taobao, expansion that the tax is handled on line is all generated The information data of magnanimity.
Nowadays the information content day for including on webpage more increases, and species is various and complex structure, therefore when data acquisition is complete A kind of Cheng Hou, it is necessary to scientific and reasonable verification method, can adopt with regard to whether verification data leaks.And the bigger website of information content, Such as some portal websites or large-scale electric business shopping platform, the difficulty of checking is bigger.The information data of collection is often divided Cloth is in the different pages, it is difficult to estimate the data total amount at the whole network station exactly, it is impossible to directly with the data strip of collection result Number directly compares, and is judged.
It is contemplated that the factor such as Consumer's Experience, the full detail that large-scale website can't be grasped only does single mode Presentation, big city classifies, the data message of same type combined.This creates the terminal a kind of new data to test The thinking of card, i.e., carry out random sampling statistics by website given categorisation, then is compared with generic informational capacity in collection result Compared with, you can whether checking collection has omission, if has reached the collection purpose that site information amount is all covered, based on this, has now carried For a kind of scientific and efficient, network data acquisition verification method.
The content of the invention
The technical assignment of the present invention is for above weak point, there is provided a kind of practical, network data acquisition checking Method.
A kind of network data acquisition verification method, it realizes that process is:
Network data is gathered first, and then the internet sites in gathered data are carried out with information classification, category random sampling;
The web database technology in selected classification is counted, then the data of adopting stored in storehouse are carried out by database manipulation script Retrieval, counts the collection capacity of selected classification;
The two is carried out into contrast verification, the coverage rate of adopted data is drawn, coverage rate here=adopt data volume/real data Amount, to verify whether leakage to adopt.
The collection network data and process classified is:The website homepage gathered by browser access, is therefrom looked for To the classification entrance of collection information, then selection sort link, a classification is clicked at random with into the page in classification porch Face, views corresponding information, i.e., the data for having collected, and positioning finds the project relevant with data volume.
The classification that information classification is the set presence in adopted website is carried out to gathered data, by clicking on the classification chain on website Connect by the different classes of information that browses web sites, so as to be immediately seen such data total amount now, or indirectly by calculating Data total amount.
The web database technology for obtaining selected classification is referred to:Then directly obtain when total how many information is explicitly shown out on webpage To the value, else if illustrate only common N page, the data strip number of every one page is drawn by observation, be multiplied by N to draw such with M Total amount of data in not, M, N here are positive integer.
A kind of network data acquisition verification method of the present invention, with advantages below:
A kind of network data acquisition verification method that the present invention is provided, this verification method is effectively to from internet sites(Especially It is large-scale website)Middle gathered data are verified, by reasonable sampling and comparative analysis, have drawn collection result to institute The level of coverage of station data total amount is adopted, to determine that data are adopted with the presence or absence of leakage, the authenticity and diversity of data is demonstrated, together When also provides the approach that effectively verify for follow-up data analysis and excacation;In collection, information content is larger and page structure After complex internet site, a certain category information module therein is chosen in sampling, and the data that category counts the module are total Amount, to be contrasted with the homogeneous data adopted, and by the way that the present invention is repeated several times in step, so that it is determined that data are No leakage is adopted, and the accuracy and confidence level for making collection result has obtained scientific and reasonable checking, practical, applied widely, It is easy to spread.
Description of the drawings
Accompanying drawing 1 realizes schematic diagram for the present invention's.
Specific embodiment
Below in conjunction with the accompanying drawings and specific embodiment the invention will be further described.
The invention problem to be solved is that the high-load in network data acquisition technology for Large-Scale Interconnected website is believed The validation problem of breath, that is, determine whether the capacity of collection result is consistent with former targeted website.
Due to the information of Large-Scale Interconnected website it is distributed more widely, it is difficult to draw total amount, it is impossible to directly whether verify collection result All cover full site information.For one or more problems present in correlation technique, the present invention is a kind of main by providing The verification method of large-scale website gathered data is directed to, at least one of to solve the above problems.As shown in Figure 1, this A kind of bright network data acquisition verification method, first analyzes the information classification mode of internet sites, category random sampling, The web database technology in selected classification is counted, then the data of adopting stored in storehouse are examined by database manipulation script Rope, counts the collection capacity of selected classification.The two is carried out into contrast verification, the coverage rate of adopted data is drawn.
It realizes that process is:
After the completion of data acquisition, the data mode classification of targeted sites is adopted in first analysis, and category random sampling counts institute The web database technology in classification is selected, by by operating script line retrieval is entered to the collection result in database again, is counted selected The collection capacity of classification.The two is carried out into contrast verification, the coverage rate of adopted data is drawn(Adopted amount/actual amount), it is to verify No leakage is adopted.
In above-mentioned steps, adopting targeted website does not have intuitively to show whole station data total amount in the page, and due to Data volume is excessive, and for the reason such as facilitate user to browse, classifying and dividing has been done in website to data.
Data mode classification is the classification of the set presence in adopted website, and the assorted linking that can pass through to click on website is pressed not The generic information that browses web sites, so as to be immediately seen such data total amount now, or draws number indirectly by Simple Calculation According to total amount.
The coverage rate of adopted data is randomly selected after the classification of targeted sites, with the collection capacity of selected classification/selected The ratio that web database technology in classification is obtained.In view of factors such as the renewal of website, the diversity and complexity of data, if should It is that data acquisition is more comprehensive that value then can be assumed that more than 90%, not there are problems that leakage is adopted.
In order to introduce the present invention in more detail, it is described in detail presently in connection with accompanying drawing 1:
Step 1, the access target page:The website homepage gathered by browser access, therefrom finds the classification of collection information Entrance, website can typically be placed on sidebar or the top of the page, be used with facilitating visitor to click.
Step 2, selection sort link:Click a classification at random, into the page, to check in classification porch To corresponding information(The data for having collected), because purpose is to count the data total amount under the classification, so letter need not be paid close attention to Breath particular content, positioning finds the project relevant with data volume, for example:Common how many pages, common how many commodity, altogether how many record Deng.
Step 3, the data volume for obtaining selected classification:Can be straight if being explicitly shown out total how many information on webpage Connect and obtain the value, else if illustrate only common N page(N is positive integer), the data strip number that observation draws every one page can be passed through, than Such as M bars(M is positive integer), the total amount of data in being multiplied by N to draw the category with M.
Step 4, retrieval collection result:Line retrieval is entered to the collection result category in database by operating script, is united Count out the collection capacity of selected classification.
Step 5, verification data amount:One is obtained with the collection capacity of selected classification divided by the web database technology in selected classification Ratio, i.e. data cover rate.In view of factors such as the renewal of website, the diversity and complexity of data, if the value is more than 90% Then it can be assumed that being that data acquisition is more comprehensive, not there are problems that leakage is adopted.
Repeat multiple step 2 to 5, adopt whether number can cover the mesh at the whole network station to reach checking by sample investigation 's.
The present invention not only large-scale website acquiescence offer show classification entrance in the case of be just adapted to use.If not carrying For can artificially carry out Rational Classification according to the content of accepted and believed breath.Such as when website provides search column with search key When mode obtains information, keyword can be input into and retrieval result is obtained, then adopt bag in data by operating script to obtain Data volume containing the keyword, by the two contrast verification is carried out.Its main body checking thinking is consistent with this method.
Due to internet site information may real-time update, with the collection capacity of selected classification divided by the net in selected classification Data cover rate obtained from network data volume, not necessarily will could judge that data are not leaked more than 90% and adopt.Specifically will basis Time that the execution time of data acquisition counts with category and judge, if two time points are apart from each other, as one sees fit can will 90% is adjusted downward to a rational scope, and 90% is a referential data.
Due to notebook data verification method similar to statistics in be divided into sampling theory, will totally be divided into different sons Group, is then sampled to all of subgroup.Although random sampling, the data under each classification are not chosen, according to system Meter learn principle, this verification method science and it is reasonable.
Above-mentioned specific embodiment is only the concrete case of the present invention, and the scope of patent protection of the present invention is included but is not limited to Above-mentioned specific embodiment, it is any to meet a kind of the claims and any of network data acquisition verification method of the invention The appropriate change or replacement that the those of ordinary skill of the technical field is done to it, should all fall into the patent protection model of the present invention Enclose.

Claims (4)

1. a kind of network data acquisition verification method, it is characterised in that it realizes that process is:
Network data is gathered first, and then the internet sites in gathered data are carried out with information classification, category random sampling;
The web database technology in selected classification is counted, then the data of adopting stored in storehouse are carried out by database manipulation script Retrieval, counts the collection capacity of selected classification;
The two is carried out into contrast verification, the coverage rate of adopted data is drawn, coverage rate here=adopt data volume/real data Amount, to verify whether leakage to adopt.
2. a kind of network data acquisition verification method according to claim 1, it is characterised in that collection network data is gone forward side by side Row classification process be:The website homepage gathered by browser access, therefrom finds the classification entrance of collection information, then Selection sort is linked, and a classification is clicked at random into the page, to view corresponding information in classification porch, i.e., gather The data for arriving, positioning finds the project relevant with data volume.
3. a kind of network data acquisition verification method according to claim 1, it is characterised in that letter is carried out to gathered data Breath classification is the classification of the set presence in adopted website, by clicking on the assorted linking on website by the different classes of letter that browses web sites Breath, so as to be immediately seen such data total amount now, or indirectly by calculating data total amount.
4. a kind of network data acquisition verification method according to claim 3, it is characterised in that obtain the net of selected classification Network data volume is referred to:The value is then directly obtained when total how many information is explicitly shown out on webpage, else if illustrate only Common N page, the data strip number of every one page, the total amount of data in being multiplied by N to draw the category with M, M, N here are drawn by observation It is positive integer.
CN201610840743.9A 2016-09-22 2016-09-22 Network data collection and validation method Pending CN106570053A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610840743.9A CN106570053A (en) 2016-09-22 2016-09-22 Network data collection and validation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610840743.9A CN106570053A (en) 2016-09-22 2016-09-22 Network data collection and validation method

Publications (1)

Publication Number Publication Date
CN106570053A true CN106570053A (en) 2017-04-19

Family

ID=58531929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610840743.9A Pending CN106570053A (en) 2016-09-22 2016-09-22 Network data collection and validation method

Country Status (1)

Country Link
CN (1) CN106570053A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681579A (en) * 2018-05-10 2018-10-19 北京鼎泰智源科技有限公司 A kind of big data missing rate analysis method
CN109685638A (en) * 2018-12-28 2019-04-26 广东电网有限责任公司 A kind of audit coverage measure method, apparatus and storage medium
CN111008675A (en) * 2019-12-26 2020-04-14 口碑(上海)信息技术有限公司 Method and device for sampling and processing recall area

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070244883A1 (en) * 2006-04-14 2007-10-18 Websidestory, Inc. Analytics Based Generation of Ordered Lists, Search Engine Fee Data, and Sitemaps
CN101222349A (en) * 2007-01-12 2008-07-16 中国电信股份有限公司 Method and system for collecting web user action and performance data
CN104090931A (en) * 2014-06-25 2014-10-08 华南理工大学 Information prediction and acquisition method based on webpage link parameter analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070244883A1 (en) * 2006-04-14 2007-10-18 Websidestory, Inc. Analytics Based Generation of Ordered Lists, Search Engine Fee Data, and Sitemaps
CN101222349A (en) * 2007-01-12 2008-07-16 中国电信股份有限公司 Method and system for collecting web user action and performance data
CN104090931A (en) * 2014-06-25 2014-10-08 华南理工大学 Information prediction and acquisition method based on webpage link parameter analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李恒训等: "WWW论坛采集关键技术研究", 《微计算机信息》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681579A (en) * 2018-05-10 2018-10-19 北京鼎泰智源科技有限公司 A kind of big data missing rate analysis method
CN109685638A (en) * 2018-12-28 2019-04-26 广东电网有限责任公司 A kind of audit coverage measure method, apparatus and storage medium
CN111008675A (en) * 2019-12-26 2020-04-14 口碑(上海)信息技术有限公司 Method and device for sampling and processing recall area
CN111008675B (en) * 2019-12-26 2020-11-24 口碑(上海)信息技术有限公司 Method and device for sampling and processing recall area

Similar Documents

Publication Publication Date Title
CN103927398B (en) The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method
Arbesser et al. Visplause: Visual data quality assessment of many time series using plausibility checks
CN102567494B (en) Website classification method and device
CN107943838B (en) Method and system for automatically acquiring xpath generated crawler script
CN102521248B (en) Network user classification method and device
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN107111625A (en) Realize the method and system of the efficient classification and exploration of data
CN104462611A (en) Modeling method, ranking method, modeling device and ranking device for information ranking model
CN105469274A (en) Method and system for comparing goods information of plurality of websites
CN106469185A (en) Method for collecting data in website statistics
CN107704621A (en) A kind of internet public feelings map visualization methods of exhibiting
CN106202108B (en) Web crawlers grabs method for allocating tasks and device and data grab method and device
CN104462115A (en) Spam message identifying method and device
CN101819585A (en) Device and method for constructing forum event dissemination pattern
CN103377240B (en) Information providing method, processing server and merging server
CN103970747B (en) Data processing method for network side computer to order search results
CN106570053A (en) Network data collection and validation method
CN107728210A (en) The determination method and apparatus in road are lacked in multiple instruments gathered data
CN102722561B (en) Method for analyzing webpage exit region and exit reason
CN113408207A (en) Data mining method based on social network analysis technology
CN103440328A (en) User classification method based on mouse behaviors
CN110309402A (en) Detect the method and system of website
CN103605744A (en) Method and device for analyzing website searching engine traffic data
CN111754340B (en) Guarantee network risk investigation system based on graph database
Bourqui et al. Detecting structural changes and command hierarchies in dynamic social networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170419