[go: up one dir, main page]

CN108920617A - A kind of decision-making system and method, information data processing terminal of data acquisition - Google Patents

A kind of decision-making system and method, information data processing terminal of data acquisition Download PDF

Info

Publication number
CN108920617A
CN108920617A CN201810690116.0A CN201810690116A CN108920617A CN 108920617 A CN108920617 A CN 108920617A CN 201810690116 A CN201810690116 A CN 201810690116A CN 108920617 A CN108920617 A CN 108920617A
Authority
CN
China
Prior art keywords
website
acquisition
value
text
determination method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810690116.0A
Other languages
Chinese (zh)
Other versions
CN108920617B (en
Inventor
宋俊平
程国艮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Global Tone Communication Technology Co ltd
Original Assignee
Global Tone Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Global Tone Communication Technology Co ltd filed Critical Global Tone Communication Technology Co ltd
Priority to CN201810690116.0A priority Critical patent/CN108920617B/en
Publication of CN108920617A publication Critical patent/CN108920617A/en
Application granted granted Critical
Publication of CN108920617B publication Critical patent/CN108920617B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to computer software technical fields, disclose the decision-making system and method, information data processing terminal of a kind of website data acquisition, the determination method sampling acquisition web site contents of the website data acquisition;Calculate the value of each influence factor;Website acquisition value is calculated according to the value of each influence factor;Determined whether according to website acquisition value to progress continuous collecting.The present invention assesses the acquisition value an of website, including fields, article quality, article renewal frequency, original content accounting etc. from many aspects;The quantization method of each factor evaluation and test value is provided, it is easily and effectively and easily operated based on sturdy engineering experience.The website acquisition value calculation method based on each evaluation and test value is given simultaneously, and automatic, quickly the acquisition value of website can be assessed.Experiments have shown that accuracy of the invention is higher than 99%, real system can be applied to.

Description

A kind of decision-making system and method, information data processing terminal of data acquisition
Technical field
The decision-making system acquired the invention belongs to computer software technical field more particularly to a kind of website data and side Method, information data processing terminal.
Background technique
Currently, the prior art commonly used in the trade is such:As big data is excavated, the rise of artificial intelligence technology, number Increasingly approved by the public according to the importance with content.In larger scale data acquisition system, how to find in time it is new, have It is worth website, and then persistently these web site contents are crawled, is Current data acquisition system urgent problem.In number According in acquisition system, by extracting the link on the page, and network address link handled to find new website automatically.It Afterwards, need the acquisition to new website value determine, for example whether the website in a certain field, website orientation content quality such as How.Acquisition list can be just added to by acquiring costly website, to periodically crawl the content newly issued.In general, Different user is different to the definition of data acquisition value, understands comprehensive considering various effects to determine it is long-term whether a website is worth Acquisition.
In conclusion problem of the existing technology is:
(1) how the acquisition value of website is assessed.
(2) which the influence factor for influencing website acquisition value has.
(3) how quantitative analysis is carried out to these factors.
Solve the difficulty and meaning of above-mentioned technical problem:By solving problem above, the website hair of automation can be realized Existing and website acquisition value determines, promotes acquisition of information speed and quality, helps user to obtain at faster speed more more preferable Data.
Summary of the invention
In view of the problems of the existing technology, the present invention provides a kind of website data acquisition decision-making system and method, Information data processing terminal.
The invention is realized in this way a kind of determination method of website data acquisition, the judgement of the website data acquisition Method is:Sampling acquisition web site contents;Calculate the value of each influence factor;Website, which is calculated, according to the value of each influence factor acquires valence Value;Determined whether according to website acquisition value to progress continuous collecting.
Further, the sampling acquisition web site contents acquire tens of thousands of articles using breadth first algorithm.
Further, the influence factor amount is:
(1) text type A, for determining whether the content of website orientation belongs to the interested field of user;
(2) text quality assesses Q, whether have in text data messy code text, JS code text, title content it is inconsistent, It pours water text;
(3) article renewal frequency F refers to article renewal frequency with the average daily newly-increased chapter quantity in website;
(4) original content accounting O, with original content, proportion is indicated in all news.
Further, the text type prepares the article in field and outside field using the machine learning method for having supervision Each a batch utilizes the classifier of machine learning or deep learning technology one two classification of training, utilization trained classification Device to website sampling text type determine, in statistic sampling text in field chapter accounting, if the accounting is higher than Specified threshold then determines that the website orientation content is consistent with user demand, is denoted as A=1, is otherwise denoted as A=0;
Text quality assessment Q use text quality's appraisal procedure for being characterized based on depth to the quality of each chapter into Row scoring, and take the quality average mark of sampling text as website text quality point;Since original quality score value value range is [0,100], in order to normalize, the value of Q is on the basis of urtext mass fraction divided by 100;
The article renewal frequency F, is normalized renewal frequency F:
Wherein Fmin、FmaxTo count a large amount of website acquired results.
Further, the website acquires value calculation formula:
V=A* (α * Q+ β * F+ γ * O);
Wherein α, β, γ are respectively three assessments of influence factor text quality, article renewal frequencies, original content accounting Weight, and alpha+beta+γ=1;Text type is denoted as A, and text quality's assessment is denoted as Q, and article renewal frequency is denoted as F, and original content accounts for Than being denoted as O.
Further, the acquisition determines to be greater than specified threshold as website acquisition value V, then taken at regular intervals column is added in website Otherwise table is added without.
Another object of the present invention is to provide a kind of website datas of determination method for realizing the website data acquisition The decision-making system of acquisition, the decision-making system that the website data acquires include:
Decimation blocks, for acquisition web site contents of sampling;
Computing module, for calculating the value of each influence factor;
Website sampling value module, for calculating website acquisition value according to the value of each influence factor;
Determination module, for determining whether to carry out continuous collecting to it according to website acquisition value.
Another object of the present invention is to provide a kind of computer journeys of determination method for realizing the website data acquisition Sequence.
Another object of the present invention is to provide a kind of information datas of determination method for realizing the website data acquisition Processing terminal.
Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer When upper operation, so that computer executes the determination method of the website data acquisition.
In conclusion advantages of the present invention and good effect are:The acquisition value an of website is assessed from many aspects, including Fields, article quality, article renewal frequency, original content accounting etc.;Provide the quantization method of each factor evaluation and test value, base It is easily and effectively and easily operated in sturdy engineering experience.The website acquisition value calculation based on each evaluation and test value is given simultaneously Method automatic, quickly can assess the acquisition value of website.Experiments have shown that accuracy of the invention is higher than 99%, It can be applied to real system.
Detailed description of the invention
Fig. 1 is the decision-making system structural schematic diagram of website data acquisition provided in an embodiment of the present invention;
In figure:1, decimation blocks;2, computing module;3, website sampling value module;4, determination module.
Fig. 2 is the determination method flow chart of website data acquisition provided in an embodiment of the present invention.
Fig. 3 is the determination method implementation flow chart of website data acquisition provided in an embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
Many-sided acquisition value for assessing a website of the invention, including fields, article quality, article update frequency Rate, original content accounting etc.;Provide the quantization method of each factor evaluation and test value, based on sturdy engineering experience, easily and effectively and It is easily operated.
As shown in Figure 1, the decision-making system of website data acquisition provided in an embodiment of the present invention includes:
Decimation blocks 1, for acquisition web site contents of sampling;
Computing module 2, for calculating the value of each influence factor;
Website sampling value module 3, for calculating website acquisition value according to the value of each influence factor;
Determination module 4, for determining whether to carry out continuous collecting to it according to website acquisition value.
As shown in Fig. 2, the determination method of website data acquisition provided in an embodiment of the present invention includes the following steps:
S201:Sampling acquisition web site contents;
S202:Calculate the value of each influence factor;
S203:Website acquisition value is calculated according to the value of each influence factor;
S204:Determine whether to carry out continuous collecting to it according to website acquisition value.
As shown in figure 3, the determination method of website data acquisition provided in an embodiment of the present invention specifically includes following steps:
Step 1, website sampling
Crawl calculating of the part website chapter for website acquisition value.It is proposed that being acquired using breadth first algorithm Tens of thousands of articles.
Step 2, influence factor quantitative analysis
As shown in figure 3, website acquisition value is mainly influenced by four aspect factors:Content of text whether belong to designated field, Text quality how, article renewal frequency and original content accounting.
(1) text type
Text type is mainly used for determining whether the content of website orientation belongs to the interested field of user, for example whether belonging to In news, whether belong to science and technology or field of finance and economics etc., if not acquiring then.
The judgement of text type (A) mainly utilizes Text Classification to realize.The present invention is using the method for having supervision, first Then each a batch of article outside in preparation field and field utilizes one two classification of machine learning or deep learning technology training Classifier.Determined using type of the trained classifier to website sampling text.In last statistic sampling text The accounting of chapter in field thinks the website orientation content and user if accounting is higher than specified threshold (it is recommended that 95% or more) Demand is consistent, i.e. A=1, otherwise A=0.
(2) text quality
Whether text quality (Q) mainly assesses in text data has messy code text, JS code text, title content different Cause, pour water text phenomena such as, using it is a kind of based on depth characterization text quality's appraisal procedure (application number: 201810028932.5) it scores the quality of each chapter, and takes the quality average mark of sampling text as website text Quality point.Since original quality score value value range is [0,100], in order to normalize, the value of Q is in urtext mass fraction On the basis of divided by 100.
(3) article renewal frequency
The speed of network upgrade content is the important indicator of its acquisition value, and a website not updated for a long time does not continue Necessity of acquisition.In order to improve practicability, the present invention does not use the detection method of tracking webpage change procedure, but counts website The Annual distribution for sampling text refers to article renewal frequency with the average daily newly-increased chapter quantity in website.In addition, for number According to uniformity for the treatment of, renewal frequency (F) is normalized, i.e.,:
Wherein Fmin、FmaxA large amount of website acquired results are counted for acquisition system.
(4) original content accounting
The high website acquisition value of original content is higher.In order to calculate original content accounting, it is necessary first to distinguish which is Original content.
The present invention uses rule and method, is reprinted by two factors judgement articles or original:1, " source " etc. is indicated The label in article source.The source of article would generally be indicated comprising labels such as " sources " in articles page, therefore traverse first Web page tag, if comprising such label, and label substance is not inconsistent with current site, then is labeled as " reprinting ", is otherwise labeled as " original ".2, original article would generally mark " reporter XXX " at article end, therefore by keyword match technique, if Article end includes that the class keywords are then labeled as " original ";If 3, two above factor is all not present in the page, chapter is silent Think " original ".
The calculation method of original content accounting (O) is to increase the ratio of original content in chapter in statistic sampling text newly daily And it is averaged.
Step 3, website acquisition value
The acquisition value of website is calculated using above four evaluation and test values, calculation formula is as follows:
V=A* (α * Q+ β * F+ γ * O);
Wherein α, β, γ are respectively three assessments of influence factor text quality, article renewal frequencies, original content accounting Weight, and alpha+beta+γ=1;Text type is denoted as A, and text quality's assessment is denoted as Q, and article renewal frequency is denoted as F, and original content accounts for Than being denoted as O.
Step 4, acquisition determine
It is greater than specified threshold when website acquires value V, then taken at regular intervals list is added in website, be otherwise added without.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to Process described in the embodiment of the present invention or function.The computer can be general purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction may be stored in a computer readable storage medium, or from one Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer-readable storage medium can be any usable medium or include one that computer can access The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (10)

1. a kind of determination method of data acquisition, which is characterized in that this method includes:
Sampling acquisition web site contents;
Calculate the value of each influence factor;
Website acquisition value is calculated according to the value of each influence factor;
Determined whether according to website acquisition value to progress continuous collecting.
2. the determination method of data acquisition as described in claim 1, which is characterized in that the sampling acquisition web site contents use Breadth first algorithm acquires tens of thousands of articles.
3. the determination method of data as described in claim 1 acquisition, which is characterized in that the influence factor amount is:
(1) text type is for determining whether the content of website orientation belongs to the interested field of user;
(2) whether have that messy code text, JS code text, title content is inconsistent, text of pouring water in text quality's assessment text data This;
(3) article renewal frequency refers to article renewal frequency with the average daily newly-increased chapter quantity in website;
(4) original content accounting distinguishes original content.
4. the determination method of data acquisition as claimed in claim 3, which is characterized in that the text type, which uses, supervision, Each a batch of article outside in preparation field and field;Utilize machine learning or the classification of deep learning technology one two classification of training Device;Determined using type of the trained classifier to website sampling text;A piece in field in statistic sampling text The accounting of chapter;
The text quality is used and is scored based on text quality's appraisal procedure that depth characterizes the quality of each chapter, and Take the quality average mark of sampling text as website text quality point;Since original quality score value value range is [0,100], it is Normalization, the value of Q is on the basis of urtext mass fraction divided by 100;
The article renewal frequency, is normalized renewal frequency F:
Wherein Fmin、FmaxTo count a large amount of website acquired results.
5. the determination method of data acquisition as described in claim 1, which is characterized in that the website acquisition value calculation is public Formula:
V=A* (α * Q+ β * F+ γ * O);
Wherein α, β, γ are the weight of three influence factors, and alpha+beta+γ=1.
6. the determination method of data acquisition as described in claim 1, which is characterized in that the acquisition determines to acquire valence when website Value V is greater than specified threshold, then taken at regular intervals list is added in website, be otherwise added without.
7. a kind of decision-making system of the data acquisition for the determination method for realizing the acquisition of data described in claim 1, which is characterized in that The decision-making system of website data acquisition includes:
Decimation blocks, for acquisition web site contents of sampling;
Computing module, for calculating the value of each influence factor;
Website sampling value module, for calculating website acquisition value according to the value of each influence factor;
Determination module, for determining whether to carry out continuous collecting to it according to website acquisition value.
8. a kind of computer program for the determination method for realizing the acquisition of data described in claim 1~6 any one.
9. a kind of information data processing terminal for the determination method for realizing the acquisition of data described in claim 1~6 any one.
10. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer executes such as The determination method of data acquisition as claimed in any one of claims 1 to 6.
CN201810690116.0A 2018-06-28 2018-06-28 Data acquisition judging system and method and information data processing terminal Active CN108920617B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810690116.0A CN108920617B (en) 2018-06-28 2018-06-28 Data acquisition judging system and method and information data processing terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810690116.0A CN108920617B (en) 2018-06-28 2018-06-28 Data acquisition judging system and method and information data processing terminal

Publications (2)

Publication Number Publication Date
CN108920617A true CN108920617A (en) 2018-11-30
CN108920617B CN108920617B (en) 2022-07-12

Family

ID=64422052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810690116.0A Active CN108920617B (en) 2018-06-28 2018-06-28 Data acquisition judging system and method and information data processing terminal

Country Status (1)

Country Link
CN (1) CN108920617B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109872181A (en) * 2019-01-08 2019-06-11 博拉网络股份有限公司 A kind of business information processing method, device and storage medium
CN110427577A (en) * 2019-06-26 2019-11-08 五八有限公司 Impact evaluation method, apparatus, electronic equipment and the storage medium of content
CN110852718A (en) * 2019-11-12 2020-02-28 江苏税软软件科技有限公司 Method for carrying out system prejudgment on evidence obtaining computer
CN111680203A (en) * 2020-05-07 2020-09-18 支付宝(杭州)信息技术有限公司 Data acquisition method and device and electronic equipment
CN113343064A (en) * 2021-06-18 2021-09-03 北京百度网讯科技有限公司 Data processing method, device, equipment, storage medium and computer program product
CN114936311A (en) * 2022-04-28 2022-08-23 中译语通科技股份有限公司 News website crawling updating method, system, medium, equipment and terminal

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100185619A1 (en) * 2009-01-22 2010-07-22 Alibaba Group Holding Limited Sampling analysis of search queries
CN105117501A (en) * 2015-10-09 2015-12-02 广州神马移动信息科技有限公司 Web crawler scheduling method and web crawler system applying same
CN105786799A (en) * 2016-03-21 2016-07-20 成都寻道科技有限公司 Web article originality judgment method
CN105824806A (en) * 2016-06-13 2016-08-03 腾讯科技(深圳)有限公司 Quality evaluation method and device for public accounts
CN106649871A (en) * 2017-01-03 2017-05-10 广州爱九游信息技术有限公司 Detection method, apparatus and computing equipment for repetition degree of articles
CN107577688A (en) * 2017-04-25 2018-01-12 上海市互联网信息办公室 Original article influence power analysis system based on media information collection

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100185619A1 (en) * 2009-01-22 2010-07-22 Alibaba Group Holding Limited Sampling analysis of search queries
CN105117501A (en) * 2015-10-09 2015-12-02 广州神马移动信息科技有限公司 Web crawler scheduling method and web crawler system applying same
CN105786799A (en) * 2016-03-21 2016-07-20 成都寻道科技有限公司 Web article originality judgment method
CN105824806A (en) * 2016-06-13 2016-08-03 腾讯科技(深圳)有限公司 Quality evaluation method and device for public accounts
CN106649871A (en) * 2017-01-03 2017-05-10 广州爱九游信息技术有限公司 Detection method, apparatus and computing equipment for repetition degree of articles
CN107577688A (en) * 2017-04-25 2018-01-12 上海市互联网信息办公室 Original article influence power analysis system based on media information collection

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109872181A (en) * 2019-01-08 2019-06-11 博拉网络股份有限公司 A kind of business information processing method, device and storage medium
CN109872181B (en) * 2019-01-08 2024-01-19 博拉网络股份有限公司 Commercial information processing method, device and storage medium
CN110427577A (en) * 2019-06-26 2019-11-08 五八有限公司 Impact evaluation method, apparatus, electronic equipment and the storage medium of content
CN110852718A (en) * 2019-11-12 2020-02-28 江苏税软软件科技有限公司 Method for carrying out system prejudgment on evidence obtaining computer
CN111680203A (en) * 2020-05-07 2020-09-18 支付宝(杭州)信息技术有限公司 Data acquisition method and device and electronic equipment
CN111680203B (en) * 2020-05-07 2023-04-18 支付宝(杭州)信息技术有限公司 Data acquisition method and device and electronic equipment
CN113343064A (en) * 2021-06-18 2021-09-03 北京百度网讯科技有限公司 Data processing method, device, equipment, storage medium and computer program product
CN113343064B (en) * 2021-06-18 2023-07-28 北京百度网讯科技有限公司 Data processing method, apparatus, device, storage medium, and computer program product
CN114936311A (en) * 2022-04-28 2022-08-23 中译语通科技股份有限公司 News website crawling updating method, system, medium, equipment and terminal

Also Published As

Publication number Publication date
CN108920617B (en) 2022-07-12

Similar Documents

Publication Publication Date Title
CN108920617A (en) A kind of decision-making system and method, information data processing terminal of data acquisition
CN109697522B (en) Data prediction method and device
US11809505B2 (en) Method for pushing information, electronic device
CN109471783B (en) Method and device for predicting task operation parameters
US20160371260A1 (en) Systems and methods for conducting and terminating a technology-assisted review
US20150161633A1 (en) Trend identification and reporting
CN103207899A (en) Method and system for recommending text files
WO2018157625A1 (en) Reinforcement learning-based method for learning to rank and server
US11586739B2 (en) System and method for identifying cyberthreats from unstructured social media content
CN108932291B (en) Power grid public opinion evaluation method, storage medium and computer
CN111932394B (en) Bad root cause path analysis method and system based on association rule mining
US20150149541A1 (en) Leveraging Social Media to Assist in Troubleshooting
CN103870541B (en) Social network user interest digging method and system
US12061611B2 (en) Search method, apparatus, electronic device, storage medium and program product
CN113535813A (en) Data mining method and device, electronic equipment and storage medium
CN106326210B (en) A kind of associated detecting method and device of text topic and emotion
CN116109373A (en) Recommendation method and device for financial products, electronic equipment and medium
CN114444863A (en) Enterprise production safety assessment method, system, device and storage medium
CN110827101B (en) Shop recommending method and device
CN110852105A (en) Time data normalization method, device, medium and electronic equipment
CN115239214B (en) Enterprise evaluation processing method and device and electronic equipment
CN103309851B (en) The rubbish recognition methods of short text and system
CN116108844A (en) Risk information identification method, apparatus, device and storage medium
CN113361248B (en) Text similarity calculation method, device, equipment and storage medium
CN113032671A (en) Content processing method, content processing device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant