CN107704538A - A kind of rubbish text processing method, device, equipment and storage medium - Google Patents
A kind of rubbish text processing method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN107704538A CN107704538A CN201710865928.XA CN201710865928A CN107704538A CN 107704538 A CN107704538 A CN 107704538A CN 201710865928 A CN201710865928 A CN 201710865928A CN 107704538 A CN107704538 A CN 107704538A
- Authority
- CN
- China
- Prior art keywords
- url
- pending
- default
- value element
- request address
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of rubbish text processing method, device, equipment and storage medium.This method includes:The uniform resource position mark URL of the HTTP HTTP data in preset time is obtained, URL includes request address and required parameter;Screening is carried out to request address based on default screening rule and obtains pending URL, and the pending entity file according to corresponding to being chosen pending URL request address;Processing entities file is treated using default segmentation methods and carries out word segmentation processing generation word segmentation result, and identifies the value element in word segmentation result, and statistical analysis generation statistical result is carried out to value element;According to statistical result, it is determined whether corresponding pending URL is added into filtering URL name list.The embodiment of the present invention solves in the prior art the problem of rubbish text accuracy of identification is not high, and inline system load is larger, realizes the load at utmost reduced to inline system, improves the reliability of identification.
Description
Technical field
The present embodiments relate to computer technology, more particularly to a kind of rubbish text processing method, device, equipment and deposit
Storage media.
Background technology
Text mining is a cross discipline, is related to data mining, machine learning, pattern-recognition, artificial intelligence, statistics
The multiple fields such as, Computational Linguistics, computer network road technique, informatics.Text mining is exactly to be sent out from substantial amounts of document
A kind of Method and kit for of existing tacit knowledge and pattern, it is developed from data mining, but is had again with traditional data mining
It is many different.The object of text mining is magnanimity, isomery, the document of distribution, in addition, document content is natural used in the mankind
Language, lack the intelligible semanteme of computer.In real work, a large amount of certain patterns be present, and without unified between pattern
Property, wherein, there is various patterns unstructured data mixed in together, such as:Http post, cookie etc..The number of these data
It is huge according to measuring, while most of is again nugatory, is called rubbish text, these rubbish texts cause substantial amounts of storage
Space is occupied and has had a strong impact on the performance of system.For the above situation, mainly there are two kinds of rubbish text processing sides at present
Method, first, document classification method, i.e., by choosing text feature, it is trained according to the data marked in advance, according to training
Model, judge whether text to be studied and judged is rubbish text;Second, rule-based filtering method, i.e., according to the setting of prior business expert
Rule, text is filtered.
For document classification method, because http post request data contents are varied, the format character do not stablized and
Key characteristics, learnt and trained it is difficult to choose available feature.Meanwhile Document Classification Method limited precision, it can abandon
The data of value.For rule-based filtering method, the rule of this method usually requires to be determined in advance, and rambling http post
Data are difficult to find clear and definite rule, while need a large amount of rules to improve precision, and environment deployment rule can reduce on line
The process performance of system.
In view of the above-mentioned problems, not yet propose effective solution at present.
The content of the invention
The present invention provides a kind of rubbish text processing method, device, equipment and storage medium, is at utmost reduced with realizing
To the load of inline system, identification reliability is improved.
In a first aspect, the embodiments of the invention provide a kind of rubbish text processing method, this method includes:
The uniform resource position mark URL of the HTTP HTTP data in preset time is obtained, the URL includes
Request address and required parameter;
Screening is carried out to the request address based on default screening rule and obtains pending URL, and according to described pending
Pending entity file corresponding to URL request address selection;
Word segmentation processing generation word segmentation result is carried out to the pending entity file using default segmentation methods, and identifies institute
The value element in word segmentation result is stated, statistical analysis generation statistical result is carried out to the value element;
According to the statistical result, it is determined whether corresponding pending URL is added into filtering URL name list.
Further, it is described that the screening pending URL of acquisition, bag are carried out to the request address based on default screening rule
Include:
The HTTP data are grouped according to the request address, and count of the HTTP data included in every group
Number;
According to the number, descending sort is carried out to every group of request address, and calculates accumulative accounting, it is default to choose satisfaction
URL corresponding to the request address of accumulative accounting, as the pending URL.
Further, the value element in the identification word segmentation result, statistical analysis is carried out to the value element
Statistical result is generated, including:
Based on default recognition rule, the word segmentation result and default value element are subjected to match cognization, obtain described point
Value element in word result;
The number that the value element occurs in the pending entity file is counted, generates statistical result.
Further, it is described according to the statistical result, it is determined whether corresponding pending URL is added to filtering URL
List, including:
When the statistical result is less than default recommendation results, obtains pending URL corresponding to the statistical result and treat
Processing entities file;
The pending URL and pending entity file are recommended and manually determined whether, and obtains artificial judgment knot
Fruit;
Based on the artificial judgment result, it is determined whether the pending URL is added into filtering URL name list.
Further, the default segmentation methods include reverse maximum matching algorithm.
Second aspect, the embodiment of the present invention additionally provide a kind of rubbish text processing unit, and the device includes:
URL acquisition modules, for obtaining the URL of the HTTP data in preset time, the URL includes request address and please
Seek parameter;
Pending entity file acquisition module, for carrying out screening acquisition to the request address based on default screening rule
Pending URL, and the pending entity file according to corresponding to being chosen the request address of the pending URL;
Statistical result generation module, for carrying out word segmentation processing to the pending entity file using default segmentation methods
Word segmentation result is generated, and identifies the value element in the word segmentation result, statistical analysis generation system is carried out to the value element
Count result;
URL name list generation module is filtered, for according to the statistical result, it is determined whether corresponding pending URL is added
It is added to filtering URL name list.
Further, the pending entity file acquisition module, including:
Classified statistics unit, for being grouped to the HTTP data according to the request address, and count in every group
Comprising HTTP data number;
Pending URL acquiring units, for according to the number, carrying out descending sort to every group of request address, and count
Accumulative accounting is calculated, URL corresponding to the request address for meeting default accumulative accounting is chosen, as the pending URL.
Further, the statistical result generation module, including:
Value element recognition unit, for based on default recognition rule, the word segmentation result and default value element to be entered
Row match cognization, obtain the value element in the word segmentation result;
Statistical result generation unit, time occurred for counting the value element in the pending entity file
Number, generate statistical result.
The third aspect, the embodiment of the present invention additionally provide a kind of equipment, and the equipment includes:
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are by one or more of computing devices so that one or more of processing
Device realizes rubbish text processing method as previously described.
Fourth aspect, the embodiment of the present invention additionally provide a kind of computer-readable recording medium, are stored thereon with computer
Program, the program realize rubbish text processing method as previously described when being executed by processor.
By obtaining the HTTP in preset time, (Hyper Text Transfer Protocol, hypertext pass the present invention
Defeated agreement) data URL (Uniform Resource Locator, URL), it include request address and please
Parameter is sought, screening is carried out to request address based on default screening rule and obtains pending URL, and according to pending URL request
Pending entity file corresponding to the selection of address, treat processing entities file using default segmentation methods and carry out word segmentation processing generation
Word segmentation result, and the value element in word segmentation result is identified, statistical analysis generation statistical result, Jin Ergen are carried out to value element
Result according to statistics, it is determined whether corresponding pending URL is added to filtering URL name list, solves rubbish text in the prior art
The problem of this accuracy of identification is not high, and inline system load is larger, the load at utmost reduced to inline system is realized, improved
The reliability of identification.
Brief description of the drawings
Fig. 1 is a kind of flow chart of rubbish text processing method in the embodiment of the present invention one;
Fig. 2 is a kind of flow chart of rubbish text processing method in the embodiment of the present invention two;
Fig. 3 is a kind of structural representation of rubbish text processing unit in the embodiment of the present invention three;
Fig. 4 is a kind of structural representation of equipment in the embodiment of the present invention four.
Embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that in order to just
Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.
Embodiment one
Fig. 1 is a kind of flow chart for rubbish text processing method that the embodiment of the present invention one provides, and the present embodiment is applicable
In accurately identifying and filtering out the situation of the text comprising priceless value information, this method can be held by rubbish text processing unit
OK, the device can realize that the device can be configured in terminal, such as is typically by the way of software and/or hardware
Mobile phone, computer, tablet personal computer etc..As shown in figure 1, this method specifically comprises the following steps:
Step S110, the URL of the HTTP data in preset time is obtained, URL includes request address and required parameter;
In a particular embodiment of the present invention, in order to ensure the coverage rate of data, it is preferred that preset time is one week or extremely
Few one day data, certainly, specific preset time can be not especially limited herein depending on actual conditions.HTTP is main
For being application layer protocol from WWW (World Wide Web, WWW) server transport hypertext to local browser.URL
Also referred to as Web addresses, it is commonly called as " network address ", URL overall format is made up of following essential part:Pattern (or agreement)+"://”+
Host domain name (or IP address)+":" port numbers+directory path+filename, such as " agreement:// mandate/pathInquiry ".Show
Example property, such as " http://www.sogou.com/sieHdq=AQxRG-0000&query=URL&ie=utf8 ",
“http://www.amdc.m.taobao.com/amdc/mobileDispatchPlatform=android&v=3.1&
DeviceId=&appkey=umeng%3A58b7fafb07fe6513dc001456 ".In addition, it is complete, with authorization portion
The common URL grammer divided is as follows:Agreement:// user name:Password@subdomain name domain name TLDs:Port
Number/directory/file name file suffixesParameter=value # marks.Exemplary, such as " http://blog.163.com/xianyu_
405@126/blog/static/161729131201082614930373/”。
Specifically, URL includes request address and required parameter, both with "" separators come, "" before represent please
Address is asked, "" required parameter is represented afterwards.Exemplary, such as " http://www.sogou.com/sieHdq=AQxRG-
" http in 0000&query=URL&ie=utf "://www.sogou.com/sie " represents request address, " hdq=
AQxRG-0000&query=URL&ie=utf " represents required parameter.Wherein, key=value can be used in required parameter
The form of key-value pair is joined to pass, and divides symbol to separate with " & " between key-value pair.
It should be noted that the content included in the generally HTTP with same request address is shown with identical structure
Example property, such as " http://s.taobao.com/searchQ=Shui Guos &imgfile=&commend=all&ssid=s5-
E&search_type=item&sourceId=tb.index&spm=a21bo.50862.201 856-taobao-item.1&
Ie=utf8&initiative_id=tbindexz_20170916 " and " http://s.taobao.com/searchQ=%
E6%B0%B4%E6%9E%9C%E6%96%B0%E9%B2%9C&imgfile=&commend=all &ssid=
S5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862. 201856-taobao-
Item.1&ie=utf8&initiative_id=tbindexz_20170916 ", above-mentioned two URL have identical request address
“http://s.taobao.com/search ", above-mentioned URL is inputted in a browser and understands that both request contents are similar, is to close
In " fruit ".Meanwhile this is also provided subsequently to carry out the pending URL of screening acquisition to request address based on default screening rule
Foundation.
Step S120, screening is carried out to request address based on default screening rule and obtains pending URL, and according to pending
Pending entity file corresponding to URL request address selection;
In a particular embodiment of the present invention, default screening rule refers to carrying out according to actual conditions or constantly in advance
Testing improvement, it is determined that data processing rule, so as to realize reduce follow-up data treating capacity.Optionally, screening rule is preset
The basis of foundation is the content included in the URL based on the HTTP data with same request address with identical structure.Phase
Answer, default screening rule can be classified statistics rule and probabilistic model rule etc..
Step S130, treat processing entities file using default segmentation methods and carry out word segmentation processing generation word segmentation result, and
The value element in word segmentation result is identified, statistical analysis generation statistical result is carried out to value element;
In a particular embodiment of the present invention, participle refers to reconfiguring continuous word sequence according to certain specification
Into the process of word sequence.It is that nature delimiter is used as using space between word in the style of writing of English.Such as:I say a
boy.And Chinese be word, sentence and section can simply be demarcated by obvious delimiter, only word neither one in form
Delimiter, so when equally facing the partition problem of short word, such as:" this branch song is too insipid " the words is being divided into word order
Will be complicated many during row.For in short, computer is how to understand which is word, which is not word, above-mentioned place
Reason process is known as segmentation methods.Optionally, presetting segmentation methods includes the segmentation methods based on string matching, based on understanding
Segmentation methods and segmentation methods three major types based on statistics.Wherein, the segmentation methods based on character match, which are called, makees machinery point
Word algorithm, it is the entry progress in the Chinese character string and " fully big " machine dictionary being analysed to according to certain strategy
Matching, if finding some character string in dictionary, the match is successful (setting out a word);The base of segmentation methods based on understanding
This thought is to carry out syntax, semantic analysis while participle, and Ambiguity is handled using syntactic information and semantic information;Base
It is due to that the frequency of word co-occurrence adjacent with word or probability can preferably reflect into word in the basic thought of the segmentation methods of statistics
Confidence level, counted using the frequency to each combinatorics on words of adjacent co-occurrence in language material, calculate their information that appears alternatively,
The information that appears alternatively embodies the tightness degree of marriage relation between Chinese character.When tightness degree is higher than some threshold value, can think
This word group may constitute a word, and the word group is added into candidate's word sequence, then make last determine through hand inspection again.This
Kind algorithm need to only count to the word group frequency in language material, it is not necessary to cutting dictionary.Conventional point based on string matching
Word algorithm include Forward Maximum Method algorithm, reverse maximum matching algorithm (Reverse Maximum Matching Method,
) and minimum cutting matching algorithm etc. RMM.Preferably, processing entities file is treated using reverse maximum matching algorithm to be segmented
Processing, is illustrated by taking the algorithm as an example below.
Specifically, due to more polarization phrase in Chinese be present, in order to reduce error rate, can inversely be matched,
Match from back to front.The basic ideas of reverse maximum matching algorithm are as follows:
The number of character is MaxLen in the most entry of number of characters in step 1, hypothesis dictionary for word segmentation, during setting is to be slit
Chinese character string is S1, S2=" ";
Step 2, judge whether S1 is empty, if sky, carry out step 7, if being not sky, carry out step 3;
Step 3, taken out from right to left no more than MaxLen character as matched character string W in character string S1;
Step 4, dictionary for word segmentation is searched, if character string W in dictionary for word segmentation be present, the match is successful, then S2="/"+W+
S2, S1=S1-W, step 2 is carried out, if there is no character string W, then it fails to match, carries out step 5;
Step 5, one word of W Far Lefts removed, as new W;
Step 6, judge whether W is individual character, if individual character, carry out step 7, if not individual character, carry out step 4;
Step 7, output result S2.
If the characteristics of algorithm maximum is that it fails to match for lookup dictionary for word segmentation, remove first character from the left side.Such as:It is right
Character string ABCD in text, wherein CD ∈ W, BCD ∈ W,So just take cutting A/BCD.
Exemplary, it is S1=" computational linguistics course is interesting " to input Chinese character string to be slit, sets number of characters
The number of character is MaxLen=5 in most entries, S2=" ", separator="/".S1 is not sky, and the right is taken out from S1
Candidate character strings W=" course is interesting ";Dictionary for word segmentation is searched, one word of W Far Lefts is removed, obtained not in dictionary for word segmentation by W
To W=" journey is interesting ";Dictionary for word segmentation is searched, W removes one word of W Far Lefts not in dictionary for word segmentation, obtains W=" intentionally
Think ";Dictionary for word segmentation is searched, one word of W Far Lefts is removed not in dictionary for word segmentation, obtain W=" meaning " by W;Search participle word
W is added in S2 by allusion quotation, " meaning " in dictionary for word segmentation, the S2="/meaning ", and W is removed from S1, and now S1=" is calculated
Linguistics course has ";S1 is not sky, and candidate character strings W=" speech, which learns course, to be had " is taken out from S1 the right;Dictionary for word segmentation is looked into, W does not exist
In dictionary for word segmentation, one word of W Far Lefts is removed, obtains W=" learning course has ";Look into dictionary for word segmentation, W not in dictionary for word segmentation,
One word of W Far Lefts is removed, obtains W=" course has ";Dictionary for word segmentation is looked into, W is not in dictionary for word segmentation, by W Far Lefts one
Word removes, and obtains W=" journey has ";Dictionary for word segmentation is looked into, one word of W Far Lefts is removed not in dictionary for word segmentation, obtain W=by W
" having ", W are individual characters, and W is added in S2, S2="/having/looks like ", and W is removed from S1, now S1=" computational linguistics
Course ";S1 is not sky, and candidate character strings W=" linguistics course " is taken out from S1 the right;Dictionary for word segmentation is looked into, W is not in dictionary for word segmentation
In, one word of W Far Lefts is removed, obtains W=" speech learns course ";Dictionary for word segmentation is looked into, W is not in dictionary for word segmentation, by W Far Lefts
One word removes, and obtains W=" course ";Dictionary for word segmentation is looked into, one word of W Far Lefts is removed, obtained not in dictionary for word segmentation by W
To W=" course ";Dictionary for word segmentation is looked into, W is added in S2 by W in dictionary for word segmentation, S2="/course/has/looked like ", and by W
Remove from S1, now S1=" computational linguistics ";S1 is not sky, and candidate character strings W=" computational languages are taken out from S1 the right
Learn ";Dictionary for word segmentation is looked into, W is added in S2 by W in dictionary for word segmentation, S2=" computational linguistics/course/has/looked like ", and will
W removes from S1, now S1=" ";S1 is sky, and output S2 terminates as word segmentation result, participle process.
In a particular embodiment of the present invention, optionally, value element includes the information of user's concern, unknown value information
And/or value information unidentified at present.Wherein, the information of user's concern includes but is not limited to hardware characteristics information and true body
Part information, wherein, hardware characteristics information refers to the unique mark of equipment, such as IMEI and MAC Address, and true identity information refers to
Be can unique mark user identity information, such as cell-phone number and identification card number;Unknown value information refers to being difficult with rule
The value element then stated, such as user name;Value information unidentified at present refers to virtual identity, can unique mark use
Family is in the information of some web-based applications, such as account, UID and email address.
On the basis of the above, optionally, the value element in word segmentation result is identified by recognition rule set in advance,
And statistical analysis is carried out to value element, and it is exemplary, such as count the number that value element occurs in each word segmentation result.
Step S140, according to statistical result, it is determined whether corresponding pending URL is added into filtering URL name list.
In a particular embodiment of the present invention, filtering URL refers to not including the URL of value element or comprising value element
Number is less than the URL of predetermined threshold value.Value element mentioned here is identical with the implication of the value element in step S130.
It should be noted that above-mentioned steps S110-S140 is online lower progress, realizes and mitigate inline system load
Purpose, inline system (system that user is used) only need to be corresponding to the URL in filtering URL name list according to result
Entity file be that text data carries out corresponding filter operation, the influence of user will also be fallen below minimum.
The technical scheme of the present embodiment, by obtaining the URL of the HTTP data in preset time, it include request address and
Required parameter, screening is carried out to request address based on default screening rule and obtains pending URL, and asking according to pending URL
Pending entity file corresponding to the selection of address is sought, treating processing entities file using default segmentation methods carries out word segmentation processing life
Into word segmentation result, and the value element in word segmentation result is identified, statistical analysis generation statistical result is carried out to value element, and then
According to statistical result, it is determined whether corresponding pending URL is added into filtering URL name list, solves rubbish in the prior art
The problem of text identification precision is not high, and inline system load is larger, the load at utmost reduced to inline system is realized, is carried
The high reliability of identification.
Further, on the basis of above-mentioned technical proposal, screening is carried out to request address based on default screening rule and obtained
Pending URL is obtained, including:
HTTP data are grouped according to request address, and count the number of the HTTP data included in every group;
In a particular embodiment of the present invention, the content included in the HTTP based on same request address has identical knot
Structure, consideration can be first grouped according to request address to HTTP data, i.e. request address identical HTTP data are subdivided into together
One group.Exemplary, such as HTTP data " http://www.sogou.com/sieHdq=AQxRG-0000&query=%
E7%BB%9F%E4%B8%80%E8%B5%84%E6%BA%90%E5%AE%9A%E4%BD%8D%E7 %
AC%A6&ie=utf8 ", " http://www.sogou.com/sieHdq=AQxRG-0000&query=unified resources are determined
Position symbol &ie=utf8 ", " http://s.taobao.com/searchQ=Shui Guos &imgfile=&commend=all&
Ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.5 0862.201856-
Taobao-item.1&ie=utf8&initiative_id=tbindexz_20170916 " and " http://
s.taobao.com/searchQ=%E6%B0%B4%E6%9E%9C%E6%96%B0%E9%B2%9C&
Imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=
A21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=t bindexz_20170916 ",
According to request address, by " http://www.sogou.com/sieHdq=AQxRG-0000&query=%E7%BB%
9F%E4%B8%80%E8%B5%84%E6%BA%90%E5%AE%9A%E4%BD%8D%E7%AC%A6 &ie
=utf8 " and " http://www.sogou.com/sieHdq=AQxRG-0000&query=Tong Yiziyuandingweifus &ie=
Utf8 " is included into same group, by " http://s.taobao.com/searchQ=Shui Guos &imgfile=&commend=all&
Ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.5 0862.201856-
Taobao-item.1&ie=utf8&initiative_id=tbindexz_20170916 " and " http://
s.taobao.com/searchQ=%E6%B0%B4%E6%9E%9C%E6%96%B0%E9%B2%9C&
Imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=
A21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=t bindexz_20170916 " return
Enter same group.It should be noted that other HTTP data are divided in the same fashion, will not be repeated here.
On the basis of the above, the number of the HTTP data included in every group is counted.Exemplary, such as comprising request address
“http:The number of //www.sogou.com/sie " HTTP data is 250, includes request address " http://
The number of s.taobao.com/search " HTTP data is 400.
According to number, descending sort is carried out to every group of request address, and calculates accumulative accounting, it is default accumulative to choose satisfaction
URL corresponding to the request address of accounting, as pending URL.
In a particular embodiment of the present invention, it is not at whole HTTP data in order to reduce data processing amount
Reason, but it is screened, the follow-up HTTP data for meeting preparatory condition for filtering out are handled.
Specifically, first, according to the number included in every group counted, descending row is carried out to every group of request address
Sequence, such as it is grouped 1 " http:The number included in //s.taobao.com/search " is f1=400, it is grouped 2 " http://
The number included in www.sogou.com/sie " is f2=250, it is grouped 3 " http://baike.sogou.com/
The number included altogether in v431372.htm " is f3=230, it is grouped 4 " http:Wrapped in //list.jd.com/list.html "
The number contained is f4=120, it is grouped 5 " https:Wrapped in //mst.vip.com/k1WvCu4H5gssBn3KLODlHQ.php "
The number contained is f5=100, by above-mentioned f1、f2、f3、f4And f5It is ranked up according to order from big to small, ranking results are
f1、f2、f3、f4And f5.Then, accumulative accounting is calculated, i.e., according to formulaCalculate accumulative account for
Than calculating process and result are as follows: Finally, the request address for meeting default accumulative accounting is chosen
Corresponding URL, as pending URL.Found according to NULL before, 80% number is little before accumulative accounting and covers
Face is sufficiently large, and such study analysis cost is little, based on this, it is preferred that and default accumulative accounting is arranged to 80%, based on above-mentioned,
Understand x3=80% meets the condition, chooses packet 1, packet 2 immediately and is grouped URL corresponding to 3 request address, as waiting to locate
Manage URL.
It should be noted that the packet count of division, the number of the HTTP data included in every group and default accumulative accounting,
Need to carry out statistics setting according to actual conditions, be not especially limited herein.
By above-mentioned, while ensureing that data cover face is sufficiently large, reduce data processing amount, reduce study analysis
Cost.
Further, on the basis of above-mentioned technical proposal, the value element in word segmentation result is identified, value element is entered
Row statistical analysis generates statistical result, including:
Based on default recognition rule, word segmentation result and default value element are subjected to match cognization, obtained in word segmentation result
Value element;
The number that Statistical Value element occurs in pending entity file, generate statistical result.
In a particular embodiment of the present invention, default recognition rule refers to the use being provided previously by by configuration file form
It is exemplary in the rule of identification value element, such as when default value element is cell-phone number, identification card number, IMEI, email address
During with MAC Address etc., default recognition rule refers to regular expression rule;When default value element is user name, preset
Recognition rule refers to specifying the rule of the information such as application and attribute mark.Above-mentioned default value element also refers to same class
Other value element, same category of value element mentioned here refer to the value element using same identification rule, such as
Cell-phone number and identification card number are just same category of value element.Different classes of value element can also be referred to, here institute
The different classes of value element said refers to the value element using different recognition rules, if cell-phone number and user name are just not
Generic value element.It is above-mentioned to carry out relative set according to actual conditions, it is not especially limited herein.Preferably, valency is preset
The object type number that value element includes the value element in different classes of value element and same category is at least one.Under
Face is cell-phone number, illustrated exemplified by identification card number, IMEI, email address, MAC Address and user name by default value element.
Because default value element is cell-phone number, identification card number, IMEI, email address, MAC Address and user name, accordingly
, it is regular expression rule and the rule for specifying the information such as application and attribute mark to preset recognition rule, based on above-mentioned rule
Whether there are cell-phone number, identification card number, IMEI, email address, MAC Address and user name in identification word segmentation result, match cognization goes out
Value element in word segmentation result, it is exemplary, such as identify and value element cell-phone number, word segmentation result 2 are included in word segmentation result 1
In include value element user name, include value element cell-phone number and MAC Address in word segmentation result 3.Above-mentioned word segmentation result 1, divide
Word result 2 and word segmentation result 3 treat processing entities file 1, pending entity file 2 and pending entity file 3 respectively
Generated after presetting segmentation methods and carrying out word segmentation processing.And time that Statistical Value element occurs in pending entity file
Number, exemplary, the number as cell-phone number occurs in pending entity file 1 is 20, and user name is in pending entity file 2
The number of middle appearance is 1, and the number that cell-phone number and MAC Address occur in pending entity file 3 is 2.
It should be noted that the statistical result based on above-mentioned acquisition is the sum for the number that each value element occurs.Such as base
In default recognition rule, the word segmentation result that certain pending entity file is obtained after segmentation methods are handled, with default value member
Element carries out match cognization, and it is cell-phone number and identification card number to obtain the value element in word segmentation result, then Statistical Value element exists
The statistical result of the number generation occurred in the pending entity file is that the number that cell-phone number occurs occurs plus identification card number
Number.
By above-mentioned, according to the statistical result of acquisition hereafter whether corresponding pending URL to be added into filtering URL name
It is single that basis for estimation is provided.
Further, on the basis of above-mentioned technical proposal, according to statistical result, it is determined whether pending by corresponding to
URL is added to filtering URL name list, including:
When statistical result is less than default recommendation results, pending URL and pending entity corresponding to statistical result are obtained
File;
In a particular embodiment of the present invention, default recommendation results can be determined by prior statistical analysis,
This is not especially limited.Preferably, default recommendation results are set as 2, i.e., when statistical result is less than 2, just need to obtain statistics
As a result corresponding pending URL and pending entity file, when statistical result is more than or equal to 2, just operated without this,
Determine that pending entity file has contained more value information corresponding to statistical result.It is exemplary based on this, due to
The number that previously described user name occurs in pending entity file 2 is 1, and the statistical result is less than 2, it is therefore desirable to obtains
Pending URL corresponding to statistical result and pending entity file, that is, obtain pending URL and pending entity file 2.
Pending URL and pending entity file are recommended and manually determined whether, and obtains artificial judgment result;
In a particular embodiment of the present invention, the purpose for carrying out artificial judgment operation is:In practice, in entity file
Often some virtual identities (such as email address), virtual identity mentioned here are virtual identity explained before, in fact
It is valuable information in matter, due to that can not be identified well, causes statistical result to be likely less than default recommendation results, because
This is, it is necessary to which artificial judgment is checked on, it is ensured that and valuable information is not deleted, reduces False Rate, meanwhile, for recommending out not
Identified extractible valuable information (such as email address), by artificial judgment, data extraction model is found, generation is accordingly
Data extracting rule, then the extracting rule can be included in previously described default recognition rule, be easy to follow-up afterwards
Use, from another point of view, reduce the number of artificial judgment.
Based on artificial judgment result, it is determined whether pending URL is added into filtering URL name list.
In a particular embodiment of the present invention, it is exemplary, whether include mailbox in the pending entity file 2 of artificial judgment
Address, if comprising email address, due to being unsatisfactory for condition of the statistical result less than default recommendation results, then just will not incite somebody to action
Pending entity file 2 is added to filtering URL name list, conversely, pending file 2 just is added into filtering URL name list.
Embodiment two
Fig. 2 is a kind of flow chart for rubbish text processing method that the embodiment of the present invention two provides, and the present embodiment is applicable
In accurately identifying and filtering out the situation of the text comprising priceless value information, this method can be held by rubbish text processing unit
OK, the device can realize that the device can be configured in terminal, such as is typically by the way of software and/or hardware
Mobile phone, computer, tablet personal computer etc..As shown in Fig. 2 this method specifically comprises the following steps:
Step S210, the URL of the HTTP data in preset time is obtained, URL includes request address and required parameter;
Step S220, HTTP data are grouped according to request address, and count the HTTP data included in every group
Number, according to number, descending sort is carried out to every group of request address, and calculate accumulative accounting;
Step S230, choose and meet URL corresponding to the request address of default accumulative accounting, as pending URL, and according to
Pending entity file corresponding to pending URL request address selection;
Step S240, processing entities file is treated with regard to those word segmentation processings using default segmentation methods, based on default identification
Rule, word segmentation result and default value element are subjected to match cognization, obtain the value element in word segmentation result;
In a particular embodiment of the present invention, it is preferred that processing entities file is treated using reverse maximum matching algorithm and entered
Row word segmentation processing.
Step S250, the number that Statistical Value element occurs in pending entity file, statistical result is generated;
Step S260, when statistical result is less than default recommendation results, obtain pending URL corresponding to statistical result and treat
Processing entities file;
Step S270, pending URL and pending entity file are recommended and manually determined whether, and obtained and manually sentence
Disconnected result;
Step S280, based on artificial judgment result, it is determined whether pending URL is added into filtering URL name list.
In a particular embodiment of the present invention, the purpose for carrying out artificial judgment operation is:In practice, in entity file
Often some virtual identities (such as email address), virtual identity mentioned here are virtual identity explained before, in fact
It is valuable information in matter, due to that can not be identified well, causes statistical result to be likely less than default recommendation results, because
This is, it is necessary to which artificial judgment is checked on, it is ensured that and valuable information is not deleted, reduces False Rate, meanwhile, for recommending out not
Identified extractible valuable information (such as email address), by artificial judgment, data extraction model is found, generation is accordingly
Data extracting rule, then the extracting rule can be included in previously described default recognition rule, be easy to follow-up afterwards
Use, from another point of view, reduce the number of artificial judgment.
It should be noted that above-mentioned steps S210-S280 is online lower progress, realizes and mitigate inline system load
Purpose, inline system (system that user is used) only corresponding filter operation need to be carried out according to result,
It is minimum by being fallen below to the influence of user.
The technical scheme of the present embodiment, by obtaining the URL of the HTTP data in preset time, it include request address and
Required parameter, screening is carried out to request address based on default screening rule and obtains pending URL, and asking according to pending URL
Pending entity file corresponding to the selection of address is sought, treating processing entities file using default segmentation methods carries out word segmentation processing life
Into word segmentation result, and the value element in word segmentation result is identified, statistical analysis generation statistical result is carried out to value element, and then
According to statistical result, it is determined whether corresponding pending URL is added into filtering URL name list, solves rubbish in the prior art
The problem of text identification precision is not high, and inline system load is larger, the load at utmost reduced to inline system is realized, is carried
The high reliability of identification.
Embodiment three
Fig. 3 is a kind of structural representation for rubbish text processing unit that the embodiment of the present invention three provides, and the present embodiment can
Situation suitable for accurately identifying and filtering out the text comprising priceless value information, the device can use software and/or hardware
Mode realize that the device can be configured in terminal, such as typically mobile phone, computer, tablet personal computer etc..Such as Fig. 3 institutes
Show, the device specifically includes:
URL acquisition modules 310, for obtaining the URL of the HTTP data in preset time, the URL includes request address
And required parameter;
Pending entity file acquisition module 320, for being screened based on default screening rule to the request address
Obtain pending URL, and the pending entity file according to corresponding to being chosen the request address of the pending URL;
Statistical result generation module 330, for being segmented using default segmentation methods to the pending entity file
Processing generation word segmentation result, and the value element in the word segmentation result is identified, statistical analysis life is carried out to the value element
Into statistical result;
In a particular embodiment of the present invention, it is preferred that processing entities file is treated using reverse maximum matching algorithm and entered
Row word segmentation processing.
URL name list generation module 340 is filtered, for according to the statistical result, it is determined whether the pending URL by corresponding to
It is added to filtering URL name list.
The technical scheme of the present embodiment, the URL of the HTTP data in preset time is obtained by URL acquisition modules 310, its
Including request address and required parameter, pending entity file acquisition module 320 is based on default screening rule and request address is entered
Row screening obtains pending URL, and the pending entity file according to corresponding to being chosen pending URL request address, statistics knot
Fruit generation module 330 treats processing entities file using default segmentation methods and carries out word segmentation processing generation word segmentation result, and identifies
Value element in word segmentation result, statistical analysis generation statistical result is carried out to value element, filters URL name list generation module
340 according to statistical result, it is determined whether corresponding pending URL is added into filtering URL name list, solves rubbish in the prior art
The problem of rubbish text identification precision is not high, and inline system load is larger, the load at utmost reduced to inline system is realized,
Improve the reliability of identification.
Further, on the basis of above-mentioned technical proposal, pending entity file acquisition module 320, including:
Classified statistics unit, for being grouped to HTTP data according to request address, and count what is included in every group
The number of HTTP data;
Pending URL acquiring units, for according to number, descending sort being carried out to every group of request address, and calculate tired
Accounting is counted, URL corresponding to the request address for meeting default accumulative accounting is chosen, as pending URL.
Further, on the basis of above-mentioned technical proposal, statistical result generation module 330, including:
Value element recognition unit, for based on default recognition rule, word segmentation result and default value element are carried out
With identification, the value element in word segmentation result is obtained;
Statistical result generation unit, the number occurred for Statistical Value element in pending entity file, generation system
Count result.
Further, on the basis of above-mentioned technical proposal, filtering URL name list generation module 340, it is specifically used for:
When statistical result is less than default recommendation results, pending URL and pending entity corresponding to statistical result are obtained
File;
Pending URL and pending entity file are recommended and manually determined whether, and obtains artificial judgment result;
Based on artificial judgment result, it is determined whether pending URL is added into filtering URL name list.
What the embodiment of the present invention was provided is configured at any implementation of the executable present invention of rubbish text processing unit of terminal
What example was provided is applied to terminal rubbish text processing method, possesses the corresponding functional module of execution method and beneficial effect.
Example IV
Fig. 4 is a kind of structural representation for equipment that the embodiment of the present invention four provides.Fig. 4 is shown suitable for being used for realizing this
The block diagram of the example devices 412 of invention embodiment.The equipment 412 that Fig. 4 is shown is only an example, should not be to the present invention
The function and use range of embodiment bring any restrictions.
As shown in figure 4, equipment 412 is showed in the form of universal computing device.The component of equipment 412 can include but unlimited
In:One or more processor 416, system storage 428, it is connected to different system component (including the He of system storage 428
Processor 416) bus 418.
Bus 418 represents the one or more in a few class bus structures, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.Lift
For example, these architectures include but is not limited to industry standard architecture (ISA) bus, MCA (MAC)
Bus, enhanced isa bus, VESA's (VESA) local bus and periphery component interconnection (PCI) bus.
Equipment 412 typically comprises various computing systems computer-readable recording medium.These media can be it is any can be by equipment
412 usable mediums accessed, including volatibility and non-volatile media, moveable and immovable medium.
System storage 428 can include the computer system readable media of form of volatile memory, such as deposit at random
Access to memory (RAM) 430 and/or cache memory 432.Equipment 412 may further include other removable/not removable
Dynamic, volatile/non-volatile computer system storage medium.Only as an example, storage system 434 can be used for read-write can not
Mobile, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").Although not shown in Fig. 4, Ke Yiti
For the disc driver for being read and write to may move non-volatile magnetic disk (such as " floppy disk "), and to may move non-volatile light
The CD drive of disk (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driver
It can be connected by one or more data media interfaces with bus 418.Memory 428 can include at least one program and produce
Product, the program product have one group of (for example, at least one) program module, and these program modules are configured to perform of the invention each
The function of embodiment.
Program/utility 440 with one group of (at least one) program module 442, can be stored in such as memory
In 428, such program module 442 includes but is not limited to operating system, one or more application program, other program modules
And routine data, the realization of network environment may be included in each or certain combination in these examples.Program module 442
Generally perform the function and/or method in embodiment described in the invention.
Equipment 412 can also be logical with one or more external equipments 414 (such as keyboard, sensing equipment, display 424 etc.)
Letter, can also enable a user to the equipment communication interacted with the equipment 412 with one or more, and/or with causing the equipment 412
Any equipment (such as network interface card, the modem etc.) communication that can be communicated with one or more of the other computing device.This
Kind communication can be carried out by input/output (I/O) interface 422.Also, equipment 412 can also by network adapter 420 with
One or more network (such as LAN (LAN), wide area network (WAN) and/or public network, such as internet) communication.Such as
Shown in figure, network adapter 420 is communicated by bus 418 with other modules of equipment 412.It should be understood that although do not show in Fig. 4
Go out, other hardware and/or software module can be used with bonding apparatus 412, included but is not limited to:It is microcode, device driver, superfluous
Remaining processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..
Processor 416 is stored in program in system storage 428 by operation, so as to perform various function application and
Data processing, such as a kind of rubbish text processing method that the embodiment of the present invention is provided is realized, including:
The uniform resource position mark URL of the HTTP HTTP data in preset time is obtained, URL includes request
Address and required parameter;
Screening is carried out to request address based on screening rule and obtains pending URL, and according to pending URL request address
Pending entity file corresponding to selection;
Processing entities file is treated using default segmentation methods and carries out word segmentation processing generation word segmentation result, and identifies participle knot
Value element in fruit, statistical analysis generation statistical result is carried out to value element;
According to statistical result, it is determined whether corresponding pending URL is added into filtering URL name list.
Embodiment five
The embodiment of the present invention five additionally provides a kind of computer-readable recording medium, is stored thereon with computer program, should
A kind of rubbish text processing method provided such as the embodiment of the present invention is realized when program is executed by processor, this method includes:
The uniform resource position mark URL of the HTTP HTTP data in preset time is obtained, URL includes request
Address and required parameter;
Screening is carried out to request address based on screening rule and obtains pending URL, and according to pending URL request address
Pending entity file corresponding to selection;
Processing entities file is treated using default segmentation methods and carries out word segmentation processing generation word segmentation result, and identifies participle knot
Value element in fruit, statistical analysis generation statistical result is carried out to value element;
According to statistical result, it is determined whether corresponding pending URL is added into filtering URL name list.
The computer-readable storage medium of the embodiment of the present invention, any of one or more computer-readable media can be used
Combination.Computer-readable medium can be computer-readable signal media or computer-readable recording medium.It is computer-readable
Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or
Device, or any combination above.The more specifically example (non exhaustive list) of computer-readable recording medium includes:Tool
There are the electrical connections of one or more wires, portable computer diskette, hard disk, random access memory (RAM), read-only storage
(ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-
ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable storage
Medium can be any includes or the tangible medium of storage program, the program can be commanded execution system, device or device
Using or it is in connection.
Computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium beyond storage medium is read, the computer-readable medium, which can send, propagates or transmit, to be used for
By instruction execution system, device either device use or program in connection.
The program code included on computer-readable medium can be transmitted with any appropriate medium, including --- but it is unlimited
In wireless, electric wire, optical cable, RF etc., or above-mentioned any appropriate combination.
It can be write with one or more programming languages or its combination for performing the computer that operates of the present invention
Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++,
Also include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
Fully perform, partly perform on the user computer on the user computer, the software kit independent as one performs, portion
Divide and partly perform or performed completely on remote computer or server on the remote computer on the user computer.
Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or
Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as carried using Internet service
Pass through Internet connection for business).
Pay attention to, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that
The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes,
Readjust and substitute without departing from protection scope of the present invention.Therefore, although being carried out by above example to the present invention
It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also
Other more equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.
Claims (10)
- A kind of 1. rubbish text processing method, it is characterised in that including:The uniform resource position mark URL of the HTTP HTTP data in preset time is obtained, the URL includes request Address and required parameter;Screening is carried out to the request address based on default screening rule and obtains pending URL, and according to the pending URL's Pending entity file corresponding to request address selection;Word segmentation processing generation word segmentation result is carried out to the pending entity file using default segmentation methods, and identifies described point Value element in word result, statistical analysis generation statistical result is carried out to the value element;According to the statistical result, it is determined whether corresponding pending URL is added into filtering URL name list.
- 2. according to the method for claim 1, it is characterised in that described that the request address is entered based on default screening rule Row screening obtains pending URL, including:The HTTP data are grouped according to the request address, and count the number of the HTTP data included in every group;According to the number, descending sort is carried out to every group of request address, and calculates accumulative accounting, it is default accumulative to choose satisfaction URL corresponding to the request address of accounting, as the pending URL.
- 3. method according to claim 1 or 2, it is characterised in that the value element in the identification word segmentation result, Statistical analysis generation statistical result is carried out to the value element, including:Based on default recognition rule, the word segmentation result and default value element are subjected to match cognization, obtain the participle knot Value element in fruit;The number that the value element occurs in the pending entity file is counted, generates statistical result.
- 4. according to the method for claim 3, it is characterised in that described according to the statistical result, it is determined whether will be corresponding Pending URL be added to filtering URL name list, including:When the statistical result is less than default recommendation results, pending URL corresponding to the statistical result and pending is obtained Entity file;The pending URL and pending entity file are recommended and manually determined whether, and obtains artificial judgment result;Based on the artificial judgment result, it is determined whether the pending URL is added into filtering URL name list.
- 5. according to the method for claim 4, it is characterised in that the default segmentation methods include reverse maximum matching and calculated Method.
- A kind of 6. rubbish text processing unit, it is characterised in that including:URL acquisition modules, for obtaining the URL of the HTTP data in preset time, the URL includes request address and request is joined Number;Pending entity file acquisition module, obtain for carrying out screening to the request address based on default screening rule and wait to locate Manage URL, and the pending entity file according to corresponding to being chosen the request address of the pending URL;Statistical result generation module, for carrying out word segmentation processing generation to the pending entity file using default segmentation methods Word segmentation result, and the value element in the word segmentation result is identified, statistical analysis generation statistics knot is carried out to the value element Fruit;URL name list generation module is filtered, for according to the statistical result, it is determined whether corresponding pending URL is added to Filter URL name list.
- 7. device according to claim 6, it is characterised in that the pending entity file acquisition module includes:Classified statistics unit, for being grouped to the HTTP data according to the request address, and count and included in every group HTTP data number;Pending URL acquiring units, for according to the number, descending sort being carried out to every group of request address, and calculate tired Accounting is counted, URL corresponding to the request address for meeting default accumulative accounting is chosen, as the pending URL.
- 8. the device according to right 6 or 7, it is characterised in that the statistical result generation module includes:Value element recognition unit, for based on default recognition rule, the word segmentation result and default value element are carried out With identification, the value element in the word segmentation result is obtained;Statistical result generation unit, the number occurred for counting the value element in the pending entity file are raw Into statistical result.
- A kind of 9. equipment, it is characterised in that including:One or more processors;Memory, for storing one or more programs;When one or more of programs are by one or more of computing devices so that one or more of processors are real The now rubbish text processing method as any one of claim 1-5.
- 10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The rubbish text processing method as described in any in claim 1-5 is realized during execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710865928.XA CN107704538A (en) | 2017-09-22 | 2017-09-22 | A kind of rubbish text processing method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710865928.XA CN107704538A (en) | 2017-09-22 | 2017-09-22 | A kind of rubbish text processing method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107704538A true CN107704538A (en) | 2018-02-16 |
Family
ID=61171855
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710865928.XA Pending CN107704538A (en) | 2017-09-22 | 2017-09-22 | A kind of rubbish text processing method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107704538A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710861A (en) * | 2018-12-26 | 2019-05-03 | 贵阳朗玛信息技术股份有限公司 | A kind of system and method generating URL |
CN111061777A (en) * | 2019-12-10 | 2020-04-24 | 广州电力工程监理有限公司 | Project data statistical analysis method and system |
CN112650849A (en) * | 2019-09-25 | 2021-04-13 | 北京国双科技有限公司 | File processing method and device, storage medium and equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008053228A3 (en) * | 2006-11-01 | 2009-01-08 | Bloxx Ltd | Methods and systems for web site categorisation training, categorisation and access control |
CN104348642A (en) * | 2013-07-31 | 2015-02-11 | 华为技术有限公司 | A spam information filtering method and device |
CN105320659A (en) * | 2014-06-04 | 2016-02-10 | 同程网络科技股份有限公司 | Sensitive word filtering method |
-
2017
- 2017-09-22 CN CN201710865928.XA patent/CN107704538A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008053228A3 (en) * | 2006-11-01 | 2009-01-08 | Bloxx Ltd | Methods and systems for web site categorisation training, categorisation and access control |
CN104348642A (en) * | 2013-07-31 | 2015-02-11 | 华为技术有限公司 | A spam information filtering method and device |
CN105320659A (en) * | 2014-06-04 | 2016-02-10 | 同程网络科技股份有限公司 | Sensitive word filtering method |
Non-Patent Citations (2)
Title |
---|
TING XU等: "Understanding Network Behavior Patterns of Bus Wi-Fi Users Using Surfing Data", 《2017 IEEE 2ND ADVANCED INFORMATION TECHNOLOGY,ELECTRONIC AND AUTOMATION CONTROL CONFERENCE》 * |
谷俊等: "面向情报获取的主题采集工具设计与实现", 《图书情报工作》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710861A (en) * | 2018-12-26 | 2019-05-03 | 贵阳朗玛信息技术股份有限公司 | A kind of system and method generating URL |
CN109710861B (en) * | 2018-12-26 | 2023-04-11 | 贵阳朗玛信息技术股份有限公司 | System and method for generating URL |
CN112650849A (en) * | 2019-09-25 | 2021-04-13 | 北京国双科技有限公司 | File processing method and device, storage medium and equipment |
CN111061777A (en) * | 2019-12-10 | 2020-04-24 | 广州电力工程监理有限公司 | Project data statistical analysis method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7170779B2 (en) | Methods and systems for automatic intent mining, classification, and placement | |
US10725836B2 (en) | Intent-based organisation of APIs | |
US11681944B2 (en) | System and method to generate a labeled dataset for training an entity detection system | |
US10394956B2 (en) | Methods, devices, and systems for constructing intelligent knowledge base | |
US8782051B2 (en) | System and method for text categorization based on ontologies | |
US10191946B2 (en) | Answering natural language table queries through semantic table representation | |
US20230177360A1 (en) | Surfacing unique facts for entities | |
US20170185653A1 (en) | Predicting Knowledge Types In A Search Query Using Word Co-Occurrence And Semi/Unstructured Free Text | |
CN110162637B (en) | Information map construction method, device and equipment | |
CN113254649B (en) | Training method of sensitive content recognition model, text recognition method and related device | |
JP7254925B2 (en) | Transliteration of data records for improved data matching | |
CN111783903A (en) | Text processing method, text model processing method and device and computer equipment | |
CN110362815A (en) | Text vector generation method and device | |
CN105975497A (en) | Automatic microblog topic recommendation method and device | |
CN107704538A (en) | A kind of rubbish text processing method, device, equipment and storage medium | |
CN112241458B (en) | Text knowledge structuring processing method, device, equipment and readable storage medium | |
CN115809334B (en) | Training method of event relevance classification model, text processing method and device | |
CN106888201A (en) | A kind of method of calibration and device | |
US11361031B2 (en) | Dynamic linguistic assessment and measurement | |
CN116450781A (en) | Question and answer processing method and device | |
CN111930949B (en) | Search string processing method and device, computer readable medium and electronic equipment | |
Prakash et al. | Aspect based sentiment analysis for Amazon data products using PAM | |
CN113779190A (en) | Event causality identification method, device, electronic device and storage medium | |
CN116702784B (en) | Entity linking method, entity linking device, computer equipment and storage medium | |
CN111882224A (en) | Method and apparatus for classifying consumption scenarios |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180216 |
|
RJ01 | Rejection of invention patent application after publication |