[go: up one dir, main page]

CN104484381B - For searching for the method and system of multiple strings - Google Patents

For searching for the method and system of multiple strings Download PDF

Info

Publication number
CN104484381B
CN104484381B CN201410757944.3A CN201410757944A CN104484381B CN 104484381 B CN104484381 B CN 104484381B CN 201410757944 A CN201410757944 A CN 201410757944A CN 104484381 B CN104484381 B CN 104484381B
Authority
CN
China
Prior art keywords
string
word
pattern
text
substring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410757944.3A
Other languages
Chinese (zh)
Other versions
CN104484381A (en
Inventor
张�林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
eBay Inc
Original Assignee
eBay Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by eBay Inc filed Critical eBay Inc
Priority to CN201410757944.3A priority Critical patent/CN104484381B/en
Priority claimed from CN201010116709.XA external-priority patent/CN102169485B/en
Publication of CN104484381A publication Critical patent/CN104484381A/en
Application granted granted Critical
Publication of CN104484381B publication Critical patent/CN104484381B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to for searching for the method and system of multiple strings.The system can include:Store the pattern string started respectively with the first word;For each the first different word, the combination of each different string length of pattern string that stores first word and started with first word;The word that one of identification and the first pattern word match in the text;String is iteratively extracted from text based on the string started with the word identified, and the string length equal with one of string length is stored together with matched first word;Iteratively each string extracted and each pattern string with first word and string length identical with the substring are compared;And the matching of one of at least one string extracted and the pattern string is determined based on this comparison.

Description

For searching for the method and system of multiple strings
The application is the application for a patent for invention (applying date of original bill Application No. 201010116709.X:2 months 2010 26 Day, denomination of invention:For searching for the method and system of multiple strings) divisional application.
Technical field
This invention relates generally to the technical fields of information processing, more specifically, are related to the method for information search And system.
Background technology
With the development of cyber-net correlation technique, more and more people perform numeric search to identify or find number The specific content for the needs of meeting them in word document.For example, people (such as parent) or authoritative institution can be attempted in child The limitation content (such as string, expression or word) of some unsuitable children is found in obtainable digital document, then so that child Son away from these contents.But in many situations, due to the size and/or substantial amounts of such as digital document, Ren Menhuo It is a time-consuming task that authoritative institution, which identifies or find these limitation contents,.Therefore, it is necessary to improved searching method, for holding The efficient search of row and reduction search time.
The content of the invention
The purpose of the application is to provide for being effectively carried out pair for the first word based on multiple strings and the combination of length The search of multiple strings is to reduce the system and method for search time.
According to the application's in a first aspect, providing one kind for existing by using one or more processors executive resident The system for instructing to search for multiple strings from text in computer.The system includes:First storage device is distinguished for storing The pattern string started with the first pattern word;And second storage device, for storing the pattern string started with the first pattern word The combination of first pattern word and corresponding pattern length.The system further includes:Search engine, for iteratively identifying in the text The word to match with one of first pattern word is to be arranged to current word;Extractor, for iteratively extracting with described current Word starts and has the substring of the substring length equal with one of pattern length;And comparator, for iteratively by the son It goes here and there compared with each pattern string with first pattern word and string length identical with the first pattern word and length of the substring. The system can also include the 3rd storage device, for matching in one of the substring and pattern string in the case of storage and the son Go here and there relevant information.
According to the another aspect of the application, one kind is provided for existing by using one or more processors executive resident The method for instructing to search for multiple strings from text in computer.This method includes:By what is started respectively with the first pattern word Pattern string is stored in the first storage device;And by the first pattern word of the pattern string started respectively with the first pattern word and accordingly The combination of pattern length be stored in the second storage device.This method further includes:Using search engine in the text iteratively The word that one of identification and first pattern word match, to be arranged to current word;It is iteratively extracted with institute using search engine State the substring that current word starts and has the substring length equal with one of pattern length;Using search engine iteratively by the son It goes here and there compared with each pattern string with first pattern word and string length identical with the first pattern word and length of the substring; And it if one of the substring and pattern string match, will be stored in the relevant information of the substring in the 3rd storage device.
The system and method for the application can efficiently find and position in the text a series of to use large character set (charset) the pre-defined pattern string (such as Chinese word) of language (such as Chinese) writing.Skill used by the application Art considers the characteristic of the language using large character set, and can obtain linear run time and reduce search time.The skill Art can for example be forbidden including in Bulletin Board Systems (BBS) thread the text of one or more pre-defined pattern words This.
Description of the drawings
In the accompanying drawings in an illustrative manner and unrestricted mode illustrates the embodiment of the present invention, the similar mark in attached drawing Number instruction similar components, in the accompanying drawings:
Fig. 1 be illustrate accoding to exemplary embodiment for searching for multiple target pattern string (target from text Pattern string) system block diagram;
Fig. 2 is the stream for being used to search for the method for multiple target pattern strings from text illustrated accoding to exemplary embodiment Cheng Tu;And
Fig. 3 is the block diagram that machine accoding to exemplary embodiment is illustrated with the exemplary form of computer system.
Specific embodiment
System and method for searching for multiple target pattern strings from text will be described.In the following description, in order to Illustrate and propose multiple details, to provide comprehensive understanding to the present invention.But those skilled in the art it will be clearly understood that The present invention can be also carried out without these details.
In many situations, people can attempt search digital document to find and position some specific contents.For example, certain A little parents or authoritative institution can attempt the digital document that search opens children, with determine whether to have in digital document with Use the harmful content of Asian language (for example, Chinese) writing of large character set.These harmful contents can be not suitable for child Multiple predeterminated target pattern strings (such as Chinese word or word), such as " pornographic ", " pornographic net ", " porny ", " violence " and " violence TV play " etc..Each target pattern string (such as " pornographic net ") has one first word (such as " color ") and a pattern string Length (such as 3).
Fig. 1 is the system 100 for being used to search for multiple target pattern strings from text illustrated accoding to exemplary embodiment Block diagram.In certain embodiments, system 100 can include the first storage device 10, the storage of the second storage device the 20, the 3rd is set Standby 30 and search engine 40.System 100 can also include one or more processors 50, retain in a computer for performing It instructs to operate other assemblies.
In certain embodiments, the first storage device 10 can store pre-defined target pattern string (for example, " color Feelings ", " pornographic net ", " porny ", " violence " and " violence TV play "), these target pattern strings each have the first word (example Such as, " color " or " sudden and violent ") and its pattern string length (for example, 2,3,4 or 5).First storage device 10 can include a HashSet (hash aggregation).The HashSet is the specific implementation of Set (set) interface.It creates the aggregation using hash table (collection) for storage.Hash table stores information by using the mechanism for being referred to as Hash (hashing). In some embodiments, system 100 can also include user interface 60, can be used for receiving that the first storage device will be stored in Target pattern string in 10.
In certain embodiments, the second storage device 20 can store target pattern string (for example, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") the first word (for example, " color " or " sudden and violent ") and pattern string it is long Spend (for example, 2,3,4 or 5) unique combinations (for example,<" color ", (2,3,4)>With<" sudden and violent ", (2,5)>).With the first pattern word First pattern word of the pattern string (for example, " pornographic ", " pornographic net ", " pornographic acute ", " porny ") that (such as " color ") starts With the combination of pattern length (for example, 2,3,3,4) (such as<" color ", (2,3,4)>) it is unduplicated.Second storage device 20 can To include at least one HashMap (hash figure).HashMap refers to a kind of data structure, using hash function come by some marks Know symbol or keyword (key) is efficiently mapped to associated value (for example, their telephone number).Hash function is used for key Word is converted into therefrom finding the index of the array element (slot (slot) or bucket (bucket)) of analog value.
In certain embodiments, search engine 40 can start iteratively one of identification and the first pattern word (example from text Such as " color ") word that matches, and the matching word is arranged to the current word in text.
In certain embodiments, search engine 40 can iteratively extract with current word (such as " color ") start and with etc. In the substring of the substring length of one of target pattern string length (such as 2).
In certain embodiments, search engine 40 can be iteratively by the substring with having the first word with the substring respectively First pattern word (such as " color ") identical with string length and each target pattern string (such as " pornographic ") of string length (such as 2) It compares.If the substring extracted matches with one of target pattern string (such as " pornographic "), with the relevant information of the substring It can then be stored in the 3rd storage device 30.It can include the position of the substring with the relevant information of the substring.
This process, which all the way moves forward to, reaches text ending.In certain embodiments, the word string navigated to is highlighted with police Show user.In certain embodiments, system 100 can include display 70, for showing and being stored in the 3rd storage device 30 In the relevant information of substring.
System 100 can efficiently find and position in the text and one or more in pre-defined target pattern string A substring to match, to reduce search time.
Fig. 2 is the method 200 for being used to search for multiple target pattern strings from text illustrated accoding to exemplary embodiment Flow chart.
In certain embodiments, at operation 202, by multiple target pattern strings (for example, " pornographic ", " pornographic net ", " color Feelings play ", " porny ", " violence " and " violence TV play ") it is respectively stored in the first storage device 10.These target patterns String is started respectively with the first pattern word (such as " color " or " sudden and violent ").
It, will be with the target pattern string that the first pattern word (such as " color " or " sudden and violent ") starts (for example, " color at operation 204 Feelings ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") the first pattern word (such as " color " or " sudden and violent ") and corresponding pattern length (such as 2,3,4 or 5) unique combinations (for example,<" color ", 2,3,4>) and<" sudden and violent ", 2,5>It deposits Storage is in the second storage device 20.
At operation 206, using search engine 40 come iteratively identify in the text with one of the first pattern word (such as " color ") word that matches, to be arranged to current word.
At operation 208, iteratively extracted using search engine 40 and start and have and mesh with current word (such as " color ") The substring of the equal substring length of one of sample string length of marking on a map (such as 2).
At operation 210, using search engine 40 iteratively by the substring with having the first pattern with the substring respectively Each target pattern string (example of word first pattern word identical with substring length (such as " color ") and target string length (such as 2) Such as " pornographic ") it compares.
It, will be with the substring phase at operation 212 if the substring is matched with one of target pattern string (such as " pornographic ") The information of pass is stored in the 3rd storage device 30.It can include the position of the substring with the relevant information of the substring.Operation 206 It is repeated to 212, the ending until reaching text.
It is related come all substrings for showing to being stored in the 3rd storage device 30 using display 70 at operation 214 Information.
The operation of embodiment can perform in two stages:Initialization and processing.In initial phase, target pattern word (searched word) is placed into HashSet, and the first word of each target pattern word and length are placed into HashMap. Since the HashMap in JAVA does not allow the keyword repeated, the different length of the identical pattern word of the first word is placed into HashMap。
In processing stage, text is iterated processing from the starting.Each word of text is examined.If in HashMap In find current word, then may the substring of current location be pattern word.Possible length is obtained from HashMap, and for each Whether possible length extracts the substring according to current length, to check it in HashSet.If hit, in text Target pattern word is found in this, starting position is current location and length is current length.Otherwise, process proceed to it is next can It can length.If be possible to length is processed, process if, proceeds to next word of text.
Assuming that text size is M, and there are N number of target pattern strings.It is A*N, wherein A the time required to initial phase Be include for by word storage into HashSet, extract the first word of substring and store word and length into HashMap Time constant.It is B*M the time required to processing stage, wherein B is to include searching word in HashMap, finding to order In in the case of extract substring and the constant of the time of the substring searched in HashSet.The total time of the algorithm is complicated Degree is the function of (A*N+B*M).Fig. 3 is the block diagram that a machine is illustrated with the exemplary form of computer system 300, in the meter The command sequence for any one method that machine is caused to perform in method discussed here can be performed in calculation machine system 300 Set.In an alternate embodiment, the machine can be server computer, client computer, personal computer (PC), Tablet PC, set-top box (STB), personal digital assistant (PDA), cellular phone, network tool, network router, interchanger or Bridge or the arbitrary machine for being able to carry out specifying the instruction set for the action that will be taken by the machine.It is although in addition, only single Machine is illustrated, but term " machine " can also include the arbitrary aggregation of multiple machines, these machines are individually or jointly Set of instructions performs any one or more method in the method discussed here.
Exemplary computer system 300 includes processor 302 (such as central processing unit (CPU), graphics processing unit (GPU) or both), main storage 304 and static memory 306, they communicate with one another via bus 308.Computer system 300 can also include video display unit 310 (such as liquid crystal display (LCD) or cathode-ray tube (CRT)).Computer system 300 further include Alphanumeric Entry Device 312 (such as keyboard), cursor control device 314 (such as mouse), disk drive unit 316th, signal generation equipment 328 (such as loud speaker) and network interface device 320.
Disk drive unit 316 includes machine readable media 322, stores embody in method or function described herein thereon The one or more instruction sets (such as software 324) of any one or more.Software 324 is held by computer system 300 It can also completely or at least partially be resided in main storage 304 and/or processor 320 between the departure date, main storage 304 and processing Device 320 also forms machine readable media.
Software 324 can also be sent or be received on network 326 via network interface device 320.It is although machine readable Medium 322 is shown as single medium in the exemplary embodiment, but term " machine readable media " is considered as including The single medium or multiple media of the one or more instruction sets of storage are (for example, centralized or distributed database and/or association Caching and server).Term " machine readable media " will also be believed to comprise store, encode or carry following instruction set The arbitrary medium of conjunction, in method or operation that described instruction set is executable by a machine and machine is caused to perform the embodiment of the present invention Any one or more.Term " machine readable media " therefore it will be believed to comprise (but being not limited to) solid-state memory, light With magnetic medium and carrier signal.
Therefore, it has been described from the method and system of the multiple target pattern strings of text search.Although the present invention by reference to Specific exemplary embodiment is described, but will be seen that, in the feelings for the wider spirit and scope for not departing from the present invention It can various modifications and changes may be made to these embodiments under condition.Therefore, specification and drawings will be considered as it is illustrative rather than Restricted.

Claims (14)

1. a kind of for searching for the method for word string in the text, this method includes:
Store the pattern string started respectively with the first word;
For each the first different word, each different string length of first word and the pattern string started with first word are stored Combination;
The word that one of identification and the first word match in the text;
Iteratively based on started with the word identified and string length be equal to combine with matched first word the string length that stores it One string extracts string from text;
Iteratively by each string extracted and with first word and string length identical with the string that this is extracted each pattern String compares;And
The matching of one of at least one string extracted and the pattern string is determined based on this comparison.
2. the method as described in claim 1 further includes from user interface and receives the pattern string.
3. it is relevant at least one substring extracted to further include display over the display for the method as described in claim 1 Information.
4. the method for claim 1, wherein the pattern string is stored as at least one HashSet.
5. method as claimed in claim 4, wherein, the combination is stored as at least one HashMap.
6. method as claimed in claim 3, wherein, exist at least one relevant information of the substring extracted including the substring Position in text.
It is shown 7. the method for claim 1, wherein in the text highlighting at least one substring extracted.
8. a kind of for searching for the system of word string in the text, which includes:
For storing the device of the pattern string started respectively with the first word;
For storing each different strings of first word and the pattern string started with first word for each the first different word The device of the combination of length;
For identifying the device of the word to match with one of the first word in the text;
For iteratively based on being started with the word identified and string length is equal to the string for combine with matched first word and storing and grows The string of one of degree extracts the device of string from text;
For iteratively by each string extracted and each with first word and string length identical with the string that this is extracted The device that pattern string compares;And
For determining the matched device of one of at least one string extracted and the pattern string based on this comparison.
9. system as claimed in claim 8 further includes to receive the device of the pattern string from user interface.
10. system as claimed in claim 8 is further included for upper display and at least one relevant information of substring extracted Device.
11. system as claimed in claim 9, wherein, include the substring at least one relevant information of the substring extracted Position in the text.
12. system as claimed in claim 8, wherein, include at least one HashSet for storing the device of pattern string.
13. system as claimed in claim 12, wherein, include at least one HashMap for storing the device of the combination.
14. system as claimed in claim 8 is further included and shown in the text highlighting at least one substring extracted Device.
CN201410757944.3A 2010-02-26 2010-02-26 For searching for the method and system of multiple strings Expired - Fee Related CN104484381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410757944.3A CN104484381B (en) 2010-02-26 2010-02-26 For searching for the method and system of multiple strings

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410757944.3A CN104484381B (en) 2010-02-26 2010-02-26 For searching for the method and system of multiple strings
CN201010116709.XA CN102169485B (en) 2010-02-26 2010-02-26 Method and system for searching a plurality of strings

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201010116709.XA Division CN102169485B (en) 2010-02-26 2010-02-26 Method and system for searching a plurality of strings

Publications (2)

Publication Number Publication Date
CN104484381A CN104484381A (en) 2015-04-01
CN104484381B true CN104484381B (en) 2018-05-22

Family

ID=52758922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410757944.3A Expired - Fee Related CN104484381B (en) 2010-02-26 2010-02-26 For searching for the method and system of multiple strings

Country Status (1)

Country Link
CN (1) CN104484381B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1324570A (en) * 2000-05-19 2001-12-05 住友化学工业株式会社 Heating fumigating preventing-killing method for injurious insects
US6484164B1 (en) * 2000-03-29 2002-11-19 Koninklijke Philips Electronics N.V. Data search user interface with ergonomic mechanism for user profile definition and manipulation
CN1890669A (en) * 2003-10-15 2007-01-03 施克莱无线公司 Incremental search for keyword strings
CN100557606C (en) * 2003-03-03 2009-11-04 皇家飞利浦电子股份有限公司 Method and device for finding strings

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060212426A1 (en) * 2004-12-21 2006-09-21 Udaya Shakara Efficient CAM-based techniques to perform string searches in packet payloads

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6484164B1 (en) * 2000-03-29 2002-11-19 Koninklijke Philips Electronics N.V. Data search user interface with ergonomic mechanism for user profile definition and manipulation
CN1324570A (en) * 2000-05-19 2001-12-05 住友化学工业株式会社 Heating fumigating preventing-killing method for injurious insects
CN100557606C (en) * 2003-03-03 2009-11-04 皇家飞利浦电子股份有限公司 Method and device for finding strings
CN1890669A (en) * 2003-10-15 2007-01-03 施克莱无线公司 Incremental search for keyword strings

Also Published As

Publication number Publication date
CN104484381A (en) 2015-04-01

Similar Documents

Publication Publication Date Title
US11599714B2 (en) Methods and systems for modeling complex taxonomies with natural language understanding
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
US8577882B2 (en) Method and system for searching multilingual documents
US10649770B2 (en) κ-selection using parallel processing
US7917514B2 (en) Visual and multi-dimensional search
US8655805B2 (en) Method for classification of objects in a graph data stream
US9245007B2 (en) Dynamically detecting near-duplicate documents
CN110837550A (en) Knowledge graph-based question and answer method and device, electronic equipment and storage medium
US7739221B2 (en) Visual and multi-dimensional search
WO2016023471A1 (en) Methods for processing handwritten inputted characters, splitting and merging data and encoding and decoding processing
CN104809117B (en) Video data aggregation processing method, paradigmatic system and video search platform
CN107025239B (en) Sensitive word filtering method and device
CN107145496A (en) The method for being matched image with content item based on keyword
US20110113046A1 (en) Information processing apparatus, information extracting method, program, and information processing system
US20170068732A1 (en) Multi-system segmented search processing
CN107145497A (en) The method of the image of metadata selected and content matching based on image and content
CN105404677A (en) Tree structure based retrieval method
US11762829B2 (en) Scalable fine grained access control within a search engine
CN111602129B (en) Smart search for notes and ink
US20130304720A1 (en) Methods and Apparatus for Presenting Search Results with Indication of Relative Position of Search Terms
CN112380445A (en) Data query method, device, equipment and storage medium
CN104484381B (en) For searching for the method and system of multiple strings
US20190163810A1 (en) Search User Interface
CN109710844A (en) Method and device for quickly and accurately locating files based on search engine
CN103348348B (en) Information search apparatus and information search method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180522

Termination date: 20210226

CF01 Termination of patent right due to non-payment of annual fee