[go: up one dir, main page]

CN104216878A - New word discovery system and method - Google Patents

New word discovery system and method Download PDF

Info

Publication number
CN104216878A
CN104216878A CN201310205571.4A CN201310205571A CN104216878A CN 104216878 A CN104216878 A CN 104216878A CN 201310205571 A CN201310205571 A CN 201310205571A CN 104216878 A CN104216878 A CN 104216878A
Authority
CN
China
Prior art keywords
neologisms
module
new word
outer station
word discovery
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310205571.4A
Other languages
Chinese (zh)
Inventor
王玉平
陈运文
姜迅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Lianshang Network Technology Co Ltd
Original Assignee
Cool Sheng (tianjin) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cool Sheng (tianjin) Technology Co Ltd filed Critical Cool Sheng (tianjin) Technology Co Ltd
Priority to CN201310205571.4A priority Critical patent/CN104216878A/en
Publication of CN104216878A publication Critical patent/CN104216878A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a new word discovery system and method. The new word discovery system comprises an outer website capturing module, a new word searching module and a repetition removing module, wherein the outer website capturing module is used for capturing and collecting new words from outer websites of a current website to obtain general outer website new words, the new word searching module is used for counting entries searched by users and extracting the front N entries with the highest frequency to serve as user searching new words, and the repetition removing module is used for conducting collecting and repetition removal on the general outer website new words captured by the outer website capturing module and the user searching new words extracted by the new word searching module to obtain final newest new words. The new word discovery system and method avoid the heavy operation burden caused by a traditional new word discovery algorithm, the current newest new words can be obtained, and the timeliness of the Internet application can be effectively maintained.

Description

New word discovery system and method
Technical field
The present invention, about a kind of new word discovery system and method, particularly relates to a kind of new word discovery system and method for commending contents.
Background technology
Along with the develop rapidly of computing machine, in internet industry, increasing application is relevant with text maninulation, modal application is searched for exactly, also has a lot of embody rule such as video recommendations, commercial product recommending, phonetic synthesis, speech recognition etc., these application have individual common ground to be exactly all and text dependent, need the content understanding text, at present, basic treatment step is, carries out participle after obtaining text to text; Then part-of-speech tagging is carried out to the word divided, finally do other process again such as extract keyword, then applying these keywords and carry out follow-up process.Here most basic treatment step is exactly participle, if word segmentation result is bad, tremendous influence will be brought to follow-up process, so, participle is the most important thing, and any segmentation methods is all to not having the word occurred to be difficult to handle well in training data or dictionary, this just causes system after use after a while, along with increasing neologisms occur, the effect of process will become poorer and poorer, in order to address this problem, new word discovery algorithm arises at the historic moment, generally from magnanimity internet data, neologisms are wherein extracted by new word discovery algorithm, but, have Railway Project like this, one is that magnanimity internet data is also difficult to cover all neologisms, two is that the comform calculation cost of multiple internet extracting data neologisms is very large, three is that new word discovery algorithm all can bring certain noise data, cause having some words to be heteroclites in the neologisms extracted, this also can cause larger impact to participle effect, except the artificial error correction of non-added, otherwise the neologisms automatically extracted have larger problem.
Summary of the invention
For overcoming the deficiency that above-mentioned prior art exists, the object of the present invention is to provide a kind of new word discovery system and method for commending contents, combined by neologisms, the user search neologisms of user search entry extraction and the neologisms of other approach acquisition that external website is captured and obtain final neologisms, avoid the heavy computational burden that traditional new word discovery algorithm causes, not only neologisms up-to-date at present can be obtained, effectively the ageing of internet, applications can also be kept.
For reaching above-mentioned and other object, the present invention proposes a kind of new word discovery system, at least comprises:
Outer station captures module, captures neologisms and gathers, obtain total outer station neologisms for the external website from current site;
Search neologisms module, the entry that counting user was searched for, and extract the highest top n entry of frequency as user search neologisms; And
Duplicate removal module, the user search neologisms of total outer station neologisms and the extraction of this search neologisms module that this outer station is captured module crawl gather and duplicate removal, obtain final up-to-date neologisms.
Further, this system also comprises other neologisms source module, for obtaining the neologisms of other approach acquisition as other neologisms of originating.
Further, this other neologisms source module extracts M the highest entry of database medium frequency as other neologisms of originating.
Further, total outer station neologisms, the user search neologisms of this search neologisms module extraction and other source neologisms of this other neologisms source module extraction that this outer station is captured module crawl by this duplicate removal module gather and duplicate removal, obtain final up-to-date neologisms.
Further, this system also comprises a duplicate removal module first, carries out gathering the outer station neologisms for total after this outer station being captured neologisms duplicate removal that module captures from each external website again.
For reaching above-mentioned and other object, the present invention also provides a kind of new word discovery method, comprises the steps:
Capture neologisms from the external website of current site and gather, obtaining total outer station neologisms;
The entry that counting user was searched for, and extract the highest top n entry of frequency as user search neologisms;
The user search neologisms of the total outer station neologisms captured and extraction are gathered and duplicate removal, obtains final up-to-date neologisms.
Further, the total outer station neologisms captured and the user search neologisms of extraction to be gathered and before duplicate removal step in this, also comprise and obtain neologisms that other approach the obtain step as other source neologisms.
Further, M the highest entry of database medium frequency is extracted as other neologisms of originating.
Further, the user search new term of the total outer station neologisms captured, extraction and other source neologisms are gathered and duplicate removal, obtains final up-to-date neologisms.
Further, gather carry out again after the neologisms duplicate removal captured from each external website with the outer station neologisms for total.
Compared with prior art, a kind of new word discovery system and method for the present invention, the method combined by the outer station neologisms that captured by external website, the user search neologisms extracted according to the entry of user search and other source neologisms obtains final up-to-date neologisms, not only can avoid the heavy computational burden that new word discovery algorithm causes, neologisms up-to-date at present can also be obtained, effectively keep the ageing of internet, applications.
Accompanying drawing explanation
Fig. 1 is the system architecture diagram of a kind of new word discovery system of the present invention;
Fig. 2 is the flow chart of steps of a kind of new word discovery method of the present invention.
Embodiment
Below by way of specific instantiation and accompanying drawings embodiments of the present invention, those skilled in the art can understand other advantage of the present invention and effect easily by content disclosed in the present specification.The present invention is also implemented by other different instantiation or is applied, and the every details in this instructions also can based on different viewpoints and application, carries out various modification and change not deviating under spirit of the present invention.
Fig. 1 is the system architecture diagram of a kind of new word discovery system of the present invention.As shown in Figure 1, a kind of new word discovery system of the present invention, at least comprises: outer station captures module 101, search neologisms module 102 and duplicate removal module 103.
Its station, China and foreign countries captures module 101, capture neologisms for the external website from current site and gather, obtain total outer station neologisms, here external website can be Baidu's roll of the hour, Sina's microblogging neologisms etc., but not as limit, suppose that the outer station neologisms captured from Baidu roll of the hour have " second kills, dive under water, thunder ... wait ", the outer station neologisms captured from Sina's microblogging neologisms have " Ming, sofa, mottled bamboo ... wait "; Search neologisms module 102, counting user search entry, and extract the highest top n entry of frequency as user search neologisms, suppose that user once searched for " second kills, mottled bamboo, mouse hand, Embarrassing, sofa, donkey friend, hold live, river crab ... " wait word, search neologisms module 102 adds up the frequency of those words search, extract the highest top n entry of frequency as user search neologisms, as top n entry for " second kills, mottled bamboo, donkey friend, river crab "; Duplicate removal module 103, the user search neologisms extracted for outer station being captured total outer station neologisms of capturing of module 101 and search neologisms module 102 gather and duplicate removal, obtain final up-to-date neologisms, at this, final up-to-date neologisms then for " second kills, dive under water, thunder, Ming, sofa, mottled bamboo, donkey friend, river crab ".
Preferably, because external website not only comprises Baidu's roll of the hour, Sina's microblogging neologisms, also comprise other websites a lot, a lot of repetition is likely had from the outer station neologisms that each website captures, therefore, the new word discovery system of the present invention can also comprise a duplicate removal module 104 first, carries out gathering the outer station neologisms for total after outer station being captured neologisms duplicate removal that module 101 captures from each external website again.
Preferably, except the neologisms source of external website and user search entry, the neologisms that the present invention can also have other are originated, therefore the new word discovery system of the present invention can also comprise other neologisms source module 105, for obtaining the neologisms that other approach obtain, as the data in database, other neologisms source module 105 extract the highest M of its a medium frequency entry as other neologisms of originating, as " scribble, plug-in, second kills, binding " etc., accordingly, outer station is captured total outer station neologisms that module 101 captures by duplicate removal module 103, the user search neologisms that search neologisms module 102 extracts and other source neologisms that other neologisms source module 105 extract gather and duplicate removal, obtain final up-to-date neologisms, then for " to kill second, diving, thunder, Ming, sofa, mottled bamboo, donkey friend, river crab, scribble, plug-in, binding ".
Fig. 2 is the flow chart of steps of a kind of new word discovery method of the present invention.As shown in Figure 2, a kind of new word discovery method of the present invention, comprises the steps:
Step 201, captures neologisms from the external website of current site and gathers, obtaining total outer station neologisms.Wherein, external website can be Baidu's roll of the hour, Sina's microblogging neologisms etc., but not as limit, illustrate, suppose that the outer station neologisms captured from Baidu roll of the hour have " second kills, dive under water, thunder ... wait ", the outer station neologisms captured from Sina microblogging neologisms have " Ming, sofa, mottled bamboo ... wait ", then the total outer station neologisms after gathering for " second kills, dive under water, thunder, Ming, sofa, mottled bamboo ... ".
Step 202, the entry of counting user search, and extract the highest top n entry of frequency as user search neologisms.Illustrate, suppose that user once searched for " second kills, mottled bamboo, mouse hand, Embarrassing, sofa, donkey friend, hold live, river crab ... " wait word, then this step adds up the frequency of those words search, extract the highest top n entry of frequency as user search neologisms, as top n entry for " second kills, mottled bamboo, donkey friend, river crab ".
Step 203, the user search neologisms of the total outer station neologisms captured and extraction are gathered and duplicate removal, obtain final up-to-date neologisms, at this, through gathering and final up-to-date neologisms after duplicate removal then for " second kills, dive under water, thunder, Ming, sofa, mottled bamboo, donkey friend, river crab ".
Preferably, because external website not only comprises Baidu's roll of the hour, Sina's microblogging neologisms, also comprise other websites a lot, a lot of repetition is likely had from the outer station neologisms that each website captures, therefore, in step 201, then need to gather carry out again after the neologisms duplicate removal captured from each external website with the outer station neologisms for total.
Preferably, except the neologisms source of external website and user search entry, the neologisms that the present invention can also have other are originated, before step 203, can also comprise the steps: that the neologisms obtaining the acquisition of other approach are as other neologisms of originating, as the data in database, other neologisms source module 105 extract the highest M of its a medium frequency entry as other neologisms of originating, as " scribble, plug-in, second kills, binding " etc., accordingly, in step 203, then need the total outer station neologisms that will capture, the user search neologisms extracted and other source neologisms gather and duplicate removal, obtain final up-to-date neologisms, then for " to kill second, diving, thunder, Ming, sofa, mottled bamboo, donkey friend, river crab, scribble, plug-in, binding ".
In sum, a kind of new word discovery system and method for the present invention, the method combined by the outer station neologisms that captured by external website, the user search neologisms extracted according to the entry of user search and other source neologisms obtains final up-to-date neologisms, not only can avoid the heavy computational burden that new word discovery algorithm causes, neologisms up-to-date at present can also be obtained, effectively keep the ageing of internet, applications.
Above-described embodiment is illustrative principle of the present invention and effect thereof only, but not for limiting the present invention.Any those skilled in the art all without prejudice under spirit of the present invention and category, can carry out modifying to above-described embodiment and change.Therefore, the scope of the present invention, should listed by claims.

Claims (10)

1. a new word discovery system, at least comprises:
Outer station captures module, captures neologisms and gathers, obtain total outer station neologisms for the external website from current site;
Search neologisms module, the entry that counting user was searched for, and extract the highest top n entry of frequency as user search neologisms; And
Duplicate removal module, the user search neologisms of total outer station neologisms and the extraction of this search neologisms module that this outer station is captured module crawl gather and duplicate removal, obtain final up-to-date neologisms.
2. a kind of new word discovery system as claimed in claim 1, is characterized in that: this system also comprises other neologisms source module, for obtaining the neologisms of other approach acquisition as other neologisms of originating.
3. a kind of new word discovery system as claimed in claim 2, is characterized in that: this other neologisms source module extracts M the highest entry of database medium frequency as other neologisms of originating.
4. a kind of new word discovery system as claimed in claim 3, it is characterized in that: total outer station neologisms, the user search neologisms of this search neologisms module extraction and other source neologisms of this other neologisms source module extraction that this outer station is captured module crawl by this duplicate removal module gather and duplicate removal, obtain final up-to-date neologisms.
5. a kind of new word discovery system as claimed in claim 1, is characterized in that: this system also comprises a duplicate removal module first, carries out gathering the outer station neologisms for total after this outer station being captured neologisms duplicate removal that module captures from each external website again.
6. a new word discovery method, comprises the steps:
Capture neologisms from the external website of current site and gather, obtaining total outer station neologisms;
The entry that counting user was searched for, and extract the highest top n entry of frequency as user search neologisms;
The user search neologisms of the total outer station neologisms captured and extraction are gathered and duplicate removal, obtains final up-to-date neologisms.
7. a kind of new word discovery method as claimed in claim 6, it is characterized in that, the total outer station neologisms captured and the user search neologisms of extraction to be gathered and before duplicate removal step in this, also comprise and obtain neologisms that other approach the obtain step as other source neologisms.
8. a kind of new word discovery method as claimed in claim 7, is characterized in that: extract M the highest entry of database medium frequency as other neologisms of originating.
9. a kind of new word discovery method as claimed in claim 8, is characterized in that: the user search new term of the total outer station neologisms captured, extraction and other source neologisms are gathered and duplicate removal, obtain final up-to-date neologisms.
10. a kind of new word discovery method as claimed in claim 6, is characterized in that: gather carry out after the neologisms duplicate removal captured from each external website with the outer station neologisms for total again.
CN201310205571.4A 2013-05-29 2013-05-29 New word discovery system and method Pending CN104216878A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310205571.4A CN104216878A (en) 2013-05-29 2013-05-29 New word discovery system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310205571.4A CN104216878A (en) 2013-05-29 2013-05-29 New word discovery system and method

Publications (1)

Publication Number Publication Date
CN104216878A true CN104216878A (en) 2014-12-17

Family

ID=52098384

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310205571.4A Pending CN104216878A (en) 2013-05-29 2013-05-29 New word discovery system and method

Country Status (1)

Country Link
CN (1) CN104216878A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512109A (en) * 2015-12-11 2016-04-20 北京锐安科技有限公司 New word discovery method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912872A (en) * 2006-07-25 2007-02-14 北京搜狗科技发展有限公司 Method and system for abstracting new word
CN102929862A (en) * 2012-11-06 2013-02-13 深圳市宜搜科技发展有限公司 New word acquiring method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912872A (en) * 2006-07-25 2007-02-14 北京搜狗科技发展有限公司 Method and system for abstracting new word
CN102929862A (en) * 2012-11-06 2013-02-13 深圳市宜搜科技发展有限公司 New word acquiring method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512109A (en) * 2015-12-11 2016-04-20 北京锐安科技有限公司 New word discovery method and device
CN105512109B (en) * 2015-12-11 2019-04-16 北京锐安科技有限公司 The discovery method and device of new term

Similar Documents

Publication Publication Date Title
CN110781317B (en) Method and device for constructing event map and electronic equipment
CN103942335B (en) Construction method of uninterrupted crawler system oriented to web page structure change
CN106202211B (en) An Integrated Microblog Rumor Identification Method Based on Microblog Type
CN102087648B (en) Method and system for fetching news comment page
CN103838823B (en) Website content accessible detection method based on web page templates
CN107992469A (en) A kind of fishing URL detection methods and system based on word sequence
CN107544988B (en) Method and device for acquiring public opinion data
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN103617213B (en) Method and system for identifying newspage attributive characters
CN103177036A (en) Method and system for label automatic extraction
CN106372118B (en) Online semantic understanding search system and method towards mass media text data
CN107766399A (en) For the method and system and machine readable media for image is matched with content item
CN104462396B (en) Character string processing method and device
CN105095271B (en) Microblogging search method and microblogging retrieve device
CN104778164A (en) Method and device for detecting repeated URL (Uniform Resource Locator)
CN106156041A (en) Hot information finds method and system
CN106372202A (en) Text similarity calculation method and device
CN107193930A (en) A method for shielding website sensitive words
CN103823753B (en) Webpage sampling method oriented at barrier-free webpage content detection
CN105468780A (en) Normalization method and device of product name entity in microblog text
CN104199947A (en) Important person speech supervision and incidence relation excavating method
CN106250456A (en) A method and device for extracting bid-winning announcements
CN104216878A (en) New word discovery system and method
CN111222000B (en) An image classification method and system based on graph convolutional neural network
CN104281710A (en) Network data excavation method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20180607

Address after: 201203 7, 1 Lane 666 lane, Zhang Heng Road, Pudong New Area, Shanghai.

Applicant after: SHANGHAI ZHANGMEN TECHNOLOGY CO., LTD.

Address before: 300467 Tianjin Binhai New Area Tianjin eco city animation road 126 anime building B1 area two layer 201-243

Applicant before: Cool Sheng (Tianjin) Technology Co., Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20180803

Address after: 300450 Tianjin Binhai New Area Tianjin eco city animation road 126 anime building B1 area two layer 201-243

Applicant after: Cool Sheng (Tianjin) Technology Co., Ltd.

Address before: 201203 7, 1 Lane 666 lane, Zhang Heng Road, Pudong New Area, Shanghai.

Applicant before: SHANGHAI ZHANGMEN TECHNOLOGY CO., LTD.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20181219

Address after: 201306 N2025 room 24, 2 New Town Road, mud town, Pudong New Area, Shanghai

Applicant after: Shanghai Lian Shang network technology Co., Ltd

Address before: 300450 Tianjin Binhai New Area Tianjin eco city animation road 126 anime building B1 area two layer 201-243

Applicant before: Cool Sheng (Tianjin) Technology Co., Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20141217