[go: up one dir, main page]

CN113761125B - Dynamic summary determination method and device, computing device and computer storage medium - Google Patents

Dynamic summary determination method and device, computing device and computer storage medium Download PDF

Info

Publication number
CN113761125B
CN113761125B CN202110577211.1A CN202110577211A CN113761125B CN 113761125 B CN113761125 B CN 113761125B CN 202110577211 A CN202110577211 A CN 202110577211A CN 113761125 B CN113761125 B CN 113761125B
Authority
CN
China
Prior art keywords
word
keyword
keyword set
keywords
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110577211.1A
Other languages
Chinese (zh)
Other versions
CN113761125A (en
Inventor
康战辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110577211.1A priority Critical patent/CN113761125B/en
Publication of CN113761125A publication Critical patent/CN113761125A/en
Application granted granted Critical
Publication of CN113761125B publication Critical patent/CN113761125B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提出了一种动态摘要确定方法和装置、计算设备以及计算机存储介质,所述方法包括:获取基于搜索内容搜索到的当前文档,所述当前文档包括标题部分和正文部分;提取所述搜索内容中的多个关键词;从所述多个关键词中筛选未包括在所述当前文档的标题部分的关键词作为第一关键词集合;对当前文档的正文部分的每个句子提取关键词,以对应地形成针对每个句子的第二关键词集合;遍历所述正文部分中的句子,确定所述第一关键词集合与所遍历到的句子的第二关键词集合之间的相似度;响应于所述相似度大于相似度阈值,基于所述遍历到的句子确定针对当前文档的动态摘要中的一部分。

The present application proposes a method and apparatus for determining a dynamic summary, a computing device, and a computer storage medium, the method comprising: obtaining a current document searched based on search content, the current document comprising a title part and a body part; extracting multiple keywords from the search content; filtering keywords that are not included in the title part of the current document from the multiple keywords as a first keyword set; extracting keywords from each sentence in the body part of the current document to correspondingly form a second keyword set for each sentence; traversing the sentences in the body part to determine the similarity between the first keyword set and the second keyword set of the traversed sentences; in response to the similarity being greater than a similarity threshold, determining a part of the dynamic summary for the current document based on the traversed sentences.

Description

Dynamic digest determination method and apparatus, computing device, and computer storage medium
Technical Field
The present disclosure relates to the field of natural language processing, and in particular, to a dynamic summary determination method and apparatus, a computing device, and a computer storage medium.
Background
With the development of computer technology, dynamic abstracts are widely used, for example, in the fields of search result summary display, document key sentence marking, and search content related content presentation. As an example, the same document may have different dynamic summaries for different search content. Currently, in a conventional dynamic digest determining method, it is common to determine which sentences in a document should be regarded as a dynamic digest of the document according to the number of keywords of search content contained in each sentence in the document.
However, in the dynamic digest determined by the conventional dynamic digest determining method, it is often occurred that some keywords in the search content are repeatedly appeared in the document title and the dynamic digest, but other keywords in the search content are not appeared in the document title and the dynamic digest, which makes the determined dynamic digest not accurate enough and thus insufficient to present information related to the whole of the search content, and even looks far from the true query intention expressed by the search content.
Disclosure of Invention
In view of the above, the present disclosure provides dynamic summary determination methods and apparatus, computing devices, and computer storage media, which desirably overcome some or all of the above-referenced shortcomings, as well as other possible shortcomings.
According to a first aspect of the present disclosure, there is provided a dynamic summary determining method including obtaining a current document searched based on search content, the current document including a header portion and a body portion, extracting a plurality of keywords in the search content, screening keywords not included in the header portion of the current document from the plurality of keywords as a first keyword set, extracting keywords for each sentence of the body portion of the current document to form a second keyword set for each sentence correspondingly, traversing sentences in the body portion, determining a similarity between the first keyword set and the second keyword set of traversed sentences, and determining a portion in the dynamic summary for the current document based on the traversed sentences in response to the similarity being greater than a similarity threshold.
In some embodiments, traversing sentences in the body part, determining similarity between the first keyword set and a second keyword set of the traversed sentences comprises determining word vectors of keywords in the first keyword set, determining first feature vectors of the first keyword set based on the word vectors of the keywords in the first keyword set, determining word vectors of the keywords in the second keyword set of the traversed sentences, determining second feature vectors of the second keyword set based on the word vectors of the keywords in the second keyword set, and determining similarity between the first keyword set and the second keyword set of the traversed sentences based on the first feature vectors and the second feature vectors.
In some embodiments, determining the first feature vector of the first keyword set based on the word vector of each keyword in the first keyword set includes performing bit-wise accumulation on the word vector of each keyword in the first keyword set to obtain the first feature vector of the first keyword set, and determining the second feature vector of the second keyword set based on the word vector of each keyword in the second keyword set includes performing bit-wise accumulation on the word vector of each keyword in the second keyword set to obtain the second feature vector of the second keyword set.
In some embodiments, determining a similarity between the first set of keywords and a second set of keywords of the traversed sentence based on the first and second feature vectors includes determining a similarity between the first set of keywords and the second set of keywords of the traversed sentence based on a distance between the first and second feature vectors, wherein the distance includes one of a cosine distance, a Euclidean distance, and a Manhattan distance.
In some embodiments, traversing sentences in the body part, determining similarity between the first set of keywords and a second set of keywords of the traversed sentences includes traversing sentences in the body part, and determining similarity between the first set of keywords and the second set of keywords of the traversed sentences when a current word count of a dynamic summary is less than a word count threshold.
In some embodiments, responsive to the similarity being greater than a similarity threshold, determining a portion of the dynamic summary for the current document based on the traversed sentence includes determining a portion of the traversed sentence as a portion of the dynamic summary for the current document such that a sum of a number of words of the portion of the traversed sentence and a current number of words of the dynamic summary is equal to the word count threshold responsive to the similarity being greater than a similarity threshold and a sum of the traversed sentence and a current number of words of the dynamic summary being greater than a word count threshold.
In some embodiments, extracting a plurality of keywords in the search content comprises segmenting the search content to obtain a first segmented word set comprising a plurality of words, removing stop words from the plurality of words in the first segmented word set to obtain a second segmented word set, determining word weights of each word in the second segmented word set, and removing words with word weights smaller than a word weight threshold from the second segmented word set to obtain a plurality of keywords in the search content.
In some embodiments, determining the word weight for each word in the second set of words includes determining an inverse document frequency value for each word in the second set of words and determining the inverse document frequency value for each word in the second set of words as the word weight for that word.
In some embodiments, determining the word weight for each word in the second set of words includes determining an inverse document frequency value for each word in the second set of words, determining the word weight for each word in the second set of words based on at least one of the part of speech of the word, the word location in the search content, the historical search times, and the historical click rate, and the inverse document frequency value thereof.
In some embodiments, determining the inverse document frequency value of each word in the second set of words includes obtaining a query log, the query log including D pieces of search content, determining, for each respective word in the second set of words, a number D of pieces of search content in the query log that includes the respective word, determining a quotient of a total number D of pieces of search content in the query log that includes the search content and a number D of pieces of search content in the query log that includes the respective word, and logarithmically taking the quotient to obtain the inverse document frequency value of the respective word.
In some embodiments, determining the word vector for each keyword in the first set of keywords includes determining the word vector for each keyword in the first set of keywords based on a trained word embedding model, and wherein the trained word embedding model is trained by obtaining a query log and segmenting search content in the query log to obtain a plurality of segments, training the word embedding model to obtain the trained word embedding model by taking each respective segment of the plurality of segments as input to the word embedding model and taking a contextual segment of the respective segment as output to the word embedding model, or by taking each respective segment of the plurality of segments as output to the word embedding model and taking a contextual segment of the respective segment as input to the word embedding model.
According to a second aspect of the present disclosure, there is provided a dynamic digest determining apparatus including a current document acquisition module configured to acquire a current document searched based on search content, the current document including a header portion and a body portion, a first keyword extraction module configured to extract a plurality of keywords in the search content, a first keyword set determination module configured to screen keywords not included in the header portion of the current document from the plurality of keywords as a first keyword set, a second keyword extraction module configured to extract keywords for each sentence of the body portion of the current document to form a second keyword set for each sentence, a similarity determination module configured to traverse sentences in the body portion, determine a similarity between the first keyword set and the second keyword set of traversed sentences, and a dynamic digest determining module configured to determine a portion in the dynamic digest for the current document based on the traversed sentences in response to the similarity being greater than a similarity threshold.
According to a third aspect of the present disclosure there is provided a computing device comprising a memory configured to store computer executable instructions, a processor configured to perform the method of any of claims 1-11 when the computer executable instructions are executed by the processor.
According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium storing computer executable instructions which, when executed, perform any of the methods as described above.
In the dynamic summary determining method and apparatus, computing device, and computer storage medium claimed in the present disclosure, by considering keywords of search content already contained in a title in a current document, certain keywords are prevented from repeatedly appearing in the document title and dynamic summary by not considering the keywords already contained in the title of the current document when determining the dynamic summary. Then, by traversing each sentence of the document body and comparing the similarity of the keywords of each sentence to the set of keywords of the search content not included in the title, it is efficiently determined which sentences are to be part of the dynamic digest. Because the hit rate of the whole search content in the article title and the dynamic abstract is considered when the dynamic abstract is determined, the accuracy of the determined dynamic abstract is improved while the repeated occurrence of certain keywords in the document title and the dynamic abstract is avoided, so that the document title and the dynamic abstract are enough to present the information related to the whole search content, and further the user experience is improved.
These and other advantages of the present disclosure will become apparent from and elucidated with reference to the embodiments described hereinafter.
Drawings
Embodiments of the present disclosure will now be described in more detail and with reference to the accompanying drawings, in which:
FIG. 1 illustrates an exemplary application scenario in which a technical solution according to an embodiment of the present disclosure may be implemented;
FIG. 2 illustrates a schematic flow diagram of a dynamic summary determination method according to one embodiment of the present disclosure;
FIG. 3 illustrates a schematic flow diagram of a method of determining similarity between two keyword sets involved in the present disclosure, according to one embodiment of the present disclosure;
FIG. 4 illustrates a schematic flow diagram of a method of extracting a plurality of keywords in search content according to one embodiment of the present disclosure;
FIG. 5 illustrates an exemplary detailed schematic framework diagram of a word embedding model, according to one embodiment of the present disclosure;
FIG. 6 illustrates a schematic effect diagram of a dynamic summary determined using a related art;
FIG. 7 illustrates a schematic effect diagram of a dynamic summary determined using a dynamic summary determination method according to one embodiment of the present disclosure;
FIG. 8 illustrates an exemplary block diagram of a dynamic summary determination apparatus according to one embodiment of the present disclosure;
FIG. 9 illustrates an example system including an example computing device that represents one or more systems and/or devices that can implement the various techniques described herein.
Detailed Description
The following description provides specific details of various embodiments of the disclosure so that those skilled in the art may fully understand and practice the various embodiments of the disclosure. It should be understood that the technical solutions of the present disclosure may be practiced without some of these details. In some instances, well-known structures or functions have not been shown or described in detail to avoid obscuring the description of embodiments of the present disclosure with such unnecessary description. The terminology used in the present disclosure should be understood in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.
First, some terms related to the embodiments of the present application will be described so as to be easily understood by those skilled in the art.
Dynamic abstracts (dynamic abstracts), search engine terms, are a technique for dynamically displaying the primary content of a retrieved document. For a search engine, in response to user input of search content, text related to the surroundings of the search content in a document is extracted and returned as a dynamic abstract according to the position of the search content in the document. Because a document is recalled by different search contents, the dynamic summarization technology may form different dynamic summaries for the same document according to the different search contents.
Search content, i.e., meaning of a query, to find a particular file, web site, record or series of records in a database, a search engine is entered by a user to retrieve words, sentences or any suitable content of data from the database.
Query log (log) is one of the diaries used to document the work done each day. In computer science, a log refers to a record (Server log) of operations of computer devices or software such as a Server. When computer equipment and software are in question, the log is an important basis for checking the problems. The query log is used for recording information related to search content input by a user received from a client.
Stop Words (Stop Words) refer to high frequency Words that do not carry any subject information, such as Words that are "also", "having been" such. In information retrieval, it is preferable to filter out these words in processing natural language data (or text) in order to save storage space and improve search efficiency. The stop words are manually input and are not automatically generated, and the generated stop words form a stop word list. When the stop words are filtered later, the stop words in the document can be confirmed by querying the stop word list.
The word segmentation device is a tool for analyzing a text input by a user into a piece of text conforming to logic. Common word segmentation devices include English word segmentation devices, chinese word segmentation devices and the like. The word segmentation process of the English word segmentation device generally comprises the steps of text input, keyword segmentation, stop word removal, morphological reduction and lower case conversion. The Chinese word segmentation device segments a Chinese character sequence into individual words. In other words, word segmentation is a process of recombining a continuous word sequence into a word sequence according to a certain specification. In this process, stop words, i.e., words that do not affect the semantic meaning, may be identified. Common word segmenters such as jieba, mmseg j, ansj, etc.
Natural language processing (Natural Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like. The natural language processing is mainly applied to the aspects of machine translation, public opinion monitoring, automatic abstracting, viewpoint extraction, text classification, question answering, text semantic comparison, voice recognition, chinese OCR and the like.
Word embedding-word embedding is a type representation of words, words with similar meaning have similar representations, and is a generic term for methods that map vocabulary to real vectors. Conceptually, it refers to embedding a high-dimensional space, which is the number of all words in dimension, into a continuous vector space, which is much lower in dimension, each word or phrase being mapped to a vector on the real number domain. Common Word embeddings such as Word2Vec, word embeddings fastText, global Word embeddings GloVe, and the like. Word2Vec is a group of correlation models used to generate Word vectors. These models are shallow, bi-layer neural networks that are used to train to reconstruct linguistic word text. The Word-embedded Word2Vec model, after training is completed, can be used to map each Word to a vector, which is the hidden layer of the neural network. GloVe, called Global Vectors for Word Representation, is a global word frequency statistics (count-based & overall statistics) based word representation (word representation) tool that can represent a word as a vector of real numbers that captures some semantic characteristics between words, such as similarity, analogic, etc. fastText is a fast text classification algorithm that has two major advantages over neural network based classification algorithms, namely increased training and testing speeds and no need for pre-trained word vectors while maintaining high accuracy.
Inverse document frequency (inverse document frequency) the inverse document frequency (inverse document frequency) is a common addition political trickery for information retrieval and information exploration. The inverse document frequency is a statistical method for evaluating the importance of a word to one of the documents in a document set or corpus. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus. The various forms of inverse document frequency weighting are often applied by the search engine as a measure or rating of the degree of correlation between the document and the user query.
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, convolutional neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.
Deep learning (DL, deep Learning) is a new research direction in the field of machine learning (ML, machine Learning) that was introduced into machine learning to bring it closer to the original goal-artificial intelligence (AI, artificial Intelligence). Deep learning is the inherent regularity and presentation hierarchy of learning sample data, and the information obtained during such learning is helpful in interpreting data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data.
The technical scheme provided by the application relates to a natural language processing technology and mainly relates to a dynamic abstract technology.
Fig. 1 illustrates an exemplary application scenario 100 in which a technical solution according to an embodiment of the present disclosure may be implemented. As shown in fig. 1, the application scenario shown includes a terminal 110, a server 120, the terminal 110 being communicatively coupled with the server 120 via a network 130.
By way of example, the terminal 110 may, for example, act as an input device for inputting search content and transmit the search content to the server 120 via the network 130. The search content may be, for example, search content for searching for related documents.
As an example, the server 120 may obtain a current document searched based on search content, extract a plurality of keywords in the search content, and screen keywords not included in a header portion of the current document from the plurality of keywords as a first keyword set, then the server 120 may extract keywords for each sentence of a body portion of the current document to form a second keyword set for each sentence, then the server 120 may traverse sentences in the body portion, determine a similarity between the first keyword set and the second keyword set of traversed sentences, and finally, in response to the similarity being greater than a similarity threshold, the server 120 may determine a portion of a dynamic summary for the current document based on the traversed sentences. Server 120 may send the determined dynamic summary to terminal 110 for presentation.
The scenario described above is merely one example in which embodiments of the present disclosure may be implemented and is not limiting. For example, in some example scenarios, it is also possible that the dynamic summary determination process may be implemented on the terminal 110.
For example, the terminal 110 may serve as an input device for inputting search contents and save the acquired current document searched based on the search contents to the terminal background. Then, a plurality of keywords in the search content are extracted, and keywords not included in a header portion of the current document are screened from the plurality of keywords as a first keyword set, then, the terminal 110 may extract keywords for each sentence of a body portion of the current document to form a second keyword set for each sentence, then, the terminal 110 may traverse the sentences in the body portion, determine a similarity between the first keyword set and the second keyword set of the traversed sentences, and finally, in response to the similarity being greater than a similarity threshold, the terminal 110 may determine a portion of a dynamic summary for the current document based on the traversed sentences.
It should be noted that the terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein. The network 130 may be, for example, a Wide Area Network (WAN), a Local Area Network (LAN), a wireless network, a public telephone network, an intranet, and any other type of network known to those skilled in the art.
In some embodiments, the application scenario 100 described above may be a distributed system composed of a cluster of terminals 110 and servers 120, which may, for example, constitute a blockchain system. For example, in the application scenario 100, the determination and storage of the dynamic summary may be performed in a blockchain system, so as to achieve the effect of decentralization. As an example, after determining the dynamic digest, the dynamic digest may be stored in a blockchain system for later retrieval from the blockchain system when the same search is conducted. Blockchains are novel application modes of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain is essentially a decentralised database, which is a series of data blocks generated by cryptographic methods, each data block containing a batch of information of network transactions for verifying the validity (anti-counterfeiting) of the information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
Fig. 2 illustrates a schematic flow diagram of a dynamic summary determination method 200 according to one embodiment of the present disclosure. The dynamic digest determination method may be implemented by the terminal 110 or the server 120 as shown in fig. 1, for example. As shown in fig. 2, the method 200 includes the following steps.
At step 210, a current document searched based on the search content is acquired, the current document including a title portion and a body portion. Search content may typically include multiple words to more clearly characterize a user's query intent. A search engine may retrieve a plurality of related documents for the search content as search results for the search content. In an embodiment of the present disclosure, a related document may be acquired as the current document from among the plurality of related documents searched based on the search content.
In step 220, a plurality of keywords in the search content are extracted. In embodiments of the present disclosure, various techniques may be used to extract the plurality of keywords in the search content, which is not limiting. For example, the search content may be segmented using various segmenters described previously, and then a plurality of keywords in the search content may be extracted from each of the segmented words according to the importance of each of the segmented words obtained.
In step 230, keywords not included in the title portion of the current document are selected from the plurality of keywords as a first keyword set. As an example, the keywords "car", "maintenance", "standard", "manual" are extracted from the search content, and if the title of the current document as a result of the search content contains "car", "maintenance", the first keyword set will contain "standard" and "manual". At this time, the dynamic digest is confirmed according to the first keyword set, and the dynamic digest will contain one or all of "standard" and "manual", i.e. the dynamic digest and the title will contain 3/4 or 4/4 of the keywords of the search content, and the hit rate of the search content on the title and the dynamic digest as a whole is 75% -100%. In contrast, in the conventional related art, the dynamic digest is determined according to the keywords of the search content, and it is highly likely that the dynamic digest will contain one or all of "car", "maintenance", but not "standard" and "manual", and at this time, the dynamic digest and the title will contain 1/4 or 2/4 of the keywords of the search content, and the hit rate of the search content on the title and the dynamic digest as a whole is 25% -50%.
Since the keywords contained in the title of the current document are not considered when the first keyword set is determined, the keywords contained in the title of the current document are not considered in the subsequent step of dynamic abstract determination, so that the phenomenon that some keywords in the search content repeatedly appear in the document title and the dynamic abstract and other keywords do not appear in the document title and the dynamic abstract is avoided, and the hit rate of the whole search content in the article title and the dynamic abstract is improved. Therefore, step 230 considers the hit rate of the whole search content in the article title and the dynamic abstract, so that the document title and the dynamic abstract extracted by the method of the present disclosure are sufficient to present the information related to the whole search content, the accuracy of the determined dynamic abstract is improved, the document title and the dynamic abstract present the information related to the whole search content, and the user experience is improved.
In step 240, keywords are extracted for each sentence of the body portion of the current document to correspondingly form a second set of keywords for each sentence. In embodiments of the present disclosure, keywords may be extracted for each sentence of the body portion of the current document using various technical means, which is not limiting.
In some embodiments, extracting keywords for each sentence of the body portion of the current document may be the same as the method of extracting the plurality of keywords in the search content in step 220. By way of example, the step of extracting keywords that is obvious to the benefit of the sentence "Internet technology" of the current document may be performed by first segmenting the sentence to obtain a first segmented word set comprising "Internet", "technology", "benefit", "yes", "obvious", "stop words" from the first segmented word set, then removing the stop words, here the stop words have "yes" to obtain a second segmented word set, then determining the word weight (which may be calculated, for example, from the number of historic occurrences) of each word in the second segmented word set, as here determining that the word weights of "Internet", "technology", "benefit", "obvious" are 0.7, 0.5, 0.6, 0.2, respectively, and finally removing words from the second segmented word set whose weights are less than a predetermined threshold, for example, here the predetermined threshold is set to 0.4, then removing "the stop words" and finally determining that the keywords of the sentence are "Internet", "technology", "benefit", respectively.
In step 250, sentences in the body part are traversed, and the similarity between the first keyword set and the second keyword set of the traversed sentences is determined. Various methods may be used to determine the similarity between the first set of keywords and the second set of keywords of the traversed sentence, without limitation.
As an example, the first set of keywords comprises "jet", "airplane", "cost", and the second set of keywords of the traversed sentence includes "airplane", "airport", "cost", "hills". In some embodiments, the similarity is determined by comparing the ratio of the number of terms that the first set of keywords contains to the number of terms that the first set of keywords contains. For example, where the first set of keywords includes 2 words, "airplane" and "cost" together with the second set of keywords, the first set of keywords includes 4 words, and the similarity is 2/4=0.5. In other embodiments, a similarity matrix is determined based on the word vector corresponding to each word in the first set of keywords and the word vector corresponding to each word in the second set of keywords, and the similarity is extracted from the similarity matrix.
In some embodiments, traversing sentences in the body part, determining similarity between the first keyword set and a second keyword set of the traversed sentences may include traversing sentences in the body part, and determining similarity between the first keyword set and the second keyword set of the traversed sentences when a current word number of dynamic summaries is less than a word number threshold. As an example, the word number threshold of the dynamic abstract is 100 words, sentences in the text portion are traversed, if the current word number of the dynamic abstract is smaller than the word number threshold, for example, the current word number of the dynamic abstract is 90 words, the similarity between the first keyword set and the second keyword set of the traversed sentences is determined, and if the current word number of the dynamic abstract is not smaller than the word number threshold, for example, the current word number of the dynamic abstract is 100 words, the similarity between the first keyword set and the second keyword set of the traversed sentences is not determined.
In response to the similarity being greater than a similarity threshold, a portion of the dynamic summary for the current document is determined based on the traversed sentence at step 260. The similarity threshold may be preset as desired and is not limiting.
In some embodiments, when determining a portion of the dynamic summary for the current document based on the traversed sentence, a portion or all of the traversed sentence may be determined to be a portion of the dynamic summary of the current document. As an example, some phrases, sentence stem portions, or portions of the traversed sentence may be extracted or determined as a whole as part of a dynamic summary for the current document. As the traversing of sentences in the body part and the determining of a portion of the dynamic summary for the current document based on the traversed sentences, a final dynamic summary may be formed.
In some embodiments, responsive to the similarity being greater than a similarity threshold, determining a portion of the dynamic summary for the current document based on the traversed sentence may include determining a portion of the traversed sentence as a portion of the dynamic summary for the current document such that a sum of a number of words of the portion of the traversed sentence and a current number of words of the dynamic summary is equal to the number of words threshold responsive to the similarity being greater than a similarity threshold and a sum of the traversed sentence and a current number of words of the dynamic summary being greater than a number of words threshold. As an example, assuming that the word count threshold of the dynamic digest is 100 words, and the current word count of the dynamic digest is 95 words, the traversed sentence is 15 words, then 5 words of the traversed sentence (which may be the head, tail, middle extracted word of the sentence, etc., without limitation herein) are determined as part of the dynamic digest.
The method 200 avoids the recurrence of certain keywords in the document title and dynamic summary by considering keywords of search content already contained in the title of the current document, and not considering those keywords already contained in the title of the current document when determining the dynamic summary. Then, each sentence of the document body is traversed, the similarity of the keywords of the sentence and the keyword set of the search content not contained in the title is compared, and whether the sentence is a part of the dynamic abstract is judged according to whether the similarity is larger than a similarity threshold. Because the hit rate of the whole search content in the article title and the dynamic abstract is considered when the dynamic abstract is determined, the accuracy of the determined dynamic abstract is improved while the repeated occurrence of certain keywords in the document title and the dynamic abstract is avoided, so that the document title and the dynamic abstract are enough to present the information related to the whole search content, and further the user experience is improved.
Fig. 3 illustrates a schematic flow diagram of a method 300 of determining similarity between two keyword sets, according to one embodiment of the present disclosure. The two keyword sets include the first keyword set and a second keyword set of the traversed sentence. The method 300 may be used, for example, to implement step 250 described with reference to fig. 2. As shown in fig. 3, the method 300 includes the following steps.
In step 310, a word vector for each keyword in the first set of keywords is determined. The word vector for each keyword in the first set of keywords may be determined using a trained word embedding model. The trained word embedding model may be obtained by training a word embedding model with an open source corpus (such as an open corpus provided by google corporation) as a training set, or may be obtained by training a word embedding model with a specific corpus (such as a training set established by words in a certain field) as a training set, which is not limited herein. The Word embedding model may be a common Word embedding model, such as embedded Word2Vec, word embedding fastText, global Word embedding GloVe, and the like, and is not limited herein.
In some embodiments, determining the word vector for each keyword in the first set of keywords includes determining the word vector for each keyword in the first set of keywords based on a trained word embedding model, and wherein the trained word embedding model is trained by obtaining a query log (in which information related to search content received from a client is recorded) and segmenting search content in the query log to obtain a plurality of segments, by taking each respective segment of the plurality of segments as an input of a word embedding model and taking a contextual segment of the respective segment as an output of the word embedding model, or by taking each respective segment of the plurality of segments as an input of a word embedding model and taking the contextual segment of the respective segment as an input of the word embedding model, training the word embedding model to obtain the trained word embedding model. Because the word segmentation used for training the word embedding model is from the query log, the query log records information related to search content input by a user received from a client, the trained word embedding model can better extract characteristics in the search content, and the determined word vector can better represent each word in the search content.
In step 320, a first feature vector of the first set of keywords is determined based on the word vectors of the keywords in the first set of keywords.
In some embodiments, determining the first feature vector of the first set of keywords based on the word vectors of the keywords in the first set of keywords includes bit-wise accumulating the word vectors of the keywords in the first set of keywords to obtain the first feature vector of the first set of keywords. As an example, the first keyword set includes words "fine dried noodles", "easy" and "paste pot", and after determining 200-dimensional word vectors corresponding to the three words respectively, the three 200-dimensional word vectors are accumulated according to positions to obtain a 200-dimensional vector, that is, a first feature vector of the first keyword set is obtained.
In step 330, a word vector is determined for each keyword in the second set of keywords for the traversed sentence. The word vectors for each keyword in the second set of keywords may be determined using the trained word embedding model described above or other suitable word embedding model. The Word embedding model may be, for example, a common Word embedding model, such as embedded Word2Vec, word embedding fastText, global Word embedding GloVe, and the like, and is not limited herein.
In step 340, a second feature vector for the second set of keywords is determined based on the word vectors for each keyword in the second set of keywords. In some embodiments, determining the second feature vector of the second set of keywords based on the word vector of each keyword in the second set of keywords includes bit-wise accumulating the word vectors of each keyword in the second set of keywords to obtain the second feature vector of the second set of keywords. As an example, the second keyword set includes words "handmade noodles", "fine dried noodles", "sliced noodles" and "paste" and after determining 200-dimensional word vectors corresponding to the four words, the four 200-dimensional word vectors are accumulated according to the positions to obtain a 200-dimensional vector, i.e. a second feature vector of the second keyword set is obtained.
In step 350, a similarity between the first set of keywords and the second set of keywords of the traversed sentence is determined based on the first and second feature vectors. As an example, the first feature vector and the second feature vector are 200-dimensional vectors, respectively, cosine similarity of the first feature vector and the second feature vector is calculated, and the cosine similarity is used as similarity between the first keyword set and the second keyword set of the traversed sentence. The cosine similarity between vectors is used for measuring the text similarity, and depends on the cosine distance between the vectors, the cosine distance maps the vectors into a vector space according to coordinate values, and the cosine distance between the vectors a and b is expressed as follows:
let the coordinates of the vectors a, b in two dimensions be respectively The cosine distance between vectors a, b is expressed in two dimensions as follows:
assuming that the coordinates of the vectors a and b in the n-dimensional space are a= (a 1,A2,……,An)、b=(B1,B2,……,Bn), the expression of the cosine distance between the vectors a and b in the n-dimensional space is as follows:
In some embodiments, determining the similarity between the first set of keywords and the second set of keywords of the traversed sentence based on the first feature vector and the second feature vector may include determining the similarity between the first set of keywords and the second set of keywords of the traversed sentence based on a distance between the first feature vector and the second feature vector, wherein the distance includes one of a cosine distance, a Euclidean distance, and a Manhattan distance. As an example, the distance may be selected as a euclidean distance, i.e. the euclidean distance of the first feature vector and the second feature vector is calculated, and the euclidean distance is taken as the similarity between the first keyword set and the second keyword set of the traversed sentence.
The method 300 determines a similarity between the first set of keywords and the second set of keywords of the traversed sentence by comparing the first feature vector based on the first set of keywords to the second feature vector based on the second set of keywords. The similarity extracted in this way can more accurately characterize the similarity between the first keyword set and the second keyword set of the traversed sentence, so that the dynamic abstract is determined from the similarity in a subsequent step.
Fig. 4 illustrates a schematic flow diagram of a method 400 of extracting a plurality of keywords in search content according to one embodiment of the present disclosure. The method 400 may be used, for example, to implement step 220 described with reference to fig. 2. As shown in fig. 4, the method 400 includes the following steps.
In step 410, the search content is segmented to obtain a first segmented set of words comprising a plurality of words. In embodiments of the present disclosure, the search content may be segmented using various technical means, which is not limiting. For example, the search content may be segmented using various segmentors previously described (e.g., jieba, mmseg, ansj, etc.).
At step 420, stop words are removed from the plurality of words in the first set of word segments to obtain a second set of word segments. As previously mentioned, high frequency words that do not carry any subject information, such as words "also", "having been" are meant. The removal of the stop words can save storage space and improve search efficiency, and interference is reduced for determining the dynamic abstract. In some embodiments, the stop words in the first set of tokens may be identified and removed by querying a pre-built stop word list.
In step 430, a word weight is determined for each word in the second set of keywords. In some embodiments, determining the word weight for each word in the second set of words may include determining an inverse document frequency value for each word in the second set of words and determining the inverse document frequency value for each word in the second set of words as the word weight for that word. The inverse document frequency is used to evaluate the importance of each word in the second set of keywords.
In some embodiments, determining the word weight for each word in the second set of words may include determining an inverse document frequency value for each word in the second set of words, determining the word weight for each word in the second set of words based on the word part of the word, at least one of the word position in the search content, the historical search times, and the historical click rate, and the inverse document frequency value thereof. Because the word weight considers at least one of the part of speech of each word in the second word set, the word position in the search content, the historical search times and the historical click rate in addition to the inverse document frequency value of each word in the second word set, the finally determined word weight can reflect the importance of each word in the second word set more comprehensively and accurately.
In some embodiments, determining the inverse document frequency value for each word in the second set of words may include obtaining a query log including D pieces of search content, determining, for each respective word in the second set of words, a number D of pieces of search content in the query log that includes the respective word, determining a quotient of a total number D of pieces of search content in the query log that includes the search content and a number D of pieces of search content in the query log that includes the respective word, and logarithmically taking the quotient to obtain the inverse document frequency value for the respective word. As an example, the query log includes 1000 pieces of search content, and the query log includes 600 pieces of search content of the corresponding word, i.e., d=1000, d=300, so the inverse document frequency value of the corresponding word is log e(D/d)=loge (1000/300) =1.204.
As an example, the inverse document frequency may be determined by querying an inverse document frequency dictionary. The inverse document frequency dictionary may be obtained based on a query log calculation, and the inverse document frequency (idf i) of the word t i in the inverse document frequency dictionary is calculated according to the following formula:
Where the numerator |D| represents the total number of bars of search content in the query log and the denominator represents the number of bars of search content that contain the word t i.
In step 440, words having a word weight less than a word weight threshold are removed from the second set of keywords to obtain a plurality of keywords in the search content. The word weight threshold may be set as desired and is not limiting. As an example, the second term set includes terms "Sima Qian", "history", "western" and "reputation", and the four terms respectively correspond to terms with weights of 0.6, 0.5, 0.7 and 0.2, and the term weight threshold is set to 0.4, so that the "reputation" is removed, and finally, the keywords in the search content are determined to be "Sima Qian", "history", "western".
The method 400 extracts a plurality of keywords in the search content by segmenting the search content, removing stop words, determining weights, and removing words having weights less than a weight threshold. The extracted keywords in this way can more accurately and briefly characterize the meaning of the search content than the search content, and the dynamic abstract can be conveniently determined according to the meaning of the search content in the subsequent steps.
FIG. 5 illustrates an exemplary detailed schematic framework diagram of a word embedding model, according to one embodiment of the present disclosure. As shown in fig. 5, the word embedding model is used to embed a high-dimensional space with a number of words in total into a continuous vector space with a much lower dimension, and each word or phrase is mapped into a vector on the real number domain, which may be a three-layer neural network including an input layer, a hidden layer, and an output layer.
The input layer is used to receive the input vector, which is usually a one-hot vector, the hidden layer processes the input vector by setting nodes, for example, to represent a word with 300 features (i.e. each word can be represented as a 300-dimensional vector), then the hidden layer will set 300 nodes whose weights can be represented as a matrix of rows (the number of rows depends on the dimension of the input vector), 300 columns, the output layer is used to process the output of the hidden layer to output a probability distribution at the layer, the output layer can be a softmax regression classifier, each node of which will output a value (probability) between 0-1, the sum of the probabilities of all output layer neuron nodes being 1.
In training word embedding models, training is typically based on pairs of words, training samples being word pairs (input word, output word) that predict the output word from the input word, both input word and output word being vectors that are one-hot coded. Typically, the input word and the output word are words (e.g., adjacent words) having a contextual relationship, i.e., the contextual relationship of each word in the corpus is used to derive a trained word embedding model.
Fig. 6 illustrates a schematic effect diagram of a dynamic summary determined using the related art. As shown in fig. 6, in the related art, the dynamic summary determination scheme does not consider the hit situation existing in the title, that is, the global hit experience is not focused on the global relevant information. This is more pronounced when the search engine is faced with some search content of a larger length (as shown in fig. 6) as a whole summary hit problem. As an example, when searching for "Luo Fushan hiking" the user selects only "Luo Fushan" and "hiking" from the contained segments, whether the title or the dynamic summary, but no "itinerary" appears. This is because the related art considers only the similarity of the traversed sentence to the search content when traversing the sentence in the body to determine the dynamic digest, and does not consider that the title of the current document already contains the content of a part of the search content, so that the part of the search content repeatedly appears in the title and the dynamic digest of the current document, but the part of the search content does not appear in the title and the dynamic digest of the current document.
As shown in fig. 6, the related art does not consider the hit rate of the whole search content in the article title and the dynamic digest, so that some keywords ("Luo Fushan", "hiking") in the search content repeatedly appear in the document title and the dynamic digest, but other keywords ("routes") in the search content do not appear in the document title and the dynamic digest, which makes the document title and the dynamic digest extracted with the related art insufficient to present information related to the whole search content. As an example, the hit effect of the keyword in the search content in the current document is embodied by font thickening, and may also be implemented by highlighting the relevant text, underlining the keyword font, tilting or enlarging the keyword font, marking the keyword font with a color different from the text or title color, or the like, which is not limited herein.
Fig. 7 illustrates a schematic effect diagram of a dynamic digest determined using a dynamic digest determination method according to one embodiment of the present disclosure. This embodiment is also directed to the dynamic summary of the current document searched for in "Luo Fushan hiking route", and the document title of the current document already contains "Luo Fushan", "hiking". As shown in FIG. 7, the system knows that the title has "Luo Fushan" and "hiking" when selecting the fragments contained in the dynamic summary of the text, and is more prone to extracting those text fragments containing at least "route" from the text as dynamic summary fragments and thickening them. This allows the user to determine that the content of the piece of article tends to be "Luo Fushan hiking" rather than "Luo Fushan hiking" when not clicking to read the full text.
As can be seen by comparing fig. 6 and 7, the dynamic digest determining method proposed by the present disclosure improves the hit rate of the whole search content in the article title (hit "Luo Fushan", "hiking") and the dynamic digest (hit "route") by considering the keywords ("Luo Fushan", "hiking") of the search content already contained in the title of the current document, and by not considering these keywords already contained in the title of the current document when determining the dynamic digest, avoiding the recurrence of some keywords in the document title and the dynamic digest, so that the document title and the dynamic digest extracted by the method of the present disclosure can present information related to the whole search content. The method and the device can avoid the repeated occurrence of certain keywords in the document title and the dynamic abstract, and improve the accuracy of the determined dynamic abstract, so that the document title and the dynamic abstract are enough to present information related to the whole searched content, and further improve the user experience.
Meanwhile, as can be seen from fig. 7, the words hit in the text do not necessarily coincide exactly with the words not included by the title in the search content, but the semantics are basically guaranteed to be the same, as in fig. 7, "route" hit in the text according to the keyword "route" in the search content, and although "route" and "route" are two words, the semantics are basically the same. This is because the present disclosure adopts a method of comparing the similarity of the keywords of sentences and the keyword sets of the search contents not included in the title, rather than requiring the keywords of sentences to be completely identical to the keywords of the search contents not included in the title, which makes it possible to determine the dynamic digest by the method shown in the present disclosure to have better robustness and accuracy.
Fig. 8 illustrates an exemplary block diagram of a dynamic summary determination apparatus 800 according to one embodiment of the present disclosure. As shown in fig. 8, the dynamic digest determining apparatus includes a current document acquisition module 810, a first keyword extraction module 820, a first keyword set determination module 830, a second keyword extraction module 840, a similarity determination module 850, and a dynamic digest determination module 860.
The current document acquisition module 810 is configured to acquire a current document searched based on the search content, the current document including a title portion and a body portion.
The first keyword extraction module 820 is configured to extract a plurality of keywords in the search content.
The first keyword set determining module 830 is configured to screen, as the first keyword set, keywords that are not included in the header portion of the current document from the plurality of keywords.
The second keyword extraction module 840 is configured to extract keywords for each sentence of the body portion of the current document to form a second keyword set for each sentence, respectively.
A similarity determination module 850 configured to traverse sentences in the body part, determining similarity between the first set of keywords and a second set of keywords of the traversed sentences.
The dynamic summary determination module 860 is configured to determine a portion of the dynamic summary for the current document based on the traversed sentence in response to the similarity being greater than a similarity threshold.
FIG. 9 illustrates an example system 900 that includes an example computing device 910 that represents one or more systems and/or devices that can implement the various techniques described herein. Computing device 910 may be, for example, a server of a service provider, a device associated with a server, a system-on-a-chip, and/or any other suitable computing device or computing system. The dynamic summary determination apparatus 800 described above with reference to fig. 8 may take the form of a computing device 910. Alternatively, the dynamic summary determination apparatus 800 may be implemented as a computer program in the form of an application 916.
The example computing device 910 as illustrated includes a processing system 911, one or more computer-readable media 912, and one or more I/O interfaces 913 communicatively coupled to each other. Although not shown, the computing device 910 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus may include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.
The processing system 911 is representative of functionality that performs one or more operations using hardware. Thus, the processing system 911 is illustrated as including hardware elements 914 that may be configured as processors, functional blocks, and the like. This may include implementation in hardware as application specific integrated circuits or other logic devices formed using one or more semiconductors. The hardware component 914 is not limited by the material from which it is formed or the processing mechanism employed therein. For example, the processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, the processor-executable instructions may be electronically-executable instructions.
The computer-readable medium 912 is illustrated as including a memory/storage 915. Memory/storage 915 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 915 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 915 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) and removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 912 may be configured in a variety of other ways as described further below.
One or more I/O interfaces 913 represent functionality that allows a user to input commands and information to computing device 910 using various input devices, and optionally also allows information to be presented to the user and/or other components or devices using various output devices. Examples of input devices include keyboards, cursor control devices (e.g., mice), microphones (e.g., for voice input), scanners, touch functions (e.g., capacitive or other sensors configured to detect physical touches), cameras (e.g., motion that does not involve touches may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a display or projector), speakers, a printer, a network card, a haptic response device, and so forth. Accordingly, the computing device 910 may be configured in a variety of ways as described further below to support user interaction.
Computing device 910 also includes application 916. The application 916 may be, for example, a software instance of the dynamic summary determination apparatus 800, and implement the techniques described herein in combination with other elements in the computing device 910.
Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer-readable media can include a variety of media that can be accessed by computing device 910. By way of example, and not limitation, computer readable media may comprise "computer readable storage media" and "computer readable signal media".
"Computer-readable storage medium" refers to a medium and/or device that can permanently store information and/or a tangible storage device, as opposed to a mere signal transmission, carrier wave, or signal itself. Thus, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in methods or techniques suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of a computer-readable storage medium may include, but are not limited to RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical storage, hard disk, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage devices, tangible media, or articles of manufacture adapted to store the desired information and which may be accessed by a computer.
"Computer-readable signal medium" refers to a signal bearing medium configured to hardware, such as to send instructions to computing device 910 via a network. Signal media may typically be embodied in computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, data signal, or other transport mechanism. Signal media also include any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
As previously described, the hardware elements 914 and computer-readable media 912 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware that may be used in some embodiments to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or components of a system on a chip, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), complex Programmable Logic Devices (CPLDs), and other implementations in silicon or other hardware devices. In this context, the hardware elements may be implemented as processing devices that perform program tasks defined by instructions, modules, and/or logic embodied by the hardware elements, as well as hardware devices that store instructions for execution, such as the previously described computer-readable storage media.
Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules, and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 914. Computing device 910 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Thus, for example, the modules may be implemented at least in part in hardware as modules executable by the computing device 910 as software using the computer-readable storage medium of the processing system and/or the hardware elements 914. The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 910 and/or processing systems 911) to implement the techniques, modules, and examples described herein.
In various implementations, the computing device 910 may take on a variety of different configurations. For example, the computing device 910 may be implemented as a computer-like device including a personal computer, desktop computer, multi-screen computer, laptop computer, netbook, and the like. Computing device 910 may also be implemented as a mobile apparatus-like device including a mobile device such as a mobile phone, portable music player, portable gaming device, tablet computer, multi-screen computer, or the like. Computing device 910 may also be implemented as a television-like device that includes devices having or connected to generally larger screens in casual viewing environments. Such devices include televisions, set-top boxes, gaming machines, and the like.
The techniques described herein may be supported by these various configurations of computing device 910 and are not limited to the specific examples of techniques described herein. The functionality may also be implemented in whole or in part on the "cloud" 920 through the use of a distributed system, such as through platform 922 as described below.
Cloud 920 includes and/or represents a platform 922 for resources 924. Platform 922 abstracts underlying functionality of hardware (e.g., servers) and software resources of cloud 920. Resources 924 may include applications and/or data that may be used when executing computer processing on servers remote from computing device 910. Resources 924 may also include services provided over the internet and/or over subscriber networks such as cellular or Wi-Fi networks.
Platform 922 may abstract resources and functionality to connect computing device 910 with other computing devices. Platform 922 may also be used to abstract a hierarchy of resources to provide a corresponding level of hierarchy of requirements encountered for resources 924 implemented via platform 922. Thus, in an interconnected device embodiment, the implementation of the functionality described herein may be distributed throughout the system 900. For example, the functionality may be implemented in part on computing device 910 and by platform 922 that abstracts the functionality of cloud 920.
The present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computing device reads the computer instructions from the computer-readable storage medium and executes the computer instructions to cause the computing device to perform the dynamic summary determination methods provided in the various alternative implementations described above.
It should be understood that for clarity, embodiments of the present disclosure have been described with reference to different functional units. However, it will be apparent that the functionality of each functional unit may be implemented in a single unit, in a plurality of units or as part of other functional units without departing from the present disclosure. For example, functionality illustrated to be performed by a single unit may be performed by multiple different units. Thus, references to specific functional units are only to be seen as references to suitable units for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single unit or may be physically and functionally distributed between different units and circuits.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, components or sections, these devices, elements, components or sections should not be limited by these terms. These terms are only used to distinguish one device, element, component, or section from another device, element, component, or section.
Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the appended claims. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the term "a" or "an" does not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.

Claims (14)

1.一种动态摘要确定方法,包括:1. A method for determining a dynamic summary, comprising: 获取基于搜索内容搜索到的当前文档,所述当前文档包括标题部分和正文部分;Acquire a current document found based on the search content, wherein the current document includes a title part and a body part; 提取所述搜索内容中的多个关键词;Extracting multiple keywords from the search content; 从所述多个关键词中筛选未包括在所述当前文档的标题部分的关键词作为第一关键词集合;Filtering keywords that are not included in the title part of the current document from the multiple keywords as a first keyword set; 对当前文档的正文部分的每个句子提取关键词,以对应地形成针对每个句子的第二关键词集合;Extracting keywords from each sentence in the body of the current document to correspondingly form a second keyword set for each sentence; 遍历所述正文部分中的句子,确定所述第一关键词集合与所遍历到的句子的第二关键词集合之间的相似度;Traversing the sentences in the body part, and determining the similarity between the first keyword set and the second keyword set of the traversed sentences; 响应于所述相似度大于相似度阈值,基于所述遍历到的句子确定针对当前文档的动态摘要中的一部分。In response to the similarity being greater than a similarity threshold, a portion of a dynamic summary for the current document is determined based on the traversed sentences. 2.根据权利要求1所述的方法,其中遍历所述正文部分中的句子,确定所述第一关键词集合与所遍历到的句子的第二关键词集合之间的相似度,包括:2. The method according to claim 1, wherein traversing the sentences in the body part and determining the similarity between the first keyword set and the second keyword set of the traversed sentences comprises: 确定所述第一关键词集合中各关键词的词向量;Determining a word vector for each keyword in the first keyword set; 基于所述第一关键词集合中各关键词的词向量,确定所述第一关键词集合的第一特征向量;Determining a first feature vector of the first keyword set based on the word vector of each keyword in the first keyword set; 确定所遍历到的句子的第二关键词集合中各关键词的词向量;Determine the word vector of each keyword in the second keyword set of the traversed sentence; 基于所述第二关键词集合中各关键词的词向量,确定所述第二关键词集合的第二特征向量;Determining a second feature vector of the second keyword set based on the word vector of each keyword in the second keyword set; 基于所述第一特征向量和第二特征向量,确定所述第一关键词集合与所遍历到的句子的第二关键词集合之间的相似度。Based on the first feature vector and the second feature vector, the similarity between the first keyword set and the second keyword set of the traversed sentence is determined. 3.根据权利要求2所述的方法,其中基于所述第一关键词集合中各关键词的词向量,确定所述第一关键词集合的第一特征向量,包括:3. The method according to claim 2, wherein determining the first feature vector of the first keyword set based on the word vector of each keyword in the first keyword set comprises: 对所述第一关键词集合中各关键词的词向量进行按位累加,得到所述第一关键词集合的第一特征向量,The word vectors of the keywords in the first keyword set are bitwise accumulated to obtain a first feature vector of the first keyword set. 以及其中,基于所述第二关键词集合中各关键词的词向量,确定所述第二关键词集合的第二特征向量,包括:And wherein, based on the word vector of each keyword in the second keyword set, determining the second feature vector of the second keyword set includes: 对所述第二关键词集合中各关键词的词向量进行按位累加,得到所述第二关键词集合的第二特征向量。The word vectors of the keywords in the second keyword set are bitwise accumulated to obtain a second feature vector of the second keyword set. 4.根据权利要求2所述的方法,其中基于所述第一特征向量和第二特征向量,确定所述第一关键词集合与所遍历到的句子的第二关键词集合之间的相似度,包括:4. The method according to claim 2, wherein determining the similarity between the first keyword set and the second keyword set of the traversed sentence based on the first feature vector and the second feature vector comprises: 基于所述第一特征向量和第二特征向量间的距离,确定所述第一关键词集合与所遍历到的句子的第二关键词集合之间的相似度,其中所述距离包括余弦距离、欧式距离、曼哈顿距离中的一种。Based on the distance between the first feature vector and the second feature vector, the similarity between the first keyword set and the second keyword set of the traversed sentence is determined, wherein the distance includes one of cosine distance, Euclidean distance and Manhattan distance. 5.根据权利要求1所述的方法,其中遍历所述正文部分中的句子,确定所述第一关键词集合与所遍历到的句子的第二关键词集合之间的相似度,包括:5. The method according to claim 1, wherein traversing the sentences in the body part and determining the similarity between the first keyword set and the second keyword set of the traversed sentences comprises: 遍历所述正文部分中的句子,并且在动态摘要的当前字数小于字数阈值时,确定所述第一关键词集合与所遍历到的句子的第二关键词集合之间的相似度。The sentences in the body part are traversed, and when the current word count of the dynamic summary is less than a word count threshold, the similarity between the first keyword set and the second keyword set of the traversed sentences is determined. 6.根据权利要求1所述的方法,其中响应于所述相似度大于相似度阈值,基于所述遍历到的句子确定针对当前文档的动态摘要中的一部分,包括:6. The method according to claim 1, wherein in response to the similarity being greater than a similarity threshold, determining a portion of the dynamic summary for the current document based on the traversed sentences comprises: 响应于所述相似度大于相似度阈值、并且所述遍历到的句子与动态摘要的当前字数的和大于字数阈值,将所述遍历到的句子的一部分确定为针对当前文档的动态摘要中的一部分,使得所述遍历到的句子的所述部分的字数与动态摘要的当前字数的和等于所述字数阈值。In response to the similarity being greater than a similarity threshold and the sum of the traversed sentence and the current word count of the dynamic summary being greater than a word count threshold, a portion of the traversed sentence is determined as a portion of the dynamic summary for the current document, so that the sum of the word count of the portion of the traversed sentence and the current word count of the dynamic summary is equal to the word count threshold. 7.根据权利要求1所述的方法,其中提取所述搜索内容中的多个关键词,包括:7. The method according to claim 1, wherein extracting a plurality of keywords from the search content comprises: 对所述搜索内容进行分词,以得到包括多个词的第一分词集合;Segmenting the search content to obtain a first segmented word set including a plurality of words; 从所述第一分词集合中的多个词中去除停用词,以得到第二分词集合;Removing stop words from the plurality of words in the first word segmentation set to obtain a second word segmentation set; 确定第二分词集合中每个词的词权重;Determine the word weight of each word in the second word segmentation set; 从所述第二分词集合中去除词权重小于词权重阈值的词,以得到所述搜索内容中的多个关键词。Words whose word weights are less than a word weight threshold are removed from the second word segmentation set to obtain a plurality of keywords in the search content. 8.根据权利要求7所述的方法,其中确定第二分词集合中每个词的词权重,包括:8. The method according to claim 7, wherein determining the word weight of each word in the second word segmentation set comprises: 确定第二分词集合中每个词的逆文档频率值;Determine an inverse document frequency value for each word in the second word segmentation set; 将第二分词集合中每个词的逆文档频率值确定为该词的词权重。The inverse document frequency value of each word in the second word segmentation set is determined as the word weight of the word. 9.根据权利要求7所述的方法,其中确定第二分词集合中每个词的词权重,包括:9. The method according to claim 7, wherein determining the word weight of each word in the second word segmentation set comprises: 确定第二分词集合中每个词的逆文档频率值;Determine an inverse document frequency value for each word in the second word segmentation set; 基于第二分词集合中每个词的词性、在搜索内容中的词位置、历史搜索次数和历史点击率中的至少一个、以及其逆文档频率值,确定该词的词权重。The word weight of each word in the second word segmentation set is determined based on the word part, the word position in the search content, at least one of the historical search times and the historical click rate, and its inverse document frequency value. 10.根据权利要求8或9所述的方法,其中确定第二分词集合中每个词的逆文档频率值包括:10. The method according to claim 8 or 9, wherein determining the inverse document frequency value of each word in the second word segmentation set comprises: 获取查询日志,所述查询日志包括D条搜索内容;Obtaining a query log, wherein the query log includes D search contents; 针对第二分词集合中每个相应的词,确定查询日志中包含所述相应的词的搜索内容的条数d;For each corresponding word in the second word segmentation set, determine the number d of search contents containing the corresponding word in the query log; 确定查询日志包含的搜索内容的总条数D与查询日志中包含所述相应的词的搜索内容的条数d的商,并将所述商取对数,以得到所述相应的词的逆文档频率值。The quotient of the total number D of search contents contained in the query log and the number d of search contents containing the corresponding word in the query log is determined, and the logarithm of the quotient is taken to obtain the inverse document frequency value of the corresponding word. 11.根据权利要求2所述的方法,其中,确定所述第一关键词集合中各关键词的词向量包括:基于经训练的词嵌入模型确定第一关键词集合中各关键词的词向量,以及其中,所述经训练的词嵌入模型通过下述方式训练得到:11. The method according to claim 2, wherein determining the word vector of each keyword in the first keyword set comprises: determining the word vector of each keyword in the first keyword set based on a trained word embedding model, and wherein the trained word embedding model is trained in the following manner: 获取查询日志,并对查询日志中的搜索内容进行分词以得到多个分词;Obtain a query log, and segment the search content in the query log to obtain multiple segmented words; 通过以所述多个分词中的各个相应的分词为词嵌入模型的输入并且以所述相应的分词的上下文分词为词嵌入模型的输出,或者通过以所述多个分词中的各个相应的分词为词嵌入模型的输出并且以所述相应的分词的上下文分词为词嵌入模型的输入,来训练所述词嵌入模型,以得到所述经训练的词嵌入模型。The word embedding model is trained by taking each corresponding word segment among the multiple word segmentations as the input of the word embedding model and taking the context segmentations of the corresponding word segmentations as the output of the word embedding model, or by taking each corresponding word segmentation among the multiple word segmentations as the output of the word embedding model and taking the context segmentations of the corresponding word segmentations as the input of the word embedding model to obtain the trained word embedding model. 12.一种动态摘要确定装置,包括:12. A dynamic summary determination device, comprising: 当前文档获取模块,被配置成获取基于搜索内容搜索到的当前文档,所述当前文档包括标题部分和正文部分;A current document acquisition module is configured to acquire a current document searched based on the search content, wherein the current document includes a title part and a body part; 第一关键词提取模块,被配置成提取所述搜索内容中的多个关键词;A first keyword extraction module is configured to extract a plurality of keywords from the search content; 第一关键词集合确定模块,被配置成从所述多个关键词中筛选未包括在所述当前文档的标题部分的关键词作为第一关键词集合;A first keyword set determination module is configured to filter keywords that are not included in the title part of the current document from the multiple keywords as a first keyword set; 第二关键词提取模块,被配置成对当前文档的正文部分的每个句子提取关键词,以对应地形成针对每个句子的第二关键词集合;A second keyword extraction module is configured to extract keywords from each sentence in the body of the current document to correspondingly form a second keyword set for each sentence; 相似度确定模块,被配置成遍历所述正文部分中的句子,确定所述第一关键词集合与所遍历到的句子的第二关键词集合之间的相似度;A similarity determination module is configured to traverse the sentences in the body part and determine the similarity between the first keyword set and the second keyword set of the traversed sentences; 动态摘要确定模块,被配置成响应于所述相似度大于相似度阈值,基于所述遍历到的句子确定针对当前文档的动态摘要中的一部分。The dynamic summary determination module is configured to determine a part of the dynamic summary for the current document based on the traversed sentences in response to the similarity being greater than a similarity threshold. 13.一种计算设备,包括:13. A computing device comprising: 存储器,其被配置成存储计算机可执行指令;a memory configured to store computer-executable instructions; 处理器,其被配置成当所述计算机可执行指令被处理器执行时执行如权利要求1-11中的任一项所述的方法。A processor, which is configured to perform the method according to any one of claims 1-11 when the computer executable instructions are executed by the processor. 14.一种计算机可读存储介质,其存储有计算机可执行指令,当所述计算机可执行指令被执行时,执行如权利要求1-11中的任一项所述的方法。14. A computer-readable storage medium storing computer-executable instructions, wherein when the computer-executable instructions are executed, the method according to any one of claims 1 to 11 is executed.
CN202110577211.1A 2021-05-26 2021-05-26 Dynamic summary determination method and device, computing device and computer storage medium Active CN113761125B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110577211.1A CN113761125B (en) 2021-05-26 2021-05-26 Dynamic summary determination method and device, computing device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110577211.1A CN113761125B (en) 2021-05-26 2021-05-26 Dynamic summary determination method and device, computing device and computer storage medium

Publications (2)

Publication Number Publication Date
CN113761125A CN113761125A (en) 2021-12-07
CN113761125B true CN113761125B (en) 2025-06-03

Family

ID=78787218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110577211.1A Active CN113761125B (en) 2021-05-26 2021-05-26 Dynamic summary determination method and device, computing device and computer storage medium

Country Status (1)

Country Link
CN (1) CN113761125B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361446A (en) * 2021-12-24 2023-06-30 中国移动通信有限公司研究院 A method, device and electronic device for generating a text summary
CN115688771B (en) * 2023-01-05 2023-03-21 京华信息科技股份有限公司 Document content comparison performance improving method and system
CN117725197B (en) * 2023-03-28 2025-02-07 书行科技(北京)有限公司 Method, device, equipment and storage medium for determining summary of search results
CN120470188A (en) * 2024-02-06 2025-08-12 华为技术有限公司 Search result generating method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661490A (en) * 2008-08-28 2010-03-03 国际商业机器公司 Search engine, client thereof and method for searching page
CN110837556A (en) * 2019-10-30 2020-02-25 深圳价值在线信息科技股份有限公司 Abstract generation method and device, terminal equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2783558B2 (en) * 1988-09-30 1998-08-06 株式会社東芝 Summary generation method and summary generation device
KR101099908B1 (en) * 2010-04-21 2011-12-28 엔에이치엔(주) Document and similarity calculation system between documents

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661490A (en) * 2008-08-28 2010-03-03 国际商业机器公司 Search engine, client thereof and method for searching page
CN110837556A (en) * 2019-10-30 2020-02-25 深圳价值在线信息科技股份有限公司 Abstract generation method and device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN113761125A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
Wang et al. Neural network-based abstract generation for opinions and arguments
CN112395506A (en) Information recommendation method and device, electronic equipment and storage medium
CN111324771B (en) Video tag determination method and device, electronic equipment and storage medium
CN113761125B (en) Dynamic summary determination method and device, computing device and computer storage medium
CN111488426A (en) Query intention determining method and device and processing equipment
CN112463914B (en) Entity linking method, device and storage medium for internet service
CN111797214A (en) Question screening method, device, computer equipment and medium based on FAQ database
CN113505196B (en) Text retrieval method and device based on parts of speech, electronic equipment and storage medium
CN108717408A (en) A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system
CN112182145B (en) Text similarity determination method, device, equipment and storage medium
CN109471944A (en) Training method, device and readable storage medium for text classification model
CN107943792B (en) Statement analysis method and device, terminal device and storage medium
CN109271624B (en) Target word determination method, device and storage medium
Jin et al. Entity linking at the tail: sparse signals, unknown entities, and phrase models
CN105956053A (en) Network information-based search method and apparatus
CN114328798A (en) Processing method, device, equipment, storage medium and program product for searching text
CN119577459A (en) A multi-modal large model intelligent customer service training method, device and storage medium
CN116701605A (en) Reply optimization method and device for question answering robot, electronic device, storage medium
CN118520854A (en) Text generation method, apparatus, computer device, storage medium, and program product
CN115129864A (en) Text classification method and device, computer equipment and storage medium
CN111950265A (en) Method and device for constructing a domain thesaurus
CN116361638A (en) Question answer search method, device and storage medium
CN115329754A (en) A text topic extraction method, device, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant