Detailed Description
The following description provides specific details of various embodiments of the disclosure so that those skilled in the art may fully understand and practice the various embodiments of the disclosure. It should be understood that the technical solutions of the present disclosure may be practiced without some of these details. In some instances, well-known structures or functions have not been shown or described in detail to avoid obscuring the description of embodiments of the present disclosure with such unnecessary description. The terminology used in the present disclosure should be understood in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.
First, some terms related to the embodiments of the present application will be described so as to be easily understood by those skilled in the art.
Dynamic abstracts (dynamic abstracts), search engine terms, are a technique for dynamically displaying the primary content of a retrieved document. For a search engine, in response to user input of search content, text related to the surroundings of the search content in a document is extracted and returned as a dynamic abstract according to the position of the search content in the document. Because a document is recalled by different search contents, the dynamic summarization technology may form different dynamic summaries for the same document according to the different search contents.
Search content, i.e., meaning of a query, to find a particular file, web site, record or series of records in a database, a search engine is entered by a user to retrieve words, sentences or any suitable content of data from the database.
Query log (log) is one of the diaries used to document the work done each day. In computer science, a log refers to a record (Server log) of operations of computer devices or software such as a Server. When computer equipment and software are in question, the log is an important basis for checking the problems. The query log is used for recording information related to search content input by a user received from a client.
Stop Words (Stop Words) refer to high frequency Words that do not carry any subject information, such as Words that are "also", "having been" such. In information retrieval, it is preferable to filter out these words in processing natural language data (or text) in order to save storage space and improve search efficiency. The stop words are manually input and are not automatically generated, and the generated stop words form a stop word list. When the stop words are filtered later, the stop words in the document can be confirmed by querying the stop word list.
The word segmentation device is a tool for analyzing a text input by a user into a piece of text conforming to logic. Common word segmentation devices include English word segmentation devices, chinese word segmentation devices and the like. The word segmentation process of the English word segmentation device generally comprises the steps of text input, keyword segmentation, stop word removal, morphological reduction and lower case conversion. The Chinese word segmentation device segments a Chinese character sequence into individual words. In other words, word segmentation is a process of recombining a continuous word sequence into a word sequence according to a certain specification. In this process, stop words, i.e., words that do not affect the semantic meaning, may be identified. Common word segmenters such as jieba, mmseg j, ansj, etc.
Natural language processing (Natural Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like. The natural language processing is mainly applied to the aspects of machine translation, public opinion monitoring, automatic abstracting, viewpoint extraction, text classification, question answering, text semantic comparison, voice recognition, chinese OCR and the like.
Word embedding-word embedding is a type representation of words, words with similar meaning have similar representations, and is a generic term for methods that map vocabulary to real vectors. Conceptually, it refers to embedding a high-dimensional space, which is the number of all words in dimension, into a continuous vector space, which is much lower in dimension, each word or phrase being mapped to a vector on the real number domain. Common Word embeddings such as Word2Vec, word embeddings fastText, global Word embeddings GloVe, and the like. Word2Vec is a group of correlation models used to generate Word vectors. These models are shallow, bi-layer neural networks that are used to train to reconstruct linguistic word text. The Word-embedded Word2Vec model, after training is completed, can be used to map each Word to a vector, which is the hidden layer of the neural network. GloVe, called Global Vectors for Word Representation, is a global word frequency statistics (count-based & overall statistics) based word representation (word representation) tool that can represent a word as a vector of real numbers that captures some semantic characteristics between words, such as similarity, analogic, etc. fastText is a fast text classification algorithm that has two major advantages over neural network based classification algorithms, namely increased training and testing speeds and no need for pre-trained word vectors while maintaining high accuracy.
Inverse document frequency (inverse document frequency) the inverse document frequency (inverse document frequency) is a common addition political trickery for information retrieval and information exploration. The inverse document frequency is a statistical method for evaluating the importance of a word to one of the documents in a document set or corpus. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus. The various forms of inverse document frequency weighting are often applied by the search engine as a measure or rating of the degree of correlation between the document and the user query.
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, convolutional neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.
Deep learning (DL, deep Learning) is a new research direction in the field of machine learning (ML, machine Learning) that was introduced into machine learning to bring it closer to the original goal-artificial intelligence (AI, artificial Intelligence). Deep learning is the inherent regularity and presentation hierarchy of learning sample data, and the information obtained during such learning is helpful in interpreting data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data.
The technical scheme provided by the application relates to a natural language processing technology and mainly relates to a dynamic abstract technology.
Fig. 1 illustrates an exemplary application scenario 100 in which a technical solution according to an embodiment of the present disclosure may be implemented. As shown in fig. 1, the application scenario shown includes a terminal 110, a server 120, the terminal 110 being communicatively coupled with the server 120 via a network 130.
By way of example, the terminal 110 may, for example, act as an input device for inputting search content and transmit the search content to the server 120 via the network 130. The search content may be, for example, search content for searching for related documents.
As an example, the server 120 may obtain a current document searched based on search content, extract a plurality of keywords in the search content, and screen keywords not included in a header portion of the current document from the plurality of keywords as a first keyword set, then the server 120 may extract keywords for each sentence of a body portion of the current document to form a second keyword set for each sentence, then the server 120 may traverse sentences in the body portion, determine a similarity between the first keyword set and the second keyword set of traversed sentences, and finally, in response to the similarity being greater than a similarity threshold, the server 120 may determine a portion of a dynamic summary for the current document based on the traversed sentences. Server 120 may send the determined dynamic summary to terminal 110 for presentation.
The scenario described above is merely one example in which embodiments of the present disclosure may be implemented and is not limiting. For example, in some example scenarios, it is also possible that the dynamic summary determination process may be implemented on the terminal 110.
For example, the terminal 110 may serve as an input device for inputting search contents and save the acquired current document searched based on the search contents to the terminal background. Then, a plurality of keywords in the search content are extracted, and keywords not included in a header portion of the current document are screened from the plurality of keywords as a first keyword set, then, the terminal 110 may extract keywords for each sentence of a body portion of the current document to form a second keyword set for each sentence, then, the terminal 110 may traverse the sentences in the body portion, determine a similarity between the first keyword set and the second keyword set of the traversed sentences, and finally, in response to the similarity being greater than a similarity threshold, the terminal 110 may determine a portion of a dynamic summary for the current document based on the traversed sentences.
It should be noted that the terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein. The network 130 may be, for example, a Wide Area Network (WAN), a Local Area Network (LAN), a wireless network, a public telephone network, an intranet, and any other type of network known to those skilled in the art.
In some embodiments, the application scenario 100 described above may be a distributed system composed of a cluster of terminals 110 and servers 120, which may, for example, constitute a blockchain system. For example, in the application scenario 100, the determination and storage of the dynamic summary may be performed in a blockchain system, so as to achieve the effect of decentralization. As an example, after determining the dynamic digest, the dynamic digest may be stored in a blockchain system for later retrieval from the blockchain system when the same search is conducted. Blockchains are novel application modes of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain is essentially a decentralised database, which is a series of data blocks generated by cryptographic methods, each data block containing a batch of information of network transactions for verifying the validity (anti-counterfeiting) of the information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
Fig. 2 illustrates a schematic flow diagram of a dynamic summary determination method 200 according to one embodiment of the present disclosure. The dynamic digest determination method may be implemented by the terminal 110 or the server 120 as shown in fig. 1, for example. As shown in fig. 2, the method 200 includes the following steps.
At step 210, a current document searched based on the search content is acquired, the current document including a title portion and a body portion. Search content may typically include multiple words to more clearly characterize a user's query intent. A search engine may retrieve a plurality of related documents for the search content as search results for the search content. In an embodiment of the present disclosure, a related document may be acquired as the current document from among the plurality of related documents searched based on the search content.
In step 220, a plurality of keywords in the search content are extracted. In embodiments of the present disclosure, various techniques may be used to extract the plurality of keywords in the search content, which is not limiting. For example, the search content may be segmented using various segmenters described previously, and then a plurality of keywords in the search content may be extracted from each of the segmented words according to the importance of each of the segmented words obtained.
In step 230, keywords not included in the title portion of the current document are selected from the plurality of keywords as a first keyword set. As an example, the keywords "car", "maintenance", "standard", "manual" are extracted from the search content, and if the title of the current document as a result of the search content contains "car", "maintenance", the first keyword set will contain "standard" and "manual". At this time, the dynamic digest is confirmed according to the first keyword set, and the dynamic digest will contain one or all of "standard" and "manual", i.e. the dynamic digest and the title will contain 3/4 or 4/4 of the keywords of the search content, and the hit rate of the search content on the title and the dynamic digest as a whole is 75% -100%. In contrast, in the conventional related art, the dynamic digest is determined according to the keywords of the search content, and it is highly likely that the dynamic digest will contain one or all of "car", "maintenance", but not "standard" and "manual", and at this time, the dynamic digest and the title will contain 1/4 or 2/4 of the keywords of the search content, and the hit rate of the search content on the title and the dynamic digest as a whole is 25% -50%.
Since the keywords contained in the title of the current document are not considered when the first keyword set is determined, the keywords contained in the title of the current document are not considered in the subsequent step of dynamic abstract determination, so that the phenomenon that some keywords in the search content repeatedly appear in the document title and the dynamic abstract and other keywords do not appear in the document title and the dynamic abstract is avoided, and the hit rate of the whole search content in the article title and the dynamic abstract is improved. Therefore, step 230 considers the hit rate of the whole search content in the article title and the dynamic abstract, so that the document title and the dynamic abstract extracted by the method of the present disclosure are sufficient to present the information related to the whole search content, the accuracy of the determined dynamic abstract is improved, the document title and the dynamic abstract present the information related to the whole search content, and the user experience is improved.
In step 240, keywords are extracted for each sentence of the body portion of the current document to correspondingly form a second set of keywords for each sentence. In embodiments of the present disclosure, keywords may be extracted for each sentence of the body portion of the current document using various technical means, which is not limiting.
In some embodiments, extracting keywords for each sentence of the body portion of the current document may be the same as the method of extracting the plurality of keywords in the search content in step 220. By way of example, the step of extracting keywords that is obvious to the benefit of the sentence "Internet technology" of the current document may be performed by first segmenting the sentence to obtain a first segmented word set comprising "Internet", "technology", "benefit", "yes", "obvious", "stop words" from the first segmented word set, then removing the stop words, here the stop words have "yes" to obtain a second segmented word set, then determining the word weight (which may be calculated, for example, from the number of historic occurrences) of each word in the second segmented word set, as here determining that the word weights of "Internet", "technology", "benefit", "obvious" are 0.7, 0.5, 0.6, 0.2, respectively, and finally removing words from the second segmented word set whose weights are less than a predetermined threshold, for example, here the predetermined threshold is set to 0.4, then removing "the stop words" and finally determining that the keywords of the sentence are "Internet", "technology", "benefit", respectively.
In step 250, sentences in the body part are traversed, and the similarity between the first keyword set and the second keyword set of the traversed sentences is determined. Various methods may be used to determine the similarity between the first set of keywords and the second set of keywords of the traversed sentence, without limitation.
As an example, the first set of keywords comprises "jet", "airplane", "cost", and the second set of keywords of the traversed sentence includes "airplane", "airport", "cost", "hills". In some embodiments, the similarity is determined by comparing the ratio of the number of terms that the first set of keywords contains to the number of terms that the first set of keywords contains. For example, where the first set of keywords includes 2 words, "airplane" and "cost" together with the second set of keywords, the first set of keywords includes 4 words, and the similarity is 2/4=0.5. In other embodiments, a similarity matrix is determined based on the word vector corresponding to each word in the first set of keywords and the word vector corresponding to each word in the second set of keywords, and the similarity is extracted from the similarity matrix.
In some embodiments, traversing sentences in the body part, determining similarity between the first keyword set and a second keyword set of the traversed sentences may include traversing sentences in the body part, and determining similarity between the first keyword set and the second keyword set of the traversed sentences when a current word number of dynamic summaries is less than a word number threshold. As an example, the word number threshold of the dynamic abstract is 100 words, sentences in the text portion are traversed, if the current word number of the dynamic abstract is smaller than the word number threshold, for example, the current word number of the dynamic abstract is 90 words, the similarity between the first keyword set and the second keyword set of the traversed sentences is determined, and if the current word number of the dynamic abstract is not smaller than the word number threshold, for example, the current word number of the dynamic abstract is 100 words, the similarity between the first keyword set and the second keyword set of the traversed sentences is not determined.
In response to the similarity being greater than a similarity threshold, a portion of the dynamic summary for the current document is determined based on the traversed sentence at step 260. The similarity threshold may be preset as desired and is not limiting.
In some embodiments, when determining a portion of the dynamic summary for the current document based on the traversed sentence, a portion or all of the traversed sentence may be determined to be a portion of the dynamic summary of the current document. As an example, some phrases, sentence stem portions, or portions of the traversed sentence may be extracted or determined as a whole as part of a dynamic summary for the current document. As the traversing of sentences in the body part and the determining of a portion of the dynamic summary for the current document based on the traversed sentences, a final dynamic summary may be formed.
In some embodiments, responsive to the similarity being greater than a similarity threshold, determining a portion of the dynamic summary for the current document based on the traversed sentence may include determining a portion of the traversed sentence as a portion of the dynamic summary for the current document such that a sum of a number of words of the portion of the traversed sentence and a current number of words of the dynamic summary is equal to the number of words threshold responsive to the similarity being greater than a similarity threshold and a sum of the traversed sentence and a current number of words of the dynamic summary being greater than a number of words threshold. As an example, assuming that the word count threshold of the dynamic digest is 100 words, and the current word count of the dynamic digest is 95 words, the traversed sentence is 15 words, then 5 words of the traversed sentence (which may be the head, tail, middle extracted word of the sentence, etc., without limitation herein) are determined as part of the dynamic digest.
The method 200 avoids the recurrence of certain keywords in the document title and dynamic summary by considering keywords of search content already contained in the title of the current document, and not considering those keywords already contained in the title of the current document when determining the dynamic summary. Then, each sentence of the document body is traversed, the similarity of the keywords of the sentence and the keyword set of the search content not contained in the title is compared, and whether the sentence is a part of the dynamic abstract is judged according to whether the similarity is larger than a similarity threshold. Because the hit rate of the whole search content in the article title and the dynamic abstract is considered when the dynamic abstract is determined, the accuracy of the determined dynamic abstract is improved while the repeated occurrence of certain keywords in the document title and the dynamic abstract is avoided, so that the document title and the dynamic abstract are enough to present the information related to the whole search content, and further the user experience is improved.
Fig. 3 illustrates a schematic flow diagram of a method 300 of determining similarity between two keyword sets, according to one embodiment of the present disclosure. The two keyword sets include the first keyword set and a second keyword set of the traversed sentence. The method 300 may be used, for example, to implement step 250 described with reference to fig. 2. As shown in fig. 3, the method 300 includes the following steps.
In step 310, a word vector for each keyword in the first set of keywords is determined. The word vector for each keyword in the first set of keywords may be determined using a trained word embedding model. The trained word embedding model may be obtained by training a word embedding model with an open source corpus (such as an open corpus provided by google corporation) as a training set, or may be obtained by training a word embedding model with a specific corpus (such as a training set established by words in a certain field) as a training set, which is not limited herein. The Word embedding model may be a common Word embedding model, such as embedded Word2Vec, word embedding fastText, global Word embedding GloVe, and the like, and is not limited herein.
In some embodiments, determining the word vector for each keyword in the first set of keywords includes determining the word vector for each keyword in the first set of keywords based on a trained word embedding model, and wherein the trained word embedding model is trained by obtaining a query log (in which information related to search content received from a client is recorded) and segmenting search content in the query log to obtain a plurality of segments, by taking each respective segment of the plurality of segments as an input of a word embedding model and taking a contextual segment of the respective segment as an output of the word embedding model, or by taking each respective segment of the plurality of segments as an input of a word embedding model and taking the contextual segment of the respective segment as an input of the word embedding model, training the word embedding model to obtain the trained word embedding model. Because the word segmentation used for training the word embedding model is from the query log, the query log records information related to search content input by a user received from a client, the trained word embedding model can better extract characteristics in the search content, and the determined word vector can better represent each word in the search content.
In step 320, a first feature vector of the first set of keywords is determined based on the word vectors of the keywords in the first set of keywords.
In some embodiments, determining the first feature vector of the first set of keywords based on the word vectors of the keywords in the first set of keywords includes bit-wise accumulating the word vectors of the keywords in the first set of keywords to obtain the first feature vector of the first set of keywords. As an example, the first keyword set includes words "fine dried noodles", "easy" and "paste pot", and after determining 200-dimensional word vectors corresponding to the three words respectively, the three 200-dimensional word vectors are accumulated according to positions to obtain a 200-dimensional vector, that is, a first feature vector of the first keyword set is obtained.
In step 330, a word vector is determined for each keyword in the second set of keywords for the traversed sentence. The word vectors for each keyword in the second set of keywords may be determined using the trained word embedding model described above or other suitable word embedding model. The Word embedding model may be, for example, a common Word embedding model, such as embedded Word2Vec, word embedding fastText, global Word embedding GloVe, and the like, and is not limited herein.
In step 340, a second feature vector for the second set of keywords is determined based on the word vectors for each keyword in the second set of keywords. In some embodiments, determining the second feature vector of the second set of keywords based on the word vector of each keyword in the second set of keywords includes bit-wise accumulating the word vectors of each keyword in the second set of keywords to obtain the second feature vector of the second set of keywords. As an example, the second keyword set includes words "handmade noodles", "fine dried noodles", "sliced noodles" and "paste" and after determining 200-dimensional word vectors corresponding to the four words, the four 200-dimensional word vectors are accumulated according to the positions to obtain a 200-dimensional vector, i.e. a second feature vector of the second keyword set is obtained.
In step 350, a similarity between the first set of keywords and the second set of keywords of the traversed sentence is determined based on the first and second feature vectors. As an example, the first feature vector and the second feature vector are 200-dimensional vectors, respectively, cosine similarity of the first feature vector and the second feature vector is calculated, and the cosine similarity is used as similarity between the first keyword set and the second keyword set of the traversed sentence. The cosine similarity between vectors is used for measuring the text similarity, and depends on the cosine distance between the vectors, the cosine distance maps the vectors into a vector space according to coordinate values, and the cosine distance between the vectors a and b is expressed as follows:
let the coordinates of the vectors a, b in two dimensions be respectively The cosine distance between vectors a, b is expressed in two dimensions as follows:
assuming that the coordinates of the vectors a and b in the n-dimensional space are a= (a 1,A2,……,An)、b=(B1,B2,……,Bn), the expression of the cosine distance between the vectors a and b in the n-dimensional space is as follows:
。
In some embodiments, determining the similarity between the first set of keywords and the second set of keywords of the traversed sentence based on the first feature vector and the second feature vector may include determining the similarity between the first set of keywords and the second set of keywords of the traversed sentence based on a distance between the first feature vector and the second feature vector, wherein the distance includes one of a cosine distance, a Euclidean distance, and a Manhattan distance. As an example, the distance may be selected as a euclidean distance, i.e. the euclidean distance of the first feature vector and the second feature vector is calculated, and the euclidean distance is taken as the similarity between the first keyword set and the second keyword set of the traversed sentence.
The method 300 determines a similarity between the first set of keywords and the second set of keywords of the traversed sentence by comparing the first feature vector based on the first set of keywords to the second feature vector based on the second set of keywords. The similarity extracted in this way can more accurately characterize the similarity between the first keyword set and the second keyword set of the traversed sentence, so that the dynamic abstract is determined from the similarity in a subsequent step.
Fig. 4 illustrates a schematic flow diagram of a method 400 of extracting a plurality of keywords in search content according to one embodiment of the present disclosure. The method 400 may be used, for example, to implement step 220 described with reference to fig. 2. As shown in fig. 4, the method 400 includes the following steps.
In step 410, the search content is segmented to obtain a first segmented set of words comprising a plurality of words. In embodiments of the present disclosure, the search content may be segmented using various technical means, which is not limiting. For example, the search content may be segmented using various segmentors previously described (e.g., jieba, mmseg, ansj, etc.).
At step 420, stop words are removed from the plurality of words in the first set of word segments to obtain a second set of word segments. As previously mentioned, high frequency words that do not carry any subject information, such as words "also", "having been" are meant. The removal of the stop words can save storage space and improve search efficiency, and interference is reduced for determining the dynamic abstract. In some embodiments, the stop words in the first set of tokens may be identified and removed by querying a pre-built stop word list.
In step 430, a word weight is determined for each word in the second set of keywords. In some embodiments, determining the word weight for each word in the second set of words may include determining an inverse document frequency value for each word in the second set of words and determining the inverse document frequency value for each word in the second set of words as the word weight for that word. The inverse document frequency is used to evaluate the importance of each word in the second set of keywords.
In some embodiments, determining the word weight for each word in the second set of words may include determining an inverse document frequency value for each word in the second set of words, determining the word weight for each word in the second set of words based on the word part of the word, at least one of the word position in the search content, the historical search times, and the historical click rate, and the inverse document frequency value thereof. Because the word weight considers at least one of the part of speech of each word in the second word set, the word position in the search content, the historical search times and the historical click rate in addition to the inverse document frequency value of each word in the second word set, the finally determined word weight can reflect the importance of each word in the second word set more comprehensively and accurately.
In some embodiments, determining the inverse document frequency value for each word in the second set of words may include obtaining a query log including D pieces of search content, determining, for each respective word in the second set of words, a number D of pieces of search content in the query log that includes the respective word, determining a quotient of a total number D of pieces of search content in the query log that includes the search content and a number D of pieces of search content in the query log that includes the respective word, and logarithmically taking the quotient to obtain the inverse document frequency value for the respective word. As an example, the query log includes 1000 pieces of search content, and the query log includes 600 pieces of search content of the corresponding word, i.e., d=1000, d=300, so the inverse document frequency value of the corresponding word is log e(D/d)=loge (1000/300) =1.204.
As an example, the inverse document frequency may be determined by querying an inverse document frequency dictionary. The inverse document frequency dictionary may be obtained based on a query log calculation, and the inverse document frequency (idf i) of the word t i in the inverse document frequency dictionary is calculated according to the following formula:
Where the numerator |D| represents the total number of bars of search content in the query log and the denominator represents the number of bars of search content that contain the word t i.
In step 440, words having a word weight less than a word weight threshold are removed from the second set of keywords to obtain a plurality of keywords in the search content. The word weight threshold may be set as desired and is not limiting. As an example, the second term set includes terms "Sima Qian", "history", "western" and "reputation", and the four terms respectively correspond to terms with weights of 0.6, 0.5, 0.7 and 0.2, and the term weight threshold is set to 0.4, so that the "reputation" is removed, and finally, the keywords in the search content are determined to be "Sima Qian", "history", "western".
The method 400 extracts a plurality of keywords in the search content by segmenting the search content, removing stop words, determining weights, and removing words having weights less than a weight threshold. The extracted keywords in this way can more accurately and briefly characterize the meaning of the search content than the search content, and the dynamic abstract can be conveniently determined according to the meaning of the search content in the subsequent steps.
FIG. 5 illustrates an exemplary detailed schematic framework diagram of a word embedding model, according to one embodiment of the present disclosure. As shown in fig. 5, the word embedding model is used to embed a high-dimensional space with a number of words in total into a continuous vector space with a much lower dimension, and each word or phrase is mapped into a vector on the real number domain, which may be a three-layer neural network including an input layer, a hidden layer, and an output layer.
The input layer is used to receive the input vector, which is usually a one-hot vector, the hidden layer processes the input vector by setting nodes, for example, to represent a word with 300 features (i.e. each word can be represented as a 300-dimensional vector), then the hidden layer will set 300 nodes whose weights can be represented as a matrix of rows (the number of rows depends on the dimension of the input vector), 300 columns, the output layer is used to process the output of the hidden layer to output a probability distribution at the layer, the output layer can be a softmax regression classifier, each node of which will output a value (probability) between 0-1, the sum of the probabilities of all output layer neuron nodes being 1.
In training word embedding models, training is typically based on pairs of words, training samples being word pairs (input word, output word) that predict the output word from the input word, both input word and output word being vectors that are one-hot coded. Typically, the input word and the output word are words (e.g., adjacent words) having a contextual relationship, i.e., the contextual relationship of each word in the corpus is used to derive a trained word embedding model.
Fig. 6 illustrates a schematic effect diagram of a dynamic summary determined using the related art. As shown in fig. 6, in the related art, the dynamic summary determination scheme does not consider the hit situation existing in the title, that is, the global hit experience is not focused on the global relevant information. This is more pronounced when the search engine is faced with some search content of a larger length (as shown in fig. 6) as a whole summary hit problem. As an example, when searching for "Luo Fushan hiking" the user selects only "Luo Fushan" and "hiking" from the contained segments, whether the title or the dynamic summary, but no "itinerary" appears. This is because the related art considers only the similarity of the traversed sentence to the search content when traversing the sentence in the body to determine the dynamic digest, and does not consider that the title of the current document already contains the content of a part of the search content, so that the part of the search content repeatedly appears in the title and the dynamic digest of the current document, but the part of the search content does not appear in the title and the dynamic digest of the current document.
As shown in fig. 6, the related art does not consider the hit rate of the whole search content in the article title and the dynamic digest, so that some keywords ("Luo Fushan", "hiking") in the search content repeatedly appear in the document title and the dynamic digest, but other keywords ("routes") in the search content do not appear in the document title and the dynamic digest, which makes the document title and the dynamic digest extracted with the related art insufficient to present information related to the whole search content. As an example, the hit effect of the keyword in the search content in the current document is embodied by font thickening, and may also be implemented by highlighting the relevant text, underlining the keyword font, tilting or enlarging the keyword font, marking the keyword font with a color different from the text or title color, or the like, which is not limited herein.
Fig. 7 illustrates a schematic effect diagram of a dynamic digest determined using a dynamic digest determination method according to one embodiment of the present disclosure. This embodiment is also directed to the dynamic summary of the current document searched for in "Luo Fushan hiking route", and the document title of the current document already contains "Luo Fushan", "hiking". As shown in FIG. 7, the system knows that the title has "Luo Fushan" and "hiking" when selecting the fragments contained in the dynamic summary of the text, and is more prone to extracting those text fragments containing at least "route" from the text as dynamic summary fragments and thickening them. This allows the user to determine that the content of the piece of article tends to be "Luo Fushan hiking" rather than "Luo Fushan hiking" when not clicking to read the full text.
As can be seen by comparing fig. 6 and 7, the dynamic digest determining method proposed by the present disclosure improves the hit rate of the whole search content in the article title (hit "Luo Fushan", "hiking") and the dynamic digest (hit "route") by considering the keywords ("Luo Fushan", "hiking") of the search content already contained in the title of the current document, and by not considering these keywords already contained in the title of the current document when determining the dynamic digest, avoiding the recurrence of some keywords in the document title and the dynamic digest, so that the document title and the dynamic digest extracted by the method of the present disclosure can present information related to the whole search content. The method and the device can avoid the repeated occurrence of certain keywords in the document title and the dynamic abstract, and improve the accuracy of the determined dynamic abstract, so that the document title and the dynamic abstract are enough to present information related to the whole searched content, and further improve the user experience.
Meanwhile, as can be seen from fig. 7, the words hit in the text do not necessarily coincide exactly with the words not included by the title in the search content, but the semantics are basically guaranteed to be the same, as in fig. 7, "route" hit in the text according to the keyword "route" in the search content, and although "route" and "route" are two words, the semantics are basically the same. This is because the present disclosure adopts a method of comparing the similarity of the keywords of sentences and the keyword sets of the search contents not included in the title, rather than requiring the keywords of sentences to be completely identical to the keywords of the search contents not included in the title, which makes it possible to determine the dynamic digest by the method shown in the present disclosure to have better robustness and accuracy.
Fig. 8 illustrates an exemplary block diagram of a dynamic summary determination apparatus 800 according to one embodiment of the present disclosure. As shown in fig. 8, the dynamic digest determining apparatus includes a current document acquisition module 810, a first keyword extraction module 820, a first keyword set determination module 830, a second keyword extraction module 840, a similarity determination module 850, and a dynamic digest determination module 860.
The current document acquisition module 810 is configured to acquire a current document searched based on the search content, the current document including a title portion and a body portion.
The first keyword extraction module 820 is configured to extract a plurality of keywords in the search content.
The first keyword set determining module 830 is configured to screen, as the first keyword set, keywords that are not included in the header portion of the current document from the plurality of keywords.
The second keyword extraction module 840 is configured to extract keywords for each sentence of the body portion of the current document to form a second keyword set for each sentence, respectively.
A similarity determination module 850 configured to traverse sentences in the body part, determining similarity between the first set of keywords and a second set of keywords of the traversed sentences.
The dynamic summary determination module 860 is configured to determine a portion of the dynamic summary for the current document based on the traversed sentence in response to the similarity being greater than a similarity threshold.
FIG. 9 illustrates an example system 900 that includes an example computing device 910 that represents one or more systems and/or devices that can implement the various techniques described herein. Computing device 910 may be, for example, a server of a service provider, a device associated with a server, a system-on-a-chip, and/or any other suitable computing device or computing system. The dynamic summary determination apparatus 800 described above with reference to fig. 8 may take the form of a computing device 910. Alternatively, the dynamic summary determination apparatus 800 may be implemented as a computer program in the form of an application 916.
The example computing device 910 as illustrated includes a processing system 911, one or more computer-readable media 912, and one or more I/O interfaces 913 communicatively coupled to each other. Although not shown, the computing device 910 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus may include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.
The processing system 911 is representative of functionality that performs one or more operations using hardware. Thus, the processing system 911 is illustrated as including hardware elements 914 that may be configured as processors, functional blocks, and the like. This may include implementation in hardware as application specific integrated circuits or other logic devices formed using one or more semiconductors. The hardware component 914 is not limited by the material from which it is formed or the processing mechanism employed therein. For example, the processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, the processor-executable instructions may be electronically-executable instructions.
The computer-readable medium 912 is illustrated as including a memory/storage 915. Memory/storage 915 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 915 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 915 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) and removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 912 may be configured in a variety of other ways as described further below.
One or more I/O interfaces 913 represent functionality that allows a user to input commands and information to computing device 910 using various input devices, and optionally also allows information to be presented to the user and/or other components or devices using various output devices. Examples of input devices include keyboards, cursor control devices (e.g., mice), microphones (e.g., for voice input), scanners, touch functions (e.g., capacitive or other sensors configured to detect physical touches), cameras (e.g., motion that does not involve touches may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a display or projector), speakers, a printer, a network card, a haptic response device, and so forth. Accordingly, the computing device 910 may be configured in a variety of ways as described further below to support user interaction.
Computing device 910 also includes application 916. The application 916 may be, for example, a software instance of the dynamic summary determination apparatus 800, and implement the techniques described herein in combination with other elements in the computing device 910.
Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer-readable media can include a variety of media that can be accessed by computing device 910. By way of example, and not limitation, computer readable media may comprise "computer readable storage media" and "computer readable signal media".
"Computer-readable storage medium" refers to a medium and/or device that can permanently store information and/or a tangible storage device, as opposed to a mere signal transmission, carrier wave, or signal itself. Thus, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in methods or techniques suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of a computer-readable storage medium may include, but are not limited to RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical storage, hard disk, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage devices, tangible media, or articles of manufacture adapted to store the desired information and which may be accessed by a computer.
"Computer-readable signal medium" refers to a signal bearing medium configured to hardware, such as to send instructions to computing device 910 via a network. Signal media may typically be embodied in computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, data signal, or other transport mechanism. Signal media also include any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
As previously described, the hardware elements 914 and computer-readable media 912 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware that may be used in some embodiments to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or components of a system on a chip, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), complex Programmable Logic Devices (CPLDs), and other implementations in silicon or other hardware devices. In this context, the hardware elements may be implemented as processing devices that perform program tasks defined by instructions, modules, and/or logic embodied by the hardware elements, as well as hardware devices that store instructions for execution, such as the previously described computer-readable storage media.
Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules, and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 914. Computing device 910 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Thus, for example, the modules may be implemented at least in part in hardware as modules executable by the computing device 910 as software using the computer-readable storage medium of the processing system and/or the hardware elements 914. The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 910 and/or processing systems 911) to implement the techniques, modules, and examples described herein.
In various implementations, the computing device 910 may take on a variety of different configurations. For example, the computing device 910 may be implemented as a computer-like device including a personal computer, desktop computer, multi-screen computer, laptop computer, netbook, and the like. Computing device 910 may also be implemented as a mobile apparatus-like device including a mobile device such as a mobile phone, portable music player, portable gaming device, tablet computer, multi-screen computer, or the like. Computing device 910 may also be implemented as a television-like device that includes devices having or connected to generally larger screens in casual viewing environments. Such devices include televisions, set-top boxes, gaming machines, and the like.
The techniques described herein may be supported by these various configurations of computing device 910 and are not limited to the specific examples of techniques described herein. The functionality may also be implemented in whole or in part on the "cloud" 920 through the use of a distributed system, such as through platform 922 as described below.
Cloud 920 includes and/or represents a platform 922 for resources 924. Platform 922 abstracts underlying functionality of hardware (e.g., servers) and software resources of cloud 920. Resources 924 may include applications and/or data that may be used when executing computer processing on servers remote from computing device 910. Resources 924 may also include services provided over the internet and/or over subscriber networks such as cellular or Wi-Fi networks.
Platform 922 may abstract resources and functionality to connect computing device 910 with other computing devices. Platform 922 may also be used to abstract a hierarchy of resources to provide a corresponding level of hierarchy of requirements encountered for resources 924 implemented via platform 922. Thus, in an interconnected device embodiment, the implementation of the functionality described herein may be distributed throughout the system 900. For example, the functionality may be implemented in part on computing device 910 and by platform 922 that abstracts the functionality of cloud 920.
The present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computing device reads the computer instructions from the computer-readable storage medium and executes the computer instructions to cause the computing device to perform the dynamic summary determination methods provided in the various alternative implementations described above.
It should be understood that for clarity, embodiments of the present disclosure have been described with reference to different functional units. However, it will be apparent that the functionality of each functional unit may be implemented in a single unit, in a plurality of units or as part of other functional units without departing from the present disclosure. For example, functionality illustrated to be performed by a single unit may be performed by multiple different units. Thus, references to specific functional units are only to be seen as references to suitable units for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single unit or may be physically and functionally distributed between different units and circuits.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, components or sections, these devices, elements, components or sections should not be limited by these terms. These terms are only used to distinguish one device, element, component, or section from another device, element, component, or section.
Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the appended claims. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the term "a" or "an" does not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.