Disclosure of Invention
In view of the above problems, the present invention has been made to provide an identification method, apparatus, application search method, and server of an application search intention that overcome the above problems or at least partially solve the above problems.
According to an aspect of the present invention, there is provided an identification method of an application search intention, the method including:
obtaining search terms in each query session from a query session log of an application search engine;
excavating a label system of each search term according to the search term in each query session and a preset strategy;
and identifying the application search intention corresponding to each search word according to the label system of each search word.
Optionally, mining a label system of each search term according to the search term in each query session and a preset policy includes:
obtaining a training corpus set according to search terms in each query session;
inputting the training corpus set into an LDA model for training to obtain a search word-subject probability distribution result and a subject-keyword probability distribution result output by the LDA model;
and calculating to obtain a label system of each search word according to the search word-theme probability distribution result and the theme-keyword probability distribution result.
Optionally, the obtaining a corpus set according to the search term in each query session includes:
obtaining the original corpus of each search term according to the search terms in each query session;
the original linguistic data of each search word form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
Optionally, the obtaining the original corpus of the search terms according to the search terms in each query session includes:
obtaining a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; obtaining a search term set corresponding to a plurality of query sessions;
training the search word sequence set to obtain an N-dimensional search word vector file;
for each search word in the search word set, calculating the association degree between the search word and other search words according to the N-dimensional search word vector file; and taking other search terms of which the association degrees with the search terms accord with preset conditions as the original linguistic data of the search terms.
Optionally, the obtaining the search word sequence sets corresponding to the plurality of query sessions includes:
for each query session, arranging the search terms in the query session into a sequence in sequence; if one search term in the sequence corresponds to an application download operation, inserting the name of the downloaded application into the rear adjacent position of the corresponding search term in the sequence; obtaining a search word sequence corresponding to the query session;
the obtaining a set of search terms corresponding to a plurality of query sessions comprises: and taking the set of search terms in the plurality of query sessions as the set of search terms corresponding to the plurality of query sessions.
Optionally, training the search word sequence set to obtain an N-dimensional search word vector file includes:
and taking each search word in the search word sequence set as a word, and training the search word sequence set by using a deep learning tool kit word2vec to generate an N-dimensional search word vector file.
Optionally, for each search term in the search term set, calculating a degree of association between the search term and each other search term according to the N-dimensional search term vector file; taking other search terms with the association degree meeting preset conditions with the search term as the original corpus of the search term, wherein the other search terms comprise:
calculating the search word set and the N-dimensional search word vector file by using a KNN algorithm, and calculating the distance between every two search words in the search word set according to the N-dimensional search word vector file;
and for each search word in the search word set, sorting the search words according to the distance from the search word from large to small, and selecting the search words with the first preset threshold as the original corpus of the search word.
Optionally, the preprocessing the original corpus set includes:
in the original corpus collection, the corpus is divided into a plurality of parts,
for each original corpus, performing word segmentation processing on the original corpus to obtain word segmentation results containing a plurality of lexical items; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
Optionally, the searching for phrases composed of adjacent terms in the word segmentation result includes:
and calculating the cPId values of every two adjacent terms in the word segmentation result, and determining that the two adjacent terms form a phrase when the cPId values of the two adjacent terms are larger than a second preset threshold.
Optionally, the preprocessing the original corpus set further includes:
using the key words correspondingly reserved in the original material of each search word as the first-stage training corpus of the search word;
the first-stage training corpus of each search word forms a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
Optionally, the performing data cleaning on the keywords in the first-stage corpus set includes:
in the first-stage corpus set,
calculating a TF-IDF value of each keyword in a first-stage training corpus of each search word; deleting the key words with the TF-IDF value higher than a third preset threshold value and/or lower than a fourth preset threshold value to obtain a training corpus of the search word;
the corpus of each search term constitutes a corpus set.
Optionally, the calculating, according to the search term-topic probability distribution result and the topic-keyword probability distribution result, a tag system of each search term includes:
calculating to obtain a search word-keyword probability distribution result according to the search word-topic probability distribution result and the topic-keyword probability distribution result;
and according to the search word-keyword probability distribution result, for each search word, sorting the keywords according to the probability of the search word from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
Optionally, the calculating a search term-keyword probability distribution result according to the search term-topic probability distribution result and the topic-keyword probability distribution result includes:
for each search word, obtaining the probability of each topic about the search word according to the search word-topic probability distribution result;
for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result;
for each keyword, taking the product of the probability of the keyword about a subject and the probability of the subject about a search word as the probability of the keyword about the search word based on the subject; and taking the probability of the keyword about the search word as the probability of the keyword based on the sum of the probabilities of the topics about the search word.
Optionally, the step of obtaining a tag system of each search term by calculation according to the search term-topic probability distribution result and the topic-keyword probability distribution result further includes:
taking the keywords with the number of the first fifth preset threshold value correspondingly selected by each search word as a first-stage label system of the search word;
for the first-stage label system of each search word, calculating a semantic relation value between each keyword in the first-stage label system of the search word and the search word; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the search word as the corrected probability of the keyword relative to the search word; and sorting all the keywords in the first-stage label system of the search word according to the correction probability of the search word from large to small, and selecting the first sixth preset threshold number of keywords to form the label system of the search word.
Optionally, calculating a semantic relationship value between each keyword in the first-stage tagging system of the search term and the search term includes:
obtaining a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; training the search word sequence set to obtain an N-dimensional keyword vector file;
calculating word vectors of the keywords according to the N-dimensional keyword vector files, and calculating the word vectors of each term in the search words;
calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the cosine similarity as a semantic relation value of the keyword and the corresponding term;
and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the search word.
Optionally, the training the search word sequence set to obtain an N-dimensional keyword vector file includes:
and performing word segmentation processing on the search word sequence set, and training the search word sequence set subjected to word segmentation processing by using a deep learning tool package word2vec to generate an N-dimensional keyword vector file.
Optionally, the step of obtaining a tag system of each search term by calculation according to the search term-topic probability distribution result and the topic-keyword probability distribution result further includes:
taking the first sixth preset threshold number of keywords correspondingly selected by each search word as a second stage label system of the search word;
for the second-stage label system of each search word, counting the TF-IDF value of each keyword in the second-stage label system of the search word in the training corpus of the search word; for each keyword, taking the product of the probability of the keyword about the search word and the TF-IDF value as the secondary correction probability of the keyword about the search word; and sequencing all the keywords in the second stage label system of the search word from large to small according to the secondary correction probability of the search word, and selecting the first K keywords to form the label system of the search word.
Optionally, the tag system for forming the search term by the first K selected keywords includes:
acquiring the query times of the search terms in a preset time period from a query session log of an application search engine;
selecting the first K key words to form a label system of the search word according to the query times; wherein the value of K is used as a broken line function of the query times corresponding to the search term.
According to another aspect of the present invention, there is provided an application search method, including:
constructing a search word tag database, wherein the search word tag database comprises a tag system of a plurality of search words;
receiving a current search word uploaded by a client, and acquiring a tag system of the current search word according to the search word tag database;
calculating the degree of association between the label system of the current search term and the label system of each application;
when the correlation degree between the label system of the current search word and the label system of one application meets a preset condition, returning the relevant information of the application to the client for displaying;
constructing the search term tag database by a method according to any one of the first aspect of the invention.
Optionally, the obtaining a tag system of the current search term according to the search term tag database includes:
calculating semantic similarity between the current search word and each search word in the search word tag database, sorting the search words according to the semantic similarity from large to small, and selecting the first preset threshold search words;
and obtaining the label system of the current search word according to the label system of each selected search word.
Optionally, the calculating semantic similarity between the current search word and each search word in the search word tag database includes: calculating the Euclidean distance between the current search word and each search word in the search word tag database, and taking the Euclidean distance between each search word and the current search word as the semantic similarity corresponding to the search word;
the obtaining of the label system of the current search term according to the label system of each selected search term includes: the semantic similarity corresponding to each search word is used as the weight of each label in the label system of the search word; adding the weights of the same labels for the labels corresponding to the label system of each search term to obtain the final weight of each label; and sorting according to the final weight from large to small, and selecting the labels with the first second preset threshold value to form a label system of the current search term.
According to another aspect of the present invention, there is provided an apparatus for identifying an application search intention, the apparatus including:
the acquisition unit is suitable for acquiring search terms in each query session from a query session log of an application search engine;
the mining unit is suitable for mining a label system of each search term according to the search term in each query session and a preset strategy;
and the identification unit is suitable for identifying the application search intention corresponding to each search word according to the label system of the search word.
Optionally, the mining unit is adapted to obtain a corpus set according to search terms in each query session; inputting the training corpus set into an LDA model for training to obtain a search word-subject probability distribution result and a subject-keyword probability distribution result output by the LDA model; and calculating to obtain a label system of each search word according to the search word-theme probability distribution result and the theme-keyword probability distribution result.
Optionally, the mining unit is adapted to obtain an original corpus of each search term according to the search term in each query session; the original linguistic data of each search word form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
Optionally, the mining unit is adapted to obtain a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; obtaining a search term set corresponding to a plurality of query sessions; training the search word sequence set to obtain an N-dimensional search word vector file; for each search word in the search word set, calculating the association degree between the search word and other search words according to the N-dimensional search word vector file; and taking other search terms of which the association degrees with the search terms accord with preset conditions as the original linguistic data of the search terms.
Optionally, the mining unit is adapted to, for each query session, arrange the search terms in the query session into a sequence in order; if one search term in the sequence corresponds to an application download operation, inserting the name of the downloaded application into the rear adjacent position of the corresponding search term in the sequence; obtaining a search word sequence corresponding to the query session; and taking the set of search terms in the plurality of query sessions as the set of search terms corresponding to the plurality of query sessions.
Optionally, the mining unit is adapted to train the search word sequence set by using a deep learning tool package word2vec to generate an N-dimensional search word vector file, where each search word in the search word sequence set is used as a word.
Optionally, the mining unit is adapted to perform an operation on the search term set and the N-dimensional search term vector file by using a KNN algorithm, and calculate a distance between every two search terms in the search term set according to the N-dimensional search term vector file; and for each search word in the search word set, sorting the search words according to the distance from the search word from large to small, and selecting the search words with the first preset threshold as the original corpus of the search word.
Optionally, the mining unit is adapted to perform word segmentation processing on each original corpus in the original corpus set to obtain a word segmentation result including a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
Optionally, the mining unit is adapted to calculate cpmd values of every two adjacent terms in the word segmentation result, and determine that the two adjacent terms form a phrase when the cpmd values of the two adjacent terms are greater than a second preset threshold.
Optionally, the mining unit is further adapted to use a keyword, which is reserved corresponding to the original material of each search term, as a first-stage training corpus of the search term; the first-stage training corpus of each search word forms a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
Optionally, the mining unit is adapted to calculate, in the first-stage corpus, for a first-stage corpus of each search word, a TF-IDF value of each keyword in the first-stage corpus; deleting the key words with the TF-IDF value higher than a third preset threshold value and/or lower than a fourth preset threshold value to obtain a training corpus of the search word; the corpus of each search term constitutes a corpus set.
Optionally, the mining unit is adapted to calculate a search term-keyword probability distribution result according to the search term-topic probability distribution result and the topic-keyword probability distribution result; and according to the search word-keyword probability distribution result, for each search word, sorting the keywords according to the probability of the search word from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
Optionally, the mining unit is adapted to, for each search word, obtain, according to the search word-topic probability distribution result, a probability of each topic about the search word; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a subject and the probability of the subject about a search word as the probability of the keyword about the search word based on the subject; and taking the probability of the keyword about the search word as the probability of the keyword based on the sum of the probabilities of the topics about the search word.
Optionally, the mining unit is further adapted to use the first-fifth preset threshold number of keywords, which are selected corresponding to each search word, as a first-stage tag system of the search word; for the first-stage label system of each search word, calculating a semantic relation value between each keyword in the first-stage label system of the search word and the search word; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the search word as the corrected probability of the keyword relative to the search word; and sorting all the keywords in the first-stage label system of the search word according to the correction probability of the search word from large to small, and selecting the first sixth preset threshold number of keywords to form the label system of the search word.
Optionally, the mining unit is adapted to obtain a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; training the search word sequence set to obtain an N-dimensional keyword vector file; calculating word vectors of the keywords according to the N-dimensional keyword vector files, and calculating the word vectors of each term in the search words; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the cosine similarity as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the search word.
Optionally, the mining unit is adapted to perform word segmentation on the search word sequence set, and train the search word sequence set after word segmentation by using a deep learning tool package word2vec to generate an N-dimensional keyword vector file.
Optionally, the mining unit is further adapted to use a first sixth preset threshold number of keywords, which are selected corresponding to each search word, as a second-stage tag system of the search word; for the second-stage label system of each search word, counting the TF-IDF value of each keyword in the second-stage label system of the search word in the training corpus of the search word; for each keyword, taking the product of the probability of the keyword about the search word and the TF-IDF value as the secondary correction probability of the keyword about the search word; and sequencing all the keywords in the second stage label system of the search word from large to small according to the secondary correction probability of the search word, and selecting the first K keywords to form the label system of the search word.
Optionally, the mining unit is adapted to obtain, from a query session log of an application search engine, a number of queries about the search term in a preset time period; selecting the first K key words to form a label system of the search word according to the query times; wherein the value of K is used as a broken line function of the query times corresponding to the search term.
According to still another aspect of the present invention, there is provided an application search server, including:
the database construction unit is suitable for constructing a search word tag database which comprises a plurality of tag systems of search words;
the interaction unit is suitable for receiving the current search terms uploaded by the client;
the search processing unit is suitable for acquiring a label system of the current search word according to the search word label database; calculating the degree of association between the label system of the current search term and the label system of each application;
the interaction unit is also suitable for returning the relevant information of the application to the client side for displaying when the association degree between the label system of the current search word and the label system of the application meets the preset condition;
the database construction unit is the same as the process of constructing the search word tag database by the recognition apparatus of application search intention according to any one of claims 22 to 39.
Optionally, the search processing unit is adapted to calculate semantic similarities between the current search word and the search words in the search word tag database, sort the search words according to the semantic similarities from large to small, and select a first preset threshold number of search words; and obtaining the label system of the current search word according to the label system of each selected search word.
Optionally, the search processing unit is adapted to calculate an euclidean distance between a current search word and each search word in the search word tag database, and use the euclidean distance between each search word and the current search word as a semantic similarity corresponding to the search word; the semantic similarity corresponding to each search word is used as the weight of each label in the label system of the search word; adding the weights of the same labels for the labels corresponding to the label system of each search term to obtain the final weight of each label; and sorting according to the final weight from large to small, and selecting the labels with the first second preset threshold value to form a label system of the current search term.
According to the scheme of the invention, a user intention identification method-a label method matched with an app application label system is provided, the label system corresponding to the search word is mined flexibly, effectively and accurately, a search word label database is established, and the search word input by the user can be accurately described by the label system, so that the problem of user intention identification is solved. Further, the user intention and the app can be mapped into the same tag system, and more accurate application search results can be obtained by performing search matching. Therefore, the problem of user intention identification is solved, the problem of related calculation of the application search engine is solved, and a foundation is laid for a core technology-function search technology in the application search engine.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Hereinafter, app represents an application, query represents a search term, tag represents a tag, and Session represents a query Session.
The invention provides a new user intention identification method aiming at an application search engine, a label method is used for flexibly and effectively expressing the fine-grained query intention of a user, a label system of the user intention is constructed based on an unsupervised machine learning technology, the traditional user intention classification method is abandoned, a set of automatic user intention mining process is realized, a user intention label list with high accuracy and recall rate can be generated, the query and the app of the user are mapped into a common label system, meanwhile, the problems of user intention identification and correlation calculation of the application search engine are solved, and a very good effect is achieved.
FIG. 1 shows a flow diagram of a method for identifying application search intent, according to one embodiment of the invention. As shown in fig. 1, the method includes:
step S110, obtaining search terms in each query session from a query session log of an application search engine;
step S120, excavating a label system of each search term according to the search term in each query session and a preset strategy;
step S130, identifying the application search intention corresponding to each search word according to the label system of the search word.
Therefore, the traditional user intention identification is a classification method designed aiming at web pages, is not suitable for app application scenes, each application has a fixed application field, a specific function is provided for people, the function requirement of mining the fine granularity of a user by using a label is proper, and the classification-based method is wide in granularity and is not suitable. The scheme provides a user intention identification method-a label method matched with an application label system, is flexible and effective, maps the user intention and the application into the same label system, solves the problem of user intention identification, solves the problem of correlation calculation of an application search engine, and is the basis for realizing a core technology-a function search technology of the application search engine.
In general, a user search word is a short text, and the search word features constructed by the user according to the thought needs of the user are sparse and cannot comprehensively describe the needs. However, if a user only finds an app application with a single functional scenario in a short period of time, query search terms are often rewritten around a single requirement, and there is usually a strong semantic relationship between the issued query terms, which is an important feature of an application search engine.
In the search engine service, the system automatically records the relevant information related to the user search and stores the relevant information in the query log. For example, when a user opens an Baidu search page, sequentially inputs search terms such as "game", "game software", "funny game", "game application download", etc. and then enters the search page, or after entering a certain search page, continues to input some search terms to perform a search action until the user completes the search event and closes the entire Baidu search page, and the entire process is called a query session.
In an embodiment of the present invention, the step S120 of mining a label system of each search term according to the search term in each query session and a preset policy includes: obtaining a training corpus set according to search terms in each query session; inputting the training corpus set into an LDA model for training to obtain a search word-subject probability distribution result and a subject-keyword probability distribution result output by the LDA model; and calculating to obtain a label system of each search word according to the search word-theme probability distribution result and the theme-keyword probability distribution result.
In the process of obtaining the training corpus, the technical difficulty is that the query short text is expanded into a long text, and then one query can be regarded as a document, which is a key for effectively utilizing an LDA topic model, so that the intention tag with high accuracy and high recall rate is generated. The intention tag is divided into a categorical tag and a functional tag, the categorical tag reflects the application field of the user's needs, and the functional tag reflects the user's specific needs.
Wherein the obtaining of the corpus set according to the search terms in each query session includes:
obtaining the original corpus of each search term according to the search terms in each query session; the original linguistic data of each search word form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set. Specifically, the obtaining the original corpus of the search terms according to the search terms in each query session includes: obtaining a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; and obtaining a set of search terms corresponding to the plurality of query sessions.
The method comprises the steps of keeping a query search word sequence in a query session, taking the search word as a whole, downloading some apps by a user under a certain query, and splicing app names right behind the query sequence. Such as: a user session sequence is query1, query2, query3, and the user downloads an app1 after entering query2 and spells app1 behind query2 and in front of query3, namely query1, query2, app1, and query 3. Each session sequence is a line and is output to a file session _ query-app _ list. And outputs all the queries to another file query _ all.
Training the search word sequence set to obtain an N-dimensional search word vector file; for each search word in the search word set, calculating the association degree between the search word and other search words according to the N-dimensional search word vector file; and taking other search terms of which the association degrees with the search terms meet preset conditions as the original linguistic data of the search terms.
In an embodiment of the present invention, the obtaining the search word sequence sets corresponding to the plurality of query sessions includes: for each query session, arranging the search terms in the query session into a sequence in sequence; if one search term in the sequence corresponds to an application download operation, inserting the name of the downloaded application into the rear adjacent position of the corresponding search term in the sequence; obtaining a search word sequence corresponding to the query session; the obtaining a set of search terms corresponding to a plurality of query sessions comprises: and taking the set of search terms in the plurality of query sessions as the set of search terms corresponding to the plurality of query sessions.
For example, a user enters "search term 1", "search term 2", and "search term 3" in sequence in a query session, and the user downloads an app1 after entering "search term 2". Therefore, the search word sequence corresponding to the query session is: search term 1, search term 2, app1, search term 3. The search word sequence corresponding to each query session is a row, and the search word sequences corresponding to the plurality of query sessions are collected into a plurality of rows.
The training of the search word sequence set to obtain the N-dimensional search word vector file includes: and taking each search word in the search word sequence set as a word, and training the search word sequence set by using a deep learning tool kit word2vec to generate an N-dimensional search word vector file. For example, a deep learning tool kit word2vec is used for training, a query vector with 300 dimensions is generated, and a query vector file query _ w2v _300. ditt, namely a search word vector file, is generated.
In fact, the user can input various search words in the form of a noun (e.g., "game"), a phrase (e.g., "funny game"), or a sentence (e.g., "I want to download a funny game") when searching for the application desired by the query.
In an embodiment of the present invention, the search term vector file obtained in the foregoing is used as a basis for calculating a term vector for each search term in the search term set, and for each search term in the search term set, the degree of association between the search term and each other search term is calculated according to the N-dimensional search term vector file; taking other search terms, the association degrees of which accord with preset conditions, as the original corpus of the search terms, specifically including:
calculating the search word set and the N-dimensional search word vector file by using a KNN algorithm, and calculating the distance between every two search words in the search word set according to the N-dimensional search word vector file; and for each search word in the search word set, sorting the search words according to the distance from the search word from large to small, and selecting the search words with the first preset threshold as the original corpus of the search word.
Table 1 shows the top 10 nearest neighbors of the search term "dog" in one embodiment of the present invention, including the search term and app application name, as shown in the first column of table 1 as "dog phone entry method", "dog entry method", etc. The first preset value 10 in this example is shown, and the second column in table 1 indicates the distance between the corresponding nearest neighbor and the search term "dog search".
TABLE 1
Nearest neighbor
|
Statistical index (based on Euclidean distance)
|
Dog searching mobile phone input method
|
38 303.827 0.838104
|
Dog searching input method
|
26 323.494 0.845153
|
Sogou
|
20332 372.525 0.778589
|
Dog collecting device
|
6986 385.809 0.76965
|
Dog searching pinyin
|
14577 410.986 0.753037
|
Dog searching input methodMillet edition
|
4042 423.929 0.746941
|
Dog searching phonetic input method
|
4927 435.273 0.736172
|
Fox searching input method
|
18233 452.955 0.724872
|
Dog searching input
|
10274 455.505 0.720034
|
Mobile phone dog searching input method
|
3075 476.93 0.721099 |
Table 2 shows the top 10 nearest neighbors of the search term "lottery drawing inquiry" in an embodiment of the present invention, and the corresponding representative meanings are similar to those in table 1 and will not be described again.
TABLE 2
In an embodiment of the present invention, after obtaining an original corpus set corresponding to each search term, the preprocessing the original corpus set includes:
in the original corpus set, performing word segmentation processing on each original corpus to obtain a word segmentation result containing a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
For example, if a user inputs a search word "download game", the term of the search word belonging to the noun is "game", and the term of the search word belonging to the verb is "download".
Wherein the searching for phrases composed of adjacent terms in the word segmentation result comprises:
and calculating the cPId values of every two adjacent terms in the word segmentation result, and determining that the two adjacent terms form a phrase when the cPId values of the two adjacent terms are larger than a second preset threshold.
Equation 1 shows cpmd calculation method, where D (x, y) represents co-occurrence frequency of two terms x and y, D (x) represents occurrence frequency of term x, D (y) represents occurrence frequency of term y, D represents total app number, and δ is 0.7.
Equation 1
For example, according to the reverse ordering of the cpmd values, selecting a term combination with the cpmd higher than the threshold value 5 as a phrase, combining the phrase with the verb and the noun which are just reserved, and generating a new file query _ corp _ seg _ nouns _ verbs _ phrase.
Further, in an embodiment of the present invention, the preprocessing the original corpus set further includes: using the key words correspondingly reserved in the original material of each search word as the first-stage training corpus of the search word; the first-stage training corpus of each search word forms a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
Specifically, the data cleaning of the keywords in the first-stage corpus set includes: in the first-stage training corpus set, for the first-stage training corpus of each search word, calculating a TF-IDF value of each keyword in the first-stage training corpus; deleting the key words with the TF-IDF value higher than a third preset threshold value and/or lower than a fourth preset threshold value to obtain a training corpus of the search word; the corpus of each search term constitutes a corpus set.
This step is to mine the non-tag words in the corpus set of the first stage for data cleansing. The probability that a term appearing in high frequency or low frequency is tag is small, tf-idf weight of each term and phrase is calculated in a first-stage training corpus set by utilizing a tf-idf statistical method, terms or phrases higher than a certain threshold or lower than the certain threshold are used as non-tag words, the threshold is related to specific corpus, specific values are not listed here, the non-tag words generate a black list black _ tag.list, the non-tag words in the first-stage training corpus set of a file are filtered, and a new training corpus set is generated, wherein the format is as follows: the search term _ id \ t term 1 term 2 … term n.
Table 3 shows that some of the non-tagged words in data cleansing that may be discarded are either too high frequency or too low frequency of occurrence and are not meaningful to the user search.
TABLE 3
After the training corpus set is obtained, the LDA model is selected from the GibbsLDA + + version. And modifying a GibbsLDA + + source code, and initializing the subjects of the same lexical item in the query corpus into the same subject. In the original code, each term is randomly initialized into a theme, so that the same repeated term can be initialized into a plurality of themes. For example, LDA training selects 120 topics, iterates through 300 rounds, and outputs two pieces of data, a topic-term probability distribution and a document-topic probability distribution.
According to the scheme, a label system of each search word is calculated according to the search word-topic probability distribution result and the topic-keyword probability distribution result, and the method comprises the following steps:
calculating to obtain a search word-keyword probability distribution result according to the search word-topic probability distribution result and the topic-keyword probability distribution result; and according to the search word-keyword probability distribution result, for each search word, sorting the keywords according to the probability of the search word from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
Wherein, the calculating the search word-keyword probability distribution result according to the search word-topic probability distribution result and the topic-keyword probability distribution result comprises: for each search word, obtaining the probability of each topic about the search word according to the search word-topic probability distribution result; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a subject and the probability of the subject about a search word as the probability of the keyword about the search word based on the subject; and taking the probability of the keyword about the search word as the probability of the keyword based on the sum of the probabilities of the topics about the search word.
This step is the process of initial LDA tag generation, which results in LDA generated tags. The LDA outputs a topic probability distribution under each query, and a term probability distribution under each topic. In order to obtain the tag of each query, the probability distribution of topic and the probability distribution of terms are sorted in an inverse order from large to small according to the probability, the top 50 topic under each query are selected, the top 120 terms are selected under each topic, the probability of terms is weighted and sorted by using the probability of topic, each tag term has LDA weight to represent the importance under the query, and the tag list generated by LDA is obtained according to the inverse sorting of the tag weight, so that the tag list contains much noise and the sequence of the tag is inaccurate.
Further, to fine-tune the prediction result of the LDA model so that the order of the important tag of each query is advanced, in an embodiment of the present invention, the calculating the label system of each search term according to the search term-topic probability distribution result and the topic-keyword probability distribution result further includes: taking the keywords with the number of the first fifth preset threshold value correspondingly selected by each search word as a first-stage label system of the search word; for the first-stage label system of each search word, calculating a semantic relation value between each keyword in the first-stage label system of the search word and the search word; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the search word as the corrected probability of the keyword relative to the search word; and sorting all the keywords in the first-stage label system of the search word according to the correction probability of the search word from large to small, and selecting the first sixth preset threshold number of keywords to form the label system of the search word.
Calculating a semantic relation value between each keyword in a first-stage label system of the search term and the search term comprises the following steps: obtaining a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; training the search word sequence set to obtain an N-dimensional keyword vector file; calculating word vectors of the keywords according to the N-dimensional keyword vector files, and calculating the word vectors of each term in the search words; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the cosine similarity as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the search word.
For example, the semantic relationship between each tag word and query is calculated, using a trained word vector term _ w2v _300. dit, by: and respectively calculating the similarity of the residual indexes of the tag word vector and the word vector of each word in the query, accumulating the similarity together, and indicating that the tag is more important when the value is larger, weighting the tag with lda weight and then re-sequencing in the reverse order.
Specifically, the training the search word sequence set to obtain an N-dimensional keyword vector file includes: and performing word segmentation processing on the search word sequence set, and training the search word sequence set subjected to word segmentation processing by using a deep learning tool package word2vec to generate an N-dimensional keyword vector file.
For example, the search word sequence set is subjected to Chinese word segmentation, a deep learning tool package word2vec is used for training, a 300-dimensional query vector is generated, and another word vector file term _ w2v _300. ditt, namely a keyword vector file, is generated.
Still further, in an embodiment of the present invention, the tag system for obtaining each search term by calculation according to the search term-topic probability distribution result and the topic-keyword probability distribution result further includes: taking the first sixth preset threshold number of keywords correspondingly selected by each search word as a second stage label system of the search word; for the second-stage label system of each search word, counting the TF-IDF value of each keyword in the second-stage label system of the search word in the training corpus of the search word; for each keyword, taking the product of the probability of the keyword about the search word and the TF-IDF value as the secondary correction probability of the keyword about the search word; and sequencing all the keywords in the second stage label system of the search word from large to small according to the secondary correction probability of the search word, and selecting the first K keywords to form the label system of the search word.
For example, tf-idf weights in query expanded corpus are weighted appropriately according to tag, normalized and tag order is rearranged accordingly.
After the correction of the two methods, the tag sequence accuracy of expressing the query intention is greatly improved
In an embodiment of the present invention, the tag system for selecting the first K keywords to form the search term includes: acquiring the query times of the search terms in a preset time period from a query session log of an application search engine; selecting the first K key words to form a label system of the search word according to the query times; wherein the value of K is used as a broken line function of the query times corresponding to the search term.
The step is to determine the number of tags for each query, keep top k tag words, the k value is used as a broken line function of the query search times, each query keeps 2 to 5 unequal tags, the accuracy is 88%, and the recall rate is 75%. At this step we generate a query intent dictionary query _ intent _ tag.
Further, in a specific example, the scheme marks about 260 tens of thousands of queries with tag words expressing the intentions of the user, the queries are regarded as a whole, after the user synonymously reconstructs and rewrites the queries, the new queries are not in the query intention dictionary, and at this time, the semantic similarity between the new queries and the queries in the dictionary needs to be calculated, and the intent tag of the semantically similar queries is given to the new queries. The calculation method comprises the following steps: accumulating the term quantity of each word in the new query to be used as a new query vector, calculating Euclidean distance with the query vector of the query intention dictionary, selecting the first 3 nearest queries, and reducing the calculation complexity by using KdTree; and smoothing the Euclidean distance by using a Gaussian kernel to serve as the weighting weight of the tag words, synthesizing the intention tag words of 3 adjacent queries to generate the intention tag word of a new query, and reserving the first 3 tags to meet the search intention of the user, wherein the accuracy rate reaches 80%.
FIG. 2 shows a flowchart of an application search method according to an embodiment of the present invention, the method comprising:
step 210, a search term tag database is constructed, wherein the search term tag database comprises a tag system of a plurality of search terms.
And step 220, receiving the current search word uploaded by the client, and acquiring a label system of the current search word according to the search word label database.
And step 230, calculating the association degree between the label system of the current search term and the label system of each application.
And 240, returning the relevant information of the application to the client for displaying when the association degree between the label system of the current search word and the label system of the application meets the preset condition.
In the process of constructing the search term tag database, the mining of the tag system of the search term in step S210 is the same as the mining process of the tag system of the search term shown in any embodiment of the method shown in fig. 1.
In an embodiment of the present invention, the tag system for obtaining the current search term according to the search term tag database includes: calculating semantic similarity between the current search word and each search word in the search word tag database, sorting the search words according to the semantic similarity from large to small, and selecting the first preset threshold search words; and obtaining the label system of the current search word according to the label system of each selected search word.
In one embodiment of the present invention, the calculating semantic similarity between the current search word and each search word in the search word tag database includes: calculating the Euclidean distance between the current search word and each search word in the search word tag database, and taking the Euclidean distance between each search word and the current search word as the semantic similarity corresponding to the search word; the obtaining of the label system of the current search term according to the label system of each selected search term includes: the semantic similarity corresponding to each search word is used as the weight of each label in the label system of the search word; adding the weights of the same labels for the labels corresponding to the label system of each search term to obtain the final weight of each label; and sorting according to the final weight from large to small, and selecting the labels with the first second preset threshold value to form a label system of the current search term.
Table 4 shows 360 the cell phone assistant searching for the intent tag words of the partial search terms.
TABLE 4
Fig. 3 shows an apparatus for identifying an application search intention according to an embodiment of the present invention, where the apparatus 300 for identifying an application search intention includes:
an obtaining unit 310, adapted to obtain a search term in each query session from a query session log of an application search engine;
the mining unit 320 is adapted to mine a label system of each search term according to the search term in each query session and a preset strategy;
the identifying unit 330 is adapted to identify an application search intention corresponding to each search term according to the label system of the search term.
In an embodiment of the present invention, the mining unit 320 is adapted to obtain a corpus set according to search terms in each query session; inputting the training corpus set into an LDA model for training to obtain a search word-subject probability distribution result and a subject-keyword probability distribution result output by the LDA model; and calculating to obtain a label system of each search word according to the search word-theme probability distribution result and the theme-keyword probability distribution result.
In an embodiment of the present invention, the mining unit 320 is adapted to obtain an original corpus of each search term according to the search term in each query session; the original linguistic data of each search word form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.
Specifically, in an embodiment of the present invention, the mining unit 320 is adapted to obtain a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; obtaining a search term set corresponding to a plurality of query sessions; training the search word sequence set to obtain an N-dimensional search word vector file; for each search word in the search word set, calculating the association degree between the search word and other search words according to the N-dimensional search word vector file; and taking other search terms of which the association degrees with the search terms meet preset conditions as the original linguistic data of the search terms.
That is, the mining unit 320 is adapted to, for each query session, arrange the search terms in the query session into a sequence in order; if one search term in the sequence corresponds to an application download operation, inserting the name of the downloaded application into the rear adjacent position of the corresponding search term in the sequence; obtaining a search word sequence corresponding to the query session; and taking the set of search terms in the plurality of query sessions as the set of search terms corresponding to the plurality of query sessions.
For example, the mining unit 320 is adapted to train the search word sequence set using a deep learning tool package word2vec to generate an N-dimensional search word vector file, with each search word in the search word sequence set as a word.
On this basis, in an embodiment of the present invention, the mining unit 320 is adapted to perform operations on the search term set and the N-dimensional search term vector file by using a KNN algorithm, and calculate a distance between every two search terms in the search term set according to the N-dimensional search term vector file; and for each search word in the search word set, sorting the search words according to the distance from the search word from large to small, and selecting the search words with the first preset threshold as the original corpus of the search word.
In the preprocessing process, in an embodiment of the present invention, the mining unit 320 is adapted to perform word segmentation on each original corpus in the original corpus set to obtain a word segmentation result including a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.
Specifically, the mining unit 320 is adapted to calculate cpmd values of every two adjacent terms in the word segmentation result, and determine that the two adjacent terms form a phrase when the cpmd values of the two adjacent terms are greater than a second preset threshold.
Further, in an embodiment of the present invention, the mining unit 320 is further adapted to use a remaining keyword corresponding to the original material of each search term as a first-stage training corpus of the search term; the first-stage training corpus of each search word forms a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.
Specifically, in an embodiment of the present invention, the mining unit 320 is adapted to calculate, in the first-stage corpus, for the first-stage corpus of each search word, a TF-IDF value of each keyword in the first-stage corpus; deleting the key words with the TF-IDF value higher than a third preset threshold value and/or lower than a fourth preset threshold value to obtain a training corpus of the search word; the corpus of each search term constitutes a corpus set.
In an embodiment of the present invention, the mining unit 320 is adapted to calculate a search term-keyword probability distribution result according to the search term-topic probability distribution result and the topic-keyword probability distribution result; and according to the search word-keyword probability distribution result, for each search word, sorting the keywords according to the probability of the search word from large to small, and selecting the keywords with the number of the top fifth preset threshold value.
In an embodiment of the present invention, the mining unit 320 is adapted to, for each search word, obtain a probability of each topic about the search word according to the search word-topic probability distribution result; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a subject and the probability of the subject about a search word as the probability of the keyword about the search word based on the subject; and taking the probability of the keyword about the search word as the probability of the keyword based on the sum of the probabilities of the topics about the search word.
Further, in an embodiment of the present invention, the mining unit 320 is further adapted to use the first-fifth preset threshold number of keywords, which are selected corresponding to each search term, as a first-stage tag system of the search term; for the first-stage label system of each search word, calculating a semantic relation value between each keyword in the first-stage label system of the search word and the search word; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the search word as the corrected probability of the keyword relative to the search word; and sorting all the keywords in the first-stage label system of the search word according to the correction probability of the search word from large to small, and selecting the first sixth preset threshold number of keywords to form the label system of the search word.
In an embodiment of the present invention, the mining unit 320 is adapted to obtain a set of search word sequences corresponding to a plurality of query sessions according to search words in each query session; training the search word sequence set to obtain an N-dimensional keyword vector file; calculating word vectors of the keywords according to the N-dimensional keyword vector files, and calculating the word vectors of each term in the search words; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the cosine similarity as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the search word.
In an embodiment of the present invention, the mining unit 320 is adapted to perform word segmentation on the search word sequence set, and train the search word sequence set after word segmentation by using a deep learning tool package word2vec to generate an N-dimensional keyword vector file.
Further, in an embodiment of the present invention, the mining unit 320 is further adapted to use the first sixth preset threshold number of keywords, which are selected corresponding to each search word, as a second-stage tag system of the search word; for the second-stage label system of each search word, counting the TF-IDF value of each keyword in the second-stage label system of the search word in the training corpus of the search word; for each keyword, taking the product of the probability of the keyword about the search word and the TF-IDF value as the secondary correction probability of the keyword about the search word; and sequencing all the keywords in the second stage label system of the search word from large to small according to the secondary correction probability of the search word, and selecting the first K keywords to form the label system of the search word.
In an embodiment of the present invention, the mining unit 320 is adapted to obtain, from a query session log of an application search engine, a number of queries about the search term in a preset time period; selecting the first K key words to form a label system of the search word according to the query times; wherein the value of K is used as a broken line function of the query times corresponding to the search term.
Fig. 4 shows an application search server according to an embodiment of the present invention, the application search server 400 including:
a database construction unit 410 adapted to construct a search term tag database including a tag system of a plurality of search terms;
an interaction unit 420 adapted to receive a current search term uploaded by a client;
the search processing unit 430 is adapted to obtain a tag system of a current search term according to the search term tag database; calculating the degree of association between the label system of the current search term and the label system of each application;
the interaction unit 420 is further adapted to return the relevant information of an application to the client for displaying when the degree of association between the tag system of the current search term and the tag system of the application meets a preset condition;
the scheme of mining the tag system of the search term in the process of building the search term tag database by the database building unit 410 is the same as the scheme of mining the tag system of the search term by the identification device 300 applying the search intention according to any one of the above embodiments of the present invention.
In an embodiment of the present invention, the search processing unit 430 is adapted to calculate semantic similarities between the current search word and the search words in the search word tag database, sort the search words according to the semantic similarities from large to small, and select a first preset threshold number of search words; and obtaining the label system of the current search word according to the label system of each selected search word.
In one embodiment of the present invention, the search processing unit 430 is adapted to calculate the euclidean distance between the current search word and each search word in the search word tag database, and use the euclidean distance between each search word and the current search word as the semantic similarity corresponding to the search word; the semantic similarity corresponding to each search word is used as the weight of each label in the label system of the search word; adding the weights of the same labels for the labels corresponding to the label system of each search term to obtain the final weight of each label; and sorting according to the final weight from large to small, and selecting the labels with the first second preset threshold value to form a label system of the current search term.
It should be noted that the embodiments of the apparatus shown in fig. 3-4 are the same as the embodiments of the method shown in fig. 1-2, and the detailed description is given above and will not be repeated herein.
In summary, the invention provides a user intention identification method-label method matched with an app application label system by applying the identification method and device of the search intention, the application search method and the server, and the query intention of the user with fine granularity is flexibly expressed. A label system of the user intention is constructed based on an unsupervised machine learning technology, a traditional user intention classification method is abandoned, a set of automatic user intention mining process is realized, and a user intention label list with high accuracy and recall rate can be generated. The user intention and the app can be mapped into the same tag system, so that the problem of user intention identification is solved, the problem of related calculation of an application search engine is solved, and a foundation is laid for a core technology-function search technology in the application search engine.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the application search intent recognition apparatus and application search server according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.