CN106649818B

CN106649818B - Application search intent identification method, device, application search method and server

Info

Publication number: CN106649818B
Application number: CN201611246921.1A
Authority: CN
Inventors: 庞伟
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2020-05-15
Anticipated expiration: 2036-12-29
Also published as: CN106649818A

Abstract

The invention discloses an application search intention identification method, device, application search method and server. The method includes: acquiring search words in each query session from a query session log of an application search engine; According to the tag system of each search term, the corresponding application search intent of the search term is identified. In this scheme, a user intent identification method matching the app application tagging system is proposed, which can flexibly express the user's fine-grained query intent. Based on unsupervised machine learning technology, a user intent labeling system is constructed, which abandons the traditional user intent classification method and realizes a set of automatic user intent mining process, which can generate a user intent label list with high accuracy and recall rate. Apps are mapped to the same tag system, so that users can quickly and accurately obtain apps that meet their intent when searching for apps.

Description

Application search intention identification method and device, application search method and server

Technical Field

The invention relates to the field of data mining, in particular to an application search intention identification method and device, an application search method and a server.

Background

The application search engine is a mobile terminal software application search engine service, and provides search and download of apps on mobile phones, such as 360 mobile phone assistants, Tencent apps, Google Play, Appstore and the like. The application search engine is a mobile search service installed on a mobile phone, such as 360 mobile phone assistant app, and due to the limitation of objective conditions such as small display plane of search results, only the accurate search results are provided to obtain the best user experience, and the application search engine is also one of important differences between mobile search and PC end web page search. The number of mobile terminal app applications is huge, millions of app applications exist, and the app search engine can accurately show the desired app applications in the user on the premise of understanding the query intention of the user.

The premise of providing accurate search service by applying a search engine is to accurately understand the query intention of a user. Potential search intentions are hidden behind each query request of the user, if the application search engine can sense the requirements of the user, the search word texts are mapped to corresponding app application functions or app application categories, and app application results which are more consistent with the user intentions are arranged in the front, so that the search experience of the user is obviously enhanced. Therefore, the user intention identification is the core technology of the application search engine and is also the key for realizing the functional search technology.

In the existing traditional web search engine technology, user search intentions are manually sorted and classified, and are classified into three types, namely a navigation type, an information type and a resource type, but the user intention classification method aiming at web pages is not suitable for app application scenes. Because each app application has a fixed application field, a specific function is provided for people, the function requirement of mining the fine granularity of users by using tags is appropriate, and the classification-based method is not suitable due to wide granularity and wide range. Therefore, there is no very flexible and efficient method to date that can meet the increasing user demand for fast and accurate searching of app applications.

Disclosure of Invention

In view of the above problems, the present invention has been made to provide an identification method, apparatus, application search method, and server of an application search intention that overcome the above problems or at least partially solve the above problems.

According to an aspect of the present invention, there is provided an identification method of an application search intention, the method including:

obtaining search terms in each query session from a query session log of an application search engine;

excavating a label system of each search term according to the search term in each query session and a preset strategy;

and identifying the application search intention corresponding to each search word according to the label system of each search word.

Optionally, mining a label system of each search term according to the search term in each query session and a preset policy includes:

obtaining a training corpus set according to search terms in each query session;

inputting the training corpus set into an LDA model for training to obtain a search word-subject probability distribution result and a subject-keyword probability distribution result output by the LDA model;

and calculating to obtain a label system of each search word according to the search word-theme probability distribution result and the theme-keyword probability distribution result.

Optionally, the obtaining a corpus set according to the search term in each query session includes:

obtaining the original corpus of each search term according to the search terms in each query session;

the original linguistic data of each search word form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.

Optionally, the obtaining the original corpus of the search terms according to the search terms in each query session includes:

obtaining a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; obtaining a search term set corresponding to a plurality of query sessions;

training the search word sequence set to obtain an N-dimensional search word vector file;

for each search word in the search word set, calculating the association degree between the search word and other search words according to the N-dimensional search word vector file; and taking other search terms of which the association degrees with the search terms accord with preset conditions as the original linguistic data of the search terms.

Optionally, the obtaining the search word sequence sets corresponding to the plurality of query sessions includes:

for each query session, arranging the search terms in the query session into a sequence in sequence; if one search term in the sequence corresponds to an application download operation, inserting the name of the downloaded application into the rear adjacent position of the corresponding search term in the sequence; obtaining a search word sequence corresponding to the query session;

the obtaining a set of search terms corresponding to a plurality of query sessions comprises: and taking the set of search terms in the plurality of query sessions as the set of search terms corresponding to the plurality of query sessions.

Optionally, training the search word sequence set to obtain an N-dimensional search word vector file includes:

and taking each search word in the search word sequence set as a word, and training the search word sequence set by using a deep learning tool kit word2vec to generate an N-dimensional search word vector file.

Optionally, for each search term in the search term set, calculating a degree of association between the search term and each other search term according to the N-dimensional search term vector file; taking other search terms with the association degree meeting preset conditions with the search term as the original corpus of the search term, wherein the other search terms comprise:

calculating the search word set and the N-dimensional search word vector file by using a KNN algorithm, and calculating the distance between every two search words in the search word set according to the N-dimensional search word vector file;

and for each search word in the search word set, sorting the search words according to the distance from the search word from large to small, and selecting the search words with the first preset threshold as the original corpus of the search word.

Optionally, the preprocessing the original corpus set includes:

in the original corpus collection, the corpus is divided into a plurality of parts,

for each original corpus, performing word segmentation processing on the original corpus to obtain word segmentation results containing a plurality of lexical items; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.

Optionally, the searching for phrases composed of adjacent terms in the word segmentation result includes:

and calculating the cPId values of every two adjacent terms in the word segmentation result, and determining that the two adjacent terms form a phrase when the cPId values of the two adjacent terms are larger than a second preset threshold.

Optionally, the preprocessing the original corpus set further includes:

using the key words correspondingly reserved in the original material of each search word as the first-stage training corpus of the search word;

the first-stage training corpus of each search word forms a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.

Optionally, the performing data cleaning on the keywords in the first-stage corpus set includes:

in the first-stage corpus set,

calculating a TF-IDF value of each keyword in a first-stage training corpus of each search word; deleting the key words with the TF-IDF value higher than a third preset threshold value and/or lower than a fourth preset threshold value to obtain a training corpus of the search word;

the corpus of each search term constitutes a corpus set.

Optionally, the calculating, according to the search term-topic probability distribution result and the topic-keyword probability distribution result, a tag system of each search term includes:

calculating to obtain a search word-keyword probability distribution result according to the search word-topic probability distribution result and the topic-keyword probability distribution result;

and according to the search word-keyword probability distribution result, for each search word, sorting the keywords according to the probability of the search word from large to small, and selecting the keywords with the number of the top fifth preset threshold value.

Optionally, the calculating a search term-keyword probability distribution result according to the search term-topic probability distribution result and the topic-keyword probability distribution result includes:

for each search word, obtaining the probability of each topic about the search word according to the search word-topic probability distribution result;

for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result;

for each keyword, taking the product of the probability of the keyword about a subject and the probability of the subject about a search word as the probability of the keyword about the search word based on the subject; and taking the probability of the keyword about the search word as the probability of the keyword based on the sum of the probabilities of the topics about the search word.

Optionally, the step of obtaining a tag system of each search term by calculation according to the search term-topic probability distribution result and the topic-keyword probability distribution result further includes:

taking the keywords with the number of the first fifth preset threshold value correspondingly selected by each search word as a first-stage label system of the search word;

for the first-stage label system of each search word, calculating a semantic relation value between each keyword in the first-stage label system of the search word and the search word; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the search word as the corrected probability of the keyword relative to the search word; and sorting all the keywords in the first-stage label system of the search word according to the correction probability of the search word from large to small, and selecting the first sixth preset threshold number of keywords to form the label system of the search word.

Optionally, calculating a semantic relationship value between each keyword in the first-stage tagging system of the search term and the search term includes:

obtaining a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; training the search word sequence set to obtain an N-dimensional keyword vector file;

calculating word vectors of the keywords according to the N-dimensional keyword vector files, and calculating the word vectors of each term in the search words;

calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the cosine similarity as a semantic relation value of the keyword and the corresponding term;

and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the search word.

Optionally, the training the search word sequence set to obtain an N-dimensional keyword vector file includes:

and performing word segmentation processing on the search word sequence set, and training the search word sequence set subjected to word segmentation processing by using a deep learning tool package word2vec to generate an N-dimensional keyword vector file.

taking the first sixth preset threshold number of keywords correspondingly selected by each search word as a second stage label system of the search word;

for the second-stage label system of each search word, counting the TF-IDF value of each keyword in the second-stage label system of the search word in the training corpus of the search word; for each keyword, taking the product of the probability of the keyword about the search word and the TF-IDF value as the secondary correction probability of the keyword about the search word; and sequencing all the keywords in the second stage label system of the search word from large to small according to the secondary correction probability of the search word, and selecting the first K keywords to form the label system of the search word.

Optionally, the tag system for forming the search term by the first K selected keywords includes:

acquiring the query times of the search terms in a preset time period from a query session log of an application search engine;

selecting the first K key words to form a label system of the search word according to the query times; wherein the value of K is used as a broken line function of the query times corresponding to the search term.

According to another aspect of the present invention, there is provided an application search method, including:

constructing a search word tag database, wherein the search word tag database comprises a tag system of a plurality of search words;

receiving a current search word uploaded by a client, and acquiring a tag system of the current search word according to the search word tag database;

calculating the degree of association between the label system of the current search term and the label system of each application;

when the correlation degree between the label system of the current search word and the label system of one application meets a preset condition, returning the relevant information of the application to the client for displaying;

constructing the search term tag database by a method according to any one of the first aspect of the invention.

Optionally, the obtaining a tag system of the current search term according to the search term tag database includes:

calculating semantic similarity between the current search word and each search word in the search word tag database, sorting the search words according to the semantic similarity from large to small, and selecting the first preset threshold search words;

and obtaining the label system of the current search word according to the label system of each selected search word.

Optionally, the calculating semantic similarity between the current search word and each search word in the search word tag database includes: calculating the Euclidean distance between the current search word and each search word in the search word tag database, and taking the Euclidean distance between each search word and the current search word as the semantic similarity corresponding to the search word;

the obtaining of the label system of the current search term according to the label system of each selected search term includes: the semantic similarity corresponding to each search word is used as the weight of each label in the label system of the search word; adding the weights of the same labels for the labels corresponding to the label system of each search term to obtain the final weight of each label; and sorting according to the final weight from large to small, and selecting the labels with the first second preset threshold value to form a label system of the current search term.

According to another aspect of the present invention, there is provided an apparatus for identifying an application search intention, the apparatus including:

the acquisition unit is suitable for acquiring search terms in each query session from a query session log of an application search engine;

the mining unit is suitable for mining a label system of each search term according to the search term in each query session and a preset strategy;

and the identification unit is suitable for identifying the application search intention corresponding to each search word according to the label system of the search word.

Optionally, the mining unit is adapted to obtain a corpus set according to search terms in each query session; inputting the training corpus set into an LDA model for training to obtain a search word-subject probability distribution result and a subject-keyword probability distribution result output by the LDA model; and calculating to obtain a label system of each search word according to the search word-theme probability distribution result and the theme-keyword probability distribution result.

Optionally, the mining unit is adapted to obtain an original corpus of each search term according to the search term in each query session; the original linguistic data of each search word form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.

Optionally, the mining unit is adapted to obtain a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; obtaining a search term set corresponding to a plurality of query sessions; training the search word sequence set to obtain an N-dimensional search word vector file; for each search word in the search word set, calculating the association degree between the search word and other search words according to the N-dimensional search word vector file; and taking other search terms of which the association degrees with the search terms accord with preset conditions as the original linguistic data of the search terms.

Optionally, the mining unit is adapted to, for each query session, arrange the search terms in the query session into a sequence in order; if one search term in the sequence corresponds to an application download operation, inserting the name of the downloaded application into the rear adjacent position of the corresponding search term in the sequence; obtaining a search word sequence corresponding to the query session; and taking the set of search terms in the plurality of query sessions as the set of search terms corresponding to the plurality of query sessions.

Optionally, the mining unit is adapted to train the search word sequence set by using a deep learning tool package word2vec to generate an N-dimensional search word vector file, where each search word in the search word sequence set is used as a word.

Optionally, the mining unit is adapted to perform an operation on the search term set and the N-dimensional search term vector file by using a KNN algorithm, and calculate a distance between every two search terms in the search term set according to the N-dimensional search term vector file; and for each search word in the search word set, sorting the search words according to the distance from the search word from large to small, and selecting the search words with the first preset threshold as the original corpus of the search word.

Optionally, the mining unit is adapted to perform word segmentation processing on each original corpus in the original corpus set to obtain a word segmentation result including a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.

Optionally, the mining unit is adapted to calculate cpmd values of every two adjacent terms in the word segmentation result, and determine that the two adjacent terms form a phrase when the cpmd values of the two adjacent terms are greater than a second preset threshold.

Optionally, the mining unit is further adapted to use a keyword, which is reserved corresponding to the original material of each search term, as a first-stage training corpus of the search term; the first-stage training corpus of each search word forms a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.

Optionally, the mining unit is adapted to calculate, in the first-stage corpus, for a first-stage corpus of each search word, a TF-IDF value of each keyword in the first-stage corpus; deleting the key words with the TF-IDF value higher than a third preset threshold value and/or lower than a fourth preset threshold value to obtain a training corpus of the search word; the corpus of each search term constitutes a corpus set.

Optionally, the mining unit is adapted to calculate a search term-keyword probability distribution result according to the search term-topic probability distribution result and the topic-keyword probability distribution result; and according to the search word-keyword probability distribution result, for each search word, sorting the keywords according to the probability of the search word from large to small, and selecting the keywords with the number of the top fifth preset threshold value.

Optionally, the mining unit is adapted to, for each search word, obtain, according to the search word-topic probability distribution result, a probability of each topic about the search word; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a subject and the probability of the subject about a search word as the probability of the keyword about the search word based on the subject; and taking the probability of the keyword about the search word as the probability of the keyword based on the sum of the probabilities of the topics about the search word.

Optionally, the mining unit is further adapted to use the first-fifth preset threshold number of keywords, which are selected corresponding to each search word, as a first-stage tag system of the search word; for the first-stage label system of each search word, calculating a semantic relation value between each keyword in the first-stage label system of the search word and the search word; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the search word as the corrected probability of the keyword relative to the search word; and sorting all the keywords in the first-stage label system of the search word according to the correction probability of the search word from large to small, and selecting the first sixth preset threshold number of keywords to form the label system of the search word.

Optionally, the mining unit is adapted to obtain a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; training the search word sequence set to obtain an N-dimensional keyword vector file; calculating word vectors of the keywords according to the N-dimensional keyword vector files, and calculating the word vectors of each term in the search words; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the cosine similarity as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the search word.

Optionally, the mining unit is adapted to perform word segmentation on the search word sequence set, and train the search word sequence set after word segmentation by using a deep learning tool package word2vec to generate an N-dimensional keyword vector file.

Optionally, the mining unit is further adapted to use a first sixth preset threshold number of keywords, which are selected corresponding to each search word, as a second-stage tag system of the search word; for the second-stage label system of each search word, counting the TF-IDF value of each keyword in the second-stage label system of the search word in the training corpus of the search word; for each keyword, taking the product of the probability of the keyword about the search word and the TF-IDF value as the secondary correction probability of the keyword about the search word; and sequencing all the keywords in the second stage label system of the search word from large to small according to the secondary correction probability of the search word, and selecting the first K keywords to form the label system of the search word.

Optionally, the mining unit is adapted to obtain, from a query session log of an application search engine, a number of queries about the search term in a preset time period; selecting the first K key words to form a label system of the search word according to the query times; wherein the value of K is used as a broken line function of the query times corresponding to the search term.

According to still another aspect of the present invention, there is provided an application search server, including:

the database construction unit is suitable for constructing a search word tag database which comprises a plurality of tag systems of search words;

the interaction unit is suitable for receiving the current search terms uploaded by the client;

the search processing unit is suitable for acquiring a label system of the current search word according to the search word label database; calculating the degree of association between the label system of the current search term and the label system of each application;

the interaction unit is also suitable for returning the relevant information of the application to the client side for displaying when the association degree between the label system of the current search word and the label system of the application meets the preset condition;

the database construction unit is the same as the process of constructing the search word tag database by the recognition apparatus of application search intention according to any one of claims 22 to 39.

Optionally, the search processing unit is adapted to calculate semantic similarities between the current search word and the search words in the search word tag database, sort the search words according to the semantic similarities from large to small, and select a first preset threshold number of search words; and obtaining the label system of the current search word according to the label system of each selected search word.

Optionally, the search processing unit is adapted to calculate an euclidean distance between a current search word and each search word in the search word tag database, and use the euclidean distance between each search word and the current search word as a semantic similarity corresponding to the search word; the semantic similarity corresponding to each search word is used as the weight of each label in the label system of the search word; adding the weights of the same labels for the labels corresponding to the label system of each search term to obtain the final weight of each label; and sorting according to the final weight from large to small, and selecting the labels with the first second preset threshold value to form a label system of the current search term.

According to the scheme of the invention, a user intention identification method-a label method matched with an app application label system is provided, the label system corresponding to the search word is mined flexibly, effectively and accurately, a search word label database is established, and the search word input by the user can be accurately described by the label system, so that the problem of user intention identification is solved. Further, the user intention and the app can be mapped into the same tag system, and more accurate application search results can be obtained by performing search matching. Therefore, the problem of user intention identification is solved, the problem of related calculation of the application search engine is solved, and a foundation is laid for a core technology-function search technology in the application search engine.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 illustrates a flow diagram of a method for identifying application search intent, according to one embodiment of the present invention;

FIG. 2 illustrates a flow diagram of a method of application searching in accordance with one embodiment of the present invention;

FIG. 3 is a diagram illustrating an apparatus for recognizing application search intention according to an embodiment of the present invention; and

FIG. 4 shows a schematic diagram of an application search server, according to one embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Hereinafter, app represents an application, query represents a search term, tag represents a tag, and Session represents a query Session.

The invention provides a new user intention identification method aiming at an application search engine, a label method is used for flexibly and effectively expressing the fine-grained query intention of a user, a label system of the user intention is constructed based on an unsupervised machine learning technology, the traditional user intention classification method is abandoned, a set of automatic user intention mining process is realized, a user intention label list with high accuracy and recall rate can be generated, the query and the app of the user are mapped into a common label system, meanwhile, the problems of user intention identification and correlation calculation of the application search engine are solved, and a very good effect is achieved.

FIG. 1 shows a flow diagram of a method for identifying application search intent, according to one embodiment of the invention. As shown in fig. 1, the method includes:

step S110, obtaining search terms in each query session from a query session log of an application search engine;

step S120, excavating a label system of each search term according to the search term in each query session and a preset strategy;

step S130, identifying the application search intention corresponding to each search word according to the label system of the search word.

Therefore, the traditional user intention identification is a classification method designed aiming at web pages, is not suitable for app application scenes, each application has a fixed application field, a specific function is provided for people, the function requirement of mining the fine granularity of a user by using a label is proper, and the classification-based method is wide in granularity and is not suitable. The scheme provides a user intention identification method-a label method matched with an application label system, is flexible and effective, maps the user intention and the application into the same label system, solves the problem of user intention identification, solves the problem of correlation calculation of an application search engine, and is the basis for realizing a core technology-a function search technology of the application search engine.

In general, a user search word is a short text, and the search word features constructed by the user according to the thought needs of the user are sparse and cannot comprehensively describe the needs. However, if a user only finds an app application with a single functional scenario in a short period of time, query search terms are often rewritten around a single requirement, and there is usually a strong semantic relationship between the issued query terms, which is an important feature of an application search engine.

In the search engine service, the system automatically records the relevant information related to the user search and stores the relevant information in the query log. For example, when a user opens an Baidu search page, sequentially inputs search terms such as "game", "game software", "funny game", "game application download", etc. and then enters the search page, or after entering a certain search page, continues to input some search terms to perform a search action until the user completes the search event and closes the entire Baidu search page, and the entire process is called a query session.

In an embodiment of the present invention, the step S120 of mining a label system of each search term according to the search term in each query session and a preset policy includes: obtaining a training corpus set according to search terms in each query session; inputting the training corpus set into an LDA model for training to obtain a search word-subject probability distribution result and a subject-keyword probability distribution result output by the LDA model; and calculating to obtain a label system of each search word according to the search word-theme probability distribution result and the theme-keyword probability distribution result.

In the process of obtaining the training corpus, the technical difficulty is that the query short text is expanded into a long text, and then one query can be regarded as a document, which is a key for effectively utilizing an LDA topic model, so that the intention tag with high accuracy and high recall rate is generated. The intention tag is divided into a categorical tag and a functional tag, the categorical tag reflects the application field of the user's needs, and the functional tag reflects the user's specific needs.

Wherein the obtaining of the corpus set according to the search terms in each query session includes:

obtaining the original corpus of each search term according to the search terms in each query session; the original linguistic data of each search word form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set. Specifically, the obtaining the original corpus of the search terms according to the search terms in each query session includes: obtaining a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; and obtaining a set of search terms corresponding to the plurality of query sessions.

The method comprises the steps of keeping a query search word sequence in a query session, taking the search word as a whole, downloading some apps by a user under a certain query, and splicing app names right behind the query sequence. Such as: a user session sequence is query1, query2, query3, and the user downloads an app1 after entering query2 and spells app1 behind query2 and in front of query3, namely query1, query2, app1, and query 3. Each session sequence is a line and is output to a file session _ query-app _ list. And outputs all the queries to another file query _ all.

Training the search word sequence set to obtain an N-dimensional search word vector file; for each search word in the search word set, calculating the association degree between the search word and other search words according to the N-dimensional search word vector file; and taking other search terms of which the association degrees with the search terms meet preset conditions as the original linguistic data of the search terms.

In an embodiment of the present invention, the obtaining the search word sequence sets corresponding to the plurality of query sessions includes: for each query session, arranging the search terms in the query session into a sequence in sequence; if one search term in the sequence corresponds to an application download operation, inserting the name of the downloaded application into the rear adjacent position of the corresponding search term in the sequence; obtaining a search word sequence corresponding to the query session; the obtaining a set of search terms corresponding to a plurality of query sessions comprises: and taking the set of search terms in the plurality of query sessions as the set of search terms corresponding to the plurality of query sessions.

For example, a user enters "search term 1", "search term 2", and "search term 3" in sequence in a query session, and the user downloads an app1 after entering "search term 2". Therefore, the search word sequence corresponding to the query session is: search term 1, search term 2, app1, search term 3. The search word sequence corresponding to each query session is a row, and the search word sequences corresponding to the plurality of query sessions are collected into a plurality of rows.

The training of the search word sequence set to obtain the N-dimensional search word vector file includes: and taking each search word in the search word sequence set as a word, and training the search word sequence set by using a deep learning tool kit word2vec to generate an N-dimensional search word vector file. For example, a deep learning tool kit word2vec is used for training, a query vector with 300 dimensions is generated, and a query vector file query _ w2v _300. ditt, namely a search word vector file, is generated.

In fact, the user can input various search words in the form of a noun (e.g., "game"), a phrase (e.g., "funny game"), or a sentence (e.g., "I want to download a funny game") when searching for the application desired by the query.

In an embodiment of the present invention, the search term vector file obtained in the foregoing is used as a basis for calculating a term vector for each search term in the search term set, and for each search term in the search term set, the degree of association between the search term and each other search term is calculated according to the N-dimensional search term vector file; taking other search terms, the association degrees of which accord with preset conditions, as the original corpus of the search terms, specifically including:

calculating the search word set and the N-dimensional search word vector file by using a KNN algorithm, and calculating the distance between every two search words in the search word set according to the N-dimensional search word vector file; and for each search word in the search word set, sorting the search words according to the distance from the search word from large to small, and selecting the search words with the first preset threshold as the original corpus of the search word.

Table 1 shows the top 10 nearest neighbors of the search term "dog" in one embodiment of the present invention, including the search term and app application name, as shown in the first column of table 1 as "dog phone entry method", "dog entry method", etc. The first preset value 10 in this example is shown, and the second column in table 1 indicates the distance between the corresponding nearest neighbor and the search term "dog search".

TABLE 1

Nearest neighbor	Statistical index (based on Euclidean distance)
		Dog searching mobile phone input method	38 303.827 0.838104
Dog searching input method	26 323.494 0.845153
		Sogou	20332 372.525 0.778589
Dog collecting device	6986 385.809 0.76965
		Dog searching pinyin	14577 410.986 0.753037
Dog searching input methodMillet edition	4042 423.929 0.746941
		Dog searching phonetic input method	4927 435.273 0.736172
Fox searching input method	18233 452.955 0.724872
		Dog searching input	10274 455.505 0.720034
Mobile phone dog searching input method	3075 476.93 0.721099

Table 2 shows the top 10 nearest neighbors of the search term "lottery drawing inquiry" in an embodiment of the present invention, and the corresponding representative meanings are similar to those in table 1 and will not be described again.

TABLE 2

In an embodiment of the present invention, after obtaining an original corpus set corresponding to each search term, the preprocessing the original corpus set includes:

in the original corpus set, performing word segmentation processing on each original corpus to obtain a word segmentation result containing a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.

For example, if a user inputs a search word "download game", the term of the search word belonging to the noun is "game", and the term of the search word belonging to the verb is "download".

Wherein the searching for phrases composed of adjacent terms in the word segmentation result comprises:

Equation 1 shows cpmd calculation method, where D (x, y) represents co-occurrence frequency of two terms x and y, D (x) represents occurrence frequency of term x, D (y) represents occurrence frequency of term y, D represents total app number, and δ is 0.7.

Equation 1

For example, according to the reverse ordering of the cpmd values, selecting a term combination with the cpmd higher than the threshold value 5 as a phrase, combining the phrase with the verb and the noun which are just reserved, and generating a new file query _ corp _ seg _ nouns _ verbs _ phrase.

Further, in an embodiment of the present invention, the preprocessing the original corpus set further includes: using the key words correspondingly reserved in the original material of each search word as the first-stage training corpus of the search word; the first-stage training corpus of each search word forms a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.

Specifically, the data cleaning of the keywords in the first-stage corpus set includes: in the first-stage training corpus set, for the first-stage training corpus of each search word, calculating a TF-IDF value of each keyword in the first-stage training corpus; deleting the key words with the TF-IDF value higher than a third preset threshold value and/or lower than a fourth preset threshold value to obtain a training corpus of the search word; the corpus of each search term constitutes a corpus set.

This step is to mine the non-tag words in the corpus set of the first stage for data cleansing. The probability that a term appearing in high frequency or low frequency is tag is small, tf-idf weight of each term and phrase is calculated in a first-stage training corpus set by utilizing a tf-idf statistical method, terms or phrases higher than a certain threshold or lower than the certain threshold are used as non-tag words, the threshold is related to specific corpus, specific values are not listed here, the non-tag words generate a black list black _ tag.list, the non-tag words in the first-stage training corpus set of a file are filtered, and a new training corpus set is generated, wherein the format is as follows: the search term _ id \ t term 1 term 2 … term n.

Table 3 shows that some of the non-tagged words in data cleansing that may be discarded are either too high frequency or too low frequency of occurrence and are not meaningful to the user search.

TABLE 3

After the training corpus set is obtained, the LDA model is selected from the GibbsLDA + + version. And modifying a GibbsLDA + + source code, and initializing the subjects of the same lexical item in the query corpus into the same subject. In the original code, each term is randomly initialized into a theme, so that the same repeated term can be initialized into a plurality of themes. For example, LDA training selects 120 topics, iterates through 300 rounds, and outputs two pieces of data, a topic-term probability distribution and a document-topic probability distribution.

According to the scheme, a label system of each search word is calculated according to the search word-topic probability distribution result and the topic-keyword probability distribution result, and the method comprises the following steps:

calculating to obtain a search word-keyword probability distribution result according to the search word-topic probability distribution result and the topic-keyword probability distribution result; and according to the search word-keyword probability distribution result, for each search word, sorting the keywords according to the probability of the search word from large to small, and selecting the keywords with the number of the top fifth preset threshold value.

Wherein, the calculating the search word-keyword probability distribution result according to the search word-topic probability distribution result and the topic-keyword probability distribution result comprises: for each search word, obtaining the probability of each topic about the search word according to the search word-topic probability distribution result; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a subject and the probability of the subject about a search word as the probability of the keyword about the search word based on the subject; and taking the probability of the keyword about the search word as the probability of the keyword based on the sum of the probabilities of the topics about the search word.

This step is the process of initial LDA tag generation, which results in LDA generated tags. The LDA outputs a topic probability distribution under each query, and a term probability distribution under each topic. In order to obtain the tag of each query, the probability distribution of topic and the probability distribution of terms are sorted in an inverse order from large to small according to the probability, the top 50 topic under each query are selected, the top 120 terms are selected under each topic, the probability of terms is weighted and sorted by using the probability of topic, each tag term has LDA weight to represent the importance under the query, and the tag list generated by LDA is obtained according to the inverse sorting of the tag weight, so that the tag list contains much noise and the sequence of the tag is inaccurate.

Further, to fine-tune the prediction result of the LDA model so that the order of the important tag of each query is advanced, in an embodiment of the present invention, the calculating the label system of each search term according to the search term-topic probability distribution result and the topic-keyword probability distribution result further includes: taking the keywords with the number of the first fifth preset threshold value correspondingly selected by each search word as a first-stage label system of the search word; for the first-stage label system of each search word, calculating a semantic relation value between each keyword in the first-stage label system of the search word and the search word; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the search word as the corrected probability of the keyword relative to the search word; and sorting all the keywords in the first-stage label system of the search word according to the correction probability of the search word from large to small, and selecting the first sixth preset threshold number of keywords to form the label system of the search word.

Calculating a semantic relation value between each keyword in a first-stage label system of the search term and the search term comprises the following steps: obtaining a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; training the search word sequence set to obtain an N-dimensional keyword vector file; calculating word vectors of the keywords according to the N-dimensional keyword vector files, and calculating the word vectors of each term in the search words; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the cosine similarity as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the search word.

For example, the semantic relationship between each tag word and query is calculated, using a trained word vector term _ w2v _300. dit, by: and respectively calculating the similarity of the residual indexes of the tag word vector and the word vector of each word in the query, accumulating the similarity together, and indicating that the tag is more important when the value is larger, weighting the tag with lda weight and then re-sequencing in the reverse order.

Specifically, the training the search word sequence set to obtain an N-dimensional keyword vector file includes: and performing word segmentation processing on the search word sequence set, and training the search word sequence set subjected to word segmentation processing by using a deep learning tool package word2vec to generate an N-dimensional keyword vector file.

For example, the search word sequence set is subjected to Chinese word segmentation, a deep learning tool package word2vec is used for training, a 300-dimensional query vector is generated, and another word vector file term _ w2v _300. ditt, namely a keyword vector file, is generated.

Still further, in an embodiment of the present invention, the tag system for obtaining each search term by calculation according to the search term-topic probability distribution result and the topic-keyword probability distribution result further includes: taking the first sixth preset threshold number of keywords correspondingly selected by each search word as a second stage label system of the search word; for the second-stage label system of each search word, counting the TF-IDF value of each keyword in the second-stage label system of the search word in the training corpus of the search word; for each keyword, taking the product of the probability of the keyword about the search word and the TF-IDF value as the secondary correction probability of the keyword about the search word; and sequencing all the keywords in the second stage label system of the search word from large to small according to the secondary correction probability of the search word, and selecting the first K keywords to form the label system of the search word.

For example, tf-idf weights in query expanded corpus are weighted appropriately according to tag, normalized and tag order is rearranged accordingly.

After the correction of the two methods, the tag sequence accuracy of expressing the query intention is greatly improved

In an embodiment of the present invention, the tag system for selecting the first K keywords to form the search term includes: acquiring the query times of the search terms in a preset time period from a query session log of an application search engine; selecting the first K key words to form a label system of the search word according to the query times; wherein the value of K is used as a broken line function of the query times corresponding to the search term.

The step is to determine the number of tags for each query, keep top k tag words, the k value is used as a broken line function of the query search times, each query keeps 2 to 5 unequal tags, the accuracy is 88%, and the recall rate is 75%. At this step we generate a query intent dictionary query _ intent _ tag.

Further, in a specific example, the scheme marks about 260 tens of thousands of queries with tag words expressing the intentions of the user, the queries are regarded as a whole, after the user synonymously reconstructs and rewrites the queries, the new queries are not in the query intention dictionary, and at this time, the semantic similarity between the new queries and the queries in the dictionary needs to be calculated, and the intent tag of the semantically similar queries is given to the new queries. The calculation method comprises the following steps: accumulating the term quantity of each word in the new query to be used as a new query vector, calculating Euclidean distance with the query vector of the query intention dictionary, selecting the first 3 nearest queries, and reducing the calculation complexity by using KdTree; and smoothing the Euclidean distance by using a Gaussian kernel to serve as the weighting weight of the tag words, synthesizing the intention tag words of 3 adjacent queries to generate the intention tag word of a new query, and reserving the first 3 tags to meet the search intention of the user, wherein the accuracy rate reaches 80%.

FIG. 2 shows a flowchart of an application search method according to an embodiment of the present invention, the method comprising:

step 210, a search term tag database is constructed, wherein the search term tag database comprises a tag system of a plurality of search terms.

And step 220, receiving the current search word uploaded by the client, and acquiring a label system of the current search word according to the search word label database.

And step 230, calculating the association degree between the label system of the current search term and the label system of each application.

And 240, returning the relevant information of the application to the client for displaying when the association degree between the label system of the current search word and the label system of the application meets the preset condition.

In the process of constructing the search term tag database, the mining of the tag system of the search term in step S210 is the same as the mining process of the tag system of the search term shown in any embodiment of the method shown in fig. 1.

In an embodiment of the present invention, the tag system for obtaining the current search term according to the search term tag database includes: calculating semantic similarity between the current search word and each search word in the search word tag database, sorting the search words according to the semantic similarity from large to small, and selecting the first preset threshold search words; and obtaining the label system of the current search word according to the label system of each selected search word.

In one embodiment of the present invention, the calculating semantic similarity between the current search word and each search word in the search word tag database includes: calculating the Euclidean distance between the current search word and each search word in the search word tag database, and taking the Euclidean distance between each search word and the current search word as the semantic similarity corresponding to the search word; the obtaining of the label system of the current search term according to the label system of each selected search term includes: the semantic similarity corresponding to each search word is used as the weight of each label in the label system of the search word; adding the weights of the same labels for the labels corresponding to the label system of each search term to obtain the final weight of each label; and sorting according to the final weight from large to small, and selecting the labels with the first second preset threshold value to form a label system of the current search term.

Table 4 shows 360 the cell phone assistant searching for the intent tag words of the partial search terms.

TABLE 4

Fig. 3 shows an apparatus for identifying an application search intention according to an embodiment of the present invention, where the apparatus 300 for identifying an application search intention includes:

an obtaining unit 310, adapted to obtain a search term in each query session from a query session log of an application search engine;

the mining unit 320 is adapted to mine a label system of each search term according to the search term in each query session and a preset strategy;

the identifying unit 330 is adapted to identify an application search intention corresponding to each search term according to the label system of the search term.

In an embodiment of the present invention, the mining unit 320 is adapted to obtain a corpus set according to search terms in each query session; inputting the training corpus set into an LDA model for training to obtain a search word-subject probability distribution result and a subject-keyword probability distribution result output by the LDA model; and calculating to obtain a label system of each search word according to the search word-theme probability distribution result and the theme-keyword probability distribution result.

In an embodiment of the present invention, the mining unit 320 is adapted to obtain an original corpus of each search term according to the search term in each query session; the original linguistic data of each search word form an original linguistic data set; and preprocessing the original corpus set to obtain a training corpus set.

Specifically, in an embodiment of the present invention, the mining unit 320 is adapted to obtain a search word sequence set corresponding to a plurality of query sessions according to search words in each query session; obtaining a search term set corresponding to a plurality of query sessions; training the search word sequence set to obtain an N-dimensional search word vector file; for each search word in the search word set, calculating the association degree between the search word and other search words according to the N-dimensional search word vector file; and taking other search terms of which the association degrees with the search terms meet preset conditions as the original linguistic data of the search terms.

That is, the mining unit 320 is adapted to, for each query session, arrange the search terms in the query session into a sequence in order; if one search term in the sequence corresponds to an application download operation, inserting the name of the downloaded application into the rear adjacent position of the corresponding search term in the sequence; obtaining a search word sequence corresponding to the query session; and taking the set of search terms in the plurality of query sessions as the set of search terms corresponding to the plurality of query sessions.

For example, the mining unit 320 is adapted to train the search word sequence set using a deep learning tool package word2vec to generate an N-dimensional search word vector file, with each search word in the search word sequence set as a word.

On this basis, in an embodiment of the present invention, the mining unit 320 is adapted to perform operations on the search term set and the N-dimensional search term vector file by using a KNN algorithm, and calculate a distance between every two search terms in the search term set according to the N-dimensional search term vector file; and for each search word in the search word set, sorting the search words according to the distance from the search word from large to small, and selecting the search words with the first preset threshold as the original corpus of the search word.

In the preprocessing process, in an embodiment of the present invention, the mining unit 320 is adapted to perform word segmentation on each original corpus in the original corpus set to obtain a word segmentation result including a plurality of terms; searching phrases formed by adjacent terms in the word segmentation result; and reserving the phrases, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result as the corresponding reserved keywords of the original corpus.

Specifically, the mining unit 320 is adapted to calculate cpmd values of every two adjacent terms in the word segmentation result, and determine that the two adjacent terms form a phrase when the cpmd values of the two adjacent terms are greater than a second preset threshold.

Further, in an embodiment of the present invention, the mining unit 320 is further adapted to use a remaining keyword corresponding to the original material of each search term as a first-stage training corpus of the search term; the first-stage training corpus of each search word forms a first-stage training corpus set; and carrying out data cleaning on the keywords in the first-stage corpus set.

Specifically, in an embodiment of the present invention, the mining unit 320 is adapted to calculate, in the first-stage corpus, for the first-stage corpus of each search word, a TF-IDF value of each keyword in the first-stage corpus; deleting the key words with the TF-IDF value higher than a third preset threshold value and/or lower than a fourth preset threshold value to obtain a training corpus of the search word; the corpus of each search term constitutes a corpus set.

In an embodiment of the present invention, the mining unit 320 is adapted to calculate a search term-keyword probability distribution result according to the search term-topic probability distribution result and the topic-keyword probability distribution result; and according to the search word-keyword probability distribution result, for each search word, sorting the keywords according to the probability of the search word from large to small, and selecting the keywords with the number of the top fifth preset threshold value.

In an embodiment of the present invention, the mining unit 320 is adapted to, for each search word, obtain a probability of each topic about the search word according to the search word-topic probability distribution result; for each topic, obtaining the probability of each keyword about the topic according to the topic-keyword probability distribution result; for each keyword, taking the product of the probability of the keyword about a subject and the probability of the subject about a search word as the probability of the keyword about the search word based on the subject; and taking the probability of the keyword about the search word as the probability of the keyword based on the sum of the probabilities of the topics about the search word.

Further, in an embodiment of the present invention, the mining unit 320 is further adapted to use the first-fifth preset threshold number of keywords, which are selected corresponding to each search term, as a first-stage tag system of the search term; for the first-stage label system of each search word, calculating a semantic relation value between each keyword in the first-stage label system of the search word and the search word; for each keyword, taking the product of the semantic relation value corresponding to the keyword and the probability of the keyword relative to the search word as the corrected probability of the keyword relative to the search word; and sorting all the keywords in the first-stage label system of the search word according to the correction probability of the search word from large to small, and selecting the first sixth preset threshold number of keywords to form the label system of the search word.

In an embodiment of the present invention, the mining unit 320 is adapted to obtain a set of search word sequences corresponding to a plurality of query sessions according to search words in each query session; training the search word sequence set to obtain an N-dimensional keyword vector file; calculating word vectors of the keywords according to the N-dimensional keyword vector files, and calculating the word vectors of each term in the search words; calculating cosine similarity between the word vector of the keyword and the word vector of each term, and taking the cosine similarity as a semantic relation value of the keyword and the corresponding term; and taking the sum of the semantic relation values of the keyword and each term as the semantic relation value between the keyword and the search word.

In an embodiment of the present invention, the mining unit 320 is adapted to perform word segmentation on the search word sequence set, and train the search word sequence set after word segmentation by using a deep learning tool package word2vec to generate an N-dimensional keyword vector file.

Further, in an embodiment of the present invention, the mining unit 320 is further adapted to use the first sixth preset threshold number of keywords, which are selected corresponding to each search word, as a second-stage tag system of the search word; for the second-stage label system of each search word, counting the TF-IDF value of each keyword in the second-stage label system of the search word in the training corpus of the search word; for each keyword, taking the product of the probability of the keyword about the search word and the TF-IDF value as the secondary correction probability of the keyword about the search word; and sequencing all the keywords in the second stage label system of the search word from large to small according to the secondary correction probability of the search word, and selecting the first K keywords to form the label system of the search word.

In an embodiment of the present invention, the mining unit 320 is adapted to obtain, from a query session log of an application search engine, a number of queries about the search term in a preset time period; selecting the first K key words to form a label system of the search word according to the query times; wherein the value of K is used as a broken line function of the query times corresponding to the search term.

Fig. 4 shows an application search server according to an embodiment of the present invention, the application search server 400 including:

a database construction unit 410 adapted to construct a search term tag database including a tag system of a plurality of search terms;

an interaction unit 420 adapted to receive a current search term uploaded by a client;

the search processing unit 430 is adapted to obtain a tag system of a current search term according to the search term tag database; calculating the degree of association between the label system of the current search term and the label system of each application;

the interaction unit 420 is further adapted to return the relevant information of an application to the client for displaying when the degree of association between the tag system of the current search term and the tag system of the application meets a preset condition;

the scheme of mining the tag system of the search term in the process of building the search term tag database by the database building unit 410 is the same as the scheme of mining the tag system of the search term by the identification device 300 applying the search intention according to any one of the above embodiments of the present invention.

In an embodiment of the present invention, the search processing unit 430 is adapted to calculate semantic similarities between the current search word and the search words in the search word tag database, sort the search words according to the semantic similarities from large to small, and select a first preset threshold number of search words; and obtaining the label system of the current search word according to the label system of each selected search word.

In one embodiment of the present invention, the search processing unit 430 is adapted to calculate the euclidean distance between the current search word and each search word in the search word tag database, and use the euclidean distance between each search word and the current search word as the semantic similarity corresponding to the search word; the semantic similarity corresponding to each search word is used as the weight of each label in the label system of the search word; adding the weights of the same labels for the labels corresponding to the label system of each search term to obtain the final weight of each label; and sorting according to the final weight from large to small, and selecting the labels with the first second preset threshold value to form a label system of the current search term.

It should be noted that the embodiments of the apparatus shown in fig. 3-4 are the same as the embodiments of the method shown in fig. 1-2, and the detailed description is given above and will not be repeated herein.

In summary, the invention provides a user intention identification method-label method matched with an app application label system by applying the identification method and device of the search intention, the application search method and the server, and the query intention of the user with fine granularity is flexibly expressed. A label system of the user intention is constructed based on an unsupervised machine learning technology, a traditional user intention classification method is abandoned, a set of automatic user intention mining process is realized, and a user intention label list with high accuracy and recall rate can be generated. The user intention and the app can be mapped into the same tag system, so that the problem of user intention identification is solved, the problem of related calculation of an application search engine is solved, and a foundation is laid for a core technology-function search technology in the application search engine.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the application search intent recognition apparatus and application search server according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A method for identifying application search intent, comprising:

Obtain the search terms in each query session from the query session log of the application search engine;

According to the search terms and preset strategies in each query session, the tag system of each search term is mined;

Identify the application search intent corresponding to the search term according to the tag system of each search term;

According to the search terms in each query session and the preset strategy, the tag system for mining each search term includes:

Obtain a training corpus set according to the search words in each query session;

Input the training corpus set into the LDA model for training, and obtain the search word-topic probability distribution results and the topic-keyword probability distribution results output by the LDA model;

According to the search term-topic probability distribution result and the topic-keyword probability distribution result, the tag system of each search term is obtained by calculating;

The obtaining of the training corpus set according to the search words in each query session includes:

Obtain the original corpus of each search term according to the search term in each query session;

The original corpus of each search term constitutes an original corpus set; the original corpus set is preprocessed to obtain a training corpus set;

According to the search terms in each query session, obtaining the original corpus of each search term includes:

According to the search words in each query session, obtain a set of search word sequences corresponding to the plurality of query sessions; and, obtain a set of search words corresponding to the plurality of query sessions;

The search word sequence set is trained to obtain an N-dimensional search word vector file;

For each search word in the searched word set, calculate the degree of association between the search word and other search words according to the N-dimensional search word vector file; Each other search term is used as the original corpus of the search term.

2. The method of claim 1, wherein the obtaining a set of search word sequences corresponding to multiple query sessions comprises:

For each query session, the search terms in the query session are ordered into a sequence; if a search term in the sequence corresponds to an application download operation, insert the name of the downloaded application into the corresponding search in the sequence The adjacent position behind the word; get the search word sequence corresponding to the query session;

The obtaining the set of search words corresponding to the multiple query sessions includes: taking the set of search words in the multiple query sessions as the set of search words corresponding to the multiple query sessions.

3. The method according to claim 1, wherein, training the search word sequence set to obtain an N-dimensional search word vector file comprising:

Taking each search word in the set of search word sequences as a word, and using the deep learning toolkit word2vec to train the set of search word sequences, an N-dimensional search word vector file is generated.

4. The method according to claim 1, wherein, for each search word in the searched word set, the association between the search word and other search words is calculated according to the N-dimensional search word vector file degree; the original corpus of the search term using other search terms whose degree of association with the search term meets the preset conditions includes:

Utilize the KNN algorithm to perform operations on the search term set and the N-dimensional search term vector file, and calculate the distance between each two search terms in the search term set according to the N-dimensional search term vector file;

For each search word in the search word set, the distance from the search word is sorted in descending order, and the first preset threshold search words are selected as the original corpus of the search word.

5. The method of claim 1, wherein the preprocessing of the original set of corpora comprises:

In the original corpus set,

For each original corpus, perform word segmentation on the original corpus to obtain a word segmentation result containing multiple terms; find a phrase formed by adjacent terms in the word segmentation result; retain the phrase and the word segmentation result The terms that belong to nouns and the terms that belong to verbs in the original corpus are used as the reserved keywords for the original corpus.

6. The method of claim 5, wherein the finding phrases formed by adjacent terms in the word segmentation results comprises:

Calculate the cPMId value of each two adjacent terms in the word segmentation result, and when the cPMId value of the two adjacent terms is greater than the second preset threshold, determine that the two adjacent terms constitute a phrase.

7. The method according to any one of claims 1-6, wherein the preprocessing the original corpus set further comprises:

The reserved keywords corresponding to the original materials of each search term are used as the first-stage training corpus of the search term;

The first-stage training corpus of each search term constitutes a first-stage training corpus set; data cleaning is performed on the keywords in the first-stage training corpus set.

8. The method according to claim 7, wherein the performing data cleaning on the keywords in the first-stage training corpus set comprises:

In the first-stage training corpus set,

For the first-stage training corpus of each search word, calculate the TF-IDF value of each keyword in the first-stage training corpus; set the TF-IDF value higher than the third preset threshold and/or lower than the third Four preset threshold keywords are deleted, and the training corpus of the search word is obtained;

The training corpus of each search term constitutes a training corpus set.

9. The method of claim 1 , wherein, according to the search term-topic probability distribution result and the topic-keyword probability distribution result, calculating and obtaining the label system of each search term comprises:

According to the search word-topic probability distribution result and the topic-keyword probability distribution result, calculate the search word-keyword probability distribution result;

According to the search term-keyword probability distribution result, for each search term, the keywords are sorted in descending order according to the probability of the search term, and the first fifth preset threshold number of keywords are selected.

10. The method according to claim 1, wherein, according to the search word-topic probability distribution result and the topic-keyword probability distribution result, the calculation to obtain the search word-keyword probability distribution result comprises:

For each search term, obtain the probability of each topic about the search term according to the search term-topic probability distribution result;

For each topic, obtain the probability of each keyword about the topic according to the topic-keyword probability distribution result;

Then for each keyword, the product of the probability of the keyword on a topic and the probability of the topic on a search term is taken as the probability of the keyword based on the topic on the search term; the keyword is based on each topic The sum of the probabilities about the search word is taken as the probability of the keyword with respect to the search word.

11. The method according to claim 1, wherein, according to the search term-topic probability distribution result and the topic-keyword probability distribution result, calculating and obtaining the label system of each search term further comprises:

The first-stage label system of the search term is taken as the first-stage label system of the search term corresponding to the first fifth preset threshold number of keywords selected for each search term;

For the first-stage tag system of each search term, calculate the semantic relationship value between each keyword in the first-stage tag system of the search term and the search term; for each keyword, the keyword corresponds to The product of the semantic relationship value of the keyword and the probability of the keyword with respect to the search word is used as the modified probability of the keyword with respect to the search word; each keyword in the first-stage label system of the search word is modified according to the search word The probability is sorted from large to small, and the first sixth preset threshold keywords are selected to form the tag system of the search word.

12. The method of claim 1, wherein calculating the semantic relationship value between each keyword in the first-stage tag system of the search term and the search term comprises:

According to the search words in each query session, obtain a set of search word sequences corresponding to multiple query sessions; perform training on the set of search word sequences to obtain an N-dimensional keyword vector file;

Calculate the word vector of the keyword according to the N-dimensional keyword vector file, and calculate the word vector of each term in the search word;

Calculate the cosine similarity between the word vector of the keyword and the word vector of each term as the semantic relationship value between the keyword and the corresponding term;

The sum of the semantic relationship values of the keyword and each term is used as the semantic relationship value between the keyword and the search term.

13. The method of claim 1 , wherein the N-dimensional keyword vector file obtained by training the set of search word sequences comprises:

Perform word segmentation processing on the search word sequence set, and use the deep learning toolkit word2vec to train the search word sequence set after word segmentation processing to generate an N-dimensional keyword vector file.

14. The method according to claim 1, wherein, according to the search term-topic probability distribution result and the topic-keyword probability distribution result, calculating and obtaining the label system of each search term further comprises:

The first sixth preset threshold keywords selected corresponding to each search term are used as the second-stage label system of the search term;

For the second-stage label system of each search word, count the TF-IDF value of each keyword in the search word's second-stage label system in the training corpus of the search word; for each keyword, the The product of the probability of the keyword about the search word and the TF-IDF value is used as the secondary correction probability of the keyword about the search word; each keyword in the second-stage tag system of the search word is based on the The secondary correction probability of the words is sorted from large to small, and the top K keywords are selected to form the tag system of the search word.

15. The method of claim 14 , wherein the label system for selecting the top K keywords to form the search term comprises:

Obtain the number of queries about the search term within a preset time period from the query session log of the application search engine;

The top K keywords are selected according to the number of queries to form a tag system of the search term; the K value is used as a polyline function of the number of queries corresponding to the search term.

16. An application search method, comprising:

Build a search term tag database, which includes a tag system of multiple search terms;

Receive the current search term uploaded by the client, and obtain the tag system of the current search term according to the search term tag database;

Calculate the degree of association between the tag system of the current search term and the tag system of each application;

When the degree of association between the tag system of the current search term and the tag system of an application meets the preset conditions, return the relevant information of the application to the client for display;

The search term tag database is constructed by the method of any one of claims 1-15.

17. The method of claim 16, wherein the obtaining the tag system of the current search term according to the search term tag database comprises:

Calculate the semantic similarity between the current search term and each search term in the search term tag database, sort according to the semantic similarity from large to small, and select the first first preset threshold search terms;

According to the selected tag system of each search term, the tag system of the current search term is obtained.

18. The method of claim 16 or 17, wherein,

The calculating the semantic similarity between the current search term and each search term in the search term label database includes: calculating the Euclidean distance between the current search term and each search term in the search term label database, The Euclidean distance between a search term and the current search term is used as the semantic similarity corresponding to the search term;

The obtaining of the label system of the current search word according to the selected label system of each search word includes: the semantic similarity corresponding to each search word is used as the weight of each label in the label system of the search word; For each tag corresponding to the tag system, the weights of the same tags are added to obtain the final weight of each tag; according to the final weight, the tags are sorted from large to small, and the first second preset threshold tags are selected to form the tag system of the current search term.

19. A device for identifying an application search intent, comprising:

an obtaining unit, adapted to obtain search terms in each query session from the query session log of the application search engine;

The mining unit is suitable for mining the tag system of each search word according to the search words in each query session and the preset strategy;

an identification unit, adapted to identify the application search intent corresponding to the search term according to the tag system of each search term;

The mining unit is adapted to obtain a training corpus set according to the search words in each query session; input the training corpus set into the LDA model for training, and obtain the search word-topic probability distribution result and the topic-keyword output by the LDA model Probability distribution result; according to the search term-topic probability distribution result and the topic-keyword probability distribution result, the tag system of each search term is calculated and obtained;

The mining unit is adapted to obtain the original corpus of each search term according to the search words in each query session; the original corpus of each search term constitutes an original corpus set; the original corpus set is preprocessed to obtain a training corpus set;

The mining unit is adapted to obtain search word sequence sets corresponding to multiple query sessions according to search terms in each query session; and obtain search word sets corresponding to multiple query sessions; and train the search word sequence sets Obtain an N-dimensional search word vector file; for each search word in the searched word set, calculate the degree of association between the search word and other search words according to the N-dimensional search word vector file; Other search words whose correlation degree meets the preset condition are used as the original corpus of the search word.

20. The identification device of claim 19, wherein,

The mining unit is adapted to, for each query session, arrange the search terms in the query session into a sequence in order; if a search term in the sequence corresponds to an application download operation, insert the name of the downloaded application into the Go to the adjacent position behind the corresponding search word in the sequence; obtain the search word sequence corresponding to the query session; use the set of search words in multiple query sessions as the search word set corresponding to the multiple query sessions.

21. The identification device of claim 19, wherein,

The mining unit is adapted to use each search word in the search word sequence set as a word, and use the deep learning toolkit word2vec to train the search word sequence set to generate an N-dimensional search word vector file.

22. The identification device of claim 19, wherein,

The mining unit is adapted to use the KNN algorithm to perform operations on the search word set and the N-dimensional search word vector file, and calculate each two in the search word set according to the N-dimensional search word vector file. The distance between the search words; for each search word in the set of search words, according to the distance from the search word in descending order, select the first preset threshold search words as the original corpus of the search word .

23. The identification device of claim 19, wherein,

The mining unit is adapted to, in the original corpus set, for each original corpus, perform word segmentation processing on the original corpus to obtain a word segmentation result including multiple terms; Phrase composed of terms; retain the phrase, the terms belonging to the noun and the terms belonging to the verb in the word segmentation result, as the reserved keywords corresponding to the original corpus.

24. The apparatus of claim 23, wherein,

The mining unit is adapted to calculate the cPMId value of each two adjacent terms in the word segmentation result, and when the cPMId value of the two adjacent terms is greater than the second preset threshold, it is determined that the two adjacent terms constitute a phrase.

25. The identification device of claim 19, wherein,

The mining unit is further adapted to use the reserved keywords corresponding to the original materials of each search term as the first-stage training corpus of the search term; the first-stage training corpus of each search term constitutes a first-stage training corpus set; The keywords in the training corpus set in the first stage are subjected to data cleaning.

26. The identification device of claim 25, wherein,

The mining unit is adapted to, in the first-stage training corpus set, for the first-stage training corpus of each search word, calculate the TF-IDF value of each keyword in the first-stage training corpus; Delete the keywords whose TF-IDF value is higher than the third preset threshold and/or lower than the fourth preset threshold to obtain the training corpus of the search term; the training corpus of each search term constitutes a training corpus set.

27. The identification device of claim 19, wherein,

The mining unit is adapted to calculate and obtain a search word-keyword probability distribution result according to the search word-topic probability distribution result and the topic-keyword probability distribution result; according to the search word-keyword probability distribution result , for each search word, sort the keywords according to the probability of the search word in descending order, and select the first fifth preset threshold number of keywords.

28. The identification device of claim 19, wherein,

The mining unit is adapted to, for each search word, obtain the probability of each topic about the search word according to the search word-topic probability distribution result; for each topic, obtain each topic according to the topic-keyword probability distribution result. The probability of the keyword on the topic; then for each keyword, the product of the probability of the keyword on a topic and the probability of the topic on a search term is taken as the probability of the keyword based on the topic on the search term ; the keyword is based on the sum of the probabilities of each topic about the search word as the probability of the keyword about the search word.

29. The identification device of claim 19, wherein,

The mining unit is also adapted to use the first fifth preset threshold number of keywords corresponding to each search term as the first-stage label system of the search term; for the first-stage label system of each search term, calculate The semantic relationship value between each keyword in the first-stage tag system of the search term and the search term; for each keyword, the semantic relationship value corresponding to the keyword and the keyword's relation to the search term The product of the probabilities is used as the modified probability of the keyword with respect to the search term; the keywords in the first-stage tag system of the search term are sorted in descending order according to the modified probability of the search term, and the first sixth preset is selected. A threshold number of keywords constitute the tag system of the search term.

30. The identification device of claim 19, wherein,

The mining unit is adapted to obtain search word sequence sets corresponding to multiple query sessions according to the search words in each query session; perform training on the search word sequence sets to obtain an N-dimensional keyword vector file; dimensional keyword vector file, calculate the word vector of the keyword, calculate the word vector of each term in the search word; calculate the cosine similarity between the word vector of the keyword and the word vector of each term , as the semantic relationship value between the keyword and the corresponding term; the sum of the semantic relationship value between the keyword and each term is used as the semantic relationship value between the keyword and the search term.

31. The identification device of claim 19, wherein,

The mining unit is adapted to perform word segmentation processing on the search word sequence set, and uses the deep learning toolkit word2vec to train the search word sequence set after word segmentation processing to generate an N-dimensional keyword vector file.

32. The identification device of claim 19, wherein,

The mining unit is also adapted to use the first sixth preset threshold keywords corresponding to each search term as the second-stage label system of the search term; for the second-stage label system of each search term, count the The TF-IDF value of each keyword in the second-stage tag system of the search word in the training corpus of the search word; for each keyword, the probability of the keyword about the search word and the TF-IDF The product of the values is used as the secondary correction probability of the keyword with respect to the search word; the keywords in the second-stage label system of the search word are sorted according to the secondary correction probability of the search word from large to small. K keywords constitute the tag system of the search term.

33. The identification device of claim 32, wherein,

The mining unit is adapted to obtain the number of queries about the search term within a preset time period from the query session log of the application search engine; according to the number of queries, the top K keywords are selected to form a tag system of the search term; The K value is used as a polyline function of the number of queries corresponding to the search term.

34. An application search server, comprising:

a database construction unit, suitable for constructing a search word tag database, the search word tag database includes a tag system of a plurality of search words;

an interaction unit, adapted to receive the current search term uploaded by the client;

a search processing unit, adapted to obtain the tag system of the current search term according to the search term tag database; calculate the degree of association between the tag system of the current search term and the tag system of each application;

The interaction unit is further adapted to return the relevant information of the application to the client for display when the degree of association between the tag system of the current search term and the tag system of an application meets a preset condition;

The database construction unit is the same as the process of constructing the search term tag database by the apparatus for recognizing search intent according to any one of claims 19-33.

35. The server of claim 34, wherein,

The search processing unit is adapted to calculate the semantic similarity between the current search term and each search term in the search term tag database, sort according to the semantic similarity in descending order, and select the first preset threshold number of searches before word; according to the selected label system of each search word, the label system of the current search word is obtained.

36. The server of claim 34 or 35, wherein,

The search processing unit is adapted to calculate the Euclidean distance between the current search term and each search term in the search term tag database, and use the Euclidean distance between each search term and the current search term as the corresponding search term. Semantic similarity; the semantic similarity corresponding to each search word is used as the weight of each label in the label system of the search word; for each label corresponding to the label system of each search word, the weights of the same labels are added to obtain each label. The final weight of the tag; according to the final weight in descending order, the first second preset threshold tags are selected to form the tag system of the current search term.