CN113407700B - A data query method, device and equipment - Google Patents
A data query method, device and equipment Download PDFInfo
- Publication number
- CN113407700B CN113407700B CN202110762089.5A CN202110762089A CN113407700B CN 113407700 B CN113407700 B CN 113407700B CN 202110762089 A CN202110762089 A CN 202110762089A CN 113407700 B CN113407700 B CN 113407700B
- Authority
- CN
- China
- Prior art keywords
- clustering
- question
- target
- target text
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 239000013598 vector Substances 0.000 claims abstract description 78
- 238000007781 pre-processing Methods 0.000 claims abstract description 27
- 230000011218 segmentation Effects 0.000 claims description 19
- 238000004422 calculation algorithm Methods 0.000 claims description 13
- 239000011159 matrix material Substances 0.000 claims description 13
- 238000003064 k means clustering Methods 0.000 claims description 9
- 238000007621 cluster analysis Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000007619 statistical method Methods 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 2
- 230000001364 causal effect Effects 0.000 description 1
- 239000007795 chemical reaction product Substances 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000002354 daily effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000009849 deactivation Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000006386 memory function Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the specification provides a data query method, a device and equipment, which relate to the technical field of artificial intelligence, wherein the method comprises the steps of receiving target text input by a target user; the method comprises the steps of preprocessing a target text to obtain a feature vector of the target text, determining a target cluster label according to the feature vector of the target text, wherein the target cluster label is used for representing classification of the target text, and determining a query result corresponding to the target text based on the feature vector of the target text and the target cluster label. In the embodiment of the specification, the query range can be determined through the target cluster tag, and the query is performed in the determined query range according to the feature vector of the target text, so that the purposes of reducing the query range and improving the query efficiency are achieved, and the accuracy of the query result can be improved.
Description
Technical Field
The embodiment of the specification relates to the technical field of artificial intelligence, in particular to a data query method, a device and equipment.
Background
Banking is a vast variety of, sophisticated, and growing continuously and rapidly with the evolution of the internet. Even on the premise that intelligent customer service is widely applied, bank customer service is unavoidable to reply a large number of customer questions every day in daily work.
In the prior art, when answering a client question, a required answer is usually retrieved by searching for a question keyword or searching for a related document. However, the method has high requirements on timeliness and accuracy of the question and answer, and because the question and answer knowledge often has the characteristics of high discreteness, lack of statistical analysis and the like, manual screening is needed in the retrieved result to obtain a correct answer. Thus, the final determined result has a certain subjective factor, and the time cost is increased. Therefore, the corresponding answer cannot be accurately determined according to the questions input by the user by adopting the technical scheme in the prior art.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the specification provides a data query method, a data query device and data query equipment, so as to solve the problem that a corresponding answer cannot be accurately determined according to a problem input by a user in the prior art.
The embodiment of the specification provides a data query method, which comprises the steps of receiving target text input by a target user, preprocessing the target text to obtain a feature vector of the target text, determining a target cluster label according to the feature vector of the target text, wherein the target cluster label is used for representing classification to which the target text belongs, and determining a query result corresponding to the target text based on the feature vector of the target text and the target cluster label.
The embodiment of the specification also provides a data query device which comprises a receiving module, a preprocessing module, a first determining module and a second determining module, wherein the receiving module is used for receiving target texts input by target users, the preprocessing module is used for preprocessing the target texts to obtain feature vectors of the target texts, the first determining module is used for determining target cluster labels according to the feature vectors of the target texts, the target cluster labels are used for representing classification to which the target texts belong, and the second determining module is used for determining query results corresponding to the target texts based on the feature vectors of the target texts and the target cluster labels.
The embodiment of the specification also provides a data query device, which comprises a processor and a memory for storing instructions executable by the processor, wherein the steps of any one of the method embodiments of the specification are realized when the instructions are executed by the processor.
The present description also provides a computer-readable storage medium having stored thereon computer instructions which, when executed, implement the steps of any of the method embodiments of the present description.
The embodiment of the specification provides a data query method, which can obtain a feature vector of a target text by receiving the target text input by a target user and preprocessing the target text. In order to determine the classification to which the target text belongs, a target cluster label may be determined according to the feature vector of the target text. Because the data volume stored in the database is large, and the question-answer knowledge often has the characteristics of high discreteness, lack of statistical analysis and the like, the query range can be determined through the target cluster labels, and the query is performed in the determined query range according to the feature vector of the target text, so that the query result corresponding to the target text is determined. Therefore, the purposes of reducing the query range and improving the query efficiency can be achieved, and the accuracy of the query result can be improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of embodiments of the present specification, are incorporated in and constitute a part of this specification and do not limit the embodiments of the present specification. In the drawings:
FIG. 1 is a schematic diagram of steps of a data query method according to an embodiment of the present disclosure;
Fig. 2 is a schematic structural diagram of a data query device according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a data query device according to an embodiment of the present disclosure.
Detailed Description
The principles and spirit of the embodiments of the present specification will be described below with reference to several exemplary implementations. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and implement the present description embodiments and are not intended to limit the scope of the present description embodiments in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Those skilled in the art will appreciate that the implementations of the embodiments of the present description may be implemented as a system, apparatus, method, or computer program product. Accordingly, the present specification embodiment disclosure may be embodied in the form of entirely hardware, entirely software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software.
While the flow described below includes a number of operations occurring in a particular order, it should be apparent that these processes may include more or fewer operations, which may be performed sequentially or in parallel (e.g., using a parallel processor or a multi-threaded environment).
Referring to fig. 1, the present embodiment may provide a data query method. The data query method can be used for determining the query range through the target clustering label, and querying in the determined query range according to the feature vector of the target text to determine the query result corresponding to the target text, so as to achieve the purposes of reducing the query range and improving the query efficiency. The data query method may include the following steps.
S101, receiving target text input by a target user.
In this embodiment, a target text input by a target user may be received. The target text may be a text input by the target user in an input box of the search interface when the target user needs to perform a query, and the target text may be used for indicating the query purpose of the target user.
In this embodiment, the target text may be one or more keywords, or may be a sentence or a paragraph, which may be specifically determined according to the actual situation, and this embodiment of the present disclosure is not limited thereto.
S102, preprocessing the target text to obtain the feature vector of the target text.
In this embodiment, since the format of the target text input by the target user is incorrect or there are some irrelevant redundant characters, in order to determine the main intention of the target user query, the target text may be preprocessed to obtain the feature vector of the target text.
In this embodiment, the feature vector of the target text may include at least one feature word, and the feature vector may be, for example, < beijing, weather >, although it is understood that the feature vector of the target text may also be in other possible forms, and may be specifically determined according to practical situations, which is not limited in this embodiment of the present disclosure.
In this embodiment, the preprocessing may include data cleaning, word segmentation, word deactivation, etc., and of course, the preprocessing is not limited to the above examples, and other modifications may be made by those skilled in the art in light of the technical spirit of the embodiments of the present disclosure, but all the functions and effects implemented by the preprocessing are included in the protection scope of the embodiments of the present disclosure as long as they are the same as or similar to the embodiments of the present disclosure.
And S103, determining a target cluster label according to the feature vector of the target text, wherein the target cluster label is used for representing the classification of the target text.
In this embodiment, the target cluster tag may be determined according to the feature vector of the target text, so as to determine the classification corresponding to the target text. The target cluster tag may be used to represent a classification to which the target text belongs.
In this embodiment, the query and answer data stored in the database may be subjected to cluster analysis in advance to obtain a plurality of cluster centers, so as to classify the historical query and answer data. Correspondingly, when a user queries, the cluster corresponding to the text input by the user can be determined first.
In this embodiment, the cluster to which the history question-answer data having the highest similarity belongs may be used as the cluster to which the feature vector of the target text belongs by calculating the similarity between the feature vector of the target text and the feature vector of each history question-answer data. The similarity between the feature vectors may be determined by calculating the cosine similarity, the minkowski distance, etc., and of course, the calculation method of the similarity between the feature vectors is not limited to the above example, and other modifications may be made by those skilled in the art in light of the technical spirit of the embodiments of the present disclosure, but all the functions and effects implemented by the method are included in the protection scope of the embodiments of the present disclosure as long as they are the same or similar to the embodiments of the present disclosure.
In this embodiment, the euclidean distance from the feature vector of the target text to each cluster center may be calculated, and the cluster center of the shortest path may be taken and divided into the classes. It will of course be appreciated that other ways of determining clusters to which feature vectors of the target text belong may also be used.
In this embodiment, each cluster center may have a cluster tag, which may be used to identify a class of clusters, and the cluster tag may be one or more feature words. It will be understood, of course, that the above-mentioned cluster labels may also be in other forms, such as charts, etc., and may be specifically determined according to practical situations, which are not limited in this embodiment of the present disclosure.
S104, determining a query result corresponding to the target text based on the feature vector of the target text and the target cluster label.
In this embodiment, the query result corresponding to the target text may be determined based on the feature vector of the target text and the target cluster tag. Therefore, the query can be performed under the classification corresponding to the target cluster label in the database, the purposes of reducing the query range and improving the query efficiency are achieved, and the accuracy of the query result can be improved.
In this embodiment, since the data amount stored in the database is large, and the question-answer knowledge often has the characteristics of high discreteness, lack of statistical analysis, and the like, the range of the query can be determined by the target cluster tag, and the query is performed in the determined query range according to the feature vector of the target text.
In this embodiment, one or more query results corresponding to the target text may be used, in some cases, the query result may also be blank, and in the case that the query result is blank, default prompt information may be fed back to the target user, and the user may be prompted whether to modify the target text or the target cluster tag, which may be specifically determined according to the actual situation, and this embodiment of the present disclosure is not limited.
From the above description, it can be seen that the technical effect of the embodiment of the present specification is that the feature vector of the target text can be obtained by receiving the target text input by the target user and preprocessing the target text. In order to determine the classification to which the target text belongs, a target cluster label may be determined according to the feature vector of the target text. Because the data volume stored in the database is large, and the question-answer knowledge often has the characteristics of high discreteness, lack of statistical analysis and the like, the query range can be determined through the target cluster labels, and the query is performed in the determined query range according to the feature vector of the target text, so that the query result corresponding to the target text is determined. Therefore, the purposes of reducing the query range and improving the query efficiency can be achieved, and the accuracy of the query result can be improved.
In one embodiment, preprocessing the target text to obtain a feature vector of the target text may include word segmentation of the target text to obtain a target word segmentation set. Further, stop words in the target word segmentation set can be removed, and feature vectors of the target text can be obtained.
In this embodiment, since the format of the target text input by the target user is incorrect or there are some irrelevant redundant characters, the target text may be preprocessed to obtain the feature vector of the target text. Specifically, the target text can be segmented to obtain a target word segmentation set of the target text. Further, the stop words in the target word segmentation set can be removed, so that the feature vector of the target text is obtained.
In this embodiment, the word segmentation may be a chinese word segmentation, which is a process of segmenting a chinese character sequence into individual words, and the word segmentation is to recombine a continuous word sequence into a word sequence according to a certain specification. The stop words refer to that in information retrieval, certain words or words are automatically filtered out before or after processing natural language data (or text), for example, in order to save storage space and improve searching efficiency.
In this embodiment, a stop word list may be maintained, and the stop words in the target word segmentation set may be removed according to the stop word list.
In one embodiment, before determining the target clustering label according to the feature vector of the target text, the method can further comprise the step of acquiring a question and answer knowledge set from a target database, wherein the question and answer knowledge set comprises a plurality of sets of question and answer data, and each set of question and answer data comprises a question text and an answer text. And preprocessing each group of question-answer data in the question-answer knowledge set to obtain feature vectors of a plurality of groups of question-answer data. Further, according to the feature vectors of the multiple groups of question-answer data, a clustering result can be obtained by performing clustering analysis through a K-means clustering algorithm, wherein the clustering result comprises multiple clustering centers, and each clustering center corresponds to one cluster. And determining cluster labels of all clusters according to the clustering result, and obtaining a classification result of the question-answer knowledge set based on the clustering result and the cluster labels of all clusters.
In this embodiment, the question-answer knowledge set in the target database may be classified in advance, where the target database may be a data source of the query, and the query may be performed in the target database according to the target text, so as to obtain the query result. The question-answer data in the obtained question-answer knowledge set may be historical question-answer data recorded in the target database, and of course may also include question-answer data that is set manually and is often retrieved, and may be specifically determined according to actual situations, which is not limited in the embodiment of the present specification.
In this embodiment, the method for preprocessing each set of question-answer data in the question-answer knowledge set may be the same as the method for preprocessing the target text, and the repetition is not repeated.
In this embodiment, a K-means clustering algorithm may be used to perform cluster analysis on feature vectors of multiple sets of question-answer data, so as to obtain multiple cluster centers, where each cluster center may correspond to one cluster, and each cluster may include at least one set of question-answer data. The K-means clustering algorithm (K-means clustering algorithm ) is an iterative solution clustering analysis algorithm, which comprises the steps of pre-dividing data into K groups, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and distributing each object to the closest clustering center. The cluster centers and the objects assigned to them represent a cluster. For each sample assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster. This process will repeat until a certain termination condition is met. The termination condition may be that no (or a minimum number of) objects are reassigned to different clusters, no (or a minimum number of) cluster centers are changed again, and the sum of squares of errors is locally minimum.
In this embodiment, in order to distinguish different clusters, the cluster labels of each cluster may be determined according to the clustering result, and the cluster labels may be used to distinguish different classifications, for example, the cluster labels may be feature words such as loan, deposit, etc., and may be specifically determined according to practical situations, which is not limited in this embodiment of the present disclosure.
In this embodiment, the above-mentioned clustering result and the cluster labels of the respective clusters may be used as classification results of the question-answer knowledge set.
In one embodiment, preprocessing each group of question-answer data in the question-answer knowledge set to obtain feature vectors of multiple groups of question-answer data may include word segmentation of the question text and the answer text in each group of question-answer data to obtain a word segmentation set of each group of question-answer data. And removing the stop words in the word segmentation set of each group of question-answer data to obtain the feature vector of each group of question-answer data.
In this embodiment, since the format of the question-answer data may be incorrect or there are some irrelevant redundant characters, preprocessing may be performed on each set of question-answer data in the question-answer knowledge set to obtain feature vectors of each set of question-answer data. Specifically, each group of question-answer data in the question-answer knowledge set can be firstly segmented to obtain a segmented set of the question-answer data. Further, the stop words in the word segmentation set of each group of question-answer data can be removed, so that the feature vector of each group of question-answer data is obtained.
In one embodiment, the clustering analysis is performed by using a K-means clustering algorithm according to the feature vectors of the multiple sets of question-answer data to obtain a clustering result, which may include calculating a TF-IDF weight two-dimensional matrix of each feature word in the feature vectors of the question-answer data of each set. Furthermore, according to the TF-IDF weight two-dimensional matrix of each feature word in each group of question-answer data, a K-means clustering algorithm is utilized for cluster analysis, and a clustering result is obtained.
In this embodiment, TF-IDF (Term Frequency-inverse document Frequency) is a common weighting technique for information retrieval and data mining, TF is Term Frequency (Term Frequency), and IDF is reverse file Frequency (Inverse Document Frequency). TF-IDF is a statistical method for evaluating the importance of a word to one of a collection of documents or a corpus, where the importance of the word increases in proportion to the number of times it appears in the document, but decreases in inverse proportion to the frequency of its occurrence in the corpus. Wherein, TF-IDF is calculated according to the following formula:
TF-IDF=TF×IDF
In this embodiment, the word frequency of each feature word in the feature vector of each set of question-answer data may be calculated first, and the calculation formula of the word frequency is as follows:
Wherein tf ij is the word frequency of the ith feature word in the fileset j, n i,j is the number of occurrences of the ith feature word in the fileset j, n kj is the number of occurrences of the kth term in the fileset j, and Σ knkj is the sum of the number of occurrences of all terms in the fileset j.
In this embodiment, the reverse file frequency may be further calculated, and the calculation formula of the reverse file frequency is as follows:
Wherein idf i is the reverse file frequency of the ith feature word, D is the total number of records of question-answer data in the target database, t i is the ith feature word, and j is t i∈dj is the total number of files containing the ith feature word.
In this embodiment, the TF-IDF value of each feature word in the file set j in the feature vector of each group of question-answer data can be obtained by obtaining the calculation result according to the word frequency and the reverse file frequency, and then the TF-IDF weight two-dimensional matrix ω [ i ] [ j ] of each feature word in different file sets can be obtained.
In this embodiment, according to the TF-IDF weight two-dimensional matrix of each feature word in the feature vector of each set of question-answer data obtained by calculation, non-important feature words (TF-IDF is smaller than a preset threshold) may be filtered out, so that important feature words are retained.
In this embodiment, according to the TF-IDF weight two-dimensional matrix of each feature word in the question-answer data of each group, performing cluster analysis by using a K-means clustering algorithm, to obtain a cluster result may include the following steps:
And 300, setting a clustering convergence threshold Delta and a maximum iteration number N.
Step 301, randomly selecting a group of question-answer data from the question-answer knowledge set as an initial clustering center point.
Step 302, the shortest distance between each row in omega [ i ] [ j ] and the current existing cluster center (i.e., euclidean distance of the nearest cluster center) is calculated and represented by D (x). Each row in ωi j represents a sample, i.e., a set of question-answer data.
And then calculating the probability that each sample point is selected as the next cluster center, and selecting the next cluster center according to a wheel disc method. The roulette method is a proportion selection method, and the basic idea is that the probability of each individual being selected is in direct proportion to the size of the fitness function value.
Step 303, repeating step 302 until K cluster centers are selected.
And 304, calculating the distance from each row remained in omega [ i ] [ j ] to K initial clustering centers by taking the selected K clustering centers as initial values, and classifying the point into one of K center points according to the minimum distance. And recording the minimum distance into an array A [ k ] [ m ], wherein k is the serial number of the cluster center, and m is the serial number of the sample points classified into the cluster center.
Step 305, recalculate the centroids of the K clusters. And comparing the newly generated K centroids with the cluster center points calculated last time, if the distances are smaller than a convergence threshold Delta, indicating that the whole process is converged, calculating the average distances a K from all sample points to the center points in the cluster according to A K m, calculating the distances between any two center points, judging whether the distances are smaller than the average distances in the two clusters, if so, merging the two clusters, and ending the algorithm under the condition that the convergence is already carried out.
Otherwise, judging whether the iteration times are larger than N, if so, judging that the K value is not reasonably selected at the moment, only obtaining partial local optimal solution, and recalculating the whole by increasing the K value to ensure that the whole is easier to converge. The start step 301 may be re-performed after the cluster number k=k+k/2 until the algorithm ends.
If the iteration number is less than or equal to N, if the convergence is carried out, whether clusters needing to be combined exist can be further judged, if yes, the combination is carried out, and the algorithm is ended after the combination.
In this embodiment, when two clusters are "close", the K value is enlarged to adjust the cluster center to obtain a more reasonable cluster center value when the two clusters cannot converge after N iterations.
In one embodiment, determining the cluster label of each cluster according to the clustering result may include arranging the feature words included in the target cluster in descending order according to TF-IDF values of the feature words, and taking a preset number of feature words before the ranking as the cluster labels of the target cluster.
In this embodiment, since the TF-IDF value may be used to evaluate the importance of a word to one document in a document set or a corpus, the feature words included in the target cluster may be arranged in descending order according to the TF-IDF value of the feature words, and a preset number of feature words before the arrangement may be used as the cluster labels of the target cluster, so that the cluster labels of each cluster may be obtained in this manner.
In this embodiment, the preset number may be a positive integer, for example, 1,3, 6, etc., and may be specifically determined according to practical situations, which is not limited in the embodiment of the present specification.
In one embodiment, after the classification result of the question-answer knowledge set is obtained based on the clustering result and the clustering labels of the clusters, the method can further comprise displaying the classification result of the question-answer knowledge set in a retrieval interface.
In this embodiment, the classification result of the question-answer knowledge set may be displayed through the front-end page, so as to provide a classification function to provide a query service for a user or a manual customer service.
In this embodiment, the classification result may include a plurality of classifications, and the name of each classification may be represented by a cluster label of each cluster. In some embodiments, only the names of the classifications may be displayed in the search interface, or the names of the classifications and corresponding representative question-answer data may be displayed together in the search interface, so as to illustrate to the user or the human customer service, thereby more intuitively and clearly knowing the meaning of each classification. Of course, the manner of displaying the classification result is not limited to the above examples, and other modifications are possible by those skilled in the art in light of the technical spirit of the embodiments of the present disclosure, but it should be covered in the protection scope of the embodiments of the present disclosure as long as the functions and effects achieved are the same as or similar to those of the embodiments of the present disclosure.
Based on the same inventive concept, the embodiments of the present disclosure further provide a data query device, as described in the following embodiments. Because the principle of the data query device for solving the problem is similar to that of the data query method, the implementation of the data query device can refer to the implementation of the data query method, and the repetition is not repeated. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated. Fig. 2 is a block diagram of a data query device according to an embodiment of the present disclosure, and as shown in fig. 2, may include a receiving module 201, a preprocessing module 202, a first determining module 203, and a second determining module 204, and the structure will be described below.
A receiving module 201, configured to receive a target text input by a target user;
the preprocessing module 202 may be configured to preprocess the target text to obtain a feature vector of the target text;
The first determining module 203 may be configured to determine a target cluster tag according to a feature vector of the target text, where the target cluster tag is used to represent a class to which the target text belongs;
The second determining module 204 may be configured to determine a query result corresponding to the target text based on the feature vector of the target text and the target cluster tag.
The embodiment of the present disclosure further provides an electronic device, specifically may refer to a schematic diagram of a composition structure of the electronic device based on the data query method provided in the embodiment of the present disclosure shown in fig. 3, where the electronic device may specifically include an input device 31, a processor 32, and a memory 33. Wherein the input device 31 may be used for inputting target text in particular. The processor 32 may specifically be configured to receive a target text input by a target user, pre-process the target text to obtain a feature vector of the target text, determine a target cluster tag according to the feature vector of the target text, where the target cluster tag is used to represent a class to which the target text belongs, and determine a query result corresponding to the target text based on the feature vector of the target text and the target cluster tag. The memory 33 may be specifically configured to store parameters such as feature vectors of the target text, target cluster labels, and the like.
In this embodiment, the input device may specifically be one of the main apparatuses for exchanging information between the user and the computer system. The input device may include a keyboard, mouse, camera, scanner, light pen, handwriting input board, voice input means, etc., for inputting raw data and programs for processing these numbers into the computer. The input device may also acquire and receive data transmitted from other modules, units, and devices. The processor may be implemented in any suitable manner. For example, a processor may take the form of, for example, a microprocessor or processor, and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, among others. The memory may in particular be a memory device for storing information in modern information technology. The memory may comprise multiple levels, in a digital system, the memory may be any memory as long as it can store binary data, in an integrated circuit, a circuit with a memory function without a physical form, such as RAM, FIFO, etc., and in a system, a memory device with a physical form, such as a memory bank, TF card, etc.
In this embodiment, the specific functions and effects of the electronic device may be explained in comparison with other embodiments, which are not described herein.
The embodiment of the specification also provides a computer storage medium based on the data query method, wherein the computer storage medium stores computer program instructions, and the computer program instructions can be used for receiving target text input by a target user, preprocessing the target text to obtain a feature vector of the target text, determining a target cluster label according to the feature vector of the target text, wherein the target cluster label is used for representing classification of the target text, and determining a query result corresponding to the target text based on the feature vector of the target text and the target cluster label.
In the present embodiment, the storage medium includes, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), a Cache (Cache), a hard disk (HARD DISK DRIVE, HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.
In this embodiment, the functions and effects of the program instructions stored in the computer storage medium may be explained in comparison with other embodiments, and are not described herein.
It will be apparent to those skilled in the art that the modules or steps of the embodiments described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, embodiments of the present specification are not limited to any specific combination of hardware and software.
Although the present description provides the method operational steps as described in the above embodiments or flowcharts, more or fewer operational steps may be included in the method, either on a routine or non-inventive basis. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided in the embodiments of the present specification. The described methods, when performed in an actual apparatus or an end product, may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment) as shown in the embodiments or figures.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the embodiments of the specification should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
The above description is only of the preferred embodiments of the present embodiments and is not intended to limit the present embodiments, and various modifications and variations can be made to the present embodiments by those skilled in the art. Any modification, equivalent replacement, improvement, or the like made within the spirit and principles of the embodiments of the present specification should be included in the protection scope of the embodiments of the present specification.
Claims (8)
1. A method of querying data, comprising:
Receiving target text input by a target user;
Preprocessing the target text to obtain a feature vector of the target text;
determining a target cluster label according to the feature vector of the target text, wherein the target cluster label is used for representing the classification of the target text;
Determining a query result corresponding to the target text based on the feature vector of the target text and the target cluster tag;
The method comprises the steps of determining a target clustering label according to a feature vector of a target text, obtaining a question-answer knowledge set from a target database, preprocessing each group of question-answer data in the question-answer knowledge set to obtain feature vectors of the question-answer data, calculating a TF-IDF weight two-dimensional matrix of each feature word in the feature vectors of each group of question-answer data, performing clustering analysis according to the TF-IDF weight two-dimensional matrix of each feature word in the feature vectors of each group of question-answer data by using a K-means clustering algorithm to obtain a clustering result, wherein the clustering result comprises a plurality of clustering centers, each clustering center corresponds to one cluster, determining a clustering label of each cluster according to the clustering result, and obtaining a classification result of the question-answer knowledge set based on the clustering result and the clustering label of each cluster;
The clustering result is determined by setting a clustering convergence threshold Delta and the maximum iteration number N, randomly selecting a group of question-answer data from a question-answer knowledge set as an initial clustering center point, calculating the shortest distance between each row in a TF-IDF weight two-dimensional matrix of each feature word and the current existing clustering center, calculating the probability that each sample point is selected as the next clustering center, selecting the next clustering center according to a wheel disc method, repeatedly selecting the clustering centers until K clustering centers are selected, calculating the distance from the selected K clustering centers to the K initial clustering centers on each row in the TF-IDF weight two-dimensional matrix of each feature word, classifying the corresponding question-answer data set to one of the K center points according to the minimum distance, storing the minimum distance into an array, calculating the mass centers of the K clusters again, comparing the newly generated K mass centers with the last calculated clustering center point respectively, if the distance is smaller than the threshold, repeatedly selecting the clustering centers until K clustering centers are selected, calculating the distance from the K initial clustering centers to the K initial clustering centers, and obtaining the clustering centers when the distance between the two clusters is equal to 2 < + > is equal to the average distance between two clusters when the two clusters are equal to each other, and if the distance between the two clusters is equal to 2, and the distance between the two clusters is equal to the average between the two clusters is equal to or equal to the average distance between the two clusters when the two clusters are equal.
2. The method of claim 1, wherein preprocessing the target text to obtain feature vectors of the target text comprises:
word segmentation is carried out on the target text to obtain a target word segmentation set;
And removing stop words in the target word segmentation set to obtain the feature vector of the target text.
3. The method of claim 1, wherein preprocessing each set of question-answer data in the question-answer knowledge set to obtain feature vectors of multiple sets of question-answer data, comprises:
the question text and the answer text in each group of question and answer data are segmented to obtain a segmentation set of each group of question and answer data;
And removing the stop words in the word segmentation set of each group of question-answer data to obtain the feature vector of each group of question-answer data.
4. The method of claim 1, wherein determining cluster labels for each cluster based on the clustering results comprises:
arranging the characteristic words contained in the target cluster in descending order according to TF-IDF values of the characteristic words;
and taking the characteristic words with preset quantity before sequencing as the cluster labels of the target clusters.
5. The method of claim 1, further comprising, after deriving the classification result of the question-answer knowledge set based on the clustering result and the cluster labels of the respective clusters:
and displaying the classification result of the question-answer knowledge set in a retrieval interface.
6. A data query device, comprising:
the receiving module is used for receiving target text input by a target user;
The preprocessing module is used for preprocessing the target text to obtain a feature vector of the target text;
the first determining module is used for determining a target cluster label according to the feature vector of the target text, wherein the target cluster label is used for representing the classification of the target text;
the second determining module is used for determining a query result corresponding to the target text based on the feature vector of the target text and the target cluster label;
The device is further used for acquiring a question-answer knowledge set from a target database, wherein the question-answer knowledge set comprises a plurality of sets of question-answer data, each set of question-answer data comprises a question text and an answer text, preprocessing each set of question-answer data in the question-answer knowledge set to obtain feature vectors of the plurality of sets of question-answer data, calculating a TF-IDF weight two-dimensional matrix of each feature word in the feature vectors of each set of question-answer data, carrying out cluster analysis by utilizing a K-means clustering algorithm according to the TF-IDF weight two-dimensional matrix of each feature word in the feature vectors of each set of question-answer data to obtain a clustering result, wherein each clustering result comprises a plurality of clustering centers, each clustering center corresponds to one cluster, determining a clustering label of each cluster according to the clustering result, and obtaining a classification result of the question-answer knowledge set based on the clustering result and the clustering labels of each cluster;
The clustering result is determined by setting a clustering convergence threshold Delta and the maximum iteration number N, randomly selecting a group of question-answer data from a question-answer knowledge set as an initial clustering center point, calculating the shortest distance between each row in a TF-IDF weight two-dimensional matrix of each feature word and the current existing clustering center, calculating the probability that each sample point is selected as the next clustering center, selecting the next clustering center according to a wheel disc method, repeatedly selecting the clustering centers until K clustering centers are selected, calculating the distance from the selected K clustering centers to the K initial clustering centers on each row in the TF-IDF weight two-dimensional matrix of each feature word, classifying the corresponding question-answer data set to one of the K center points according to the minimum distance, storing the minimum distance into an array, calculating the mass centers of the K clusters again, comparing the newly generated K mass centers with the last calculated clustering center point respectively, if the distance is smaller than the threshold, repeatedly selecting the clustering centers until K clustering centers are selected, calculating the distance from the K initial clustering centers to the K initial clustering centers, and obtaining the clustering centers when the distance between the two clusters is equal to 2 < + > is equal to the average distance between two clusters when the two clusters are equal to each other, and if the distance between the two clusters is equal to 2, and the distance between the two clusters is equal to the average between the two clusters is equal to or equal to the average distance between the two clusters when the two clusters are equal.
7. A data querying device, comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the steps of the method of any of claims 1 to 5.
8. A computer readable storage medium having stored thereon computer instructions which when executed implement the steps of the method of any of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110762089.5A CN113407700B (en) | 2021-07-06 | 2021-07-06 | A data query method, device and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110762089.5A CN113407700B (en) | 2021-07-06 | 2021-07-06 | A data query method, device and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113407700A CN113407700A (en) | 2021-09-17 |
CN113407700B true CN113407700B (en) | 2024-12-31 |
Family
ID=77685216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110762089.5A Active CN113407700B (en) | 2021-07-06 | 2021-07-06 | A data query method, device and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113407700B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114416975A (en) * | 2021-12-21 | 2022-04-29 | 航天信息股份有限公司 | System and method for determining text similarity |
CN115292351A (en) * | 2022-08-11 | 2022-11-04 | 中国科学院微电子研究所 | Financial business data processing method, device and storage medium |
CN116069998A (en) * | 2022-12-16 | 2023-05-05 | 北京中电飞华通信有限公司 | Data resource processing method and device, electronic equipment and storage medium |
CN116738493B (en) * | 2023-08-15 | 2024-02-09 | 广州淘通科技股份有限公司 | Data encryption storage method and device based on classification category |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810218A (en) * | 2012-11-14 | 2014-05-21 | 北京百度网讯科技有限公司 | Problem cluster-based automatic asking and answering method and device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105787097A (en) * | 2016-03-16 | 2016-07-20 | 中山大学 | Distributed index establishment method and system based on text clustering |
CN109885651B (en) * | 2019-01-16 | 2024-06-04 | 平安科技(深圳)有限公司 | Question pushing method and device |
CN109766428B (en) * | 2019-02-02 | 2021-05-28 | 中国银行股份有限公司 | Data query method and equipment and data processing method |
CN112925912B (en) * | 2021-02-26 | 2024-01-12 | 北京百度网讯科技有限公司 | Text processing method, synonymous text recall method and apparatus |
-
2021
- 2021-07-06 CN CN202110762089.5A patent/CN113407700B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810218A (en) * | 2012-11-14 | 2014-05-21 | 北京百度网讯科技有限公司 | Problem cluster-based automatic asking and answering method and device |
Also Published As
Publication number | Publication date |
---|---|
CN113407700A (en) | 2021-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113407700B (en) | A data query method, device and equipment | |
CN108804641B (en) | Text similarity calculation method, device, equipment and storage medium | |
WO2020192401A1 (en) | System and method for generating answer based on clustering and sentence similarity | |
US8788503B1 (en) | Content identification | |
US20080281764A1 (en) | Machine Learning System | |
CN110516074B (en) | Website theme classification method and device based on deep learning | |
CN111984792A (en) | Website classification method and device, computer equipment and storage medium | |
CN112231555A (en) | Recall method, apparatus, device and storage medium based on user portrait label | |
CN105183831A (en) | Text classification method for different subject topics | |
AU2022204712B2 (en) | Extracting content from freeform text samples into custom fields in a software application | |
CN110347791A (en) | A kind of topic recommended method based on multi-tag classification convolutional neural networks | |
CN113535960A (en) | Text classification method, device and equipment | |
CN114781611A (en) | Natural language processing method, language model training method and related equipment | |
CN116887201A (en) | Intelligent short message pushing method and system based on user analysis | |
CN113705215A (en) | Meta-learning-based large-scale multi-label text classification method | |
CN111104422A (en) | Training method, device, equipment and storage medium of data recommendation model | |
CN117291192B (en) | Government affair text semantic understanding analysis method and system | |
US12361027B2 (en) | Iterative sampling based dataset clustering | |
CN111125329B (en) | Text information screening method, device and equipment | |
CN111259117B (en) | Short text batch matching method and device | |
CN115114493A (en) | Method and device for realizing intelligent question answering system based on question matching | |
CN117009596A (en) | Identification method and device for power grid sensitive data | |
WO2018100700A1 (en) | Data conversion device and data conversion method | |
CN115310564B (en) | Classification label updating method and system | |
CN117725555B (en) | Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |