CN118733844B - A general data fuzzy search and matching method based on FuzzyWuzzy algorithm - Google Patents
A general data fuzzy search and matching method based on FuzzyWuzzy algorithm Download PDFInfo
- Publication number
- CN118733844B CN118733844B CN202411230352.6A CN202411230352A CN118733844B CN 118733844 B CN118733844 B CN 118733844B CN 202411230352 A CN202411230352 A CN 202411230352A CN 118733844 B CN118733844 B CN 118733844B
- Authority
- CN
- China
- Prior art keywords
- dimension
- user
- search
- result set
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90348—Query processing by searching ordered data, e.g. alpha-numerically ordered data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Input From Keyboards Or The Like (AREA)
Abstract
The application provides a universal data fuzzy search matching method based on FuzzyWuzzy algorithm, which relates to the field of search matching and comprises the steps of receiving keywords and dimension parameters input by a user, analyzing the dimension parameters, determining priority, matching the keywords with a database according to the priority to generate a candidate result set, reordering the candidate result set according to the dimension parameters to obtain a final result set, performing dimension inspection on the final result set, returning data if the final result set passes, and re-executing a matching process or reporting errors if the final result set does not pass.
Description
Technical Field
The application belongs to the technical field of search matching, and particularly relates to a universal data fuzzy search matching method based on FuzzyWuzzy algorithm.
Background
With the rapid development of information technology, enterprise digital transformation has become an irreversible trend, and the functionality and usability of an enterprise-level cloud platform serving as a core facility for supporting various business applications directly influence the operation efficiency of enterprises. The GSCloud enterprise digital cloud platform is used as a comprehensive management tool for serving enterprise users, and provides a series of services covering functions such as data management, flow control, report generation and the like. In order to further optimize the user experience, the platform is particularly integrated with powerful search functionality, enabling the user to quickly locate desired information through fuzzy searches. However, with the progress of cloud computing technology and the enhancement of big data analysis capability, the demands of enterprises on search functions are also increased. Conventional exact matching approaches have not been adequate to address the increasing demand, and fuzzy searches are particularly important because they provide a better understanding and satisfaction of the user's actual intent, requiring the system to handle not only standard queries, but also nonstandard inputs such as misspellings, synonyms, and abbreviations.
Currently, the fuzzy search function on GSCloud platforms is mainly implemented through SQL query language, and particularly utilizes LIKE operators to complete fuzzy matching tasks. However, when the query conditions contain wild cards, LIKE operators tend to result in full table scanning, rather than accelerating the query through an index, which can significantly increase response time when processing large data sets and complex queries. On the other hand, in some advanced application scenarios, the platform adopts a direct comparison method of data at the application program level, and although the method can accelerate the response speed in some scenarios, the method is also accompanied with the problems of high memory consumption and maintenance cost. Furthermore, failure to properly configure the search weights or matching algorithms may result in insufficiently accurate search results, especially when dealing with user-diverse input forms, the quality of the search may be somewhat downgraded.
Disclosure of Invention
The application provides a universal data fuzzy search matching method based on FuzzyWuzzy algorithm, which aims to solve the problems of low response speed, high memory consumption and inaccurate search of processing large-scale data query.
The technical scheme adopted by the application is as follows:
the embodiment of the application provides a universal data fuzzy search matching method based on FuzzyWuzzy algorithm, which comprises the following steps:
Receiving keywords and dimension parameters input by a user, analyzing the dimension parameters, and determining priority;
matching the keywords with a database according to the priority to generate a candidate result set, and reordering the candidate result set according to the dimension parameter to obtain a final result set;
And performing dimension inspection on the final result set, returning data if the final result set passes through the dimension inspection, and re-executing the matching process or reporting errors if the final result set does not pass through the dimension inspection.
The application provides a universal data fuzzy search matching method based on FuzzyWuzzy algorithm, which also comprises the following additional technical characteristics that the key words and dimension parameters input by a user are received, the dimension parameters are analyzed, and the priority is determined, specifically:
and receiving a search character set, dimension information and dimension sequence input by a user, determining importance of each dimension and roles of each dimension in a matching process according to the dimension information, and sequencing the dimensions according to priorities according to the dimension sequence.
According to an embodiment of the present application, the matching the keyword with the database according to the priority, so as to generate a candidate result set, and reordering the candidate result set according to the dimension parameter, so as to obtain a final result set, which specifically is:
Splitting the search character set input by a user into a plurality of parts, matching each split part with a target character set by using FuzzyWuzzy algorithm to generate a series of result sets, merging all the result sets, removing repeated items to form a preliminary candidate result set, and reordering the candidate result sets according to the dimension information determined previously to obtain a final result set.
According to one embodiment of the application, the final result set is subjected to dimension checking, if the final result set passes, data is returned, and if the final result set does not pass, the matching process or error reporting is re-executed, specifically, the correctness and the relativity of the matching result are checked according to the dimension parameters, if the final result set is correct, the data is prepared to be returned, and if the final result set is incorrect, the similarity is not high enough or is irrelevant to the searched content, the matching process or error information is required to be re-executed.
According to one embodiment of the application, the return data, in particular, the ordered final result set, is returned to the user.
According to one embodiment of the application, the error information is returned, specifically, if any problem is found in the searching process, the corresponding error information is returned to prompt the user of the problem or suggest to reenter the searching condition.
According to one embodiment of the present application, the FuzzyWuzzy algorithm is used to evaluate the similarity between two strings, and the FuzzyWuzzy algorithm includes an edit distance algorithm and Jaccard similarity.
A computer program product containing instructions that, when run on a device, cause the device to perform steps in implementing a common data fuzzy search matching method based on FuzzyWuzzy algorithm.
A computer readable storage medium having stored thereon a program which when executed by a processor performs steps in a common data fuzzy search matching method based on FuzzyWuzzy algorithm.
An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing steps in a common data fuzzy search matching method based on FuzzyWuzzy algorithm when the program is executed.
By adopting the technical scheme, the application has the following beneficial effects:
1. By introducing FuzzyWuzzy algorithm, the similarity between two character strings can be evaluated based on various similarity measurement methods (such as edit distance algorithm, jaccard similarity, etc.), so that the content matched with the user input can be more accurately identified in mass data. Compared with the traditional accurate matching mode, the method can better understand and meet the actual intention of the user, and even if the user inputs misspellings or synonyms, abbreviations and the like, more accurate search results can be obtained.
2. The technical scheme allows the user to customize the dimension parameters and the dimension sequence, so that the search strategy can be flexibly adjusted according to specific requirements. The flexibility enables the system to adapt to diversified and complicated input forms, provides personalized search experience for users, and enhances user satisfaction.
And 3, the fuzzy wuzzy algorithm can still maintain a relatively fast matching speed when processing a large amount of data, and the problem of full-table scanning caused by the traditional SQL LIKE operator is effectively solved. By means of priority sorting, the system can screen out high-quality result sets in a short time, unnecessary computing resource consumption is reduced, response time is greatly shortened, and overall efficiency of the system is improved.
4. By dimension checking the final result set, the correctness and relevance of the returned data are ensured. If any problem is found, the system can automatically re-execute the matching process or report errors, so that misjudgment caused by the quality problem of the data is avoided. This mechanism improves the stability and reliability of the system.
5. The method not only improves the searching function, but also indirectly promotes the modernization process of enterprise data management. Through optimizing the search function, the system can process data more efficiently, help enterprises to better utilize information resources, support decision making and improve operation efficiency.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
Fig. 1 is a flowchart of a universal data fuzzy search matching method based on FuzzyWuzzy algorithm according to an embodiment of the present application.
Detailed Description
In order to more clearly illustrate the general inventive concept, a detailed description is given below by way of example with reference to the accompanying drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, but the present application may be practiced in other ways than those described herein, and therefore the scope of the present application is not limited to the specific embodiments disclosed below. It should be noted that, without conflict, embodiments of the present application and features in each embodiment may be combined with each other.
In the present application, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed, mechanically connected, electrically connected, or in communication, directly connected, or indirectly connected via an intervening medium, or in communication between two elements or in an interaction relationship between two elements. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art according to the specific circumstances.
In the present application, unless expressly stated or limited otherwise, a first feature "up" or "down" a second feature may be the first and second features in direct contact, or the first and second features in indirect contact via an intervening medium. In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Example 1
As shown in fig. 1, a universal data fuzzy search matching method based on FuzzyWuzzy algorithm includes:
and step S100, receiving keywords and dimension parameters input by a user, analyzing the dimension parameters, and determining the priority.
Specifically, a keyword input by a user is received:
Keyword this is information that the user wishes to obtain by searching, such as "financial reports". The keywords entered by the user may be individual words or phrases, and the system needs to accurately capture these inputs as the basis for the search.
Example, a user enters a "financial report" and the system needs to identify and save this search string.
Receiving dimension parameters:
Dimension parameters-these parameters provide additional guidance information for the search, helping the system to better understand and process the user's query. The dimension parameters may include, but are not limited to, similarity threshold, frequency of use, time stamp, and the like.
Examples are a user might specify a similarity threshold of 80% indicating that the search results require at least 80% similarity, a high frequency of use, meaning that frequently mentioned content is preferentially displayed, a time stamp of the last week, indicating that only updates within the last week are considered.
Analyzing dimension parameters:
Dimension information, namely, the system needs to analyze dimension parameters provided by users and understand the specific meaning and effect of each dimension. For example, the "similarity" dimension may relate to the degree to which strings match, and the "frequency of use" dimension may relate to the number of times particular content appears in the dataset.
Example if the user specifies a similarity threshold of 80%, the system needs to ensure that the similarity of the matching results is not below this threshold.
Determining priority:
dimension order-the user-entered dimension parameters may contain a dimension order indicating which dimensions the system should prioritize during the matching process.
For example, a user may specify that the similarity is prioritized over the frequency of use, meaning that the system first matches according to similarity and then considers the frequency of use if the similarity is the same.
The specific operation is as follows:
Receiving keywords:
The user enters a search keyword, such as "financial report," through the interface, which the system stores as a search character set.
Receiving dimension information:
The user simultaneously inputs or selects dimension information, such as 80% similarity threshold, high frequency of use, time stamp of the last week, which the system stores as dimension information.
Receiving dimension order:
the user specifies a dimension order, e.g., similarity favors frequency of use, which the system stores as a dimension order.
Analyzing dimension parameters:
The system analyzes the incoming dimensional parameters, such as:
Similarity threshold the system identifies a degree of similarity at which the user wishes to match the result to at least 80%.
Frequency of use the system identifies that the user wishes to preferentially display frequently occurring content.
Timestamp the system recognizes that the user wishes to consider only the updates in the last week.
Determining priority:
According to the dimension sequence, the system ranks the dimensions according to priority:
similarity takes precedence over frequency of use.
In the case where the similarity is the same, the content with high frequency of use is prioritized.
In the implementation of the invention, keywords input by a user are received
Assume that a user wants to search for information about "financial reports" on GSCloud enterprise digitized cloud platforms.
Keywords-the search term entered by the user is "financial report".
Receiving dimension parameters
The user may select or input a number of dimensional parameters to further refine the search results. Dimensional parameters include, but are not limited to:
similarity-the similarity between the search results and the entered keywords that the user wishes to reach a certain threshold.
Frequency of use-the user may be concerned about how often certain keywords appear in the document.
Timestamp the user may wish that the search result is the most recently updated document.
For example, the user may input the following dimension parameters:
similarity, at least 80% similarity.
Frequency of use-documents that have been referenced a greater number of times in the past year.
Timestamp: update in the last month.
Analyzing dimensional parameters
The system needs to analyze the dimension parameters input by the user, and determine the specific meaning of each dimension and the roles in the matching process. For example:
Similarity the system will use FuzzyWuzzy algorithm to calculate the similarity between strings and ensure that the similarity of search results is not less than 80%.
The frequency of use is that the system counts the number of times each document is referenced in the past year and prioritizes documents that are more referenced.
Timestamp the system will filter out documents that were not updated in the last month.
Determining priority
The user may also specify a priority order for the dimension parameters, such as:
Priority order similarity > use frequency > timestamp.
This means that the system considers similarity first in the matching process, further considers the frequency of use if the similarity of multiple documents is the same, and finally considers the time stamp.
Detailed operation
Receiving keywords:
the user enters the search term "financial report" through the interface.
Receiving dimension information:
the user inputs or selects dimension information through an interface, such as:
similarity, at least 80% similarity.
Frequency of use-documents that have been referenced a greater number of times in the past year.
Timestamp: update in the last month.
Receiving dimension order:
The user specifies a dimension order, e.g., similarity > frequency of use > timestamp, through the interface.
Analyzing dimension parameters:
The system analyzes the incoming dimensional parameters:
Similarity the system recognizes that the user wishes to match results to at least 80% similarity.
Frequency of use the system recognizes that the user wishes to preferentially display documents that have been referenced a greater number of times in the past year.
Timestamp the system recognizes that the user wishes to consider only updates in the last month.
The system prioritizes the dimensions according to their order, similarity over frequency of use. In the case where the similarity is the same, documents with high frequency of use are prioritized. Finally consider the time stamp.
Through the series of operations, the system can accurately receive the search request of the user and determine the matching rule in the search process according to the specific requirements and preferences of the user, so that the search result which is more in line with the expectations of the user is provided.
It should be noted that, based on the above scheme, the system may automatically recommend a suitable dimensional parameter combination according to the historical search behavior and the user preference, so as to reduce the workload of manual configuration of the user.
Further, more dimensional information, such as geographic location, time stamps, user behavior tracks, etc., may also be introduced, making the search results more personalized and contextualized.
Furthermore, the semantic analysis can be performed on the search character set by combining with a natural language processing technology, and keywords and phrases are extracted, so that the search intention of a user can be more accurately understood.
Furthermore, instant feedback can be provided in the input process of the user, and the result with the highest similarity with the input content is displayed, so that the user can be helped to adjust the search strategy in time.
Further, the user may also be allowed to score or flag search results, and the system gathers such feedback for subsequent search optimization.
Further, the search history of the user may also be analyzed, recommending search dimensions and orders of possible interest based on the user's behavior patterns.
Further, possible searching intention can be predicted based on the input habit of the user, relevant dimension information is loaded in advance, and the searching process is accelerated.
Further, the user may be allowed to specify multiple data sources, and the system may be able to extract information from the data from the different sources, enriching the search results.
Further, third party service APIs, such as weather forecast, news information, etc., may also be integrated to add external information dimensions to the search results.
Step 200, matching the keywords with the database according to the priority, so as to generate a candidate result set, and reordering the candidate result set according to the dimension parameter, so as to obtain a final result set.
Generating candidate result sets according to priority matching
When a user enters keywords and defines relevant dimension parameters, the system first determines a priority order according to the dimension parameters. The system will then match the keywords in this order of priority, generating a preliminary set of candidate results. The specific process is as follows:
input is received-the system receives a search character set, dimension information (e.g., similarity, frequency of use, etc.), and dimension order (e.g., similarity over frequency of use) entered by the user.
Determining priority, namely determining importance of each dimension according to the dimension information, and sequencing the dimensions according to priority according to the dimension sequence.
A candidate result set is generated by splitting a search character set input by a user into a plurality of parts and matching each part with a target character set in a database by using FuzzyWuzzy algorithm. Thus, for each split portion, a result set is generated. Next, the system will merge all of these result sets together and remove duplicate terms, thereby forming a preliminary candidate result set.
Re-ordering according to dimension parameters to obtain final result set
Once the candidate result set is obtained, the next step is to reorder the candidate results according to the dimension parameters to obtain the final result set. This process is specifically as follows:
and checking the dimension, namely checking each record in the candidate result set according to the dimension parameters input by the user. The check content includes the correctness and relevance of the matching result.
And (3) sequencing and screening, namely re-sequencing the candidate result set according to the dimension information determined previously. For example, if similarity is the most preferred dimension, the system will rank the result set first according to similarity.
And finally, the system reorders the candidate result sets according to the priority order of the dimension parameters and generates a final result set.
Assume that the user inputs the keyword "financial report" and sets the dimension parameters as "similarity" and "frequency of use", and specifies that similarity takes precedence over frequency of use. The system will first match the keyword "financial report" based on similarity to find out all documents that are similar to it. The system will then rank these similar documents by frequency of use, i.e., those documents that were referenced more in the past will be placed more forward. Eventually, the system will generate a final result set that considers both the similarity and the frequency of use and return it to the user.
In the implementation scenario of the present invention, assuming that the user searches for information related to "financial reports" on GSCloud enterprise digitized cloud platforms, the user also sets two dimensional parameters, similarity and frequency of use, where similarity takes precedence over frequency of use. The method comprises the following specific steps:
Input is received that the user enters the search keyword "financial report" and sets the dimension parameters to similarity and frequency of use, with similarity being prioritized over frequency of use.
The priority is determined, namely the system determines the priority sequence according to the dimension parameters set by the user, namely firstly, similarity is considered in the matching process, and secondly, the frequency of use is determined.
Generating a candidate result set:
The system splits the user's search keyword "financial report" into a single vocabulary, "financial" and "report".
For each vocabulary, the system uses FuzzyWuzzy algorithm to match the target character set in the database. For example, for the word "finance", the system finds all document titles or content segments in the database that are similar to "finance". It can be assumed here that FuzzyWuzzy algorithm uses an edit distance algorithm to calculate the similarity.
Similarly, for the word "report," the system will find all document titles or content segments that are similar to it.
And combining the matching results of the two vocabularies, and removing repeated items to form a preliminary candidate result set.
Reordering to obtain the final result set:
Based on the previously determined dimensional parameters, i.e., similarity over frequency of use, the system first ranks each entry in the candidate result set based on similarity.
For the items with the same similarity, the items are ranked according to the use frequency.
Finally, the system returns the ordered result set to the user, which is the final result set.
Specific examples
Suppose there are the following documents in the database:
Document a title "financial report in 2020", similarity score 85%, frequency of use, etc.
Document B, title "financial statement analysis", similarity score 70%, frequency of use is very high.
Document C, heading "2021 annual financial summary", similarity score 90%, frequency of use is general.
Document D title "financial budgeting", similarity score 75%, low frequency of use.
The order of the sorting according to the similarity is that the document C > the document A > the document B > the document D.
On the premise of the same similarity, the documents are ranked again according to the use frequency, but in the example, the similarity score of each document is unique, so that the final ranking result is that the document C > the document A > the document B > the document D.
In the final result set, document C is ranked first because it has the highest similarity score, even though it is not used most frequently, and document B is ranked third because of its lower similarity, although it is used more frequently. In this way, the system can return to the user the document that best meets his search intent.
It should be noted that, on the basis of the above scheme, the matching result of frequent query can be saved by a caching mechanism, so as to reduce the calculation burden in each search.
Further, inverted indexes or other efficient data structures may also be utilized to accelerate the search process.
Further, the search request can be distributed to different nodes for parallel processing in a distributed environment, so that the search speed is improved.
Further, a more intelligent predictive model can be trained in combination with the machine learning model to estimate the search intent of the user, thereby more accurately matching the results.
Further, natural language processing techniques may also be utilized to enhance search capabilities, such as entity recognition, emotion analysis, etc., to improve relevance of search results.
Further, context awareness may also be introduced to adjust the priority of search results based on factors such as the user's search history, geographic location, and the like.
And step S300, performing dimension inspection on the final result set, returning data if the final result set passes the dimension inspection, and re-executing the matching process or reporting errors if the final result set does not pass the dimension inspection.
Dimension checking-at this stage, the system checks the generated candidate result set against the dimension parameters provided by the user. Dimensional parameters refer to those criteria or rules used to guide the search process, which may include, but are not limited to, the importance of keywords, the priority of searches, and the like. The purpose of the check is to verify whether the result set meets a predetermined dimension requirement, such as whether a minimum similarity threshold is met, whether it is relevant to the search content, etc.
Correctness verification the system will evaluate the correctness of the search result. If the result is correct, it means that the information searched out is desired by the user or sufficiently similar to the query entered by the user. If the result is incorrect, it may be due to an error in the search process or to the result not being in line with the expectations of the user, e.g. too low a similarity or not being relevant at all.
Relevance validation in addition to correctness, it is also necessary to verify the relevance of the search results. Even if search results are technically correct, such results are not valuable if they do not match the needs of the user. The check of the correlation helps to ensure that the information returned to the user is of real interest to them.
Data is returned-if the dimension check indicates that the search results are both correct and relevant, the system returns those results to the user. This typically means returning a list of results that are ordered in some way (e.g., in decreasing order of similarity).
Re-executing the matching process or reporting errors-if the check finds that the result is incorrect or that the correlation is insufficient, the system has two options. The user is informed of the error, prompting the user that there may be a problem, such as that the entered query condition may be too ambiguous or no matching terms exist, and suggesting that the user re-enter or adjust the search condition.
In the implementation scenario of the present invention, assuming that a finance module on GSCloud enterprise digital cloud platform is being used, the user wishes to find the most recent sales records by fuzzy search. The user has entered the keyword "sales records" and set dimension parameters such as date range, amount interval, etc.
The key words and dimension parameters entered by the user are received, namely, the user inputs 'sales record' as search words, and a date range (2024, 8, 1, to 2024, 8, 31, and 10) and an amount interval (10000 yuan to 50000 yuan) are provided as dimension parameters.
Analyzing dimension parameters and determining priority, namely analyzing the dimension parameters by a system and determining priority, for example, screening according to a date range, and further narrowing according to an amount interval.
Generating a candidate result set based on the dimensional parameters, the system begins searching in the database, matches the data entry containing the "sales record" using FuzzyWuzzy algorithm, and generates a candidate result set.
And (3) reordering the candidate result sets, namely reordering the candidate result sets by the system according to the dimension information of the date and the amount to obtain a final result set.
And D, dimension checking, namely checking whether the final result set meets preset dimension conditions or not by the system. For example, all records should be within a specified date range and sales amount should be within a user-defined interval. If there are unconditional records in the result set, such as the date of a certain record is outside the user specified range, then the check will fail.
By checking, if all records meet the dimension conditions, the system will continue to the next step.
Without checking, if there are unconditional records, the system will not return these data. At this point, the system has two options:
Re-executing the matching process-the system may attempt to search again, this time may adjust the search strategy, such as relaxing the matching conditions, or re-analyzing the dimensional parameters.
Error reporting the system may also report the error directly to the user, indicate that no satisfactory records were found, and suggest the user to reenter the search criteria.
Return data-if the final result set passes all dimension checks, the system will return the data to the user in the sorted order.
And returning error information, namely if any problem is found in the searching process, returning corresponding error information by the system, and prompting the user to have the problem or suggesting to reenter the searching condition.
It should be noted that, a finer dimension check rule may be developed based on the above scheme, so as to allow the user to customize more dimension parameters, for example, dimensions such as geographic location, product category, etc. may be added in addition to the basic time range and the amount interval. Thus, the searching is more accurate, and the possibility of false alarm is reduced.
Further, the historical search behavior and preferences of the user can also be automatically learned through a machine learning model, thereby dynamically adjusting the criteria of the dimension check. For example, if the system detects that a user is searching for a certain type of information often, the relevance score of such information may be automatically increased.
Further, natural Language Processing (NLP) techniques may also be added during the fuzzy search to enhance the contextual understanding capabilities of the system so that the search can better understand the user's real intent. For example, when the user enters a "recent order," the system may determine, based on context, whether the "recent" refers to today, this week, or this month, and adjust the search scope accordingly.
Furthermore, a feedback circulation mechanism can be established, when the user is not satisfied with the search result, the feedback can be conveniently fed back to the system, and the system continuously optimizes the matching algorithm and the dimension checking rule according to the feedback information, so that the search quality is improved.
In some embodiments of the present application, keywords and dimension parameters input by a user are received, the dimension parameters are analyzed, and priorities are determined, specifically:
And receiving a search character set, dimension information and dimension sequence input by a user, determining importance of each dimension and roles of each dimension in a matching process according to the dimension information, and sequencing the dimensions according to priority according to the dimension sequence.
The method comprises the specific steps of receiving keywords and dimension parameters input by a user, analyzing the dimension parameters and determining the priority, wherein the specific steps are as follows:
receiving keywords input by a user:
the user enters a search character set through the interface, for example, entering "financial reports" as search keywords.
The system receives this search character set and stores it as the basis for a subsequent search match.
Receiving dimension information:
The user simultaneously inputs or selects dimension information, which may include, but is not limited to, similarity, frequency of use, time stamp, etc.
For example, the user may specify a similarity threshold of 80% indicating that the search results require at least 80% similarity, a high frequency of use, meaning that frequently mentioned content is preferentially displayed, and a timestamp of the last week indicating that only updates within the last week are considered.
Receiving dimension order:
the user specifies a dimension order, e.g., similarity > frequency of use > time stamp.
This means that the system considers the similarity first, the frequency of use second, and the time stamp last in the matching process.
Analyzing dimension parameters:
The system analyzes the incoming dimension parameters to determine the importance of each dimension and their role in the matching process.
For example, the system identifies that the user wishes to match results to at least 80% similarity and preferentially displays documents that have been referenced more frequently in the past year, while considering only the updates in the last month.
Determining priority:
according to the order of dimensions, the system ranks the dimensions by priority.
For example, similarity takes precedence over frequency of use, whereas in the case of the same similarity, documents with high frequency of use take precedence, and finally the time stamp is considered.
The specific operation is as follows:
Receiving a search character set:
the user enters the search keyword "financial report" which the system stores as a search character set.
Receiving dimension information:
the user inputs or selects dimension information through an interface, such as:
similarity, at least 80% similarity.
Frequency of use-documents that have been referenced a greater number of times in the past year.
Timestamp: update in the last month.
Receiving dimension order:
The user specifies a dimension order, e.g., similarity > frequency of use > timestamp, through the interface.
Analyzing dimension parameters:
The system analyzes the incoming dimensional parameters:
Similarity the system recognizes that the user wishes to match results to at least 80% similarity.
Frequency of use the system recognizes that the user wishes to preferentially display documents that have been referenced a greater number of times in the past year.
Timestamp the system recognizes that the user wishes to consider only updates in the last month.
Determining priority:
According to the dimension sequence, the system ranks the dimensions according to priority:
similarity takes precedence over frequency of use.
In the case where the similarity is the same, documents with high frequency of use are prioritized.
Finally consider the time stamp.
Through the steps, the system can accurately receive the search request of the user and determine the matching rule in the search process according to the specific requirements and preferences of the user, so that the search result which is more in line with the expectations of the user is provided.
In some embodiments of the present application, matching the keywords with the database according to the priority, so as to generate a candidate result set, and reordering the candidate result set according to the dimension parameter, so as to obtain a final result set, which is specifically:
Splitting a search character set input by a user into a plurality of parts, matching each split part with a target character set by using FuzzyWuzzy algorithm to generate a series of result sets, merging all the result sets, removing repeated items to form a preliminary candidate result set, and reordering the candidate result sets according to the dimension information determined previously to obtain a final result set.
Splitting the search character set-first, the system splits the search character set entered by the user into a plurality of separate portions. For example, if the user entered "financial report 2021 year", then this phrase may be split into "financial report", "2021" and "year".
The algorithm matching is used FuzzyWuzzy. Next, for each split portion, the system uses FuzzyWuzzy algorithm to match the target character set in the database. The FuzzyWuzzy algorithm evaluates similarity between two strings by a variety of methods, including but not limited to the edit distance algorithm (Levenshtein distance) and Jaccard similarity. This means that even if the entered search term is not exactly identical to an entry in the database, a match can be found, provided that it is sufficiently similar.
The result set is generated such that the matching of each portion generates a series of results representing the set of entries in the database closest to that portion. For example, a "financial report" may match the entries of "2021 financial report", "financial statement", and the like.
Merging results sets-all partial match results will be merged into one large result set. This process will ensure that no duplicate result entries enter the final set. If the "financial reports" and "2021" each match the same entry, then this entry will only appear once in the merged result set.
The candidate result set is formed, namely, after the repeated items are removed, a preliminary candidate result set is formed.
And finally, reordering the candidate result set according to the previously determined dimension information (such as similarity, use frequency, time stamp, etc.). The dimension information determines how the results are prioritized to ensure that the most relevant entries are top ranked. For example, if the user set "similarity" as the most important dimension, then those items that are most similar to the search term will be ranked in front.
After the above steps, the system will get a final result set that has been prioritized, which will be returned to the user.
The process ensures that the fuzzy search is more intelligent and efficient, not only can process complex query requests, but also can ensure the relevance and accuracy of search results, and simultaneously considers the personalized requirements of users.
In some embodiments of the present application, dimension checking is performed on the final result set, if the final result set passes, data is returned, and if the final result set does not pass, the matching process or error reporting is re-performed, specifically, the correctness and the relativity of the matching result are checked according to the dimension parameters, if the final result set is correct, the data is prepared to be returned, and if the final result set is incorrect, the similarity is not high enough or is irrelevant to the search content, the matching process or error information is required to be re-performed.
The process of dimension checking the final result set is specifically as follows:
dimension checking, namely checking the correctness and the relativity of the matching result according to the dimension parameters input by the user by the system. The dimension parameters refer to importance and priority settings, such as similarity threshold, frequency of use, etc., used to guide the matching process during the search.
Verification of the correctness of the results if the results are expected, i.e. the results are correct and highly relevant to the search content, the system is ready to return the final result set to the user. The correctness of the result means not only that the result is correct, but also that the result is highly matched with the search content input by the user.
Similarity and relevance assessment if the similarity of the matching results is found to be insufficiently high or irrelevant to the search content, i.e. the matching results do not match the user's search intent, the system needs to take further action. In this case, there are generally two processing modes:
re-execution of the matching process the system will restart the search matching process, possibly adjusting the matching algorithm or parameter settings in order to obtain a better matching result.
Error information is returned-if the system determines that better results cannot be obtained by re-attempting, or other problems (e.g., input errors, system errors, etc.) are found in the search process, corresponding error information is returned to the user. The error information may include a description of the problem and possible solutions or suggestions, such as suggesting that the user re-enter search criteria.
Through the mechanism, the data returned to the user is ensured to be accurate and relevant, and meanwhile, a method for processing abnormal conditions is also provided, so that the user experience and the reliability of the system are improved.
In some embodiments of the application, the data is returned, specifically, the ordered final result set is returned to the user.
Final result set after all matching and sorting steps are completed, the system will get a final result set. This result set contains all the entries that meet the user search criteria and are checked for multi-dimensionality.
Ordering, the system will order the final result set according to the dimension parameters defined by the user. The purpose of the ranking is to give priority to the presentation of the most relevant, most user-desired items to the user. For example, if the user sets the similarity to the highest priority, the system will rank the similarity first, rank the similarities according to frequency of use if they are the same, and so on.
And returning to the user, namely returning the ordered result set to the user by the system. This means that the user will see a prioritized list of results with the most relevant entry at the top of the list.
The method comprises the following specific steps:
formation of the final result set:
Through a series of matches and filters, the system generates a candidate result set.
And sequencing the candidate result sets according to the dimension parameters appointed by the user to obtain a final result set.
The sequencing process comprises the following steps:
the system sorts the candidate result sets according to user entered dimensional parameters such as similarity, frequency of use, time stamp, etc.
For example, if similarity is the most important dimension, the system will rank first according to the similarity score, rank according to the frequency of use if the similarity score is the same, and rank according to the timestamp if the frequency of use is also the same.
Returning to the user:
the system returns the ordered final result set to the user, who can see a prioritized list.
The user can immediately view the most relevant entry to quickly find the desired search result.
For example, assume that the user has entered the keyword "financial report" and has set the similarity to 80%, the frequency of use to be high, and the timestamp to be the last month. After matching and sorting, the system generates a final result set containing the following items:
Item A, similarity 90%, use frequency high, timestamp last week.
Item B, similarity 85%, frequency of use, and time stamp last two weeks.
Item C, similarity 80%, frequency of use high, time stamp last month.
The system ranks according to similarity first, then ranks according to frequency of use in case of the same similarity, and finally ranks according to time stamp. Finally, the system ranks item A first, item C second, item B third, and returns this ordered result set to the user.
In this way, the user can see an optimized prioritized list of search results, thereby making it easier to find the most relevant information.
In some embodiments of the present application, the error information is returned, specifically, if any problem is found in the searching process, the corresponding error information is returned, so as to prompt the user of the problem or suggest to reenter the searching condition.
In some embodiments of the present application, fuzzyWuzzy algorithm is used to evaluate the similarity between two strings, fuzzyWuzzy algorithm includes edit distance algorithm and Jaccard similarity.
The FuzzyWuzzy algorithm is a set of methods for evaluating the similarity of two strings, which measures the similarity between two strings by different algorithms and techniques. In the fuzzy search matching scheme, fuzzyWuzzy algorithm can process nonstandard inputs such as spelling errors, synonyms, abbreviations and the like, and can evaluate the similarity among character strings through various means according to search conditions input by a user. The following are several commonly used evaluation methods:
Edit distance algorithm (Levenshtein distance), which is a method of measuring the degree of difference between two strings. It calculates the minimum number of single character editing operations required to convert one character string into another, including inserting, deleting, or replacing a character. The smaller the edit distance, the more similar the two strings.
Jaccard similarity coefficient is used to compare similarity between finite sample sets. For two strings, they may be first converted into a set of characters or character sequences, and then the ratio of the number of intersection elements to the number of union elements of the two sets is calculated. The higher the scale, the higher the similarity of the two strings.
In addition to these two methods, fuzzyWuzzy also support other multiple ways of matching, including but not limited to:
simple matching, simple string comparison, is typically used to quickly determine if two strings are identical.
Partial matches are allowed, i.e. strings need not be identical but can be considered similar.
Neglecting sequence matching, namely neglecting the sequence of the characters during comparison, and only concerning whether the characters exist in two character strings.
And (3) de-duplicated subset matching, namely removing duplicated characters or character sequences in the matching process so as to avoid the influence caused by duplication counting.
The FuzzyWuzzy algorithm has the advantage that different matching strategies can be flexibly selected to adapt to different application scenes and requirements. For example, fuzzyWuzzy can still maintain a relatively fast matching speed when processing large-scale data, so that the overall efficiency is improved, and meanwhile, a proper matching mode can be selected according to actual requirements, so that the accuracy and the efficiency of matching are improved.
In summary, the FuzzyWuzzy algorithm provides a comprehensive fuzzy search matching solution by combining multiple evaluation methods so that the system can more accurately understand and satisfy the user's intent when processing complex search requests.
Example 2
The input character set comprises two character setsSum character setThe incoming dimension isThe order of the incoming dimensions is。
Wherein, Is a search character set storing search contents,Is a character set storing search matching targets,Is the corresponding dimension-related information of the object,Is the order of the dimensions of the incoming, determinesOrder of dimensions and relevance priority. Wherein, AndConfigurable, the matching result of different modes can be obtained through configuration.
The dimensional analysis of the data requires combining the incoming dimensional parameters and, in order of the incoming dimensions, matchingThe order and relevance priority of the dimensions are determined, after which the priority and order of the matching algorithm is determined by the dimensions.
The specific operation can be simplified as follows:
Wherein the method comprises the steps of Is based onIs a dimension of the sequential numbering of (a).
Will search for content character setsThrough splitting, search content is respectively obtained、、.... For the target character setObtaining result set by FuzzyWuzzy mode、、.... Combining and de-duplicating the obtained result set to obtain a target result set。
The method comprises the following steps:
Obtaining result set After that, through dimension(Specific dimension is、、...) The dimension ordering is carried out again, and a final result set is obtained
The method comprises the following steps:
and checking the dimension according to the input dimension parameters. If passed, the data is ready to be returned, if there is a mistake (similar correlation has problems, AndIs not in accordance with the content,Unreasonable character set content, etc.) then a re-iterative ordering or return error is required.
The good order return, or return error information, will be determined.
Example 3
A computer program product containing instructions that, when run on a device, cause the device to perform steps in implementing a common data fuzzy search matching method based on FuzzyWuzzy algorithm.
Example 4
A computer readable storage medium having stored thereon a program which when executed by a processor performs steps in a common data fuzzy search matching method based on FuzzyWuzzy algorithm.
Example 5
An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing steps in a common data fuzzy search matching method based on FuzzyWuzzy algorithm when the program is executed.
The application can be realized by adopting or referring to the prior art at the places which are not described in the application.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411230352.6A CN118733844B (en) | 2024-09-04 | 2024-09-04 | A general data fuzzy search and matching method based on FuzzyWuzzy algorithm |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411230352.6A CN118733844B (en) | 2024-09-04 | 2024-09-04 | A general data fuzzy search and matching method based on FuzzyWuzzy algorithm |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN118733844A CN118733844A (en) | 2024-10-01 |
| CN118733844B true CN118733844B (en) | 2025-02-25 |
Family
ID=92853113
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411230352.6A Active CN118733844B (en) | 2024-09-04 | 2024-09-04 | A general data fuzzy search and matching method based on FuzzyWuzzy algorithm |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118733844B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120763380A (en) * | 2025-09-09 | 2025-10-10 | 拉扎斯网络科技(上海)有限公司 | Information search method, device, storage medium and computer equipment |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113342923A (en) * | 2021-06-29 | 2021-09-03 | 招商局金融科技有限公司 | Data query method and device, electronic equipment and readable storage medium |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8201087B2 (en) * | 2007-02-01 | 2012-06-12 | Tegic Communications, Inc. | Spell-check for a keyboard system with automatic correction |
| CN106997384B (en) * | 2017-03-24 | 2020-01-14 | 福州大学 | Semantic fuzzy searchable encryption method capable of verifying sequencing |
| CN112417096B (en) * | 2020-11-17 | 2024-05-28 | 平安科技(深圳)有限公司 | Question-answer pair matching method, device, electronic device and storage medium |
| US12067370B2 (en) * | 2021-06-08 | 2024-08-20 | Sap Se | Detection of abbreviation and mapping to full original term |
| CN118467851B (en) * | 2024-07-15 | 2024-10-25 | 北京蜂窝科技有限公司 | Artificial intelligent data searching and distributing method and system |
-
2024
- 2024-09-04 CN CN202411230352.6A patent/CN118733844B/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113342923A (en) * | 2021-06-29 | 2021-09-03 | 招商局金融科技有限公司 | Data query method and device, electronic equipment and readable storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN118733844A (en) | 2024-10-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP3937029A2 (en) | Method and apparatus for training search model, and method and apparatus for searching for target object | |
| JP7252914B2 (en) | Method, apparatus, apparatus and medium for providing search suggestions | |
| US8965872B2 (en) | Identifying query formulation suggestions for low-match queries | |
| US9230041B2 (en) | Search suggestions of related entities based on co-occurrence and/or fuzzy-score matching | |
| US8335787B2 (en) | Topic word generation method and system | |
| CN111611356B (en) | Information search method, device, electronic device and readable storage medium | |
| WO2020215563A1 (en) | Training sample generation method and device for text classification, and computer apparatus | |
| US20170371965A1 (en) | Method and system for dynamically personalizing profiles in a social network | |
| WO2015084759A1 (en) | Systems and methods for in-memory database search | |
| US12197476B2 (en) | Information processing apparatus and information processing method | |
| CN115470133A (en) | Test case prioritization method, equipment and medium for large-scale continuous integration | |
| CN115062151B (en) | A text feature extraction method, a text classification method and a readable storage medium | |
| CN118733844B (en) | A general data fuzzy search and matching method based on FuzzyWuzzy algorithm | |
| CN117112904A (en) | Intelligent information recommendation and information search system based on large language model | |
| CN114116736A (en) | Knowledge base updating method and device, knowledge base updating verification method and device, and knowledge base-based searching method and device | |
| CN115630144A (en) | Document searching method and device and related equipment | |
| CN118656385B (en) | A data query method based on large model | |
| CN119271802A (en) | Inspection report generation method, equipment, medium and product for cloud network networking scenarios | |
| CN119739838A (en) | RAG intelligent question answering method, device, equipment and medium for multi-label generation and matching | |
| CN115796172B (en) | Fault case recommendation method, device and system | |
| CN119202221A (en) | Text search result sorting method, device, electronic device, readable medium and program product | |
| CN118503304A (en) | Data table recall method, device, equipment and storage medium | |
| CN115577694A (en) | Intelligent recommendation method for standard writing | |
| CN112783758B (en) | Test case library and feature library generation method, device and storage medium | |
| KR102903603B1 (en) | Hybrid chatbot system based on heterogeneous models utilizing ESG data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |