Disclosure of Invention
In order to solve the problem that the existing retrieval system cannot accurately identify the professional terms and key elements in the query sentences and influence the question-answering quality, the application provides a retrieval method, a retrieval system and electronic equipment for a tax knowledge question-answering system, which adopts the following technical scheme:
in a first aspect, the present application provides a retrieval method for a tax knowledge question-answering system, including the steps of:
Receiving tax question query sentences input by a user;
Performing word segmentation and part-of-speech tagging on the query sentence to obtain a word segmentation sequence with part-of-speech tagging;
Based on the word segmentation sequence, identifying and marking tax professional terms in the tax professional dictionary by utilizing a pre-constructed tax professional dictionary, and generating a target sequence with tax term marks;
based on the target sequence, extracting tax elements by utilizing a preset tax element identification rule;
and constructing a query vector according to the extracted tax elements, and executing search matching in a tax knowledge base.
By adopting the technical scheme, the prior art cannot identify some complete tax main concepts, and also cannot extract key tax elements, so that the search result deviates from the actual consultation intention of the user; according to the application, firstly, the query sentence is subjected to word segmentation and part-of-speech tagging to form a basic language unit, then the standard term is identified by utilizing the pre-built tax professional dictionary, the key information is extracted based on the pre-set tax element identification rule, and finally, the query vector integrated with professional knowledge is built to execute retrieval, so that the professional term and the key element in the query sentence can be accurately identified, and the intelligent level and the retrieval quality of tax questions and answers are improved.
Optionally, the query sentence is segmented and part of speech tagged to obtain a segmented sequence with part of speech tag, which specifically comprises the following steps:
Performing preliminary word segmentation on the query sentence by using a word segmentation device to obtain an initial word segmentation sequence;
matching the initial word segmentation sequence with a preset tax abbreviation comparison table, and identifying tax professional abbreviations;
According to the tax abbreviation comparison table, the identified tax professional abbreviation is replaced by a corresponding standard term, and a standardized word segmentation sequence is obtained;
and performing part-of-speech tagging on the standardized word segmentation sequence to obtain a word segmentation sequence with part-of-speech tagging.
By adopting the technical scheme, in order to solve the problem of unsatisfactory retrieval effect caused by using a large amount of nonstandard abbreviations in tax query sentences, the method comprises the steps of firstly using a word segmentation device to obtain an initial sequence, then respectively identifying and replacing the initial sequence with standard terms through a tax abbreviation comparison table to obtain a standardized sequence, and finally marking the parts of speech to obtain a marked sequence, so that the query sentences containing the abbreviations can be more accurately understood, and the user experience is improved.
Optionally, based on the word segmentation sequence, identifying and marking tax professional terms in the word segmentation sequence by using a pre-constructed tax professional dictionary, and generating a target sequence with tax term marks, which specifically comprises the following steps:
matching the continuous phrase in the word segmentation sequence with a tax professional dictionary, and identifying standard tax professional terms to obtain a preliminary marking sequence;
calculating the character similarity of each term in the tax professional dictionary for the unrecognized phrase in the preliminary marking sequence to obtain a similarity result set;
screening phrase with similarity higher than a preset threshold value and corresponding standard tax professional terms from the similarity result set to obtain a fuzzy matching sequence;
And combining the preliminary marking sequence and the fuzzy matching sequence to generate a target sequence with tax term marks.
According to the technical scheme, in order to solve the problem that the recognition of the professional terms is incomplete due to the fact that non-standard expressions are used by a user in tax inquiry, the word segmentation sequence is firstly matched with a tax professional dictionary to recognize part of standard terms, the editing distance or the character overlapping degree of the non-matched word groups and terms in the dictionary is calculated, when the similarity exceeds a preset threshold value, the word groups are marked as corresponding standard terms, and finally, the primary marking and the fuzzy matching results are combined to generate a complete term marking sequence, various expression variants can be covered more comprehensively, and the accuracy rate and recall rate of term recognition are improved.
Optionally, based on the target sequence, extracting tax elements by using a preset tax element recognition rule, and specifically includes the following steps:
Matching the target sequence with a preset associated word list, and identifying associated words and positions in the sequence to obtain an associated word position sequence;
Dividing the target sequence into a plurality of subsequences according to the related word position sequence to obtain a subsequence set;
respectively extracting tax elements from the subsequence set by using a preset tax element identification rule to obtain a preliminary element set;
Marking the logic relation among the elements in the preliminary element set according to the semantic type of the associated word in the associated word position sequence, and generating a complete element set containing the elements and the associated relation.
By adopting the technical scheme, the method and the device for inquiring the tax type of the complex query have the advantages that firstly, the positions of the associated words are identified, the query sequence is divided into a plurality of subsequences, then, basic elements such as tax types, tax rates and policy types are extracted from the subsequences, and finally, the parallel relationship between tax type elements and the subordinate relationship between tax type elements are marked according to the semantic features of the associated words, so that the accuracy of the complex query is improved.
Optionally, after the search matching is performed in the tax knowledge base, the method further comprises the following steps:
combining the candidate knowledge list obtained by retrieval with the original query statement of the user to construct an input prompt;
The input prompt is transmitted into a large language model, and a correlation analysis result of each candidate knowledge is obtained;
extracting a relevance score of each candidate knowledge based on the relevance analysis result;
Re-ordering the candidate knowledge according to the relevance score to generate a result list;
Judging whether the highest score in the result list exceeds a preset threshold value, and if so, returning the corresponding knowledge as a reply.
By adopting the technical scheme, the problem that the search result is not accurate enough because the query semantics cannot be accurately understood only by relying on vector similarity in the traditional tax knowledge search system is solved; the method comprises the steps of firstly combining the user query and the candidate knowledge into a prompt, inputting the prompt into a large language model to obtain detailed relevance analysis, then extracting relevance scores from analysis results, finally reordering the scores based on the scores, and taking the knowledge with the score exceeding a threshold value as a final answer, thereby improving the understanding of the search method on the query intention and greatly improving the accuracy and the user experience of the search results.
Optionally, when the highest score in the result list does not exceed a preset threshold, the method further includes the following steps:
extracting knowledge items with the relevance scores not lower than a preset recommendation threshold value from the result list, and generating a candidate recommendation set;
sorting in a descending order according to the relevance score of each knowledge item in the candidate recommendation set, and generating an ordered recommendation list;
The ordered recommendation list is formatted as a user-readable recommendation interface and returned.
By adopting the technical scheme, in order to solve the problem that the user experience is poor when the tax knowledge question-answering system cannot find the completely matched answer, the method extracts the knowledge items with the relevance scores not lower than the preset recommendation threshold value, generates the candidate recommendation set, performs descending order sorting according to the relevance scores of each knowledge item in the candidate recommendation set, generates an ordered recommendation list, formats the ordered recommendation list into a recommendation interface readable by the user and returns the recommendation interface, and improves the user experience when the retrieval result is not ideal through a multilevel threshold value and a recommendation display mechanism, thereby improving the practicability of the system.
Optionally, after returning the corresponding knowledge as a reply, the method further comprises the following steps:
generating an interaction record containing user inquiry, returned knowledge ID and returned time, and distributing a unique identifier;
Receiving and recording feedback information of a user based on the identifier corresponding to the interaction record, and generating a feedback record;
according to the feedback type in the feedback record, calculating a weight adjustment value of the knowledge item;
and adding the weight adjustment value to the original weight value of the corresponding entry in the knowledge base, and updating the original weight value.
By adopting the technical scheme, the method and the system firstly generate the unique interaction ID for each query response, record the query content of the user, the returned knowledge number and time information, collect the feedback of the user on the response and store the feedback in association with the interaction ID, calculate the weight adjustment value according to different types of feedback, act the adjustment value on the original weight of the knowledge item, dynamically update the retrieval ordering basis, thereby being capable of continuously learning the user feedback to optimize the ordering of the retrieval result, enabling the system to gradually adapt to the characteristics of user demands and improving the user experience.
In a second aspect, the present application provides a retrieval system for a tax knowledge question-answering system, comprising:
The receiving module is used for receiving tax problem query sentences input by a user;
The word segmentation marking module is used for carrying out word segmentation and part-of-speech marking on the query sentence to obtain a word segmentation sequence with part-of-speech marking;
the term marking module is used for identifying and marking tax professional terms in the tax professional dictionary constructed in advance based on the word segmentation sequence to generate a target sequence with tax term marks;
the element extraction module is used for extracting tax elements by utilizing a preset tax element identification rule based on the target sequence;
and the retrieval matching module is used for constructing a query vector according to the extracted tax factors and executing retrieval matching in the tax knowledge base.
In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above-described search method for tax knowledge question-answering systems when the computer program is executed.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above-described retrieval method for tax knowledge question-answering systems.
In summary, the present application includes at least one of the following beneficial technical effects:
1. Firstly, dividing words and marking parts of speech of a query sentence to form a basic language unit, then, utilizing a pre-constructed tax professional dictionary to identify standard terms, extracting key information based on a pre-set tax element identification rule, and finally, constructing a query vector integrated with professional knowledge to execute retrieval, so that the professional terms and key elements in the query sentence can be accurately identified, and the intelligent level and the retrieval quality of tax questions and answers are improved;
2. in order to solve the problem of unsatisfactory retrieval effect caused by the fact that a large number of nonstandard abbreviations are used in tax query sentences, firstly, a word segmentation device is used for obtaining an initial sequence, then standard terms are respectively identified and replaced by a tax abbreviation comparison table, a standardized sequence is obtained, and finally part-of-speech labeling is carried out to obtain a labeling sequence, so that query sentences containing the abbreviations can be more accurately understood, and user experience is improved;
3. In order to solve the problem that the search result is not accurate enough due to the fact that a traditional tax knowledge search system only depends on vector similarity and can not accurately understand query semantics, the method comprises the steps of firstly combining user query and candidate knowledge into a prompt, inputting the prompt into a large language model to obtain detailed correlation analysis, then extracting correlation scores from the analysis results, finally reordering based on the scores, taking the knowledge with the score exceeding a threshold value as a final answer, improving the understanding of the search method on the query intention, and greatly improving the accuracy and user experience of the search result.
Detailed Description
The terminology used in the following embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this disclosure is intended to encompass any or all possible combinations of one or more of the listed items.
The terms "first," "second," and the like, are used below for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the application, unless otherwise indicated, the meaning of "a plurality" is two or more.
Embodiments of the application are described in further detail below with reference to the drawings.
In a first aspect, the present application provides a retrieval method for a tax knowledge question-answering system, referring to fig. 1, comprising the steps of:
S110, receiving tax question query sentences input by a user.
In this embodiment, a tax question query sentence input by a user in a natural language form is received through a user interaction interface of the system. The query sentence can be in a complete question form, such as "what the tax rate of the income tax of the enterprise is", or in a key word form, such as "the value-added tax collection point of the small-scale tax payer".
Specifically, the system firstly preprocesses the received query sentence, including operations of removing redundant space, normalization of punctuation marks, conversion of full-angle characters into half-angle characters and the like, so as to ensure consistency of subsequent processing. For example, "is the business income tax rate.
S120, performing word segmentation and part-of-speech tagging on the query sentence to obtain a word segmentation sequence with part-of-speech tagging.
In this embodiment, a word segmentation method based on statistical probability is adopted, and a preset dictionary is combined to segment the query sentence. The system maintains a comprehensive dictionary containing universal vocabulary and professional vocabulary in advance, and is used for guiding the word segmentation process, improving the word segmentation accuracy, and particularly improving the recognition capability of the special vocabulary in the tax field.
Specifically, the system first performs preliminary word segmentation on the text by using a maximum matching algorithm, and then marks the part of speech for each word segmentation. For example, for the query sentence "what the tax rate of the business income tax is," the word segmentation and labeling result is "what the tax rate/n of the business income tax/n is/v/r," where n represents nouns, u represents a co-word, v represents a verb, and r represents a pronoun.
S130, identifying and marking tax professional terms in the tax professional dictionary constructed in advance based on the word segmentation sequence, and generating a target sequence with tax term marks.
In this embodiment, the system performs term recognition through a pre-built tax professional dictionary. The dictionary contains standard tax terms and common variant forms thereof, supports accurate matching and fuzzy matching of tax technical terms, and ensures that the technical terms in the user query can be accurately identified.
Specifically, the system traverses the continuous word groups in the word segmentation sequence and matches with the tax professional dictionary. When a match is found, a special tag is used for identification, thereby generating a target sequence with semantic tags.
S140, extracting tax elements by using a preset tax element recognition rule based on the target sequence.
In this embodiment, the system extracts key elements in the query based on predefined tax element recognition rules. These rules include recognition patterns of elements such as tax types, tax payment subjects, collection objects, tax rates, etc., and can extract the core semantic information of the query from the tagged sequence.
Specifically, the system identifies various elements in the sequence in a rule matching mode, extracts tax type elements and inquiry attribute elements, and marks the subordinate relations between the tax type elements and the inquiry attribute elements.
S150, constructing a query vector according to the extracted tax factors, and executing search matching in a tax knowledge base.
In this embodiment, the system converts the extracted tax elements into structured query vectors, and matches the query vectors in the knowledge base for retrieval. A large amount of tax policy regulations, actual operation guidelines and other contents are stored in a knowledge base of the system.
Specifically, the system assigns different weights to each dimension in the query vector, e.g., the tax type dimension is higher in weight and the attribute dimension is inferior. When the search is executed, a weighted cosine similarity calculation method is adopted to find out the knowledge item most similar to the query vector.
In one embodiment, referring to fig. 2, in step S120, the query sentence is segmented and part of speech labeled to obtain a segmented word sequence with part of speech label, which specifically includes the following steps:
s121, performing preliminary word segmentation on the query sentence by using a word segmentation device to obtain an initial word segmentation sequence.
In this embodiment, the system adopts a word segmentation method based on a combination of a dictionary and a statistical model. The word segmentation device loads a preset dictionary library, wherein the dictionary library comprises a universal vocabulary dictionary and a tax field special dictionary, and each entry comprises basic attribute information such as word frequency, part of speech and the like.
Specifically, the word segmentation device firstly uses a maximum forward matching algorithm to perform preliminary segmentation on the text, meanwhile calculates the probabilities of different segmentation schemes by combining a statistical language model, and selects a segmentation result with the highest probability. When an ambiguous segmentation is encountered, the system performs disambiguation by referring to word frequency information and a context relation, and selects an optimal segmentation path.
S122, matching the initial word segmentation sequence with a preset tax abbreviation comparison table, and identifying tax professional abbreviations.
In this embodiment, the system maintains a specific tax abbreviation comparison table in advance, where the table records the common professional abbreviations in the tax field and the corresponding standard holonomics. The abbreviation comparison table is stored in a key value pair form and comprises information such as abbreviations, full names, using frequencies and the like, and two matching modes of accurate matching and fuzzy matching are supported.
Specifically, the system matches each term in the initial word segmentation sequence with an abbreviation comparison table. For example, when a user inputs professional abbreviations including "IIT" (personal income tax), "CIT" (business income tax), "VAT" (value added tax), "RTF" (printing tax), etc., the system can accurately recognize these abbreviations. For common industry abbreviations such as individual user of the individual industry and merchant, special ticket of the special invoice of the value-added tax, common ticket of the general invoice of the value-added tax, and the like, the system can also complete accurate matching through a comparison table.
S123, replacing the identified tax professional abbreviations with corresponding standard terms according to the tax abbreviation comparison table to obtain a standardized word segmentation sequence.
In this embodiment, the system replaces the identified abbreviations with corresponding standard names according to the mapping relation in the abbreviation comparison table.
Specifically, when the system finds an abbreviation in the word segmentation sequence, the system directly calls the standard full name in the comparison table to replace.
S124, performing part-of-speech tagging on the standardized word segmentation sequence to obtain a word segmentation sequence with part-of-speech tagging.
In the embodiment, the system adopts a part-of-speech labeler based on a conditional random field, and the labeler can accurately identify the part of speech of the special vocabulary in the tax field through training of a large-scale tax text corpus. The labeling set comprises basic parts of speech such as nouns, verbs, adjectives and the like, and special labels such as tax types, tax rates, deadlines and the like.
Specifically, the labeler makes part-of-speech judgment on each word in the normalized word segmentation sequence, and determines the part-of-speech according to the context of the word and the characteristics of the word itself.
In one embodiment, referring to fig. 3, in step S130, based on the word segmentation sequence, the tax professional dictionary constructed in advance is utilized to identify and label the tax professional terms therein, and a target sequence with tax term labels is generated, which specifically includes the following steps:
s131, matching the continuous phrase in the word segmentation sequence with the tax professional dictionary, and identifying standard tax professional terms to obtain a preliminary marking sequence.
In this embodiment, the system processes word sequences using a multi-level matching strategy. The tax professional dictionary is classified and stored according to different dimensions of tax category, collection management, tax payment subject and the like, and each term contains attribute information such as standard name, category label, synonym set and the like.
Specifically, the system scans the sliding window of the phrase in the word segmentation sequence, the window size is reduced from the maximum phrase length, and the window size is sequentially matched with the professional dictionary.
S132, calculating the character similarity of the unrecognized phrase in the preliminary marking sequence and each term in the tax professional dictionary to obtain a similarity result set.
In this embodiment, the system performs fuzzy matching on the phrase which cannot be matched accurately by using a character similarity calculation method. The similarity calculation comprehensively considers a plurality of factors such as editing distance, character overlap ratio, position weight and the like, and a candidate term list is generated for each phrase to be matched.
Specifically, the system calculates the similarity between the phrases. For example, when a user enters a non-standard expression of "business tax preference," the system calculates a similarity with terms in the dictionary such as "business derived tax preference policy", "tax preference policy" and records a similarity score.
S133, screening out phrases with similarity higher than a preset threshold value and corresponding standard tax professional terms from the similarity result set to obtain a fuzzy matching sequence.
In this embodiment, the system sets a dynamic similarity threshold mechanism. The base threshold is set to 0.75 and the system dynamically adjusts the threshold based on the query context and phrase length. For shorter phrases, the system adopts a higher threshold value to ensure matching precision, and for longer phrases, the threshold value is lowered to improve recall rate.
Specifically, the system sorts and screens the similarity result set. For example, when "business tax offer" is identified, if the similarity to "business income tax offer policy" is 0.82, the threshold is exceeded, it is recorded into the fuzzy matching sequence and marked.
S134, combining the preliminary marking sequence and the fuzzy matching sequence to generate a target sequence with tax term marks.
In this embodiment, the system uses a priority policy to handle merging of tag sequences. The exact match results have the highest priority, with the fuzzy match results being inferior. When overlapping labels occur, the system selects the optimal label scheme based on the priority rules and coverage.
Specifically, the system orderly combines the marks in the preliminary mark sequence and the fuzzy matching sequence according to the positions in the original text.
In one embodiment, referring to fig. 4, in step S140, tax elements are extracted by using preset tax element recognition rules based on a target sequence, and specifically includes the following steps:
s141, matching the target sequence with a preset associated word list, and identifying associated words and positions in the sequence to obtain an associated word position sequence.
In this embodiment, the system maintains a related vocabulary including a plurality of categories including logical related words, conditional related words, time related words, and the like. Each class of related words marks the semantic function and the use scene of the related words, and the system understands the semantic structure of the query sentence through the recognition of the related words.
Specifically, the system traverses the target sequence and matches each phrase with the associated vocabulary. For example, for the query "tax rate for corporate income tax and value added tax and the tax levying point for small scale taxpayers", the system identifies the locations of the associated words "and", and marks them as parallel associated words,
S142, dividing the target sequence into a plurality of subsequences according to the related word position sequence to obtain a subsequence set.
In this embodiment, the system uses a divide-and-conquer strategy to divide the target sequence into a plurality of sub-sequences with complete semantics according to the recognized related word position.
Specifically, taking the inquiry as an example, the system firstly divides the sequence into three subsequences of 'tax rate of the enterprise income tax', 'tax rate of the value-added tax', 'tax point of small-scale tax payer' according to the 'sum', 'and' two related words. Each subsequence retains original tax term labels for subsequent element extraction.
S143, respectively extracting tax elements from the subsequence set by using a preset tax element identification rule to obtain a preliminary element set.
In this embodiment, the system uses a rule-based element recognition method, and the system defines specific recognition rules for different types of tax elements (e.g., tax types, tax rates, terms, conditions, etc.).
Specifically, for each sub-sequence, the system applies a corresponding element recognition rule for processing. For example, for the sub-sequence "tax rate of business income tax", the system identifies tax type element "business income tax" and attribute element "tax rate" by rules, and for "small-scale tax payer's point of collection", the system identifies the main tax principal element "small-scale tax payer" and attribute element "point of collection". The original expression and the standardized expression of the elements are reserved in the extraction process to form a preliminary element set.
S144, marking the logic relation among the elements in the preliminary element set according to the semantic type of the associated word in the associated word position sequence, and generating a complete element set containing the elements and the associated relation.
In this embodiment, the system constructs a logical relationship network between the elements based on the semantic types of the related words. The system defines a plurality of relationship types including juxtaposition, subordinate, conditional, turning, etc. for describing semantic links between different elements. The final generated complete element set not only contains the specific content of each element, but also contains the logic association relation among the elements, so as to form a structured semantic network.
In one embodiment, referring to fig. 5, after performing the search matching in the tax knowledge base, the steps further include:
s510, combining the candidate knowledge list obtained by retrieval with the original query statement of the user to construct an input prompt.
In this embodiment, the system adopts a templatized prompt word construction method, and a special prompt word template is designed in advance for guiding the large language model to analyze the correlation between the query and the candidate knowledge. The hint word template comprises a plurality of parts including an original query description, candidate knowledge content, a task description, scoring requirements and the like, and the information is organized in a structured manner.
Specifically, the system combines the user query and the candidate knowledge according to a preset template. For example, for the query "what the small-scale tax payer value-added tax rate is," the system may construct a hint "please analyze how relevant the following tax policy is to the user' what the small-scale tax payer value-added tax rate is: [ candidate knowledge content ]. Please evaluate from three dimensions of expertise, integrity, timeliness and give a composite score of 0-10. "
S520, the input prompt is transmitted into a large language model, and a correlation analysis result of each candidate knowledge is obtained.
In this embodiment, the system inputs the constructed prompt into the pre-optimized large language model. The model can accurately understand the professional content and nuances of tax policy texts through special training of knowledge in the tax field, so that accurate judgment can be made on the correlation of the knowledge.
S530, extracting the correlation score of each candidate knowledge based on the correlation analysis result.
In this embodiment, the system uses regular expressions and text parsing rules to extract numerical scores from the parsed text output by the model. The system can identify the scores of different dimensions and the final integrated score, and perform rationality verification to ensure that the extracted score meets the expected range.
S540, the candidate knowledge is reordered according to the relevance score, and a result list is generated.
In this embodiment, the system ranks the candidate knowledge based on the relevance score while considering other attributes of the knowledge, such as policy timeliness, file level, etc. The sorting algorithm adopts a weighted sorting mode, and the weights of different attributes can be adjusted according to actual application scenes.
Specifically, the system arranges the candidate knowledge in descending order of the composite score to generate a new result list.
S550, judging whether the highest score in the result list exceeds a preset threshold value, and if so, returning the corresponding knowledge as a reply.
In this embodiment, the system sets a dynamic threshold mechanism, and the basic threshold is 7 minutes. The system dynamically adjusts the threshold based on factors such as complexity of the query, number of candidate knowledge, etc. When the highest score exceeds a threshold, the system considers that the knowledge has sufficient relevance to return as a valid answer.
In one embodiment, referring to fig. 6, when the highest score in the result list does not exceed the preset threshold, the method further includes the following steps:
S610, extracting knowledge items with the relevance score not lower than a preset recommendation threshold value from the result list, and generating a candidate recommendation set.
In this embodiment, the system sets a recommendation threshold below the master threshold for screening potentially relevant knowledge. The system will collect all knowledge items with scores above the recommendation threshold to form a candidate recommendation pool.
Specifically, the system traverses each knowledge item in the results list, checking if its score reaches the recommendation threshold. For example, for a result list containing 10 pieces of knowledge, if 4 pieces of knowledge have scores of 6.8, 6.5, 5.8 and 5.2, respectively, and the recommendation threshold is 5, then all 4 pieces of knowledge are included in the candidate recommendation set. The system can record key attributes of each knowledge, such as policy validity period, application range and the like, for subsequent display and sorting.
S620, sorting in descending order according to the relevance score of each knowledge item in the candidate recommendation set, and generating an ordered recommendation list.
In this embodiment, the system adopts a multidimensional ranking strategy, and in addition to considering the relevance score, factors such as timeliness, frequency of use, document source and the like of the knowledge items are combined. The system assigns a weight to each dimension and obtains a final ranking score by weighted calculation.
S630, formatting the ordered recommendation list into a recommendation interface readable by a user and returning.
In one embodiment, referring to fig. 7, after returning the corresponding knowledge as a reply, the following steps are further included:
s710, generating an interaction record containing the user query, the returned knowledge ID and the returned time, and distributing a unique identifier.
In this embodiment, the system employs a distributed ID generation algorithm to generate a globally unique identifier for each query interaction. The interaction record contains user information (desensitization process), query content, ID of returned knowledge, interaction time stamp and query source, etc.
Specifically, when the system returns a tax policy knowledge, an interaction record is created immediately.
S720, receiving and recording feedback information of the user based on the identifier corresponding to the interaction record, and generating a feedback record.
In this embodiment, the feedback information of the user includes explicit feedback (such as active evaluation of the user) and implicit feedback (such as browsing duration, whether to collect, etc.). And the feedback information is associated with the interaction record through the unique identifier to form a complete user experience data chain.
S730, calculating a weight adjustment value of the knowledge item according to the feedback type in the feedback record.
In this embodiment, the system designs a corresponding weight adjustment algorithm according to different types of feedback. The algorithm considers factors such as feedback type, user activity, feedback timeliness and the like, and obtains a final adjustment value through weighted calculation.
S740, adding the weight adjustment value to the original weight value of the corresponding entry in the knowledge base, and updating the original weight value.
In this embodiment, the system adopts a progressive weight update strategy to modify the original weight of the knowledge item by accumulating the adjustment values fed back multiple times. The time decay factor is considered in the updating process, so that the recent feedback has a larger influence. The system also regularly normalizes the weight values to maintain the balance of the overall weight distribution.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.
In a second aspect, the present application provides a retrieval system for a tax knowledge question-answering system, and the retrieval system for a tax knowledge question-answering system of the present application will be described below with reference to the above retrieval method for a tax knowledge question-answering system.
Referring to fig. 8, a retrieval system for a tax knowledge question-and-answer system, comprising:
The receiving module is used for receiving tax problem query sentences input by a user;
The word segmentation marking module is used for performing word segmentation and part-of-speech marking on the query sentence to obtain a word segmentation sequence with part-of-speech marking;
the term marking module is used for identifying and marking tax professional terms in the tax professional dictionary constructed in advance based on the word segmentation sequence to generate a target sequence with tax term marks;
the element extraction module is used for extracting tax elements by utilizing a preset tax element identification rule based on the target sequence;
and the retrieval matching module is used for constructing a query vector according to the extracted tax factors and executing retrieval matching in the tax knowledge base.
In one embodiment, the word segmentation annotation module includes:
The preliminary word segmentation unit is used for carrying out preliminary word segmentation on the query sentence by utilizing a word segmentation device to obtain an initial word segmentation sequence;
the abbreviation identification unit is used for matching the initial word segmentation sequence with a preset tax abbreviation comparison table and identifying tax professional abbreviations;
The standardized unit is used for replacing the identified tax professional abbreviations with corresponding standard terms according to the tax abbreviation comparison table to obtain standardized word segmentation sequences;
the part-of-speech tagging unit is used for performing part-of-speech tagging on the standardized word segmentation sequence to obtain a word segmentation sequence with part-of-speech tagging.
In one embodiment, the term tagging module includes:
the standard term identification unit is used for matching the continuous phrase in the word segmentation sequence with the tax professional dictionary and identifying standard tax professional terms to obtain a preliminary marking sequence;
The similarity calculation unit is used for calculating the character similarity of each term in the tax professional dictionary for the unrecognized phrase in the preliminary marking sequence to obtain a similarity result set;
the fuzzy matching unit is used for screening out phrases with similarity higher than a preset threshold value and corresponding standard tax professional terms from the similarity result set to obtain a fuzzy matching sequence;
And the sequence merging unit is used for merging the preliminary marking sequence and the fuzzy matching sequence to generate a target sequence with tax term marks.
In one embodiment, the element extraction module includes:
The related word recognition unit is used for matching the target sequence with a preset related word list, recognizing related words and positions in the sequence and obtaining a related word position sequence;
The sequence dividing unit is used for dividing the target sequence into a plurality of subsequences according to the related word position sequence to obtain a subsequence set;
The element identification unit is used for respectively extracting tax elements from the subsequence set by utilizing a preset tax element identification rule to obtain a preliminary element set;
the relation marking unit is used for marking the logic relation among the elements in the preliminary element set according to the semantic type of the associated word in the associated word position sequence and generating a complete element set containing the elements and the associated relation.
In one embodiment, further comprising:
The prompt construction module is used for combining the candidate knowledge list obtained by retrieval with the original query statement of the user to construct an input prompt;
The correlation analysis module is used for transmitting the input prompt into the large language model to obtain a correlation analysis result of each candidate knowledge;
The score extraction module is used for extracting the relevance score of each candidate knowledge based on the relevance analysis result;
The sequencing module is used for re-sequencing the candidate knowledge according to the relevance score to generate a result list;
and the judging and returning module is used for judging whether the highest score in the result list exceeds a preset threshold value, and returning the corresponding knowledge as a reply if the highest score exceeds the preset threshold value.
In one embodiment, further comprising:
The recommendation set generation module is used for extracting knowledge items with the relevance score not lower than a preset recommendation threshold value from the result list and generating a candidate recommendation set;
the recommendation ordering module is used for ordering in a descending order according to the relevance score of each knowledge item in the candidate recommendation set, and generating an ordered recommendation list;
And the recommendation display module is used for formatting the ordered recommendation list into a recommendation interface readable by a user and returning the recommendation interface.
In one embodiment, further comprising:
the interaction record module is used for generating an interaction record comprising user inquiry, returned knowledge ID and returned time and distributing a unique identifier;
the feedback recording module is used for receiving and recording feedback information of the user based on the identifier corresponding to the interaction record and generating a feedback record;
the weight calculation module is used for calculating a weight adjustment value of the knowledge item according to the feedback type in the feedback record;
and the weight updating module is used for adding the weight adjustment value to the original weight value of the corresponding entry in the knowledge base and updating the original weight value.
In one embodiment, the present application provides an electronic device, which may be a server, and an internal structure thereof may be as shown in fig. 9. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the electronic device is for storing data. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a retrieval method for a tax knowledge question-answering system.
It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the electronic device to which the present application is applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, there is also provided an electronic device including a memory and a processor, the memory storing a computer program, the processor implementing the steps of the method embodiments described above when executing the computer program.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.
The above embodiments are not intended to limit the scope of the application, so that the equivalent changes of the structure, shape and principle of the application are covered by the scope of the application.