CN121188178A

CN121188178A - Search processing method, device, equipment, medium and program product for query statement

Info

Publication number: CN121188178A
Application number: CN202511248214.5A
Authority: CN
Inventors: 黄兴如; 刘中亮; 李奕萱; 王功举; 胡博文; 闫龙; 李大中
Original assignee: China Unicom Data Intelligence Co ltd; China United Network Communications Group Co Ltd
Current assignee: China Unicom Data Intelligence Co ltd; China United Network Communications Group Co Ltd
Priority date: 2025-09-02
Filing date: 2025-09-02
Publication date: 2025-12-23

Abstract

The application provides a search processing method, a search processing device, search processing equipment, search processing media and search processing program products. The method comprises the steps of obtaining a query sentence and a multi-layer index, wherein the multi-layer index comprises a text structure-based index, a text semantic-based index and a text unit-based index, conducting semantic similarity retrieval and keyword matching retrieval according to the query sentence and the multi-layer index to obtain a semantic retrieval result set and a keyword retrieval result set, conducting repeated document block filtering processing on the semantic retrieval result set and the keyword retrieval result set to obtain a filtered retrieval result set, conducting ranking processing on the filtered document blocks in the filtered retrieval result set to obtain a target document block candidate set, and providing a context prompt for answer sentence generation of the query sentence. So as to improve the retrieval accuracy.

Description

Search processing method, device, equipment, medium and program product for query statement

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, a medium, and a program product for processing search of a query statement.

Background

The intelligent question-answering technology enables a computer to answer a user question in accurate and simple natural language so as to meet the requirement of the user on acquiring information. However, intelligent question and answer results may generate inaccurate or fictional content when dealing with specialized queries in a particular area.

In order to improve the generation quality and accuracy of intelligent question and answer results, the existing intelligent question and answer technology is combined with RAG (RETRIEVAL-Augmented Generation, search enhancement generation) technology, knowledge embedded in a vector database is stored and searched, related documents are searched through semantic similarity to serve as contexts, and an intelligent question and answer model is assisted to output more accurate answers. However, when the existing single-dependency vector retrieval mechanism is faced with questions and answers under massive, multi-source, heterogeneous and other complex scenes, the correlation between semantic differences with fine granularity and complex concepts is difficult to capture, the relevance of retrieval results is easy to be reduced, generalized deviation is easy to be generated, and the retrieval precision is reduced.

Based on this, the conventional technology has a problem of insufficient search accuracy.

Disclosure of Invention

The embodiment of the application provides a search processing method, a search processing device, search processing equipment, search processing media and search processing program products, which are used for achieving the effect of improving search precision.

In a first aspect, an embodiment of the present application provides a method for processing search of a query statement, including:

acquiring a query statement;

Acquiring a multi-layer index, wherein the multi-layer index comprises a text structure-based index, a text semantic-based index and a text unit-based index;

According to the query statement and the multi-layer index, carrying out semantic similarity retrieval and keyword matching retrieval to obtain a semantic retrieval result set and a keyword retrieval result set, wherein the semantic retrieval result set and the keyword retrieval result set both comprise a plurality of document blocks;

performing repeated document block filtering processing on the semantic search result set and the keyword search result set to obtain a filtered search result set, wherein the filtered search result set comprises a plurality of filtering document blocks;

And ranking the filtered document blocks in the filtered search result set to obtain a target document block candidate set, wherein the target document block candidate set is used for providing a context prompt for the answer sentence generation of the query sentence.

In one possible implementation, the text structure-based index includes a chapter topic index and a paragraph topic index;

The text-based semantic indexes comprise a semantic topic index and a semantic problem index;

The text unit-based index includes a semantic text keyword index and a semantic text vector index.

In one possible implementation, before the multi-layer index is obtained, the method further includes:

Acquiring a plurality of text data, wherein the text data comprises academic paper data, technical report data and science popularization article data;

performing data preprocessing on the text data to obtain vocabulary data;

Carrying out knowledge slicing processing on a plurality of text data according to vocabulary data to obtain knowledge slicing data, wherein the knowledge slicing data comprises chapter data, paragraph data and semantic block data;

generating structured storage information according to knowledge slice data;

And constructing a multi-layer index according to the structured storage information.

In one possible implementation, the data preprocessing is performed on the text data to obtain vocabulary data, including:

word segmentation is carried out on the text data to obtain first word data;

word frequency filtering is carried out on the first word data to obtain second word data

And performing stem extraction processing on the second word data to obtain vocabulary data.

In one possible embodiment, performing knowledge slicing processing on a plurality of text data according to vocabulary data to obtain knowledge slice data includes:

extracting chapter data and paragraph data in text data according to the vocabulary data and the document analysis tool;

cutting the text data according to punctuation marks in the text data to obtain cut text data;

According to the pre-trained large language model, carrying out semantic block recognition processing on the segmented text data to obtain a plurality of semantic text blocks;

and adding identification information to the semantic text blocks to obtain semantic block data.

In one possible implementation, generating structured store information from knowledge slice data includes:

establishing a knowledge slice mapping relation according to the chapter data, the paragraph data and the semantic block data;

and adding metadata information to the knowledge slice mapping relationship to obtain structured storage information, wherein the metadata information comprises a plurality of sources, time stamps and domain labels.

In one possible implementation manner, according to the query sentence and the multi-layer index, the semantic similarity search and the keyword matching search are performed to obtain a semantic search result set and a keyword search result set, including:

Generating a query vector according to the query statement;

According to the query vector and the multi-layer index, carrying out semantic similarity retrieval to obtain a semantic retrieval result set;

Generating a query keyword set according to the query statement;

And carrying out keyword matching search according to the query keyword set and the multi-layer index to obtain a keyword search result set.

In one possible implementation, after ranking the filtered document blocks in the filtered search result set to obtain the target document block candidate set, the method further includes:

inputting the target document block candidate set and the query sentence into a preset large language model;

and determining an answer sentence corresponding to the query sentence according to the output result of the preset large language model.

In a second aspect, an embodiment of the present application provides a search processing apparatus for a query sentence, including:

the first acquisition module is used for acquiring the query statement;

The second acquisition module is used for acquiring the multi-layer index; wherein the multi-layer index comprises a text structure-based index, a text semantic-based index and a text unit-based index;

the search module is used for carrying out semantic similarity search and keyword matching search according to the query statement and the multi-layer index so as to obtain a semantic search result set and a keyword search result set, wherein the semantic search result set and the keyword search result set comprise a plurality of document blocks;

The filtering module is used for filtering the repeated document blocks of the semantic search result set and the keyword search result set to obtain a filtered search result set, wherein the filtered search result set comprises a plurality of filtering document blocks;

And the ranking module is used for ranking the filtered document blocks in the filtered search result set to obtain a target document block candidate set, wherein the target document block candidate set is used for providing a context prompt for the answer sentence generation of the query sentence.

In a third aspect, an embodiment of the present application provides a search processing apparatus for a query statement, including a memory, and a processor;

The memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory such that the processor performs the various possible implementations of the first aspect and/or the first aspect as described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions for implementing the various possible implementations of the above first aspect and/or the first aspect when executed by a processor.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the various possible implementations of the above first aspect and/or the first aspect.

The search processing method, the device, the equipment, the medium and the program product for the query sentence, provided by the embodiment of the application, are used for carrying out semantic similarity search and keyword matching search by acquiring the query sentence and a multi-layer index to obtain a semantic search result set and a keyword search result set, wherein the multi-layer index comprises a text structure-based index, a text semantic-based index and a text unit-based index, the semantic search result set and the keyword search result set comprise a plurality of file blocks, repeated file block filtering processing is carried out on the semantic search result set and the keyword search result set to obtain a filtered search result set, the filtered search result set comprises a plurality of filtered file blocks, and ranking processing is carried out on the filtered file blocks in the filtered search result set to obtain a target file block candidate set, so that a context prompt is provided for answer sentence generation of the query sentence. According to the method, query contents are refined from different levels and angles through multi-layer indexing, multi-dimensional information is fused for index enhancement, relevance among the query contents is deepened, the results are filtered and ordered, and accuracy and quality of the search results are improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram of a search processing system architecture for query sentences according to the present application;

FIG. 2 is a flowchart illustrating a method for processing a query sentence according to the present application;

FIG. 3 is a second flow chart of the search processing method of query sentences provided by the application;

FIG. 4 is a flowchart illustrating a method for processing a query sentence according to the present application;

FIG. 5 is a flowchart illustrating a method for processing a query sentence according to the present application;

FIG. 6 is a flowchart of a search processing method of a query sentence provided by the present application;

FIG. 7 is a schematic diagram of a search processing device for query sentences according to the present application;

fig. 8 is a schematic diagram of a search processing device for a query sentence according to the present application.

Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

It should be noted that, the data related to the present application are all information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards, and a corresponding operation entry is provided for the user to select authorization or rejection.

Optionally, fig. 1 is a schematic diagram of a search processing system architecture of a query statement provided by the present application. As shown in fig. 1, the retrieval processing system architecture of the query sentence includes at least one of a data acquisition device 101, a processing device 102, and a display device 103.

It should be understood that the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the architecture described above. In other possible embodiments of the present application, the architecture may include more or less components than those illustrated, or some components may be combined, some components may be split, or different component arrangements may be specifically determined according to the actual application scenario, and the present application is not limited herein. The components shown in fig. 1 may be implemented in hardware, software, or a combination of software and hardware.

In a specific implementation process, the data acquisition device 101 may include an input/output interface or a communication interface, and the data acquisition device 101 may be connected to a processing device through the input/output interface or the communication interface, for acquiring a query statement and acquiring a multi-layer index, where the multi-layer index includes a text structure-based index, a text semantic-based index, and a text unit-based index.

The processing device 102 may be configured to perform semantic similarity search and keyword matching search according to a query sentence and a multi-layer index, so as to obtain a semantic search result set and a keyword search result set, where the semantic search result set and the keyword search result set each include a plurality of document blocks, perform repeated document block filtering processing on the semantic search result set and the keyword search result set to obtain a filtered search result set, where the filtered search result set includes a plurality of filtered document blocks, and perform ranking processing on the filtered document blocks in the filtered search result set to obtain a target document block candidate set, where the target document block candidate set is used to provide a context hint for generating an answer sentence of the query sentence.

The display device 103 may also be a touch display screen or a screen of a terminal device for receiving a user instruction while displaying the above content to enable interaction with a user.

The RAG technology is a technical framework combining information retrieval and language model generation capability and comprises a retrieval stage and a generation stage, wherein the retrieval stage is used for retrieving documents or text fragments related to user query from an external knowledge base, and the generation stage is used for combining the retrieved context information with the user query and inputting the context information into a large language model to generate an answer.

By combining the RAG technology and the intelligent question-answering technology, the problem that the generated content is inconsistent with the fact when a large language model processes specific fields or highly specialized queries can be effectively solved.

However, the conventional RAG relies on single vector search or performs matching based on keywords, so that it is difficult to capture semantic similarity and perform accurate keyword matching at the same time, and for the technical terms related to low-frequency vocabulary or semantic boundary ambiguity, generalized deviation is easily generated in single vector search, and the keyword matching may ignore the upper part and the lower part Wen Yuyi, and in the case of massive, multiple and heterogeneous text data, the utilization rate of structural information and semantic information of the text data is insufficient, so that the search efficiency and the search precision are low.

In order to solve the technical problems, the application has the core concept that by introducing a multi-layer index system, text features are captured from different dimensions, structured information and semantic text blocks are fused, the limitation of single dimension on retrieval is broken, the retrieval results are ordered, and the accuracy of the retrieval results is improved.

The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 is a flow chart of a search processing method of a query sentence provided by the present application, as shown in fig. 2, the method includes:

S201, acquiring a query statement.

In this embodiment, the scenarios of the intelligent question-answering system include open domain questions and domain knowledge questions, including, but not limited to, academic, medical, financial, legal, business domain knowledge query sentences.

For example, in an application scenario of an intelligent question-answering system, the obtained query statement may be "what is the basic principle of quantum computation.

S202, acquiring a multi-layer index, wherein the multi-layer index comprises a text structure-based index, a text semantic-based index and a text unit-based index.

Optionally, the text structure-based index includes a chapter topic index and a paragraph topic index.

In this embodiment, the chapter topic index is used to quickly locate the related chapter content, and the paragraph topic index is used to reflect the core semantics of the paragraphs, so as to implement retrieval based on semantic matching.

For a document with a chapter structure, a chapter theme is determined by comprehensively analyzing a chapter title, a chapter opening paragraph and a chapter ending paragraph and professional vocabularies with the occurrence frequency higher than a preset occurrence frequency, for a document without the chapter structure, the whole text data is used as a chapter for analysis, the chapter theme is determined, and the determined chapter theme is converted into a semantic vector through a preset large language model to be used as a chapter theme index.

Extracting key sentences for each paragraph by using a text sorting algorithm as paragraph topics, wherein the text sorting algorithm takes sentences in the paragraphs as nodes, constructs a graph structure according to semantic similarity among the sentences, represents the weight of edges, and selects sentences with highest scores as key sentences by iteratively calculating node importance scores, and converts the extracted key sentences into semantic vectors to construct paragraph topic indexes.

The text-based semantic indexes include a semantic topic index and a semantic question index.

In this embodiment, a semantic topic and a semantic question are generated for each semantic text block of a corpus according to semantic understanding and generating capabilities of a preset large language model, the semantic topic is converted into a vector to serve as a semantic topic index, and the semantic question is converted into a vector to serve as a semantic question index.

In this embodiment, according to the knowledge slice mapping relationship, the chapter theme and the paragraph theme are fused to the corresponding semantic text blocks, the fused semantic text blocks are subjected to word segmentation and keyword extraction, and the importance weight of the keywords is determined by using a TF-IDF (Term Frequency-Inverse Document Frequency) algorithm to establish a semantic text keyword index.

Further, the semantic text keywords are converted into semantic vectors to establish corresponding semantic text vector indexes.

S203, carrying out semantic similarity retrieval and keyword matching retrieval according to the query sentences and the multi-layer indexes to obtain a semantic retrieval result set and a keyword retrieval result set, wherein the semantic retrieval result set and the keyword retrieval result set comprise a plurality of file blocks.

In this embodiment, milvus (an index of a vector database) is used to calculate cosine similarity between the query vector and the stored document, so as to capture deep semantic relation between the query vector and the stored document, and the similarity calculation is performed by combining the corresponding semantic vectors in the multi-layer index, comprehensive screening is performed according to the calculation result, a preset number of document blocks with highest similarity are returned, and the preset number of document blocks are assembled into a set, so as to obtain a semantic retrieval result set.

Extracting query keywords in a query sentence, acquiring document blocks including the query keywords in a stored document by adopting a retriever and an inverted index based on BM25 (Best Match 25), and forming the acquired document blocks into a set to obtain a query keyword set.

S204, performing repeated document block filtering processing on the semantic search result set and the keyword search result set to obtain a filtered search result set, wherein the filtered search result set comprises a plurality of filtering document blocks.

In this embodiment, according to the semantic search result set and the keyword search result set, performing repeatability check on the unique identifier of each document block, if the similarity between the document block in the semantic search result set and the document block in the keyword search result set is higher than a preset similarity threshold, considering that the document block is a repeated document block, filtering the repeated document block, and reserving the filtered search result set.

And S205, ranking the filtered document blocks in the filtered search result set to obtain a target document block candidate set, wherein the target document block candidate set is used for providing a context prompt for answer sentence generation of the query sentence.

In this embodiment, an RRF (Reciprocal Rank Fusion, inverse ranking fusion) algorithm is used to rank the filtered document blocks in the filtered search result set, so as to obtain a target document block candidate set.

Optionally, after ranking the filtered document blocks in the filtered search result set to obtain a target document block candidate set, the method further includes:

and inputting the target document block candidate set and the query sentence into a preset large language model.

In the present embodiment, the target document block candidate set and the query sentence are input to a preset large language model, so that the preset large language model generates an answer sentence with respect to the query sentence.

In this embodiment, the answer sentence generated by the preset large language model is output to the visual interface, so that the user obtains the answer sentence corresponding to the query sentence.

According to the search processing method of the query statement, provided by the application, the multi-layer index is introduced to search the related content of the query statement from different levels and angles, so that the limitation of a single index is broken through, the search result is more comprehensive and specific, the filtering processing and the ranking processing are carried out on the search result, the interference of an invalid document on the search result is reduced, and the accuracy and the search efficiency of the search result are improved.

Fig. 3 is a second flow chart of the search processing method of the query sentence provided in the present application, as shown in fig. 3, the present embodiment further includes, based on the embodiment of fig. 2, before the multi-layer index is obtained in step S201:

S301, acquiring a plurality of text data, wherein the text data comprise academic paper data, technical report data and science popularization article data.

In this embodiment, the text data is data related to a query sentence, for example, if the text data includes academic paper data, technical report data, and popular article data related to quantum computation.

S302, performing data preprocessing on the text data to obtain vocabulary data.

Optionally, performing data preprocessing on the text data to obtain vocabulary data, including:

and performing word segmentation processing on the text data to obtain first word data.

In the embodiment, a word segmentation tool is adopted to segment continuous text data into first word data, for example, if sentences in the continuous text data are 'quantum computation utilizes superposition states of quantum bits to realize efficient computation engineering', word segmentation processing is carried out on the sentences by the word segmentation tool, and the obtained first word data comprise quantum computation, utilization, quantum bits, superposition states, implementation, efficient computation and process.

And performing word frequency filtering processing on the first word data to obtain second word data.

In this embodiment, the word frequency filtering process is performed on the first word data through a preset general stop word list and a preset specific stop word list in the professional field, so as to filter general stop words in the first word data and high-frequency irrelevant key information words commonly found in the professional field.

For example, the preset specific stop word list of the professional domain is a specific stop word list of the quantum computing domain, the general stop word list includes "yes, in" and other insubstantial stop words, and the quantum computing domain specific stop word list includes "in general, in general" and other high-frequency irrelevant key information words.

In this embodiment, a baud stem extraction algorithm is adopted to perform stem extraction processing on the second word data, so as to restore words in the second word data into corresponding stem forms, and reduce the influence of morphological changes of words on retrieval.

S303, carrying out knowledge slicing processing on the text data according to the vocabulary data to obtain knowledge slicing data, wherein the knowledge slicing data comprises chapter data, paragraph data and semantic block data.

Optionally, knowledge slicing is performed on the plurality of text data according to the vocabulary data to obtain knowledge slice data, including:

And extracting chapter data and paragraph data in the text data according to the vocabulary data and the document analysis tool.

In this embodiment, for example, if the text data is in PDF format (Portable Document Format ), a PyPDF (a tool kit for processing PDF documents) tool is used to extract the chapter title and paragraph content in the text data, and the format of the chapter title is identified in combination with the regular expression to determine the chapter boundary, so that the chapter data and paragraph data in the text data are accurately extracted.

And cutting the text data according to punctuation marks in the text data to obtain cut text data.

In the embodiment, primary segmentation processing is performed on the text data based on punctuation marks in the text data, wherein the punctuation marks comprise question marks, exclamation marks, periods, semicolons and colon marks, a primary text fragment sequence is obtained, and segmentation results of the primary text fragment sequence are disambiguated and corrected, including disambiguation of abbreviations, numbers, dates and version numbers, and correction of sentences, titles and symbols in quotations or brackets, so that the corresponding text is a whole text.

And carrying out semantic block recognition processing on the segmented text data according to the pre-trained large language model so as to obtain a plurality of semantic text blocks.

In this embodiment, the text data after segmentation is embedded according to the pre-trained large language model, so that the text data after segmentation is mapped into a semantic space, cosine similarity clustering is adopted, and sentences with similarity higher than a preset similarity threshold value in the text data after segmentation are classified as a semantic text block, so that a plurality of different semantic text blocks are obtained.

In this embodiment, an MD5 (Message-Digest Algorithm 5) hash Algorithm is used to generate corresponding unique identifiers for different semantic text blocks, so as to provide storage and retrieval operations for the different semantic text blocks.

S304, generating structured storage information according to the knowledge slice data.

Optionally, generating the structured store information from the knowledge slice data includes:

and establishing a knowledge slice mapping relation according to the chapter data, the paragraph data and the semantic block data.

In this embodiment, the chapter data, the paragraph data under the chapter data, and the semantic block data corresponding to the paragraph data are stored in an associated manner, so as to establish a knowledge slice mapping relationship, thereby improving the retrieval efficiency.

In this embodiment, the source includes a document name, a website domain name, the timestamp includes a posting time, a collection time of the text data, and the domain label includes a professional domain corresponding to the text data, for example, "quantum computation".

S305, constructing a multi-layer index according to the structured storage information.

In this embodiment, the multi-layer index includes a text structure-based index, a text semantic-based index, and a text unit-based index.

Wherein the text structure-based index includes a chapter topic index and a paragraph topic index.

For text data with a chapter structure, a chapter topic is determined by comprehensively analyzing a chapter title, a chapter beginning paragraph, a chapter ending paragraph and high-frequency professional vocabulary, and the chapter topic is converted into a semantic vector to be indexed, so that a chapter topic index is obtained, wherein the high-frequency professional vocabulary is the professional vocabulary indicating that the occurrence frequency is greater than a preset frequency.

For example, the high-frequency professional vocabulary comprises quantum bits, quantum gates and entangled states, and if the section of text data is entitled as a core algorithm of quantum computation, the section data of the section is combined for analysis, and the main title is determined as a principle and application of the quantum computation algorithm.

And regarding the text data without the chapter structure, taking the whole text data as a chapter, analyzing the chapter to determine a chapter theme, converting the chapter theme into a semantic vector for indexing, and thus obtaining a chapter theme index.

And respectively extracting key sentences in each paragraph by adopting a text sorting algorithm as paragraph topics, and converting the paragraph topics into semantic vectors to obtain paragraph topic indexes.

For example, in a paragraph about qubits, the text ordering algorithm takes the sentence "qubits are basic units of quantum computation, which has unique superposition states and entanglement characteristics" as a paragraph topic, and then converts the paragraph topic into a semantic vector to obtain a corresponding paragraph topic index.

According to a preset large language model, a semantic topic and a semantic question are generated for the semantic text block, the semantic topic and the semantic question are respectively converted into semantic vectors to construct corresponding semantic topic indexes and semantic question indexes, for example, for a semantic text block related to quantum gate operation, the generated semantic topic is a 'mechanism of action of quantum gate in quantum computing', the generated semantic question is a 'how to realize state conversion of quantum bits by quantum gate', and then the generated semantic topic and the semantic question are respectively converted into corresponding semantic vectors to obtain corresponding semantic topic indexes and semantic question indexes.

According to the knowledge slice mapping relation, fusing the chapter theme and the paragraph theme to the semantic text block to obtain a fused semantic text block, performing word segmentation and keyword extraction on the fused semantic text block, and determining the importance weight of the keywords by adopting a TF-IDF algorithm to establish a semantic text keyword index.

For example, for a semantic text block fused with a quantum computing algorithm and a related theme of quantum bits, extracting corresponding keywords, wherein the corresponding keywords comprise the quantum algorithm, the quantum bits and the computing efficiency, determining weights according to the occurrence frequency and the importance of the keywords in text data, and selecting a preset number of keywords with the maximum weights as semantic text keyword indexes.

The search processing of the query statement provided by the embodiment of the application obtains vocabulary data by preprocessing the obtained text data, carries out knowledge slicing processing on the vocabulary data to obtain knowledge slicing data so as to generate structured storage information, thereby constructing a multi-layer index, enabling related content to be rapidly positioned during search, reducing irrelevant information interference, capturing deep semantic association, improving the semantic matching degree of search, supporting multi-source heterogeneous text integration by the structured storage information, improving the universality and expansibility of a system, and being suitable for large-scale text data processing and improving the search efficiency and accuracy under complex scenes.

Fig. 4 is a flowchart of a search processing method of a query sentence provided in the present application, as shown in fig. 4, in this embodiment, on the basis of the embodiment of fig. 2, semantic similarity search and keyword matching search are performed according to the query sentence and the multi-layer index in the above step S203, so as to obtain a semantic search result set and a keyword search result set, and the method includes:

s401, generating a query vector according to the query statement.

In this embodiment, a preset large language model is used to convert the query statement into a query vector.

S402, carrying out semantic similarity retrieval according to the query vector and the multi-layer index to obtain a semantic retrieval result set.

In this embodiment, milvus indexes are adopted to calculate cosine similarity of query vectors and vectors converted by semantic text blocks, similarity calculation is performed by combining corresponding semantic vectors in multiple layers of indexes, comprehensive screening is performed according to calculation results, a preset number of text blocks with highest similarity are returned, and the preset number of text blocks form a set to obtain a semantic search result set.

S403, generating a query keyword set according to the query statement.

In the present embodiment, query keywords in query sentences are extracted, for example, "what is the basic principle of quantum computation" is the query sentences, the extracted query keywords include quantum computation, basic principle; through the query keywords, a retriever and an inverted index based on BM25 are adopted to obtain semantic text blocks fused with the query keywords, and the obtained semantic text blocks are assembled into a set to obtain a query keyword set.

S404, carrying out keyword matching search according to the query keyword set and the multi-layer index to obtain a keyword search result set.

In this embodiment, according to the query keyword set, the multi-layer index is flexibly selected and used to perform keyword matching search, so as to obtain a keyword search result set, so as to improve the search efficiency and accuracy.

Optionally, keyword matching search can be performed according to at least one index of the query keyword set and the multi-layer index to obtain corresponding keyword search results, and all the keyword search results are combined into a set to obtain a corresponding keyword search result set.

The method for processing the search of the query statement provided by the application generates the query vector through the query statement, performs semantic similarity search by combining the multi-layer indexes to obtain the semantic search result set, extracts the query keyword from the query statement to generate the query keyword set, performs keyword matching search by combining the multi-layer indexes to obtain the keyword search result set, improves the accuracy and recall rate of the search result, enhances the use experience of a user, and ensures the search efficiency and performance.

Fig. 5 is a flow chart diagram of a search processing method for a query sentence provided in the present application, as shown in fig. 5, the present embodiment further includes, before obtaining the query sentence in the step S201, building a preset corpus, where building the preset corpus includes:

s501, acquiring a corpus to be processed.

S502, preprocessing data of the corpus to be processed to obtain preprocessed corpus.

In this embodiment, the preprocessing includes word segmentation, stop word removal, stem extraction.

For example, a word segmentation tool is used for processing the corpus to be processed, continuous texts are segmented into meaningful word units, non-substantial stop words such as ' yes ', ' no and the like and high-frequency irrelevant key information words common in the professional field are removed according to a preset general stop word list and a preset specific stop word list in the professional field, words are restored to a word stem form by a Bode word stem extraction algorithm, and the influence of morphological changes of the words on retrieval is reduced.

S503, carrying out knowledge slicing processing on the preprocessed corpus to obtain a knowledge unit.

In this embodiment, knowledge slicing refers to the division of a large-scale unstructured document into manageable knowledge units, where the knowledge units include chapters, paragraphs, and semantic blocks.

For example, a document parsing tool is used to extract section titles and paragraph content in the preprocessed corpus. The method comprises the steps of combining a regular expression to identify the format of a chapter title to determine a chapter boundary, segmenting a corpus to be processed into text sentences with the finest granularity according to punctuation marks, embedding the text sentences by using a preset large language model, mapping the text sentences into a semantic space, clustering based on semantic similarity, classifying the text sentences with similarity higher than a preset similarity threshold value into a semantic text block, and generating unique identifiers by using a hash algorithm to facilitate subsequent storage and retrieval operations.

S504, determining structured storage data according to the knowledge units, wherein the mapping relation of the knowledge units and meta information of the knowledge units.

In this embodiment, a corresponding mapping relationship is established for the sliced knowledge units, where the mapping relationship is a chapter-paragraph-semantic block, the data are stored according to the mapping relationship, and meta information is added to each piece of data, where the meta information includes a source, a timestamp, and a domain tag, and multi-dimensional information support is provided for subsequent screening and retrieval to support rapid retrieval.

Further, the preset corpus constructed by the embodiment provides text data for the search processing method of the query statement, as shown in fig. 6, the search results are summarized and ranked finally by the preset corpus construction and index construction, so that efficient search of the user problem is achieved, the corpus construction comprises data preprocessing, knowledge slicing and structured storage, the index construction comprises indexing based on a text structure, indexing based on text semantic and indexing based on a text unit, the multi-path search comprises semantic vector search and keyword search, and the summarization ranking comprises filtering and rearrangement.

Fig. 7 is a schematic structural diagram of a search processing device for a query sentence provided by the present application, and as shown in fig. 7, the search processing device for a query sentence provided in this embodiment includes:

A first obtaining module 701, configured to obtain a query statement.

And a second obtaining module 702, configured to obtain a multi-layer index, where the multi-layer index includes a text structure-based index, a text semantic-based index, and a text unit-based index.

The search module 703 is configured to perform semantic similarity search and keyword matching search according to the query sentence and the multi-layer index, so as to obtain a semantic search result set and a keyword search result set, where the semantic search result set and the keyword search result set each include a plurality of document blocks.

And the filtering module 704 is configured to perform repeated document block filtering processing on the semantic search result set and the keyword search result set to obtain a filtered search result set, where the filtered search result set includes a plurality of filtering document blocks.

And the ranking module 705 is configured to rank the filtered document blocks in the filtered search result set to obtain a target document block candidate set, where the target document block candidate set is used to provide a context hint for generating an answer sentence of the query sentence.

In one possible implementation manner, before the multi-layer index is acquired, the search processing apparatus of the query statement further includes:

and the third acquisition module is used for acquiring a plurality of text data, wherein the text data comprises academic paper data, technical report data and science popularization article data.

And the preprocessing module is used for preprocessing the text data to obtain vocabulary data.

And the knowledge slicing module is used for carrying out knowledge slicing processing on the plurality of text data according to the vocabulary data to obtain knowledge slicing data, wherein the knowledge slicing data comprises chapter data, paragraph data and semantic block data.

And the generation module is used for generating structured storage information according to the knowledge slice data.

And the construction module is used for constructing the multi-layer index according to the structured storage information.

In one possible implementation, the preprocessing module may be further configured to:

word segmentation is carried out on the text data to obtain first word data;

Performing word frequency filtering processing on the first word data to obtain second word data;

In one possible implementation, the knowledge slicing module may be further specifically configured to:

In one possible implementation, the generating module may be further configured to:

In one possible implementation, the retrieval module 703 may also be used in particular to:

Generating a query vector according to the query statement;

Generating a query keyword set according to the query statement;

In one possible implementation manner, after ranking the filtered document blocks in the filtered search result set to obtain the target document block candidate set, the search processing apparatus for a query sentence further includes:

And the input module is used for inputting the target document block candidate set and the query sentence into a preset large language model.

And the determining module is used for determining the answer sentence corresponding to the query sentence according to the output result of the preset large language model.

The search processing device for query sentences provided in this embodiment may execute the method provided in the foregoing method embodiment, and its implementation principle and technical effects are similar, which is not described herein.

Fig. 8 is a schematic diagram of a search processing device for a query sentence according to the present application. As shown in fig. 8, the search processing device for a query sentence provided in this embodiment includes at least one processor 801 and a memory 802. Optionally, the search processing apparatus of the query sentence further includes a communication section 803. The processor 801, the memory 802, and the communication section 803 are connected via a bus 804.

In a specific implementation, at least one processor 801 executes computer-executable instructions stored in memory 802, causing the at least one processor 801 to perform the methods described above.

The specific implementation process of the processor 801 may refer to the above-mentioned method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein again.

In the above embodiment, it should be understood that the Processor may be a central processing unit (english: central Processing Unit, abbreviated as CPU), or may be other general purpose processors, digital signal processors (english: DIGITAL SIGNAL Processor, abbreviated as DSP), application specific integrated circuits (english: application SPECIFIC INTEGRATED Circuit, abbreviated as ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.

The Memory may include high-speed Memory (Random Access Memory, RAM) or may further include Non-volatile Memory (NVM), such as at least one disk Memory.

The bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (PERIPHERAL COMPONENT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or to one type of bus.

The application also provides a computer program product comprising a computer program which, when executed by a processor, implements the method described above.

The application also provides a computer readable storage medium, wherein computer execution instructions are stored in the computer readable storage medium, and when a processor executes the computer execution instructions, the method is realized.

The above-described readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk. A readable storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.

An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. In the alternative, the readable storage medium may be integral to the processor. The processor and the readable storage medium may reside in an Application SPECIFIC INTEGRATED Circuits (ASIC). The processor and the readable storage medium may reside as discrete components in a device.

The division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of implementing the various method embodiments described above may be implemented by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium. The program, when executed, performs the steps comprising the method embodiments described above, and the storage medium described above includes various media capable of storing program code, such as ROM, RAM, magnetic or optical disk.

Finally, it should be noted that other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any adaptations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains and as may be applied to the precise construction hereinbefore set forth and shown in the drawings and as follows in the scope of the appended claims. The scope of the invention is limited only by the appended claims.

Claims

1. A method for retrieving and processing query statements, characterized in that it includes:

Retrieve the query statement;

Obtain multi-level indexes; wherein, the multi-level indexes include text structure-based indexes, text semantic-based indexes, and text unit-based indexes;

Based on the query statement and the multi-level index, semantic similarity retrieval and keyword matching retrieval are performed to obtain a semantic retrieval result set and a keyword retrieval result set; wherein, both the semantic retrieval result set and the keyword retrieval result set include multiple document blocks;

The semantic search result set and the keyword search result set are subjected to duplicate document block filtering to obtain a filtered search result set; wherein, the filtered search result set includes multiple filtered document blocks;

The filtered document blocks in the filtered search result set are ranked to obtain a candidate set of target document blocks; wherein, the candidate set of target document blocks is used to provide contextual hints for generating the answer statement of the query statement.

2. The method according to claim 1, wherein the text structure-based index includes a chapter topic index and a paragraph topic index;

The text-based semantic index includes a semantic topic index and a semantic question index;

The text-based unit index includes a semantic text keyword index and a semantic text vector index.

3. The method according to claim 2, characterized in that, before obtaining the multi-level index, it further includes:

Acquire multiple text data; wherein, the text data includes academic paper data, technical report data, and popular science article data;

The text data is preprocessed to obtain vocabulary data;

Based on the vocabulary data, the multiple text data are processed by knowledge slicing to obtain knowledge slice data; wherein, the knowledge slice data includes chapter data, paragraph data and semantic block data;

Based on the knowledge slice data, structured storage information is generated;

Based on the structured storage information, a multi-level index is constructed.

4. The method according to claim 3, characterized in that, the step of preprocessing the text data to obtain vocabulary data includes:

The text data is segmented to obtain the first word data;

The first word data is subjected to word frequency filtering to obtain the second word data;

Stem extraction is performed on the second word data to obtain vocabulary data.

5. The method according to claim 4, characterized in that, the step of performing knowledge slicing processing on the plurality of text data based on the vocabulary data to obtain knowledge slice data includes:

Based on the vocabulary data and the document parsing tool, extract the chapter data and paragraph data from the text data;

Based on the punctuation marks in the text data, the text data is segmented to obtain segmented text data;

Based on a pre-trained large language model, semantic block recognition processing is performed on the segmented text data to obtain multiple semantic text blocks;

Identification information is added to the semantic text block to obtain semantic block data.

6. The method according to claim 5, wherein generating structured storage information based on the knowledge slice data includes:

Based on the chapter data, paragraph data, and semantic block data, a knowledge slice mapping relationship is established;

Metadata information is added to the knowledge slice mapping relationship to obtain structured storage information; wherein, the metadata information includes multiple of the following: source, timestamp, and domain tag.

7. The method according to any one of claims 1 to 6, characterized in that, the step of performing semantic similarity retrieval and keyword matching retrieval based on the query statement and the multi-level index to obtain a semantic retrieval result set and a keyword retrieval result set includes:

Generate a query vector based on the query statement;

Based on the query vector and the multi-level index, semantic similarity retrieval is performed to obtain a semantic retrieval result set;

Based on the query statement, generate a set of query keywords;

Based on the set of query keywords and the multi-level index, keyword matching retrieval is performed to obtain a set of keyword retrieval results.

8. The method according to any one of claims 1 to 6, characterized in that, after ranking the filtered document blocks in the filtered retrieval result set to obtain a candidate set of target document blocks, it further includes:

The target document block candidate set and the query statement are input into a preset large language model;

Based on the output of the preset large language model, the corresponding answer statement is determined.

9. A retrieval processing device for query statements, characterized in that it comprises:

The first acquisition module is used to acquire the query statement;

The second acquisition module is used to acquire multi-level indexes; wherein, the multi-level indexes include text structure-based indexes, text semantic-based indexes, and text unit-based indexes;

The retrieval module is used to perform semantic similarity retrieval and keyword matching retrieval based on the query statement and the multi-level index to obtain a semantic retrieval result set and a keyword retrieval result set; wherein, both the semantic retrieval result set and the keyword retrieval result set include multiple document blocks;

The filtering module is used to perform duplicate document block filtering on the semantic search result set and the keyword search result set to obtain a filtered search result set; wherein, the filtered search result set includes multiple filtered document blocks;

The ranking module is used to rank the filtered document blocks in the filtered search result set to obtain a candidate set of target document blocks; wherein, the candidate set of target document blocks is used to provide contextual hints for generating the answer statement of the query statement.

10. A query statement retrieval processing device, characterized in that it comprises: a memory and a processor;

The memory stores computer-executed instructions;

The processor executes computer execution instructions stored in the memory, causing the processor to perform the method as described in any one of claims 1-8.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1-8.

12. A computer program product, characterized in that it comprises a computer program that, when executed by a processor, implements the method described in any one of claims 1-8.