CN116361421B

CN116361421B - Text retrieval method, device and storage medium

Info

Publication number: CN116361421B
Application number: CN202310618455.9A
Authority: CN
Inventors: 滕济可; 刘亚猛
Original assignee: Internet Moment Beijing Information Technology Co ltd
Current assignee: Internet Moment Beijing Information Technology Co ltd
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2023-08-15
Anticipated expiration: 2043-05-30
Also published as: CN116361421A

Abstract

The application discloses a text retrieval method, a text retrieval device and a storage medium, and relates to the technical field of text retrieval. The text retrieval method comprises the steps of searching a knowledge information table corresponding to each character in a retrieval word, wherein the knowledge information table is recorded with zone bit codes of the corresponding characters, a lower zone bit code set of the zone bit codes of the corresponding characters and a level set of the corresponding characters in a text; judging whether a region bit code consistent with a region bit code in a knowledge information table corresponding to a next adjacent search word exists in a lower region bit code set in the knowledge information table corresponding to each target character in the search word; if the region bit codes consistent with the region bit codes in the knowledge information table corresponding to the next adjacent search word exist, the text content matched with the search word is found out based on the level set recorded in the knowledge information table corresponding to each character in the search word. The text retrieval method, the device and the storage medium disclosed by the application can conveniently and accurately retrieve the content to be retrieved from the text.

Description

Text retrieval method, device and storage medium

Technical Field

The application belongs to the technical field of text retrieval, and particularly relates to a text retrieval method, a text retrieval device and a storage medium.

Background

Text retrieval (TextRetrieval), also known as natural language retrieval, refers to the process of retrieving, classifying, filtering, etc., a collection of text based on the content of the text, such as terms, semantics, etc., contained in the text.

At present, a keyword recognition method is often adopted for text retrieval, and an index is formed by extracting keywords of text content, so that full text retrieval of the text content is achieved, for example, an elastsearch full text search engine which is currently mainstream uses the retrieval method, however, only extracted keywords can be retrieved in the mode, but no extracted keywords can be retrieved, and therefore, the content to be retrieved can not be accurately retrieved from the text.

Therefore, how to provide an effective solution to accurately retrieve the content to be retrieved from the text has become a problem in the prior art.

Disclosure of Invention

The application aims to provide a text retrieval method, a text retrieval device and a storage medium, which are used for solving the problems in the prior art.

In order to achieve the above purpose, the present application adopts the following technical scheme:

in a first aspect, the present application provides a text retrieval method, including:

searching a knowledge information table corresponding to each character in the search term, wherein the knowledge information table is recorded with zone bit codes of the corresponding characters, a lower zone bit code set of the zone bit codes of the corresponding characters and a level set of the corresponding characters in the text, and the lower zone bit code set of the zone bit code of any character is a set of zone bit codes corresponding to all adjacent characters which appear in the text and are positioned behind any character;

sequentially judging whether a region code consistent with the region code in the knowledge information table corresponding to the next adjacent search word exists in a lower region code set in the knowledge information table corresponding to each target character in the search word, wherein the target characters are the rest characters except the last search word in the search word;

if the lower-level region code set in the knowledge information table corresponding to each target character in the search word has region codes consistent with the region codes in the knowledge information table corresponding to the next adjacent search word, searching text content matched with the search word based on the level set recorded in the knowledge information table corresponding to each character in the search word.

Based on the above disclosure, the application searches the knowledge information table corresponding to each character in the search term, and the knowledge information table records the zone bit codes of the corresponding characters, the lower zone bit code set of the zone bit codes of the corresponding characters and the level set of the corresponding characters in the text; then judging whether the region bit codes consistent with the region bit codes in the knowledge information table corresponding to the next adjacent search word exist in the lower region bit code set in the knowledge information table corresponding to each target character in the search word in sequence; if the lower-level region code set in the knowledge information table corresponding to each target character in the search word has region codes consistent with the region codes in the knowledge information table corresponding to the next adjacent search word, the text content matched with the search word is found out based on the level set recorded in the knowledge information table corresponding to each character in the search word. In the process, whether the region bit codes consistent with the region bit codes in the knowledge information table corresponding to the next adjacent search word exist in the lower region bit code set in the knowledge information table corresponding to each target character in the search word is judged, so that all the next characters adjacent to the previous character in the search word can be found out in the text, and further text content matched with the search word can be located and found out from the text according to the rank corresponding to each found character, any content in the text can be conveniently searched, and the search is not limited to the extracted keyword, so that the content to be searched can be accurately searched out from the text.

Through the design, the application can conveniently search any content in the text, is not limited to searching the extracted keywords, ensures that the content to be searched can be accurately searched from the text, and is convenient for practical application and popularization.

In one possible design, the searching text content matched with the search term based on the level set recorded in the knowledge information table corresponding to each character in the search term includes:

determining at least one group of rank combinations with continuous ranks from the knowledge information table corresponding to each character based on the rank set recorded in the knowledge information table corresponding to each character in the search term;

and searching at least one text content matched with the search term based on the at least one group of order combinations.

In one possible design, the searching the knowledge information table corresponding to each character in the search term includes:

calculating a storage address of a knowledge information table corresponding to each character in the search term through a hash algorithm;

and searching the knowledge information table corresponding to each character in the search word based on the storage address of the knowledge information table corresponding to each character.

In one possible design, the storage address of the knowledge information table corresponding to each character in the search term is:

p (ai) = ((ai% 100-1) ×94+ [ ai/100] -1) ×32, where p (ai) represents the number of storage bits of the knowledge information table corresponding to the i-th character in the search term in the storage space, ai represents the area code of the i-th character in the search term,% represents modulo arithmetic, and [ (] represents rounding).

In one possible design, before searching the knowledge information table corresponding to each character in the search term, the method further includes:

constructing a knowledge network, wherein the knowledge network comprises a directed connection diagram formed by knowledge information tables corresponding to characters in a text;

in the directed connection diagram, the corresponding knowledge information table corresponding to the previous character in the text points to the corresponding knowledge information table corresponding to the next character.

In one possible design, the knowledge information table further includes a superior location code set of the corresponding character and a self-association identifier for characterizing whether the character appears continuously in the text, where the superior location code set of the location code of any character is a set of the location codes corresponding to all adjacent characters that appear in the text and are located before the any character.

In one possible design, the method further comprises:

if the region code consistent with the region code in the knowledge information table corresponding to the next adjacent search word does not exist in the lower region code set in the knowledge information table corresponding to one of the target characters of the search word, generating prompt information representing that the search word is not queried in the text.

In a second aspect, the present application provides a text retrieval apparatus comprising:

the first searching unit is used for searching a knowledge information table corresponding to each character in the search word, wherein the knowledge information table is recorded with zone bit codes of the corresponding characters, a lower zone bit code set of the zone bit codes of the corresponding characters and a level set of the corresponding characters in the text, and the lower zone bit code set of the zone bit code of any character is a set of zone bit codes corresponding to all adjacent characters which appear in the text and are positioned behind any character;

the judging unit is used for sequentially judging whether a region code consistent with the region code in the knowledge information table corresponding to the next adjacent search word exists in a lower region code set in the knowledge information table corresponding to each target character in the search word, wherein the target characters are the other characters except the last search word in the search word;

and the second searching unit is used for searching text content matched with the search word based on the level set recorded in the knowledge information table corresponding to each character in the search word if the lower level region code set in the knowledge information table corresponding to each target character in the search word has region codes consistent with the region codes in the knowledge information table corresponding to the next adjacent search word.

In a third aspect, the present application provides a text retrieval device comprising a memory, a processor and a transceiver in communication with each other in sequence, wherein the memory is adapted to store a computer program and the transceiver is adapted to receive and transmit messages, and the processor is adapted to read the computer program and to perform the text retrieval method as described in the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium having instructions stored thereon which, when executed on a computer, perform the text retrieval method of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the text retrieval method according to the first aspect.

The beneficial effects are that:

the application can conveniently search any content in the text, is not limited to search the extracted keywords, ensures that the content to be searched can be accurately searched from the text, and is convenient for practical application and popularization.

Drawings

FIG. 1 is a flowchart of a text retrieval method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a text retrieval device according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of another text retrieval device according to an embodiment of the present application.

Detailed Description

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the present application will be briefly described below with reference to the accompanying drawings and the description of the embodiments or the prior art, and it is obvious that the following description of the structure of the drawings is only some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art. It should be noted that the description of these examples is for aiding in understanding the present application, but is not intended to limit the present application.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments of the present application.

It should be understood that for the term "and/or" that may appear herein, it is merely one association relationship that describes an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a alone, B alone, and both a and B; for the term "/and" that may appear herein, which is descriptive of another associative object relationship, it means that there may be two relationships, e.g., a/and B, it may be expressed that: a alone, a alone and B alone; in addition, for the character "/" that may appear herein, it is generally indicated that the context associated object is an "or" relationship.

In order to accurately retrieve contents to be retrieved from a text, embodiments of the present application provide a text retrieval method, apparatus, and storage medium, which can conveniently retrieve any content in a text, and ensure that the contents to be retrieved can be accurately retrieved from the text.

The text retrieval method provided by the embodiment of the application can be applied to the user terminal, and the user terminal can be, but is not limited to, a personal computer, a smart phone, a tablet computer, a laptop portable computer, a personal digital assistant (personal digital assistant, PDA) and the like. It will be appreciated that the execution body is not to be construed as limiting the embodiments of the application.

The text retrieval method provided by the embodiment of the application will be described in detail.

As shown in fig. 1, a flowchart of a text retrieval method according to the first aspect of the embodiment of the present application may include, but is not limited to, the following steps S101-S103.

Step S101, searching a knowledge information table corresponding to each character in the search term, wherein the knowledge information table is recorded with the zone bit codes of the corresponding characters, the lower zone bit code sets of the zone bit codes of the corresponding characters and the level sets of the corresponding characters in the text.

The lower-level region bit code set of the region bit code of any character is a set of region bit codes corresponding to all adjacent characters which appear in the text and are positioned behind the any character.

In the embodiment of the application, before text retrieval is performed, a knowledge network can be constructed in advance according to the content in all texts, the knowledge network comprises a directed connection diagram formed by knowledge information tables corresponding to characters in the texts, and in the directed connection diagram, the knowledge information tables corresponding to the previous characters in the texts point to the knowledge information tables corresponding to the next characters.

Specifically, the location codes of the characters in all the texts can be extracted first, and the location code of the next character in the text is used as the lower level location code of the previous character, and the address position (i.e. the level) of each character in all the texts is recorded at the same time. When the knowledge network is constructed, a knowledge information table corresponding to each character can be firstly generated, a directed connection diagram is constructed by taking the knowledge information table as a node, in the directed connection diagram, the knowledge information table corresponding to the previous character in the text points to the knowledge information table corresponding to the next character, and the knowledge information table is recorded with the region bit codes of the corresponding character, the lower region bit code set of the corresponding character and the bit set of the corresponding character in the text.

In the embodiment of the application, the area code of the previous character in the text can be used as the upper-level area code of the next character, and the self-association identification of each character is set and used for representing whether the characters appear continuously in the text. The knowledge information table can also record information such as a superior region bit code set of the corresponding character and self-association identification of the corresponding character. The upper-level region bit code set of any character is a set of region bit codes corresponding to all adjacent characters which appear in the text and are positioned in front of the any character.

In the constructed directed connection graph, knowledge information tables corresponding to all characters can be stored in an array mode, so that the knowledge information tables corresponding to all the characters can be conveniently searched. In the embodiment of the application, the storage position of the knowledge information table can be calculated through a hash algorithm, and the knowledge information table can be stored in the storage space according to the storage position of the knowledge information table.

When searching the knowledge information table corresponding to each character in the search word, the storage address of the knowledge information table corresponding to each character in the search word can be calculated through a hash algorithm, and then the knowledge information table corresponding to each character in the search word is searched based on the storage address of the knowledge information table corresponding to each character. In the embodiment of the present application, the storage address of the knowledge information table corresponding to each character in the search term may be expressed as: p (ai) = ((ai% 100-1) ×94+ [ ai/100] -1) ×32, where p (ai) represents the number of storage bits of the knowledge information table corresponding to the i-th character in the search term in the storage space, ai represents the area code of the i-th character in the search term,% represents modulo arithmetic, and [ (] represents rounding).

S102, judging whether the region bit codes consistent with the region bit codes in the knowledge information table corresponding to the next adjacent search word exist in the lower region bit code set in the knowledge information table corresponding to each target character in the search word in sequence.

Wherein the target character is the rest characters except the last search term in the search terms.

Specifically, it may be first determined whether a region code consistent with a region code in a knowledge information table corresponding to the 2 nd character (in the search term) exists in a lower region code set in the knowledge information table corresponding to the 1 st character in the search term; if so, continuing to judge whether the region bit code consistent with the region bit code in the knowledge information table corresponding to the 3 rd character exists in the lower region bit code set in the knowledge information table corresponding to the 2 nd character in the search term; if so, continuing to judge whether the region bit code consistent with the region bit code in the knowledge information table corresponding to the 3 rd character exists in the lower region bit code set in the knowledge information table corresponding to the 4 th character, and the like until judging whether the region bit code consistent with the region bit code in the knowledge information table corresponding to the 2 nd character exists in the lower region bit code set in the knowledge information table corresponding to the 2 nd character.

Step S103, if the lower-level region code sets in the knowledge information table corresponding to the target characters in the search word all have region codes consistent with the region codes in the knowledge information table corresponding to the next adjacent search word, searching text content matched with the search word based on the level set recorded in the knowledge information table corresponding to the characters in the search word.

If the lower-level region code set in the knowledge information table corresponding to one of the target characters of the search term does not have the region code consistent with the region code in the knowledge information table corresponding to the next adjacent search term, the text content consistent with the search term is not indicated in the text, and the search can be stopped and a prompt message representing that the search term is not searched in the text is generated at the moment so as to remind the user that the text content consistent with the search term does not exist in the text.

If the lower-level region code set in the knowledge information table corresponding to each target character in the search word has a region code consistent with the region code in the knowledge information table corresponding to the next adjacent search word, the text content consistent with the search word is described to exist in the text, and the text content matched with the search word can be found out based on the position set recorded in the knowledge information table corresponding to each character in the search word.

The search for text content matching the term may include, but is not limited to, the following steps S1031-S1032.

Step S1031, determining at least one group of sequential rank combinations obtained after sequentially selecting one rank from the knowledge information table corresponding to each character based on the rank set recorded in the knowledge information table corresponding to each character in the search term.

Specifically, according to the sequence of each character in the search term, one rank is sequentially selected from the knowledge information table (recorded rank set) corresponding to each character, so as to form continuous rank combinations, and at least one group of rank combinations is obtained.

For example, the term is "publication", where the number of bits recorded in the knowledge information table corresponding to the "out" character is {1,23,56.111,234}, the number of bits recorded in the knowledge information table corresponding to the "version" of the character is {2,95.112,250,429}, and the number of bits recorded in the knowledge information table corresponding to the "object" character is {3,70.113,299}, then one bit may be sequentially selected from the knowledge information table corresponding to each character (the number of bits recorded) according to the sequence of each character in the term, and a number of consecutive number of bits is formed, and two sets of consecutive number of bits (1, 2, 3) and (111, 112, 113) may be obtained at this time.

And S1032, searching out at least one text content matched with the search term based on the at least one group of rank combinations.

At least one text content matched with the search word is searched, and characters corresponding to each rank in the rank combination can be found out from the text according to the ranks recorded in at least one group of rank combination, namely the text content matched with the search word.

Still based on the above example of step S1031, the two sets of order combinations are (1, 2, 3) and (111, 112, 113), the text in which the characters in order of order 1,2 and 3 in the text are combined may be used as the text content matching the search term, and the text in which the characters in order of order 111, 112 and 113 in the text are combined may be used as another text content matching the search term.

In one or more embodiments, after the text content matching the term is found, the text content matching the term may also be rendered in text, which may include, but is not limited to, font bolding, font tilting, font color adjustment, and/or font background color adjustment.

In summary, according to the text retrieval method provided by the application, the knowledge information table corresponding to each character in the retrieval word is searched, and the region code of the corresponding character, the lower region code set of the region code of the corresponding character and the level set of the corresponding character in the text are recorded in the knowledge information table; then judging whether the region bit codes consistent with the region bit codes in the knowledge information table corresponding to the next adjacent search word exist in the lower region bit code set in the knowledge information table corresponding to each target character in the search word in sequence; if the lower-level region code set in the knowledge information table corresponding to each target character in the search word has region codes consistent with the region codes in the knowledge information table corresponding to the next adjacent search word, the text content matched with the search word is found out based on the level set recorded in the knowledge information table corresponding to each character in the search word. In the process, whether the region code consistent with the region code in the knowledge information table corresponding to the next adjacent search word exists in the lower region code set in the knowledge information table corresponding to each target character in the search word is judged, so that all the next characters adjacent to the previous character in the search word can be found out in the text, further, text content matched with the search word can be located and found out from the text according to the rank corresponding to each found character, any content in the text can be conveniently searched, the search is not limited to the search of the extracted keyword, the content to be searched can be accurately searched out from the text, and practical application and popularization are facilitated.

Referring to fig. 2, a second aspect of the embodiment of the present application provides a text retrieval device, including:

The working process, working details and technical effects of the device provided in the second aspect of the present embodiment may be referred to in the first aspect of the present embodiment, and are not described herein.

As shown in fig. 3, a third aspect of the embodiment of the present application provides another text retrieval device, which includes a memory, a processor and a transceiver, which are sequentially communicatively connected, where the memory is configured to store a computer program, the transceiver is configured to send and receive a message, and the processor is configured to read the computer program, and perform the text retrieval method according to the first aspect of the embodiment.

By way of specific example, the Memory may include, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), flash Memory (Flash Memory), first-in-first-out Memory (FIFO), and/or first-in-last-out Memory (FILO), etc.; the processor may not be limited to a processor adopting architecture such as a microprocessor, ARM (Advanced RISC Machines), X86, etc. of the model STM32F105 series or a processor integrating NPU (neural-network processing units); the transceiver may be, but is not limited to, a WiFi (wireless fidelity) wireless transceiver, a bluetooth wireless transceiver, a general packet radio service technology (General Packet Radio Service, GPRS) wireless transceiver, a ZigBee protocol (low power local area network protocol based on the ieee802.15.4 standard), a 3G transceiver, a 4G transceiver, and/or a 5G transceiver, etc.

The working process, working details and technical effects of the device provided in the third aspect of the present embodiment may be referred to in the first aspect of the present embodiment, and are not described herein.

A fourth aspect of the present embodiment provides a computer readable storage medium storing instructions comprising the text retrieval method according to the first aspect of the present embodiment, i.e. the computer readable storage medium has instructions stored thereon, which when executed on a computer, perform the text retrieval method according to the first aspect. The computer readable storage medium refers to a carrier for storing data, and may include, but is not limited to, a floppy disk, an optical disk, a hard disk, a flash Memory, and/or a Memory Stick (Memory Stick), etc., where the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.

A fifth aspect of the present embodiment provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the text retrieval method of the first aspect of the embodiment, wherein the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus.

Finally, it should be noted that: the foregoing description is only of the preferred embodiments of the application and is not intended to limit the scope of the application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A text retrieval method, comprising:

2. The text retrieval method according to claim 1, wherein the searching text content matching the search term based on the level set recorded in the knowledge information table corresponding to each character in the search term comprises:

3. The text retrieval method according to claim 1, wherein the searching the knowledge information table corresponding to each character in the retrieval word includes:

4. A text retrieval method according to claim 3, wherein the storage address of the knowledge information table corresponding to each character in the retrieval word is:

5. The text retrieval method according to claim 1, wherein before searching the knowledge information table corresponding to each character in the retrieval word, the method further comprises:

6. The text retrieval method according to claim 1, wherein the knowledge information table further includes a set of upper level location codes of corresponding characters and a self-associated identifier indicating whether characters appear continuously in the text, and the set of upper level location codes of the location codes of any character is a set of location codes corresponding to all adjacent characters that appear in the text and precede the any character.

7. The text retrieval method of claim 1, wherein the method further comprises:

8. A text retrieval apparatus, comprising:

9. A text retrieval device comprising a memory, a processor and a transceiver in communication with each other in sequence, wherein the memory is adapted to store a computer program and the transceiver is adapted to send and receive messages, and wherein the processor is adapted to read the computer program and to perform a text retrieval method as claimed in any one of claims 1 to 7.

10. A computer readable storage medium having instructions stored thereon which, when executed on a computer, perform the text retrieval method of any of claims 1 to 7.