[go: up one dir, main page]

CN114154480B - Information extraction method, device, equipment and storage medium - Google Patents

Information extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN114154480B
CN114154480B CN202111520172.8A CN202111520172A CN114154480B CN 114154480 B CN114154480 B CN 114154480B CN 202111520172 A CN202111520172 A CN 202111520172A CN 114154480 B CN114154480 B CN 114154480B
Authority
CN
China
Prior art keywords
information
text
target
data
order data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111520172.8A
Other languages
Chinese (zh)
Other versions
CN114154480A (en
Inventor
简仁贤
李梦雄
马永宁
王海波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Emotibot Technologies Ltd
Original Assignee
Emotibot Technologies Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Emotibot Technologies Ltd filed Critical Emotibot Technologies Ltd
Priority to CN202111520172.8A priority Critical patent/CN114154480B/en
Publication of CN114154480A publication Critical patent/CN114154480A/en
Application granted granted Critical
Publication of CN114154480B publication Critical patent/CN114154480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0633Lists, e.g. purchase orders, compilation or processing
    • G06Q30/0635Processing of requisition or of purchase orders

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Finance (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Accounting & Taxation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Character Discrimination (AREA)

Abstract

本申请提供一种信息提取方法、装置、设备和存储介质,该方法包括:获取查询指令对应的订单数据;将所述订单数据输入至预设识别模型,输出所述订单数据中的标的物信息;基于标准词库对所述标的物信息进行校验处理,得到校验后的标的物信息;基于所述校验后的标的物信息生成所述订单数据的三元组信息。本申请同时结合了人工智能模型识别和标准词库规则校验对订单信息进行提取,提高提取精度。

The present application provides an information extraction method, device, equipment and storage medium, the method comprising: obtaining order data corresponding to a query instruction; inputting the order data into a preset recognition model, and outputting the subject matter information in the order data; verifying the subject matter information based on a standard vocabulary to obtain verified subject matter information; and generating triple information of the order data based on the verified subject matter information. The present application combines artificial intelligence model recognition and standard vocabulary rule verification to extract order information, thereby improving extraction accuracy.

Description

Information extraction method, device, equipment and storage medium
Technical Field
The present application relates to the field of information processing technologies, and in particular, to an information extraction method, apparatus, device, and storage medium.
Background
With the development of internet technology, more and more commodities are purchased by online orders, such as delivering order information by mail, such as when a user places an order for a batch of commodities on a platform, the order information is delivered by mail.
The commodity information and the arrival date in the order information are very important commodity data, and when a user wants to check the commodity information and the arrival date of related commodities in the mail, the user often needs to open the mail to search manually, which is inconvenient for the user. Thus, automatic information extraction technology for mail contents has been developed.
In the existing mail extraction method, information is mainly extracted by means of compiling rules and the like, but the extracted information has limitations and low precision, and the mail content has diversity and cannot meet the requirement of extracting information in any form, so that how to improve the extraction precision of the mail content information is a problem to be solved urgently.
Disclosure of Invention
The embodiment of the application aims to provide an information extraction method, device, equipment and storage medium, which simultaneously combine model identification and standard word stock rule verification to extract order information, thereby improving extraction accuracy.
An embodiment of the present application provides an information extraction method, including: acquiring order data corresponding to the query instruction; inputting the order data into a preset identification model, and outputting target object information in the order data; performing verification processing on the target object information based on a standard word stock to obtain verified target object information; and generating triple information of the order data based on the checked object information.
In an embodiment, the query instruction carries identification information of the target order; the obtaining the order data corresponding to the query instruction includes: when a query instruction is received, extracting order content corresponding to the identification information from a preset order library; and carrying out content analysis on the order content to obtain text data of the target order, and taking the text data as the order data.
In one embodiment, the step of establishing the preset recognition model includes: acquiring a sample order data set; converting the sample order dataset into a predetermined standard format; labeling the sample object information in the sample order data set in a standard format; and training a neural network model by using the marked sample order data set to obtain the preset identification model.
In one embodiment, the target information includes: a subject identification text and a text position of the identification text in the order data; the standard word stock-based verification processing is performed on the target object information to obtain verified target object information, and the method comprises the following steps: judging whether target standard data which is the same as the identification text exists in the standard word stock or not; and when the target standard data does not exist in the standard word stock, correcting the identification text based on the text position to obtain the verified target information.
In an embodiment, before the determining whether the target standard data identical to the identification text exists in the standard word stock, the method further includes: and detecting character information at the boundary of the identification text, deleting non-text symbols at the boundary of the identification text, and obtaining the corrected identification text.
In an embodiment, the correcting the identification text based on the text position to obtain the verified target information includes: when the target standard data does not exist in the standard word stock, selecting target candidate data with similarity with the identification text being larger than a preset threshold value from the standard word stock; judging whether the spelling sequence of the target candidate data is the same as the spelling sequence of the text position appointed section in the order data; and when the spelling sequence of the target candidate data is the same as the spelling sequence of the text position appointed section in the order data, taking the target candidate data as the checked target information.
In an embodiment, the correcting the identification text based on the text position to obtain the verified target information further includes: and when the spelling sequence of the target candidate data is different from the spelling sequence of the text position appointed section in the order data, expanding text content along the text position boundary in the order data until a space symbol is encountered, and taking the text content obtained after expansion and a new text position corresponding to the text content as the checked object information.
In one embodiment, the method further comprises: and updating the checked object information into the standard word stock.
In one embodiment, the subject matter information includes: the method comprises the steps of identifying a target object and date information corresponding to the target object; the generating the triplet information of the order data based on the checked object information comprises the following steps: and respectively taking the target object identifier and the date information as two entities, and taking the type label and the date label of the target object as the relation between the two entities to generate the triplet information of the order data.
A second aspect of an embodiment of the present application provides an information extraction apparatus, including: the acquisition module is used for acquiring order data corresponding to the query instruction; the identification module is used for inputting the order data into a preset identification model and outputting target object information in the order data; the verification module is used for carrying out verification processing on the target information based on a standard word stock to obtain verified target information; and the generation module is used for generating the triplet information of the order data based on the checked object information.
In an embodiment, the query instruction carries identification information of the target order; the acquisition module is used for: when a query instruction is received, extracting order content corresponding to the identification information from a preset order library; and carrying out content analysis on the order content to obtain text data of the target order, and taking the text data as the order data.
In one embodiment, the method further comprises: the establishing module is used for: acquiring a sample order data set;
converting the sample order dataset into a predetermined standard format; labeling the sample object information in the sample order data set in a standard format; and training a neural network model by using the marked sample order data set to obtain the preset identification model.
In one embodiment, the target information includes: a subject identification text and a text position of the identification text in the order data; the verification module is used for: judging whether target standard data which is the same as the identification text exists in the standard word stock or not; and when the target standard data does not exist in the standard word stock, correcting the identification text based on the text position to obtain the verified target information.
In an embodiment, before the determining whether the target standard data identical to the identification text exists in the standard word stock, the method further includes: and detecting character information at the boundary of the identification text, deleting non-text symbols at the boundary of the identification text, and obtaining the corrected identification text.
In an embodiment, the correcting the identification text based on the text position to obtain the verified target information includes: when the target standard data does not exist in the standard word stock, selecting target candidate data with similarity with the identification text being larger than a preset threshold value from the standard word stock; judging whether the spelling sequence of the target candidate data is the same as the spelling sequence of the text position appointed section in the order data; and when the spelling sequence of the target candidate data is the same as the spelling sequence of the text position appointed section in the order data, taking the target candidate data as the checked target information.
In an embodiment, the correcting the identification text based on the text position to obtain the verified target information further includes: and when the spelling sequence of the target candidate data is different from the spelling sequence of the text position appointed section in the order data, expanding text content along the text position boundary in the order data until a space symbol is encountered, and taking the text content obtained after expansion and a new text position corresponding to the text content as the checked object information.
In one embodiment, the method further comprises: and the updating module is used for updating the checked object information into the standard word stock.
In one embodiment, the subject matter information includes: the method comprises the steps of identifying a target object and date information corresponding to the target object; the generating module is used for: and respectively taking the target object identifier and the date information as two entities, and taking the type label and the date label of the target object as the relation between the two entities to generate the triplet information of the order data.
A third aspect of an embodiment of the present application provides an electronic device, including: a memory for storing a computer program; a processor configured to perform the method of the first aspect of the embodiments of the present application and any of the embodiments thereof.
A fourth aspect of an embodiment of the present application provides a non-transitory electronic device readable storage medium, comprising: a program which, when run by an electronic device, causes the electronic device to perform the method of the first aspect of the embodiments of the application and any of its embodiments.
According to the information extraction method, device, equipment and storage medium, order data corresponding to the query instruction are processed by the identification model to obtain the target information in the unified format, the target information output by the identification model is checked based on the standard word stock, and the triplet information of the order data is generated based on the checked target information, so that the model identification and standard word stock rule check are combined to extract the order information, and the information extraction precision is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an electronic device according to an embodiment of the application;
FIG. 2A is a flow chart of an information extraction method according to an embodiment of the application;
FIG. 2B is a schematic diagram illustrating mail content parsing according to an embodiment of the present application;
FIG. 3 is a flow chart of an information extraction method according to an embodiment of the application;
fig. 4 is a schematic structural diagram of an information extraction device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. In the description of the present application, the terms "first," "second," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.
As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10. The memory 12 stores instructions executable by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 can execute all or part of the flow of the method in the following embodiments, so as to extract order information by combining model identification and standard word stock rule verification, and improve information extraction accuracy.
In an embodiment, the electronic device 1 may be a mobile phone, a tablet computer, a notebook computer, a desktop computer, or a large computing system composed of multiple computer devices.
Please refer to fig. 2A, which is an information extraction method according to an embodiment of the present application, the method may be executed by the electronic device 1 shown in fig. 1, and may be applied to an information extraction scenario of order data using mail as an information carrier, so as to extract order information by combining model recognition and standard word stock rule verification, thereby improving information extraction accuracy. The method comprises the following steps:
step 201: order data corresponding to the query instruction is obtained.
In this step, the query instruction may be input by the user, for example, may be input by the user through an interactive interface of the terminal, and the query instruction may include identification information, such as a name, of the designated order data. The order data may be carried in various manners, such as data that the order data may be transmitted via email, where the query instruction may include a mail name that needs to be queried. The order data may be pre-stored in a pre-set order store.
In one embodiment, step 201 may include: and when a query instruction is received, extracting order contents corresponding to the identification information from a preset order library. And carrying out content analysis on the order content to obtain text data of the target order, and taking the text data as order data.
In this step, a large amount of queriable order data is prestored in a preset order library, and the order data may be data in the form of mail or bill. The method comprises the steps that an inquiry command input by a user carries identification information of a target order, such as a mail name of the target order, target order content corresponding to the mail name is searched in a preset order library, then content analysis is carried out on the order content to obtain text data of the order, the text data is used as the order data of the target order, and therefore the order content in any form can be processed into unified text data, so that information extraction is not limited in the form of receiving the order.
In one embodiment, as shown in fig. 2B, taking an order in the form of a mail as an example, a user inputs a mail name to be queried, the order content includes the mail content of the target order, and when the content analysis is performed on the order content, the content type of the mail is first determined, and the content analysis is performed based on the mail content type. The content types of the mail may include the following:
1. if only text exists, the mail content is input to a text parsing module, and corresponding text content is obtained.
2. If the mail body contains the form, the form part is input into a form analysis module, and the unified ginger form is converted into a text format.
3. If the mail has an attachment, judging the type of the attachment, if the attachment is a word document, adopting a word analyzer to analyze the attachment, if the attachment is an excel document, adopting an excel analyzer to analyze the attachment, if the attachment is a pdf document, adopting a pdf analyzer to analyze the attachment, if the attachment is a picture, adopting a ocr (Optical Character Recognition ) picture analysis tool to extract information, if the attachment is a text file, directly transferring the attachment to a text analysis module, and uniformly transferring mail contents in different formats into text contents in the mode. The mail content extraction is not limited by the mail type any more, and the application range of the mail information extraction is expanded.
Step 202: and inputting the order data into a preset identification model, and outputting target object information in the order data.
In this step, the target may be a commodity, a general article, or the like, and different targets may be selected based on different scenes. In a practical scenario, the mail related to the target object generally includes various attribute information of the target object, such as a name, a category, etc., and if the target object is a commodity, the order data may be ordering information of the commodity, such as a mail order form of a lot of created products, where the mail order form generally includes information of the category, the name, the quantity, and the date of arrival of the commodity.
The preset recognition model may be a neural network based recognition model. Assuming that the target object is a commodity, taking order data in the form of a mail as an example, because commodity information has strong correlation with the arrival date, a neural network model can be trained based on regularity according to commodity information and the arrival date in the past mail, so as to obtain a preset identification model, mail contents are firstly unified into a text format through step 201, and then mail contents in the text format are input into the preset identification model, so that information such as commodity information, arrival date and the like in the mail can be output.
In an embodiment, before inputting the text format mail content into the preset recognition model, the text format mail content may be normalized and converted into a predetermined standard format, so that the preset recognition model may extract the target information more accurately.
In one embodiment, the method may further include, before step 202: the step of establishing a preset recognition model comprises the following steps: a sample order dataset is obtained. The sample order dataset is converted to a predetermined standard format. And labeling the sample object information in the sample order data set in the standard format. Training the neural network model by using the marked sample order data set to obtain a preset identification model.
In this step, the sample order data set may be an order data set in text format, after the multiple purchase order mail content sets of the target object are processed by the parser, the order data set is unified and converted into the order data set in text format, firstly, the order data set is standardized into a predetermined standard format, such as unified text typesetting format, data cleaning, deduplication and other processes are performed, and then, the sample order data set in standard format is labeled, such as labeling label corresponding to target object information and high target object information related in each text, where the label may be a label representing the category of the target object. And finally, training the neural network model by adopting the marked sample order data set to obtain a preset identification model.
Taking the example that the target object can be a commodity, the input of the preset identification model is the text and the label of commodity information, and the output result of the preset identification model is the text and the label of the queried target object information and the position information of the text of the target object information in the input text.
In an embodiment, the model training may be implemented by using a bidirectional lstm+crf network architecture, taking the commodity information as an example, performing iterative training on the data by using the preprocessed commodity information and the tag as input data of the network architecture, so that the loss function loss is minimized, and ending the model training process after the set F1 value is reached on the test set. When predicting, after text information is input, the model processes the label with the highest predicting probability to obtain whether commodity information, labels and position information corresponding to the commodity information exist in the input text.
Step 203: and performing verification processing on the target object information based on the standard word stock to obtain verified target object information.
In this step, the target information output by the preset recognition model may have an inaccurate phenomenon, for example, the word breaking of the mail content may be inaccurate, which may cause incomplete information to be included in the recognition result, thereby affecting the final information extraction precision, so that the rule verification may be based on the model recognition, that is, the target information is verified based on the standard word stock, and the verified target information is obtained. The standard word stock is preset with related standard format information of the target object, and takes the commodity as the target object as an example, and the standard word stock can comprise information such as names, ordering quantity, arrival date and the like of the ordered commodity. The standard thesaurus may be based on different completed order data statistics. Therefore, the accuracy of information extraction can be further ensured by verifying the result of model identification through the standard word stock.
Step 204: and generating triplet information of the order data based on the checked object information.
In the step, based on the checked target object information, the corresponding triplet information is extracted, so that the user can more clearly review the interested content of the mail, and the situation that the user can obtain the related target object information only after reading the mail when the mail content is very much is avoided. Taking the commodity as a target object as an example, in a practical scene, an order mail of a commodity may contain a lot of contents and even be attached with a lot of accessories, if a user only wants to inquire about the related names and arrival dates of the commodity, the user can know that the time and the energy are obviously wasted by reading the contents of all the mails, and if the user only needs to input the names of the inquired mails, the user can automatically return the commodity information and the triple information related to the arrival dates of the commodity in the mails, so that the inquiry time of the user can be greatly saved.
According to the information extraction method, mail content corresponding to the query instruction is firstly converted into a unified text format through the analyzer, then order data in the text format is processed through the recognition model to obtain target information in the unified format, the target information output by the recognition model is checked based on the standard word stock, and triple information of the order data is generated based on the checked target information, so that the mail information extraction is uniformly converted into the text format, the mail information extraction is not limited by the type of the mail content, and meanwhile, the order information is extracted by combining model recognition and standard word stock rule check, so that the information extraction precision is improved.
Please refer to fig. 3, which is an information extraction method according to an embodiment of the present application, the method may be executed by the electronic device 1 shown in fig. 1, and may be applied to an information extraction scenario of order data using mail as an information carrier, so as to extract order information by combining model recognition and standard word stock rule verification, thereby improving information extraction accuracy. The method comprises the following steps:
Step 301: order data corresponding to the query instruction is obtained. See the description of step 201 in the above embodiments for details.
Step 302: and inputting the order data into a preset identification model, and outputting target object information in the order data. See the description of step 202 in the above embodiments for details.
Step 303: and judging whether target standard data which is the same as the identification text exists in the standard word stock. If yes, go to step 308, otherwise go to step 304.
In this step, the object information output by the preset recognition model may include: the subject matter identifies text and the text location of the identifying text in the order data. Taking the object as an example, assuming that the object identifier is an object name, the object information output by the preset recognition model further includes a text position of the object name in the input text, and the text position may be a coordinate position. When the standard word stock is used for verifying the target object information, whether the target standard data which is the same as the identification text exists in the standard word stock can be judged first. Taking the commodity as an example, the identification text is the text of the commodity name, each commodity can be provided with a standard word stock, and different standard word stocks can be maintained through tag matching. When judging, the text of the commodity name can be input into a standard word stock of commodity information, if the standard word stock has no target standard data which is completely the same as the text of the commodity name, the step 304 is entered, and if the target standard data is not the same as the text of the commodity name, the step 308 is entered.
In an embodiment, before step 303, the method may further include: character information at the boundary of the identification text is detected, and non-text symbols at the boundary of the identification text are deleted, so that the corrected identification text is obtained.
In an actual scenario, the identification text identified by the preset identification model may not be accurate enough, for example, some similar non-text symbols may be used as the content of the identification text, which may cause inaccurate amount verification based on a standard word stock, for example, scenario 1:
the user inputs the mail name and analyzes the mail name to obtain the mail content as follows:
-CSSA MONTE VLCC LOAD MONGSTAD/DISCHARGE INDIA LAYCAN 20-21/08-REMARK RPLC
The analyzer firstly analyzes the mail content into order data in a text format, and the order data is identified by a preset identification model to obtain that-CSSA MONTE is commodity name text (namely identification text), text position information is (0, 12), the text of the arrival date is 20-21/08, the text position information is (57, 65), the label of the CSSA MONTE is PRD, the label of 20-21/08 is DAT, and the output result of the preset identification model is: (-CSSA MONTE, PRODUCT,0, 12) and (20-21/08, DATA,57, 65). Wherein, for the text of commodity name, the text of-CSSA MONTE carries a non-text symbol "-", if the-CSSA MONTE is directly matched with a standard word stock, the real trade name is called CSSA MONTE in the standard word stock, so that the standard word stock cannot be matched with the same target standard data as the-CSSA MONTE.
In order to avoid the error from causing error to the final information extraction result, the output result of the preset recognition model may be preprocessed before matching and checking with the standard word stock, character information at the boundary of the identification text is detected based on the text position of the identification text, and non-text symbols at the boundary of the identification text are deleted to obtain the corrected identification text. For example, the commodity name is truncated by the position information, and "-" is removed to obtain CSSA mote, and finally the tuple of the commodity name is (CSSA mote, PRODUCT,2, 12). It is obvious that step 308 can be entered at this point because CSSA mote is already in the thesaurus table, i.e., there is target standard data in the standard thesaurus.
Step 304: and selecting target candidate data with the similarity with the identification text being greater than a preset threshold value from the standard word stock.
In this step, when the target standard data does not exist in the standard word stock, correction processing is required to be performed on the identification text based on the text position, and the verified target object information is obtained. Such as for scenario 2:
The user inputs the mail name to be inquired, and the corresponding mail content is analyzed as follows:
09-12 DELTA APOLLONIA 319 15 22.52 VADINAR 17-11 DELTA
Firstly, analyzing the mail content into order data in a text format, then, recognizing through a preset recognition model to obtain DELTA APOLLON texts with commodity names, wherein the text position information is (9, 22), the texts with arrival dates are 17-11, the text position information is (61, 66), the labels of DELTA APOLLON are PRDs (PRDs), the labels of 17-11 are DAT, and the output result of the preset recognition model is recorded as follows: (DELTA APOLLON, PRODUCT,9, 22) and (17-11, DATA,61, 66), in the matching check of the first item in each tuple through the standard word stock, if the word DELTA APOLLON is not in the standard word stock, further searching in the standard word stock, selecting target candidate data with the similarity with the text DELTA APOLLON of the commodity name being larger than a preset threshold value, wherein the candidate data with the similarity being larger than the preset threshold value possibly has a plurality of candidate data, and selecting the candidate data with the largest similarity value from the plurality of candidate data as target candidate data.
Step 305: and judging whether the spelling sequence of the target candidate data is the same as the spelling sequence of the text position designated interval in the order data. If yes, go to step 306, otherwise go to step 307.
In this step, although the similarity between the selected target candidate data and the text DELTA APOLLON of the commodity name is greater than the preset threshold, the spelling order may be different, especially the text similar to english, and the spelling order of the same letters may identify different meanings, so that the target candidate data may be extracted from the words in the standard word stock, assuming that the target candidate data is DELTA APOLLONIA, the target candidate data DELTA APOLLONIA is slid on the text DELTA APOLLON of the commodity name in the interval of the start position and the end position of the text DELTA APOLLON of the commodity name, and further, whether the letter spelling order of the letter-by-letter judgment in the specified interval of the start position and the end position of the text DELTAAPOLLON is the same as the target candidate data, where the spelling order of the target candidate data DELTA APOLLONIA may be obtained in the specified interval as the text DELTA APOLLON of the commodity name may enter step 306.
Step 306: and taking the target candidate data as the checked target information. Step 308 is then entered.
In this step, when the spelling of the target candidate data is identical to the spelling of the text position-specified section in the order data, for example, the spelling of the target candidate data DELTA APOLLONIA in the above-described scenario 2 is identical to the text DELTA APOLLON of the commodity name in the specified section, the target candidate data DELTA APOLLONIA may be regarded as the verified target information.
Step 307: and expanding the text content along the boundary of the text position in the order data until the space symbol is encountered, and taking the text content obtained after expansion and a new text position corresponding to the text content as checked object information. Step 308 is entered.
In this step, when the spelling order of the target candidate data is different from the spelling order of the text position designated section in the order data, or the target candidate data meeting the standard is not screened from the standard word stock in step 304, it is indicated that the standard word stock may not store the standard data of the target object of the query in advance, and at this time, correction processing may be performed on the identification text based on the text position of the identification text of the target object and the original order data, for example, incomplete identification text output by the identification model may be preset, and completion processing may be performed.
Taking the scene 2 as an example, assuming that the identification text of the object is text DELTA APOLLON of the commodity name, the text position is (9, 22), the text position boundary 22 can be positioned in the original order data to be N, the text content is extended backwards along the boundary, the next position is letter I, the commodity name is extended backwards by one position, the expansion is sequentially circulated, the expansion is finished by taking a space symbol as a divider, if the space symbol is encountered, stopping, and taking the finally-expanded text content DELTA APOLLONIA and the new text position (9, 24) thereof as checked object information, so as to obtain checked object information of the post-mark (DELTA APOLLONIA, PRODUCT,9, 24).
In one embodiment, assume that the subject matter information includes: the target object identifier and date information corresponding to the target object. It should be noted that, in the above scenarios 1 and 2, the verification process is described by taking the commodity name as the triplet of the target object identifier as an example, and for the verification process of the date information corresponding to the target object, the verification process of the target object identifier may be referred to. For example, the triple information corresponding to the commodity arrival date can be checked in the same way, and because the original mail sentence has a plurality of dates, the date is directly searched through rules, only half of the opportunities in one sentence are searched, the approximate position of the date can be identified through a preset identification model, and a more accurate result can be obtained through standard word stock check. And will not be described in detail herein.
Step 308: and taking the target object identification and the date information in the target object information as two entities respectively, and taking the type label and the date label of the target object as the relation between the two entities to generate the triplet information of the order data.
In this step, it is assumed that after the identification text of the target object in each triplet is matched in the standard word stock, the target standard data of complete matching is found, for example, CSSA mote in scene 1 is already in the word stock table, or after the verification processing in step 307 and step 306, complete target object information is obtained, where the target object information includes: the target object identifier and date information corresponding to the target object. Taking the commodity as a target object, the target object information can comprise the identification of the commodity and the arrival date of the commodity, and the triple information can be (commodity identification, commodity-date and arrival date), so that a user can refer to the interested content in the mail at a glance.
In one embodiment, the date information may be formatted to a uniform predetermined date format, such as scene 1, where the date information is correctly identified, and no correction is required, the original date 20-21/08, which is simply an abbreviation for two dates, only the 20-21/08 need be converted to the standard format 2021-08-20/2021-08-21. The final output triplet content is (CSSA MONTE, PRD-DAT, 2021-08-20/2021-08-21)
For example, in scenario 2, date 17-11 is converted to standard format 2021-11-17, and finally the triplet content is output as (DELTA APOLLONIA, PRD-DAT, 2021-11-17).
Step 309: and updating the checked object information into the standard word stock.
In this step, for the target object information that is matched to the target standard data or the target candidate data in the standard word stock, it is explained that the target object information is not recorded in the standard word stock, in order to enrich the standard word stock, the target object information may be added into the corresponding standard word stock, and the standard word stock is updated, so as to further promote the subsequent improvement of the accuracy of information extraction.
Referring to fig. 4, an information extraction apparatus 400 according to an embodiment of the present application is applicable to the electronic device 1 shown in fig. 1, and can be applied to an information extraction scenario of order data using mail as an information carrier, so as to extract order information by combining model recognition and standard word stock rule verification, thereby improving information extraction accuracy. The device comprises: the principle relation of the acquisition module 401, the identification module 402, the verification module 403 and the generation module 404 is as follows:
the obtaining module 401 is configured to obtain order data corresponding to the query instruction.
The recognition module 402 is configured to input order data into a preset recognition model, and output target information in the order data.
And the verification module 403 is configured to perform verification processing on the target object information based on the standard word stock, so as to obtain verified target object information.
A generating module 404, configured to generate triplet information of the order data based on the checked object information.
In one embodiment, the query instruction carries identification information of the target order. The acquisition module 401 is configured to: and when a query instruction is received, extracting order contents corresponding to the identification information from a preset order library. And carrying out content analysis on the order content to obtain text data of the target order, and taking the text data as order data.
In one embodiment, the method further comprises: a building module 405 for: a sample order dataset is obtained.
The sample order dataset is converted to a predetermined standard format. And labeling the sample object information in the sample order data set in the standard format. Training the neural network model by using the marked sample order data set to obtain a preset identification model.
In one embodiment, the subject matter information includes: the subject matter identifies text and the text location of the identifying text in the order data. The verification module 403 is configured to: and judging whether target standard data which is the same as the identification text exists in the standard word stock. And when the target standard data does not exist in the standard word stock, correcting the identification text based on the text position to obtain the verified target information.
In one embodiment, before determining whether the target standard data identical to the identification text exists in the standard word stock, the method further includes: character information at the boundary of the identification text is detected, and non-text symbols at the boundary of the identification text are deleted, so that the corrected identification text is obtained.
In one embodiment, performing correction processing on the identification text based on the text position to obtain verified target information includes: and when the target standard data does not exist in the standard word stock, selecting target candidate data with the similarity with the identification text being greater than a preset threshold value from the standard word stock. And judging whether the spelling sequence of the target candidate data is the same as the spelling sequence of the text position designated interval in the order data. And when the spelling sequence of the target candidate data is the same as the spelling sequence of the text position appointed section in the order data, taking the target candidate data as the checked target information.
In an embodiment, the correction processing is performed on the identification text based on the text position to obtain the verified target information, and the method further includes: when the spelling sequence of the target candidate data is different from the spelling sequence of the text position appointed section in the order data, expanding the text content along the text position boundary in the order data until a space symbol is encountered, and taking the text content obtained after expansion and a new text position corresponding to the text content as checked object information.
In one embodiment, the method further comprises: and the updating module 406 is configured to update the checked object information to the standard word stock.
In one embodiment, the subject matter information includes: the target object identifier and date information corresponding to the target object. The generating module 404 is configured to: and respectively taking the target object identifier and the date information as two entities, and taking the type label and the date label of the target object as the relation between the two entities to generate the triplet information of the order data.
For a detailed description of the information extraction apparatus 400, please refer to the description of the relevant method steps in the above embodiments.
The embodiment of the invention also provides a non-transitory electronic device readable storage medium, which comprises: a program which, when run on an electronic device, causes the electronic device to perform all or part of the flow of the method in the above-described embodiments. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a hard disk (HARD DISK DRIVE, abbreviated as HDD), a Solid state disk (Solid-state disk STATE DRIVE, SSD), or the like. The storage medium may also comprise a combination of memories of the kind described above.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations are within the scope of the invention as defined by the appended claims.

Claims (10)

1. An information extraction method, characterized by comprising:
acquiring order data corresponding to the query instruction;
inputting the order data into a preset identification model, and outputting target object information in the order data;
Performing verification processing on the target object information based on a standard word stock to obtain verified target object information;
Generating triplet information of the order data based on the checked object information;
Wherein, the target information comprises: a subject identification text and a text position of the identification text in the order data; the standard word stock-based verification processing is performed on the target object information to obtain verified target object information, and the method comprises the following steps:
judging whether target standard data which is the same as the identification text exists in the standard word stock or not;
When the target standard data does not exist in the standard word stock, correcting the identification text based on the text position to obtain the verified target information;
the correcting processing is performed on the identification text based on the text position to obtain the checked target information, which comprises the following steps:
when the target standard data does not exist in the standard word stock, selecting target candidate data with similarity with the identification text being larger than a preset threshold value from the standard word stock;
Judging whether the spelling sequence of the target candidate data is the same as the spelling sequence of the text position appointed section in the order data;
And when the spelling sequence of the target candidate data is the same as the spelling sequence of the text position appointed section in the order data, taking the target candidate data as the checked target information.
2. The method of claim 1, wherein the query instruction carries identification information of the target order; the obtaining the order data corresponding to the query instruction includes:
When a query instruction is received, extracting order content corresponding to the identification information from a preset order library;
And carrying out content analysis on the order content to obtain text data of the target order, and taking the text data as the order data.
3. The method of claim 1, wherein the step of establishing the preset recognition model comprises:
Acquiring a sample order data set;
converting the sample order dataset into a predetermined standard format;
labeling the sample object information in the sample order data set in a standard format;
and training a neural network model by using the marked sample order data set to obtain the preset identification model.
4. The method of claim 1, further comprising, prior to said determining whether target criterion data that is the same as the identification text exists in the criterion word stock:
And detecting character information at the boundary of the identification text, deleting non-text symbols at the boundary of the identification text, and obtaining the corrected identification text.
5. The method of claim 1, wherein the correcting the identification text based on the text position to obtain the verified object information further comprises:
and when the spelling sequence of the target candidate data is different from the spelling sequence of the text position appointed section in the order data, expanding text content along the text position boundary in the order data until a space symbol is encountered, and taking the text content obtained after expansion and a new text position corresponding to the text content as the checked object information.
6. The method as recited in claim 5, further comprising:
And updating the checked object information into the standard word stock.
7. The method of claim 1, wherein the subject matter information comprises: the method comprises the steps of identifying a target object and date information corresponding to the target object; the generating the triplet information of the order data based on the checked object information comprises the following steps:
and respectively taking the target object identifier and the date information as two entities, and taking the type label and the date label of the target object as the relation between the two entities to generate the triplet information of the order data.
8. An information extraction apparatus, characterized by comprising:
The acquisition module is used for acquiring order data corresponding to the query instruction;
The identification module is used for inputting the order data into a preset identification model and outputting target object information in the order data;
The verification module is used for carrying out verification processing on the target information based on a standard word stock to obtain verified target information;
The generation module is used for generating triplet information of the order data based on the checked object information;
Wherein, the target information comprises: a subject identification text and a text position of the identification text in the order data; the standard word stock-based verification processing is performed on the target object information to obtain verified target object information, and the method comprises the following steps:
judging whether target standard data which is the same as the identification text exists in the standard word stock or not;
When the target standard data does not exist in the standard word stock, correcting the identification text based on the text position to obtain the verified target information;
the correcting processing is performed on the identification text based on the text position to obtain the checked target information, which comprises the following steps:
when the target standard data does not exist in the standard word stock, selecting target candidate data with similarity with the identification text being larger than a preset threshold value from the standard word stock;
Judging whether the spelling sequence of the target candidate data is the same as the spelling sequence of the text position appointed section in the order data;
And when the spelling sequence of the target candidate data is the same as the spelling sequence of the text position appointed section in the order data, taking the target candidate data as the checked target information.
9. An electronic device, comprising:
a memory for storing a computer program;
A processor for executing the computer program to implement the method of any one of claims 1 to 7.
10. A non-transitory electronic device-readable storage medium, comprising: program which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1 to 7.
CN202111520172.8A 2021-12-13 2021-12-13 Information extraction method, device, equipment and storage medium Active CN114154480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111520172.8A CN114154480B (en) 2021-12-13 2021-12-13 Information extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111520172.8A CN114154480B (en) 2021-12-13 2021-12-13 Information extraction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114154480A CN114154480A (en) 2022-03-08
CN114154480B true CN114154480B (en) 2024-11-19

Family

ID=80450513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111520172.8A Active CN114154480B (en) 2021-12-13 2021-12-13 Information extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114154480B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468255B (en) * 2023-06-15 2023-09-08 国网信通亿力科技有限责任公司 A configured master data management system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045777A (en) * 2007-08-01 2015-11-11 金格软件有限公司 Automatic context sensitive language correction and enhancement using an internet corpus
CN113515587A (en) * 2021-06-02 2021-10-19 中国神华国际工程有限公司 Object information extraction method and device, computer equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106888234A (en) * 2015-12-15 2017-06-23 深圳市银信网银科技有限公司 A kind of data interactive processing method and device
CN107291730B (en) * 2016-03-31 2020-07-31 阿里巴巴集团控股有限公司 Method and device for providing correction suggestion for query word and probability dictionary construction method
CN108764194A (en) * 2018-06-04 2018-11-06 科大讯飞股份有限公司 A kind of text method of calibration, device, equipment and readable storage medium storing program for executing
CN110704226B (en) * 2019-09-19 2023-02-17 贝壳技术有限公司 Data verification method, device and storage medium
CN111275038A (en) * 2020-01-17 2020-06-12 平安医疗健康管理股份有限公司 Image text recognition method and device, computer equipment and computer storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045777A (en) * 2007-08-01 2015-11-11 金格软件有限公司 Automatic context sensitive language correction and enhancement using an internet corpus
CN113515587A (en) * 2021-06-02 2021-10-19 中国神华国际工程有限公司 Object information extraction method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN114154480A (en) 2022-03-08

Similar Documents

Publication Publication Date Title
CA3174601C (en) Text intent identifying method, device, computer equipment and storage medium
CN110704633A (en) Named entity recognition method and device, computer equipment and storage medium
CN112163424B (en) Data labeling method, device, equipment and medium
CN111198948A (en) Text classification correction method, apparatus, device, and computer-readable storage medium
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
CN113254574A (en) Method, device and system for auxiliary generation of customs official documents
US9575937B2 (en) Document analysis system, document analysis method, document analysis program and recording medium
US9558234B1 (en) Automatic metadata identification
US9772991B2 (en) Text extraction
CN116244410B (en) Index data analysis method and system based on knowledge graph and natural language
CN112069069A (en) Defect automatic location analysis method, device and readable storage medium
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
CN111414751A (en) Quality inspection optimization method, device, equipment and storage medium
CN111492364A (en) Data labeling method and device and storage medium
US20170109697A1 (en) Document verification
CN110909528A (en) Script analysis method, script presentation method, device and electronic device
CN114154480B (en) Information extraction method, device, equipment and storage medium
CN114743012A (en) Text recognition method and device
CN115114913A (en) Labeling method, device, equipment and readable storage medium
CN112434537B (en) Translation text consistency verification method, computing device and storage medium
CN116049213A (en) Keyword retrieval method of form document and electronic equipment
CN115017872B (en) Method and device for intelligently labeling table in PDF file and electronic equipment
CN112328746A (en) Dish label warehousing method and device, computer equipment and storage medium
CN113505570B (en) Reference is made to empty checking method, device, equipment and storage medium
CN113342931B (en) Big data based user demand analysis method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant