Text intelligent analysis method and system based on natural language processing
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a text intelligent analysis method and system based on natural language processing.
Background
With the continuous progress of society, the number of various documents is increasing, and part of documents still need to be signed or checked manually, which brings a series of problems and challenges. Firstly, the manual examination requires deep knowledge of the content to be checked of various files, but because of the numerous file types, the auditor may not be familiar with the key content of certain files, and a great deal of time is required for review and understanding; second, manual auditing presents a high risk of error. Therefore, it is important to develop a system capable of extracting key information with high accuracy.
In order to accurately identify and mark key contents to be audited in various files and facilitate auditors to audit the files, the invention provides a text intelligent analysis method and a text intelligent analysis system based on natural language processing.
Disclosure of Invention
The invention provides a simple and efficient text intelligent analysis method and system based on natural language processing in order to make up the defects of the prior art.
The invention is realized by the following technical scheme:
a text intelligent analysis method based on natural language processing comprises the following steps:
Step S1, data collection
A website is selected in a self-defining mode, a crawler technology is utilized to crawl files on the corresponding website, and corresponding data are obtained;
Step S2, data cleaning
After the data crawling is completed, performing data cleaning on the crawled data, removing text noise, and standardizing a data format so as to improve the accuracy and quality of the crawled data;
in the step S2, when the data is cleaned, firstly, removing text noise, then normalizing the data format, and performing textualization on the obtained data to enable each piece of data to be a line of text independently; and finally, correcting the content, and removing repeated and wrong content in the text.
Step S3, data filling
After the data cleaning is completed, the text is filled with real data so as to improve the accuracy of entity identification;
in the step S3, the text input into the named entity recognition model is preprocessed, including word segmentation and stop word removal, so as to convert the text into a word sequence.
S4, constructing a named entity recognition model
The two-way encoder representation transformation BERT is used as a basic stone, and a named entity recognition model is constructed by combining a two-way long-short-term memory network BiLSTM and a conditional random field CRF so as to improve the accuracy of entity recognition;
In the step S4, the bi-directional encoder represents the transformation BERT as a pre-trained language model and is responsible for extracting deep text features from the original text; the two-way long-short-term memory network BiLSTM is responsible for further capturing various information in the text sequence; the conditional random field CRF is responsible for constraining the output tag sequence to ensure the ordering of the entity tags and optimize the output of the model.
In the step S4, the two-way long-short-term memory network BiLSTM is used as an encoding layer of the named entity recognition model, and is responsible for comprehensively capturing context information of the text, so that the named entity recognition model can fully consider the front-rear relevance of the text when generating an output tag sequence, and the accuracy of prediction is improved;
The conditional random field CRF is responsible for accurately determining the dependency relationship between labels in the labeling sequence, so as to optimize the recognition process of the named entity.
Step S5, model training
After extracting effective features from word sequences, inputting the characterized text fragments into a pre-defined class set, and training a named entity recognition model so that the corresponding relationship between the text features and class labels can be learned;
Step S6, named entity identification
And loading the optimal model obtained after training, carrying out named entity recognition on the new input text, and outputting an entity and a category label thereof.
A system for implementing a natural language processing based text intelligent analysis method, comprising:
the data collection module is responsible for self-defining and selecting websites, and crawling files on the corresponding websites by utilizing a crawler technology to obtain corresponding data;
The data cleaning module is responsible for cleaning the data of the crawled data after the crawling of the data is completed, eliminating text noise and standardizing the data format so as to improve the accuracy and quality of the crawled data;
The data cleaning module firstly cleans text noise, then standardizes a data format, and carries out textualization on the acquired data so that each piece of data is independently a line of text; and finally, correcting the content, and removing repeated and wrong content in the text.
The data filling module is responsible for carrying out real data filling on the text after the data cleaning is completed so as to improve the accuracy of entity identification;
The system also comprises a preprocessing module which is responsible for preprocessing the text input into the named entity recognition model, and comprises word segmentation and stop word removal, and converts the text into a word sequence.
The named entity recognition model construction module is responsible for constructing a named entity recognition model by taking a bi-directional encoder representation transformation BERT as a basic stone and combining a bi-directional long-short-term memory network BiLSTM and a conditional random field CRF so as to improve the accuracy of entity recognition;
The model training module is in charge of extracting effective features from word sequences, inputting the characterized text fragments into a pre-defined class set, and training a named entity recognition model so that the corresponding relationship between the text features and class labels can be learned;
And the named entity recognition module is responsible for loading the optimal model obtained after training, carrying out named entity recognition on the new input text, and outputting the entity and the category label thereof.
A computing device, characterized by: comprising the following steps:
one or more processors, one or more memories, and one or more programs, wherein the one or more programs are stored in the one or more memories and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods described above.
A readable storage medium, characterized by: the readable storage medium has stored thereon a computer program which, when executed by a processor, implements a method as described above.
The beneficial effects of the invention are as follows: the text intelligent analysis method and system based on natural language processing can automatically extract key information, greatly reduce the time and workload of manually processing the text, improve the processing efficiency, improve the accuracy of key information extraction, reduce the risk of human errors, expand the application range and improve the user experience.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a text intelligent analysis method based on natural language processing.
FIG. 2 is a diagram of a named entity recognition model according to the present invention.
Detailed Description
In order to enable those skilled in the art to better understand the technical solution of the present invention, the following description will make clear and complete descriptions of the technical solution of the present invention in combination with the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
The intelligent text analysis method based on natural language processing comprises the following steps:
Step S1, data collection
A website is selected in a self-defining mode, a crawler technology is utilized to crawl files on the corresponding website, and corresponding data are obtained;
Step S2, data cleaning
After the data crawling is completed, because the crawled data is various and not particularly standard data, the crawled data is subjected to data cleaning, text noise is removed, and the data format is standardized, so that the accuracy and quality of the crawled data are improved;
in the step S2, when the data is cleaned, firstly, text noise such as html tags, messy codes and the like generated in the crawling process is removed; then, the data format is standardized, and the obtained data is subjected to text processing, so that each piece of data is independently a line of text; and finally, correcting the content, and removing repeated and wrong content in the text.
Step S3, data filling
After the data cleaning is completed, the text is filled with real data so as to improve the accuracy of entity identification;
in the step S3, the text input into the named entity recognition model is preprocessed, including word segmentation and stop word removal, so as to convert the text into a word sequence.
S4, constructing a named entity recognition model
The two-way encoder representation transformation BERT is used as a basic stone, and a named entity recognition model is constructed by combining a two-way long-short-term memory network BiLSTM and a conditional random field CRF so as to improve the accuracy of entity recognition;
In the step S4, the bi-directional encoder represents the transformation BERT as a pre-trained language model and is responsible for extracting deep text features from the original text; the two-way long-short-term memory network BiLSTM is responsible for further capturing various information in the text sequence; the conditional random field CRF is responsible for constraining the output tag sequence to ensure the ordering of the entity tags and optimize the output of the model.
The result shows that the model based on the bi-directional encoder representation transformation BERT achieves remarkable performance improvement on a specific named entity data set, so that the correctness of the design concept is verified. In addition, because the bi-directional encoder represents the wide application and excellent performance of the transformation BERT in natural language processing, the accurate identification of key entities in the text is realized by fine tuning the trained bi-directional encoder to represent the transformation BERT model and combining specific text data.
In order to understand the text more deeply and identify the named entity therein accurately, in the step S4, the bidirectional long-short-term memory network BiLSTM is used as a coding layer of the named entity identification model and is responsible for capturing the context information of the text comprehensively, so that the named entity identification model can fully consider the front-rear relevance of the text when generating the output tag sequence, and the accuracy of prediction is improved;
However, relying solely on BiLSTM for prediction may cause problems with tag sequence confusion, such as generating erroneous tag combinations of "B-JiaFang, 0-JiaFang, I-JiaFang", etc. To effectively circumvent such problems, the characteristics of conditional random field CRF are smartly exploited. The conditional random field CRF is responsible for accurately determining the dependency relationship between labels in the labeling sequence, so as to optimize the recognition process of the named entity. By applying CRF to constraints of the output sequence, more accurate, consistent recognition results are obtained.
Step S5, model training
After extracting effective features from word sequences, inputting the characterized text fragments into a pre-defined class set, and training a named entity recognition model so that the corresponding relationship between the text features and class labels can be learned;
Step S6, named entity identification
And loading the optimal model obtained after training, carrying out named entity recognition on the new input text, and outputting an entity and a category label thereof.
The system for realizing the intelligent text analysis method based on natural language processing comprises the following steps:
the data collection module is responsible for self-defining and selecting websites, and crawling files on the corresponding websites by utilizing a crawler technology to obtain corresponding data;
The data cleaning module is responsible for cleaning the data of the crawled data after the crawling of the data is completed, eliminating text noise and standardizing the data format so as to improve the accuracy and quality of the crawled data;
The data cleaning module firstly cleans text noise, then standardizes a data format, and carries out textualization on the acquired data so that each piece of data is independently a line of text; and finally, correcting the content, and removing repeated and wrong content in the text.
The data filling module is responsible for carrying out real data filling on the text after the data cleaning is completed so as to improve the accuracy of entity identification;
The system also comprises a preprocessing module which is responsible for preprocessing the text input into the named entity recognition model, and comprises word segmentation and stop word removal, and converts the text into a word sequence.
The named entity recognition model construction module is responsible for constructing a named entity recognition model by taking a bi-directional encoder representation transformation BERT as a basic stone and combining a bi-directional long-short-term memory network BiLSTM and a conditional random field CRF so as to improve the accuracy of entity recognition;
The model training module is in charge of extracting effective features from word sequences, inputting the characterized text fragments into a pre-defined class set, and training a named entity recognition model so that the corresponding relationship between the text features and class labels can be learned;
And the named entity recognition module is responsible for loading the optimal model obtained after training, carrying out named entity recognition on the new input text, and outputting the entity and the category label thereof.
The computing device includes:
one or more processors, one or more memories, and one or more programs, wherein the one or more programs are stored in the one or more memories and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods described above.
The readable storage medium has stored thereon a computer program which, when executed by a processor, implements a method as described above.
Compared with the prior art, the text intelligent analysis method and system based on natural language processing have the following characteristics:
(1) By automatic key information extraction, the time and workload of manually processing the text can be greatly reduced, and the processing efficiency is improved.
(2) The text content can be more accurately understood based on the natural language processing technology, so that the key information is more accurately extracted, the accuracy is improved, and the risk of human errors is reduced.
(3) The method can be applied to text data in various types and fields, is not limited by a specific format or structure, expands the application range, and has strong universality and expandability.
(4) For users who need to process a large amount of texts, the method can provide more convenient and efficient service, and improves user experience and satisfaction.
The text intelligent analysis method and the text intelligent analysis system based on natural language processing in the embodiment of the invention are described in detail. The principles and embodiments of the present invention have been described in this section with specific examples provided above to facilitate understanding of the core concepts of the invention and all other examples obtained by one skilled in the art without departing from the principles of the invention are intended to be within the scope of the invention.