Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Due to the lack of corpus in labeling and the extremely high labor cost required for manually labeling texts, the supervised learning method becomes undesirable in the aspect of professional, such as medical, text named entity recognition. Based on this, the application proposes a method of unsupervised entity recognition for the recognition of professional text-named entities.
Fig. 1 shows a flow of a method for professional encyclopedia named entity recognition according to an embodiment of the invention, which may be performed by a device, which may be implemented by software and/or hardware. As shown in fig. 1, the method for identifying a professional encyclopedia named entity provided in this embodiment includes:
s110: vectorizing and expressing professional vocabularies in the standardized vocabulary in a document embedding mode to form a seed word set;
s120: averaging vectors of all entity categories in the seed word set to obtain vectorization representation of the entity categories, wherein the vectorization representation is used as a label vector of the entity categories in the seed word set;
s130: and determining the category of the candidate professional entity through cosine similarity comparison according to the label vector of the candidate professional entity in the target document and the label vector of the entity category in the seed set.
For convenience of description, in the following embodiment description, each step in the above method is further described in detail by taking the identification of a professional encyclopedia named entity of a medical class as an example.
In an exemplary embodiment, in step S110, the step of vectorizing the professional vocabulary in the standardized vocabulary by embedding the document may further include:
s111: searching professional vocabularies in the standardized vocabulary online in a preset database (such as an A + medical encyclopedia, an encyclopedia and the like online searching database);
s112: adding the professional vocabulary and the entity category returned by the result page to the seed entity list according to the search result;
specifically, as an example, SNOMED international medical specification terms may be adopted as the standardized vocabulary. On the basis of the standardized vocabulary, the professional vocabularies in the vocabulary are traversed, the current professional vocabularies are searched online in an online database such as an A + medical encyclopedia and an encyclopedia, if a result page returns, the professional vocabularies and the corresponding entity categories are added to a seed entity list, and the entity categories are defined in the standardized vocabulary.
S113: for each entity in the seed entity list, extracting descriptive characters of an appointed part from the preset database as an entity embedded document; the appointment part is generally a brief part of the corresponding encyclopedia record in the preset database, and may also be a summary part or the like capable of generally describing a vocabulary outline.
S114: and carrying out document embedding processing on the entity embedded document to obtain vectorized representation of each entity.
The specific document embedding processing can be performed in a mode of embedding the document based on the exclusive word vector in the medical corpus training field. As an example, the original medical text may be participled using the crust participle algorithm and then the vector representation of each Word is calculated on the basis of the medical corpus using the Word2Vec embedding algorithm calculation. In the training of the Word2Vec model, description characters in encyclopedia entries of words contained in all standard Word lists are combined to form a special medical corpus, and the Word2Vec model is trained on the basis to obtain a Word embedding set special for the medical field.
Due to the design of the Word2Vec model, the simple embedding of using Word vectors as entities only blends in the context information of each entity, so only local semantic and grammatical information is blended in the vectors. While it is desirable to obtain global semantic information and some medical information with each entity. For example, for diseases, the corresponding symptoms, the used medicines, differential diagnosis and other information are described in the text, so that the information can be well integrated by embedding the whole document. Thus, in this embodiment, the vectorized representation for each standard medical term document is used as an embedding for the entity to incorporate the rich semantic information in the descriptive text.
Assuming that each section of introduction text can describe its corresponding entity with a representative descriptive text, the vectorized representation of the section of descriptive text is also equivalent to the vectorized representation of the entity. Therefore, in this embodiment, Document Embedding (Document Embedding) is selected for the profile of each entity, and then the Document vectors subjected to Document Embedding are regarded as vectorized representation of the entity.
Specifically, as an example, assume entity EiHaving descriptive text TiHere Ti=w1,w2,…,wnComposition of wjRepresents the nth word in the text, 1 ≦ j ≦ n, and wjPossession of the word vector ej. Then the embedding of the document edocument,iIs the average of all word vectors in the document, represented by the following formula:
the vector calculated by this formula is also equivalent to the entity EiIs shown vectorially.
After forming the seed word set, the vectors of the entity categories in the formed seed word set need to be averaged to obtain the vectorized representation of the entity categories as the label vectors of the entity categories in the seed word set.
Specifically, after traversing the standard medical vocabulary and obtaining the vectorized representation of all medical entities (i.e., forming the seed word set), the vectors of all entities in each entity category need to be further averaged to obtain the vectorized representation of the entity category. As an example, assume entity class ETi=(E1,…,Em) Contains m entities, where EjRepresents an entity j,1 ≦ j ≦ n, and each EjPossession vector entj。ETiVector e ofcategory,iIt is represented by the following formula:
in the process of determining the category of the candidate professional entity through cosine similarity comparison according to the label vector of the candidate professional entity in the target document and the label vector of the entity category in the seed set, the label vector of the candidate entity set in the target document needs to be determined at first.
In an exemplary embodiment, the method for determining the tag vector of the candidate entity set in the target document includes:
s131: extracting all noun phrases as a candidate entity set in the target document by utilizing semantic dependency analysis;
s132: performing professional entity screening on the candidate entity set in a preset search engine searching mode to obtain a candidate professional entity set;
s133: searching the entities in the candidate professional entity set again according to the preset search engine to determine the embedded documents of the candidate professional entity set;
s134: and embedding the candidate professional entity set according to the embedded document to obtain a label vector of the candidate professional entity set.
Specifically, as an example, to determine a label vector of a candidate entity set in a target document, a medical entity needs to be found in a target text first, in this embodiment, a word segmentation process is performed on the target text needing entity identification by using a convergent word segmentation algorithm, and then the text after word segmentation is input into a hand semantic dependency analysis algorithm, so that part-of-speech tagging of each word and semantic dependency relationship labels between words can be obtained.
Since most medical entities are rooted in nouns, noun phrases can cover most medical entities. For symptom-like entities, it usually results from the patient describing that a part is uncomfortable, and therefore all exist in the form of adjective modified nouns, thus constituting a main phrase. Based on the characteristics of the text, the noun phrases and the main phrases in the text can be used for screening out the candidate entities. Noun phrases are defined by nouns that are rooted and related to other words in a fixed relationship, while cardinal phrases are defined by adjectives that are rooted and related to other words in a cardinal relationship. And traversing and screening the main phrases and the noun phrases of the target text to form a candidate entity set.
FIG. 2 shows an embodiment of the result of semantic dependency analysis, in which the lower part is labeled part of speech and the upper part is labeled semantic dependency relationship in the embodiment shown in FIG. 2. Noun phrases are sequences of words that are rooted in nouns, i.e., words whose part of speech is labeled "n", and are connected by a centered relationship "ATT". In the example shown in fig. 2, "myelofibrosis", "myeloproliferative diseases", "wuhan-shu-zhong-hospital", etc. all belong to noun phrases, which are candidate entities, and then whether these candidate entities are professional entities of medicine can be determined by means of online search.
The step of performing professional entity screening on the candidate entity set in a preset search engine searching manner to obtain a candidate professional entity set further comprises:
searching entities in the candidate entity set in the preset search engine to obtain a first search result;
searching page introduction of a preset number of results ranked in the front from the remaining items after the advertisement items are removed from the first search result, and searching whether any keyword in preset professional keywords is contained in the search page introduction;
and if any one of the preset number results contains the keyword, judging that the entity is a professional entity.
Specifically, as an example, for the words in the candidate entity set, the search is performed in a search engine such as a Baidu search, a Google search, and the like. After the advertisement items are removed, the search page introduction of the first ten results is taken from the remaining items, and whether any keywords of medical science, medicine, disease and symptom are contained in the search page introduction is searched. If any one of the ten results contains the keywords, the medical entity is judged and reserved, and if not, the medical entity is removed. And finally, forming a candidate medical entity set by the left medical entities, namely the candidate professional entity set.
And (4) carrying out document embedding on the candidate medical entity set reserved after screening in the target document. Since these words screened out of text are not standard words, there is probably no entry of such entities in encyclopedias. It is therefore necessary to find corpora in search engine page results for document embedding for each candidate entity. In the process of searching again for the entities in the candidate professional entity set according to the preset search engine to determine the embedded documents of the candidate professional entity set, the method may include the following steps:
searching again the entities in the candidate professional entity set through the preset search engine to obtain a search page;
crawling short text descriptions of a preset number of search results of the search page and splicing the short text descriptions to form a corpus;
and taking the corpus as an embedded document of the candidate professional entity set.
In an embodiment, the candidate entities are searched again by the search engine, short text descriptions of each search result of each search page are crawled and spliced to form corpora, each search result contains descriptions of 80-90 words, and each search page contains 10 records. In this embodiment, the search results of the first 10 pages are taken, and only the whole sentence in each description is cut, and the sentence segments at the beginning and the end are cut and then spliced. Thus, descriptions of around 6000 words are available for each candidate entity as the corpus in which the document is embedded. Finally, the aforementioned pre-trained Word2Vec vector may be used for document embedding for each candidate medical entity.
After the vectorization representation of each entity in the target document is obtained, the vectorization representation can be compared with the entity class label vector in the standardized vocabulary and classified. Here, the comparison of cosine similarity is used because the comparison object is two vectors. Specifically, the cosine similarity can be calculated according to the following formula, namely the vector of the entity and each entity class label is calculated, and the most similar is calculated, namely the type of the target entity is obtained by the following formula with the highest value.
Then forEntity possession vector ecandidateEntity E ofjFor example, its entity class label is given by:
where e iscandidateAnd ecategory,iVectors representing candidate entities and entity class i, respectively. And traversing all the target texts in sequence according to the steps to complete the entity recognition task.
In an exemplary embodiment, an implementation of the present invention may also include the steps of:
firstly, on the basis of a standardized vocabulary, specialized vocabularies in the vocabulary are vectorized and expressed in a document embedding mode, so that a seed word set is formed.
Secondly, averaging the vectors of each entity type in the seed word set to obtain the embedding of the entity type.
And then, extracting all noun phrases in the target document by utilizing semantic dependency analysis, judging whether the candidate noun phrases are medical entities in a Baidu and Google search mode, and embedding the noun phrases by describing in the first hundred search result pages.
Finally, cosine similarity comparison is carried out on the vectors and each entity category, and the nearest category to which the noun phrase belongs is judged, so that the entity recognition task is completed.
It can be seen from the above description that, in the method for identifying professional encyclopedia named entities provided by the present invention, candidate professional entities in a target document are determined by performing semantic dependency analysis and professional entity screening on the target document, an embedded document of the professional entities in the target document is determined by using a search engine, so as to perform document embedding on the professional entities in the target document, determine vectorized representation of the professional entities in the target document, and finally perform entity identification based on vectorized representation of the professional entities in a standardized vocabulary and vectorized representation of the professional entities in the target document.
Corresponding to the professional encyclopedia named entity identification method, the invention also provides a professional encyclopedia named entity identification system. FIG. 3 illustrates functional modules of a specialized encyclopedia named entity recognition system according to an embodiment of the present invention.
As shown in fig. 3, the professional encyclopedia named entity recognition system 300 provided by the present invention can be installed in an electronic device. According to the implemented functions, the professional encyclopedia named entity recognition system 300 can include a seed word set acquisition unit 310, a seed word set vectorization unit 320 and a target entity recognition unit 330. The unit of the invention, which may also be referred to as a module, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a certain fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the seed word set obtaining unit 310 is configured to perform vectorization representation on the professional vocabulary in the standardized vocabulary by means of document embedding, so as to form a seed word set.
Specifically, as an example, the seed word set obtaining unit 310 may further include:
a professional vocabulary searching subunit 311, configured to search professional vocabularies in the standardized vocabulary online in a preset database (e.g., an online search database of a + medical encyclopedia, etc.);
a seed entity list subunit 312, configured to add, according to the search result, the professional vocabulary and the entity category returned by the result page to the seed entity list;
specifically, as an example, SNOMED international medical specification terms may be adopted as standard medical vocabularies. On the basis of a standard medical word list, the word list is traversed, current words are searched online in an online database of A + medical encyclopedia, encyclopedia and the like, and if a result page returns, the words and entity categories of the words are added to a seed entity list.
A first embedded document extracting sub-unit 313, configured to extract, for each entity in the seed entity list, a descriptive text of an agreed portion from the preset database as an entity embedded document;
a first vectorization sub-unit 314, configured to perform a document embedding process on the entity-embedded document to obtain a vectorized representation of each entity.
Specifically, as an example, the first embedded document extraction subunit may perform document embedding based on the medical corpus training domain specific word vector. For example, the Word segmentation algorithm may be used to segment the original medical text, and then the Word2Vec embedding algorithm may be used to calculate the vector representation of each Word on the basis of the medical corpus. In the training of the Word2Vec model, description characters in encyclopedia entries of words contained in all standard Word lists are combined to form a special medical corpus, and the Word2Vec model is trained on the basis to obtain a Word embedding set special for the medical field.
Due to the design of the Word2Vec model, the simple embedding of using Word vectors as entities only blends in the context information of each entity, so only local semantic and grammatical information is blended in the vectors. While it is desirable to obtain global semantic information and some medical information with each entity. For example, for diseases, the corresponding symptoms, the used medicines, differential diagnosis and other information are described in the text, so that the information can be well integrated by embedding the whole document. Thus, in this embodiment, the vectorized representation for each standard medical term document is used as an embedding for the entity to incorporate the rich semantic information in the descriptive text.
Assuming that each section of introduction text can describe its corresponding entity with a representative descriptive text, the vectorized representation of the section of descriptive text is also equivalent to the vectorized representation of the entity. Therefore, in this embodiment, Document Embedding (Document Embedding) is selected for the profile of each entity, and then the Document vectors subjected to Document Embedding are regarded as vectorized representation of the entity.
The seed word set vectorization unit 320 is configured to average vectors of each entity category in the seed word set to obtain a vectorized representation of the entity category, where the vectorized representation is used as a tag vector of the entity category in the seed word set.
And the target entity identification unit 330 is configured to determine the category to which the candidate professional entity belongs through cosine similarity comparison according to the tag vector of the candidate professional entity in the target document and the tag vector of the entity category in the seed set.
Specifically, as an example, the target entity identifying unit 330 may further include:
a candidate entity set determining subunit 331, configured to extract all noun phrases in the target document as a candidate entity set by using semantic dependency analysis;
a candidate professional entity set determining subunit 332, configured to perform professional entity screening on the candidate entity set in a search manner of a preset search engine to obtain a candidate professional entity set;
a second embedded document extracting subunit 333, configured to perform a second search on entities in the candidate professional entity set determined by the candidate professional entity set determining subunit 332 according to the preset search engine, so as to determine embedded documents of the candidate professional entity set;
and the second vector quantization subunit 334 is configured to embed the candidate professional entity set according to the embedded document to obtain a tag vector of the candidate professional entity set.
Since the professional encyclopedia named entity recognition system provided by the embodiment is a system corresponding to the professional encyclopedia named entity recognition method, the specific implementation manners are similar, and the specific implementation examples of the professional encyclopedia named entity recognition system are not described too much here.
It can be seen from the above embodiments that, the professional encyclopedia named entity recognition system provided by the present invention determines candidate professional entities in a target document by performing semantic dependency analysis and professional entity screening on the target document, determines an embedded document of the professional entities in the target document by using a search engine, performs document embedding on the professional entities in the target document, determines vectorized representation of the professional entities in the target document, and performs entity recognition based on the vectorized representation of the professional entities in the standardized vocabulary and the vectorized representation of the professional entities in the target document, thereby overcoming the defects of the existing supervised learning entity recognition method that professional corpus is missing in labeling and the extremely high labor cost is required for manually labeling texts, and effectively improving the efficiency of encyclopedia text information extraction and entity recognition.
Fig. 4 is a schematic structural diagram of an electronic device implementing the method for identifying a professional encyclopedia named entity according to the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a professional encyclopedia named entity recognition program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of non-volatile readable storage medium, and the readable storage medium includes a flash memory, a removable hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of a professional encyclopedia named entity recognition program, etc., but also for temporarily storing data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., a professional encyclopedia named entity recognition program, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 4 only shows an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions such as charge management, discharge management, and power consumption management are implemented through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The memory 11 in the electronic device 1 is a computer-readable storage medium, in which at least one instruction is stored, and the at least one instruction is executed by a processor in the electronic device to implement the professional encyclopedia named entity identification method described above. Specifically, as an example, the professional encyclopedia named entity recognition program 12 stored in the memory 11 is a combination of a plurality of instructions, and when running in the processor 10, the following steps can be implemented:
s110: vectorizing and expressing professional vocabularies in the standardized vocabulary in a document embedding mode to form a seed word set;
s120: averaging vectors of all entity categories in the seed word set to obtain vectorization representation of the entity categories, wherein the vectorization representation is used as a label vector of the entity categories in the seed word set;
s130: and determining the category of the candidate professional entity through cosine similarity comparison according to the label vector of the candidate professional entity in the target document and the label vector of the entity category in the seed set.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.