[go: up one dir, main page]

CN111695345B - Method and device for identifying entity in text - Google Patents

Method and device for identifying entity in text Download PDF

Info

Publication number
CN111695345B
CN111695345B CN202010533173.5A CN202010533173A CN111695345B CN 111695345 B CN111695345 B CN 111695345B CN 202010533173 A CN202010533173 A CN 202010533173A CN 111695345 B CN111695345 B CN 111695345B
Authority
CN
China
Prior art keywords
text
word
entity
vector
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010533173.5A
Other languages
Chinese (zh)
Other versions
CN111695345A (en
Inventor
王明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010533173.5A priority Critical patent/CN111695345B/en
Publication of CN111695345A publication Critical patent/CN111695345A/en
Application granted granted Critical
Publication of CN111695345B publication Critical patent/CN111695345B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and a device for identifying an entity in a text, electronic equipment and a computer readable storage medium; the method comprises the following steps: extracting characteristics of the text to obtain character characteristic vectors corresponding to each character in the text; determining dictionary vectors corresponding to each word in the text according to the entity dictionary corresponding to the text; word segmentation is carried out on the text to obtain words corresponding to each word in the text, and word vectors corresponding to each word are determined; performing stitching processing on the character feature vector, the dictionary vector and the word vector corresponding to each character to obtain a stitching vector corresponding to each character; and determining a label corresponding to each text according to the splicing vector corresponding to each text, and determining an entity in the text and the type of the entity according to the label corresponding to each text. The invention can improve the efficiency and accuracy of entity identification.

Description

Method and device for identifying entity in text
Technical Field
The present invention relates to natural language processing technology in the field of artificial intelligence, and in particular, to a method and apparatus for identifying an entity in a text, an electronic device, and a computer readable storage medium.
Background
Artificial intelligence (AI, artificial Intelligence) is the theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. Natural language processing (NLP, nature Language Processing) is an important direction in artificial intelligence, and various theories and methods for realizing effective communication between a person and a computer in natural language have been mainly studied.
Entity recognition is a branch of natural language processing, and is to identify entities in text with specific meaning, such as song names, person names, place names, and the like. In the scheme provided by the related art, a feature is usually constructed by manually constructing a text to be identified, then the feature is labeled by a machine learning model in a label type, and finally entity identification is realized according to the labeled label type, wherein the manually constructed feature causes low entity identification efficiency.
Disclosure of Invention
The embodiment of the invention provides a method, a device, electronic equipment and a computer readable storage medium for identifying entities in a text, which can improve the efficiency and accuracy of entity identification.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a method for identifying entities in a text, which comprises the following steps:
extracting characteristics of the text to obtain character characteristic vectors corresponding to each character in the text;
determining dictionary vectors corresponding to each word in the text according to the entity dictionary corresponding to the text;
word segmentation is carried out on the text to obtain words corresponding to each word in the text, and word vectors corresponding to each word are determined;
performing stitching processing on the character feature vector, the dictionary vector and the word vector corresponding to each character to obtain a stitching vector corresponding to each character;
determining the label corresponding to each character according to the splicing vector corresponding to each character, and
and determining the entity in the text and the type of the entity according to the label corresponding to each text.
The embodiment of the invention provides a device for identifying entities in texts, which comprises the following steps:
the feature extraction module is used for carrying out feature extraction processing on the text so as to obtain a word feature vector corresponding to each word in the text;
the dictionary module is used for determining dictionary vectors corresponding to each text in the text according to the entity dictionary corresponding to the text;
The word segmentation module is used for carrying out word segmentation processing on the text to obtain words corresponding to each word in the text, and determining word vectors corresponding to each word;
the splicing module is used for carrying out splicing processing on the character feature vector, the dictionary vector and the word vector corresponding to each character so as to obtain a spliced vector corresponding to each character;
and the identification module is used for determining the label corresponding to each text according to the splicing vector corresponding to each text and determining the entity in the text and the type of the entity according to the label corresponding to each text.
In the above scheme, the feature extraction module is further configured to query a mapping dictionary for a numeric identifier corresponding to each text in the text; and converting the digital identification corresponding to each word into a vector form to obtain the word characteristic vector corresponding to each word.
In the above scheme, the dictionary module is further configured to determine a type to which the text belongs; determining an entity dictionary corresponding to the type to which the text belongs; and querying dictionary vectors corresponding to each word in the text in the entity dictionary.
In the above scheme, the word segmentation module is further configured to invoke a word vector model to perform the following operations on the text: intercepting a plurality of words with word number length of a preset value from the text; encoding each intercepted word to obtain a plurality of code sequences corresponding to each other one by one; and mapping the code sequence corresponding to each word into a word vector of the corresponding word.
In the above scheme, the splicing module is further configured to determine a word feature vector, a dictionary vector and a word vector corresponding to the same word; and superposing the character feature vector, the dictionary vector and each dimension contained in the word vector, and filling a scalar corresponding to the dimension in the superposed dimension to obtain a spliced vector corresponding to the characters.
In the above scheme, the identification module is further configured to map the spliced vector corresponding to each text to probabilities respectively belonging to different candidate tags; the candidate labels are used for indicating the type of the entity to which the characters belong and the position of the characters in the entity to which the characters belong, or are used for indicating that the characters belong to irrelevant characters; and determining the candidate label corresponding to the maximum probability as the label corresponding to the text.
In the above scheme, the identification module is further configured to map the stitching vector corresponding to each word to a plurality of different candidate labels, and determine a transfer score corresponding to each candidate label by mapping the stitching vector to each candidate label; the candidate labels are used for indicating the type of the entity to which the characters belong and the position of the characters in the entity to which the characters belong, or are used for indicating that the characters belong to irrelevant characters; and determining the label corresponding to each character according to the plurality of candidate labels corresponding to each character and the corresponding transfer score.
In the above scheme, the identification module is further configured to select candidate tags from a plurality of candidate tags corresponding to each text for multiple times according to the appearance sequence of each text in the text, and combine the candidate tags selected each time to obtain a plurality of different candidate tag sequences; wherein, a plurality of candidate labels contained in each selected candidate label sequence belong to different characters, and the number of the contained candidate labels is the same as the number of the characters of the text; accumulating the transfer scores corresponding to each candidate tag in the candidate tag sequence to obtain an overall transfer score; and determining a plurality of candidate labels contained in the candidate label sequence with the maximum overall transfer score as label categories corresponding to the corresponding characters.
In the above scheme, the identification module is further configured to identify, as the same entity, a text with continuous positions in the text and corresponding labels indicating the same entity type, and identify, as the type of the same entity, the entity type indicated by the label.
An embodiment of the present invention provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the entity identification method in the text provided by the embodiment of the invention when executing the executable instructions stored in the memory.
The embodiment of the invention provides a computer readable storage medium which stores executable instructions for causing a processor to execute the method for identifying entities in text.
The embodiment of the invention has the following beneficial effects:
the method has the advantages that the characteristics in the text are automatically extracted, the manual construction of the characteristics is not needed, the workload of characteristic engineering is simplified, and the entity identification efficiency is improved; and splicing the character feature vector, the dictionary vector and the word vector corresponding to each word in the text, and labeling the spliced vectors, so that errors in identifying the entities and the entity types in the text are reduced, and the accuracy of entity identification is improved.
Drawings
FIG. 1 is a schematic diagram of an architecture of an entity-in-text recognition system 100 according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an electronic device 500 according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for identifying entities in text according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for identifying entities in text according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a method for identifying entities in text according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of an example model input provided by an embodiment of the present invention;
FIG. 7 is a schematic diagram of a model architecture provided by an embodiment of the present invention;
fig. 8 is a schematic view of an application scenario provided in an embodiment of the present invention;
fig. 9 is a schematic diagram of a model structure according to an embodiment of the present invention.
Detailed Description
The present invention will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
Before describing embodiments of the present invention in further detail, the terms and terminology involved in the embodiments of the present invention will be described, and the terms and terminology involved in the embodiments of the present invention will be used in the following explanation.
1) Natural language processing: is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
2) BIO labeling system: one way to label an element in text (or text sentence) is to label the element as "B-X", "I-X", or "O", where "B" in "B-X" indicates that the entity position of the element is the first position, "I" in "I-X" indicates that the entity position of the element is the non-first position, "X" in "B-X" and "I-X" indicates that the entity type of the element is the X type, and "O" indicates that the element is not of any type, i.e., an unrelated element. Wherein the element may be a word in a text sentence.
3) Short text Query (Query): referring to a request sentence input by a user, the request sentence contains the intention expectations of the user, for example: "ice rain coming from first XXX"; "give me a story of foolproof mountain"; "I want to watch the movie without a break" etc.
4) Entity (Entity), or named Entity: is the basic unit of the knowledge graph, and is also the important language unit for carrying information in this document, for example: name of person, place, institution, product, etc.
In a task dialog system, entities are used to express important information in a user input Query. For example, when Query is "ice rain from a XXX", query itself is an intended desire to indicate that the user wants to hear a song, and the entities are singer-type XXX and song-type ice rain.
5) Media asset entity: it refers to entities of media information class, such as song entities in music skills, movies, television shows, or cartoon entities in video skills, album entities in fm radio skills. There is some similarity between entities and there may also be intersections of the content of the entities.
6) Entity identification (NER, named Entity Recognition): refers to identifying entities in text.
7) Physical dictionary: when designing a new skill intent, a collection of entity instances for reference to the new skill-related entity set is typically provided to inform the boundaries and rules of the entity set, and the collection of reference entity instances, i.e., entity dictionaries.
8) Speech recognition, or automatic speech recognition (ASR, automatic Speech Recognition), aims to convert lexical content in human speech into computer-readable inputs, such as keys, binary codes, or character sequences. Unlike speaker recognition and speaker verification, speaker recognition and speaker verification attempt to identify or verify the speaker making the speech, not the lexical content contained therein.
9) Training samples, or training data, are data sets which are preprocessed and have relatively reliable and accurate feature descriptions, and participate in the training process of the entity recognition model in a sample mode.
10 Artificial intelligence cloud Service (AiaaS, AI as a Service): the service method is a currently mainstream service method of an artificial intelligent platform, and particularly an AIaaS platform can split several common AI services and provide independent or packaged services at the cloud. This service mode is similar to an AI theme mall: all developers can access one or more artificial intelligence services provided by the use platform through an application program interface (API, application Programming Interface) interface, and partial senior developers can also deploy and operate and maintain self-proprietary cloud artificial intelligence services by using an AI framework and AI infrastructure provided by the platform.
According to the embodiment of the invention, training of the entity identification model can be realized by calling the artificial intelligent cloud service.
The embodiment of the invention provides a method, a device, electronic equipment and a computer readable storage medium for identifying entities in a text, which can effectively improve the efficiency and accuracy of entity identification. The following describes an exemplary application of the method for identifying an entity in a text provided by the embodiment of the present invention, where the method for identifying an entity in a text provided by the embodiment of the present invention may be implemented by various electronic devices, for example, may be implemented by a terminal, may be implemented by a server or a server cluster, or may be implemented by a terminal and a server in cooperation.
In the following, embodiments of the present invention are described by a coordinated embodiment of a terminal and a server, referring to fig. 1, fig. 1 is a schematic architecture diagram of an entity identification system 100 in text provided by the embodiments of the present invention. The text entity recognition system 100 includes: the server 200, the network 300, the terminal 400, and the client 410 operating on the terminal 400 will be described separately.
Server 200 is a background server of client 410, and is configured to receive text sent by client 410; but also for identifying entities and types of entities in the text (a process of identifying the entities and types of entities in the text will be described in detail later), and retrieving corresponding resources (e.g., answer sentences, songs, movies, etc.) from a database or network according to the identified entities and types of entities, and transmitting the resources to the client 410.
The network 300 is used as a medium for communication between the server 200 and the terminal 400, and may be a wide area network or a local area network, or a combination of both.
The terminal 400 is used for running a client 410, and the client 410 is various Applications (APP) having an entity recognition function, for example, a voice assistant APP, a music APP, or a video APP. The client 410 is configured to send text to the server 200, and obtain resources of the corresponding text sent by the server 200, and present the text to a user (e.g., present an answer sentence, play a song, play a video, etc.).
It should be noted that, the client 410 may determine the entity and the type of the entity in the text not only by invoking the entity identification service of the server 200; the entities and the types of entities in the text may also be determined by invoking the entity identification service of the terminal 400.
As an example, the client 410 invokes the entity recognition service of the terminal 400 to recognize entities and types of entities in text; corresponding requests are sent to the server 200 according to the identified entities and types of entities, so that the server 200 obtains corresponding resources (e.g., answer sentences, songs, movies, etc.) from a database or network according to the requests, and receives the resources returned by the server 200 to display (e.g., present answer sentences, play songs, play videos, etc.) to the user.
In some embodiments, the server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present invention.
The embodiment of the invention can be applied to various scenes, such as a response scene, a video playing scene or a music playing scene.
Taking the answer scenario as an example, the client 410 may be a voice assistant APP. The client 410 invokes the microphone of the terminal 400 to collect the voice query sentence of the user and performs voice recognition on the voice query sentence to obtain a corresponding text; client 410 sends the text to server 200; the server 200 identifies the text, obtains the corresponding entity and entity type, queries in the knowledge graph, obtains the answer sentence, and sends the answer sentence to the client 410; the client 410 reports the answer sentence in the form of a voice report.
Alternatively, the client 410 invokes the microphone of the terminal 400 to collect the voice query sentence of the user and performs voice recognition on the voice query sentence to obtain the corresponding text; the client 410 invokes the entity recognition service of the terminal 400 to recognize the text, obtain the corresponding entity and entity type, and inquire in the knowledge graph to obtain the answer sentence, and broadcast the answer sentence in the form of voice broadcast.
Taking a music play scenario as an example, the client 410 may be a music APP. The client 410 invokes the microphone of the terminal 400 to collect the voice operation instruction of the user, for example, "get a robust ice flower", and performs voice recognition on the voice operation instruction to obtain a corresponding text; client 410 sends the text to server 200; the server 200 identifies the text to obtain a corresponding entity, i.e. "lubinghua", and the type of the entity, i.e. the song name type; the server 200 searches the database or the network for the song "roulette", obtains the corresponding song resources, and sends the song resources to the client 410; the client 410 plays the song "roux".
Alternatively, the client 410 invokes the microphone of the terminal 400 to collect a voice operation instruction of the user, for example, "get a break of ice", and performs voice recognition on the voice operation instruction to obtain a corresponding text; the client 410 invokes the entity recognition service of the terminal 400 to recognize the text, and obtains a corresponding entity, i.e. "roux", and the type of the entity, i.e. the song name type; the client 410 sends a song acquisition request to the server 200, so that the server 200 searches the database or the network for the song "roux", obtains the corresponding song resource, and returns the song resource to the client 410; the client 410 plays the song "roux".
Next, a structure of an electronic device for entity identification, which may be the server 200 or the terminal 400 shown in fig. 1, is described. The following describes a structure of the electronic device by taking the server 200 shown in fig. 1 as an example, referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device 500 provided in an embodiment of the present invention, and the electronic device 500 shown in fig. 2 includes: at least one processor 510, memory 540, and at least one network interface 520. The various components in the electronic device 500 are coupled together by a bus system 530. It is understood that bus system 530 is used to enable connected communication between these components. The bus system 530 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 530 in fig. 2.
The processor 510 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.
Memory 540 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 540 described in embodiments of the present invention is intended to comprise any suitable type of memory. Memory 540 optionally includes one or more storage devices physically remote from processor 510.
In some embodiments, memory 540 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 541 including system programs, such as a framework layer, a core library layer, a driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;
network communication module 542 is used to reach other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;
In some embodiments, the entity recognition device in text provided in the embodiments of the present invention may be implemented in software, and fig. 2 shows the entity recognition device in text 543 stored in the memory 540, which may be software in the form of a program and a plug-in, and includes the following software modules: the feature extraction module 5431, the dictionary module 5432, the word segmentation module 5433, the concatenation module 5434, and the recognition module 5435. These modules may be logical functional modules, and thus may be arbitrarily combined or further split depending on the functionality implemented. The functions of the respective modules will be described hereinafter.
In other embodiments, the entity-in-text recognition device 543 provided by the present invention may be implemented by a combination of hardware and software, and by way of example, the device provided by the present invention may be a processor in the form of a hardware decoding processor programmed to perform the entity-in-text recognition method provided by the present invention, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSP, programmable logic device (PLD, programmable Logic Device), complex programmable logic device (CPLD, complex Programmable Logic Device), field programmable gate array (FPGA, field-Programmable Gate Array), or other electronic components.
In the following, a method for identifying entities in text provided by the embodiment of the present invention implemented by the server 200 in fig. 1 is taken as an example. Referring to fig. 3, fig. 3 is a flowchart of a method for identifying an entity in text according to an embodiment of the present invention, and will be described with reference to the steps shown in fig. 3.
In step S101, feature extraction processing is performed on the text to obtain a word feature vector corresponding to each word in the text.
In some embodiments, a numeric Identification (ID) corresponding to each word in the text is queried in the mapping dictionary; and converting the digital identification corresponding to each word into a vector form to obtain a word characteristic vector corresponding to each word.
As an example, the server 200 includes a mapping dictionary that maps words to numeric IDs, the mapping dictionary being capable of converting each chinese character in the text to a corresponding numeric ID; inquiring the number ID corresponding to each word in the text in the mapping dictionary through a feature extraction network; the digital ID obtained by the query is converted into a vector form to obtain a word characteristic vector corresponding to each word in the text. According to the embodiment of the invention, the character feature vector corresponding to each character in the text is accurately extracted according to the mapping dictionary, so that the accuracy of subsequent entity recognition is improved.
Here, the above-described feature extraction network is capable of extracting not only a word feature vector of each word in a text, but also a feature characterizing the type of the text (or a feature characterizing the user' S intention), and classifying the feature characterizing the type of the text to obtain a classification result (the process of classifying the extracted feature characterizing the type of the text will be described in detail in step S102). For example: when the text is 'coming first ice rain', the characteristics of the type of the characteristic text are extracted through the characteristic extraction network, and the classification result that the type of the characteristic text is music can be obtained through classifying the extracted characteristics of the type of the characteristic text. In this way, the physical dictionary corresponding to the type of the text can be selected according to the extracted feature characterizing the type of the text, so as to help determine the dictionary vector corresponding to each word in the text in step S102.
Taking the text as an example of "coming one forgetting water", feature extraction processing is performed on the text to obtain a word feature vector corresponding to "coming", a word feature vector corresponding to "one", a word feature vector corresponding to "first", a word feature vector corresponding to "forgetting", a word feature vector corresponding to "emotion", and a word feature vector corresponding to "water".
According to the embodiment of the invention, the character feature vector corresponding to each character can be automatically extracted from the text without manually constructing the feature, the automatic evolution of the feature engineering is realized, and compared with the related technology, the automatic feature engineering can greatly reduce the workload.
In step S102, a dictionary vector corresponding to each word in the text is determined according to the physical dictionary of the corresponding text.
Here, the entity dictionary may be an unambiguous named entity dictionary, i.e. named entities included in the entity dictionary have only unique meanings. Entity instances contained in the entity dictionary support user customization.
In some embodiments, determining the type to which the text belongs; wherein the text belongs to a type comprising at least one of the following: music class; video class; a radio station class; a place name class; a character class; determining an entity dictionary of the type to which the corresponding text belongs; in the physical dictionary, a dictionary vector corresponding to each word in the text is queried.
As an example, the specific procedure of determining the type to which the text belongs is: extracting characteristics representing the types of the texts through a characteristic extraction network; mapping the characteristics representing the types of the texts into probabilities respectively belonging to the types of different texts; and determining the type of the text corresponding to the maximum probability as the type to which the text belongs. Therefore, the entity dictionary corresponding to the type to which the text belongs can be accurately obtained, so that the dictionary vector corresponding to each text in the text can be accurately extracted, and the accuracy of classifying the extracted features is improved.
For example, features characterizing the type of text are extracted through a feature extraction network; receiving characteristics representing the types of the texts through an input layer of the classification network and transmitting the characteristics to a hidden layer of the classification network; mapping the characteristics of the type of the representation text through an activation function of the hidden layer of the classification network, and continuing forward propagation of the vector obtained by mapping in the hidden layer; receiving vectors propagated by hidden layers through an output layer of a classification network, and mapping the vectors into confidence degrees belonging to types of different texts through an activation function of the output layer; and determining the type corresponding to the maximum confidence as the type to which the text belongs.
In some embodiments, server 200 stores therein a plurality of different types of physical dictionaries, such as, for example, a physical dictionary of music classes, a physical dictionary of video classes, a physical dictionary of radio classes, a physical dictionary of place names classes, and a physical dictionary of people classes; selecting an entity dictionary of a corresponding text from a plurality of entity dictionaries of different types; inquiring dictionary features corresponding to each word in the text in the selected entity dictionary; and converting the dictionary features obtained by the query into a vector form to obtain a dictionary vector corresponding to each word in the text. According to the embodiment of the invention, the entity dictionary consistent with the text type is selected, the dictionary vector corresponding to each text in the text can be accurately extracted, the accuracy of subsequent entity identification is improved, the user-defined entity dictionary is supported, and the accuracy of cold entity identification is further improved.
In step S103, word segmentation processing is performed on the text to obtain words corresponding to each word in the text, and a word vector corresponding to each word is determined.
In some embodiments, a word vector model is invoked to perform the following on text: intercepting a plurality of words with word number length of a preset value from a text; encoding each intercepted word to obtain a plurality of code sequences corresponding to each other one by one; the code sequence corresponding to each word is mapped to a word vector of the corresponding word.
Here, the Word vector model may be any language model capable of converting words into corresponding Word vectors, for example, a Word2vec model, a Glove model, a bi-directional encoder characterization (BERT, bidirectional Encoder Representation from Transformers) model, and the like. The preset value may be a user-defined value; or a value determined according to the number of words of the text, wherein the number of words of the text is proportional to the size of the preset value.
Taking the text as an example of "coming one forgets water", and the preset value is 2, the words in the text are "coming", "one", "first", "forgetting", "emotion" and "water", and the words corresponding to "coming", "first", "forgetting", "emotion" and "water" are intercepted in the text, for example, the word corresponding to "coming" is "coming one", the word corresponding to "one" is "first", the word corresponding to "first" is "forgetting", the word corresponding to "forgetting" is "emotion", the word corresponding to "emotion" is "emotion water" and the word corresponding to "water" is "water#" (here, because "water" is the last word in the text, word grouping is performed with "water" and wildcard "#"); encoding each word intercepted, such as One-Hot (One-Hot) encoding, to obtain encoding sequences corresponding to "come One", "first", "forget", "water of mind" and "water #", respectively; the code sequences corresponding to "come first", "forget", "emotion water" and "water#" are mapped into corresponding word vectors, so that word vectors corresponding to each word (i.e., "come", "first", "forget", "emotion" and "water") in the text can be obtained.
It should be noted that, the capturing of the word corresponding to each word is not necessarily capturing the word and the next word adjacent to the word, and the word containing the word and having the length of the preset value may be arbitrarily captured as the word corresponding to the word, for example, the word corresponding to "forget" may be "forget", "forget water", or "forget to feel". Therefore, the diversity of the intercepted words can be improved, training samples of the model are increased, and overfitting of the model is avoided.
In step S104, the word feature vector, the dictionary vector, and the word vector corresponding to each word are subjected to a stitching process to obtain a stitched vector corresponding to each word.
In some embodiments, a word feature vector, a dictionary vector, and a word vector corresponding to the same word are determined; performing tail splicing on character feature vectors, dictionary vectors and word vectors corresponding to the same character to obtain spliced vectors corresponding to the character; the dimension of the stitching vector is the sum of the dimension of the character feature vector, the dimension of the dictionary vector and the dimension of the word vector.
As an example, a character feature vector, a dictionary vector, and a word vector corresponding to the same word are determined; determining a plurality of dimensions of the character feature vector and scalar quantities corresponding to each dimension respectively; determining a plurality of dimensions in the dictionary vector and scalar quantities corresponding to each dimension respectively; determining a plurality of dimensions in the word vector and scalar quantities corresponding to each dimension respectively; and superposing the character feature vector, the dictionary vector and each dimension contained in the word vector, and filling scalar quantities of corresponding dimensions in the superposed dimensions to obtain spliced vectors of the corresponding characters. Therefore, classification processing is carried out on the spliced vectors, and classification labeling can be carried out on characters in the text by integrating multiple dimensions, so that the accuracy of entity identification can be improved, and the efficiency of entity identification can be improved.
In step S105, a label corresponding to each text is determined according to the concatenation vector corresponding to each text.
Here, the tag is used to indicate the type of entity to which the text belongs and the position of the text in the belonging entity, or to indicate that the text belongs to an irrelevant character. The type of entity includes at least one of: music class; video class; a radio station class; a place name class; and (5) characters. For example, the label "B-song" characterizes that the location of the word in the entity to which the word belongs is a first word and the type of entity to which the word belongs is a musical class; the label "I-per" indicates that the position of the text in the entity to which the text belongs is a non-first word, and the type of the entity to which the text belongs is a character class; the label "O" characterizes the word as not belonging to any type and not belonging to any entity, i.e. unrelated character.
In some embodiments, referring to fig. 4, fig. 4 is a schematic flow chart of a method for identifying entities in text provided in an embodiment of the present invention, and step S105 shown in fig. 3 may be further implemented by steps S1051 to S1052.
In step S1051, the stitching vector corresponding to each text is mapped to probabilities respectively belonging to different candidate tags.
Here, the candidate tag is used to indicate the type of entity to which the text belongs and the position of the text in the belonging entity, or to indicate that the text belongs to an irrelevant character.
In some embodiments, receiving a splice vector corresponding to each word through an input layer of the classification network and propagating to a hidden layer of the classification network; mapping the spliced vector corresponding to each character through an activation function of the hidden layer of the classification network, and continuing forward propagation of the vector obtained by mapping in the hidden layer; and receiving the vector propagated by the hidden layer through the output layer of the classification network, and mapping the vector into the confidence degrees belonging to different candidate labels through the activation function of the output layer.
In step S1052, the candidate tag corresponding to the maximum probability is determined as the tag corresponding to the text.
In some embodiments, the candidate tag corresponding to the greatest confidence level is determined to be the tag corresponding to the text.
As an example, when the probability that the concatenation vector corresponding to "forget" in the text "coming from the water" is mapped to the candidate tag "B-song" is 0.5, the probability that it is mapped to the candidate tag "I-movie" is 0.3, and the probability that it is mapped to the candidate tag "O" is 0.2, the candidate tag "B-song" corresponding to the highest probability is determined to be the tag corresponding to "forget". In this way, it can be determined that the label corresponding to each word in the text "coming one forgets the water", that is, the label corresponding to "coming" is "O", "the label corresponding to" one "is" O "," the label corresponding to "first" is "O", "the label corresponding to" forget "is" B-song "," the label corresponding to "love" is "I-song", and the label corresponding to "water" is "I-song". Thus, the entities and the types of the entities in the text 'forgetting water' can be determined according to the labels corresponding to each word.
According to the embodiment of the invention, the classification is carried out on each text in the text through the classifier, the classification process is simple, the classification speed is high, and the entity identification efficiency is improved. However, the classification marking is performed on each text in the text through the classifier, although the marking speed is high, because the classification marking is performed only on each text in the text, and global marking of the text is not considered, therefore, for the case that the positions of two characterization texts in the same entity to which the same text belongs are first words in the same text marking type, for example, labels corresponding to forgetting and emotion are B-song, and thus, errors are generated in entity identification. In this regard, another labeling method is provided in the embodiments of the present invention, and will be described in detail below.
In other embodiments, referring to fig. 5, fig. 5 is a schematic flow chart of a method for identifying entities in text provided in an embodiment of the present invention, and step S105 shown in fig. 3 may be further implemented by steps S1053 to S1054.
In step S1053, the stitching vector corresponding to each text is mapped to a plurality of different candidate labels, and the transfer score corresponding to each candidate label is determined.
Here, the transition score is used to characterize the degree of matching between the word and the corresponding mapped candidate tag, that is, the greater the transition score, the greater the probability that the word belongs to the corresponding mapped candidate tag.
Taking a text of "coming a lot of forgetting water" as an example, mapping "forgetting" to candidate tags "B-song", "O", and "I-song", wherein the transfer score corresponding to the "forgetting" mapped to the candidate tag "B-song" is 0.5, the transfer score corresponding to the "forgetting" mapped to the candidate tag "I-song" is 0.3, the transfer score corresponding to the "forgetting" mapped to the candidate tag "O" is 0.2, the probability of representing that the "forgetting" belongs to the tag "B-song" is the largest, and the probability of representing that the "forgetting" belongs to the tag "O" is the smallest; in this way, the transfer scores for mapping "come", "first", "forget", "emotion" and "water" to candidate tags "B-song", "O", and "I-song", respectively, can be determined one by one.
Here, the candidate tags are not limited to "B-song", "O", and "I-song", and may include types of any entity, for example, "B-movie", "B-per", and "I-fm", etc., and the embodiments of the present invention are not limited herein.
In step S1054, a label corresponding to each text is determined according to the plurality of candidate labels corresponding to each text and the corresponding transition score.
In some embodiments, selecting candidate labels from a plurality of candidate labels corresponding to each text for multiple times according to the appearance sequence of each text in the text, and combining the candidate labels selected each time to obtain a plurality of different candidate label sequences; wherein, a plurality of candidate labels contained in the candidate label sequence selected each time belong to different characters, and the number of the contained candidate labels is the same as the number of characters of the text; accumulating the transfer scores corresponding to each candidate tag in the candidate tag sequence to obtain an overall transfer score; and determining a plurality of candidate labels contained in the candidate label sequence with the maximum overall transfer score as label categories corresponding to the corresponding characters.
It should be noted that, the larger the corresponding overall transfer score in the candidate tag sequence, the higher the matching degree between the candidate tag sequence and the text is, and the higher the association degree between the candidate tag corresponding to the text and the candidate tag corresponding to the adjacent text is.
Taking a text of "come a first forgetting water" and candidate labels including "B-song", "O" and "I-song" as examples, each word in "come", "first", "forget", "love" and "water" corresponds to three candidate labels; selecting candidate labels for multiple times from 3 candidate labels corresponding to 6 different characters, and combining the candidate labels selected each time to obtain 3 6 729 different candidate tag sequences; the overall transfer scores of the 729 different candidate tag sequences are respectively calculated, and the candidate tag sequence with the largest overall transfer score is selected, namely { "O", "O", "O", "B-song", "I-song", "I-song" }; and determining the candidate label contained in the candidate label sequence with the largest overall transfer score as a label category corresponding to the corresponding text, namely, the label corresponding to "coming" is "O", "the label corresponding to" one "is" O "," the label corresponding to "first" is "O", "the label corresponding to" forgetting "is" B-song "," the label corresponding to "love" is "I-song" and the label corresponding to "water" is "I-song".
When each text in the text is marked, the embodiment of the invention not only considers the probability that each text belongs to the corresponding label, but also considers the overall optimal probability of the whole text, and can avoid the problem of marking bias. Specifically, the embodiment of the invention considers the association degree between the label corresponding to each text and the label corresponding to the adjacent text in the text, so that the situation that the label representing that the position of the text in the same entity to which the text belongs is the first word appears behind the label representing that the position of the text in the same entity to which the text belongs is the non-first word can be avoided; the association degree between the label corresponding to each text and the text is also considered, so that the situation that the positions of two characterization texts in the same entity belonging to the same text mark are first words can be avoided.
In step S106, the entity and the type of the entity in the text are determined according to the label corresponding to each text.
Here, the type of entity includes at least one of: music class; video class; a radio station class; a place name class; and (5) characters. Wherein each of the above types belongs to a major category, and each major category may include minor categories, for example, singers, songs, albums, etc. in music categories; video categories include movies, television shows, actors, cartoon categories, etc. Step S106 may identify not only the major class to which the entity belongs, but also the minor class to which the entity belongs.
In some embodiments, words with consecutive positions in the text and corresponding tags indicating the same entity type are identified as the same entity, and the entity type indicated by the tags is identified as the same entity type.
As one example, a word corresponding to a tag class indicating a position at the beginning of an entity to which the word belongs is determined as a first word of the entity in a text; traversing in a plurality of characters belonging to the same entity type as the first character; when the label category of the traversed text indicates the text positioned in the middle of the affiliated entity, determining the traversed text as a non-first word of the entity; the first word and the non-first word are determined together as an entity, and the type of the entity to which the tag categories of the first word and the non-first word indicate together is determined as the type of the entity.
Taking the text of "coming one forgetting water" as an example, the label corresponding to "coming" is "O", "the label corresponding to" one "is" O "," the label corresponding to "first" is "O", "the label corresponding to" forgetting "is" B-song "," the label corresponding to "love" is "I-song" and the label corresponding to "water" is "I-song", it can be determined that the entity in the text is "forgetting water", and the type of the entity is the song class in the major class of music.
According to the embodiment of the invention, the feature vector corresponding to each word in the text is automatically extracted, the manual construction of the features is not needed, and the efficiency of entity identification is improved. And the character feature vector, the dictionary vector and the word vector corresponding to each word in the text are spliced, and labels are marked on the spliced vectors, so that errors in identifying the entities and the entity types in the text are reduced, and the accuracy of identification is improved.
Continuing with the description of the architecture of electronic device 500 in conjunction with FIG. 2, in some embodiments, as shown in FIG. 2, the software modules stored in entity recognition device 543 in the text of memory 540 may include: the feature extraction module 5431, the dictionary module 5432, the word segmentation module 5433, the concatenation module 5434, and the recognition module 5435.
The feature extraction module 5431 is configured to perform feature extraction processing on a text to obtain a word feature vector corresponding to each word in the text;
the dictionary module 5432 is configured to determine a dictionary vector corresponding to each word in the text according to the physical dictionary corresponding to the text;
the word segmentation module 5433 is configured to perform word segmentation processing on the text to obtain words corresponding to each word in the text, and determine word vectors corresponding to each word;
the stitching module 5434 is configured to stitch the word feature vector, the dictionary vector, and the word vector corresponding to each word to obtain a stitched vector corresponding to each word;
the recognition module 5435 is configured to determine a label corresponding to each text according to the concatenation vector corresponding to each text, and determine an entity in the text and a type of the entity according to the label corresponding to each text.
In some embodiments, the feature extraction module 5431 is further configured to query a mapping dictionary for a numeric identifier corresponding to each word in the text; and converting the digital identification corresponding to each word into a vector form to obtain the word characteristic vector corresponding to each word.
In some embodiments, the dictionary module 5432 is further configured to determine a type to which the text belongs; determining an entity dictionary corresponding to the type to which the text belongs; and querying dictionary vectors corresponding to each word in the text in the entity dictionary.
In some embodiments, the word segmentation module 5433 is further configured to invoke a word vector model to perform the following operations on the text: intercepting a plurality of words with word number length of a preset value from the text; encoding each intercepted word to obtain a plurality of code sequences corresponding to each other one by one; and mapping the code sequence corresponding to each word into a word vector of the corresponding word.
In some embodiments, the stitching module 5434 is further configured to determine a word feature vector, a dictionary vector, and a word vector corresponding to the same word; and superposing the character feature vector, the dictionary vector and each dimension of the word vector, and filling a scalar corresponding to the dimension in the superposed dimension to obtain a spliced vector corresponding to the characters.
In some embodiments, the identifying module 5435 is further configured to map the stitching vector corresponding to each text to probabilities respectively belonging to different candidate tags; the candidate labels are used for indicating the type of the entity to which the characters belong and the position of the characters in the entity to which the characters belong, or are used for indicating that the characters belong to irrelevant characters; and determining the candidate label corresponding to the maximum probability as the label corresponding to the text.
In some embodiments, the identifying module 5435 is further configured to map the stitching vector corresponding to each text to a plurality of different candidate labels, and determine a transfer score corresponding to each of the candidate labels by mapping the stitching vector to each of the candidate labels, respectively; the candidate labels are used for indicating the type of the entity to which the characters belong and the position of the characters in the entity to which the characters belong, or are used for indicating that the characters belong to irrelevant characters; and determining the label corresponding to each character according to the plurality of candidate labels corresponding to each character and the corresponding transfer score.
In some embodiments, the identifying module 5435 is further configured to select candidate labels from the plurality of candidate labels corresponding to each text for multiple times according to the appearance sequence of each text in the text, and combine the candidate labels selected each time to obtain a plurality of different candidate label sequences; wherein, a plurality of candidate labels contained in each selected candidate label sequence belong to different characters, and the number of the contained candidate labels is the same as the number of the characters of the text; accumulating the transfer scores corresponding to each candidate tag in the candidate tag sequence to obtain an overall transfer score; and determining a plurality of candidate labels contained in the candidate label sequence with the maximum overall transfer score as label categories corresponding to the corresponding characters.
In some embodiments, the identifying module 5435 is further configured to identify the text whose positions in the text are continuous and the corresponding tag indicates the same entity type as the same entity, and identify the entity type indicated by the tag as the type of the same entity.
Embodiments of the present invention provide a computer readable storage medium storing executable instructions, wherein the executable instructions are stored, which when executed by a processor, cause the processor to perform a method for identifying an entity in text provided by embodiments of the present invention, for example, the method for identifying an entity in text shown in fig. 3, fig. 4, or fig. 5.
In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, e.g., in one or more scripts in a hypertext markup language document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.
In the following, an exemplary application in an actual application scenario will be described.
For entity extraction (i.e., entity identification as described above), the related art generally uses a conditional random field (CRF, conditional Random Field) model for entity extraction. In the related art, training corpus (i.e., training data) is organized into a model input example shown in fig. 6, and is input into a CRF model for training, so as to obtain a CRF model after training, and finally entity recognition is performed based on the CRF model after training.
Fig. 6 is a schematic diagram of a model input example provided by an embodiment of the present invention, where in fig. 6, the first column is a character feature, the second column is a two-word (Bi-gram) feature (or binary feature), the third column is a part-of-speech feature, the fourth column is an entity information feature, and the fifth column is a prediction tag. It should be noted that, the training data has the prediction tag of the fifth column, and the prediction data (or called test data) has no prediction tag of the fifth column, because the purpose of the prediction data is to predict the prediction tag of the fifth column. Here, the tag may be labeled with a BIO tag system.
Referring to fig. 7, fig. 7 is a schematic diagram of a model architecture according to an embodiment of the present invention.
In FIG. 7, the model provided by the embodiment of the present invention includes a bi-directional long short time Memory (BiLSTM) module that uses mainly deep-learned bi-directional LSTM for feature extraction. Because feature capture is a powerful capability of the front-end layer of the deep learning network, embodiments of the present invention reduce the workload of manual feature engineering compared to the related art, but due to the inherent limitations of BiLSTM, for example: when the Query is overlong, the attention of words with longer distance can be reduced, and meanwhile, due to the time sequence of the sequence, the training parameters must be trained in series, so that the training time of the model is longer.
From the above, the related art requires the user to manually construct the feature engineering, and decides which features to use specifically, which is based on the conclusion obtained on multiple experiments. In the related art, single word features, two word features, part-of-speech features or the like can be used, and in the process of model development and tuning, feature engineering is a very time-consuming matter with a certain threshold requirement for a model tuning person. The embodiment of the present invention in fig. 7 introduces a BiLSTM as a feature extraction module, which solves the technical problems existing in the related art to a certain extent, but the effect of the identification of the LSTM is not obvious although the effect is improved due to the characteristics of the LSTM itself, such as the limitation of feature capturing capability and serial training.
Aiming at the problems, the embodiment of the invention also provides a method for identifying the entity in the text, which can simplify the workload of the feature engineering while improving the effect of entity identification.
Referring to fig. 8, fig. 8 is a schematic diagram of an application scenario provided in an embodiment of the present invention. In fig. 8, when designing a skill intention, a set of entities involved in the related skill intention (i.e., the entity dictionary described above) may be defined and imported according to the requirement, while the entities also support alias configuration to satisfy the diversity of entity expressions. Fig. 8 is a definition and sample of an animation (sys. Video. Carton) entity type related to the Video (Video) domain (i.e., video class described above).
Referring to fig. 9, fig. 9 is a schematic diagram of a model structure provided in an embodiment of the present invention, and will be described with reference to fig. 9.
(1) Feature extraction layer
In some embodiments, the inputs tok1, tok2 … tokn of the feature extraction layer are each a digital ID corresponding to Query (i.e., text as described above). Here, the feature extraction layer has a mapping dictionary of word to digit IDs, where the mapping dictionary contains all the words in the training set, and is capable of converting each word of Query into a corresponding digit ID.
Here, the feature extraction layer can also construct Cls symbols to represent the whole sentence, and its output vector can represent low-dimensional information of the whole Query (i.e., the Embedding information, which corresponds to the above-mentioned feature characterizing the type of text), which is generally used for classifying the text sentence.
In some embodiments, the output of the feature extraction layer is the word vector (i.e., the word feature vector described above) and the information of the Cls portion for each word of the Query. In the application scene of entity identification, only the word vector of each word corresponding to Query is needed; in the application scenario when text sentence classification is performed, the information of the Cls part of the output may be directly used.
Here, the dimensions of the word vector and the information of the Cls part corresponding to each word are 768 dimensions.
(2) Intermediate feature layer
In some embodiments, according to the customized physical dictionary, the customized dictionary information corresponding to each text in 40 dimensions (i.e., corresponding to the 4 th column feature of fig. 6, or called physical dictionary feature, which corresponds to the above-mentioned dictionary vector) is spliced on the basis of 768-dimension word vectors output by the feature extraction layer. Thus, the splicing process of the output of the feature extraction layer and the custom dictionary vector is completed in the middle feature layer.
In some embodiments, a Bi-gram Word vector (or Bi-gram feature) is obtained for each Word of the Query based on a Word2vec algorithm, where the dimension of the Bi-gram Word vector is 200 dimensions. The intermediate feature layer can also splice the output of the feature extraction layer, the custom dictionary vector and the Bi-gram word vector.
(3) CRF decoding layer
In some embodiments, a 768-dimensional word vector, a 40-dimensional dictionary vector, and a 200-dimensional Bi-gram word vector corresponding to each word of Query are stitched to obtain 768+40+200=1008-dimensional vectors (i.e., the stitched vectors described above); and marking the spliced vector by the CRF decoding layer.
Here, when probability labeling is performed on the spliced vector corresponding to each word, the CRF decoding layer considers the vector information of each word and the transition matrix information of each Label (Label) at the same time, so that the simultaneous occurrence of two 'B' labels representing the first words of the entities can be avoided; in addition, the CRF decoding layer considers the overall optimal probability of the whole sentence, and the problem of labeling bias can be solved.
In some embodiments, the training corpus selects a crowd-tested entity corpus; the test corpus selects log data of the real users, and the data distribution accords with the real user distribution.
The effect comparison between the embodiments of the present invention and the related art is described below.
Referring to tables 1 and 2, tables 1 and 2 are comparative tables of effects between examples of the present invention and the related art. In case one of table 1, the feature input to the CRF model is a word vector of each word of the corresponding Query output by the feature extraction layer; in the second case, the feature input to the CRF model is a vector obtained by splicing the word vector and the dictionary vector of each word corresponding to Query output by the feature extraction layer; in the third case, the feature input to the CRF model is a vector obtained by splicing the word vector, the dictionary vector, and the Bi-gram word vector of each word corresponding to Query output by the feature extraction layer.
As can be seen from Table 1, in the embodiment of the present invention, although the accuracy P value is somewhat reduced, the recall R value is greatly improved, and the overall integrated value F value is significantly improved. Therefore, the overall effect of the embodiment of the invention is still obviously better than that of the related technology.
Table 1 comparative table of effects between examples of the present invention and related art
TABLE 2 comparison of effects between examples of the present invention and related arts
Compared with the related technology, the embodiment of the invention reduces the workload of a plurality of characteristic engineering parts and is beneficial to improving the model development iteration efficiency.
Based on the progress in multitasking learning, similar tasks taken together will have some improvement in overall performance. Based on the principle, the embodiment of the invention considers the similarity characteristics of Query corpus of media asset intentions, synthesizes the corpora of multiple intentions such as Music (Music), video (Video) and radio (FM, frequency Modulation), enhances the language by using the similarity information expressed by the Query corpus among the similar intentions, and discovers and improves the effect of entity identification. Meanwhile, the number of training corpus is increased, the problem of over fitting of the model is avoided, and therefore accuracy of entity identification is improved.
In some embodiments, for scenes where deep learning requires a large amount of training data, techniques for data enhancement may also be added in embodiments of the present invention. For example, in the training data, the entity labeling parts can be replaced with each other to add more training data, the Corpus (Corpus) part outside the entity labeling can also adopt auxiliary paraphrasing substitution and other technologies, training data can be continuously mined from the user log, and more sample data can be added, so that the effect of continuously optimizing the model is realized.
In summary, the embodiment of the invention has the following beneficial effects:
1) The character feature vector corresponding to each character can be automatically extracted from the text without manually constructing the feature, so that the automatic evolution of the feature engineering is realized, and compared with the related technology, the automatic feature engineering can greatly reduce the workload.
2) The dictionary vector corresponding to each text in the text can be accurately extracted by selecting the entity dictionary consistent with the text type, and the accuracy of subsequent entity recognition is improved.
3) The spliced vectors are classified, and the characters in the text can be labeled in a comprehensive mode through multiple dimensions, so that the accuracy of entity identification can be improved, and the efficiency of entity identification can be improved.
4) Each text in the text is labeled by the classifier, the classification process is simple, the classification speed is high, and the entity identification efficiency is improved.
5) When each text in the text is marked, the probability that each text belongs to the corresponding label is considered, and the overall optimal probability of the whole text is considered, so that the problem of marking bias can be avoided. Specifically, the embodiment of the invention considers the association degree between the label corresponding to each text and the label corresponding to the adjacent text in the text, so that the situation that the label representing that the position of the text in the same entity to which the text belongs is the first word appears behind the label representing that the position of the text in the same entity to which the text belongs is the non-first word can be avoided; the association degree between the label corresponding to each text and the text is also considered, so that the situation that the positions of two characterization texts in the same entity belonging to the same text are first words in the same text labeling type can be avoided, errors in identifying the entity and entity type in the text are reduced, and the identification accuracy is improved.
The foregoing is merely exemplary embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims (9)

1. A method for identifying entities in text, the method comprising:
extracting characteristics of the text to obtain character characteristic vectors corresponding to each character in the text;
determining dictionary vectors corresponding to each word in the text according to the entity dictionary corresponding to the text;
word segmentation is carried out on the text to obtain words corresponding to each word in the text, and word vectors corresponding to each word are determined;
performing stitching processing on the character feature vector, the dictionary vector and the word vector corresponding to each character to obtain a stitching vector corresponding to each character;
mapping the splicing vector corresponding to each word into a plurality of different candidate labels, and respectively determining transfer scores corresponding to the splicing vector mapped to each candidate label;
the candidate labels are used for indicating the type of the entity to which the characters belong and the position of the characters in the entity to which the characters belong, or are used for indicating that the characters belong to irrelevant characters;
Selecting candidate labels for multiple times from the plurality of candidate labels corresponding to each text according to the appearance sequence of each text in the text, and combining the candidate labels selected each time to obtain a plurality of different candidate label sequences;
wherein, a plurality of candidate labels contained in each selected candidate label sequence belong to different characters, and the number of the contained candidate labels is the same as the number of the characters of the text;
accumulating the transfer scores corresponding to each candidate tag in the candidate tag sequence to obtain an overall transfer score;
determining a plurality of candidate labels contained in the candidate label sequence with the maximum overall transfer score as labels corresponding to each character, and
and determining the entity in the text and the type of the entity according to the label corresponding to each text.
2. The method according to claim 1, wherein the performing feature extraction processing on the text to obtain a word feature vector corresponding to each word in the text includes:
inquiring a digital identifier corresponding to each text in the text in a mapping dictionary;
and converting the digital identification corresponding to each word into a vector form to obtain the word characteristic vector corresponding to each word.
3. The method of claim 1, wherein determining the dictionary vector for each word in the text based on the physical dictionary for the text comprises:
determining the type of the text;
determining an entity dictionary corresponding to the type to which the text belongs;
and querying dictionary vectors corresponding to each word in the text in the entity dictionary.
4. The method of claim 1, wherein the word segmentation process is performed on the text to obtain words corresponding to each word in the text, and determining word vectors corresponding to each word comprises:
intercepting a plurality of words with word number length of a preset value from the text;
encoding each intercepted word to obtain a plurality of code sequences corresponding to each other one by one;
and mapping the coding sequence corresponding to each word into a word vector corresponding to the word.
5. The method according to claim 1, wherein the performing a concatenation process on the word feature vector, the dictionary vector, and the word vector corresponding to each word to obtain a concatenation vector corresponding to each word includes:
Determining character feature vectors, dictionary vectors and word vectors corresponding to the same character;
and superposing the character feature vector, the dictionary vector and each dimension contained in the word vector, and filling a scalar corresponding to the dimension in the superposed dimension to obtain a spliced vector corresponding to the characters.
6. The method according to any one of claims 1 to 5, wherein determining the entity in the text and the type of the entity according to the label corresponding to each text comprises:
recognizing characters with continuous positions in the text and corresponding labels indicating the type of the same entity as the same entity, and
and identifying the type of the entity indicated by the label as the type of the same entity.
7. An in-text entity recognition apparatus, the apparatus comprising:
the feature extraction module is used for carrying out feature extraction processing on the text so as to obtain a word feature vector corresponding to each word in the text;
the dictionary module is used for determining dictionary vectors corresponding to each text in the text according to the entity dictionary corresponding to the text;
the word segmentation module is used for carrying out word segmentation processing on the text to obtain words corresponding to each word in the text, and determining word vectors corresponding to each word;
The splicing module is used for carrying out splicing processing on the character feature vector, the dictionary vector and the word vector corresponding to each character so as to obtain a spliced vector corresponding to each character;
the identification module is used for mapping the spliced vector corresponding to each character into a plurality of different candidate labels respectively and determining transfer scores corresponding to the spliced vector mapped to each candidate label respectively; the candidate labels are used for indicating the type of the entity to which the characters belong and the position of the characters in the entity to which the characters belong, or are used for indicating that the characters belong to irrelevant characters; selecting candidate labels for multiple times from the plurality of candidate labels corresponding to each text according to the appearance sequence of each text in the text, and combining the candidate labels selected each time to obtain a plurality of different candidate label sequences; wherein, a plurality of candidate labels contained in each selected candidate label sequence belong to different characters, and the number of the contained candidate labels is the same as the number of the characters of the text; accumulating the transfer scores corresponding to each candidate tag in the candidate tag sequence to obtain an overall transfer score; and determining a plurality of candidate labels contained in the candidate label sequence with the maximum overall transfer score as labels corresponding to each text, and determining the entity in the text and the type of the entity according to the labels corresponding to each text.
8. An electronic device, comprising:
a memory for storing executable instructions;
a processor for implementing the method for identifying entities in text of any one of claims 1 to 6 when executing executable instructions stored in said memory.
9. A computer readable storage medium storing executable instructions for causing a processor to perform the method of identifying an entity in text as claimed in any one of claims 1 to 6.
CN202010533173.5A 2020-06-12 2020-06-12 Method and device for identifying entity in text Active CN111695345B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010533173.5A CN111695345B (en) 2020-06-12 2020-06-12 Method and device for identifying entity in text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010533173.5A CN111695345B (en) 2020-06-12 2020-06-12 Method and device for identifying entity in text

Publications (2)

Publication Number Publication Date
CN111695345A CN111695345A (en) 2020-09-22
CN111695345B true CN111695345B (en) 2024-02-23

Family

ID=72480580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010533173.5A Active CN111695345B (en) 2020-06-12 2020-06-12 Method and device for identifying entity in text

Country Status (1)

Country Link
CN (1) CN111695345B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113536793A (en) * 2020-10-14 2021-10-22 腾讯科技(深圳)有限公司 Entity identification method, device, equipment and storage medium
CN112487813B (en) * 2020-11-24 2024-05-10 中移(杭州)信息技术有限公司 Named entity recognition method and system, electronic equipment and storage medium
CN112364656A (en) * 2021-01-12 2021-02-12 北京睿企信息科技有限公司 Named entity identification method based on multi-dataset multi-label joint training
CN112906380B (en) * 2021-02-02 2024-09-27 北京有竹居网络技术有限公司 Character recognition method and device in text, readable medium and electronic equipment
CN112906381B (en) * 2021-02-02 2024-05-28 北京有竹居网络技术有限公司 Dialog attribution identification method and device, readable medium and electronic equipment
CN113705232B (en) * 2021-03-03 2024-05-07 腾讯科技(深圳)有限公司 Text processing method and device
CN112988979B (en) * 2021-04-29 2021-10-08 腾讯科技(深圳)有限公司 Entity identification method, entity identification device, computer readable medium and electronic equipment
CN113505587B (en) * 2021-06-23 2024-04-09 科大讯飞华南人工智能研究院(广州)有限公司 Entity extraction method, related device, equipment and storage medium
CN113673249B (en) * 2021-08-25 2022-08-16 北京三快在线科技有限公司 Entity identification method, device, equipment and storage medium
CN113868419B (en) * 2021-09-29 2024-05-31 中国平安财产保险股份有限公司 Text classification method, device, equipment and medium based on artificial intelligence

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165280A (en) * 2018-09-13 2019-01-08 安徽倍思特教育科技有限公司 A kind of information consulting system of educational training
WO2019024704A1 (en) * 2017-08-03 2019-02-07 阿里巴巴集团控股有限公司 Entity annotation method, intention recognition method and corresponding devices, and computer storage medium
CN109388795A (en) * 2017-08-07 2019-02-26 芋头科技(杭州)有限公司 A kind of name entity recognition method, language identification method and system
CN109543181A (en) * 2018-11-09 2019-03-29 中译语通科技股份有限公司 A kind of name physical model combined based on Active Learning and deep learning and system
CN110502738A (en) * 2018-05-18 2019-11-26 阿里巴巴集团控股有限公司 Chinese name entity recognition method, device, equipment and inquiry system
CN111079418A (en) * 2019-11-06 2020-04-28 科大讯飞股份有限公司 Named body recognition method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391485A (en) * 2017-07-18 2017-11-24 中译语通科技(北京)有限公司 Entity recognition method is named based on the Korean of maximum entropy and neural network model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019024704A1 (en) * 2017-08-03 2019-02-07 阿里巴巴集团控股有限公司 Entity annotation method, intention recognition method and corresponding devices, and computer storage medium
CN109388795A (en) * 2017-08-07 2019-02-26 芋头科技(杭州)有限公司 A kind of name entity recognition method, language identification method and system
CN110502738A (en) * 2018-05-18 2019-11-26 阿里巴巴集团控股有限公司 Chinese name entity recognition method, device, equipment and inquiry system
CN109165280A (en) * 2018-09-13 2019-01-08 安徽倍思特教育科技有限公司 A kind of information consulting system of educational training
CN109543181A (en) * 2018-11-09 2019-03-29 中译语通科技股份有限公司 A kind of name physical model combined based on Active Learning and deep learning and system
CN111079418A (en) * 2019-11-06 2020-04-28 科大讯飞股份有限公司 Named body recognition method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111695345A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN111695345B (en) Method and device for identifying entity in text
CN109408622B (en) Statement processing method, device, equipment and storage medium
CN109165302B (en) Multimedia file recommendation method and device
CN115952272B (en) Method, device and equipment for generating dialogue information and readable storage medium
US20230386238A1 (en) Data processing method and apparatus, computer device, and storage medium
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN112911331B (en) Music identification method, device, equipment and storage medium for short video
WO2021135103A1 (en) Method and apparatus for semantic analysis, computer device, and storage medium
CN118692014B (en) Video tag identification method, device, equipment, medium and product
CN114911915A (en) Knowledge graph-based question and answer searching method, system, equipment and medium
CN117236340A (en) Question answering method, device, equipment and medium
CN118916442A (en) Data processing method and device and electronic equipment
CN115238708B (en) Text semantic recognition method, device, equipment, storage medium and program product
CN118520976B (en) Text dialogue generation model training method, text dialogue generation method and equipment
CN118093792B (en) Method, device, computer equipment and storage medium for searching object
CN110795547A (en) Text recognition method and related product
CN112632962B (en) Method and device for realizing natural language understanding in man-machine interaction system
CN117932049B (en) Medical record abstract generation method, device, computer equipment and medium
CN110516109B (en) Music label association method and device and storage medium
CN116956941B (en) Text recognition method, device, equipment and medium
CN117973331A (en) Document processing method, device, computer equipment and storage medium based on artificial intelligence
CN116955704A (en) Searching method, searching device, searching equipment and computer readable storage medium
CN115757469A (en) Data generation method, electronic device and storage medium for text-to-SQL tasks
CN1312898C (en) Universal mobile human interactive system and method
CN116628232A (en) Label determining method, device, equipment, storage medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant