US20230229860A1 - Method and system for hybrid entity recognition - Google Patents
Method and system for hybrid entity recognition Download PDFInfo
- Publication number
- US20230229860A1 US20230229860A1 US18/152,198 US202318152198A US2023229860A1 US 20230229860 A1 US20230229860 A1 US 20230229860A1 US 202318152198 A US202318152198 A US 202318152198A US 2023229860 A1 US2023229860 A1 US 2023229860A1
- Authority
- US
- United States
- Prior art keywords
- entities
- machine learning
- computer
- input sentence
- entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 239000002131 composite material Substances 0.000 claims abstract description 37
- 238000010801 machine learning Methods 0.000 claims abstract description 34
- 239000013598 vector Substances 0.000 claims abstract description 21
- 230000008569 process Effects 0.000 claims description 22
- 238000012549 training Methods 0.000 claims description 15
- 238000004891 communication Methods 0.000 claims description 14
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims 3
- 238000012937 correction Methods 0.000 abstract description 12
- 230000014509 gene expression Effects 0.000 description 8
- 238000013473 artificial intelligence Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 230000003287 optical effect Effects 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 230000010354 integration Effects 0.000 description 4
- 230000002093 peripheral effect Effects 0.000 description 4
- 230000026676 system process Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000006855 networking Effects 0.000 description 3
- 239000008186 active pharmaceutical agent Substances 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000005055 memory storage Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 241000251468 Actinopterygii Species 0.000 description 1
- 241000238558 Eucarida Species 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000003628 erosive effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000004900 laundering Methods 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/046—Forward inferencing; Production systems
Definitions
- the present disclosure relates in general to the field of computer software and systems, and in particular, to a system and method for hybrid entity recognition.
- unstructured data has a great amount of text (e.g., natural language heavy). It is important to understand the semantics and syntax of that text, in order to determine the various entities and its underlying linguistic cause.
- Prior systems may recognize basic level entities (e.g., names of persons, organizations, locations, expressions of times, quantities, etc.) Prior systems are not accurate because they do not determine the context, semantics and syntax at the same time during entity recognition. Prior systems struggle recognizing second level entities (e.g., credit amount and debit amount that are both similar entities that belong to same class “amount”).
- basic level entities e.g., names of persons, organizations, locations, expressions of times, quantities, etc.
- a computer-implemented process comprises receiving an input sentence.
- the input sentence is preprocessed to remove extraneous information, perform spelling correction, and perform grammar correction to generate a cleaned input sentence.
- a POS tagger tags parts of speech of the cleaned input sentence.
- a rules based entity recognizer module identifies first level entities in the cleaned input sentence.
- the cleaned input sentence is converted and translated into numeric vectors.
- Basic and composite entities are extracted from the cleaned input sentence using the numeric vectors.
- FIG. 1 illustrates a block diagram of an exemplary network of entities with a hybrid entity recognizer (HER), according to one embodiment.
- HER hybrid entity recognizer
- FIG. 2 illustrates an exemplary HER system architecture, according to one embodiment.
- FIG. 3 illustrates an exemplary HER system architecture, according to another embodiment.
- FIG. 4 illustrates an exemplary HER system process for learning, according to one embodiment.
- FIGS. 5 a and 5 b illustrate an exemplary HER system process for recognizing and extracting entities, according to one embodiment.
- FIG. 6 shows an exemplary general purpose computing device in the form of a computer, according to one embodiment.
- a computer-implemented process comprises receiving an input sentence.
- the input sentence is preprocessed to remove extraneous information, perform spelling correction, and perform grammar correction to generate a cleaned input sentence.
- a POS tagger tags parts of speech of the cleaned input sentence.
- a rules based entity recognizer module identifies first level entities in the cleaned input sentence.
- the cleaned input sentence is converted and translated into numeric vectors.
- Basic and composite entities are extracted from the cleaned input sentence using the numeric vectors.
- the present hybrid entity recognition (HER) system is useful for any Artificial Intelligence (AI) based expert system.
- AI Artificial Intelligence
- the present hybrid entity recognition (HER) system efficiently searches and discovers information using entity recognition.
- the present HER system finds and implements ways to add structure to unstructured data. This entire process of information extraction and classification of extracted information into pre-determined categories (e.g., names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.) is known as entity recognition.
- pre-determined categories e.g., names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
- users may define new categories of entities using a user interface.
- the present system determines the context, semantics and syntax of text to identify second level entities also described as composite entities that consist of a base entity with a linguistic pattern (e.g., from city and to city). It also uses memory based linguistic pattern recognition to differentiate between two similar entities of the same class or type (e.g., to date and from date, credit amount and debit account, from city and to city, etc.)
- the present hybrid entity recognition (HER) system identifies the entities from a given corpus, stream of abstract raw data, or preprocessed data in text format.
- the present system is modular and flexible to be implemented with a variety of IT software solutions including semantics (behavioral) systems, question answering systems, ontology computing and opinion mining.
- the present system has the benefits of:
- FIG. 1 illustrates a block diagram of an exemplary network of entities with a hybrid entity recognizer (HER) 100 , according to one embodiment.
- Customer contact center 020 may be a call center where the HER system 010 processes queries and responses.
- Third party systems 030 may be a service desk or help desk for an enterprise organization.
- Anti-money laundering (AML) and fraud detection systems 040 work with the HER system 010 that processes natural language queries and responses that may include entity lists.
- Smart home and connected devices managers 050 work with HER system 010 to process language, domain, custom dictionaries, and corpus (e.g., a collection of written texts).
- Personal assistants e.g., SIRI, Alexa, Google, etc.
- HER system 010 works with manually tagged entities and extracted entities 070 .
- Concierge services 080 works with HER system 010 using natural language queries and responses.
- Administrative users 090 work with HER system 010 to using decision trees, command center, dictionaries and themes.
- FIG. 2 illustrates an exemplary HER system architecture 200 , according to one embodiment.
- System 200 includes HER services 211 - 218 .
- the preprocessor 211 cleans and massages an input text string (e.g., a sentence), removes extraneous information (e.g., extra spaces, non-useful special symbols, etc.) and performs spelling correction and grammar corrections. For example, if the input sentence is “My bank account number is 70318XXXX and want bank statement from 12/11/17 to 12/01/18” then the output of preprocessor would be “My bank account number is 70318XXXX and want bank statement from 12/11/17 to 12/01/18.”
- the Parts of Speech (POS) tagger 212 assigns parts of speech to each word. There are eight parts of speech in the English language: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection. The part of speech indicates how the word functions in meaning as well as grammatically within the sentence. POS Tagger 212 provides tags such as “He [PRON] lives [VERB] in [Preposition] USA [NOUN].”
- the rules based entity recognizer 213 recognizes an entity at very first level (e.g., a first level entity) based on predefined linguistic rules and corpus/dictionaries, where the corpus contains a list of names of persons, organizations, locations, etc. For example, the rules based entity recognizer 213 identifies a word as a name when the word is identified as a noun by the POS tagger 212 that is also available in a dictionary of names.
- the memory based entity recognizer 214 recognizes composite entities such as the “to date”, “from date”, “to location”, and “from location.”
- Memory based entity recognizer 214 uses the first level entities identified by the rules based entity recognizer 213 and machine learning based entity recognizer 220 to recognize the composite entities that include a base entity with a linguistic pattern (e.g., from city and to city).
- the memory based entity recognizer module 214 has the capability to learn linguistic patterns and store the linguistic patterns, base entity information, key word and its relative proximity to the base entity, in memory for future entity recognition processes.
- the RegEx (regular expression) based entity recognizer 217 and rules based entity recognizer 213 recognizes the entity at a first level based on a predefined word structure, and linguistic rules (e.g., USD200 or $200 or any date, etc.).
- the vectorizor 215 used in the present HER system 210 converts and translates text into numeric vectors used by the sequence classification machine learning algorithm, which is based on a back propagation neural network.
- Sequence-to-sequence prediction involves predicting an output sequence given an input sequence. Sequence prediction predicts elements of a sequence on the basis of the adjacent elements.
- the sequence-to-sequence classifier is a type of neural network that is trained using a back-propagation method that fine-tunes the weights of the neural network based on the error rate obtained in the previous epoch (e.g., iteration).
- a sequence to sequence model maps a fixed-length input with a fixed-length output where the length of the input and output may differ, for example:
- the vectorizer 215 preserves the POS information of individual words in a sentence, the number of occurrences of any POS in the sentence, and the entity information recognized by the rules based entity recognizer 213 .
- the vectorizer 215 uses a hash table to assign the numeric value to every word's POS as POSbaseid.
- the vectorizer's numerical representation is given below:
- the devectorizer 216 performs the opposite process of the vectorizer 215 to reconvert the output of a machine learning model (which would be in the form of a vector) to meaningful text with a clear identification of entities in the text (e.g. names of persons, organizations, locations, expressions of times, quantities, etc.).
- the corpus data contains predefined entities like Person Name, County Name, City Name, etc.
- the corpus builder 218 uses memory based learning to add new entities to the corpus over time with additional training.
- Data sources 250 may include corpus data, application data, CRM, P2P systems, SAP, and Oracle.
- the corpus data contains predefined entities like Person Name, County Name, City Name, etc.
- the present system 200 can be connected with any of the above mentioned systems to utilize existing information, which can be used in the form of pre-defined entities.
- the present system 200 adheres to SSL level security protocols 273 .
- the present system 200 adheres to the available security protocols and requirements of the enterprise system within which it operates.
- the present system 200 has capabilities that relate to API specifications to which it interfaces (e.g., microservices). These capabilities include being small in size, messaging-enabled, autonomous and being independently deployable.
- a microservice is not a layer within a monolithic application.
- the orchestration layer 230 controls and manages the communications between HER services 211 - 218 .
- the orchestration layer 230 contains a directory of services along with a listing of its capabilities. Based on the type of request and business logic, the orchestration layer 230 manages the communication between the HER services 211 - 218 . Communication between the HER services 211 - 218 uses JavaScript Object Notation (JSON).
- JSON JavaScript Object Notation
- a consumer 240 - 243 sends text in the form of a request in JSON object for entity recognition and orchestration layer 230 returns the list of extracted entities, as shown below: def get_all_entities_service(input_sentence ⁇ JSON>, Return_all_entities_list ⁇ JSON>)
- Example values of input and output parameters include:
- Integration connectors 280 connect the present system 200 with different data sources 250 , such as a database, CRM, P2P systems, SAP, Oracle, etc.
- Integration connectors 280 include driver libraries, such as JDBC driver, ODBC driver, SAP JDBC driver, etc. Integration connectors 280 use these drivers to make and establish the connection between the HER system 210 and datasource 250 .
- FIG. 3 illustrates an exemplary HER system architecture 300 , according to another embodiment.
- the context engine 326 keeps the context of up to three levels to identify the appropriate entity in the sentence.
- the context is used to identify the indirect entities addressed by pronoun like he, she, it, etc. For example, consider the sentence “I am working with Genpact and I want to know its last year performance numbers.” In this sentence the context engine 326 identifies that the subject is Genpact and in the second portion of that sentence “its” means Genpact.
- the rule based entity recognizer 331 recognizes “its” as Genpact.
- the training module 370 trains a machine learning model and memory learning model with new datasets, when a user feeds the data to the HER system 320 .
- the HER system 320 identifies the tagged entities, which are identified by the rule based entity recognizer 331 and RegEx based entity recognizer 333 .
- the user may make corrections, tag the untagged entities and then perform the training using the User Interface of the training module 370 .
- Using the training module 370 a user may introduce new types of basic entities and composite entities.
- Business layer 350 is an intermediate layer between the HER system 320 and the external source systems 360 (e.g., a legacy enterprise system).
- the business layer 350 has business logic and rules.
- a business logic or rule can be represented as:
- the business logic/rules of business layer 350 provides domain specific knowledge.
- the meaning of “card” is a credit card or a debit card.
- the meaning of “card” is a PCB (Printed Circuit Board).
- the word “net” in the financial domain has the meaning of gross.
- the meaning refers to a fish net.
- the meaning of “net” is Internet.
- the business layer 350 contains business rules.
- the external source systems 360 e.g., ERP, P2P, CMS systems
- the business rules provide data used by the external source systems 360 .
- the source systems 360 are systems that are the consumers to HER system 320 .
- Any consumer source system 360 e.g., ERP, P2P, CMS systems
- AI system 310 can utilize the HER system 200 to capture the important information in the form of entities from Natural language/free flow text.
- AI systems artificially mimic human intelligence processes, including self-learning, reasoning. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions) and self-correction.
- Particular applications of AI include expert systems, speech recognition and machine vision.
- the client 305 may be a human consumer of HER system 320 , who may be a trainer, developer or user, interacting directly with the system or through the mobile device or Interactive Voice Response (IRV) system.
- the orchestration layer 321 controls and manages the communication between all the internal HER services.
- the orchestration layer 321 contains a directory of services along with a listing of the service's capabilities. For example, some of the services may be:
- orchestration layer 321 manages communications between the HER services 322 - 334 , where according to one embodiment the communications use the JSON format.
- Predefined business logic/rules determine the sequence that orchestration layer 321 calls HER services 322 - 334 in order to provide extracted basic and composite entities.
- a custom corpus 343 provides the flexibility to store metadata and data of user defined entities. While new entity or custom entity training, the system 320 captures that entity and stores the entity into custom corpus 343 .
- a business requirement may be that a designation such as CTO, CDO, VP, AVP, and SM should be recognize as a designation. Then during learning the HER system 320 stores all these custom words into custom corpus 343 .
- FIG. 4 illustrates an exemplary HER system process for learning 400 , according to one embodiment.
- the learning phase of the system utilizes the processing modules to process/learn the sentences with the selected entities in it by the user.
- the text preprocessing module 405 cleans and massages the input text, string, or sentence 401 .
- the preprocessor 405 removes extra spaces, and other non-useful special symbols.
- the preprocessor 405 performs spelling correction and grammar corrections to the input text, string, or sentence.
- the cleansed sentence is passed to the POS Tagger 410 , as well as to the rules based entity recognizer 415 .
- the POS tagger module 410 tags the POS for each word of the input text 401 .
- the rules based entity recognizer 415 identifies the first level entities in the input text 401 .
- the output of both the POS tagger 410 and rules based entity recognizer modules 415 is passed to the vectorizer 425 , which combines these outputs and creates/translates the input sentence 401 into the form of a vector representation that is further used by the machine learning based entity recognizer module 441 for training (e.g., the machine learning model generated based on manually tagged data and entities 443 tagged by the rules based entity recognizer 415 .
- Manually tagged data is used as training data.
- Manually tagged data contains the words of an input sentence and the tags for the individual words as tagged manually.
- the machine learning based entity recognizer module learns based on the manually tagged data. For example, consider the input sentence “My bank account number is 70318XXXX and want bank statement from 12/11/17 to 12/01/18.” All the entities available in this example sentence are tagged (e.g, 70318XXXX, 12/11/17 and 12/01/18).
- the vectorizor 425 and vectorizer 444 process use hash table 430 that contains the base numeric code of entity tag set.
- the hash table 430 contains the tag set based on the numbering on the list of entities and parts-of-speech (e.g., a tag set may be noun, pronoun, adverb, verb, helping verb, etc.). Every entity in a tag set is given a numeric code, which is used to generate the vector or to transform the vector into tagged entities. For example:
- the vectorizor 425 preserves the POS information of individual words in a sentence, the number of occurrences of any POS in the sentence, and the entity information recognized by the rules based entity recognizer 415 .
- the vectorizer 425 uses hash table 430 to assign the numeric value to every word's POS as POSbaseid.
- the vectorizer's numerical representation is:
- the devectorizer 444 performs the opposite process of the vectorizer 425 to reconvert the output of a machine learning model (which would be in the form of a vector) to meaningful text with a clear identification of entities in the text (e.g., names of persons, organizations, locations, expressions of times, quantities etc.) and identified entities store into the base entities 445 extracted.
- the base entities extracted 445 feed into memory learning based entity recognizer 446 to recognize composite entities.
- the present system 200 uses memory based learning to create and learn computational linguistics based patterns, (e.g., to date and from date, credit amount and debit account, from city and to city, etc.)
- the RegEx (regular expression) based entity recognizer 420 recognizes the entity based on a predefined word or character level structure as a regular expression (e.g., USD200 or $200 or any date, etc.).
- This information is used as feedback to the machine learning model and memory learning based linguistic patterns module to learn new first level entities, as well as composite entities.
- the present HER system 300 After this feedback and training process the present HER system 300 generates two models, e.g., a machine learning model 460 , and a memory learning based linguistic patterns module model 450 . These models can be used to recognize the entities from the text that will be analyzed by the HER system 300 after the learning phase is completed.
- the system uses memory based learning to create and learn computational linguistics based patterns, to recognize composite entities (e.g., to date and from date, credit amount and debit account, from city and to city, etc.)
- FIGS. 5 a and 5 b illustrate an exemplary HER system process for recognizing and extracting entities, according to one embodiment.
- This hybrid approach uses machine learning, memory learning and a rules based system to recognize entities from the given sentences, it also uses the POS.
- the present system 300 learns new entities from the text.
- the HER entity recognition modules shown in Figure Sa operate as described above with FIG. 4 .
- the process and modules are similar to the HER learning phase 400 , except that the machine learning model 541 as shown in Figure Sa, the machine learning model recognizes the entities from new text being presented to and processed by the present HER system 300 .
- HER system 300 has an entity corpus 535 that contains the metadata and data of pre-defined entities (e.g., names of persons, organizations, locations, expressions of times, quantities, etc.) HER system 300 finds and matches the predefined entities using the entity corpus 535 .
- pre-defined entities e.g., names of persons, organizations, locations, expressions of times, quantities, etc.
- Entity bucket 560 is the mechanism used to store the base entities temporarily for further processing.
- the composite entities are extracted using memory learning and linguistic pattern based models 565 .
- the base entities and extracted composite entities are stored into processed base entity buckets 570 and processed composite entity buckets 575 , respectively.
- the present system 300 extracts entities from any given text input(s) and learns new entities.
- the present system 300 uses a hybrid approach that leverages machine learning 334 for extracting the base entities and a memory based entity recognizer 332 for extracting the linguistic pattern based composite entities (e.g., to date and from date, credit amount and debit account, from city and to city, etc.)
- the present HER system 300 may be used with any Artificial Intelligent (AI) System and Automation System 310 .
- AI Artificial Intelligent
- the following is a list of technology applications for the present HER system:
- FIG. 6 shows an exemplary general purpose computing device in the form of a computer 130 , according to one embodiment.
- a computer such as the computer 130 is suitable for use in the other figures illustrated and described herein.
- Computer 130 has one or more processors or processing units 132 and a system memory 134 .
- a system bus 136 couples various system components including the system memory 134 to the processors 132 .
- the bus 136 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
- such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as a mezzanine bus.
- ISA Industry Standard Architecture
- MCA Micro Channel Architecture
- EISA Enhanced ISA
- VESA Video Electronics Standards Association
- PCI Peripheral Component Interconnect
- the computer 130 typically has at least some form of computer readable media.
- Computer readable media which include both volatile and nonvolatile media, removable and non-removable media, may be any available medium that can be accessed by computer 130 .
- Computer readable media comprise computer storage media and communication media.
- Computer storage media include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
- computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and that can accessed by computer 130 .
- Communication media typically embody computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism and include any information delivery media. Those skilled in the art are familiar with the modulated data signal, which has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
- Wired media such as a wired network or direct-wired connection
- wireless media such as acoustic, RF, infrared, and other wireless media
- communication media such as acoustic, RF, infrared, and other wireless media
- the system memory 134 includes computer storage media in the form of removable and/or non-removable, volatile and/or nonvolatile memory.
- system memory 134 includes read only memory (ROM) 138 and random access memory (RAM) 140 .
- ROM read only memory
- RAM random access memory
- BIOS basic input/output system
- RAM 140 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 132 .
- FIG. 6 illustrates operating system 144 , application programs 146 , other program modules 148 , and program data 150 .
- the computer 130 may also include other removable/non-removable, volatile/nonvolatile computer storage media.
- FIG. 6 illustrates a hard disk drive 154 that reads from or writes to non-removable, nonvolatile magnetic media.
- FIG. 6 also shows a magnetic disk drive 156 that reads from or writes to a removable, nonvolatile magnetic disk 158 , and an optical disk drive 160 that reads from or writes to a removable, nonvolatile optical disk 162 such as a CD-ROM or other optical media.
- removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like.
- the hard disk drive 144 , and magnetic disk drive 156 and optical disk drive 160 are typically connected to the system bus 136 by a nonvolatile memory interface, such as interface 166 .
- Hard disk drive 154 is illustrated as storing operating system 170 , application programs 172 , other program modules 174 , and program data 176 . Note that these components can either be the same as or different from operating system 144 , application programs 146 , other program modules 148 , and program data 150 . Operating system 170 , application programs 172 , other program modules 174 , and program data 176 are given different numbers here to illustrate that, at a minimum, they are different copies.
- a user may enter commands and information into computer 130 through input devices or user interface selection devices such as a keyboard 180 and a pointing device 182 (e.g., a mouse, trackball, pen, or touch pad).
- Other input devices may include a microphone, joystick, game pad, satellite dish, scanner, or the like.
- processing unit 132 through a user input interface 184 that is coupled to system bus 136 , but may be connected by other interface and bus structures, such as a parallel port, game port, or a Universal Serial Bus (USB).
- a monitor 188 or other type of display device is also connected to system bus 136 via an interface, such as a video interface 190 .
- computers often include other peripheral output devices (not shown) such as a printer and speakers, which may be connected through an output peripheral interface (not shown).
- the computer 130 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 194 .
- the remote computer 194 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer 130 .
- the logical connections depicted in FIG. 6 include a local area network (LAN) 196 and a wide area network (WAN) 198 , but may also include other networks.
- LAN local area network
- WAN wide area network
- Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets, and global computer networks (e.g., the Internet).
- computer 130 When used in a local area networking environment, computer 130 is connected to the LAN 196 through a network interface or adapter 186 . When used in a wide area networking environment, computer 130 typically includes a modem 178 or other means for establishing communications over the WAN 198 , such as the Internet.
- the modem 178 which may be internal or external, is connected to system bus 136 via the user input interface 194 , or other appropriate mechanism.
- program modules depicted relative to computer 130 may be stored in a remote memory storage device (not shown).
- FIG. 6 illustrates remote application programs 192 as residing on the memory device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
- the data processors of computer 130 are programmed using instructions stored at different times in the various computer-readable storage media of the computer.
- Programs and operating systems are typically distributed, for example, on floppy disks or CD-ROMs. From there, they are installed or loaded into the secondary memory of a computer. At execution, they are loaded at least partially into the computer's primary electronic memory.
- the invention described herein includes these and other various types of computer-readable storage media when such media contain instructions or programs for implementing the steps described below in conjunction with a microprocessor or other data processor.
- the invention also includes the computer itself when programmed according to the methods and techniques described herein.
- the invention is operational with numerous other general purposes or special purpose computing system environments or configurations.
- the computing system environment is not intended to suggest any limitation as to the scope of use or functionality of the invention.
- the computing system environment should not be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment.
- Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
- the invention may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices.
- program modules include, but are not limited to, routines, programs, objects, components, and data structures that perform particular tasks or implement particular abstract data types.
- the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
- program modules may be located in both local and remote computer storage media including memory storage devices.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Description
- This application is a continuation of U.S. application Ser. No. 16/721,452, filed on Dec. 19, 2019, which claims the benefit of and priority to U.S. Provisional Application Ser. No. 62/789,751, entitled “Method and System for Hybrid Entity Recognition”, filed on Jan. 8, 2019, the contents of all of which are incorporated herein by reference in their entirety.
- The present disclosure relates in general to the field of computer software and systems, and in particular, to a system and method for hybrid entity recognition.
- There are a number of requirements and/or preferences associated with utilizing unstructured data. Dealing with unstructured data is pretty complex, as the unstructured data does not have predefined or pre-structured information. This leads to unpredictable and unsolvable conditions with prior systems.
- Typically, unstructured data has a great amount of text (e.g., natural language heavy). It is important to understand the semantics and syntax of that text, in order to determine the various entities and its underlying linguistic cause.
- Prior systems may recognize basic level entities (e.g., names of persons, organizations, locations, expressions of times, quantities, etc.) Prior systems are not accurate because they do not determine the context, semantics and syntax at the same time during entity recognition. Prior systems struggle recognizing second level entities (e.g., credit amount and debit amount that are both similar entities that belong to same class “amount”).
- A system and method for hybrid entity recognition are disclosed. According to one embodiment, a computer-implemented process, comprises receiving an input sentence. The input sentence is preprocessed to remove extraneous information, perform spelling correction, and perform grammar correction to generate a cleaned input sentence. A POS tagger, tags parts of speech of the cleaned input sentence. A rules based entity recognizer module identifies first level entities in the cleaned input sentence. The cleaned input sentence is converted and translated into numeric vectors. Basic and composite entities are extracted from the cleaned input sentence using the numeric vectors.
- The above and other preferred features, including various novel details of implementation and combination of elements, will now be more particularly described with reference to the accompanying drawings and pointed out in the claims. It will be understood that the particular methods and apparatuses are shown by way of illustration only and not as limitations. As will be understood by those skilled in the art, the principles and features explained herein may be employed in various and numerous embodiments.
- The accompanying figures, which are included as part of the present specification, illustrate the various embodiments of the presently disclosed system and method and together with the general description given above and the detailed description of the embodiments given below serve to explain and teach the principles of the present system and method.
-
FIG. 1 illustrates a block diagram of an exemplary network of entities with a hybrid entity recognizer (HER), according to one embodiment. -
FIG. 2 illustrates an exemplary HER system architecture, according to one embodiment. -
FIG. 3 illustrates an exemplary HER system architecture, according to another embodiment. -
FIG. 4 illustrates an exemplary HER system process for learning, according to one embodiment. -
FIGS. 5 a and 5 b illustrate an exemplary HER system process for recognizing and extracting entities, according to one embodiment. -
FIG. 6 shows an exemplary general purpose computing device in the form of a computer, according to one embodiment. - While the present disclosure is subject to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and will herein be described in detail. The present disclosure should be understood to not be limited to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present disclosure.
- A system and method for hybrid entity recognition are disclosed. According to one embodiment, a computer-implemented process, comprises receiving an input sentence. The input sentence is preprocessed to remove extraneous information, perform spelling correction, and perform grammar correction to generate a cleaned input sentence. A POS tagger, tags parts of speech of the cleaned input sentence. A rules based entity recognizer module identifies first level entities in the cleaned input sentence. The cleaned input sentence is converted and translated into numeric vectors. Basic and composite entities are extracted from the cleaned input sentence using the numeric vectors.
- The following disclosure provides many different embodiments, or examples, for implementing different features of the subject matter. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
- The present hybrid entity recognition (HER) system is useful for any Artificial Intelligence (AI) based expert system. To understand the important entities in any free flow text the AI based expert system requires any entity recognition system, so that based on important entities system can take automatic decision.
- The present hybrid entity recognition (HER) system efficiently searches and discovers information using entity recognition. The present HER system finds and implements ways to add structure to unstructured data. This entire process of information extraction and classification of extracted information into pre-determined categories (e.g., names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.) is known as entity recognition. In addition to pre-determined categories, users may define new categories of entities using a user interface.
- The present system determines the context, semantics and syntax of text to identify second level entities also described as composite entities that consist of a base entity with a linguistic pattern (e.g., from city and to city). It also uses memory based linguistic pattern recognition to differentiate between two similar entities of the same class or type (e.g., to date and from date, credit amount and debit account, from city and to city, etc.)
- The present hybrid entity recognition (HER) system identifies the entities from a given corpus, stream of abstract raw data, or preprocessed data in text format. The present system is modular and flexible to be implemented with a variety of IT software solutions including semantics (behavioral) systems, question answering systems, ontology computing and opinion mining. The present system has the benefits of:
-
- flexibility, scalability and compatibility with AI systems;
- fully complying with microservices architecture e.g., modular, loosely coupled services etc.;
- processing entities from text or corpus of data;
- recognizing linguistic patterns previously learnt by the system;
- learning new entities with minimal effort; and
- identifying basic and composite entities which are distinct based on linguistic patterns.
-
FIG. 1 illustrates a block diagram of an exemplary network of entities with a hybrid entity recognizer (HER) 100, according to one embodiment.Customer contact center 020 may be a call center where the HERsystem 010 processes queries and responses.Third party systems 030 may be a service desk or help desk for an enterprise organization. Anti-money laundering (AML) andfraud detection systems 040 work with the HERsystem 010 that processes natural language queries and responses that may include entity lists. Smart home and connected devices managers 050 work with HERsystem 010 to process language, domain, custom dictionaries, and corpus (e.g., a collection of written texts). Personal assistants (e.g., SIRI, Alexa, Google, etc.) communicate with HERsystem 010 to process conversational queries and natural language responses. HERsystem 010 works with manually tagged entities and extractedentities 070.Concierge services 080 works with HERsystem 010 using natural language queries and responses.Administrative users 090 work with HERsystem 010 to using decision trees, command center, dictionaries and themes. -
FIG. 2 illustrates an exemplary HERsystem architecture 200, according to one embodiment.System 200 includes HER services 211-218. Thepreprocessor 211 cleans and massages an input text string (e.g., a sentence), removes extraneous information (e.g., extra spaces, non-useful special symbols, etc.) and performs spelling correction and grammar corrections. For example, if the input sentence is “My bank account number is 70318XXXX and want bank statement from 12/11/17 to 12/05/18” then the output of preprocessor would be “My bank account number is 70318XXXX and want bank statement from 12/11/17 to 12/05/18.” - The Parts of Speech (POS)
tagger 212 assigns parts of speech to each word. There are eight parts of speech in the English language: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection. The part of speech indicates how the word functions in meaning as well as grammatically within the sentence.POS Tagger 212 provides tags such as “He [PRON] lives [VERB] in [Preposition] USA [NOUN].” - The rules based
entity recognizer 213 recognizes an entity at very first level (e.g., a first level entity) based on predefined linguistic rules and corpus/dictionaries, where the corpus contains a list of names of persons, organizations, locations, etc. For example, the rules basedentity recognizer 213 identifies a word as a name when the word is identified as a noun by thePOS tagger 212 that is also available in a dictionary of names. - The memory based
entity recognizer 214 recognizes composite entities such as the “to date”, “from date”, “to location”, and “from location.” Memory basedentity recognizer 214 uses the first level entities identified by the rules basedentity recognizer 213 and machine learning basedentity recognizer 220 to recognize the composite entities that include a base entity with a linguistic pattern (e.g., from city and to city). The memory basedentity recognizer module 214 has the capability to learn linguistic patterns and store the linguistic patterns, base entity information, key word and its relative proximity to the base entity, in memory for future entity recognition processes. - The RegEx (regular expression) based
entity recognizer 217 and rules basedentity recognizer 213 recognizes the entity at a first level based on a predefined word structure, and linguistic rules (e.g., USD200 or $200 or any date, etc.). - The
vectorizor 215 used in the present HERsystem 210 converts and translates text into numeric vectors used by the sequence classification machine learning algorithm, which is based on a back propagation neural network. Sequence-to-sequence prediction involves predicting an output sequence given an input sequence. Sequence prediction predicts elements of a sequence on the basis of the adjacent elements. The sequence-to-sequence classifier is a type of neural network that is trained using a back-propagation method that fine-tunes the weights of the neural network based on the error rate obtained in the previous epoch (e.g., iteration). A sequence to sequence model maps a fixed-length input with a fixed-length output where the length of the input and output may differ, for example: -
- Step 1: Input Sentence: “James lives in USA”
- Step 2: Vectorization: convert input sentence into a numeric vector
- Step 3: Sequence-to-sequence classification: Pass numeric vectorized sequence to sequence classifier
- Step 4: Devectorization: Devectorize the numeric sequence received from the classifier
- Step 5: Output: “[Name] James [none] lives [none] in [Country] USA.”
- The
vectorizer 215 preserves the POS information of individual words in a sentence, the number of occurrences of any POS in the sentence, and the entity information recognized by the rules basedentity recognizer 213. Thevectorizer 215 uses a hash table to assign the numeric value to every word's POS as POSbaseid. The vectorizer's numerical representation is given below: -
Vector=f(POSbaseid,Occurrence Number,Rulebase Entity Class id) - The
devectorizer 216 performs the opposite process of thevectorizer 215 to reconvert the output of a machine learning model (which would be in the form of a vector) to meaningful text with a clear identification of entities in the text (e.g. names of persons, organizations, locations, expressions of times, quantities, etc.). - The corpus data contains predefined entities like Person Name, County Name, City Name, etc. The
corpus builder 218 uses memory based learning to add new entities to the corpus over time with additional training. -
Data sources 250 may include corpus data, application data, CRM, P2P systems, SAP, and Oracle. The corpus data contains predefined entities like Person Name, County Name, City Name, etc. Thepresent system 200 can be connected with any of the above mentioned systems to utilize existing information, which can be used in the form of pre-defined entities. - The
present system 200 adheres to SSLlevel security protocols 273. As an enterprise level application, thepresent system 200 adheres to the available security protocols and requirements of the enterprise system within which it operates. - The
present system 200 has capabilities that relate to API specifications to which it interfaces (e.g., microservices). These capabilities include being small in size, messaging-enabled, autonomous and being independently deployable. A microservice is not a layer within a monolithic application. Some of the benefits of microservice based APIs are: -
- Modularity: This makes the
present system 200 easier to understand, develop, test, and become more resilient to architecture erosion. - Scalability: Because microservices are implemented and deployed independently of each other (e.g., they run within independent processes), they can be monitored and scaled independently.
- Integration of heterogeneous and legacy systems: microservices can be used to modernize existing monolithic software applications.
- Distributed development: Teams develop, deploy and scale their respective services to independently parallelize development. Microservices allow the architecture of an individual service to emerge through continuous refactoring. Microservice-based architectures facilitate continuous delivery and deployment.
- Modularity: This makes the
- Multiple types of
consumers 240 can consume thesemicroservices using orchestration 230. Theorchestration layer 230 controls and manages the communications between HER services 211-218. Theorchestration layer 230 contains a directory of services along with a listing of its capabilities. Based on the type of request and business logic, theorchestration layer 230 manages the communication between the HER services 211-218. Communication between the HER services 211-218 uses JavaScript Object Notation (JSON). Typically, a consumer 240-243 sends text in the form of a request in JSON object for entity recognition andorchestration layer 230 returns the list of extracted entities, as shown below: def get_all_entities_service(input_sentence<JSON>, Return_all_entities_list<JSON>) - Example values of input and output parameters include:
-
- Input_Sentence: “James lives in USA” <pass as JSON Object>
- Return_all_entitie_list: “[Name] James [none] lives [none] in [Country] USA.”<get as JSON>
-
Integration connectors 280 connect thepresent system 200 withdifferent data sources 250, such as a database, CRM, P2P systems, SAP, Oracle, etc.Integration connectors 280 include driver libraries, such as JDBC driver, ODBC driver, SAP JDBC driver, etc.Integration connectors 280 use these drivers to make and establish the connection between the HERsystem 210 anddatasource 250. -
FIG. 3 illustrates an exemplary HERsystem architecture 300, according to another embodiment. Components inFIG. 3 that are shared with the HERsystem 200 ofFIG. 2 operate as described above. Thecontext engine 326 keeps the context of up to three levels to identify the appropriate entity in the sentence. The context is used to identify the indirect entities addressed by pronoun like he, she, it, etc. For example, consider the sentence “I am working with Genpact and I want to know its last year performance numbers.” In this sentence thecontext engine 326 identifies that the subject is Genpact and in the second portion of that sentence “its” means Genpact. The rule basedentity recognizer 331 recognizes “its” as Genpact. - The
training module 370 trains a machine learning model and memory learning model with new datasets, when a user feeds the data to the HERsystem 320. The HERsystem 320 identifies the tagged entities, which are identified by the rule basedentity recognizer 331 and RegEx basedentity recognizer 333. The user may make corrections, tag the untagged entities and then perform the training using the User Interface of thetraining module 370. Using the training module 370 a user may introduce new types of basic entities and composite entities. - As an enterprise level application, the present HER
system 200 interacts with business layer 350. Business layer 350 is an intermediate layer between the HERsystem 320 and the external source systems 360 (e.g., a legacy enterprise system). The business layer 350 has business logic and rules. A business logic or rule can be represented as: -
- If <condition(s)> Then <consequence(s)>
- When <condition(s)> Then <imposition(s)> Otherwise <consequence(s)>
- In
system 300, the business logic/rules of business layer 350 provides domain specific knowledge. For example, in the banking domain, the meaning of “card” is a credit card or a debit card. In electronics engineering the meaning of “card” is a PCB (Printed Circuit Board). As another example of a business rule, the word “net” in the financial domain has the meaning of gross. However, in the fishing industry the meaning refers to a fish net. In the IT domain the meaning of “net” is Internet. - The business layer 350 contains business rules. The external source systems 360 (e.g., ERP, P2P, CMS systems) use business layer services 350 to display data, or to consume data. The business rules provide data used by the
external source systems 360. - The
source systems 360 are systems that are the consumers to HERsystem 320. Any consumer source system 360 (e.g., ERP, P2P, CMS systems) sends the request along with the text/sentence in JSON format to HERsystem 320 for entity extraction/recognition. - Artificial intelligence (AI)
system 310 can utilize the HERsystem 200 to capture the important information in the form of entities from Natural language/free flow text. AI systems artificially mimic human intelligence processes, including self-learning, reasoning. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions) and self-correction. Particular applications of AI include expert systems, speech recognition and machine vision. - In
system 300, the client 305 may be a human consumer of HERsystem 320, who may be a trainer, developer or user, interacting directly with the system or through the mobile device or Interactive Voice Response (IRV) system. To manage the communication between the HER services 211-218, theorchestration layer 321 controls and manages the communication between all the internal HER services. Theorchestration layer 321 contains a directory of services along with a listing of the service's capabilities. For example, some of the services may be: -
- def get_preprocessor_service(input_sentence<JSON>, Return_processed_sent<JSON>)
- def get_pOStagger_service(input_sentence<JSON>, Return_tagged_Sentence<JSON>)
- def get_vectorizer_service(input_sentence<JSON>, Return_vector<JSON>)
- def get_vectorizer_service(input_sentence<JSON>, Return_vector<JSON>)
- def get_vectorizer_service(input_vector <JSON>, Return_sentence r<JSON>)
- Based on the type of request and the business logic/rules,
orchestration layer 321 manages communications between the HER services 322-334, where according to one embodiment the communications use the JSON format. Predefined business logic/rules determine the sequence thatorchestration layer 321 calls HER services 322-334 in order to provide extracted basic and composite entities. - A custom corpus 343 provides the flexibility to store metadata and data of user defined entities. While new entity or custom entity training, the
system 320 captures that entity and stores the entity into custom corpus 343. For example, a business requirement may be that a designation such as CTO, CDO, VP, AVP, and SM should be recognize as a designation. Then during learning the HERsystem 320 stores all these custom words into custom corpus 343. -
FIG. 4 illustrates an exemplary HER system process for learning 400, according to one embodiment. The learning phase of the system utilizes the processing modules to process/learn the sentences with the selected entities in it by the user. As shown inFIG. 4 thetext preprocessing module 405 cleans and massages the input text, string, orsentence 401. Thepreprocessor 405 removes extra spaces, and other non-useful special symbols. Thepreprocessor 405 performs spelling correction and grammar corrections to the input text, string, or sentence. - Then the cleansed sentence is passed to the
POS Tagger 410, as well as to the rules basedentity recognizer 415. ThePOS tagger module 410 tags the POS for each word of theinput text 401. The rules basedentity recognizer 415 identifies the first level entities in theinput text 401. Then the output of both thePOS tagger 410 and rules basedentity recognizer modules 415 is passed to thevectorizer 425, which combines these outputs and creates/translates theinput sentence 401 into the form of a vector representation that is further used by the machine learning basedentity recognizer module 441 for training (e.g., the machine learning model generated based on manually tagged data andentities 443 tagged by the rules basedentity recognizer 415. - Manually tagged data is used as training data. Manually tagged data contains the words of an input sentence and the tags for the individual words as tagged manually. The machine learning based entity recognizer module learns based on the manually tagged data. For example, consider the input sentence “My bank account number is 70318XXXX and want bank statement from 12/11/17 to 12/05/18.” All the entities available in this example sentence are tagged (e.g, 70318XXXX, 12/11/17 and 12/05/18).
- The
vectorizor 425 andvectorizer 444 process use hash table 430 that contains the base numeric code of entity tag set. The hash table 430 contains the tag set based on the numbering on the list of entities and parts-of-speech (e.g., a tag set may be noun, pronoun, adverb, verb, helping verb, etc.). Every entity in a tag set is given a numeric code, which is used to generate the vector or to transform the vector into tagged entities. For example: - Input Sentence: He[Pro Noun] is[Helping Verb] Tom[Noun]
-
-
- POSBase id of “Noun” is 0.10 and occurrence number is 2.
- Entity id of Tom is 10 by using Hash Table: Entity Class
by using the equation:
-
Vector−f(POSbaseid,Occurrence Number,Rulebase Entity Class id) -
Vector[Tom] =f(0.10,2,10)=>0.10110 - The
vectorizor 425 preserves the POS information of individual words in a sentence, the number of occurrences of any POS in the sentence, and the entity information recognized by the rules basedentity recognizer 415. Thevectorizer 425 uses hash table 430 to assign the numeric value to every word's POS as POSbaseid. The vectorizer's numerical representation is: -
Vector=f(POSbaseid,Occurrence Number,Rulebase Entity Class id) - The
devectorizer 444 performs the opposite process of thevectorizer 425 to reconvert the output of a machine learning model (which would be in the form of a vector) to meaningful text with a clear identification of entities in the text (e.g., names of persons, organizations, locations, expressions of times, quantities etc.) and identified entities store into thebase entities 445 extracted. The base entities extracted 445 feed into memory learning basedentity recognizer 446 to recognize composite entities. Thepresent system 200 uses memory based learning to create and learn computational linguistics based patterns, (e.g., to date and from date, credit amount and debit account, from city and to city, etc.) - The RegEx (regular expression) based
entity recognizer 420 recognizes the entity based on a predefined word or character level structure as a regular expression (e.g., USD200 or $200 or any date, etc.). - This information is used as feedback to the machine learning model and memory learning based linguistic patterns module to learn new first level entities, as well as composite entities. After this feedback and training process the present HER
system 300 generates two models, e.g., amachine learning model 460, and a memory learning based linguisticpatterns module model 450. These models can be used to recognize the entities from the text that will be analyzed by the HERsystem 300 after the learning phase is completed. During the training process if the user wants to train the HERsystem 300 for composite entities, the system uses memory based learning to create and learn computational linguistics based patterns, to recognize composite entities (e.g., to date and from date, credit amount and debit account, from city and to city, etc.) -
FIGS. 5 a and 5 b illustrate an exemplary HER system process for recognizing and extracting entities, according to one embodiment. This hybrid approach uses machine learning, memory learning and a rules based system to recognize entities from the given sentences, it also uses the POS. Thepresent system 300 learns new entities from the text. After the learning phase is complete the HER entity recognition modules shown in Figure Sa operate as described above withFIG. 4 . The process and modules are similar to the HERlearning phase 400, except that themachine learning model 541 as shown in Figure Sa, the machine learning model recognizes the entities from new text being presented to and processed by the present HERsystem 300. - HER
system 300 has anentity corpus 535 that contains the metadata and data of pre-defined entities (e.g., names of persons, organizations, locations, expressions of times, quantities, etc.) HERsystem 300 finds and matches the predefined entities using theentity corpus 535. - The extracted entities are stored in
entity bucket 560.Entity bucket 560 is the mechanism used to store the base entities temporarily for further processing. After being stored inentity bucket 560, the composite entities are extracted using memory learning and linguistic pattern basedmodels 565. Then the base entities and extracted composite entities are stored into processedbase entity buckets 570 and processedcomposite entity buckets 575, respectively. - Both the learning and run-time phases use Statistical Machine Learning (SML) and Memory Based Learning (MBL) that work on linguistic or lexical patterns. The
present system 300 extracts entities from any given text input(s) and learns new entities. Thepresent system 300 uses a hybrid approach that leveragesmachine learning 334 for extracting the base entities and a memory basedentity recognizer 332 for extracting the linguistic pattern based composite entities (e.g., to date and from date, credit amount and debit account, from city and to city, etc.) - The present HER
system 300 may be used with any Artificial Intelligent (AI) System andAutomation System 310. The following is a list of technology applications for the present HER system: -
- Personal Assistants
- Expert Q&A systems
- Domain Specific Expert Assistants
- Service Desk/Help Desk
- Customer Contact Center
- Outbound for Data Gathering
- AML, Fraud (Risk) detection
- Sales of low involvement Products
- Smart Home/Connected Devices Manager
- Concierge Services
-
FIG. 6 shows an exemplary general purpose computing device in the form of acomputer 130, according to one embodiment. A computer such as thecomputer 130 is suitable for use in the other figures illustrated and described herein.Computer 130 has one or more processors orprocessing units 132 and asystem memory 134. In the illustrated embodiment, asystem bus 136 couples various system components including thesystem memory 134 to theprocessors 132. Thebus 136 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as a mezzanine bus. - The
computer 130 typically has at least some form of computer readable media. Computer readable media, which include both volatile and nonvolatile media, removable and non-removable media, may be any available medium that can be accessed bycomputer 130. By way of example and not limitation, computer readable media comprise computer storage media and communication media. Computer storage media include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. For example, computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and that can accessed bycomputer 130. Communication media typically embody computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism and include any information delivery media. Those skilled in the art are familiar with the modulated data signal, which has one or more of its characteristics set or changed in such a manner as to encode information in the signal. Wired media, such as a wired network or direct-wired connection, and wireless media, such as acoustic, RF, infrared, and other wireless media, are examples of communication media. Combinations of the any of the above are also included within the scope of computer readable media. - The
system memory 134 includes computer storage media in the form of removable and/or non-removable, volatile and/or nonvolatile memory. In the illustrated embodiment,system memory 134 includes read only memory (ROM) 138 and random access memory (RAM) 140. A basic input/output system 142 (BIOS), containing the basic routines that help to transfer information between elements withincomputer 130, such as during start-up, is typically stored inROM 138.RAM 140 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processingunit 132. By way of example, and not limitation,FIG. 6 illustratesoperating system 144,application programs 146,other program modules 148, andprogram data 150. - The
computer 130 may also include other removable/non-removable, volatile/nonvolatile computer storage media. For example,FIG. 6 illustrates ahard disk drive 154 that reads from or writes to non-removable, nonvolatile magnetic media.FIG. 6 also shows amagnetic disk drive 156 that reads from or writes to a removable, nonvolatilemagnetic disk 158, and anoptical disk drive 160 that reads from or writes to a removable, nonvolatileoptical disk 162 such as a CD-ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. Thehard disk drive 144, andmagnetic disk drive 156 andoptical disk drive 160 are typically connected to thesystem bus 136 by a nonvolatile memory interface, such asinterface 166. - The drives or other mass storage devices and their associated computer storage media discussed above, provide storage of computer readable instructions, data structures, program modules and other data for the
computer 130.Hard disk drive 154 is illustrated as storingoperating system 170,application programs 172,other program modules 174, andprogram data 176. Note that these components can either be the same as or different fromoperating system 144,application programs 146,other program modules 148, andprogram data 150.Operating system 170,application programs 172,other program modules 174, andprogram data 176 are given different numbers here to illustrate that, at a minimum, they are different copies. - A user may enter commands and information into
computer 130 through input devices or user interface selection devices such as akeyboard 180 and a pointing device 182 (e.g., a mouse, trackball, pen, or touch pad). Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are connected toprocessing unit 132 through auser input interface 184 that is coupled tosystem bus 136, but may be connected by other interface and bus structures, such as a parallel port, game port, or a Universal Serial Bus (USB). Amonitor 188 or other type of display device is also connected tosystem bus 136 via an interface, such as avideo interface 190. In addition to themonitor 188, computers often include other peripheral output devices (not shown) such as a printer and speakers, which may be connected through an output peripheral interface (not shown). - The
computer 130 may operate in a networked environment using logical connections to one or more remote computers, such as aremote computer 194. Theremote computer 194 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative tocomputer 130. The logical connections depicted inFIG. 6 include a local area network (LAN) 196 and a wide area network (WAN) 198, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets, and global computer networks (e.g., the Internet). - When used in a local area networking environment,
computer 130 is connected to theLAN 196 through a network interface oradapter 186. When used in a wide area networking environment,computer 130 typically includes amodem 178 or other means for establishing communications over theWAN 198, such as the Internet. Themodem 178, which may be internal or external, is connected tosystem bus 136 via theuser input interface 194, or other appropriate mechanism. In a networked environment, program modules depicted relative tocomputer 130, or portions thereof, may be stored in a remote memory storage device (not shown). By way of example, and not limitation,FIG. 6 illustratesremote application programs 192 as residing on the memory device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used. - Generally, the data processors of
computer 130 are programmed using instructions stored at different times in the various computer-readable storage media of the computer. Programs and operating systems are typically distributed, for example, on floppy disks or CD-ROMs. From there, they are installed or loaded into the secondary memory of a computer. At execution, they are loaded at least partially into the computer's primary electronic memory. The invention described herein includes these and other various types of computer-readable storage media when such media contain instructions or programs for implementing the steps described below in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein. - For purposes of illustration, programs and other executable program components, such as the operating system, are illustrated herein as discrete blocks. It is recognized, however, that such programs and components reside at various times in different storage components of the computer, and are executed by the data processor(s) of the computer.
- Although described in connection with an exemplary computing system environment, including
computer 130, the invention is operational with numerous other general purposes or special purpose computing system environments or configurations. The computing system environment is not intended to suggest any limitation as to the scope of use or functionality of the invention. Moreover, the computing system environment should not be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. - The invention may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices. Generally, program modules include, but are not limited to, routines, programs, objects, components, and data structures that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
- While the present disclosure has been described in terms of particular embodiments and applications, summarized form, it is not intended that these descriptions in any way limit its scope to any such embodiments and applications, and it will be understood that many substitutions, changes and variations in the described embodiments, applications and details of the method and system illustrated herein and of their operation can be made by those skilled in the art without departing from the scope of the present disclosure.
Claims (21)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/152,198 US20230229860A1 (en) | 2019-01-08 | 2023-01-10 | Method and system for hybrid entity recognition |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962789751P | 2019-01-08 | 2019-01-08 | |
US16/721,452 US11580301B2 (en) | 2019-01-08 | 2019-12-19 | Method and system for hybrid entity recognition |
US18/152,198 US20230229860A1 (en) | 2019-01-08 | 2023-01-10 | Method and system for hybrid entity recognition |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/721,452 Continuation US11580301B2 (en) | 2019-01-08 | 2019-12-19 | Method and system for hybrid entity recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230229860A1 true US20230229860A1 (en) | 2023-07-20 |
Family
ID=71403776
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/721,452 Active 2040-12-17 US11580301B2 (en) | 2019-01-08 | 2019-12-19 | Method and system for hybrid entity recognition |
US18/152,198 Pending US20230229860A1 (en) | 2019-01-08 | 2023-01-10 | Method and system for hybrid entity recognition |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/721,452 Active 2040-12-17 US11580301B2 (en) | 2019-01-08 | 2019-12-19 | Method and system for hybrid entity recognition |
Country Status (1)
Country | Link |
---|---|
US (2) | US11580301B2 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112749551A (en) * | 2020-12-31 | 2021-05-04 | 平安科技(深圳)有限公司 | Text error correction method, device and equipment and readable storage medium |
US11431472B1 (en) | 2021-11-22 | 2022-08-30 | Morgan Stanley Services Group Inc. | Automated domain language parsing and data extraction |
CN118966225B (en) * | 2024-10-12 | 2024-12-24 | 贵州大学 | Named entity recognition method based on mixed scale sentence representation |
Family Cites Families (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6311152B1 (en) | 1999-04-08 | 2001-10-30 | Kent Ridge Digital Labs | System for chinese tokenization and named entity recognition |
TWI256562B (en) | 2002-05-03 | 2006-06-11 | Ind Tech Res Inst | Method for named-entity recognition and verification |
US7421386B2 (en) * | 2003-10-23 | 2008-09-02 | Microsoft Corporation | Full-form lexicon with tagged data and methods of constructing and using the same |
US20060047500A1 (en) | 2004-08-31 | 2006-03-02 | Microsoft Corporation | Named entity recognition using compiler methods |
US7685201B2 (en) | 2006-09-08 | 2010-03-23 | Microsoft Corporation | Person disambiguation using name entity extraction-based clustering |
US20090070325A1 (en) * | 2007-09-12 | 2009-03-12 | Raefer Christopher Gabriel | Identifying Information Related to a Particular Entity from Electronic Sources |
US8594996B2 (en) | 2007-10-17 | 2013-11-26 | Evri Inc. | NLP-based entity recognition and disambiguation |
US8346534B2 (en) | 2008-11-06 | 2013-01-01 | University of North Texas System | Method, system and apparatus for automatic keyword extraction |
EP2488963A1 (en) * | 2009-10-15 | 2012-08-22 | Rogers Communications Inc. | System and method for phrase identification |
JP2011216071A (en) * | 2010-03-15 | 2011-10-27 | Sony Corp | Device and method for processing information and program |
JP5474704B2 (en) * | 2010-08-16 | 2014-04-16 | Kddi株式会社 | Binary relation classification program, method, and apparatus for classifying semantically similar situation pairs into binary relations |
CN103154936B (en) * | 2010-09-24 | 2016-01-06 | 新加坡国立大学 | For the method and system of robotization text correction |
US9235812B2 (en) * | 2012-12-04 | 2016-01-12 | Msc Intellectual Properties B.V. | System and method for automatic document classification in ediscovery, compliance and legacy information clean-up |
RU2665239C2 (en) * | 2014-01-15 | 2018-08-28 | Общество с ограниченной ответственностью "Аби Продакшн" | Named entities from the text automatic extraction |
US20190180196A1 (en) * | 2015-01-23 | 2019-06-13 | Conversica, Inc. | Systems and methods for generating and updating machine hybrid deep learning models |
US10210862B1 (en) * | 2016-03-21 | 2019-02-19 | Amazon Technologies, Inc. | Lattice decoding and result confirmation using recurrent neural networks |
US10498898B2 (en) * | 2017-12-13 | 2019-12-03 | Genesys Telecommunications Laboratories, Inc. | Systems and methods for chatbot generation |
US10664540B2 (en) * | 2017-12-15 | 2020-05-26 | Intuit Inc. | Domain specific natural language understanding of customer intent in self-help |
US11043214B1 (en) * | 2018-11-29 | 2021-06-22 | Amazon Technologies, Inc. | Speech recognition using dialog history |
EP3903200A4 (en) * | 2018-12-25 | 2022-07-13 | Microsoft Technology Licensing, LLC | DATE EXTRACTOR |
US10992632B2 (en) * | 2019-01-03 | 2021-04-27 | International Business Machines Corporation | Content evaluation |
-
2019
- 2019-12-19 US US16/721,452 patent/US11580301B2/en active Active
-
2023
- 2023-01-10 US US18/152,198 patent/US20230229860A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
US11580301B2 (en) | 2023-02-14 |
US20200218856A1 (en) | 2020-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Poongodi et al. | Chat-bot-based natural language interface for blogs and information networks | |
US11250033B2 (en) | Methods, systems, and computer program product for implementing real-time classification and recommendations | |
US11086601B2 (en) | Methods, systems, and computer program product for automatic generation of software application code | |
US10705796B1 (en) | Methods, systems, and computer program product for implementing real-time or near real-time classification of digital data | |
de Araújo et al. | Re-bert: automatic extraction of software requirements from app reviews using bert language model | |
US11556716B2 (en) | Intent prediction by machine learning with word and sentence features for routing user requests | |
US11501080B2 (en) | Sentence phrase generation | |
CN110337645B (en) | Adaptable processing assembly | |
US20230229860A1 (en) | Method and system for hybrid entity recognition | |
US10467122B1 (en) | Methods, systems, and computer program product for capturing and classification of real-time data and performing post-classification tasks | |
US20220245353A1 (en) | System and method for entity labeling in a natural language understanding (nlu) framework | |
US12175196B2 (en) | Operational modeling and optimization system for a natural language understanding (NLU) framework | |
CN106407211A (en) | Method and device for classifying semantic relationships among entity words | |
US20220245361A1 (en) | System and method for managing and optimizing lookup source templates in a natural language understanding (nlu) framework | |
US12197869B2 (en) | Concept system for a natural language understanding (NLU) framework | |
US20220238103A1 (en) | Domain-aware vector encoding (dave) system for a natural language understanding (nlu) framework | |
JP7297458B2 (en) | Interactive content creation support method | |
US12175193B2 (en) | System and method for lookup source segmentation scoring in a natural language understanding (NLU) framework | |
US20220229987A1 (en) | System and method for repository-aware natural language understanding (nlu) using a lookup source framework | |
WO2020242383A9 (en) | Conversational dialogue system and method | |
US20220245352A1 (en) | Ensemble scoring system for a natural language understanding (nlu) framework | |
Sharma et al. | An Optimized Approach for Sarcasm Detection Using Machine Learning Classifier | |
Sharma et al. | Weighted Ensemble LSTM Model with Word Embedding Attention for E-Commerce Product Recommendation | |
Mote | Natural language processing-a survey | |
Jolly et al. | Anatomizing lexicon with natural language Tokenizer Toolkit 3 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: GENPACT LUXEMBOURG S.A R.L. II, LUXEMBOURG Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NARAYAN, RAVI;KHOKHAR, SUNIL KUMAR;MEHTA, VIKAS;AND OTHERS;SIGNING DATES FROM 20220306 TO 20220504;REEL/FRAME:062615/0850 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: GENPACT USA, INC., NEW YORK Free format text: MERGER;ASSIGNOR:GENPACT LUXEMBOURG S.A R.L. II;REEL/FRAME:066511/0683 Effective date: 20231231 |
|
AS | Assignment |
Owner name: GENPACT USA, INC., NEW YORK Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE CONVEYANCE TYPE OF MERGER PREVIOUSLY RECORDED ON REEL 66511 FRAME 683. ASSIGNOR(S) HEREBY CONFIRMS THE CONVEYANCE TYPE OF ASSIGNMENT;ASSIGNOR:GENPACT LUXEMBOURG S.A R.L. II;REEL/FRAME:067211/0020 Effective date: 20231231 |