CN112189206B

CN112189206B - Processing personal data using machine learning algorithms and applications thereof

Info

Publication number: CN112189206B
Application number: CN201980024828.1A
Authority: CN
Inventors: 罗伯特·雷蒙德·林德内尔
Original assignee: Vader Data Solutions
Current assignee: Vader Data Solutions
Priority date: 2018-04-09
Filing date: 2019-04-09
Publication date: 2024-09-06
Anticipated expiration: 2039-04-09
Also published as: CN112189206A; CA3096405A1; EP3776376A4; WO2019199778A1; EP3776376A1

Abstract

Training data is required in order to train the model. As the personal data changes over time, the training data may become obsolete, thereby eliminating its usefulness in training the model. Embodiments address this problem by developing a database with a running log that specifies how each person's data changed at that time. When the data is ingested, it may not be standardized. To address this problem, embodiments clean up the data to ensure that the ingested data fields are standardized. Finally, the various tasks required to train models and resolve personal data accuracy can quickly become cumbersome for computing devices. They may conflict with each other and be inefficient in terms of computing resources such as processor power and memory capacity. To solve these problems, a scheduler is used to queue the various tasks involved.

Description

Processing personal data using machine learning algorithms and applications thereof

Technical Field

This field is generally related to processing information.

Background

As technology advances, more and more personal data is digitized, and as a result, more and more personal data becomes legitimately accessible. The increased accessibility of personal data has spawned new industries focused on legally mining personal data.

The personal data record may include a number of attributes. The data record representing the person may include attributes such as the person's name, his or her city, state, and zip code. In addition to demographic information, the data record may also include information about the person's behavior. Data records from different sources may include different attributes. There are systems for collecting information describing characteristics or behaviors of individual individuals. The collection of such personal information has many applications including applications in the fields of national security, law enforcement, marketing, healthcare, and insurance.

For example, in healthcare, a healthcare provider may have inconsistent personal information (such as address information) from various data sources including national provider identifier registration, medication administration (DEA) registration, public resources (e.g., internet sites such as YELP review sites), and proprietary sources such as health insurance company claim information.

As records receive more updates from different sources, they also present a greater risk of inconsistencies and errors associated with data entry. In these ways, data records that all describe the same person may be inconsistent, and erroneous in terms of their content. From these various sources, a single healthcare provider may have many addresses, perhaps up to 200 addresses. These sources may not be the same opinion as to the correct address. Some healthcare providers have multiple correct addresses. For this reason, the fact that the provider may have newer addresses does not mean that the older addresses are incorrect.

Some health and dental insurance companies require employees to manually call healthcare providers to determine their proper address. However, such manual updates are costly because the healthcare provider's address information may be changed frequently. In addition to address information, other demographic information related to healthcare providers, such as telephone numbers, also present similar problems.

In addition, fraudulent claims are a great problem in healthcare. It is estimated that fraudulent claims may steal more than 800 billions of dollars per year from just the government operated health insurance program. The ubiquitous nature of fraud far exceeds the resources of investigation by law enforcement and insurance companies.

Data-oriented algorithms (known as machine learning algorithms) may be used to make predictions and to conduct certain data analyses. Machine learning is a field of computer science that gives computers the ability to learn without explicit programming. In the field of data analysis, machine learning is a method for designing complex models and algorithms that can be used for prediction and estimation.

To develop these models, they must first be trained. Typically, training involves inputting a set of parameters called features, and known correct or incorrect values for the input features. After training the model, it can be applied to new features of unknown appropriate solutions. By applying the model in this way, the model can predict or estimate solutions for other situations that are not known. These models may discover hidden insights by learning from historical relationships and trends in the database. The quality of these machine learning models may depend on the quality and quantity of the underlying training data.

Systems and methods are needed to improve the identification and prediction of correct personal information (such as demographic information and fraud propensity of healthcare providers) or data sources.

Disclosure of Invention

In an embodiment, a computer-implemented method trains a machine learning algorithm using time-varying (temporally variant) personal data. At various times, the data source is monitored to determine whether the data related to the person has been updated. When the person's data has been updated, the updated data is stored in a database such that the database includes a travel log specifying how the person's data changes over time. The person's data includes values for a plurality of attributes associated with the person. Receiving an indication that a value of a particular attribute in the data for the person is verified as accurate or inaccurate at a particular time. Based on a particular time, the person's data is retrieved from a database, including values for a plurality of attributes that were up-to-date at the particular time. Using the retrieved data and indications, the model may be trained so that the model can predict whether another person is accurate with respect to the value of the particular attribute. In this way, the retrieved data is streamed to the particular time to maintain the meaning of the retrieved data in the training of the model.

In an embodiment, a computer-implemented method correlates different demographic data about a person. In the method, a plurality of different values describing the same attribute of the person are received from a plurality of different data sources. It is determined whether any of a plurality of different values represent the same trait. When it is determined that different values represent the same trait, one of the values representing the same trait is selected to most accurately represent the trait, and the links are determined to be those values representing the same trait.

In an embodiment, a system schedules data ingestion and machine learning. The system includes a computing device, a database, a queue stored on the computing device, and a scheduler implemented on the computing device. The scheduler is configured to place a request to complete a job on the queue. The request includes instructions to complete at least one of a data ingest task, a training task, and a solution task. The system also includes three processes: a data ingest process, a trainer process, and a solver process, each implemented on a computing device and monitoring a queue. When the queue includes a request to complete a data ingest task, the data ingest task retrieves data related to the person from a data source and stores the retrieved data in a database. When the queue includes a request to complete a training task, the trainer task trains the model using the retrieved data in the database and an indication that the value of the particular attribute in the data for the person is verified as accurate or inaccurate. The model is trained so that the model can predict whether another person is accurate for the value of a particular attribute. Finally, when the queue includes a request to complete a solution task, the solver process will apply the model to predict whether the value of another person in the plurality of attributes is accurate.

Methods, systems, and computer program product embodiments are also disclosed.

Further embodiments, features, and advantages of the present inventions, as well as the structure and operation of the various embodiments, are described in detail below with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate the present disclosure and, together with the description, further serve to explain the principles of the disclosure and to enable a person skilled in the pertinent art to make and use the disclosure.

FIG. 1 is a schematic diagram illustrating training of a machine learning model with time-varying data, according to an embodiment.

FIG. 2 is a flow chart illustrating a method of ingesting data and training a model according to an embodiment.

FIG. 3 is a schematic diagram illustrating an example of ingesting data to train a model according to an embodiment.

Fig. 4 is a flow chart illustrating a method of training a model according to an embodiment.

Fig. 5 is a schematic diagram illustrating an example of an application model to identify addresses according to an embodiment.

FIG. 6 is a schematic diagram illustrating a method of cleaning up ingested data according to an embodiment.

Fig. 7 is a schematic diagram illustrating a method of cleaning up ingested address data according to an embodiment.

Fig. 8 is a schematic diagram illustrating a method of linking ingested data according to an embodiment.

Fig. 9 is a schematic diagram showing an example of linking ingested data according to an embodiment.

FIG. 10 is a schematic diagram illustrating a system for ingesting data, training a model based on the data, and determining a solution based on the trained model, according to an embodiment.

FIG. 11 is a schematic diagram illustrating a system for scheduling ingestion, training, and solution tasks, according to an embodiment.

The drawing in which an element first appears is generally indicated by the leftmost digit(s) in the respective reference number. In the drawings, like reference numbers may indicate identical or functionally similar elements.

Detailed Description

Machine learning algorithms can train models to predict the accuracy of personal data. But significant training data is required to train the model. As the personal data changes over time, the training data may become obsolete, thereby eliminating its usefulness in training the model. Embodiments address this problem by developing a database with a running log that specifies how each person's data changed at that time. When information verifying the accuracy of the individual's data is available to train the model, embodiments may retrieve information from the database to identify all data available to the individual as data that existed at the time of verification of accuracy. From this retrieved information, a feature can be determined. The determined features are used to train the model. In this way, embodiments avoid outdated training data.

When the data is ingested, it may not be standardized. For example, the same address may be listed differently in different records and data sources. The different representations make it difficult to link these records. The machine learning algorithm and model will operate more efficiently if the same data is represented in the same manner. To address this problem, embodiments clean up the data to ensure that the ingested data fields are standardized.

The various tasks required to train models and resolve personal data accuracy are quickly becoming cumbersome for computing devices. They may conflict with each other and be inefficient in terms of computing resources (e.g., processor power and memory capacity). To solve these problems, a scheduler is employed to queue the various tasks involved.

In the following detailed description, references to "one embodiment," "an example embodiment," etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

FIG. 1 is a schematic diagram 100 illustrating training of a machine learning model with time-varying data, according to an embodiment. Schematic 100 includes a timeline 120. Timeline 120 shows time 102 a..n. and 104A.

At time 102 a..n, information about the individual or group of people being monitored has been updated. As described below, information may be stored in a number of different data sources. For application to healthcare providers, the data sources may include public databases and catalogs describing demographic information about the individual healthcare providers, and proprietary databases such as internal insurance catalogs and claims databases. Updates to any data source will discard the change log of the historical update database 110. For example, when a new claim is added to the healthcare provider, the new claim is recorded in the history update database 110. Similarly, when the provider's address is updated, the change is recorded in the history update database 110, such that the history update database 110 archives all relevant data sources for all monitored persons as the change is made. In this way, the historical update database 110 includes a travel log that specifies how all relevant data about the monitored person changes over time. From the historical update database 110, the content of all data when stored at any particular time may be determined.

At time 104 a..n, at least some of the information is verified as accurate or inaccurate. In the context of demographic information such as an address or telephone number, this may involve the caller healthcare provider asking if the address or telephone number is valid. The result is an indication of whether the address is valid or invalid and the time at which verification occurred. These values are all stored in the verification database 112. In addition to demographic information, other information about the individual, including its behavior, may be verified or determined. For example, time 104 a..n may be the time that a claim determined to be fraudulent by investigation occurred.

Using the data and history update database 110 and the verification database 112, a characterization training database 114 may be determined. The historical data from the historical update database 110 may be converted into features useful for training and machine learning algorithms prior to input into the characterization training database 114, as described below. These features are used to train the machine learning model 116.

If the historical update database 110 includes only up-to-date information, the information in the verification database 112 will soon be outdated because the information is updated at time 102A. Further, verification at time 104 a..n may occur independent of time 102 a..n. If information from the data source is collected only when verification data is received, it may be that time has elapsed and the data source has been updated. For this reason, if the history update database 110 includes only data that is valid when new authentication data is received, the history update database 110 will be outdated. For example, the data that may be most relevant to predicting fraudulent claims is the data that was valid when the claims were made. If the history update database 110 includes only up-to-date information or information that is available when a claim is determined to be fraudulent, there may not be much history data relevant and, therefore, the machine learning algorithm may be less efficient.

FIG. 2 is a flow diagram illustrating a method 200 of ingesting data to train a model, according to an embodiment. For example, exemplary operations of method 200 are shown in schematic diagram 300 of fig. 3.

The method 200 begins at step 202 by examining the various data sources to determine if the data has been updated. To check whether the data has been updated, embodiments may, for example, check the timestamp of the data, or determine a hash value of the data and compare the hash value to another hash value generated when the data was last checked. The check of step 202 may be performed on a number of different data sources. These data sources are shown, for example, in diagram 300 of fig. 3.

Schematic diagram 300 illustrates various data sources: medical subsidized and medical insurance Center (CMS) service data source 302A, catalog data source 302B, DEA data source 302C, public data source 302D, NPI data source 302E, registration data source 302F, and claims data source 302G.

CMS data source 302A may be a data service provided by a government agency. The database may be distributed and different organization organizations may be responsible for storing different data stored in CMS data source 302A. The CMS data source 302A may include data about the healthcare provider, such as legally available demographic information and claim information. CMS data source 302A may also allow providers to register and update their information in a medical insurance provider registration system and register and assist in medical insurance and healthcare Electronic Health Record (EHR) incentive programs.

Catalog data source 302B may be a catalog of a healthcare provider. In one example, catalog data source 302B may be a proprietary catalog that matches healthcare providers with demographics and behavioral traits that a particular customer deems to be authentic. Directory data source 302B may belong to an insurance company, for example, and may be securely accessed and used only if the company agrees.

The DEA data source 302C may be a registration database maintained by a government agency such as DEA. The DEA may maintain a database of healthcare providers (including doctors, optometrists, pharmacists, dentists or veterinarians) that are allowed to prescribe or dispense medications. The DEA data source 302C may match the healthcare provider with the DEA number. In addition, the DEA data source 302C may include demographic information about the healthcare provider.

Public data source 302D may be a public data source, possibly a Web-based data source, such as an online review system. One example is the YELP online review system. These data sources may include demographic information about the healthcare provider, areas of expertise, and behavioral information (such as comments by the general public).

NPI data source 302E is a data source that matches a healthcare provider with a National Provider Identifier (NPI). NPI is a simplified standard for administration of the health insurance portability and liability Act (HIPAA). NPI is the unique identification number of the healthcare provider being administered. The NPI must be used by the insured healthcare provider and all health planning and healthcare information exchange facilities in the administrative and financial transactions employed by HIPAA. NPI is a 10-bit non-intelligent numeric identifier (10-bit digits). This means that these numbers do not contain other information about the healthcare provider, such as the state in which they reside or the medical profession. The NPI data source 302E may also include demographic information about the healthcare provider.

Registration data source 302F may include state license information. For example, a healthcare provider (such as a physician) may need to register with a state licensing committee. The state license committee may provide registration data source 302F information about the healthcare provider, such as demographic information and areas of expertise, including committee certification.

The claim data source 302G can be a data source with insurance claim information. Similar to the catalog data source 302B, the claims data source 302G can be a proprietary database. An insurance claim may specify the necessary information for an insurance reimbursement. For example, the claim information may include information about the healthcare provider, the services performed, and possibly the amount of the claim. The services performed may be described using a standardized code system, such as ICD-9. The information about the healthcare provider may include demographic information.

Returning to FIG. 2, at decision block 204, each data source is evaluated to determine if an update has occurred. If an update has occurred in any of the data sources, the update is stored at step 206. The update may be stored in the historical update database 110 shown in fig. 3. As described above with reference to fig. 1, the history update database 110 includes a travel log that specifies how the person's data changes over time.

For example, in FIG. 3, such a log of operations in the historical update database 110 is shown in table 312. Table 312 has three rows and five columns: source ID, date time, provider ID, attributes and values. The source ID column indicates the source of the underlying data from the history update database 110. Tracking the source of this data may be important to ensure that proprietary data is not used properly. In the table 312, the first two rows indicate that data is retrieved from the NPI data source 302E, and the third row indicates that data is retrieved from the claims data source 302G. The time of day column may indicate the time of update or the time at which the update was detected. The provider ID column may be a primary key identifier of the healthcare provider. The attribute column may be a primary key identifier for one of several monitored attributes, such as demographics (e.g., address, phone number, name). In this case, the attribute value for each row in table 312 is one, indicating that they are relevant to the update of the healthcare provider's address attributes. The value column indicates the values received from a particular source at a specified time for that attribute and for a particular provider. In table 312, the first address value retrieved from NPI data source 302E for the provider is "123 anysphere Street" and the second address value subsequently retrieved from NPI data source 302E for the provider is "123Anywhere St.Suite 100".

After updating the original data downloaded from the data source at step 206, the data is cleaned and normalized at step 208. Sometimes, different data sources use different conventions to represent the same underlying data. In addition, some errors in the data often occur. At step 210, instances of different data sources that represent the same underlying data using varying conventions are identified. In addition, some errors that occur frequently or periodically are corrected. The cleaning and normalization is described in more detail below with reference to fig. 6-7.

Turning to fig. 3, a schematic diagram 300 shows an example of cleaning and normalization at step 314 and table 316. In table 316, it is determined that the first row and the second row represent the same underlying trait. Thus, they are linked by a given generic representation. To keep consistent, "Street" is changed to the abbreviation "St." and the apartment number missing in the first row is added.

Returning to FIG. 2, at step 210, features representing known incorrect or correct data are captured. As described above, the attributes of the build model may be manually verified. For example, in an example of a model for predicting the accuracy of the address of a healthcare provider, a worker may manually call the healthcare provider and ask if the address is correct. The solution data may be used to train a model. In addition to this solution, the input parameters required for the model must also be determined. The input parameters may be referred to as features.

If the input parameters are facts about attributes, the machine learning algorithm may perform better than entering the raw data into the model. Facts may be, for example, true or false statements about underlying raw data. For example, in the address model, the following features may be useful:

Is the address updated within the last six months? Is the address updated in the past year?

Is the provider registered with a state that matches this address?

Is there claim data for this address six months recently? Is there claim data in the past year?

Is the update date of the address data the same as the creation date?

New features may be continually added and tested to determine their efficacy in predicting whether an address is correct. To save computational resources and train and solve the model, features that have little impact can be eliminated. At the same time, new features determined to have predicted values may be added.

Turning to fig. 3, the characterization process is shown in step 318 to produce training data shown in table 320. In table 320, two rows show two different verifications that have occurred. For a provider with an ID of 14, address "123Anywhere St.Suite 100" has been verified as correct. For a provider with an ID of 205, address "202Nowhere St" has been verified as incorrect. Both rows have a set of features Fl.. FN already determined for the respective address.

Returning to FIG. 2, at step 212, the training data is used to train a plurality of machine learning models. Different types of models may have different validity for each attribute. Thus, in step 212, many different types of models are trained. Types may include, for example: logistic regression, naive bayes, elastic networks, neural networks, bernoulli naive bayes, multimodal naive bayes, nearest neighbor classifiers, support vector machines. In some embodiments, these techniques may be combined. Upon entering a feature related to an attribute, the trained model may output a score indicating the likelihood that the attribute is correct.

In step 214, the best model or combination of models is selected. The best model may be the model that most accurately predicts the properties that are trained to predict. Step 214 may be implemented using a grid search. For each known correct answer, features are computed and applied to each trained model. For each trained model, an accuracy value is determined that indicates the degree to which the score output by the trained model is correct. The model with the greatest accuracy is then selected to predict the correctness of the attribute.

In this way, embodiments ingest data from multiple data sources and use the data to train a model that can predict whether a particular attribute is accurate. As shown in fig. 4 and 5, a trained model may be applied.

Fig. 4 is a flow chart illustrating a method 400 of training a model according to an embodiment. The operation of the method 400 is illustrated in the schematic diagram 500 of fig. 5.

The method 400 begins at step 402. At step 402, features are collected for the queried attributes. Features may be collected in the same manner that they were collected to develop training data for models of attributes. For example, the data may be cleaned and normalized as described above and in detail below with respect to fig. 6-7 for training data. Features may be calculated from the historical update database 110 using up-to-date information about attributes. In one embodiment, the features may be calculated only for the provider of the user request. In another embodiment, the features may be calculated for each provider or for each provider that does not have the attribute (e.g., address) that was recently verified and included in the training data. An example of calculated data is shown in schematic diagram 500 of fig. 5.

In diagram 500, table 502 shows data received from historical update database 110 for input into a training model. Each row represents a different value of the attribute for prediction. The provider ID corresponds to the provider for the value. FN is a feature related to the provider and a specific value. These features may be the same facts used to train the model.

Returning to FIG. 4, at step 404, the collected features are applied to the trained model. Features may be input into the model, and thus, the model may output a score indicating the likelihood that the value is accurate.

Exemplary scores are shown in step 504 and table 506 of diagram 500. Table 506 represents various possible addresses for the provider and the scores that the model has output for each address. In addition, table 506 includes a source for each address. Additional queries to the history update database 110 may be required in order to determine the source. In the example of table 506, there are four possible addresses for a particular provider: "123Anywhere St" collected from NPI data sources, "321Someplace Rd" collected from first claim data sources, "10Somewhere Ct" collected from second, different claim data sources, and "5Overthere Blvd" collected from DEA data sources. The model calculates a score for each address.

In FIG. 4, at step 406, the scores are analyzed to determine the appropriate answer. For some attributes, a provider may have more than one valid answer. For example, a provider may have more than one valid address. To determine which answers are valid, a score may be analyzed. In one embodiment, scores greater than a threshold may be selected as correct. In another embodiment, scores below the threshold may be rejected as incorrect. In yet another embodiment, a grouping of scores may be determined and answer clusters in the grouping may be selected as correct.

Once the possible answers are determined in step 406, they are filtered based on the information source in step 408. As noted above, not all data sources are common. Some are proprietary. The filtering at step 408 may ensure that values retrieved from the proprietary source do not leak to another party without proper consent.

The answer selection and filtering described in steps 406 and 408 is shown in step 508 and list 510 of fig. 5. In this example, three of four possible addresses may be selected as the effective addresses of the provider: "321Someplace Rd", "10Somewhere Ct", and "5Overthere Blvd". The scores of these three addresses are.95, 96 and 94, respectively. They are close to each other and above a threshold that may be.9. On the other hand, the score for the remaining addresses is only.10, which is below the threshold and is therefore excluded from possible solutions.

The three effective addresses come from three different data sources. The address "5Overthere Blvd" is collected from a common data source (DEA data source as described above). "5Overthere Blvd" that has been collected from a common source is included in a list 510 that lists the final answers to be presented to the user. The other two addresses, "321Someplace Rd" and "10Somewhere Ct" -are collected from a proprietary claims database. In the example shown in diagram 500, the user has access only to the first claims database containing address "321Someplace Rd" and not to the claims database containing address "5Overthere Blvd". Thus, "321Someplace Rd" is included in list 510, but not "5Overthere Blvd".

In this way, embodiments apply a trained model to solve for the effective values of the various attributes for personal data.

As described above, in order to train the model and apply the collected data to the model to solve for the correct values, the data ingested from the various data sources must be cleaned up and normalized. This process is described, for example, with reference to fig. 6-7.

FIG. 6 is a schematic diagram illustrating a method 600 of cleaning up ingested data, according to an embodiment.

Method 600 begins at step 602 when a plurality of values for an attribute are received from a plurality of data sources. This data ingestion process is described above with reference to fig. 2 and 3.

At step 604, the values are analyzed to determine if any of them represent the same trait. In the context of addresses, various address values are analyzed to determine whether they are intended to represent the same underlying geographic location.

Steps 606 and 608 occur when there are multiple values to determine the same underlying trait. At step 606, these values are analyzed to determine which best represents the underlying trait. In the context of addresses, the address that best represents the geographic location may be selected. In addition, any convention (such as abbreviation or not) may be applied to the address. In the context of entity names, step 606 may involve mapping various possible descriptions of the entity to standard descriptions consistent with state registration. For example, "DENTAL SERVICE inc" (no comma) may be mapped to "DENTAL SERVICE, inc" (comma). In the context of claims, step 606 may involve mapping the data to a common claim code system, such as ICD-9.

At step 608, the values are linked to indicate that they represent the same trait. In one embodiment, they may be set to the same value determined in step 606.

Fig. 7 is a schematic diagram illustrating a method 700 of cleaning up ingested address data, according to an embodiment.

The method 700 begins at step 702. In step 702, each address is geocoded. Geocoding is the process of converting postal address descriptions into locations (e.g., spatial representations in digital coordinates) of the earth's surface.

In step 704, the geocoded coordinates are evaluated to determine if they represent the same geographic location. If so, the ingested address values are likely to be intended to represent the same trait.

In step 706, an apartment (Suite) number is evaluated. Apartment numbers are typically represented in various ways. For example, instead of "apartment", other names may be used. In addition, the apartment number may sometimes be omitted. Often the digits are omitted and then erroneously added. Using this, embodiments may choose between multiple possible apartment numbers.

For example, a healthcare provider may have a different apartment number: various addresses of "apartment 550" and "apartment 5500". An embodiment determines whether a first string of a plurality of different values is a substring of a second string of another of the plurality of different values. For example, an embodiment determines that "550" is a substring of "5500". This embodiment then determines that "5500" more accurately represents the address of the healthcare provider because digits are more often omitted and then added in error. In addition to or instead of checking sub-strings, embodiments may apply fuzzy matching, e.g., comparing the Levenshtein distance between two strings to a threshold.

In step 708, numbers with similar appearance are evaluated. In an embodiment, a first string of a plurality of different values is determined to be similar to a second string of another of the plurality of different values, except for different numbers having similar appearances. When this determination occurs, the string determined to most accurately present the string is selected.

For example, a healthcare provider may have a different apartment number: various addresses of "apartment 6500" and "apartment 5500". The numbers "5" and "6" may have similar appearances. The strings are similar except that "5" is substituted for "6". Thus, the character strings may be identified as representing the same address. To determine which string is the correct apartment number, other factors may be employed, such as the apartment number presented in other sources.

Based on the analysis in steps 706 and 708, the correct address is selected in step 710.

Fig. 8 is a schematic diagram illustrating a method of linking ingested data according to an embodiment. As shown, method 800 describes an embodiment for matching and linking records using embodiments of the aforementioned system. The term "match" refers to determining that two or more data records correspond to the same person.

At step 830, the processor legitimately accesses at least one set of data records stored in memory. In an embodiment, the set of data records may include the data sources described above with respect to fig. 2-3. All data can be accessed and retrieved legitimately from a variety of external sources.

In some instances, the accessed data records may be received and/or stored in an undesirable format or in a format that is incompatible with contemplated methods and systems. In such embodiments, the data records are cleaned or standardized to conform to a predetermined format.

In step 832, the data for each accessed record is parsed. In an embodiment, the parsing step is implemented using control logic defining a set of dynamic rules. In an embodiment, the control logic may be trained to parse the data record and locate a first name, last name, home address, email address, telephone number, or any other demographic or personal information describing the person associated with the parsed data record. In another embodiment, the control logic may specify a set of persistent rules based on the type of data record being parsed.

At step 834, parsed data is assigned to predetermined categories within each record. For example, an embodiment may include parsing rules for finding a person's first name, last name, home address, email address, and phone number. In such an embodiment, when the processor looks up a first name, a last name, etc., a temporary file may be created within the data record, with the first name, the last name being assigned to the respective category. In another embodiment, a new persistent file may be created to store the classification data. For example, a new record may be created as a new row in a database table or memory, and different categories are entered as column values in the row, respectively. In yet another embodiment, the processor may assign the classification data and store the assigned and classified data as metadata within the original file.

At step 836, the classification data for each record is compared to all other classification records using a pair-wise function. For example, the processor compares the classification data of the first record with the classification data of the second record. In an embodiment, the processor compares the individual categories. For example, the processor compares the address associated with the first record with the address associated with the second record to determine if they are the same. Alternatively, other possible categories may be compared, including first name, last name, email address, social security number, or any other identifying information. In another embodiment, the processor compares more than one category of data. For example, the processor may compare the first name, surname, and address associated with the first record with the first name, surname, and address of the second record to determine if they are the same. The processor may keep track of which categories match and which categories do not. Alternatively, the processor may count only the number of matching categories. It is contemplated that step 836 may include comparing more than three categories. For example, in an embodiment, the processor compares up to seven categories. In further embodiments, the processor compares between 8 and 20 categories.

In an embodiment, step 836 may use not only literal matching, but also other types of matching, such as regular expression matching or fuzzy matching. Regular expression matching can determine that two values match when they both satisfy the same regular expression. When two strings approximately (rather than completely) match a pattern, a fuzzy match may detect a match.

In an embodiment, step 836 may be implemented using multiple sets of data records. For example, data records from a first set of records may be compared to data records from a second set of records using the methods and systems described herein. In an embodiment, the first set of data records may be an input list comprising data records describing the person of interest or a list of persons of interest. The second set of data records may be personal data records from a second input list or legally stored in a database. A comparison of the multiple sets of data records is performed to determine whether the records of the first set of data records and the records of the second set of data records describe the same person.

Furthermore, in embodiments implemented using multiple sets of data records, the second set of data records may hold true identities, identities with confirmed accuracy, and/or identities that exceed a predetermined accuracy threshold. The true identity may be encoded as a serial number.

At step 838, for each data pair, a similarity score is calculated based on the data comparison. More specifically, the processor calculates a similarity score for each data pair based on which categories in the pair of records are determined to match in step 836. In an embodiment, the similarity score is calculated as a ratio. For example, in an embodiment comparing 7 categories, if the first record and the second record describe data such that 5 of the 7 categories are the same between records, the similarity score is 5/7. In another embodiment, the similarity score is calculated as a percentage. For example, in an embodiment comparing 20 categories, if the first record and the second record describe data such that 16 of the 20 categories are the same between records, then the similarity score is.8 or 80%.

In another embodiment, each category may be assigned a weight, and in step 838, a similarity score may be determined based on whether each category matches and the respective weights associated with the matching categories. The weights may be determined using a training set. In one example, linear programming may be used to determine weights. In other examples, a neural network or other adaptive learning algorithm may be used to determine the similarity scores for a pair of data records based on which categories in the pair match.

At step 840, a determination is made as to whether the calculated similarity score meets or exceeds a predetermined threshold. For example, in an embodiment where the similarity score threshold is 5/7 (or approximately 71.4%), the processor will determine whether the calculated similarity score meets or exceeds the 5/7 threshold. Likewise, in embodiments where the similarity score threshold is 16/20 (or 80%), the processor will determine whether the calculated score meets or exceeds the threshold.

At step 842, if the similarity score of at least two records meets or exceeds the similarity score threshold, the similar records (i.e., records meeting or exceeding the similarity score threshold) are linked or combined into a group. For example, in an embodiment, the processor performs a pair-wise comparison between the first record and all subsequent records. Any records in the first group that meet or exceed the similarity score threshold are linked and/or combined. The processor then performs a pair-wise comparison between the second record and all subsequent records. Assuming that the second record is not linked to the first record, any subsequent records in the second group that meet or exceed the similarity score threshold are linked and/or combined (when compared to the second record). Step 842 is also applicable when comparing multiple sets of data records. The similarity score is calculated for each data record in the first set of data records as they relate to the data records in the second set of data records. Any records that meet or exceed the similarity score threshold are linked and/or grouped into groups, as described above. In an embodiment, the records of a link/packet may be programmably linked while the records remain in their respective record sets.

Further, at step 842, it may be that a pair-wise comparison between the first data record and the second data record results in a similarity score that meets or exceeds a threshold. In addition, the pairwise comparison between the second record and the third record also produces a similarity score that meets or exceeds the threshold, but the pairwise comparison between the first record and the third record is not similar and does not meet the threshold. The processor may handle such conflicting packet situations in a number of ways. For example, in an embodiment, the processor may compare additional categories that are not included when performing the initial pairwise comparison. For example, if the processor compares the first name, last name, address, and telephone number during an initial comparison, the processor may include a social security number, age, and/or any other information that may help narrow the identity during a second comparison. After this second comparison of the first record, the second record, and the third record, an updated similarity score is calculated for each comparison (i.e., first record and second record, first record and third record, second record and third record) and the similarity score is measured relative to a second predetermined threshold. If the updated similarity scores meet or exceed a second predetermined threshold, they are grouped according to the foregoing embodiment. However, if the same situation still exists, i.e. the first record is similar to the second record, the second record is similar to the third record, and the first record is not similar to the third record, the second record is grouped with the first record or with the third record, depending on which of the pairings has a higher updated similarity score. If the updated similarity scores are equal, another iteration of comparing other columns will begin.

In another embodiment, the processor may handle conflicting packet situations by creating a copy of the second record. After making the copy, the processor may group the first record and the second record into group a and group the copy of the second record with the third record into group B.

In yet another embodiment, the processor may handle conflicting grouping situations by creating a group based on a pairwise comparison of the second record. For example, based on similarity scores between the first record and the second record and between the second record and the third record, all three records are grouped together based on their relationship to the second record.

In step 844, the processor determines the most prevalent identity within each set of similar records. For example, if the set of similar records contains 10 records, and 5 of the records describe an individual named James, and the remaining 5 records include the name Jim, mike, or Harry, the processor will determine James to be the most common name. In other embodiments, the processor may require additional steps to determine the most prevalent identities within each group. For example, it may occur that a similar set of records contains six records, two records describing an individual named Mike, two records describing an individual named Michael, one record describing an individual with the first initial "M", and the last record describing an individual named John. In such embodiments, the processor may determine that the most common identity is Michael based on a relationship between the names Michael and Mike. In instances where there is no explicit pervasive identity, other categories (i.e., surnames, addresses, email addresses, phone numbers, social security codes, etc.) may be referenced to determine the most pervasive identity. In embodiments where multiple sets of data records are compared, the data records in the first or second sets of data records may be modified or marked to indicate the most prevalent identity and/or record of links/groupings. More specifically, the records may be modified so that a user may determine the most prevalent identity and/or linked data records when viewing a single set of data records.

In step 846, the processor modifies the identities of the similar records to match the identities of the most common records within each set of similar records. Now we return to the example provided above, where a set of similar records contains six records, two records describing an individual named Mile, two records describing an individual named Michael, one record describing an individual with an initial "M", and the last record describing an individual named John. In this example, the processor now modifies each record at step 846 so that the identity of each record describes an individual named "Michael". After the identity of each affinity group is modified, the record matching operation is complete. This process is further illustrated in fig. 9.

Fig. 9 illustrates a flow chart 900 that illustrates exemplary operations that may be used to implement various embodiments of the present disclosure. As shown, this flowchart 900 illustrates an embodiment of a record matching operation using an embodiment of the foregoing system.

The flow chart 900 illustrates data that has been parsed, classified, and normalized using the parsing, assigning, and classifying steps described above, or using well-known methods. As shown, the received classification data has been assigned to rows 950a-n and columns 952a-n. Each of the rows 950a-n includes information parsed from the data records describing the individual. Each of the columns 952a-n includes classification information that has been parsed and assigned to a predetermined category.

In step 936, the processor compares the classification data for each record to all other classification records using the pair-wise function. As described above, the processors may compare a single category, or alternatively, the processor may compare more than one category. In the illustrated embodiment, the processor compares the five categories and implements a 3/5 (or 60%) similarity score threshold.

As above, the method described in fig. 3 may also be applicable when comparing multiple sets of data records. For example, step 936 may also be performed using multiple sets of data records. The data records from the first set of records may be compared to the data records from the second set of records. More specifically, the first set of data records may include data records describing a person of interest or a list of persons of interest, while the second set of data records may be personal data records legally stored in a database or memory.

In step 942, if the similarity score of at least two records meets or exceeds the similarity score threshold, then similar records (i.e., records meeting or exceeding the similarity score threshold) are linked or combined into a group. As shown, groups A and B have been created based on the data provided in rows 950a-n and columns 952 a-n. The number of possible groups is proportional to the number of rows being compared. As shown, group a contains three records, while group B contains two records. Each record in each group meets or exceeds a similarity score threshold ratio of 3/5 (or 60%) as compared to the other records in the group.

In step 944, the processor determines the most prevalent identity within each set of similar records. For example, in group A, the processor compares the identities "Aaron Person", "Erin Person" and "A.person". Following the rules described above, the processor determines that "Aaron Person" is the most prevalent identity in group a. In group B, the processor compares the identities of "Henry Human" and "h.humane". Also following the rules above, the processor determines "Henry Human" as the most common identity in group B.

At step 946, the processor modifies the identity of record 958 to match the identity of the most prevalent record within each similar record group. As shown, the record of group a has been modified to describe the identity of "Aaron Person", while the record of group B has been modified to describe the identity of "Henry Human".

Fig. 10 is a schematic diagram illustrating a system 1000 for ingesting data, training a model based on the data, and determining a solution based on the trained model, according to an embodiment.

The system 1000 includes a server 1050. The server 1050 includes a data ingester 1002. The data ingester 1002 is configured to retrieve data from the data source 102 a..n. The data may comprise a plurality of different values for describing the same attribute of the person. In particular, the data ingester 1002 repeatedly and continuously monitors the data source to determine whether data related to anyone being monitored is updated. When the data for the person has been updated, the data ingester 1002 stores the updated data in the database 110. As described above, the database 110 stores a travel log that specifies how the person's data changes over time.

Using database 110, server 1050 periodically or intermittently generates machine learning models 1022 to evaluate the validity of personal data. To generate model 1022, server 1050 includes six modules: querier 1004, data cleaner 1006, data linker 1010, characterizer 1012, trainer 1015, and tester 1020.

The API monitor 1003 receives an indication that the value for a particular attribute in the personal data is verified as accurate or inaccurate at a particular time. For example, the caller may manually verify the accuracy of the value and, after verification, cause the API call to be transferred to the API monitor 1003. Based on the particular time, the querier 1004 retrieves the person's data from the database 110, including the latest values for the plurality of attributes at the particular time.

Data cleaner 1006 determines whether any of a plurality of different values represent the same trait. When different values represent the same trait, the data cleaner 1006 determines which of the values determined to represent the same trait most accurately represents that trait.

The data linker 1010 links those values that are determined to represent the same trait. The data linker 1010 may include a geocoder (not shown) that geocodes each of a plurality of different address values to determine a geographic location and determines whether any of the determined geographic locations are the same.

Using the querier 1004 retrieves, the data cleaner 1006 cleans up, and the data linked by the data linker 1010, the characterizer 1012 determines a plurality of features. Each of the plurality of features describes the fact that the data is related to the person.

Using these features, the trainer 1015 may train the model 1022 so that the model 1022 can predict whether another person is accurate for the value of a particular attribute. In an embodiment, the trainer trains a plurality of models. Each model uses a different type of machine learning algorithm. The tester 1020 uses the available training data to evaluate the accuracy of the plurality of models and selects a model 1022 from the plurality of models determined based on the evaluated accuracy.

Server 1050 can use model 1022 to predict whether records in database 110 are accurate. To generate answers for presentation to clients, server 1050 includes two modules: a scoring engine 1025 and an answer filter 1030. Scoring engine 1025 applies model 1022 to predict whether the value of another person in the plurality of attributes is accurate. In an embodiment, for each value of a plurality of values for a particular attribute of another person, a model is applied to each value to determine a score.

Answer filter 1030 selects at least one value from the plurality of values determined by scoring engine 1025 based on each determined score. In an embodiment, answer filter 1030 filters the answers so that proprietary information is not shared without appropriate consent.

The various modules shown in fig. 10 may conflict with each other and compete inefficiently for computing resources, such as processor power and memory capacity. To solve these problems, a scheduler is employed to queue the various tasks involved, as shown in fig. 11.

FIG. 11 is a schematic diagram illustrating a system 1100 for scheduling ingestion, training, and solution tasks, according to an embodiment. In addition to the modules of fig. 10, system 1100 includes a scheduler 1102 and a queue 1106, as well as various processes, including a data ingestion process 1108, a trainer process 1110, and a solver process 1112. Each of the individual processes runs on a separate thread of execution.

As in system 1000, system 1100 includes an API monitor 1003. As described above, the API monitor 1003 may receive an indication that the value for a particular attribute in the personal data is verified as accurate or inaccurate at a particular time. The API monitor 1003 may also receive other types of API requests. Depending on the content of the API request, the API monitor may, upon receipt of the API request, place a request to complete another job specified on the API request on the queue, the API request including instructions to complete at least one of a data ingest task, a training task, a solution task, or a scheduling task.

Scheduler 1102 places a request to complete a job on queue 1106. The request includes instructions to complete at least one of a data ingest task, a training task, and a solution task. In an embodiment, the scheduler 1102 places requests to complete a job on the queue at periodic intervals. Scheduler 1102 also monitors queues 1106. When the queue 1106 includes a request (possibly placed by the API monitor 1104) to complete a scheduled task, the scheduler 1102 schedules the task as specified in the API request.

The queue 1106 queues various tasks 1107. Queue 1106 can be any type of message queue for inter-process communication (IPC) or inter-thread communication within the same process. They use queues for messaging-control or content delivery. Group communication systems provide a similar kind of functionality. The queue 1106 can be implemented, for example, using Java Message Service (JMS) or Amazon Simple Queue Service (SQS).

The data ingest process 1108 includes a data ingest 1002. The data ingest process 1108 monitors the data ingest tasks in the queue 1106. When the queue 1106 next includes a data ingest task, the data ingest process 1108 executes the data ingest 1002 to retrieve data related to the person from a data source and store the retrieved data in a database.

Trainer process 1110 includes data cleaner 1006, data matcher 1010, trainer 1015, tester 1020, querier 1004, and characterizer 1012. The trainer process 1110 monitors the training tasks in the queue 1106. When the queue 1106 next includes training tasks, the trainer process 1110 executes the data matcher 1010, trainer 1015, tester 1020, querier 1004, and characterizer 1012 to train the model.

Solver process 1112 includes scoring engine 1025 and answer filter 1030. Solver process 1112 monitors the solving tasks in queue 1106. When the queue 1106 next includes a solving task, the solver process 1112 executes the scoring engine 1025 and the answer filter 1030 applies the model to predict whether the values of another person in the plurality of attributes are accurate and to determine the final solution presented to the user.

In an embodiment (not shown), the system 1100 may include a plurality of queues, each dedicated to one of a data ingest task, a training task, and a solution task. In that embodiment, the data ingest process 1108 monitors a queue dedicated to the data ingest task. The trainer process 1110 monitors queues dedicated to training tasks. Solver process 1030 monitors queues dedicated to solver tasks.

Each of the above servers and modules may be implemented in software, firmware, or hardware on a computing device. The computing device may include, but is not limited to, a personal computer, a mobile device such as a mobile phone, a workstation, an embedded system, a game console, a television, a set-top box, or any other computing device. Further, computing devices may include, but are not limited to, devices having processors and memory (including non-transitory memory) for executing and storing instructions. The memory may tangibly embody data and program instructions in a non-transitory manner. The software may include one or more applications and an operating system. The hardware may include, but is not limited to, a processor, memory, and a graphical user interface display. The computing device may also have multiple processors and multiple shared or separate memory components. For example, the computing device may be part or all of a clustered or distributed computing environment or server farm.

Conclusion(s)

Identifiers such as "(a)", "(b)", "(i)", "(ii)", etc. are sometimes used for different elements or steps. These identifiers are used for clarity and do not necessarily specify an order of elements or steps.

The invention has been described above with the aid of functional building blocks illustrating the implementation of specific functions and relationships thereof. For ease of description, the boundaries of these functional building blocks have been arbitrarily defined herein. Other boundaries may be defined so long as the specified functions and relationships thereof are appropriately performed.

The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments without undue experimentation, without departing from the generic concept of the present invention. Accordingly, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.

The breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

1. A system for scheduling data intake and machine learning, comprising:

A computing device comprising a processor;

A database;

a queue stored on the computing device;

A scheduler implemented on the computing device and configured to place a request to complete a job on the queue, the request including instructions to complete at least one of a data ingestion task, a training task, and a solution task;

a data ingestion process implemented on the computing device and configured to: (i) Monitoring the queue, and (ii) when the queue includes a request to complete the data ingestion task, retrieving data related to a person from a data source and storing the retrieved data in the database;

A trainer process implemented on the computing device and configured to: (i) Monitoring the queue, and (ii) when the queue includes a request to complete the training task, training a model using the retrieved data in the database and an indication in the retrieved data that a value for a particular attribute is verified as accurate or inaccurate, such that the model can predict whether the value for the other person of the particular attribute is accurate;

A solver process implemented on a computing device and configured to: (i) Monitoring the queue, and (ii) when the queue includes a request to complete the solution task, applying the model to predict whether the value of the other person is accurate; and

An API monitor implemented on the computing device and configured to: upon receiving an API request, a request to complete another job specified on the API request is placed on the queue, the API request including instructions to complete at least one of the data ingest task, the training task, the solution task, or the scheduling task.

2. The system of claim 1, further comprising:

a plurality of queues, each queue dedicated to one of the data ingest task, the training task, and the solution task,

Wherein the data ingest process monitors a queue of the plurality of queues that is dedicated to the data ingest task,

Wherein the trainer process monitors a queue of the plurality of queues that is dedicated to the training task, and

Wherein the solver process monitors a queue of the plurality of queues that is dedicated to the solving task.

3. The system of claim 1, wherein,

The scheduler places the requests for completion of the job on the queue at periodic intervals.

4. The system of claim 1, wherein the data ingestion process is configured to:

(i) Monitoring the data source to determine whether data related to the person has been updated; and

(Ii) When the person's data has been updated, the updated data is stored in the database.

5. The system of claim 1, wherein,

The scheduler monitors the queues, and

When the queue includes a request for completing the scheduled task, the task specified in the API request is scheduled.

6. The system of claim 1, wherein the API request comprises:

(i) An indication that the value for the particular attribute in the retrieved data is verified to be accurate or inaccurate at a particular time, an

(Ii) Instructions for completing the training task.

7. The system of claim 1, wherein the data ingestion process is configured to:

monitoring the data source to determine whether data related to the person has been updated, and

When the person's data has been updated, another request to complete the training task is placed on the queue.

8. A computer-implemented method for scheduling data intake and machine learning, comprising:

(a) Placing a request to complete a job on a queue, the request including instructions to complete at least one of a data ingest task, a training task, and a solution task;

(b) Monitoring the queue to determine whether the queue includes the request and what tasks are next on the queue;

(c) Retrieving data related to the person from the data source when the queue includes a request for completing the data ingest task, to store the retrieved data in the database;

(d) When the queue includes a request to complete the training task, training a model using the data retrieved in the database and an indication in the retrieved data that the value for a particular attribute is verified as accurate or inaccurate, such that the model is able to predict whether the value for the other person of the particular attribute is accurate;

(e) When the queue includes a request to complete the solution task, applying the model to predict whether the other person's value is accurate;

(f) Receiving an API request; and

(G) Upon receiving the API request, another request to complete another job specified on the API request is placed on the queue, the API request including instructions to complete at least one of the data ingest task, the training task, the solving task, or the scheduling task.

9. The method of claim 8, wherein,

The (b) monitoring includes monitoring a plurality of queues, each queue dedicated to one of the data ingest task, the training task, and the solution task.

10. The method of claim 8, wherein,

The (a) placing occurs at periodic intervals.

11. The method of claim 8, further comprising:

(f) Monitoring the data source to determine whether data related to the person has been updated; and

(G) When the data for the person has been updated, the updated data is stored in the database.

12. The method of claim 8, further comprising:

(h) When the queue includes another request for completing the scheduled task, the task specified in the API request is scheduled.

13. The method of claim 8, wherein the API request comprises:

(Ii) Instructions for completing the training task.

14. The method of claim 8, further comprising:

(G) When the data for the person has been updated, another request for completion of the training task is placed on the queue.