CN112189206A

CN112189206A - Processing personal data using machine learning algorithms and applications thereof

Info

Publication number: CN112189206A
Application number: CN201980024828.1A
Authority: CN
Inventors: 罗伯特·雷蒙德·林德内尔
Original assignee: Vader Data Solutions
Current assignee: Vader Data Solutions
Priority date: 2018-04-09
Filing date: 2019-04-09
Publication date: 2021-01-05
Anticipated expiration: 2039-04-09
Also published as: EP3776376A1; CN112189206B; EP3776376A4; WO2019199778A1; CA3096405A1

Abstract

In order to train the model, training data is required. As personal data changes over time, the training data may become obsolete, thereby eliminating its use in training models. Embodiments address this problem by developing a database with a running log that specifies how each person's data changes at that time. When data is ingested, it may not be standardized. To address this issue, embodiments clean up the data to ensure that ingested data fields are standardized. Finally, training the model and the various tasks required to resolve the accuracy of personal data can quickly become cumbersome for the computing device. They may conflict with each other and be inefficient in competing for computing resources, such as processor power and memory capacity. To address these issues, schedulers are used to queue the various tasks involved.

Description

Processing personal data using machine learning algorithms and applications thereof

Technical Field

This field is generally related to processing information.

Background

As technology advances, more and more personal data is digitized and, as a result, more and more personal data becomes legitimately accessible. The increase in accessibility of personal data has spawned new industries that focus on the legal mining of personal data.

The personal data record may include a number of attributes. The data record representing an individual may include attributes such as the individual's name, his or her city, state, and zip code. In addition to demographic information, the data records may also include information about the behavior of the individual. Data records from different sources may include different attributes. Systems exist for collecting information describing the characteristics or behavior of individual individuals. Collecting such personal information has many applications, including applications in the areas of national security, law enforcement, marketing, healthcare, and insurance.

For example, in healthcare, a healthcare provider may have inconsistent personal information (such as address information) from various data sources including national provider identifier registration, drug administration (DEA) registration, public resources (e.g., internet websites such as YELP review websites), and proprietary sources such as health insurance company claim information.

As records receive more updates from different sources, they also present a greater risk of inconsistencies and errors associated with data entry. In these ways, data records that all describe the same person may be inconsistent, and erroneous in their content. From these various sources, a single healthcare provider may have many addresses, perhaps up to 200 addresses. These sources may not agree on the correct address. Some healthcare providers have multiple correct addresses. For this reason, the fact that the provider may have a newer address does not mean that the older address is incorrect.

Some health and dental insurance companies require employees to manually call healthcare providers to determine their correct address. However, such manual updates are costly, as the healthcare provider's address information may be changed frequently. Similar problems exist with other demographic information related to the healthcare provider, such as telephone numbers, in addition to address information.

In addition, fraudulent claims are a huge problem in healthcare. It is estimated that fraudulent claims may steal over 800 billion dollars each year, only from health insurance plans operated by government. The prevalence of fraud is far beyond the resources of law enforcement and insurance companies to conduct surveys.

Data-oriented algorithms (known as machine learning algorithms) can be used to make predictions and to perform certain data analyses. Machine learning is the field of computer science, giving computers the ability to learn without explicit programming. In the field of data analysis, machine learning is a method for designing complex models and algorithms that can be used for prediction and estimation.

To develop these models, they must first be trained. Typically, training involves inputting a set of parameters, called features, and known correct or incorrect values for the input features. After the model is trained, it can be applied to new features for which no appropriate solution is known. By applying the model in this manner, the model can predict or estimate solutions for other situations that are not known. These models can discover hidden insights by learning from historical relationships and trends in the database. The quality of these machine learning models may depend on the quality and quantity of the underlying training data.

There is a need for systems and methods to improve the identification and prediction of proper personal information (such as demographic information and fraud trends of healthcare providers) or data sources.

Disclosure of Invention

In an embodiment, a computer-implemented method trains a machine learning algorithm with time-varying (temporallyvariant) personal data. At various times, the data source is monitored to determine if the data relating to the person has been updated. When the person's data has been updated, the updated data is stored in a database such that the database includes a log of runs that specify how the person's data changes over time. The data of the person includes values of a plurality of attributes related to the person. Receiving an indication that a value of a particular attribute in data for the person is verified as accurate or inaccurate at a particular time. Based on a particular time, data for the person is retrieved from a database, including values for a plurality of attributes that are most recent at the particular time. With the retrieved data and instructions, the model may be trained so that the model can predict whether another person is accurate for the value of the particular attribute. In this manner, the retrieved data is streamed to the particular time to maintain the meaning of the retrieved data in the training of the model.

In an embodiment, a computer-implemented method associates different demographic data related to a person. In the method, a plurality of different values describing the same attribute of the person are received from a plurality of different data sources. It is determined whether any of a plurality of different values represent the same trait. When it is determined that different values represent the same trait, one of the values representing the same trait is selected to most accurately represent the trait, and the values determined to represent the same trait are linked.

In an embodiment, a system schedules data ingestion and machine learning. The system includes a computing device, a database, a queue stored on the computing device, and a scheduler implemented on the computing device. The scheduler is configured to place requests to complete the job on the queue. The request includes instructions to complete at least one of a data ingestion task, a training task, and a solution task. The system also includes three processes: a data ingestion process, a trainer process, and a solver process, each implemented on a computing device and monitoring a queue. When the queue includes a request to complete the data ingestion task, the data ingestion task retrieves data related to the person from the data source and stores the retrieved data in the database. When the queue includes requests to complete the training task, the trainer task trains the model using the retrieved data in the database and an indication that the value of the particular attribute in the data for the person is verified to be accurate or inaccurate. The model is trained so that the model can predict whether another person is accurate for the value of the particular attribute. Finally, when the queue includes a request to complete a solution task, the solver process will apply the model to predict whether the value of another person in the plurality of attributes is accurate.

Method, system, and computer program product embodiments are also disclosed.

Further embodiments, features, and advantages of the present inventions, as well as the structure and operation of the various embodiments, are described in detail below with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate the present disclosure and, together with the description, further serve to explain the principles of the disclosure and to enable a person skilled in the pertinent art to make and use the disclosure.

Fig. 1 is a schematic diagram illustrating training of a machine learning model with time-varying data, according to an embodiment.

FIG. 2 is a flow diagram illustrating a method of ingesting data and training a model, according to an embodiment.

Fig. 3 is a schematic diagram illustrating an example of ingesting data to train a model, according to an embodiment.

Fig. 4 is a flow diagram illustrating a method of training a model according to an embodiment.

Fig. 5 is a diagram illustrating an example of applying a model to identify an address according to an embodiment.

Fig. 6 is a schematic diagram illustrating a method of cleaning up ingested data according to an embodiment.

Fig. 7 is a schematic diagram illustrating a method of cleaning up ingested address data according to an embodiment.

Fig. 8 is a schematic diagram illustrating a method of linking ingested data according to an embodiment.

Fig. 9 is a diagram illustrating an example of link ingested data according to an embodiment.

FIG. 10 is a schematic diagram illustrating a system for ingesting data, training a model based on the data, and determining a solution based on the trained model, according to an embodiment.

FIG. 11 is a schematic diagram illustrating a system for scheduling ingestion, training, and solution tasks, according to an embodiment.

The drawing in which an element first appears is typically indicated by the leftmost digit or digits in the respective reference number. In the drawings, like reference numbers may indicate identical or functionally similar elements.

Detailed Description

Machine learning algorithms may train models to predict the accuracy of personal data. However, meaningful training data is required to train the model. As personal data changes over time, the training data may become obsolete, thereby eliminating its use in training models. Embodiments address this problem by developing a database with a running log that specifies how each person's data changes at that time. When information verifying the accuracy of a person's data is available to train a model, embodiments may retrieve information from the database to identify all data available to the person as data that existed at the time the accuracy was verified. From this retrieved information, a characteristic may be determined. The determined features are used to train the model. In this way, embodiments avoid training data being out of date.

When data is ingested, it may not be standardized. For example, the same address may be listed differently in different records and data sources. The different representations make it difficult to link the records. Machine learning algorithms and models will operate more efficiently if the same data is represented in the same manner. To address this issue, embodiments clean up data to ensure that ingested data fields are standardized.

Training models and the various tasks required to resolve personal data accuracy quickly become cumbersome for computing devices. They may conflict with each other and be inefficient in competing for computing resources (e.g., processor power and memory capacity). To address these issues, schedulers have been employed to queue the various tasks involved.

In the following detailed description, references to "one embodiment," "an example embodiment," etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

Fig. 1 is a diagram 100 illustrating training of a machine learning model with time-varying data according to an embodiment. The schematic diagram 100 includes a timeline 120. A timeline 120 shows times 102a.. N and 104a.. N.

At time 102a.. N, information about an individual or a group of people being monitored has been updated. As described below, information may be stored in a number of different data sources. As applied to healthcare providers, the data sources may include public databases and catalogs that describe demographic information about the various healthcare providers, as well as proprietary databases such as internal insurance catalogs and claims databases. Updates to any data source will discard the change log of the historical update database 110. For example, when a new claim is added for a healthcare provider, the new claim is recorded in the history update database 110. Similarly, when the provider's address is updated, the change is recorded in the history update database 110, such that the history update database 110 archives all relevant data sources for all monitored people as changes are made. In this manner, the historical update database 110 includes a running log that specifies how all relevant data relating to the monitored person changes over time. From the historical update database 110, the contents of all data at any particular time can be determined.

At time 104a. In the context of demographic information, such as an address or telephone number, this may involve calling the healthcare provider and asking if the address or telephone number is valid. The result is an indication of whether the address is valid or invalid and the time at which the verification occurred. These values are stored in the verification database 112. In addition to demographic information, other information about an individual, including its behavior, may be verified or determined. For example, the time 104a.. N may be the time at which a claim determined to be fraudulent by investigation occurred.

Using the data and history update database 110 and the validation database 112, a characterization training database 114 may be determined. Prior to input to the characterization training database 114, the historical data from the historical update database 110 may be converted into features useful for training and machine learning algorithms, as described below. These features are used to train the machine learning model 116.

If the historical update database 110 includes only the most up-to-date information, the information in the verification database 112 will soon be outdated because it was updated at time 102A.. N. Further, the verification at time 104a.. N may occur independently of time 102a.. N. If information from the data source is collected only when verification data is received, it is likely that time has passed and the data source has been updated. To this end, the history update database 110 will be outdated if the history update database 110 only includes data that is valid when new validation data is received. For example, data that may be most relevant to predicting fraudulent claims is data that is valid at the time the claim is presented. If the historical update database 110 includes only the most up-to-date information or information that is available when claims are determined to be fraudulent, then there may not be much historical data associated and, therefore, the machine learning algorithm may be less effective.

FIG. 2 is a flow diagram illustrating a method 200 of ingesting data to train a model, according to an embodiment. Exemplary operations of the method 200 are illustrated, for example, in the diagram 300 of fig. 3.

The method 200 begins at step 202 by examining various data sources to determine whether data has been updated. To check whether the data has been updated, embodiments may, for example, check a timestamp of the data, or determine a hash value for the data and compare the hash value to another hash value generated when the data was last checked. The check of step 202 may be performed on a plurality of different data sources. These data sources are shown, for example, in diagram 300 of FIG. 3.

Diagram 300 illustrates various data sources: the medical subsidy and health insurance Center (CMS) services data source 302A, the catalog data source 302B, DEA data source 302C, the public data source 302D, NPI data source 302E, the registry data source 302F, and the claims data source 302G.

The CMS data source 302A may be a data service provided by a government agency. The database may be distributed, and different organizations of organizations may be responsible for storing different data stored in the CMS data source 302A. The CMS data source 302A may include data about healthcare providers, such as legally available demographic information and claim information. The CMS data source 302A may also allow providers to register and update their information in a healthcare insurance provider registration system and register and assist in healthcare insurance and medicaid Electronic Health Record (EHR) incentive programs.

The catalog data source 302B may be a catalog of a healthcare provider. In one example, the catalog data source 302B can be a proprietary catalog that matches the healthcare provider with demographic and behavioral traits that a particular customer considers to be authentic. The catalog data source 302B may, for example, belong to an insurance company and may only be securely accessed and used if the company agrees.

The DEA data source 302C may be a registry database maintained by a governmental agency, such as the DEA. The DEA may maintain a database of healthcare providers (including doctors, optometrists, pharmacists, dentists or veterinarians) who are allowed to prescribe or dispense medications. The DEA data source 302C can match the healthcare provider to the DEA number. In addition, the DEA data source 302C can include demographic information about the healthcare provider.

The public data source 302D may be a public data source, possibly a Web-based data source, such as an online review system. One example is the YELP online review system. These data sources may include demographic information about the healthcare provider, areas of expertise, and behavioral information (such as reviews by the public).

NPI data source 302E is a data source that matches a healthcare provider with a National Provider Identifier (NPI). NPI is the administrative simplification standard of the Health Insurance Portability and Accountability Act (HIPAA). The NPI is the unique identification number of the healthcare provider for the application. The healthcare provider of the application and all health plans and healthcare information clearinghouses must use the NPI in the administrative and financial transactions employed by HIPAA. NPI is a 10-bit dumb digit identifier (10 digits). This means that these numbers do not contain other information about the healthcare provider, such as the state or medical professional in which they reside. The NPI data source 302E may also include demographic information about the healthcare provider.

The registration data source 302F may include state permission information. For example, a healthcare provider (such as a physician) may need to register with a state licensing committee. The state licensing committee may provide registration data source 302F information about the healthcare provider, such as demographic information and areas of expertise, including committee certification.

The claims data source 302G can be a data source with insurance claim information. Similar to the catalog data source 302B, the claims data source 302G can be a proprietary database. The insurance claim may specify information necessary for insurance reimbursement. For example, the claim information may include information about the healthcare provider, the services performed, and the amount of possible claims. The services performed may be described using a standardized code system, such as ICD-9. The information about the healthcare provider may include demographic information.

Returning to FIG. 2, at decision block 204, each data source is evaluated to determine whether an update has occurred. If an update has occurred in any of the data sources, the update is stored at step 206. The updates may be stored in the historical updates database 110 shown in fig. 3. As described above with reference to FIG. 1, the historical update database 110 includes a log of runs that specifies how the person's data changes over time.

For example, in FIG. 3, such a log of runs in the historical update database 110 is shown in table 312. Table 312 has three rows and five columns: source ID, time of day, provider ID, attributes, and values. The source ID column indicates the source of the underlying data from the history update database 110. The source that tracks the data may be important to ensure that the proprietary data is not used properly. In table 312, the first two rows indicate data is retrieved from the NPI data source 302E and the third row indicates data is retrieved from the claims data source 302G. The time of day column may indicate the time of the update or the time at which the update was detected. The provider ID column may be the primary key identifier of the healthcare provider. The attribute column may be a primary key identifier for one of several monitored attributes, such as demographic data (e.g., address, phone number, name). In this case, the attribute values of each row in table 312 are one, indicating that they are relevant to the update of the address attributes of the healthcare provider. The value column indicates the value received from a particular source at a specified time for that attribute and for a particular provider. In table 312, the first address value retrieved from the NPI data source 302E for the provider is "123 Anywhere Street," and the second address value subsequently retrieved from the NPI data source 302E for the provider is "123 Anywhere st.

After updating the raw data downloaded from the data source at step 206, the data is cleaned and normalized at step 208. Sometimes, different data sources use different conventions to represent the same underlying data. In addition, some errors often occur in the data. At step 210, it is identified that different data sources use varying conventions to represent these instances of the same underlying data. In addition, some errors that occur frequently or periodically are corrected. This cleaning and normalization is described in more detail below with reference to fig. 6-7.

Turning to fig. 3, a diagram 300 illustrates an example of cleanup and normalization at step 314 and table 316. In table 316, it is determined that the first row and the second row represent the same underlying trait. Thus, they are linked by a given generic representation. To maintain consistency, "Street" is changed to the abbreviation "St." and the apartment number missing in the first row is added.

Returning to FIG. 2, at step 210, features representing known incorrect or correct data are captured. As described above, the attributes of the build model may be verified manually. For example, in the example of a model for predicting the accuracy of a healthcare provider's address, a worker may manually call the healthcare provider and ask whether the address is correct. The solution data may be used to train the model. In addition to this solution, the input parameters required by the model must also be determined. The input parameters may be referred to as features.

If the input parameters are facts about attributes, the machine learning algorithm may perform better than inputting raw data into the model. Facts may be, for example, true and false statements about the underlying raw data. For example, in the address model, the following features may be useful:

is the address updated within the last six months? Is the address updated during the past year?

Is the provider have a state registration that matches this address?

Is there claim data for claim for the address in the last six months? Is there claim data for that address in the past year?

Is the update date of the address data the same as the creation date?

New features may be continually added and tested to determine their efficacy in predicting whether an address is correct. To save computational resources and train and solve the model, less influential features may be removed. Meanwhile, a new feature determined to have a predicted value may be added.

Turning to FIG. 3, the characterization process is shown in step 318 to produce the training data shown in table 320. In table 320, two rows show two different verifications that have occurred. For a provider with an ID of 14, the address "123 Anywhere st. For a provider with ID 205, the address "202 Nowhere St." has been verified to be incorrect. Both rows have a set of characteristics Fl..

Returning to FIG. 2, at step 212, the training data is used to train a plurality of machine learning models. Different types of models may have different validity for each attribute. Thus, at step 212, many different types of models are trained. Types may include, for example: logistic regression, naive Bayes, elastic nets, neural networks, Bernoulli naive Bayes, multi-modal naive Bayes, nearest neighbor classifiers, and support vector machines. In some embodiments, these techniques may be combined. Upon entering the features related to the attribute, the trained model may output a score indicating a likelihood that the attribute is correct.

At step 214, the best model or combination of models is selected. The best model is likely the model that most accurately predicts the attributes that are trained to predict. Step 214 may be implemented using a grid search. For each known correct answer, features are computed and applied to each trained model. For each trained model, an accuracy value is determined that indicates how correct the score output by the trained model is. The model with the greatest accuracy is then selected to predict the correctness of the attribute.

In this manner, embodiments ingest data from multiple data sources and use the data to train a model that can predict whether a particular attribute is accurate. As shown in fig. 4 and 5, a trained model may be applied.

FIG. 4 is a flow diagram illustrating a method 400 of training a model according to an embodiment. The operations of method 400 are illustrated in diagram 500 of fig. 5.

The method 400 begins at step 402. At step 402, features are collected for the queried attributes. Features can be collected in the same manner that they are collected to develop training data for a model of the attributes. For example, the data may be cleaned up and normalized as described above and in detail below with reference to fig. 6-7 for training data. The features may be computed from the historical update database 110 using the most up-to-date information about the attributes. In one embodiment, the features may be computed only for the provider of the user request. In another embodiment, the features may be computed for each provider or each provider that does not have the most recently verified attribute (e.g., address) included in the training data. An example of calculated data is shown in diagram 500 of fig. 5.

In diagram 500, table 502 shows data received from historical update database 110 for input into a training model. Each row represents a different value of the attribute for prediction. The provider ID corresponds to the provider for the value. FN are features that relate to providers and specific values. These features may be the same facts used to train the model.

Returning to FIG. 4, at step 404, the collected features are applied to the trained model. Features may be input to the model, and thus the model may output a score indicating the likelihood that the value is accurate.

An exemplary score is shown in step 504 and table 506 of diagram 500. Table 506 represents the various possible addresses of the provider and the scores that the model has output for each address. In addition, table 506 includes a source for each address. To determine the source, additional queries to the historical update database 110 may be required. In the example of table 506, there are four possible addresses for a particular provider: "123 anyhere St." collected from the NPI data source, "321 Someplace Rd." collected from the first claim data source, "10 Somewhere Ct." collected from the second, different claim data source, and "5 Overthere Blvd" collected from the DEA data source. The model calculates a score for each address.

In FIG. 4, the scores are analyzed to determine the appropriate answer at step 406. For some attributes, the provider may have more than one valid answer. For example, a provider may have more than one effective address. To determine which answers are valid, the scores may be analyzed. In one embodiment, scores greater than a threshold may be selected as correct. In another embodiment, scores below a threshold may be rejected as incorrect. In yet another embodiment, a grouping of scores may be determined and the cluster of answers in the grouping may be selected as correct.

Once the possible answers are determined in step 406, they are filtered based on the information sources in step 408. As noted above, not all data sources are common. Some are proprietary. The filtering at step 408 may ensure that the values retrieved from the proprietary source are not revealed to another party without proper consent.

The answer selection and filtering described in

steps

406 and 408 are shown in step 508 and list 510 of FIG. 5. In this example, three of the four possible addresses may be selected as the effective addresses of the providers: "321 Someplace Rd.", "10 Somewhere Ct." and "5 overther Blvd. The scores for these three addresses are.95,. 96 and.94, respectively. They are close to each other and above a threshold value which may be.9. On the other hand, the scores for the remaining addresses are only.10, which is below the threshold and therefore excluded from possible solutions.

The three effective addresses come from three different data sources. The address "5 overther blvd" is collected from a common data source (the DEA data source as described above). The "5 overther blvd." that has been collected from the public source is included in a list 510 that lists the final answer to be presented to the user. The other two addresses, "321 Someplace Rd." and "10 Somewhere Ct." -are collected from a proprietary claims database. In the example shown in diagram 500, the user can only access the first claims database containing the address "321 Someplace Rd.," but cannot access the claims database containing the address "5 overther Blvd. Thus, "321 Someplace Rd." is included in list 510, but does not include "5 overther blvd.

In this manner, embodiments apply a trained model to solve for valid values for individual attributes of personal data.

As described above, in order to train a model and apply the collected data to the model to solve for the correct values, the data ingested from the various data sources must be cleaned up and normalized. This process is described, for example, with reference to fig. 6-7.

Fig. 6 is a schematic diagram illustrating a method 600 of cleansing ingested data, according to an embodiment.

The method 600 begins at step 602 when a plurality of values for an attribute are received from a plurality of data sources. The data ingestion process is described above with reference to fig. 2 and 3.

At step 604, the values are analyzed to determine if any of them represent the same trait. In the context of an address, various address values are analyzed to determine whether they are intended to represent the same underlying geographic location.

Steps

606 and 608 occur when there are multiple values to determine the same underlying trait. At step 606, the values are analyzed to determine which best represents the underlying trait. In the context of an address, the address that best represents the geographic location may be selected. In addition, any convention (such as abbreviated or not) may be applied to the address. In the context of entity names, step 606 may involve mapping various possible descriptions of the entity to standard descriptions consistent with state registrations. For example, "Dental Service, inc." (without commas) may be mapped to "Dental Service, inc." (with commas). In the context of claims, step 606 can involve mapping the data to a common claim code system, such as ICD-9.

At step 608, the values are linked to indicate that they represent the same trait. In one embodiment, they may be set to the same values determined in step 606.

FIG. 7 is a schematic diagram that illustrates a method 700 of scrubbing ingested address data, according to an embodiment.

The method 700 begins at step 702. At step 702, each address is geocoded. Geocoding is the process of converting a postal address description into a location (e.g., a spatial representation in numerical coordinates) on the surface of the earth.

At step 704, the geocoded coordinates are evaluated to determine whether they represent the same geographic location. If so, the ingested address value is likely to be intended to represent the same trait.

At step 706, an apartment (Suite) number is evaluated. Apartment numbers are often expressed in various ways. For example, instead of "apartment", other names may be used. Further, the apartment number sometimes omits the number. Often the digits are omitted and then added erroneously. Using this, embodiments may choose between a number of possible apartment numbers.

For example, a healthcare provider may have a patient with a different apartment number: various addresses of "apartment 550" and "apartment 5500". The embodiment determines whether the first string of the plurality of different values is a substring of the second string of another of the plurality of different values. For example, an embodiment determines that "550" is a substring of "5500". This embodiment then determines that "5500" more accurately represents the address of the healthcare provider, as the numbers are more often omitted and then added erroneously. In addition to or instead of checking substrings, embodiments may apply fuzzy matching, e.g., comparing the Levenshtein distance between two strings to a threshold.

At step 708, numbers with similar appearances are evaluated. In an embodiment, a first string of the plurality of different values is determined to be similar to a second string of another of the plurality of different values, except for different numbers having similar appearances. When this determination occurs, the character string determined to most accurately present the character string is selected.

For example, a healthcare provider may have a patient with a different apartment number: various addresses of "apartment 6500" and "apartment 5500". The numbers "5" and "6" may have similar appearances. The strings are similar except that "6" is replaced with "5". Therefore, the character strings may be recognized as representing the same address. To determine which string is the correct apartment number, other factors may be employed, such as apartment numbers presented in other sources.

Based on the analysis in

steps

706 and 708, the correct address is selected in step 710.

Fig. 8 is a schematic diagram illustrating a method of linking ingested data according to an embodiment. As shown, method 800 describes an embodiment for matching and linking records using an embodiment of the foregoing system. The term "match" refers to a determination that two or more personal data records correspond to the same person.

At step 830, the processor legitimately accesses at least one set of data records stored in the memory. In an embodiment, the set of data records may include the data sources described above with respect to fig. 2-3. All data can be legally accessed and retrieved from a variety of external sources.

In some instances, the accessed data records may be received and/or stored in an undesirable format or in a format that is incompatible with contemplated methods and systems. In such embodiments, the data records are cleaned or standardized to conform to a predetermined format.

At step 832, the recorded data for each access is parsed. In an embodiment, the parsing step is implemented using control logic that defines a set of dynamic rules. In an embodiment, the control logic may be trained to parse the data records and locate a first name, a last name, a home address, an email address, a telephone number, or any other demographic or personnel information describing the individual associated with the parsed data records. In another embodiment, the control logic may specify a set of persistent rules based on the type of data record being parsed.

At step 834, the parsed data is assigned to predetermined categories within the respective records. For example, an embodiment may include parsing rules for finding a person's first name, last name, home address, email address, and phone number. In such an embodiment, when the processor looks for a first name, a last name, etc., a temporary file may be created within the data record, wherein the first name, last name, etc. are assigned to the respective categories. In another embodiment, a new persistent file may be created to store the classification data. For example, a new record may be created as a new row in a database table or memory, and different categories may be entered as column values in the row, respectively. In yet another embodiment, the processor may assign the classification data and store the assigned and classified data as metadata within the original file.

At step 836, the classification data for each record is compared to all other classification records using a pairwise function. For example, the processor compares the classification data of the first record with the classification data of the second record. In an embodiment, the processor compares a single class. For example, the processor compares the address associated with the first record to the address associated with the second record to determine if they are the same. Alternatively, other possible categories may be compared, including first name, last name, email address, social security number, or any other identifying information. In another embodiment, the processor compares more than one category of data. For example, the processor may compare the first name, last name, and address associated with the first record to the first name, last name, and address of the second record to determine if they are the same. The processor may track which categories match and which do not. Alternatively, the processor may count only the number of matched categories. It is contemplated that step 836 may include comparing more than three categories. For example, in an embodiment, the processor compares up to seven categories. In further embodiments, the processor makes comparisons between 8 and 20 classes.

In embodiments, step 836 may use not only text matching, but other types of matching, such as regular expression matching or fuzzy matching. Regular expression matching may determine that two values match when they both satisfy the same regular expression. When two strings approximately (rather than exactly) match a pattern, a fuzzy match may detect a match.

In an embodiment, step 836 may be implemented using multiple sets of data records. For example, data records from a first set of records may be compared to data records from a second set of records using the methods and systems described herein. In an embodiment, the first set of data records may be an input list comprising data records describing the person of interest or a list of the person of interest. The second set of data records may be personal data records from a second input list or legally stored in a database. A comparison of the sets of data records is performed to determine whether the records of the first set of data records and the records of the second set of data records describe the same person.

Further, in embodiments implemented using multiple sets of data records, the second set of data records may hold true identities, identities with confirmed accuracy, and/or identities that exceed a predetermined accuracy threshold. The true identity may be encoded as a serial number.

At step 838, for each data pair (data pair), a similarity score is calculated based on the data comparison. More specifically, the processor calculates a similarity score for each data pair based on which categories in the pair of records match as determined in step 836. In an embodiment, the similarity score is calculated as a ratio. For example, in an embodiment that compares 7 categories, if the first record and the second record describe data such that 5 of the 7 categories are the same between records, the similarity score is 5/7. In another embodiment, the similarity score is calculated as a percentage. For example, in an embodiment where 20 categories are compared, if the first record and the second record description data are such that 16 of the 20 categories are the same between records, the similarity score is.8 or 80%.

In another embodiment, a weight may be assigned to each category, and a similarity score may be determined in step 838 based on whether each category matches and the respective weights associated with the matching categories. The weights may be determined using a training set. In one example, the weights may be determined using linear programming. In other examples, a neural network or other adaptive learning algorithm may be used to determine a similarity score for a pair of data records based on which categories in the pair match.

At step 840, it is determined whether the calculated similarity score meets or exceeds a predetermined threshold. For example, in embodiments where the similarity score threshold is 5/7 (or approximately 71.4%), the processor will determine whether the calculated similarity score meets or exceeds the 5/7 threshold. Likewise, in embodiments where the similarity score threshold is 16/20 (or 80%), the processor will determine whether the calculated score meets or exceeds the threshold.

At step 842, similar records (i.e., records that meet or exceed the similarity score threshold) are linked or combined into a group if the similarity scores of at least two records meet or exceed the similarity score threshold. For example, in an embodiment, the processor performs a pair-wise comparison between the first record and all subsequent records. Any records in the first group that meet or exceed the similarity score threshold are linked and/or combined. The processor then performs a pair-wise comparison between the second record and all subsequent records. Any subsequent records in the second group that meet or exceed the similarity score threshold (when compared to the second record) are linked and/or combined (when compared to the second record) assuming the second record is not linked to the first record. Step 842 is also applicable when multiple sets of data records are compared. A similarity score is calculated for each data record in the first set of data records as they relate to a data record in the second set of data records. As described above, any records that meet or exceed the similarity score threshold are linked and/or combined into groups. In an embodiment, linked/grouped records may be programmably linked while remaining in their respective record sets.

Further, at step 842, it may occur that the pair-wise comparison between the first data record and the second data record results in a similarity score that meets or exceeds the threshold. In addition, the pair-wise comparison between the second and third records also yields a similarity score that meets or exceeds the threshold, but the pair-wise comparison between the first and third records is not similar and does not meet the threshold. The processor may handle such a conflicting packet scenario in a number of ways. For example, in an embodiment, the processor may compare additional categories not included when performing the initial pair-wise comparison. For example, if the processor compares first name, last name, address, and phone number during the initial comparison, the processor may include a social security number, age, and/or any other information that may help narrow the identity during the second comparison. After this second pairwise comparison of the first record, the second record, and the third record, an updated similarity score is calculated for each comparison (i.e., first record to second record, first record to third record, second record to third record) and the similarity scores are measured relative to a second predetermined threshold. If the updated similarity scores meet or exceed the second predetermined threshold, they are grouped according to the foregoing embodiment. However, if the same situation still exists, i.e., the first record is similar to the second record, the second record is similar to the third record, and the first record is not similar to the third record, then the second record is grouped with either the first record or the third record, depending on which pair-wise comparison has the higher updated similarity score. If the updated similarity scores are equal, another iteration of comparing the other columns will begin.

In another embodiment, the processor may handle a conflicting packet scenario by creating a copy of the second record. After making the copy, the processor may group the first record and the second record into group a and group the copy of the second record with the third record into group B.

In yet another embodiment, the processor may handle a conflicting grouping situation by creating a group based on a pair-wise comparison of the second records. For example, based on the similarity scores between the first record and the second record and the third record, all three records are grouped together based on their relationship to the second record.

At step 844, the processor determines the most prevalent identity within each group of similar records. For example, if the group of similar records contains 10 records, and of these 5 records describe an individual named James, and the remaining 5 records include the names of Jim, Mike, or Harry, the processor will determine that James is the most common name. In other embodiments, the processor may require additional steps to determine the most prevalent identity within each group. For example, it may appear that a group of similar records contains six records, two records describing the person named Mike, two records describing the person named Michael, one record describing the person with the first initial "M", and the last record describing the situation of the person named John. In such an embodiment, the processor may determine that the most prevalent identity is Michael based on a relationship between the names Michael and Mike. In instances where there is no explicit universal identity, other categories (i.e., surname, address, email address, phone number, social security code, etc.) may be referenced to determine the most universal identity. In embodiments where multiple sets of data records are compared, the data records in the first or second set of data records may be modified or marked to indicate the most prevalent identity and/or linked/grouped records. More specifically, the records may be modified so that the user may determine the most prevalent identity and/or linked data records when viewing a single set of data records.

At step 846, the processor modifies the identity of the similar records to match the identity of the most common record within each set of similar records. We now return to the example provided above, where a set of similar records contains six records, two records describing a person named Mile, two records describing a person named Michael, one record describing a person with an initial "M", and the last record describing a person named John. In this example, the processor now modifies each record at step 846, so that the identity of each record describes a person named "Michael". After modifying the identity of each affinity group, the record matching operation is complete. This process is further illustrated in fig. 9.

Fig. 9 illustrates a flow chart 900 illustrating exemplary operations that may be used to implement various embodiments of the present disclosure. As shown, the flow diagram 900 illustrates an embodiment of a record matching operation using an embodiment of the foregoing system.

Flow chart 900 shows data that has been parsed, classified, and normalized using the parsing, assigning, and classifying steps described above or using well known methods. As shown, the received classification data has been assigned to rows 950a-n and columns 952 a-n. Each of rows 950a-n includes information parsed from a data record describing an individual. Each of the columns 952a-n includes classification information that has been parsed and assigned to a predetermined category.

At step 936, the processor compares the classification data of each record to all other classification records using a pairwise function. As described above, the processors may compare a single category, or alternatively, the processors may compare more than one category. In the illustrated embodiment, the processor compares five categories and implements a similarity score threshold of 3/5 (or 60%).

As above, the method described in fig. 3 may also be applicable when comparing multiple sets of data records. Step 936 may also be performed using multiple sets of data records, for example. Data records from the first set of records may be compared with data records from the second set of records. More specifically, the first set of data records may include data records describing a person of interest or a list of persons of interest, while the second set of data records may be personal data records legitimately stored in a database or memory.

At step 942, if the similarity scores of at least two records meet or exceed a similarity score threshold, similar records (i.e., records that meet or exceed the similarity score threshold) are linked or combined into a group. As shown, groups A and B have been created based on the data provided in rows 950a-n and columns 952 a-n. The number of possible groups is proportional to the number of rows being compared. As shown, group A contains three records, while group B contains two records. Each record within a respective group meets or exceeds a similarity score threshold ratio 3/5 (or 60%) as compared to other records within the group.

At step 944, the processor determines the most prevalent identity within each set of similar records. For example, in group a, the processors compare the identities "Aaron Person", "Erin Person", and "a. Following the above rules, the processor determines that "Aaron Person" is the most prevalent identity in group A. In set B, the processor compares the identities of "Henry Human" and "h. Also following the above rules, the processor determines "Henry Human" as the most prevalent identity in group B.

At step 946, the processor modifies the identity of the record 958 to match the identity of the most common record within the respective group of similar records. As shown, the records of group a have been modified to describe the identity of "Aaron Person" while the records of group B have been modified to describe the identity of "Henry Human".

Fig. 10 is a schematic diagram illustrating a system 1000 for ingesting data, training a model based on the data, and determining a solution based on the trained model, according to an embodiment.

The system 1000 includes a server 1050. The server 1050 includes a data ingester 1002. The data ingester 1002 is configured to retrieve data from the data source 102a.. N. The data may comprise a plurality of different values for the same attribute describing the person. In particular, data ingester 1002 repeatedly and continuously monitors data sources to determine whether data pertaining to anyone being monitored is updated. When the data for the person has been updated, data ingester 1002 stores the updated data in database 110. As described above, the database 110 stores a log of runs that specify how the person's data changes over time.

Using database 110, server 1050 periodically or intermittently generates machine learning models 1022 to evaluate the validity of personal data. To generate model 1022, server 1050 includes six modules: querier 1004, data cleaner 1006, data linker 1010, characterizer 1012, trainer 1015, and tester 1020.

API monitor 1003 receives an indication that the value for a particular attribute in the personal data is verified as accurate or inaccurate at a particular time. For example, the caller may manually verify the accuracy of the value and, after verification, cause the API call to be transmitted to the API monitor 1003. Based on the particular time, querier 1004 retrieves the person's data from database 110, including the most recent values for the plurality of attributes at the particular time.

The data cleaner 1006 determines whether any of a plurality of different values represent the same trait. When different values represent the same trait, data cleaner 1006 determines which of the values determined to represent the same trait most accurately represents the trait.

The data linker 1010 links those values determined to represent the same trait. The data linker 1010 may include a geocoder (not shown) that geocodes each of a plurality of different address values to determine a geographic location, and determines whether any of the determined geographic locations are the same.

Using the data retrieved by querier 1004, cleaned by data cleaner 1006, and linked by data linker 1010, characterizer 1012 determines a plurality of characteristics. Each feature of the plurality of features describes the fact of data about the person.

Using these features, trainer 1015 may train model 1022 so that model 1022 may predict whether another person's value for a particular attribute is accurate. In an embodiment, the trainer trains a plurality of models. Each model uses a different type of machine learning algorithm. Tester 1020 evaluates the accuracy of the plurality of models using available training data and selects model 1022 from the plurality of models determined based on the evaluated accuracy.

The server 1050 may use the model 1022 to predict whether the records in the database 110 are accurate. To generate answers for presentation to the client, the server 1050 includes two modules: a scoring engine 1025 and an answer filter 1030. The scoring engine 1025 applies the model 1022 to predict whether the value of another person in the plurality of attributes is accurate. In an embodiment, for each of a plurality of values for a particular attribute of another person, a model is applied to each value to determine a score.

Answer filter 1030 selects at least one value from a plurality of values determined by scoring engine 1025 based on the respective determined scores. In an embodiment, answer filter 1030 filters answers such that proprietary information is not shared without proper consent.

The various modules shown in fig. 10 may conflict with each other and inefficiently compete for computing resources, such as processor power and memory capacity. To address these issues, a scheduler is employed to queue the various tasks involved, as shown in FIG. 11.

FIG. 11 is a schematic diagram illustrating a system 1100 for scheduling ingestion, training, and solution tasks, according to an embodiment. In addition to the modules of FIG. 10, the system 1100 includes a scheduler 1102 and a queue 1106, as well as various processes, including a data ingestion process 1108, a trainer process 1110, and a solver process 1112. Each of the various processes runs on a separate thread of execution.

As in system 1000, system 1100 includes API monitor 1003. As described above, API monitor 1003 may receive an indication that a value for a particular attribute in personal data is verified as accurate or inaccurate at a particular time. Other types of API requests may also be received by API monitor 1003. Depending on the content of the API request, the API monitor may, upon receipt of the API request, place a request to complete another job specified on the API request on a queue, the API request including instructions to complete at least one of a data ingestion task, a training task, a solution task, or a scheduling task.

Scheduler 1102 places requests to complete jobs on queue 1106. The request includes instructions to complete at least one of a data ingestion task, a training task, and a solution task. In an embodiment, scheduler 1102 places requests to complete a job on a queue at periodic intervals. The scheduler 1102 also monitors the queues 1106. When the queue 1106 includes a request (potentially placed by the API monitor 1104) to complete a scheduled task, the scheduler 1102 schedules the task as specified in the API request.

The queue 1106 queues various tasks 1107. The queue 1106 may be any type of message queue used for inter-process communication (IPC) or inter-thread communication within the same process. They use queues for message passing-control or content passing. Group communication systems provide a similar kind of functionality. Queue 1106 can be implemented, for example, using Java Message Service (JMS) or amazon Simple Queue Service (SQS).

The data ingest process 1108 includes a data ingester 1002. The data ingestion process 1108 monitors the data ingestion tasks in the queue 1106. When the queue 1106 next includes a data ingest task, the data ingest process 1108 executes the data ingester 1002 to retrieve data related to a person from a data source and store the retrieved data in a database.

The trainer process 1110 includes a data cleaner 1006, a data matcher 1010, a trainer 1015, a tester 1020, a querier 1004, and a characterizer 1012. The trainer process 1110 monitors the training tasks in the queue 1106. When the queue 1106 next includes a training task, a trainer process 1110 executes a data matcher 1010, trainer 1015, tester 1020, querier 1004, and characterizer 1012 to train the model.

Solver process 1112 includes a scoring engine 1025 and an answer filter 1030. Solver process 1112 monitors the solution tasks in queue 1106. When the queue 1106 next includes a solution task, the solver process 1112 executes a scoring engine 1025, and the answer filter 1030 applies the model to predict whether the value of another person in the plurality of attributes is accurate and determines the final solution that is presented to the user.

In an embodiment (not shown), the system 1100 may include a plurality of queues, each dedicated to one of a data ingestion task, a training task, and a solution task. In that embodiment, the data ingestion process 1108 monitors a queue dedicated to the data ingestion task. The trainer process 1110 monitors a queue dedicated to training tasks. The solver process 1030 monitors a queue dedicated to the solver task.

Each of the above servers and modules may be implemented in software, firmware, or hardware on a computing device. The computing device may include, but is not limited to, a personal computer, a mobile device such as a mobile phone, a workstation, an embedded system, a gaming console, a television, a set-top box, or any other computing device. Further, the computing device may include, but is not limited to, a device having a processor and memory (including non-transitory memory) for executing and storing instructions. The memory may tangibly embody data and program instructions in a non-transitory manner. The software may include one or more applications and an operating system. The hardware may include, but is not limited to, a processor, memory, and a graphical user interface display. The computing device may also have multiple processors and multiple shared or separate memory components. For example, the computing device may be part or all of a clustered or distributed computing environment or server farm.

Conclusion

Identifiers such as "(a)", "(b)", "(i)", "(ii)" and the like are sometimes used for different elements or steps. These identifiers are used for clarity and do not necessarily specify the order of the elements or steps.

The invention has been described above with the aid of functional building blocks illustrating the implementation of specific functions and relationships thereof. Boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Other boundaries may be defined so long as the specified functions and relationships thereof are appropriately performed.

The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments without undue experimentation and without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.

The breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims appended hereto and their equivalents.

Claims

1. A computer-implemented method for training a machine learning algorithm with time-varying personal data, the method comprising:

(a) at a plurality of times, monitoring a data source to determine whether data relating to a person has been updated;

(b) when the person's data has been updated, storing the updated data in a database such that the database includes a log of runs that specify how the person's data changes over time, wherein the person's data includes values for a plurality of attributes related to the person;

(c) receiving an indication in the person's data that a value for a particular attribute is verified as accurate or inaccurate at a particular time;

(d) retrieving data from the database based on the particular time, the data including the person's values for the plurality of attributes that are up-to-date at the particular time; and

(e) training a model using the retrieved data and the indication to enable the model to predict whether a value of another person for the particular attribute is accurate, thereby causing the retrieved data to be circulated to the particular time to maintain a meaning of the retrieved data in training of the model.

2. The method of claim 1, further comprising:

(f) determining a plurality of features based on the data of the person retrieved in (d), each feature of the plurality of features describing facts about the data of the person retrieved in (d),

wherein the training (e) comprises training the model using the determined features.

3. The method of claim 2, wherein,

the determining (f) comprises: based on determining which of the features and the plurality of attributes is the particular attribute.

4. The method of claim 1, wherein the training (e) comprises training a plurality of models, each module utilizing a different type of machine learning algorithm, further comprising:

(f) evaluating the accuracy of the plurality of models using available training data; and

(g) selecting a model from the plurality of models determined based on the evaluated accuracy.

5. The method of claim 1, further comprising:

(f) applying the model to predict whether the value of the other person in the plurality of attributes is accurate.

6. The method of claim 1, wherein the applying (f) comprises:

(i) for each of a plurality of values of the particular attribute for the other person, applying the model to the respective value to determine a score; and

(ii) (ii) selecting at least one value from the plurality of values based on the respective scores determined in (i).

7. The method of claim 6, wherein the monitoring (a) comprises monitoring a plurality of data sources to determine whether data related to a person has been updated, and wherein the applying (f) further comprises:

(iii) (iii) determining from which of the plurality of data sources the selected at least one value of (ii) originated;

(iv) (iv) determining whether the client has rights to the data source determined in (iii); and

(v) (iv) if the client lacks authority over the data source determined in (iii), filtering the at least one value from the result before presenting the result to the client.

8. The method of claim 1, wherein,

the person and the other person are healthcare providers, and

the data of the person and the other person comprises demographic information.

9. The method of claim 1, wherein,

the person and the other person are healthcare providers, and

the data of the person includes an indication of whether the person is involved in fraud.

10. A non-transitory program storage device having stored thereon instructions that, when executed by at least one computing device, cause the at least one computing device to perform a method for training a machine learning algorithm with time-varying personal data, the method comprising:

(c) receiving an indication in the person's data that the value for the particular attribute is verified as accurate or inaccurate at a particular time;

11. The program storage device of claim 10, the method further comprising:

12. The program storage device of claim 11,

13. The program storage device of claim 10, wherein the training (e) comprises training a plurality of models, each module utilizing a different type of machine learning algorithm, further comprising:

14. The program storage device of claim 10, wherein the method further comprises:

15. The program storage device of claim 10, wherein the applying (f) comprises:

16. The program storage device of claim 15, wherein the monitoring (a) comprises monitoring a plurality of data sources to determine whether data relating to a person has been updated, and wherein the applying (f) further comprises:

17. The program storage device of claim 10,

the person and the other person are healthcare providers, and

the data of the person and the other person comprises demographic information.

18. The program storage device of claim 10,

the person and the other person are healthcare providers, and

19. A system for training a machine learning algorithm with time-varying personal data, comprising:

a computing device;

a database comprising a log of runs that specifies how data for a person changes over time, wherein the data for the person comprises values for a plurality of attributes relating to the person;

a data ingestion process implemented on the computing device and configured to: (i) monitoring a data source at a plurality of times to determine if data relating to the person has been updated; and (ii) when the data of the person has been updated, storing the updated data in the database;

an API monitor implemented on the computing device and configured to: receiving an indication in the person's data that the value for the particular attribute is verified as accurate or inaccurate at a particular time;

a querier implemented on the computing device and configured to: retrieving data from the database based on the particular time, the data including the person's values for the plurality of attributes that are up-to-date at the particular time; and

a trainer implemented on the computing device and configured to: training a model using the retrieved data and the indication to enable the model to predict whether a value of another person for the particular attribute is accurate, thereby causing the retrieved data to be circulated to the particular time to maintain a meaning of the retrieved data in training of the model.

20. The system of claim 19, further comprising:

a characterizer configured to determine, based on the data of the person retrieved in (d), a plurality of features, each of the plurality of features describing facts about the data of the person retrieved in (d),

21. The system of claim 19, wherein,

the model predicts whether the value of the other person in the plurality of attributes is accurate.

22. The system of claim 18, wherein the trainer trains a plurality of models, each module utilizing a different type of machine learning algorithm, further comprising:

a grid searcher to evaluate an accuracy of the plurality of models using available training data and to select a model from the plurality of models determined based on the evaluated accuracy.

23. A computer-implemented method for correlating demographic data related to a person, the method comprising:

(a) receiving, from a plurality of different data sources, a plurality of different values describing a same attribute of the person;

(b) determining whether any of the plurality of different values represent the same trait;

when it is determined in (b) that different values represent the same trait:

(c) determining which of the values determined to represent the same trait most accurately represents the same trait; and is

(d) The links are determined as those values that represent the same trait.

24. The method of claim 23, wherein,

the same attribute is the address of the person, and

each of the plurality of different values is a different address value.

25. The method of claim 24, wherein the determining (b) comprises:

(i) geocoding each of the plurality of different address values to determine a geographic location; and

(ii) (ii) determining whether any of the geographical locations determined in (i) are the same.

26. The method of claim 23, wherein,

the determining (b) includes: determining whether a first string of the plurality of different values is a substring of a second string of another value of the plurality of different values, an

Wherein the determining (c) comprises: determining that the second string represents the same trait more accurately than the first string.

27. The method of claim 23, wherein,

the determining (b) includes: determining whether a first string of the plurality of different values is similar to a second string of another value of the plurality of different values except that the first string of the plurality of different values has a different number with a similar appearance, and

wherein the determining (c) comprises: determining that the second string represents the trait more accurately than the first string.

28. The method of claim 23, wherein,

the determining step (b) includes: it is determined whether the first string is an ambiguous match to the second string.

29. The method of claim 23, wherein,

the same attribute is the entity name of the person.

30. The method of claim 23, wherein,

the same attribute is claim code, and the person is a healthcare provider.

31. The method of claim 23, further comprising:

(e) training a plurality of models, each model utilizing a different type of machine learning algorithm;

(g) a model is selected from a plurality of models determined based on the evaluated accuracy.

32. The method of claim 23, further comprising:

(e) monitoring a data source at a plurality of times to determine if data relating to the person has been updated; and is

(f) When the person's data has been updated, storing the updated data in a database such that the database includes a log of runs that specify how the person's data changes over time, wherein the person's data includes values for a plurality of attributes related to the person.

33. A non-transitory program storage device having instructions stored thereon that, when executed by at least one computing device, cause the at least one computing device to perform a method for correlating demographic data about a person, the method comprising:

when it is determined in (b) that different values represent the same trait:

(d) The links are determined as those values that represent the same trait.

34. The program storage device of claim 33,

the same attribute is the address of the person, and

each of the plurality of different values is a different address value.

35. The program storage device of claim 34, wherein the determining (b) comprises:

36. The program storage device of claim 33,

37. The program storage device of claim 33,

38. The program storage device of claim 33,

39. The program storage device of claim 33,

the same attribute is the entity name of the person.

40. The program storage device of claim 33,

the same attribute is claim code, and the person is a healthcare provider.

41. The program storage device of claim 33, the method further comprising:

42. The program storage device of claim 33, the method further comprising:

43. A system for training a machine learning algorithm with time-varying personal data, the system comprising:

a computing device;

a data ingestion process implemented on the computing device and configured to: receiving, from a plurality of different data sources, a plurality of different values for a same attribute that describes the person;

a data scrubber implemented on the computing device and configured to: (i) determining whether any of the plurality of different values represent the same trait; and (ii) when it is determined that different values represent the same trait, determining which of the values determined to represent the same trait most accurately represents the same trait; and

a data linker implemented on the computing device and configured to: the links are determined as those values that represent the same trait.

44. The system of claim 43, wherein the data cleaner comprises:

a geocoder that geocodes each of a plurality of different address values to determine a geographic location, and determines whether any of the determined geographic locations are the same.

45. A system for scheduling data ingestion and machine learning, comprising:

a computing device;

a database;

a queue stored on the computing device;

a scheduler implemented on a computing device and configured to: placing a request to complete a job on the queue, the request including instructions to complete at least one of a data ingestion task, a training task, and a solution task;

a data ingestion process implemented on a computing device and configured to: (i) monitoring the queue, and (ii) when the queue includes a request to complete the data ingestion task, retrieving data related to personnel from a data source and storing the retrieved data in the database;

a trainer process implemented on the computing device and configured to: (i) monitoring the queue, and (ii) when the queue includes a request to complete the training task, training a model using the retrieved data in the database and an indication that a value for a particular attribute in the retrieved data is verified to be accurate or inaccurate, such that the model can predict whether a value of another person for the particular attribute is accurate; and

a solver process implemented on a computing device and configured to: (i) monitor the queue, and (ii) apply the model to predict whether the value of the other person is accurate when the queue includes a request to complete the solution task.

46. The system of claim 45, further comprising:

a plurality of queues, each queue dedicated to one of the data ingestion task, the training task, and the solution task,

wherein the data ingestion process monitors a queue from the plurality of queues that is dedicated to the data ingestion task,

wherein the trainer process monitors a queue from the plurality of queues that is dedicated to the training task, and

wherein the solver process monitors a queue from the plurality of queues that is dedicated to the solver task.

47. The system of claim 45, wherein,

the scheduler places requests for completion of the job on the queue at periodic intervals.

48. The system of claim 45, wherein,

the data ingestion process is configured to: (i) monitoring the data source to determine whether data relating to the person has been updated; and (ii) when the data of the person has been updated, storing the updated data in the database.

49. The system of claim 45, further comprising:

an API monitor implemented on the computing device and configured to: upon receipt of an API request, placing a request on the queue to complete another job specified on the API request,

the API request includes instructions for at least one of:

the data ingestion task, the training task, the solution task, or the scheduling task.

50. The system of claim 49, wherein,

the scheduler monitors the queue, and

when the queue includes a request to complete the scheduled task, the scheduler schedules the task as specified in the API request.

51. The system of claim 49, wherein,

the API request includes: (i) an indication in the retrieved data that the value for the particular attribute is verified as accurate or inaccurate at a particular time, and (ii) instructions for completing the training task.

52. The system of claim 45, wherein the data ingestion process is configured to:

monitoring the data source to determine whether data relating to the person has been updated, and

when the person's data has been updated, another request to complete the training task is placed on the queue.

53. A computer-implemented method for scheduling data ingestion and machine learning, comprising:

(a) placing a request to complete a job on a queue, the request including instructions to complete at least one of a data ingestion task, a training task, and a solution task;

(b) monitoring the queue to determine whether the queue includes the request and what the next task on the queue is,

(c) retrieving data related to a person from a data source to store the retrieved data in a database when the queue includes the request to complete the data ingestion task;

(d) when the queue includes a request to complete the training task, training a model using the retrieved data in the database and an indication in the retrieved data that the value for the particular attribute is verified to be accurate or inaccurate, such that the model can predict whether the value of another person for the particular attribute is accurate; and

(e) applying the model to predict whether the value of the other person is accurate when the queue includes requests to complete the solution task.

54. The method of claim 53, wherein,

the monitoring (b) includes monitoring a plurality of queues, each queue dedicated to one of the data ingestion task, the training task, and the solution task.

55. The method of claim 53, wherein,

the placing (a) occurs at periodic intervals.

56. The method of claim 53, further comprising:

(f) monitoring the data source to determine if data relating to the person has been updated; and

(g) when the data of the person has been updated, storing the updated data in the database.

57. The method of claim 53, further comprising:

(f) receiving an API request;

(g) upon receipt of the API request, placing another request on the queue for completion of another job specified on the API request, the API request including instructions for completing at least one of:

58. The method of claim 57, further comprising:

(h) scheduling a task as specified in the API request when the queue includes the other request for completing the scheduled task.

59. The method of claim 57, wherein the API request comprises:

(i) an indication in the retrieved data that the value for the particular attribute is verified as accurate or inaccurate at a particular time, an

(ii) Instructions for completing the training task.

60. The method of claim 53, further comprising:

(f) monitoring the data source to determine whether data relating to the person has been updated, and

(g) when the person's data has been updated, another request to complete the training task is placed on the queue.