CN111105317B

CN111105317B - A medical insurance fraud detection method based on drug purchase records

Info

Publication number: CN111105317B
Application number: CN201911383476.7A
Authority: CN
Inventors: 孙佰清; 鲍鑫; 王天辰; 高稳; 王思霖
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2019-12-28
Filing date: 2019-12-28
Publication date: 2023-05-12
Anticipated expiration: 2039-12-28
Also published as: CN111105317A

Abstract

The invention provides a medical insurance fraud detection method based on a medicine purchase record, belongs to the field of medicine fraud detection methods, and provides the medical insurance fraud detection method based on the medicine purchase record, which can accurately extract medical insurance fraud information, is convenient to operate and has high applicability. In the invention, a fraudster classification model is constructed through a machine learning algorithm; inputting patient information and medicine purchasing information into a model, and establishing a patient-medicine bipartite graph; according to the patient-medicine bipartite graph, a medicine single-mode projection graph is established to form a medicine chain; dividing a medicine chain into a normal chain and an abnormal chain by using a correlation chain algorithm; calculating the similarity of the normal chain and the abnormal chain through cosine similarity formulas respectively; retaining a comparison combination of the abnormal chain and the normal chain with the similarity of not 0; removing the same products in the abnormal chain and the normal chain in the combination, and retaining other medicines; and synthesizing the rest medicines into a fraud chain, and outputting the fraud chain. The invention is mainly used for detecting the fraudulent behavior of fraudulent patients.

Description

A medical insurance fraud detection method based on drug purchase records

技术领域Technical Field

本发明属于药品欺诈检测方法领域，具体涉及一种医疗保险欺诈检测方法。The invention belongs to the field of drug fraud detection methods, and in particular relates to a medical insurance fraud detection method.

背景技术Background Art

管欺诈并不常见，但医保欺诈事件往往对应着异常的购药记录，同时医保欺诈案件往往具有以下特点：Although medical insurance fraud is not common, medical insurance fraud often corresponds to abnormal drug purchase records. At the same time, medical insurance fraud cases often have the following characteristics:

(1)不常见：欺诈事件罕见，但代价高昂，因此正常患者与诈骗者之间的数量分布极不平衡。(1) Uncommon: Fraud incidents are rare but costly, so the distribution between normal patients and fraudsters is extremely unbalanced.

(2)知识共享：欺诈者经常受到他们的盟友和联系人的影响，进而影响其他人。在医疗采购行为模式中，欺诈知识被转移和发生。(2) Knowledge sharing: Fraudsters are often influenced by their allies and contacts, who in turn influence others. Fraud knowledge is transferred and occurs in medical procurement behavior patterns.

(3)行为模仿：欺诈患者也会模仿正常的参保者的购药行为来掩盖他们的欺诈目标，尽力让自己的购药行为看起来“正常”。(3) Behavioral imitation: Fraudulent patients will also imitate the drug purchasing behavior of normal insured persons to conceal their fraudulent goals and try their best to make their drug purchasing behavior look "normal."

因此，就需要一种能够精准提取医疗保险欺诈信息、操作便捷、适用性强的基于购药记录的医疗保险欺诈检测方法。Therefore, there is a need for a medical insurance fraud detection method based on drug purchase records that can accurately extract medical insurance fraud information, is easy to operate, and has strong applicability.

发明内容Summary of the invention

本发明针对现有医疗保险欺诈模式多样、不能精准确定欺诈信息、人工提取欺诈信息繁琐的问题，提供一种能够精准提取医疗保险欺诈信息、操作便捷、适用性强的基于购药记录的医疗保险欺诈检测方法。In view of the problems that existing medical insurance fraud patterns are diverse, fraud information cannot be accurately determined, and manual extraction of fraud information is cumbersome, the present invention provides a medical insurance fraud detection method based on drug purchase records that can accurately extract medical insurance fraud information, is easy to operate, and has strong applicability.

本发明所涉及的一种基于购药记录的医疗保险欺诈检测方法的技术方案如下：The technical solution of a medical insurance fraud detection method based on drug purchase records involved in the present invention is as follows:

本发明所涉及的一种基于购药记录的医疗保险欺诈检测方法，它包括以下步骤：The present invention relates to a medical insurance fraud detection method based on drug purchase records, which comprises the following steps:

步骤S1、通过机器学习算法构建欺诈者分类模型；Step S1, constructing a fraudster classification model through a machine learning algorithm;

步骤S2、对所述模型输入患者信息和购药信息，建立患者-药品二部图，所述患者信息包括正常患者和欺诈患者；Step S2, inputting patient information and drug purchase information into the model to establish a patient-drug bipartite graph, wherein the patient information includes normal patients and fraudulent patients;

步骤S3、根据患者-药品二部图，建立药品单模投影关系，形成药品链；Step S3: According to the patient-drug bipartite graph, a drug single-mode projection relationship is established to form a drug chain;

步骤S4、利用关联链式算法将步骤S3所述的药品链分为正常链和异常链；Step S4, using an associative chain algorithm to divide the drug chain described in step S3 into a normal chain and an abnormal chain;

步骤S5、将正常链和异常链分别通过余弦相似度公式计算相似度；Step S5, calculating the similarity of the normal chain and the abnormal chain respectively by using the cosine similarity formula;

步骤S6、去除相似度为0的正常链，保留相似度不为0的异常链和正常链的对比组合；Step S6, remove the normal chain with a similarity of 0, and retain the comparison combination of the abnormal chain and the normal chain with a similarity not equal to 0;

步骤S7、去除组合中异常链和正常链中相同的产品，保留其他药品；Step S7, remove the same products in the abnormal chain and the normal chain in the combination, and keep other drugs;

步骤S8、将剩余药品合成欺诈链，输出欺诈链。Step S8: synthesize the remaining drugs into a fraud chain and output the fraud chain.

进一步地：在步骤S1中，整合患者信息，采用机器学习算法提取患者信息的特征向量，对所述特征向量使用监督筛选算法smbinning对每个特征的信息量IV进行计算，并提取信息量IV大于的特征投入机器学习算法，获得欺诈者分类模型。Further: in step S1, the patient information is integrated, and the feature vector of the patient information is extracted using a machine learning algorithm. The information content IV of each feature is calculated using the supervised screening algorithm smbinning for the feature vector, and the features with information content greater than IV are extracted and input into the machine learning algorithm to obtain a fraudster classification model.

进一步地：在步骤S2中，将正常患者的购药信息和存在欺诈行为的欺诈患者的购药信息设置为患者节点和药品节点，分别构建欺诈患者的药品—患者无向二部图及正常患者的药品—患者无向二部图；对患者-药品二部图进行第一轮衍生特征提取，所述第一轮提取的特征包括使用药品的种类总量和使用药品的总量，并根据其衍生特征建立药品单模投影关系。Further: in step S2, the drug purchasing information of normal patients and the drug purchasing information of fraudulent patients with fraudulent behavior are set as patient nodes and drug nodes, and a drug-patient undirected bipartite graph of fraudulent patients and a drug-patient undirected bipartite graph of normal patients are constructed respectively; the first round of derivative feature extraction is performed on the patient-drug bipartite graph, and the features extracted in the first round include the total number of types of drugs used and the total amount of drugs used, and a drug single-mode projection relationship is established based on its derived features.

进一步地：在步骤S3中，对异常链进行第二轮衍生特征提取，所述第二轮提取的特征包括种类异常率、数量异常率和异常链中的异常药品使用率。Further: in step S3, a second round of derivative feature extraction is performed on the abnormal chain, and the features extracted in the second round include the abnormal rate of type, the abnormal rate of quantity and the abnormal drug usage rate in the abnormal chain.

进一步地：在步骤S4中，所述关联链式算法具体为：对二部图的对应矩阵按边权排序，从最高边权对应的药品组合开始，作为异常链中的起始药品组合，进一步检索组合药品中次高边权所连接的药品，依次检索，将药品链串联在一起，输入边权邻接矩阵，输出一条链。Further: In step S4, the associative chain algorithm is specifically as follows: sort the corresponding matrix of the bipartite graph by edge weight, start with the drug combination corresponding to the highest edge weight, as the starting drug combination in the abnormal chain, further search for the drugs connected by the second highest edge weight in the combined drugs, search in sequence, connect the drug chains together, input the edge weight adjacency matrix, and output a chain.

进一步地：在步骤S5中，所述余弦相似度公式为

；其中a、b、c分别为正常链或异常链。Further: In step S5, the cosine similarity formula is

; where a, b, and c are normal chains or abnormal chains respectively.

进一步地：在步骤S8中，对合成的欺诈链进行第三轮衍生特征提取，所述第三轮提取的特征包括种类异常率、数量异常率和异常链中的异常药品使用率。Further: in step S8, a third round of derivative feature extraction is performed on the synthesized fraud chain, and the features extracted in the third round include the abnormal rate of type, the abnormal rate of quantity and the abnormal drug usage rate in the abnormal chain.

本发明所涉及的一种基于购药记录的医疗保险欺诈检测方法的有益效果是：The beneficial effects of the medical insurance fraud detection method based on drug purchase records involved in the present invention are:

本发明所涉及的一种基于购药记录的医疗保险欺诈检测方法，利用二部图及其导出的单模模态投影关系，应用关联链式算法提取欺诈模式转移以及隐藏的药品购买目标，在业务逻辑方面具有迅速准确的优势，便于应用；同时，欺诈链的提取可以帮助监管机构建立避免欺诈活动的监管规则，防止欺诈患者的恶意欺诈活动。所述医疗保险欺诈检测方法针对医疗保险数据中的购药记录进行分析，利用图论算法构建有效衍生特征，对欺诈判断的准确度较高，能够有效的检测多变的医疗保险欺诈模式。The present invention relates to a medical insurance fraud detection method based on drug purchase records. It uses a bipartite graph and a unimodal modal projection relationship derived therefrom, and applies an association chain algorithm to extract fraud pattern transfers and hidden drug purchase targets. It has the advantages of being fast and accurate in terms of business logic and is easy to apply. At the same time, the extraction of fraud chains can help regulatory agencies establish regulatory rules to avoid fraudulent activities and prevent malicious fraud activities of fraudulent patients. The medical insurance fraud detection method analyzes drug purchase records in medical insurance data, uses graph theory algorithms to construct effective derivative features, has a high accuracy in fraud judgment, and can effectively detect variable medical insurance fraud patterns.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的一种基于购药记录的医疗保险欺诈检测方法的流程图；FIG1 is a flow chart of a medical insurance fraud detection method based on drug purchase records of the present invention;

图2为实施例2所述的一种基于购药记录的医疗保险欺诈检测方法的流程图；FIG2 is a flow chart of a medical insurance fraud detection method based on drug purchase records according to Example 2;

图3为实施例2中的正常患者的药品二部图；FIG3 is a bipartite graph of medicines for normal patients in Example 2;

图4为实施例2中的欺诈患者的药品二部图。FIG4 is a bipartite graph of drugs for fraudulent patients in Example 2.

实施方式Implementation

下面结合实施例对本发明的技术方案做进一步的说明，但并不局限于此，凡是对本发明技术方案进行修改或者等同替换，而不脱离本发明技术方案的精神和范围，均应涵盖在本发明的保护范围中。The technical solution of the present invention is further described below in conjunction with the embodiments, but is not limited thereto. Any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention should be included in the protection scope of the present invention.

实施例1Example 1

结合图1说明本实施例，在本实施例中，本实施例所涉及的一种基于购药记录的医疗保险欺诈检测方法，它包括以下步骤：This embodiment is described in conjunction with FIG. 1. In this embodiment, a medical insurance fraud detection method based on drug purchase records involved in this embodiment includes the following steps:

步骤S1、通过机器学习算法构建欺诈者分类模型；整合患者信息，采用机器学习算法提取患者信息的特征向量，对所述特征向量使用监督筛选算法smbinning对每个特征的信息量IV进行计算，并提取信息量IV大于的特征投入机器学习算法，获得欺诈者分类模型。使用机器学习算法完成欺诈检测模型的构建。整合患者对应的特征向量与患者对应的欺诈标记Y，为

，对上述特征使用有监督特征筛选算法smbinning对每个特征的信息量IV进行计算，并提取信息量IV大于0.05的特征投入机器学习算法，获得有效的针对欺诈者的分类模型。Step S1: Construct a fraudster classification model through a machine learning algorithm; integrate patient information, use a machine learning algorithm to extract the feature vector of the patient information, use the supervised screening algorithm smbinning to calculate the information volume IV of each feature of the feature vector, and extract the features with information volume greater than IV and input them into the machine learning algorithm to obtain a fraudster classification model. Use a machine learning algorithm to complete the construction of a fraud detection model. Integrate the feature vector corresponding to the patient and the fraud mark Y corresponding to the patient to obtain

, the supervised feature screening algorithm smbinning is used to calculate the information volume IV of each feature, and the features with information volume IV greater than 0.05 are extracted and input into the machine learning algorithm to obtain an effective classification model for fraudsters.

步骤S2、对所述模型输入患者信息和购药信息，建立患者-药品二部图，所述患者信息包括正常患者和欺诈患者；将正常患者的购药信息和存在欺诈行为的欺诈患者的购药信息设置为患者节点和药品节点，分别构建欺诈患者的药品—患者无向二部图及正常患者的药品—患者无向二部图；对患者-药品二部图进行第一轮衍生特征提取，所述第一轮提取的特征包括使用药品的种类总量和使用药品的总量，并根据其衍生特征建立药品单模投影关系。Step S2, inputting patient information and drug purchase information into the model, establishing a patient-drug bipartite graph, wherein the patient information includes normal patients and fraudulent patients; setting the drug purchase information of normal patients and the drug purchase information of fraudulent patients with fraudulent behavior as patient nodes and drug nodes, respectively constructing a drug-patient undirected bipartite graph of fraudulent patients and a drug-patient undirected bipartite graph of normal patients; performing a first round of derivative feature extraction on the patient-drug bipartite graph, wherein the features extracted in the first round include the total number of types of drugs used and the total amount of drugs used, and establishing a drug single-mode projection relationship based on its derivative features.

从训练数据中对存在欺诈行为的病患以及表现正常的病患进行分割，每个病患均对应多条用药记录。将记录中同一病人的多条用药记录进行整理，基于图论，将患者和药品设置为两类不同的节点，通过用药记录那个患者节点(IDj)与药品节点(Dk)链接在一起，分别构建欺诈病患的药品—患者无向二部图及正常病患的药品—患者无向二部图。针对构建的欺诈病患的药品—患者无向二部图及正常病患的药品—患者无向二部图，对每个患者的购药行为提炼第一轮特征：（1）

：使用药品种类总量 (2)

：使用药品总量；The training data is divided into patients with fraudulent behavior and patients with normal performance. Each patient corresponds to multiple medication records. The multiple medication records of the same patient are sorted out. Based on graph theory, patients and drugs are set as two different types of nodes. The patient node (IDj) and the drug node (Dk) of the medication record are linked together to construct the drug-patient undirected bipartite graph of fraudulent patients and the drug-patient undirected bipartite graph of normal patients. For the constructed drug-patient undirected bipartite graph of fraudulent patients and the drug-patient undirected bipartite graph of normal patients, the first round of features of each patient's drug purchasing behavior are extracted: (1)

：Total number of drugs used (2)

: Total amount of drugs used;

分别推演欺诈病患的药品—患者二部图及正常病患的药品—患者二部图对应的药品单模投影关系，并使用关联链式算法将欺诈患者的购药行为表示为异常链；正常患者的购药行为表示为正常链。The drug unimodal projection relationships corresponding to the drug-patient bipartite graph of fraudulent patients and the drug-patient bipartite graph of normal patients are deduced respectively, and the associative chain algorithm is used to represent the drug purchasing behavior of fraudulent patients as abnormal chains; the drug purchasing behavior of normal patients is represented as normal chains.

二部图的单模式投影算法主要用于研究同类结点间的关系，利用二部图中一类结点通过另一类结点相连的特性，将同类结点进行直接关联聚类，从而生成一个不断增长包含单一类别结点的网络图。单模投影关系依赖于上述二部图。首先，需将所研究的药品节点(Dk)逐个加入到新的网络中，以加入的药品节点为起点在二部图中按照节点边权由低到高查找与之通过患者相连的其他药品节点，将查找过程中获得的药品节点与起点节点相连接。不断重复这一过程，直至所有药物节点完成连接，形成新的单模投影关系。对两投影关系使用关联链式算法进行处理，获得对应患者行为的异常链和正常链。The single-mode projection algorithm of the bipartite graph is mainly used to study the relationship between nodes of the same type. By using the characteristic that one type of node in the bipartite graph is connected through another type of node, the nodes of the same type are directly associated and clustered, thereby generating a growing network graph containing nodes of a single type. The single-mode projection relationship depends on the above-mentioned bipartite graph. First, the drug nodes (Dk) to be studied need to be added to the new network one by one. Taking the added drug nodes as the starting point, other drug nodes connected to the drug nodes through the patients are searched in the bipartite graph according to the node edge weight from low to high, and the drug nodes obtained in the search process are connected to the starting point node. This process is repeated until all drug nodes are connected, forming a new single-mode projection relationship. The two projection relationships are processed using the association chain algorithm to obtain the abnormal chain and normal chain corresponding to the patient's behavior.

链进行第二轮衍生特征提取，所述第二轮提取的特征包括种类异常率、数量异常率和异常链中的异常药品使用率；以欺诈病患行为对应的异常链为基准，衍生第二轮患者行为特征，对于每一条异常链，分别对每个患者提取以下三个特征：（1）每个患者用药的种类与欺诈链中药品种类相同数量与每个患者用药种类总量的比值；（2）每个患者使用欺诈链中药品对应的总量与每个患者用药总量的比值；（3）每个患者使用欺诈链中药品对应的总量与其使用的异常链中药品种类的比值。The second round of derived feature extraction is performed on the chain. The features extracted in the second round include the abnormal rate of type, the abnormal rate of quantity and the abnormal drug usage rate in the abnormal chain. The second round of patient behavior features are derived based on the abnormal chain corresponding to the fraudulent patient behavior. For each abnormal chain, the following three features are extracted for each patient: (1) the ratio of the number of types of drugs used by each patient that is the same as the number of drugs in the fraudulent chain to the total number of drug types used by each patient; (2) the ratio of the total amount of drugs used by each patient in the fraudulent chain to the total amount of drugs used by each patient; (3) the ratio of the total amount of drugs used by each patient in the fraudulent chain to the types of drugs used by the patient in the abnormal chain.

步骤S4、利用关联链式算法将步骤S3所述的药品链分为正常链和异常链；所述关联链式算法具体为：对二部图的对应矩阵按边权排序，从最高边权对应的药品组合开始，作为异常链中的起始药品组合，进一步检索组合药品中次高边权所连接的药品，依次检索，将药品链串联在一起，输入边权邻接矩阵，输出一条链。Step S4, using an associative chain algorithm to divide the drug chain described in step S3 into a normal chain and an abnormal chain; the associative chain algorithm is specifically as follows: sorting the corresponding matrix of the bipartite graph by edge weight, starting from the drug combination corresponding to the highest edge weight, as the starting drug combination in the abnormal chain, further searching for drugs connected by the second highest edge weight in the combined drugs, searching in sequence, connecting the drug chains together, inputting the edge weight adjacency matrix, and outputting a chain.

步骤S5、将正常链和异常链分别通过余弦相似度公式计算相似度；所述余弦相似度公式为

；其中a、b、c分别为正常链或异常链。Step S5: Calculate the similarity of the normal chain and the abnormal chain respectively using the cosine similarity formula; the cosine similarity formula is:

; where a, b, and c are normal chains or abnormal chains respectively.

步骤S6、去除相似度为0的正常链，保留相似度不为0的异常链和正常链的对比组合；计算每个欺诈病患行为对应的异常链与每个正常病患行为对应的正常链按照余弦相似度，去除相似度为0的异常链与正常链的对比组合。保留其他对比组合，去除组合中异常链上与正常链相同的药品，将异常链上的剩余药品组合成欺诈链。Step S6: remove the normal chains with a similarity of 0, and retain the comparison combinations of abnormal chains and normal chains with a similarity of not 0; calculate the cosine similarity between the abnormal chain corresponding to each fraudulent patient behavior and the normal chain corresponding to each normal patient behavior, and remove the comparison combinations of abnormal chains and normal chains with a similarity of 0. Retain other comparison combinations, remove the drugs on the abnormal chain that are the same as the normal chain in the combination, and combine the remaining drugs on the abnormal chain into a fraudulent chain.

步骤S8、将剩余药品合成欺诈链，对合成的欺诈链进行第三轮衍生特征提取，所述第三轮提取的特征包括种类异常率、数量异常率和异常链中的异常药品使用率；输出欺诈链。Step S8, synthesize the remaining drugs into a fraud chain, and perform a third round of derivative feature extraction on the synthesized fraud chain, wherein the features extracted in the third round include the abnormal rate of type, the abnormal rate of quantity, and the abnormal drug usage rate in the abnormal chain; output the fraud chain.

以欺诈链为基准，衍生第三轮患者行为特征，对于每一条欺诈链，分别对每个患者提取以下三个特征：（1）每个患者用药的种类与欺诈链中药品种类相同数量与每个患者用药种类总量的比值；（2）每个患者使用欺诈链中药品对应的总量与每个患者用药总量的比值；（3）每个患者使用欺诈链中药品对应的总量与其使用的欺诈链中药品种类的比值。Taking the fraud chain as the benchmark, the third round of patient behavior characteristics is derived. For each fraud chain, the following three characteristics are extracted for each patient: (1) the ratio of the number of drug types used by each patient that is the same as the number of drug types in the fraud chain to the total number of drug types used by each patient; (2) the ratio of the total amount of drugs used by each patient in the fraud chain to the total amount of drugs used by each patient; (3) the ratio of the total amount of drugs used by each patient in the fraud chain to the types of drugs used by the patient in the fraud chain.

实施例2Example 2

结合图2、图3和图4以及实施例1说明本实施例，在本实施例中，本实施例所涉及的一种基于购药记录的医疗保险欺诈检测方法，采用某市公布的医保药品购买数据为案例，其对应医疗保险欺诈行为的检测方法，分别对欺诈病患和正常病患建立其对应的患者——药品二部图。This embodiment is explained in conjunction with Figures 2, 3 and 4 as well as Example 1. In this embodiment, a medical insurance fraud detection method based on drug purchase records involved in this embodiment uses the medical insurance drug purchase data published by a certain city as a case, and its corresponding medical insurance fraud detection method establishes corresponding patient-drug bipartite graphs for fraudulent patients and normal patients, respectively.

从训练数据中对存在欺诈行为的病患以及表现正常的病患进行分割，每个病患均对应多条用药记录，总计15000人的1368148条用药记录。将记录中同一病人的多条用药记录进行整理，基于图论，将患者和药品设置为两类不同的节点，通过用药记录那个患者节点(IDj)与药品节点(Dk)链接在一起，分别构建欺诈病患的药品—患者无向二部图及正常病患的药品—患者无向二部图。通过用药记录构建的无向二部图中，患者节点只能通过药品节点相互连接，不能直接相互连接；药物节点亦只能通过患者节点相互连接，不能直接连接。且药品节点与患者节点之间只存在单纯的购买关系，因而，使用无向二部图完成购药行为的表示。无向二部图中的边权为患者对相应药品的购买次数。Patients with fraudulent behavior and normal patients are segmented from the training data. Each patient corresponds to multiple medication records, totaling 1,368,148 medication records for 15,000 people. Multiple medication records of the same patient in the records are sorted out. Based on graph theory, patients and drugs are set as two different types of nodes. The patient node (IDj) and the drug node (Dk) are linked together through the medication record to construct the drug-patient undirected bipartite graph of fraudulent patients and the drug-patient undirected bipartite graph of normal patients. In the undirected bipartite graph constructed by medication records, patient nodes can only be connected to each other through drug nodes, not directly to each other; drug nodes can also only be connected to each other through patient nodes, not directly. There is only a simple purchase relationship between drug nodes and patient nodes. Therefore, an undirected bipartite graph is used to represent drug purchasing behavior. The edge weight in the undirected bipartite graph is the number of times the patient purchases the corresponding drug.

针对构建的欺诈病患的药品—患者无向二部图及正常病患的药品—患者无向二部图，对每个患者的购药行为提炼第一轮特征：（1）

：使用药品种类总量，

表示患者节点j连接的药品节点数量；(2)

：使用药品总量，

表示患者节点j连接的药品节点的权重之和。Based on the constructed drug-patient undirected bipartite graph of fraudulent patients and the drug-patient undirected bipartite graph of normal patients, the first round of features are extracted for each patient’s drug purchasing behavior: (1)

: Total number of types of drugs used,

represents the number of drug nodes connected to patient node j; (2)

: Total amount of drugs used,

Represents the sum of the weights of the drug nodes connected to patient node j.

二部图的单模式投影算法主要用于研究同类结点间的关系，利用二部图中一类结点通过另一类结点相连的特性，将同类结点进行直接关联聚类，从而生成一个不断增长包含单一类别结点的网络图。单模投影关系依赖于上述二部图。首先，需将所研究的药品节点(Dk)逐个加入到新的网络中，以加入的药品节点为起点在二部图中按照节点边权由低到高查找与之通过患者相连的其他药品节点，将查找过程中获得的药品节点与起点节点相连接。不断重复这一过程，直至所有药物节点完成连接，形成新的单模投影关系。最后对新形成的单模投影关系进行矩阵表示，节点连接形成的边对应的边权为共同使用两类药品的患者数量，对应矩阵形式如下：The single-mode projection algorithm of the bipartite graph is mainly used to study the relationship between nodes of the same type. By using the characteristic that one type of node in the bipartite graph is connected through another type of node, the nodes of the same type are directly associated and clustered, thereby generating a growing network graph containing nodes of a single category. The single-mode projection relationship depends on the above-mentioned bipartite graph. First, the drug nodes (Dk) to be studied need to be added to the new network one by one. Taking the added drug nodes as the starting point, other drug nodes connected to it through patients are searched in the bipartite graph from low to high according to the node edge weight. The drug nodes obtained in the search process are connected to the starting point node. This process is repeated until all drug nodes are connected to form a new single-mode projection relationship. Finally, the newly formed single-mode projection relationship is represented by a matrix. The edge weight corresponding to the edge formed by the node connection is the number of patients who use two types of drugs together. The corresponding matrix form is as follows:

其中，m为药品的数量，同时使用

药品和

药品的患者数量为

。Among them, m is the number of drugs, and

Medicines and

The number of patients for the drug is

.

据此，获得欺诈病患的药品—患者二部图及正常病患的药品—患者二部图对应的药品单模投影关系，对两投影关系使用关联链式算法进行处理，获得对应患者行为的异常链和正常链。Based on this, the drug unimodal projection relationship corresponding to the drug-patient bipartite graph of fraudulent patients and the drug-patient bipartite graph of normal patients is obtained, and the two projection relationships are processed using the associative chain algorithm to obtain the abnormal chain and normal chain corresponding to the patient behavior.

以对欺诈病患的药品—患者二部图处理获得异常链的过程为例，首先对其二部图对应的矩阵按照边权weight进行排序，并从最高边权对应的药品组合开始，作为异常链中对应的起始药品组合，进一步检索到组合中药品分别对应的次高边权所连接的药品。如果存在，则分别将检索出的药品按照对应关系连接到异常链起始药品组合的两侧，直至无法不重复地检索到链两侧药品对应次高边权的连接药品位置。不断重复上述步骤，直至遍历单模投影关系中所有药品，最终获得承载欺诈病患信息的不包含重复药品的异常链组合。Taking the process of obtaining an abnormal chain by processing the drug-patient bipartite graph of fraudulent patients as an example, first sort the matrix corresponding to the bipartite graph according to the edge weight, and start from the drug combination corresponding to the highest edge weight as the corresponding starting drug combination in the abnormal chain, and further retrieve the drugs connected by the second highest edge weight corresponding to the drugs in the combination. If they exist, connect the retrieved drugs to both sides of the starting drug combination of the abnormal chain according to the corresponding relationship until the connection drug positions corresponding to the second highest edge weight of the drugs on both sides of the chain cannot be retrieved without duplication. Repeat the above steps until all drugs in the single-mode projection relationship are traversed, and finally obtain an abnormal chain combination that carries fraudulent patient information and does not contain duplicate drugs.

对正常病患的药品—患者单模投影关系的处理方式与上述方法一致，最终获得承载正常病患信息的不包含重复药品的正常链组合。The processing method of the drug-patient single-mode projection relationship of normal patients is consistent with the above method, and finally a normal chain combination carrying normal patient information and not containing duplicate drugs is obtained.

以欺诈病患行为对应的异常链为基准，衍生第二轮患者行为特征，对于每一条异常链，分别对每个患者提取以下三个特征：（1）每个患者用药的种类与欺诈链中药品种类相同数量与每个患者用药种类总量的比值；（2）每个患者使用欺诈链中药品对应的总量与每个患者用药总量的比值；（3）每个患者使用欺诈链中药品对应的总量与其使用的异常链中药品种类的比值。若获取异常链的数量为

条，则对应每个患者可以获得

个衍生特征，标记为

。Based on the abnormal chain corresponding to the fraudulent patient behavior, the second round of patient behavior features is derived. For each abnormal chain, the following three features are extracted for each patient: (1) the ratio of the number of types of drugs used by each patient that are the same as the types of drugs in the fraudulent chain to the total number of types of drugs used by each patient; (2) the ratio of the total amount of drugs used by each patient in the fraudulent chain to the total amount of drugs used by each patient; (3) the ratio of the total amount of drugs used by each patient in the fraudulent chain to the types of drugs used by the patient in the abnormal chain. If the number of abnormal chains obtained is

Each patient can obtain

derived features, labeled

.

计算每个欺诈病患行为对应的异常链与每个正常病患行为对应的正常链按照余弦相似度，去除相似度为0的异常链与正常链的对比组合。保留其他对比组合，去除组合中异常链上与正常链相同的药品，将异常链上的剩余药品组合成欺诈链。对每组相似度非0的异常链与正常链进行上述操作，求解对应的欺诈链。Calculate the cosine similarity between the abnormal chain corresponding to each fraudulent patient behavior and the normal chain corresponding to each normal patient behavior, and remove the comparison combinations of abnormal chains and normal chains with a similarity of 0. Keep other comparison combinations, remove the drugs on the abnormal chain that are the same as the normal chain in the combination, and combine the remaining drugs on the abnormal chain into a fraudulent chain. Perform the above operation on each group of abnormal chains and normal chains with a non-zero similarity to solve the corresponding fraudulent chain.

以欺诈链为基准，衍生第三轮患者行为特征，对于每一条欺诈链，分别对每个患者提取以下三个特征：（1）每个患者用药的种类与欺诈链中药品种类相同数量与每个患者用药种类总量的比值；（2）每个患者使用欺诈链中药品对应的总量与每个患者用药总量的比值；（3）每个患者使用欺诈链中药品对应的总量与其使用的欺诈链中药品种类的比值。若获取欺诈链的数量为r条，则对应每个患者可以获得

个衍生特征，标记为

。Based on the fraud chain, the third round of patient behavior features is derived. For each fraud chain, the following three features are extracted for each patient: (1) the ratio of the number of types of drugs used by each patient that are the same as the types of drugs in the fraud chain to the total number of types of drugs used by each patient; (2) the ratio of the total amount of drugs used by each patient in the fraud chain to the total amount of drugs used by each patient; (3) the ratio of the total amount of drugs used by each patient in the fraud chain to the types of drugs used by the patient in the fraud chain. If the number of fraud chains obtained is r, then each patient can obtain

derived features, labeled

.

使用机器学习算法完成欺诈检测模型的构建。Complete the construction of fraud detection models using machine learning algorithms.

综上，整合患者对应的特征向量与患者对应的欺诈标记Y，为

，对上述特征使用有监督特征筛选算法smbinning对每个特征的信息量IV进行计算，并提取信息量IV大于0.05的特征投入机器学习模型，本实施例以逻辑回归算法为例。smbining算法：R语言下的分类处理方法，在本文中目的是对数据集的信息量分类，去除信息量过低的特征向量与欺诈标记的匹配首先，对上述数据集进行十折交叉验证，针对训练集数据使用逻辑回归算法建立关于分类系数的凸优化目标，利用梯度下降法对凸优化目标进行迭代更新，使用ROC、AUC作为模型性能表现的评估变量，比较结果如下表1所示，获得相对性能最佳分类系数向量t，其中，In summary, the feature vector corresponding to the patient and the fraud mark Y corresponding to the patient are integrated to be

, the supervised feature screening algorithm smbinning is used to calculate the information content IV of each feature for the above features, and the features with information content IV greater than 0.05 are extracted and put into the machine learning model. This embodiment takes the logistic regression algorithm as an example. Smbining algorithm: a classification processing method under R language. The purpose of this article is to classify the information content of the data set and remove the feature vectors with too low information content and the matching of fraud marks. First, a ten-fold cross-validation is performed on the above data set. The logistic regression algorithm is used for the training set data to establish a convex optimization target for the classification coefficient. The gradient descent method is used to iteratively update the convex optimization target. ROC and AUC are used as evaluation variables for model performance. The comparison results are shown in Table 1 below. The classification coefficient vector t with the best relative performance is obtained, where

。

.

11 22 33 44 55 66 77 88 99 1010 训练AUCTraining AUC 0.860.86 0.860.86 0.820.82 0.850.85 0.850.85 0.860.86 0.860.86 0.860.86 0.860.86 0.850.85 测试AUCTest AUC 0.840.84 0.790.79 0.80.8 0.780.78 0.820.82 0.80.8 0.780.78 0.780.78 0.810.81 0.860.86

表1Table 1

根据的分类系数向量t，欺诈概率为

；According to the classification coefficient vector t, the fraud probability is

;

获得

使用logistic函数计算病患对应的欺诈概率，完成对病患是否为欺诈病患的判定。建立新的信用模型的实现方式有多种。为了得到可行的信用评分模型，本实施例中新的信用模型的属性集为属性集可行域的子集。基于属性集性质进一步确定新信用评分模型使用的算法。目前，可应用于生成信用评分模型的算法种类较多。例如：基于逻辑回归，基于随机森林，基于GBDT等等。在本实施例中，进行算法筛选时包含使用算法融合后的新算法，并按照如下策略实现算法的优选。Logistic回归：研究某一事件发生的概率与若干因素间的关系，可得出事件发生的概率。当概率大于0.5时。可认为其发生，小于0.5时，可认为其不发生。get

The logistic function is used to calculate the fraud probability corresponding to the patient, and the determination of whether the patient is a fraudulent patient is completed. There are many ways to implement a new credit model. In order to obtain a feasible credit scoring model, the attribute set of the new credit model in this embodiment is a subset of the feasible domain of the attribute set. The algorithm used by the new credit scoring model is further determined based on the properties of the attribute set. At present, there are many types of algorithms that can be applied to generate credit scoring models. For example: based on logistic regression, based on random forest, based on GBDT, etc. In this embodiment, the algorithm screening includes the use of a new algorithm after algorithm fusion, and the algorithm is optimized according to the following strategy. Logistic regression: Study the relationship between the probability of an event and several factors, and the probability of the event can be obtained. When the probability is greater than 0.5. It can be considered to occur, and when it is less than 0.5, it can be considered not to occur.

关联链式算法：对二部图的对应矩阵按边权（药品组合出现频次）排序，从最高边权对应的药品组合开始，作为异常链中的起始药品组合，进一步检索组合药品中次高边权所连接的药品。依次检索，将药品链串联在一起的算法。输入边权邻接矩阵，输出一条链。如图3和图4所示的二部图示意，甲乙丙丁戊为正常患者，己庚辛壬癸为异常患者，a,b,c,d,e,f为药品。Association chain algorithm: Sort the corresponding matrix of the bipartite graph by edge weight (frequency of occurrence of drug combination), start with the drug combination corresponding to the highest edge weight as the starting drug combination in the abnormal chain, and further search for drugs connected by the second highest edge weight in the combination drug. Search in sequence and connect the drug chains in series. Input the edge weight adjacency matrix and output a chain. As shown in Figures 3 and 4, the bipartite graphs A, B, C, D, and E are normal patients, J, G, S, R, and G are abnormal patients, and a, b, c, d, e, and f are drugs.

单模投影关系Single-mode projection relationship

边权邻接矩阵示意关系：The edge weight adjacency matrix shows the relationship:

头（药品链接的开头）Header (beginning of the drug link) 尾（药品链接的结尾）Tail (end of the drug link) 权（药品链接出现频次）Rights (frequency of drug links appearing) a药品aDrugs b药品bMedicine 22 a药品aDrugs c药品cDrugs 11 a药品aDrugs d药品d. Drugs 00 b药品bMedicine c药品cDrugs 44 b药品bMedicine d药品d. Drugs 11 c药品cDrugs d药品d. Drugs 11

对二部图的边权矩阵按边权高低排序，药品组合出现频次是a-b为2次，a-c为2次，a-d为0次，b-c为4次，b-d为1次，c-d为1次，从最高边权对应的药品组合开始，即b-c作为异常链中的起始药品组合，进一步检索组合药品中次高边权所连接的药品，即a-b为两次，b-d为一次，故取a-b，依次检索，将药品链串联在一起，输出a-b-c-d，即一条正常链。即关联链式算法输出一条按照边权高低排列的一条正常链。The edge weight matrix of the bipartite graph is sorted by edge weight. The frequency of drug combinations is a-b 2 times, a-c 2 times, a-d 0 times, b-c 4 times, b-d 1 time, and c-d 1 time. Starting from the drug combination corresponding to the highest edge weight, that is, b-c is the starting drug combination in the abnormal chain, further search for drugs connected by the second highest edge weight in the combination drug, that is, a-b is twice, b-d is once, so take a-b, search in sequence, connect the drug chains together, and output a-b-c-d, that is, a normal chain. That is, the association chain algorithm outputs a normal chain arranged according to the edge weight.

头（药品链接的开头）Header (beginning of the drug link) 尾（药品链接的结尾）Tail (end of the drug link) 权（药品链接出现频次）Rights (frequency of drug links appearing) b药品bMedicine c药品cDrugs 44 b药品bMedicine e药品e-Drugs 11 b药品bMedicine f药品fMedicine 11 c药品cDrugs e药品e-Drugs 33 c药品cDrugs f药品fMedicine 22 e药品e-Drugs f药品fMedicine 22

对二部图的边权矩阵按边权高低排序，药品组合出现频次是b-c为4次，b-e为1次，b-f为1次，c-e为3次，c-f为2次，e-f为2次，从最高边权对应的药品组合开始，即b-c作为异常链中的起始药品组合，进一步检索组合药品中次高边权所连接的药品，即c-f为2次，c-e为3次，故取c-f，依次检索，将药品链串联在一起，输出b-c-e-f，即一条异常链。即关联链式算法输出一条按照边权高低排列的一条异常链。The edge weight matrix of the bipartite graph is sorted by edge weight. The frequency of drug combinations is b-c 4 times, b-e 1 time, b-f 1 time, c-e 3 times, c-f 2 times, e-f 2 times. Starting from the drug combination corresponding to the highest edge weight, that is, b-c is the starting drug combination in the abnormal chain, and further searching for drugs connected by the second highest edge weight in the combination drugs, that is, c-f 2 times, c-e 3 times, so c-f is selected, searched in sequence, the drug chains are connected in series, and b-c-e-f is output, that is, an abnormal chain. That is, the association chain algorithm outputs an abnormal chain arranged according to the edge weight.

因为向量化后余弦相似度不为0。因此在去除异常链中与正常链相同的药品b,c后，形成欺诈链e-f。Because the cosine similarity after vectorization is not 0. Therefore, after removing the drugs b and c in the abnormal chain that are the same as the normal chain, a fraud chain e-f is formed.

余弦相似度定义：余弦相似度，又称为余弦相似性，是通过计算两个向量的夹角余弦值来评估他们的相似度。余弦相似度将向量根据坐标值，绘制到向量空间中。用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小。余弦值越接近1，就表明夹角越接近0度，也就是两个向量越相似，反之越接近0就表示两个向量相似度越低，这就叫"余弦相似性"。Cosine similarity definition: Cosine similarity, also known as cosine similarity, evaluates the similarity of two vectors by calculating the cosine value of the angle between them. Cosine similarity plots the vectors into the vector space according to the coordinate values. The cosine value of the angle between two vectors in the vector space is used as a measure of the size of the difference between two individuals. The closer the cosine value is to 1, the closer the angle is to 0 degrees, that is, the more similar the two vectors are. Conversely, the closer it is to 0, the lower the similarity between the two vectors. This is called "cosine similarity".

公式：

，其中a,b,c为正常链与异常链；formula:

, where a, b, c are normal chains and abnormal chains;

其中，a向量是[x₁, y₁]，b向量是[x₂, y₂]，a向量为正常链的向量化，b向量为异常链的向量化。从而去除相似度为0的向量。Among them, the a vector is [x ₁ , y ₁ ], the b vector is [x ₂ , y ₂ ], the a vector is the vectorization of the normal chain, and the b vector is the vectorization of the abnormal chain. Thus, vectors with a similarity of 0 are removed.

。

.

Claims

1. A medical insurance fraud detection method based on drug purchase records, characterized in that it includes the following steps:

Step S1, constructing a fraudster classification model through a machine learning algorithm;

Step S2, inputting patient information and drug purchase information into the model to establish a patient-drug bipartite graph, wherein the patient information includes normal patients and fraudulent patients;

Step S3: According to the patient-drug bipartite graph, a drug single-mode projection relationship is established to form a drug chain;

Step S4, using an associative chain algorithm to divide the drug chain described in step S3 into a normal chain and an abnormal chain;

Step S5, calculating the similarity of the normal chain and the abnormal chain respectively by using the cosine similarity formula;

Step S6, remove the normal chain with a similarity of 0, and retain the comparison combination of the abnormal chain and the normal chain with a similarity not equal to 0;

Step S7, remove the same products in the abnormal chain and the normal chain in the combination, and keep other drugs;

Step S8, synthesizing the remaining drugs into a fraud chain, and outputting the fraud chain;

In step S2, the drug purchase information of normal patients and the drug purchase information of fraudulent patients with fraudulent behavior are set as patient nodes and drug nodes, and a drug-patient undirected bipartite graph of fraudulent patients and a drug-patient undirected bipartite graph of normal patients are constructed respectively; the first round of derived feature extraction is performed on the patient-drug bipartite graph, and the features extracted in the first round include the total amount of types of drugs used and the total amount of drugs used, and a drug single-mode projection relationship is established based on its derived features;

In step S3, a second round of derivative feature extraction is performed on the abnormal chain, and the features extracted in the second round include the abnormal rate of type, the abnormal rate of quantity, and the abnormal drug usage rate in the abnormal chain;

In step S4, the association chain algorithm is specifically as follows: sorting the corresponding matrix of the bipartite graph by edge weight, starting from the drug combination corresponding to the highest edge weight as the starting drug combination in the abnormal chain, further searching for drugs connected to the second highest edge weight in the drug combination, searching in sequence, connecting the drug chains together, inputting the edge weight adjacency matrix, and outputting a chain;

In step S8, a third round of derivative feature extraction is performed on the synthesized fraud chain, and the features extracted in the third round include the abnormal rate of type, the abnormal rate of quantity and the abnormal drug usage rate in the abnormal chain.

2. According to claim 1, a medical insurance fraud detection method based on drug purchase records is characterized in that, in step S1, patient information is integrated, a machine learning algorithm is used to extract a feature vector of the patient information, the supervised screening algorithm smbinning is used to calculate the information volume IV of each feature of the feature vector, and the features with information volume IV greater than the IV are extracted and input into the machine learning algorithm to obtain a fraudster classification model.

3. A medical insurance fraud detection method based on drug purchase records according to claim 2, characterized in that in step S5, the cosine similarity formula is

; where a, b, and c are normal chains or abnormal chains respectively.