CN116305285B - Patient information desensitization processing method and system combined with artificial intelligence - Google Patents
Patient information desensitization processing method and system combined with artificial intelligence Download PDFInfo
- Publication number
- CN116305285B CN116305285B CN202310328830.6A CN202310328830A CN116305285B CN 116305285 B CN116305285 B CN 116305285B CN 202310328830 A CN202310328830 A CN 202310328830A CN 116305285 B CN116305285 B CN 116305285B
- Authority
- CN
- China
- Prior art keywords
- text
- desensitized
- medical record
- privacy
- paragraph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000586 desensitisation Methods 0.000 title claims abstract description 58
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 14
- 238000003672 processing method Methods 0.000 title claims abstract description 10
- 238000000034 method Methods 0.000 claims description 44
- 238000012545 processing Methods 0.000 claims description 32
- 201000010099 disease Diseases 0.000 claims description 22
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 22
- 230000015654 memory Effects 0.000 claims description 18
- 238000010801 machine learning Methods 0.000 claims description 14
- 238000005065 mining Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 6
- 230000010365 information processing Effects 0.000 abstract description 2
- 238000013461 design Methods 0.000 description 34
- 238000001514 detection method Methods 0.000 description 19
- 230000006870 function Effects 0.000 description 8
- 230000005540 biological transmission Effects 0.000 description 4
- 238000000354 decomposition reaction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012502 risk assessment Methods 0.000 description 4
- 239000003550 marker Substances 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及人工智能与信息处理技术领域,特别涉及一种结合人工智能的病患信息脱敏处理方法及系统。The present invention relates to the field of artificial intelligence and information processing technology, and in particular to a patient information desensitization processing method and system combined with artificial intelligence.
背景技术Background technique
信息脱敏是指对某些敏感信息通过脱敏规则进行数据的变形,实现敏感隐私数据的可靠保护。在涉及安全数据或者一些敏感数据的情况下,在不违反系统规则条件下,对真实数据进行改造并提供一定的使用。在智慧医疗领域,病患信息通常携带一些病患隐私,因此对病患信息进行信息脱敏是非常有必要的。Information desensitization refers to the deformation of certain sensitive information through desensitization rules to achieve reliable protection of sensitive privacy data. In the case of security data or some sensitive data, the real data is transformed and provided for certain use without violating the system rules. In the field of smart medical care, patient information usually carries some patient privacy, so it is very necessary to desensitize patient information.
发明内容Summary of the invention
为改善相关技术中存在的技术问题,本发明提供了一种结合人工智能的病患信息脱敏处理方法及系统。In order to improve the technical problems existing in the related technologies, the present invention provides a patient information desensitization processing method and system combined with artificial intelligence.
第一方面,本发明实施例提供了一种结合人工智能的病患信息脱敏处理方法,应用于AI脱敏处理系统,所述方法包括:获取待脱敏病患病历文本和病患病历脱敏文本;In a first aspect, an embodiment of the present invention provides a patient information desensitization processing method combined with artificial intelligence, which is applied to an AI desensitization processing system, and the method includes: obtaining a medical record text of a patient to be desensitized and a desensitized medical record text of a patient;
获取所述待脱敏病患病历文本中的隐私段落解析结果和所述病患病历脱敏文本中的隐私段落解析结果;Obtain the parsing results of the private paragraphs in the medical record text to be desensitized and the parsing results of the private paragraphs in the desensitized medical record text;
基于所述待脱敏病患病历文本中的隐私段落解析结果和所述病患病历脱敏文本中的隐私段落解析结果,确定所述待脱敏病患病历文本和所述病患病历脱敏文本的隐私段落共性评分;Determine the privacy paragraph commonality score of the medical record text to be desensitized and the desensitized medical record text based on the privacy paragraph parsing results in the medical record text to be desensitized and the privacy paragraph parsing results in the desensitized medical record text of the patient;
确定所述待脱敏病患病历文本的文本语义特征和所述病患病历脱敏文本的文本语义特征;Determine the text semantic features of the medical record text to be desensitized and the text semantic features of the desensitized medical record text;
基于所述待脱敏病患病历文本的文本语义特征和所述病患病历脱敏文本的文本语义特征,确定所述待脱敏病患病历文本和所述病患病历脱敏文本的第一语义共性评分;Determine a first semantic commonality score between the medical record text to be desensitized and the desensitized text of the patient's medical record based on the text semantic features of the medical record text to be desensitized and the text semantic features of the desensitized text of the patient's medical record;
基于所述待脱敏病患病历文本和所述病患病历脱敏文本的隐私段落共性评分和所述第一语义共性评分,确定所述待脱敏病患病历文本和所述病患病历脱敏文本之间的文本共性评分;Determine a text commonality score between the medical record text to be desensitized and the desensitized text of the patient's medical record based on the privacy paragraph commonality score of the medical record text to be desensitized and the desensitized text of the patient's medical record and the first semantic commonality score;
基于所述待脱敏病患病历文本和所述病患病历脱敏文本之间的文本共性评分,从所述病患病历脱敏文本中确定目标脱敏文本,以便基于所述目标脱敏文本对所述待脱敏病患病历文本进行隐私脱敏保护。Based on the text commonality score between the medical record text to be desensitized and the desensitized text of the patient's medical record, a target desensitized text is determined from the desensitized text of the patient's medical record, so as to perform privacy desensitization protection on the medical record text to be desensitized based on the target desensitized text.
在一些实施例中,所述待脱敏病患病历文本中的隐私段落解析结果包括从所述待脱敏病患病历文本中挖掘的第一隐私段落文本,所述病患病历脱敏文本中的隐私段落解析结果包括从所述病患病历脱敏文本中挖掘的第二隐私段落文本;其中,基于所述待脱敏病患病历文本中的隐私段落解析结果和所述病患病历脱敏文本中的隐私段落解析结果,确定所述待脱敏病患病历文本和所述病患病历脱敏文本的隐私段落共性评分,包括:In some embodiments, the privacy paragraph parsing result in the medical record text to be desensitized includes a first privacy paragraph text mined from the medical record text to be desensitized, and the privacy paragraph parsing result in the medical record desensitized text includes a second privacy paragraph text mined from the medical record desensitized text; wherein, based on the privacy paragraph parsing result in the medical record text to be desensitized and the privacy paragraph parsing result in the medical record desensitized text, determining the privacy paragraph commonality score of the medical record text to be desensitized and the medical record desensitized text includes:
基于所述第一隐私段落文本确定所述待脱敏病患病历文本中目标隐私段落的第一统计数据;Determine first statistical data of a target private paragraph in the medical record text to be desensitized based on the first private paragraph text;
基于所述第二隐私段落文本确定所述病患病历脱敏文本中所述目标隐私段落的第二统计数据;Determine second statistical data of the target private paragraph in the desensitized text of the patient's medical record based on the second private paragraph text;
基于所述第一统计数据和所述第二统计数据,确定所述待脱敏病患病历文本和所述病患病历脱敏文本中目标隐私段落的数目求和结果与数目求差结果;Based on the first statistical data and the second statistical data, determine a sum and a difference of the number of target private paragraphs in the medical record text to be desensitized and the desensitized medical record text;
基于所述待脱敏病患病历文本和所述病患病历脱敏文本中所述目标隐私段落的数目求和结果与数目求差结果,确定所述待脱敏病患病历文本与所述病患病历脱敏文本的隐私段落数目共性评分;Determine a commonality score of the number of private paragraphs in the medical record text to be desensitized and the desensitized text of the patient's medical record based on the sum and difference of the number of target private paragraphs in the medical record text to be desensitized and the desensitized text of the patient's medical record;
基于所述待脱敏病患病历文本与所述病患病历脱敏文本的隐私段落数目共性评分,确定所述待脱敏病患病历文本和所述病患病历脱敏文本的所述隐私段落共性评分。Based on the commonality score of the number of private paragraphs between the medical record text to be desensitized and the desensitized text of the patient's medical record, the commonality score of the private paragraphs between the medical record text to be desensitized and the desensitized text of the patient's medical record is determined.
在一些实施例中,获取所述待脱敏病患病历文本中的隐私段落解析结果和所述病患病历脱敏文本中的隐私段落解析结果,包括:In some embodiments, obtaining the parsing result of the private paragraphs in the medical record text to be desensitized and the parsing result of the private paragraphs in the desensitized medical record text includes:
对所述待脱敏病患病历文本和所述病患病历脱敏文本分别进行隐私段落挖掘处理,以确定所述待脱敏病患病历文本中包括的第一隐私段落文本和所述病患病历脱敏文本中包括的第二隐私段落文本;Performing privacy paragraph mining processing on the medical record text to be desensitized and the desensitized text of the patient's medical record, respectively, to determine a first privacy paragraph text included in the medical record text to be desensitized and a second privacy paragraph text included in the desensitized text of the patient's medical record;
基于所述第一隐私段落文本和所述第二隐私段落文本,确定所述待脱敏病患病历文本中各个文本单元所属的隐私段落标识和所述病患病历脱敏文本中各个文本单元所属的隐私段落标识;Based on the first privacy paragraph text and the second privacy paragraph text, determine the privacy paragraph identifier to which each text unit in the medical record text to be desensitized belongs and the privacy paragraph identifier to which each text unit in the medical record desensitized text belongs;
将所述待脱敏病患病历文本拆解为X个第一文本集,并将所述病患病历脱敏文本拆解为X个第二文本集,所述X个第一文本集与所述X个第二文本集一一对应,X为不小于1的整数;Decomposing the medical record text to be desensitized into X first text sets, and decomposing the medical record desensitized text into X second text sets, wherein the X first text sets correspond to the X second text sets one by one, and X is an integer not less than 1;
基于所述待脱敏病患病历文本中各个文本单元所属的隐私段落标识确定各个第一文本集中各个文本单元所属的第一隐私段落标识,作为所述待脱敏病患病历文本中的隐私段落解析结果;Determine the first privacy paragraph identifier of each text unit in each first text set based on the privacy paragraph identifier of each text unit in the medical record text to be desensitized, as the privacy paragraph parsing result in the medical record text to be desensitized;
基于所述病患病历脱敏文本中各个文本单元所属的隐私段落标识确定各个第二文本集中各个文本单元所属的第二隐私段落标识,作为所述病患病历脱敏文本中的隐私段落解析结果。Based on the privacy paragraph identifiers of each text unit in the desensitized text of the patient's medical record, the second privacy paragraph identifiers of each text unit in each second text set are determined as the privacy paragraph parsing results in the desensitized text of the patient's medical record.
在一些实施例中,基于所述待脱敏病患病历文本中的隐私段落解析结果和所述病患病历脱敏文本中的隐私段落解析结果,确定所述待脱敏病患病历文本和所述病患病历脱敏文本的隐私段落共性评分,包括:In some embodiments, based on the privacy paragraph parsing results in the medical record text to be desensitized and the privacy paragraph parsing results in the desensitized text of the patient's medical record, determining the privacy paragraph commonality score of the medical record text to be desensitized and the desensitized text of the patient's medical record, includes:
基于各个第一文本集中各个文本单元所属的第一隐私段落标识和各个第二文本集中各个文本单元所属的第二隐私段落标识,确定各个第一文本集与对应的第二文本集之间的隐私段落标识共性评分;Determine a privacy paragraph identifier commonality score between each first text set and the corresponding second text set based on a first privacy paragraph identifier to which each text unit belongs in each first text set and a second privacy paragraph identifier to which each text unit belongs in each second text set;
基于各个第一文本集与对应的第二文本集之间的隐私段落标识共性评分,确定所述待脱敏病患病历文本与所述病患病历脱敏文本的隐私段落主题共性评分;Determine the privacy paragraph topic commonality score between the patient medical record text to be desensitized and the patient medical record desensitized text based on the privacy paragraph identification commonality score between each first text set and the corresponding second text set;
基于所述待脱敏病患病历文本和所述病患病历脱敏文本的隐私段落主题共性评分,确定所述待脱敏病患病历文本与所述病患病历脱敏文本的隐私段落共性评分。Based on the commonality scores of the private paragraph topics of the medical record text to be desensitized and the desensitized text of the patient's medical record, the commonality scores of the private paragraphs of the medical record text to be desensitized and the desensitized text of the patient's medical record are determined.
在一些实施例中,所述第一隐私段落标识包括第一目标隐私段落标识,所述第二隐私段落标识包括第二目标隐私段落标识,所述第一文本集包括第一目标文本集,所述第二文本集包括第二目标文本集,所述第一目标文本集与所述第二目标文本集对应,所述第一目标文本集中的文本单元属于所述第一目标隐私段落标识,所述第二目标文本集中的文本单元属于所述第二目标隐私段落标识;其中,确定各个第一文本集与对应的第二文本集之间的隐私段落标识共性评分,包括:In some embodiments, the first privacy paragraph identifier includes a first target privacy paragraph identifier, the second privacy paragraph identifier includes a second target privacy paragraph identifier, the first text set includes a first target text set, the second text set includes a second target text set, the first target text set corresponds to the second target text set, a text unit in the first target text set belongs to the first target privacy paragraph identifier, and a text unit in the second target text set belongs to the second target privacy paragraph identifier; wherein determining the privacy paragraph identifier commonality score between each first text set and the corresponding second text set includes:
确定所述第一目标隐私段落标识与所述第二目标隐私段落标识的相同隐私段落标识数量;Determine the number of identical privacy paragraph identifiers between the first target privacy paragraph identifier and the second target privacy paragraph identifier;
确定所述第一目标隐私段落标识与所述第二目标隐私段落标识的隐私段落标识统计值;Determine a privacy paragraph identifier statistic value of the first target privacy paragraph identifier and the second target privacy paragraph identifier;
基于第一目标隐私段落标识与所述第二目标隐私段落标识的相同隐私段落标识数量、以及所述第一目标隐私段落标识与所述第二目标隐私段落标识的隐私段落标识统计值确定所述第一目标文本集与所述第二目标文本集的隐私段落标识共性评分。The privacy paragraph identifier commonality score of the first target text set and the second target text set is determined based on the number of identical privacy paragraph identifiers of the first target privacy paragraph identifier and the second target privacy paragraph identifier, and the privacy paragraph identifier statistics of the first target privacy paragraph identifier and the second target privacy paragraph identifier.
在一些实施例中,所述第一隐私段落文本包括所述待脱敏病患病历文本中包括的目标隐私段落的第一统计数据,所述第二隐私段落文本包括所述待脱敏病患病历文本中包括的所述目标隐私段落的第二统计数据;其中,基于所述待脱敏病患病历文本和所述病患病历脱敏文本的隐私段落主题共性评分,确定所述待脱敏病患病历文本与所述病患病历脱敏文本的隐私段落共性评分,包括:In some embodiments, the first privacy paragraph text includes first statistical data of a target privacy paragraph included in the medical record text to be desensitized, and the second privacy paragraph text includes second statistical data of the target privacy paragraph included in the medical record text to be desensitized; wherein, based on the privacy paragraph topic commonality scores of the medical record text to be desensitized and the desensitized text of the patient's medical record, determining the privacy paragraph commonality scores of the medical record text to be desensitized and the desensitized text of the patient's medical record includes:
基于所述第一统计数据和所述第二统计数据,确定所述待脱敏病患病历文本和所述病患病历脱敏文本中目标隐私段落的数目求和结果与数目求差结果;Based on the first statistical data and the second statistical data, determine a sum and a difference of the number of target private paragraphs in the medical record text to be desensitized and the desensitized medical record text;
基于所述待脱敏病患病历文本和所述病患病历脱敏文本中目标隐私段落的数目求和结果与数目求差结果,确定所述待脱敏病患病历文本与所述病患病历脱敏文本的隐私段落数目共性评分;Determine a commonality score of the number of private paragraphs in the medical record text to be desensitized and the desensitized text of the patient's medical record based on the sum and difference of the number of target private paragraphs in the medical record text to be desensitized and the desensitized text of the patient's medical record;
基于所述待脱敏病患病历文本与所述病患病历脱敏文本的隐私段落数目共性评分和隐私段落主题共性评分,确定所述待脱敏病患病历文本与所述病患病历脱敏文本的隐私段落共性评分。Based on the commonality score of the number of private paragraphs and the commonality score of the subject of private paragraphs between the medical record text to be desensitized and the desensitized text of the patient's medical record, the commonality score of private paragraphs between the medical record text to be desensitized and the desensitized text of the patient's medical record is determined.
在一些实施例中,所述待脱敏病患病历文本包括第一待脱敏病患病历文本和第二待脱敏病患病历文本,所述第一待脱敏病患病历文本与所述病患病历脱敏文本之间的文本共性评分为第一文本共性评分,所述第二待脱敏病患病历文本与所述病患病历脱敏文本之间的文本共性评分为第二文本共性评分;其中,基于所述待脱敏病患病历文本和所述病患病历脱敏文本之间的文本共性评分,从所述病患病历脱敏文本中确定目标脱敏文本,包括:In some embodiments, the medical record text to be desensitized includes a first medical record text to be desensitized and a second medical record text to be desensitized, the text commonality score between the first medical record text to be desensitized and the desensitized text of the patient medical record is a first text commonality score, and the text commonality score between the second medical record text to be desensitized and the desensitized text of the patient medical record is a second text commonality score; wherein, based on the text commonality score between the medical record text to be desensitized and the desensitized text of the patient medical record, determining the target desensitized text from the desensitized text of the patient medical record includes:
分别获取所述第一待脱敏病患病历文本和所述第二待脱敏病患病历文本的文本语义特征;Respectively obtaining text semantic features of the first medical record text of the disease to be desensitized and the second medical record text of the disease to be desensitized;
基于所述第一待脱敏病患病历文本与所述第二待脱敏病患病历文本的文本语义特征,确定所述第一待脱敏病患病历文本与所述第二待脱敏病患病历文本之间的第二语义共性评分;Determining a second semantic commonality score between the first medical record text of the disease to be desensitized and the second medical record text of the disease to be desensitized based on text semantic features of the first medical record text of the disease to be desensitized and the second medical record text of the disease to be desensitized;
基于所述第一文本共性评分、所述第二文本共性评分以及所述第二语义共性评分,从所述病患病历脱敏文本中确定所述第一待脱敏病患病历文本的目标脱敏文本。Based on the first text commonality score, the second text commonality score, and the second semantic commonality score, a target desensitized text of the first medical record text to be desensitized is determined from the desensitized text of the medical record.
在一些实施例中,基于所述第一文本共性评分、所述第二文本共性评分以及所述第二语义共性评分,从所述病患病历脱敏文本中确定所述第一待脱敏病患病历文本的目标脱敏文本,包括:In some embodiments, based on the first text commonality score, the second text commonality score, and the second semantic commonality score, determining a target desensitized text of the first medical record text to be desensitized from the desensitized medical record text includes:
基于所述第二文本共性评分和所述第二语义共性评分确定所述第二待脱敏病患病历文本对所述第一待脱敏病患病历文本的贡献信息;Determine contribution information of the second medical record text of the disease to be desensitized to the first medical record text of the disease to be desensitized based on the second text commonality score and the second semantic commonality score;
基于所述第二待脱敏病患病历文本对所述第一待脱敏病患病历文本的贡献信息、以及所述第一文本共性评分,确定所述第一待脱敏病患病历文本与所述病患病历脱敏文本的相关性;Determine the correlation between the first to-be-desensitized patient medical record text and the desensitized patient medical record text based on the contribution information of the second to-be-desensitized patient medical record text to the first to-be-desensitized patient medical record text and the first text commonality score;
基于所述第一待脱敏病患病历文本与所述病患病历脱敏文本的相关性,从所述病患病历脱敏文本中确定所述第一待脱敏病患病历文本的目标脱敏文本。Based on the correlation between the first medical record text to be desensitized and the desensitized text of the patient's medical record, a target desensitized text of the first medical record text to be desensitized is determined from the desensitized text of the patient's medical record.
在一些实施例中,基于所述目标脱敏文本对所述待脱敏病患病历文本进行隐私脱敏保护,包括:In some embodiments, performing privacy desensitization protection on the medical record text to be desensitized based on the target desensitized text includes:
将所述目标脱敏文本与所述待脱敏病患病历文本进行文本匹配处理,以获得文本匹配结果;Perform text matching processing on the target desensitized text and the medical record text to be desensitized to obtain a text matching result;
结合所述文本匹配结果,通过所述目标脱敏文本对所述待脱敏病患病历文本进行隐私脱敏保护,以获得基础脱敏文本;In combination with the text matching result, the medical record text to be desensitized is subjected to privacy desensitization protection through the target desensitized text to obtain a basic desensitized text;
通过文本匿名算法对所述基础脱敏文本进行文本优化,以进行对所述待脱敏病患病历文本的隐私脱敏保护。The basic desensitized text is optimized by a text anonymization algorithm to perform privacy desensitization protection on the medical record text to be desensitized.
在一些实施例中,在通过文本匿名算法对所述基础脱敏文本进行文本优化,以进行对所述待脱敏病患病历文本的隐私脱敏保护之前,还包括:In some embodiments, before optimizing the basic desensitized text by a text anonymization algorithm to perform privacy desensitization protection on the medical record text to be desensitized, the method further includes:
获取目标已匿名文本示例;Get the target anonymized text sample;
对所述目标已匿名文本示例进行隐私段落去匿名处理,以获得去匿名文本示例;Performing privacy paragraph de-anonymization processing on the target anonymized text example to obtain a de-anonymized text example;
将所述去匿名文本示例加载到目标机器学习算法,以获得匿名预测文本;Loading the de-anonymized text example into a target machine learning algorithm to obtain anonymous predicted text;
确定所述匿名预测文本与所述目标已匿名文本示例之间的目标调试代价;determining a target debugging cost between the anonymous predicted text and the target anonymized text example;
通过所述目标调试代价对所述目标机器学习算法的算法变量进行改进,以将所述目标机器学习算法调试为所述文本匿名算法。The algorithm variables of the target machine learning algorithm are improved by using the target debugging cost to debug the target machine learning algorithm into the text anonymization algorithm.
第二方面,本发明还提供了一种AI脱敏处理系统,包括处理器和存储器;所述处理器和所述存储器通信连接,所述处理器用于从所述存储器中读取计算机程序并执行,以实现上述的方法。In a second aspect, the present invention also provides an AI desensitization processing system, comprising a processor and a memory; the processor and the memory are communicatively connected, and the processor is used to read a computer program from the memory and execute it to implement the above method.
第三方面,本发明还提供了一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时实现上述的方法。In a third aspect, the present invention further provides a computer-readable storage medium having a program stored thereon, which implements the above method when executed by a processor.
本发明实施例提供的结合人工智能的病患信息脱敏处理方法及系统,在为待脱敏病患病历文本确定目标脱敏文本时,不仅通过待脱敏病患病历文本与病患病历脱敏文本的文本语义特征确保了目标脱敏文本与待脱敏病患病历文本之间的全局共性程度;还通过待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分确保了目标脱敏文本中的隐私段落与待脱敏病患病历文本中的隐私段落共性程度,即确保了目标脱敏文本与待脱敏病患病历文本中的文本信息的类似性。这样,本发明实施例不仅可以保障目标脱敏文本与待脱敏病患病历文本在整体层面的文本布局类似性,还可以保障目标脱敏文本与待脱敏病患病历文本中的隐私段落信息类似性,提高了目标脱敏文本与待脱敏病患病历文本之间的匹配性,以便在基于目标脱敏文本对待脱敏病患病历文本进行数据匿名/脱敏时,提高待脱敏病患病历文本的数据匿名/脱敏质量和效率。The patient information desensitization processing method and system combined with artificial intelligence provided by the embodiment of the present invention, when determining the target desensitized text for the medical record text to be desensitized, not only the global commonality between the target desensitized text and the medical record text to be desensitized is ensured through the text semantic features of the medical record text to be desensitized and the desensitized text of the patient's medical record; but also the commonality of the privacy paragraphs in the target desensitized text and the privacy paragraphs in the medical record text to be desensitized is ensured through the commonality score of the privacy paragraphs between the medical record text to be desensitized and the desensitized text of the patient's medical record, that is, the similarity of the text information in the target desensitized text and the medical record text to be desensitized is ensured. In this way, the embodiment of the present invention can not only ensure the similarity of the text layout of the target desensitized text and the medical record text to be desensitized at the overall level, but also ensure the similarity of the privacy paragraph information in the target desensitized text and the medical record text to be desensitized, and improve the matching between the target desensitized text and the medical record text to be desensitized, so as to improve the data anonymity/desensitization quality and efficiency of the medical record text to be desensitized when the medical record text to be desensitized is anonymized/desensitized based on the target desensitized text.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并于说明书一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention.
图1是本发明实施例提供的一种结合人工智能的病患信息脱敏处理方法的流程示意图。FIG1 is a flow chart of a method for desensitizing patient information combined with artificial intelligence provided in an embodiment of the present invention.
实施方式Implementation
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are shown in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Instead, they are merely examples of devices and methods consistent with some aspects of the present invention as detailed in the appended claims.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that the terms "first", "second", etc. in the specification and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.
本发明实施例所提供的方法实施例可以在AI脱敏处理系统、计算机设备或者类似的运算装置中执行。以运行在AI脱敏处理系统上为例,AI脱敏处理系统可以包括一个或多个处理器(处理器可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器,可选地,上述AI脱敏处理系统还可以包括用于通信功能的传输装置。本领域普通技术人员可以理解,上述结构仅为示意,其并不对上述AI脱敏处理系统的结构造成限定。例如,AI脱敏处理系统还可包括比上述所示更多或者更少的组件,或者具有与上述所示不同的配置。The method embodiments provided in the embodiments of the present invention can be executed in an AI desensitization processing system, a computer device or a similar computing device. Taking the operation on the AI desensitization processing system as an example, the AI desensitization processing system may include one or more processors (the processor may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory for storing data. Optionally, the above-mentioned AI desensitization processing system may also include a transmission device for communication functions. It can be understood by those of ordinary skill in the art that the above-mentioned structure is only illustrative and does not limit the structure of the above-mentioned AI desensitization processing system. For example, the AI desensitization processing system may also include more or fewer components than those shown above, or have a configuration different from that shown above.
存储器可用于存储计算机程序,例如,应用软件的软件程序以及模块,如本发明实施例中的一种结合人工智能的病患信息脱敏处理方法对应的计算机程序,处理器通过运行存储在存储器内的计算机程序,从而执行各种功能应用以及数据处理,即实现上述的方法。存储器可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器可进一步包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至AI脱敏处理系统。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory can be used to store computer programs, for example, software programs and modules of application software, such as a computer program corresponding to a method for desensitizing patient information combined with artificial intelligence in an embodiment of the present invention. The processor executes various functional applications and data processing by running the computer program stored in the memory, that is, implementing the above method. The memory may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include a memory remotely arranged relative to the processor, and these remote memories can be connected to the AI desensitization processing system via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
传输装置用于经由一个网络接收或者发送数据。上述的网络具体实例可包括AI脱敏处理系统的通信供应商提供的无线网络。在一个实例中,传输装置包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置可以为射频(Radio Frequency,简称为RF)模块,其用于通过无线方式与互联网进行通讯。The transmission device is used to receive or send data via a network. The above-mentioned specific examples of the network may include a wireless network provided by a communication provider of the AI desensitization processing system. In one example, the transmission device includes a network adapter (Network Interface Controller, referred to as NIC), which can be connected to other network devices through a base station so that it can communicate with the Internet. In one example, the transmission device can be a radio frequency (Radio Frequency, referred to as RF) module, which is used to communicate with the Internet wirelessly.
基于此,请参阅图1,图1是本发明实施例所提供的一种结合人工智能的病患信息脱敏处理方法的流程示意图,该方法应用于AI脱敏处理系统,进一步可以包括STEP11-STEP17。Based on this, please refer to Figure 1, which is a flow chart of a patient information desensitization processing method combined with artificial intelligence provided by an embodiment of the present invention. The method is applied to an AI desensitization processing system and may further include STEP11-STEP17.
STEP11,获取待脱敏病患病历文本和病患病历脱敏文本。STEP 11, obtain the medical records to be desensitized and the desensitized medical records of the patients.
在本发明实施例中,待脱敏病患病历文本可以是需要进行隐私脱敏保护的病患病历文本,该待脱敏病患病历文本可以是门诊病患病历文本,也可以是住院病患病历文本,甚至可以是手术病患病历文本。待脱敏病患病历文本中记录了与病患相关的隐私信息,若后期为了研究需求,对病患病历文本进行共享的时候需要保护这些隐私信息。因此本发明实施例可以实现待脱敏病患病历文本的隐私脱敏保护处理。In an embodiment of the present invention, the medical record text to be desensitized can be a medical record text of a patient that needs to be desensitized for privacy protection. The medical record text to be desensitized can be an outpatient medical record text, an inpatient medical record text, or even a surgical medical record text. The medical record text to be desensitized records the privacy information related to the patient. If the medical record text is shared for research needs in the later stage, the privacy information needs to be protected. Therefore, the embodiment of the present invention can realize the privacy desensitization protection processing of the medical record text to be desensitized.
此外,病患病历脱敏文本可以是已完成隐私信息匿名/脱敏处理的病患病历文本,该病患病历脱敏文本是基于K匿名算法得到的,还可以是基于其他匿名算法得到的。In addition, the desensitized medical record text may be a medical record text that has completed anonymization/desensitization processing of privacy information. The desensitized medical record text is obtained based on a K-anonymity algorithm or other anonymity algorithms.
进一步地,该待脱敏病患病历文本可以为至少一个,病患病历脱敏文本可以为至少一个。Furthermore, there may be at least one medical record text to be desensitized, and there may be at least one desensitized medical record text.
STEP12,获取待脱敏病患病历文本中的隐私段落解析结果和病患病历脱敏文本中的隐私段落解析结果。STEP 12, obtain the parsing results of the private paragraphs in the medical records to be desensitized and the parsing results of the private paragraphs in the desensitized medical records of the patients.
在一种可能的设计思路下,待脱敏病患病历文本中的隐私段落解析结果可以包括在待脱敏病患病历文本中挖掘的目标隐私段落、目标隐私段落的标识信息、该目标隐私段落在待脱敏病患病历文本中的分布特征(位置)和统计数据等任意可以从待脱敏病患病历文本中挖掘出的与目标隐私段落相关的信息。其中,该目标隐私段落可以是一种也可以是多种。In a possible design idea, the result of analyzing the private paragraphs in the medical records to be desensitized may include the target private paragraph mined in the medical records to be desensitized, the identification information of the target private paragraph, the distribution characteristics (position) of the target private paragraph in the medical records to be desensitized, and statistical data, and any other information related to the target private paragraph that can be mined from the medical records to be desensitized. The target private paragraph may be one or more.
STEP13,基于待脱敏病患病历文本中的隐私段落解析结果和病患病历脱敏文本中的隐私段落解析结果,确定待脱敏病患病历文本和病患病历脱敏文本的隐私段落共性评分。STEP 13, based on the parsing results of the privacy paragraphs in the medical records to be desensitized and the parsing results of the privacy paragraphs in the desensitized text of the patient's medical records, determine the commonality scores of the privacy paragraphs in the medical records to be desensitized and the desensitized text of the patient's medical records.
在一种可能的设计思路下,待脱敏病患病历文本和病患病历脱敏文本的隐私段落共性评分可以指的是待脱敏病患病历文本和病患病历脱敏文本中包括的目标隐私段落的共性评分,该隐私段落共性评分可以包括目标隐私段落的数目共性评分、主题共性评分、分布特征共性评分或者隐私等级共性评分等中的至少一种。其中,共性评分可以理解为相似度或者相似性。In a possible design concept, the commonality score of the private paragraphs of the medical records to be desensitized and the desensitized medical records of the patients may refer to the commonality score of the target private paragraphs included in the medical records to be desensitized and the desensitized medical records of the patients, and the commonality score of the private paragraphs may include at least one of the commonality score of the number of target private paragraphs, the commonality score of themes, the commonality score of distribution characteristics, or the commonality score of privacy levels, etc. Among them, the commonality score can be understood as similarity or similarity.
STEP14,确定待脱敏病患病历文本的文本语义特征和病患病历脱敏文本的文本语义特征。STEP 14, determine the text semantic features of the medical record text to be desensitized and the text semantic features of the desensitized medical record text.
在一种可能的设计思路下,可以通过目标机器学习算法对待脱敏病患病历文本和病患病历脱敏文本进行文本向量挖掘处理,以获取一个可表征待脱敏病患病历文本整体情况的文本语义特征、一个可表征病患病历脱敏文本整体情况的文本语义特征。目标机器学习算法可以是深度学习模型DNN、残差网络等。In a possible design idea, the target machine learning algorithm can be used to perform text vector mining on the medical records to be desensitized and the medical records of patients with desensitized texts, so as to obtain a text semantic feature that can characterize the overall situation of the medical records to be desensitized and a text semantic feature that can characterize the overall situation of the medical records with desensitized texts. The target machine learning algorithm can be a deep learning model DNN, residual network, etc.
STEP15,基于待脱敏病患病历文本的文本语义特征和病患病历脱敏文本的文本语义特征,确定待脱敏病患病历文本和病患病历脱敏文本的第一语义共性评分。STEP 15, based on the text semantic features of the medical records to be desensitized and the text semantic features of the desensitized text of the patient's medical records, determine the first semantic commonality score of the medical records to be desensitized and the desensitized text of the patient's medical records.
在一种可能的设计思路下,可以确定待脱敏病患病历文本的文本语义特征与病患病历脱敏文本的文本语义特征之间的语义共性评分,并将该语义共性评分作为该第一语义共性评分。第一语义共性评分越大,待脱敏病患病历文本与病患病历脱敏文本的语义特征匹配性(语义特征相关度)越大。In a possible design idea, the semantic commonality score between the text semantic features of the medical records to be desensitized and the text semantic features of the desensitized medical records of the patients can be determined, and the semantic commonality score is used as the first semantic commonality score. The larger the first semantic commonality score, the greater the semantic feature matching (semantic feature correlation) between the medical records to be desensitized and the desensitized medical records of the patients.
其中,可以但不限于通过余弦相似度确定待脱敏病患病历文本的文本语义特征与病患病历脱敏文本的文本语义特征之间的共性评分。Among them, the commonality score between the text semantic features of the medical record text to be desensitized and the text semantic features of the desensitized medical record text can be determined by, but is not limited to, cosine similarity.
比如待脱敏病患病历文本的文本语义特征为feature_u,病患病历脱敏文本的文本语义特征为feature_v,那么可以通过feature_u和feature_v的点积结果确定待脱敏病患病历文本的文本语义特征与病患病历脱敏文本的文本语义特征之间的语义共性评分,获得评分越大,待脱敏病患病历文本与病患病历脱敏文本的宏观共性评分越高,i为不小于1、小于或者等于待脱敏病患病历文本个数的整数。For example, the text semantic feature of the medical record text to be desensitized is feature_u, and the text semantic feature of the patient medical record desensitized text is feature_v. Then the semantic commonality score between the text semantic features of the medical record text to be desensitized and the text semantic features of the patient medical record desensitized text can be determined by the dot product result of feature_u and feature_v. The larger the score obtained, the higher the macro commonality score between the medical record text to be desensitized and the patient medical record desensitized text. i is an integer not less than 1 and less than or equal to the number of medical record texts to be desensitized.
STEP16,基于待脱敏病患病历文本和病患病历脱敏文本的隐私段落共性评分和第一语义共性评分,确定待脱敏病患病历文本和病患病历脱敏文本之间的文本共性评分。STEP 16, based on the privacy paragraph commonality score and the first semantic commonality score of the medical record text to be desensitized and the desensitized text of the patient's medical record, determine the text commonality score between the medical record text to be desensitized and the desensitized text of the patient's medical record.
在一种可能的设计思路下,可以将待脱敏病患病历文本和病患病历脱敏文本的隐私段落共性评分与第一语义共性评分整合,以作为待脱敏病患病历文本和病患病历脱敏文本之间的文本共性评分,也可以将待脱敏病患病历文本与病患病历脱敏文本的隐私段落共性评分与第一语义共性评分进行乘法运算,以作为待脱敏病患病历文本与病患病历脱敏文本之间的文本共性评分。Under one possible design idea, the privacy paragraph commonality score of the medical record text to be desensitized and the desensitized text of the patient's medical record can be integrated with the first semantic commonality score to serve as the text commonality score between the medical record text to be desensitized and the desensitized text of the patient's medical record. Alternatively, the privacy paragraph commonality score of the medical record text to be desensitized and the desensitized text of the patient's medical record can be multiplied with the first semantic commonality score to serve as the text commonality score between the medical record text to be desensitized and the desensitized text of the patient's medical record.
基于上述方法确定的文本共性评分,不仅通过第一语义共性评分引入了待脱敏病患病历文本与病患病历脱敏文本整体之间的共性评分关系,还通过隐私段落共性评分引入了待脱敏病患病历文本与病患病历脱敏文本中的局部文本信息类似性。Based on the text commonality score determined by the above method, not only the commonality score relationship between the medical records to be desensitized and the desensitized text of the patient's medical records is introduced through the first semantic commonality score, but also the similarity of local text information in the medical records to be desensitized and the desensitized text of the patient's medical records is introduced through the privacy paragraph commonality score.
STEP17,基于待脱敏病患病历文本和病患病历脱敏文本之间的文本共性评分,从病患病历脱敏文本中确定目标脱敏文本,以便基于目标脱敏文本对待脱敏病患病历文本进行隐私脱敏保护。STEP 17, based on the text commonality score between the medical record text to be desensitized and the desensitized text of the patient's medical record, determine the target desensitized text from the desensitized text of the patient's medical record, so as to perform privacy desensitization protection on the medical record text to be desensitized based on the target desensitized text.
在一种可能的设计思路下,可以基于待脱敏病患病历文本与各个病患病历脱敏文本的文本共性评分,从各个病患病历脱敏文本中确定目标脱敏文本,比如可以在文本共性评分列表中确定文本共性评分最大值对应的病患病历脱敏文本作为待脱敏病患病历文本的目标脱敏文本。Under one possible design idea, the target desensitized text can be determined from the desensitized texts of each patient's medical record based on the text commonality scores of the medical record text to be desensitized and the desensitized texts of each patient's medical record. For example, the desensitized text of the patient's medical record corresponding to the maximum text commonality score can be determined in the text commonality score list as the target desensitized text of the medical record text to be desensitized.
比如,若待脱敏病患病历文本为两个,则可以将待脱敏病患病历文本与各个病患病历脱敏文本的隐私段落共性评分列表List1、待脱敏病患病历文本与各个病患病历脱敏文本的语义共性评分列表List2由1维变换成2维,将隐私段落共性评分列表List1和语义共性评分列表List2点乘之后,就可以确定两个待脱敏病患病历文本分别与各个病患病历脱敏文本的文本共性评分,进而就可以分别为该两个待脱敏病患病历文本分别确定目标脱敏文本了。For example, if there are two medical records to be desensitized, the privacy paragraph commonality score list List1 of the medical records to be desensitized and the desensitized texts of each patient's medical records, and the semantic commonality score list List2 of the medical records to be desensitized and the desensitized texts of each patient's medical records can be transformed from one dimension to two dimensions. After multiplying the privacy paragraph commonality score list List1 and the semantic commonality score list List2, the text commonality scores of the two medical records to be desensitized and the desensitized texts of each patient's medical records can be determined, and then the target desensitized texts can be determined for the two medical records to be desensitized, respectively.
本发明实施例,在为待脱敏病患病历文本确定目标脱敏文本时,不仅通过待脱敏病患病历文本与病患病历脱敏文本的文本语义特征确保了目标脱敏文本与待脱敏病患病历文本全局类似性;还通过待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分确保了目标脱敏文本中的隐私段落与待脱敏病患病历文本中的隐私段落文本类似性,即确保了目标脱敏文本与待脱敏病患病历文本中的文本信息(比如隐私段落数目、隐私段落分布等)类似性。这样,本发明实施例不仅可以保障目标脱敏文本与待脱敏病患病历文本在整体层面的文本布局类似性,还可以保障目标脱敏文本与待脱敏病患病历文本中的隐私段落信息类似性,显著提升了目标脱敏文本与待脱敏病患病历文本之间的匹配性,以便在基于目标脱敏文本对待脱敏病患病历文本进行数据匿名/脱敏时,提高待脱敏病患病历文本的数据匿名/脱敏质量和效率。In the embodiment of the present invention, when determining the target desensitized text for the medical record text to be desensitized, not only the global similarity between the target desensitized text and the medical record text to be desensitized is ensured through the text semantic features of the medical record text to be desensitized and the desensitized text of the patient's medical record; but also the similarity between the privacy paragraphs in the target desensitized text and the privacy paragraphs in the medical record text to be desensitized is ensured through the commonality score of the privacy paragraphs between the medical record text to be desensitized and the desensitized text of the patient's medical record, that is, the similarity of the text information (such as the number of privacy paragraphs, the distribution of privacy paragraphs, etc.) in the target desensitized text and the medical record text to be desensitized is ensured. In this way, the embodiment of the present invention can not only ensure the similarity of the text layout of the target desensitized text and the medical record text to be desensitized at the overall level, but also ensure the similarity of the privacy paragraph information in the target desensitized text and the medical record text to be desensitized, significantly improving the matching between the target desensitized text and the medical record text to be desensitized, so as to improve the data anonymity/desensitization quality and efficiency of the medical record text to be desensitized when the medical record text to be desensitized is anonymized/desensitized based on the target desensitized text.
在一种可能的设计思路下,当确定了待脱敏病患病历文本的目标脱敏文本后,可通过如下方式完成对待脱敏病患病历文本的隐私脱敏保护。Under a possible design idea, after the target desensitized text of the medical record text to be desensitized is determined, the privacy desensitization protection of the medical record text to be desensitized can be completed in the following way.
将目标脱敏文本与待脱敏病患病历文本进行文本匹配处理,以获得文本匹配结果。其中,文本匹配处理可以将各个隐私段落的词句进行匹配。比如,若目标脱敏文本与待脱敏病患病历文本皆具有存在对应关系的两个隐私段落,通过文本匹配处理,可以将目标脱敏文本中隐私段落的人名、家庭住址、工作单位、患病前后的日常生活习惯等词句与待脱敏病患病历文本中隐私段落的人名、家庭住址、工作单位、患病前后的日常生活习惯等词句一一进行匹配。The target desensitized text and the medical record text to be desensitized are subjected to text matching processing to obtain text matching results. Among them, the text matching processing can match the words and sentences of each private paragraph. For example, if the target desensitized text and the medical record text to be desensitized both have two private paragraphs with a corresponding relationship, through text matching processing, the names, home addresses, work units, daily living habits before and after illness and other words and sentences in the private paragraphs of the target desensitized text can be matched one by one with the names, home addresses, work units, daily living habits before and after illness and other words and sentences in the private paragraphs of the medical record text to be desensitized.
结合文本匹配结果,可以通过目标脱敏文本对待脱敏病患病历文本进行隐私脱敏保护,以获得基础脱敏文本。该基础脱敏文本是经过初步匿名脱敏处理的文本。在获得基础脱敏文本后,可以通过调试完成的文本匿名算法对基础脱敏文本进行文本优化(进一步的隐私脱敏处理),以完成对待脱敏病患病历文本的隐私脱敏保护。这样一来,通过2轮的脱敏处理,能够实现隐私匿名的平滑性,避免匿名过度造成相关文本信息过于泛化而难以被后期的医学研究所使用,也即本发明实施例在保障用户隐私的同时还可以尽量提高脱敏后的病患病历文本的可用性。Combined with the text matching results, the target desensitized text can be used to perform privacy desensitization protection on the medical record text to be desensitized to obtain the basic desensitized text. The basic desensitized text is a text that has undergone preliminary anonymous desensitization processing. After obtaining the basic desensitized text, the basic desensitized text can be optimized (further privacy desensitization processing) through the debugged text anonymity algorithm to complete the privacy desensitization protection of the medical record text to be desensitized. In this way, through two rounds of desensitization processing, the smoothness of privacy anonymity can be achieved, avoiding excessive anonymity that causes the relevant text information to be too generalized and difficult to be used by later medical research institutes, that is, the embodiment of the present invention can maximize the usability of the desensitized medical record text while protecting user privacy.
在通过调试完成的文本匿名算法对基础脱敏文本进行文本优化,以完成对待脱敏病患病历文本的隐私脱敏保护之前可以如下思路完成文本匿名算法的调试:获取目标已匿名文本示例;对目标已匿名文本示例进行隐私段落去匿名处理,以获得去匿名文本示例;将去匿名文本示例加载到目标机器学习算法,以获得匿名预测文本;确定匿名预测文本与目标已匿名文本示例之间的目标调试代价;通过目标调试代价对目标机器学习算法的算法变量进行改进,以将目标机器学习算法调试为文本匿名算法。Before optimizing the basic desensitized text through the debugged text anonymization algorithm to complete the privacy desensitization protection of the desensitized medical record text, the debugging of the text anonymization algorithm can be completed as follows: obtain a target anonymized text example; perform privacy paragraph de-anonymization on the target anonymized text example to obtain a de-anonymized text example; load the de-anonymized text example into the target machine learning algorithm to obtain anonymous predicted text; determine the target debugging cost between the anonymous predicted text and the target anonymized text example; improve the algorithm variables of the target machine learning algorithm through the target debugging cost to debug the target machine learning algorithm into a text anonymization algorithm.
其中,上述的文本示例可以理解为训练文本,调试代价可以理解为算法的训练损失。Among them, the above text examples can be understood as training texts, and the debugging cost can be understood as the training loss of the algorithm.
在本发明实施例中,待脱敏病患病历文本中的隐私段落解析结果可以包括从待脱敏病患病历文本中挖掘的第一隐私段落文本,病患病历脱敏文本中的隐私段落解析结果可以包括从病患病历脱敏文本中挖掘的第二隐私段落文本。In an embodiment of the present invention, the privacy paragraph parsing results in the medical record text to be desensitized may include a first privacy paragraph text mined from the medical record text to be desensitized, and the privacy paragraph parsing results in the desensitized medical record text may include a second privacy paragraph text mined from the desensitized medical record text.
其中,第一隐私段落文本可以包括待脱敏病患病历文本中目标隐私段落的统计数据,第二隐私段落文本可以包括病患病历脱敏文本中目标隐私段落的统计数据。Among them, the first privacy paragraph text may include statistical data of the target privacy paragraph in the patient's medical record text to be desensitized, and the second privacy paragraph text may include statistical data of the target privacy paragraph in the patient's medical record desensitized text.
在一种可能的设计思路下,可以通过目标隐私段落挖掘网络对待脱敏病患病历文本或病患病历脱敏文本进行病患病历文本挖掘,以从待脱敏病患病历文本或者病患病历脱敏文本中确定该第一隐私段落文本或第二隐私段落文本。Under one possible design idea, the target privacy paragraph mining network can be used to perform medical record text mining on the medical record text to be desensitized or the patient medical record desensitized text to determine the first privacy paragraph text or the second privacy paragraph text from the medical record text to be desensitized or the patient medical record desensitized text.
基于此,上述隐私段落共性评分的确定包括以下步骤。Based on this, the determination of the commonality score of the above-mentioned privacy paragraphs includes the following steps.
STEP21,基于第一隐私段落文本确定待脱敏病患病历文本中目标隐私段落的第一统计数据。STEP 21, determine first statistical data of the target private paragraph in the medical record text to be desensitized based on the first private paragraph text.
STEP22,基于第二隐私段落文本确定病患病历脱敏文本中目标隐私段落的第二统计数据。STEP 22, determine the second statistical data of the target private paragraph in the desensitized text of the patient's medical record based on the second private paragraph text.
STEP23,基于第一统计数据和第二统计数据,确定待脱敏病患病历文本和病患病历脱敏文本中目标隐私段落的数目求和结果与数目求差结果。STEP23, based on the first statistical data and the second statistical data, determine the sum and difference of the number of target privacy paragraphs in the medical record text to be desensitized and the desensitized medical record text of the patient.
其中,统计数据可以理解为数量或者数目。Here, statistical data can be understood as quantity or number.
STEP24,基于待脱敏病患病历文本和病患病历脱敏文本中目标隐私段落的数目求和结果与数目求差结果,确定待脱敏病患病历文本与病患病历脱敏文本的隐私段落数目共性评分。STEP24, based on the sum and difference of the number of target private paragraphs in the medical records to be desensitized and the desensitized text of the patient's medical records, determine the commonality score of the number of private paragraphs in the medical records to be desensitized and the desensitized text of the patient's medical records.
在一种可能的设计思路下,可以基于数目求和结果与数目求差结果的加权平均结果来确定待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落数目共性评分。其中,加权平均的具体参数可以由本领域技术人员根据实际需求进行调整。In a possible design idea, the commonality score of the number of private paragraphs between the medical record text to be desensitized and the desensitized medical record text of the patient can be determined based on the weighted average result of the number summation result and the number difference result. Among them, the specific parameters of the weighted average can be adjusted by those skilled in the art according to actual needs.
STEP25,基于待脱敏病患病历文本与病患病历脱敏文本的隐私段落数目共性评分,确定待脱敏病患病历文本和病患病历脱敏文本的隐私段落共性评分。STEP 25, based on the commonality score of the number of private paragraphs in the medical records to be desensitized and the desensitized text of the patient's medical records, determine the commonality score of private paragraphs in the medical records to be desensitized and the desensitized text of the patient's medical records.
在一种可能的设计思路下,可以将该待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落数目共性评分作为待脱敏病患病历文本与病患病历脱敏文本之间的文本共性评分。比如,将待脱敏病患病历文本与病患病历脱敏文本的隐私段落数目共性评分列表作为待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分列表。In a possible design idea, the commonality score of the number of private paragraphs between the medical record text to be desensitized and the patient's medical record desensitized text can be used as the text commonality score between the medical record text to be desensitized and the patient's medical record desensitized text. For example, the commonality score list of the number of private paragraphs of the medical record text to be desensitized and the patient's medical record desensitized text is used as the commonality score list of private paragraphs between the medical record text to be desensitized and the patient's medical record desensitized text.
可见,通过本发明实施例,不仅可以通过病患病历文本的语义共性评分保障目标脱敏文本与待脱敏病患病历文本全局类似性,还可以保障目标脱敏文本与待脱敏病患病历文本中的目标隐私段落数目一致,以使得通过目标脱敏文本对待脱敏病患病历文本进行数据匿名/脱敏时,待脱敏病患病历文本中的目标隐私段落在目标脱敏文本中皆具有对应的隐私段落,进而可以保障各个待脱敏病患病历文本中各个目标隐私段落的数据匿名/脱敏质量和效率。It can be seen that through the embodiments of the present invention, not only can the global similarity between the target desensitized text and the medical record text to be desensitized be guaranteed through the semantic commonality scoring of the medical record text, but also the number of target privacy paragraphs in the target desensitized text and the medical record text to be desensitized can be guaranteed to be consistent, so that when the medical record text to be desensitized is anonymized/desensitized through the target desensitized text, the target privacy paragraphs in the medical record text to be desensitized all have corresponding privacy paragraphs in the target desensitized text, thereby ensuring the data anonymity/desensitization quality and efficiency of each target privacy paragraph in each medical record text to be desensitized.
在另一些设计思路下,上述隐私段落解析结果的确定可包括如下内容。Under other design ideas, the determination of the above-mentioned privacy paragraph analysis results may include the following content.
STEP31,对待脱敏病患病历文本和病患病历脱敏文本分别进行隐私段落挖掘处理,以确定待脱敏病患病历文本中包括的第一隐私段落文本和病患病历脱敏文本中包括的第二隐私段落文本。STEP31, perform privacy paragraph mining processing on the medical record text to be desensitized and the desensitized text of the patient's medical record, respectively, to determine the first privacy paragraph text included in the medical record text to be desensitized and the second privacy paragraph text included in the desensitized text of the patient's medical record.
其中,第一隐私段落文本可以包括待脱敏病患病历文本中存在的目标隐私段落以及各个目标隐私段落在待脱敏病患病历文本中的分布特征。第二隐私段落文本可以包括病患病历脱敏文本中存在的目标隐私段落以及各个目标隐私段落在病患病历脱敏文本中的分布特征。The first privacy paragraph text may include the target privacy paragraphs in the patient medical record text to be desensitized and the distribution characteristics of each target privacy paragraph in the patient medical record text to be desensitized. The second privacy paragraph text may include the target privacy paragraphs in the patient medical record desensitized text and the distribution characteristics of each target privacy paragraph in the patient medical record desensitized text.
在本发明实施例中,一个待脱敏病患病历文本或者病患病历脱敏文本中可能会存在多个目标隐私段落,每个目标隐私段落又可以归属于不同的隐私段落标识。In the embodiment of the present invention, there may be multiple target privacy paragraphs in a patient medical record text to be desensitized or a patient medical record desensitized text, and each target privacy paragraph may belong to a different privacy paragraph identifier.
在一种可能的设计思路下,可以通过调试完成的隐私段落挖掘机器学习算法对待脱敏病患病历文本进行挖掘,以从待脱敏病患病历文本或者病患病历脱敏文本中挖掘出各个目标隐私段落以及各个目标隐私段落的分布特征。Under one possible design idea, the debugged privacy paragraph mining machine learning algorithm can be used to mine the desensitized medical record text, so as to mine each target privacy paragraph and the distribution characteristics of each target privacy paragraph from the desensitized medical record text or the desensitized medical record text.
STEP32,基于第一隐私段落文本和第二隐私段落文本,确定待脱敏病患病历文本中各个文本单元所属的隐私段落标识和病患病历脱敏文本中各个文本单元所属的隐私段落标识。STEP 32, based on the first privacy paragraph text and the second privacy paragraph text, determine the privacy paragraph identifier to which each text unit in the medical record text to be desensitized belongs and the privacy paragraph identifier to which each text unit in the desensitized medical record text belongs.
在一些示例下,当在待脱敏病患病历文本或者病患病历脱敏文本中挖掘出各个目标隐私段落的分布特征后,那么待脱敏病患病历文本或者病患病历脱敏文本中的各个文本单元也相当于匹配了各自的标识字段。In some examples, after the distribution characteristics of each target private paragraph are mined in the medical record text to be desensitized or the desensitized text of the patient's medical record, each text unit in the medical record text to be desensitized or the desensitized text of the patient's medical record is equivalent to matching its respective identification field.
比如,在待脱敏病患病历文本中挖掘出Q、W、E等三种不同隐私段落标识的目标隐私段落,那么待脱敏病患病历文本中的某一个文本单元要么属于Q隐私段落标识、要么属于W隐私段落标识、要么属于E隐私段落标识,当然还可能属于其他标识。For example, target privacy paragraphs with three different privacy paragraph markers, Q, W, and E, are mined from the medical record text to be desensitized. Then a certain text unit in the medical record text to be desensitized either belongs to the Q privacy paragraph marker, the W privacy paragraph marker, or the E privacy paragraph marker, and of course may also belong to other markers.
STEP33,将待脱敏病患病历文本拆解为X个第一文本集,并将病患病历脱敏文本拆解为X个第二文本集,X个第一文本集与X个第二文本集一一对应,X为不小于1的整数。STEP 33, decompose the medical record text to be desensitized into X first text sets, and desensitize the medical record text into X second text sets, the X first text sets correspond to the X second text sets one by one, and X is an integer not less than 1.
在一种可能的设计思路下,可以按照一定的拆解规则对待脱敏病患病历文本或者病患病历脱敏文本拆解为X个文本集。本发明实施例对上述拆解方法以及拆解获得的病患病历文本文本集个数不做限制,但是要保障待脱敏病患病历文本的拆解方法与病患病历脱敏文本的拆解方法一致,以保障待脱敏病患病历文本中的各个病患病历文本文本集与病患病历脱敏文本中的病患病历文本文本集可以一一对应。In a possible design idea, the desensitized patient medical record text or the patient medical record desensitized text can be decomposed into X text sets according to certain decomposition rules. The embodiment of the present invention does not limit the above decomposition method and the number of patient medical record text sets obtained by decomposition, but it is necessary to ensure that the decomposition method of the patient medical record text to be desensitized is consistent with the desensitized patient medical record text, so as to ensure that each patient medical record text set in the patient medical record text to be desensitized corresponds to the patient medical record text set in the patient medical record desensitized text.
STEP34,基于待脱敏病患病历文本中各个文本单元所属的隐私段落标识确定各个第一文本集中各个文本单元所属的第一隐私段落标识,作为待脱敏病患病历文本中的隐私段落解析结果。STEP34, based on the privacy paragraph identifiers of each text unit in the medical record text to be desensitized, determine the first privacy paragraph identifiers of each text unit in each first text set as the privacy paragraph parsing result in the medical record text to be desensitized.
STEP35,基于病患病历脱敏文本中各个文本单元所属的隐私段落标识确定各个第二文本集中各个文本单元所属的第二隐私段落标识,作为病患病历脱敏文本中的隐私段落解析结果。STEP35, based on the privacy paragraph identifiers of each text unit in the desensitized text of the patient medical record, determine the second privacy paragraph identifiers of each text unit in each second text set as the privacy paragraph parsing result in the desensitized text of the patient medical record.
在一种可能的设计思路下,在获得各个第一文本集中各个文本单元所属的第一隐私段落标识和各个第二文本集中各个文本单元所属的第二隐私段落标识后,可以基于各个第一文本集中的第一隐私段落标识和各个第二文本集中的第二隐私段落标识确定待脱敏病患病历文本与病患病历脱敏文本中的隐私段落解析结果。Under one possible design idea, after obtaining the first privacy paragraph identifier belonging to each text unit in each first text set and the second privacy paragraph identifier belonging to each text unit in each second text set, the privacy paragraph parsing results in the medical record text to be desensitized and the desensitized text of the patient's medical record can be determined based on the first privacy paragraph identifier in each first text set and the second privacy paragraph identifier in each second text set.
在一种可能的设计思路下,可以基于各个第一文本集中的第一隐私段落标识和各个第二文本集中的第二隐私段落标识统计各个第一文本集与对应的第二文本集之间的文本单元状态标识共性评分,然后基于各个第一文本集对应的文本单元状态标识共性评分确定待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分。比如可以将各个第一文本集与其对应的第二文本集的文本单元状态标识共性评分进行平均化处理等,确定待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分。In one possible design idea, the commonality score of the text unit status identifiers between each first text set and the corresponding second text set can be counted based on the first privacy paragraph identifiers in each first text set and the second privacy paragraph identifiers in each second text set, and then the commonality score of the privacy paragraphs between the medical records to be desensitized and the desensitized text of the patient's medical records can be determined based on the commonality score of the text unit status identifiers corresponding to each first text set. For example, the commonality score of the text unit status identifiers of each first text set and its corresponding second text set can be averaged to determine the commonality score of the privacy paragraphs between the medical records to be desensitized and the desensitized text of the patient's medical records.
其中,第一文本集与其对应的第二文本集的文本单元状态标识(即该文本单元对应的隐私段落标识字段)共性评分可通过如下方式确定:确定第一文本集与第二文本集状态相同的文本单元个数;通过第一文本集与第二文本集状态相同的文本单元个数与第一文本集(或第二文本集)文本单元统计值确定比例值,并将该比例值作为第一文本集与其对应的第二文本集的文本单元状态标识共性评分。Among them, the commonality score of the text unit status identification (i.e., the privacy paragraph identification field corresponding to the text unit) of the first text set and its corresponding second text set can be determined in the following way: determining the number of text units in the first text set and the second text set with the same status; determining a ratio value by the number of text units in the first text set and the second text set with the same status and the statistical value of the text units of the first text set (or the second text set), and using the ratio value as the commonality score of the text unit status identification of the first text set and its corresponding second text set.
在另一些可能的设计思路下,上述隐私段落共性评分的确定可包括如下内容。Under some other possible design ideas, the determination of the commonality score of the above-mentioned privacy paragraphs may include the following content.
STEP41,基于各个第一文本集中各个文本单元所属的第一隐私段落标识和各个第二文本集中各个文本单元所属的第二隐私段落标识,确定各个第一文本集与对应的第二文本集之间的隐私段落标识共性评分。STEP 41, based on the first privacy paragraph identifiers belonging to each text unit in each first text set and the second privacy paragraph identifiers belonging to each text unit in each second text set, determine the privacy paragraph identifier commonality score between each first text set and the corresponding second text set.
在一种可能的设计思路下,如果知晓各个第一文本集中各个文本单元所属的第一隐私段落标识,则能够统计各个第一文本集中包括的隐私段落标识数量。类似地,如果知晓各个第二文本集中各个文本单元所属的第二隐私段落标识,则能够统计各个第二文本集中包括的隐私段落标识数量。In a possible design concept, if the first privacy paragraph identifier to which each text unit in each first text set belongs is known, the number of privacy paragraph identifiers included in each first text set can be counted. Similarly, if the second privacy paragraph identifier to which each text unit in each second text set belongs is known, the number of privacy paragraph identifiers included in each second text set can be counted.
通常,鉴于第u个待脱敏病患病历文本的拆解方法与第v个待脱敏病患病历文本的拆解方法相同,所以第u个待脱敏病患病历文本的第一文本集与第v个待脱敏病患病历文本的第二文本集是存在一一对应关系的,所以第u个待脱敏病患病历文本的第p个第一文本集与第v个病患病历脱敏文本的第p个第二文本集是对应的。Generally, given that the disassembly method of the u-th patient medical record text to be desensitized is the same as the disassembly method of the v-th patient medical record text to be desensitized, there is a one-to-one correspondence between the first text set of the u-th patient medical record text to be desensitized and the second text set of the v-th patient medical record text to be desensitized, so the p-th first text set of the u-th patient medical record text to be desensitized corresponds to the p-th second text set of the v-th patient medical record desensitized text.
在一些情况下,可通过如下方式确定第一目标文本集与第二目标文本集的隐私段落标识共性评分:确定第一目标隐私段落标识与第二目标隐私段落标识的相同隐私段落标识数量;确定第一目标隐私段落标识与第二目标隐私段落标识的隐私段落标识统计值;基于第一目标隐私段落标识与第二目标隐私段落标识的相同隐私段落标识数量、第一目标隐私段落标识与第二目标隐私段落标识的隐私段落标识统计值确定第一目标文本集与第二目标文本集的隐私段落标识共性评分。In some cases, the privacy paragraph identifier commonality score of the first target text set and the second target text set can be determined in the following manner: determining the number of identical privacy paragraph identifiers of the first target privacy paragraph identifier and the second target privacy paragraph identifier; determining the privacy paragraph identifier statistics of the first target privacy paragraph identifier and the second target privacy paragraph identifier; determining the privacy paragraph identifier commonality score of the first target text set and the second target text set based on the number of identical privacy paragraph identifiers of the first target privacy paragraph identifier and the second target privacy paragraph identifier and the privacy paragraph identifier statistics of the first target privacy paragraph identifier and the second target privacy paragraph identifier.
在一种可能的设计思路下,可以基于第一目标文本集与第二目标文本集的隐私段落标识共性评分确定方法,确定各个第一文本集与对应的第二文本集的隐私段落标识共性评分。In a possible design idea, the commonality score of the privacy paragraph identifiers of each first text set and the corresponding second text set can be determined based on the method for determining the commonality score of the privacy paragraph identifiers of the first target text set and the second target text set.
STEP42,基于各个第一文本集与对应的第二文本集之间的隐私段落标识共性评分,确定待脱敏病患病历文本与病患病历脱敏文本的隐私段落主题共性评分。STEP 42, based on the privacy paragraph identification commonality scores between each first text set and the corresponding second text set, determine the privacy paragraph topic commonality scores of the patient's medical record text to be desensitized and the patient's medical record desensitized text.
在一种可能的设计思路下,可以对各个第一文本集与对应的第二文本集的隐私段落标识共性评分进行一系列运算处理来确定待脱敏病患病历文本与病患病历脱敏文本的隐私段落主题共性评分。Under one possible design idea, a series of operations can be performed on the commonality scores of the privacy paragraph identification of each first text set and the corresponding second text set to determine the commonality scores of the privacy paragraph topics of the medical record text to be desensitized and the desensitized medical record text of the patient.
STEP43,基于待脱敏病患病历文本和病患病历脱敏文本的隐私段落主题共性评分,确定待脱敏病患病历文本与病患病历脱敏文本的隐私段落共性评分。STEP43, based on the commonality scores of the topics of the private paragraphs in the medical records to be desensitized and the desensitized texts of the patients' medical records, determine the commonality scores of the private paragraphs in the medical records to be desensitized and the desensitized texts of the patients' medical records.
在一种可能的设计思路下,可以将待脱敏病患病历文本与病患病历脱敏文本的隐私段落主题共性评分,作为待脱敏病患病历文本与病患病历脱敏文本的隐私段落共性评分。比如,将待脱敏病患病历文本与病患病历脱敏文本的隐私段落主题共性评分作为待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分列表。In a possible design idea, the privacy paragraph theme commonality score of the medical record text to be desensitized and the patient medical record desensitized text can be used as the privacy paragraph commonality score of the medical record text to be desensitized and the patient medical record desensitized text. For example, the privacy paragraph theme commonality score of the medical record text to be desensitized and the patient medical record desensitized text is used as the privacy paragraph commonality score list between the medical record text to be desensitized and the patient medical record desensitized text.
在一种可能的设计思路下,从待脱敏病患病历文本中挖掘的第一隐私段落文本可以包括待脱敏病患病历文本中包括的目标隐私段落的第一统计数据,从病患病历脱敏文本中挖掘处理的第二隐私段落文本可以包括待脱敏病患病历文本中包括的目标隐私段落的第二统计数据。Under one possible design idea, the first private paragraph text mined from the medical record text to be desensitized may include first statistical data of the target private paragraph included in the medical record text to be desensitized, and the second private paragraph text mined and processed from the desensitized medical record text may include second statistical data of the target private paragraph included in the medical record text to be desensitized.
在另一些设计思路下,上述隐私段落共性评分的确定可包含如下内容。Under other design ideas, the determination of the commonality score of the above-mentioned privacy paragraphs may include the following content.
STEP51,基于各个第一文本集中各个文本单元所属的第一隐私段落标识和各个第二文本集中各个文本单元所属的第二隐私段落标识,确定各个第一文本集与对应的第二文本集之间的隐私段落标识共性评分。STEP 51, based on the first privacy paragraph identifiers belonging to each text unit in each first text set and the second privacy paragraph identifiers belonging to each text unit in each second text set, determine the privacy paragraph identifier commonality score between each first text set and the corresponding second text set.
STEP52,基于各个第一文本集与对应的第二文本集之间的隐私段落标识共性评分,确定待脱敏病患病历文本与病患病历脱敏文本的隐私段落主题共性评分。STEP52, based on the privacy paragraph identification commonality scores between each first text set and the corresponding second text set, determine the privacy paragraph topic commonality scores of the patient's medical record text to be desensitized and the patient's medical record desensitized text.
STEP53,基于第一统计数据和第二统计数据,确定待脱敏病患病历文本和病患病历脱敏文本中目标隐私段落的数目求和结果与数目求差结果。STEP53, based on the first statistical data and the second statistical data, determine the sum and difference of the number of target privacy paragraphs in the medical record text to be desensitized and the desensitized medical record text of the patient.
STEP54,基于待脱敏病患病历文本和病患病历脱敏文本中目标隐私段落的数目求和结果与数目求差结果,确定待脱敏病患病历文本与病患病历脱敏文本的隐私段落数目共性评分。STEP54, based on the sum and difference of the number of target private paragraphs in the medical records to be desensitized and the desensitized text of the patient's medical records, determine the commonality score of the number of private paragraphs in the medical records to be desensitized and the desensitized text of the patient's medical records.
STEP55,基于待脱敏病患病历文本与病患病历脱敏文本的隐私段落数目共性评分和隐私段落主题共性评分,确定待脱敏病患病历文本与病患病历脱敏文本的隐私段落共性评分。STEP55, based on the commonality score of the number of private paragraphs between the medical records to be desensitized and the desensitized text of the patient's medical records and the commonality score of the topics of private paragraphs, determine the commonality score of private paragraphs between the medical records to be desensitized and the desensitized text of the patient's medical records.
在一种可能的设计思路下,可以将待脱敏病患病历文本与病患病历脱敏文本的隐私段落主题共性评分、隐私段落数目共性评分进行乘法运算或者求和运算,以确定待脱敏病患病历文本与病患病历脱敏文本的隐私段落共性评分。比如,可以将待脱敏病患病历文本与病患病历脱敏文本的隐私段落主题共性评分列表与隐私段落数目共性评分列表进行乘法运算或者求和运算获得待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分列表等。In one possible design idea, the privacy paragraph theme commonality score and the privacy paragraph number commonality score of the medical record text to be desensitized and the patient medical record desensitized text can be multiplied or summed to determine the privacy paragraph commonality score of the medical record text to be desensitized and the patient medical record desensitized text. For example, the privacy paragraph theme commonality score list of the medical record text to be desensitized and the patient medical record desensitized text and the privacy paragraph number commonality score list can be multiplied or summed to obtain the privacy paragraph commonality score list between the medical record text to be desensitized and the patient medical record desensitized text.
本发明实施例既能通过病患病历文本的语义共性评分保障目标脱敏文本与待脱敏病患病历文本全局类似性;又能保障目标脱敏文本与待脱敏病患病历文本中的目标隐私段落数目一致,以使得通过目标脱敏文本对待脱敏病患病历文本进行数据匿名/脱敏时,可以保障目标隐私段落的数据匿名/脱敏质量和效率;还可以通过隐私段落主题共性评分保障目标脱敏文本在与第一文本集对应的分布区域,存在与第一文本集类似的隐私段落主题。The embodiments of the present invention can not only ensure the global similarity between the target desensitized text and the medical record text to be desensitized through the semantic commonality scoring of the medical record text; it can also ensure that the number of target privacy paragraphs in the target desensitized text and the medical record text to be desensitized is consistent, so that when the medical record text to be desensitized is anonymized/desensitized by the target desensitized text, the data anonymity/desensitization quality and efficiency of the target privacy paragraphs can be guaranteed; it can also ensure that the target desensitized text has privacy paragraph themes similar to the first text set in the distribution area corresponding to the first text set through the privacy paragraph theme commonality scoring.
通过本发明实施例,既能通过病患病历文本的语义共性评分保障目标脱敏文本与待脱敏病患病历文本全局类似性,又能保障目标脱敏文本与待脱敏病患病历文本中的目标隐私段落数目一致;还可以保障目标脱敏文本与目标隐私段落中的隐私段落分布一致性,即确保了目标脱敏文本与待脱敏病患病历文本在对应分布区域中的隐私段落主题的一致性。Through the embodiments of the present invention, not only can the global similarity of the target desensitized text and the medical record text to be desensitized be ensured through the semantic commonality scoring of the medical record text, but also the consistency of the target number of private paragraphs in the target desensitized text and the medical record text to be desensitized can be ensured; the consistency of the distribution of private paragraphs in the target desensitized text and the target private paragraphs can also be ensured, that is, the consistency of the private paragraph themes in the target desensitized text and the medical record text to be desensitized in the corresponding distribution areas is ensured.
在一种可能的设计思路下,待脱敏病患病历文本包括第一待脱敏病患病历文本和第二待脱敏病患病历文本,其中第一待脱敏病患病历文本与病患病历脱敏文本之间的文本共性评分可以为第一文本共性评分score1v,第二待脱敏病患病历文本与病患病历脱敏文本之间的文本共性评分可以为第二文本共性评分score2v其中,v为大于或等于1、且小于或者等于病患病历脱敏文本个数的整数。Under one possible design idea, the medical record text to be desensitized includes a first medical record text to be desensitized and a second medical record text to be desensitized, wherein the text commonality score between the first medical record text to be desensitized and the desensitized medical record text can be a first text commonality score score1v, and the text commonality score between the second medical record text to be desensitized and the desensitized medical record text can be a second text commonality score score2v, wherein v is an integer greater than or equal to 1 and less than or equal to the number of desensitized medical record texts.
进一步地,上述目标脱敏文本的确定可包括如下内容。Furthermore, the determination of the above-mentioned target desensitized text may include the following contents.
STEP61,分别获取第一待脱敏病患病历文本和第二待脱敏病患病历文本的文本语义特征。STEP 61, respectively obtain text semantic features of the medical record text of the first disease to be desensitized and the medical record text of the second disease to be desensitized.
在一种可能的设计思路下,可以通过一些目标机器学习算法对第一待脱敏病患病历文本和第二待脱敏病患病历文本进行文本向量挖掘处理,以获取一个可表征第一待脱敏病患病历文本整体情况的文本语义特征、一个可表征第二待脱敏病患病历文本整体情况的文本语义特征。Under one possible design idea, some target machine learning algorithms can be used to perform text vector mining on the medical records of the first patient to be desensitized and the medical records of the second patient to be desensitized to obtain a text semantic feature that can characterize the overall situation of the medical records of the first patient to be desensitized and a text semantic feature that can characterize the overall situation of the medical records of the second patient to be desensitized.
其中,所述目标机器学习算法可以是深度学习模型DNN、残差网络等。Among them, the target machine learning algorithm can be a deep learning model DNN, a residual network, etc.
STEP62,基于第一待脱敏病患病历文本与第二待脱敏病患病历文本的文本语义特征,确定第一待脱敏病患病历文本与第二待脱敏病患病历文本之间的第二语义共性评分。STEP62, based on the text semantic features of the first medical record text of the disease to be desensitized and the second medical record text of the disease to be desensitized, determine the second semantic commonality score between the first medical record text of the disease to be desensitized and the second medical record text of the disease to be desensitized.
在一种可能的设计思路下,可以确定第一待脱敏病患病历文本的文本语义特征与第二待脱敏病患病历文本的文本语义特征之间的共性评分作为该第二语义共性评分。In a possible design concept, a commonality score between the text semantic features of the first medical record text to be desensitized and the text semantic features of the second medical record text to be desensitized can be determined as the second semantic commonality score.
其中,可以通过余弦相似度计算第一待脱敏病患病历文本的文本语义特征与第二待脱敏病患病历文本的文本语义特征之间的共性评分。Among them, the commonality score between the text semantic features of the first medical record text to be desensitized and the text semantic features of the second medical record text to be desensitized can be calculated by cosine similarity.
在一种可能的设计思路下,在获得第一待脱敏病患病历文本与第二待脱敏病患病历文本的语义共性评分之后,可以基于第一文本共性评分、第二文本共性评分以及第二语义共性评分,从病患病历脱敏文本中确定第一待脱敏病患病历文本的目标脱敏文本。Under one possible design idea, after obtaining the semantic commonality scores of the first medical record text to be desensitized and the second medical record text to be desensitized, the target desensitized text of the first medical record text to be desensitized can be determined from the desensitized texts of the medical records based on the first text commonality score, the second text commonality score and the second semantic commonality score.
在一种可能的设计思路下,可以采用STEP63-STEP65所示方法,以便基于第一文本共性评分、第二文本共性评分以及第二语义共性评分,从病患病历脱敏文本中确定第一待脱敏病患病历文本的目标脱敏文本。Under one possible design idea, the method shown in STEP63-STEP65 can be used to determine the target desensitized text of the first medical record text to be desensitized from the desensitized text of the medical record based on the first text commonality score, the second text commonality score and the second semantic commonality score.
STEP63,基于第二文本共性评分和第二语义共性评分确定第二待脱敏病患病历文本对第一待脱敏病患病历文本的贡献信息。STEP 63: Determine contribution information of the second medical record text to be desensitized to the first medical record text to be desensitized based on the second text commonality score and the second semantic commonality score.
在一种可能的设计思路下,可以将第二待脱敏病患病历文本与第一待脱敏病患病历文本的第二语义共性评分与第二待脱敏病患病历文本与病患病历脱敏文本的第二文本共性评分进行乘法运算或者相加,以确定第二待脱敏病患病历文本相对于第一待脱敏病患病历文本的贡献信息。Under one possible design idea, the second semantic commonality score between the second medical record text to be desensitized and the first medical record text to be desensitized can be multiplied or added with the second text commonality score between the second medical record text to be desensitized and the desensitized medical record text to determine the contribution information of the second medical record text to be desensitized relative to the first medical record text to be desensitized.
STEP64,基于第二待脱敏病患病历文本对第一待脱敏病患病历文本的贡献信息、第一文本共性评分,确定第一待脱敏病患病历文本与病患病历脱敏文本的相关性。STEP64, based on the contribution information of the second medical record text to be desensitized to the first medical record text to be desensitized and the commonality score of the first text, determine the correlation between the first medical record text to be desensitized and the desensitized text of the patient's medical record.
在一种可能的设计思路下,可以将第二待脱敏病患病历文本对第一待脱敏病患病历文本的贡献信息与第一待脱敏病患病历文本与病患病历脱敏文本的第一文本共性评分求和,以确定第一待脱敏病患病历文本与病患病历脱敏文本的相关性。Under one possible design idea, the contribution information of the second patient medical record text to the first patient medical record text to be desensitized can be summed with the first text commonality score of the first patient medical record text to be desensitized and the patient's desensitized medical record text to determine the correlation between the first patient medical record text to be desensitized and the patient's desensitized medical record text.
STEP65,基于第一待脱敏病患病历文本与病患病历脱敏文本的相关性,从病患病历脱敏文本中确定第一待脱敏病患病历文本的目标脱敏文本。STEP65, based on the correlation between the first patient medical record text to be desensitized and the patient medical record desensitized text, determine the target desensitized text of the first patient medical record text to be desensitized from the patient medical record desensitized text.
在一种可能的设计思路下,可以通过本发明实施例方法确定第一待脱敏病患病历文本与多个病患病历脱敏文本的相关性,然后从多个相关性中确定最大相关性对应的病患病历脱敏文本为第一待脱敏病患病历文本的目标脱敏文本。Under one possible design idea, the correlation between the first patient's medical record text to be desensitized and multiple patient's medical record desensitized texts can be determined through the method of the embodiment of the present invention, and then the patient's medical record desensitized text corresponding to the maximum correlation is determined from the multiple correlations as the target desensitized text of the first patient's medical record text to be desensitized.
其中,该第一待脱敏病患病历文本与病患病历脱敏文本的相关性,不仅引入了第一待脱敏病患病历文本与病患病历脱敏文本中的病患病历文本特征,还引入了第二待脱敏病患病历文本对第一待脱敏病患病历文本的共性程度影响。Among them, the correlation between the first medical record text to be desensitized and the desensitized medical record text of the patient not only introduces the medical record text features of the first medical record text to be desensitized and the desensitized medical record text of the patient, but also introduces the influence of the commonality degree of the second medical record text to be desensitized on the first medical record text to be desensitized.
在一种可能的设计思路下,当通过上述方法确定了待脱敏病患病历文本与各个病患病历脱敏文本的病患病历文本相关性之后,可以将病患病历文本相关性最大值对应的病患病历脱敏文本作为待脱敏病患病历文本的目标脱敏文本。Under one possible design idea, after the patient medical record text correlation between the patient medical record text to be desensitized and the desensitized texts of each patient medical record is determined by the above method, the patient medical record desensitized text corresponding to the maximum patient medical record text correlation can be used as the target desensitized text of the patient medical record text to be desensitized.
本发明实施例所提供的另一种结合人工智能的病患信息脱敏处理方法可包含如下内容。Another patient information desensitization processing method combined with artificial intelligence provided by an embodiment of the present invention may include the following content.
STEP71,获取第一待脱敏病患病历文本、第二待脱敏病患病历文本和病患病历脱敏文本。STEP 71, obtain the medical record text of the first patient to be desensitized, the medical record text of the second patient to be desensitized, and the desensitized text of the patient's medical record.
STEP72,获取第一待脱敏病患病历文本与病患病历脱敏文本之间的第一文本共性评分。STEP 72, obtain the first text commonality score between the first patient medical record text to be desensitized and the patient medical record desensitized text.
第一文本共性评分可以是第一待脱敏病患病历文本与病患病历脱敏文本之间的语义共性评分,也可以是基于第一待脱敏病患病历文本与病患病历脱敏文本之间的语义共性评分确定的其它与病患病历文本内容相关的共性评分,比如可以是基于第一待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分和语义共性评分确定的共性评分。The first text commonality score can be a semantic commonality score between the first medical record text to be desensitized and the desensitized text of the patient's medical record, or it can be other commonality scores related to the content of the patient's medical record text determined based on the semantic commonality score between the first medical record text to be desensitized and the desensitized text of the patient's medical record, for example, it can be a commonality score determined based on the privacy paragraph commonality score and the semantic commonality score between the first medical record text to be desensitized and the desensitized text of the patient's medical record.
可通过如下方式确定第一待脱敏病患病历文本与病患病历脱敏文本之间的语义共性评分:确定第一待脱敏病患病历文本的文本语义特征和病患病历脱敏文本的文本语义特征;基于第一待脱敏病患病历文本的文本语义特征和病患病历脱敏文本的文本语义特征,确定第一待脱敏病患病历文本和病患病历脱敏文本的语义共性评分。The semantic commonality score between the first medical record text to be desensitized and the desensitized text of the patient's medical record can be determined in the following manner: determine the text semantic features of the first medical record text to be desensitized and the text semantic features of the desensitized text of the patient's medical record; determine the semantic commonality score between the first medical record text to be desensitized and the desensitized text of the patient's medical record based on the text semantic features of the first medical record text to be desensitized and the text semantic features of the desensitized text of the patient's medical record.
可通过如下方式确定第一待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分:从第一待脱敏病患病历文本中挖掘第一隐私段落文本,从病患病历脱敏文本中挖掘出第二隐私段落文本;基于第一隐私段落文本确定第一待脱敏病患病历文本中目标隐私段落的第一统计数据;基于第二隐私段落文本确定病患病历脱敏文本中目标隐私段落的第二统计数据;基于第一统计数据和第二统计数据,确定第一待脱敏病患病历文本和病患病历脱敏文本中目标隐私段落的数目求和结果与数目求差结果;基于第一待脱敏病患病历文本和病患病历脱敏文本中目标隐私段落的数目求和结果与数目求差结果,确定第一待脱敏病患病历文本与病患病历脱敏文本的隐私段落数目共性评分;基于第一待脱敏病患病历文本与病患病历脱敏文本的隐私段落数目共性评分,确定第一待脱敏病患病历文本和病患病历脱敏文本的隐私段落共性评分。The privacy paragraph commonality score between the first patient medical record text to be desensitized and the patient medical record desensitized text can be determined in the following manner: mining the first privacy paragraph text from the first patient medical record text to be desensitized, and mining the second privacy paragraph text from the patient medical record desensitized text; determining the first statistical data of the target privacy paragraph in the first patient medical record text to be desensitized based on the first privacy paragraph text; determining the second statistical data of the target privacy paragraph in the patient medical record desensitized text based on the second privacy paragraph text; determining the sum and difference results of the number of target privacy paragraphs in the first patient medical record text to be desensitized and the patient medical record desensitized text based on the first statistical data and the second statistical data; determining the commonality score of the number of privacy paragraphs between the first patient medical record text to be desensitized and the patient medical record desensitized text based on the sum and difference results of the number of target privacy paragraphs in the first patient medical record text to be desensitized and the patient medical record desensitized text; determining the privacy paragraph commonality score between the first patient medical record text to be desensitized and the patient medical record desensitized text based on the commonality score of the number of privacy paragraphs between the first patient medical record text to be desensitized and the patient medical record desensitized text.
还可通过如下方式确定第一待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分:基于各个第一文本集中各个文本单元所属的第一隐私段落标识和各个第二文本集中各个文本单元所属的第二隐私段落标识,确定各个第一文本集与对应的第二文本集之间的隐私段落标识共性评分;基于各个第一文本集与对应的第二文本集之间的隐私段落标识共性评分,确定第一待脱敏病患病历文本与病患病历脱敏文本的隐私段落主题共性评分;基于第一待脱敏病患病历文本和病患病历脱敏文本的隐私段落主题共性评分,确定第一待脱敏病患病历文本与病患病历脱敏文本的隐私段落共性评分。The privacy paragraph commonality score between the first medical record text to be desensitized and the patient medical record desensitized text can also be determined in the following manner: based on the first privacy paragraph identifier belonging to each text unit in each first text set and the second privacy paragraph identifier belonging to each text unit in each second text set, determine the privacy paragraph identifier commonality score between each first text set and the corresponding second text set; based on the privacy paragraph identifier commonality score between each first text set and the corresponding second text set, determine the privacy paragraph topic commonality score between the first medical record text to be desensitized and the patient medical record desensitized text; based on the privacy paragraph topic commonality score between the first medical record text to be desensitized and the patient medical record desensitized text, determine the privacy paragraph commonality score between the first medical record text to be desensitized and the patient medical record desensitized text.
还可通过如下方式确定第一待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分:基于各个第一文本集中各个文本单元所属的第一隐私段落标识和各个第二文本集中各个文本单元所属的第二隐私段落标识,确定各个第一文本集与对应的第二文本集之间的隐私段落标识共性评分;基于各个第一文本集与对应的第二文本集之间的隐私段落标识共性评分,确定第一待脱敏病患病历文本与病患病历脱敏文本的隐私段落主题共性评分;基于第一统计数据和第二统计数据,确定第一待脱敏病患病历文本和病患病历脱敏文本中目标隐私段落的数目求和结果与数目求差结果;基于第一待脱敏病患病历文本和病患病历脱敏文本中目标隐私段落的数目求和结果与数目求差结果,确定第一待脱敏病患病历文本与病患病历脱敏文本的隐私段落数目共性评分;基于第一待脱敏病患病历文本与病患病历脱敏文本的隐私段落数目共性评分和隐私段落主题共性评分,确定第一待脱敏病患病历文本与病患病历脱敏文本的隐私段落共性评分。The privacy paragraph commonality score between the first medical record text to be desensitized and the patient medical record desensitized text can also be determined in the following manner: based on the first privacy paragraph identifier belonging to each text unit in each first text set and the second privacy paragraph identifier belonging to each text unit in each second text set, determine the privacy paragraph identifier commonality score between each first text set and the corresponding second text set; based on the privacy paragraph identifier commonality score between each first text set and the corresponding second text set, determine the privacy paragraph topic commonality score between the first medical record text to be desensitized and the patient medical record desensitized text; based on the first statistical data and the second 2. Statistical data, determine the sum and difference of the number of target private paragraphs in the first medical record to be desensitized and the desensitized text of the patient's medical record; determine the commonality score of the number of private paragraphs in the first medical record to be desensitized and the desensitized text of the patient's medical record based on the sum and difference of the number of target private paragraphs in the first medical record to be desensitized and the desensitized text of the patient's medical record; determine the commonality score of private paragraphs in the first medical record to be desensitized and the desensitized text of the patient's medical record based on the commonality score of the number of private paragraphs and the commonality score of the theme of private paragraphs between the first medical record to be desensitized and the desensitized text of the patient's medical record.
STEP73,获取第二待脱敏病患病历文本与病患病历脱敏文本之间的第二文本共性评分。STEP73, obtain the second text commonality score between the second patient's medical record text to be desensitized and the patient's medical record desensitization text.
第二文本共性评分可以是第二待脱敏病患病历文本与病患病历脱敏文本之间的语义共性评分,也可以是基于第二待脱敏病患病历文本与病患病历脱敏文本之间的语义共性评分确定的其它与病患病历文本内容相关的共性评分,比如可以是基于第二待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分和语义共性评分确定的共性评分。The second text commonality score can be a semantic commonality score between the second medical record text to be desensitized and the desensitized text of the patient's medical record, or it can be other commonality scores related to the content of the patient's medical record text determined based on the semantic commonality score between the second medical record text to be desensitized and the desensitized text of the patient's medical record, for example, it can be a commonality score determined based on the privacy paragraph commonality score and the semantic commonality score between the second medical record text to be desensitized and the desensitized text of the patient's medical record.
第二待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分和语义共性评分确定方法与第一待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分和语义共性评分确定方法类似。The method for determining the privacy paragraph commonality score and semantic commonality score between the second medical record text to be desensitized and the patient's medical record desensitized text is similar to the method for determining the privacy paragraph commonality score and semantic commonality score between the first medical record text to be desensitized and the patient's medical record desensitized text.
STEP74,分别获取第一待脱敏病患病历文本和第二待脱敏的文本语义特征。STEP 74, respectively obtain the semantic features of the first medical record text to be desensitized and the second text to be desensitized.
STEP75,基于第一待脱敏病患病历文本、第二待脱敏的文本语义特征,确定第一待脱敏病患病历文本与第二待脱敏病患病历文本之间的第二语义共性评分。STEP75, based on the semantic features of the first medical record text to be desensitized and the second medical record text to be desensitized, determine the second semantic commonality score between the first medical record text to be desensitized and the second medical record text to be desensitized.
STEP76,基于第一文本共性评分、第二文本共性评分以及第二语义共性评分,从病患病历脱敏文本中确定第一待脱敏病患病历文本的目标脱敏文本,以便基于目标脱敏文本对第一待脱敏病患病历文本进行隐私脱敏保护。STEP76, based on the first text commonality score, the second text commonality score and the second semantic commonality score, determine the target desensitized text of the first medical record text to be desensitized from the desensitized text of the medical record, so as to perform privacy desensitization protection on the first medical record text to be desensitized based on the target desensitized text.
本发明实施例,不仅通过待脱敏病患病历文本与病患病历脱敏文本的文本语义特征确保了目标脱敏文本与待脱敏病患病历文本全局类似性;还通过待脱敏病患病历文本与病患病历脱敏文本之间的隐私段落共性评分确保了目标脱敏文本中的隐私段落与待脱敏病患病历文本中的隐私段落相似,即使得了目标脱敏文本与待脱敏病患病历文本中的隐私段落数目、隐私段落分布、隐私段落主题等相似。这样,本发明实施例不仅可以保障目标脱敏文本与待脱敏病患病历文本在整体层面的文本布局类似性,还可以保障目标脱敏文本与待脱敏病患病历文本中的隐私段落信息类似性,显著提升了目标脱敏文本与待脱敏病患病历文本之间的匹配性。The embodiment of the present invention not only ensures the global similarity between the target desensitized text and the medical record text to be desensitized through the text semantic features of the medical record text to be desensitized and the desensitized text of the patient's medical record; it also ensures that the privacy paragraphs in the target desensitized text are similar to the privacy paragraphs in the medical record text to be desensitized through the commonality score of the privacy paragraphs between the medical record text to be desensitized and the desensitized text of the patient's medical record, that is, the number of privacy paragraphs, privacy paragraph distribution, privacy paragraph themes, etc. in the target desensitized text and the medical record text to be desensitized are similar. In this way, the embodiment of the present invention can not only ensure the similarity of the text layout of the target desensitized text and the medical record text to be desensitized at the overall level, but also ensure the similarity of the privacy paragraph information in the target desensitized text and the medical record text to be desensitized, which significantly improves the matching between the target desensitized text and the medical record text to be desensitized.
在一些可独立的设计思路下,在所述基于所述待脱敏病患病历文本和所述病患病历脱敏文本之间的文本共性评分,从所述病患病历脱敏文本中确定目标脱敏文本,以便基于所述目标脱敏文本对所述待脱敏病患病历文本进行隐私脱敏保护之后,所述方法还包括STEP18。Under some independent design ideas, after determining the target desensitized text from the desensitized text of the patient medical records based on the text commonality score between the medical record text to be desensitized and the desensitized text of the patient medical records, so as to perform privacy desensitization protection on the medical record text to be desensitized based on the target desensitized text, the method also includes STEP18.
STEP18,响应于针对所述待脱敏病患病历文本的已脱敏病患病历文本的共享请求,对目标数字化医疗服务器进行风险检测,在所述目标数字化医疗服务器通过所述风险检测的基础上,将所述已脱敏病患病历文本共享给所述目标数字化医疗服务器。STEP18, in response to the sharing request of the desensitized medical record text for the medical record text to be desensitized, perform risk detection on the target digital medical server, and share the desensitized medical record text with the target digital medical server on the basis that the target digital medical server passes the risk detection.
可见,在共享已脱敏病患病历文本之前,还会对目标数字化医疗服务器进行针对性的风险检测,从而保障已脱敏病患病历文本共享的安全性。It can be seen that before sharing the desensitized medical records, targeted risk detection will be carried out on the target digital medical server to ensure the security of sharing the desensitized medical records.
在一些可独立的设计思路下,上述对目标数字化医疗服务器进行风险检测,包括如下内容。Under some independent design ideas, the above-mentioned risk detection of the target digital medical server includes the following contents.
STEP181,获取所述目标数字化医疗服务器的数据风险检测日志,将所述数据风险检测日志加载到风险评估策略,在所述风险评估策略中提取所述数据风险检测日志的风险检测描述向量。STEP 181, obtain the data risk detection log of the target digital medical server, load the data risk detection log into the risk assessment strategy, and extract the risk detection description vector of the data risk detection log in the risk assessment strategy.
其中,风险评估策略可以是用于进行风险识别的决策树模型。Among them, the risk assessment strategy can be a decision tree model for risk identification.
STEP182,将所述风险检测描述向量分别输入至所述风险评估策略中的风险决策模块和状态推演模块中。所述风险决策模块包含偏向风险类别决策的第一执行配置数据,所述状态推演模块中包含偏向会话状态推演的第二执行配置数据。STEP 182, input the risk detection description vector into the risk decision module and the state deduction module in the risk assessment strategy respectively. The risk decision module includes first execution configuration data biased towards risk category decision, and the state deduction module includes second execution configuration data biased towards session state deduction.
其中,执行配置数据可以指导相关的模块进行特征处理。Among them, the execution configuration data can guide related modules to perform feature processing.
STEP183,在所述风险决策模块中,通过所述第一执行配置数据对所述风险检测描述向量进行处理,得到类别决策向量关系网。STEP 183: In the risk decision module, the risk detection description vector is processed by using the first execution configuration data to obtain a category decision vector relationship network.
STEP184,在所述状态推演模块中,通过所述第二执行配置数据对所述风险检测描述向量进行处理,得到状态推演向量关系网。STEP 184: In the state deduction module, the risk detection description vector is processed by the second execution configuration data to obtain a state deduction vector relationship network.
STEP185,根据所述类别决策向量关系网和所述状态推演向量关系网确定所述目标数字化医疗服务器的风险检测指数;在所述风险检测指数不超过设定检测指数的基础上,判定所述目标数字化医疗服务器通过所述风险检测;否则,判定所述目标数字化医疗服务器没有通过所述风险检测。STEP185, determine the risk detection index of the target digital medical server according to the category decision vector relationship network and the state deduction vector relationship network; on the basis that the risk detection index does not exceed the set detection index, determine that the target digital medical server has passed the risk detection; otherwise, determine that the target digital medical server has not passed the risk detection.
在本发明实施例中,通过对类别决策向量关系网和状态推演向量关系网进行特征运算(比如加权处理),可以从风险类别和风险状态两个层面准确计算得到目标数字化医疗服务器的风险检测指数。从而基于风险检测指数进行准确的风险检测量化判断。In the embodiment of the present invention, by performing feature operations (such as weighted processing) on the category decision vector relationship network and the state deduction vector relationship network, the risk detection index of the target digital medical server can be accurately calculated from the two levels of risk category and risk state. Thus, accurate risk detection quantitative judgment can be performed based on the risk detection index.
进一步地,还提供了一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时实现上述的方法。Furthermore, a computer-readable storage medium is provided, on which a program is stored, and when the program is executed by a processor, the above method is implemented.
在本发明实施例所提供的几个实施例中,应该理解到,所揭露的装置和方法,也可以通过其它的方式实现。以上所描述的装置和方法实施例仅仅是示意性的,例如,附图中的流程图和框图显示了根据本发明的多个实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现方式中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。In several embodiments provided in the embodiments of the present invention, it should be understood that the disclosed apparatus and method can also be implemented in other ways. The apparatus and method embodiments described above are merely schematic. For example, the flowcharts and block diagrams in the accompanying drawings show the possible architecture, functions and operations of the apparatus, method and computer program product according to multiple embodiments of the present invention. In this regard, each box in the flowchart or block diagram can represent a module, a program segment or a part of a code, and the module, program segment or a part of the code contains one or more executable instructions for implementing the specified logical function. It should also be noted that in some alternative implementations, the functions marked in the box can also occur in a different order from the order marked in the accompanying drawings. For example, two consecutive boxes can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each box in the block diagram and/or flowchart, and the combination of boxes in the block diagram and/or flowchart can be implemented with a dedicated hardware-based system that performs a specified function or action, or can be implemented with a combination of dedicated hardware and computer instructions.
另外,在本发明各个实施例中的各功能模块可以集成在一起形成一个独立的部分,也可以是各个模块单独存在,也可以两个或两个以上模块集成形成一个独立的部分。In addition, the functional modules in the various embodiments of the present invention may be integrated together to form an independent part, or each module may exist independently, or two or more modules may be integrated to form an independent part.
所述功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。If the function is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to enable a computer device (which can be a personal computer, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program code. It should be noted that in this article, the term "include", "include" or any other variant thereof is intended to cover non-exclusive inclusion, so that the process, method, article or device including a series of elements includes not only those elements, but also includes other elements that are not explicitly listed, or also includes elements inherent to such process, method, article or device. Without more constraints, an element defined by the phrase "comprising a..." does not exclude the existence of other identical elements in the process, method, article or apparatus comprising the element.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310328830.6A CN116305285B (en) | 2023-03-30 | 2023-03-30 | Patient information desensitization processing method and system combined with artificial intelligence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310328830.6A CN116305285B (en) | 2023-03-30 | 2023-03-30 | Patient information desensitization processing method and system combined with artificial intelligence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116305285A CN116305285A (en) | 2023-06-23 |
CN116305285B true CN116305285B (en) | 2024-04-05 |
Family
ID=86788565
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310328830.6A Active CN116305285B (en) | 2023-03-30 | 2023-03-30 | Patient information desensitization processing method and system combined with artificial intelligence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116305285B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117216800B (en) * | 2023-10-31 | 2024-09-10 | 中国人民解放军总医院 | Privacy removing processing method and device for large-batch medical record data |
CN119069065B (en) * | 2024-11-01 | 2025-04-25 | 宁波芯联心医疗科技有限公司 | Intelligent medical equipment user information anonymizing method and system based on artificial intelligence |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106845265A (en) * | 2016-12-01 | 2017-06-13 | 北京计算机技术及应用研究所 | A kind of document security level automatic identifying method |
CN108519983A (en) * | 2018-02-05 | 2018-09-11 | 中国科学院信息工程研究所 | A secure document similarity calculation method and system based on latent layer semantic analysis |
CN110135189A (en) * | 2019-04-28 | 2019-08-16 | 上海市第六人民医院 | A desensitization method for patient privacy information oriented to medical text |
EP3528150A1 (en) * | 2018-02-14 | 2019-08-21 | OneSpan NV | A system, apparatus and method for privacy preserving contextual authentication |
CN110287314A (en) * | 2019-05-20 | 2019-09-27 | 中国科学院计算技术研究所 | Long text credibility evaluation method and system based on unsupervised clustering |
CN111209373A (en) * | 2020-01-07 | 2020-05-29 | 北京启明星辰信息安全技术有限公司 | Method and device for sensitive text recognition based on natural semantics |
CN112308048A (en) * | 2020-12-03 | 2021-02-02 | 云知声智能科技股份有限公司 | Medical record integrity judging method, device and system based on small amount of labeled data |
WO2021119175A1 (en) * | 2019-12-11 | 2021-06-17 | Servicenow, Inc. | Determining semantic content of textual clusters |
CN113811866A (en) * | 2019-05-23 | 2021-12-17 | 国际商业机器公司 | Sensitive data management |
CN114580354A (en) * | 2022-05-05 | 2022-06-03 | 阿里巴巴达摩院(杭州)科技有限公司 | Synonym-based information encoding method, device, equipment and storage medium |
CN115688166A (en) * | 2022-10-10 | 2023-02-03 | 北京肿瘤医院(北京大学肿瘤医院) | Information desensitization processing method, device, computer equipment and readable storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230061906A1 (en) * | 2021-08-09 | 2023-03-02 | Samsung Electronics Co., Ltd. | Dynamic question generation for information-gathering |
-
2023
- 2023-03-30 CN CN202310328830.6A patent/CN116305285B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106845265A (en) * | 2016-12-01 | 2017-06-13 | 北京计算机技术及应用研究所 | A kind of document security level automatic identifying method |
CN108519983A (en) * | 2018-02-05 | 2018-09-11 | 中国科学院信息工程研究所 | A secure document similarity calculation method and system based on latent layer semantic analysis |
EP3528150A1 (en) * | 2018-02-14 | 2019-08-21 | OneSpan NV | A system, apparatus and method for privacy preserving contextual authentication |
CN110135189A (en) * | 2019-04-28 | 2019-08-16 | 上海市第六人民医院 | A desensitization method for patient privacy information oriented to medical text |
CN110287314A (en) * | 2019-05-20 | 2019-09-27 | 中国科学院计算技术研究所 | Long text credibility evaluation method and system based on unsupervised clustering |
CN113811866A (en) * | 2019-05-23 | 2021-12-17 | 国际商业机器公司 | Sensitive data management |
WO2021119175A1 (en) * | 2019-12-11 | 2021-06-17 | Servicenow, Inc. | Determining semantic content of textual clusters |
CN111209373A (en) * | 2020-01-07 | 2020-05-29 | 北京启明星辰信息安全技术有限公司 | Method and device for sensitive text recognition based on natural semantics |
CN112308048A (en) * | 2020-12-03 | 2021-02-02 | 云知声智能科技股份有限公司 | Medical record integrity judging method, device and system based on small amount of labeled data |
CN114580354A (en) * | 2022-05-05 | 2022-06-03 | 阿里巴巴达摩院(杭州)科技有限公司 | Synonym-based information encoding method, device, equipment and storage medium |
CN115688166A (en) * | 2022-10-10 | 2023-02-03 | 北京肿瘤医院(北京大学肿瘤医院) | Information desensitization processing method, device, computer equipment and readable storage medium |
Non-Patent Citations (3)
Title |
---|
Semantic disclosure control:semantics meets data privacy;Montserrat Batet 等;Online Information Review;第42卷(第3期);1-14 * |
基于风格特征融合的文档分割方法;刘刚 等;计算机应用与软件;第37卷(第10期);200-207 * |
现代电子医疗环境中加强隐私和数据保护技术指南;Stefanos Gritzalis 等;电子制作(第13期);199-206 * |
Also Published As
Publication number | Publication date |
---|---|
CN116305285A (en) | 2023-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116305285B (en) | Patient information desensitization processing method and system combined with artificial intelligence | |
Zaeem et al. | The effect of the GDPR on privacy policies: Recent progress and future promise | |
Feng et al. | Modular pluralism: Pluralistic alignment via multi-llm collaboration | |
US10242310B2 (en) | Type evaluation in a question-answering system | |
EP1950684A1 (en) | Anonymity measuring device | |
CN112925914B (en) | Data security grading method, system, equipment and storage medium | |
Kreso et al. | Data mining privacy preserving: Research agenda | |
CN103733190A (en) | Protecting network entity data while preserving network properties | |
US20210150358A1 (en) | System and method for controlling confidential information | |
CN111402973A (en) | Information matching analysis method and device, computer system and readable storage medium | |
US11972023B2 (en) | Compatible anonymization of data sets of different sources | |
US20220229908A1 (en) | Methods, systems, and devices for trusted execution environments and secure data processing and storage environments | |
Ramachandranpillai et al. | FairXAI-A Taxonomy and Framework for Fairness and Explainability Synergy in Machine Learning | |
CN112287397B (en) | System and method for improving and guaranteeing safety of patient information transmission | |
Bangare et al. | Kernel interpolation-based technique for privacy protection of pluggable data in cloud computing | |
CN118761438A (en) | Knowledge forgetting method for multimodal large language models based on dual mask divergence | |
Callier | Machine learning in evolutionary studies comes of age | |
CN117421550A (en) | Policy-based data analysis method and device, electronic equipment and storage medium | |
Jiang et al. | Differential Privacy on Edge Computing | |
Kordi et al. | On the Usage of ChatGPT for Integrating CAPEC Attacks into ADVISE Meta Ontology | |
WO2013042788A1 (en) | Data partitioning apparatus, data partitioning system, data partitioning method, and program | |
Dahman | Review of data privacy techniques: Concepts, scenarios and architectures, simulations, challenges, and future directions | |
CN111784303B (en) | Nuclear protection information processing method and device, computer storage medium and electronic equipment | |
CN111858832B (en) | Dialogue method, dialogue device, electronic equipment and storage medium | |
Afzal et al. | Meaningful integration of online knowledge resources with clinical decision support system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |