CN112613045B

CN112613045B - Method and system for embedding data watermark of target data

Info

Publication number: CN112613045B
Application number: CN202011375206.4A
Authority: CN
Inventors: 于鹏飞; 石聪聪; 陈磊
Original assignee: State Grid Smart Grid Research Institute of SGCC
Current assignee: State Grid Smart Grid Research Institute of SGCC
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2023-06-06
Anticipated expiration: 2040-11-30
Also published as: CN112613045A

Abstract

The invention discloses a data watermark embedding method and system for target data. The method includes: S1 dividing the target data to be embedded with a data watermark into a plurality of content blocks, and embedding a data watermark in each content block; S2 adopting a preset The data similarity evaluation model evaluates the data item similarity of the data items embedded in the data watermark; S3 evaluates the similarity of the content block data watermark based on the similarity of all data items that make up the content block. When the data watermarks of each content block are similar Execute S4 when the degrees all meet the first threshold range, otherwise adjust the embedding ratio and/or position of the data watermark in the content block and execute S2; S4 calculate the overall similarity of the target data based on the data watermark similarities of all content blocks that make up the target data The target data embedded in the data watermark is obtained by adjusting the embedding ratio and/or position of the data watermark. Finally, the high concealment and high emulation of data watermark embedding are realized.

Description

A method and system for embedding data watermark of target data

技术领域Technical Field

本发明涉及数据水印领域，具体涉及一种目标数据的数据水印嵌入方法及系统。The present invention relates to the field of data watermarking, and in particular to a method and system for embedding data watermarking into target data.

背景技术Background Art

随着数字经济的不断发展，不同部门、不同地区、不同数据主体间的信息交流逐步增加，数据在各个环节间以结构化数据的形式流转、重组、使用越来越频繁。数据在动态环境中使用，发生数据泄露事件的风险巨大，一旦发生数据泄露，就需要能够准确的定位责任环节，以追溯相关人员的安全责任，并针对性的加强薄弱环节的安全管控。With the continuous development of the digital economy, information exchange between different departments, different regions, and different data subjects has gradually increased, and data has been circulated, reorganized, and used more and more frequently in the form of structured data between various links. The use of data in a dynamic environment poses a huge risk of data leakage. Once a data leakage occurs, it is necessary to accurately locate the responsible link to trace the security responsibility of the relevant personnel and strengthen the security control of the weak links in a targeted manner.

数据水印技术是为解决上述数据泄露之后责任追溯问题的有效技术手段之一。数据水印是对数据内容本身添加额外冗余的标识信息，通过高仿真实数据内容，并参入标识信息，以关联和记录相关责任环节，一旦数据泄露，即可根据事先添加的水印信息进行定位。而高仿真、高隐蔽性是数据水印有效的关键指标，避免被恶意用户发现并破坏。数据水印高仿真、高隐蔽性的实现需要目标数据在数据水印添加前后的相似度必须达到用户不易发现的阈值，因此如何在目标数据中嵌入数据水印达到数据水印嵌入后的高隐蔽性和高仿真性亟需解决。Data watermark technology is one of the effective technical means to solve the above-mentioned responsibility tracing problem after data leakage. Data watermarking is to add additional redundant identification information to the data content itself. It can associate and record the relevant responsibility links by highly simulating the data content and incorporating identification information. Once the data is leaked, it can be located according to the watermark information added in advance. High simulation and high concealment are the key indicators for the effectiveness of data watermarking to avoid being discovered and destroyed by malicious users. The realization of high simulation and high concealment of data watermarking requires that the similarity of the target data before and after the data watermark is added must reach a threshold that is not easy for users to find. Therefore, how to embed data watermarks in the target data to achieve high concealment and high simulation after the data watermark is embedded needs to be solved urgently.

发明内容Summary of the invention

为了解决现有技术中所存在的上述不足，本发明提供了一种目标数据的数据水印嵌入方法，包括：In order to solve the above-mentioned deficiencies in the prior art, the present invention provides a data watermark embedding method for target data, comprising:

S1将待嵌入数据水印的目标数据划分为多个内容块，在每个内容块的数据条目中嵌入数据水印；S1 divides the target data to be embedded with a data watermark into a plurality of content blocks, and embeds a data watermark in a data entry of each content block;

S2采用预先设置的数据相似度评估模型对嵌入数据水印后的数据条目进行数据条目相似度评估；S2 uses a preset data similarity evaluation model to evaluate the data entry similarity of the data entry embedded with the data watermark;

S3基于组成内容块的所有数据条目的数据条目相似度进行内容块数据水印相似度评估，当各内容块数据水印相似度均满足第一阈值范围时，执行S4，否则调整不满足第一阈值范围的内容块中数据水印的嵌入比例和/或位置，执行S2；S3 evaluates the similarity of data watermarks of the content blocks based on the data entry similarities of all data entries constituting the content blocks. When the similarities of the data watermarks of the content blocks all meet the first threshold range, execute S4; otherwise, adjust the embedding ratio and/or position of the data watermarks in the content blocks that do not meet the first threshold range, and execute S2;

S4基于组成目标数据的所有内容块的数据水印相似度计算所述目标数据整体的相似度，当所述目标数据整体的相似度满足第二阈值范围时，完成数据水印的嵌入，否则调整一个或多个内容块中数据水印的嵌入比例和/或位置，执行S2。S4 calculates the similarity of the target data as a whole based on the similarity of the data watermarks of all the content blocks constituting the target data. When the similarity of the target data as a whole satisfies a second threshold range, the embedding of the data watermark is completed. Otherwise, the embedding ratio and/or position of the data watermark in one or more content blocks is adjusted, and S2 is executed.

优选的，调整数据水印的嵌入比例和/或位置，包括：Preferably, adjusting the embedding ratio and/or position of the data watermark includes:

当数据条目中包含单一类型字段时，调整数据水印的嵌入比例；When the data entry contains a single type of field, adjust the embedding ratio of the data watermark;

当数据条目中包含多种类型字段时，调整数据水印的嵌入比例和/或位置。When a data entry contains multiple types of fields, the embedding ratio and/or position of the data watermark is adjusted.

优选的，所述调整数据水印的嵌入比例，包括：Preferably, the adjusting the embedding ratio of the data watermark includes:

当内容块数据水印相似度>第一阈值范围中的最大值时，则减小在所述内容块数据的数据条目中嵌入数据水印的比例至预设比例；When the content block data watermark similarity is greater than the maximum value in the first threshold range, reducing the proportion of data watermarks embedded in the data entries of the content block data to a preset proportion;

当内容块数据水印相似度<第一阈值范围中的最小值时，则增加在所述内容块数据的数据条目中嵌入数据水印的比例至预设比例；When the content block data watermark similarity is less than the minimum value in the first threshold range, increasing the proportion of data watermarks embedded in the data entries of the content block data to a preset proportion;

当目标数据的整体相似度>第二阈值范围中的最大值时，则减小一个或多个内容块中嵌入数据水印的比例至预设比例；When the overall similarity of the target data is greater than the maximum value in the second threshold range, reducing the proportion of the data watermark embedded in one or more content blocks to a preset proportion;

当目标数据的整体相似度<第二阈值范围中的最小值时，则增加一个或多个内容块中嵌入数据水印的比例至预设比例。When the overall similarity of the target data is less than the minimum value in the second threshold range, the proportion of the data watermark embedded in one or more content blocks is increased to a preset proportion.

优选的，所述调整数据水印的嵌入位置，包括：Preferably, the step of adjusting the embedding position of the data watermark includes:

去除数据条目中原有数据水印，按照预设比例分别向数据条目中各种类型字段嵌入与字段类型匹配的数据水印；Remove the original data watermark in the data entry, and embed data watermarks matching the field type into various types of fields in the data entry according to a preset ratio;

对嵌入与字段类型匹配的数据水印后的数据条目进行数据条目相似度评估，选择数据条目相似度最大的字段所在位置作为嵌入数据水印的最优位置，并在最优位置处嵌入数据水印。The data entries after embedding the data watermark matching the field type are evaluated for data entry similarity, the position of the field with the greatest data entry similarity is selected as the optimal position for embedding the data watermark, and the data watermark is embedded at the optimal position.

优选的，所述数据条目中的字段类型包括如下任一种或多种：Preferably, the field types in the data entry include any one or more of the following:

数值字段、文本字段和自然语言字段。Numeric fields, text fields, and natural language fields.

优选的，在所述数据条目中嵌入数据水印，包括：Preferably, embedding a data watermark in the data entry comprises:

当所述数据条目中包括数值字段时，在所述数值字段中嵌入数值型的数据水印；When the data entry includes a numerical field, embedding a numerical data watermark in the numerical field;

当所述数据条目中包括文本字段时，在所述文本字段中嵌入字符文本型的数据水印；When the data entry includes a text field, embedding a data watermark of a character text type in the text field;

当所述数据条目中包括自然语言字段时，在所述自然语言字段中嵌入自然语言型的数据水印。When the data entry includes a natural language field, a natural language data watermark is embedded in the natural language field.

优选的，所述采用预先设置的数据相似度评估模型对嵌入数据水印后的数据条目进行数据条目相似度评估，包括：Preferably, the use of a preset data similarity evaluation model to perform data entry similarity evaluation on the data entries embedded with the data watermark includes:

当所述数据条目的数值字段中嵌入数值型的数据水印时，对所述数据水印嵌入前后的数值进行解构分词，并通过欧几里得距离向量数据相似度评估模型进行数据条目相似度评估；When a numerical data watermark is embedded in the numerical field of the data entry, the numerical values before and after the data watermark is embedded are deconstructed and segmented, and the data entry similarity is evaluated using a Euclidean distance vector data similarity evaluation model;

当所述数据条目的文本字段中嵌入字符文本型的数据水印时，对所述数据水印嵌入前后的ASCII码值进行解构，并通过余弦向量数据相似度评估模型进行数据条目相似度评估；When a character text type data watermark is embedded in the text field of the data entry, the ASCII code values before and after the data watermark is embedded are deconstructed, and the data entry similarity is evaluated by a cosine vector data similarity evaluation model;

当所述数据条目的自然语言字段中嵌入自然语言型的数据水印，对所述数据水印嵌入前后的自然语言字段应用空间向量模型进行解构分词，并对解构分词结果通过余弦向量数据相似度评估模型进行数据条目相似度评估。When a natural language data watermark is embedded in the natural language field of the data entry, a space vector model is applied to the natural language field before and after the data watermark is embedded to perform word deconstruction and segmentation, and the deconstruction and segmentation results are evaluated for data entry similarity using a cosine vector data similarity evaluation model.

优选的，按下式进行内容块数据水印相似度评估：Preferably, the content block data watermark similarity evaluation is performed according to the following formula:

式中：δ表示内容块数据水印相似度；N表示内容块中数据条目的总数量；C_i表示第i个数据条目的数据条目相似度。Where: δ represents the content block data watermark similarity; N represents the total number of data entries in the content block; _Ci represents the data entry similarity of the i-th data entry.

优选的，按下式评估所述目标数据整体的相似度：Preferably, the similarity of the target data as a whole is evaluated as follows:

式中：θ表示目标数据整体的相似度；M表示目标数据中内容块的总数量；δ_i表示第i个内容块的内容块数据水印相似度。Where: θ represents the similarity of the target data as a whole; M represents the total number of content blocks in the target data; _δi represents the watermark similarity of the content block data of the i-th content block.

基于同一发明构思，本发明还提供了一种目标数据的数据水印嵌入系统，包括：Based on the same inventive concept, the present invention also provides a data watermark embedding system for target data, comprising:

嵌入模块，用于将待嵌入数据水印的目标数据划分为多个内容块，在每个内容块中嵌入数据水印；An embedding module, used for dividing the target data to be embedded with a data watermark into a plurality of content blocks, and embedding a data watermark in each content block;

数据条目相似度评估模块，用于采用预先设置的数据相似度评估模型对嵌入数据水印后的数据条目进行数据条目相似度评估；A data entry similarity evaluation module is used to use a preset data similarity evaluation model to perform data entry similarity evaluation on the data entry after the data watermark is embedded;

内容块相似度评估模块，用于基于组成内容块的所有数据条目的数据条目相似度进行内容块数据水印相似度评估，当各内容块数据水印相似度均满足第一阈值范围时，执行整体相似度评估模块，否则调整不满足第一阈值范围的内容块中数据水印的嵌入比例和/或位置，执行所述数据条目相似度评估模块；A content block similarity evaluation module, used to evaluate the similarity of data watermarks of the content blocks based on the data entry similarities of all data entries constituting the content blocks, and when the similarities of the data watermarks of the content blocks all meet the first threshold range, the overall similarity evaluation module is executed, otherwise, the embedding ratio and/or position of the data watermarks in the content blocks that do not meet the first threshold range is adjusted, and the data entry similarity evaluation module is executed;

整体相似度评估模块，用于基于组成目标数据的所有内容块的数据水印相似度计算所述目标数据整体的相似度，当所述目标数据整体的相似度满足第二阈值范围时，完成数据水印的嵌入，否则调整一个或多个内容块中数据水印的嵌入比例和/或位置，执行所述数据条目相似度评估模块。The overall similarity evaluation module is used to calculate the overall similarity of the target data based on the data watermark similarity of all content blocks that constitute the target data. When the overall similarity of the target data meets the second threshold range, the embedding of the data watermark is completed; otherwise, the embedding ratio and/or position of the data watermark in one or more content blocks is adjusted, and the data entry similarity evaluation module is executed.

优选的，所述数据条目相似度评估模块，具体用于：Preferably, the data entry similarity assessment module is specifically used to:

与现有技术相比，本发明的有益效果为：Compared with the prior art, the present invention has the following beneficial effects:

本发明提供的技术方案，S1将待嵌入数据水印的目标数据划分为多个内容块，在每个内容块的数据条目中嵌入数据水印；S2采用预先设置的数据相似度评估模型对嵌入数据水印后的数据条目进行数据条目相似度评估；S3基于组成内容块的所有数据条目的数据条目相似度进行内容块数据水印相似度评估，当各内容块数据水印相似度均满足第一阈值范围时，执行S4，否则调整不满足第一阈值范围的内容块中数据水印的嵌入比例和/或位置，执行S2；S4基于组成目标数据的所有内容块的数据水印相似度计算所述目标数据整体的相似度，当所述目标数据整体的相似度满足第二阈值范围时，完成数据水印的嵌入，否则调整一个或多个内容块中数据水印的嵌入比例和/或位置，执行S2。本发明依次根据数据条目、内容块和数据整体的数据水印相似度评估结果，动态的调整嵌入内容块的数据水印，以最终实现数据水印嵌入后的高隐蔽性和高仿真性。The technical solution provided by the present invention is as follows: S1 divides the target data to be embedded with a data watermark into multiple content blocks, and embeds a data watermark in the data entry of each content block; S2 uses a pre-set data similarity evaluation model to perform data entry similarity evaluation on the data entry after the data watermark is embedded; S3 performs content block data watermark similarity evaluation based on the data entry similarity of all data entries constituting the content block, and when the data watermark similarity of each content block meets the first threshold range, S4 is executed, otherwise the embedding ratio and/or position of the data watermark in the content block that does not meet the first threshold range is adjusted, and S2 is executed; S4 calculates the similarity of the target data as a whole based on the data watermark similarity of all content blocks constituting the target data, and when the similarity of the target data as a whole meets the second threshold range, the embedding of the data watermark is completed, otherwise the embedding ratio and/or position of the data watermark in one or more content blocks is adjusted, and S2 is executed. The present invention dynamically adjusts the data watermark embedded in the content block according to the data watermark similarity evaluation results of the data entry, the content block and the data as a whole, so as to finally achieve high concealment and high simulation after the data watermark is embedded.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明提供的一种目标数据的数据水印嵌入方法流程图；FIG1 is a flow chart of a method for embedding a data watermark into target data provided by the present invention;

图2为本发明实施例提供的一种目标数据的数据水印嵌入系统示意图。FIG. 2 is a schematic diagram of a data watermark embedding system for target data provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了更好地理解本发明，下面结合说明书附图和实例对本发明的内容做进一步的说明。In order to better understand the present invention, the content of the present invention is further described below in conjunction with the accompanying drawings and examples.

实施例1：如图1所示，为满足上述现有技术中的迫切需求，本发明提供了一种目标数据的数据水印嵌入方法，包括：Embodiment 1: As shown in FIG1 , in order to meet the urgent needs of the above-mentioned prior art, the present invention provides a data watermark embedding method for target data, comprising:

其中，调整数据水印的嵌入比例和/或位置，包括：Wherein, adjusting the embedding ratio and/or position of the data watermark includes:

本发明依次根据数据条目、内容块和数据整体的数据水印相似度评估结果，通过动态的调整嵌入目标数据的数据水印，使嵌入数据水印后的内容块相似度和目标数据整体的相似度分别满足设置的阈值范围，以最终实现数据水印嵌入后的高隐蔽性和高仿真性。The present invention sequentially evaluates the data watermark similarity of data entries, content blocks and the data as a whole, and dynamically adjusts the data watermark embedded in the target data so that the similarity of the content block after the data watermark is embedded and the similarity of the target data as a whole meet the set threshold ranges respectively, so as to finally achieve high concealment and high simulation after the data watermark is embedded.

本实施例中S1将待嵌入数据水印的目标数据划分为多个内容块，在每个内容块的数据条目中嵌入数据水印，包括：In this embodiment, S1 divides the target data to be embedded with a data watermark into multiple content blocks, and embeds a data watermark in a data entry of each content block, including:

对于组成内容块的各数据条目，根据数据条目中的字段类型，选择对应类型的数据水印并嵌入，为了提高数据水印嵌入的信息容量，目标数据中数据水印嵌入的比例为100％。For each data entry constituting the content block, a data watermark of a corresponding type is selected and embedded according to the field type in the data entry. In order to increase the information capacity of the embedded data watermark, the proportion of the data watermark embedded in the target data is 100%.

具体包括：Specifically include:

S2采用预先设置的数据相似度评估模型对嵌入数据水印后的数据条目进行数据条目相似度评估，即根据不同的数据水印嵌入算法，选择合适的数据相似度评估模型，进行数据水印条目的相似度评估，包括：S2 uses a preset data similarity evaluation model to perform data entry similarity evaluation on the data entries embedded with the data watermark, that is, according to different data watermark embedding algorithms, selects a suitable data similarity evaluation model to perform similarity evaluation on the data watermark entries, including:

本实施例中的相似度指对于某一类型的数据，经数据水印嵌入后，其数据类型特征应不发生变化，如果发生变化，其数据水印条目的相似度评估结果为0。The similarity in this embodiment means that for a certain type of data, after the data watermark is embedded, its data type characteristics should not change. If it changes, the similarity evaluation result of its data watermark entry is 0.

例如手机号码类型的数据，现有的手机号码为11位，其中前3位表示网络识别号，第4～7位表示地区编码，第8～11位表示用户号码，在数据水印嵌入后，应仍然符合手机号码的数据类型特征。For example, for data of the mobile phone number type, the existing mobile phone number is 11 digits, of which the first 3 digits represent the network identification number, the 4th to 7th digits represent the area code, and the 8th to 11th digits represent the user number. After the data watermark is embedded, it should still conform to the data type characteristics of the mobile phone number.

(1)对于数值型的数据水印嵌入后，应对数据水印嵌入前后的数值，进行解构分词，并通过欧几里得距离向量数据相似度评估模型进行相似度评估，评估结果为D。(1) After embedding the numerical data watermark, the numerical values before and after the data watermark is embedded should be deconstructed and segmented, and the similarity evaluation should be performed using the Euclidean distance vector data similarity evaluation model. The evaluation result is D.

例如手机号码类型的数据，数据水印嵌入前后的值分别为P和P’，经过结构分词，每一位数字都应是独立的单位，即P＝{N1、N2、……、N11}，P’＝{N’1、N’2、……、N’11}；然后带入欧几里得数据相似度评估模型，计算相似度

For example, for data of the mobile phone number type, the values before and after the data watermark is embedded are P and P' respectively. After structural segmentation, each digit should be an independent unit, that is, P = {N1, N2, ..., N11}, P' = {N'1, N'2, ..., N'11}; then bring it into the Euclidean data similarity evaluation model to calculate the similarity

(2)对于字符文本型的数据水印嵌入后，应对数据水印嵌入前后的ASCII码值解构，并通过余弦向量数据相似度评估模型进行相似度评估，评估结果为C。(2) After embedding the character text data watermark, the ASCII code values before and after the data watermark is embedded should be deconstructed, and the similarity evaluation should be performed using the cosine vector data similarity evaluation model. The evaluation result is C.

例如微信账号类型的数据，数据水印嵌入前后的值分别为Pi和Pi’，经过ASCII码值解构，每一位数字都应是独立的单位，即P＝{N1、N2、……、Nn}，P’＝{N’1、N’2、……、N’n}；然后带入余弦数据相似度评估模型，计算相似度

For example, for WeChat account type data, the values before and after data watermark embedding are Pi and Pi' respectively. After ASCII code value deconstruction, each digit should be an independent unit, that is, P = {N1, N2, ..., Nn}, P' = {N'1, N'2, ..., N'n}; then bring it into the cosine data similarity evaluation model to calculate the similarity

(3)对于自然语言类型的数据水印嵌入后，应用空间向量模型对数据水印嵌入前后的进行解构分词，并对解构分词结果通过余弦向量数据相似度评估模型进行数据相似度评估。(3) After the natural language type data watermark is embedded, the space vector model is used to deconstruct and segment the data before and after the data watermark is embedded, and the data similarity of the deconstructed segmentation results is evaluated using the cosine vector data similarity evaluation model.

电力业务涉及到自然语言的数据具有显著的专业特征，如检修地址、扩装地址等地址类数据；运行术语、电气量术语等电力专业术语数据；面向居民的姓名等等，可形成电力业务自然语言数据特色分词库。The natural language data involved in the power business has significant professional characteristics, such as address data such as maintenance address and expansion address; power professional terminology data such as operation terminology and electrical quantity terminology; names of residents, etc., which can form a characteristic word segmentation library of natural language data for the power business.

对添加数据水印前后的电力业务涉及到自然语言的数据经分词处理，得到的向量表达式为O＝{O1、O2、……、On}和O’＝{O’1、O’2、……、O’n}，带入余弦数据相似度评估模型，计算相似度

The natural language data of the power business before and after adding the data watermark is processed by word segmentation, and the vector expressions obtained are O = {O1, O2, ..., On} and O' = {O'1, O'2, ..., O'n}, which are brought into the cosine data similarity evaluation model to calculate the similarity

S3基于组成内容块的所有数据条目的数据条目相似度进行内容块数据水印相似度评估，当各内容块数据水印相似度均满足第一阈值范围时，执行S4，否则调整不满足第一阈值范围的内容块中数据水印的嵌入比例和/或位置，执行S2，包括：S3 evaluates the similarity of data watermarks of the content blocks based on the data item similarities of all data items constituting the content blocks. When the similarities of the data watermarks of the content blocks satisfy the first threshold range, S4 is executed. Otherwise, the embedding ratio and/or position of the data watermarks in the content blocks that do not satisfy the first threshold range is adjusted, and S2 is executed, including:

根据组成内容块的所有数据条目的数据条目相似度进行二级相似度计算，即内容块数据水印相似度。内容块的大小根据具体业务场景由用户设定，例如为了查阅方便，内容块的大小可以被设置成20行、50行、或者100行。Secondary similarity calculation is performed based on the data entry similarity of all data entries constituting the content block, i.e., content block data watermark similarity. The size of the content block is set by the user according to the specific business scenario. For example, for ease of reference, the size of the content block can be set to 20 lines, 50 lines, or 100 lines.

以内容块大小为N行的数据为例，按照S2中提供的方法进行数据水印条目的相似度评估，评估结果记为C，那么该内容块全部嵌入数据水印后的二级相似度为

Taking the data with a content block size of N rows as an example, the similarity of the data watermark entry is evaluated according to the method provided in S2, and the evaluation result is recorded as C. Then the secondary similarity of the content block after all the data watermarks are embedded is

判断各内容块的数据水印相似度是否满足第一阈值范围，当各内容块数据水印相似度均满足第一阈值范围时，执行S4，否则调整不满足第一阈值范围的内容块中数据水印的嵌入比例和/或位置，执行S2，Determine whether the data watermark similarity of each content block meets the first threshold range. When the data watermark similarity of each content block meets the first threshold range, execute S4. Otherwise, adjust the embedding ratio and/or position of the data watermark in the content block that does not meet the first threshold range, and execute S2.

本实施例中对内容块相似度不满足阈值范围时采取的方法进行具体介绍：In this embodiment, the method adopted when the similarity of the content blocks does not meet the threshold range is specifically introduced:

方法一、动态的调整数据水印添加的比例，包括：Method 1: Dynamically adjust the proportion of data watermark addition, including:

上述过程具体包括：当某数据内容块全部嵌入数据水印的二级相似度超过了第一阈值范围中的最大值，可通过降低数据水印的嵌入比例，保证数据水印嵌入前后的二级相似度，例如数据水印的嵌入比例可以设置为50％、30％或者20％等等。The above process specifically includes: when the secondary similarity of a data content block fully embedded with a data watermark exceeds the maximum value in the first threshold range, the secondary similarity before and after the data watermark is embedded can be ensured by reducing the embedding ratio of the data watermark. For example, the embedding ratio of the data watermark can be set to 50%, 30% or 20%, etc.

当某数据内容块嵌入数据水印后的二级相似度小于第一阈值范围中的最小值，可通过提高数据水印的嵌入比例，尽可能的提高数据水印嵌入容量，例如数据水印的嵌入比例可以设置为20％、30％或者50％等等。When the secondary similarity of a data content block after embedding the data watermark is less than the minimum value in the first threshold range, the embedding ratio of the data watermark can be increased to increase the embedding capacity of the data watermark as much as possible. For example, the embedding ratio of the data watermark can be set to 20%, 30% or 50%, etc.

当组成给内容块的数据条目中包含多种字段类型时，可以采用方法二、动态的调整数据水印添加的位置，包括：When the data items that make up the content block contain multiple field types, you can use method 2 to dynamically adjust the location where the data watermark is added, including:

本实施例中调整数据水印添加的位置，具体包括：当某数据内容块中的数据条目既包括了数值，又包括了文本、自然语言，可以先按照固定的数据水印的嵌入比例，分别在数值字段，或者文本字段，或者自然语言字段添加数据水印，并按照S2提供的方法分别计算嵌入字段，或者文本，或者自然语言数据水印后的条目相似度，选择条目相似度最大的位置作为水印添加最优位置，并删除该条目中添加的原始数据谁赢，根据条目相似度计算二级相似度，以实现在保障嵌入数据水印后的二级相似度满足阈值的前提下，尽可能的提高数据水印嵌入容量。In this embodiment, the position where the data watermark is added is adjusted, specifically including: when the data entry in a data content block includes both numerical values and text and natural language, the data watermark can be added to the numerical field, or the text field, or the natural language field according to a fixed data watermark embedding ratio, and the similarity of the entries after the embedded field, or the text, or the natural language data watermark is calculated according to the method provided by S2, the position with the largest entry similarity is selected as the optimal position for adding the watermark, and the original data added in the entry is deleted, and the secondary similarity is calculated according to the entry similarity, so as to maximize the data watermark embedding capacity while ensuring that the secondary similarity after the embedded data watermark meets the threshold.

S4基于组成目标数据的所有内容块的数据水印相似度计算所述目标数据整体的相似度，当所述目标数据整体的相似度满足第二阈值范围时，完成数据水印的嵌入，否则调整一个或多个内容块中数据水印的嵌入比例和/或位置，执行S2，包括：S4 calculates the similarity of the target data as a whole based on the similarity of the data watermarks of all the content blocks constituting the target data, and when the similarity of the target data as a whole satisfies the second threshold range, completes the embedding of the data watermark, otherwise adjusts the embedding ratio and/or position of the data watermark in one or more content blocks, and executes S2, including:

当嵌入数据水印后的二级相似度满足阈值范围，同时尽可能的提高数据水印嵌入容量后，根据所有内容块的二级相似度计算数据水印嵌入目标数据整体的相似度，即三级相似度，当三级相似度满足第二阈值范围时，完成数据水印的嵌入，否则调整一个或多个内容块中数据水印的嵌入比例和/或位置，执行S2。When the secondary similarity after embedding the data watermark meets the threshold range and the data watermark embedding capacity is increased as much as possible, the similarity of the data watermark embedded in the target data as a whole is calculated according to the secondary similarity of all content blocks, that is, the tertiary similarity. When the tertiary similarity meets the second threshold range, the embedding of the data watermark is completed, otherwise the embedding ratio and/or position of the data watermark in one or more content blocks is adjusted, and S2 is executed.

当三级相似度不满足第二阈值范围时，可以通过下列调整比例的方式进行动态调整：When the third-level similarity does not meet the second threshold range, dynamic adjustment can be performed by adjusting the ratio as follows:

即当目标数据的整体相似度不满足第二阈值范围时，则需要调整一个或多个内容块中嵌入数据水印的比例至预设比例。That is, when the overall similarity of the target data does not satisfy the second threshold range, it is necessary to adjust the proportion of the data watermark embedded in one or more content blocks to a preset proportion.

当三级相似度不满足第二阈值范围时，且要调整的内容块中，组成内容块的数据条目含有多种类型的字段时，可以调整数据水印嵌入数据条目中的位置使三级相似度满足第二阈值范围，从而完成数据水印的嵌入过程。When the third-level similarity does not meet the second threshold range, and in the content block to be adjusted, the data entries that constitute the content block contain multiple types of fields, the position of the data watermark embedded in the data entry can be adjusted so that the third-level similarity meets the second threshold range, thereby completing the data watermark embedding process.

本实施例以将某个目标数据整体分割成M个大小的内容块为例，在各内容块中嵌入数据水印后进行数据水印条目的相似度评估，然后基于数据水印条目的相似度评估内容块的相似度，评估结果记为δ，那么该目标数据的内容块全部嵌入数据水印后的三级相似度为

This embodiment takes the example of dividing a target data into M content blocks as a whole, embedding data watermarks in each content block, and then evaluating the similarity of the data watermark entries. Then, the similarity of the content blocks is evaluated based on the similarity of the data watermark entries. The evaluation result is recorded as δ. Then, the third-level similarity of the target data content blocks after all the data watermarks are embedded is

如果θ超过了设定的第二阈值范围的最大值时，调整数据水印的嵌入比例和/或位置，提高δ值进而提高θ值，最终提高数据水印嵌入数据整体后的相似度。If θ exceeds the maximum value of the set second threshold range, the embedding ratio and/or position of the data watermark is adjusted to increase the δ value and then the θ value, thereby ultimately improving the similarity of the data watermark after embedding the data as a whole.

如果θ距离设定的第二阈值范围的最小值较大时，调整数据水印的嵌入比例和/或位置，提高数据水印嵌入比例，以实现在保障嵌入数据水印后的三级相似度的前提下，尽可能的提高数据水印嵌入容量。If the minimum value of the second threshold range set by the θ distance is large, the embedding ratio and/or position of the data watermark is adjusted to increase the embedding ratio of the data watermark, so as to maximize the data watermark embedding capacity while ensuring the third-level similarity after embedding the data watermark.

本发明实施例为了达到数据水印嵌入目标数据后实现高隐蔽性和高仿真性这一目的，根据不同的数据水印算法的相似度评估结果，选择合适的水印添加的比例、分布策略，以最终实现数据水印嵌入后的高隐蔽性和高仿真性。In order to achieve the purpose of high concealment and high simulation after the data watermark is embedded in the target data, the embodiment of the present invention selects the appropriate proportion and distribution strategy of watermark addition according to the similarity evaluation results of different data watermark algorithms, so as to ultimately achieve high concealment and high simulation after the data watermark is embedded.

实施例2：基于同一发明构思，本发明还提供了一种目标数据的数据水印嵌入系统，如图2所示包括：Embodiment 2: Based on the same inventive concept, the present invention also provides a data watermark embedding system for target data, as shown in FIG2 , including:

该系统一方面通过数据相似度评估模型进行数据水印条目、数据水印嵌入内容块、数据水印嵌入数据整体的相似度评估，另一方面根据评估结果，动态的调整水印添加的比例、分布位置，以满足用户设定的流转数据的相似度阈值，整体保障数据水印嵌入的隐蔽性和高仿真性。On the one hand, the system uses a data similarity assessment model to evaluate the similarity of data watermark entries, data watermark embedded content blocks, and data watermark embedded data as a whole. On the other hand, based on the evaluation results, the system dynamically adjusts the proportion and distribution of watermark additions to meet the similarity threshold of the circulating data set by the user, thereby ensuring the concealment and high simulation of data watermark embedding as a whole.

实施例中，所述系统还包括调整模块，用于调整数据水印的嵌入比例和/或位置。In an embodiment, the system further comprises an adjustment module for adjusting the embedding ratio and/or position of the data watermark.

所述调整模块包括：The adjustment module comprises:

第一调整单元，用于当数据条目中包含单一类型字段时，调整数据水印的嵌入比例；A first adjustment unit, used for adjusting the embedding ratio of the data watermark when the data entry contains a single type of field;

第二调整单元，用于当数据条目中包含多种类型字段时，调整数据水印的嵌入比例和/或位置。The second adjustment unit is used to adjust the embedding ratio and/or position of the data watermark when the data entry contains multiple types of fields.

所述调整模块还包括：比例调整单元，具体用于：The adjustment module further includes: a ratio adjustment unit, which is specifically used to:

所述调整模块还包括：位置调整单元，具体用于：The adjustment module further includes: a position adjustment unit, specifically used for:

实施例中，所述数据条目中的字段类型包括如下任一种或多种：In an embodiment, the field types in the data entry include any one or more of the following:

实施例中，所述嵌入模块具体用于：In an embodiment, the embedded module is specifically used for:

实施例中，数据条目相似度评估模块，具体用于：In the embodiment, the data entry similarity assessment module is specifically used to:

实施例中，按下式进行内容块数据水印相似度评估：In the embodiment, the content block data watermark similarity evaluation is performed according to the following formula:

实施例中，按下式评估所述目标数据整体的相似度：In the embodiment, the similarity of the target data as a whole is evaluated as follows:

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented in one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that include computer-usable program code.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowchart and/or block diagram of the method, device (system) and computer program product according to the embodiment of the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for realizing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

以上仅为本发明的实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均包含在申请待批的本发明的权利要求范围之内。The above are merely embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention are included in the scope of the claims of the present invention to be approved.

Claims

1. A data watermark embedding method of target data, characterized in that, comprising:

S1 divides the target data to be embedded with the data watermark into multiple content blocks, and embeds the data watermark in the data entry of each content block;

S2 uses a pre-set data similarity evaluation model to evaluate the similarity of data items on the data items before embedding the data watermark and the data items after embedding the data watermark;

S3 evaluates the similarity of content block data watermarks based on the data entries of all data items that make up the content block before embedding the data watermark and the data entries of all the data items that make up the content block after embedding the data watermark. When the data watermarks of each content block are similar When all degrees meet the first threshold range, execute S4, otherwise adjust the embedding ratio and/or position of the data watermark in the content block that does not meet the first threshold range, and execute S2;

S4 is based on the data integrity of all the content blocks that make up the target data before embedding the data watermark, and the data integrity of all the content blocks that make up the target data after embedding the data watermark, perform the similarity calculation of the target data overall, when the target data overall When the similarity of the content meets the second threshold range, the embedding of the data watermark is completed; otherwise, the embedding ratio and/or position of the data watermark in one or more content blocks is adjusted, and S2 is executed.

2. The method according to claim 1, wherein adjusting the embedding ratio and/or position of the data watermark comprises:

When a data entry contains a single type of field, adjust the embedding ratio of the data watermark;

When the data entry contains multiple types of fields, adjust the embedding ratio and/or position of the data watermark.

3. The method according to claim 2, wherein said adjusting the embedding ratio of the data watermark comprises:

When the content block data watermark similarity>the maximum value in the first threshold range, then reduce the ratio of the data watermark embedded in the data entry of the content block data to a preset ratio;

When the content block data watermark similarity is less than the minimum value in the first threshold range, then increase the ratio of the data watermark embedded in the data entry of the content block data to a preset ratio;

When the overall similarity of the target data>the maximum value in the second threshold range, then reduce the proportion of the data watermark embedded in one or more content blocks to a preset proportion;

When the overall similarity of the target data is less than the minimum value in the second threshold range, then increase the proportion of the data watermark embedded in one or more content blocks to a preset proportion.

4. The method according to claim 2, wherein said adjusting the embedding position of the data watermark comprises:

Remove the original data watermark in the data entry, and embed the data watermark matching the field type into various types of fields in the data entry according to the preset ratio;

Evaluate the similarity of the data entry after embedding the data watermark that matches the field type, select the location of the field with the largest similarity of the data entry as the optimal location for embedding the data watermark, and embed the data watermark at the optimal location.

5. The method according to any one of claims 2 or 4, wherein the field types in the data entry include any one or more of the following:

Numeric fields, text fields, and natural language fields.

6. The method according to claim 5, wherein embedding a data watermark in the data entry comprises:

When the data entry includes a numerical field, embedding a numerical data watermark in the numerical field;

When the data entry includes a text field, embedding a character text data watermark in the text field;

When the data entry includes a natural language field, a natural language data watermark is embedded in the natural language field.

7. The method according to claim 1, wherein said adopting a preset data similarity evaluation model to evaluate the data entry similarity of the data entry after the embedded data watermark comprises:

When a numerical data watermark is embedded in the numerical field of the data entry, the numerical value before and after the data watermark is embedded is deconstructed and word-segmented, and the data entry similarity is evaluated through the Euclidean distance vector data similarity evaluation model;

When the text field of the data entry is embedded with a character text type data watermark, the ASCII code value before and after the data watermark is embedded is deconstructed, and the similarity evaluation of the data entry is performed through the cosine vector data similarity evaluation model;

When a natural language-type data watermark is embedded in the natural language field of the data entry, the space vector model is applied to the natural language field before and after the data watermark is embedded to perform deconstruction and segmentation, and the deconstruction and segmentation results are evaluated by the cosine vector data similarity model Perform data entry similarity evaluation.

8. The method according to claim 1, wherein the content block data watermark similarity evaluation is carried out according to the formula:

In the formula: δ represents the similarity of the content block data watermark; N represents the total number of data entries in the content block; C _i represents the data entry similarity of the i-th data entry.

9. The method according to claim 1, wherein the overall similarity of the target data is evaluated as follows:

In the formula: θ represents the overall similarity of the target data; M represents the total number of content blocks in the target data; δ _i represents the similarity of the content block data watermark of the i-th content block.

10. A data watermark embedding system for target data, comprising:

An embedding module, configured to divide the target data to be embedded with the data watermark into multiple content blocks, and embed the data watermark in the data entry of each content block;

The data entry similarity evaluation module is used to evaluate the data entry similarity between the data entry before embedding the data watermark and the data entry after embedding the data watermark;

The content block similarity evaluation module is used to perform content block data watermark similarity on the data items of all data items that make up the content block before the data watermark is embedded, and the data items of all the data items that make up the content block after the data watermark is embedded Evaluate, when the similarity of each content block data watermark meets the first threshold range, execute the overall similarity evaluation module, otherwise adjust the embedding ratio and/or position of the data watermark in the content block that does not meet the first threshold range, and execute the described Data entry similarity evaluation module;

The overall similarity evaluation module is used to evaluate the overall similarity of the data integrity based on all content blocks that make up the target data before embedding the data watermark, and the data integrity of all content blocks that make up the target data after embedding the data watermark. When the overall similarity of the data meets the second threshold range, the embedding of the data watermark is completed; otherwise, the embedding ratio and/or position of the data watermark in one or more content blocks is adjusted, and the data item similarity evaluation module is executed.

11. The system according to claim 10, wherein the data entry similarity evaluation module is specifically used for: