CN106549674B

CN106549674B - A kind of data compression and decompressing method towards electronic health record

Info

Publication number: CN106549674B
Application number: CN201610961205.5A
Authority: CN
Inventors: 于海龙; 李建元; 温晓岳
Original assignee: Enjoyor Co Ltd
Current assignee: Yinjiang Technology Co.,Ltd.
Priority date: 2016-10-28
Filing date: 2016-10-28
Publication date: 2019-07-23
Anticipated expiration: 2036-10-28
Also published as: CN106549674A

Abstract

The invention relates to a data compression and decompression method oriented to electronic medical records. The invention converts the characters in the GB2312 Chinese character encoding character set from the area code and the bit code representation to each character is represented by the area code, the segment number and the bit number, and then these The area code, segment number, and bit number are identified by ASCII characters. There are no requirements and assumptions about which ASCII characters correspond to which numbers, so that a single Chinese character can be represented by three ASCII characters, and the Chinese character string will be converted into an ASCII string. Compression and decompression are realized by statistics and encoding of ASCII characters. The present invention is characterized in that it is oriented to the compression of Chinese characters, the compression efficiency ignores the repetition rate of Chinese characters, and at the same time reduces the computing pressure of the server; it is convenient to use and has high reliability.

Description

A data compression and decompression method for electronic medical records

技术领域technical field

本发明涉及病历数据管理领域，尤其涉及一种面向电子病历的数据压缩及解压方法。The invention relates to the field of medical record data management, in particular to a data compression and decompression method for electronic medical records.

背景技术Background technique

随着医院信息化的不断成熟和规模的扩大，电子病历信息的增长达到前所未有的规模，电子病历信息量的大规模增长、个人电子病历信息的不断膨胀，不仅在数据存储、调阅遇到了问题，而且在远程诊疗、跨区域信息共享等应用方面遇到了各种传输上的问题。With the continuous maturity of hospital informatization and the expansion of scale, the growth of electronic medical record information has reached an unprecedented scale. The large-scale growth of electronic medical record information and the continuous expansion of personal electronic medical record information have not only encountered problems in data storage and retrieval. , and encountered various transmission problems in applications such as remote diagnosis and treatment and cross-regional information sharing.

压缩是信息在存储、传输过程中提高效率的有效手段之一，人们使用了各种压缩方法，如使用传统的压缩方法中最有代表性的halfman算法，这种算法的思想是基于信息中单个字符出现的频率进行编码，频率高的字符用短编码表示，频率低的字符使用长的编码表示，这种思想一直到上世纪70年代末一直占统计地位。后来通过静态/动态字典对信息进行压缩的算法打破了这一格局，最有代表性的是基于滑动窗口的动态字典对信息进行压缩的LZ77及其衍生算法，该算法的特点是字符串的重复率越高，压缩的效率越好，目前主流的压缩工具所使用的算法都与LZ77及其衍生算法有着千丝万缕的联系。Compression is one of the effective means to improve the efficiency of information in the process of storage and transmission. People use various compression methods, such as using the most representative halfman algorithm in the traditional compression method. The idea of this algorithm is based on the single The frequency of character occurrence is coded, high frequency characters are represented by short codes, and low frequency characters are represented by long codes. Later, the algorithm of compressing information through static/dynamic dictionaries broke this pattern. The most representative is LZ77 and its derivative algorithm, which compresses information based on dynamic dictionaries of sliding windows. The algorithm is characterized by the repetition of strings. The higher the rate, the better the compression efficiency. The algorithms used by the current mainstream compression tools are inextricably linked with LZ77 and its derived algorithms.

目前我国电子病历系统中对电子病历数据压缩应用并不多见，主要原因是针对电子病历系统的整个数据进行统一压缩不切实际，因为压缩后的数据不适合在线应用系统的不断增、删、改的操作。而对于单个病例的信息并不多，少则几十个汉字，多则数千字。并且，患者的症状、检查、诊断、治疗、治疗结果等信息的字符重复率并不高，所以不管是halfman的压缩算法，还是LZ77及其衍生的算法，也不能大展拳脚。At present, the application of electronic medical record data compression in my country's electronic medical record system is rare. The main reason is that it is impractical to uniformly compress the entire data of the electronic medical record system, because the compressed data is not suitable for the continuous addition, deletion, etc. of the online application system. changed operation. There is not much information about a single case, ranging from dozens of Chinese characters to thousands of characters. Moreover, the character repetition rate of the patient's symptoms, examination, diagnosis, treatment, treatment results and other information is not high, so neither the halfman's compression algorithm nor the LZ77 and its derived algorithms can make great efforts.

对于汉字字符的压缩我国也有不少的专家在积极探索、研究。华南师范大学学报(自然科学版)2001年2期的《通用简易中文文本压缩方法研究》对汉字压缩提供了独特的思路，该技术的核心思想是通过汉字字符数值化，然后利用两个字节存储数值的“最高3个二进制位(即第14、15和16位)全是0”的特点，舍弃这三位而进行存储，该技术方案的亮点是不需要任何字典，通过事先建立的函数关系，通过汉字的区位码得到该汉字第一区的数值编码，简单并且高效率。本文中的压缩率指的是文件压缩后的大小与压缩前的大小之比，例如：把100m的文件压缩后是90m，压缩率为90/100*100％＝90％所以这种压缩方式的压缩率应该是13/16×100％,而不是文中声称的3/16×100％，相比而言在电子病历系统中，本发明能提供更好的压缩率。There are also many experts in my country who are actively exploring and studying the compression of Chinese characters. The Journal of South China Normal University (Natural Science Edition), No. 2, 2001, "Research on General Simple Chinese Text Compression Methods" provides a unique idea for Chinese character compression. The core idea of this technology is to digitize Chinese characters and then use two bytes. The characteristic of "the highest 3 binary bits (that is, the 14th, 15th and 16th bits) of the stored value are all 0", and these three bits are discarded for storage. The highlight of this technical solution is that no dictionary is required. relationship, the numerical code of the first area of the Chinese character is obtained through the area code of the Chinese character, which is simple and efficient. The compression ratio in this article refers to the ratio of the compressed size of the file to the size before compression. For example, if a 100m file is compressed to 90m, the compression ratio is 90/100*100%=90%. The compression ratio should be 13/16×100%, instead of 3/16×100% as claimed in the text, compared to the electronic medical record system, the present invention can provide a better compression ratio.

中文信息学报2000年第14期第1卷刊登的《汉语文本动态字母表0阶模型算术编码》，该方法基于对既有文本的分析基础上建立统计模型，结合PPM算法对汉字字符进行压缩，该压缩的方法依赖适合的统计模型，在大量文本的基础上建立适合的统计模型，然后对文本里的字符进行预测、压缩，这是一个非常有效的方法。在电子病历系统中，病历由多个书写者书写，不同书写者的书写习惯不一样，不同病历的症状描述、诊断方式、治疗手段都不一样，这对模型的适用带来了挑战。"Chinese Text Dynamic Alphabet 0-Order Model Arithmetic Coding" published in Chinese Journal of Information Technology, Vol. 1, 2000. This method is based on the analysis of existing texts to establish a statistical model, combined with PPM algorithm to compress Chinese characters, The compression method relies on a suitable statistical model, establishes a suitable statistical model on the basis of a large amount of text, and then predicts and compresses the characters in the text, which is a very effective method. In the electronic medical record system, medical records are written by multiple writers, different writers have different writing habits, and different medical records have different symptom descriptions, diagnosis methods, and treatment methods, which brings challenges to the application of the model.

申请号为201410614796.X的《中文文本压缩方法》使用基于词频统计的压缩方法，本质上还是halfman算法的简单应用，该压缩方法的关键点是如何实现对“没有空格等标志”作为分隔符的中文字符串进行分词，该方法没有进行说明。同时，正如前面所述，单个病例的症状、检查、诊断、治疗、治疗结果等信息的字符重复率并不高，即使使用有效的分词算法得到正确的分词还是不能有效的压缩。The "Chinese Text Compression Method" with the application number of 201410614796.X uses a compression method based on word frequency statistics, which is essentially a simple application of the halfman algorithm. The key point of this compression method is how to realize the "no spaces and other signs" as separators. Chinese character string for word segmentation, this method is not described. At the same time, as mentioned above, the character repetition rate of information such as symptoms, examination, diagnosis, treatment, and treatment results of a single case is not high, and even if an effective word segmentation algorithm is used to obtain correct word segmentation, it cannot be effectively compressed.

申请号为089126954的《一种适用于对混编之文档资料进行压缩之方法》(来源于中国台湾)，使用的也是halfman编码算法，亮点是将中文单字符编码拆分成高位和低位两个部分，然后与英文字符分别进行统计、编码，得到英文编码字典、中文高位编码字典和中文低位编码字典。为了进行区分，防止冲突，在输出字符编码时添加了标志。该方法侧重于提供一种中英文混编的方法，对压缩效率没有进行仔细的考虑，因为添加标志时势必要考虑该标志对halfman编码的解码产生的影响，为了避免对添加标志后的halfman编码的解码的影响，必定要对标志和halfman的编码进行约束，这严重影响了压缩的效率。The application number is 089126954, "A Method for Compressing Mixed Documents" (from Taiwan, China), which also uses the halfman encoding algorithm. The highlight is to split the Chinese single-character encoding into high-order and low-order two. part, and then perform statistics and coding with English characters respectively to obtain an English encoding dictionary, a Chinese high-order encoding dictionary and a Chinese low-order encoding dictionary. To distinguish and prevent conflicts, flags are added when outputting character encodings. This method focuses on providing a method of mixing Chinese and English, and does not carefully consider the compression efficiency, because when adding a flag, it is necessary to consider the impact of the flag on the decoding of halfman encoding. In order to avoid the halfman encoding after adding the flag The impact of decoding must be constrained to the encoding of signs and halfman, which seriously affects the efficiency of compression.

发明内容SUMMARY OF THE INVENTION

本发明为克服上述的不足之处，目的在于提供一种面向电子病历的数据压缩方法，本发明基于面向电子病历数据压缩的情景下，提出一种新的压缩方法，其特点是，面向汉字字符压缩，压缩效率无视汉字的重复率，即使在汉字字符不重复的情况下也能有效的压缩，同时兼顾减轻服务器的运算压力。In order to overcome the above-mentioned shortcomings, the present invention aims to provide a data compression method for electronic medical records. The present invention proposes a new compression method based on the situation of data compression for electronic medical records. Compression, the compression efficiency ignores the repetition rate of Chinese characters, and can effectively compress even when Chinese characters are not repeated, and at the same time reduce the computing pressure of the server.

本发明另一目的在于提供一种面向电子病历的数据解压方法，本解密方法用于配套如上所述的压缩方法，方便使用，可靠性高。Another object of the present invention is to provide a data decompression method for electronic medical records. The decryption method is used to support the above-mentioned compression method, which is convenient to use and has high reliability.

本发明是通过以下技术方案达到上述目的：一种面向电子病历的数据压缩方法，包括如下步骤：The present invention achieves the above object through the following technical solutions: a data compression method for electronic medical records, comprising the following steps:

(1)创建N1×N2×N3的三维数组arry_3rd，三个维度分别为区、段、位，其中，N1为区最大个数，N2为段最大个数；N3为位最大个数；并将字符顺序填充至三维数组arry_3rd；(1) Create a three-dimensional array arry_3rd of N1×N2×N3, the three dimensions are area, segment, and bit, where N1 is the maximum number of areas, N2 is the maximum number of segments; N3 is the maximum number of bits; and The character sequence is filled to the three-dimensional array arry_3rd;

(2)根据ASCII字符对三维数组arry_3rd的区、段、位进行标识；(2) Mark the area, segment and bit of the three-dimensional array arry_3rd according to ASCII characters;

(3)基于三维数组arry_3rd和步骤(2)的标识结果创建得到字符标识表dic_character；(3) creating a character identification table dic_character based on the three-dimensional array arry_3rd and the identification result of step (2);

(4)创建排序编码表，设表内第i个编码为i₁i₂…i_n，第j个编码为j₁j₂…j_n…j_r，n,r为长度，其中1≤n＜r，且i₁i₂…i_n≠j₁j₂…j_n；总编码个数为N1、N2、N3中的最大值；(4) Create a sorting code table, set the i-th code in the table as i ₁ i ₂ ... i _n , and the j-th code as j ₁ j ₂ ... j _n ... j _r , n, r is the length, where 1≤n <r, and i ₁ i ₂ ... i _n ≠j ₁ j ₂ ... j _n ; the total number of codes is the maximum value among N1, N2, and N3;

(5)将电子病历中的所有字符参照字符标识表dic_character转换成由ASCII字符组成的字符串，并将所有字符分别对应区、段、位ASCII字符标识符进行统计并倒序排序；(5) Converting all characters in the electronic medical record with reference to the character identification table dic_character into a string consisting of ASCII characters, and all characters are counted and sorted in reverse order corresponding to area, segment, and bit ASCII character identifiers respectively;

(6)根据步骤(5)的结果选择压缩方法生成压缩字典dic_output；(6) select a compression method according to the result of step (5) to generate a compression dictionary dic_output;

(7)根据生成的压缩字典dic_output对汉字ASCII字符进行编码，完成压缩。(7) Encode Chinese ASCII characters according to the generated compression dictionary dic_output to complete the compression.

作为优选，所述N1×N2×N3优选为20×20×20。Preferably, the N1×N2×N3 is preferably 20×20×20.

作为优选，所述步骤(1)将字符顺序填充至三维数组arry_3rd的方法为将GB2312编码表的汉字字符填充至三维数组arry_3rd后，在三维数组arry_3rd剩余空间内填充ASCII字符与控制符。Preferably, the method for filling the character sequence into the three-dimensional array arry_3rd in the step (1) is to fill the Chinese characters of the GB2312 encoding table into the three-dimensional array arry_3rd, and then fill the remaining space of the three-dimensional array arry_3rd with ASCII characters and control characters.

作为优选，所述步骤(3)的字符标识表dic_character保存有三维数组arry_3rd内的所有字符和字符对应的区、段、位标识符的ASCII码之间的一一映射关系，字符标识表dic_character一次创建多次使用，并以预定的时间间隔进行更新。Preferably, the character identification table dic_character of the step (3) saves a one-to-one mapping relationship between all characters in the three-dimensional array arry_3rd and the ASCII codes of the corresponding area, segment, and bit identifiers of the characters, and the character identification table dic_character once Create multiple uses and update them at scheduled intervals.

作为优选，所述步骤(5)进行统计并倒序排序时对ASCII字符标识符出现的次数进行统计，按照出现次数从大到小排序。Preferably, in the step (5), the number of occurrences of the ASCII character identifiers is counted when performing statistics and sorting in reverse order, and the number of occurrences is sorted in descending order.

作为优选，所述步骤(6)选择压缩方法生成压缩字典dic_output的选择依据是字符出现的分布情况，具体如下：Preferably, the selection basis of the step (6) selecting the compression method to generate the compression dictionary dic_output is the distribution of the occurrence of characters, as follows:

I)若字符集中分布在少数几个字符上，则使用halfman算法对区、段、位各ASCII字符的排序结果进行二进制编码，生成压缩字典dic_output；1) if the character set is distributed on a few characters, then use halfman algorithm to carry out binary encoding to the sorting result of each ASCII character of district, segment and position, and generate compression dictionary dic_output;

II)若字符分布偏向于均匀分布，则使用排序编码表的编码，将区、段、位各ASCII字符的排序结果与排序编码表的编码排序一一对应，生成压缩字典dic_output。II) If the character distribution tends to be evenly distributed, use the coding of the sorting code table, and the sorting results of each ASCII character of the area, segment, and bit are in one-to-one correspondence with the coding sorting of the sorting code table, and a compression dictionary dic_output is generated.

作为优选，所述判断字符集中分布或均匀分布的原则为排序在前4位的字符数合计是否占总字符数的50％以上，若排序在前4位的字符数超过50％，则使用halfman算法生成编码；否则使用排序编码表。Preferably, the principle for judging the centralized distribution or uniform distribution of characters is whether the total number of characters sorted in the first 4 digits accounts for more than 50% of the total number of characters. If the number of characters sorted in the first 4 digits exceeds 50%, halfman is used. Algorithm to generate codes; otherwise use a sorted code table.

一种配套如上所述数据压缩方法的解压方法，包括如下步骤：A decompression method supporting the above-mentioned data compression method, comprising the following steps:

1)参照压缩字典dic_output依次到压缩文件中提取二进制串，第一次提取得到区号标识符的二进制编码，第二次提取得到段号二进制编码，第三次提取得到位号二进制编码；1) with reference to the compression dictionary dic_output, extract the binary string in the compressed file successively, extract the binary code of the area code identifier for the first time, extract the segment number binary code for the second time, and extract the bit number binary code for the third time;

2)依照压缩字典dic_output将得到的二进制编码转换成对应的ASCII字符标识符，输出到字符串out1中；2) Convert the obtained binary code into the corresponding ASCII character identifier according to the compression dictionary dic_output, and output it to the string out1;

3)从左至右依次扫描字符串out1，每次扫描三个字符，根据得到的三个字符到字符标识表dic_character中查找相应的汉字字符，完成解压。3) Scan the character string out1 in turn from left to right, scan three characters at a time, and search for the corresponding Chinese character in the character identification table dic_character according to the obtained three characters, and complete the decompression.

作为优选，所述压缩字典dic_output、字符标识表dic_character与数据压缩使用的压缩字典dic_output、字符标识表dic_character相同。Preferably, the compression dictionary dic_output and the character identification table dic_character are the same as the compression dictionary dic_output and the character identification table dic_character used for data compression.

本发明的有益效果在于：本发明是基于面向电子病历数据压缩的情景下，提出一种新的压缩思想，其特点是，面向汉字字符压缩，压缩效率无视汉字的重复率，而且汉字越多压缩的效率越高，同时兼顾减轻服务器的运算压力；同时提供与之相配的解压方法，方便使用，可靠性高。The beneficial effects of the present invention are as follows: the present invention proposes a new compression idea based on the situation of electronic medical record data compression. The higher the efficiency, the lower the computing pressure of the server; and the decompression method that matches it is provided, which is convenient to use and has high reliability.

附图说明Description of drawings

图1是本发明数据压缩方法的流程示意图。FIG. 1 is a schematic flow chart of the data compression method of the present invention.

具体实施方式Detailed ways

下面结合具体实施例对本发明进行进一步描述，但本发明的保护范围并不仅限于此：The present invention is further described below in conjunction with specific embodiment, but the protection scope of the present invention is not limited to this:

实施例1：如图1所示，一种面向电子病历的数据压缩方法，包括如下步骤：Embodiment 1: As shown in Figure 1, a data compression method for electronic medical records, comprising the following steps:

步骤1，构建20×20×20的三维数组arry_3rd，三维数组的三个维度分别对应区、段、位，其中，N1为区最大个数、N2为段最大个数、N3为位最大个数。Step 1. Build a 20×20×20 three-dimensional array arry_3rd. The three dimensions of the three-dimensional array correspond to areas, segments, and bits, respectively. Among them, N1 is the maximum number of areas, N2 is the maximum number of segments, and N3 is the maximum number of bits. .

步骤，2依次填充字符，字符包括汉字字符、ASCII符、控制符。Step 2 fills the characters in sequence, and the characters include Chinese characters, ASCII characters, and control characters.

具体的，汉字字符可以采用GB2312编码表内字符，在剩余的空间里填充其它必须的字符如ASCII符、控制符等。Specifically, Chinese characters can use the characters in the GB2312 encoding table, and fill the remaining space with other necessary characters such as ASCII characters, control characters, and the like.

步骤3，使用ASCII字符对三维数组arry_3rd的区、段、位进行标识。Step 3, use ASCII characters to identify the area, segment and bit of the three-dimensional array arry_3rd.

步骤4，创建一张字符标识表dic_character，保存三维数组arry_3rd内的所有字符和字符对应的区、段、位标识符的ASCII码之间的一一映射关系。In step 4, a character identification table dic_character is created, and a one-to-one mapping relationship between all characters in the three-dimensional array arry_3rd and the ASCII codes of the area, segment and bit identifiers corresponding to the characters is stored.

步骤5，创建排序编码表，如表1所示，设表内第i个编码为i₁i₂…i_n，第j个编码为j₁j₂…j_n…j_r，n,r为长度，其中1≤n＜r，且i₁i₂…i_n≠j₁j₂…j_n；总编码个数为N1、N2、N3中的最大值；Step 5: Create a sorting code table, as shown in Table 1, set the i-th code in the table as i ₁ i ₂ ... i _n , the j-th code as j ₁ j ₂ ... j _n ... j _r , and n,r as length, where 1≤n<r, and i ₁ i ₂ ... i _n ≠j ₁ j ₂ ... j _n ; the total number of codes is the maximum value among N1, N2, and N3;

表1Table 1

步骤6，将电子病历中的所有汉字字符以及其它字符参照如步骤4所生成的表dic_character转换成ASCII字符组成的字符串，这样所有的汉字均转换为ASCII字符。Step 6: Convert all Chinese characters and other characters in the electronic medical record into a string composed of ASCII characters with reference to the table dic_character generated in step 4, so that all Chinese characters are converted into ASCII characters.

步骤7，将电子病历中所有的汉字字符以及其它字符，分别对区、段、位ASCII字符标识符进行统计并倒序排序。ASCII字符出现的次数进行统计，出现次数从大到小进行排序。Step 7, count all the Chinese characters and other characters in the electronic medical record, respectively count the area, segment and bit ASCII character identifiers and sort them in reverse order. The number of occurrences of ASCII characters is counted, and the number of occurrences is sorted from large to small.

步骤8，根据步骤7的结果，将区、段、位各ASCII字符的排序结果与排序编码表的编码排序一一对应，生成压缩字典dic_output。或者，使用halfman算法对区、段、位各ASCII字符进行二进制编码，生成压缩字典dic_output。选择的依据是字符出现的分布情况，如果集中分布在少数几个字符上使用halfman算法生成编码，如果分布偏向于均匀，建议使用排序编码表的编码。简单的判断原则是排序在前4位的字符数合计是否占50％以上，如果排序在前4位的字符数超过50％使用halfman算法生成编码，否则使用排序编码表。Step 8: According to the result of Step 7, the sorting result of each ASCII character of the area, segment and bit is in one-to-one correspondence with the coding sorting of the sorting coding table, and a compression dictionary dic_output is generated. Or, use the halfman algorithm to binary-encode the ASCII characters of the region, segment, and bit to generate a compression dictionary dic_output. The selection is based on the distribution of characters. If the distribution is concentrated on a few characters, the halfman algorithm is used to generate the code. If the distribution is biased towards uniformity, it is recommended to use the encoding of the sorting code table. The simple judgment principle is whether the total number of characters sorted in the first 4 digits accounts for more than 50%. If the number of characters sorted in the first 4 digits exceeds 50%, use the halfman algorithm to generate the code, otherwise use the sorting code table.

步骤9根据步骤8生成压缩字典dic_output对汉字ASCII字符进行编码，完成压缩。Step 9 generates a compression dictionary dic_output according to step 8 to encode Chinese ASCII characters to complete the compression.

基于上述压缩的解压方法，具体如下：The decompression method based on the above compression is as follows:

步骤1，参照压缩字典dic_output，依次到压缩文件中提取二进制串，第一次取得区号标识符的二进制编码，第二次取得段号二进制编码，第三次取得位号二进制编码。Step 1: Referring to the compression dictionary dic_output, extract the binary string from the compressed file in turn, obtain the binary code of the area code identifier for the first time, obtain the binary code of the segment number for the second time, and obtain the binary code of the bit number for the third time.

由于排序编码表中，表内第i个编码为i₁i₂…i_n，第j个编码为j₁j₂…j_n…j_r，n,r为长度，其中1≤n＜r，且i₁i₂…i_n≠j₁j₂…j_n；因此，压缩字典的二进制编码中任何编码都不是其他编码的前缀，所以每次都能唯一获取正确的编码。Since in the sorting code table, the i-th code in the table is i ₁ i ₂ ... i _n , the j-th code is j ₁ j ₂ ... j _n ... j _r , n, r are the lengths, where 1≤n<r, And i ₁ i ₂ ... i _n ≠ j ₁ j ₂ ... j _n ; therefore, any code in the binary code of the compression dictionary is not a prefix of other codes, so the correct code can be uniquely obtained every time.

步骤2，依照压缩字典将取得的二进制编码转换成对应的ASCII标识符，输出到字符串out1中；Step 2, convert the obtained binary code into a corresponding ASCII identifier according to the compression dictionary, and output it to the string out1;

步骤3，从左至右依次扫描字符串out1，每次扫描三个字符，根据得到的三个字符到dic_character字典表中查找相应的汉字字符，完成解压。Step 3: Scan the character string out1 in turn from left to right, scan three characters each time, and search for the corresponding Chinese characters in the dic_character dictionary table according to the three characters obtained to complete the decompression.

实施例2：在本实施例中，由于篇幅限制，同时也限于本发明针对的是汉字压缩，我们使用电子病历中最常见的案例，以20个字符为例说明压缩的过程：Embodiment 2: In the present embodiment, due to space limitation, and also limited to that the present invention is aimed at Chinese character compression, we use the most common case in the electronic medical record, and illustrate the process of compression with 20 characters as an example:

该患者于四天前无明显原因出现左下胸部不适The patient developed left lower chest discomfort for no apparent reason four days ago

步骤1根据已经构建的字符标识表dic_character，字符标识表dic_character使用到的ASCII标识符，如表2所示，字符标识表dic_character其产生步骤如上所述，此处不再敖述。Step 1: According to the character identification table dic_character that has been constructed, the ASCII identifier used by the character identification table dic_character is shown in Table 2, and the generation steps of the character identification table dic_character are as described above, and will not be described here.

表2Table 2

依次对上述案例的字符取区标识符、段标识符、位标识符，生成ASCII字符串，在每三个一组的字符中，第一位字符是区标识符，第二位是段标识符，第三位是位标识符：Take the area identifier, segment identifier, and bit identifier for the characters in the above case in turn, and generate an ASCII string. In each group of three characters, the first character is the area identifier, and the second character is the segment identifier. , the third bit is the bit identifier:

41c 4f7 aj6 a9d 8aj 8hd 7b3 967 6fi 9ab ac2 a63 30d 9ad b9e 99d 9f92d4 2d0 86b41c 4f7 aj6 a9d 8aj 8hd 7b3 967 6fi 9ab ac2 a63 30d 9ad b9e 99d 9f92d4 2d0 86b

步骤2分别统计区标识符、段标识符、位标识符出现的次数并进行倒序排序，如表3所示：Step 2: Count the number of occurrences of area identifiers, segment identifiers, and bit identifiers and sort them in reverse order, as shown in Table 3:

表3table 3

步骤3根据判断，排序前4位的字符数合计均超过了50％，我们使用halfman算法生成对应的区标识符二进制编码、段标识符二进制编码和位标识符二进制编码生成压缩字典dic_output，如表4所示：Step 3 According to the judgment, the total number of characters in the first 4 digits exceeds 50%. We use the halfman algorithm to generate the corresponding binary code of the area identifier, the binary code of the segment identifier and the binary code of the bit identifier to generate the compression dictionary dic_output, as shown in the table 4 shows:

表4Table 4

步骤4根据步骤3生成的压缩字典分别对区标识符、段标识符、位标识符进行转换完成压缩。In step 4, according to the compression dictionary generated in step 3, the area identifier, the segment identifier, and the bit identifier are respectively converted to complete the compression.

案例中使用了20个汉字字符，显而易见，每个汉字占2个字节,每个字节占8bit，那么压缩前总共占320bit。In the case, 20 Chinese characters are used. Obviously, each Chinese character occupies 2 bytes, and each byte occupies 8 bits, so a total of 320 bits before compression.

根据步骤3得出的压缩字典，和步骤2得出的统计数据，可以轻易得出压缩后的字符共占207bit，所以本案例的压缩率为207÷320×100％＝64.6875％According to the compression dictionary obtained in step 3 and the statistical data obtained in step 2, it can be easily concluded that the compressed characters occupy a total of 207 bits, so the compression rate of this case is 207÷320×100%=64.6875%

解压过程为上述过程的逆过程，此处不再赘述。The decompression process is the reverse process of the above-mentioned process, which is not repeated here.

实施例3：由于篇幅限制，同时也限于本发明针对的是汉字压缩，我们使用电子病历中最常见的案例，以200个字符为例说明压缩的过程：Embodiment 3: Due to space limitations, and also limited to the fact that the present invention is aimed at Chinese character compression, we use the most common case in the electronic medical record, and use 200 characters as an example to illustrate the compression process:

该患者缘于四天前无明显原因出现左下胸部不适，无咳嗽、咳痰，无畏冷发热。未引起注意，未行任何治疗。六小时前因进食凉菜后再次出现左下胸部不适，无腹泻及恶心、呕吐。未引起重视，自服行军散症状无缓解。就诊当地卫生所予口服药物及肌注液体(具体欠详)左胸部不适无明显好转且出现畏冷，肢体冰冷。经我院门诊检查心肌酶谱示谷草转氨酶及乳酸脱氢酶高。为进一步治疗急诊我院收入我科。发病以来，精神差，睡眠欠佳。大小便正常。The patient had left lower chest discomfort for no obvious reason four days ago, no cough, expectoration, and no fear of cold and fever. No attention was paid, no treatment was given. Six hours ago, after eating cold vegetables, the patient experienced discomfort in the lower left chest again, without diarrhea, nausea and vomiting. No attention was paid, and the symptoms of marching powder were not relieved by self-service. He went to the local health center for oral drugs and intramuscular injection of liquid (the details are not detailed). The discomfort in his left chest did not improve significantly, and he developed chills and cold limbs. The outpatient examination of our hospital showed high levels of aspartate aminotransferase and lactate dehydrogenase. We were admitted to our department for further treatment in the emergency department. Since the onset, the spirit is poor and the sleep is poor. Normal bowel movements.

步骤1根据已经构建的字符标识表dic_character，字符标识表dic_character使用到的ASCII标识符如表2所示，字符标识表dic_character其产生步骤如上所述，此处不再敖述。依次对上述案例的字符取区标识符、段标识符、位标识符，生成ASCII字符串，在每三个一组的字符中，第一位字符是区标识符，第二位是段标识符，第三位是位标识符：Step 1 is based on the character identification table dic_character that has been constructed, the ASCII identifier used by the character identification table dic_character is shown in Table 2, and the generating steps of the character identification table dic_character are as described above, and will not be described here. Take the area identifier, segment identifier, and bit identifier for the characters in the above case in turn, and generate an ASCII string. In each group of three characters, the first character is the area identifier, and the second character is the segment identifier. , the third bit is the bit identifier:

41c 4f7 aj6 aca a9d 8aj 8hd 7b3 967 6fi 9ab ac2 a63 30d 9ad b9e 99d9f9 2d4 2d0 86b 09j 967 5ef 8bf 001 5ef 8ei 09j 967 948 60h 3f5 7h1 002 945a6e 79d b5d a57 09j 945 9f1 7h7 4bb b33 641 002 66c 9ce 856 7b3 a63 57f 858638 2dg 4d8 ae6 337 30d 9ad b9e 99d 9f9 2d4 2d0 86b 09j 967 412 9dj 4j9 3ea9e9 001 726 905 002 945a6e 79d b3d 870 09j b83 3jh 9f1 5cd 7jh b0j b6b 9674f5 56d 002 5a6b02 36c 38b 94h 845 8dc aa7 5fd 3jh a2a 974 4j9 4ie b5d a378h6 09f 5b4 8h6 7ba 9bd 09g b9e 9f9 2d4 2d0 86b 967 6fi 9ab 4b0b61 7ch 30d9ad 948 60h 09j b18 8h6 2b8 60h 002 58e 95f acf 6d8b02 529 2f6 9e9 4ie 6cf788 85j 46b 2ee b61 232 6cf 4j9 7ie 8c8 910 7dd 6cf 430 002 93f 57f a382d2b33 641 4ja b02 95f acf 872 7ig 95f 5ed 002 3f5 2be a4d 5i9 09j 58c 83g2fb 09j 89i 6ee 7ba 512 002 354 9ce 2a6b0g 2ga 00241c 4f7 aj6 aca a9d 8aj 8hd 7b3 967 6fi 9ab ac2 a63 30d 9ad b9e 99d9f9 2d4 2d0 86b 09j 967 5ef 8bf 001 5ef 8ei 09j 967 948 60h 3f5 7h1 002 945a6e 79d b5d a57 09j 945 9f1 7h7 4bb b33 641 002 66c 9ce 856 7b3 a63 57f 858638 2dg 4d8 ae6 337 30d 9ad b9e 99d 9f9 2d4 2d0 86b 09j 967 412 9dj 4j9 3ea9e9 001 726 905 002 945a6e 79d b3d 870 09j b83 3jh 9f1 5cd 7jh b0j b6b 9674f5 56d 002 5a6b02 36c 38b 94h 845 8dc aa7 5fd 3jh a2a 974 4j9 4ie b5d a378h6 09f 5b4 8h6 7ba 9bd 09g b9e 9f9 2d4 2d0 86b 967 6fi 9ab 4b0b61 7ch 30d9ad 948 60h 09j b18 8h6 2b8 60h 002 58e 95f acf 6d8b02 529 2f6 9e9 4ie 6cf788 85j 46b 2ee b61 232 6cf 4j9 7ie 8c8 910 7dd 6cf 430 002 93f 57f a382d2b33 641 4ja b02 95f acf 872 7ig 95f 5ed 002 3f5 2be a4d 5i9 09j 58c 83g2fb 09j 89i 6ee 7ba 512 002 354 002 2a6b0g

步骤2分别统计区标识符、段标识符、位标识符出现的次数并进行倒序排序，如表5所示：Step 2: Count the number of occurrences of area identifiers, segment identifiers, and bit identifiers and sort them in reverse order, as shown in Table 5:

表5table 5

步骤3排序前4位的字符数合计未超过50％，根据附表2取得对应的区标识符二进制编码、段标识符二进制编码和位标识符二进制编码生成压缩字典dic_output，如表6所示：Step 3: The total number of characters in the first 4 digits of the sorting does not exceed 50%, and the corresponding area identifier binary code, segment identifier binary code and bit identifier binary code are obtained according to Schedule 2 to generate the compression dictionary dic_output, as shown in Table 6:

表6Table 6

压缩后的二进制串的总长度为2490，压缩率为2490/3200×100％＝77.8125％The total length of the compressed binary string is 2490, and the compression ratio is 2490/3200×100%=77.8125%

综上所述，本发明的目标是提供一种对数据压缩方法，在实际应该过程中，我们对常用的英文医疗术语和标量单位如HBsAg、mmHg等直接放到空白的区段中进行标识，对于其他的单个出现的ASCII字符在出现次数很少的情况下对整体压缩效果的影响不是很大。为了高压缩效果，在实际应用中我们定期收集一次电子病历中的高频汉字，将高频汉字集中存放到dic_character的区段中，在压缩后的二进制串中添加dic_character的版本号来防止字符映射出错。另外对于常用的汉字医疗术语如先天性心脏病、十二指肠等直接将整个字符串放到空白区段中进行标识(为了充分利用有限的dic_character空间，删除了GB2312中基本用不到的字符如第04区至09区的字符以及形如冫、亠、讠这些偏旁部首的字符)。To sum up, the goal of the present invention is to provide a data compression method. In the actual process, we directly mark the commonly used English medical terms and scalar units such as HBsAg, mmHg, etc. in the blank section. For other single-occurring ASCII characters, the impact on the overall compression effect is not very large when the number of occurrences is small. In order to achieve high compression effect, in practical applications, we regularly collect high-frequency Chinese characters in the electronic medical record, store the high-frequency Chinese characters in the dic_character section, and add the version number of dic_character to the compressed binary string to prevent character mapping. error. In addition, for the commonly used Chinese medical terms such as congenital heart disease, duodenum, etc., the entire string is directly marked in the blank section (in order to make full use of the limited dic_character space, the characters that are basically unused in GB2312 are deleted. Such as the characters from the 04th area to the 09th area and the characters of the radicals such as 冫, 庠, 宠).

GB2312编码对所收录的字符进行了分区处理，共94个区，每个区含有94个位，可以容纳8836个码位，其中一级汉字有3755个，分布在16-55分区中，二级汉字有3008个分布在56-87分区中，除此以外还包括拉丁字母、希腊字母等其他682个字符，共计收录了7445个字符，为了实现无损压缩，我们还要添加ASCII编码表中的可打印字符和控制符，这样一共约7545个字符。在压缩过程中如果直接对这些字符使用halfman算法进行压缩，显然不能达到压缩目的。对汉字字符进行转换成一两位以上的转码，降低对压缩源中可能出现的字符个数来限制压缩码长度的扩展是有效手段之一，同时，我们注意到，单个汉字转成转码的同时实际上也是对汉字字符的一种膨胀，转码位数越长，压缩源中可能出现的字符个数越少，平均压缩字典编码的长度就越短，但字符重复率也越均匀，压缩效率也越低。实践表明将汉字转换成3位转码可以达到最佳的效果。为了限制压缩字典编码的长度，本方案中设计了20区，每个区设计20个段，每个段容纳20个位的方法来容纳7545个字符。然后分别对这20个区、段、位来进行编码，来实现压缩,压缩过程中视区、段、位编码字符的分布情况选择使用halfman算法生成压缩字典码还是使用附表(二)的编码，显而易见，halfman算法在生成排序第10位字符的编码时其编码长度会达到或超过5位，由于我们对汉字使用3个字符表示，平均每个字符压缩后的二进制编码不超过5位才能起到压缩效果。对统计结果排序第10位以后的字符编码时，实际上是对字符的一种膨胀，所以只有排序超过10以后的字符总出现次数不超过总字符数的25％时，总体上才能抵消这25％的膨胀而起到压缩效果，所以我们的判断原则是排序在前4位的字符总数超过50％使用halfman算法生成压缩字典码，否则使用附表(二)的编码，附表二的编码不是放之四海而皆准的编码，但能保证任何情况下均能起到压缩的效果，这也是我们使用20区，每个区设计20个段，每个段容纳20个位来容纳这7545个字符的原因。The GB2312 encoding has carried out partition processing on the recorded characters, a total of 94 areas, each area contains 94 bits, can accommodate 8836 code points, of which there are 3755 first-level Chinese characters, which are distributed in 16-55 partitions. There are 3008 Chinese characters distributed in partitions 56-87, in addition to other 682 characters such as Latin letters, Greek letters, etc., a total of 7445 characters are included. In order to achieve lossless compression, we also need to add the ASCII encoding table Print characters and control characters, so a total of about 7545 characters. In the compression process, if the halfman algorithm is used to compress these characters directly, the compression purpose will obviously not be achieved. Converting Chinese characters into one or two or more transcoding, reducing the number of characters that may appear in the compressed source to limit the expansion of the compressed code length is one of the effective means. At the same time, it is actually an expansion of Chinese characters. The longer the number of transcoding bits, the less the number of characters that may appear in the compressed source, and the shorter the length of the average compression dictionary encoding, but the more uniform the character repetition rate, the more uniform the compression rate is. The efficiency is also lower. Practice shows that converting Chinese characters into 3-bit transcoding can achieve the best effect. In order to limit the length of the compression dictionary encoding, 20 areas are designed in this scheme, each area is designed with 20 segments, and each segment accommodates 20 bits to accommodate 7545 characters. Then encode these 20 areas, segments and bits respectively to realize compression. During the compression process, the distribution of coded characters in view areas, segments and bits chooses to use the halfman algorithm to generate the compression dictionary code or to use the code in the attached table (2). Obviously, the encoding length of the halfman algorithm will reach or exceed 5 bits when generating the code of the 10th character in the sorting. Since we use 3 characters to represent Chinese characters, the average compressed binary code of each character does not exceed 5 bits. compression effect. When sorting the characters after the 10th digit in the statistical results, it is actually an expansion of the characters, so only when the total number of occurrences of the characters after sorting more than 10 does not exceed 25% of the total number of characters, the 25 can be offset as a whole. % expansion has a compression effect, so our judgment principle is that the total number of characters sorted in the first 4 digits exceeds 50%, and the halfman algorithm is used to generate the compression dictionary code. Otherwise, the code in the attached table (2) is used. It is a universal code, but it can guarantee the effect of compression in any case. This is why we use 20 areas, each area is designed with 20 segments, and each segment accommodates 20 bits to accommodate the 7545 The reason for the characters.

另外，关于本发明编码表编码规则：0、1组成3位，共8种可能“000、001、010、011、100、101、110、111”，取其中2个作为前2位编码，“000、011”，剩下的6种，添加一位0、1，共12种可能“0010、0011、0100、0101、1001、1000、1010、1011、1100、1101、1110、1111”，取其中的5个作为第3～7位编码，剩下7种可能，添加一位0、1，共14种可能，去除“10000”，剩下的13个作为8～20位编码。这样可以实现解码时正确划分区、段、位。关于编码特点：假设第i个编码为d₁d₂…d_ni，第j个编码为d₁d₂…d_rj…d_nj，1≤ni≤nj(i<j)，且d₁d₂…d_ni≠d₁d₂…d_rj其中，ni、nj为编码长度，1≤rj≤nj。In addition, with regard to the coding rules of the coding table of the present invention: 0 and 1 form 3 bits, and there are 8 possibilities "000, 001, 010, 011, 100, 101, 110, 111" in total, and 2 of them are used as the first 2 codes, " 000, 011", the remaining 6 kinds, add a 0, 1, a total of 12 possible "0010, 0011, 0100, 0101, 1001, 1000, 1010, 1011, 1100, 1101, 1110, 1111", take one of them The 5 bits are used as the 3rd to 7th bit codes, and there are 7 possibilities left, adding a 0, 1, a total of 14 possibilities, remove "10000", and the remaining 13 codes are used as 8 to 20 bits. In this way, regions, segments and bits can be correctly divided during decoding. About coding characteristics: Suppose the i-th code is d ₁ d ₂ ... d _ni , the j-th code is d ₁ d ₂ ... d _rj ... d _nj , 1≤ni≤nj (i<j), and d ₁ d ₂ ...d _ni ≠d ₁ d ₂ ...d _rj where ni and nj are the coding lengths, and 1≤rj≤nj.

以上的所述乃是本发明的具体实施例及所运用的技术原理，若依本发明的构想所作的改变，其所产生的功能作用仍未超出说明书及附图所涵盖的精神时，仍应属本发明的保护范围。The above descriptions are the specific embodiments of the present invention and the technical principles used. If changes are made according to the concept of the present invention, if the functions produced by them still do not exceed the spirit covered by the description and the accompanying drawings, they should still be It belongs to the protection scope of the present invention.

Claims

1. a data compression method for electronic medical records, is characterized in that comprising the steps:

(1) Create a three-dimensional array arry_3rd of N1×N2×N3, the three dimensions are area, segment, and bit, where N1 is the maximum number of areas, N2 is the maximum number of segments; N3 is the maximum number of bits; and The character sequence is filled into the three-dimensional array arry_3rd; the characters include Chinese characters, ASCII characters, and control characters;

(2) Mark the area, segment and bit of the three-dimensional array arry_3rd according to ASCII characters;

(3) creating a character identification table dic_character based on the three-dimensional array arry_3rd and the identification result of step (2);

(4) Create a sorting code table, set the i-th code in the table as i ₁ i ₂ ... i _n , and the j-th code as j ₁ j ₂ ... j _n ... j _r , n, r is the length, where 1≤n <r, and i ₁ i ₂ ... i _n ≠j ₁ j ₂ ... j _n ; the total number of codes is the maximum value among N1, N2, and N3;

(5) Converting all characters in the electronic medical record with reference to the character identification table dic_character into a string consisting of ASCII characters, and all characters are counted and sorted in reverse order corresponding to area, segment, and bit ASCII character identifiers respectively;

(6) select a compression method according to the result of step (5) to generate a compression dictionary dic_output;

(7) Encode Chinese ASCII characters according to the generated compression dictionary dic_output to complete the compression.

2 . The data compression method for electronic medical records according to claim 1 , wherein the N1×N2×N3 is preferably 20×20×20. 3 .

3. a kind of data compression method oriented to electronic medical record according to claim 1, is characterized in that: described step (1) is filled with the method for three-dimensional array arry_3rd with character sequence for filling the Chinese character character of GB2312 coding table to three-dimensional After the array arry_3rd, fill the remaining space of the three-dimensional array arry_3rd with ASCII characters and control characters.

4. a kind of data compression method oriented to electronic medical record according to claim 1, is characterized in that: the character identification table dic_character of described step (3) preserves all characters in three-dimensional array arry_3rd and the district, section corresponding to character A one-to-one mapping relationship between the ASCII codes of the bit identifiers, the character identifier table dic_character is created once and used multiple times, and is updated at a predetermined time interval.

5. a kind of data compression method oriented to electronic medical record according to claim 1, is characterized in that: when described step (5) carries out statistics and reverse order sort, the number of times that ASCII character identifier appears is counted, according to the number of times of occurrence from . Sort from largest to smallest.

6. a kind of data compression method for electronic medical records according to claim 1, is characterized in that: the selection basis that described step (6) selects compression method to generate compression dictionary dic_output is the distribution situation that character appears, is specifically as follows:

1) if the character set is distributed on a few characters, then use halfman algorithm to carry out binary encoding to the sorting result of each ASCII character of district, segment and position, and generate compression dictionary dic_output;

II) If the character distribution tends to be evenly distributed, use the coding of the sorting code table, and the sorting results of each ASCII character of the area, segment, and bit are in one-to-one correspondence with the coding sorting of the sorting code table, and a compression dictionary dic_output is generated.

7. A kind of data compression method oriented to electronic medical records according to claim 6, it is characterized in that: the principle of described judgment character set distribution or uniform distribution is whether the total number of characters sorted in the first 4 places accounts for 30% of the total number of characters. More than 50%, if the number of characters sorted in the first 4 digits exceeds 50%, the halfman algorithm is used to generate the code; otherwise, the sorting code table is used.

8. a decompression method matching the data compression method as claimed in claim 1, is characterized in that comprising the steps:

1) with reference to the compression dictionary dic_output, extract the binary string in the compressed file successively, extract the binary code of the area code identifier for the first time, extract the segment number binary code for the second time, and extract the bit number binary code for the third time;

2) Convert the obtained binary code into the corresponding ASCII character identifier according to the compression dictionary dic_output, and output it to the string out1;

3) Scan the character string out1 in turn from left to right, scan three characters at a time, and search for the corresponding Chinese character in the character identification table dic_character according to the obtained three characters, and complete the decompression.

9 . The decompression method according to claim 8 , wherein the compression dictionary dic_output and the character identification table dic_character are the same as the compression dictionary dic_output and the character identification table dic_character used for data compression. 10 .