CN102073663B

CN102073663B - A Method and Device for Quickly Processing XML Compressed Data

Info

Publication number: CN102073663B
Application number: CN200910238256.5A
Authority: CN
Inventors: 王晓磊; 仇睿恒; 张磊; 王毅
Original assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Peking University; Peking University Founder Group Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Peking University Founder Research and Development Center
Priority date: 2009-11-24
Filing date: 2009-11-24
Publication date: 2013-01-30
Anticipated expiration: 2029-11-24
Also published as: CN102073663A

Abstract

The invention provides a method and connector for rapidly reading XML (Extensible Markup Language) compressed data. XML compressed data are converted into hybrid data type objects to generate a message sequence including the hybrid data type objects, wherein the hybrid data type adopts a mode of data type marks and data storage part. Correspondingly, the invention provides a method and device for rapidly processing the XML compressed data. The method comprises the steps of: generating a message sequence comprising the hybrid data type objects through converting the XML compressed data into the hybrid data type objects; searching corresponding data processing logic according to the read hybrid data type objects in the message sequence and converting hybrid data, and transmitting the converted data to the searched data processing logic. Through the hybrid data types and the flow read interface, for the XML structure information and some types of data, the read original compressed data are directly transmitted, thus the operation consumption and the storage consumption caused by character string operation are saved.

Description

A Method and Device for Quickly Processing XML Compressed Data

技术领域 technical field

本发明涉及XML数据处理领域，尤其涉及一种快速处理XML压缩数据的方法及其装置。The invention relates to the field of XML data processing, in particular to a method and device for rapidly processing XML compressed data.

背景技术 Background technique

在现有的XML数据压缩方法中，有很多压缩方法是基于XML结构和数据特征的处理方法，比如XGRIND、XPRESS等。在这些方法中，一方面，避免使用XML原始的容器方式表示结构，并且在压缩后的数据中直接保存XML的结构信息，以实现对压缩数据的直接查询；另一方面，把具有同种数据类型或具有类似特征的数据放到一起压缩，以达到较高的数据压缩比。从这类将结构与数据分离压缩的压缩数据中读取信息时，通常的做法是首先解压缩出原始XML数据，再使用公知的XML解析方式进行解析，相关内容可参见专利号为US 7,013,425B2的美国专利“Data processing method，and encoder，decoderand xml parser for encoding and decoding an xml document”和公开号为US 2004/0225754A1的美国专利申请“Method ofcompressing xml data and method of decompressing compressed xmldata”。Among the existing XML data compression methods, there are many compression methods based on XML structure and data characteristics processing methods, such as XGRIND, XPRESS and so on. In these methods, on the one hand, avoid using the original XML container to represent the structure, and directly save the XML structure information in the compressed data, so as to realize the direct query of the compressed data; Types or data with similar characteristics are compressed together to achieve a higher data compression ratio. When reading information from this kind of compressed data that separates the structure and data, the usual practice is to first decompress the original XML data, and then use the known XML parsing method to parse. For related content, please refer to the patent number US 7,013,425B2 The US patent "Data processing method, and encoder, decoder and xml parser for encoding and decoding an xml document" and the US patent application with publication number US 2004/0225754A1 "Method of compressing xml data and method of decompressing compressed xmldata".

比如，在XGRIND压缩方法中，分别对表示XML结构的元数据(XML标签和属性的形式)、枚举型属性值和一般元素/属性值采取不同的压缩方法。具体地讲，对于元数据，每个开始标签用一个字符“T”加上一个唯一分配的元素ID编码，所有的结束标签用一个字符“/”编码，属性名用一个字符“A”加上一个唯一分配的属性ID编码；对于枚举型属性值，采用字典编码法；对于一般元素/属性值，采用霍夫曼编码法(参见http://dsl.serc.iisc.ernet.in/pub/xgrindicde02.pdf)。按照现有方法对XGRIND压缩数据中的结构信息进行读取时，程序首先需要读出表示XML结构信息的字符和编号，通过某种方式转换为开始标签、结束标签或属性名的字符串表示，再使用公知的XML解析方法对字符串形式的XML数据进行解析，从而分析出XML结构信息。对XML数据进行读取时，也是先解压缩成字符串表示，然后对字符串表示的XML数据进行解析。在整个读取过程中，额外的计算机资源消耗包括在整个读取过程中，额外的计算机资源消耗包括解压过程中将XML结构信息和XML数据处理为符合XML形式字符串的运算消耗、解压缩后XML数据的存储消耗、XML解析方法中的字符串比较、查找的运算消耗以及解析字符串形式的XML数据的运算消耗。这些计算机资源消耗造成了程序运行效率的降低。For example, in the XGRIND compression method, different compression methods are adopted for metadata representing XML structures (in the form of XML tags and attributes), enumerated attribute values, and general element/attribute values. Specifically, for metadata, each start tag is encoded with a character "T" plus a uniquely assigned element ID, all end tags are encoded with a character "/", and attribute names are encoded with a character "A" plus A uniquely assigned attribute ID encoding; for enumerated attribute values, use dictionary encoding; for general element/attribute values, use Huffman encoding (see http://dsl.serc.iisc.ernet.in/pub /xgrindicde02.pdf ). When reading the structural information in XGRIND compressed data according to the existing method, the program first needs to read the characters and numbers representing the XML structural information, and convert them into string representations of start tags, end tags or attribute names in a certain way, Then, the XML data in the form of a character string is parsed by using a known XML parsing method, so as to analyze the XML structure information. When reading the XML data, it is first decompressed into a string representation, and then the XML data represented by the string is parsed. During the entire reading process, the additional computer resource consumption includes the operation consumption of processing the XML structure information and XML data into a character string conforming to the XML form during the decompression process, and after decompression, the additional computer resource consumption includes The storage consumption of XML data, the operation consumption of string comparison and search in the XML parsing method, and the operation consumption of parsing XML data in the form of strings. The consumption of these computer resources has caused the reduction of program operating efficiency.

再比如，在XPRESS压缩方法中，分别对标签和不同类型的数据值采取不同的压缩方法。具体地讲，对于标签的压缩，将元素的路径编码为[0.0，1.0]中的某一间隔，并使用这些间隔之间的包含关系，对路径表达进行估计。对于数据值的压缩，具体地，对于数字型数据值，整数则采用差分编码，浮点数则采用f32编码；对于字符串型数据值，枚举型则采用字典编码，文本数据则采取霍夫曼编码(参见http://islab.kaist.ac.kr/chungcw/InterConfPapers/sigmod2003_jkmin. pdf)。按照现有方法读取XPRESS压缩数据时，除了上述问题之外，还存在关于浮点数类型的数据的问题。具体地讲，如上所述，在XPRESS中，对类型为xsd:float的XML数据采用f32编码，其数据表示方式与IEEE单精度浮点数表示相同，在压缩后的数据中记录为4个字节。按照现有方法对该类型的数据进行读取时，因为XML Schema规范规定类型为xsd:float的数据必须表示为十进制形式，所以XML数据解压过程涉及从IEEE单精度浮点数表示到浮点数的十进制字符串表示的转换，需要经过多次浮点数乘法和浮点数加法运算；而在应用程序内部，该XML数据一般对应于计算机中的一个浮点数类型的数值，在大多数计算机系统中，浮点数类型数值的表示方式正是IEEE单精度浮点数表示，因此，XML数据解析过程又涉及再把浮点数的十进制字符串转换为IEEE单精度浮点数表示，同样需要经过多次浮点数乘法和浮点数加法运算。可以看到，整个过程中有两次多余的数据表示方式转换，而且对常见的计算机体系架构来说，浮点数运算的开销是相对较大的。For another example, in the XPRESS compression method, different compression methods are adopted for tags and different types of data values. Specifically, for tag compression, the path of elements is encoded as a certain interval in [0.0, 1.0], and the path expression is estimated using the containment relationship between these intervals. For the compression of data values, specifically, for digital data values, differential encoding is used for integers, and f32 encoding is used for floating point numbers; for string data values, dictionary encoding is used for enumeration types, and Huffman encoding is used for text data Coding (see http://islab.kaist.ac.kr/chungcw/InterConfPapers/sigmod2003_jkmin.pdf ) . When reading XPRESS compressed data according to the existing method, in addition to the above-mentioned problems, there is also a problem about the data of the floating point type. Specifically, as mentioned above, in XPRESS, XML data of type xsd:float is encoded with f32, and its data representation is the same as that of IEEE single-precision floating-point numbers, and it is recorded as 4 bytes in the compressed data . When reading this type of data according to the existing method, because the XML Schema specification stipulates that the data of type xsd:float must be expressed in decimal form, the decompression process of XML data involves the representation from IEEE single-precision floating-point number to the decimal system of floating-point number The conversion of string representation requires multiple floating-point number multiplication and floating-point addition operations; inside the application, the XML data generally corresponds to a floating-point number in the computer. In most computer systems, floating-point numbers The representation of the type value is the IEEE single-precision floating-point number representation. Therefore, the XML data parsing process involves converting the decimal string of the floating-point number into the IEEE single-precision floating-point number representation, which also requires multiple floating-point number multiplications and floating-point numbers. Addition operation. It can be seen that there are two redundant data representation conversions in the whole process, and for common computer architectures, the overhead of floating-point operations is relatively large.

再比如，在专利号为US 6,883,137B1的美国专利“System andmethod for schema-driven compression of extensible mark-uplanguage(XML)documents”中，在压缩过程中，首先将输入的XML流或DOM树分为结构流和数据流两个子流，然后，将结构流编码为WBXML(WAP二进制XML)格式，使用传统压缩算法(比如，GZIP)对数据流进行压缩。在读取压缩数据过程中，对编码的结构流进行解码，根据SAX解析方法或者DOM解析方法产生SAX事件或者建立DOM树来获得XML结构信息；对压缩的数据流进行解压缩来获得XML数据。在这种方法中，结构流编码成二进制格式，不转换为字符串表示，直接把二进制格式的结构流解析为SAX消息序列或DOM树结构。这样，可直接使用XML结构信息，而不需经过字符串解析。但是，该方法对于数据流采用传统压缩算法进行压缩，在读取时也必须先解压为字符串表示、再进行字符串解析，因此，仍然存在先解压再解析所带来的运算消耗。For another example, in the US patent "System and method for schema-driven compression of extensible mark-uplanguage (XML) documents" with the patent number US 6,883,137B1, in the compression process, the input XML stream or DOM tree is first divided into structures There are two sub-streams of stream and data stream. Then, the structured stream is encoded into WBXML (WAP Binary XML) format, and the data stream is compressed using a traditional compression algorithm (eg, GZIP). In the process of reading compressed data, decode the coded structure stream, generate SAX events or build a DOM tree according to SAX analysis method or DOM analysis method to obtain XML structure information; decompress the compressed data stream to obtain XML data. In this method, the structured stream is encoded into a binary format without being converted into a string representation, and the structured stream in binary format is directly parsed into a SAX message sequence or a DOM tree structure. In this way, XML structure information can be used directly without string parsing. However, this method uses a traditional compression algorithm to compress the data stream. When reading, it must be decompressed into a string representation and then parsed. Therefore, there is still computational consumption caused by first decompressing and then parsing.

发明内容 Contents of the invention

为了解决现有XML压缩数据读取方法中的上述不足，本发明提供一种快速处理XML压缩数据的方法及其装置，以减少在读取XML压缩数据时的计算机资源消耗，提高程序运行效率。In order to solve the above-mentioned shortcomings in the existing XML compressed data reading method, the present invention provides a method and device for quickly processing XML compressed data, so as to reduce computer resource consumption when reading XML compressed data and improve program operation efficiency.

为了实现以上目的，本发明提供一种快速读取XML压缩数据的方法，该方法通过将XML压缩数据转换为混合数据类型对象来产生包含混合数据类型对象的消息序列，其中，所述混合数据类型采用“数据类型标记+数据存储部分”的方式表示XML压缩数据中的XML结构信息和XML数据。In order to achieve the above object, the present invention provides a method for quickly reading XML compressed data, the method generates a message sequence containing a mixed data type object by converting the XML compressed data into a mixed data type object, wherein the mixed data type The XML structure information and XML data in the XML compressed data are expressed in the way of "data type tag + data storage part".

在所述混合数据类型中，数据类型标记表示各种数据类型，数据存储部分存储数据，具体地，对于数字型数据，在数据存储部分中直接存储其原始压缩数据，对于字符串数据，在数据存储部分中存储指向该字符串在计算机内存中的地址的指针。In the mixed data type, the data type mark represents various data types, and the data storage part stores data. Specifically, for digital data, the original compressed data is directly stored in the data storage part, and for character string data, in the data storage part A pointer to the address of the string in computer memory is stored in the storage section.

相应地，提供一种快速处理XML压缩数据的方法，该方法包括：通过将XML压缩数据转换为混合数据类型对象来产生包含混合数据类型对象的消息序列；根据读取的消息序列中的混合数据类型对象来查找对应的数据处理逻辑；根据读取的消息序列中的混合类型对象将该对象中的数据转换为所述数据处理逻辑所需的类型，并将转换的数据传送给该数据处理逻辑。Correspondingly, a method for quickly processing XML compressed data is provided, the method comprising: generating a message sequence containing mixed data type objects by converting the XML compressed data into mixed data type objects; type object to find the corresponding data processing logic; convert the data in the object into the type required by the data processing logic according to the mixed type object in the read message sequence, and transmit the converted data to the data processing logic .

所述查找步骤包括：通过传送标签名和属性名的消息中的混合数据类型对象中的数字比较来查找对应的数据处理逻辑。The searching step includes: searching for the corresponding data processing logic by comparing the numbers in the mixed data type objects in the message transmitting the tag name and the attribute name.

所述转换步骤包括：通过传送数据的消息中的混合数据类型对象来判断该消息中的数据类型是否与所述数据处理逻辑所需的数据类型相匹配；对于混合数据类型能直接表示的值，混合数据类型对象中保存的就是所需的数据，不进行转换；对于混合数据类型不能表示的值，存储在混合数据类型对象中的是原始的字符串值，将该字符串值转换为所述数据处理逻辑所需的数据类型。The conversion step includes: judging whether the data type in the message matches the data type required by the data processing logic through the mixed data type object in the message of the transmitted data; for the value that can be directly represented by the mixed data type, The required data is stored in the mixed data type object without conversion; for the value that the mixed data type cannot represent, the original string value is stored in the mixed data type object, and the string value is converted to the The data type required by the data processing logic.

相应地，提供一种用于快速读取XML压缩数据的流式读取接口，该接口通过将XML压缩数据转换为混合数据类型对象来产生包含混合数据类型对象的消息序列。Correspondingly, a streaming read interface for quickly reading XML compressed data is provided, the interface generates a message sequence containing mixed data type objects by converting the XML compressed data into mixed data type objects.

相应地，提供一种用于快速处理XML压缩数据的装置，该装置包括：流式读取接口，通过将XML压缩数据转换为混合数据类型对象来产生包含混合数据类型对象的消息序列；消息序列处理单元，根据从流式读取接口读取的消息序列中的混合数据类型对象，查找对应的数据处理逻辑，将该混合数据类型对象中的混合数据转换为该数据处理逻辑所需的数据，并将转换的数据传送给该数据处理逻辑，Correspondingly, a device for rapidly processing XML compressed data is provided, the device comprising: a streaming read interface, which generates a message sequence containing a mixed data type object by converting the XML compressed data into a mixed data type object; the message sequence The processing unit searches for the corresponding data processing logic according to the mixed data type object in the message sequence read from the streaming read interface, and converts the mixed data in the mixed data type object into the data required by the data processing logic, and pass the transformed data to this data processing logic,

所述消息序列处理单元执行以下操作：通过传送标签名和属性名的消息中的混合数据类型对象中的数字比较来查找对应的数据处理逻辑；通过传送数据的消息中的混合数据类型对象来判断该消息中的数据类型是否与所述数据处理逻辑所需的数据类型相匹配；对于混合数据类型能直接表示的值，混合数据类型对象中保存的就是所需的数据，不进行转换；对于混合数据类型不能表示的值，存储在混合数据类型对象中的是原始的字符串值，将该字符串值转换为所述数据处理逻辑所需的数据类型。The message sequence processing unit performs the following operations: find the corresponding data processing logic by comparing the numbers in the mixed data type objects in the message that transmits the tag name and the attribute name; Whether the data type in the message matches the data type required by the data processing logic; for values that can be directly represented by the mixed data type, the required data is stored in the mixed data type object without conversion; for mixed data For values that cannot be represented by the type, the original string value is stored in the mixed data type object, and the string value is converted to the data type required by the data processing logic.

本发明通过所述混合数据类型提供对XML压缩数据的流式读取接口。在混合数据类型表示中，对XML结构信息和某些类型的数据，直接传递读取的原始压缩数据，从而节省了XML结构数据和XML数据到字符串表示的转换，也节省了对字符串表示的解析，并且还节省了解压缩后XML数据的存储消耗。在对混合数据类型消息序列进行处理的过程中，仅通过数字比较就可实现现有技术中字符串的比较和查找操作。通过这些方法，减少了字符串操作所带来的计算机资源占用。The invention provides a streaming read interface for XML compressed data through the mixed data type. In the mixed data type representation, for XML structure information and certain types of data, the original compressed data read is directly passed, thus saving the conversion of XML structure data and XML data to string representation, and saving the conversion of string representation parsing, and also save the storage consumption of decompressed XML data. In the process of processing mixed data type message sequences, the comparison and search operations of character strings in the prior art can be realized only by numerical comparison. Through these methods, the occupation of computer resources caused by string operations is reduced.

附图说明 Description of drawings

图1A为显示根据本发明的混合数据类型在C语言中的一种实现结构的示意图；FIG. 1A is a schematic diagram showing an implementation structure of a mixed data type in C language according to the present invention;

图1B为几种常见数据类型使用混合数据类型的表示方法的示意图；Fig. 1B is a schematic diagram of several common data types using a mixed data type representation method;

图2A为根据现有技术的读取XML压缩数据的方法的示意性流程图；FIG. 2A is a schematic flowchart of a method for reading XML compressed data according to the prior art;

图2B为根据现有技术的XML SAX消息序列处理方法的示意图；Fig. 2B is the schematic diagram according to the XML SAX message sequence processing method of prior art;

图3A为根据本发明的XML压缩数据流式读取方法的示意性流程图；FIG. 3A is a schematic flow chart of a method for reading XML compressed data streams according to the present invention;

图3B为根据本发明的混合数据类型消息序列处理方法的示意性流程图。FIG. 3B is a schematic flowchart of a method for processing mixed data type message sequences according to the present invention.

具体实施方式 Detailed ways

本发明主要应用于将XML结构与数据分离压缩的XML压缩数据，其主要技术构思是基于一种混合数据类型来提供对XML压缩数据的流式读取方式及其相应的数据处理，以减少读取XML压缩数据时先解压再解析过程中字符串操作所带来的计算机资源消耗，提高程序运行效率。以下将结合附图和实施例对根据本发明的处理XML压缩数据的方法及其装置进行详细说明。The present invention is mainly applied to XML compressed data that separates and compresses XML structure and data. When extracting XML compressed data, decompress it first and then parse the computer resource consumption caused by string operations in the process of parsing, and improve the efficiency of program operation. The method and device for processing XML compressed data according to the present invention will be described in detail below with reference to the drawings and embodiments.

在本发明中，定义一种混合数据类型，采用“数据类型标记+数据”的方式表示XML压缩数据中的XML结构信息和XML数据，在程序接口之间直接传递多种类型的数据，减少字符串类型的使用。具体地，在所述混合数据类型中，数据类型标记表示各种数据类型，数据存储部分存储数据，具体地，对于数字型数据，在数据存储部分中直接存储其原始压缩数据，对于字符串数据，在数据存储部分中存储指向该字符串在计算机内存中的地址的指针。In the present invention, a mixed data type is defined, and the XML structure information and XML data in the XML compressed data are represented by means of "data type mark + data", and multiple types of data are directly transmitted between program interfaces, reducing characters String type usage. Specifically, in the mixed data type, the data type mark represents various data types, and the data storage part stores data, specifically, for digital data, its original compressed data is directly stored in the data storage part, and for character string data , storing a pointer to the address of the string in computer memory in the data storage section.

图1A是显示根据本发明的混合数据类型在C语言中的一种实现结构的示意图。如图1A所示，整个混合类型数据结构分为两个部分：101是长度为1字节的数据类型标记，最多可以区分256种不同的数据类型；102是数据存储部分，几种不同大小的具体数据类型共享8字节的存储空间。FIG. 1A is a schematic diagram showing an implementation structure of a mixed data type in C language according to the present invention. As shown in Figure 1A, the entire mixed-type data structure is divided into two parts: 101 is a data type mark with a length of 1 byte, which can distinguish up to 256 different data types; 102 is a data storage part, several different sizes Concrete data types share 8 bytes of storage space.

图1B中列举了几种常见数据类型使用混合数据类型的表示方法，如图1B所示：整型数据使用N作为数据类型标记，整型值占用数据存储部分的4字节；再比如双精度浮点型数据使用D作为数据类型标记，并占用数据存储部分的8字节；对于字符串数据，由于字符串长度是不固定的，因此数据存储部分只存储一个占4字节(32位操作系统)或8字节(64位操作系统)的指针，指向计算机内存中另外一个位置上存储的字符串数据。Figure 1B lists several common data types that use mixed data types, as shown in Figure 1B: integer data uses N as the data type mark, and the integer value occupies 4 bytes of the data storage part; another example is double precision Floating-point data uses D as the data type mark and occupies 8 bytes of the data storage part; for string data, since the length of the string is not fixed, the data storage part only stores a 4-byte (32-bit operation) System) or 8-byte (64-bit operating system) pointer, pointing to the string data stored in another location in the computer memory.

从图1A和图1B可看出，所述混合数据类型在空间上是一段长度固定的数据，在时间上直接传递特定数据类型，要快于先转换为字符串表示再传递。而且，所述混合数据类型在程序运行时对计算机资源的占用很小。It can be seen from FIG. 1A and FIG. 1B that the mixed data type is a piece of data with a fixed length in space, and it is faster to transfer a specific data type directly in time than to convert it into a character string first and then transfer it. Moreover, the mixed data types occupy very little computer resources when the program is running.

利用所述混合数据类型，本发明提供对XML压缩数据的流式读取方法，该方法将XML压缩数据转换为混合数据类型对象，从而产生包含混合数据类型对象的消息序列。Using said mixed data type, the present invention provides a method for streaming XML compressed data, which converts XML compressed data into mixed data type objects, thereby generating a sequence of messages containing mixed data type objects.

从XML压缩数据读取包含混合数据类型对象的消息序列之后，对混合数据类型消息序列进行处理。具体包括：根据读取的消息序列中的混合数据类型对象，查找对应的数据处理逻辑，并将混合数据转换为对应的数据处理逻辑所需的数据，并将转换的数据传送给该数据处理逻辑。在查找的过程中，通过传送标签名和属性名的消息中的混合数据类型对象中的数字比较来查找对应的数据处理逻辑，而不是如现有技术那样进行字符串的比较、查找，从而可节省计算机资源占用。在转换的过程中，通过传送数据的消息中的混合数据类型对象来判断该消息中的数据类型是否与所述数据处理逻辑所需的数据类型相匹配，对于混合数据类型能直接表示的值，混合数据类型对象中保存的就是所需的数据，不进行转换，对于混合数据类型不能表示的值，存储在混合数据类型对象中的是原始的字符串值，进行字符串到所需数据类型的转换，从而可节省数据类型转换操作所带来的计算机资源占用。The mixed data type message sequence is processed after the message sequence containing the mixed data type objects is read from the XML compressed data. Specifically include: according to the mixed data type object in the read message sequence, find the corresponding data processing logic, convert the mixed data into the data required by the corresponding data processing logic, and transmit the converted data to the data processing logic . In the process of searching, the corresponding data processing logic is found by comparing the numbers in the mixed data type objects in the message that transmits the tag name and attribute name, instead of comparing and searching strings as in the prior art, thus saving computer resource usage. In the conversion process, judge whether the data type in the message matches the data type required by the data processing logic through the mixed data type object in the message of the transmitted data. For the value that can be directly represented by the mixed data type, The required data is stored in the mixed data type object without conversion. For values that cannot be represented by the mixed data type, the original string value is stored in the mixed data type object, and the string is converted to the required data type. Conversion, thereby saving computer resources occupied by data type conversion operations.

以下，为了详细说明本发明相对于现有技术的优点，首先参照图2A和图2B描述现有技术的一个实施例，然后参照图3A和图3B说明根据本发明的XML压缩数据处理方法及其对现有技术的改进。Below, in order to describe the advantages of the present invention over the prior art in detail, an embodiment of the prior art is first described with reference to FIG. 2A and FIG. 2B , and then the XML compressed data processing method and its Improvements over existing technologies.

在图2A中，201、202是XML压缩方法所使用的标签名字典和属性名字典，使用一个数字编号对应于原始XML结构中的一个标签名或属性名；203是XML压缩方法所使用的枚举值字典，使用一个数字编号对应于XML中的一个枚举值定义；204是XML压缩数据，在本实施例中使用的是类似于XGrind的压缩格式，如204所示，T01表示编号为01的节点开始，A01表示编号为01的属性，其后的数据是字符串类型的属性值；A02之后的数值是直接以4个字节的形式写在XML压缩文件中的单精度浮点值，转换为十进制浮点数是1234.99；同样，T02表示编号为02的节点开始，A03之后的E01表示编号为01的枚举值，其后一行的“/”表示T02节点结束；T03之后的字符串数据是该XML节点的文本数据，其后两个“/”分别表示T03和T01对应的节点结束。In Fig. 2A, 201, 202 are the tag name dictionary and the attribute name dictionary used by the XML compression method, using a number number corresponding to a tag name or attribute name in the original XML structure; 203 is the enumeration used by the XML compression method The enumerated value dictionary uses a number number corresponding to an enumeration value definition in XML; 204 is XML compressed data, and what is used in this embodiment is a compression format similar to XGrind, as shown in 204, T01 means that the number is 01 node, A01 represents the attribute numbered 01, and the subsequent data is the attribute value of string type; the value after A02 is a single-precision floating-point value directly written in the XML compressed file in the form of 4 bytes. Converted to a decimal floating-point number is 1234.99; similarly, T02 indicates the start of node number 02, E01 after A03 indicates the enumeration value of number 01, and the "/" in the next line indicates the end of node T02; string data after T03 It is the text data of the XML node, and the following two "/" respectively indicate the end of the node corresponding to T03 and T01.

注意，在本实施例所使用的XML压缩方法中，并没有对XML中的文本数据进行压缩，本实施例所描述的方法同样适用于对文本进行了压缩的情况。Note that the XML compression method used in this embodiment does not compress the text data in the XML, and the method described in this embodiment is also applicable to the case where the text is compressed.

步骤205是XML解压缩过程，接受201、202、203、204作为输入数据，解压缩为206所示的XML数据，具体过程是：Step 205 is an XML decompression process, accepting 201, 202, 203, 204 as input data, and decompressing it into the XML data shown in 206. The specific process is:

读取204中的T01，根据201查询到对应的标签名，并转换为字符串“<Product>”；Read T01 in 204, query the corresponding tag name according to 201, and convert it into a string "<Product>";

读取204中的A01，根据202查询到对应的属性名，再读取其后的数据，转化为字符串“Name＝″Some Product″”；同理，204中接下来的数据转换为字符串“Price＝″1234.99″”，涉及到单精度浮点值到十进制表示方式的转换；Read A01 in 204, query the corresponding attribute name according to 202, read the subsequent data, and convert it into a character string "Name="Some Product""; similarly, the following data in 204 is converted into a character string "Price="1234.99"", involves the conversion of single-precision floating-point value to decimal representation;

204中的T02行、A01行转换方法同T01、A01行；The conversion method of line T02 and line A01 in 204 is the same as line T01 and line A01;

读取204中的A03 E01，根据202查询到对应的属性名，根据203查询到对应的枚举值，再生成字符串“Country＝″China″”；Read A03 E01 in 204, query the corresponding attribute name according to 202, query the corresponding enumeration value according to 203, and then generate the string "Country="China"";

读取204中的/，转换为字符串“</Manufacturer>”，因为T02节点只包含属性值，因此该节点结尾字符串也可省略为“/>”；Read the / in 204 and convert it to the string "</Manufacturer>", because the T02 node only contains attribute values, so the end string of this node can also be omitted as "/>";

读取204中的T03行及其后的数据，并转换为带文本数据的节点“<Detai l s>Product introduct ion...”；Read the data in line T03 and after in 204, and convert it into a node "<Detail s>Product introduction..." with text data;

204中最后的两个/分别对应于T03和T01节点的结束，转换为“</Details>”和“</Product>”。The last two / in 204 correspond to the end of T03 and T01 nodes respectively, and are converted into "</Details>" and "</Product>".

步骤207使用现有的XML SAX读取方法，把206解析为208所示的SAX消息序列，在208中，所有消息中的值都是以字符串形式传递的。Step 207 uses the existing XML SAX reading method to parse 206 into the SAX message sequence shown in 208. In 208, the values in all messages are transmitted in the form of character strings.

现有技术对SAX消息序列的处理方法如图2B所示。The processing method of the SAX message sequence in the prior art is shown in FIG. 2B .

步骤211接收一条SAX消息，根据消息类型，分派到步骤212-214处理；步骤212、213、214中要对标签名、属性名做字符串匹配，以查找对应的数据处理逻辑；在步骤215中，把207中的字符串形式的值转化为所需数据类型，然后在步骤216中，具体的处理逻辑使用转换后的数据。Step 211 receives a SAX message, and assigns it to steps 212-214 for processing according to the message type; in steps 212, 213, and 214, string matching should be performed on tag names and attribute names to find the corresponding data processing logic; in step 215 , converting the value in the form of a string in 207 into the required data type, and then in step 216, the specific processing logic uses the converted data.

由图2A和图2B的描述可以看出，现有的处理XML压缩数据的方法存在以下不足：步骤205中的数据格式转换、字符串操作带来的计算机资源占用；解压出的206对计算机存储资源的占用；212、213、214中字符串比较操作的计算机资源占用；215中字符串形式转到所需数据类型带来的计算机资源占用。As can be seen from the description of Fig. 2A and Fig. 2B, the existing method for processing XML compressed data has the following deficiencies: the data format conversion in step 205, the computer resource occupation that character string operation brings; Occupation of resources; occupancy of computer resources for string comparison operations in 212, 213, and 214; occupancy of computer resources caused by transferring the string form to the required data type in 215.

针对现有方法的不足，下面以图3A、图3B描述本发明的XML压缩数据流式读取方法，并分析本发明对现有方法的改进。In view of the deficiencies of the existing methods, the XML compressed data stream reading method of the present invention is described below with FIG. 3A and FIG. 3B , and the improvement of the present invention to the existing methods is analyzed.

图3A为根据本发明的XML压缩数据流式读取方法的示意性流程图。FIG. 3A is a schematic flow chart of a method for streaming XML compressed data according to the present invention.

如图3A所示，输入数据只有204所示的XML压缩数据，不需要201、202和203所示的字典；输出数据302类似于图2(现有方法)中208所示的SAX消息序列，但是对XML结构和值的表示都是采用混合数据类型，而不是208中的字符串形式。As shown in Figure 3A, input data only has the XML compressed data shown in 204, does not need the dictionary shown in 201,202 and 203; Output data 302 is similar to the SAX message sequence shown in 208 in Fig. 2 (existing method), However, the representation of the XML structure and value all adopts a mixed data type instead of the string form in 208.

步骤301是XML压缩数据的流式读取过程，直接将204读取为302所示的消息序列，具体过程如下：Step 301 is the stream reading process of XML compressed data, directly read 204 as the message sequence shown in 302, the specific process is as follows:

204中的T01直接转换为一条startElement消息，标签名用一个类型为N、值为数字01的混合数据类型对象表示；同理，204中的A01和其后的数据转换为一条attribute消息，属性名表示成类型为N、值为01的混合数据类型对象，属性值表示成类型为S、值为字符串″Some Product″的混合数据类型对象；同理，204中的A02和其后的数据转换为一条attribute消息，其属性值表示成类型为F、值为单精度浮点数0x449A5FAE的混合数据类型对象，这里并不需要浮点数到字符串的数据类型转换；204中的枚举值E01直接转换为类型为E、值为01的混合数据类型对象；204中T03节点的文本数据转换为类型为S、值为该段字符串在计算机中的内存地址的混合数据类型对象；204中最后的两个/转换为两条endElement消息，标签名也同样是把数字转换为混合数据类型表示。T01 in 204 is directly converted into a startElement message, and the tag name is represented by a mixed data type object whose type is N and the value is 01; similarly, A01 in 204 and subsequent data are converted into an attribute message, and the attribute name Expressed as a mixed data type object whose type is N and whose value is 01, the attribute value is expressed as a mixed data type object whose type is S and whose value is the string "Some Product"; similarly, A02 in 204 and subsequent data conversion It is an attribute message, and its attribute value is expressed as a mixed data type object whose type is F and whose value is a single-precision floating-point number 0x449A5FAE. There is no need for data type conversion from floating-point numbers to strings; the enumeration value E01 in 204 is directly converted It is a mixed data type object whose type is E and a value of 01; the text data of the T03 node in 204 is converted into a mixed data type object whose type is S and whose value is the memory address of this segment character string in the computer; the last two characters in 204 One/converted into two endElement messages, and the tag name is also converted into a mixed data type representation.

从以上可看出，对于表示XML结构信息的标签名和属性名，不转换为字符串表示，而是转换为整数型混合数据类型对象，在整数型混合数据类型对象中，保留其原始压缩数据，即为原始数字编号。这样，通过相应的消息，直接把结构信息的类型和压缩后的结构信息数据提供给后续处理，后续处理不经过解析XML字符串，直接使用XML结构信息。It can be seen from the above that for the tag name and attribute name representing the XML structure information, it is not converted into a string representation, but converted into an integer type mixed data type object. In the integer type mixed data type object, the original compressed data is retained. That is the original digital number. In this way, the type of the structure information and the compressed structure information data are directly provided to the subsequent processing through the corresponding message, and the subsequent processing directly uses the XML structure information without parsing the XML string.

对XML中的基本数据类型，如整数、浮点数、布尔、日期等数据类型，以及对经过XML压缩方法特殊处理过的数据类型，如枚举类型，在读取压缩过的原始数据后，不转换为字符串表示，而是直接转换为混合数据类型对象，在该混合数据类型对象中，保留其原始压缩数据，即为原始枚举值数字编号，后续处理不经过字符串解析，直接使用原始数据。For basic data types in XML, such as integers, floating-point numbers, Boolean, dates, and other data types, as well as data types that have been specially processed by XML compression methods, such as enumeration types, after reading the compressed original data, no It is converted to a string representation, but directly converted to a mixed data type object. In the mixed data type object, the original compressed data is retained, that is, the original enumeration value is numbered. Subsequent processing does not go through string parsing, and directly uses the original data.

对XML中的字符串类型数据，以及压缩方法没有做特殊处理的非预定义类型，在读取压缩过的原始数据后，解压为字符串表示，并转换为混合数据类型，后续处理自行解析字符串数据。For string type data in XML, and non-predefined types that do not have special processing in the compression method, after reading the compressed original data, decompress it into a string representation, and convert it into a mixed data type, and the subsequent processing parses the characters by itself string data.

图3B为根据本发明的混合数据类型消息序列处理方法的示意性流程图，如图3B所示：Fig. 3B is a schematic flowchart of a mixed data type message sequence processing method according to the present invention, as shown in Fig. 3B:

步骤311接收到一条混合数据类型消息，根据消息类型分派到步骤312、313、314处理；Step 311 receives a mixed data type message, and dispatches it to steps 312, 313, and 314 for processing according to the message type;

由于消息中标签名和属性名是以数字形式表示的，所以步骤312、313、314通过传送标签名和属性名的消息中的混合数据类型对象中的数字比较来查找对应的数据处理逻辑；Since the tag names and attribute names in the message are expressed in digital form, steps 312, 313, and 314 are used to find the corresponding data processing logic by comparing the numbers in the mixed data type objects in the message that transmits the tag names and attribute names;

步骤315通过传送数据的消息中的混合数据类型对象来判断该消息中的数据类型是否与所需数据类型相匹配，对于混合数据类型能直接表示的值，混合数据类型对象中保存的就是所需的数据，这一步不做任何处理，直接由步骤216使用数据；对于混合数据类型不能表示的值，存储在混合数据类型对象中的是原始的字符串值，需要将该字符串值转换为所述数据处理逻辑所需的数据类型；Step 315 judges whether the data type in the message matches the required data type through the mixed data type object in the message of the transmitted data. For the value that the mixed data type can directly represent, the value stored in the mixed data type For the data, this step does not do any processing, and the data is directly used in step 216; for the value that cannot be represented by the mixed data type, the original string value is stored in the mixed data type object, and the string value needs to be converted to all The data types required by the data processing logic;

步骤216中查找到的处理逻辑使用转换的数据。The processing logic found in step 216 uses the transformed data.

由图3A和图3B的描述可以看出，本发明的XML压缩数据流式读取方法比现有方法的改进在于：省略了201、202和203所示的数据；省略了步骤205中的数据格式转换和字符串操作；省略了206中字符串的计算机存储资源占用；步骤212、213、214中的字符串比较改进为数字比较，节省了计算机资源占用；对大部分的数据类型，步骤315比步骤215能够节省数据类型转换带来的计算机资源占用。It can be seen from the description of Fig. 3A and Fig. 3B that the improvement of the XML compressed data stream reading method of the present invention over the existing method lies in that: the data shown in 201, 202 and 203 are omitted; the data in step 205 is omitted Format conversion and string operation; omit the computer storage resource occupation of character string in 206; The character string comparison in steps 212, 213, 214 is improved to digital comparison, which saves computer resource occupation; for most data types, step 315 Comparing with step 215, the occupation of computer resources caused by data type conversion can be saved.

此外，由图3B和图2B的比较还可以看出，本发明的混合数据类型消息序列处理流程和现有的XML SAX消息序列处理流程类似，现有应用程序中的XML SAX消息处理流程易于修改为使用本发明的XML流式读取方法。In addition, it can also be seen from the comparison of Fig. 3B and Fig. 2B that the mixed data type message sequence processing flow of the present invention is similar to the existing XML SAX message sequence processing flow, and the XML SAX message processing flow in the existing application program is easy to modify To use the XML streaming reading method of the present invention.

相应地，本发明提供一种流式读取接口。该流式读取接口通过将XML压缩数据转换为混合数据类型对象来产生包含混合数据类型对象的消息序列，具体操作与上述流式读取方式相同，省略其描述。Correspondingly, the present invention provides a streaming reading interface. The stream reading interface converts XML compressed data into mixed data type objects to generate a message sequence containing mixed data type objects. The specific operation is the same as the stream reading method above, and its description is omitted.

相比于公知的XML读取方法中的SAX接口，该流式读取接口对于XML压缩数据中的XML结构信息和XML数据的读取的改进之处在于：读取XML压缩数据中的结构信息后，不转换为字符串表示，直接把结构信息的类型和压缩后的结构信息数据提供给该接口的使用者，所述接口的使用者不经过解析XML字符串，直接使用XML结构信息；对XML中的基本数据类型，如整数、浮点数、布尔、日期等数据类型，以及对经过XML压缩方法特殊处理过的数据类型，如枚举类型，程序在读取压缩过的原始数据后，不转换为字符串表示，而是直接转换为混合数据类型，通过所述流式读取接口提供给该接口的使用者，所述接口的使用者不经过字符串解析，直接使用原始数据；对XML中的字符串类型数据，以及压缩方法没有做特殊处理的非预定义类型，程序在读取压缩过的原始数据后，解压为字符串表示，并转换为混合数据类型，通过所述流式读取接口提供给该接口的使用者，所述接口的使用者自行解析字符串数据。Compared with the SAX interface in the known XML reading method, the improvement of the streaming reading interface for reading the XML structure information in the XML compressed data and XML data lies in: reading the structural information in the XML compressed data Afterwards, the type of the structural information and the compressed structural information data are directly provided to the user of the interface without being converted into a string representation, and the user of the interface directly uses the XML structural information without parsing the XML string; Basic data types in XML, such as integers, floating-point numbers, Boolean, dates, and other data types, as well as data types that have been specially processed by XML compression methods, such as enumeration types, after the program reads the compressed original data, it does not Converted to a string representation, but directly converted to a mixed data type, and provided to the user of the interface through the stream reading interface, the user of the interface directly uses the original data without string parsing; for XML String type data in , and non-predefined types that are not specially processed by the compression method. After the program reads the compressed original data, it decompresses it into a string representation and converts it into a mixed data type. Through the stream read The interface is provided to the user of the interface, and the user of the interface parses the string data by itself.

相应地，本发明提供一种用于快速处理XML压缩数据的装置。该装置包括：流式读取接口，通过将XML压缩数据转换为混合数据类型对象来产生包含混合数据类型对象的消息序列；消息序列处理单元，根据从流式读取接口读取的消息序列中的混合数据类型对象，查找对应的数据处理逻辑，将该混合数据类型对象中的混合数据转换为该数据处理逻辑所需的数据，并将转换的数据传送给该数据处理逻辑。流式读取接口和消息序列处理单元的操作与上述参照图3A和图3B所描述的方法步骤相同，省略其描述。Accordingly, the present invention provides a device for rapidly processing XML compressed data. The device includes: a streaming read interface, which converts XML compressed data into a mixed data type object to generate a message sequence containing a mixed data type object; a message sequence processing unit, according to the message sequence read from the stream read interface. The mixed data type object, finds the corresponding data processing logic, converts the mixed data in the mixed data type object into the data required by the data processing logic, and transmits the converted data to the data processing logic. The operations of the stream reading interface and the message sequence processing unit are the same as the method steps described above with reference to FIG. 3A and FIG. 3B , and their descriptions are omitted.

从以上结合实施例的描述可看出，相比于现有的读取XML压缩数据的方法而言，通过所述混合数据类型，对XML结构信息和某些类型的数据，直接传递读取的原始压缩数据，节省了对计算机存储资源的占用。具体表现在以下几个方面：It can be seen from the above description in conjunction with the embodiment that, compared with the existing method of reading XML compressed data, the XML structure information and certain types of data are directly transmitted through the mixed data type. The original compressed data saves the occupation of computer storage resources. Specifically in the following aspects:

通过所述对XML结构信息的流式读取，节省了XML结构信息到字符串表示的转换，也节省了XML解析器对XML结构信息字符串的解析；同时，压缩后的结构信息表示一般是数值类型(编号)，数值比较的计算量非常小，有利于XML解析器的高效实现。这种方法能节省读取XML压缩数据时的计算机时间资源占用；Through the stream reading of the XML structure information, the conversion of the XML structure information to the string representation is saved, and the analysis of the XML structure information string by the XML parser is also saved; at the same time, the compressed structure information representation is generally Numerical type (number), the amount of calculation for numerical comparison is very small, which is conducive to the efficient implementation of XML parsers. This method can save computer time resource occupation when reading XML compressed data;

通过所述对XML压缩数据的流式读取，不需要先解压为XML的字符串表示，整个数据读取过程所占用的存储空间与XML数据大小无关，节省了读取XML压缩数据时的计算机存储资源占用；Through the stream reading of the XML compressed data, it is not necessary to first decompress it into an XML string representation, and the storage space occupied by the entire data reading process has nothing to do with the size of the XML data, saving the computer when reading the XML compressed data storage resource usage;

所述对XML中基本数据类型(比如，XPRESS中的xsd:float类型数据)，以及对经过XML压缩方法特殊处理过的数据类型的读取过程，节省了多次数据表示方式转换对计算机资源的占用；The described process of reading the basic data types in XML (for example, xsd:float type data in XPRESS) and the data types specially processed by the XML compression method saves the cost of multiple data representation conversions on computer resources. occupy;

所述XML压缩数据的流式读取方法，其接口使用方式与公知的XML读取方法中的SAX方法相似，可以方便与现有的使用SAX解析方法的程序兼容；在该接口之上，方便实现DOM、XPath等其他公知的XML读取方法接口。The streaming reading method of the XML compressed data, its interface usage mode is similar to the SAX method in the known XML reading method, which can be easily compatible with the existing programs using the SAX parsing method; on this interface, it is convenient Realize DOM, XPath and other well-known XML reading method interfaces.

以上参考实施例描述了本发明。但是，本领域的技术人员应该理解，本发明不限于所公开的实施例，在不脱离本发明的基本原理的情况下，任何类似的修改、替换或变形都应包括在本发明的保护范围内，本发明的保护范围由权利要求限定。The present invention has been described above with reference to the embodiments. However, those skilled in the art should understand that the present invention is not limited to the disclosed embodiments, and any similar modification, replacement or deformation should be included in the protection scope of the present invention without departing from the basic principle of the present invention , The protection scope of the present invention is defined by the claims.

Claims

1. a method that reads fast the XML packed data is characterized in that,

By being converted to the mixed data type object, the XML packed data produces the message sequence that comprises the mixed data type object,

Wherein, the mode of described mixed data type employing " data type mark+data store is divided " represents XML structural information and the XML data in the XML packed data; In described mixed data type, the data type mark represents various data types, and data store is divided the storage data.

2. method according to claim 1, it is characterized in that, for digital data, in data store is divided, directly store its original compression data, for string data, divide middle storage to point to the pointer of the address of this character string in calculator memory in data store.

3. the method for a fast processing XML packed data comprises:

By being converted to the mixed data type object, the XML packed data produces the message sequence that comprises the mixed data type object;

Search corresponding data process method according to the mixed data type object in the message sequence that reads;

According to the mixed type object in the message sequence that reads the data of this object are converted to the required data type of described data process method, and send the data of conversion to this data process method,

4. method according to claim 3, it is characterized in that, for digital data, in data store is divided, directly store its original compression data, for string data, divide middle storage to point to the pointer of the address of this character string in calculator memory in data store.

5. method according to claim 3 is characterized in that, described finding step comprises:

By the numeral in the mixed data type object in the message that compares Transport label name and attribute-name, search corresponding data process method.

6. method according to claim 3 is characterized in that, described switch process comprises:

Judge that by the mixed data type object in the message that transmits data whether the data type in this message is complementary with the required data type of described data process method;

For the value of mixed data type energy direct representation, what preserve in the mixed data type object is exactly required data, does not change;

For the value that mixed data type can not represent, being stored in the mixed data type object is original string value, and this string value is converted to the required data type of described data process method.

7. device that is used for fast processing XML packed data comprises:

The streaming fetch interface produces the message sequence that comprises the mixed data type object by the XML packed data is converted to the mixed data type object;

The message sequence processing unit, according to the mixed data type object the message sequence that reads from the streaming fetch interface, search corresponding data process method, blended data in this mixed data type object is converted to the required data of this data process method, and with the conversion data send this data process method to

8. device according to claim 7, it is characterized in that, for digital data, in data store is divided, directly store its original compression data, for string data, divide middle storage to point to the pointer of the address of this character string in calculator memory in data store.

9. device according to claim 7 is characterized in that, described message sequence processing unit is carried out following operation:

Relatively search corresponding data process method by the numeral in the mixed data type object in the message of Transport label name and attribute-name;