GB2451479A - Compressing markup language (e.g. XML) schema - Google Patents
Compressing markup language (e.g. XML) schema Download PDFInfo
- Publication number
- GB2451479A GB2451479A GB0714906A GB0714906A GB2451479A GB 2451479 A GB2451479 A GB 2451479A GB 0714906 A GB0714906 A GB 0714906A GB 0714906 A GB0714906 A GB 0714906A GB 2451479 A GB2451479 A GB 2451479A
- Authority
- GB
- United Kingdom
- Prior art keywords
- elements
- data
- tags
- schema
- markup language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 claims abstract description 35
- 230000005540 biological transmission Effects 0.000 claims description 5
- 230000000717 retained effect Effects 0.000 claims description 2
- 230000006835 compression Effects 0.000 abstract description 2
- 238000007906 compression Methods 0.000 abstract description 2
- 101100424858 Arabidopsis thaliana TEN1 gene Proteins 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
-
- G06F17/30908—
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A data file based on a markup language schema such as XML is encoded to reduce the file size. In the markup language schema, the information and metadata is arranged in a hierarchical structure comprised of at least one data group, to elements within the data group, and sub-elements within the elements. A text string representation of the markup language schema is produced which lists the data group, its elements and sub-elements in order. The bracketed descriptive text strings provided in the schema as both opening (e.g. <Datagroup 1>, figure 1) and closing tags are replaced by alphanumeric, alphabetic or numeric opening tags (e.g. MD01, figure 2a). Furthermore, the opening tags are used to additionally imply the close of the preceding data group, element or sub-element such that closing tags are not generally provided. By these measures compression is achieved and the size of the data file is reduced. Also included are embodiments concerning processes for transmitting and receiving data.
Description
IMPROVEMENTS IN OR RELATING TO A MARKUP LANGUAGE
SCHEMA
The present invention relates to a text string representation of a markup language schema, to a method of encoding a data file based on a markup language schema to reduce the file size, to apparatus for encoding a data file based on a markup language schema to reduce the file size, and to processes for transmitting and receiving data.
XML, Extensible Markup Language, is a specification developed by the W3C. It is a pared down version of SGML, Standard Generalised Markup Language designed especially for web documents. It provides a hierarchical structure for organising and tagging a document to enable the definition, transmission, validation, and interpretation of data between applications.
XML is generally considered to be the appropriate language for presenting information and metadata to be shared across networks, but it has the disadvantage that it is verbose in nature.
Schemes have been proposed for compressing XML. There have also been proposals to map XML tags to fixed values in a table in order to reduce the size of text strings.
The present invention seeks to reduce the size of a data file based on a markup language schema.
According to a first aspect of the present invention there is provided a text string representation of a markup language schema which defines a structure for carrying information and metadata, which information and metadata is identified within the schema by appropriate tags, wherein, in the text string representation, elements of information and metadata are not identified by bracketed descriptive text strings provided as both opening and closing tags but by alphanumeric, alphabetical or numeric opening tags.
As the usual bracketed descriptive text strings have been replaced by alphanumeric, alphabetical or numeric opening tags, which are smaller in size 4) than the bracketed descriptive text strings, the size of the data file comprised of the text string representation of the schema is reduced.
In the markup language schema, the information and metadata is arranged in a hierarchical structure comprising at least one data group, elements within the data group, and sub-elements within the elements. In a preferred embodiment, the text string representation lists the data group and its elements and sub-elements in order, and the alphanumeric, alphabetical or numerical opening tag identifying each data group, element or sub-element io additionally implies the close of the preceding data group, element or sub-element.
Usual markup language schema, for example the XML structure, has bracketed descriptive text strings provided in pairs to act as both opening tags Is and closing tags. Where an opening tag of one element is, as in this embodiment of the invention, used to also imply the end of the preceding element, there is a reduction in file size.
It has been established that it is only necessary to provide closing tags to identify the encoded data in an entire schema.
In the markup language schema, the information and metadata is arranged in a hierarchical structure comprising at least one data group, elements within the data group, and sub-elements within the elements. In one embodiment, the text string representation lists the data group and its elements and sub-elements in order, and preferably alphanumeric tags are used to identify data groups and numeric tags are used to identify elements and sub-elements.
For example, alphanumeric tags identifying data groups may comprise initials and a sequentially increasing number.
Additionally and/or alternatively, sequentially increasing unique numbers may be used to identify individual elements and sub-elements. 4)
In a preferred embodiment, a space separator is provided between an opening tag and the identified data.
Preferably, the number of available states of each alphanumeric, alphabetical or numeric opening tag must exceed the number of states to be used.
The present invention also extends to a method of encoding a data file based on a markup language schema to reduce the file size, where the schema io defines a structure for carrying information and metadata which is identified within the schema by appropriate tags, where the elements of information and metadata in the schema are identified by bracketed descriptive text strings provided in pairs as opening and closing tags, the method comprising omitting the closing tags.
As closing tags are omitted, the file size is reduced.
In an embodiment, the only closing tag retained is a closing tag identifying the end of a schema.
The method preferably comprises replacing the bracketed descriptive opening tags with alphanumeric, alphabetical or numeric opening tags.
In the markup language schema the information and metadata is arranged in a hierarchical structure comprising at least one data group, elements within the data group, and sub-elements within the elements. In an embodiment, the method comprises listing the data group and its elements and sub-elements in order, and replacing the bracketed descriptive opening tags identifying data groups with alphanumeric tags, and replacing the bracketed descriptive opening tags identifying elements and sub-elements with numeric tags.
Preferably, the method comprises incorporating a space separator between an opening tag and the identified data. )
Preferably, the encoded data file is formed into a string for transmission.
The formed string may be compressed for transmission by any appropriate compression scheme.
In an embodiment, the encoding is undertaken by a client parser and a servant encoder The present invention also extends to apparatus for encoding a data file based on a markup language schema to reduce the file size, where the schema defines a structure for carrying information and metadata which is identified within the structure by appropriate tags, where the elements of information and metadata in the schema are identified by bracketed descriptive text strings provided in pairs as opening and closing tags, the apparatus comprising a client parser to identify the defined schema and the opening and closing tags, and a server encoder to replace the identified opening and closing tags with alphanumeric, alphabetical, or numeric opening tags.
Accord ing to a further aspect of the present invention, there is provided a process for transmitting data, the process comprising encoding the data to form a data file based on a markup language schema, encapsulating the data file for transport, and transmitting the encapsulated data file, wherein the process further comprises reducing the file size of the data file before it is encapsulated using a method as defined above.
The invention also extends to a process for receiving data, the process comprising receiving and de-encapsulating an encapsulated and encoded data file to retrieve the encoded data file, and decoding the data file to obtain the data, wherein the data file had been encoded based on a markup language schema and reduced in size using a method as defined above, and wherein the data is decoded based on the markup language schema.
Embodiments of the present invention will hereinafter be described, by way of example, by reference to the accompanying drawings, in which; Figure 1 shows the structure of a general XML metadata schema, ) Figures 2a and 2b together illustrate a partially populated version of a text string representation of the markup language schema showing the reduced size of the data file, Figure 3 shows the encoded message string developed from the text s string representation of Figures 2a and 2b, and, Figure 4 illustrates processes for transmitting and receiving data.
The invention is described further by reference to an XML metadata schema. However, the method described herein is not specific to XML language and the reduced size data files produced by the invention may be formed for any markup language having a hierarchical structure and employing opening and closing tags. Such markup languages will generally be based on the W3C rules.
Figure 1 shows the hierarchical structure of the XML schema for carrying information and metadata. Thus, the hierarchical structure comprises at least one data group 10 within which there are one or more elements 12. The elements 12 may contain one or more sub-elements 14. It will be seen that each data group 10, each element 12, and each sub-element 14 is tagged by a pair of bracketed descriptive text strings which form an opening tag and a closing tag. So, for example, the data in a data group is incorporated between opening tag <Datagroup2> and a closing tag </Datagroup2>.
Because of the size of the individual descriptive tags and because of the provision of both opening and closing tags for each element of data, the XML metadata schema shown in Figure 1 is verbose. The size of the data file can be reduced as shown in Figure 2a. Thus, Figure 2a shows a text string representation of the schema of Figure 1. In the representation of Figure 2a, the data group, the elements, and the sub-elements are listed in order. This enables closing tags to be generally omitted from the text string representation.
As also shown in Figure 2a each bracketed descriptive text string as <Datagroup2> has been replaced by an alphanumeric, alphabetical or numeric opening tag. Thus, the <Datagroupi> tag, for example of the schema of Figure 1 has been replaced by the opening tag 16 MDO1. In Figure 2a there is also shown a closing tag 18 (IMDO1).
I
In the example shown in Figure 2a, alphanumeric tags as 16 are used to identify data groups. The alphanumeric tags comprise initials and a sequentially increasing number. In the example illustrated the metadata in the first data group 10 has the opening tag MDO1 and the sequence MDnn is sequentially increasing. The elements and sub-elements are identified by numeric tags and as is apparent from Figure 2a, sequentially increasing unique numbers are used to identify individual elements 12 and individual sub-elements 14.
It is important that the number of variable states available for each tag is greater than the number used to allow for extensibility. Leading zeros can be utilised if necessary.
It would be possible to use any format of alphanumeric, alphabetical and numeric tags to reduce the size of the data file. However, the use of a different format tag for data groups and for elements within each group gives robustness to the scheme as it provides the hierarchical structure. Of course, identical referencing between the metadata elements and the encoding tags must be used in a client parser and in a server encoder so that a consistently maintained schema is developed and used at both ends of the system.
Figure 2b exemplifies data 20 which is to be used to partially populate the text string representation of Figure 2a. Together Figures 2a and 2b illustrate a partially populated text string representation of the markup language schema of Figure 1.
Figure 3 shows the formation of the text string representation of Figures 2a and 2b into a message string for transmission, It will be seen here that in the message string, the only pair of start and end tags are SchemaName 22 and /SchemaName 24 which identify the start and end of the encoded data.
Otherwise closing tags are not used and are not required. The provision of an opening tag implies the end of the preceding group or element.
Run time encoding and decoding algorithms generally require the parser to remain in sync with the data stream and for the data to be encoded in a specific order defined by the schema. However, in this case, the alphanumeric tag, for example, MDO2 also allows complete metadata data groups (defined by the MDnn -/MDnn tags) to be omitted from an encoded sequence. Also elements at any level may be omitted provided that their child elements are not included, and provided that overall the parent-child relationship is maintained in the encoding and decoding.
The closing tags shown in Figure 2a in brackets, for example, /MDO2, would be implied by the beginning of the next data grouping of the metadata or by the end of the schema and would not be carried in the encoded data. I0
Space separators 26 are necessary between encoded tags and actual data in the encoded metadata to prevent any possible problems in the parser.
Spaces are therefore not allowed in any of the tag names or strings to be encoded unless they use standard escape sequences.
it will be seen that in the example of the text string representation shown in Figures 2a and 2b, the data group MDO3, for example, has no sub-elements and no data. Groups of elements without data as the data group MDO3 and the elements 4 and 10 may be omitted from the string as also indicated by Figure 3.
Figure 4 illustrates processes for transmitting and receiving data. In Figure 4 data 30 is encoded by an encoder 32 to provide a data file based on a markup language schema. This schema is stored by a store 34 for use by the encoder. The encoded data file is then encapsulated for transportation as indicated at 36 and may be transmitted by any appropriate media. At a receiver the transmitted data file is de-encapsulated as indicated at 38, and the encoded data file is then applied to a decoder 40 for decoding in accordance with the same schema 42. In this way the original data 30 is obtained.
With the present invention, the encoder 32 is controlled to encode the data not only in accordance with the markup language schema but also in accordance with the invention such that the resulting data file is reduced in size. The transmitted data file is, therefore, smaller than previously. lb
It will be appreciated that amendments to and variations of the embodiments specifically described and illustrated may be made within the scope of this application as set out in the accompanying claims. )
Claims (19)
1. A text string representation of a markup language schema which defines a structure for carrying information and metadata, which information and metadata is identified within the schema by appropriate tags, wherein elements of information and metadata are not identified by bracketed descriptive text strings provided as both opening and closing tags but by alphanumeric, alphabetical or numeric opening tags.
2. A text string representation as claimed in Claim 1, wherein, in the markup language schema, the information and metadata is arranged in a hierarchical structure comprising at least one data group, elements within the data group, and sub-elements within the elements, the text string representation listing the data group and its elements and sub-elements in is order, and wherein the alphanumeric, alphabetical or numerical opening tag identifying each data group, element or sub-element additionally implies the close of the preceding data group, element or sub-element
3. A text string representation as claimed in Claim 1 or Claim 2, wherein the only closing tags provided identify the end of a schema.
4. A text string representation as claimed in any preceding claim, wherein, in the markup language schema, the information and metadata is arranged in a hierarchical structure comprising at least one data group, elements within the data group, and sub-elements within the elements, the text string representation listing the data group and its elements and sub-elements in order, and wherein alphanumeric tags are used to identify data groups and numeric tags are used to identify elements and sub-elements
5. A text string representation as claimed in Claim 4, wherein alphanumeric tags identifying data groups comprise initials and a sequentially increasing number.
6. A text string representation as claimed in any preceding claim, wherein, in the markup language schema, the information and metadata is arranged in a hierarchical structure comprising at least one data group, elements within the i0 data group, and sub-elements within the elements, the text string representation listing the data group and its elements and sub-elements in order, and wherein sequentially increasing unique numbers are used to identify individual elements and sub-elements. )
7 A text string representation as claimed in any preceding claim, wherein a space separathr is provided between an opening tag and the identified data.
8. A text string representation as claimed in any preceding claim, wherein the number of available states of each alphanumeric, alphabetical or numeric opening tag must exceed the number of states to be used.
9. A method of encoding a data file based on a markup language schema to reduce the file size, where the schema defines a structure for carrying information and metadata which is identified within the schema by appropriate tags, where the elements of information and metadata in the unconverted schema are identified by bracketed descriptive text strings provided in pairs as opening and closing tags, the method comprising omitting the closing tags.
10. A method of encoding a data file based on a markup language schema as claimed in Claim 9, wherein the only closing tag retained is a closing tag identifying the end of a schema.
11. A method of encoding a data file based on a markup language schema as claimed in Claim 9 or Claim 10, comprising replacing the bracketed descriptive opening tags with alphanumeric, alphabetical or numeric opening tags.
12. A method of encoding a data file based on a markup language schema as claimed in any of Claims 9 to 11, wherein, in the markup language schema, the information and metadata is arranged in a hierarchical structure comprising at least one data group, elements within the data group, and sub-elements within the elements, the method comprising listing the data group and its elements and sub-elements in order, and comprising replacing the bracketed descriptive opening tags identifying data groups with alphanumeric tags, and ) ii replacing the bracketed descriptive opening tags identifying elements and sub-elements with numeric tags.
13. A method of encoding a data file based on a markup language schema as claimed in any of Claims 9 to 12, comprising incorporating a space separator between an opening tag and the identified data.
14. A method of encoding a data file based on a markup language schema, as claimed in any of Claims 9 to 13, further comprising forming the encoded data file into a string for transmission.
15. A method of encoding a data file based on a markup language schema as claimed in Claim 14, further comprising compressing the formed string.
16. A method of encoding a data file based on a markup language schema as claimed in any of Claims 9 to 15, wherein the encoding is undertaken by a client parser and a server encoder.
17. Apparatus for encoding a data file based on a markup language schema to reduce the file size, where the schema defines a structure for carrying information and metadata which is identified within the structure by appropriate tags, where the elements of information and metadata in the schema are identified by bracketed descriptive text strings provided in pairs as opening and closing tags, the apparatus comprising a client parser to identify the defined schema and the opening and closing tags, and a server encoder to replace the identified opening and closing tags with alphanumeric, alphabetical, or numeric opening tags.
18. A process for transmstting data, the process comprising encoding the data to form a data file based on a markup language schema, encapsulating the data file for transport, and transmitting the encapsulated data file, wherein the process further comprises reducing the file size of the data file before it is encapsulated using a method as claimed in any of Claims 9 to 16.
19. A process for receiving data, the process comprising receiving and de-encapsulating an encapsulated and encoded data file to retrieve the encoded ) data file, and decoding the data file to obtain the data, wherein the data file had been encoded based on a markup language schema and reduced in size using a method as claimed in any of Claims 9 to 16, and wherein the data is decoded based on the markup language schema. ) I0
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0714906A GB2451479A (en) | 2007-07-31 | 2007-07-31 | Compressing markup language (e.g. XML) schema |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0714906A GB2451479A (en) | 2007-07-31 | 2007-07-31 | Compressing markup language (e.g. XML) schema |
Publications (2)
Publication Number | Publication Date |
---|---|
GB0714906D0 GB0714906D0 (en) | 2007-09-12 |
GB2451479A true GB2451479A (en) | 2009-02-04 |
Family
ID=38529052
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB0714906A Withdrawn GB2451479A (en) | 2007-07-31 | 2007-07-31 | Compressing markup language (e.g. XML) schema |
Country Status (1)
Country | Link |
---|---|
GB (1) | GB2451479A (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6635088B1 (en) * | 1998-11-20 | 2003-10-21 | International Business Machines Corporation | Structured document and document type definition compression |
-
2007
- 2007-07-31 GB GB0714906A patent/GB2451479A/en not_active Withdrawn
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6635088B1 (en) * | 1998-11-20 | 2003-10-21 | International Business Machines Corporation | Structured document and document type definition compression |
Non-Patent Citations (1)
Title |
---|
Computer Networks 33 (2000) 747-765 * |
Also Published As
Publication number | Publication date |
---|---|
GB0714906D0 (en) | 2007-09-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7669120B2 (en) | Method and system for encoding a mark-up language document | |
US6825781B2 (en) | Method and system for compressing structured descriptions of documents | |
RU2285354C2 (en) | Binary format for mpeg-7 samples | |
US20080077606A1 (en) | Method and apparatus for facilitating efficient processing of extensible markup language documents | |
US20050120031A1 (en) | Structured document encoder, method for encoding structured document and program therefor | |
AU2002253002A1 (en) | Method and system for compressing structured descriptions of documents | |
CN101346689A (en) | A compressed schema representation object and method for metadata processing | |
US20110283183A1 (en) | Method for compressing/decompressing structured documents | |
US9128912B2 (en) | Efficient XML interchange schema document encoding | |
US6850948B1 (en) | Method and apparatus for compressing textual documents | |
CN1526239A (en) | A method for improving the functionality of MPEG-7 and other XML-based content description binary representations | |
US7747558B2 (en) | Method and apparatus to bind media with metadata using standard metadata headers | |
US7676742B2 (en) | System and method for processing of markup language information | |
US20120296916A1 (en) | Method, apparatus and software for processing data encoded as one or more data elements in a data format | |
JP2006519422A (en) | How to encode structured documents | |
GB2451479A (en) | Compressing markup language (e.g. XML) schema | |
US20120151330A1 (en) | Method and apparatus for encoding and decoding xml documents using path code | |
JP2007516514A (en) | Structured document compression and decompression methods | |
CN103473058B (en) | A kind of convenient coded method generating ASN1 data file | |
JP4122759B2 (en) | Document data code processing method and system | |
CN102119384B (en) | Method and device for encoding elements | |
CN1823528B (en) | Method for coding structured documents | |
Elgedawy et al. | Exploring queriability of encrypted and compressed XML data | |
Boone et al. | Text and Multimedia | |
CN116886447A (en) | Encryption transmission method and device for simplified encoding and decoding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |