Documents: Language and Property.
Documents: Language and Property.
Chapter 6
Documents: Languages & Properties
Metadata
Document Formats
Markup Languages
Text Properties
Document Preprocessing
Organizing Documents
Text Compression
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 1
Introduction
The document
denotes a single unit of information
has a syntax and structure
has a semantics, specified by the author
may have a presentation style
given by its syntax and structure
related to a specific application
specifies how to display or print document
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 2
Introduction
Document
Presentation Style
Syntax Text + Structure +
Other Media
Semantics
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 3
Introduction
The document syntax can be
implicit in its content
expressed in a simple declarative language
expressed in a programming language
the language syntax might be proprietary and specific
open and generic languages are more flexible
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 4
Introduction
Large font
Document style
defines how a docu-
Chapter 6
ment is visualized or Document and Query Properties and
printed Languages
Bold font
with Gonzalo Navarro
can be embedded in 6.1 Introduction
the document: TeX and Text is the main form of communicating knowledge. Starting with
hieroglyphs, the first written surfaces (stone, wood, animal skin ,
papyrus and rice paper) and paper, text has been created
RTF everywhere, in many forms and languages. We use the term
document to denote a single unit of informatio n, typically text in digital
form, but it can also include other media . In practice, a document is
loosely defined.It can be a complete logical unit , like a research
can be complemented article, a book or a manual. It can also be part of a larger text, such
as a paragraph or a sequence of paragraphs(also called a passage
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 5
Introduction
Queries in search engines
can be considered as short pieces of text
differ from normal text
understanding them is very important
semantics often ambiguous due to polysemy
not simple to infer user intent behind a query
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 6
Metadata
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 7
Metadata
Metadata is information on the organization of the data,
the various data domains, and their relationship
metadata is data about the data
in a database, names of relations and attributes constitute
metadata
metadata is associated with most documents and text collections
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 8
Descriptive Metadata
Common forms of metadata for documents
author of the text
date of publication
source of the publication
document length
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 9
Semantic Metadata
Semantic Metadata
characterizes the subject matter within the document contents
is associated with a wide number of documents
its availability is increasing
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 10
Metadata in Web Documents
The increase in Web data has led to many initiatives to
add metadata information to Web pages for various
purposes such as
cataloging and content rating
intellectual property rights and digital signatures
applications to electronic commerce
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 11
Metadata in Web Documents
RDF does not assume any particular application or
semantic domain
It consists of a description of nodes and attached
attribute/value pairs
Nodes can be any Web resource, that is, any Uniform Resource
Identifier (URI) including Uniform Resource Locators (URLs)
Attributes are properties of nodes and their values are text strings
or other nodes (Web resources or metadata instances)
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 12
Document Formats
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 13
Text
With the advent of the computer, it became necessary
to represent code characters in binary digits, which is
done through coding schemes
EBCDIC (7 bits), ASCII (8 bits) and UNICODE (16 bits)
All these coding schemes are based on characters
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 14
Text
Other text formats
Rich Text Format (RTF): for document interchange
Portable Document Format (PDF): for printing and displaying
Postscript: for printing and displaying
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 15
Multimedia
Multimedia usually stands for applications that handle
different types of digital data
Most common types of media: text, sound, images, and video
Different types of formats are necessary for storing each media
Most formats for multimedia can only be processed by a computer
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 16
Image Formats
The simplest image formats are direct representations
of a bit-mapped display such as XBM, BMP or PCX
Images of these formats have a lot of redundancy and
can be compressed efficiently
Example of format that incorporates compression:
Compuserve’s Graphic Interchange Format (GIF)
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 17
Image Formats
To improve compression ratios, lossy compression was
developed
uncompressing a compressed image does not yield exactly the
original image
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 18
Image Formats
Another common image format is the Tagged Image
File Format (TIFF)
exchange of documents between different applications and
computers
TIFF provides for metadata, compression, and varying number of
colors
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 19
Image Formats
In 1996, a new image format was proposed for use in
the Internet: Portable Network Graphics (PNG)
Standard de facto for images in the Web
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 20
Audio
Audio must be digitalized to be stored properly
Most common formats for audio: AU, MIDI and WAVE
MIDI: standard format to interchange music between electronic
instruments and computers
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 21
Movies
Main format for animations is Moving Pictures Expert
Group (MPEG):
works by coding the changes in consecutive frames
profits from the temporal image redundancy that any video has
includes the audio signal associated with the video
specific cases for audio (MP3), video (MP4), etc.
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 22
Graphics and Virtual Reality
There are many formats for three dimensional graphics:
Computer Graphics Metafile (CGM) and Virtual Reality
Modeling Language (VRML)
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 23
Markup Languages
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 24
Markup Languages
Markup is defined as extra syntax used to describe
formatting actions, structure information, text
semantics, attributes
Examples of Markup Languages
SGML: Standard Generalized Markup Language
XML: eXtensible Markup Language
HTML: Hyper Text Markup Language
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 25
Markup Languages
SGML
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 26
SGML
SGML (ISO 8879) stands for Standard Generalized
Markup Language, i.e., a meta-language for tagging
text
it provides rules for defining a markup language based on tags
includes a description of the document structure called
document type definition
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 27
SGML
The document type definition is used to
describe and name the pieces that a document is composed of
define how those pieces relate to each other
part of the definition can be specified by an SGML document
type declaration (DTD)
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 28
SGML
Tags are denoted by angle brackets
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 29
SGML
Example of a SGML DTD for electronic messages
<!ELEMENT e-mail - - (prolog, contents) >
<!ELEMENT prolog - - (sender, address+, subject?, Cc*) >
<!ELEMENT (sender | address | subject | Cc) - O (#PCDATA) >
<!ELEMENT contents - - (par | image | audio)+ >
<!ELEMENT par - O (ref | #PCDATA)+ >
<!ELEMENT ref - O EMPTY >
<!ELEMENT (image | audio) - - (#NDATA) >
<!ATTLIST e-mail
id ID #REQUIRED
date_sent DATE #REQUIRED
status (secret | public ) public >
<!ATTLIST ref
id IDREF #REQUIRED >
<!ATTLIST (image | audio )
id ID #REQUIRED >
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 30
SGML
Example of use of previous DTD
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 31
SGML
Document description does not specify how a
document is printed
Output specifications are often added to SGML documents, such
as:
DSSSL: Document Style Semantic Specification Language
FOSI: Formatted Output Specification Instance
These standards define mechanisms for associating style
information with SGML document instances
They allow defining that the data identified by a tag should be
typeset in some particular font
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 32
SGML
One important use of SGML is in the Text Encoding
Initiative (TEI)
includes several USA associations related to the humanities and
linguistics
provides several document formats through SGML DTDs
main goal is to generate guidelines for the preparation and
interchange of electronic texts for scholarly research, as well as
the industry
one of the most used formats is TEI Lite
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 33
Markup Languages
HTML
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 34
HTML
HTML stands for HyperText Markup Language and is
an instance of SGML
It was created in 1992 and has evolved during the last
years, being 4.0 the latest version
HTML5 is under development
Most documents on the Web are stored and transmitted
in HTML
Although there is an HTML DTD, most HTML instances
do not explicitly make reference to the DTD
The HTML tags follow all the SGML conventions and
also include formatting directives
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 35
HTML
HTML documents can have other media embedded
within them, such as images or audios
HTML also has fields for metadata, which can be used
for different applications and purposes
If we also add programs (for example, using Javascript)
inside a page some people call it dynamic HTML
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 36
HTML
Example of an HTML document
<html><head>
<title>HTML Example</title>
<meta name=rby content="Just an example">
</head>
<body>
<h1>HTML Example</h1>
<p><hr><p>HTML has many <i>tags</i>, among them:
<ul>
<li> links to other <a href=example.html>pages</a> (a from anchor),
<li> paragraphs (p), headings (h1, h2, etc), font types (b, i),
<li> horizontal rules (hr), indented lists and items (ul, li),
<li> images (img), tables, forms, etc.
</ul>
<p><hr><p><img align=left src="at_work.gif">
This page is <b>always</b> under construction.
</body></html>
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 37
HTML
How the HTML document is seen on a browser
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 38
HTML
Because HTML does not fix a presentation style, the
Cascade Style Sheets (CSS) were introduced in 1997
powerful and manageable way for authors to improve the
aesthetics of HTML pages
separate information about presentation from document content
support (for CSS) in current browsers is still modest
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 39
HTML
The evolution of HTML implies support for backward
compatibility and also for forward compatibility
HTML 4.0 has been specified in three flavors: strict,
transitional, and frameset
Strict HTML only worries about non-presentational markup,
leaving all the displaying information to CSS
Transitional HTML uses all the presentational features for pages
that should be read for old browsers that do not understand CSS
Frameset HTML is used when you want to partition the browser
window in two or more frames
HTML 4.0 includes support for style sheets,
internationalization, frames, richer tables and forms,
and accessibility options for people with disabilities
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 40
HTML
Typical HTML applications use a fixed small set of tags
makes the language specification much easier to build
applications
comes at the cost of severely limiting HTML in several important
aspects
In particular, HTML does not
allow users to specify their own tags
support the specification of nested structures needed to
represent database schemas
support the kind of language specification that allows
consuming applications to check data for structural validity on
importation
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 41
Markup Languages
XML
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 42
XML
XML , the eXtensible Markup Language
is a simplified subset of SGML
is not a markup language, as HTML, but a meta-language, as
SGML
allows to have human-readable semantic markup, which is also
machine-readable
makes it easier to develop and deploy new specific markup
languages
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 43
XML
XML does not have many of the restrictions of HTML
On the other hand, imposes a more rigid syntax on the
markup:
In XML, ending tags cannot be omitted
XML also distinguishes upper and lower case
All attribute values must be between quotes
Parsing XML without a DTD is easier
The tags can be obtained while the parsing is done
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 44
XML
XML allows any user to define new tags
Extensible Style sheet Language (XSL)
the XML counterpart of Cascading Style Sheets (CSS)
syntax defined based on XML
designed to transform and style highly-structured, data-rich
documents written in XML
For example, with XSL it would be possible to automatically
extract a table of contents from a document
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 45
XML
Extensible Linking Language (XLL)
Another extension to XML, defined using XML
defines different types of links (external and internal)
Recent uses of XML include:
Mathematical Markup Language (MathML)
Synchronized Multimedia Integration Language (SMIL)
Resource Description Format
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 46
Markup Languages
RDF: Resource Description Framework
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 47
Resource Description Framework
RDF (Resource Description Framework)
family of specifications originally designed as a metadata model
has become a general method for the conceptual description or
modeling of information
the de facto standard language of the Semantic Web
similar to conceptual modeling approaches such as
Entity-Relationship or Class diagrams
more naturally suited to represent certain kinds of knowledge
than other traditional models
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 48
Resource Description Framework
The RDF Schema (RDFS) is an extensible knowledge
representation language
It provides basic elements for the description of ontologies,
intended to structure RDF resources
Many RDFS components are included in the more
expressive language Web Ontology Language (OWL)
OWL is a family of knowledge representation languages for
authoring ontologies, also endorsed by the W3C
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 49
Markup Languages
HyTime
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 50
HyTime
HyTime: Hypermedia/Time-based Structuring
Language
an SGML architecture that specifies the generic hypermedia
structure of documents
its hypermedia concepts include
complex locating of document objects
relationships (hyperlinks) between document objects
numeric, measured associations between document objects
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 51
HyTime
The HyTime architecture has three parts:
the base linking and addressing architecture
the scheduling architecture (derived from the base architecture)
the rendition architecture (which is an application of the
scheduling architecture)
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 52
Text Properties
Information Theory
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 53
Information Theory
It is difficult to formally capture how much information
is there in a given text
However, the distribution of symbols is related to it
For example, a text where one symbol appears almost all the
time does not convey much information
Information theory defines a special concept, entropy, to capture
information content
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 54
Entropy
The entropy of a whole text T = t1 t2 . . . tn is given by
n
1 X 1
E = log2
n pi
i=1
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 55
Entropy
If the model always assign probability pi to the alphabet
symbol si , the entropy can be rewritten as
X 1
E = pi log2
pi
si ∈Σ
or
Pσ
E = − i=1 pi log2 pi
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 56
Text Properties
Modeling Natural Language
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 57
Modeling Natural Language
We can divide the symbols of a text in two disjoint
subsets:
symbols that separate words; and
symbols that belong to words
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 58
Modeling Natural Language
A simple model to generate text is the Binomial model
However, the probability of a symbol depends on
previous symbols
For example, in English, a letter f cannot appear after a letter c
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 59
Modeling Natural Language
The second issue is how the different words are
distributed inside each document
An approximate model is the Zipf’s Law
This law states that the frequency fi of the i-th most frequent
word is given by
f1
fi =
iα
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 60
Modeling Natural Language
For a text of n words with a vocabulary of V words, we
have
V
" V #
X 1 X 1
n= α
f1 = f1 ×
i iα
i=1 i=1
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 61
Modeling Natural Language
Figure below illustrates the distribution of frequencies of
the terms in a text
words arranged in decreasing order of their frequencies
W o rd s
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 62
Modeling Natural Language
Since the distribution of words is very skewed, words
that are too frequent, called stopwords, can be
disregarded
A stopword is a word which does not carry meaning in
natural language
Examples of stopwords in english: a, the, by, and
Fortunately, the most frequent words are stopwords
Therefore, half of the words appearing in a text do not need to
be considered
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 63
Modeling Natural Language
A third issue is the distribution of words in the
documents of a collection
Simple model: consider that each word appears the
same number of times in every document (not true)
Better model: use a negative binomial distribution
Fraction of documents containing a word k times is
α+k−1 k
F (k) = p (1 + p)−α−k
k
where p and α are parameters that depend on the word and the
document collection
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 64
Modeling Natural Language
The fourth issue is the number of distinct words in a
document (the document vocabulary)
To predict the growth of the vocabulary size in natural
language text, we use the so called Heaps’ Law
the vocabulary of a text of n words is of size
V = Knβ
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 65
Modeling Natural Language
The figure below illustrates that vocabulary size grows
sub-linearly with text size
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 66
Modeling Natural Language
Notice that the set of different words of a language is
fixed by a constant
However, the limit is so high that it is common assume
that the size of the vocabulary is O(nβ ) instead of O(1)
Many authors argue that the number keeps growing
anyway because of typing and spelling errors
Heap’s law also applies to collections of documents
as the total text size grows, the predictions of the model become
more accurate
the model is also valid for the World Wide Web
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 67
Modeling Natural Language
A last issue is the average length of words
It relates the text size in words with the text size in bytes
In sub-collections of TREC-2, average word length is very close
to 5 letters
If we remove the stopwords, the average length of a word
increases to a number between 6 and 7 letters
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 68
Text Properties
Text Similarity
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 69
Text Similarity
Similarity is measured by a distance function
for strings of the same length, distance between them is the
number of positions with different characters
for instance, the distance is 0 if they are equal
this is called the Hamming distance
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 70
Text Similarity
Edit or Levenshtein distance
important distance function over strings
it is the minimal number of char insertions, deletions, and
substitutions needed to make two strings equal
edit distance between color and colour is 1
edit distance between survey and surgery is 2
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 71
Text Similarity
Longest common subsequence (LCS)
all non-common characters of two (or more) strings
remaining sequence of characters is the LCS of both strings
LCS of survey and surgery is surey
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 72
Text Similarity
Similarity can be extended to documents
Consider lines as single symbols and compute the
longest common sequence of lines between two files
Measure used by the diff command in Unix
Problems with this approach
very time consuming
does not consider lines that are similar
but, this can be fixed by taking a weighted edit distance
between lines
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 73
Text Similarity
Resemblance measure
If W (dj ) is the set of all distinct words in document dj , then the
resemblance function between two documents di and dj is
defined as
|W (di ) ∩ W (dj )|
R(di , dj ) =
|W (di ) ∪ W (dj )|
where 0 ≤ R(di , dj ) ≤ 1
Notice that this is a more efficient document similarity measure
D(di , dj ) = 1 − R(di , dj )
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 74
Document Preprocessing
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 75
Document Preprocessing
Document preprocessing can be divided into five text
operations:
Lexical analysis of the text
Elimination of stopwords
Stemming of the remaining words
Selection of index terms or keywords
Construction of term categorization structures (thesaurus)
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 76
Logical View of a Document
accents, noun
stopwords stemming controlled
document spacing, groups vocabulary
etc
text +
structure structure text
recognition
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 77
Lexical Analysis of the Text
Lexical analysis
Process of converting stream of chars into stream of words
Major objective: identify words in the text
Word separators:
Space: most common separator
Numbers: inherently vague, need context for disambiguation
Hyphens: break up hyphenated words
Punctuation marks: allows distinguishing x.id from xid in a
program
Case of the letters: allows distinguishing Bank from bank
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 78
Elimination of Stopwords
Stopwords
words that appear too frequently
usually, not good discriminators
normally, filtered out as potential index terms
natural candidates: articles, prepositions, conjunctions
Elimination of stopwords
reduces size of index by 40% or more
at expense of reducing recall: not able to retrieve documents that
contain “to be or not to be”
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 79
Stemming
User specifies query word but only a variant of it is
present in a relevant document
plurals, gerund forms, and past tense suffixes
partially solved by the adoption of stems
Stem
portion of word left after removal of prefixes/suffixes
connect: stem of connected, connecting, connection,
connections
Stemming reduces size of the index
There is controversy about benefits of stemming for
retrieval
Many search engines do not adopt any stemming
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 80
Stemming
Four types of stemming strategies:
Affix removal: in which the most important part is suffix removal
The Porter algorithm is a suffix removal algorithm for English
Table lookup: look for the stem of a word in a table
Successor variety: determine morpheme boundaries and use
knowledge from structural linguistics
N-grams: identify digrams and trigrams (term clustering)
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 81
Keyword Selection
Full text representation
all words in text used as index terms (or, keywords)
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 82
Thesauri
Thesaurus
Word has Greek and Latin origins
Used as reference to a treasury of words
In its simplest form, this treasury consists of
precompiled list of important words in a knowledge domain
for each word in this list, a set of related words derived from a
synonymy relationship
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 83
Thesauri
In general, a thesaurus includes a more complex
structure
Roget’s thesaurus includes phrases
cowardly adjective
Ignobly lacking in courage: cowardly turncoats.
Syns: chicken (slang), chicken-hearted, craven, dastardly,
faint-hearted, gutless, lily-livered, pusillanimous, unmanly,
yellow (slang), yellow-bellied (slang).
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 84
Thesauri
The main purposes of a thesaurus are to provide:
a standard vocabulary for indexing and searching
a means to find terms for proper query formulation
classified hierarchies to allow broadening/narrowing queries
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 85
Thesauri
A controlled vocabulary presents important advantages:
normalization of indexing concepts
reduction of noise
identification of indexing terms with a clear semantic meaning
retrieval based on concepts rather than on words
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 86
Thesauri
For general domains
a well known body of knowledge that can be associated with the
documents in the collection might not exist
this might be because collection is new, too large, or changes
dynamically
thus, not clear how useful a thesaurus is in the context of the Web
Many search engines simply index all the words
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 87
Thesaurus Index Terms
Terms are the indexing components of a thesaurus
a term can be composed of a word, a group of words, or a phrase
it is normally a noun (most concrete part of speech)
it usually denotes a concept
can be expressed as a combination of an adjective with a noun:
polar bear
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 88
Thesaurus Term Relationships
Synonyms and near-synonyms
Set of terms related to a given thesaurus term
Relationships can be induced by patterns of co-occurrence within
documents
Such relationships are usually of a hierarchical nature
broader (represented by BT) related terms
narrower (represented by NT) related terms
But, they can also be of a lateral or non-hierarchical nature
We simply say that the terms are related (represented by RT)
BT and NT relationships can be identified automatically
Dealing with RT relationships is much harder
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 89
On the Use of Thesauri in IR
Query formation process
User forms a query
Query terms might be erroneous and improper
Solution: reformulate the original query
Usually, this implies expanding original query with related terms
Thus, it is natural to use a thesaurus for finding related terms
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 90
On the Use of Thesauri in IR
Relationships captured in a thesaurus are frequently
not valid for the local context of a query
To solve this context problem
determine thesaurus-like relationships at query time
but, not attractive for search engines
expensive in terms of time and resources
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 91
Taxonomies
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 92
Folksonomies
A folksonomy is a collaborative flat vocabulary
terms are selected by a population of users
each term is called a tag
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 93
Text Compression
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 94
Text Compression
A representation of text using less space
Attractive option to reduce costs associated with
space requirements
input/output (I/O) overhead
communication delays
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 95
Text Compression
Our focus are compression methods that
allow random access to text
do not require decoding the entire text
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 96
Compression Methods
Two general approaches
statistical text compression
dictionary based text compression
Statistical methods
Estimate the probability of a symbol to appear next in the text
Symbol: a character, a text word, a fixed number of chars
Alphabet: set of all possible symbols in the text
Modeling: task of estimating probabilities of a symbol
Coding or encoding: process of converting symbols into binary
digits using the estimated probabilities
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 97
Compression Methods
Dictionary methods
Identify a set of sequences that can be referenced
Sequences are often called phrases
Set of phrases is called the dictionary
Phrases in the text are replaced by pointers to dictionary entries
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 98
Statistical Methods
Defined by the combination of two tasks
the modeling task estimates a probability for each next symbol
the coding task encodes the next symbol as a function of the
probability assigned to it by the model
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 99
Statistical Methods
Golden rule
Shorter codewords should be assigned to more frequent
symbols to achieve higher compression
If probability pc of a symbol c is much higher than
others, then log2 p1c will be small
To achieve good compression
Modeler must provide good estimation of probability p of symbol
occurrences
1
Encoder must assign codewords of length close to log2 p
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 100
Statistical Methods: Modeling
Compression models can be
adaptive, static, or semi-static
character-based or word-based
Adaptive models:
start with no information about the text
progressively learn the statistical text distribution
need only one pass over the text
store no additional information apart from the compressed text
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 101
Static models
Static models
assume an average distribution for all input texts
modeling phase is done only once for all texts
achieve poor compression ratios when data deviate from initial
statistical assumptions
a model that is adequate for English literary texts will probably
perform poorly for financial texts
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 102
Semi-static models
Semi-static models
Do not assume any distribution on the data
Learn data distribution (fixed code) in a first pass
Text compressed in a second pass using fixed code from first
pass
Information on data distribution sent to decoder before
transmitting encoded symbols
Advantage in IR contexts: direct access
Same model used at every point in compressed file
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 103
Semi-static models
Simple semi-static model: use global frequency
information
Let fc be the frequency of symbol c in the text
T = t1 t2 . . . tn
The corresponding entropy is
X fc n
E= log2
n fc
c∈Σ
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 104
Semi-static models
In the 2 gigabyte TREC-3 collection:
Entropy under this simple model: 4.5 bits per character
Compression ratio cannot be lower than 55%
But, state-of-the-art compressors achieve compression ratios
between 20% and 40%
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 105
Semi-static models
Order k of a model
Number of symbols used to estimate probability of next symbol
Zero-order model: computed independently of context
Compression improves with higher-order models
Model of order 3 in TREC-3 collection
· compression ratios of 30%
· handling about 1.3 million frequencies
Model of order 4 in TREC-3 collection
· compression ratio of 25%
· handling about 6 million frequencies
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 106
Word-based Modeling
Word-based modeling uses zero-order modeling over
a sequence of words
Good reasons to use word-based models in IR
Huffman Codes
The Huffman tree for a given probability distribution is
not unique
0 1 0 1
0 1 0 0 1
1 0 1
rose a rose a
0 1 0 1 0 1 0 1
my , for is my , for is
Original text: for my rose, a rose is a rose Original text: for my rose, a rose is a rose
Compressed text: 110 010 00 011 10 00 111 10 00 Compressed text: 010 000 10 001 11 10 011 11 10
Canonical tree
right subtree of no node can be taller than its left subtree
can be stored very efficiently
allows faster decoding
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 115
Byte-Huffman Codes
Original Huffman method leads to binary coding trees
However, we can make the code assign a sequence of
whole bytes to each symbol
As a result, Huffman tree has degree 256 instead of 2
This word-based model degrades compression ratios to around
30%
In exchange, decompression of byte-Huffman code is much faster
than for binary Huffman code
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 116
Byte-Huffman Codes
In byte-Huffman coding, direct searching on
compressed text is simpler
To search for a word in the compressed text
first find it in the vocabulary
for TREC-3, vocabulary requires just 5 megabytes
mark the corresponding leaf in the tree
proceed over text as if decompressing, except that no symbol is
output
instead, report occurrences when visiting marked leaves
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 117
Byte-Huffman Codes
Process is simple and fast: only 30% of the I/O is
necessary
Assume we wish to search for a complex pattern
including ranges of characters or a regular expression
just apply the algorithm over the vocabulary
for each match of a whole vocabulary word, mark the word
done only on the vocabulary (much smaller than whole text)
once relevant words are marked, run simple byte-scanning
algorithm over the compressed text
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 118
Byte-Huffman Codes
All complexity of the search is encapsulated in the
vocabulary scanning
For this reason, searching the compressed text is
up to 8 times faster when complex patterns are involved
about 3 times faster when simple patterns are involved
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 119
Byte-Huffman Codes
Although the search technique is simple and uniform,
one could do better especially for single-word queries
Concatenation of two codewords might contain a third
codeword
Consider the code: A → 0, B → 10, C → 110, D → 111
DB would be coded as 11110
If we search for C, we would incorrectly report a spurious
occurrence spanning the codewords of DB
To check if the occurrence is spurious or not, rescan all text
from the beginning
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 120
Dense Codes
An alternative coding simpler than byte-Huffman is
dense coding
Dense codes arrange the symbols in decreasing
frequency order
Codeword assigned to the i-th most frequent symbol is,
essentially, the number i − 1
number is written in a variable length sequence of bytes
7 bits of each byte are used to encode the number
highest bit is reserved to signal the last byte of the codeword
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 121
Dense Codes
Codewords of symbols ranked 1 to 128 are 0 to 127
they receive one-byte codewords
highest bit is set to 1 to indicate last byte (that is, we add 128 to
all codewords)
symbol ranked 1 receives codeword h128i = h0 + 128i
symbol ranked 2 receives codeword h129i = h1 + 128i
symbol ranked 128 receives codeword h255i
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 122
Dense Codes
Stoppers
these are those bytes with their highest bit set
they indicate the end of the codeword
Continuers
these are the bytes other than stoppers
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 123
Dense Codes
Figure below illustrates an encoding with dense codes
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 124
Dense Codes
Highest bit signals the end of a codeword
a dense code is automatically a prefix code
Self-synchronization
Dense codes are self-synchronizing
Given any position in the compressed text, it is very easy to
determine the next or previous codeword beginning
decompression can start from any position, be it a codeword
beginning or not
Huffman-encoding is not self-synchronizing
not possible to decode starting from an arbitrary position in
the compressed text
notice that it is possible to decode starting at an arbitrary
codeword beginning
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 125
Dense Codes
Self-synchronization allows faster search
To search for a single word we can
obtain its codeword
search for the codeword in the compressed text using any string
matching algorithm
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 126
Dense Codes
An spurious occurrence is a codeword that is a suffix of
another codeword
assume we look for codeword a b c, where we have overlined the
stopper byte
there could be a codeword d a b c in the code, so that we could
find our codeword in the text . . . e f g d a b c . . .
yet, it is sufficient to access the text position preceding the
candidate occurrence, ‘d’, to see that it is not a stopper
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 127
(s, c)-Dense Codes
Dense codes assign
values 0 to 127 to continuer bytes
values 128 to 255 to stopper bytes
split is arbitrary and could be changed to c continuers and s
stoppers, for s + c = 256
Stoppers and continuers can be distinguished by
considering the numerical value of the bytes
s most frequent symbols are encoded with 1 byte
next sc most frequent symbols are encoded with 2 bytes
next sc2 most frequent symbols are encoded with 3 bytes
For the TREC collection,
optimal values for s: between 185 and 200
compression ratios very close to those of byte-Huffman codes
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 128
(s, c)-Dense Codes
Finding the optimal s and c values is not expensive
sort the V symbols by decreasing frequencies fi
compute the cumulative frequencies in a single pass
F0 = 0
Fi = Fi−1 + fi
total compressed file length using a given (s, c) combination is
sum has O(logc V ) terms and can be easily recomputed for all
possible 256 (s, c) combinations
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 129
(s, c)-Dense Codes
Dense codes compared to Huffman codes
are much simpler to handle
no coding tree has to be built nor stored — just the sorted
vocabulary suffices
simple algebraic calculation to convert word ranks into codes,
and vice versa
yield almost the same compression ratios (as byte-Huffman)
permit much faster searching
are simpler and faster to program and manipulate
have stronger synchronization properties
share all the positive features of Huffman coding
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 130
Dictionary Methods
Dictionary methods
identify groups of consecutive symbols (or phrases) that can be
referenced
occurrences of the phrases in the text are replaced by a pointer to
an entry in a dictionary
choice of phrases can be made by static, semi-adaptive, or
adaptive algorithms
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 131
Dictionary Methods
Static dictionary encoders are fast
little effort for a small amount of compression
Example: digram coding in which selected pairs of letters are
replaced with special characters
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 132
Dictionary Methods
Adaptive dictionary methods are more successful
in Ziv-Lempel compression, the dictionary is built as the text is
compressed
depending on the Ziv-Lempel compressor, different substrings
can be referenced
Ziv-Lempel methods are very popular as
general-purpose compressors
implemented in various compression softwares: zip, pkzip,
gzip, winzip, arj
they do not allow access to the compressed text at random
positions
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 133
Re-Pair Methods
Much more interesting for IR is Re-Pair:
assign an integer to each symbol in the alphabet
rewrite the text as the corresponding sequence of integers
iteratively, identify the most frequent pair of consecutive integers,
insert into the dictionary a rule C → A · B, for a fresh integer C
replace all the occurrences of A · B to C
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 134
Re-Pair Methods
A phrase like C can participate in new phrases like
E →D·C
dictionary can be seen as a binary hierarchy of phrases
Advantages:
Re-Pair obtains very good compression ratios
It can easily decompress the text starting at any position
Drawbacks:
compression is extremely slow
compression needs large amounts of memory
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 135
Re-Pair Methods
There is, however, a version of Re-Pair of utmost
interest for IR applications:
take the text as a sequence of words
apply the same algorithm over the resulting sequence of integers
compression ratios worsen a bit
but, all phrases in the dictionary are sequences of words
frequent phrases can be used for phrase browsing
user writes ‘United’ and sees related phrases in the text:
‘United Nations’, ‘United States’, and
‘Manchester United’
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 137
Preprocessing for Compression
Building the Burrows-Wheeler Transform of
‘mississippi$’
m i s s i s s i p p i $
i s s i s s i p p i $ m
s s i s s i p p i $ m i
s i s s i p p i $ m i s
i s s i p p i $ m i s s
s s i p p i $ m i s s i
s i p p i $ m i s s i s
i p p i $ m i s s i s s
p p i $ m i s s i s s i
p i $ m i s s i s s i p
i $ m i s s i s s i p p
$ m i s s i s s i p p i
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 138
Preprocessing for Compression
Building the Burrows-Wheeler Transform of
‘mississippi$’
F L
$ m i s s i s s i p p i
i $ m i s s i s s i p p
i p p i $ m i s s i s s
i s s i p p i $ m i s s
i s s i s s i p p i $ m
m i s s i s s i p p i $
p i $ m i s s i s s i p
p p i $ m i s s i s s i
s i p p i $ m i s s i s
s i s s i p p i $ m i s
s s i p p i $ m i s s i
s s i s s i p p i $ m i
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 139
Preprocessing for Compression
Somewhat surprisingly, it is possible to recover T from L
Why BWT should help compression
Consider a common phrase: ‘United States’
Occurrences of ‘ed States’ in T appear contiguously in M
In most cases, ‘ed States’ is preceded by ‘t’
L will contain mostly ‘t’s in contiguous area
Similarly, ‘ted States’, ‘ited States’, and ‘nited
States’ will produce long runs of letters ‘i’, ‘n’, and ‘U’,
respectively, in L
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 140
Comparing Techniques
Statistical methods:
Semi-static methods: bit-Huffman and byte-dense
Adaptive methods: PPM (program ppmdi)
Dictionary methods:
Semi-static method: Re-Pair
Adaptive method: LZ77 of the Ziv-Lempel family
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 141
Comparing Techniques
Comparison of the main compression techniques
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 142
Structured Text Compression
Structured text is a common form of representing
documents in many IR systems
It is possible to use the structure to achieve improved
compression
We opt for mentioning the most prominent products and
prototypes for structured text compression
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 143
XMLPPM
Relies on PPM-like modeling and arithmetic coding
Is adaptive and does not permit direct access to the
compressed file
Several models are active at the same time, and
XMLPPM switches between them
Four models are used:
one for element and attribute names
one for element structure
one for attributes
one for strings
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 144
XMLPPM
For example, consider the source text
<author><title>Prof.</title>Claude
Shannon</author>
Then,
The order-2 PPM context to model ‘Prof.’ will be ‘<author>
<title>’
The context to model ‘Claude’ will be ‘<author>’
The context to model ‘Shannon’ will be ‘<author> Claude’
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 145
SCM: Structured Context Modeling
Simplification of XMLPPM
direct access is possible
model and encode text under each tag separately from the rest
SCM
uses a word-based bit-Huffman encoder for each tag
merges tags when the distributions turn out to be similar
over TREC-3, achieves 2% – 4% improvement in compression
compared to word-based bit-Huffman
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 146
LZCS
LZCS (Lempel-Ziv to Compress Structure)
aims at very repetitive structured collections
archives of Web forms, e-commerce documents, Web service
exchange documents
Documents: Languages & Properties, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 147