Text and Multimedia
Languages and Properties
Overview
Text: main form of communicating
knowledge
Document: a single unit of information
A document has
syntax:
(dictated by the application or by
structure: the person who created it)
semantics: specified by the author
presentation style: specifies how it should
be displayed or printed
Characteristics of a Document
Document
Presentation Style
Syntax
Text + Structure
+ Other Media
Semantics
Overview
The syntax of a document can express
structure
presentation style
semantics
external actions
This syntax can be
implicit
expressed in a simple declarative language
expressed in a programming language
Metadata
Metadata, data about the data, is
information on
the organization of the data
the various data domains
the relationship between them
Metadata Examples
Database management system:
name of the relations
fields or attributes of each relation
domain of each attribute
Text:
author
date of publication
source of publication
document length
document genre
Descriptive Metadata
Descriptive Metadata: metadata that is
external to the meaning of the document
pertain how the document was created
Example: the Dublin Core Metadata
Element Set: proposes 15 fields to
describe a document
Semantic Metadata
Semantic Metadata: metadata that
characterizes the subject matter found in the
documents contents
is associated with a wide number of
documents
is increasing in its availability
Example:
All books published in the USA are assigned
Library of Congress subject codes
Semantic Metadata
Example:
Many journals require authorassigned key terms (from a closed
vocabulary of relevant terms)
topical metadata in biomedical
articles within the MEDLINE system
are disease, anatomy,
pharmaceuticals, etc.
Metadata for Web Document
In the web, metadata can be used for:
cataloging: a popular format is BibTeX
content rating
intellectual property rights
digital signatures
privacy levels
applications to electronic commerce
Metadata for Web Document
Resource Description Framework (RDF):
new standard for Web metadata
provides interoperability between
applications
allows the description of Web resources to
facilitate automated processing of the
information
consists of a description of nodes and
attached attribute/value pairs
Metadata for Web Document
node: Web resource, Uniform Resource
Identifier (URI)
attribute: properties of nodes
value: text strings or other nodes
Text
text is coded in binary digits for computer
First coding schemes: EBCDIC, ASCII
use seven bits for each symbol
Later, ASCII was standardized to eight bits
(ISO-Latin)
accommodate several languages
including accents and diacritical marks
Unicode (ISO 10616) uses 16-bit code
for oriental languages
Formats
In the past, IR systems convert a
document to an internal format
disadvantages:
original application related to the document
is no longer useful
contents of a document cannot be changed
Current IR system uses filters
might not be possible with proprietary or
non-public formats
Formats
Full ASCII syntax: TeX
Binary syntax: Word, WordPerfect,
FrameMaker
Rich Text Format (RTF):
used by word processors
has ASCII syntax
developed for document interchange
Formats
Portable Document Format (PDF) and
Postscript
developed for displaying and printing
documents
Multipurpose Internet Mail Exchange
(MIME)
interchange formats
used to encode electronic mail
Formats
Compressed text:
Compress (Unix)
ARJ (PCs)
ZIP (gzip-Unix, Winzip-Windows), etc.
Conversion tools: convert binary files
(compressed text) to ASCII text for
transmission:
uuencode/uudecode
binhex
Information Theory
the distribution of symbols related to information
(or semantics) in written text
entropy: used to capture information content (or
information uncertainty)
: symbols the alphabet has
pi: probability of each symbol appearance (the
symbol frequency over the total number of symbols)
E: the entropy of this text
#
E = "$ pi log 2 pi
i=1
Entropy
the symbols of the alphabet are coded in
binary the entropy is measured in bits
example: for = 2,
the entropy is 1 if both symbols appear the same
number of times
the entropy is 0 if only one symbol appears
the text model determines probabilities pi and
amount of information in a text
Modeling Natural Language
text is composed of symbols from a finite
alphabet
symbols can be divided into two subsets
symbols that separate words
symbols that belong to words
A simple model to generate text is the
binomial model
In natural language, these symbols are not
uniformly distributed each symbol depends
on previous symbol
Modeling Natural Language
a finite-context or Markovian model can
be used to compute this dependency
more complex models: finite-state
models and grammar models
Distribution of the Frequencies
Zipfs Law is used to model the distribution of
word frequencies in the text
the frequency of the i-th most frequent word is 1/i
times that of the most frequent word
in a text of n words with a vocabulary of V words,
the i-th most frequent word appears n/(iHV())
times
HV() is the harmonic number of order of V
V
HV (" ) = #
j=1
1
j"
depends on the text, usually > 1 (1.5-2.0)
Distribution of Words
A simple model: consider each word appears
the same number of time in every document
A better model: a negative binomial distribution
the fraction of documents containing a word
k times is
F(k) = C
" +k#1
k
p (1+ p)
#" #k
where p and are parameters (depend on the
word and the document collection)
Document Vocabulary
Heaps Law is used to predict the
growth of the vocabulary size in natural
language text
V = Kn = O(n)
V: vocabulary size of a text of n words
K, : free parameters - depend on text
10 K 100;
See Figure 6.2
01
Figure 6.2
Words
Text size
Average Length of Words
Heaps law:
the length of the words in the vocabulary
increases logarithmically with the text size
In practice:
the average length of the words is constant
Finite-state model
the space character has probability close to 0.2
the space character cant appear twice in a row
there are 26 letters
Similarity Models
similarity is measured by
a distance function: Hamming distance
edit or Levenshtein distance
longest common subsequence (LCS)
a distance function should
be symmetric: arguments order is important
satisfy the triangular inequality
distance(a,c) distance(a,b)+distance(b,c)
Similarity Models
extending similarity to documents is done by
consider lines as single symbols and compute the
longest common sequence of lines between two
files (diff command in Unix)
problems:
time consuming
does not consider lines that are similar
The second problem can be fixed by
taking a weighted edit distance between lines
computing the LCS over all the characters
Document Similarity
Other solutions include
extract fingerprints of the documents and
compare them, or find large repeated pieces
use visual tools to see document similarity:
Dotplot draws a rectangular map where
both coordinates are file lines
the entry for each coordinate is a gray pixel that
depends on the edit distance between the
associated lines
Markup Languages
Markup: extra textual syntax used to describe
formatting actions
structure information
text semantics
attributes, etc.
ex. the formatting commands of TeX
formal markup languages are much more
structured
the marks are called tags (initial+text+ending)
Samples markup languages: SGML, HTML, XML
SGML
Standard Generalized Markup Language
(ISO 8879): a metalanguage for tagging text
developed by a group led by Goldfarb
based on earlier work done at IBM
provides the rules for defining a markup language
based on tags
an SGML document is defined by
a description of the structure of the document
the text marked with tags describing the structure
SGML
each instance of SGML includes a description
of the document structure called a document
type definition
the document type definition is used to
describe and name the pieces that a document is
composed of
define how those pieces relate to each other
part of the definition can be specified by an
SGML document type declaration (DTD)
SGML
SGML cannot formally express
semantics of elements
attributes
application conventions
only informal form (comment) can be done
SGML tag are denoted by angle
brackets <>
<tagname> text </tagname>
TEI
One important use of SGML is in TEI
(Text Encoding Initiative), a cooperative
project started in 1987
to generate guidelines for the preparation
and interchange of electronic texts for
scholarly research and for industry
one of the most used formats is TEI Lite
HTML
HyperText Markup Language (HTML):
is an instance of SGML
created in 1992, the latest version is 4.0
is being extended to solve its limitation
HTML tags follow SGML conventions
HTML tags include format directives
other media can be embedded in HTML
documents
HTML
HyperText Markup Language (HTML):
supports backward and forward
compatibility
Cascade Style Sheets (CSS)
offer a powerful and manageable way to
create visual effects of HTML pages
HTML 4.0
specified in strict, transitional and
frameset
Strict: only worries about nonpresentational markup, leaving all the
display information to CSS
Transitional: uses all the presentation
features for pages
Frameset: used when frames is used
HTML Limitation
HTML does not
allow users to specify their own tags or
attributes
support the specification of nested
structures
support the kind of language specification
that allows consuming applications to
check data for structural validity on
importation
XML
eXtensible Markup Language (XML)
is a simplified subset of SGML
is not a markup language
is a metalanguage capable of containing
markup languages
allows a human-readable semantic markup
(also machine-readable)
is easier to develop and deploy new
specific markup
XML
eXtensible Markup Language (XML)
enables automatic authoring, parsing, and
processing of networked data
does not have many restrictions imposed
by HTML
imposes a more rigid syntax on the markup
distinguishes upper and lower case
is easier to be parsed without knowledge of
the tags (all attribute values must be
between quotes)
XML
eXtensible Markup Language (XML)
allows users to define new tags, more
complex structures
has data validation capabilities
Recent Uses of XML
Mathematical Markup Language
(MathML): two sets of tags
for presentation of formulas
for the meaning of mathematical
expressions
Recent Uses of XML
Synchronized Multimedia Integration
language (SMIL):
- a declarative language for scheduling
multimedia presentations in the Web
- the position and activation time of different
objects can be specified
Resource Description Format (RDF):
used as metadata information for XML
Multimedia
Multimedia: applications that handle
different types of digital data originating
from distinct types of media
Most common types of media are
- text, sound, images, video (animated
sequence of images)
The differences among these media types
- volume, format, processing requirements
Image Formats
Several formats for images:
direct representations of a bit-mapped display
- consume too much space: XBM, BMP, PCX
compressed:
Graphic Interchange Format (GIF)
Joint Photographic Experts Group (JPEG)
Tagged Image File Format (TIFF):
exchange documents between different
applications and different computer platforms
has fields for metadata and support compression
Image Formats
Several formats for images:
True-vision Targa image file (TGA):
associated with video game boards
Other formats:
fax (bi-level image formats): JBIG
fingerprints (highly accurate and compressed):
WSQ
satellite (large resolution and full-color images)
Portable Network Graphics (PNG)
Audio Formats
Several formats for small piece of digital audio:
AU: created by Sun Microsystems, one of the
most common formats on the Web
MIDI: standard format to interchange music
between electronic instruments and computers
WAVE: the native sound format within the
Windows environment, one of the most common
on the Web
Formats for audio libraries
RealAudio or CD formats
Animation Formats
for animations or moving images:
Moving Pictures Expert Group (MPEG):
related to JPEG
AVI: includes compression (CinePac)
FLI: originally developed by Autodesk, Inc.,
play back faster than MPEG for computer
generated animations at 640x480
QuickTime: developed by Apple
Textual Images
Very important in office systems
images of documents that contain mainly
typed or typeset text
obtained by scanning the documents
usually for archiving purposes
Large portion of a textual image is text
can be used for retrieval purpose
allow efficient compression
Textual Images
further compression can be achieved by
extracting the different text symbols or
marks from the image
building a library of symbols
representing each one by a position in the
library
Retrieval of Textual Images
associated a set of keywords at creation
time or added to the database
use OCR to extract the text of the
image
use the symbols extracted from the
images as basic units to combine image
retrieval techniques with sequence
retrieval techniques
Graphics and Virtual Reality
For three-dimensional graphics
Computer Graphics Metafile (CGM)
standard (ISO 8632):
defined for the open interchange of
structured graphical objects and associated
attributes
specifies a two-dimensional data
interchange standard
Graphics and Virtual Reality
allows graphical data to be stored and
exchanged between graphics devices,
applications, and computer systems
(device-independent)
can represent vector graphics and raster
format
support a collection of elements, called
metafile
specifies which elements are allowed to
occur in which positions in a metafile
Graphics and Virtual Reality
For three-dimensional graphics
Virtual Reality Modeling Language (VRML,
ISO/IEC 14772-1):
file format for describing interactive 3D objects and
worlds
is a subset of the Silicon Graphics OpenInventor
file format
intended to be a universal interchange format for
integrated 3D graphics and multimedia
HyTime
The Hypermedia/Time-based Structuring
Language (HyTime) is a standard
(ISO/IEC 10744)
defined for multimedia documents markup
is an SGML architecture that specifies the
generic hypermedia structure of documents
Allows DTDs to be written for individual
document models
HyTime
The hypermedia concepts directly
represented by HyTime include
complex locating of document objects
relationships (hyperlinks) between
document objects
numeric, measured associations between
document objects
HyTime
The HyTime architecture has three parts:
The base linking and addressing
architecture:
addresses the syntax and semantics of hyperlinks
The scheduling architecture (derived
from the base architecture):
defines the abstract representation of complex
hypermedia structures (including music and
interactive presentations)
HyTime
The rendition architecture (an application
of the scheduling architecture):
defines a general mechanism for defining the
creation of new schedules from existing schedules
(by applying special rendition rules of different
types)
Applications of HyTime
Standard Music Description Language (SMDL)
an architecture for the representation of music
information
supporting multimedia time sequencing
information
Metafile for Interactive Documents (MID)
a common interchange structure
based on SGML and HyTime
takes data from various authoring systems and
structures it for display on different presentation
systems (with minimal human intervention)
Trends and Research Issues
The main trend is the convergence and
integration of the different efforts (the Web is
the main application)
ODA (Open Document Architecture):
designed to share documents electronically without
losing control over the content, structure, and layout
of those documents
defines a logical structure, a layout and the content
an ODA file can be formatted, processable, or
formatted processable
Trends and Research Issues
Formatted files
cannot be edited
have information about content and layout
Processable files
can be edited
have content and logical information
Formatted processable files
have everything
Trends and Research Issues
Recent developments include:
the document object model (DOM)
integration between VRML and Dynamic
HTML
Integration between the Standard
Exchange for Product Data format (STEP,
ISO 10303) and SGML
Effort to convert MARC to SGML by
defining DTD as well as MARC to XML
Trends and Research Issues
Recent developments include:
CGM: developing a new encoding which
can be parsed by XML
Several new proposals such as
SDML (Signed Document Markup Language)
VML (Vector Markup Language)
PGML (Precision Graphics Markup Language)
Taxonomy of Web Languages
DSSL
SGML
HyTime
XML
Metalanguages
XSL
Languages
TEI Lite
CSS
HTML
RDF MathML SMIL
Next
Generation
HTML
Style sheets