Chapter 5 Data Representation PDF
Chapter 5 Data Representation PDF
1- Decimal Numbers
In the decimal number system each of the ten digits 0 through 9, represents a certain quantity. These
ten symbols (digits) don’t limit you to expressing only ten different quantities, because you use the
various digits in appropriate positions within a number to indicate the magnitude of the quantity.
The position of each digit in a decimal number indicates the magnitude of the quantity represented, and
can be assigned a weight. The decimal number system is said to be of base, or radix, 10 because it uses
10 digits and the coefficients are multiplied by powers of 10. In general, a number with a decimal point
is represented by a series of coefficients:
a4a3a2a1a0. a-1a-2a-3
The coefficients aj are any of the 10 digits (0, 1, 2,……, 9), and the subscript value j gives the place value
and, hence, the power of 10 by which the coefficient must be multiplied.
2- Binary Numbers
This is another way to represent quantities. The binary system is less complicated than the decimal
system because it has only two digits. It’s a base- 2 system.
A binary digit, called a bit, has two values 0 & 1. Each coefficient aj is multiplied by 2j, and the results
are added to obtain the decimal equivalent of the number. For example,
Rn ×an + rn-1 ×an-1+ …..+ r2×a2 + r1×a1 + 1 ×a0+ r-1× a-1 + r-2×a-2 +…..+ r-m×a-m
Four bits 16 values 0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, 1011, 1100,
1101, 1110, 1111
And so on……..
8421
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000
9 1001
10 1010
11 1011
12 1100
13 1101
14 1110
When n = 5, you can count from 0 – 31, (32 values). Max number = (31)10.
3 – Hexadecimal Numbers
It has (16) digits and is used primarily as a compact way of displaying or writing binary numbers, it’s very
easy to convert between binary and hexadecimal numbers.
8421
0 0000 0
1 0001 1
2 0010 2
3 0011 3
4 0100 4
5 0101 5
6 0110 6
7 0111 7
8 1000 8
10 1010 A
11 1011 B
12 1100 C
13 1101 D
14 1110 E
15 1111 F
How do you count in hexadecimal once you get to F? Simply start over with another column and
continue as follows:
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D…….
With two hexadecimal digits, you can count up to FF which in decimal 255.
(100)16 = (256)10
(101)16 = (257)10
(FFF)16 = (4095)10
(FFFF)16 = (65535)10
4- Octal Numbers
This system provides a convenient way to express binary numbers and codes (like Hex. Number system),
it is used less frequently than hexadecimal. The octal number system is composed of eight digits, which
are:
01234567
10 11 12 13 14 15 16 17 20 21 22 23 24 25 26 27 30 31 32 33 34 35 36 37……
The decimal value of any binary number can be found by adding the weights of all bits that are 1 and
discarding the weights of all bits that are 0.
Example
Solution
26 × 1+ 25 × 1+ 24 ×0+ 23 × 1 +22 × 1 + 21 × 0 + 20 ×1
64 + 32 + 0 + 8 + 4 + 0 + 0 + 1= (109)10
Example
Solution
Sum – of – Weights Method: determine the set of binary weights whose sum is equal to the decimal
number
1001
So (9)10 = (1001)2
Example
25 = 16 + 8 + 1 = 11001
82 = 64 + 16 + 2 = 1010010.
6/2 = 3 reminder = 0
3/2 = 1 reminder = 1
Solution:
45/2 = 22 reminder = 1
22/2 = 11 reminder = 0
11/2 = 5 reminder = 1
5/2 = 2 reminder = 1
2/2 = 1 reminder = 0
1/2 = 0 reminder = 1
(45)10 = (101101)2
0.25 × 2 = 0.5 0
So (0.625)10 = (0.101)2
Simply break the binary number into 4-bit groups, starting at the rightmost bit and replace each 4-bit
group with the equivalent hexadecimal symbol.
Example
C A 5 7
Reverse the process and replace each hexadecimal symbol with the appropriate four bits.
Example
10AF = 1 0 A F
So (10AF)16 = (0001000010101111)2
Repeated division of a decimal number by 16 will produce the equivalent hexadecimal number, formed
by the reminders of the division. The first reminder produced is the LSD. Each successive division by 16
yields a reminder that becomes a digit in the equivalent hexadecimal number.
Example
(650)10 = ( ? )16
40/16 = 2 reminder = 8 = 8
So (650)10 = (28A)16
(2374)8 = 2 × 83 + 3 × 82 + 7 × 81 + 4 × 80
= 1024
(359)10 = ( )8
44/8 = 5 reminder = 4
So (359)10 = (547)8
Example
(7526)8 = 7 5 2 6
(7526)8 = (111101010110)2
Example
3 2 0 4
(11010000100)2 = (3204)8
Logic Gates
THE INVERTER
The inverter (NOT circuit) performs the operation called inversion or complementation. The inverter
changes one logic level to the opposite level.
Standard logic symbols for the inverter are shown in Fig which shows the distinctive shape symbols.
The negation indicator is a "bubble" (o) that indicates inversion or complementation when it appears on
the input or output of any logic element. Generally, input is on the left of a logic symbol and the output
is on the right. When appearing on the input, the bubble means that a 0 is the active, and the input is
called an active-LOW input. When appearing on the output, the bubble means that a 0 is the active, and
the output is called an active-LOW output.
When a HIGH level is applied to an inverter input, a LOW level will appear on its output. When a LOW
level is applied to its input, a HIGH will appear on its output. This operation is summarized in Table 3-1,
which shows the output for each possible input in terms of levels and corresponding bits. A table such as
this is called a truth table.
The term gate is used to describe a circuit that performs a basic logic operation. The AND gate is
composed of two or more inputs and a single output, as indicated by the standard logic symbols shown
in Fig. Inputs are on the left, and the output is on the right in each symbol. Gates with two inputs are
shown; however, an AND gate can have any number of inputs greater than one.
An AND gate produces a HIGH output only when all of the inputs are HIGH. When any of the inputs is
LOW, the output is LOW. Therefore, the basic purpose of an AND gate is to determine when certain
conditions are simultaneously true, as indicated by HIGH levels on all of its inputs, and to produce a
HIGH on its output to indicate that all these conditions are true.
The inputs of the 2-input AND gate in Figure below are labeled A, B and the output is labeled X. The gate
operation can be stated as follows:
For a 2-input AND gate, output X is HIGH only when inputs A and B are HIGH; X is LOW when either A
or B is LOW, or when both A and B are LOW.
The OR Gate
An OR gate can have more than two inputs. The OR gate is another of the basic gates from which all
logic functions are constructed. An OR gate can have two or more inputs and performs what is known as
logical addition.
An OR gate has two or more inputs and one output, as indicated by the standard logic symbol in Figure
below , where OR gates with two inputs are illustrated. An OR gate can have any number of inputs
greater than one.
An OR gate produces a HIGH on the output when any of the inputs is HIGH. The output is LOW only
when all of the inputs are LOW. The inputs of the 2-input OR gate in Figure above are labeled A, B and
the output is labeled X. The operation of the gate can be stated as follows:
For a 2-input OR gate, output X is HIGH when either input A or input B is HIGH, or when both A and B
are HIGH; X is LOW only when both A and B are LOW.
The HIGH level is the active or asserted output level for the OR gate. Figure below illustrates the
operation for a 2-input OR gate for all four possible input combinations.
The operation of a 2-input OR gate is described in Table below. This truth table can be expanded for any
number of inputs regardless of the number of inputs. The output is HIGH when one or more of the
inputs are HIGH.
X=A+B
The NAND gate is a popular logic element because it can be used as a universal gate: that is, NAND gates
can be used in combination to perform the AND, OR, and inverter operations.
The term NAND is a contraction of NOT-AND and implies an AND function with a complemented
(inverted) output. The standard logic symbol for a 2-input NAND gate and its equivalency to an AND gate
followed by an inverter are shown in Fig (a), where the symbol ≡ means equivalent to. A rectangular
outline symbol is shown in part (b).
For a 2-input NAND gate, output X is LOW only when inputs A and Bare HIGH; X is HIGH when either A
or B is LOW, or when both A and B are LOW.
Note that this operation is opposite that of the AND in terms of the output level. In a NAND gate, the
LOW level (0) is the active or asserted output level, as indicated by the bubble on the output. Fig below
illustrates the operation of a 2-input NAND gate for all four input combinations, and Table is the truth
table summarizing the logical operation of the 2-input NAND gate.
The NOR gate, like the NAND gate, is a useful logic element because it can also be used as a universal
gate i.e. NOR gates can be used in combination to perform the AND, OR, and inverter operations.
A NOR gate produces a LOW output when any of its inputs is HIGH. Only when all of its inputs are LOW is
the output HIGH. For the specific case of a 2-input NOR gate, as shown in Fig above with the inputs
labeled A, B and the output labeled X, the operation can be stated as follows:
For a 2-input NOR gate, output X is LOW when either input A or input B is HIGH, or when both A and B
are HIGH; X is HIGH only when both A and B are LOW.
This operation results in an output level opposite that of the OR gate. In a NOR gate, the LOW output is
the active or asserted output level as indicated by the bubble on the output.
Fig below illustrates the operation of a 2-input NOR gate for all four possible input combinations, and
Table is the truth table for a 2-input NOR gate.
Exclusive-OR and exclusive-NOR gates are formed by a combination of other gates already discussed.
However, because of their fundamental importance in many applications, these gates are often treated
as basic logic elements with their own unique symbols.
Standard symbol for an exclusive-OR (XOR for short) gate is shown in Fig below. The XOR gate has only
two inputs.
For an exclusive-OR gate, output X is HIGH when input A is LOW and input B is HIGH, or when input A
is HIGH and input B is LOW: X is LOW when A and B are both HIGH or both LOW.
The four possible input combinations and the resulting outputs for an XOR gate are illustrated in Fig
below. The HIGH level is the active or asserted output level and occurs only when the inputs are at
opposite levels. The operation of an XOR gate is summarized in the table shown in Table.
Standard symbols for an exclusive-NOR (XNOR) gate are shown in Fig below. Like the XOR gate, an XNOR
has only two inputs. The bubble on the output of the XNOR symbol indicates that its output is opposite
that of the XOR gate. When the two input logic levels are opposite, the output of the exclusive-NOR gate
is LOW. The operation can be stated as follows (A and B are inputs, X is the output):
For an exclusive-NOR gate, output X is LOW when input A is LOW and input B is HIGH, or when A is
HIGH and B is LOW; X is HIGH when A and B are both HIGH or both LOW.
The four possible input combinations and the resulting outputs for an XNOR gate are shown in Fig
below. The operation of an XNOR gate is summarized in Table. Notice that the output is HIGH when the
same level is on both inputs.
Because information is so important in most organizations, computer scientists have developed a large
body of concepts and techniques for managing data.
Although Data Processing and Data Management Systems both refer to functions that take raw data and
transform it into usable information, the usage of the terms is very different. Data Processing is the
term generally used to describe what was done by large mainframe computers from the late 1940's until
the early 1980's (and which continues to be done in most large organizations to a greater or lesser
extent even today): large volumes of raw transaction data fed into programs that update a master file,
with fixed-format reports written to paper.
The term Data Management Systems refers to an expansion of this concept, where the raw data,
previously copied manually from paper to punched cards, and later into data-entry terminals, is now fed
into the system from a variety of sources, including ATMs, EFT, and direct customer entry through the
Internet. The master file concept has been largely displaced by database management systems, and
static reporting replaced or augmented by ad-hoc reporting and direct inquiry, including downloading of
data by customers. The ubiquities of the Internet and the Personal Computer have been the driving
force in the transformation of Data Processing to the more global concept of Data Management
Systems.
The earliest business computer systems were used to process business records and produce
information. They were generally faster and more accurate than equivalent manual systems. These
systems stored groups of records in separate files, and so they were called file processing systems. In a
typical file processing systems, each department has its own files, designed specifically for those
applications. The department itself, working with the data processing staff, sets policies or standards for
the format and maintenance of its files.
Programs are dependent on the files and vice-versa; that is, when the physical format of the file is
changed, the program has also to be changed. Although the traditional file oriented approach to
information processing is still widely used, it does have some very important disadvantages.
System programmers wrote these application programs to meet the needs of the bank. New application
programs are added to the system as the need arises. For example, suppose that the savings bank
decides to offer checking accounts. As a result, the bank creates new permanent files that contain
information about all the checking accounts maintained in the bank, and it may have to write new
application programs to deal with situations that do not arise in savings accounts, such as overdrafts.
Thus, as time goes by, the system acquires more files and more application programs.
This typical file-processing system is supported by a conventional operating system. The system stores
permanent records in various files, and it needs different application programs to extract records from,
and add records to, the appropriate files. Before database management systems (DBMSs) came along,
organizations usually stored information in such systems.
Data redundancy and inconsistency: Since different programmers create the files and
application programs over a long period, the various files are likely to have different formats and
the programs may be written in several programming languages. Moreover, the same
information may be duplicated in several places (files). For example, the address and telephone
number of a particular customer may appear in a file that consists of savings-account records
and in a file that consists of checking-account records. This redundancy leads to higher storage
and access cost. In addition, it may lead to data inconsistency; that is, the various copies of the
same data may no longer agree.
Difficulty in accessing data: Conventional file-processing environments do not allow needed
data to be retrieved in a convenient and efficient manner. More responsive data-retrieval
systems are required for general use.
Data isolation. Because data are scattered in various files, and files may be in different formats,
writing new application programs to retrieve the appropriate data is difficult.
Integrity problems. The data values stored in the database must satisfy certain types of
consistency constraints. For example, the balance of a bank account may never fall below a
prescribed amount (say, $25). Developers enforce these constraints in the system by adding
appropriate code in the various application programs. However, when new constraints are
These difficulties, among others, prompted the development of database systems. In what follows, we
shall see the concepts and algorithms that enable database systems to solve the problems with file-
processing systems. In most of this book, we use a bank enterprise as a running example of a typical
data-processing application found in a corporation.
Characteristics of Database
The database approach has some very characteristic features which are discussed in detail below:
Concurrent Use
A database system allows several users to access the database concurrently. Answering different
questions from different users with the same (base) data is a central aspect of an information system.
Such concurrent use of data increases the economy of a system.
An example for concurrent use is the travel database of a bigger travel agency. The employees of
different branches can access the database concurrently and book journeys for their clients. Each travel
Computational course chapter 5 Page 21
agent sees on his interface if there are still seats available for a specific journey or if it is already fully
booked.
A fundamental feature of the database approach is that the database system does not only contain the
data but also the complete definition and description of these data. These descriptions are basically
details about the extent, the structure, the type and the format of all data and, additionally, the
relationship between the data. This kind of stored data is called metadata ("data about data").
As described in the feature structured data the structure of a database is described through metadata
which is also stored in the database. Application software does not need any knowledge about the
physical data storage like encoding, format, storage place, etc. It only communicates with the
management system of a database (DBMS) via a standardized interface with the help of a standardized
language like SQL. The access to the data and the metadata is entirely done by the DBMS. In this way all
the applications can be totally separated from the data. Therefore database internal reorganizations or
improvement of efficiency do not have any influence on the application software.
Data Integrity
Data integrity is a byword for the quality and the reliability of the data of a database system. In a
broader sense data integrity includes also the protection of the database from unauthorized access
(confidentiality) and unauthorized changes. Data reflect facts of the real world.
Transactions
A transaction is a bundle of actions which are done within a database to bring it from one consistent
state to a new consistent state. In between the data are inevitable inconsistent. A transaction is atomic
what means that it cannot be divided up any further. Within a transaction all or none of the actions
need to be carried out. Doing only a part of the actions would lead to an inconsistent database state.
One example of a transaction is the transfer of an amount of money from one bank account to another.
The debit of the money from one account and the credit of it to another account make together a
consistent transaction. This transaction is also atomic.
Data Persistence
Data persistence means that in a DBMS all data is maintained as long as it is not deleted explicitly. The
life span of data needs to be determined directly or indirectly be the user and must not be dependent
on system features. Additionally data once stored in a database must not be lost. Changes of a database
Disadvantages of a DBMS
Danger of a Overkill: For small and simple applications for single users a database system is
often not advisable.
Complexity: A database system creates additional complexity and requirements. The supply and
operation of a database management system with several users and databases is quite costly
and demanding.
Qualified Personnel: The professional operation of a database system requires appropriately
trained staff. Without a qualified database administrator nothing will work for long.
Costs: Through the use of a database system new costs are generated for the system itself but
also for additional hardware and the more complex handling of the system.
Databases change over time as information is inserted and deleted. The collection of information stored
in the database at a particular moment is called an instance of the database. The overall design of the
database is called the database schema. Schemas are changed infrequently, if at all.
The concept of database schemas and instances can be understood by analogy to a program written in a
programming language. A database schema corresponds to the variable declarations (along with
associated type definitions) in a program. Each variable has a particular value at a given instant. The
values of the variables in a program at a point in time correspond to an instance of a database schema.
Database systems have several schemas, partitioned according to the levels of abstraction.
The physical schema describes the database design at the physical level, while the logical schema
describes the database design at the logical level. A database may also have several schemas at the view
level, sometimes called sub-schemas that describe different views of the database.
Of these, the logical schema is by far the most important, in terms of its effect on application programs,
since programmers construct applications by using the logical schema. The physical schema is hidden
beneath the logical schema, and can usually be changed easily without affecting application programs.
Application programs are said to exhibit physical data independence if they do not depend on the
physical schema, and thus need not be rewritten if the physical schema changes.
We study languages for describing schemas, after introducing the notion of data models in the next
section.
Underlying the structure of a database is the data model: a collection of conceptual tools for describing
data, data relationships, data semantics, and consistency constraints.
To illustrate the concept of a data model, we outline two data models in this section: the entity-
relationship model and the relational model. Both provide a way to describe the design of a database at
the logical level.
The entity-relationship (E-R) data model is based on a perception of a real world that consists of a
collection of basic objects, called entities, and of relationships among these objects. An entity is a
Entities are described in a database by a set of attributes. For example, the attributes account-number
and balance may describe one particular account in a bank, and they form attributes of the account
entity set. Similarly, attributes customer-name, customer-street address and customer-city may describe
a customer entity.
An extra attribute customer-id is used to uniquely identify customers (since it may be possible to have
two customers with the same name, street address, and city).
A unique customer identifier must be assigned to each customer. In the United States, many enterprises
use the social-security number of a person (a unique number the U.S. government assigns to every
person in the United States) as a customer identifier. A relationship is an association among several
entities. For example, a depositor relationship associates a customer with each account that she has.
The set of all entities of the same type and the set of all relationships of the same type are termed an
entity set and relationship set, respectively.
The overall logical structure (schema) of a database can be expressed graphically by an E-R diagram.
The ER model views the real world as a construct of entities and association between entities.
Entities
Entities are the principal data object about which information is to be collected. Entities are usually
recognizable concepts, either concrete or abstract, such as person, places, things, or events which have
relevance to the database. Some specific examples of entities are EMPLOYEES, PROJECTS and INVOICES.
An entity is analogous to a table in the relational model.
Entities are classified as independent or dependent (in some methodologies, the terms used are strong
and weak, respectively). An independent entity is one that does not rely on another for identification. A
dependent entity is one that relies on another for identification.
Relationships
Attributes
Attributes describe the entity of which they are associated. A particular instance of an attribute is a
value. For example, "Jane R. Hathaway" is one value of the attribute Name. The domain of an attribute
is the collection of all possible values an attribute can have. The domain of Name is a character string.
Attributes can be classified as identifiers or descriptors. Identifiers, more commonly called keys,
uniquely identify an instance of an entity. A descriptor describes a non-unique characteristic of an entity
instance.
ER Notation
The relational model uses a collection of tables to represent both data and the relationships among
those data. Each table has multiple columns, and each column has a unique name.
The data is arranged in a relation which is visually represented in a two dimensional table. The data is
inserted into the table in the form of tuples (which are nothing but rows). A tuple is formed by one or
more than one attributes, which are used as basic building blocks in the formation of various
expressions that are used to derive meaningful information. There can be any number of tuples in the
table, but all the tuple contain fixed and same attributes with varying values. The relational model is
implemented in database where a relation is represented by a table, a tuple is represented by a row, an
We shall represent a relation as a table with columns and rows. Each column of the table has a name, or
attribute. Each row is called a tuple.
• Attribute: name of a column in a particular table (all data is stored in tables). Each attribute Ai
must have a domain, dom(A ).
i
• Relational Schema: The design of one table, containing the name of the table (i.e. the name of the
relation), and the names of all the columns, or attributes.
Relational keys
There are two kinds of keys in relations. The first are identifying keys: the primary key is the main
concept, while two other keys – super key and candidate key – are related concepts. The second kind is
the foreign key.
Super Keys
A super key is a set of attributes whose values can be used to uniquely identify a tuple within a relation.
A relation may have more than one super key, but it always has at least one: the set of all attributes that
make up the relation.
Candidate Keys
A candidate key is a super key that is minimal; that is, there is no proper subset that is itself a superkey.
A relation may have more than one candidate key, and the different candidate keys may have a different
number of attributes. In other words, you should not interpret 'minimal' to mean the super key with the
fewest attributes.
Primary Key
The primary key of a relation is a candidate key especially selected to be the key for the relation. In
other words, it is a choice, and there can be only one candidate key designated to be the primary key.
Foreign keys
The attribute(s) within one relation that matches a candidate key of another relation. A relation may
have several foreign keys, associated with different target relations.
Foreign keys allow users to link information in one relation to information in another relation. Without
FKs, a database would be a collection of unrelated tables.
The object-oriented model can be seen as extending the E-R model with notions object-oriented data
model.
Object oriented databases are also called Object Database Management Systems (ODBMS). Object
databases store objects rather than data such as integers, strings or real numbers. Objects are used in
object oriented languages such as Smalltalk, C++, Java, and others. Objects basically consist of the
following:
Therefore objects contain both executable code and data. There are other characteristics of objects such
as whether methods or data can be accessed from outside the object. We don't consider this here, to
keep the definition simple and to apply it to what an object database is. One other term worth
mentioning is classes. Classes are used in object oriented programming to define the data and methods
the object will contain. The class is like a template to the object. The class does not itself contain data or
methods but defines the data and methods contained in the object. The class is used to create
(instantiate) the object. Classes may be used in object databases to recreate parts of the object that may
not actually be stored in the database. Methods may not be stored in the database and may be
recreated by using a class.
The object-relational data model combines features of the object-oriented data model and relational
data model. Semi-structured data models permit the specification of data where individual data items of
the same type may have different sets of attributes. This is in contrast with the data models mentioned
earlier, where every data item of a particular type must have the same set of attributes. The extensible
markup language (XML) is widely used to represent semi-structured data.
A network database model is a database model that allows multiple records to be linked to the same
owner file. The model can be seen as an upside down tree where the branches are the member
information linked to the owner, which is the bottom of the tree. The multiple linkages which this
information allows the network database model to be very flexible. In addition, the relationship that the
information has in the network database model is defined as many-to-many relationship because one
owner file can be linked to many member files and vice versa.
The network model allows each record to have multiple parent and child records, forming a generalized
graph structure. This property applies at two levels: the schema is a generalized graph of record types
connected by relationship types (called "set types" in CODASYL), and the database itself is a generalized
graph of record occurrences connected by relationships (CODASYL "sets"). Cycles are permitted at both
levels. The chief argument in favor of the network model, in comparison to the hierarchic model, was
that it allowed a more natural modeling of relationships between entities.
A hierarchical database is a design that uses a one-to-many relationship for data elements. Hierarchical
database models use a tree structure that links a number of disparate elements to one "owner," or
"parent," primary record.
Hierarchical databases were popular in early database design, in the era of mainframe computers. While
some IBM and Microsoft models are still in use, many other types of business databases use more
flexible models to accommodate more sophisticated types of data management. Hierarchical models
make the most sense where the primary focus of information gathering is on a concrete hierarchy such
as a list of business departments, assets or people that will all be associated with specific higher-level
primary data elements.
A hierarchical database model is a data model in which the data is organized into a tree-like structure.
The data is stored as records which are connected to one another through links. A record is a collection
of fields, with each field containing only one value. The entity type of a record defines which fields the
record contains.
A record in the hierarchical database model corresponds to a row (or tuple) in the relational database
model and an entity type corresponds to a table (or relation).
The hierarchical database model mandates that each child record has only one parent, whereas each
parent record can have one or more child records. In order to retrieve data from a hierarchical database
the whole tree needs to be traversed starting from the root node. This model is recognized as the first
database model created by IBM in the 1960s
A database system provides a data definition language to specify the database schema and a data
manipulation language to express database queries and updates. In practice, the data definition and
data manipulation languages are not two separate languages; instead they simply form parts of a single
database language, such as the widely used SQL language.
We specify a database schema by a set of definitions expressed by a special language called a data-
definition language (DDL).
For instance, the following statement in the SQL language defines the account table:
Execution of the above DDL statement creates the account table. In addition, it updates a special set of
tables called the data dictionary or data directory.
A data dictionary contains metadata—that is, data about data. The schema of a table is an example of
metadata. A database system consults the data dictionary before reading or modifying actual data.
We specify the storage structure and access methods used by the database system by a set of
statements in a special type of DDL called a data storage and definition language. These statements
define the implementation details of the database schemas, which are usually hidden from the users.
The data values stored in the database must satisfy certain consistency constraints. For example,
suppose the balance on an account should not fall below $100. The DDL provides facilities to specify
such constraints. The database systems check these constraints every time the database is updated.
Data manipulation is
A data-manipulation language (DML) is a language that enables users to access or manipulate data as
organized by the appropriate data model. There are basically two types:
Declarative DMLs are usually easier to learn and use than are procedural DMLs. However, since a user
does not have to specify how to get the data, the database system has to figure out an efficient means
of accessing data. The DML component of the SQL language is nonprocedural.
A query is a statement requesting the retrieval of information. The portion of a DML that involves
information retrieval is called a query language. Although technically incorrect, it is common practice to
use the terms query language and data manipulation language synonymously.
This query in the SQL language finds the name of the customer whose customer-id is 192-83-7465:
We can define a data dictionary as a DBMS component that stores the definition of data characteristics
and relationships. You may recall that such “data about data” were labeled metadata. The DBMS data
dictionary provides the DBMS with its self describing characteristic. In effect, the data dictionary
resembles and X-ray of the company’s entire data set, and is a crucial element in the data administration
function.
The two main types of data dictionary exist, integrated and stand alone. An integrated data dictionary is
included with the DBMS. For example, all relational DBMSs include a built in data dictionary or system
catalog that is frequently accessed and updated by the RDBMS. Other DBMSs especially older types, do
not have a built in data dictionary instead the DBA may use third party stand alone data dictionary
systems.
Data dictionaries can also be classified as active or passive. An active data dictionary is automatically
updated by the DBMS with every database access, thereby keeping its access information up-to-date. A
passive data dictionary is not updated automatically and usually requires a batch process to be run. Data
dictionary access information is normally used by the DBMS for query optimization purpose.
The goal of the three-schema architecture, illustrated in Figure 1.1, is to separate the user applications
and the physical database. In this architecture, schemas can be defined at the following three levels:
2. The conceptual level has a conceptual schema, which describes the structure of the whole
database for a community of users. The conceptual schema hides the details of physical storage
structures and concentrates on describing entities, data types, relationships, user operations,
and constraints. A high-level data model or an implementation data model can be used at this
level.
3. The external or view level includes a number of external schemas or user views. Each
external schema describes the part of the database that a particular user group is interested in
and hides the rest of the database from that user group. A high-level data model or an
implementation data model can be used at this level.
The three-schema architecture is a convenient tool for the user to visualize the schema levels in
a database system. Most DBMSs do not separate the three levels completely, but support the
three-schema architecture to some extent. Some DBMSs may include physical-level details in
the conceptual schema. In most DBMSs that support user views, external schemas are specified
in the same data model that describes the conceptual-level information. Some DBMSs allow
different data models to be used at the conceptual and external levels.
Notice that the three schemas are only descriptions of data; the only data that actually exists is
at the physical level. In a DBMS based on the three-schema architecture, each user group refers
only to its own external schema. Hence, the DBMS must transform a request specified on an
external schema into a request against the conceptual schema, and then into a request on the
internal schema for processing over the stored database. If the request is database retrieval, the
Computational course chapter 5 Page 33
data extracted from the stored database must be reformatted to match the user’s external
view. The processes of transforming requests and results between levels are called mappings.
These mappings may be time-consuming, so some DBMSs—especially those that are meant to
support small databases—do not support external views. Even in such systems, however, a
certain amount of mapping is necessary to transform requests between the conceptual and
internal levels.
There are a number of ways to set up the centralized database. Multiple programming languages are
well suited to database building and companies can also purchase database software rather than
developing their own. Users may have a number of ways to access material, and the database can be set
up with varying security levels to allow for more access controls. Information technology staffs maintain
the database with various operations to keep it orderly and address early signs of problems like viral
infections. They can also change access levels on request and administer the security system.
One advantage of the centralized database is the ability to access all the information in one location.
Searches of the database can be fast because the search engine does not need to check multiple
locations to return results. Information may also be easier to organize in a single location. In a database
upgrade to handle more information, servers can be added to the database site easily, and the company
will not have to balance the needs of a distributed database.
A centralized database can also be easier to physically secure. It can be enclosed in a variety of ways to
protect it from theft, sabotage, fire, and other issues. It is also possible to set up an extremely
robust computer security system to prevent unauthorized access. For extremely sensitive databases, the
computers may not be connected to a network, and users will have to physically enter the database
location to pull information. This may be used with some government computers that contain high-
security information.
There can also be disadvantages. A centralized database tends to create bottlenecks if multiple users
need to access it and their needs are substantial. It can also be very vulnerable if something happens to
it and a backup has not been performed or the existing backup is outdated. One advantage of
distributed databases is the redundancy factor, which can allow the system to function even if an
individual database is down.
Database design typically includes the physical layout of hardware and software devices that manage a
company's data storage. There are multiple techniques that can be applied when designing a database.
A distributed database is a database that is split over multiple hardware devices but managed by a
central database controller. This distributed approach typically provides better performance and
reliability.
Dividing a database into separate physical units has many benefits. This approach provides better
control over specific data. It also distributes the load on the computer hardware and network devices.
A distributed database is normally separated by business units, companies, or geographical regions. This
approach provides for faster response times for users because the database is local to each business
unit within the organization. The business unit is typically smaller then the entire organization, which
reduces the overall load on the each server.
Most large companies have separate business units for specific functions. Some examples include
accounting, human resources, and sales departments. A distributed database is designed to serve
specific business units throughout the organization, while maintaining control from a central server. This
technique enables the separation of hardware and data throughout the company, which provides for
better control and overall performance.
A distributed database design provides the benefits of central access by corporate headquarters, while
enabling local access for specific business units. This is a good design for companies that are disbursed
throughout the world. It is also recommend for organizations that support multiple portfolios. Some
examples of industries that would benefit from this design include manufacturing, hospitality, and
banking.
A distributed database might also be used in an accounting operation. A global organization would
typically include a distributed database designed to serve each country. This geographical distribution
approach would enable the local country to query data faster. The central database would access each
country's data without impacting each local accounting application.
Distributed databases provide better flexibility for a business. With the data divided between multiple
servers, it can easily be replicated onto new hardware throughout the organization. This reduces the risk
on unavailable data due to hardware failure.
There are some drawbacks to a distributed database design. The most prevalent is database
integrity and concurrency. At times the distributed data may become unavailable to the central server.
This is typically due to network issues within the computer system. While the database will remain
available to the local business units, it may become outdated within the center headquarters of the
organization until the network issue is repaired.
The client–server model of computing is a distributed application structure that partitions tasks or
workloads between the providers of a resource or service, called servers, and service requesters,
called clients. Often clients and servers communicate over a computer network on separate hardware,
but both client and server may reside in the same system. A server host runs one or more server
programs which share their resources with clients. A client does not share any of its resources, but
requests a server's content or service function. Clients therefore initiate communication sessions with
servers which await incoming requests.
The data processing is split into distinct parts. A part is either requester (client) or provider (server). The
client sends during the data processing one or more requests to the servers to perform specified tasks.
The server part provides services for the clients.
This basic structure is called 2-tier structure. The client and server parts may reside on the same node or
on different nodes. A part can play the roles of a server of a service and a client of another service at the
same time. A client can be connected to several servers.
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data in databases.
While data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms,
data mining is actually part of the knowledge discovery process. The following figure shows data mining
as a step in an iterative knowledge discovery process.
Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data
are removed from the collection.
Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in
a common source.
Data selection: at this step, the data relevant to the analysis is decided on and retrieved from
the data collection.
Data transformation: also known as data consolidation, it is a phase in which the selected data
is transformed into forms appropriate for the mining procedure.
Data mining: it is the crucial step in which clever techniques are applied to extract patterns
potentially useful.
Pattern evaluation: in this step, strictly interesting patterns representing knowledge are
identified based on given measures.
Knowledge representation: is the final phase in which the discovered knowledge is visually
represented to the user. This essential step uses visualization techniques to help users
understand and interpret the data mining results.
Sales/Marketing: Data mining enables businesses to understand the hidden patterns inside
historical purchasing transaction data, thus helping in planning and launching new marketing
campaigns in prompt and cost effective way.
Banking / Finance: Several data mining techniques e.g., distributed data mining have been
researched, modeled and developed to help credit card fraud detection. Data mining is used to
identify customer’s loyalty by analyzing the data of customer’s purchasing activities such as the
Broadly speaking, computational biology is the application of computer science, statistics, and
mathematics to problems in biology. Computational biology spans a wide range of fields within biology,
including genomics/genetics, biophysics, cell biology, biochemistry, and evolution. Likewise, it makes
use of tools and techniques from many different quantitative fields, including algorithm design, machine
learning, Bayesian and frequent statistics, and statistical physics.
Much of computational biology is concerned with the analysis of molecular data, such as bio-sequences
(DNA, RNA, or protein sequences), three-dimensional protein structures, gene expression data, or
molecular biological networks (metabolic pathways, protein-protein interaction networks, or gene
regulatory networks). A wide variety of problems can be addressed using these data, such as the
identification of disease-causing genes, the reconstruction of the evolutionary histories of species, and
the unlocking of the complex regulatory codes that turn genes on and off. Computational biology can
also be concerned with non-molecular data, such as clinical or ecological data.
The terms computational biology and bioinformatics are often used interchangeably. However,
computational biology sometimes connotes the development of algorithms, mathematical models, and
methods for statistical inference, while bioinformatics is more associated with the development of
software tools, databases, and visualization methods.
Computational Biologists use a wide range of software. These range from command line programs to
graphical and web-based programs. Some of them are as follows:
Nanoscience and nanotechnology change the nature of almost every human-made object in this
century. Advances in the field of nanoscience empower us with new tools for proving electronic devices
with ever-decreasing scale. Many people have projected that nanometer-scale devices will continue this
trend, bringing control of matter to unprecedented scales. This includes scale reduction not only in
microelectronics, but also in fields such as quantumswitch-based computing in the shorter term. These
advances have the potential to change the way we engineer our environment, construct and control
systems, and interact in society. Computational science, which has emerged as a third way of doing
research, one that complements theory and experiment, plays a key role in developing our
understanding of materials at the nanometer scale and in the development “by-design” of new
nanoscale materials and devices. Hence, modeling and simulation are now integral components of
scientific research.