INTRODUCTION TO Dr Nimisha Singh
nimisha_singh@scmhrd.e
DATABASES du
QUICK CHECK
History of database
1960s
1970s
1980s
1990s
QUICK CHECK
History of database
1960s
1970s
1980s
1990s
Nearly 60 years of history since early 1960s.
Early stage- simply saving and processing data files
to a rich comprehensive discipline with data modeling and DBMS
To internet and data mining
The world around us
WHAT HAPPENS IN ONE
MINUTE
Twitter users sent 473,400 tweets
Instagram users posted 49,380 pictures
16 million text messages
156 million emails sent
54,200 calls on Skype
Google processes more than 40,000 searches every second, or 3.5
billion searches a day.
1.5 billion people are active on Facebook every day. That’s one-fifth
of the world’s population.
Two-thirds of the world’s population now own a mobile phone.
CLOUD ADOPTION GROWING
IOT GROWING
Digitization everywhere
in Governance, Healthcare, Education ……….
Need a database
This data needs to be
stored securely
Available
Error free……..
FOR DESIGNING DATABASES
Need to understand
Type of data that will be maintained in the database
Relationship among the data
How the data will be used
Organizational changes to manage the data
CLASS ACTIVITY- DOMINO’S
PIZZA (10 MINUTES)
Identify the sources of data
Label as structured, unstructured
Relevance of the data generated and collected
Why and how long that data should be stored?
(these are the questions we will be dealing with in subsequent
sessions)
CATEGORIES OF
ENTERPRISE DATA
Transactional data Interaction data
Machine
generated
• OLTP systems • Emails, Documents • Click streams
• ERP, CRM systems • Social Media Feeds, • System logs
• DWH, Marts Blogs • Scanners, Readers
• Master Data and • IVR, Call center data • RFID tags
Reference data • RSS/News Feeds • GPS
• External data feeds • Internet/Web • Geospatial data
• Structured/semi- crawlers • Cameras
structured, Well • Competition info • Monitoring devices
controlled and • Un-structured No • Semi-structured
managed Predictable predefined metadata Complex metadata
volumes and scale All Data volumes are not Volumes can be
data is relevant for known Part of the estimated All data is
analysis data is relevant relevant for analysis
CLASS ACTIVITY- DOMINO’S
PIZZA
Categorize data as transactional, interaction and machine
generated
What other sources of data can Domino’s use
IMPORTANT TERMS
Database technology- technology used for data management.
Data- raw records that have not been processed.
Database- a large collection of organized and shareable data stored
in the computer for a long time
Database management system (DBMS)- system software located
between the user and the operating system that can organize and
store data, access and maintain data efficiently
Source: Database Principles and
Technologies
https://doi.org/10.1007/978-981-19-3
DATA
Data can be a pile of magazines, a stack of newspapers, minutes of
a meeting, or a copy of medical records.
Data includes numbers, text, graphics, image, audio, video, and
many other forms
Data also involves semantics, i.e., the meaning and
implications of the data.
What do you infer from 50
DATA
Data can be a pile of magazines, a stack of newspapers, minutes of a
meeting, or a copy of medical records.
Data includes numbers, text, graphics, image, audio, video, and many
other forms
Data also involves semantics, i.e., the meaning and implications of the
data. Data and the semantics of data are closely related.
50
can mean the total number of employees in a department
Or a student's score in a certain subject
Or the price of a product
Or a person's weight
DATABASE
Database is a large collection of organized and shareable data
stored in the computer, with the following three characteristics
Long-term storage: The database should provide a reliable mechanism to
support long-term data storage, so that data recovery is feasible upon the
system failure.
Organization: Data should be organized, described and stored in a certain data
model. Model-based storage endows data with less redundancy, higher
independence and easy scalability.
Shareability: The data in the database is shared and used by all types of users,
not exclusive to a single user.
DBMS
The DBMS provides data definition language (DDL), through which
the user can easily define the composition and structure of data
objects in the database.
Data organization, storage and management. The DBMS is
responsible for organizing, storing and managing data in a
classified manner, involving data dictionaries, user data, data
access paths, etc.
Data manipulation. DBMS also provides data manipulation
language (DML), with which users can manipulate data to achieve
such basic operations as query, insert, deletion and modification on
data.
EMERGENCE AND DEVELOPMENT
OF DATABASE TECHNOLOGY
Source:
TRADITIONAL FILE
ENVIRONMENT
File organization concepts
Field- e.g. person’s name
Record- A group of related fields
File- A group of records of same type
Database- A group of related files make up a database
Entity- Record describes and entity e.g. person, place
Attribute- Each characteristic or quality describing a particular entity e.g.
student id
TRADITIONAL FILE
ENVIRONMENT
Problems with the traditional file environment
Data redundancy- Presence of duplicate data in multiple files stored in more than one
location
Data inconsistency- Data redundancy leads to data inconsistency where the same
attribute may have different values. This happens because the same information stored in
different locations not getting updated at the same time.
Program-data dependence- data stored in files dependent on the programs controlling it.
Lack of flexibility-Deliver routine scheduled reports but cannot deliver ad hoc or
unanticipated requests in a timely fashion.
Poor security- Access and dissemination of information not reliable
Lack of data sharing and availability- Pieces of information available in different files
with different departments, information sharing impossible.
Student record is managed in three excel sheets- Students personal details, student
academic record, student attendance record. List the problems encountered in such
record management
TRADITIONAL FILE
PROCESSING
DATABASE APPROACH TO
DATA MANAGEMENT
Database with multiple views
A COMPARISON
Manual File system Database system
management
Degree of data No shareability, Poor shareability, High shareability,
sharing very high high redundancy low redundancy
redundancy
Data independence No independence, Poor independence Physical and logical
complete independence
dependence on
programs
Data type Structured (mainly Structured Structured and
numerical) (numerical, text) unstructured
Data control Application control Application control Data security and
integrity
guaranteed by the
DBMS, providing
BENEFITS OF DATABASE
Data definition for the whole organization
High level of data sharing and easy expansion
Strong data independence.
Unified management and control.
Consideration for developing databases- computer technology, application
and data models
DATA MODELS
Relational data models- consists of a set of relations, each with a
normalized two-dimensional table as its data structure
Object oriented data models- uses semantic data model and object
oriented methods
XML data models- extensible markup language for semi structured
and unstructured data sources.
CLASS ACTIVITY
Student record in the college
List all the data elements required about a student
Group the records according to purpose
HIERARCHY DATABASE
Hierarchical model- presents a tree-like data structure
Features of hierarchy model-
(1) There is one and only one node without “two parents” nodes,
which is called the root node.
(2) Each of the nodes other than the root node has one and only one
“two parents” node, and this hierarchical model is often used in
common organizational structures.
NETWORK DATABASE
Many to many relations which allows a record to have more than
one parent segment e.g. students choosing courses and teachers
teaching them
•An owner record that is the same as the parent in the hierarchical
model.
•A member record that is the same as a child in the hierarchical
model.
DATABASES AND THE WEB
The databases are made available to users by linking the internal databases to the Web through software
programs that provide a connection to the database.
A database server, which is a special dedicated computer, maintains the DBMS.
A software program, called an application server, processes the transactions and offers data access.
A user making an inquiry through the Web server can connect to the organization’s database and receive
information in the form of a Web page.
CLASS ACTIVITY- CASE
STUDY- DIRT BIKES (30 MINUTES)
Dirt Bikes sells primarily through its distributors. It maintains a small customer database with the following data:
customer’s name, address, telephone number, model purchased, date of purchase, and distributor. These data are
collected by its distributors when they make a sale and forwarded to Dirt Bikes. Dirt Bikes would like to be able
to market more aggressively to its customers. The Marketing Department would like to be able to send customers
e-mail notices of special racing events and of sales on parts. It would also like to learn more about customers’
interests and tastes: their ages, years of schooling, another sport in which they are interested, and whether they
attend dirt bike racing events. Additionally Dirt Bikes would like to know whether customers own more than one
motorcycle. (Some Dirt Bikes customers own two or three motorcycles from Dirt Bikes U.S.A. or other
manufacturers.) If a motorcycle was purchased from Dirt Bikes, the company would like to know the date of
purchase, model purchased, and distributor. If the customer owns a non-Dirt Bikes motorcycle, the company
would like to know the manufacturer and model of the other motorcycle (or motorcycles), and the distributor
from whom the customer purchased that motorcycle.
1. Redesign Dirt Bikes’s customer database so that it can store and provide the information needed for
marketing. You will need to develop a design for the new customer database (Consider using multiple
tables in your new design)
2. What kind of reports would be of interest to Dirt Bikes’s marketing and sales department
DATABASE MANAGEMENT
Database management- management and maintenance of the DBMS
core goal to ensure the stability of the database, security and data consistency, high performance of the
system.
Stability- high availability of the database. Master-slave, multi-master, distributed and
other highly available architectures are used to ensure the availability and stability of
the database system.
Security- the security of the content stored in the database to avoid illegal access and
use of data content.
Data consistency- the database itself will provide many functions to ensure data
consistency, such as foreign key constraints and non-null constraints on table, etc.
Data consistency here is about the use of synchronization technology, replication tools,
etc. provided by the database when building master-standby systems, master-slave
systems and other multi-master systems to ensure data consistency among multiple
databases. The high performance of the system mainly involves the optimization,
monitor ing, troubleshooting and other work inside the database management.
DBA
Database Administrator (DBA) is a collective term for the personnel
involved in managing and maintaining the DBMS, not a particular
person, but a role
DATABASE OBJECT
MANAGEMENT
A database object is anything that is used to store or manipulate
data e.g. tables, indexes, sequences etc. Database objects are
meant to help users access the information in a database more
logically.
Database object management is the management of the data in
the database, including physical design and implementation work.
Physical design work refers to understanding the features and
functions provided by different database objects, and transforming
the data model in conceptual design and logical design into
physical database objects on the basis of reasonable relational
design principles.
Physical implementation work refers to the creation, deletion,
modification and optimization of database objects.
DATABASE SECURITY
MANAGEMENT
Database security management refers to the prevention of unauthorized
access to avoid leakage of protected information, prevention of security
breaches and improper data modifications to ensure that data is available
only to authorized users.
Database security management work includes the management of system
security, data security and network security.
Enterprise database security strategy includes
laying a solid foundation with authentication, authorization, access, control, recovery,
classification, and batch management
encryption and desensitization to protect critical information and data privacy without
affecting program application functions
creating database intrusion detection with audit monitoring and vulnerability
assessment, etc.,
developing security policies and standards to ensure role separation and availability.
BACKUP AND RECOVERY
MANAGEMENT
To develop a reasonable backup strategy to achieve regular backup
function to ensure the fastest recovery and minimum loss for the
database system in case of disaster.
DATABASE PERFORMANCE
MANAGEMENT
Database performance management- monitoring and optimizing the factors that affect
database performance and optimizing the resources that the database can use to increase
system throughput to handle the maximum possible workload.
The factors involved in database performance optimization include workload, throughput,
resources, and contention.
Work load for the database is the user-submitted usage requirements, expressed in different
forms such as online transactions, batch jobs, analytical queries, instant queries.
The workload of the database varies from time to time, and the overall workload has a great
impact on the database performance.
Throughput refers to the overall processing capacity of the database software, i.e., the number
of queries and transactions that can be processed per unit of time.
Resources include CPU, I/O, network, storage, processes, threads, and all other hardware and
software objects
Contention occurs when multiple processes try to access the same resource simultaneously. This
conflict arises because the number of resources is less than the demand of the workload.
OBJECT MANAGEMENT
Naming convention
Table_customer, t_customer_orders, t_customer, t_001
acct_amt, t_cust_orders, T_Customer_Orders
choose meaningful, easy to remember, descriptive, short and unique words
Use the name dictionary to develop some common abbreviations on a
project basis, such as “amt” for “amount”.
Databases may have set limits on number of characters
Too long names are not easy to remember and communicate, nor easy for
SQL code writing.
BACKUP AND RECOVERY
MANAGEMENT
There are many possible reasons for data loss, mainly storage
media failure, user’s operation error, server failure, virus invasion,
natural disasters, etc. Backup database is used to additionally store
the data in the database and the relevant information to ensure the
normal operation of the database system and to restore the
database upon the system failure.
The database backup includes data and data related database
objects, users, permissions, database environment (profiles, timing
tasks), etc.
Data recovery is the activity of restoring a database system from a
failed or paralyzed state to one that is operational and capable of
restoring data to an acceptable state.
DISASTER BACKUP
Disaster backup refers to the process of backing up data, data processing systems,
network systems, infrastructure, specialized technical information and operational
management information for the purpose of recovery after a disaster occurs.
Disaster backup has two objectives
1. Recovery time objective (RTO)- the time limit within which recovery must be completed
after a disaster has stopped an information system or business function.
2. Recovery point objective (RPO)- the requirement for the time point to which the system
and data are recovered to after a disaster.
For example, if the RPO requirement is one day, then the system and data must be
recovered to the state 24 h before the failure caused by the disaster, and the possibility
of data loss within 24 h is allowed in this case. However, if the data can be restored to the
state only two days ago, that is, 48 h ago, the requirement of RPO as1day is not satisfied.
The RTO emphasizes the availability of the service, and the smaller the RTO, the less the
loss of service. The RPO targets data loss, and the smaller the RPO, the less the data loss.
DISASTER RECOVERY
PLANNING
Frequency
On-site/ off-site
Time to recover
Data availability
Data loss
Scope of recovery
CHOOSING DISASTER
RECOVERY PLAN
The data backup system is required to guarantee data backup at least once a week, and the
backup media can be stored off-site, with no specific requirements for the backup data
processing system and backup network system.
Alternate site support- to equip part of the data processing equipment required for disaster
recovery, or to deploy the required data processing equipment to the backup site within a
predetermined time after a disaster
Electronic transmission and equipment support- to conduct at least one full data backup
every day, and the backup media is stored off-site, while using communication equipment to
transfer critical data to the backup site in batches at regular intervals several times a day
Electronic transfer and complete device support- to configure all data processing equipment,
communication lines and corresponding network equipment required for disaster recovery,
which must be in ready or operational status.
Real-time data transfer and complete device support- at least one full data backup per day
and backup media stored off-site, it also requires the use of remote data replication
technology to replicate critical data to the backup site in real time through the
communication network.
Zero data loss and remote cluster support- to realize remote real-time backup with zero data
loss; the backup data processing system should have the same processing capability as the
production data processing system, and the application software should be “clustered” and
DETERMINING APPROPRIATE
DISASTER RECOVERY LEVEL
Reasonable disaster recovery level needs to be determined for
service systems based on the cost-risk balance principle (i.e.,
balancing the cost of disaster recovery resources against the
potential loss due to risk).
The higher the disaster recovery level, the better the protection of
the information system, but this also means a sharp increase in
cost.
BACKUP STRATEGIES
Backup strategies are divided into
- full backup, differential backup and incremental backup according
to the scope of data collection
- hot backup, warm backup and cold backup according to whether
the database is deactivated
- physical backup and logical backup according to the backup
content.
Full backup, also called complete backup,
refers to the complete backup of all data and corresponding structures at a specified point
in time.
most secure backup type, whose backup and recovery time increases significantly with the
increase in data volume. While important, full backup also comes at a cost in time and
expenses, and is prone to a performance impact on the entire system.
Differential backup is a backup of data that has changed since the last full
backup. backup involves a small volume of data but the availability of
system data should be guaranteed. It only needs the data from the last full
backup and the most recent differential backup.
Incremental backup is a backup of the data that has changed after the
previous backup. Incremental backup involves a small volume of data and
requires very little time, but the reliability of each backup must be
guaranteed.
CLASS ASSIGNMENT
XYZ corporation takes full back up on Sunday and incremental
backup for rest of the week.
A system failure occurred in the early hours of Thursday morning
and the system needs to be restored, how will the data be restored
Incremental backup?
Differential backup?
CLASS ASSIGNMENT
XYZ corporation takes full back up on Sunday and incremental backup
for rest of the week.
A system failure occurred in the early hours of Thursday morning and
the system needs to be restored, how will the data be restored
In case of incremental backup- the full backup on Sunday, the
incremental backup on Monday, the incremental backup on Tuesday,
and the incremental backup on Wednesday must all be prepared and
restored in chronological order. If Tuesday’s incremental backup file is
corrupted, then Wednesday’s incremental backup will also fail, so that
only the data state at 12:00 PM on Monday can be restored.
In case of differential backup, simply prepare the full backup on Sunday
and the differential backup on Wednesday.
Hot backup is performed when the database is running normally,
where read/ write operations can be performed on the database
during the backup period.
Warm backup means that only database read operations can be
performed during the backup period, and no write operation is
allowed, where the database availability is weaker than hot backup.
Cold backup means that read/write operations are not available
during the backup period, and the backup data is the most reliable.
QUIZ
In the case that the database application does not allow the service
to stop, a ---------- backup solution must be used (hot/ cold)
In the case where the read/write service of the application can be
stopped and the accuracy of the backup data is required, the --------
backup solution is preferred. (hot/ cold)
QUIZ
In the case that the database application does not allow the service
to stop, a hot backup solution must be used
In the case where the read/write service of the application can be
stopped and the accuracy of the backup data is required, the cold
backup solution is preferred.
The hot backup solution should be used as much as possible for
routine daily backups, while a cold backup solution is
recommended in the case of system migration, so as to ensure
data accuracy.
A physical backup is a direct backup of the data files corresponding
to the database or even the entire disk.
Logical backup refers to exporting data from the database and
archiving the exported data for backup.
SECURITY MANAGEMENT
Network security. The main technologies for maintaining network
security are encryption technology, digital signature technology,
firewall technology, and intrusion detection technology. The security at
the network level focuses on the encryption of transmission contents.
Operating system security. Encryption aiming at securing the operating
system refers to the encryption of data files stored in the operating
system, the core of which is to ensure the security of the server, mainly
in terms of the server’s user accounts, passwords, access rights, etc.
DBMS security. The encryption aimed at DBMS security refers to the
encryp tion and decryption of data in the process of reading and writing
data by means of custom functions or built-in system functions,
involving database encryption, data access control, security auditing,
and data backup.
PERFORMANCE
MANAGEMENT
There are upper limits on the processing capacity of resources. For
example, the disk space is limited, and there are also upper limits
on CPU frequency, memory size and network bandwidth. Resources
are divided into supply resources and concurrency control
resources. The supply resources, also called basic resources, are
the resources corresponding to computer hardware, including the
resources managed by the operating system, whose processing
capacity is ordered as “CPU > memory >>disk network”.
Concurrency control resources include but are not limited to locks,
queues, caches, mutually exclusive signals, etc., which are also
resources managed by the database system. The basic principle of
performance management is to make full use of resources and not
to waste them.
CLASS ASSIGNMENT
For a dental clinic record, devise the backup strategy and backup
plan
PROBE QUESTION
What considerations will you keep for preparing the backup
strategy
While planning data backup recovery, and security, take Indian
govt norms into consideration. For example, RBI and Sebi have
mandates regarding maintaining confidentiality and privacy of user
data.
India Data privacy Act 2023 is a legislation to safeguard privacy of
individuals.