[go: up one dir, main page]

0% found this document useful (0 votes)
30 views63 pages

Unit 2

Uploaded by

kuldeep kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views63 pages

Unit 2

Uploaded by

kuldeep kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Data Visualization and Overall

Perspective
What is OLAP?

• Online Analytical Processing (OLAP) is a category of software


that allows users to analyze information from multiple database
systems at the same time. It is a technology that enables analysts
to extract and view business data from different points of view.
• Analysts frequently need to group, aggregate and join data.
These OLAP operations in data mining are resource intensive.
With OLAP data can be pre-calculated and pre-aggregated,
making analysis faster.
• OLAP databases are divided into one or more cubes. The cubes
are designed in such a way that creating and viewing reports
become easy. OLAP stands for Online Analytical Processing.
• Basic analytical operations of OLAP
• Four types of analytical OLAP operations are:
• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)
1) Roll-up:
• Roll-up is also known as “consolidation” or
“aggregation.” The Roll-up operation can be
performed in 2 ways
• Reducing dimensions
• Climbing up concept hierarchy. Concept
hierarchy is a system of grouping things based
on their order or level.
• in this example, cities New jersey and Lost Angles
and rolled up into country USA
• The sales figure of New Jersey and Los Angeles are
440 and 1560 respectively. They become 2000 after
roll-up
• In this aggregation process, data is location
hierarchy moves up from city to the country.
• In the roll-up process at least one or more
dimensions need to be removed. In this example,
Cities dimension is removed.
2) Drill-down
• In drill-down data is fragmented into smaller
parts. It is the opposite of the rollup process. It
can be done via
• Moving down the concept hierarchy
• Increasing a dimension
• Consider the diagram above
• Quater Q1 is drilled down to months January,
February, and March. Corresponding sales are
also registers.
• In this example, dimension months are added.
3)Slice:
• Here, one dimension is selected, and a new
sub-cube is created.
• Following diagram explain how slice operation
performed:
• Dimension Time is Sliced with Q1 as the filter.
• A new cube is created altogether.
Dice:
• This operation is similar to a slice. The
difference in dice is you select 2 or more
dimensions that result in the creation of a sub-
cube.
• 4) Pivot
• In Pivot, you rotate the data axes to provide a
substitute presentation of data.
• In the following example, the pivot is based on
item types.
Types of OLAP systems
ROLAP

• ROLAP works with data that exist in a relational database.


Facts and dimension tables are stored as relational tables.
It also allows multidimensional analysis of data and is the
fastest growing OLAP.
• Advantages of ROLAP model:
• High data efficiency. It offers high data efficiency because
query performance and access language are optimized
particularly for the multidimensional data analysis.
• Scalability. This type of OLAP system offers scalability for
managing large volumes of data, and even when the data
is steadily increasing.
Relational OLAP (ROLAP) Server
• These are intermediate servers which stand in between a relational
back-end server and user frontend tools.
• They use a relational or extended-relational DBMS to save and handle
warehouse data, and OLAP middleware to provide missing pieces.
• ROLAP servers contain optimization for each DBMS back end,
implementation of aggregation navigation logic, and additional tools
and services.
• ROLAP technology tends to have higher scalability than MOLAP
technology.
• ROLAP systems work primarily from the data that resides in a
relational database, where the base data and dimension tables are
stored as relational tables. This model permits the multidimensional
analysis of data.
• This technique relies on manipulating the data stored in the relational
database to give the presence of traditional OLAP's slicing and dicing
functionality. In essence, each method of slicing and dicing is
equivalent to adding a "WHERE" clause in the SQL statement.
MOLAP
• Multidimensional OLAP (MOLAP) is a classical OLAP that
facilitates data analysis by using a multidimensional data cube.
Data is pre-computed, re-summarized, and stored in a MOLAP
(a major difference from ROLAP). Using a MOLAP, a user can use
multidimensional view data with different facets.
• Multidimensional data analysis is also possible if a relational
database is used. By that would require querying data from
multiple tables. On the contrary, MOLAP has all possible
combinations of data already stored in a multidimensional array.
MOLAP can access this data directly. Hence, MOLAP is faster
compared to Relational Online Analytical Processing (ROLAP).
• Multidimensional OLAP (MOLAP) Server
• A MOLAP system is based on a native logical model that
directly supports multidimensional data and operations.
Data are stored physically into multidimensional arrays,
and positional techniques are used to access them.
• One of the significant distinctions of MOLAP against
a ROLAP is that data are summarized and are stored in an
optimized format in a multidimensional cube, instead of in
a relational database. In MOLAP model, data are structured
into proprietary formats by client's reporting requirements
with the calculations pre-generated on the cubes.
• Key Points in MOLAP
• In MOLAP, operations are called processing.
• MOLAP tools process information with the same amount of
response time irrespective of the level of summarizing.
• MOLAP tools remove complexities of designing a relational
database to store data for analysis.
• MOLAP server implements two level of storage
representation to manage dense and sparse data sets.
• The storage utilization can be low if the data set is sparse.
• Facts are stored in multi-dimensional array and dimensions
used to query them.
MOLAP Advantages

• MOLAP can manage, analyze and store considerable amounts of


multidimensional data.
• Fast Query Performance due to optimized storage, indexing, and
caching.
• Smaller sizes of data as compared to the relational database.
• Automated computation of higher level of aggregates data.
• Help users to analyze larger, less-defined data.
• MOLAP is easier to the user that’s why It is a suitable model for
inexperienced users.
• MOLAP cubes are built for fast data retrieval and are optimal for
slicing and dicing operations.
• All calculations are pre-generated when the cube is created.
Hybrid OLAP

• Hybrid OLAP is a mixture of both ROLAP and MOLAP. It offers


fast computation of MOLAP and higher scalability of ROLAP.
HOLAP uses two databases.
• Aggregated or computed data is stored in a multidimensional
OLAP cube
• Detailed information is stored in a relational database.
• Benefits of Hybrid OLAP:
• This kind of OLAP helps to economize the disk space, and it also
remains compact which helps to avoid issues related to access
speed and convenience.
• Hybrid HOLAP’s uses cube technology which allows faster
performance for all types of data.
• HOLAP incorporates the best features
of MOLAP and ROLAP into a single architecture.
HOLAP systems save more substantial quantities
of detailed data in the relational tables while the
aggregations are stored in the pre-calculated
cubes. HOLAP also can drill through from the
cube down to the relational tables for delineated
data. The Microsoft SQL Server 2000 provides a
hybrid OLAP server.
• Advantages of HOLAP
• HOLAP provide benefits of both MOLAP and
ROLAP.
• It provides fast access at all levels of aggregation.
• HOLAP balances the disk space requirement, as
it only stores the aggregate information on the
OLAP server and the detail record remains in the
relational database. So no duplicate copy of the
detail record is maintained.
1. Backup

Backup refers to storing a copy of original data which
can be used in case of data loss. Backup is considered
as one of the approach of data protection. Important
data of the organization needs to be kept in backup
efficiently for protecting valuable data. Backup can be
achieved by storing copy of the original data
separately or database on storage devices. There are
various types of backups are available like full backup,
incremental backup, Local backup, mirror backup etc.
2. Recovery

Recovery refers to restoring the lost data by
following some processes. Even if the data was
backed up still lost so it can be recovered by
using/implementing some
recovery techniques. When data base failures
due to any reason then there is the chance of
data loss, so in that case recovery process
helps in improving the reliability of the
database.
BACKUP RECOVERY

Backup refers to storing a copy of Recovery refers to restoring the lost


original data separately. data in case of failure.

In simple backup is the replication of In simple recovery is the process to


data. store the database.

It helps in improving the reliability of


It helps in improving data protection. the database.

Backup makes the recovery process


more easier. Recovery has no role in data backup.

The cost of backup is affordable. The cost of recovery is expensive.

It’s production usage is very common. It’s production usage is very rare.
Data Warehouse Tuning

• A data warehouse keeps evolving and it is


unpredictable what query the user is going to
post in the future. Therefore it becomes more
difficult to tune a data warehouse system. In
this chapter, we will discuss how to tune the
different aspects of a data warehouse such as
performance, data load, queries, etc.
Difficulties in Data Warehouse Tuning

• Tuning a data warehouse is a difficult procedure due


to following reasons −
• Data warehouse is dynamic; it never remains constant.
• It is very difficult to predict what query the user is
going to post in the future.
• Business requirements change with time.
• Users and their profiles keep changing.
• The user can switch from one group to another.
• The data load on the warehouse also changes with
time.
Data Load Tuning

• Data load is a critical part of overnight processing. Nothing


else can run until data load is complete. This is the entry point
into the system.
• There are various approaches of tuning data load that are
discussed below −
• The very common approach is to insert data using the SQL
Layer. In this approach, normal checks and constraints need
to be performed. When the data is inserted into the table, the
code will run to check for enough space to insert the data. If
sufficient space is not available, then more space may have to
be allocated to these tables. These checks take time to
perform and are costly to CPU.
• The second approach is to bypass all these checks
and constraints and place the data directly into
the preformatted blocks. These blocks are later
written to the database. It is faster than the first
approach, but it can work only with whole blocks
of data. This can lead to some space wastage.
• The third approach is that while loading the data
into the table that already contains the table, we
can maintain indexes.
Integrity Checks

• Integrity checking highly affects the


performance of the load. Following are the
points to remember −
• Integrity checks need to be limited because
they require heavy processing power.
• Integrity checks should be applied on the
source system to avoid performance degrade
of data load
Tuning Queries
• We have two kinds of queries in data warehouse −
• Fixed queries
• Ad hoc queries
Fixed Queries
• Fixed queries are well defined. Following are the
examples of fixed queries −
• regular reports
• Canned queries
• Common aggregations
Testing in Data warehouse
• Data Warehouse stores huge amount of data, which is typically
collected from multiple heterogeneous source like files, DBMS,
etc to produce statistical result that help in decision making.
• Testing is very important for data warehouse systems for data
validation and to make them work correctly and efficiently.
There are three basic levels of testing performed on data
warehouse which are as follows :
• Unit Testing –
This type of testing is being performed at the developer’s end.
In unit testing, each unit/component of modules is separately
tested. Each modules of the whole data warehouse, i.e.
program, SQL Script, procedure,, Unix shell is validated and
tested.
• Integration Testing –
In this type of testing the various individual units/ modules of the
application are brought together or combined and then tested
against the number of inputs. It is performed to detect the fault in
integrated modules and to test whether the various components
are performing well after integration.
• System Testing –
System testing is the form of testing that validates and tests the
whole data warehouse application. This type of testing is being
performed by technical testing team. This test is conducted after
developer’s team performs unit testing and the main purpose of
this testing is to check whether the entire system is working
altogether or not.
• Challenges of data warehouse testing are :
• Data selection from multiple source and analysis
that follows pose great challenge.
• Volume and complexity of the data, certain
testing strategies are time consuming.
• ETL testing requires hive SQL skills, thus it pose
challenges for tester who have limited SQL skills.
• Redundant data in a data warehouse.
• Inconsistent and inaccurate reports.
• ETL testing is performed in five stages :
• Identifying data sources and requirements.
• Data acquisition.
• Implement business logic’s and dimensional
modeling.
• Build and populate data.
• Build reports.
The best applications of Data Warehousing
Web Mining

• Web Mining is the process of Data Mining


techniques to automatically discover and
extract information from Web documents and
services. The main purpose of web mining is
discovering useful information from the
World-Wide Web and its usage patterns.
• Applications of Web Mining:
• Web mining helps to improve the power of web
search engine by classifying the web documents
and identifying the web pages.
• It is used for Web Searching e.g., Google, Yahoo etc
and Vertical Searching e.g., FatLens, Become etc.
• Web mining is used to predict user behavior.
• Web mining is very useful of a particular Website
and e-service e.g., landing page optimization.
• Web Content Mining:
Web content mining is the application of extracting useful
information from the content of the web documents. Web
content consist of several types of data – text, image, audio,
video etc. Content data is the group of facts that a web page
is designed. It can provide effective and interesting patterns
about user needs. Text documents are related to text mining,
machine learning and natural language processing. This
mining is also known as text mining. This type of mining
performs scanning and mining of the text, images and groups
of web pages according to the content of the input.
• Web Structure Mining:
Web structure mining is the application of discovering
structure information from the web. The structure of the
web graph consists of web pages as nodes, and hyperlinks
as edges connecting related pages. Structure mining
basically shows the structured summary of a particular
website. It identifies relationship between web pages linked
by information or direct link connection. To determine the
connection between two commercial websites, Web
structure mining can be very useful.
• Web Usage Mining:
Web usage mining is the application of
identifying or discovering interesting usage
patterns from large data sets. And these
patterns enable you to understand the user
behaviors or something like that. In web usage
mining, user access data on the web and
collect data in form of logs. So, Web usage
mining is also called log mining.
Points Data Mining Web Mining

Data Mining is the process that attempts to Web Mining is the process of data
Definition discover pattern and hidden knowledge in mining techniques to automatically
large data sets in any system. discover and extract information from
web documents.

Application Data Mining is very useful for web page Web Mining is very useful for a
analysis. particular website and e-service.

Data scientists along with data


Target Users Data scientist and data engineers. analysts.

Access Data Mining is access data privately. Web Mining is access data publicly.

In Web Mining get the information


Structure In Data Mining get the information from from structured, unstructured and
explicit structure. semi-structured web pages.

Problem Type Clustering, classification, regression, Web content mining, Web structure
prediction, optimization and control. mining.

Tools It includes tools like machine learning Special tools for web mining are
algorithms. Scrapy, PageRank and Apache logs.
Spatial Data Mining

Spatial data mining is the process of discovering


interesting and previously unknown, but
potentially useful patterns from spatial databases.
In spatial data mining analyst use geographical or
spatial information to produce business
intelligence or other results. Challenges involved
in spatial data mining include identifying patterns
or finding objects that are relevant to research
project.
• 2. Temporal Data Mining :
Temporal data refers to the extraction of implicit, non-
trivial and potentially useful abstract information from
large collection of temporal data. It is concerned with the
analysis of temporal data and for finding temporal
patterns and regularities in sets of temporal data tasks of
temporal data mining are –
• Data Characterization and Comparison
• Cluster Analysis
• Classification
• Association rules
• Prediction and Trend Analysis
• Pattern Analysis
Spatial data mining Temporal data mining

It requires space. It requires time.

It deals with spatial (location , Geo-referenced) data. It deals with implicit or explicit Temporal content ,
from large quantities of data.

Spatial databases reverses spatial objects derived by


spatial data. types and spatial association among Temporal data mining comprises the subject as well
such objects. as its utilization in modification of fields.

It aims at mining new and unknown knowledge,


It includes finding characteristic rules, discriminant which takes into account the temporal aspects of
rules, association rules and evaluation rules etc. data.

It is the method of identifying unusual and


unexplored data but useful models from spatial It deals with useful knowledge from temporal data.
databases.

Spatial mining is the extraction of knowledge/spatial Temporal mining is the extraction of knowledge
relationship and interesting measures that are not about occurrence of an event whether they follow
explicitly stored in spatial database. Cyclic , Random ,Seasonal variations etc.

You might also like