Data Warehousing and Data Mining Assignment 1
Name: Tanya Maheshwari
Enrollment No. 02613702022
Submitted to: Ms. Ruchika
1. Explain the three-tier architecture of data warehousing.
Answer:
The three-tier architecture in data warehousing organizes the system into layers for better
performance, scalability, and maintenance. The tiers are:
Bottom Tier (Data Source Layer):
This layer consists of relational databases, flat files, and external data sources. It uses
ETL (Extract, Transform, Load) processes to gather data from different sources, clean it,
and load it into the warehouse.
Middle Tier (Data Warehouse Layer):
This is the core layer, where the actual data warehouse resides. It stores integrated,
historical, and subject-oriented data in a structured format, often using OLAP (Online
Analytical Processing) servers. It enables fast query processing and multidimensional
analysis.
Top Tier (Presentation Layer):
This layer interacts with end-users via reporting tools, dashboards, data mining, and
business intelligence applications. It allows users to perform queries, generate reports, and
visualize data insights.
2. What are the most common transformations in ETL processes?
Answer:
In ETL (Extract, Transform, Load) processes, transformation is the critical phase where raw data is
converted into a structured and usable format. Common transformations include:
Data Cleaning: Handling missing values, correcting errors, and removing duplicates.
Data Filtering: Selecting only relevant records based on specific conditions.
Data Aggregation: Summarizing data, e.g., calculating averages or totals.
Data Mapping: Converting source data to the target schema by applying mapping rules.
Data Encoding: Converting categorical data into numerical formats (e.g., one-hot encoding).
Data Splitting: Breaking one column into multiple columns (e.g., splitting full name into first
and last names).
Data Merging: Combining data from multiple sources or tables into a single dataset.
Normalization/Standardization: Scaling numerical data for consistency across datasets.
These transformations ensure that data is clean, consistent, and correctly structured before loading
into the data warehouse, improving the accuracy of analysis and decision-making.
3. What are the various components of data warehouse? Explain their functionality in detail.
Answer:
A data warehouse consists of several components, each with specific roles to support storage,
processing, and analysis of data:
Data Sources: These are external or internal databases, flat files, APIs, etc., from which data
is extracted.
ETL Tools (Extract, Transform, Load): These tools extract data from sources, clean and
transform it, and load it into the warehouse. Tools include Informatica, Talend, and SSIS.
Data Staging Area: A temporary storage location where data is cleansed and transformed
before loading. It ensures that only high-quality data is loaded.
Data Warehouse Database: The central repository where processed, structured, and
historical data is stored. It supports multidimensional analysis and querying.
Metadata Repository: Stores data about the data (i.e., metadata) including source
information, schema definitions, data lineage, and transformation rules.
OLAP Engine: Enables complex analytical queries and multidimensional views of data, such
as slicing, dicing, roll-up, and drill-down.
Front-End Tools: These are used for reporting, querying, visualization, and dashboard
creation. They help users extract insights and make informed decisions.
Together, these components ensure seamless data flow, storage, management, and analysis.
4. Differentiate between Data Warehouse, Database, Data Marts, and Data Repository.
Answer:
Aspect Database Data Warehouse Data Mart Data Repository
Manages real-
Stores integrated,
time Department-level General term for centralized
Purpose historical analytical
transactional analytical data data storage
data
data
Current, Subset of data Any type
Data Type Historical, analytical
operational warehouse (structured/unstructured)
Daily operations
Business intelligence Focused reporting Storing and managing
Usage (CRUD
and decision-making for specific teams diverse data
operations)
Narrow,
Wide, organization- Narrow, subject- Broad, not limited to
Scope application-
wide specific structured data
specific
MySQL for e-
Enterprise data Marketing data Big data repository like
Example commerce
warehouse (EDW) mart Hadoop
orders