Unit 5 Notes
Unit 5 Notes
Data mining – Data warehousing – Data mining vs Data warehouse – Machine learning –
Supervised learning – Unsupervised learning – Business Intelligence – Cloud computing.
1.DATA MINING
Data mining is a multidisciplinary field that involves extracting valuable insights from large
datasets. It combines techniques from statistics, machine learning, artificial intelligence, and
database systems to identify patterns, relationships, and trends that are not immediately obvious.
This knowledge can be used to make predictions, improve decision-making, and uncover hidden
insights within vast amounts of data.
The process of data mining can be broken down into the following phases:
1. Data Collection:
The first step in any data mining project is gathering data. This could come from various
sources, including databases, data warehouses, data lakes, and external data sources. It's
important to have access to large, clean, and relevant data.
2. Data Preprocessing:
This phase involves cleaning and preparing the data for analysis. It includes:
o Data Cleaning: Removing or correcting errors, missing values, and
inconsistencies in the data.
o Data Transformation: Normalizing or scaling data to ensure all variables are on
the same scale (especially important in algorithms like k-means clustering).
o Data Integration: Combining data from different sources and ensuring
compatibility.
o Data Reduction: Reducing the complexity of the data without losing important
information (e.g., dimensionality reduction).
3. Data Exploration:
Here, data scientists perform exploratory data analysis (EDA) to better understand the
data's structure, patterns, and relationships. Visualization techniques and summary
statistics are often used.
4. Data Mining:
This is the core phase, where various data mining techniques are applied to the data.
These techniques help to uncover patterns, trends, and relationships. The primary
methods include:
o Classification: Predicting a categorical label for new data based on historical
data. For example, classifying an email as spam or not.
o Clustering: Grouping similar data points together based on certain attributes. For
instance, clustering customers based on buying behavior.
o Regression: Predicting a continuous numeric value based on input features. For
example, predicting house prices based on factors like location, size, and age.
o Association Rule Mining: Discovering interesting relationships between
variables, such as finding that customers who buy diapers are also likely to buy
baby wipes.
o Anomaly Detection: Identifying rare or unusual patterns, often used for fraud
detection or fault detection.
5. Model Evaluation:
After building the model, it’s crucial to assess its performance. Evaluation metrics
depend on the type of problem:
o For classification, metrics such as accuracy, precision, recall, and F1 score are
used.
o For regression, mean squared error (MSE) or R-squared can be used.
o Cross-validation is often applied to check how well the model generalizes to new,
unseen data.
6. Deployment and Monitoring:
After evaluation, the model can be deployed into production to start making predictions
or identifying patterns in real-time data. Continuous monitoring and updates are required
to maintain model accuracy as new data is collected.
Tools and Technologies in Data Mining
There are many tools and technologies available for data mining. Some of the most commonly
used ones include:
Programming Languages:
o Python: Libraries like Pandas, Scikit-learn, TensorFlow, and Keras are widely
used in data mining for data manipulation, analysis, and machine learning.
o R: Another popular language for statistical analysis and data mining, with
packages like caret, randomForest, and ggplot2 for visualization.
Data Mining Software:
o RapidMiner: A user-friendly tool that provides a wide array of data mining and
machine learning algorithms without much programming.
o KNIME: Open-source data analytics platform with a graphical interface to build
data pipelines.
o Weka: A collection of machine learning algorithms for data mining tasks, with an
easy-to-use interface.
Databases:
o SQL Databases: For managing structured data.
o NoSQL Databases: Used for unstructured or semi-structured data, like
MongoDB or Cassandra.
o Data Warehouses: Specialized databases for reporting and analysis (e.g.,
Amazon Redshift, Google BigQuery).
2.DATA WAREHOUSING
Data warehousing is the process of collecting, storing, and managing large volumes of data
from multiple sources into a centralized repository designed for reporting and analysis. It enables
businesses to consolidate data from disparate sources and provides a platform for decision
support and business intelligence (BI). The data warehouse is structured to support query
processing and decision-making activities efficiently.
This is the foundation of the data warehousing architecture. The data source layer consists of
all the external data sources that provide the raw data for the data warehouse. These sources
could include:
Operational Databases: These are transactional systems (e.g., CRM, ERP) that manage
day-to-day operations and generate transactional data.
Flat Files: Data can also come from files like CSV, Excel, or log files.
External Data: This could include third-party data sources or data from public APIs.
Social Media, IoT Devices, and Web Scraping: Additional sources of data that can feed
into the data warehouse.
Once data is collected from the data sources, it is moved to the data staging layer. This is a
temporary area where data is processed before it is loaded into the data warehouse. In this layer,
the data undergoes ETL (Extract, Transform, Load) operations.
The data warehouse layer is the central repository where data is stored after the ETL process.
This layer is designed to support reporting, querying, and analytical tasks. It is the most
important part of the architecture and serves as the source of truth for all business intelligence
activities.
The data presentation layer is where end users interact with the data warehouse. This layer
involves tools and interfaces that allow users to visualize, analyze, and report on the data stored
in the warehouse. Business intelligence (BI) tools are commonly used in this layer.
Business Intelligence (BI) Tools: Tools like Tableau, Power BI, QlikView, and
Looker allow users to create reports, dashboards, and visualizations that provide insights.
**OLAP (Online
1. Definition
Data Warehousing:
Data warehousing is the process of collecting, storing, and managing large volumes of
historical data from various sources into a centralized repository known as a data
warehouse. This structured repository is designed to support query and reporting
processes, enabling business intelligence (BI) and decision-making.
Data Mining:
Data mining refers to the process of analyzing large datasets to uncover hidden patterns,
correlations, and useful insights. It involves using algorithms and statistical techniques to
extract valuable knowledge from the data stored in databases or data warehouses. Data
mining is more focused on discovering insights rather than storing or organizing data.
2. Primary Focus
Data Warehousing:
The primary focus of data warehousing is on data storage, management, and
consolidation. It involves organizing and storing data from various operational systems
into a centralized location, typically optimized for read-heavy analytical processing.
Data Mining:
Data mining is focused on the analysis of data to find patterns, trends, or relationships.
The goal is to extract actionable insights that can drive decision-making or predict future
outcomes.
3. Purpose
Data Warehousing:
The main purpose of a data warehouse is to store and consolidate data for reporting,
querying, and analysis. A data warehouse supports Business Intelligence (BI) activities
by providing a single source of truth from which companies can draw data for reporting
and decision-making.
Data Mining:
The primary purpose of data mining is to discover patterns and trends in the data. It
uses advanced analytical techniques, such as machine learning, clustering, classification,
and regression, to extract insights that were previously hidden in large datasets.
4. Key Technologies and Techniques
Data Warehousing:
o ETL (Extract, Transform, Load): Used to extract data from multiple sources,
clean and transform it, and load it into the data warehouse.
o OLAP (Online Analytical Processing): Used to analyze multidimensional data
and generate reports and dashboards.
o Relational Database Management Systems (RDBMS): For storing structured
data in a way that is optimized for reporting and analysis.
Data Mining:
o Classification: Assigning data into predefined categories or classes.
o Clustering: Grouping data into similar categories.
o Regression: Predicting a continuous value.
o Association Rule Mining: Identifying relationships between variables (e.g., in
market basket analysis).
o Anomaly Detection: Identifying outliers or rare events.
Example Applications
Data Warehousing:
o Business Reporting: A retail chain may have a data warehouse containing sales
data from multiple stores. Analysts can query the warehouse to generate sales
reports, customer behavior insights, and inventory analysis.
o Historical Trend Analysis: A financial organization might use a data warehouse
to store transaction data and track long-term trends, such as stock performance or
client portfolios over time.
Data Mining:
o Customer Segmentation: A company can use data mining to analyze customer
behavior and segment customers into groups based on purchasing patterns,
demographics, and preferences.
o Fraud Detection: In banking, data mining techniques can be applied to detect
unusual transaction patterns that could indicate fraudulent activity.
o Recommendation Systems: Data mining algorithms are used by e-commerce
sites (like Amazon or Netflix) to recommend products or movies based on users’
past behavior and preferences.
Key Difference
4.MACHINE LEARNING
Machine Learning (ML) is a branch of artificial intelligence (AI) that focuses on developing
algorithms and statistical models that allow computers to learn from and make predictions or
decisions based on data, without being explicitly programmed. The goal of machine learning is
to enable computers to automatically improve their performance on a task through experience.
1. Data Collection:
Collect relevant data from various sources such as databases, sensors, websites, or logs.
The quality and quantity of the data are crucial for building an effective machine learning
model.
2. Data Preprocessing:
Raw data often requires cleaning and transformation before it can be used in machine
learning models. This involves:
o Handling missing values
o Removing duplicates
o Encoding categorical data
o Normalizing or scaling numerical data
3. Feature Selection and Engineering:
Selecting the most important features (variables) and transforming them into a format
suitable for the model. Feature engineering involves creating new features based on
domain knowledge to improve the model's performance.
4. Model Training:
Train the machine learning model on the prepared data. During training, the algorithm
learns to recognize patterns in the data that correlate with the target output.
5. Model Evaluation:
Once the model is trained, evaluate its performance using testing data that it has not seen
before. Common evaluation metrics include accuracy, precision, recall, F1-score, and
mean squared error (MSE).
6. Model Tuning:
Adjust the hyperparameters (e.g., learning rate, number of layers in a neural network) to
improve the model’s performance. Techniques such as cross-validation, grid search, and
random search are commonly used for hyperparameter tuning.
7. Model Deployment:
Once the model is trained and tuned, it is deployed into production where it can make
predictions on real-world data.
8. Model Monitoring and Maintenance:
After deployment, it’s important to monitor the model’s performance over time and
retrain it if necessary, especially if the data changes or if the model starts to degrade.
1. Supervised Learning:
In supervised learning, the algorithm is trained using a labeled dataset, where both the
input data (features) and the correct output (labels) are provided. The goal is to learn a
mapping from inputs to outputs, so the model can predict the label for new, unseen data.
o Example: Predicting house prices based on features such as the number of rooms,
location, etc.
o Algorithms: Linear Regression, Logistic Regression, Decision Trees, Support
Vector Machines (SVM), Neural Networks.
2. Unsupervised Learning:
In unsupervised learning, the algorithm is provided with input data but no labels
(outputs). The goal is to identify underlying patterns or groupings in the data. It is often
used for clustering, anomaly detection, or association.
o Example: Grouping customers into segments based on purchasing behavior.
o Algorithms: K-Means Clustering, Hierarchical Clustering, Principal Component
Analysis (PCA), Association Rule Learning.
3. Reinforcement Learning:
In reinforcement learning, an agent interacts with an environment and learns by receiving
rewards or penalties for actions taken. The agent's goal is to maximize the cumulative
reward by choosing the best actions over time.
o Example: Training a robot to navigate a maze by rewarding it for getting closer to
the goal and penalizing it for making wrong moves.
o Algorithms: Q-Learning, Deep Q-Networks (DQN), Policy Gradient Methods.
Machine learning has a wide variety of applications across different industries, such as:
Healthcare:
o Predicting disease outbreaks, diagnosing illnesses, and personalizing treatment
plans.
o Machine learning models can analyze medical images, such as X-rays and MRIs,
to assist doctors in diagnosis.
Finance:
o Fraud detection, credit scoring, algorithmic trading, and risk management.
o Predicting stock prices and market movements based on historical data.
Retail and E-commerce:
o Recommender systems that suggest products based on customer preferences.
o Predicting customer behavior, optimizing inventory, and demand forecasting.
Autonomous Vehicles:
o Machine learning is used to train self-driving cars to recognize objects, navigate
roads, and make decisions in real-time.
Marketing:
o Customer segmentation, targeted advertising, and social media analysis to
personalize marketing efforts.
Natural Language Processing (NLP):
o Sentiment analysis, machine translation, chatbots, and speech recognition.
Image and Video Processing:
o Image recognition, facial recognition, and object detection in security systems,
social media, and autonomous vehicles.
5.SUPERVISED LEARNING
The working of Supervised learning can be easily understood by the below example and
diagram:
uppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under
supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
6.UNSUPERVISED LEARNING
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
o Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remains into a group and has less or no similarities with the objects
of another group. Cluster analysis finds the commonalities between the data objects and
categorizes them as per the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set of
items that occurs together in the dataset. Association rule makes marketing strategy more
effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Business Intelligence (BI) refers to the technologies, processes, and practices used to collect,
analyze, present, and interpret business data to help organizations make informed decisions. BI
helps companies to gain insights into their operations, understand market trends, and optimize
strategies for competitive advantage. It involves using various tools, systems, and methodologies
to turn raw data into meaningful and actionable insights.
1. Data Quality:
o BI is only as good as the data it analyzes. Poor-quality or incomplete data can lead
to incorrect insights and decisions.
2. Data Security and Privacy:
o Storing and analyzing sensitive data presents security and privacy challenges.
Organizations need to implement strong data protection measures to ensure
compliance with regulations (e.g., GDPR, HIPAA).
3. Integration Issues:
o Integrating data from different systems, especially legacy systems, can be
complex and time-consuming.
4. Cost:
o Implementing and maintaining BI tools and systems can be costly, especially for
small and medium-sized enterprises (SMEs).
5. User Adoption:
o Getting stakeholders to adopt BI tools and fully leverage them for decision-
making can be challenging, particularly in organizations with limited data
literacy.
8.CLOUD COMPUTING
What is cloud computing?
Cloud computing is the on-demand delivery of IT resources over the Internet with pay-as-you-go
pricing. Instead of buying, owning, and maintaining physical data centers and servers, you can
access technology services, such as computing power, storage, and databases, on an as-needed
basis from a cloud provider like Amazon Web Services (AWS).
Organizations of every type, size, and industry are using the cloud for a wide variety of use
cases, such as data backup, disaster recovery, email, virtual desktops, software development and
testing, big data analytics, and customer-facing web applications. For example, healthcare
companies are using the cloud to develop more personalized treatments for patients. Financial
services companies are using the cloud to power real-time fraud detection and prevention. And
video game makers are using the cloud to deliver online games to millions of players around the
world.
1. On-Demand Self-Service:
o Users can provision computing resources (such as storage or processing power)
automatically, without needing to interact with service providers.
2. Broad Network Access:
o Cloud services are accessible from various devices and locations, as long as
there's internet access, enabling flexible and remote work.
3. Resource Pooling:
o The cloud provider’s resources (e.g., processing power, storage) are pooled
together and distributed to multiple users based on demand. This is achieved
through multi-tenant models.
4. Rapid Elasticity:
o Cloud resources can be scaled up or down quickly, providing flexibility to
accommodate fluctuating workloads.
5. Measured Service:
o Cloud services are metered, meaning users only pay for what they use, based on
usage or resource consumption.
6. Security and Privacy:
o Cloud providers invest in high levels of security for data storage and transfer,
though users also need to implement their own security practices.
Deployment Models
1. Public Cloud:
o Services are delivered over the internet and shared among multiple organizations
(tenants). Examples: AWS, Google Cloud, Microsoft Azure.
2. Private Cloud:
o A dedicated cloud infrastructure for a single organization. This is used when there
are strict security, compliance, or data privacy requirements. It can be hosted
internally or externally by a third-party provider.
3. Hybrid Cloud:
o A mix of public and private clouds that work together, allowing data and
applications to be shared between them. This model provides more flexibility and
optimization.
1. Cloud Storage:
o Services that allow you to store and retrieve data online, such as Dropbox, Google
Drive, and Amazon S3.
2. Cloud Databases:
o Databases provided as a service, such as Amazon RDS, Google Cloud SQL, and
Azure SQL Database.
3. Cloud Hosting:
o Hosting services for websites and applications, like AWS EC2, Google Compute
Engine, or DigitalOcean.
4. Cloud Analytics and AI:
o Cloud-based tools for processing and analyzing data, including services for
machine learning, data analytics, and artificial intelligence (e.g., Google AI, AWS
SageMaker).
1. Edge Computing:
o Processing data closer to where it's generated (e.g., on IoT devices) rather than
relying on centralized cloud servers. This reduces latency and bandwidth usage.
2. Serverless Computing:
o The growth of serverless computing, where developers can focus on writing code
without managing the underlying infrastructure, is rapidly increasing.
3. AI and Cloud Integration:
o More AI and machine learning services are being integrated into cloud platforms,
making it easier for businesses to leverage advanced analytics.
4. Quantum Computing:
o Cloud-based quantum computing services are beginning to emerge, offering
potential advancements in computing power for specialized tasks.