[go: up one dir, main page]

0% found this document useful (0 votes)
59 views48 pages

CUITM217-DATA-SCIENCE Data

This document outlines a module on data science. It provides an introduction to key concepts in data science including different data types, the data science lifecycle, machine learning, statistical analysis, and data visualization. It also lists learning outcomes, terms, resources, and assessment methods for the module.

Uploaded by

Tanaka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views48 pages

CUITM217-DATA-SCIENCE Data

This document outlines a module on data science. It provides an introduction to key concepts in data science including different data types, the data science lifecycle, machine learning, statistical analysis, and data visualization. It also lists learning outcomes, terms, resources, and assessment methods for the module.

Uploaded by

Tanaka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

CHINHOYI UNIVERSITY OF TECHNOLOGY

MODULE TEMPLATE

SCHOOL/INSTITUTE SCHOOL OF ENGINEERING SCIENCES AND


TECHNOLOGY
DEPARTMENT/CENTER ICT AND ELECTRONICS
PROGRAMME BACHELOR OF SCIENCE (HONS) DEGREE IN
INFORMATION TECHNOLOGY (BSIT)

LEVEL 1.1

Module TITLE DATA SCIENCE


Module CODE CUITM217

FACILITATOR Dr. Gideon T. Mazambani


MODULE OVERVIEW Data was formerly considered the new oil, but it is now
considered the new soil because it is what businesses use to
grow. Data is the core driver of the knowledge economy, and
unlike oil, it can be leveraged to extract value for businesses
several times. This course serves as an introduction to the field
of data science. You will learn data gathering, representation,
storage, analysis, and visualization techniques. You will also
learn about the impact of business analytics and big data on
business performance. The course teaches you how to mix
technical and statistical knowledge, analytical thinking, and
commercial knowledge.
Laboratory: Create reports connect data sources (Extract-
Transform-Load), and Build models using appropriate
algorithms depending on the business case.
MODULE AIM To provide students with a foundational understanding of the
principles and practices of data science, including data
collection, cleaning, analysis, visualization, and
communication.
This module will introduce students to the key concepts and
tools of data science, and provide them with the opportunity to
gain hands-on experience with data science projects. Students
will learn how to collect, clean, analyze, and visualize data,
and how to communicate their findings effectively.
OBJECTIVES By the end of the module, students should be able to

 Define data science and explain its key concepts and


principles.
 Identify the different types of data and how to collect
and store them.
 Understand the data science lifecycle, including data
cleaning, preparation, analysis, visualization, and
communication.
 Apply basic statistical and machine learning techniques
to analyze data.
 Use data visualization tools to create informative and
engaging visualizations.
 Communicate data science findings effectively to a
variety of audiences.
LEARNING OUTCOMES Upon completion of this module, students will be able to:

 Define data science and explain its key concepts and


principles.
 Identify the different types of data and how to collect
and store them.
 Understand the data science lifecycle, including data
cleaning, preparation, analysis, visualization, and
communication.
 Apply basic statistical and machine learning techniques
to analyze data.
 Use data visualization tools to create informative and
engaging visualizations.
 Communicate data science findings effectively to a
variety of audiences.
RESOURCES/ PPT, BBB, IDE, VIDEOS.
TECHNOLOGY/
TECHNICAL SUPPORT

ASSESSMENT Practicum
Group tasks and presentations on group assignment
Individual in-class activities and programming assignments
Examination
Quiz
Units Module Units
1 – Introduction To 1.0 Introduction
Data And Data
Science Data is a collection of information that can be analyzed to gain
insights and knowledge. Data science is the field that deals with the
extraction of knowledge from data. It involves the use of various
techniques, such as statistical analysis, machine learning, and data
visualization, to uncover patterns and trends in data. Data science is
used in various industries, from finance to healthcare, to improve
decision-making and drive innovation. It is a rapidly growing field
and has become an essential part of many businesses and
organizations.

1.1 Learning outcomes

 Define data and data science.


 Identify the different types of data and their characteristics.
 Explain the data science lifecycle.
 Describe the different tools and technologies used in data
science.
 Understand the ethical implications of data science.

1.2 Key Terms/ Definition of Terms

 Data: Data is a collection of facts, such as numbers, words,


images, or sounds. Data can be structured, unstructured, or semi-
structured.
 Structured data: Structured data is data that is organized in a
fixed format, such as a database table.
 Unstructured data: Unstructured data is data that does not have a
fixed format, such as text, images, or videos.
 Semi-structured data: Semi-structured data is data that has some
structure, but not as much as structured data. For example, a
JSON file is a type of semi-structured data.
 Data science: Data science is the process of extracting
knowledge from data. Data scientists use a variety of tools and
techniques to collect, clean, prepare, analyze, visualize, and
communicate data.
 Data science lifecycle: The data science lifecycle is the process
of using data to solve problems.
 Machine learning: Machine learning is a type of artificial
intelligence that allows computers to learn from data without
being explicitly programmed.
 Statistical analysis: Statistical analysis is the process of using
statistical methods to analyze data.
 Data visualization: Data visualization is the process of creating
visual representations of data to communicate information
clearly and concisely.

1.2.1 Abbreviations and Acronyms

 AI: Artificial intelligence


 API: Application programming interface
 CSV: Comma-separated values
 DB: Database
 EDA: Exploratory data analysis
 ETL: Extract, transform, and load
 ID: Identifier
 JSON: JavaScript Object Notation
 ML: Machine learning
 NLP: Natural language processing
 NoSQL: Non-relational database
 SQL: Structured Query Language
 URL: Uniform Resource Locator

1.3 Introduction to Data and Data Science

What is data?

Data is a collection of facts, such as numbers, words, images, or sounds.


Data can be structured, unstructured, or semi-structured.

 Structured data is data that is organized in a fixed format, such as


a database table.
 Unstructured data is data that does not have a fixed format, such
as text, images, or videos.
 Semi-structured data is data that has some structure, but not as
much as structured data. For example, a JSON file is a type of
semi-structured data.

What is data science?

Data science is the process of extracting knowledge from data. Data


scientists use a variety of tools and techniques to collect, clean, prepare,
analyze, visualize, and communicate data.

The data science lifecycle

The data science lifecycle is the process of using data to solve problems.
It consists of the following stages:
1. Data collection: This stage involves collecting data from a
variety of sources, such as databases, sensors, and social media.
2. Data cleaning: This stage involves removing errors and
inconsistencies from the data.
3. Data preparation: This stage involves transforming the data into
a format that can be easily analyzed.
4. Data analysis: This stage involves using statistical and machine
learning techniques to extract knowledge from the data.
5. Data visualization: This stage involves creating visualizations to
communicate the findings of the data analysis to others.
6. Data communication: This stage involves communicating the
findings of the data analysis to others clearly and concisely.

Machine learning

Machine learning is a type of artificial intelligence that allows


computers to learn from data without being explicitly programmed.

Statistical analysis

Statistical analysis is the process of using statistical methods to analyze


data.

Data visualization

Data visualization is the process of creating visual representations of


data to communicate information clearly and concisely.

Why is data science important?

Data science is important because it allows us to extract knowledge


from data that we can use to make better decisions. Data science is used
in a wide variety of industries, including healthcare, finance, and
marketing.

Getting started with data science

If you are interested in getting started with data science, there are a few
things you can do:

 Learn a programming language, such as Python or R.


 Learn about data visualization tools, such as Matplotlib or
Seaborn.
 Learn about statistical analysis methods.
 Take a data science course or tutorial.
 Start working on data science projects.
 There are many resources available to help you learn data
science. You can find online courses, tutorials, and articles on a
variety of data science topics.

Activity 1.1

Class discussion:
What data science projects have you heard of? How
are they being used to solve real-world problems?

1.6 The role of data in business


Data is everywhere in business. From customer purchase history to
website traffic to employee productivity, businesses generate vast
amounts of data on a daily basis. This data can be used to improve
decision-making, streamline operations, and gain a competitive
advantage.

Here are some of the key roles of data in business:

 Improved decision-making: Data can help businesses make


better decisions by providing insights into customer behavior,
market trends, and operational performance. For example, a
retailer can use data to identify which products are most popular,
what discounts are most effective, and how to improve the
customer experience.
 Streamlined operations: Data can help businesses streamline
their operations by automating tasks, identifying bottlenecks, and
improving efficiency. For example, a manufacturing company
can use data to optimize its production schedule, reduce
inventory waste, and improve product quality.
 Competitive advantage: Data can help businesses gain a
competitive advantage by providing insights into customer
needs, market opportunities, and competitor strategies. For
example, a technology company can use data to identify new
product features that customers are demanding, develop
strategies to enter new markets, and improve its marketing
campaigns.

Here are some examples of how businesses are using data today:

 E-commerce companies: E-commerce companies use data to


personalize the shopping experience for customers, recommend
products, and optimize their supply chains.
 Financial services companies: Financial services companies use
data to assess risk, detect fraud, and develop new financial
products and services.
 Healthcare companies: Healthcare companies use data to
improve patient care, develop new treatments, and reduce costs.
 Manufacturing companies: Manufacturing companies use data to
optimize production, improve quality, and predict demand.
 Retail companies: Retail companies use data to track customer
behavior, optimize inventory, and develop targeted marketing
campaigns.

How to get started on working with data in business

1. Identify your business goals. What are you hoping to achieve by


using data? Once you know your goals, you can start to collect
and analyze the data that will help you achieve them.
2. Collect the right data. Not all data is created equal. Make sure
you're collecting the data that is most relevant to your business
goals.
3. Analyze the data. Once you have collected the data, you need to
analyze it to identify trends and patterns. This can be done using
a variety of data analysis tools and techniques.
4. Take action. Once you have analyzed the data, you need to take
action based on your findings. This may involve making changes
to your business strategy, operations, or marketing campaigns.

Here are some tips for using data effectively in business:

1. Start with a clear goal in mind. What are you hoping to achieve
by using data? Once you know your goal, you can start to collect
and analyse the data that will help you achieve it.
2. Use the right tools and techniques. There are a variety of data
analysis tools and techniques available. Choose the ones that are
most appropriate for your needs and the data that you have.
3. Be sceptical of the data. Not all data is accurate or reliable. Be
sure to verify the data before you use it to make any decisions.
4. Communicate the findings effectively. Once you have analysed
the data, you need to communicate the findings to your team and
stakeholders in a way that is clear and concise.

1.7 Summary
Data and data science are crucial components of today's world. Data
refers to the information that is collected, stored, and analysed by
individuals and organizations. Data science, on the other hand, is the
process of extracting insights from this data using various techniques
such as statistical analysis, machine learning, and artificial
intelligence. Data science helps businesses and individuals to make
informed decisions, improve efficiency, and gain a competitive
advantage. It involves various tasks such as data cleaning, data
analysis, data visualization, and data interpretation. In summary,
data and data science play a vital role in today's world and help
individuals and organizations to succeed in their respective fields.

1.8 References/Reading sources/links

Provost, F., & Fawcett, T. (2013). Data science and its


relationship to big data and data-driven decision making. Big
data, 1(1), 51-59.

Van Der Aalst, W., & van der Aalst, W. (2016). Data science in
action (pp. 3-23). Springer Berlin Heidelberg.

Dhar, V. (2013). Data science and prediction. Communications


of the ACM, 56(12), 64-73.

2– 2.0 Introduction
INTRODUCTION Power BI is a powerful tool for data science that allows you to create
TO interactive visualizations, reports, and dashboards. It is designed to help
POWERBI/EXCE you make sense of large amounts of data and turn that data into
L actionable insights. With Power BI, you can easily connect to various
data sources, clean and transform data, and create compelling reports
and visualizations. Whether you are a data scientist, analyst, or business
user, Power BI can help you explore, analyze, and communicate your
data more effectively and efficiently.

2.1 Learning outcomes

 Define Power BI and explain its key features.


 Identify the different types of data that can be imported into
Power BI.
 Use Power BI to clean and prepare data for analysis.
 Apply statistical and machine learning functions to data in Power
BI.
 Create data visualizations in Power BI to communicate findings
to others.
 Share Power BI dashboards and reports with others.

2.2 Key Terms/ Definition of Terms

 Power BI: Power BI is a cloud-based business intelligence (BI)


and data visualization tool that provides a variety of features for
data scientists, including data import and cleaning, data analysis,
data visualization, and data sharing and collaboration.
 Data model: A data model is a conceptual representation of data
that defines how data is structured and related. Power BI uses
data models to represent the data that is imported into the tool.
 Statistical function: A statistical function is a mathematical
function that is used to analyze data. Power BI provides a variety
of statistical functions that can be applied to data.
 Machine learning function: A machine learning function is a
function that is used to train and deploy machine learning
models. Power BI provides a variety of machine learning
functions that can be applied to data.
 Data visualization: Data visualization is the process of creating
visual representations of data to communicate information
clearly and concisely. Power BI provides a variety of data
visualization tools that can be used to create charts, graphs, and
maps.
 Dashboard: A dashboard is a collection of data visualizations
that are displayed on a single page. Dashboards can be used to
monitor key performance indicators (KPIs), track trends, and
identify patterns in data.
 Report: A report is a collection of data visualizations and text
that is used to communicate findings to others. Reports can be
used to share the results of data analysis, to document research
findings, or to present business proposals.

2.2.1 Abbreviations and Acronyms

 BI: Business intelligence


 D3: Data-driven documents
 DAX: Data analysis expressions
 KPI: Key performance indicator
 ML: Machine learning
 PBIX: Power BI Desktop file format
 PBI: Power BI
 SQL: Structured Query Language
 URL: Uniform Resource Locator
 UX: User experience
 VBA: Visual Basic for Applications
 XML: Extensible Markup Language

2.3 Power BI
Power BI is a business intelligence (BI) and data visualization tool that
can be used by data scientists to analyze and visualize data. It is a cloud-
based service that provides a variety of features for data scientists,
including:

 Data import and cleaning: Power BI can import data from a


variety of sources, including databases, spreadsheets, and cloud
storage services. It also provides a variety of tools for cleaning
and preparing data for analysis.
 Data analysis: Power BI provides a variety of tools for data
analysis, including statistical functions, machine learning
algorithms, and data mining techniques.
 Data visualization: Power BI provides a variety of tools for data
visualization, including charts, graphs, and maps.
 Data sharing and collaboration: Power BI allows data scientists
to share their data and visualizations with others, and to
collaborate with others on data analysis projects.

How data scientists can use Power BI

Data scientists can use Power BI for a variety of tasks, including:

 Data exploration: Power BI can be used to explore and visualize


data to identify trends and patterns.
 Data analysis: Power BI can be used to perform data analysis
using statistical functions, machine learning algorithms, and data
mining techniques.
 Data storytelling: Power BI can be used to create data stories to
communicate the findings of data analysis to others.
 Data sharing and collaboration: Power BI can be used to share
data and visualizations with others, and to collaborate with
others on data analysis projects.

Here are some examples of how data scientists can use Power BI:

 A data scientist working in the healthcare industry could use


Power BI to analyze data on patient outcomes to identify trends
and patterns.
 A data scientist working in the retail industry could use Power
BI to analyze data on sales and customer behaviour to identify
trends and patterns.
 A data scientist working in the financial industry could use
Power BI to analyze data on stock prices and economic
indicators to identify trends and patterns.

Power BI is a powerful tool that can be used by data scientists to analyze


and visualize data. It provides a variety of features that can help data
scientists explore data, perform data analysis, create data stories, and
share data with others.

Here are some additional tips for data scientists using Power BI:

 Use Power BI to explore data before performing data analysis.


This will help you to identify trends and patterns in the data, and
to develop hypotheses that you can test using data analysis
techniques.
 Use Power BI to create data visualizations that communicate the
findings of your data analysis to others. Data visualizations can
be a powerful way to communicate complex data clearly and
concisely.
 Share your data and visualizations with others using Power BI.
This will allow you to collaborate with others on data analysis
projects, and to get feedback on your findings.
Activity 2.1

Class discussion:
Design a piece of code to determine which letter grade
a student has obtained based on a given final mark.

2.4 Data Exploration with Power BI

Power BI is a business intelligence (BI) and data visualization tool that


can be used to explore and visualize data. It provides a variety of
features that can help users to identify trends, patterns, and outliers in
data.

How to explore data with Power BI

To explore data with Power BI, you can follow these steps:

1. Import your data into Power BI. Power BI can import data from
a variety of sources, including databases, spreadsheets, and cloud
storage services.
2. Clean and prepare your data. Before you can explore your data,
you need to clean it and prepare it for analysis. This may involve
removing errors and inconsistencies from the data and
transforming the data into a format that is easy to analyze.
3. Create data visualizations. Power BI provides a variety of data
visualization tools that can be used to explore data. You can
create charts, graphs, and maps to visualize your data and
identify trends, patterns, and outliers.
4. Interact with your data visualizations. Power BI allows you to
interact with your data visualizations to explore your data
further. For example, you can filter and slice your data to focus
on specific subsets of data. You can also drill down into your
data to get more detailed information.
5. Share your findings. Once you have explored your data and
identified key findings, you can share your findings with others
using Power BI. You can share your data visualizations or create
reports to communicate your findings.

Here are some tips for exploring data with Power BI:

 Start by understanding your data. Before you start exploring your


data, take some time to understand the different variables in your
data set and how they are related to each other. This will help
you to identify the most relevant data visualizations to use.
 Use a variety of data visualizations. Power BI provides a variety
of data visualization tools, each with its own strengths and
weaknesses. Use a variety of data visualizations to get different
perspectives on your data.
 Interact with your data visualizations. Power BI allows you to
interact with your data visualizations to explore your data
further. Use the filtering, slicing, and drilling down features to
get more detailed insights from your data.
 Share your findings. Once you have explored your data and
identified key findings, share your findings with others using
Power BI. This will help you to collaborate with others and get
feedback on your work.

Here are some examples of how you can use Power BI to explore data:

 Identify trends: You can use Power BI to identify trends in your


data, such as trends in sales, customer behaviour, or website
traffic.
 Identify patterns: You can use Power BI to identify patterns in
your data, such as patterns in customer churn or patterns in
product demand.
 Identify outliers: You can use Power BI to identify outliers in
your data, such as customers who have spent an unusually large
amount of money or products that have an unusually high
number of returns.

2.5 Data Analysis with Power BI

Power BI is a business intelligence (BI) and data visualization tool that


can be used to analyze data. It provides a variety of features that can
help users perform statistical analysis, machine learning, and data
mining on data.

How to analyse data with Power BI

To analyse data with Power BI, you can follow these steps:

1. Import your data into Power BI. Power BI can import data from
a variety of sources, including databases, spreadsheets, and cloud
storage services.
2. Clean and prepare your data. Before you can analyze your data,
you need to clean it and prepare it for analysis. This may involve
removing errors and inconsistencies from the data, and
transforming the data into a format that is easy to analyze.
3. Create data models. Power BI uses data models to represent the
data that is imported into the tool. Data models define the
relationships between different variables in the data set.
4. Apply statistical and machine learning functions. Power BI
provides a variety of statistical and machine learning functions
that can be applied to data. You can use these functions to
perform data analysis, such as calculating averages, correlations,
and regressions.
5. Create data visualizations. Power BI provides a variety of data
visualization tools that can be used to communicate the findings
of your data analysis. You can create charts, graphs, and maps to
visualize your data and identify trends, patterns, and outliers.
6. Share your findings. Once you have analysed your data and
identified key findings, you can share your findings with others
using Power BI. You can share your data visualizations or create
reports to communicate your findings.

Here are some tips for analyzing data with Power BI:

1. Start by understanding your data. Before you start analyzing


your data, take some time to understand the different variables in
your data set and how they are related to each other. This will
help you to identify the most relevant statistical and machine
learning functions to use.
2. Use a variety of data analysis techniques. Power BI provides a
variety of statistical and machine learning functions. Use a
variety of data analysis techniques to get different perspectives
on your data.
3. Use data visualizations to communicate your findings. Data
visualizations can be a powerful way to communicate the
findings of your data analysis to others. Use Power BI's data
visualization tools to create charts, graphs, and maps that are
clear, concise, and easy to understand.
4. Share your findings. Once you have analyzed your data and
identified key findings, share your findings with others using
Power BI. This will help you to collaborate with others and get
feedback on your work.

Here are some examples of how you can use Power BI to analyze data:

1. Perform statistical analysis: You can use Power BI to perform


statistical analysis on your data, such as calculating averages,
correlations, and regressions.
2. Perform machine learning: You can use Power BI to perform
machine learning on your data, such as building and deploying
predictive models.
3. Perform data mining: You can use Power BI to perform data
mining on your data, such as identifying patterns and outliers in
the data.

Activity 2.1

Lab activity
Objective: To learn how to use Power BI to
explore and visualize data.

Materials:

A computer with Power BI Desktop installed


A dataset to explore

Instructions:

1. Open Power BI Desktop and create a


new file.
2. Import your dataset into Power BI.
3. Clean and prepare your data.
4. Create data visualizations to explore
your data.
5. Interact with your data visualizations.
6. Share your findings.

Summary
Power BI is a powerful tool used in data science to create interactive
and visually appealing reports and dashboards from various data
sources. It enables data analysts to explore data, identify patterns, and
make data-driven decisions through its robust data visualization and
analysis features. Power BI is widely used in industries such as finance,
marketing, healthcare, and retail to gain insights and make informed
decisions. With its ability to connect to multiple data sources, Power BI
simplifies data processing and analysis, making it a valuable tool in the
field of data science.

Further reading

Krishnan, V. (2017). Research data analysis with Power BI.


Aspin, A. (2016). Pro Power BI Desktop. Apress.

3: DATA 3.0 Introduction


PREPARATION
Data preparation is a crucial step in data science that involves cleaning,
removing duplicates, filling in missing values, and converting data
types. Data transformation is also necessary to normalize and scale the
data for analysis. Lastly, feature engineering is the process of selecting
the most relevant features for analysis. Overall, data preparation plays a
significant role in ensuring accurate and reliable analysis in data science.

3.1 Learning outcomes

 Define data preparation and explain its importance in the data


science process.
 Identify the different steps involved in the data preparation
process.
 Describe the different types of data quality issues and how to
address them.
 Explain how to use data preparation tools to clean, transform,
and enrich data.
 Discuss the challenges of data preparation and how to overcome
them.
 Apply the principles and techniques of data preparation to a real-
world data science problem.
3.2 Key Terms/ Definition of Terms

 Data preparation: The process of cleaning and transforming raw


data into a format that is suitable for analysis.
 Data cleaning: The process of identifying and correcting errors
and inconsistencies in the data.
 Data structuring: The process of organizing the data into a
consistent format.
 Data transformation and enrichment: The process of converting
the data into a format that is suitable for the intended analysis.
 Data validation and publishing: The process of checking the
quality of the prepared data and making it available for analysis.
 Data quality: The degree to which the data is accurate, complete,
consistent, and relevant for the intended use.
 Data profiling: The process of understanding the structure and
content of the data.
 Data lineage: The history of how the data was collected,
transformed, and processed.
 Data governance: The policies and procedures that ensure the
quality, security, and compliance of the data.
 Machine learning: A type of artificial intelligence that allows
software applications to become more accurate in predicting
outcomes without being explicitly programmed to do so.
 Self-service data preparation: A data preparation approach that
allows business users to prepare their own data without having to
rely on data scientists or analysts.

3.2.1 Abbreviations and Acronyms

 API: Application Programming Interface


 CRISP-DM: Cross-Industry Standard Process for Data Mining
 CSV: Comma-Separated Values
 DBMS: Database Management System
 DPM: Data Preparation Management
 DQ: Data Quality
 ETL: Extract, Transform, Load
 JSON: JavaScript Object Notation
 ML: Machine Learning
 NLP: Natural Language Processing
 OLAP: Online Analytical Processing
 SQL: Structured Query Language
 XML: Extensible Markup Language
3.3 Data Preparation

Data preparation is the process of cleaning and transforming raw data


into a format that is suitable for analysis. It is an essential step in any
data science project, as the quality of the output will be heavily
dependent on the quality of the input data.

Purposes of data preparation

There are several reasons why data preparation is important:

 To improve the accuracy and reliability of the analysis.


 To make the data more consistent and easier to work with.
 To reduce the amount of time and effort required for analysis.
 To identify and address any biases in the data.
 To prepare the data for specific machine learning algorithms.

Benefits of data preparation

Data preparation can lead to several benefits, including:

 More accurate and reliable results from data analysis.


 Faster and more efficient data analysis.
 Easier to identify and understand trends and patterns in the data.
 Reduced risk of bias in the analysis.
 Increased confidence in the results of the analysis.

Steps in the data preparation process

The data preparation process typically involves the following steps:

1. Data collection: This involves gathering the data from all


relevant sources.
2. Data discovery and profiling: This involves understanding the
structure and content of the data.
3. Data cleaning: This involves identifying and correcting errors
and inconsistencies in the data.
4. Data structuring: This involves organizing the data into a
consistent format.
5. Data transformation and enrichment: This involves converting
the data into a format that is suitable for the intended analysis.
6. Data validation and publishing: This involves checking the
quality of the prepared data and making it available for analysis.
Challenges of data preparation

Data preparation can be a challenging task, especially when dealing with


large and complex datasets. Some of the challenges include:

1. Data quality issues: Raw data is often incomplete, inaccurate, or


inconsistent.
2. Data diversity: Data may come from a variety of sources with
different formats and structures.
3. Data complexity: Large datasets can be complex and difficult to
understand.
4. Resource constraints: Data preparation can be time-consuming
and resource-intensive.

Data preparation tools

A variety of data preparation tools are available to help data scientists


and analysts with the data preparation process. Some popular tools
include:

 Alteryx: A self-service data preparation platform that offers a


variety of tools for data cleaning, blending, and transformation.
 Dataiku: A collaborative data science platform that includes a
variety of data preparation tools, as well as tools for machine
learning and data visualization.
 Knime: An open-source data analytics platform that includes a
variety of data preparation tools.
 OpenRefine: An open-source data cleaning and transformation
tool.
 Pandas: A Python library that provides a variety of data
manipulation and analysis tools.

Data preparation trends

The field of data preparation is constantly evolving, as new technologies


and techniques are developed. Some of the current trends in data
preparation include:

 The use of machine learning: Machine learning can be used to


automate many of the tasks involved in data preparation, such as
identifying and correcting errors, and transforming data into
different formats.
 The rise of self-service data preparation: Self-service data
preparation tools allow business users to prepare their own data
without having to rely on data scientists or analysts.
 The focus on data quality: Data quality is becoming increasingly
important as organizations rely more and more on data-driven
decision-making. As a result, there is a growing focus on
developing new tools and techniques for improving data quality.

Activity 3.1

Lab activity:
Objective: To practice identifying and correcting
common data quality issues.

Materials:

o A dataset with common data quality issues (e.g.,


missing values, duplicate rows, inconsistent
formatting, etc.)
o A data preparation tool (e.g., OpenRefine,
Pandas, etc.)

Instructions:

1. Load the dataset into your data preparation tool.


2. Identify the different data quality issues in the
dataset.
3. Use the data preparation tool to correct the data
quality issues.
4. Save the cleaned dataset.

Discussion:

1. What were the most common data quality issues


in the dataset?
2. What challenges did you face while cleaning the
data?

Summary

Data preparation is the process of transforming and cleaning raw data


into a format that can be easily analysed and used for insights and
decision-making. The purpose of data preparation is to ensure that data
is accurate, complete, and relevant to the problem at hand.
There are many benefits to data preparation, including improving data
quality, reducing errors, and increasing the accuracy of analysis. In
addition, data preparation can help to save time and resources by
streamlining the data analysis process.

The steps in the data preparation process include data acquisition, data
cleaning, data transformation, and data integration. Each step is
important in ensuring that the data is of high quality and ready for
analysis.

However, there are also challenges associated with data preparation,


including dealing with missing or incomplete data, identifying outliers
and anomalies, and managing data from multiple sources.

To help with the data preparation process, there are various tools
available such as data profiling tools, data cleaning tools, and data
transformation tools. These tools can help to automate and streamline
the data preparation process.

Finally, some trends in data preparation include the use of artificial


intelligence and machine learning to automate the data preparation
process, as well as the use of big data technologies to handle large
volumes of data.

Further Reading

Soh, J., Singh, P., Soh, J., & Singh, P. (2020). Data Preparation and
Data Engineering Basics. Data Science Solutions on Azure: Tools and
Techniques Using Databricks and MLOps, 65-115.

Zhang, S., Zhang, C., & Yang, Q. (2003). Data preparation for data
mining. Applied artificial intelligence, 17(5-6), 375-381.

4: DATA ANALYSIS 4.0 Introduction

In data analysis for data science, there are three main areas of focus:
exploratory data analysis (EDA), descriptive data analysis, and
predictive data analysis. EDA involves identifying patterns and
relationships in data, while descriptive data analysis involves
summarizing data using statistical measures. Predictive data analysis
utilizes statistical models to make predictions based on historical data.
Overall, these techniques provide a comprehensive approach to
analyzing data in data science.
4.1 Learning outcomes

 Understand the different types of data analysis and how to


choose the right type for a specific task.
 Be able to use data visualization techniques to explore and
understand data.
 Be able to use statistical methods to summarize and describe
data.
 Be able to identify patterns and trends in data.
 Be able to develop hypotheses about the underlying relationships
between variables.
 Be able to use statistical and machine learning techniques to
develop models that can predict future outcomes.
 Be able to interpret and communicate the results of data analysis
to others clearly and concisely.

4.2 Key Terms/ Definition of Terms

 Data analysis: The process of collecting, cleaning, transforming,


and visualizing data to extract meaningful insights.
 Exploratory data analysis (EDA): The process of using data
visualization techniques to explore and understand data without
making any assumptions about it.
 Descriptive data analysis: The process of using statistical
methods to summarize and describe data.
 Predictive data analysis: The process of using statistical and
machine learning techniques to develop models that can predict
future outcomes.
 Data visualization: The process of creating graphical
representations of data to make it easier to understand and
interpret.
 Statistical methods: Mathematical techniques used to collect,
analyse, interpret, and present data.
 Machine learning: A type of artificial intelligence that allows
software applications to become more accurate in predicting
outcomes without being explicitly programmed to do so.
 Pattern recognition: The identification of patterns and
regularities in data.
 Trend analysis: The identification of patterns and changes in data
over time.
 Correlation: A statistical measure of the relationship between
two variables.
 Causation: A relationship between two variables in which one
variable causes the other to change.
 Outlier: A data point that falls significantly outside of the normal
range of values.

4.2.1 Abbreviations and Acronyms

 API: Application Programming Interface


 BI: Business Intelligence
 CRISP-DM: Cross-Industry Standard Process for Data Mining
 CSV: Comma-Separated Values
 DBMS: Database Management System
 DPM: Data Preparation Management
 DQ: Data Quality
 ETL: Extract, Transform, Load
 JSON: JavaScript Object Notation
 ML: Machine Learning
 NLP: Natural Language Processing
 OLAP: Online Analytical Processing
 SQL: Structured Query Language
 XML: Extensible Markup Language

4.3 Data Analysis

Exploratory data analysis (EDA)

EDA is the process of using data visualization and other techniques


to explore and understand data without making any assumptions
about it. The goal of EDA is to identify patterns and trends in the
data and to develop hypotheses about the underlying relationships
between variables.

Common EDA techniques:

 Data visualization: Histograms, bar charts, scatter plots, box


plots, line charts, heatmaps, etc.
 Statistical summaries: Measures of central tendency (mean,
median, mode) and dispersion (standard deviation, range),
correlation, covariance, etc.
 Outlier detection: Identifying and understanding outliers in
the data.

Benefits of EDA:

 Helps to identify patterns and trends in the data that may not
be immediately obvious
 Helps to develop hypotheses about the underlying
relationships between variables
 Helps to identify outliers and anomalies in the data
 Can be used to inform the design of further data analysis and
modelling

Descriptive data analysis

Descriptive data analysis is the process of using statistical methods


to summarize and describe the data. The goal of descriptive data
analysis is to provide a clear and concise overview of the data, and
to identify any key features or trends.

Common descriptive data analysis techniques:

 Measures of central tendency: Mean, median, mode


 Measures of dispersion: Standard deviation, range,
interquartile range
 Correlation and covariance
 Frequency distributions
 Cross-tabulations

Benefits of descriptive data analysis:

 Provides a clear and concise overview of the data


 Helps to identify key features and trends in the data
 Can be used to communicate the findings of a data analysis
project to others

Predictive data analysis

 Predictive data analysis is the process of using statistical and


machine learning techniques to develop models that can
predict future outcomes. The goal of predictive data analysis
is to enable businesses to make better decisions by
understanding what is likely to happen in the future.

Common predictive data analysis techniques:

 Regression analysis
 Classification algorithms
 Clustering algorithms
 Time series analysis

Benefits of predictive data analysis:

 Enables businesses to make better decisions by


understanding what is likely to happen in the future
 Can be used for a variety of purposes, such as forecasting
sales, predicting customer churn, and detecting fraud

Example

Suppose a data scientist is working for a retail company. The


company wants to understand the factors that influence customer
spending. The data scientist could use the following steps:

1. EDA: The data scientist would first use EDA to explore the
data and identify patterns and trends. This might involve
creating histograms and scatter plots to examine the
relationships between variables such as customer
demographics, purchase history, and product type.
2. Descriptive data analysis: The data scientist would then use
descriptive data analysis to summarize and describe the data.
This might involve calculating the average spending per
customer, the most popular product categories, and the
correlation between customer demographics and spending.
3. Predictive data analysis: Finally, the data scientist would use
predictive data analysis to develop a model that can predict
customer spending. This model could then be used by the
company to identify customers who are likely to spend more
money and to develop targeted marketing campaigns.

Activity 4.1

Lab Activity: Analyse a Public Dataset

Objective: To practice using EDA, descriptive data


analysis, and predictive data analysis techniques on a
real-world dataset.

Materials:

 A public dataset (e.g., from Kaggle, UCI


Machine Learning Repository, etc.)
 A data analysis tool (e.g., Python, R, Power BI,
etc.)

Instructions:

1. Choose a public dataset that you are interested


in analysing.
2. Load the dataset into your data analysis tool.
3. Perform EDA on the dataset to identify
patterns and trends.
4. Perform descriptive data analysis on the dataset
to summarize and describe the data.
5. Perform predictive data analysis on the dataset
to develop a model that can predict a future
outcome.

Discussion:

1. What were the most important patterns and


trends that you identified in the data?
2. What were the key features and trends that you
identified in the data?
3. What predictive model did you develop? How
well does it perform on the test data?
4. How could you improve your data analysis and
modelling?

Summary

Data analysis plays a crucial role in data science. It involves various


techniques and methods to extract useful insights from a vast amount of
data. Exploratory data analysis (EDA) is the first step in data analysis,
where data is examined to understand its properties, patterns, and
relationships. Descriptive data analysis is used to summarize and
interpret data, while predictive data analysis uses statistical models to
make predictions about future outcomes based on past data. All three
approaches are vital in data science, as they help to identify trends,
patterns, and relationships in data, which can be used to make informed
decisions and drive business growth.

Further Reading

Clark, D. (2017). Beginning Power BI: A Practical Guide to Self-


Service Data Analytics with Excel 2016 and Power BI Desktop. Apress.

Metre, K. V., Mathur, A., Dahake, R. P., Bhapkar, Y., Ghadge, J., Jain,
P., & Gore, S. (2024). An Introduction to Power BI for Data Analysis.
International Journal of Intelligent Systems and Applications in
Engineering, 12(1s), 142-147.
5: DATA 5.0 Introduction
VISUALISATION
Data visualization in data science is the process of presenting data in a
graphical or pictorial format. It involves the use of charts, graphs, and
other visual aids to present complex data sets in a clear and concise
manner. Data visualization is an important part of data science as it
helps analysts and stakeholders understand the data better and make
informed decisions based on the findings. It is used in various fields
including finance, healthcare, marketing, and others, to identify trends,
patterns, and relationships in data sets. Data visualization tools such as
Tableau, Power BI, and QlikView are widely used in the industry to
create interactive and engaging visualizations. Overall, data
visualization plays a crucial role in data science as it helps to
communicate complex information in a simple and meaningful way.

5.1 Learning outcomes

 Understand the purpose of data visualization and its importance


in data science.
 Be able to identify the different types of data visualizations and
choose the right type for a specific task.
 Be able to use data visualization tools to create effective
visualizations.
 Be able to communicate the findings of data analysis using data
visualizations.
 Be able to evaluate the effectiveness of data visualizations.

5.2 Key Terms/ Definition of Terms

 Data visualization: The process of creating graphical


representations of data to make it easier to understand and
interpret.
 Data visualization tool: A software application that is used to
create data visualizations. Some popular data visualization tools
include Python, R, Tableau, and Power BI.
 Data visualization type: The specific type of graphical
representation that is used to visualize data. Some common data
visualization types include bar charts, line charts, pie charts,
histograms, and scatter plots.
 Chart: A graphical representation of data that shows the
relationship between two or more variables.
 Graph: A type of chart that uses lines to show the relationship
between two or more variables.
 Plot: A type of chart that uses symbols to show the relationship
between two or more variables.
 Axis: A line that is used to represent a variable in a chart or
graph.

5.2.1 Abbreviations and Acronyms

 API: Application Programming Interface


 BI: Business Intelligence
 D3: Data-Driven Documents
 DPM: Data Preparation Management
 DQ: Data Quality
 ETL: Extract, Transform, Load
 JSON: JavaScript Object Notation
 ML: Machine Learning
 NLP: Natural Language Processing
 OLAP: Online Analytical Processing
 SQL: Structured Query Language
 XML: Extensible Markup Language

5.3 Data Visualisation

What is data visualization?

Data visualization is the process of creating graphical representations of


data to make it easier to understand and interpret. It is an essential tool
in data science, as it allows data scientists to communicate their findings
to others clearly and concisely.

Graphical basics

There are a few basic elements that are common to all data
visualizations:

 X-axis: The horizontal axis of a graph, which typically


represents the independent variable.
 Y-axis: The vertical axis of a graph, which typically represents
the dependent variable.
 Data points: The individual data points that are plotted on the
graph.
 Trend lines: Lines that are used to show the overall trend of the
data.
 Labels: Text annotations that are used to identify the different
elements of the graph.

Making visualizations more understandable

There are a few things that can be done to make data visualizations more
understandable:
 Use clear and concise titles and labels. Make sure that the title of
your visualization accurately reflects what it shows, and that the
labels for the axes and data points are easy to read and
understand.
 Choose appropriate colors. Choose colours that are visually
appealing and that can be easily distinguished from each other.
Avoid using too many colours, as this can make your
visualization cluttered and confusing.
 Use a consistent design. Use the same fonts, colours, and
symbols throughout your visualization to create a consistent look
and feel.
 Highlight important features. Use visual cues such as bold text,
different colours, or larger data points to highlight the most
important features of your visualization.
 Provide context. Add a brief explanation to your visualization
that provides context for the data and helps the viewer to
understand what they are looking at.

Here are some additional tips for creating effective data visualizations:

 Use the right type of visualization for your data. There are many
different types of data visualizations, each with its strengths and
weaknesses. Choose the type of visualization that is most
appropriate for the type of data you are visualizing and the
message you want to communicate.
 Keep it simple. Don't try to cram too much information into a
single visualization. Focus on communicating the most important
insights from your data.
 Tell a story. Use your visualization to tell a story about your
data. What insights can be drawn from the data? What
implications do the insights have?
 Get feedback. Once you have created a visualization, share it
with others and get their feedback. This will help you to identify
any areas where the visualization can be improved.

Activity 5.1

Lab Activity: Create a Data Visualization

Objective: To practice creating a data visualization that


is both informative and visually appealing.

Materials:

 A dataset (e.g., from Kaggle, UCI Machine


Learning Repository, etc.)
 A data visualization tool (e.g., Python, R,
Tableau, etc.)

Instructions:

1. Choose a dataset that you are interested in


visualizing.
2. Clean and prepare the data for visualization.
3. Choose a data visualization type that is
appropriate for the data and the message you
want to communicate.
4. Create the visualization using your chosen data
visualization tool.
5. Evaluate your visualization and make any
necessary changes.
6. Share your visualization with others and get
their feedback.

Discussion:

1. What type of data visualization did you


choose? Why?
2. What challenges did you face in creating your
visualization?
3. How could you improve your visualization?

Summary

Data visualization is a fundamental aspect of data science that involves


representing complex data in a graphical format. Graphical basics such
as colour, size, and shape play a significant role in creating effective
visualizations. The key to making visualizations more understandable is
to ensure that they are simple, concise, and easy to interpret. By using
visual aids, data analysis becomes much more accessible, and patterns
and trends can be easily identified. The primary purpose of data
visualization is to communicate complex data in a way that is both
accessible and understandable. It helps to make data analysis more
efficient and effective by providing a clear and concise representation of
the information. In summary, data visualization is an essential tool that
plays a critical role in data science.

Further Reading

Lyon, W. (2019). Microsoft Power BI Desktop: A free and user-friendly


software program for data visualisations in the Social Sciences. Historia,
64(1), 166-171.

Wright, C. Y., & Wernecke, B. (2020). Using Microsoft© Power BI© to


visualise Rustenburg Local Municipality's air quality data. Clean Air
Journal, 30(1), 1-5.

6: REGRESSION 6.0 Introduction


ALGORITHMS
Regression algorithms are a powerful tool in data science that can
provide insight into the relationship between variables. These algorithms
are used to predict a continuous value based on one or more input
variables. They are commonly used in fields such as finance, economics,
and medicine to forecast future trends and make informed decisions.
Some popular regression algorithms include linear regression,
polynomial regression, and logistic regression. Each algorithm has its
own strengths and weaknesses, so it's important to choose the right one
for the particular problem at hand. Overall, regression algorithms are an
essential component of any data scientist's toolkit.

6.1 Learning outcomes

 Understand the basics of regression algorithms and how they are


used in data science.
 Be able to identify the different types of regression algorithms
and choose the right algorithm for a specific task.
 Be able to train and evaluate regression models.
 Be able to interpret and communicate the results of regression
models.
 Be able to apply regression algorithms to solve real-world
problems.

6.2 Key Terms/ Definition of Terms

 Regression algorithm: A machine learning algorithm that is used


to predict a continuous target variable from one or more
predictor variables.
 Linear regression: A regression algorithm that models the
relationship between the target variable and the predictor
variables using a linear function.
 Logistic regression: A regression algorithm that models the
probability of a binary outcome (e.g., yes/no, churn/not churn)
from one or more predictor variables.
 Ridge regression: A regression algorithm that addresses the
problem of overfitting in linear regression by adding a penalty
term to the cost function.
 Lasso regression: A regression algorithm that addresses the
problem of overfitting and feature selection in linear regression
by adding a penalty term to the cost function.
 Overfitting: A problem that occurs when a regression model
learns the training data too well and is unable to generalize to
new data.
 Underfitting: A problem that occurs when a regression model
does not learn the training data well enough and is unable to
make accurate predictions.
 R-squared: A metric that measures how well a regression model
fits the training data.
 Adjusted R-squared: A metric that penalizes regression models
for adding unnecessary predictor variables.
 Mean squared error: A metric that measures the average squared
difference between the predicted values and the actual values.

6.2.1 Abbreviations and Acronyms

 LR: Linear Regression


 GLM: Generalized Linear Model
 OLS: Ordinary Least Squares
 WLS: Weighted Least Squares
 RLS: Recursive Least Squares
 LASSO: Least Absolute Shrinkage and Selection Operator
 Ridge: Ridge Regression
 SVM: Support Vector Machines
 RF: Random Forest
 GBM: Gradient Boosting Machine
 XGBoost: eXtreme Gradient Boosting
 CV: Cross-Validation
 MSE: Mean Squared Error
 MAE: Mean Absolute Error
 RMSE: Root Mean Squared Error
 R-squared: Coefficient of Determination
 Adj. R-squared: Adjusted R-squared

6.3 Regression Algorithms

Regression algorithms are a type of machine learning algorithm that is


used to predict a continuous target variable from one or more predictor
variables. They are supervised learning algorithms, which means that
they are trained on a set of labelled data, where each data point has a
known target value.

Regression algorithms are one of the most important tools in data


science, and they are used in a wide variety of applications, including:

 Predicting customer churn


 Forecasting sales
 Determining the optimal price for a product
 Predicting the risk of a loan default
 Predicting the likelihood of a patient developing a disease

Different types of regression algorithms

There are many different types of regression algorithms, each with its
own strengths and weaknesses. Some of the most common regression
algorithms include:

 Linear regression: Linear regression is a simple but powerful


regression algorithm that models the relationship between the
target variable and the predictor variables using a linear function.
 Logistic regression: Logistic regression is a regression algorithm
that is used to predict the probability of a binary outcome (e.g.,
yes/no, churn/not churn) from one or more predictor variables.
 Ridge regression: Ridge regression is a regression algorithm that
addresses the problem of overfitting in linear regression by
adding a penalty term to the cost function.
 Lasso regression: Lasso regression is a regression algorithm that
addresses the problem of overfitting and feature selection in
linear regression by adding a penalty term to the cost function.
 Decision trees: Decision trees are a type of machine learning
algorithm that can be used for both classification and regression
tasks. Regression decision trees model the relationship between
the target variable and the predictor variables by building a tree-
like structure.
 Random forests: Random forests are an ensemble learning
algorithm that combines multiple decision trees to produce more
accurate predictions.
 Gradient boosting machines: Gradient boosting machines are
another type of ensemble learning algorithm that combines
multiple regression models to produce more accurate predictions.

Training and evaluating regression models

To train a regression model, you need a set of labelled data, where each
data point has a known target value. You can then use a variety of
machine learning libraries to train a regression model on your data.

Once the model is trained, you can evaluate its performance on a held-
out test set. This will help you to assess how well the model generalizes
to unseen data.

Interpreting and communicating the results of regression models

Once you have trained and evaluated a regression model, you can use it
to make predictions on new data points. However, it is important to
interpret the results of the model carefully.

For example, if you are using a regression model to predict customer


churn, you need to understand the relationship between the predictor
variables and the target variable. This will help you to identify the
factors that are most likely to lead to customer churn.

You should also communicate the results of the model to stakeholders in


a clear and concise way. Avoid using jargon and technical terms that
your audience may not understand.

Here are some additional tips for using regression algorithms in data
science:

 Choose the right algorithm for the task at hand. There is no one-
size-fits-all regression algorithm. The best algorithm for a
particular task will depend on the specific characteristics of the
data and the problem you are trying to solve.
 Prepare your data carefully. Before you train a regression model,
it is important to clean and prepare your data. This includes
removing outliers and missing values.
 Tune the hyperparameters. Most regression algorithms have a
number of hyperparameters that can be tuned to improve the
performance of the model. It is important to tune the
hyperparameters for your specific problem.
 Evaluate the model carefully. Once you have trained a regression
model, it is important to evaluate its performance on a held-out
test set. This will help you to assess how well the model
generalizes to unseen data.
 Use the model responsibly. Regression models can be used to
make predictions, but it is important to use them responsibly. Be
aware of the limitations of the model and do not rely on it to
make important decisions without consulting with other experts.
 By following these tips, you can use regression algorithms to
build and deploy effective machine learning models that can
solve real-world problems.
Activity 6.1

Lab Activity: Regression Algorithms


Activity:

 Choose a regression algorithm. There are many


different regression algorithms available, so
choose one that is appropriate for your data and
problem. For example, linear regression is a
good choice for continuous target variables,
while logistic regression is a good choice for
binary target variables.
 Find a dataset. You can find many public
datasets online that can be used for machine
learning. Choose a dataset that is relevant to
your interests and that has a target variable that
you are interested in predicting.
 Prepare the data. Before you train a regression
model, it is important to clean and prepare your
data. This may involve removing outliers,
handling missing values, and scaling the data.
 Train and evaluate the model. Once your data
is prepared, you can train a regression model
on your data. You can then evaluate the
performance of the model on a held-out test set.
 Make predictions. Once you have trained and
evaluated a regression model, you can use it to
make predictions on new data points.
Variations:

1. Use different regression algorithms. Try


training different regression algorithms on your
data and see which one performs the best.
2. Use different datasets. Try training different
regression algorithms on different datasets and
see how they perform.
3. Use different features. Try using different
features in your data to train your regression
algorithms.
4. Use different hyperparameters. Most regression
algorithms have a number of hyperparameters
that can be tuned to improve the performance
of the model. Try tuning the hyperparameters
for your specific problem.

Conclusion:
This learning activity will help you to understand how
to use regression algorithms to solve real-world
problems. By experimenting with different regression
algorithms, datasets, features, and hyperparameters,
you can develop the skills you need to build and
deploy effective machine learning models.

Here is an example of how you could apply this


learning activity to a real-world problem:

Problem: A company wants to predict customer churn.

Solution:

 Choose a regression algorithm. Logistic


regression is a good choice for this problem
because the target variable is binary (churn or
not churn).
 Find a dataset. There are many public datasets
available online that contain customer data.
Choose a dataset that is relevant to your
industry and that contains a churn variable.
 Prepare the data. Clean and prepare the data by
removing outliers, handling missing values,
and scaling the data.
 Train and evaluate the model. Train a logistic
regression model on the prepared data and
evaluate its performance on a held-out test set.
 Make predictions. Once you have a trained
model, you can use it to predict whether new
customers are likely to churn.

Summary

Regression algorithms are a powerful tool that can be used to solve a


wide range of problems in data science. By understanding the different
types of regression algorithms and how to use them, you can develop the
skills you need to build and deploy effective machine learning models.

Further Reading

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An


introduction to statistical learning: with applications in R. Springer-
Verlag.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,
Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research, 12, 2825-2830.

Friedman, J. H. (2001). Greedy function approximation: A gradient


boosting machine. Annals of Statistics, 29(5), 1189-1232.
7: 7.0 Introduction
UNSUPERVISED
LEARNING Unsupervised learning is a type of machine learning where the algorithm
is not given any labelled data to learn from. Instead, it is left to find
patterns and structures in the data on its own. This makes it useful in
situations where labelled data is scarce or expensive to obtain.
Clustering and dimensionality reduction are common techniques used in
unsupervised learning. It is a powerful tool in data science that can help
uncover hidden insights and relationships in data.

7.1 Learning outcomes


 Understand the basics of unsupervised learning and how it is
used in data science.
 Be able to identify the different types of unsupervised learning
algorithms and choose the right algorithm for a specific task.
 Be able to train and evaluate unsupervised learning models.
 Be able to interpret and communicate the results of unsupervised
learning models.
 Be able to apply unsupervised learning algorithms to solve real-
world problems.
7.2 Key Terms/ Definition of Terms

 Unsupervised learning: A type of machine learning that learns


from unlabeled data. This means that the data does not have any
known target values.
 Clustering: An unsupervised learning algorithm that groups data
points into clusters based on their similarity.
 Association rule learning: An unsupervised learning algorithm
that discovers relationships between items in a dataset.
 Dimensionality reduction: An unsupervised learning algorithm
that reduces the number of features in a dataset while preserving
as much information as possible.
 Cluster dendrogram: A tree diagram that shows the relationships
between clusters in a dataset.
 Association rule: A rule that describes a relationship between
two or more items in a dataset.
 Principal component: A new feature that is created by combining
two or more existing features in a dataset.

7.2.1 Abbreviations and Acronyms


 UL: Unsupervised Learning
 CL: Clustering
 ARL: Association Rule Learning
 DR: Dimensionality Reduction
 K-Means: A popular clustering algorithm
 DBSCAN: Another popular clustering algorithm
 Apriori: A popular association rule learning algorithm
 PCA: Principal Component Analysis, a popular dimensionality
reduction algorithm
 MDS: Multidimensional Scaling, another popular dimensionality
reduction algorithm
 SVD: Singular Value Decomposition, a technique that can be
used for dimensionality reduction
 CV: Cross-Validation, a technique that can be used to evaluate
the performance of unsupervised learning models

7.3 Unsupervised Learning

Unsupervised learning is a type of machine learning that learns from


unlabelled data. This means that the data does not have any known
target values. Unsupervised learning algorithms are used to discover
patterns and relationships in data without any prior knowledge of the
data.

Unsupervised learning is used in a wide variety of applications,


including:

 Customer segmentation: Unsupervised learning can be used to


segment customers into different groups based on their purchase
history, demographics, or other factors. This information can
then be used to target customers with personalized marketing
campaigns.
 Market basket analysis: Unsupervised learning can be used to
discover relationships between items that are frequently
purchased together. This information can then be used to
improve product placement, recommend products to customers,
and design promotions.
 Anomaly detection: Unsupervised learning can be used to detect
anomalies in data, such as fraudulent transactions or network
intrusions. This information can then be used to prevent fraud
and protect systems from attack.

Different types of unsupervised learning algorithms

There are many different types of unsupervised learning algorithms,


each with its own strengths and weaknesses. Some of the most common
unsupervised learning algorithms include:

 Clustering: Clustering algorithms group data points into clusters


based on their similarity. Clustering algorithms can be used to
segment customers, identify groups of patients with similar
symptoms, or detect groups of fraudulent transactions.
 Association rule learning: Association rule learning algorithms
discover relationships between items in a dataset. Association
rule learning algorithms can be used to identify market basket
relationships, recommend products to customers, and design
promotions.
 Dimensionality reduction: Dimensionality reduction algorithms
reduce the number of features in a dataset while preserving as
much information as possible. Dimensionality reduction
algorithms can be used to improve the performance of machine
learning models and make data easier to visualize.

Training and evaluating unsupervised learning models

To train an unsupervised learning model, you need a dataset of


unlabelled data. You can then use a variety of machine learning libraries
to train an unsupervised learning model on your data.

Once the model is trained, you can evaluate its performance on a held-
out test set. This will help you to assess how well the model generalizes
to unseen data.

Interpreting and communicating the results of unsupervised


learning models

Once you have trained and evaluated an unsupervised learning model,


you can use it to make predictions on new data points. However, it is
important to interpret the results of the model carefully.

Unsupervised learning models can be used to discover patterns and


relationships in data, but they cannot explain why these patterns and
relationships exist. It is important to use your domain knowledge to
interpret the results of unsupervised learning models and to identify
meaningful patterns and relationships.

Here are some additional tips for using unsupervised learning:

 Choose the right algorithm for the task at hand. There is no one-
size-fits-all unsupervised learning algorithm. The best algorithm
for a particular task will depend on the specific characteristics of
the data and the problem you are trying to solve.
 Prepare your data carefully. Before you train an unsupervised
learning model, it is important to clean and prepare your data.
This may involve removing outliers and missing values.
 Tune the hyperparameters. Most unsupervised learning
algorithms have a number of hyperparameters that can be tuned
to improve the performance of the model. It is important to tune
the hyperparameters for your specific problem.
 Evaluate the model carefully. Once you have trained an
unsupervised learning model, it is important to evaluate its
performance on a held-out test set. This will help you to assess
how well the model generalizes to unseen data.
 Use the model responsibly. Unsupervised learning models can be
used to discover patterns and relationships in data, but they
cannot explain why these patterns and relationships exist. It is
important to use your domain knowledge to interpret the results
of unsupervised learning models and to identify meaningful
patterns and relationships.

Activity 7.1

Lab Activity: Activity:

1. Choose an unsupervised learning algorithm.


There are many different unsupervised learning
algorithms available, so choose one that is
appropriate for your data and problem. For
example, k-means clustering is a good choice
for grouping data points into clusters, while
association rule learning is a good choice for
discovering relationships between items in a
dataset.
2. Find a dataset. There are many public datasets
available online that can be used for
unsupervised learning. Choose a dataset that is
relevant to your interests and that contains the
type of data that you want to analyze.
3. Prepare the data. Before you train an
unsupervised learning model, it is important to
clean and prepare your data. This may involve
removing outliers, handling missing values,
and scaling the data.
4. Train and evaluate the model. Once your data
is prepared, you can train an unsupervised
learning model on your data. You can then
evaluate the performance of the model on a
held-out test set. This will help you to assess
how well the model generalizes to unseen data.
5. Interpret and communicate the results. Once
you have trained and evaluated an
unsupervised learning model, you can use it to
make predictions on new data points. You
should also interpret the results of the model
and communicate the findings to your audience
in a clear and concise way.

Variations:

 Use different unsupervised learning algorithms.


Try training different unsupervised learning
algorithms on your data and see which one
performs the best.
 Use different datasets. Try training different
unsupervised learning algorithms on different
datasets and see how they perform.
 Use different features. Try using different
features in your data to train your unsupervised
learning algorithms.
 Tune the hyperparameters. Most unsupervised
learning algorithms have a number of
hyperparameters that can be tuned to improve
the performance of the model. Try tuning the
hyperparameters for your specific problem.

Conclusion:

This learning activity will help you to understand how


to use unsupervised learning algorithms to solve real-
world problems. By experimenting with different
unsupervised learning algorithms, datasets, features,
and hyperparameters, you can develop the skills you
need to build and deploy effective machine learning
models.

Here is an example of how you could apply this


learning activity to a real-world problem:

Problem: A company wants to segment its customers


into different groups based on their purchase history.
Solution:

1. Choose an unsupervised learning algorithm. K-


means clustering is a good choice for this
problem because it is a simple and effective
algorithm for grouping data points into
clusters.
2. Find a dataset. The company can use its
customer purchase history data to train the k-
means clustering model.
3. Prepare the data. The company should clean
and prepare the data by removing outliers and
handling missing values.
4. Train and evaluate the model. The company
can train the k-means clustering model on the
prepared data and evaluate its performance on a
held-out test set. This will help the company to
assess how well the model generalizes to
unseen data.
5. Interpret and communicate the results. Once
the k-means clustering model is trained and
evaluated, the company can use it to segment
its customers into different groups based on
their purchase history. The company can then
communicate the findings to its marketing team
so that they can develop more targeted
marketing campaigns.

Summary

Unsupervised learning is a powerful tool that can be used to discover


patterns and relationships in data without any prior knowledge of the
data. Unsupervised learning algorithms are used in a wide variety of
applications, including customer segmentation, market basket analysis,
and anomaly detection.

By understanding the different types of unsupervised learning


algorithms and how to use them, you can develop the skills you need to
solve a wide range of problems in data science.

Further Reading

Cunningham, P., & Bailey, S. (2014). Unsupervised learning and data


mining: A conceptual overview. In Machine learning (pp. 483-512).
Springer.
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based
algorithm for discovering clusters in large spatial databases with noise.
Kdd, 96(34), 226-231.

Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining association


rules between sets of items in large databases. Acm Sigmod Record,
22(2), 207-216.

8: 8.0 Introduction
RECOMMENDER
SYSTEMS Recommender systems are a type of algorithm that suggests items to
users based on their preferences and behaviour. They are commonly
used in e-commerce websites, streaming services, and social networks.
By analysing user data such as past purchases, ratings, and browsing
history, recommender systems can make personalized recommendations
to help users find products or content they are likely to enjoy. This can
improve user engagement, increase sales, and enhance the user
experience. However, designing effective recommender systems
requires careful consideration of factors such as data privacy, bias, and
evaluation metrics.

8.1 Learning outcomes

 Define recommender systems and explain their importance.


 Understand the different types of recommender systems and their
underlying algorithms.
 Be able to evaluate the performance of recommender systems.
 Identify the challenges in building and deploying recommender
systems.
 Apply recommender systems to solve real-world problems.

8.2 Key Terms/ Definition of Terms

 Recommender system: A system that recommends items to users


based on their past behaviour and preferences.
 Content-based recommender system: A recommender system
that recommends items to users based on the similarity between
the items and the user's past behaviour.
 Collaborative filtering recommender system: A recommender
system that recommends items to users based on the ratings and
preferences of other users with similar tastes.
 Hybrid recommender system: A recommender system that
combines content-based and collaborative filtering techniques.
 K-nearest neighbours (KNN) algorithm: A recommender system
algorithm that recommends items to users based on the similarity
between the users and their k-nearest neighbours.
 Matrix factorization algorithm: A recommender system
algorithm that decomposes a user-item rating matrix into two
latent factor matrices, which represent the users and items in a
latent space.
 Deep learning-based recommender system: A recommender
system that uses deep learning algorithms to recommend items to
users.
 Cold start problem: The challenge of recommending items to
new users or items that have not been rated by many users.
 Data sparsity: The challenge of dealing with recommender
system datasets that are often sparse, meaning that many users
have not rated many items.
 Scalability: The challenge of deploying recommender systems to
large user bases.

8.2.1 Abbreviations and Acronyms

 RS: Recommender System


 CB: Content-based Recommender System
 CF: Collaborative Filtering Recommender System
 HRS: Hybrid Recommender System
 KNN: K-Nearest Neighbors
 MF: Matrix Factorization
 DLRS: Deep Learning-based Recommender System
 CSP: Cold Start Problem

8.3 Recommender Systems

Recommender systems are a type of machine learning system that


recommends items to users based on their past behaviour and
preferences. Recommender systems are used in a wide variety of
applications, including product recommendation, movie
recommendation, music recommendation, and news recommendation.

There are two main types of recommender systems: content-based and


collaborative filtering.

1. Content-based recommender systems recommend items to users


based on the similarity between the items and the user's past
behaviour. For example, a content-based recommender system
for movies might recommend movies to users based on the
genres of movies that the users have watched in the past.
2. Collaborative filtering recommender systems recommend items
to users based on the ratings and preferences of other users with
similar tastes. For example, a collaborative filtering
recommender system for products might recommend products to
users based on the products that other users with similar
purchase histories have purchased.

Recommender systems are typically built using a variety of machine


learning algorithms, such as k-nearest neighbours, matrix factorization,
and deep learning.

Here are some examples of how recommender systems are used in the
real world:

 Product recommendation: Recommender systems are used by e-


commerce companies to recommend products to customers
based on their purchase history and browsing behavior.
 Movie recommendation: Recommender systems are used by
streaming services to recommend movies and TV shows to users
based on their watch history and ratings.
 Music recommendation: Recommender systems are used by
music streaming services to recommend songs and albums to
users based on their listening history and preferences.
 News recommendation: Recommender systems are used by news
websites to recommend articles to users based on their reading
history and interests.

Challenges in building and deploying recommender systems

There are a number of challenges in building and deploying


recommender systems. Some of the most common challenges include:

 Cold start problem: The challenge of recommending items to


new users or items that have not been rated by many users.
 Data sparsity: The challenge of dealing with recommender
system datasets that are often sparse, meaning that many users
have not rated many items.
 Scalability: The challenge of deploying recommender systems to
large user bases.

Despite these challenges, recommender systems are a powerful tool that


can be used to improve the user experience in a wide variety of
applications.

Here are some tips for building and deploying effective recommender
systems:

 Use high-quality data: Recommender systems are only as good


as the data that they are trained on. It is important to use high-
quality data that is representative of the users and items that the
recommender system will be used for.
 Choose the right algorithm: There are a variety of different
machine learning algorithms that can be used to build
recommender systems. It is important to choose the right
algorithm for the specific application.
 Evaluate the performance of the recommender system: It is
important to evaluate the performance of the recommender
system on a held-out test set before deploying it to production.
This will help to identify any potential problems with the
recommender system.
 Monitor the recommender system in production: It is important
to monitor the performance of the recommender system in
production to ensure that it is working as expected.

Activity 8.1

Lab Activity:

1. Choose a recommender system dataset. There


are many public datasets available online that
can be used for recommender systems. Choose
a dataset that is relevant to your interests and
that contains the type of data that you want to
use to build your recommender system.
2. Prepare the data. Before you train a
recommender system model, it is important to
clean and prepare your data. This may involve
removing outliers, handling missing values,
and scaling the data.
3. Choose a recommender system algorithm.
There are many different recommender system
algorithms available. Choose an algorithm that
is appropriate for your dataset and the problem
you are trying to solve.
4. Train and evaluate the model. Once your data
is prepared and you have chosen an algorithm,
you can train your recommender system model.
Once the model is trained, you can evaluate its
performance on a held-out test set.
5. Make recommendations. Once you have
trained and evaluated your model, you can use
it to make recommendations to users.
Variations:

 Try different recommender system algorithms.


Train different recommender system
algorithms on your dataset and see which one
performs the best.
 Use different features. Try using different
features in your data to train your recommender
system models.
 Tune the hyperparameters. Most recommender
system algorithms have a number of
hyperparameters that can be tuned to improve
the performance of the model. Try tuning the
hyperparameters for your specific problem.
 Deploy your recommender system. Once you
have trained and evaluated a recommender
system model, you can deploy it to production
so that it can be used to make
recommendations to users.

Conclusion:

This learning activity will help you to understand how


to build and deploy recommender systems in data
science. By experimenting with different recommender
system algorithms, datasets, features, and
hyperparameters, you can develop the skills you need
to build and deploy effective recommender systems
that can improve the user experience in a wide variety
of applications.

Here is an example of how you could apply this


learning activity to a real-world problem:

Problem: A company wants to build a recommender


system to recommend products to its customers.

Solution:

1. Choose a recommender system dataset. The


company could use a public dataset of product
reviews or a dataset of its own customer
purchase history.
2. Prepare the data. The company would need to
clean and prepare the data by removing
outliers, handling missing values, and scaling
the data.
3. Choose a recommender system algorithm. The
company could use a collaborative filtering
algorithm or a content-based algorithm,
depending on the type of data that is available.
4. Train and evaluate the model. The company
would need to train the recommender system
model on the prepared data and evaluate its
performance on a held-out test set.
5. Make recommendations. Once the model is
trained and evaluated, the company could use it
to make recommendations to customers based
on their purchase history or browsing behavior.

Summary

Recommender systems in data science are algorithms that make


recommendations to users based on their past behaviour and
preferences. There are different types of recommender systems, such as
content-based and collaborative filtering, and they can be challenging to
build effectively due to the cold-start problem and the need for
personalized recommendations. By understanding the key concepts and
techniques involved, one can successfully build effective recommender
systems in data science.

Further Reading

Ricci, F., Rokach, L., Shapira, B., & Kantor, P. (2011). Recommender
systems handbook. Springer Science & Business Media.

Burke, R. (2002). Hybrid recommender systems: Survey and


experiments. User Modeling and Adaptive Personalization, 12(4), 331-
376.

Shani, L., & Gunawardana, A. (2011). Evaluating recommender


systems. In Recommender systems handbook (pp. 257-297). Springer
Science & Business Media.

You might also like