09 Handout 1
09 Handout 1
• Prescriptive. You may need to use a model to make the required decisions and modify the parameters
based on the data set or question. To do this, you need to use prescriptive analytics. This form of
analytics is more about providing the right information to make an informed decision. You can also use
this type of analytics to predict a range of associated outcomes and prescribed actions. An example of
this type of analytics is a self-driving car. You can run numerous algorithms (a procedure or formula for
solving a problem, based on conducting a sequence of specified actions) on the data collected from
the cars and use the results to make the car more intelligent. This makes it easier for the car to make
the right decisions to turn, slow down, speed up, or identify the direction to take.
Data science and machine learning are both popular buzzwords today. These two (2) terms are often
thrown together but should not be used interchangeably. Although data science includes machine
learning, it is a vast field with many different tools.
Machine Learning
It is a group of computational algorithms that performs pattern recognition, classification, and prediction
by learning from existing data.
• Make predictions. Numerous machine-learning algorithms allow you to make predictions using
unstructured, semi-structured, and structured data sets. Let us assume you work for a finance company
and you have the transactional data available. You need to develop a model to determine the trend of
future transactions. To perform this analysis, you need to use a supervised machine-learning algorithm.
Such algorithms are used to train the machine with an existing data set. You can also use supervised
machine learning algorithms to develop and train a model to detect future frauds based on historical
information.
• Pattern discovery. Not every data set has variables you can use to make the necessary predictions.
This is not true. There is a hidden pattern in every data set, and you need to find those patterns to
make the required predictions. To do this, you need to use an unsupervised model since you do not
have any pre-defined labels in the data set (using which you can group the variables). One (1) of the
most common algorithms used to identify patterns is clustering. Let us assume you work for a phone
company, and you are tasked with identifying where to set up cell towers in an area to establish a
network. You can then use the clustering algorithm to identify where you can set up towers to ensure
every user in the area receives the optimum signal strength.
Why Use Data Science? (Campbell, 2021)
In the past, organizations manage small volumes of data. It was easy to analyze and understand the data
and relationships within the data set using some business intelligence tools. Most traditional business
intelligence tools only worked on structured data sets, but most of the data collected today are semi-
structured or unstructured. It is important to understand that most data collected now are semi-
structured or unstructured.
Simple business intelligence tools cannot process this type of data, especially since large volumes of data
are collected from different instruments. For this reason, we need to develop advanced and complex
analytical algorithms and tools to process, analyze, and draw some insights from the data.
It is not only for this reason why data science has gained popularity. Let us look at how data science is
used in different domains:
• Customer Service. How great would it be if you could know exactly what your customers want? Do you
think you can use existing data to learn more about your customers, such as purchase history, browsing
history, income, and age? You may have had this data with you in the past, as well. Since you use
different mathematical and statistical models, you can effectively work with large volumes of data and
identify the right products to recommend to your customers. This is a great way to bring more business
to your firm.
• Self-Driven Cars. How would you feel if your car could drive you home? Numerous companies are
trying to develop and improve the workings of a self-driven car. The cars collect live information from
various sensors, such as lasers, radars, and cameras, to create a map of the surrounding environment.
The algorithm in the car uses this data to decide to speed up, slow down, park, stop, overtake, etc.
These algorithms are often machine learning algorithms.
• Predictions. Let us now consider how you can use data science in predictive analytics. Consider
weather forecasting. The algorithms used take data from aircraft, satellites, radars, ships, and other
parts to collect and analyze data. This helps you build the required models. You can use these models
to predict the occurrence of any natural calamities. Using this information, you can take the necessary
measures to save lives.
Who is a Data Scientist? (Campbell, 2021)
If you look for data scientist on the Internet, you may come across numerous definitions. A data scientist
uses data science to answer some business questions and concerns. The term data scientist was coined
when people learned that a data scientist uses data, various mathematical or statistical functions,
operations, and other scientific fields and applications to make sense of the data in the database.
Functions Performed by Data Scientists
Data scientists crack various data problems using their expertise in specific scientific disciplines. He works
with different mathematical, statistical, and computer science elements. He does not necessarily have to
be an expert in these fields. He would use some technologies and solutions to develop the right solutions
and reach conclusions crucial for the organization's development and growth. A data scientist finds a way
to present the data in a useful form compared to the data available in the data set. They work with both
structured and unstructured data.
Differences between Data Science and Business Intelligence (Campbell, 2021)
Before we look at the differences between data science and business intelligence, let us understand these
terms better.
Using business intelligence (BI), an organization can find insight and hindsight in the existing data set to
describe various trends in the data set. Through BI, businesses can take data from internal and external
sources, prepare that data, and run queries on the data set to obtain the required information. They can
then create the required dashboards to answer different questions or identify solutions to various
business problems. BI can also help businesses evaluate certain futuristic events. On the other hand, data
science is a different approach to looking at data. You can take a forward-looking approach and explain
any information or insight in the data set. Using data science, you can analyze the current or past data
that helps you predict the outcomes. This is one (1) way most organizations do their best to make
informed decisions.
Now you have an idea of what data science is, let us look at the lifecycle of data science. Most people rush
into using the models they develop on the data sets without understanding the basics of data science.
You need to understand these basics and assess the business requirements before you rush into using the
model. Make sure to follow the data science life cycle phases to ensure your results are accurate.
Lifecycle
This section gives you a brief overview of the phases in the data science lifecycle.
• Phase One: Discovery. Before you work on the project, you need to understand the following: business
requirements, specifications, required or approved budget, and priorities. If you want to pursue a
career in data science, you need to possess the ability to ask important questions. You need to assess
if you have the right resources, people, technology, data, and time to support the work done on the
project. This phase involves framing the problem and identifying the initial hypothesis you want to test.
• Phase Two: Data Preparation. When you identify the required resources needed to work on the
analysis, you need to develop or identify an analytical sandbox where you can perform the testing and
analysis of the data. Before modeling it, you need to process, explore, and condition the data. You also
need to perform the following operations to move the data into the sandbox environment: extract-
transform-load-transform. Programming languages can be used to clean, transform, and visualize the
data used in the analysis. These programming languages help you identify the outliers in the data. You
can also use the information to develop or identify a relationship between variables. Once the data is
cleaned and prepared, you can perform different types of analysis on the data.
• Phase Three: Plan the Model. During this phase, you need to identify the techniques and methods to
help you draw the relationship between the different variables in the data set. These relationships will
help you determine the algorithms you can use in the next phase of the lifecycle. To do this, you need
to apply exploratory data analytics methods and tools using various formulas and visualization
methods. Let us look at some tools used for this below:
o R: This programming language has various modeling capabilities. It is also a good platform to
use and develop the right models if you are a beginner.
o SQL: This provides a set of methods to perform analysis within the database using different
predictive models and mining functions.
o ACCESS or SAS: These tools can be used to access data from various storage platforms, like
Hadoop, and use that data to create a reusable and repeatable model.
The market has numerous tools to develop modeling techniques, but R is commonly used. At the end
of this phase, you will have the required insights in your data that will help you determine the algorithm
to use. The next phase is where you apply this algorithm and develops the model.
• Phase Four: Build the Model. Now that you have decided which algorithm to use, you must split the
data set into training and testing data sets. In this phase, you need to consider the existing tools and
determine if they are sufficient for building a model. Make sure you identify a robust environment to
run the models. To develop the model, you need to analyze different techniques, such as clustering,
classification, and association.
• Phase Five: Operate the Model. In this phase, you run the data through the model and deliver the
reports and necessary technical documents. Additionally, you may also need to run the model in the
production environment to test if it works the way it needs. This gives you an idea of how the model
performs on real-time data. You can also determine any constraints in the model.
• Phase Six: Communicate the Results. It is important to evaluate if the model has given you the needed
results. You can do this by analyzing your hypotheses. This is the last phase of the data science lifecycle
and is where you identify the key findings and communicate the same to the organization. You can
determine the results of the model based on the criteria you identified in the first phase.
Reference:
Campbell, A. (2021). Data science for beginners: Comprehensive guide to most important basics in data science. Alex Campbell.
Stobierski, T. (2021). What's the difference between data analytics & data science? https://online.hbs.edu/blog/post/data-analytics-vs-data-
science