NOTES OF Python Ok
NOTES OF Python Ok
NOTES OF Python Ok
UNIT -1
Python Basics:
Introduction to Python :-
What is Python:-
Python is a popular programming language. It was created by Guido van Rossum, and
released in 1991.
It is used for:
web development (server-side),
software development,
mathematics,
system scripting.
What can Python do:-
Python can be used on a server to create web applications.
Python can be used alongside software to create workflows.
Python can connect to database systems. It can also read and modify files.
Python can be used to handle big data and perform complex mathematics.
Python can be used for rapid prototyping, or for production-ready software
development.
Why Python:-
Python works on different platforms (Windows, Mac, Linux, Raspberry Pi,
etc).
Python has a simple syntax similar to the English language.
Python has syntax that allows developers to write programs with fewer lines
than some other programming languages.
Python runs on an interpreter system, meaning that code can be executed as
soon as it is written. This means that prototyping can be very quick.
Python can be treated in a procedural way, an object-oriented way or a
functional way.
Python Features
Python Syntax
Python keywords
Keywords in Python are reserved words that have special meanings. For example if,
else, while, etc. They cannot be used as Keywords. Below is the list of keywords in
Python.
False await else import pass
None break except in raise
True class finally is return
and continue for lambda try
as def from nonlocal while
assert del global not with
async elif if or yield
Python Variables
Python defines four types of variables: local, global, instance, and class variables. Local
variables are created within functions and can only be accessed there. Global variables
are defined outside of any function and can be used throughout the program.
if a<b:
print("a and b are not really numbers")
# Output
>>> a and b are not really numbers
For more on the if condition, refer
to: https://docs.python.org/3/tutorial/controlflow.html
Exercise
Given a = 10, b = 10, and c = 10.00, use if condition to print the following:
if a is equal to b and b is equal to c, print the message "a, b and c are all similar"
if location of a is same as location of b is same as the location of c, print
message "a, b and c are all referring to the same object"
# data
a = 10
b = 10
c = 10.00
Click here to edit and execute the code.
Solution code
if a==b and b==c: # two conditions using and operator
print("a, b and c are all similar")
if id(a)==id(b)==id(c): # two conditions without using and operator
print("a, b and c are all referring to the same object")
If-elif-else
An advanced approach to decision making is when we chain multiple if conditions,
so as to perform different operations based on different conditions being satisfied.
Here we use an 'if...elif...else' construct. 'elif' is short for else-if. It is written as:
Syntax:
if condition:
statement(s)
elif:
statement(s)
.
.
.
else:
statement(s)
Ref: https://docs.python.org/3/reference/compound_stmts.html#if
Exercise
Given variable a, which can take any integral value, write multiple conditions to
determine whether a is:
an even number
a negative number
or 1
if not any of the above, a positive odd number other than 1
a=1
Click here to edit and execute the code.
Solution code
if a == 1:
print("a is equal to 1")
elif(a%2) == 0:
print("a is an even number")
elif a<0:
print("a is a negative number")
else:
print("a is a positive odd number other than 1")
Loops
In general, statements are executed sequentially: The first statement in a function is
executed first, followed by the second, and so on.
There may be a situation when you need to execute a block of code several number
of times. Programming languages provide various control structures that allow for
more complicated execution paths.
A loop statement allows us to execute a statement or group of statements multiple
times.
The following diagram illustrates a loop statement −
Python programming language provides following types of loops to handle looping
requirements.
While Loop
Repeats a statement or group of statements while a given condition is TRUE. It tests
the condition before executing the loop body.
Note that Python works with indentation and no braces are necessary. The next line
of code following the loops or conditions or function blocks should have an
indentation of two spaces or four spaces.
Syntax:
While :
statements
Example:
i=1
while i < 5:
i += 1
Example:
a=1
# Iterating 8 times and then Printing ints 2-10
while a < 10:
a += 1
print(a)
# Output
>>> 2
3
4
5
6
7
8
9
10
For Loop
Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.
The For loop works in a similar way as the while loop. Please put emphasis on the
syntax here. We will using the 'in' operator for the function of this loop.
Syntax:
range(1, 10) -- generates a list of numbers from 1 to 10
for i in range(1, 10):
statements
Example:
# iterating 10 times and then print ints 1-10
for i in range(1, 11):
print(i)
# Output
>>> 1
2
3
4
5
6
7
8
9
10
Ref: https://docs.python.org/3/tutorial/controlflow.html
Exercise
Given:
array = [1,'a',2,'b','3',4]
array2 = []
1. Use a For Loop Reversely to add the values in array2
2. Empty out array
3. Use a While Loop reversely to add the values from array2 to array
# Write Your solution here
array = [1,'a',2,'b','3',4]
array2 = []
Click here to edit and execute the code.
Solution Code
array = [1,'a',2,'b','3',4]
array2 = []
for i in range(len(array)-1, -1, -1):
array2.append(array[i])
print(array2)
array = []
i = len(array2)-1
while 0 <= i:
array.append(array2)
i -= 1
print(array2)
Comprehension
Comprehension expressions are described as one line loops in python. These
comprehensions help in reducing the verbosity for simple loopes
Syntax
a = [do something(i) for i in array (if condition - optional)]
Example
a = [1,2,3]
# Adding 1 to each element in a
a = [i+1 for i in a]
# Output
>>> [2, 3, 4]
Exercise
arr = ['1', 2, '4', '5', 65,'100']
sum_var = 0
Convert all the string elements and then sum them up and sum_var
sum() is used to add all the elements in the array
# Write your Solution here
arr = ['1', 2, '4', '5', 65,'100']
sum_var = 0
Click here to edit and execute the code.
Solution
arr = ['1', 2, '4', '5', 65,'100']
sum_var = 0
sum_var = sum([int(i) for i in arr if type(i) is str])
sum_var
Loop Control
Loop control statements change execution from its normal sequence. When execution
leaves a scope, all automatic objects that were created in that scope are destroyed.
In Python there is a certain way to restrict/limit the number of iteration a for/while
loop can go through. Essentially, these are designed to manipulate the control flow of
a loop.
Break
Terminates the loop statement and transfers execution to the statement immediately
following the loop.
Syntax:
while condition:
if condition:
break
for i in range():
if i is condition:
break
Example:
i=0
# The loop breaks when it reaches 'h'
for letter in 'Python': # First Example
if(letter == 'h'):
break
print ('Current Letter:', letter)
# Output
>>> Current Letter: P
Current Letter: y
Current Letter: t
# Output
>>> Current variable value: 10
Current variable value: 9
Current variable value: 8
Current variable value: 7
Current variable value: 6
Good bye!
Continue
Causes the loop to skip the remainder of its body and immediately retest its condition
prior to reiterating.
Syntax:
while condition:
if condition:
continue
for i in range():
if condition:
continue
Example:
# The loops skips
for letter in 'Python' : # First Example
if letter == 'h' :
continue
print(letter)
# Output
>>> P
y
t
o
n
# Output
>>> Current variable value: 9
Current variable value: 8
Current variable value: 7
Current variable value: 6
Current variable value: 4
Current variable value: 3
Current variable value: 2
Current variable value: 1
Current variable value: 0
Good bye!
Pass
The pass statement in Python is used when a statement is required syntactically but
you do not want any command or code to execute.
# Prints the sentence 'This is pass block' when it reaches 'h'
for letter in 'Python' :
if letter == 'h' :
pass
print ('This is pass block')
print ('Current Letter :' , letter)
print ("Good bye!")
sum_var += i
Python Data Structure
Python data structures are essentially containers for different kinds of data. The four
main types are lists, sets, tuples and dictionaries.
The four primary data structures utilized in Python are lists, sets, tuples and
dictionaries.
Lists
Lists are a type of data structure containing an ordered collection of items. They are
crucial to executing projects in Python.
Every item contained within a list has an inherent order used to identify them, which
remains consistent throughout the life of the list. Lists are mutable, allowing elements
to be searched, added, moved and deleted after creation. Lists can also be nested,
allowing them to contain any object, including other lists and sublists.
More on Python ListsHow to Append Lists in Python
Tuples
A tuple contains much of the same functionality as a list, albeit with limited
functionality. The primary difference between the two is that a list is immutable,
meaning it cannot be modified or deleted. Tuples are best when a user intends to keep
an object intact throughout its lifetime to prevent the modification or addition of data.
Sets
A set is a collection of unique elements with no defined order, which are utilized when
an object only needs to exist within a collection of objects and its order or number of
appearances are not important.
Dictionaries
Dictionaries are unique and immutable objects that consist of key value pairs and are
accessible through unique keys in the dictionary.
Here are a few examples to help in the understanding of datetime in Python properly.
Let’s begin.
Example 1
Here, this example shows you how can get the current date using datetime in Python:
# importing the datetime class
from datetime import datetime
#calling the now() function of datetime class
Now = datetime.datetime.now()
print("Now the date and time are", Now)
The output is as follows:
Example 2
Here is the second example. The aim is to count the difference between two different
datetimes.
#Importing the datetime class
from datetime import datetime
#Initializing the first date and time
time1 = datetime(year=2020, month=5, day=9, hour=4, minute=33, second=6)
#Initializing the second date and time
time2 = datetime(year=2021, month=7, day=4, hour=7, minute=55, second=4)
#Calculating and printing the time difference between two given date and times
time_difference = time2 - time1
print("The time difference between the two times is", time_difference)
And the output is:
Function or Method Overloading:
Two or more methods have the same name but different numbers of parameters or
different types of parameters, or both. These methods are called overloaded methods
and this is called method overloading.
Like other languages (for example, method overloading in C++) do, python does not
support method overloading by default. But there are different ways to achieve
method overloading in Python.
The problem with method overloading in Python is that we may overload the methods
but can only use the latest defined method.
Example :-
Output :- 100
Operator Overloading in Python
print(1 + 2)
Output
3
GeeksFor
12
GeeksGeeksGeeksGeeks
Python Classes/Objects
Create Object
Now we can use the class named MyClass to create objects:
Example
Create an object named p1, and print the value of x:
p1 =
MyClass()
print(p1.x)
Example :-
Example
Create a class named Person, use the __init__() function to assign values for name
and age:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
p1 = Person("John", 36)
print(p1.name)
print(p1.age)
UNIT -2
Working with Data in Python
Open File
Reading files with Open
Example
f = open("demofile.txt", "r")
print(f.read())
Python provides inbuilt functions for creating, writing and reading files. There are
two types of files that can be handled in python, normal text files and binary files
(written in binary language, 0s and 1s).
Text files: In this type of file, Each line of text is terminated with a special
character called EOL (End of Line), which is the new line character (‘\n’) in
python by default.
Binary files: In this type of file, there is no terminator for a line and the data
is stored after converting it into machine-understandable binary language.
Access mode
Access modes govern the type of operations possible in the opened file. It refers to
how the file will be used once it’s opened. These modes also define the location of
the File Handle in the file. File handle is like a cursor, which defines from where the
data has to be read or written in the file. Different access modes for reading a file are
–
1. Write Only (‘w’) : Open the file for writing. For an existing file, the data is
truncated and over-written. The handle is positioned at the beginning of the
file. Creates the file if the file does not exist.
2. Write and Read (‘w+’) : Open the file for reading and writing. For an existing
file, data is truncated and over-written. The handle is positioned at the
beginning of the file.
3. Append Only (‘a’) : Open the file for writing. The file is created if it does not
exist. The handle is positioned at the end of the file. The data being written
will be inserted at the end, after the existing data.
Opening a File
It is done using the open() function. No module is required to be imported for this
function. Syntax:
<<your code comes here>> = pd.read_csv("<<your csv file name comes here>>",
index_col=0)
(2) Use describe() function of pandas dataframe to see the data in this 'mydf'
dataframe.
df.to_csv('your_file_name.csv', index=False)
Array oriented Programming with Numpy
Array Programming provides a powerful, compact and expressive syntax for
accessing, manipulating and operating on data in vectors, matrices and higher-
dimensional arrays. NumPy is the primary array programming library for the
Python language.
NumPy stands for Numerical Python. It is a Python library used for working with an
array. In Python, we use the list for the array but it’s slow to process. NumPy array is
a powerful N-dimensional array object and is used in linear algebra, Fourier
transform, and random number capabilities. It provides an array object much faster
than traditional Python lists.
Types of Array:
1. One Dimensional Array
2. Multi-Dimensional Array
One Dimensional Array:
A one-dimensional array is a type of linear array.
Example:
# creating list
list = [1, 2, 3, 4]
Output:
Cleaning and preparing data is a crucial part of data analysis. The goal is to transform
raw data into a format suitable for analysis. This often involves identifying and
resolving missing values, outliers, and data inconsistencies.
The following Python example illustrates how data can be cleaned and prepared.
import numpy as np
import pandas as pd
median_qty = df["quantity"].median()
df["quantity"].fillna(median_qty, inplace=True)
df.head()
3. Handling Outliers
Outliers are values that are significantly different from the rest of the data. Outliers
can negatively affect analyses, and they should be identified and addressed. A
common method of handling outliers is to remove them, but in some cases, you may
want to keep them if they represent legitimate data points.
Assume that the "revenue" column contains some outliers which we would like to
remove using the interquartile range (IQR).
Q1 = df["revenue"].quantile(0.25)
Q3 = df["revenue"].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df["revenue"] < (Q1 - 1.5 * IQR)) | (df["revenue"] > (Q3 + 1.5 * IQR)))]
Suppose the "state" column contains some inconsistencies that we wish to resolve by
mapping different values to the same category.
state_map = {
"New York": "NY",
"California": "CA",
"Texas": "TX"
}
df["state"] = df["state"].replace(state_map)
df.to_csv("/Users/rafael/Desktop/General/cleaned_sales_data.csv", index=False)
Python provides various libraries that come with different features for visualizing
data. All these libraries come with different features and can support various types of
graphs. We use following four such libraries.
Matplotlib
Seaborn
Bokeh
Plotly
It mainly works with datasets and arrays. It works with entire datasets.
Matplotlib is more
Seaborn has more inbuilt themes and is mainly used for customizable and pairs well
statistical analysis. with Pandas and Numpy for
Exploratory Data Analysis.
Example :-
Scatter Plot
Scatter plots are used to observe relationships between variables and uses dots to
represent the relationship between them. The scatter() method in the matplotlib library
is used to draw a scatter plot.
Example:
import pandas as pd
import matplotlib.pyplot as plt
plt.show()
Output:
# import module
import pandas as pd
# display dataset
print(df)
UNIT -3
Machine Learning and Cognitive Intelligence
Classification
Classification deals with predicting categorical target variables, which represent
discrete classes or labels. For instance, classifying emails as spam or not spam, or
predicting whether a patient has a high risk of heart disease. Classification algorithms
learn to map the input features to one of the predefined classes.
Here are some classification algorithms:
Logistic Regression
Support Vector Machine
Random Forest
Decision Tree
K-Nearest Neighbors (KNN)
Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables,
which represent numerical values. For example, predicting the price of a house based
on its size, location, and amenities, or forecasting the sales of a product. Regression
algorithms learn to map the input features to a continuous numerical value.
Here are some regression algorithms:
Linear Regression
Polynomial Regression
Ridge Regression
Lasso Regression
Decision tree
Random Forest
Advantages of Supervised Machine Learning
Supervised Learning models can have high accuracy as they are trained
on labelled data.
The process of decision-making in supervised learning models is often
interpretable.
It can often be used in pre-trained models which saves time and resources
when developing new models from scratch.
Disadvantages of Supervised Machine Learning
It has limitations in knowing patterns and may struggle with unseen or
unexpected patterns that are not present in the training data.
It can be time-consuming and costly as it relies on labeled data only.
It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
Image classification: Identify objects, faces, and other features in images.
Natural language processing: Extract information from text, such as
sentiment, entities, and relationships.
Speech recognition: Convert spoken language into text.
Recommendation systems: Make personalized recommendations to users.
Predictive analytics: Predict outcomes, such as sales, customer churn, and
stock prices.
Medical diagnosis: Detect diseases and other medical conditions.
Fraud detection: Identify fraudulent transactions.
Autonomous vehicles: Recognize and respond to objects in the environment.
Email spam detection: Classify emails as spam or not spam.
Quality control in manufacturing: Inspect products for defects.
Credit scoring: Assess the risk of a borrower defaulting on a loan.
Gaming: Recognize characters, analyze player behavior, and create NPCs.
Customer support: Automate customer support tasks.
Weather forecasting: Make predictions for temperature, precipitation, and
other meteorological parameters.
Sports analytics: Analyze player performance, make game predictions, and
optimize strategies.
2. Unsupervised Machine Learning
Unsupervised Learning Unsupervised learning is a type of machine learning
technique in which an algorithm discovers patterns and relationships using unlabeled
data. Unlike supervised learning, unsupervised learning doesn’t involve providing the
algorithm with labeled target outputs. The primary goal of Unsupervised learning is
often to discover hidden patterns, similarities, or clusters within the data, which can
then be used for various purposes, such as data exploration, visualization,
dimensionality reduction, and more.
There are two main categories of unsupervised learning that are mentioned below:
Clustering
Association
Clustering
Clustering is the process of grouping data points into clusters based on their similarity.
This technique is useful for identifying patterns and relationships in data without the
need for labeled examples.
Here are some clustering algorithms:
K-Means Clustering algorithm
Mean-shift algorithm
DBSCAN Algorithm
Principal Component Analysis
Independent Component Analysis
Association
Association rule learning is a technique for discovering relationships between items
in a dataset. It identifies rules that indicate the presence of one item implies the
presence of another item with a specific probability.
Here are some association rule learning algorithms:
Apriori Algorithm
Eclat
FP-growth Algorithm
Advantages of Unsupervised Machine Learning
It helps to discover hidden patterns and various relationships between the data.
Used for tasks such as customer segmentation, anomaly detection, and data
exploration.
It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
Without using labels, it may be difficult to predict the quality of the model’s
output.
Cluster Interpretability may not be clear and may not have meaningful
interpretations.
It has techniques such as autoencoders and dimensionality reduction that can
be used to extract meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
Clustering: Group similar data points into clusters.
Anomaly detection: Identify outliers or anomalies in data.
Dimensionality reduction: Reduce the dimensionality of data while
preserving its essential information.
Recommendation systems: Suggest products, movies, or content to users
based on their historical behavior or preferences.
Topic modeling: Discover latent topics within a collection of documents.
Density estimation: Estimate the probability density function of data.
Image and video compression: Reduce the amount of storage required for
multimedia content.
Data preprocessing: Help with data preprocessing tasks such as data cleaning,
imputation of missing values, and data scaling.
Market basket analysis: Discover associations between products.
Genomic data analysis: Identify patterns or group genes with similar
expression profiles.
Image segmentation: Segment images into meaningful regions.
Community detection in social networks: Identify communities or groups of
individuals with similar interests or connections.
Customer behavior analysis: Uncover patterns and insights for better
marketing and product recommendations.
Content recommendation: Classify and tag content to make it easier to
recommend similar items to users.
Exploratory data analysis (EDA): Explore data and gain insights before
defining specific tasks.
3. Reinforcement Machine Learning
Reinforcement machine learning algorithm is a learning method that interacts with
the environment by producing actions and discovering errors. Trial, error, and
delay are the most relevant characteristics of reinforcement learning. In this
technique, the model keeps on increasing its performance using Reward Feedback to
learn the behavior or pattern. These algorithms are specific to a particular problem
e.g. Google Self Driving car, AlphaGo where a bot competes with humans and even
itself to get better and better performers in Go Game. Each time we feed in data, they
learn and add the data to their knowledge which is training data. So, the more it learns
the better it gets trained and hence experienced.
Here are some of most common reinforcement learning algorithms:
Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function,
which maps states to actions. The Q-function estimates the expected reward of
taking a particular action in a given state.
SARSA (State-Action-Reward-State-Action): SARSA is another model-
free RL algorithm that learns a Q-function. However, unlike Q-learning,
SARSA updates the Q-function for the action that was actually taken, rather
than the optimal action.
Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep
learning. Deep Q-learning uses a neural network to represent the Q-function,
which allows it to learn complex relationships between states and actions.
he overall process of finding and interpreting patterns from data involves the repeated
application of the following steps:
1. Developing an understanding of
o the application domain
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
o Removal of noise or outliers.
o Collecting necessary information to model or account for noise.
o Strategies for handling missing data fields.
o Accounting for time sequence information and known changes.
4. Data reduction and projection.
o Finding useful features to represent the data depending on the goal of
the task.
o Using dimensionality reduction or transformation methods to reduce
the effective number of variables under consideration or to find
invariant representations for the data.
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
6. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of
the KDD process.
7. Data mining.
o Searching for patterns of interest in a particular representational form
or a set of such representations as classification rules or trees,
regression, clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.
CRISP-DM
CRISP-DM stands for cross-industry process for data mining. The CRISP-DM
methodology provides a structured approach to planning a data mining project. It is a
robust and well-proven methodology. We do not claim any ownership over it. We did
not invent it. We are however evangelists of its powerful practicality, its flexibility
and its usefulness when using analytics to solve thorny business issues. It is the golden
thread than runs through almost every client engagement. The CRISP-DM model is
shown on the right.
This model is an idealised sequence of events. In practice many of the tasks can be
performed in a different order and it will often be necessary to backtrack to previous
tasks and repeat certain actions. The model does not try to capture all possible routes
through the data mining process.
You can jump to more information about each phase of the process here:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation
6. Deployment
What are the 6 CRISP-DM Phases?
I. Business Understanding
Any good project starts with a deep understanding of the customer’s needs. Data
mining projects are no exception and CRISP-DM recognizes this.
The Business Understanding phase focuses on understanding the objectives and
requirements of the project. Aside from the third task, the three other tasks in this
phase are foundational project management activities that are universal to most
projects:
1. Determine business objectives: You should first “thoroughly understand,
from a business perspective, what the customer really wants to accomplish.”
(CRISP-DM Guide) and then define business success criteria.
2. Assess situation: Determine resources availability, project requirements,
assess risks and contingencies, and conduct a cost-benefit analysis.
3. Determine data mining goals: In addition to defining the business objectives,
you should also define what success looks like from a technical data mining
perspective.
4. Produce project plan: Select technologies and tools and define detailed plans
for each project phase.
While many teams hurry through this phase, establishing a strong business
understanding is like building the foundation of a house – absolutely essential.
II. Data Understanding
Next is the Data Understanding phase. Adding to the foundation of Business
Understanding, it drives the focus to identify, collect, and analyze the data sets that
can help you accomplish the project goals. This phase also has four tasks:
1. Collect initial data: Acquire the necessary data and (if necessary) load it into
your analysis tool.
2. Describe data: Examine the data and document its surface properties like data
format, number of records, or field identities.
3. Explore data: Dig deeper into the data. Query it, visualize it, and identify
relationships among the data.
4. Verify data quality: How clean/dirty is the data? Document any quality
issues.
III. Data Preparation
A common rule of thumb is that 80% of the project is data preparation.
This phase, which is often referred to as “data munging”, prepares the final data set(s)
for modeling. It has five tasks:
1. Select data: Determine which data sets will be used and document reasons for
inclusion/exclusion.
2. Clean data: Often this is the lengthiest task. Without it, you’ll likely fall
victim to garbage-in, garbage-out. A common practice during this task is to
correct, impute, or remove erroneous values.
3. Construct data: Derive new attributes that will be helpful. For example,
derive someone’s body mass index from height and weight fields.
4. Integrate data: Create new data sets by combining data from multiple
sources.
5. Format data: Re-format data as necessary. For example, you might convert
string values that store numbers to numeric values so that you can perform
mathematical operations.
IV. Modeling
What is widely regarded as data science’s most exciting work is also often the shortest
phase of the project. Here you’ll likely build and assess various models based on
several different modeling techniques. This phase has four tasks:
1. Select modeling techniques: Determine which algorithms to try (e.g.
regression, neural net).
2. Generate test design: Pending your modeling approach, you might need to
split the data into training, test, and validation sets.
3. Build model: As glamorous as this might sound, this might just be executing
a few lines of code like “reg = LinearRegression().fit(X, y)”.
4. Assess model: Generally, multiple models are competing against each other,
and the data scientist needs to interpret the model results based on domain
knowledge, the pre-defined success criteria, and the test design.
Although the CRISP-DM Guide suggests to “iterate model building and assessment
until you strongly believe that you have found the best model(s)”, in practice teams
should continue iterating until they find a “good enough” model, proceed through the
CRISP-DM lifecycle, then further improve the model in future iterations.
V. Evaluation
Whereas the Assess Model task of the Modeling phase focuses on technical model
assessment, the Evaluation phase looks more broadly at which model best meets the
business and what to do next. This phase has three tasks:
1. Evaluate results: Do the models meet the business success criteria? Which
one(s) should we approve for the business?
2. Review process: Review the work accomplished. Was anything overlooked?
Were all steps properly executed? Summarize findings and correct anything if
needed.
3. Determine next steps: Based on the previous three tasks, determine whether
to proceed to deployment, iterate further, or initiate new projects.
VI. Deployment
“Depending on the requirements, the deployment phase can be as simple as
generating a report or as complex as implementing a repeatable data mining process
across the enterprise.”
-CRISP-DM Guide
A model is not particularly useful unless the customer can access its results. The
complexity of this phase varies widely. This final phase has four tasks:
1. Plan deployment: Develop and document a plan for deploying the model
2. Plan monitoring and maintenance: Develop a thorough monitoring and
maintenance plan to avoid issues during the operational phase (or post-project
phase) of a model
3. Produce final report: The project team documents a summary of the project
which might include a final presentation of data mining results.
4. Review project: Conduct a project retrospective about what went well, what
could have been better, and how to improve in the future.
SEMMA
Why SEMMA?
Businesses use the SEMMA methodology on their data mining and machine learning
projects to achieve a competitive advantage, improve performance, and deliver more
useful services to customers. The data we collect about our surroundings serve as the
foundation for hypotheses and models of the world we live in.
Ultimately, data is accumulated to help in collecting knowledge. That means the data
is not worth much until it is studied and analyzed. But hoarding vast volumes of data
is not equivalent to gathering valuable knowledge. It is only when data is sorted and
evaluated that we learn anything from it.
Thus, SEMMA is designed as a data science methodology to help practitioners
convert data into knowledge.
The 5 Stages Of SEMMA
SEMMA is leveraged as an organized, functional toolset, or is claimed as such by
SAS to be associated with their SAS Enterprise Miner initiative. While it is true that
the SEMMA process is more ambiguous to those not using the tool, most regard it as
a functional data mining methodology rather than a specific tool.
The process breaks down into its own set of stages. These include:
Sample: This step entails choosing a subset of the appropriate volume dataset
from a vast dataset that has been given for the model’s construction. The goal
of this initial stage of the process is to identify variables or factors (both
dependent and independent) influencing the process. The collected
information is then sorted into preparation and validation categories.
Explore: During this step, univariate and multivariate analysis is conducted in
order to study interconnected relationships between data elements and to
identify gaps in the data. While the multivariate analysis studies the
relationship between variables, the univariate one looks at each factor
individually to understand its part in the overall scheme. All of the influencing
factors that may influence the study’s outcome are analyzed, with heavy
reliance on data visualization.
Modify: In this step, lessons learned in the exploration phase from the data
collected in the sample phase are derived with the application of business
logic. In other words, the data is parsed and cleaned, being then passed onto
the modeling stage, and explored if the data requires refinement and
transformation.
Model: With the variables refined and data cleaned, the modeling step applies
a variety of data mining techniques in order to produce a projected model of
how this data achieves the final, desired outcome of the process.
Assess: In this final SEMMA stage, the model is evaluated for how useful and
reliable it is for the studied topic. The data can now be tested and used to
estimate the efficacy of its performance.
10 Matplotlib
1)
Matplotlib is an interactive, cross-platform library for two-dimensional plotting. It
can produce high-quality graphs, charts and plots in several hardcopy formats.
Advantages:
Flexible usage: supports both Python and IPython shells, Python scripts,
Jupyter Notebook, web application servers and many GUI toolkits (GTK+,
Tkinter, Qt, and wxPython).
Optionally provides a MATLAB-like interface for simple plotting.
The object-oriented interface gives complete control of axes properties, font
properties, line styles, etc.
Compatible with several graphics backends and operating systems.
Matplotlib is frequently incorporated in other libraries, such as Pandas.
2)
Natural Language Toolkit (NLTK)
NLTK is a framework and suite of libraries for developing both symbolic and
statistical Natural Language Processing (NLP) in Python. It is the standard tool for
NLP in Python.
Advantages:
3) Pandas
4) Scikit-learn
The Python library, Scikit-Learn, is built on top of the matplotlib, NumPy, and SciPy
libraries. This Python ML library has several tools for data analysis and data mining
tasks.
Advantages:
5) Seaborn
7) Keras
Keras is a very popular ML for Python, providing a high-level neural network API
capable of running on top of TensorFlow, CNTK, or Theano.
Advantages:
SciPy is a very popular ML library with different modules for optimization, linear
algebra, integration and statistics.
Advantages:
9) Pytorch
Contains tools and libraries that support Computer Vision, NLP , Deep
Learning, and many other ML programs.
Developers can perform computations on Tensors with GPU acceleration.
Helps in creating computational graphs.
Modeling process is simple and transparent.
The default “define-by-run” mode is more like traditional programming.
Uses common debugging tools such as pdb, ipdb or PyCharm debugger.
Uses a lot of pre-trained models and modular parts that are easy to combine.
10) TensorFlow
Numpy
Scipy
Scikit-learn
Theano
TensorFlow
Keras
PyTorch
Pandas
Matplotlib
Introduction to classification
Before diving into the classification concept, we will first understand the difference
between the two types of learners in classification: lazy and eager learners. Then we
will clarify the misconception between classification and regression.
Lazy Learners Vs. Eager Learners
There are two types of learners in machine learning classification: lazy and eager
learners.
Eager learners are machine learning algorithms that first build a model from the
training dataset before making any prediction on future datasets. They spend more
time during the training process because of their eagerness to have a better
generalization during the training from learning the weights, but they require less time
to make predictions.
Most machine learning algorithms are eager learners, and below are some examples:
Logistic Regression.
Support Vector Machine.
Decision Trees.
Artificial Neural Networks.
Lazy learners or instance-based learners, on the other hand, do not create any
model immediately from the training data, and this is where the lazy aspect comes
from. They just memorize the training data, and each time there is a need to make a
prediction, they search for the nearest neighbor from the whole training data, which
makes them very slow during prediction. Some examples of this kind are:
K-Nearest Neighbor.
Case-based reasoning.
Regression Metrics
Machine learning is an effective tool for predicting numerical values, and
regression is one of its key applications. In the arena of regression analysis,
accurate estimation is crucial for measuring the overall performance of
predictive models. This is where the famous machine learning library
Python Scikit-Learn comes in. Scikit-Learn gives a complete set of
regression metrics to evaluate the quality of regression models.
In this article, we are able to explore the basics of regression metrics in
scikit-learn, discuss the steps needed to use them effectively, provide some
examples, and show the desired output for each metric.
Regression
Regression fashions are algorithms used to expect continuous numerical
values primarily based on entering features. In scikit-learn, we will use
numerous regression algorithms, such as Linear Regression, Decision Trees,
Random Forests, and Support Vector Machines (SVM), amongst others.
Before learning about precise metrics, let’s familiarize ourselves with a few
essential concepts related to regression metrics:
1. True Values and Predicted Values:
In regression, we’ve got two units of values to compare: the actual target
values (authentic values) and the values expected by our version (anticipated
values). The performance of the model is assessed by means of measuring
the similarity among these sets.
2. Evaluation Metrics:
Regression metrics are quantitative measures used to evaluate the nice of a
regression model. Scikit-analyze provides several metrics, each with its own
strengths and boundaries, to assess how well a model suits the statistics.
Types of Regression Metrics
Some common regression metrics in scikit-learn with examples
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
R-squared (R²) Score
Root Mean Squared Error (RMSE)
Multivariate regression
Multivariate Multiple Regression is a method of modeling multiple
responses, or dependent variables, with a single set of predictor variables.
For example, we might want to model both math and reading SAT scores
as a function of gender, race, parent income, and so forth.
Multivariate Regression
The goal in any data analysis is to extract from raw information the
accurate estimation. One of the most important and common question
concerning if there is statistical relationship between a response variable
(Y) and explanatory variables (Xi). An option to answer this question is to
employ regression analysis in order to model its relationship. Further it
can be used to predict the response variable for any arbitrary set of
explanatory variables.
The Problem:
Multivariate Regression is one of the simplest Machine Learning
Algorithm. It comes under the class of Supervised Learning Algorithms
i.e, when we are provided with training dataset. Some of the problems that
can be solved using this model are:
A researcher has collected data on three psychological variables, four
academic variables (standardized test scores), and the type of educational
program the student is in for 600 high school students. She is interested in
how the set of psychological variables is related to the academic variables
and the type of program the student is in.
A doctor has collected data on cholesterol, blood pressure, and
weight. She also collected data on the eating habits of the subjects (e.g.,
how many ounces of red meat, fish, dairy products, and chocolate
consumed per week). She wants to investigate the relationship between
the three measures of health and eating habits.
A property dealer wants to set housing prices which are based various
factors like Size of house, No of bedrooms, Age of house, etc. We shall
discuss the algorithm further using this example.
The Solution:
The solution is divided into various parts.
Selecting the features: Finding the features on which a response variable
depends (or not) is one of the most important steps in Multivariate
Regression. To make our analysis simple, we assume that the features on
which the response variable is dependent are already selected.
Normalizing the features: The features are then scaled in order to bring
them in range of (0,1) to make better analysis. This can be done by
changing the value of each feature
Selecting Hypothesis and Cost function: A hypothesis is a predicted value
of the response variable represented by h(x). Cost function defines the cost
for wrongly predicting hypothesis. It should be as small as possible. We
choose hypothesis function as linear combination of features X.
Non-Linear Regression
Table of Content
Where,
Functional Form: The chosen nonlinear model correctly represents the true
relationship between the dependent and independent variables.
Independence: Observations are assumed to be independent of each other.
Homoscedasticity: The variance of the residuals (the differences between
observed and predicted values) is constant across all levels of the
independent variable.
Normality: Residuals are assumed to be normally distributed.
Multicollinearity: Independent variables are not perfectly correlated.
Types of Non-Linear Regression
There are two main types of Non Linear regression in Machine Learning:
K-Nearest Neighbour
Now, given another set of data points (also called testing data), allocate these
points to a group by analyzing the training set. Note that the unclassified points
are marked as ‘White’.
Intuition Behind KNN Algorithm
If we plot these points on a graph, we may be able to locate some clusters or
groups. Now, given an unclassified point, we can assign it to a group by
observing what group its nearest neighbors belong to. This means a point close
to a cluster of points classified as ‘Red’ has a higher probability of getting
classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’,
and the second point (5.5, 4.5) should be classified as ‘Red’.
Why do we need a KNN algorithm?
(K-NN) algorithm is a versatile and widely used machine learning algorithm that
is primarily used for its simplicity and ease of implementation. It does not require
any assumptions about the underlying data distribution. It can also handle both
numerical and categorical data, making it a flexible choice for various types of
datasets in classification and regression tasks. It is a non-parametric method that
makes predictions based on the similarity of data points in a given dataset. K-NN
is less sensitive to outliers compared to other algorithms.
The K-NN algorithm works by finding the K nearest neighbors to a given data
point based on a distance metric, such as Euclidean distance. The class or value
of the data point is then determined by the majority vote or average of the K
neighbors. This approach allows the algorithm to adapt to different patterns and
make predictions based on the local structure of the data.
Decision Trees
Logistic Regression
Logistic regression is a supervised machine learning algorithm that accomplishes
binary classification tasks by predicting the probability of an outcome, event, or
observation. The model delivers a binary or dichotomous outcome limited to two
possible outcomes: yes/no, 0/1, or true/false.
Clustering
What is Clustering
The task of grouping data points based on their similarity with each other is called
Clustering or Cluster Analysis. This method is defined under the branch
of Unsupervised Learning, which aims at gaining insights from unlabelled data
points, that is, unlike supervised learning we don’t have a target variable.
Clustering aims at forming groups of homogeneous data points from a
heterogeneous dataset. It evaluates the similarity based on a metric like Euclidean
distance, Cosine similarity, Manhattan distance, etc. and then group the points
with highest similarity score together.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group
similar data points:
Hard Clustering: In this type of clustering, each data point belongs to a
cluster completely or not. For example, Let’s say there are 4 data point and we
have to cluster them into 2 clusters. So each data point will either belong to
cluster 1 or cluster 2.
Soft Clustering: In this type of clustering, instead of assigning each data point
into a separate cluster, a probability or likelihood of that point being that cluster
is evaluated. For example, Let’s say there are 4 data point and we have to
cluster them into 2 clusters. So we will be evaluating a probability of a data
point belonging to both clusters. This probability is calculated for all data
points.
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through the
use cases of Clustering algorithms. Clustering algorithms are majorly used for:
Market Segmentation – Businesses use clustering to group their customers and
use targeted advertisements to attract more audience.
Market Basket Analysis – Shop owners analyze their sales and figure out
which items are majorly bought together by the customers. For example, In
USA, according to a study diapers and beers were usually bought together by
fathers.
Social Network Analysis – Social media sites use your data to understand your
browsing behaviour and provide you with targeted friend recommendations or
content recommendations.
Medical Imaging – Doctors use Clustering to find out diseased areas in
diagnostic images like X-rays.
Anomaly Detection – To find outliers in a stream of real-time dataset or
forecasting fraudulent transactions we can use clustering to identify them.
Simplify working with large datasets – Each cluster is given a cluster ID after
clustering is complete. Now, you may reduce a feature set’s whole feature set
into its cluster ID. Clustering is effective when it can represent a complicated
case with a straightforward cluster ID. Using the same principle, clustering
data can make complex datasets simpler.
Hierarchical clustering
Hierarchical clustering, also known as hierarchical cluster analysis, is an
algorithm that groups similar objects into groups called clusters. The endpoint is
a set of clusters, where each cluster is distinct from each other cluster, and the
objects within each cluster are broadly similar to each other.
Partitioning Clustering:-
It is a type of clustering that divides the data into non-hierarchical groups. It is
also known as the centroid-based method. The most common example of
partitioning clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
Unsupervised Learning
In artificial intelligence, machine learning that takes place in the absence of
human supervision is known as unsupervised machine learning. Unsupervised
machine learning models, in contrast to supervised learning, are given unlabeled
data and allow discover patterns and insights on their own—without explicit
direction or instruction.
Unsupervised machine learning analyzes and clusters unlabeled datasets using
machine learning algorithms. These algorithms find hidden patterns and data
without any human intervention, i.e., we don’t give output to our model. The
training model has only input parameter values and discovers the groups or
patterns on its own.
Some applications are-