[go: up one dir, main page]

0% found this document useful (0 votes)
110 views73 pages

NOTES OF Python Ok

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 73

NOTES OF

305 BA - Machine Learning & Cognitive intelligence


using Python

Prepared By Dr. Rajesh Kumar Hukre

UNIT -1
Python Basics:

Introduction to Python :-
What is Python:-
Python is a popular programming language. It was created by Guido van Rossum, and
released in 1991.
It is used for:
 web development (server-side),
 software development,
 mathematics,
 system scripting.
What can Python do:-
 Python can be used on a server to create web applications.
 Python can be used alongside software to create workflows.
 Python can connect to database systems. It can also read and modify files.
 Python can be used to handle big data and perform complex mathematics.
 Python can be used for rapid prototyping, or for production-ready software
development.
Why Python:-
 Python works on different platforms (Windows, Mac, Linux, Raspberry Pi,
etc).
 Python has a simple syntax similar to the English language.
 Python has syntax that allows developers to write programs with fewer lines
than some other programming languages.
 Python runs on an interpreter system, meaning that code can be executed as
soon as it is written. This means that prototyping can be very quick.
 Python can be treated in a procedural way, an object-oriented way or a
functional way.
Python Features

Python is a dynamic, high-level, free open source, and interpreted programming


language. It supports object-oriented programming as well as procedural-oriented
programming. In Python, we don’t need to declare the type of variable because it is a
dynamically typed language. For example, x = 10 Here, x can be anything such as
String, int, etc. In this article we will see what characteristics describe the python
programming language
Features in Python
In this section we will see what are the features of Python programming language:
1. Free and Open Source
Python language is freely available at the official website and you can download it
from the given download link below click on the Download
Python keyword. Download Python Since it is open-source, this means that source
code is also available to the public. So you can download it, use it as well as share it.
2. Easy to code
Python is a high-level programming language. Python is very easy to learn the
language as compared to other languages like C, C#, Javascript, Java, etc. It is very
easy to code in the Python language and anybody can learn Python basics in a few
hours or days. It is also a developer-friendly language.
3. Easy to Read
As you will see, learning Python is quite simple. As was already established, Python’s
syntax is really straightforward. The code block is defined by the indentations rather
than by semicolons or brackets.
4. Object-Oriented Language
One of the key features of Python is Object-Oriented programming. Python supports
object-oriented language and concepts of classes, object encapsulation, etc.
5. GUI Programming Support
Graphical User interfaces can be made using a module such as PyQt5, PyQt4,
wxPython, or Tk in Python. PyQt5 is the most popular option for creating graphical
apps with Python.
6. High-Level Language
Python is a high-level language. When we write programs in Python, we do not need
to remember the system architecture, nor do we need to manage the memory.
7. Large Community Support
Python has gained popularity over the years. Our questions are constantly answered
by the enormous StackOverflow community. These websites have already provided
answers to many questions about Python, so Python users can consult them as needed.
8. Easy to Debug
Excellent information for mistake tracing. You will be able to quickly identify and
correct the majority of your program’s issues once you understand how
to interpret Python’s error traces. Simply by glancing at the code, you can determine
what it is designed to perform.
9. Python is a Portable language
Python language is also a portable language. For example, if we have Python code for
Windows and if we want to run this code on other platforms such as Linux, Unix, and
Mac then we do not need to change it, we can run this code on any platform.
10. Python is an Integrated language
Python is also an Integrated language because we can easily integrate Python with
other languages like C, C++, etc.
11. Interpreted Language:
Python is an Interpreted Language because Python code is executed line by line at a
time. like other languages C, C++, Java, etc. there is no need to compile Python code
this makes it easier to debug our code. The source code of Python is converted into an
immediate form called bytecode.
12. Large Standard Library
Python has a large standard library that provides a rich set of modules and functions
so you do not have to write your own code for every single thing. There are many
libraries present in Python such as regular expressions, unit-testing, web browsers, etc.
13. Dynamically Typed Language
Python is a dynamically-typed language. That means the type (for example- int,
double, long, etc.) for a variable is decided at run time not in advance because of this
feature we don’t need to specify the type of variable.
14. Frontend and backend development
With a new project py script, you can run and write Python codes in HTML with the
help of some simple tags <py-script>, <py-env>, etc. This will help you do frontend
development work in Python like javascript. Backend is the strong forte of Python it’s
extensively used for this work cause of its frameworks like Django and Flask.
15. Allocating Memory Dynamically
In Python, the variable data type does not need to be specified. The memory is
automatically allocated to a variable at runtime when it is given a value. Developers
do not need to write int y = 18 if the integer value 15 is set to y. You may just type
y=18.

Python Syntax

What is Python Syntax?


Python syntax refers to the set of rules that defines the combinations of symbols
that are considered to be correctly structured programs in the Python language.
These rules make sure that programs written in Python should be structured and
formatted, ensuring that the Python interpreter can understand and execute
them correctly. Here are some aspects of Python syntax:
Key Concepts
 Value (data types)
o String
 Text value
o Int
 Number value (no decimal places)

o Float
Number value (has decimal places)
o Boolean
 True or false value
 Variable
o Used to store values
o Has a unique name that you choose
 Operator
o Used for assigning a value to a variable
o Used for math between two values or variables
 Function
o Pre-written code that can be used repeatedly
 Print
o Function that displays a string in the console
 Error
o Text displayed in the console when Python detects a mistake
o Shows the location (line number) of the mistake
 Statement
o A single line of code that does something
o Like a sentence, every statement has a subject and a verb
 Comment
o Not interpreted by Python
o Used to explain what the code does
 Console
o Where you see the program’s output (print results and errors)
Syntax
 Values
o Use quotation marks (" or ') around a string
o Use decimal points (.) to turn an int into a float
o Booleans can only be True or False
 Functions
o Use parentheses (()) after the name to use a function
o Add the parameter between the parentheses if needed (like in print)
 Comments
o Use an octothorpe (#) to start a single-line comment
o Use triple quotes (""") around a multi-line comment
 Some common operators
o A=B
 Set variable A to the value of B
o A+B
 Add A and B
o A-B
 Subtract B from A
o A*B
 Multiply A by B
o A/B
 Divide A by B

Python keywords
Keywords in Python are reserved words that have special meanings. For example if,
else, while, etc. They cannot be used as Keywords. Below is the list of keywords in
Python.
False await else import pass
None break except in raise
True class finally is return
and continue for lambda try
as def from nonlocal while
assert del global not with
async elif if or yield

Python Variables

Python defines four types of variables: local, global, instance, and class variables. Local
variables are created within functions and can only be accessed there. Global variables
are defined outside of any function and can be used throughout the program.

Data Types in Python

Some built-in Python data types are:


 Numeric data types: int, float, complex.
 String data types: str.
 Sequence types: list, tuple, range.
 Binary types: bytes, bytearray, memoryview.
 Mapping data type: dict.
 Boolean type: bool.
 Set data types: set, frozenset.
Operators in Python

Decision Making Loops in Python

Decision statements and Loops in Python


As a continuation of the previous lessons, here we introduce decision statements and
loops.
Decision Making
Decision Making is to execute a set of statements based on the validation of a
condition. The decision making in Python involves an 'if' condition that is written as:
Syntax:
if condition:
statements
A Python example:
a = '10'
b = '5'

if a<b:
print("a and b are not really numbers")

# Output
>>> a and b are not really numbers
For more on the if condition, refer
to: https://docs.python.org/3/tutorial/controlflow.html
Exercise
Given a = 10, b = 10, and c = 10.00, use if condition to print the following:
 if a is equal to b and b is equal to c, print the message "a, b and c are all similar"
 if location of a is same as location of b is same as the location of c, print
message "a, b and c are all referring to the same object"
# data

a = 10
b = 10
c = 10.00
Click here to edit and execute the code.
Solution code
if a==b and b==c: # two conditions using and operator
print("a, b and c are all similar")
if id(a)==id(b)==id(c): # two conditions without using and operator
print("a, b and c are all referring to the same object")
If-elif-else
An advanced approach to decision making is when we chain multiple if conditions,
so as to perform different operations based on different conditions being satisfied.
Here we use an 'if...elif...else' construct. 'elif' is short for else-if. It is written as:
Syntax:
if condition:

statement(s)

elif:
statement(s)
.
.
.

else:
statement(s)
Ref: https://docs.python.org/3/reference/compound_stmts.html#if
Exercise
Given variable a, which can take any integral value, write multiple conditions to
determine whether a is:
 an even number
 a negative number
 or 1
 if not any of the above, a positive odd number other than 1
a=1
Click here to edit and execute the code.
Solution code
if a == 1:
print("a is equal to 1")
elif(a%2) == 0:
print("a is an even number")
elif a<0:
print("a is a negative number")
else:
print("a is a positive odd number other than 1")
Loops
In general, statements are executed sequentially: The first statement in a function is
executed first, followed by the second, and so on.
There may be a situation when you need to execute a block of code several number
of times. Programming languages provide various control structures that allow for
more complicated execution paths.
A loop statement allows us to execute a statement or group of statements multiple
times.
The following diagram illustrates a loop statement −
Python programming language provides following types of loops to handle looping
requirements.
While Loop
Repeats a statement or group of statements while a given condition is TRUE. It tests
the condition before executing the loop body.

Note that Python works with indentation and no braces are necessary. The next line
of code following the loops or conditions or function blocks should have an
indentation of two spaces or four spaces.
Syntax:
While :
statements
Example:
i=1
while i < 5:
i += 1
Example:
a=1
# Iterating 8 times and then Printing ints 2-10
while a < 10:
a += 1
print(a)

# Output
>>> 2
3
4
5
6
7
8
9
10
For Loop
Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.
The For loop works in a similar way as the while loop. Please put emphasis on the
syntax here. We will using the 'in' operator for the function of this loop.
Syntax:
range(1, 10) -- generates a list of numbers from 1 to 10
for i in range(1, 10):
statements
Example:
# iterating 10 times and then print ints 1-10
for i in range(1, 11):
print(i)

# Output
>>> 1
2
3
4
5
6
7
8
9
10
Ref: https://docs.python.org/3/tutorial/controlflow.html
Exercise
Given:
array = [1,'a',2,'b','3',4]
array2 = []
1. Use a For Loop Reversely to add the values in array2
2. Empty out array
3. Use a While Loop reversely to add the values from array2 to array
# Write Your solution here
array = [1,'a',2,'b','3',4]
array2 = []
Click here to edit and execute the code.
Solution Code
array = [1,'a',2,'b','3',4]
array2 = []
for i in range(len(array)-1, -1, -1):
array2.append(array[i])
print(array2)
array = []
i = len(array2)-1
while 0 <= i:
array.append(array2)
i -= 1
print(array2)
Comprehension
Comprehension expressions are described as one line loops in python. These
comprehensions help in reducing the verbosity for simple loopes
Syntax
a = [do something(i) for i in array (if condition - optional)]
Example
a = [1,2,3]
# Adding 1 to each element in a
a = [i+1 for i in a]

# Output
>>> [2, 3, 4]
Exercise
arr = ['1', 2, '4', '5', 65,'100']
sum_var = 0
Convert all the string elements and then sum them up and sum_var
sum() is used to add all the elements in the array
# Write your Solution here
arr = ['1', 2, '4', '5', 65,'100']
sum_var = 0
Click here to edit and execute the code.
Solution
arr = ['1', 2, '4', '5', 65,'100']
sum_var = 0
sum_var = sum([int(i) for i in arr if type(i) is str])
sum_var
Loop Control
Loop control statements change execution from its normal sequence. When execution
leaves a scope, all automatic objects that were created in that scope are destroyed.
In Python there is a certain way to restrict/limit the number of iteration a for/while
loop can go through. Essentially, these are designed to manipulate the control flow of
a loop.
Break
Terminates the loop statement and transfers execution to the statement immediately
following the loop.
Syntax:
while condition:
if condition:
break

for i in range():
if i is condition:
break
Example:
i=0
# The loop breaks when it reaches 'h'
for letter in 'Python': # First Example
if(letter == 'h'):
break
print ('Current Letter:', letter)

# Output
>>> Current Letter: P
Current Letter: y
Current Letter: t

# The loop breaks when var = 5


var = 10
while var > 0:
print('Current variable value: ', var)
var = var-1
if var == 5:
break
print('Good bye!')

# Output
>>> Current variable value: 10
Current variable value: 9
Current variable value: 8
Current variable value: 7
Current variable value: 6
Good bye!
Continue
Causes the loop to skip the remainder of its body and immediately retest its condition
prior to reiterating.
Syntax:
while condition:
if condition:
continue

for i in range():
if condition:
continue
Example:
# The loops skips
for letter in 'Python' : # First Example
if letter == 'h' :
continue
print(letter)

# Output
>>> P
y
t
o
n

# Prints all numbers except 5


var = 10 # Second Example
while var > 0 :
var = var - 1
if var == 5 :
continue
print ('Current variable value:' , var)
print ("Good bye!")

# Output
>>> Current variable value: 9
Current variable value: 8
Current variable value: 7
Current variable value: 6
Current variable value: 4
Current variable value: 3
Current variable value: 2
Current variable value: 1
Current variable value: 0
Good bye!
Pass
The pass statement in Python is used when a statement is required syntactically but
you do not want any command or code to execute.
# Prints the sentence 'This is pass block' when it reaches 'h'
for letter in 'Python' :
if letter == 'h' :
pass
print ('This is pass block')
print ('Current Letter :' , letter)
print ("Good bye!")

>>> Current Letter : P


Current Letter : y
Current Letter : t
This is pass block
Current Letter : h
Current Letter : o
Current Letter : n
Good bye!
Exercise
a = [100, -220, 113, -50, 65, -70, -10, 15, 65]
Add +ve ints until the sum becomes larger than 250
# Write your Solution here
arr = [100, -220, 113, -50, 65, -70, -10, 15, 65]
sum_var = 0
Click here to edit and execute the code.
arr = [100, -220, 113, -50, 65, -70, -10, 15, 65]
sum_var = 0
for i in arr:
if i < 0:
continue
if sum_var > 250:
break

sum_var += i
Python Data Structure

Python data structures are essentially containers for different kinds of data. The four
main types are lists, sets, tuples and dictionaries.

4 Built-In Python Data Structures

The four primary data structures utilized in Python are lists, sets, tuples and
dictionaries.

Lists
Lists are a type of data structure containing an ordered collection of items. They are
crucial to executing projects in Python.
Every item contained within a list has an inherent order used to identify them, which
remains consistent throughout the life of the list. Lists are mutable, allowing elements
to be searched, added, moved and deleted after creation. Lists can also be nested,
allowing them to contain any object, including other lists and sublists.
More on Python ListsHow to Append Lists in Python

Tuples
A tuple contains much of the same functionality as a list, albeit with limited
functionality. The primary difference between the two is that a list is immutable,
meaning it cannot be modified or deleted. Tuples are best when a user intends to keep
an object intact throughout its lifetime to prevent the modification or addition of data.

Sets
A set is a collection of unique elements with no defined order, which are utilized when
an object only needs to exist within a collection of objects and its order or number of
appearances are not important.

Dictionaries
Dictionaries are unique and immutable objects that consist of key value pairs and are
accessible through unique keys in the dictionary.

Date & time, Functions

What is Datetime in Python?


In the Python programming language, datetime is a single module. This means that it
is not two separate data types. You can import this datetime module in such a way
that they work with dates and times. datetime is a built-in Python module. Developers
don’t need to install it separately. It’s right there.
With the datetime module, you get to work with different classes that work perfectly
with date and time. Within these classes, you can get a range of functions that deal
with dates, times, and different time intervals.

Examples of Datetime in Python

Here are a few examples to help in the understanding of datetime in Python properly.
Let’s begin.
Example 1
Here, this example shows you how can get the current date using datetime in Python:
# importing the datetime class
from datetime import datetime
#calling the now() function of datetime class
Now = datetime.datetime.now()
print("Now the date and time are", Now)
The output is as follows:

Example 2
Here is the second example. The aim is to count the difference between two different
datetimes.
#Importing the datetime class
from datetime import datetime
#Initializing the first date and time
time1 = datetime(year=2020, month=5, day=9, hour=4, minute=33, second=6)
#Initializing the second date and time
time2 = datetime(year=2021, month=7, day=4, hour=7, minute=55, second=4)
#Calculating and printing the time difference between two given date and times
time_difference = time2 - time1
print("The time difference between the two times is", time_difference)
And the output is:
Function or Method Overloading:

Two or more methods have the same name but different numbers of parameters or
different types of parameters, or both. These methods are called overloaded methods
and this is called method overloading.
Like other languages (for example, method overloading in C++) do, python does not
support method overloading by default. But there are different ways to achieve
method overloading in Python.
The problem with method overloading in Python is that we may overload the methods
but can only use the latest defined method.

Example :-

# First product method.


# Takes two argument and print their
# product

def product(a, b):


p=a*b
print(p)

# Second product method


# Takes three argument and print their
# product

def product(a, b, c):


p = a * b*c
print(p)

# Uncommenting the below line shows an error


# product(4, 5)

# This line will call the second product method


product(4, 5, 5)

Output :- 100
Operator Overloading in Python

Operator Overloading means giving extended meaning beyond their predefined


operational meaning. For example operator + is used to add two integers as well as
join two strings and merge two lists. It is achievable because ‘+’ operator is
overloaded by int class and str class. You might have noticed that the same built-in
operator or function shows different behavior for objects of different classes, this is
called Operator Overloading.
Example
 Python3

# Python program to show use of


# + operator for different purposes.

print(1 + 2)

# concatenate two strings


print("Geeks"+"For")

# Product two numbers


print(3 * 4)

# Repeat the String


print("Geeks"*4)

Output
3
GeeksFor
12
GeeksGeeksGeeksGeeks

Python Classes/Objects

Python is an object oriented programming language.


Almost everything in Python is an object, with its properties and methods.
A Class is like an object constructor, or a "blueprint" for creating objects.
Create a Class
To create a class, use the keyword class:
ExampleGet your own Python Server
Create a class named MyClass, with a property named x:
class MyClass:
x=5

Create Object
Now we can use the class named MyClass to create objects:
Example
Create an object named p1, and print the value of x:
p1 =
MyClass()
print(p1.x)

Example :-

The __init__() Function


The examples above are classes and objects in their simplest form, and are not really
useful in real life applications.
To understand the meaning of classes we have to understand the built-in __init__()
function.
All classes have a function called __init__(), which is always executed when the class
is being initiated.
Use the __init__() function to assign values to object properties, or other operations
that are necessary to do when the object is being created:

Example
Create a class named Person, use the __init__() function to assign values for name
and age:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age

p1 = Person("John", 36)

print(p1.name)
print(p1.age)
UNIT -2
Working with Data in Python

Open File
Reading files with Open

To open the file, use the built-in open() function.


The open() function returns a file object, which has a read() method for reading the
content of the file:

Example
f = open("demofile.txt", "r")
print(f.read())

writing files with Open

Python provides inbuilt functions for creating, writing and reading files. There are
two types of files that can be handled in python, normal text files and binary files
(written in binary language, 0s and 1s).
 Text files: In this type of file, Each line of text is terminated with a special
character called EOL (End of Line), which is the new line character (‘\n’) in
python by default.
 Binary files: In this type of file, there is no terminator for a line and the data
is stored after converting it into machine-understandable binary language.

Access mode
Access modes govern the type of operations possible in the opened file. It refers to
how the file will be used once it’s opened. These modes also define the location of
the File Handle in the file. File handle is like a cursor, which defines from where the
data has to be read or written in the file. Different access modes for reading a file are

1. Write Only (‘w’) : Open the file for writing. For an existing file, the data is
truncated and over-written. The handle is positioned at the beginning of the
file. Creates the file if the file does not exist.
2. Write and Read (‘w+’) : Open the file for reading and writing. For an existing
file, data is truncated and over-written. The handle is positioned at the
beginning of the file.
3. Append Only (‘a’) : Open the file for writing. The file is created if it does not
exist. The handle is positioned at the end of the file. The data being written
will be inserted at the end, after the existing data.
Opening a File
It is done using the open() function. No module is required to be imported for this
function. Syntax:

File_object = open(r"File_Name", "Access_Mode")

# Open function to open the file &quot;MyFile1.txt&quot;


# (same directory) in read mode and
file1 = open(&quot;MyFile.txt&quot;, &quot;w&quot;)

# store its reference in the variable file1


# and &quot;MyFile2.txt&quot; in D:\Text in file2
file2 = open(r&quot;D:\Text\MyFile2.txt&quot;, &quot;w+&quot;)

loading data with Pandas

Pandas - DataFrame - Loading the dataset from various data sources


A dataset can be loaded from various data sources using relevant Pandas constructs
(functions) as mentioned below:
 CSV file - read_csv() function
 JSON file - read_json() function
 Excel file - read_excel() function
 Database table - read_sql() function
All the above functions return a dataframe object and most of these functions have a
parameter called 'chunksize'.
e.g. to load a JSON data file (myfile.json) you can use the below code
my_df = pd.read_json("myfile.json")
Here, my_df is a pandas dataframe object.
chunksize - It is the number of rows(records) of the dataset (csv, excel, json, table,
etc.) which you want to be returned in each chunk.
When you use this parameter - chunksize, these functions (read_csv(), read_sql(),
etc.) return you an iterator which enable you to traverse through these chunks of data,
where each chunk is of size as specified by chunksize parameter.
This 'chunksize' parameter is very useful when you are dealing with (loading) a large
dataset and you have very limited memory (RAM) available on your machine. If
'chunksize' parameter is specified, only a chunk of data will be read into the dataframe
at a time. Hence, if your specified chunksize is within your memory (RAM) limits,
you can easily load large datasets using these constructs/functions of Pandas.
Loading dataset from a CSV file

(1) Please load the data from /cxldata/datasets/project/housing_short.csv file by


passing it to the read_csv() function of Pandas library and store the returned dataframe
in a variable called 'mydf'

<<your code comes here>> = pd.read_csv("<<your csv file name comes here>>",
index_col=0)
(2) Use describe() function of pandas dataframe to see the data in this 'mydf'
dataframe.

mydf.<<your code comes here>>

working with Pandas

pandas: How to Read and Write Files


1. Installing pandas.
2. Preparing Data.
3. Using the pandas read_csv() and .to_csv() Functions. Write a CSV File. ...
4. Using pandas to Write and Read Excel Files. Write an Excel File. ...
5. Understanding the pandas IO API. Write Files. ...
6. Working With Different File Types. CSV Files. ...
7. Working With Big Data. ...
8. Conclusion.

saving with Pandas

How to Save a Pandas DataFrame to CSV


In Pandas, you can save a DataFrame to a CSV file using the
df.to_csv('your_file_name.csv', index=False) method, where df is your
DataFrame and index=False prevents an index column from being added.

How to Save Pandas DataFrame to CSV


To save a Pandas DataFrame as a CSV, use the DataFrame.to_csv() method:

df.to_csv('your_file_name.csv', index=False)
Array oriented Programming with Numpy
Array Programming provides a powerful, compact and expressive syntax for
accessing, manipulating and operating on data in vectors, matrices and higher-
dimensional arrays. NumPy is the primary array programming library for the
Python language.

NumPy stands for Numerical Python. It is a Python library used for working with an
array. In Python, we use the list for the array but it’s slow to process. NumPy array is
a powerful N-dimensional array object and is used in linear algebra, Fourier
transform, and random number capabilities. It provides an array object much faster
than traditional Python lists.
Types of Array:
1. One Dimensional Array
2. Multi-Dimensional Array
One Dimensional Array:
A one-dimensional array is a type of linear array.

Example:

# importing numpy module


import numpy as np

# creating list
list = [1, 2, 3, 4]

# creating numpy array


sample_array = np.array(list)

print("List in python : ", list)

print("Numpy Array in python :",


sample_array)

Output:

List in python : [1, 2, 3, 4]


Numpy Array in python : [1 2 3 4]
Data Cleaning and Preparation in Python

Cleaning and preparing data is a crucial part of data analysis. The goal is to transform
raw data into a format suitable for analysis. This often involves identifying and
resolving missing values, outliers, and data inconsistencies.

The following Python example illustrates how data can be cleaned and prepared.

1. Generating Sample Data


Let's begin by generating some sample data using the NumPy package.This example
will create a data frame with columns for "id", "date", "product", "quantity",
"revenue", and "state".

The Python code is as follows:

import numpy as np
import pandas as pd

# set random seed for reprducibility


np.random.seed(123)

# generate sample data


n = 1000
dates = pd.date_range(start='2022-01-01', end='2022-12-31')
products = ['A', 'B', 'C', 'D']
states = ['NY', 'CA', 'TX']
df = pd.DataFrame({
'id': np.arange(n),
'date': np.random.choice(dates, size=n),
'product': np.random.choice(products, size=n),
'quantity': np.random.randint(low=1, high=10, size=n),
'revenue': np.random.uniform(low=10, high=100, size=n),
'state': np.random.choice(states, size=n)
}
No alt text provided for this image
2. Dealing with Missing Values
Missing values are one of the most common problems in data cleansing. The missing
values can be handled in several ways:
Dropping the rows
Filling them with a default or calculated value
Let's assume that the "quantity" column has some missing values which we wish to
fill with the median value.

median_qty = df["quantity"].median()
df["quantity"].fillna(median_qty, inplace=True)
df.head()

3. Handling Outliers
Outliers are values that are significantly different from the rest of the data. Outliers
can negatively affect analyses, and they should be identified and addressed. A
common method of handling outliers is to remove them, but in some cases, you may
want to keep them if they represent legitimate data points.

Assume that the "revenue" column contains some outliers which we would like to
remove using the interquartile range (IQR).

Q1 = df["revenue"].quantile(0.25)
Q3 = df["revenue"].quantile(0.75)
IQR = Q3 - Q1

df = df[~((df["revenue"] < (Q1 - 1.5 * IQR)) | (df["revenue"] > (Q3 + 1.5 * IQR)))]

4. Handling Data Inconsistencies


When the same data point is represented in multiple ways, there is an inconsistency
in the data. "New York" and "NY", for example, may refer to the same location.

Suppose the "state" column contains some inconsistencies that we wish to resolve by
mapping different values to the same category.

state_map = {
"New York": "NY",
"California": "CA",
"Texas": "TX"
}

df["state"] = df["state"].replace(state_map)

5. Exporting the Cleaned Data


Finally, once you have cleaned and prepared the data, you can export it to a new file
using the to_csv method.

df.to_csv("/Users/rafael/Desktop/General/cleaned_sales_data.csv", index=False)

Plotting and Visualization

Python provides various libraries that come with different features for visualizing
data. All these libraries come with different features and can support various types of
graphs. We use following four such libraries.
 Matplotlib
 Seaborn
 Bokeh
 Plotly

What is Data Visualization?


Data visualization is a field in data analysis that deals with visual representation of
data. It graphically plots data and is an effective way to communicate inferences from
data.
Using data visualization, we can get a visual summary of our data. With pictures,
maps and graphs, the human mind has an easier time processing and understanding
any given data. Data visualization plays a significant role in the representation of both
small and large data sets, but it is especially useful when we have large data sets, in
which it is impossible to see all of our data, let alone process and understand it
manually.
Data Visualization in Python
Python offers several plotting libraries, namely Matplotlib, Seaborn and many other
such data visualization packages with different features for creating informative,
customized, and appealing plots to present data in the most simple and effective way.

Matplotlib and Seaborn


Matplotlib and Seaborn are python libraries that are used for data visualization. They
have inbuilt modules for plotting different graphs. While Matplotlib is used to embed
graphs into applications, Seaborn is primarily used for statistical graphs.
But when should we use either of the two? Let’s understand this with the help of a
comparative analysis. The table below provides comparison between Python’s two
well-known visualization packages Matplotlib and Seaborn.
Matplotlib Seaborn

It is mainly used for statistics


It is used for basic graph plotting like line charts, bar visualization and can perform
graphs, etc. complex visualizations with
fewer commands.

It mainly works with datasets and arrays. It works with entire datasets.

Matplotlib acts productively


Seaborn is considerably more organized and functional
with data arrays and frames. It
than Matplotlib and treats the entire dataset as a solitary
regards the aces and figures as
unit.
objects.

Matplotlib is more
Seaborn has more inbuilt themes and is mainly used for customizable and pairs well
statistical analysis. with Pandas and Numpy for
Exploratory Data Analysis.

Example :-

Scatter Plot
Scatter plots are used to observe relationships between variables and uses dots to
represent the relationship between them. The scatter() method in the matplotlib library
is used to draw a scatter plot.

Example:

import pandas as pd
import matplotlib.pyplot as plt

# reading the database


data = pd.read_csv("tips.csv")

# Scatter plot with day against tip


plt.scatter(data['day'], data['tip'])
# Adding Title to the Plot
plt.title("Scatter Plot")
# Setting the X and Y labels
plt.xlabel('Day')
plt.ylabel('Tip')

plt.show()

Output:

Grouping and Aggregating with Pandas


In this concept we see grouping and aggregating using pandas. Grouping and
aggregating will help to achieve data analysis easily using various functions. These
methods will help us to the group and summarize our data and make complex analysis
comparatively easy.
Creating a sample dataset of marks of various subjects.
 Python

# import module
import pandas as pd

# Creating our dataset


df = pd.DataFrame([[9, 4, 8, 9],
[8, 10, 7, 6],
[7, 6, 8, 5]],
columns=['Maths', 'English',
'Science', 'History'])

# display dataset
print(df)
UNIT -3
Machine Learning and Cognitive Intelligence

Introduction to Machine Learning- History and Evolution

What is machine learning


Machine learning is an application of AI that includes algorithms that parse data, learn
from that data, and then apply what they’ve learned to make informed decisions.
An easy example of a machine learning algorithm is an on-demand music streaming
service like Spotify.
For Spotify to make a decision about which new songs or artists to recommend to
you, machine learning algorithms associate your preferences with other listeners who
have a similar musical taste. This technique, which is often simply touted as AI, is
used in many services that offer automated recommendations.
Machine learning fuels all sorts of tasks that span across multiple industries, from
data security firms that hunt down malware to finance professionals who want alerts
for favorable trades. The AI algorithms are programmed to constantly be learning in
a way that simulates as a virtual personal assistant—something that they do quite
well.
The early days
Machine learning history starts in 1943 with the first mathematical model of neural
networks presented in the scientific paper "A logical calculus of the ideas immanent
in nervous activity" by Walter Pitts and Warren McCulloch.
Then, in 1949, the book The Organization of Behavior by Donald Hebb is published.
The book had theories on how behavior relates to neural networks and brain activity
and would go on to become one of the monumental pillars of machine learning
development.
In 1950 Alan Turing created the Turing Test to determine if a computer has real
intelligence. To pass the test, a computer must be able to fool a human into believing
it is also human. He presented the principle in his paper Computing Machinery and
Intelligence while working at the University of Manchester. It opens with the words:
"I propose to consider the question, 'Can machines think?'"
Playing games and plotting routes
The first ever computer learning program was written in 1952 by Arthur Samuel. The
program was the game of checkers, and the IBM computer improved at the game the
more it played, studying which moves made up winning strategies and incorporating
those moves into its program.
Then in 1957 Frank Rosenblatt designed the first neural network for computers - the
perceptron - which simulated the thought processes of the human brain.
The next significant step forward in ML wasn’t until 1967 when the “nearest
neighbor” algorithm was written, allowing computers to begin using very basic
pattern recognition. This could be used to map a route for traveling salesmen, starting
at a random city but ensuring they visit all cities during a short tour.
Twelve years later, in 1979 students at Stanford University invent the ‘Stanford Cart’
which could navigate obstacles in a room on its own. And in 1981, Gerald Dejong
introduced the concept of Explanation Based Learning (EBL), where a computer
analyses training data and creates a general rule it can follow by discarding
unimportant data.
Big steps forward
In the 1990s work on machine learning shifted from a knowledge-driven approach to
a data-driven approach. Scientists began creating programs for computers to analyze
large amounts of data and draw conclusions — or “learn” — from the results.
And in 1997, IBM’s Deep Blue shocked the world by beating the world champion at
chess.
The term “deep learning” was coined in 2006 by Geoffrey Hinton to explain new
algorithms that let computers “see” and distinguish objects and text in images and
videos.
Four years later, in 2010 Microsoft revealed their Kinect technology could track 20
human features at a rate of 30 times per second, allowing people to interact with the
computer via movements and gestures. The follow year IBM’s Watson beat its human
competitors at Jeopardy.
Google Brain was developed in 2011 and its deep neural network could learn to
discover and categorize objects much the way a cat does. The following year, the tech
giant’s X Lab developed a machine learning algorithm that is able to autonomously
browse YouTube videos to identify the videos that contain cats.
In 2014, Facebook developed DeepFace, a software algorithm that is able to recognize
or verify individuals on photos to the same level as humans can.
2015 - Present day
Amazon launched its own machine learning platform in 2015. Microsoft also created
the Distributed Machine Learning Toolkit, which enabled the efficient distribution of
machine learning problems across multiple computers.
Then more 3,000 AI and Robotics researchers, endorsed by Stephen Hawking, Elon
Musk and Steve Wozniak (among many others), signed an open letter warning of the
danger of autonomous weapons which select and engage targets without human
intervention.
In 2016 Google’s artificial intelligence algorithm beat a professional player at the
Chinese board game Go, which is considered the world’s most complex board game
and is many times harder than chess. The AlphaGo algorithm developed by Google
DeepMind managed to win five games out of five in the Go competition.
Waymo started testing autonomous cars in the US in 2017 with backup drivers only
at the back of the car. Later the same year they introduce completely autonomous
taxis in the city of Phoenix.
In 2020, while the rest of the world was in the grips of the pandemic, open AI
announced a ground-breaking natural language processing algorithm GPT-3 with a
remarkable ability to generate human-like text when given a prompt. Today, GPT-3
is considered the largest and most advanced language model in the world, using 175
billion parameters and Microsoft Azure’s AI supercomputer for training.
The future of machine learning
Improvements in unsupervised learning algorithms
In the future, we’ll see more effort dedicated to improving unsupervised machine
learning algorithms to help to make predictions from unlabeled data sets. This
function is going to become increasingly important as it allows algorithms to discover
interesting hidden patterns or groupings within data sets and help businesses
understand their market or customers better.
The rise of quantum computing
One of the major applications of machine learning trends lies in quantum computing
that could transform the future of this field. Quantum computers lead to faster
processing of data, enhancing the algorithm’s ability to analyze and draw meaningful
insights from data sets.
Focus on cognitive services
Software applications will become more interactive and intelligent thanks to
cognitive services driven by machine learning. Features such as visual recognition,
speech detection, and speech understanding will be easier to implement. We’re going
to see more intelligent applications using cognitive services appear on the market.

Types/Categories of Machine Learning

Machine learning is the branch of Artificial Intelligence that focuses on developing


models and algorithms that let computers learn from data and improve from previous
experience without being explicitly programmed for every task. In simple words, ML
teaches the systems to think and understand like humans by learning from the data.
In this article, we will explore the various types of machine learning
algorithms that are important for future requirements. Machine learning is
generally a training system to learn from past experiences and improve performance
over time. Machine learning helps to predict massive amounts of data. It helps to
deliver fast and accurate results to get profitable opportunities.
Types of Machine Learning
There are several types of machine learning, each with special characteristics and
applications. Some of the main types of machine learning algorithms are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
1. Supervised Machine Learning
Supervised learning is defined as when a model gets trained on a “Labelled
Dataset”. Labelled datasets have both input and output parameters. In Supervised
Learning algorithms learn to map points between inputs and correct outputs. It has
both training and validation datasets labelled.
There are two main categories of supervised learning that are mentioned below:
 Classification
 Regression

Classification
Classification deals with predicting categorical target variables, which represent
discrete classes or labels. For instance, classifying emails as spam or not spam, or
predicting whether a patient has a high risk of heart disease. Classification algorithms
learn to map the input features to one of the predefined classes.
Here are some classification algorithms:
 Logistic Regression
 Support Vector Machine
 Random Forest
 Decision Tree
 K-Nearest Neighbors (KNN)
 Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables,
which represent numerical values. For example, predicting the price of a house based
on its size, location, and amenities, or forecasting the sales of a product. Regression
algorithms learn to map the input features to a continuous numerical value.
Here are some regression algorithms:
 Linear Regression
 Polynomial Regression
 Ridge Regression
 Lasso Regression
 Decision tree
 Random Forest
Advantages of Supervised Machine Learning
 Supervised Learning models can have high accuracy as they are trained
on labelled data.
 The process of decision-making in supervised learning models is often
interpretable.
 It can often be used in pre-trained models which saves time and resources
when developing new models from scratch.
Disadvantages of Supervised Machine Learning
 It has limitations in knowing patterns and may struggle with unseen or
unexpected patterns that are not present in the training data.
 It can be time-consuming and costly as it relies on labeled data only.
 It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
 Image classification: Identify objects, faces, and other features in images.
 Natural language processing: Extract information from text, such as
sentiment, entities, and relationships.
 Speech recognition: Convert spoken language into text.
 Recommendation systems: Make personalized recommendations to users.
 Predictive analytics: Predict outcomes, such as sales, customer churn, and
stock prices.
 Medical diagnosis: Detect diseases and other medical conditions.
 Fraud detection: Identify fraudulent transactions.
 Autonomous vehicles: Recognize and respond to objects in the environment.
 Email spam detection: Classify emails as spam or not spam.
 Quality control in manufacturing: Inspect products for defects.
 Credit scoring: Assess the risk of a borrower defaulting on a loan.
 Gaming: Recognize characters, analyze player behavior, and create NPCs.
 Customer support: Automate customer support tasks.
 Weather forecasting: Make predictions for temperature, precipitation, and
other meteorological parameters.
 Sports analytics: Analyze player performance, make game predictions, and
optimize strategies.
2. Unsupervised Machine Learning
Unsupervised Learning Unsupervised learning is a type of machine learning
technique in which an algorithm discovers patterns and relationships using unlabeled
data. Unlike supervised learning, unsupervised learning doesn’t involve providing the
algorithm with labeled target outputs. The primary goal of Unsupervised learning is
often to discover hidden patterns, similarities, or clusters within the data, which can
then be used for various purposes, such as data exploration, visualization,
dimensionality reduction, and more.

There are two main categories of unsupervised learning that are mentioned below:
 Clustering
 Association
Clustering
Clustering is the process of grouping data points into clusters based on their similarity.
This technique is useful for identifying patterns and relationships in data without the
need for labeled examples.
Here are some clustering algorithms:
 K-Means Clustering algorithm
 Mean-shift algorithm
 DBSCAN Algorithm
 Principal Component Analysis
 Independent Component Analysis
Association
Association rule learning is a technique for discovering relationships between items
in a dataset. It identifies rules that indicate the presence of one item implies the
presence of another item with a specific probability.
Here are some association rule learning algorithms:
 Apriori Algorithm
 Eclat
 FP-growth Algorithm
Advantages of Unsupervised Machine Learning
 It helps to discover hidden patterns and various relationships between the data.
 Used for tasks such as customer segmentation, anomaly detection, and data
exploration.
 It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
 Without using labels, it may be difficult to predict the quality of the model’s
output.
 Cluster Interpretability may not be clear and may not have meaningful
interpretations.
 It has techniques such as autoencoders and dimensionality reduction that can
be used to extract meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
 Clustering: Group similar data points into clusters.
 Anomaly detection: Identify outliers or anomalies in data.
 Dimensionality reduction: Reduce the dimensionality of data while
preserving its essential information.
 Recommendation systems: Suggest products, movies, or content to users
based on their historical behavior or preferences.
 Topic modeling: Discover latent topics within a collection of documents.
 Density estimation: Estimate the probability density function of data.
 Image and video compression: Reduce the amount of storage required for
multimedia content.
 Data preprocessing: Help with data preprocessing tasks such as data cleaning,
imputation of missing values, and data scaling.
 Market basket analysis: Discover associations between products.
 Genomic data analysis: Identify patterns or group genes with similar
expression profiles.
 Image segmentation: Segment images into meaningful regions.
 Community detection in social networks: Identify communities or groups of
individuals with similar interests or connections.
 Customer behavior analysis: Uncover patterns and insights for better
marketing and product recommendations.
 Content recommendation: Classify and tag content to make it easier to
recommend similar items to users.
 Exploratory data analysis (EDA): Explore data and gain insights before
defining specific tasks.
3. Reinforcement Machine Learning
Reinforcement machine learning algorithm is a learning method that interacts with
the environment by producing actions and discovering errors. Trial, error, and
delay are the most relevant characteristics of reinforcement learning. In this
technique, the model keeps on increasing its performance using Reward Feedback to
learn the behavior or pattern. These algorithms are specific to a particular problem
e.g. Google Self Driving car, AlphaGo where a bot competes with humans and even
itself to get better and better performers in Go Game. Each time we feed in data, they
learn and add the data to their knowledge which is training data. So, the more it learns
the better it gets trained and hence experienced.
Here are some of most common reinforcement learning algorithms:
 Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function,
which maps states to actions. The Q-function estimates the expected reward of
taking a particular action in a given state.
 SARSA (State-Action-Reward-State-Action): SARSA is another model-
free RL algorithm that learns a Q-function. However, unlike Q-learning,
SARSA updates the Q-function for the action that was actually taken, rather
than the optimal action.
 Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep
learning. Deep Q-learning uses a neural network to represent the Q-function,
which allows it to learn complex relationships between states and actions.

Let’s understand it with the help of examples.


Example: Consider that you are training an AI agent to play a game like chess. The
agent explores different moves and receives positive or negative feedback based on
the outcome. Reinforcement Learning also finds applications in which they learn to
perform tasks by interacting with their surroundings.
Types of Reinforcement Machine Learning
There are two main types of reinforcement learning:
Positive reinforcement
 Rewards the agent for taking a desired action.
 Encourages the agent to repeat the behavior.
 Examples: Giving a treat to a dog for sitting, providing a point in a game for a
correct answer.
Negative reinforcement
 Removes an undesirable stimulus to encourage a desired behavior.
 Discourages the agent from repeating the behavior.
 Examples: Turning off a loud buzzer when a lever is pressed, avoiding a
penalty by completing a task.
Advantages of Reinforcement Machine Learning
 It has autonomous decision-making that is well-suited for tasks and that can
learn to make a sequence of decisions, like robotics and game-playing.
 This technique is preferred to achieve long-term results that are very difficult
to achieve.
 It is used to solve a complex problems that cannot be solved by conventional
techniques.
Disadvantages of Reinforcement Machine Learning
 Training Reinforcement Learning agents can be computationally expensive
and time-consuming.
 Reinforcement learning is not preferable to solving simple problems.
 It needs a lot of data and a lot of computation, which makes it impractical and
costly.
Applications of Reinforcement Machine Learning
Here are some applications of reinforcement learning:
 Game Playing: RL can teach agents to play games, even complex ones.
 Robotics: RL can teach robots to perform tasks autonomously.
 Autonomous Vehicles: RL can help self-driving cars navigate and make
decisions.
 Recommendation Systems: RL can enhance recommendation algorithms by
learning user preferences.
 Healthcare: RL can be used to optimize treatment plans and drug discovery.
 Natural Language Processing (NLP): RL can be used in dialogue systems
and chatbots.
 Finance and Trading: RL can be used for algorithmic trading.
 Supply Chain and Inventory Management: RL can be used to optimize
supply chain operations.
 Energy Management: RL can be used to optimize energy consumption.
 Game AI: RL can be used to create more intelligent and adaptive NPCs in
video games.
 Adaptive Personal Assistants: RL can be used to improve personal assistants.
 Virtual Reality (VR) and Augmented Reality (AR): RL can be used to
create immersive and interactive experiences.
 Industrial Control: RL can be used to optimize industrial processes.
 Education: RL can be used to create adaptive learning systems.
 Agriculture: RL can be used to optimize agricultural operations.
What is a machine learning framework:-
Machine learning (ML) frameworks are interfaces that allow data scientists and
developers to build and deploy machine learning models faster and easier. Machine
learning is used in almost every industry, notably finance, insurance, healthcare, and
marketing. Using these tools, businesses can scale their machine learning efforts
while maintaining an efficient ML lifecycle.
Companies can choose to build their own custom machine learning framework, but
most organizations choose an existing framework that fits their needs. In this article,
we’ll show key considerations for selecting the right machine learning framework for
your project and briefly review four popular ML frameworks.

What are the top machine learning frameworks


Apache Singa
Apache Singa is a general distributed deep learning platform for training big deep
learning models over large datasets. It is designed with an intuitive programming
model based on the layer abstraction. A variety of popular deep learning models are
supported, namely feed-forward models including convolutional neural networks
(CNN), energy models like restricted Boltzmann machine (RBM), and recurrent
neural networks (RNN). Many built-in layers are provided for users.
Amazon Machine Learning
Amazon Machine Learning is a service that makes it easy for developers of all skill
levels to use machine learning technology. Amazon Machine Learning provides
visualization tools and wizards that guide you through the process of creating machine
learning (ML) models without having to learn complex ML algorithms and
technology. It connects to data stored in Amazon S3, Redshift, or RDS, and can run
binary classification, multiclass categorization, or regression on said data to create a
model.
Azure ML Studio
Azure ML Studio allows Microsoft Azure users to create and train models, then turn
them into APIs that can be consumed by other services. Users get up to 10GB of
storage per account for model data, although you can also connect your own Azure
storage to the service for larger models. A wide range of algorithms are available,
courtesy of both Microsoft and third parties. You don’t even need an account to try
out the service; you can log in anonymously and use Azure ML Studio for up to eight
hours.
Caffe
Caffe is a deep learning framework made with expression, speed, and modularity in
mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by
community contributors. Yangqing Jia created the project during his PhD at UC
Berkeley. Caffe is released under the BSD 2-Clause license. Models and optimization
are defined by configuration without hard-coding & user can switch between CPU
and GPU. Speed makes Caffe perfect for research experiments and industry
deployment. Caffe can process over 60M images per day with a single NVIDIA K40
GPU.
H2O
H2O makes it possible for anyone to easily apply math and predictive analytics to
solve today’s most challenging business problems. It intelligently combines unique
features not currently found in other machine learning platforms including: Best of
Breed Open Source Technology, Easy-to-use WebUI and Familiar Interfaces, Data
Agnostic Support for all Common Database and File Types. With H2O, you can work
with your existing languages and tools. Further, you can extend the platform
seamlessly into your Hadoop environments.
Massive Online Analysis
Massive Online Analysis (MOA) is the most popular open source framework for data
stream mining, with a very active growing community. It includes a collection of
machine learning algorithms (classification, regression, clustering, outlier
detection, concept drift detection and recommender systems) and tools for evaluation.
Related to the WEKA project, MOA is also written in Java, while scaling to more
demanding problems.
KDD process model

What is the KDD Process?


The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application of
particular data mining methods. It is of interest to researchers in machine learning,
pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition
for expert systems, and data visualization.
The unifying goal of the KDD process is to extract knowledge from data in the context
of large databases.
It does this by using data mining methods (algorithms) to extract (identify) what is
deemed knowledge, according to the specifications of measures and thresholds, using a
database along with any required preprocessing, subsampling, and transformations of
that database.

An Outline of the Steps of the KDD Process

he overall process of finding and interpreting patterns from data involves the repeated
application of the following steps:
1. Developing an understanding of
o the application domain
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
o Removal of noise or outliers.
o Collecting necessary information to model or account for noise.
o Strategies for handling missing data fields.
o Accounting for time sequence information and known changes.
4. Data reduction and projection.
o Finding useful features to represent the data depending on the goal of
the task.
o Using dimensionality reduction or transformation methods to reduce
the effective number of variables under consideration or to find
invariant representations for the data.
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
6. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of
the KDD process.
7. Data mining.
o Searching for patterns of interest in a particular representational form
or a set of such representations as classification rules or trees,
regression, clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.
CRISP-DM
CRISP-DM stands for cross-industry process for data mining. The CRISP-DM
methodology provides a structured approach to planning a data mining project. It is a
robust and well-proven methodology. We do not claim any ownership over it. We did
not invent it. We are however evangelists of its powerful practicality, its flexibility
and its usefulness when using analytics to solve thorny business issues. It is the golden
thread than runs through almost every client engagement. The CRISP-DM model is
shown on the right.
This model is an idealised sequence of events. In practice many of the tasks can be
performed in a different order and it will often be necessary to backtrack to previous
tasks and repeat certain actions. The model does not try to capture all possible routes
through the data mining process.
You can jump to more information about each phase of the process here:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation
6. Deployment
What are the 6 CRISP-DM Phases?
I. Business Understanding
Any good project starts with a deep understanding of the customer’s needs. Data
mining projects are no exception and CRISP-DM recognizes this.
The Business Understanding phase focuses on understanding the objectives and
requirements of the project. Aside from the third task, the three other tasks in this
phase are foundational project management activities that are universal to most
projects:
1. Determine business objectives: You should first “thoroughly understand,
from a business perspective, what the customer really wants to accomplish.”
(CRISP-DM Guide) and then define business success criteria.
2. Assess situation: Determine resources availability, project requirements,
assess risks and contingencies, and conduct a cost-benefit analysis.
3. Determine data mining goals: In addition to defining the business objectives,
you should also define what success looks like from a technical data mining
perspective.
4. Produce project plan: Select technologies and tools and define detailed plans
for each project phase.
While many teams hurry through this phase, establishing a strong business
understanding is like building the foundation of a house – absolutely essential.
II. Data Understanding
Next is the Data Understanding phase. Adding to the foundation of Business
Understanding, it drives the focus to identify, collect, and analyze the data sets that
can help you accomplish the project goals. This phase also has four tasks:
1. Collect initial data: Acquire the necessary data and (if necessary) load it into
your analysis tool.
2. Describe data: Examine the data and document its surface properties like data
format, number of records, or field identities.
3. Explore data: Dig deeper into the data. Query it, visualize it, and identify
relationships among the data.
4. Verify data quality: How clean/dirty is the data? Document any quality
issues.
III. Data Preparation
A common rule of thumb is that 80% of the project is data preparation.
This phase, which is often referred to as “data munging”, prepares the final data set(s)
for modeling. It has five tasks:
1. Select data: Determine which data sets will be used and document reasons for
inclusion/exclusion.
2. Clean data: Often this is the lengthiest task. Without it, you’ll likely fall
victim to garbage-in, garbage-out. A common practice during this task is to
correct, impute, or remove erroneous values.
3. Construct data: Derive new attributes that will be helpful. For example,
derive someone’s body mass index from height and weight fields.
4. Integrate data: Create new data sets by combining data from multiple
sources.
5. Format data: Re-format data as necessary. For example, you might convert
string values that store numbers to numeric values so that you can perform
mathematical operations.
IV. Modeling
What is widely regarded as data science’s most exciting work is also often the shortest
phase of the project. Here you’ll likely build and assess various models based on
several different modeling techniques. This phase has four tasks:
1. Select modeling techniques: Determine which algorithms to try (e.g.
regression, neural net).
2. Generate test design: Pending your modeling approach, you might need to
split the data into training, test, and validation sets.
3. Build model: As glamorous as this might sound, this might just be executing
a few lines of code like “reg = LinearRegression().fit(X, y)”.
4. Assess model: Generally, multiple models are competing against each other,
and the data scientist needs to interpret the model results based on domain
knowledge, the pre-defined success criteria, and the test design.
Although the CRISP-DM Guide suggests to “iterate model building and assessment
until you strongly believe that you have found the best model(s)”, in practice teams
should continue iterating until they find a “good enough” model, proceed through the
CRISP-DM lifecycle, then further improve the model in future iterations.
V. Evaluation
Whereas the Assess Model task of the Modeling phase focuses on technical model
assessment, the Evaluation phase looks more broadly at which model best meets the
business and what to do next. This phase has three tasks:
1. Evaluate results: Do the models meet the business success criteria? Which
one(s) should we approve for the business?
2. Review process: Review the work accomplished. Was anything overlooked?
Were all steps properly executed? Summarize findings and correct anything if
needed.
3. Determine next steps: Based on the previous three tasks, determine whether
to proceed to deployment, iterate further, or initiate new projects.
VI. Deployment
“Depending on the requirements, the deployment phase can be as simple as
generating a report or as complex as implementing a repeatable data mining process
across the enterprise.”
-CRISP-DM Guide
A model is not particularly useful unless the customer can access its results. The
complexity of this phase varies widely. This final phase has four tasks:
1. Plan deployment: Develop and document a plan for deploying the model
2. Plan monitoring and maintenance: Develop a thorough monitoring and
maintenance plan to avoid issues during the operational phase (or post-project
phase) of a model
3. Produce final report: The project team documents a summary of the project
which might include a final presentation of data mining results.
4. Review project: Conduct a project retrospective about what went well, what
could have been better, and how to improve in the future.

SEMMA

SEMMA as the process of data mining. It has five steps


(Sample, Explore, Modify, Model, and Assess), earning the acronym of SEMMA.
You can use the SEMMA data mining methodology to solve a wide range of business
problems, including fraud identification, customer retention and turnover, database
marketing, customer loyalty, bankruptcy forecasting, market segmentation, as well as
risk, affinity, and portfolio analysis.

Why SEMMA?
Businesses use the SEMMA methodology on their data mining and machine learning
projects to achieve a competitive advantage, improve performance, and deliver more
useful services to customers. The data we collect about our surroundings serve as the
foundation for hypotheses and models of the world we live in.
Ultimately, data is accumulated to help in collecting knowledge. That means the data
is not worth much until it is studied and analyzed. But hoarding vast volumes of data
is not equivalent to gathering valuable knowledge. It is only when data is sorted and
evaluated that we learn anything from it.
Thus, SEMMA is designed as a data science methodology to help practitioners
convert data into knowledge.
The 5 Stages Of SEMMA
SEMMA is leveraged as an organized, functional toolset, or is claimed as such by
SAS to be associated with their SAS Enterprise Miner initiative. While it is true that
the SEMMA process is more ambiguous to those not using the tool, most regard it as
a functional data mining methodology rather than a specific tool.
The process breaks down into its own set of stages. These include:
 Sample: This step entails choosing a subset of the appropriate volume dataset
from a vast dataset that has been given for the model’s construction. The goal
of this initial stage of the process is to identify variables or factors (both
dependent and independent) influencing the process. The collected
information is then sorted into preparation and validation categories.
 Explore: During this step, univariate and multivariate analysis is conducted in
order to study interconnected relationships between data elements and to
identify gaps in the data. While the multivariate analysis studies the
relationship between variables, the univariate one looks at each factor
individually to understand its part in the overall scheme. All of the influencing
factors that may influence the study’s outcome are analyzed, with heavy
reliance on data visualization.
 Modify: In this step, lessons learned in the exploration phase from the data
collected in the sample phase are derived with the application of business
logic. In other words, the data is parsed and cleaned, being then passed onto
the modeling stage, and explored if the data requires refinement and
transformation.
 Model: With the variables refined and data cleaned, the modeling step applies
a variety of data mining techniques in order to produce a projected model of
how this data achieves the final, desired outcome of the process.
 Assess: In this final SEMMA stage, the model is evaluated for how useful and
reliable it is for the studied topic. The data can now be tested and used to
estimate the efficacy of its performance.

Machine Learning packages :-


Top 10 Python Packages for Machine Learning
Top 10 ML Frameworks and Libraries
A Python framework is an interface or tool that allows developers to build ML
models easily, without getting into the depth of the underlying algorithms.
Python libraries are specific files containing pre-written code that can be imported
into your code base by using Python’s import feature. This increases your code
reusability.
A Python framework can be a collection of libraries intended to build a model (e.g.,
machine learning) easily, without having to know the details of the underlying
algorithms. An ML developer, however, must at least know how the algorithms work
in order to know what results to expect, as well as how to validate them.

10 Matplotlib
1)
Matplotlib is an interactive, cross-platform library for two-dimensional plotting. It
can produce high-quality graphs, charts and plots in several hardcopy formats.
Advantages:

 Flexible usage: supports both Python and IPython shells, Python scripts,
Jupyter Notebook, web application servers and many GUI toolkits (GTK+,
Tkinter, Qt, and wxPython).
 Optionally provides a MATLAB-like interface for simple plotting.
 The object-oriented interface gives complete control of axes properties, font
properties, line styles, etc.
 Compatible with several graphics backends and operating systems.
 Matplotlib is frequently incorporated in other libraries, such as Pandas.
2)
Natural Language Toolkit (NLTK)
NLTK is a framework and suite of libraries for developing both symbolic and
statistical Natural Language Processing (NLP) in Python. It is the standard tool for
NLP in Python.
Advantages:

 The Python library contains graphical examples, as well as sample data.


 Includes a book and cookbook making it easies for beginners to pick up.
 Provides support for different ML operations like classification, parsing, and
tokenization functionalities, etc.
 Acts as a platform for prototyping and building research systems.
 Compatible with several languages.

3) Pandas

Pandas is a Python library for providing high-performance, easy-to-use data


structures and data analysis tools for the Python programming language.
Advantages:

 Expressive, fast, and flexible data structures.


 Supports aggregations, concatenations, iteration, re-indexing, and
visualizations operations.
 Very flexible usage in conjunction with other Python libraries.
 Intuitive data manipulation using minimal commands.
 Supports a wide range of commercial and academic domains.
 Optimized for performance.

4) Scikit-learn

The Python library, Scikit-Learn, is built on top of the matplotlib, NumPy, and SciPy
libraries. This Python ML library has several tools for data analysis and data mining
tasks.
Advantages:

 Simple, easy to use, and effective.


 In rapid development, and constantly being improved.
 Wide range of algorithms, including clustering, factor analysis, principal
component analysis, and more.
 Can extract data from images and text.
 Can be used for NLP.

5) Seaborn

Seaborn is a library for making statistical graphs in Python. It is built on top of


matplotlib and also integrated with pandas data structures.
Advantages:

 Gives more attractive graphs than matplotlib.


 Has built-in plots that matplotlib lacks.
 Uses less code to visualize graphs.
 Smooth integration with Pandas: data visualization and analysis combined!
6) NumPy

NumPy adds multi-dimensional array and matrix processing to Python, as well as a


large collection of high-level mathematical functions. It is commonly used for
scientific computing and hence, one of the most used Python Packages for machine
learning.
Advantages:

 Intuitive and interactive.


 Offers Fourier transforms, random number capabilities, and other tools for
integrating computing languages like C/C++ and Fortran.
 Versatility – other ML libraries like scikit-learn and TensorFlow use NumPy
arrays as input; data manipulation packages like Pandas use NumPy under the
hood.
 Has terrific open-source community support/contributions.
 Simplifies complex mathematical implementations.

7) Keras

Keras is a very popular ML for Python, providing a high-level neural network API
capable of running on top of TensorFlow, CNTK, or Theano.
Advantages:

 Great for experimentation and quick prototyping.


 Portable.
 Offers easy expression of neural networks.
 Great for use in modeling and visualization.
8) SciPy

SciPy is a very popular ML library with different modules for optimization, linear
algebra, integration and statistics.
Advantages:

 Great for image manipulation.


 Provides easy handling of mathematical operations.
 Offers efficient numerical routines, including numerical integration and
optimization.
 Supports signal processing.

9) Pytorch

PyTorch is a popular ML library for Python based on Torch, which is an ML


library implemented in C and wrapped in Lua. It was originally developed by
Facebook, but is now used by Twitter, Salesforce, and many other major
organizations and businesses.
Advantages:

 Contains tools and libraries that support Computer Vision, NLP , Deep
Learning, and many other ML programs.
 Developers can perform computations on Tensors with GPU acceleration.
 Helps in creating computational graphs.
 Modeling process is simple and transparent.
 The default “define-by-run” mode is more like traditional programming.
 Uses common debugging tools such as pdb, ipdb or PyCharm debugger.
 Uses a lot of pre-trained models and modular parts that are easy to combine.
10) TensorFlow

Originally developed by Google, TensorFlow is an open-source library for high-


performance numerical computation using data flow graphs.
Under the hood, it’s actually a framework for creating and running computations
involving tensors. The principal application for TensorFlow is in neural
networks, and especially deep learning where it is widely used. That makes it one
of the most important Python packages for machine learning
Advantages:

 Supports reinforcement learning and other algorithms.


 Provides computational graph abstraction.
 Offers a very large community.
 Provides TensorBoard, which is a tool for visualizing ML models directly in
the browser.
 Production ready.
 Can be deployed on multiple CPUs and GPUs.

Python libraries for Machine Learning:-

Machine Learning, as the name suggests, is the science of programming a


computer by which they are able to learn from different kinds of data. A more
general definition given by Arthur Samuel is – “Machine Learning is the field of
study that gives computers the ability to learn without being explicitly
programmed.” They are typically used to solve various types of life problems.
In the older days, people used to perform Machine Learning tasks by manually
coding all the algorithms and mathematical and statistical formulas. This made
the processing time-consuming, tedious, and inefficient. But in the modern
days, it is become very much easy and more efficient compared to the olden
days with various python libraries, frameworks, and modules. Today, Python is
one of the most popular programming languages for this task and it has
replaced many languages in the industry, one of the reasons is its vast collection
of libraries. Python libraries that are used in Machine Learning are:

 Numpy
 Scipy
 Scikit-learn
 Theano
 TensorFlow
 Keras
 PyTorch
 Pandas
 Matplotlib

Introduction to Cognitive Intelligence :-


Cognitive intelligence is referred to as human mental ability and understanding
developed through thinking, experiences and senses. It is the ability to generate
knowledge by using existing information. It also includes other intellectual
functions such as attention, learning, memory, judgment and reasoning.
Cognitive intelligence is the ability of the human brain to digest information and
form intelligence and meaning. Hence, measuring cognitive intelligence is
crucial for organizations undertaking recruitment, as it determines whether an
applicant has the aptitude to perform well at work that requires significant
cognitive ability. It is said that cognitive intelligence uses existing knowledge
that grows with practice and different experiences.

Cognitive intelligence is an advanced stage of intelligence science. It aims to


conduct in-depth mechanism research and computer simulation on human natural
language, knowledge expression, logical reasoning, autonomous learning, and
other abilities, so as to promote machines to have similar human intelligence and
even have the knowledge accumulation and application ability of human experts
in various fields.
As the most basic tool for humans to express and exchange ideas, natural
language exists everywhere in human social activities. Natural language
processing (NLP) is the theory and technology of processing human language by
computer. As an important high-level research direction of language information
processing technology, NLP has always been the core topic in the field of
artificial intelligence. NLP is also one of the most difficult problems due to its
polysemy, being context related, fuzziness, nonsystematicness, close correlation
with the environment, and wide range of knowledge involved.
Features of Cognitive Intelligence :-

What is Cognitive Intelligence:


Cognitive computing represents self-learning systems that utilize machine
learning models to mimic the way brain works.“ Eventually, this technology
will facilitate the creation of automated IT models which are capable of
solving problems without human assistance.

FEATURES OF COGNITIVE INTELLIGENCE :-


Cognitive Intelligence or Cognitive Computing consortium has
recommended the following features for the computing systems –
1. Adaptive
This is the first step in making a machine learning based cognitive system.
The solutions should mimic the ability of human brain to learn and adapt
from the surroundings. The systems can’t be programmed for an isolated
task. It needs to be dynamic in data gathering, understanding goals, and
requirements.
2. Interactive
Similar to brain the cognitive solution must interact with all elements in the
system – processor, devices, cloud services and user. Cognitive systems
should interact bi-directionally. It should understand human input and
provide relevant results using natural language processing and deep
learning. Some skilled intelligent chatbots such as Mitsuku have already
achieved this feature.
3. Iterative and stateful
The system should “remember” previous interactions in a process and
return information that is suitable for the specific application at that point
in time. It should be able to define the problem by asking questions or
finding an additional source. This feature needs a careful application of the
data quality and validation methodologies in order to ensure that the system
is always provided with enough information and that the data sources it
operates on to deliver reliable and up-to-date input.
4. Contextual
They must understand, identify, and extract contextual elements such as
meaning, syntax, time, location, appropriate domain, regulations, user’s
profile, process, task, and goal. They may draw on multiple sources of
information, including both structured and unstructured digital
information, as well as sensory inputs (visual, gestural, auditory, or sensor-
provided).
UNIT -4
Supervised Learning

Introduction to classification

Classification is a supervised machine learning method where the model tries to


predict the correct label of a given input data. In classification, the model is fully
trained using the training data, and then it is evaluated on test data before being used
to perform prediction on new unseen data.

What is Classification in Machine Learning


Classification is a supervised machine learning method where the model tries to
predict the correct label of a given input data. In classification, the model is fully
trained using the training data, and then it is evaluated on test data before being used
to perform prediction on new unseen data.
For instance, an algorithm can learn to predict whether a given email is spam or ham
(no spam), as illustrated below.

Before diving into the classification concept, we will first understand the difference
between the two types of learners in classification: lazy and eager learners. Then we
will clarify the misconception between classification and regression.
Lazy Learners Vs. Eager Learners
There are two types of learners in machine learning classification: lazy and eager
learners.
Eager learners are machine learning algorithms that first build a model from the
training dataset before making any prediction on future datasets. They spend more
time during the training process because of their eagerness to have a better
generalization during the training from learning the weights, but they require less time
to make predictions.
Most machine learning algorithms are eager learners, and below are some examples:
 Logistic Regression.
 Support Vector Machine.
 Decision Trees.
 Artificial Neural Networks.
Lazy learners or instance-based learners, on the other hand, do not create any
model immediately from the training data, and this is where the lazy aspect comes
from. They just memorize the training data, and each time there is a need to make a
prediction, they search for the nearest neighbor from the whole training data, which
makes them very slow during prediction. Some examples of this kind are:
 K-Nearest Neighbor.
 Case-based reasoning.

Machine Learning Classification Vs. Regression


There are four main categories of Machine Learning algorithms: supervised,
unsupervised, semi-supervised, and reinforcement learning.
Even though classification and regression are both from the category of
supervised learning, they are not the same.
 The prediction task is a classification when the target variable is discrete. An
application is the identification of the underlying sentiment of a piece of text.
 The prediction task is a regression when the target variable is continuous. An
example can be the prediction of the salary of a person given their education
degree, previous work experience, geographical location, and level of
seniority.
If you are interested in knowing more about classification, courses
on Supervised Learning with scikit-learn and Supervised Learning in R
might be helpful. They provide you with a better understanding of how each
algorithm approaches tasks and the Python and R functions required to
implement them.
Regarding regression, Introduction to Regression in R and Introduction to
Regression with statsmodels in Python will help you explore different types of
regression models as well as their implementation in R and Python.
Examples of Machine Learning Classification in Real Life
Supervised Machine Learning Classification has different applications in
multiple domains of our day-to-day life. Below are some examples.
Healthcare
Training a machine learning model on historical patient data can help healthcare
specialists accurately analyze their diagnoses:
 During the COVID-19 pandemic, machine learning models were implemented
to efficiently predict whether a person had COVID-19 or not.
 Researchers can use machine learning models to predict new diseases that are
more likely to emerge in the future.
Education
Education is one of the domains dealing with the most textual, video, and audio
data. This unstructured information can be analyzed with the help of Natural
Language technologies to perform different tasks such as:
 The classification of documents per category.
 Automatic identification of the underlying language of students' documents
during their application.
 Analysis of students’ feedback sentiments about a Professor.
Transportation
Transportation is the key component of many countries' economic development.
As a result, industries are using machine and deep learning models:
 To predict which geographical location will have a rise in traffic volume.
 Predict potential issues that may occur in specific locations due to weather
conditions.
Sustainable agriculture
Agriculture is one of the most valuable pillars of human survival. Introducing
sustainability can help improve farmers' productivity at a different level without
damaging the environment:
 By using classification models to predict which type of land is suitable for a
given type of seed.
 Predict the weather to help them take proper preventive measures.

What is Linear Regression:-


Linear regression is a type of supervised machine learning algorithm that
computes the linear relationship between the dependent variable and one or more
independent features by fitting a linear equation to observed data.
When there is only one independent feature, it is known as Simple Linear
Regression, and when there are more than one feature, it is known as Multiple
Linear Regression.
Similarly, when there is only one dependent variable, it is considered Univariate
Linear Regression, while when there are more than one dependent variables, it is
known as Multivariate Regression.
Why Linear Regression is Important?
The interpretability of linear regression is a notable strength. The model’s
equation provides clear coefficients that elucidate the impact of each independent
variable on the dependent variable, facilitating a deeper understanding of the
underlying dynamics. Its simplicity is a virtue, as linear regression is transparent,
easy to implement, and serves as a foundational concept for more complex
algorithms.
Linear regression is not merely a predictive tool; it forms the basis for various
advanced models. Techniques like regularization and support vector machines
draw inspiration from linear regression, expanding its utility. Additionally, linear
regression is a cornerstone in assumption testing, enabling researchers to validate
key assumptions about the data.
Types of Linear Regression
There are two main types of linear regression:
Simple Linear Regression
This is the simplest form of linear regression, and it involves only one
independent variable and one dependent variable. The equation for simple linear
regression is:
y=β0+β1Xy=β0+β1X
where:
 Y is the dependent variable
 X is the independent variable
 β0 is the intercept
 β1 is the slope
Multiple Linear Regression
This involves more than one independent variable and one dependent variable.
The equation for multiple linear regression is:
y=β0+β1X1+β2X2+………βnXny=β0+β1X1+β2X2+………βnXn
where:
 Y is the dependent variable
 X1, X2, …, Xn are the independent variables
 β0 is the intercept
 β1, β2, …, βn are the slopes
The goal of the algorithm is to find the best Fit Line equation that can
predict the values based on the independent variables.
In regression set of records are present with X and Y values and these values are
used to learn a function so if you want to predict Y from an unknown X this
learned function can be used. In regression we have to find the value of Y, So, a
function is required that predicts continuous Y in the case of regression given X
as independent features.

Metrics for evaluating linear model:-

Regression Metrics
Machine learning is an effective tool for predicting numerical values, and
regression is one of its key applications. In the arena of regression analysis,
accurate estimation is crucial for measuring the overall performance of
predictive models. This is where the famous machine learning library
Python Scikit-Learn comes in. Scikit-Learn gives a complete set of
regression metrics to evaluate the quality of regression models.
In this article, we are able to explore the basics of regression metrics in
scikit-learn, discuss the steps needed to use them effectively, provide some
examples, and show the desired output for each metric.
Regression
Regression fashions are algorithms used to expect continuous numerical
values primarily based on entering features. In scikit-learn, we will use
numerous regression algorithms, such as Linear Regression, Decision Trees,
Random Forests, and Support Vector Machines (SVM), amongst others.
Before learning about precise metrics, let’s familiarize ourselves with a few
essential concepts related to regression metrics:
1. True Values and Predicted Values:
In regression, we’ve got two units of values to compare: the actual target
values (authentic values) and the values expected by our version (anticipated
values). The performance of the model is assessed by means of measuring
the similarity among these sets.
2. Evaluation Metrics:
Regression metrics are quantitative measures used to evaluate the nice of a
regression model. Scikit-analyze provides several metrics, each with its own
strengths and boundaries, to assess how well a model suits the statistics.
Types of Regression Metrics
Some common regression metrics in scikit-learn with examples
 Mean Absolute Error (MAE)
 Mean Squared Error (MSE)
 R-squared (R²) Score
 Root Mean Squared Error (RMSE)

1. Mean Absolute Error (MAE)


In the fields of statistics and machine learning, the Mean Absolute Error
(MAE) is a frequently employed metric. It’s a measurement of the typical
absolute discrepancies between a dataset’s actual values and projected
values.
Mathematical Formula
The formula to calculate MAE for a data with “n” data points is:
MAE=1n∑i=1n∣xi–yi∣MAE=n1∑i=1n∣xi–yi∣
Where:
 xi represents the actual or observed values for the i-th data point.
 yi represents the predicted value for the i-th data point.

2. Mean Squared Error (MSE)

Mean Squared Error (MSE)


A popular metric in statistics and machine learning is the Mean
Squared Error (MSE). It measures the square root of the average
discrepancies between a dataset’s actual values and projected
values. MSE is frequently utilized in regression issues and is used
to assess how well predictive models work.
Mathematical Formula
For a dataset containing ‘n’ data points, the MSE calculation
formula is:
MSE=1n∑i=1n(xi–yi)2MSE=n1∑i=1n(xi–yi)2
where:
 xi represents the actual or observed value for the i-th data point.
 yi represents the predicted value for the i-th data point.
3. R-squared (R²) Score
R-squared (R²) Score
A statistical metric frequently used to assess the goodness of fit of a
regression model is the R-squared (R2) score, also referred to as the
coefficient of determination. It quantifies the percentage of the
dependent variable’s variation that the model’s independent
variables contribute to. R2 is a useful statistic for evaluating the
overall effectiveness and explanatory power of a regression model.
Mathematical Formula
The formula to calculate the R-squared score is as follows:
R2=1–SSRSSTR2=1–SSTSSR
Where:
 R2 is the R-Squared.
 SSR represents the sum of squared residuals between the predicted values
and actual values.
 SST represents the total sum of squares, which measures the total
variance in the dependent variable.

4. Root Mean Squared Error (RMSE)


RMSE stands for Root Mean Squared Error. It is a usually used metric in
regression analysis and machine learning to measure the accuracy or
goodness of fit of a predictive model, especially when the predictions are
continuous numerical values.
The RMSE quantifies how well the predicted values from a model align
with the actual observed values in the dataset. Here’s how it works:
1. Calculate the Squared Differences: For each data point, subtract the
predicted value from the actual (observed) value, square the result, and
sum up these squared differences.
2. Compute the Mean: Divide the sum of squared differences by the number
of data points to get the mean squared error (MSE).
3. Take the Square Root: To obtain the RMSE, simply take the square root
of the MSE.
Mathematical Formula
The formula for RMSE for a data with ‘n’ data points is as follows:
RMSE=1n∑i=1n(xi–yi)2RMSE=n1∑i=1n(xi–yi)2
Where:
 RMSE is the Root Mean Squared Error.
 xi represents the actual or observed value for the i-th data point.
 yi represents the predicted value for the i-th data point.

Multivariate regression
Multivariate Multiple Regression is a method of modeling multiple
responses, or dependent variables, with a single set of predictor variables.
For example, we might want to model both math and reading SAT scores
as a function of gender, race, parent income, and so forth.

Multivariate Regression
The goal in any data analysis is to extract from raw information the
accurate estimation. One of the most important and common question
concerning if there is statistical relationship between a response variable
(Y) and explanatory variables (Xi). An option to answer this question is to
employ regression analysis in order to model its relationship. Further it
can be used to predict the response variable for any arbitrary set of
explanatory variables.
The Problem:
Multivariate Regression is one of the simplest Machine Learning
Algorithm. It comes under the class of Supervised Learning Algorithms
i.e, when we are provided with training dataset. Some of the problems that
can be solved using this model are:
 A researcher has collected data on three psychological variables, four
academic variables (standardized test scores), and the type of educational
program the student is in for 600 high school students. She is interested in
how the set of psychological variables is related to the academic variables
and the type of program the student is in.
 A doctor has collected data on cholesterol, blood pressure, and
weight. She also collected data on the eating habits of the subjects (e.g.,
how many ounces of red meat, fish, dairy products, and chocolate
consumed per week). She wants to investigate the relationship between
the three measures of health and eating habits.
 A property dealer wants to set housing prices which are based various
factors like Size of house, No of bedrooms, Age of house, etc. We shall
discuss the algorithm further using this example.
The Solution:
The solution is divided into various parts.
 Selecting the features: Finding the features on which a response variable
depends (or not) is one of the most important steps in Multivariate
Regression. To make our analysis simple, we assume that the features on
which the response variable is dependent are already selected.
 Normalizing the features: The features are then scaled in order to bring
them in range of (0,1) to make better analysis. This can be done by
changing the value of each feature
 Selecting Hypothesis and Cost function: A hypothesis is a predicted value
of the response variable represented by h(x). Cost function defines the cost
for wrongly predicting hypothesis. It should be as small as possible. We
choose hypothesis function as linear combination of features X.
Non-Linear Regression

Understanding Nonlinear Regression with Examples


here, we will see some examples of non-linear regression in machine learning
that are generally used in regression analysis, the reason being that most of
the real-world data follow highly complex and non-linear relationships
between the dependent and independent variables.

Table of Content

Non-linear regression in Machine Learning


Assumptions in NonLinear Regression
Types of Non-Linear Regression
Applications of Non-Linear Regression
Advantages & Disadvantages of Non-Linear Regression
Frequently Asked Questions (FAQs) on Non-Linear Regression
Non-linear regression in Machine Learning
Nonlinear regression refers to a broader category of regression models
where the relationship between the dependent variable and the independent
variables is not assumed to be linear. If the underlying pattern in the data
exhibits a curve, whether it’s exponential growth, decay, logarithmic, or any
other non-linear form, fitting a nonlinear regression model can provide a
more accurate representation of the relationship. This is because in linear
regression it is pre-assumed that the data is linear.

A nonlinear regression model can be expressed as:

Y = f(X , \beta) + \epsilon

Where,

f(X, \beta) : Regression function


X: This is the vector of independent variables, which are used to predict the
dependent variable.
\beta :The vector of parameters that the model aims to estimate. These
parameters determine the shape and characteristics of the regression
function.
\epsilon : error term
Many different regressions exist and can be used to fit whatever the dataset
looks like such as quadratic, cubic regression, and so on to infinite degrees
according to our requirement.
Assumptions in NonLinear Regression
These assumptions are similar to those in linear regression but may have
nuanced interpretations due to the nonlinearity of the model. Here are the
key assumptions in nonlinear regression:

Functional Form: The chosen nonlinear model correctly represents the true
relationship between the dependent and independent variables.
Independence: Observations are assumed to be independent of each other.
Homoscedasticity: The variance of the residuals (the differences between
observed and predicted values) is constant across all levels of the
independent variable.
Normality: Residuals are assumed to be normally distributed.
Multicollinearity: Independent variables are not perfectly correlated.
Types of Non-Linear Regression
There are two main types of Non Linear regression in Machine Learning:

Parametric non-linear regression assumes that the relationship between the


dependent and independent variables can be modeled using a specific
mathematical function. For example, the relationship between the
population of a country and time can be modeled using an exponential
function. Some common parametric non-linear regression models include:
Polynomial regression, Logistic regression, Exponential regression, Power
regression etc.
Non-parametric non-linear regression does not assume that the relationship
between the dependent and independent variables can be modeled using a
specific mathematical function. Instead, it uses machine learning algorithms
to learn the relationship from the data. Some common non-parametric non-
linear regression algorithms include: Kernel smoothing, Local polynomial
regression, Nearest neighbor regression etc.

Applications of Non-Linear Regression


As we know that most of the real-world data is non-linear and hence non-linear
regression techniques are far better than linear regression techniques. Non-Linear
regression techniques help to get a robust model whose predictions are reliable
and as per the trend followed by the data in history. Tasks related to exponential
growth or decay of a population, financial forecasting, and logistic pricing model
were all successfully accomplished by the Non-Linear Regression techniques.
1. The insurance industry makes use of it. Its application is seen, for instance, in
the IBNR reserve computation.
2. In the field of agricultural research, it is crucial. Considering that nonlinear
models more accurately represent numerous crops and soil dynamics than
linear ones.
3. There are uses for nonlinear models in forestry research because the majority
of biological processes are inherently nonlinear. An example would be a
straightforward power function that relates a tree’s weight or volume to its
height or diameter.
4. It is employed in the framing of the problem and the derivation of statistical
solutions to the calibration problem in research and development.
5. One example from the world of chemistry is the development of a wide-range
colorless gas, HCFC-22 formulation, using a nonlinear model.

Advantages of Non-Linear Regression


1. Non-linear regression can model relationships that are not linear in nature.
2. Non-linear regression can be used to make predictions about the dependent
variable based on the values of the independent variables.
3. Non-linear regression can be used to identify the factors that influence the
dependent variable.

K-Nearest Neighbour

The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised


learning classifier, which uses proximity to make classifications or predictions
about the grouping of an individual data point. It is one of the popular and
simplest classification and regression classifiers used in machine learning today.

K-Nearest Neighbor(KNN) Algorithm:-


The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning
method employed to tackle classification and regression problems. Evelyn Fix
and Joseph Hodges developed this algorithm in 1951, which was subsequently
expanded by Thomas Cover. The article explores the fundamentals, workings,
and implementation of the KNN algorithm.
What is the K-Nearest Neighbors Algorithm
KNN is one of the most basic yet essential classification algorithms in machine
learning. It belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning
it does not make any underlying assumptions about the distribution of data (as
opposed to other algorithms such as GMM, which assume a Gaussian
distribution of the given data). We are given some prior data (also called training
data), which classifies coordinates into groups identified by an attribute.
As an example, consider the following table of data points containing two
features:

Now, given another set of data points (also called testing data), allocate these
points to a group by analyzing the training set. Note that the unclassified points
are marked as ‘White’.
Intuition Behind KNN Algorithm
If we plot these points on a graph, we may be able to locate some clusters or
groups. Now, given an unclassified point, we can assign it to a group by
observing what group its nearest neighbors belong to. This means a point close
to a cluster of points classified as ‘Red’ has a higher probability of getting
classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’,
and the second point (5.5, 4.5) should be classified as ‘Red’.
Why do we need a KNN algorithm?
(K-NN) algorithm is a versatile and widely used machine learning algorithm that
is primarily used for its simplicity and ease of implementation. It does not require
any assumptions about the underlying data distribution. It can also handle both
numerical and categorical data, making it a flexible choice for various types of
datasets in classification and regression tasks. It is a non-parametric method that
makes predictions based on the similarity of data points in a given dataset. K-NN
is less sensitive to outliers compared to other algorithms.
The K-NN algorithm works by finding the K nearest neighbors to a given data
point based on a distance metric, such as Euclidean distance. The class or value
of the data point is then determined by the majority vote or average of the K
neighbors. This approach allows the algorithm to adapt to different patterns and
make predictions based on the local structure of the data.

Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used


for classification and regression. The goal is to create a model that predicts the
value of a target variable by learning simple decision rules inferred from the data
features. A tree can be seen as a piecewise constant approximation.
For instance, in the example below, decision trees learn from data to approximate
a sine curve with a set of if-then-else decision rules. The deeper the tree, the more
complex the decision rules and the fitter the model.

Some advantages of decision trees are:


 Simple to understand and to interpret. Trees can be visualized.
 Requires little data preparation. Other techniques often require data
normalization, dummy variables need to be created and blank values to be
removed. Some tree and algorithm combinations support missing values.
 The cost of using the tree (i.e., predicting data) is logarithmic in the number of
data points used to train the tree.
 Able to handle both numerical and categorical data. However, the scikit-learn
implementation does not support categorical variables for now. Other
techniques are usually specialized in analyzing datasets that have only one type
of variable. See algorithms for more information.
 Able to handle multi-output problems.
 Uses a white box model. If a given situation is observable in a model, the
explanation for the condition is easily explained by boolean logic. By contrast,
in a black box model (e.g., in an artificial neural network), results may be more
difficult to interpret.
 Possible to validate a model using statistical tests. That makes it possible to
account for the reliability of the model.
 Performs well even if its assumptions are somewhat violated by the true model
from which the data were generated.

Logistic Regression
Logistic regression is a supervised machine learning algorithm that accomplishes
binary classification tasks by predicting the probability of an outcome, event, or
observation. The model delivers a binary or dichotomous outcome limited to two
possible outcomes: yes/no, 0/1, or true/false.

Logistic Regression in Machine Learning


Logistic regression is a supervised machine learning algorithm used
for classification tasks where the goal is to predict the probability that an
instance belongs to a given class or not. Logistic regression is a statistical
algorithm which analyze the relationship between two data factors. The article
explores the fundamentals of logistic regression, it’s types and implementations.
Table of Content
 What is Logistic Regression?
 Logistic Function – Sigmoid Function
 Types of Logistic Regression

What is Logistic Regression?


Logistic regression is used for binary classification where we use sigmoid
function, that takes input as independent variables and produces a probability
value between 0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of the logistic
function for an input is greater than 0.5 (threshold value) then it belongs to Class
1 otherwise it belongs to Class 0. It’s referred to as regression because it is the
extension of linear regression but is mainly used for classification problems.
Key Points:
 Logistic regression predicts the output of a categorical dependent variable.
Therefore, the outcome must be a categorical or discrete value.
 It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
 In Logistic regression, instead of fitting a regression line, we fit an “S” shaped
logistic function, which predicts two maximum values (0 or 1).
Logistic Function – Sigmoid Function
 The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
 It maps any real value into another value within a range of 0 and 1. The value
of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the “S” form.
 The S-form curve is called the Sigmoid function or the logistic function.
 In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.
Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified into three
types:
1. Binomial: In binomial Logistic regression, there can be only two possible
types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as “low”, “Medium”, or “High”.

Support Vector Machines


Support vector machines (SVMs) are a set of supervised learning methods used
for classification, regression and outliers detection.
The advantages of support vector machines are:
 Effective in high dimensional spaces.
 Still effective in cases where number of dimensions is greater than the number
of samples.
 Uses a subset of training points in the decision function (called support
vectors), so it is also memory efficient.
 Versatile: different Kernel functions can be specified for the decision function.
Common kernels are provided, but it is also possible to specify custom kernels

What Is Model Evaluation


Model evaluation in machine learning is the process of determining a model’s
performance via a metrics-driven analysis. It can be performed in two ways:
 Offline: The model is evaluated after training during experimentation
or continuous retraining.
 Online: The model is evaluated in production as part of model monitoring.
The metrics selection for the analysis varies depending on the data, algorithm, and
use case.
For supervised learning, the metrics are categorized with respect to classification
and regression. Classification metrics are based on the confusion matrix, such
as accuracy, precision, recall, and f1-score; regression metrics are based on errors,
such as mean absolute error (MAE) and root mean squared errors (RMSE).
For unsupervised learning, the metrics aim to define the cohesion, separation,
confidence, and error in the output. For example, the silhouette measure is used
for clustering in order to measure how similar a data point is to its own cluster
relative to its similarity to other clusters.
For both learning approaches, and necessarily for the latter, model evaluation
metrics are extended during experimentation with visualizations and manual
analysis of (groups of) data points. Domain experts are often required to support
this evaluation.

Applications of supervised learning in multiple domains


Real-Life Applications of Supervised Learning
 Spam Filtering. Supervised learning is extensively used in email spam filtering
systems. ...
 Image Classification. Image classification tasks involve categorizing images
into predefined classes or categories. ...
 Medical Diagnosis. ...
 Fraud Detection. ...
 Natural Language Processing.
Application of Supervised Learning
There are many applications across the industry, since it provides the best
algorithms for finding accurate results.
1. Fraud Detection in Banking and Finance Sector: It helps in identifying
whether the transactions made by the users are genuine.
2. Spam detection: With the help of specific keywords and different content,
Supervised Learning can easily detect emails if it is spam. It recognizes certain
keywords and sends them into the spam category.
3. Bioinformatics: The biggest application is to store the biological information
of human beings. This could be information related to fingertips, eyes, swabs,
iris textures, and a lot more.
4. Object recognition: Another application is “Recatch” (prove you are not a
robot). Here, choose multiple images to confirm if you are a human. You can
access certain information only if you can identify it correctly. If not, keep
trying until you get the right identifications.
UNIT - 5
Unsupervised Learning

Clustering

Clustering is an unsupervised machine learning technique designed to group


unlabeled examples based on their similarity to each other. (If the examples are
labeled, this kind of grouping is called classification.

What is Clustering
The task of grouping data points based on their similarity with each other is called
Clustering or Cluster Analysis. This method is defined under the branch
of Unsupervised Learning, which aims at gaining insights from unlabelled data
points, that is, unlike supervised learning we don’t have a target variable.
Clustering aims at forming groups of homogeneous data points from a
heterogeneous dataset. It evaluates the similarity based on a metric like Euclidean
distance, Cosine similarity, Manhattan distance, etc. and then group the points
with highest similarity score together.

Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group
similar data points:
 Hard Clustering: In this type of clustering, each data point belongs to a
cluster completely or not. For example, Let’s say there are 4 data point and we
have to cluster them into 2 clusters. So each data point will either belong to
cluster 1 or cluster 2.
 Soft Clustering: In this type of clustering, instead of assigning each data point
into a separate cluster, a probability or likelihood of that point being that cluster
is evaluated. For example, Let’s say there are 4 data point and we have to
cluster them into 2 clusters. So we will be evaluating a probability of a data
point belonging to both clusters. This probability is calculated for all data
points.
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through the
use cases of Clustering algorithms. Clustering algorithms are majorly used for:
 Market Segmentation – Businesses use clustering to group their customers and
use targeted advertisements to attract more audience.
 Market Basket Analysis – Shop owners analyze their sales and figure out
which items are majorly bought together by the customers. For example, In
USA, according to a study diapers and beers were usually bought together by
fathers.
 Social Network Analysis – Social media sites use your data to understand your
browsing behaviour and provide you with targeted friend recommendations or
content recommendations.
 Medical Imaging – Doctors use Clustering to find out diseased areas in
diagnostic images like X-rays.
 Anomaly Detection – To find outliers in a stream of real-time dataset or
forecasting fraudulent transactions we can use clustering to identify them.
 Simplify working with large datasets – Each cluster is given a cluster ID after
clustering is complete. Now, you may reduce a feature set’s whole feature set
into its cluster ID. Clustering is effective when it can represent a complicated
case with a straightforward cluster ID. Using the same principle, clustering
data can make complex datasets simpler.

Hierarchical clustering
Hierarchical clustering, also known as hierarchical cluster analysis, is an
algorithm that groups similar objects into groups called clusters. The endpoint is
a set of clusters, where each cluster is distinct from each other cluster, and the
objects within each cluster are broadly similar to each other.

Hierarchical Clustering in Machine Learning


Hierarchical clustering is another unsupervised machine learning algorithm,
which is used to group the unlabeled datasets into a cluster and also known
as hierarchical cluster analysis or HCA.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and
this tree-shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may
look similar, but they both differ depending on how they work. As there is no
requirement to predetermine the number of clusters as we did in the K-Means
algorithm.
The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the


algorithm starts with taking all data points as single clusters and merging them
until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as
it is a top-down approach.
Why hierarchical clustering?
As we already have other clustering algorithms such as K-Means Clustering,
then why we need hierarchical clustering? So, as we have seen in the K-means
clustering that there are some challenges with this algorithm, which are a
predetermined number of clusters, and it always tries to create the clusters of the
same size. To solve these two challenges, we can opt for the hierarchical
clustering algorithm because, in this algorithm, we don't need to have knowledge
about the predefined number of clusters.

Partitioning Clustering:-
It is a type of clustering that divides the data into non-hierarchical groups. It is
also known as the centroid-based method. The most common example of
partitioning clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.

Partitional clustering (or partitioning clustering) are clustering methods


used to classify observations, within a data set, into multiple groups based
on their similarity. The algorithms require the analyst to specify the number
of clusters to be generated.
This course describes the commonly used partitional clustering, including:
 K-means clustering (MacQueen 1967), in which, each cluster is
represented by the center or means of the data points belonging to the
cluster. The K-means method is sensitive to anomalous data points and
outliers.
 K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman &
Rousseeuw, 1990), in which, each cluster is represented by one of the
objects in the cluster. PAM is less sensitive to outliers compared to k-
means.
 CLARA algorithm (Clustering Large Applications), which is an extension
to PAM adapted for large data sets.
For each of these methods, we provide:
 the basic idea and the key mathematical concepts
 the clustering algorithm and implementation in R software
 R lab sections with many examples for cluster analysis and visualization
The following R packages will be used to compute and visualize partitioning
clustering:
 stats package for computing K-means
 cluster package for computing PAM and CLARA algorithms
 factoextra for beautiful visualization of clusters

What is K-means Clustering:-


Unsupervised Machine Learning is the process of teaching a computer to
use unlabeled, unclassified data and enabling the algorithm to operate on
that data without supervision. Without any previous data training, the
machine’s job in this case is to organize unsorted data according to parallels,
patterns, and variations.
K means clustering, assigns data points to one of the K clusters depending
on their distance from the center of the clusters. It starts by randomly
assigning the clusters centroid in the space. Then each data point assign to
one of the cluster based on its distance from centroid of the cluster. After
assigning each point to one of the cluster, new cluster centroids are assigned.
This process runs iteratively until it finds good cluster. In the analysis we
assume that number of cluster is given in advanced and we have to put
points in one of the group.
In some cases, K is not clearly defined, and we have to think about the
optimal number of K. K Means clustering performs best data is well
separated. When data points overlapped this clustering is not suitable. K
Means is faster as compare to other clustering technique. It provides strong
coupling between the data points. K Means cluster do not provide clear
information regarding the quality of clusters. Different initial assignment of
cluster centroid may lead to different clusters. Also, K Means algorithm is
sensitive to noise. It maymhave stuck in local minima.
What is the objective of k-means clustering?
The goal of clustering is to divide the population or set of data points into a
number of groups so that the data points within each group are
more comparable to one another and different from the data points within
the other groups. It is essentially a grouping of things based on how similar
and different they are to one another.

Applications of unsupervised learning in multiple domains.

Unsupervised Learning
In artificial intelligence, machine learning that takes place in the absence of
human supervision is known as unsupervised machine learning. Unsupervised
machine learning models, in contrast to supervised learning, are given unlabeled
data and allow discover patterns and insights on their own—without explicit
direction or instruction.
Unsupervised machine learning analyzes and clusters unlabeled datasets using
machine learning algorithms. These algorithms find hidden patterns and data
without any human intervention, i.e., we don’t give output to our model. The
training model has only input parameter values and discovers the groups or
patterns on its own.
Some applications are-

 Market Segmentation. ...


 Anomaly Detection. ...
 Recommendation Systems. ...
 Image and Document Clustering. ...
 Genomics and Bioinformatics. ...
 Neuroscience. ...
 Natural Language Processing (NLP)

Applications of Unsupervised learning


 Customer segmentation: Unsupervised learning can be used to segment
customers into groups based on their demographics, behavior, or
preferences. This can help businesses to better understand their customers and
target them with more relevant marketing campaigns.
 Fraud detection: Unsupervised learning can be used to detect fraud in
financial data by identifying transactions that deviate from the expected
patterns. This can help to prevent fraud by flagging these transactions for
further investigation.
 Recommendation systems: Unsupervised learning can be used to
recommend items to users based on their past behavior or preferences. For
example, a recommendation system might use unsupervised learning to
identify users who have similar taste in movies, and then recommend movies
that those users have enjoyed.
 Natural language processing (NLP): Unsupervised learning is used in a
variety of NLP tasks, including topic modeling, document clustering, and part-
of-speech tagging.
 Image analysis: Unsupervised learning is used in a variety of image analysis
tasks, including image segmentation, object detection, and image pattern
recognition.

You might also like