[go: up one dir, main page]

0% found this document useful (0 votes)
13 views105 pages

Unit 4

The document outlines a syllabus for a course on Machine Learning, focusing on data analysis and machine learning techniques using Python libraries such as NumPy, Pandas, SciPy, Matplotlib, and Scikit-Learn. It covers key topics including data manipulation, visualization, validation techniques, and various supervised learning algorithms. The document also details the functionalities of each library, emphasizing their roles in scientific computing and data analysis.

Uploaded by

yarokiduniya31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views105 pages

Unit 4

The document outlines a syllabus for a course on Machine Learning, focusing on data analysis and machine learning techniques using Python libraries such as NumPy, Pandas, SciPy, Matplotlib, and Scikit-Learn. It covers key topics including data manipulation, visualization, validation techniques, and various supervised learning algorithms. The document also details the functionalities of each library, emphasizing their roles in scientific computing and data analysis.

Uploaded by

yarokiduniya31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 105

Fundamentals of Machine Learning

TBC- 603
Syllabus
UNIT 4 : Data Analysis and Machine Learning with Python NumPy, SciPy Matplotlib,
Pandas Scikit-Learn.

NumPy Basics. A Multidimensional Array Object (ndarrays ) Creating ndarrays


Data Types for ndarrays Basic Indexing and Slicing,

Getting Started with pandas Series, Data Frame and Index Objects Re indexing Indexing,
Selection and Filtering Sorting and Ranking, Loading from CSV and other structured text
formats, Normalizing data, Dealing with missing data, Data manipulation (alignment,
aggregation, and summarization), Group-based operations: split-apply-combine Statistical
analysis, Date and time series analysis with Pandas, Visualizing data

UNIT 5 : Validation Techniques: Hold out, K-Fold Cross Validation, Leave one out,
Bootstrapping.

Supervised Learning Algorithms: Linear Regression, Logistic Regression, Decision Trees,


Support Vector Machine, K-Nearest Neighbours, CN2 Algorithm, Naive Bayes, Artificial Neural
Networks.
SciPy (Scientific Python)

SciPy (Scientific Python)

SciPy is an open-source Python library built on top of NumPy, designed for


advanced scientific and engineering applications.

It provides modules for optimization, integration, signal processing, linear algebra,


image processing, differential equations, and more. SciPy makes it easy to
implement complex mathematical algorithms with simple function calls, making it
suitable for scientific research, engineering simulations, and academic projects. It
also offers sparse matrix support and interpolation methods. With its extensive
functionality and close integration with NumPy, SciPy is a powerful tool for solving
real-world scientific problems across multiple disciplines like physics, chemistry,
biology, and mechanical engineering.
Matplotlib

Matplotlib is a versatile data visualization library in Python, widely used for


generating static, interactive, and animated plots. It supports a variety of plot
types including line charts, bar graphs, scatter plots, histograms, and heatmaps.

The pyplot interface provides a MATLAB-like experience, making it easy for new
users. Matplotlib is highly customizable, allowing fine control over plot elements
such as labels, legends, colors, and annotations. It is often used in conjunction
with NumPy and Pandas for exploratory data analysis and reporting. Whether for
publishing-quality graphs or basic data visualizations, Matplotlib remains a staple
in the scientific Python ecosystem.
Matplotlib is a comprehensive and widely used Python library for creating
visualizations, especially two-dimensional plots and graphs. It is a vital tool in the
data science and scientific computing ecosystem, enabling users to generate
high-quality static, interactive, and animated charts.

The library is particularly known for its pyplot module, which provides a
MATLAB-like interface and makes it easy to produce plots with just a few lines of
code. Matplotlib supports various types of visualizations such as line plots, bar
charts, scatter plots, histograms, pie charts, and more.

One of its biggest strengths is its high level of customizability—users can control
every aspect of a figure, including fonts, colors, line styles, and layout. It
integrates smoothly with other Python libraries like NumPy and Pandas, and
works well in Jupyter Notebooks, making it a go-to choice for data analysis and
exploration. Additionally, Matplotlib allows exporting plots in multiple formats,
such as PNG, PDF, and SVG, which is useful for both web and print publishing.
Because of its flexibility, ease of use, and broad community support, Matplotlib
remains one of the most essential libraries for data visualization in Python
Scikit-Learn

Scikit-Learn

Scikit-Learn is one of the most popular machine learning libraries in Python. It


provides a wide range of supervised and unsupervised learning algorithms including
linear regression, decision trees, support vector machines, k-means clustering, and
principal component analysis. Scikit-Learn also includes tools for model evaluation,
hyperparameter tuning, preprocessing, and feature selection. Built on top of NumPy,
SciPy, and Matplotlib, it integrates well with the broader Python ecosystem. Its
consistent API and user-friendly documentation make it suitable for both beginners
and experienced practitioners. Scikit-Learn is used in academia, industry, and real-
world applications ranging from recommendation systems to fraud detection.
Scikit-Learn is a powerful, open-source Python library used for machine
learning and data mining. It provides a simple and consistent interface for a
wide range of machine learning algorithms, making it easy for both beginners
and professionals to build predictive models. It is built on top of core scientific
libraries like NumPy, SciPy, and Matplotlib, which provide the computational
and visualization backbones.
Key Features:
Supervised learning: Includes algorithms like linear regression, logistic
regression, decision trees, random forests, support vector machines (SVM), and
more.
Unsupervised learning: Supports clustering algorithms like K-means, DBSCAN,
and dimensionality reduction techniques like PCA (Principal Component
Analysis).
Model selection and evaluation: Tools for cross-validation, hyperparameter
tuning (GridSearchCV, RandomizedSearchCV), and performance metrics
(accuracy, precision, recall, F1-score).
Preprocessing: Provides data transformation tools such as standardization,
normalization, encoding categorical variables, and handling missing data.
Pipelines: Allows chaining of preprocessing steps and models to streamline
workflows and prevent data leakage.

Why Use Scikit-Learn?


Easy to use and well-documented API.
Works seamlessly with Pandas and NumPy.
Ideal for both prototyping and production-ready machine learning solutions.
NumPy (Numerical
Python)NumPy

NumPy (Numerical Python)NumPy is a fundamental Python library used for


numerical and scientific computing. It introduces a powerful N-dimensional
array object called ndarray, which allows for efficient storage and manipulation
of large datasets. NumPy supports a wide range of mathematical operations
like linear algebra, statistical computations, and Fourier transforms. Its
performance is significantly faster than Python lists due to its implementation in
C. NumPy also includes broadcasting and vectorization features that make
code cleaner and more efficient. It serves as the base for many other libraries,
including Pandas, SciPy, and Scikit-Learn, making it essential for data science
and machine learning.
NumPy Basics

A Multidimensional Array Object: ndarray : At the core of NumPy is the ndarray


(N-dimensional array), which is a fast, flexible container for large datasets in
Python. Unlike regular Python lists, ndarrays are homogeneous (all elements are
of the same data type) and support vectorized operations, which are more
efficient.
Key properties of an ndarray:
.ndim: Number of dimensions (axes)
.shape: Dimensions of the array (rows, columns, etc.)
.size: Total number of elements
.dtype: Data type of the elements
.itemsize: Size in bytes of each element
create
import numpy as np
arrays in several ways:
# From a Python list
arr1 = np.array([1, 2, 3])

# 2D array from nested lists


arr2 = np.array([[1, 2], [3, 4]])

# Array of zeros
zeros = np.zeros((2, 3))

# Array of ones
ones = np.ones((2, 2))

# Array with a range of values


range_arr = np.arange(0, 10, 2)

# Array with random values


rand_arr = np.random.rand(3, 2)
Basic Indexing and Slicing
arr = np.array([[10, 20, 30], [40, 50, 60]])

# Indexing
print(arr[0, 1]) # Output: 20

# Slicing
print(arr[:, 1]) # Output: [20 50] (second column)
print(arr[1, :]) # Output: [40 50 60] (second row)

# Negative indexing
print(arr[-1, -1]) # Output: 60 (last element)

# Modifying values
arr[0, 0] = 100
Pandas

Pandas is a powerful open-source library in Python used for data


manipulation and analysis. It provides data structures
like Series and DataFrames that allow for easy handling of structured data,
making tasks such as data cleaning, merging, and visualization

Pandas is a Python library used for working with data sets.

The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.
Introduction to Pandas

• Pandas is a Python package that provides fast, flexible,


and expressive data structures designed to make working
with structured (tabular, multidimensional, potentially
heterogeneous) and time series data both easy and
intuitive.

• Pandas is a high-level data manipulation tool developed


by Wes McKinney. It is built on the Numpy package and its
key data structure is called the DataFrame.

• Through pandas, you get acquainted with your data by


cleaning, transforming, and analyzing it.
Pandas is especially useful for:

Handling tabular data (like spreadsheets and SQL tables)

Cleaning and preparing data for analysis

Performing aggregation, grouping, and statistical operations

Reading from and writing to different file formats (CSV, Excel, JSON, SQL, etc.)
Core Data Structures in Pandas

There are two primary data structures in Pandas

• Series – for one-dimensional labeled data

• DataFrame – for two-dimensional tabular data (like a spreadsheet)


Provides two main data structures:

Series: 1-dimensional labeled array (like a column in a spreadsheet)

DataFrame: 2-dimensional labeled table (like a spreadsheet or SQL table)

import pandas as pd

# Create a simple DataFrame

x={ Name': ['AMIT', 'SUMIT', 'RAJAT'],

'Age': [22, 21, 22],


'City': ['Delhi', 'Bombay',
'Dehradun']}

df = pd.DataFrame(x)

print(df)
The Pandas Series

A Series is a one-dimensional labeled array that can hold any


data type—integers, strings, floats, Python objects, etc.

It is similar to a column in Excel or a single list in Python, but


with labels (called index) attached to it.
We can also assign custom labels to the index
Operations on Series

We can perform any type of vectorized operations on


Series.
The Pandas DataFrame
• A DataFrame is a two-dimensional labeled data structure with rows and
columns.
• It’s the most commonly used object in Pandas and resembles a SQL table
or Excel spreadsheet.
• Each column in a DataFrame is actually a Series.
series

Series: One-dimensional labeled array.


Constructing Series from a dictionary
with an Index specified

The keys of the dictionary match with the Index values, hence the Index values have
no effect.
Note that the Index is first build with the keys from the dictionary. After this the
Series is reindexed with the given Index values, hence we get all NaN as a result.
o create a Series with a custom index and
view the index labels:
Data Frame
Two-dimensional, size-mutable, potentially
heterogeneous tabular data.
DataFrame([data, index, columns, dtype, copy])

Constructing DataFrame from a dictionary.


The Pandas DataFrame
• A DataFrame is a two-dimensional labeled data structure with rows and
columns.
• It’s the most commonly used object in Pandas and resembles a SQL table
or Excel spreadsheet.
• Each column in a DataFrame is actually a Series.
Printing a specific column
Printing multiple columns
Reading Data with Pandas

One of the most powerful features of Pandas is its


ability to seamlessly read data from various file
formats and write processed data back to files.

• Reading Data from CSV


• Reading Data from JSON
Reading from .csv Files
What if we have a different delimiter??

import pandas as pd
df = pd.read_csv(‘sample.csv', sep = '\t',
engine = 'python')

print(df)

if we don’t specify the sep parameter, it uses the default value of


the comma (,)

sep='\t': This argument specifies that the tab character (\


t) is used as the delimiter between fields in the CSV file.

The engine='python' argument specifies that the Python


parsing engine should be used, which might be necessary
for more complex CSV files or when dealing with specific
In case of a csv file with multiple type of delimiters

Use regular expression

import pandas as pd

df = pd.read_csv(‘sample.csv', sep = '[:, |_]',


engine = 'python')

print(df)
Reading Specific rows from csv file

import pandas as pd

df = pd.read_csv("F:/varun/NDIM/sample.csv", skiprows = [0, 2,


5])
print(df )

Skip n rows from the end


df = pd.read_csv("students.csv",
skipfooter = 5, engine = 'python')
We can also you use DictReader to read CSV files. The results are
interpreted as a dictionary where the header row is the key, and
other rows are values.
Writing to a .csv file
import pandas as pd
my_df = {'student': ['Ritu', 'Akhil'], 'course': [1,
2], 'credits': [20, 19]}

df = pd.DataFrame(my_df)

print('DataFrame:\n', df)

s_data = df.to_csv('F:/newdata.csv', index = True)


print(s_data)

saves a Pandas DataFrame to a CSV file named


"newdata.csv" in the F drive. The index=True
argument ensures that the DataFrame's index is
also written to the CSV file as a column.
If we wish not to include the index, then assign False to index
parameter
s_data = df.to_csv('F:/newdata.csv', index = False)

If we wish not to include the header, assign False to header


parameter.
s_data = df.to_csv('F:/newdata.csv', header = False)
To Export selected columns of Data frame

import pandas as pd

my_df = {'student': ['R', 'A’], 'course': [1, 2], 'credits': [20, 19]}

df = pd.DataFrame(my_df)

print('DataFrame:\n', df)

s_data = df.to_csv('F:/newdata.csv', columns = ['student'])

print( s_data)
dtype in Python refers to the data type of elements within a
NumPy array or Pandas Series.

To check column type - ( df[‘credits'].dtype )

Check types for all the columns - (df.dtypes)

df.dtypes is a Pandas command that displays the data type of each


column in a DataFrame.

Check all the columns names : df.columns

For example, if a DataFrame df is created with columns 'A' and 'B',


then df.columns will return Index(['A', 'B'],
Reading from JSON files

json file is a text-based file format that uses JavaScript


Object Notation to store and transmit data, JSON is
often used when data is sent from a server to a web
page
CSV Files(Comma-Separated Values)
CSV files are the Comma Separated Files. It allows users to load tabular
data into a DataFrame, which is a powerful structure for data manipulation
and analysis. To access data from the CSV file, we require a function
read_csv() from Pandas that retrieves data in the form of the data frame..

import pandas as pd

df_csv = pd.read_csv('sampledata.csv')

Table :A table is a structured arrangement within the data context where


information is organized into rows and columns. Each row represents a distinct
record or entry, while each column corresponds to a specific attribute or
characteristic.

The read_table() function in pandas facilitates the conversion of tabular data


from a text file into a pandas DataFrame.

df = pd.read_table('data.txt', sep='\t')
Loading a CSV Data from a URL

url = "https://xyz.comg/wp-ontent/uploads/20241121154629307916/people_data.csv"

x = pd.read_csv(url)

# importing pandas module


import pandas as pd

# making data frame


df = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")

df.head(10)
Excel Files
(read_excel())
Excel is one of the most common data storage formats. They can be quickly
loaded into the pandas dataframe with the read_excel() function. It smoothly
loads various Excel file formats, from xls to xlsx,

df_excel = pd.read_excel('sample.xlsx', sheet_name='Sheet1')

JSON (read_json())
JSONs (JavaScript Object Notation) find extensive use as a file format for
storing and exchanging data.

df_json = pd.read_json('data.json')
df.head() # View first few rows

df.describe() # Summary statistics

df['Age'] # Access a column

df[df['Age'] > 30] # Filter rows

df.sort_values('Age') # Sort by a column


Writing to JSON files

json_str = '{"Name": ["A", "B"], "Age": [25, 30]}‘

df = pd.read_json(json_str)

df.to_json('output.json', orient='records')
How to look at data dimensionality??
print(df.shape)

(10,3)
number_of_rows, number_of_columns It shows the number of rows and
columns present in the DataFrame. For instance, an output of (3, 2) indicates
that the DataFrame has 3 rows and 2 columns.

print(df.info): The df.info() method provides a quick


summary of a DataFrame — like a snapshot of its
structure.
Handling Duplicates

import pandas as pd

df = pd.read_csv("F:/varun/NDIM/sample.csv")
print(df.shape )

tempdf=df.append(df)
print(tempdf.shape )
tempdf = tempdf.drop_duplicates()
How to Handle missing Values in
data
There are two options in dealing with nulls:

1. Get rid of rows or columns with nulls


2. Replace nulls with non-null values, a technique known as
imputation
How to check whether we have Null values in data or not

df = pd.read_csv("F:/varun/NDIM/sample1.csv")

df.isnull()
Count the number of Null Values in data

df = pd.read_csv("F:/varun/NDIM/sample1.csv")

df.isnull().sum()
Removing null data is only suggested if you have a small amount of
missing data.

df.dropna()
Other than just dropping rows, we can also drop columns with null
values by setting axis=1:

df.dropna(axis=1)
Replacing Null Values with Mean or
Median
Inplace=True will affect the original dataframe

Similarly cr.mode() and


cr.median() can be calculated
Using Groupby

import pandas as pd
import numpy as np
data = {'Name': ['Parker', 'Smith', 'John', 'William'],
'Percentage': [82, 98, 91, 87,88],
'Course': ['B.Sc','B.Ed','M.Phill','BA‘,’BA’]}
df = pd.DataFrame(data)
s_data = df.to_csv('F:/newdata1.csv', index = True)
df = pd.read_csv("F:/newdata1.csv")
grouped = df.groupby('Course')
print(grouped['Percentage'].agg(np.mean))
Constructing DataFrame
from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),


... columns=['a', 'b', 'c'])
>>> df2
a b c
0 1 2 3
1 4 5 6
2 7 8 9
Constructing DataFrame from
Series/DataFrame:

>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])

>>> df = pd.DataFrame(data=ser, index=["a", "c"])

>>> df
0
a 1
c 3

>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])


>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
x
a 1
c 3
Reindexing
Reindexing in Pandas is used to change the row or column labels of a
DataFrame to match a new set of indices

reindexed the DataFrame by adding an extra


row (index 3), which wasn’t present in the
original data. As a result, Pandas filled the
missing values with NaN.
DataFrame.reindex(labels=None, index=None, columns=None, axis=None,
method=None, copy=True, fill_value=NaN)

Labels: New labels/indexes to conform to.


index/columns: New row/column labels.
fill_value: Value to use for filling missing entries (default is NaN).
method: Method for filling holes (ffill, bfill, etc.).

Reindexing the Rows- We can


reindex a single row or multiple
rows by using reindex() method.
Default values in the new index
that are not present in the
dataframe are assigned NaN.
Reindexing the Columns

we changed the order of columns and added a new


column (GP). Since no data existed for column GP,
the dataframe was filled with NaN but we filled it with
value 80
Indexing (Selection by Label or Position)
Row Selection by label (loc) or integer position (iloc):
Filtering (Boolean Indexing)
Sorting
Sort by Column:
Ranking

In pandas, ranking refers to assigning ranks to elements in


a Series or DataFrame. With the help of .rank() function

method:
average -Default. Assigns the average rank to tied
elements.
min- lowest rank for all ties (duplicate values) .
Max- Highest rank for all ties.
First- Assigns ranks in order they appear
dense- Like 'min', but ranks increase by 1 between groups
ascending:-Sort in ascending or descending order before
ranking.
True (default): Lowest value gets rank 1.
False: Highest value gets rank 1.
import pandas as pd
s = pd.Series([100, 200, 100, 300]) # Default ranking (average method)
print(s.rank())
# Output:
# 0 1.5
# 1 3.0
# 2 1.5
# 3 4.0
# dtype: float64

print(s.rank(method='min')) # Using method='min'


# Output:
# 0 1.0
# 1 3.0
# 2 1.0
# 3 4.0

print(s.rank(method='dense')) # Using method='dense'


# Output:
# 0 1.0
# 1 2.0
# 2 1.0
# 3 3.0
rank() is particularly useful in statistics, for percentile rankings, or handling ties.
Operation Method(s)
Column Access df['col'], df[['col1', 'col2']]
Row Access df.loc[], df.iloc[]
Reindex df.reindex()
Filtering df[condition]
Sorting df.sort_values(), df.sort_index()
Ranking df.rank()
CSV Files(Comma-Separated Values)
CSV files are the Comma Separated Files. It allows users to load tabular
data into a DataFrame, which is a powerful structure for data manipulation
and analysis. To access data from the CSV file, we require a function
read_csv() from Pandas that retrieves data in the form of the data frame..

import pandas as pd

df_csv = pd.read_csv('sampledata.csv')

Table :A table is a structured arrangement within the data context where


information is organized into rows and columns. Each row represents a distinct
record or entry, while each column corresponds to a specific attribute or
characteristic.

The read_table() function in pandas facilitates the conversion of tabular data


from a text file into a pandas DataFrame.

df = pd.read_table('data.txt', sep='\t')
Loading a CSV Data from a URL

url = "https://xyz.comg/wp-ontent/uploads/20241121154629307916/people_data.csv"

x = pd.read_csv(url)

# importing pandas module


import pandas as pd

# making data frame


df = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")

df.head(10)
Excel Files
(read_excel())
Excel is one of the most common data storage formats. They can be quickly
loaded into the pandas dataframe with the read_excel() function. It smoothly
loads various Excel file formats, from xls to xlsx,

df_excel = pd.read_excel('sample.xlsx', sheet_name='Sheet1')

JSON (read_json())
JSONs (JavaScript Object Notation) find extensive use as a file format for
storing and exchanging data.

df_json = pd.read_json('data.json')
df.head() # View first few rows

df.describe() # Summary statistics

df['Age'] # Access a column

df[df['Age'] > 30] # Filter rows

df.sort_values('Age') # Sort by a column


NumPy Arrays

NumPy is used to work with arrays. The array object in NumPy is


called ndarray.

We can create a NumPy ndarray object by using the array()


function.
Numpy arrays are great alternatives to Python Lists.

Some of the key advantages of Numpy arrays are that they are fast,
easy to work with, and give users the opportunity to perform
calculations across entire arrays.
import numpy as np
a = np.array([1, 2, 3])
print(a)
print(type(a))

import numpy as np
a = np.array([4.1,6.2, 3.3])
print(a)
print(type(a))
2 Dimension Array
You can add a dimension with a ","coma

Note that it has to be within the bracket []


Array of complex numbers

A = np.array([[1, 2, 3], [3, 4, 5]], dtype = complex)


print(A)
numpy.ones
Return a new array of given shape and type, filled with ones.
Create an array with random values

np.random.random((2,2))
Create an empty

arraynp.empty((3,2))
Create an array of evenly-spaced values

np.arange(10,25,5)
Addition of Matrices

import numpy as np

A = np.array([[2, 4], [5, -6]])


B = np.array([[9, -3], [3, 6]])
C=A+B
print(C)
Multiplication of Matrices
Data manipulation (alignment,
aggregation, and summarization), Group-
based operations: split-apply-combine
Statistical analysis, Date and time series
analysis with Pandas, Visualizing data
Data Manipulation

Data manipulation involves transforming and restructuring data to make it


more suitable for analysis. Techniques include:

Alignment: Ensuring data is properly aligned, especially in time series data.

Aggregation: Combining multiple values into a single summary value (e.g.,


sum, mean, count).

Summarization: Creating concise summaries of data.


summarize data using Python Pandas

Summary statistics for numerical columns (mean, median, standard


deviation, etc.)

Print(df.describe()) # Summary for numerical columns


Aggregation in Pandas

Aggregation means applying a mathematical function to summarize data. It can be


used to get a summary of columns in our dataset like getting sum, minimum,
maximum etc. from a particular column of our dataset. The function used for
aggregation is agg() the parameter is the function we want to perform. Some
functions used in the aggregation are:

sum() : Compute sum of column values


min() : Compute min of column values
max() : Compute max of column values
mean() : Compute mean of column
size() : Compute column sizes
describe() : Generates descriptive statistics
first() : Compute first of group values
last() : Compute last of group values
count() : Compute count of column values
std() : Standard deviation of column
var() : Compute variance of column
sem() : Standard error of the mean of column
The agg() function in Python is commonly used with pandas, a data analysis
library. It stands for aggregate and allows you to apply one or more
aggregation functions to a DataFrame or Series.
Using agg() on a
DataFrame
Date and time series analysis with
Pandas, Visualizing data
date_range()- This function generates a list of dates automatically.
The pd.date_range function in Pandas is used to generate a sequence of dates or
times. It's especially useful when working with time series data.

pd.date_range(start=None, end=None, periods=None, freq='D', tz=None,


normalize=False, name=None,

Parameters:
start: The starting date of the range (string or datetime-like).
end: The ending date of the range (string or datetime-like).
periods: Number of periods to generate
freq: Frequency string (default is 'D' for daily). Other examples
:'H' = hourly
'T' or 'min' = minute
'S' = second
'M' = month end
'MS' = month start
'B' = business day
tz: - Time zone
normalize: Normalize start/end dates to midnight
inclusive: Whether to include start/end.. Options: 'both' (default), 'neither', 'left',
'right'.
Daily Range Between Two Dates:

Hourly Range with Periods:


np.random.randit(10,100,5) generates an array of 5 random integers , where each
integer is between 10 to 100.
Plotting Time Series Data
pd.to_datetime()

It converts a string representing a date ("2025-05-01") into a Pandas Timestamp


object. A special data type that understands dates (and optionally times).
Convert a list of date strings
Group-Based Operations
Group-Based Operations: Split-Apply-Combine (with Statistical Analysis
Example)In data analysis, the split-apply-combine strategy is a powerful pattern
used for group-based operations. It involves three steps: splitting the data into
groups based on some criteria (e.g., categories or labels), applying a function
(such as sum, mean, or custom logic) to each group independently, and then
combining the results into a single data structure.

This approach is especially useful in statistical analysis where we want to


compute aggregate metrics like the mean, median, or standard deviation for
different categories. Pandas makes this process intuitive with its groupby()
function. For example, suppose we have a dataset of students and their scores
across different classes. Using groupby('Class')['Score'].mean() allows us to
compute the average score for each class. This helps in understanding group-
level trends and making data-driven decisions, such as identifying which class is
performing the best. Overall, split-apply-combine is a fundamental pattern in
exploratory data analysis and reporting.
import pandas as pd

# Sample data Output:


data = {
'Class': ['A', 'A', 'B', 'B', 'C', 'C'], Class
'Student': ['John', 'Alice', 'Bob', 'Eva', 'Mike', 'Sara'], A 87.5
'Score': [85, 90, 78, 82, 88, 91] B 80.0
} C 89.5
Name: Score,
df = pd.DataFrame(data) dtype: float64

# Group by 'Class' and calculate mean score


average_scores = df.groupby('Class')['Score'].mean()

print(average_scores)
CN2
The CN2 algorithm is a simple yet powerful machine learning technique that
helps computers learn how to make decisions based on examples. It works by
creating easy-to-understand IF-THEN rules, much like how a person might
reason through a problem. For example, CN2 might learn a rule like “IF the
weather is sunny AND the temperature is hot, THEN do not play outside.”
These rules are generated from past data, where the outcomes are already
known, and can be used to predict new outcomes in the future.

CN2 is especially useful when we need a model that not only makes accurate
predictions but also explains its reasoning clearly. This makes it very popular in
fields like medicine, finance, and education, where understanding the logic
behind a decision is just as important as the decision itself.

CN2 builds its rules step by step. First, it looks for patterns in the data that
strongly indicate a certain result. It uses a method called beam search to
explore many possible rules but only keeps the most promising ones to avoid
wasting time on bad options.
Once a good rule is found, it removes the examples it explains and moves on to
find more rules until most of the data is covered. For example, if it finds a rule
that perfectly describes all the times someone decided not to play outside, it will
remove those and focus on the remaining ones. This way, it keeps building a list
of useful rules, each one adding more to the overall understanding.
Merits of CN2 Algorithm
• Human-Readable Rules
CN2 produces IF-THEN rules that are easy to interpret, making it ideal for
domains where transparency is critical (e.g., medicine, law).
• Handles Noisy Data Well
CN2 includes mechanisms like rule pruning and heuristic evaluation to
prevent overfitting and deal with noisy or imperfect data.
• Efficient Search via Beam Search
Instead of examining all possible rule combinations, CN2 uses beam search,
which makes the rule induction process faster and more scalable.
• Works with Categorical and Numerical Data
While originally designed for symbolic data, it can also handle numerical
attributes (usually with discretization).
• Supports Incremental Learning
CN2 builds rules step-by-step, allowing for easier updates when new data is
introduced (compared to tree-based methods that might need complete
rebuilding).
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a popular technique in data science and
machine learning used to reduce the dimensionality of large datasets while
preserving as much information as possible. It transforms data into a new set of
variables called principal components, which are linear combinations of the
original features. These components capture the maximum variance in the data,
with the first component capturing the most variance, the second capturing the next,
and so on. PCA is especially useful when working with high-dimensional data, as it
simplifies analysis and visualization.

The process begins by standardizing the data to ensure all features have the same
scale, as PCA is sensitive to the scale of the data. Next, the covariance matrix is
calculated to capture the relationships between features. From this matrix,
eigenvectors and eigenvalues are derived. Eigenvectors define the directions of
the principal components, while eigenvalues represent the variance captured by
each component. The components are ranked based on their eigenvalues, and the
top components are selected based on the proportion of variance they explain.
The selected components form a new coordinate system, and the data is
projected onto these components, reducing its dimensionality. The result is a
dataset that retains the most important features but with fewer dimensions,
making it easier to analyze and visualize.

PCA is widely used for data visualization, feature extraction, and noise
reduction, especially when dealing with highly correlated data. It helps in
speeding up computations and improving model performance by reducing
overfitting. However, PCA has limitations: it assumes linear relationships
between features and can be sensitive to outliers. Despite these challenges,
PCA remains a powerful and essential tool for simplifying complex datasets while
retaining critical information.

You might also like