Unit 4
Unit 4
TBC- 603
Syllabus
UNIT 4 : Data Analysis and Machine Learning with Python NumPy, SciPy Matplotlib,
Pandas Scikit-Learn.
Getting Started with pandas Series, Data Frame and Index Objects Re indexing Indexing,
Selection and Filtering Sorting and Ranking, Loading from CSV and other structured text
formats, Normalizing data, Dealing with missing data, Data manipulation (alignment,
aggregation, and summarization), Group-based operations: split-apply-combine Statistical
analysis, Date and time series analysis with Pandas, Visualizing data
UNIT 5 : Validation Techniques: Hold out, K-Fold Cross Validation, Leave one out,
Bootstrapping.
The pyplot interface provides a MATLAB-like experience, making it easy for new
users. Matplotlib is highly customizable, allowing fine control over plot elements
such as labels, legends, colors, and annotations. It is often used in conjunction
with NumPy and Pandas for exploratory data analysis and reporting. Whether for
publishing-quality graphs or basic data visualizations, Matplotlib remains a staple
in the scientific Python ecosystem.
Matplotlib is a comprehensive and widely used Python library for creating
visualizations, especially two-dimensional plots and graphs. It is a vital tool in the
data science and scientific computing ecosystem, enabling users to generate
high-quality static, interactive, and animated charts.
The library is particularly known for its pyplot module, which provides a
MATLAB-like interface and makes it easy to produce plots with just a few lines of
code. Matplotlib supports various types of visualizations such as line plots, bar
charts, scatter plots, histograms, pie charts, and more.
One of its biggest strengths is its high level of customizability—users can control
every aspect of a figure, including fonts, colors, line styles, and layout. It
integrates smoothly with other Python libraries like NumPy and Pandas, and
works well in Jupyter Notebooks, making it a go-to choice for data analysis and
exploration. Additionally, Matplotlib allows exporting plots in multiple formats,
such as PNG, PDF, and SVG, which is useful for both web and print publishing.
Because of its flexibility, ease of use, and broad community support, Matplotlib
remains one of the most essential libraries for data visualization in Python
Scikit-Learn
Scikit-Learn
# Array of zeros
zeros = np.zeros((2, 3))
# Array of ones
ones = np.ones((2, 2))
# Indexing
print(arr[0, 1]) # Output: 20
# Slicing
print(arr[:, 1]) # Output: [20 50] (second column)
print(arr[1, :]) # Output: [40 50 60] (second row)
# Negative indexing
print(arr[-1, -1]) # Output: 60 (last element)
# Modifying values
arr[0, 0] = 100
Pandas
The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.
Introduction to Pandas
Reading from and writing to different file formats (CSV, Excel, JSON, SQL, etc.)
Core Data Structures in Pandas
import pandas as pd
df = pd.DataFrame(x)
print(df)
The Pandas Series
The keys of the dictionary match with the Index values, hence the Index values have
no effect.
Note that the Index is first build with the keys from the dictionary. After this the
Series is reindexed with the given Index values, hence we get all NaN as a result.
o create a Series with a custom index and
view the index labels:
Data Frame
Two-dimensional, size-mutable, potentially
heterogeneous tabular data.
DataFrame([data, index, columns, dtype, copy])
import pandas as pd
df = pd.read_csv(‘sample.csv', sep = '\t',
engine = 'python')
print(df)
import pandas as pd
print(df)
Reading Specific rows from csv file
import pandas as pd
df = pd.DataFrame(my_df)
print('DataFrame:\n', df)
import pandas as pd
my_df = {'student': ['R', 'A’], 'course': [1, 2], 'credits': [20, 19]}
df = pd.DataFrame(my_df)
print('DataFrame:\n', df)
print( s_data)
dtype in Python refers to the data type of elements within a
NumPy array or Pandas Series.
import pandas as pd
df_csv = pd.read_csv('sampledata.csv')
df = pd.read_table('data.txt', sep='\t')
Loading a CSV Data from a URL
url = "https://xyz.comg/wp-ontent/uploads/20241121154629307916/people_data.csv"
x = pd.read_csv(url)
df.head(10)
Excel Files
(read_excel())
Excel is one of the most common data storage formats. They can be quickly
loaded into the pandas dataframe with the read_excel() function. It smoothly
loads various Excel file formats, from xls to xlsx,
JSON (read_json())
JSONs (JavaScript Object Notation) find extensive use as a file format for
storing and exchanging data.
df_json = pd.read_json('data.json')
df.head() # View first few rows
df = pd.read_json(json_str)
df.to_json('output.json', orient='records')
How to look at data dimensionality??
print(df.shape)
(10,3)
number_of_rows, number_of_columns It shows the number of rows and
columns present in the DataFrame. For instance, an output of (3, 2) indicates
that the DataFrame has 3 rows and 2 columns.
import pandas as pd
df = pd.read_csv("F:/varun/NDIM/sample.csv")
print(df.shape )
tempdf=df.append(df)
print(tempdf.shape )
tempdf = tempdf.drop_duplicates()
How to Handle missing Values in
data
There are two options in dealing with nulls:
df = pd.read_csv("F:/varun/NDIM/sample1.csv")
df.isnull()
Count the number of Null Values in data
df = pd.read_csv("F:/varun/NDIM/sample1.csv")
df.isnull().sum()
Removing null data is only suggested if you have a small amount of
missing data.
df.dropna()
Other than just dropping rows, we can also drop columns with null
values by setting axis=1:
df.dropna(axis=1)
Replacing Null Values with Mean or
Median
Inplace=True will affect the original dataframe
import pandas as pd
import numpy as np
data = {'Name': ['Parker', 'Smith', 'John', 'William'],
'Percentage': [82, 98, 91, 87,88],
'Course': ['B.Sc','B.Ed','M.Phill','BA‘,’BA’]}
df = pd.DataFrame(data)
s_data = df.to_csv('F:/newdata1.csv', index = True)
df = pd.read_csv("F:/newdata1.csv")
grouped = df.groupby('Course')
print(grouped['Percentage'].agg(np.mean))
Constructing DataFrame
from numpy ndarray:
>>> df
0
a 1
c 3
method:
average -Default. Assigns the average rank to tied
elements.
min- lowest rank for all ties (duplicate values) .
Max- Highest rank for all ties.
First- Assigns ranks in order they appear
dense- Like 'min', but ranks increase by 1 between groups
ascending:-Sort in ascending or descending order before
ranking.
True (default): Lowest value gets rank 1.
False: Highest value gets rank 1.
import pandas as pd
s = pd.Series([100, 200, 100, 300]) # Default ranking (average method)
print(s.rank())
# Output:
# 0 1.5
# 1 3.0
# 2 1.5
# 3 4.0
# dtype: float64
import pandas as pd
df_csv = pd.read_csv('sampledata.csv')
df = pd.read_table('data.txt', sep='\t')
Loading a CSV Data from a URL
url = "https://xyz.comg/wp-ontent/uploads/20241121154629307916/people_data.csv"
x = pd.read_csv(url)
df.head(10)
Excel Files
(read_excel())
Excel is one of the most common data storage formats. They can be quickly
loaded into the pandas dataframe with the read_excel() function. It smoothly
loads various Excel file formats, from xls to xlsx,
JSON (read_json())
JSONs (JavaScript Object Notation) find extensive use as a file format for
storing and exchanging data.
df_json = pd.read_json('data.json')
df.head() # View first few rows
Some of the key advantages of Numpy arrays are that they are fast,
easy to work with, and give users the opportunity to perform
calculations across entire arrays.
import numpy as np
a = np.array([1, 2, 3])
print(a)
print(type(a))
import numpy as np
a = np.array([4.1,6.2, 3.3])
print(a)
print(type(a))
2 Dimension Array
You can add a dimension with a ","coma
np.random.random((2,2))
Create an empty
arraynp.empty((3,2))
Create an array of evenly-spaced values
np.arange(10,25,5)
Addition of Matrices
import numpy as np
Parameters:
start: The starting date of the range (string or datetime-like).
end: The ending date of the range (string or datetime-like).
periods: Number of periods to generate
freq: Frequency string (default is 'D' for daily). Other examples
:'H' = hourly
'T' or 'min' = minute
'S' = second
'M' = month end
'MS' = month start
'B' = business day
tz: - Time zone
normalize: Normalize start/end dates to midnight
inclusive: Whether to include start/end.. Options: 'both' (default), 'neither', 'left',
'right'.
Daily Range Between Two Dates:
print(average_scores)
CN2
The CN2 algorithm is a simple yet powerful machine learning technique that
helps computers learn how to make decisions based on examples. It works by
creating easy-to-understand IF-THEN rules, much like how a person might
reason through a problem. For example, CN2 might learn a rule like “IF the
weather is sunny AND the temperature is hot, THEN do not play outside.”
These rules are generated from past data, where the outcomes are already
known, and can be used to predict new outcomes in the future.
CN2 is especially useful when we need a model that not only makes accurate
predictions but also explains its reasoning clearly. This makes it very popular in
fields like medicine, finance, and education, where understanding the logic
behind a decision is just as important as the decision itself.
CN2 builds its rules step by step. First, it looks for patterns in the data that
strongly indicate a certain result. It uses a method called beam search to
explore many possible rules but only keeps the most promising ones to avoid
wasting time on bad options.
Once a good rule is found, it removes the examples it explains and moves on to
find more rules until most of the data is covered. For example, if it finds a rule
that perfectly describes all the times someone decided not to play outside, it will
remove those and focus on the remaining ones. This way, it keeps building a list
of useful rules, each one adding more to the overall understanding.
Merits of CN2 Algorithm
• Human-Readable Rules
CN2 produces IF-THEN rules that are easy to interpret, making it ideal for
domains where transparency is critical (e.g., medicine, law).
• Handles Noisy Data Well
CN2 includes mechanisms like rule pruning and heuristic evaluation to
prevent overfitting and deal with noisy or imperfect data.
• Efficient Search via Beam Search
Instead of examining all possible rule combinations, CN2 uses beam search,
which makes the rule induction process faster and more scalable.
• Works with Categorical and Numerical Data
While originally designed for symbolic data, it can also handle numerical
attributes (usually with discretization).
• Supports Incremental Learning
CN2 builds rules step-by-step, allowing for easier updates when new data is
introduced (compared to tree-based methods that might need complete
rebuilding).
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a popular technique in data science and
machine learning used to reduce the dimensionality of large datasets while
preserving as much information as possible. It transforms data into a new set of
variables called principal components, which are linear combinations of the
original features. These components capture the maximum variance in the data,
with the first component capturing the most variance, the second capturing the next,
and so on. PCA is especially useful when working with high-dimensional data, as it
simplifies analysis and visualization.
The process begins by standardizing the data to ensure all features have the same
scale, as PCA is sensitive to the scale of the data. Next, the covariance matrix is
calculated to capture the relationships between features. From this matrix,
eigenvectors and eigenvalues are derived. Eigenvectors define the directions of
the principal components, while eigenvalues represent the variance captured by
each component. The components are ranked based on their eigenvalues, and the
top components are selected based on the proportion of variance they explain.
The selected components form a new coordinate system, and the data is
projected onto these components, reducing its dimensionality. The result is a
dataset that retains the most important features but with fewer dimensions,
making it easier to analyze and visualize.
PCA is widely used for data visualization, feature extraction, and noise
reduction, especially when dealing with highly correlated data. It helps in
speeding up computations and improving model performance by reducing
overfitting. However, PCA has limitations: it assumes linear relationships
between features and can be sensitive to outliers. Despite these challenges,
PCA remains a powerful and essential tool for simplifying complex datasets while
retaining critical information.