100% found this document useful (1 vote)

172 views21 pages

Unit 4 Fod

The document provides information about the CS3352 Foundations of Data Science course offered at Sengunthar College of Engineering, Tiruchengode. It includes details about Unit IV of the course which covers Python libraries for data wrangling such as NumPy, Pandas, data manipulation, aggregation, and missing data handling. It also lists important questions related to the unit in Part A and Part B.

Uploaded by

it hod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

172 views21 pages

Unit 4 Fod

Uploaded by

it hod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

CS3352 FOUNDATIONS OF DATA SCIENCE

II YEAR / III SEMESTER B.Tech.- INFORMATION TECHNOLOGY

UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING

COMPILED BY,

Prof.M.KARTHIKEYAN, M.E., HoD / IT

VERIFIED BY

HOD PRINCIPAL CEO/CORRESPONDENT

DEPARTMENT OF INFORMATION TECHNOLOGY

SENGUNTHAR COLLEGE OF ENGINEERING, TIRUCHENGODE – 637 205.

1
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING

➢ Basics of Numpy arrays

➢ aggregations
➢ computations on arrays
➢ comparisons, masks,
➢ boolean logic
➢ fancy indexing
➢ structured arrays
➢ Data manipulation with Pandas
➢ data indexing and selection
➢ operating on data – missing data
➢ Hierarchical indexing – combining datasets – aggregation and grouping
➢ pivot tables

2
LIST OF IMPORTANT QUESTIONS
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING
PART – A

1. Write short on Importance of Data Wrangling.

2. Write short note on NumPy.
3. List the operations which are done using NumPy?
4. Write short notes on aggregate() function.
5. List the operations which are done using Pandas?
6. Write short note on Data Manipulation using Pandas.
7. How can Pandas get missing data?
8. How do you treat missing data in Python?
9. How to use Hierarchical Indexes with Pandas?
10. List some of the Aggregation functions in Pandas.
11. What is grouping in pandas?
12. What is the use of pivot table in Python?

PART – B
1. Explain in detail about Data Wrangling in Python.
2. Explain the two main ways to carry out boolean masking.
3. Explain in detail about Aggregation in Pandas.
4. Pandas DataFrame - transform() function
5. Explain in detail about the pivot table using python.

3
LIST OF IMPORTANT QUESTIONS
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING

PART – A

1. Write short on Importance of Data Wrangling.

Data Wrangling is a very important step. The below example will explain its importance
as :Books selling Website want to show top-selling books of different domains, according
to user preference. For example, a new user search for motivational books, then they
want to show those motivational books which sell the most or having a high rating, etc.

2. Write short note on NumPy.

One of the most fundamental packages in Python, NumPy is a general-purpose array-
processing package. It provides high-performance multidimensional array objects and tools
to work with the arrays. NumPy is an efficient container of generic multi-dimensional data.
3. List the operations which are done using NumPy?
Basic array operations: add, multiply, slice, flatten, reshape, index arrays
1. Advanced array operations: stack arrays, split into sections, broadcast arrays
2. Work with DateTime or Linear Algebra
3. Basic Slicing and Advanced Indexing in NumPy Python

4. Write short notes on aggregate() function.

The aggregate() method allows us to apply a function or a list of function names to be
executed along one of the axis of the DataFrame, default 0, which is the index (row) axis.
Note: the agg() method is an alias of the aggregate() method.
dataframe.aggregate(func, axis, args, kwargs)

5. List the operations which are done using Pandas?

1. Indexing, manipulating, renaming, sorting, merging data frame
2. Update, Add, Delete columns from a data frame
3. Impute missing files, handle missing data or NANs
4. Plot data with histogram or box plot

4
6. Write short note on Data Manipulation using Pandas.
• Dropping columns in the data.
• Dropping rows in the data.
• Renaming a column in the dataset.
• Select columns with specific data types.
• Slicing the dataset.
7. How can Pandas get missing data?
In order to check missing values in Pandas DataFrame, we use a function isnull()
and notnull(). Both function help in checking whether a value is NaN or not. These function
can also be used in Pandas Series in order to find null values in a series
8. How do you treat missing data in Python?
It is time to see the different methods to handle them.
1. Drop rows or columns that have a missing value.
2. Drop rows or columns that only have missing values.
3. Drop rows or columns based on a threshold value.
4. Drop based on a particular subset of columns.
5. Fill with a constant value.
6. Fill with an aggregated value.
9. How to use Hierarchical Indexes with Pandas?
#importing pandas library as alias pd
import pandas as pd
# calling the pandas read_csv() function.
# and storing the result in DataFrame df
df = pd.read_csv('homelessness.csv')

print(df.head())

5
10. List some of the Aggregation functions in Pandas.
Pandas provide us with a variety of aggregate functions. These functions help to perform
various activities on the datasets. The functions are:
• .count(): This gives a count of the data in a column.
• .sum(): This gives the sum of data in a column.
• .min() and .max(): This helps to find the minimum value and maximum value, ina
function, respectively.
• .mean() and .median(): Helps to find the mean and median, of the values in a
column, respectively.
11. What is grouping in pandas?
Pandas groupby is used for grouping the data according to the categories and
apply a function to the categories. It also helps to aggregate data efficiently. Pandas
dataframe. groupby() function is used to split the data into groups based on some criteria.

12. What is the use of pivot table in Python?

The Pandas pivot_table() function provides a familiar interface to create Excel-style
pivot tables. The function requires at a minimum either the index= or columns= parameters
to specify how to split data. The function can calculate one or multiple aggregation
methods, including using custom functions

PART – B
1. Explain in detail about Data Wrangling in Python.
Data wrangling involves processing the data in various formats like - merging, grouping,
concatenating etc. for the purpose of analysing or getting them ready to be used with another
set of data. Python has built-in features to apply these wrangling methods to various data sets
to achieve the analytical goal. In this chapter we will look at few examples describing these
methods.

Merging Data

The Pandas library in python provides a single function, merge, as the entry point for all standard
database join operations between DataFrame objects −

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,

left_index=False, right_index=False, sort=True)

Let us now create two different DataFrames and perform the merging operations on it.
6
# import the pandas library
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right

Its output is as follows −

Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5

Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5

Grouping Data

Grouping data sets is a frequent need in data analysis where we need the result in terms of
various groups present in the data set. Panadas has in-built methods which can roll the data into
various groups.

In the below example we group the data by year and then get the result for a specific year.
7
# import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',

'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')
print grouped.get_group(2014)

Its output is as follows −

Points Rank Team Year

0 876 1 Riders 2014
2 863 2 Devils 2014
4 741 3 Kings 2014
9 701 4 Royals 2014

Concatenating Data

Pandas provides various facilities for easily combining together Series, DataFrame,
and Panel objects. In the below example the concat function performs concatenation operations
along an axis. Let us create different objects and do concatenation.

import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],

8
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])

Its output is as follows −

Marks_scored Name subject_id

1 98 Alex sub1
2 90 Amy sub2
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
1 89 Billy sub2
2 80 Brian sub4
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5

2. Explain the two main ways to carry out boolean masking.

The NumPy library in Python is a popular library for working with arrays. Boolean masking, also
called boolean indexing, is a feature in Python NumPy that allows for the filtering of values
in numpy arrays.

There are two main ways to carry out boolean masking:

• Method one: Returning the result array.

• Method two: Returning a boolean array.

Method one: Returning the result array

The first method returns an array with the required results. In this method, we pass a condition in
the indexing brackets, [ ], of an array. The condition can be any comparison, like arr > 5, for the
array arr.

9
Syntax
arr[arr > 5]

Parameter values

• arr: This is the array that we are querying.

• The condition arr > 5 is the criterion with which values in the arr array will be filtered.

Return value

This method returns a NumPy array, ndarray, with values that satisfy the given condition. The line
in the example given above will return all the values in arr that are greater than 5.

Example

Let's try out this method in the following example:

# importing NumPy
import numpy as np
# Creating a NumPy array
arr = np.arange(15)
# Printing our array to observe
print(arr)
# Using boolean masking to filter elements greater than or equal to 8
print(arr[arr >= 8])
# Using boolean masking to filter elements equal to 12
print(arr[arr == 12])

Method two: Returning a boolean array

The second method returns a boolean array that has the same size as the array it represents.
A boolean array only contains the boolean values of either True or False. This boolean array is
also called a mask array, or simply a mask. We'll discuss boolean arrays in more detail in the
"Return value" section.

10
Syntax
The code snippet given below shows us how to use this method:
mask = arr > 5
Return value

The line in the code snippet given above will:

• Return an array with the same size and dimensions as arr. This array will only contain the
values True and False. All the True values represent elements in the same position
in arr that satisfy our condition, and all the False values represent elements in the same
position in arr that do not satisfy our condition.
• Store this boolean array in a mask array.

The mask array can be passed in the index brackets of arr to return the values that satisfy our
condition. We will see how this works in our coding example.

Example

Let's try out this method in the following example:

# importing NumPy
import numpy as np
# Creating a NumPy array
arr = np.array([[ 0, 9, 0],
[ 0, 7, 8],
[ 6, 0, 1]])
# Printing our array to observe
print(arr)
# Creating a mask array
mask = arr > 5
# Printing the mask array
print(mask)
# Printing the filtered array using both methods
print(arr[mask])
print(arr[arr > 5])

11
3. Explain in detail about Aggregation in Pandas.

Pandas provide us with a variety of aggregate functions. These functions help to perform
various activities on the datasets. The functions are:
• .count(): This gives a count of the data in a column.
• .sum(): This gives the sum of data in a column.
• .min() and .max(): This helps to find the minimum value and maximum value, ina
function, respectively.
• .mean() and .median(): Helps to find the mean and median, of the values in a
column, respectively.

• DataFrame.aggregate(func, axis=0, *args, **kwargs)

Parameters:

func: It refers callable, string, dictionary, or list of string/callables.

It is used for aggregating the data. For a function, it must either work when passed to a DataFrame
or DataFrame.apply(). For a DataFrame, it can pass a dict, if the keys are the column names.

axis: (default 0): It refers to 0 or 'index', 1 or 'columns'

0 or 'index': It is an apply function for each column.

1 or 'columns': It is an apply function for each row.

*args: It is a positional argument that is to be passed to func.

**kwargs: It is a keyword argument that is to be passed to the func.

Returns:

It returns the scalar, Series or DataFrame.

scalar: It is being used when Series.agg is called with the single function.

Series: It is being used when DataFrame.agg is called for the single function.

DataFrame: It is being used when DataFrame.agg is called for the several functions.

Example:
1. import pandas as pd
2. import numpy as np
3. info=pd.DataFrame([[1,5,7],[10,12,15],[18,21,24],[np.nan,np.nan,np.nan]],columns=['X','Y'
,'Z'])
12
4. info.agg(['sum','min'])

13
4. Pandas DataFrame - transform() function

The transform() function is used to call function (func) on self producing a DataFrame with
transformed values and that has the same axis length as self.
Syntax:
DataFrame.transform(self, func, axis=0, *args, **kwargs)
Parameters:
Name Description Type/Default Required
Value /
Optional

func Function to use for transforming the data. If a function, str, list Required
or dict
function, must either work when passed a
DataFrame or when passed to
DataFrame.apply.
Accepted combinations are:
• function
• string function name
• list of functions and/or function names,
e.g. [np.exp. 'sqrt']
• dict of axis labels -> functions, function
names or list of such.

axis If 0 or ‘index’: apply function to each column. If {0 or ‘index’, 1 Required

1 or ‘columns’: apply function to each row. or ‘columns’}
Default Value: 0

*args Positional arguments to pass to func. Required

**kwargs Keyword arguments to pass to func. Required

Returns: DataFrame
A DataFrame that must have the same length as self.
Raises: ValueError - If the returned DataFrame has a different length than self.
Example:
Examples
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.DataFrame({'X': range(4), 'Y': range(2, 6)})
df
Out[2]:
X Y

0 0 2
14
X Y

1 1 3

2 2 4

3 3 5
In [3]:
df.transform(lambda x: x + 1)
Out[3]:
X Y

0 1 3

1 2 4

2 3 5

3 4 6
Even though the resulting DataFrame must have the same length as the input DataFrame,
it is possible to provide several input functions:
In [4]:
s = pd.Series(range(4))
s
Out[4]:
0 0
1 1
2 2
3 3
dtype: int64
In [5]:
s.transform([np.sqrt, np.exp])
Out[5]:
sqrt exp

0 0.000000 1.000000

1 1.000000 2.718282

2 1.414214 7.389056

3 1.732051 20.085537

15
5. Explain in detail about the pivot table using python.

Most people likely have experience with pivot tables in Ecel. Pandas provides a similar function
called (appropriately enough) pivot_table . While it is exceedingly useful, I frequently find myself
struggling to remember how to use the syntax to format the output for my needs. This article will
focus on explaining the pandas pivot_table function and how to use it for your data analysis.

As an added bonus, I’ve created a simple cheat sheet that summarizes the pivot_table. We can
find it at the end of this post and I hope it serves as a useful reference. Let me know if it
is helpful.

The Data

One of the challenges with using the panda’s pivot_table is making sure us understand our data
and what questions we are trying to answer with the pivot table. It is a seemingly simple function
but can produce very powerful analysis very quickly.

In this scenario, I’m going to be tracking a sales pipeline (also called funnel). The basic problem
is that some sales cycles are very long (think “enterprise software”, capital equipment, etc.) and
management wants to understand it in more detail throughout the year.

Typical questions include:

• How much revenue is in the pipeline?

• What products are in the pipeline?
• Who has what products at what stage?
• How likely are we to close deals by year end?

Many companies will have CRM tools or other software that sales uses to track the process. While
they may have useful tools for analyzing the data, inevitably someone will export the data to Excel
and use a PivotTable to summarize the data.

Using a panda’s pivot table can be a good alternative because it is:

• Quicker (once it is set up)

• Self documenting (look at the code and we know what it does)
• Easy to use to generate a report or email
16
• More flexible because we can define custome aggregation functions

Read in the data

Let’s set up our environment first.

If we want to follow along, we can download the Excel file.

import pandas as pd

import numpy as np

Version Warning
The pivot_table API has changed over time so please make sure we have a recent version of
pandas ( > 0.15) installed for this example to work. This example also uses the category data
type which requires a recent version as well.
Read in our sales funnel data into our DataFrame

df = pd.read_excel("../in/sales-funnel.xlsx")

df.head()

For convenience sake, let’s define the status column as a category and set the order we want
to view.

This isn’t strictly required but helps us keep the order we want as we work through analyzing
the data.

17
df["Status"] = df["Status"].astype("category")

df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)

Pivot the data

As we build up the pivot table, I think it’s easiest to take it one step at a time. Add items and check
each step to verify we are getting the results we expect. Don’t be afraid to play with the order and
the variables to see what presentation makes the most sense for our needs.

The simplest pivot table must have a dataframe and an index . In this case, let’s use the Name
as our index.

pd.pivot_table(df,index=["Name"])

We can have multiple indexes as well. In fact, most of the pivot_table args can take multiple
values via a list.

18
pd.pivot_table(df,index=["Name","Rep","Manager"])

This is interesting but not particularly useful. What we probably want to do is look at this by
Manager and Rep. It’s easy enough to do by changing the index .

pd.pivot_table(df,index=["Manager","Rep"])

We can see that the pivot table is smart enough to start aggregating the data and summarizing it
by grouping the reps with their managers. Now we start to get a glimpse of what a pivot table can
do for us.

For this purpose, the Account and Quantity columns aren’t really useful. Let’s remove it by
explicitly defining the columns we care about using the values field.

19
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])

The price column automatically averages the data but we can do a count or a sum. Adding them
is simple using aggfunc and np.sum .

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)

aggfunc can take a list of functions. Let’s try a mean using the numpy mean function and len to
get a count.

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len]
)

20
21

AndeSight STD v5.2 IDE User Manual UM257 V1.1
No ratings yet
AndeSight STD v5.2 IDE User Manual UM257 V1.1
658 pages
Ad3351 Daa Unit I
No ratings yet
Ad3351 Daa Unit I
135 pages
AL3452 OS Lab Manual
No ratings yet
AL3452 OS Lab Manual
85 pages
09 - Layout Strategies
No ratings yet
09 - Layout Strategies
68 pages
Python - 1 Year - Unit-2
No ratings yet
Python - 1 Year - Unit-2
116 pages
Programming in C - CS3251 - HandWritten Notes - Un - 250316 - 200237
No ratings yet
Programming in C - CS3251 - HandWritten Notes - Un - 250316 - 200237
38 pages
UNIT 3 (Chapter 2) Pandas
No ratings yet
UNIT 3 (Chapter 2) Pandas
43 pages
Unit 4
No ratings yet
Unit 4
60 pages
Pandas Class XII (2021-22)
No ratings yet
Pandas Class XII (2021-22)
246 pages
Network Security Research Presentation
100% (1)
Network Security Research Presentation
14 pages
Python Practical File
No ratings yet
Python Practical File
42 pages
Verizon SD NN F V Reference Architecture
100% (2)
Verizon SD NN F V Reference Architecture
220 pages
GE3151 Problem Solving and Python Programming Lecture Notes 2
No ratings yet
GE3151 Problem Solving and Python Programming Lecture Notes 2
158 pages
AD3461 ML Lab Manual
No ratings yet
AD3461 ML Lab Manual
32 pages
DS Unit 3 Part 1
No ratings yet
DS Unit 3 Part 1
27 pages
IV Unit Fds
No ratings yet
IV Unit Fds
16 pages
CS3451 OS Unit 5 Notes
No ratings yet
CS3451 OS Unit 5 Notes
25 pages
Corcom Product Guide 0611
No ratings yet
Corcom Product Guide 0611
292 pages
Chapter 4 File Handlinf Final (New)
100% (1)
Chapter 4 File Handlinf Final (New)
78 pages
Lab3 - Python - Pandas DataFrame - GeeksforGeeks
No ratings yet
Lab3 - Python - Pandas DataFrame - GeeksforGeeks
20 pages
DEV Lab Manual
No ratings yet
DEV Lab Manual
55 pages
IBM Tivoli Monitoring Troubleshooting Guide V6.3 PDF
No ratings yet
IBM Tivoli Monitoring Troubleshooting Guide V6.3 PDF
370 pages
CCS354 Network Security
No ratings yet
CCS354 Network Security
87 pages
Ad3301 Data Exploration and Visualization
No ratings yet
Ad3301 Data Exploration and Visualization
38 pages
COMPORGA - Module 4
100% (1)
COMPORGA - Module 4
12 pages
Dsf-Pyt-Lab Manual
No ratings yet
Dsf-Pyt-Lab Manual
50 pages
Best Practice Book11
No ratings yet
Best Practice Book11
111 pages
Data Structures Design - AD3251 - Important Questions With Answer - Unit 1 - Abstract Data Types
No ratings yet
Data Structures Design - AD3251 - Important Questions With Answer - Unit 1 - Abstract Data Types
15 pages
Python Record
No ratings yet
Python Record
35 pages
Unit 5 Fod (1) (Repaired)
No ratings yet
Unit 5 Fod (1) (Repaired)
28 pages
Unit 2 Fod
No ratings yet
Unit 2 Fod
27 pages
FSWD Unit1
No ratings yet
FSWD Unit1
37 pages
Python Notes 3rd Mca
No ratings yet
Python Notes 3rd Mca
99 pages
COIS/FRSC 2750H Computer Crime and Forensics Fall 2011 Malware: Viruses, Worms Etc
No ratings yet
COIS/FRSC 2750H Computer Crime and Forensics Fall 2011 Malware: Viruses, Worms Etc
18 pages
Chapter 6 Database
No ratings yet
Chapter 6 Database
67 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
Data Warehousing Full
No ratings yet
Data Warehousing Full
41 pages
Unit V Data Visualization
No ratings yet
Unit V Data Visualization
49 pages
Placement Portal Management System
No ratings yet
Placement Portal Management System
29 pages
Ad3251 Unit 2 Notes Edu Engg
No ratings yet
Ad3251 Unit 2 Notes Edu Engg
35 pages
Ge3171 Pplab
No ratings yet
Ge3171 Pplab
63 pages
Lab-manual-Advanced Python Programming 4321602
No ratings yet
Lab-manual-Advanced Python Programming 4321602
24 pages
TCP Ip
No ratings yet
TCP Ip
11 pages
Unlicensed Mobile Access (UMA)
No ratings yet
Unlicensed Mobile Access (UMA)
38 pages
Compiler Design Lab Manual
No ratings yet
Compiler Design Lab Manual
36 pages
Ventana HotChips23 - Final
No ratings yet
Ventana HotChips23 - Final
16 pages
Design and Static Structural Analysis of Aircraft Floor Beam
No ratings yet
Design and Static Structural Analysis of Aircraft Floor Beam
5 pages
OS Lab Manual
No ratings yet
OS Lab Manual
76 pages
Cse-IV-unix and Shell Programming (10cs44) - Notes
No ratings yet
Cse-IV-unix and Shell Programming (10cs44) - Notes
161 pages
1401 0101 Design Basis Report Ver 1
No ratings yet
1401 0101 Design Basis Report Ver 1
22 pages
Part A Assignment - No - 1
No ratings yet
Part A Assignment - No - 1
7 pages
Bca-Iv Sem Dar Imp Questions
100% (1)
Bca-Iv Sem Dar Imp Questions
1 page
Teacher Guide - Cybercrime and Computer Forensics Unit
No ratings yet
Teacher Guide - Cybercrime and Computer Forensics Unit
9 pages
ANT OF LICENSE. This EULA Grants You Applications Software. You May Install, Use, Access, Display, Run, or Otherwise Interact With
No ratings yet
ANT OF LICENSE. This EULA Grants You Applications Software. You May Install, Use, Access, Display, Run, or Otherwise Interact With
53 pages
Informatics Practices Practical List22-2323
100% (1)
Informatics Practices Practical List22-2323
7 pages
Simplified DES
No ratings yet
Simplified DES
13 pages
Data Preprocessing in Python - Handling Missing Data
No ratings yet
Data Preprocessing in Python - Handling Missing Data
8 pages
Pandas Notes
No ratings yet
Pandas Notes
4 pages
CT-1 - Paper (Python-BCC302) - Solution
No ratings yet
CT-1 - Paper (Python-BCC302) - Solution
12 pages
System Verilog Interview Questions
No ratings yet
System Verilog Interview Questions
7 pages
Artificial Intelligence Lab Manual: Python
No ratings yet
Artificial Intelligence Lab Manual: Python
15 pages
CS3361 Set1
No ratings yet
CS3361 Set1
5 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
AI in Insurance: Top Use Cases, Challenges, and Trends
No ratings yet
AI in Insurance: Top Use Cases, Challenges, and Trends
24 pages
Cd3291 Dsa Unit 5 Notes Eduengg
No ratings yet
Cd3291 Dsa Unit 5 Notes Eduengg
23 pages
cs3251 UNIT II QUESTION BANK
No ratings yet
cs3251 UNIT II QUESTION BANK
4 pages
CS610 Lab 2
No ratings yet
CS610 Lab 2
19 pages
Ge8151 Phython Prog Unit 4 New
No ratings yet
Ge8151 Phython Prog Unit 4 New
33 pages
FSWD Unit2
No ratings yet
FSWD Unit2
20 pages
Material Management For Construction Site With Using ERP
100% (1)
Material Management For Construction Site With Using ERP
2 pages
FDS Iat-2 Part-B
No ratings yet
FDS Iat-2 Part-B
4 pages
SQL Practical
No ratings yet
SQL Practical
6 pages
Online Education Laboratory Exercise
No ratings yet
Online Education Laboratory Exercise
4 pages
Syllabus GE3151 PROBLEM SOLVING AND PYTHON PROGRAMMING 3 0 0 3
No ratings yet
Syllabus GE3151 PROBLEM SOLVING AND PYTHON PROGRAMMING 3 0 0 3
2 pages
Left Rotation - Hackerrank
100% (1)
Left Rotation - Hackerrank
3 pages
Homework 6
No ratings yet
Homework 6
2 pages
Cs3461 Operating Systems Laboratory L T P C
No ratings yet
Cs3461 Operating Systems Laboratory L T P C
1 page
RAJA+ Fully Automatic Star Delta Controller 3TE7431: Installation, Operation & Maintenance Instructions
No ratings yet
RAJA+ Fully Automatic Star Delta Controller 3TE7431: Installation, Operation & Maintenance Instructions
6 pages
SC0501 01 - en Us
No ratings yet
SC0501 01 - en Us
3 pages
Apply For Ethiopian Passport Online
No ratings yet
Apply For Ethiopian Passport Online
1 page
Spotify Identity Manual
No ratings yet
Spotify Identity Manual
19 pages
Chandrappa Profile - Chandrappa G
No ratings yet
Chandrappa Profile - Chandrappa G
3 pages
Assignment 3
No ratings yet
Assignment 3
4 pages
Theory QBank
No ratings yet
Theory QBank
7 pages
Study On Intel 80386 Microprocessor
No ratings yet
Study On Intel 80386 Microprocessor
3 pages
MIMO Technology For Advanced Wireless Local Area Networks
No ratings yet
MIMO Technology For Advanced Wireless Local Area Networks
3 pages
Functional Dependencies and Normalization
No ratings yet
Functional Dependencies and Normalization
7 pages
Assignment 2OOP
No ratings yet
Assignment 2OOP
1 page
Thirteen Rules That Expire: Number
No ratings yet
Thirteen Rules That Expire: Number
3 pages
DAA QP Internal 1
No ratings yet
DAA QP Internal 1
2 pages
CRDB TestingPlan
No ratings yet
CRDB TestingPlan
2 pages
FSWD QP Internal 1
No ratings yet
FSWD QP Internal 1
1 page
PH1000 Series (3.6-5KW) : Specification
No ratings yet
PH1000 Series (3.6-5KW) : Specification
1 page
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet

Unit 4 Fod

Uploaded by

Unit 4 Fod

Uploaded by

CS3352 FOUNDATIONS OF DATA SCIENCE

II YEAR / III SEMESTER B.Tech.- INFORMATION TECHNOLOGY

Prof.M.KARTHIKEYAN, M.E., HoD / IT

HOD PRINCIPAL CEO/CORRESPONDENT

DEPARTMENT OF INFORMATION TECHNOLOGY

SENGUNTHAR COLLEGE OF ENGINEERING, TIRUCHENGODE – 637 205.

➢ Basics of Numpy arrays

1. Write short on Importance of Data Wrangling.

1. Write short on Importance of Data Wrangling.

2. Write short note on NumPy.

4. Write short notes on aggregate() function.

5. List the operations which are done using Pandas?

12. What is the use of pivot table in Python?

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,

Its output is as follows −

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',

Its output is as follows −

Points Rank Team Year

Its output is as follows −

Marks_scored Name subject_id

2. Explain the two main ways to carry out boolean masking.

There are two main ways to carry out boolean masking:

• Method one: Returning the result array.

• Method two: Returning a boolean array.

Method one: Returning the result array

• arr: This is the array that we are querying.

Let's try out this method in the following example:

Method two: Returning a boolean array

The line in the code snippet given above will:

Let's try out this method in the following example:

• DataFrame.aggregate(func, axis=0, *args, **kwargs)

func: It refers callable, string, dictionary, or list of string/callables.

axis: (default 0): It refers to 0 or 'index', 1 or 'columns'

0 or 'index': It is an apply function for each column.

1 or 'columns': It is an apply function for each row.

*args: It is a positional argument that is to be passed to func.

**kwargs: It is a keyword argument that is to be passed to the func.

It returns the scalar, Series or DataFrame.

axis If 0 or ‘index’: apply function to each column. If {0 or ‘index’, 1 Required

*args Positional arguments to pass to func. Required

**kwargs Keyword arguments to pass to func. Required

Typical questions include:

• How much revenue is in the pipeline?

Using a panda’s pivot table can be a good alternative because it is:

• Quicker (once it is set up)

Read in the data

Let’s set up our environment first.

If we want to follow along, we can download the Excel file.

Pivot the data

You might also like