[go: up one dir, main page]

0% found this document useful (0 votes)
10 views11 pages

Pandas Questions

Pandas Questions Assignment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

Pandas Questions

Pandas Questions Assignment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Pandas Questions

Question1 ) What is Pandas, and what is its primary purpose in Python data analysis?
Answer )

Pandas :
Pandas is an open source Python package that is most widely used for data science/data analysis
and machine learning tasks.

It is built on top of another package named Numpy, which provides support for multi-dimensional
arrays. As one of the most popular data wrangling packages, Pandas works well with many other
data science modules inside the Python ecosystem, and is typically included in every Python
distribution, from those that come with your operating system to commercial vendor distributions
like ActiveState’s ActivePython.

Purpose :
Pandas makes it simple to do many of the time consuming, repetitive tasks associated with working
with data, including:

* Data cleansing
* Data fill
* Data normalization
* Merges and joins
* Data visualization
* Statistical analysis
* Data inspection
* Loading and saving data
* And much more

In Data Analysis the purpose of pandas is :


Data Analysis is the technique of collecting, transforming, and organizing data to make future
predictions and informed data-driven decisions. It also helps to find possible solutions for a
business problem

There are various steps for Data Analysis where pandas play important Role . They are:

Prepare or Collect Data


Clean and Process
Analyze
Share
Act or Report
Question 2 ) How do you install Pandas in your Python environment?
Answer )

It requires Python 3.6, 3.7, or 3.8 or later versions as a prerequisite for installation.

* Install Pandas using pip :

Step 1 : Launch Command Prompt

Step 2 : Run the Command

“ pip install pandas “

This will start the pip installation. After downloading the necessary files,
Pandas will be set to operate on your computer.
Question 3 ) What are the two primary data structures in Pandas, and how do they differ?
Answer )

Pandas provides two essential data structures: Series and DataFrame .

1 . Series :
A Pandas Series is a one-dimensional array-like object that can hold data of any
type (integer, float, string, etc.). It is labelled, meaning each element has a unique
identifier called an index.

Series are a fundamental data structure in Pandas and are commonly used for data
manipulation and analysis tasks. They can be created from lists, arrays, dictionaries, and
existing Series objects

Creating a Series data structure from a list, dictionary, and custom index

# Initializing a Series from a list


data = [1, 2, 3, 4, 5]
series_from_list = pd.Series(data)
print(series_from_list)

# Initializing a Series from a dictionary


data = {'a': 1, 'b': 2, 'c': 3}
series_from_dict = pd.Series(data)
print(series_from_dict)

# Initializing a Series with custom index


data = [1, 2, 3, 4, 5]
index = ['a', 'b', 'c', 'd', 'e']
series_custom_index = pd.Series(data, index=index)
print(series_custom_index)

2 . Data-frame :
A Pandas DataFrame is a two-dimensional, tabular data structure with rows and columns.

the DataFrame has three main components: the data, which is stored in rows and columns;
the rows, which are labeled by an index; and the columns, which are labeled and contain
the actual data.

Indexing:
DataFrame provides flexible indexing options, allowing access to rows
,columns, or individual elements based on labels or integer positions
# Initializing a Data-frame from a dictionary
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
print(df)

# Initializing a Data-frame from a list of lists


data = [['John', 25, 'New York'],
['Alice', 30, 'Los Angeles'],
['Bob', 35, 'Chicago']]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df)

Series DataFrame

One- dimensional Two- dimensional

Series elements must be homogenous. Can be heterogeneous.

Mutable(size can be
Immutable(size cannot be changed).
changeable).

Element wise computations. Column wise computations.

Functionality is less. Functionality is more.

Alignment not supported. Alignment is supported.


Question 4 ) How can you read a CSV file into a Pandas DataFrame? Provide an example.
Answer )

CSV files (comma separated files) :

We can read a CSV file using pandas. for Eg.

import pandas as pd
df = pd.read_csv('data.csv')
print(df)

1. Firstly we import pandas libraby


2. We use the “read_csv” command to read the csv file store it in Dataframe
3. Print the csv File

Question 5 ) What is the difference between the loc[] and iloc[] methods when selecting data
from a Pandas DataFrame?

Answer )

LOC :
There are two arguments we need to pass when we are using this function. The first arguments
represent the row label and the second argument represents a column label. We can even use colon
(:) if we want to select all rows or columns. We use boolean expressions to solve it.

1. Syntax :
Dataframe.loc[specific rows, specific columns]

2. Selecting A Subset Of Rows And Columns :


Subset = df.loc[df [‘Department’] == ‘Marketing’, [‘Name’,’Salary’]

Here are a few advantages of loc.

Allowed in cases like labeled-based indexing. It is easy to read and understand.


It can be used with Boolean arrays to solve problems.
Can be used on both single and multiple indexes.
ILOC :
The iloc function in Python is an index-based function. In this function, we select an integer position
instead of selecting rows or columns. It can also work across multiple DataFrame Objects.

"iloc" method is a valuable tool for selecting rows and columns by an integer. It can also access
specific values in a DataFrame.It does not accept the boolean data. We have to follow the syntax
below:

Syntax :
df.iloc[row_index_value, column_index_value]

‘iloc’ method in Pandas is a valuable tool for selecting rows and column values using integer values.
Many operations can be held using the 'iloc' method.

loc function Iloc function


Select rows and columns by labels Select rows and columns by integer positions
Slicing with labels Slicing with integer positions
Use Boolean arrays Does not uses Boolean arrays
Label-based indexing Position based indexing
Syntax : Dataframe.loc[specific rows, specific Syntax : Dataframe.iloc[row_index_value,
columns] column_index_value]
Example:
Example:
df.loc[2, 'Salary']
df.iloc[2, 2]
df.loc[df['Department'] == 'Marketing', ['Name',
df.iloc[1:3, :2]
'Salary']]
Question 6 ) How do you handle missing data (NaN or None) in a Pandas DataFrame?
Ans )
In Pandas, missing values are represented by None or NaN, which can occur due to uncollected
data or incomplete entries

To identify and handle the missing values, Pandas provides two useful functions:
isnull() and notnull()

1 ) isnull() : returns a DataFrame of Boolean values, where True represents missing data (NaN).
missing_values = df.isnull()

2 ) notnull() : returns a DataFrame of Boolean values, where True indicates non-missing data.

Handeling Missing Value :


the fillna(), replace() functions are commonly used to fill NaN values

(1) Fillna() : The fillna() function is used to replace missing values (NaN) with a
specified value. For example, you can fill missing values with 0.
Syntax : df.fillna(0)

(2) replace() : Use replace() to replace NaN values with a specific value

Syntax : data.replace(to_replace=np.nan, value=-99)


( Replace NaN with -99 )

Question 7 ) What is the function of the groupby() method in Pandas, and how is it typically
used in data analysis?
Answer )

Pandas groupby splits all the records from your data set into different categories or groups so that
you can analyze the data by these groups. When you use the .groupby() function on any categorical
column of DataFrame, it returns a GroupBy object, which you can use other methods on to group
the data.

In the real world, you’ll usually work with large amounts of data and need to do similar operations
over different groups of data. Pandas groupby() is handy in all those scenarios and gives you
insights making it extremely efficient and a must know function in data analysis.

If we have certain requirement to get the sum and count on the groups then we can also do that

1. ) Number of Groups :
To know how many different groups your data is now divided into.
Then we can use the nunique() function on any column, which gives you a number of
unique values in that column. As many unique values as there are in a column, the data
will be divided into that many groups.
Eg : df.Product_Category.nunique()

2. ) Group Sizes :
The number of rows in each group of a GroupBy object can be easily obtained
using the function .size().
Eg: df.groupby("Product_Category").size()

3. ) Aggregate Multiple Columns :


Applying an aggregate function on columns in each group is one of the most widely used
practices After grouping the data by product_category, suppose you want to see what the
average unit price and quantity in each product category is

#Create a groupby object


df_group = df.groupby("Product_Category")

#Select only required columns


df_columns = df_group[["UnitPrice(USD)","Quantity"]]

#Apply aggregate function


df_columns.mean()

Question 8 ) How can you merge or join two Pandas DataFrames based on a common column or
Key?
Answer )

Merge :
The merge function in Pandas is used to combine two DataFrames based on a common column or
index

import pandas as pd

# Creating DataFrame 1]
df1 = pd.DataFrame({
'Name': ['Raju', 'Rani', 'Geeta', 'Sita', 'Sohit'],
'Marks': [80, 90, 75, 88, 59]
})

# Creating DataFrame 2
df2 = pd.DataFrame({
'Name': ['Raju', 'Divya', 'Geeta', 'Sita'],
'Grade': ['A', 'A', 'B', 'A'],
'Rank': [3, 1, 4, 2],
'Gender': ['Male', 'Female', 'Female', 'Female']
})

# Display DataFrames
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Merging 2 dataframes
df_merged = df1.merge(df2[['Name', 'Grade', 'Rank']], on='Name')
print("\nMerged DataFrame:")
print(df_merged)

Question 9 ) Explain the concept of pivot tables in Pandas and how they can be created
Answer )

The pivot table function takes in a data frame and the parameters detailing the shape you want the
data to take. Then it outputs summarized data in the form of a pivot table.

pivot tables in pandas are very effective way to analyze and summarize data.

To create a pivot table from a pandas DataFrame :

How to Create a Pandas Pivot Table


A pandas pivot table has three main elements:

Index : This specifies the row-level grouping.


Column : This specifies the column level grouping.
Values : These are the numerical values you are looking to summarize.

* Pivot tables can be multi-level. We can use multiple indexes and column
level groupings to create more powerful summaries of a data set

Eg )

Question 10 )What is the purpose of the apply() function in Pandas, and when might you use it in
data transformation?
Answer )

The apply() method is one of the most common methods of data preprocessing. It simplifies
applying a function on each element in a pandas Series and each row or column in a pandas
DataFrame.

Pandas.apply allow the users to pass a function and apply it on every single value of the Pandas
series. this function helps to segregate data according to the conditions required due to which it is
efficiently used in data science and machine learning.

Syntax :
s.apply(func, convert_dtype=True, args=())

func: apply takes a function and applies it to all values of pandas series.
convert_dtype: Convert dtype as per the function’s operation.
args=(): Additional arguments to pass to function instead of series.
Return Type: Pandas Series after applied function/operation.

EG :
import pandas as pd
s = pd.read_csv("stock.csv", squeeze = True)

# adding 5 to each value


new = s.apply(lambda num : num + 5)

You might also like