Pandas Questions
Pandas Questions
Question1 ) What is Pandas, and what is its primary purpose in Python data analysis?
Answer )
Pandas :
Pandas is an open source Python package that is most widely used for data science/data analysis
and machine learning tasks.
It is built on top of another package named Numpy, which provides support for multi-dimensional
arrays. As one of the most popular data wrangling packages, Pandas works well with many other
data science modules inside the Python ecosystem, and is typically included in every Python
distribution, from those that come with your operating system to commercial vendor distributions
like ActiveState’s ActivePython.
Purpose :
Pandas makes it simple to do many of the time consuming, repetitive tasks associated with working
with data, including:
* Data cleansing
* Data fill
* Data normalization
* Merges and joins
* Data visualization
* Statistical analysis
* Data inspection
* Loading and saving data
* And much more
There are various steps for Data Analysis where pandas play important Role . They are:
It requires Python 3.6, 3.7, or 3.8 or later versions as a prerequisite for installation.
This will start the pip installation. After downloading the necessary files,
Pandas will be set to operate on your computer.
Question 3 ) What are the two primary data structures in Pandas, and how do they differ?
Answer )
1 . Series :
A Pandas Series is a one-dimensional array-like object that can hold data of any
type (integer, float, string, etc.). It is labelled, meaning each element has a unique
identifier called an index.
Series are a fundamental data structure in Pandas and are commonly used for data
manipulation and analysis tasks. They can be created from lists, arrays, dictionaries, and
existing Series objects
Creating a Series data structure from a list, dictionary, and custom index
2 . Data-frame :
A Pandas DataFrame is a two-dimensional, tabular data structure with rows and columns.
the DataFrame has three main components: the data, which is stored in rows and columns;
the rows, which are labeled by an index; and the columns, which are labeled and contain
the actual data.
Indexing:
DataFrame provides flexible indexing options, allowing access to rows
,columns, or individual elements based on labels or integer positions
# Initializing a Data-frame from a dictionary
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
print(df)
Series DataFrame
Mutable(size can be
Immutable(size cannot be changed).
changeable).
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Question 5 ) What is the difference between the loc[] and iloc[] methods when selecting data
from a Pandas DataFrame?
Answer )
LOC :
There are two arguments we need to pass when we are using this function. The first arguments
represent the row label and the second argument represents a column label. We can even use colon
(:) if we want to select all rows or columns. We use boolean expressions to solve it.
1. Syntax :
Dataframe.loc[specific rows, specific columns]
"iloc" method is a valuable tool for selecting rows and columns by an integer. It can also access
specific values in a DataFrame.It does not accept the boolean data. We have to follow the syntax
below:
Syntax :
df.iloc[row_index_value, column_index_value]
‘iloc’ method in Pandas is a valuable tool for selecting rows and column values using integer values.
Many operations can be held using the 'iloc' method.
To identify and handle the missing values, Pandas provides two useful functions:
isnull() and notnull()
1 ) isnull() : returns a DataFrame of Boolean values, where True represents missing data (NaN).
missing_values = df.isnull()
2 ) notnull() : returns a DataFrame of Boolean values, where True indicates non-missing data.
(1) Fillna() : The fillna() function is used to replace missing values (NaN) with a
specified value. For example, you can fill missing values with 0.
Syntax : df.fillna(0)
(2) replace() : Use replace() to replace NaN values with a specific value
Question 7 ) What is the function of the groupby() method in Pandas, and how is it typically
used in data analysis?
Answer )
Pandas groupby splits all the records from your data set into different categories or groups so that
you can analyze the data by these groups. When you use the .groupby() function on any categorical
column of DataFrame, it returns a GroupBy object, which you can use other methods on to group
the data.
In the real world, you’ll usually work with large amounts of data and need to do similar operations
over different groups of data. Pandas groupby() is handy in all those scenarios and gives you
insights making it extremely efficient and a must know function in data analysis.
If we have certain requirement to get the sum and count on the groups then we can also do that
1. ) Number of Groups :
To know how many different groups your data is now divided into.
Then we can use the nunique() function on any column, which gives you a number of
unique values in that column. As many unique values as there are in a column, the data
will be divided into that many groups.
Eg : df.Product_Category.nunique()
2. ) Group Sizes :
The number of rows in each group of a GroupBy object can be easily obtained
using the function .size().
Eg: df.groupby("Product_Category").size()
Question 8 ) How can you merge or join two Pandas DataFrames based on a common column or
Key?
Answer )
Merge :
The merge function in Pandas is used to combine two DataFrames based on a common column or
index
import pandas as pd
# Creating DataFrame 1]
df1 = pd.DataFrame({
'Name': ['Raju', 'Rani', 'Geeta', 'Sita', 'Sohit'],
'Marks': [80, 90, 75, 88, 59]
})
# Creating DataFrame 2
df2 = pd.DataFrame({
'Name': ['Raju', 'Divya', 'Geeta', 'Sita'],
'Grade': ['A', 'A', 'B', 'A'],
'Rank': [3, 1, 4, 2],
'Gender': ['Male', 'Female', 'Female', 'Female']
})
# Display DataFrames
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
# Merging 2 dataframes
df_merged = df1.merge(df2[['Name', 'Grade', 'Rank']], on='Name')
print("\nMerged DataFrame:")
print(df_merged)
Question 9 ) Explain the concept of pivot tables in Pandas and how they can be created
Answer )
The pivot table function takes in a data frame and the parameters detailing the shape you want the
data to take. Then it outputs summarized data in the form of a pivot table.
pivot tables in pandas are very effective way to analyze and summarize data.
* Pivot tables can be multi-level. We can use multiple indexes and column
level groupings to create more powerful summaries of a data set
Eg )
Question 10 )What is the purpose of the apply() function in Pandas, and when might you use it in
data transformation?
Answer )
The apply() method is one of the most common methods of data preprocessing. It simplifies
applying a function on each element in a pandas Series and each row or column in a pandas
DataFrame.
Pandas.apply allow the users to pass a function and apply it on every single value of the Pandas
series. this function helps to segregate data according to the conditions required due to which it is
efficiently used in data science and machine learning.
Syntax :
s.apply(func, convert_dtype=True, args=())
func: apply takes a function and applies it to all values of pandas series.
convert_dtype: Convert dtype as per the function’s operation.
args=(): Additional arguments to pass to function instead of series.
Return Type: Pandas Series after applied function/operation.
EG :
import pandas as pd
s = pd.read_csv("stock.csv", squeeze = True)