EDA
Exploratory Data Analysis (EDA) is a crucial step in the data analysis process.
In [ ]: ====================== Data Analysis =======================
1.Pandas ------- Dataframe read and write operations
2.Numpy ------- Numerical python Math operations
3.Matplotlib ------- plots , graphs , visualization
4.Seaborn ------- plots
5.Plotly ------- plots
6.Bokhe ------- plots
====================== Machine Learning =====================
7.Sickit-learn (sklearn) ------ Model development
8.stats packages ------ Linear Regression
====================== Webscrapping and Database connection ======
9.Sqlite ------ SQL Connection
10.Beautiful soup ------ scrap the data
11.websocket ------ scrap the data
====================== Deep Learning ==========================
12.Tensorflow ------ Deep learning models development(google)
13.keras
14.pytorch ------ develop by
15.Opencv ------ computer vision(reading and writing images)
16.Pillow ------ reading images
====================== NLP ======================================
17.NLTK ----- Natural language tool kit
18.SpaCy ----- NLP Models
19.wordcloud -----
====================== Web development - API ======================
20.Flask
21.Django
22.Fask API
23.Gradio
====================== Apps creation ==============================
24.Streamlit
====================== Transformers BERT (NLP models) ==============
25.Transformers ------ Huggingface (Google)
====================== DL:Pretarained Models bject Detections =======
26.vgg16
27.Mobilenet
28.Yolo ----- Ultralytics
====================== NLP pretrained Models ========================
29.Word2Vec ----- Google
30.GloVe ----- StandforUniversity
====================== Model save ==================================
31.Pickle
32.Joblib
====================== GenAI LLM ====================================
33.Azure openAI
34.Google Gemini
35.Amazon BedRock
36.LLAMA Meta
37.Langchain Framework
====================== Model Deployment ================================
38.MLFlow
====================== Cloud Services ==================================
39.Azure ML Related packages
40.GCP vertex ai packages
41.Amazon sagemaker packages
====================== Alle NLP ======================================
42. Allen NLP packages
====================== ML using Pyspark ================================
43.MLlib package
====================== Small packages ==================================
44.random
45.math
46.time
47.logger
Step-1 : Import Packages
In [1]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Step-2 : Create a DataFrame using List
In [7]: import pandas as pd
pd.DataFrame
Out[7]: pandas.core.frame.DataFrame
In [9]: import pandas as pd
pd.DataFrame()
Out[9]:
In [13]: import pandas as pd
data=pd.DataFrame()
data
# we created a DataFrame
# But no data (no rows and no columns)
# we saved our DataFrame with a name 'data'
Out[13]:
Step-3 : Provide The Data
In [16]: name=['Navya','Sneha','Yamu']
pd.DataFrame()
Out[16]:
In [18]: name=['Navya','Sneha','Yamu']
pd.DataFrame(name)
Out[18]: 0
0 Navya
1 Sneha
2 Yamu
In [20]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
pd.DataFrame(zip(name,age))
Out[20]: 0 1
0 Navya 20
1 Sneha 21
2 Yamu 22
In [22]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
pd.DataFrame(zip(name,age,city))
Out[22]: 0 1 2
0 Navya 20 Hyd
1 Sneha 21 Delhi
2 Yamu 22 Pune
In [24]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
data=[name,age,city]
pd.DataFrame(data)
Out[24]: 0 1 2
0 Navya Sneha Yamu
1 20 21 22
2 Hyd Delhi Pune
In [26]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
df=pd.DataFrame(zip(name,age,city))
df
Out[26]: 0 1 2
0 Navya 20 Hyd
1 Sneha 21 Delhi
2 Yamu 22 Pune
Step-4 : Provide The Columns
Columns we need to provide in a list
The number of columns exactly match with data
Here we have 3 columns , so we need to create a list with 3 names
In [30]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
df=pd.DataFrame(zip(name,age,city),columns=cols)
df
Out[30]: Names Age City
0 Navya 20 Hyd
1 Sneha 21 Delhi
2 Yamu 22 Pune
Step-5 : Provide the Index
In [33]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=[1,2,3]
df=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df
Out[33]: Names Age City
1 Navya 20 Hyd
2 Sneha 21 Delhi
3 Yamu 22 Pune
In [35]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df
Out[35]: Names Age City
A Navya 20 Hyd
B Sneha 21 Delhi
C Yamu 22 Pune
Step-6 : How to provide a New Column to already existed dataframe
Here we already has a dataframe with name df
It has 3 columns
Now we want to add a new column Marks
we need to create new array or list
That length of list should be equal to length of rows
so here we have 3 rows , so new list also must have 3 values
In [ ]: # df['<new column name>']=<list>
In [38]: marks=[100,200,300]
df['Marks']=marks
df
Out[38]: Names Age City Marks
A Navya 20 Hyd 100
B Sneha 21 Delhi 200
C Yamu 22 Pune 300
Step-7 : Create a DataFrame using empty DataFrame
In above case we created a list
we create a dataframe by passing list
In [41]: df1=pd.DataFrame()
df1
Out[41]:
In [43]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
df1['Name']=name
df1['Age']=age
df1['City']=city
df1
Out[43]: Name Age City
0 Navya 20 Hyd
1 Sneha 21 Delhi
2 Yamu 22 Pune
Step-8 : Create a DataFrame using Dictionary
In [50]: dict1={'Names':['Navya','Sneha','Yamu'],'Age':[20,21,22],'City':['Hyd','Delhi','Pune']}
dict1
Out[50]: {'Names': ['Navya', 'Sneha', 'Yamu'],
'Age': [20, 21, 22],
'City': ['Hyd', 'Delhi', 'Pune']}
In [52]: df2=pd.DataFrame(dict1)
df2
Out[52]: Names Age City
0 Navya 20 Hyd
1 Sneha 21 Delhi
2 Yamu 22 Pune
In [54]: df2=pd.DataFrame(dict1,index=['A','B','C'])
df2
Out[54]: Names Age City
A Navya 20 Hyd
B Sneha 21 Delhi
C Yamu 22 Pune
Keys Behaves as Columns
Values Behaves as Rows
In [57]: dict2={'Name':'Navya','Age':20,'City':'Hyd'}
dict2
Out[57]: {'Name': 'Navya', 'Age': 20, 'City': 'Hyd'}
In [61]: df3=pd.DataFrame(dict2)
df3
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[61], line 1
----> 1 df3=pd.DataFrame(dict2)
2 df3
File ~\anaconda3\Lib\site-packages\pandas\core\frame.py:778, in DataFrame.__init__(self, data, in
dex, columns, dtype, copy)
772 mgr = self._init_mgr(
773 data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
774 )
776 elif isinstance(data, dict):
777 # GH#38939 de facto copy defaults to False only in non-dict cases
--> 778 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
779 elif isinstance(data, ma.MaskedArray):
780 from numpy.ma import mrecords
File ~\anaconda3\Lib\site-packages\pandas\core\internals\construction.py:503, in dict_to_mgr(dat
a, index, columns, dtype, typ, copy)
499 else:
500 # dtype check to exclude e.g. range objects, scalars
501 arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 503 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File ~\anaconda3\Lib\site-packages\pandas\core\internals\construction.py:114, in arrays_to_mgr(ar
rays, columns, index, dtype, verify_integrity, typ, consolidate)
111 if verify_integrity:
112 # figure out the index, if necessary
113 if index is None:
--> 114 index = _extract_index(arrays)
115 else:
116 index = ensure_index(index)
File ~\anaconda3\Lib\site-packages\pandas\core\internals\construction.py:667, in _extract_index(d
ata)
664 raise ValueError("Per-column arrays must each be 1-dimensional")
666 if not indexes and not raw_lengths:
--> 667 raise ValueError("If using all scalar values, you must pass an index")
669 if have_series:
670 index = union_indexes(indexes)
ValueError: If using all scalar values, you must pass an index
In [63]: dict2={'Name':'Navya','Age':20,'City':'Hyd'}
pd.DataFrame(dict2,index=[1])
# If using all scalar values, you must pass an index
Out[63]: Name Age City
1 Navya 20 Hyd
In [65]: dict2={'Name':'Navya','Age':20,'City':'Hyd'}
pd.DataFrame(dict2,index=[1,2])
Out[65]: Name Age City
1 Navya 20 Hyd
2 Navya 20 Hyd
Data in the form of array can print 3 ways :
list : Normal way
numpy: Numpy package
tensor: Tensorflow
In [68]: l1=[1,2,3]
import numpy as np
np.array(l1)
Out[68]: array([1, 2, 3])
In [70]: l1=[1,2,3]
l2=[11,12,13]
l1+l2
Out[70]: [1, 2, 3, 11, 12, 13]
In [72]: import numpy as np
np.array(l1)
np.array(l2)
np.array(l1+l2)
Out[72]: array([ 1, 2, 3, 11, 12, 13])
In [74]: l1=[1,2,3]
a=np.array(l1)
l2=[11,12,13]
b=np.array(l2)
a+b
Out[74]: array([12, 14, 16])
In [76]: l1=[1,2,3]
a=np.array(l1)
l2=[11,12,13]
b=np.array(l2)
a*b
Out[76]: array([11, 24, 39])
In [78]: l1=[1,2,3]
a=np.array(l1)
l2=[11,12,13]
b=np.array(l2)
a+b,a*b
Out[78]: (array([12, 14, 16]), array([11, 24, 39]))
Step-9 : Drop the column
In order to drop a column we need to use drop method
All the methods based on dataframe names similar as the string names
It requires mainly 3 arguments
1.Column name
2.axis
axis = 1 represents column
axis = 0 represents rows
3.Inplace
once you drop the column , dataframe affected
The modified dataframe wants to save in a same or different name
if you want to keep at same name then inplace=True
In [ ]: # create a dataframe and drop any column
In [81]: df4=pd.DataFrame()
df4
Out[81]:
In [ ]: df4.drop()
In [87]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df4=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df4
Out[87]: Names Age City
A Navya 20 Hyd
B Sneha 21 Delhi
C Yamu 22 Pune
In [97]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df4=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df4.drop('City',axis=1)
Out[97]: Names Age
A Navya 20
B Sneha 21
C Yamu 22
In [103… name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df4=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df4.drop('A',axis=0)
Out[103… Names Age City
B Sneha 21 Delhi
C Yamu 22 Pune
In [107… name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df4=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df4.drop('A',axis=0,inplace=True)
In [109… df4
Out[109… Names Age City
B Sneha 21 Delhi
C Yamu 22 Pune
In [4]: name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df4=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df4.drop('A',axis=0,inplace=False)
Out[4]: Names Age City
B Sneha 21 Delhi
C Yamu 22 Pune
In [ ]: # create two dataframes df1 and df2
# add those dataframes
############# df1 ###########
Names Age City
Ramesh 20 Hyd
############ df2 ###########
Names Age City
Suresh 21 Blr
Names Age City
Ramesh 20 Hyd
Suresh 21 Blr
append
concate
join
In [34]: dict1={'Name':'Ramesh','Age':20,'City':'Hyd'}
df5=pd.DataFrame(dict1,index=[1])
dict2={'Name':'Suresh','Age':21,'City':'Blr'}
df6=pd.DataFrame(dict2,index=[2])
result=pd.concat([df5,df6],ignore_index=True)
print(result)
Name Age City
0 Ramesh 20 Hyd
1 Suresh 21 Blr
Step-10 : How to overwrite existed column
we already has a dataframe
now we want to replace all the values of specific column with new values
first create a list with new values
Then update the column with new values , in the same way of how to create a new column
df[new col]=data , to create a new column
df[old col]=new data , to overwrite theold column
In [8]: df4['Age']=[33,44,34]
df4
Out[8]: Names Age City
A Navya 33 Hyd
B Sneha 44 Delhi
C Yamu 34 Pune
In [48]: df4['Names']=['anshu','chinni','adya']
df4
Out[48]: Names Age City Name
A anshu 33 Hyd anshu
B chinni 44 Delhi chinni
C adya 34 Pune adya
Step-11 : How to save the DataFrame
we can save the dataframe using 2 ways
csv:comma seperated value
excel
For csv : to_csv extension = .csv
For excel : read_csv extension = .xlsx
In [51]: # create a dataframe
name=['Navya','Sneha','Yamu']
age=[20,21,22]
city=['Hyd','Delhi','Pune']
cols=['Names','Age','City']
id=['A','B','C']
df=pd.DataFrame(zip(name,age,city),index=id,columns=cols)
df
Out[51]: Names Age City
A Navya 20 Hyd
B Sneha 21 Delhi
C Yamu 22 Pune
Csv Format
In [56]: # DataFramename.methodname
# where you want to save
# in what name you want to save
df.to_csv('data12.csv')
Excel sheet
In [61]: df.to_excel('data13.xlsx')
Step-12 : Read the data
read_csv
read_excel
both available on pandas
In [65]: pd.read_csv('data12.csv')
Out[65]: Unnamed: 0 Names Age City
0 A Navya 20 Hyd
1 B Sneha 21 Delhi
2 C Yamu 22 Pune
In [69]: pd.read_excel('data13.xlsx')
Out[69]: Unnamed: 0 Names Age City
0 A Navya 20 Hyd
1 B Sneha 21 Delhi
2 C Yamu 22 Pune
Step-13 : How to avoid extra column
while we are saving the data , we have argument name index
keep index=False
In [74]: # Give the different name , provide index=False
df.to_csv('data21.csv',index=False)
pd.read_csv('data21.csv')
Out[74]: Names Age City
0 Navya 20 Hyd
1 Sneha 21 Delhi
2 Yamu 22 Pune
In [76]: df.to_excel('data31.xlsx',index=False)
pd.read_excel('data31.xlsx')
Out[76]: Names Age City
0 Navya 20 Hyd
1 Sneha 21 Delhi
2 Yamu 22 Pune
In [ ]: