[go: up one dir, main page]

0% found this document useful (0 votes)
6 views42 pages

data analysis

The document provides an overview of data analysis using Python, detailing essential libraries such as NumPy, Pandas, and Matplotlib. It covers data importing, preprocessing, normalization, and exploratory data analysis (EDA) techniques, including descriptive statistics and correlation. Additionally, it includes practical exercises for applying these concepts with a sample dataset.

Uploaded by

piyush dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views42 pages

data analysis

The document provides an overview of data analysis using Python, detailing essential libraries such as NumPy, Pandas, and Matplotlib. It covers data importing, preprocessing, normalization, and exploratory data analysis (EDA) techniques, including descriptive statistics and correlation. Additionally, it includes practical exercises for applying these concepts with a sample dataset.

Uploaded by

piyush dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Data Analysis With Python

Dr. Amar Singh


Professor
Lovely Professional University
Libraries in Python
• Scientific Computing Libraries
• NumPy
• Pandas
• SciPy
• Visualization Libraries
• Matplotlib
• Seaborn
• Algorithmic Libraries
• Scikit-learn
Importing Data in Python
• Importing is the process of loading or reading the data from different
resources.
• The data may be in different formats.
• .csv, .json, .xlsx
• Path of the dataset could be mentioned as below:
• C:\\mydata\\data.csv
• To read a csv file we can use following command:
• pd.read_csv(“c:\\mydata\\data.csv”)
Libraries
Check data type
• Dataframe.dtypes
Printing Dataframe
• df[“BasePay”] // Prints only BasePay column
• df.head(n) //shows first n rows of the data frame
• df.tail(n) //shows bottom n rows of the data frame
• Df.dtypes // used to check data types
Dataframe.describe()
• Returns full summary Statistical
Dataframe.describe(include=“all”)
Data Pre-processing
• Pre-Processing is used to convert raw data into another format for
further data analysis.
• Also known as data cleaning or data wrangling.
Data-Preprocessing
• Deal with missing values
• Data Formatting
• Data Normalization
• Converting Categorical Values to Numerical Values
Missing Values
• When no value is stored for column in an observation.
• Could be represented as ?, NA or blank cell.
How to deal with missing data
• Drop missing values
• Drop the variable
• Replace missing values with an average or frequency values.
• Leave it as missing data.
How to drop missing values in python
• Use dataframe.dropna()
How to replace missing value with new value ?
• Df.replace(missing value, new value)
Data Formatting
• Data are usually collected from different sources and stored in
different formats.
• Bringing data into standard of expression allows user to make
meaningful comparisons.
Incorrect Data Types
• Sometimes wrong datatype is assigned to a column.
Continue..
Apply calculations to entire column
Data Normalization
• Normalization is the process of transforming values of several
variables into a similar range.
• Typical values range from 0 to 1
Normalization

Age Income
20 20000
25 45000
37 28000

• Age and income are in different ranges..


• Hard to Compare.
• “Income” will influence the results more.
Methods for normalization
Simple feature scaling
• df['length'] = df['length']/df['length'].max()
• df['width'] = df['width']/df['width'].max()
Categorical
Continue..
Continue..
Continue..
Continue..
Exploratory Data Analysis (EDA)
• Preliminary Step to data analysis
• Get better understanding of data set.
• Summarize main characteristics of data set.
• Uncover relationship between different variables
• Extract Important Variables
Descriptive Statistics
• Describe basic features of data.
• Giving short summaries about sample and measures of data set.
Descriptive Statistics
• df.describe()
• df.value_count()
• Summarizing categorical data
• Example : df[“drive-wheels"].value_counts()
Scatterplot
• Represents the relationships between variables
• Predictor variable on x-axis.
• Target variable on y-axis.
Scatterplot : Example
Grouping Data
• Groupby method is used to grouping the data.
• Can be applied on categorical variables.
• Groups the data into categories.
• Example:
• test_Data1= test_Data.groupby('JobTitle’)
• test_Data1.mean()
Correlation
• a measure of the extent of interdependence between variables.
• 1: Total positive linear correlation.
• 0: No linear correlation, the two variables most likely do not affect each other.
• -1: Total negative linear correlation.
• df.corr()
Correlation using scatter plot
Exercise
• Import SocialAds.csv dataset.
• Show first five rows of the dataset.
• Show last five rows of the dataset.
• Give the statistical description of the dataset.
• Count the number of males and females in the dataset.

You might also like