The document provides an overview of data analysis using Python, detailing essential libraries such as NumPy, Pandas, and Matplotlib. It covers data importing, preprocessing, normalization, and exploratory data analysis (EDA) techniques, including descriptive statistics and correlation. Additionally, it includes practical exercises for applying these concepts with a sample dataset.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
6 views42 pages
data analysis
The document provides an overview of data analysis using Python, detailing essential libraries such as NumPy, Pandas, and Matplotlib. It covers data importing, preprocessing, normalization, and exploratory data analysis (EDA) techniques, including descriptive statistics and correlation. Additionally, it includes practical exercises for applying these concepts with a sample dataset.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42
Data Analysis With Python
Dr. Amar Singh
Professor Lovely Professional University Libraries in Python • Scientific Computing Libraries • NumPy • Pandas • SciPy • Visualization Libraries • Matplotlib • Seaborn • Algorithmic Libraries • Scikit-learn Importing Data in Python • Importing is the process of loading or reading the data from different resources. • The data may be in different formats. • .csv, .json, .xlsx • Path of the dataset could be mentioned as below: • C:\\mydata\\data.csv • To read a csv file we can use following command: • pd.read_csv(“c:\\mydata\\data.csv”) Libraries Check data type • Dataframe.dtypes Printing Dataframe • df[“BasePay”] // Prints only BasePay column • df.head(n) //shows first n rows of the data frame • df.tail(n) //shows bottom n rows of the data frame • Df.dtypes // used to check data types Dataframe.describe() • Returns full summary Statistical Dataframe.describe(include=“all”) Data Pre-processing • Pre-Processing is used to convert raw data into another format for further data analysis. • Also known as data cleaning or data wrangling. Data-Preprocessing • Deal with missing values • Data Formatting • Data Normalization • Converting Categorical Values to Numerical Values Missing Values • When no value is stored for column in an observation. • Could be represented as ?, NA or blank cell. How to deal with missing data • Drop missing values • Drop the variable • Replace missing values with an average or frequency values. • Leave it as missing data. How to drop missing values in python • Use dataframe.dropna() How to replace missing value with new value ? • Df.replace(missing value, new value) Data Formatting • Data are usually collected from different sources and stored in different formats. • Bringing data into standard of expression allows user to make meaningful comparisons. Incorrect Data Types • Sometimes wrong datatype is assigned to a column. Continue.. Apply calculations to entire column Data Normalization • Normalization is the process of transforming values of several variables into a similar range. • Typical values range from 0 to 1 Normalization
Age Income 20 20000 25 45000 37 28000
• Age and income are in different ranges..
• Hard to Compare. • “Income” will influence the results more. Methods for normalization Simple feature scaling • df['length'] = df['length']/df['length'].max() • df['width'] = df['width']/df['width'].max() Categorical Continue.. Continue.. Continue.. Continue.. Exploratory Data Analysis (EDA) • Preliminary Step to data analysis • Get better understanding of data set. • Summarize main characteristics of data set. • Uncover relationship between different variables • Extract Important Variables Descriptive Statistics • Describe basic features of data. • Giving short summaries about sample and measures of data set. Descriptive Statistics • df.describe() • df.value_count() • Summarizing categorical data • Example : df[“drive-wheels"].value_counts() Scatterplot • Represents the relationships between variables • Predictor variable on x-axis. • Target variable on y-axis. Scatterplot : Example Grouping Data • Groupby method is used to grouping the data. • Can be applied on categorical variables. • Groups the data into categories. • Example: • test_Data1= test_Data.groupby('JobTitle’) • test_Data1.mean() Correlation • a measure of the extent of interdependence between variables. • 1: Total positive linear correlation. • 0: No linear correlation, the two variables most likely do not affect each other. • -1: Total negative linear correlation. • df.corr() Correlation using scatter plot Exercise • Import SocialAds.csv dataset. • Show first five rows of the dataset. • Show last five rows of the dataset. • Give the statistical description of the dataset. • Count the number of males and females in the dataset.