DATA SCIENCE
UNIT – 2
INTRODUCTION TO PROGRAMMING TOOLS FOR DATA SCIENCE
Ques 1. Explain different tool kits used in Python.
Python has a rich ecosystem of toolkits, libraries, and frameworks that serve a wide
range of applications, from web development and data analysis to scientific
computing and artificial intelligence.
1. Web Development
Django: A high-level framework that encourages rapid development and
clean, pragmatic design. It’s built by experienced developers and handles
much of the hassle of web development.
Flask: A micro web framework for Python. It is classified as a
microframework because it does not require particular tools or libraries. It
has no database abstraction layer, form validation, or any other components
where pre-existing third-party libraries provide common functions.
2. Data Analysis and Data Science
Pandas: An open-source library providing high-performance, easy-to-use
data structures and data analysis tools. It's particularly suited for data
manipulation and analysis.
NumPy: A library for the Python programming language, adding support for
large, multi-dimensional arrays and matrices, along with a large collection of
high-level mathematical functions to operate on these arrays.
3. Machine Learning and Deep Learning
Scikit-learn: A simple and efficient tool for predictive data analysis. It is
accessible to everybody and reusable in various contexts.
TensorFlow: An end-to-end open-source platform for machine learning
designed by Google. It has a comprehensive, flexible ecosystem of tools,
libraries, and community resources that lets researchers push the state-of-
the-art in ML, and developers easily build and deploy ML-powered
applications.
PyTorch: An open-source machine learning library based on the Torch
library, used for applications such as computer vision and natural language
processing.
4. Scientific Computing
SciPy: A Python-based ecosystem of open-source software for mathematics,
science, and engineering. In particular, these are a series of Python modules
for optimization, linear algebra, integration, interpolation, and special
functions.
SymPy: A Python library for symbolic mathematics. It aims to become a full-
featured computer algebra system (CAS) while keeping the code as simple as
possible.
5. Visualization
Matplotlib: Matplotlib is a comprehensive plotting library for Python. It
provides a variety of plot types and styles, as well as interactive features for
exploring data. You can generate plots, histograms, power spectra, bar
charts, error charts, scatterplots, etc.,
Seaborn: Based on matplotlib, it provides a high-level interface for drawing
attractive and informative statistical graphics.
6 . Game Development
Pygame: A set of Python modules designed for writing video games. It adds
functionality on top of the excellent SDL library.
Manipulating Data
Data manipulation refers to the process of adjusting data to make it organised and
easier to read. Data manipulation language, or DML, is a programming language
that adjusts data by inserting, deleting and modifying data in a database such as to
cleanse or map the data.
Data Manipulation is one of the initial processes done in Data Analysis. It involves
arranging or rearranging data points to make it easier for users/data analysts to
perform necessary insights or business directives. Data Manipulation encompasses
a broad range of tools and languages, which may include coding and non-coding
techniques. It is not only used extensively by Data Analysts but also by business
people and accountants to view the budget of a certain project.
It also has its programming language, DML (Data Manipulation Language) which is
used to alter data in databases. Let’s know what exactly Data manipulation is.
Data manipulation is a fundamental step in data analysis, data mining, and data
preparation for machine learning and is essential for making informed decisions
and drawing conclusions from raw data.
To make use of these data points, we perform data manipulation. It involves:
Creating a database
SQL for structured data manipulation
NoSQL languages like MongoDB for unstructured data manipulation.
Steps Required to Perform Data Manipulation
The steps we perform in Data Manipulation are:
Mine the data and create a database: The data is first mined from the
internet, either with API requests or Web Scraping, and these data points are
structured into a database for further processing.
Perform data preprocessing: The Data acquired from mining is still a little
rough and may have incorrect values, missing values, and some outliers. In
this step, all these problems are taken care of, either by deleting the rows or,
by adding the mean values in all missing areas.
Arrange the data: After the data has been preprocessed, it is arranged
accordingly to make analysis of data easier.
Transform the data: The data in question is transformed, either by
changing datatypes or transposing data in some cases.
Perform Data Analysis: Work with the data to view the result. Create
visualizations or an output column to view the output.
Tools Used in Data Manipulation
Many tools are used in Data Manipulation. Some most popularly known
tools with no-code/code Data manipulation functionalities are:
MS Excel – MS Excel is one of the most popular tools used for data
manipulation.
Power BI – It is a tool used to create interactive dashboards easily.
Tableau – Tableau has a similar functionality as Power BI, but it is also a
data analysis tool where you can manipulate data to create stunning
visualizations.
Operations of Data Manipulation
Data Manipulation follows the 4 main operations, CRUD (Create, Read,
Update, and Delete). It is used in many industries to improve the overall
output.
In most DML, there is some version of the CRUD operations where:
Create: To create a new data point or database.
Read: Read the data to understand where we need to perform data
manipulation.
Update: Update missing/wrong data points with the correct ones to
encourage data to be streamlined.
Delete: Deletes the rows with missing data points/ erroneous/ misclassified
data.
import pandas as pd
df=pd.read_csv("Iris.csv")
print(df.tail())
this code reads the last five values from the table of data
Rescaling –
Rescaling is a fundamental data preprocessing technique used in data
science to standardize the range of features of data. It is especially important
when features vary widely in magnitude, units, and range because most
machine learning algorithms perform better or converge faster when
features are on a relatively similar scale and close to normally distributed.
Here are some of the most common rescaling techniques:
1. Min-Max Scaling (Normalization) - One of the simplest methods of
rescaling is known as min-max scaling. A feature’s minimum and maximum
values are used to rescale data to fit inside a predetermined range when
using min-max scaling.
Min-Max Scaling rescales the feature to a fixed range, usually 0 to 1, or -1 to
1 (depending on the distribution). The formula for calculating the Min-Max
Scaling of a value x in feature X is:
X’ = X – min(X)/ max(X)-min(X)
where min(X) and max(X) are the minimum and maximum values of feature
X, respectively. This scaling is useful when you need a bounded range and are
working with algorithms sensitive to the scale of input data, like neural networks.
2. Standardization (Z-score Normalization)
Standardization rescales data so that it has a mean of 0 and a standard deviation of
1. The formula is:
X′=X−μ/ σ
where μ is the mean and σ is the standard deviation of feature X. This method is
less affected by outliers and is generally used if the data does not follow a normal
distribution. It is highly effective for algorithms that assume data is normally
distributed, such as Support Vector Machines and Linear Regression.
3. MaxAbs Scaling
MaxAbs Scaling scales each feature by its maximum absolute value to transform
the data within the range [-1, 1]. The formula for MaxAbs Scaling is:
X′=X/max(∣X∣)
This scaler is meant for data that is already centered at zero without outliers.
4. Robust Scaling
Robust Scaling uses the median and the interquartile range for scaling, thus making
it robust to outliers. The formula is:
X′=X−Median(X)/ IQR(X)
Ques. How to scraping the web with respect data science
Web scraping is a crucial technique in data science for gathering structured data
from the internet. This process typically involves several key steps that enable data
scientists to extract, clean, and utilize web data effectively. Below is a detailed
explanation of each step involved in the web scraping process:
1. Choose a Target Website
Before you begin scraping, you need to choose which website(s) you will extract
data from. This decision is usually driven by the specific data needs of your project.
For instance, if you are interested in sentiment analysis, you might target social
media sites, forums, or review platforms.
2. Select the Data to Scrape
After choosing a website, the next step is to specify the data you need. This might
include text, images, links, or any specific pieces of information embedded in the
website’s HTML. Identifying the right HTML elements (like IDs and classes) that
contain your target data is crucial. Tools like the browser’s developer tools can be
used to inspect the HTML elements.
3. Generate Scraping Code
This step involves writing the scripts or programs that will fetch the web pages and
extract the desired data. Python, with libraries such as BeautifulSoup, Scrapy, or
Selenium, is a popular choice due to its readability and powerful libraries:
Code:
pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
# Define the URL of the site
url = 'http://example-news.com'
# Send a GET request to the website
response = requests.get(url)
# Parse the HTML content of the page with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Find all article tags, assuming they're in <h2> tags within <article> tags
articles = soup.find_all('article')
headlines = [article.h2.text.strip() for article in articles if article.h2]
urls = [article.h2.a['href'] for article in articles if article.h2 and article.h2.a]
4. Execute the Scraping Code
Run the Python script. This script sends a request to http://example-news.com,
parses the HTML to extract headlines and URLs, and prints each headline with its
corresponding URL.
5. Clean and Validate the Data
Extracted data often comes with noise or unwanted markup. Cleaning this data
involves:
Removing HTML tags, unnecessary whitespaces, and correcting encoding
issues.
Validating the accuracy of the data, such as checking if a scraped date or
numerical value falls
6. Store or Analyze the Data
Save the cleaned data for further analysis or store it in a database for future use,
depending on your project requirements.Here’s a simple example of saving the
data to a CSV file using Python's csv module.
import csv
# Saving to CSV
with open('scraped_news_headlines.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Headline', 'URL'])
for headline, url in zip(cleaned_headlines, validated_urls):
writer.writerow([headline, url])
print("Data has been saved to 'scraped_news_headlines.csv'")
Ques. how to generate a bar chart. And also describes step by step for the
generation of scatter plot.
A bar graph, also known as a bar chart, is a graphical display of data using bars of
different heights or lengths. It is used to compare quantities across different
categories. Each bar represents a category of data, and the size of the bar
represents the value or frequency of the category it corresponds to. Bar graphs can
be drawn horizontally or vertically
Bar Graph Definition
Bar graph is a visual representation of data in statistics that uses bars to compare
different categories or groups. Each bar in a bar graph represents a category or
group, and the length or height of the bar corresponds to the value or frequency of
that category.
Key elements of a bar graph include:
1. Axes: Most bar graphs have two axes. The x-axis typically lists the categories
being compared, and the y-axis represents the measured values.
2. Bars: Each bar’s length or height varies according to the data it represents.
3. Labels: Categories and values are labeled to make the data easy to
understand.
4. Title: The title of the bar graph provides an overview of what the data
represents.
It is also called a bar chart. Bar graphs are represented in vertical or horizontal
rectangular bars, where the bars’ length represents the data’s growth.
Generating bar Graph-
Step 1: Install matplotlib
Firstly you have to install Matplotlib as Matplotlib is a Python library for creating
static, animated, and interactive visualizations.
Code-
pip install matplotlib
Step 2: Prepare Your Data
To create a bar chart, you typically need two sets of data:
Categories: These are typically textual labels (e.g., names of items, months,
etc.).
Values: Numerical data associated with each category.
Code-
categories = ['Red', 'Blue', 'Green', 'Yellow']
values = [10, 15, 7, 12]
Step 3: Write the Code to Generate a Bar Chart
Here’s how you can create a simple vertical bar chart using the data above:
Code –
import matplotlib.pyplot as plt
# Data
categories = ['Red', 'Blue', 'Green', 'Yellow']
values = [10, 15, 7, 12]
# Creating the bar chart
plt.bar(categories, values, color='blue') # You can customize the color of the bars
# Adding title and labels
plt.title('Example of Bar Chart')
plt.xlabel('Categories')
plt.ylabel('Values')
# Displaying the chart
plt.show()
step by step for the generation of the scatter plot.-
Scatter Plot-
A scatter plot is a type of data visualization that shows the relationship between two
continuous variables by plotting data points on a graph. Each point represents an observation
or instance in a dataset, the x-axis represents one variable, and the y-axis represents the other
variable. Scatter plots are useful for identifying patterns, clusters, or correlations between the
variables.
Creating a scatter plot is an effective way to visualize the relationship between two
variables and observe potential correlation patterns. Using Python and the
matplotlib library, you can easily generate a scatter plot. Here’s a step-by-step
guide:
Step 1: Install matplotlib
First, make sure you have the matplotlib library installed. If it's not already
installed, you can install it using pip:
pip install matplotlib
Step 2: Prepare Your Data
For a scatter plot, you need two sets of numerical data, typically one set for the x-
axis and another for the y-axis.
Example Data:
x-axis data (e.g., hours studied)
y-axis data (e.g., exam scores)
Code-
x = [5, 20, 40, 60, 80] # Example x-axis data
y = [2, 3, 5, 7, 9] # Example y-axis data
Step 3: Write the Code to Generate a Scatter Plot
import matplotlib.pyplot as plt
# Data
x = [5, 20, 40, 60, 80] # Example x-axis data
y = [2, 3, 5, 7, 9] # Example y-axis data
# Creating the scatter plot
plt.scatter(x, y, color='red') # You can customize the color of the dots
# Adding title and labels
plt.title('Scatter Plot Example')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Scores')
# Displaying the chart
plt.show()
This script will create a scatter plot with red dots placed at the coordinates defined
by the x and y lists.ional
Dimensionality Reduction
Dimensionality reduction is a technique used to reduce the number of features in
a dataset while retaining as much of the important information as possible. In
other words, it is a process of transforming high-dimensional data into a lower-
dimensional space that still preserves the essence of the original data.
In machine learning, high-dimensional data refers to data with a large number of
features or variables. The curse of dimensionality is a common problem in machine
learning, where the performance of the model deteriorates as the number of
features increases. This is because the complexity of the model increases with the
number of features, and it becomes more difficult to find a good solution. In
addition, high-dimensional data can also lead to overfitting, where the model fits
the training data too closely and does not generalize well to new data.
Dimensionality reduction can help to mitigate these problems by reducing the
complexity of the model and improving its generalization performance. There are
two main approaches to dimensionality reduction: feature selection and feature
extraction.
Feature Selection –
Feature selection involves selecting a subset of the original features that are
most relevant to the problem at hand.
The goal is to reduce the dimensionality of the dataset while retaining the
most important features.
There are several methods for feature selection, including filter methods,
wrapper methods, and embedded methods.
Filter methods rank the features based on their relevance to the target
variable, wrapper methods use the model performance as the criteria for
selecting features, and embedded methods combine feature selection with
the model training process.
Feature Extraction
Feature extraction involves creating new features by combining or
transforming the original features.
The goal is to create a set of features that captures the essence of the
original data in a lower-dimensional space.
There are several methods for feature extraction, including principal
component analysis (PCA), linear discriminant analysis (LDA), and t-
distributed stochastic neighbor embedding (t-SNE).
PCA is a popular technique that projects the original features onto a lower-
dimensional space while preserving as much of the variance as possible.
Advantages of Dimensionality Reduction
It helps in data compression, and hence reduced storage space.
It reduces computation time.
It also helps remove redundant features, if any.
Improved Visualization:High dimensional data is difficult to visualize, and
dimensionality reduction techniques can help in visualizing the data in 2D
or 3D.
Overfitting Prevention: High dimensional data may lead to overfitting in
machine learning models leads to poor performance.
Improved Performance.
Cleaning and Munging –
In data science, data cleaning and munging, also known as data wrangling, is the
process of converting raw data into a format that is more suitable for downstream
uses, such as analytics. The process involves cleaning, organizing, and enriching the
data to make it readable.
Data Cleaning
Data cleaning involves fixing or removing incorrect, corrupted, incorrectly
formatted, duplicate, or incomplete data within a dataset. Key activities include:
Removing Duplicates: Deleting repeated data entries to prevent asymmetry
analysis results.
Handling Missing Values: Imputing missing data with statistical measures
like mean or median, or removing rows/columns with too many missing
values.
Correcting Errors: Fixing data entry mistakes or outliers that may affect the
accuracy of the final analysis.
Standardizing Formats: Ensuring that all data follows the same format (e.g.,
dates in YYYY-MM-DD, all text in the same case).
Filtering Noise: Identifying and eliminating random variance from data.
Data Munging (or Data Wrangling)
Data munging involves transforming and mapping data from one "raw" form into
another format that allows for more convenient consumption and analysis. It
encompasses:
Merging Sources: Combining data from multiple sources to create a
comprehensive dataset.
Converting Data Types: Changing data fields to the correct data type (e.g.,
converting strings to numeric values, parsing dates).
Creating New Variables: Deriving new meaningful features from existing
data which are more suitable for analysis.
Reshaping Data: Pivoting tables, transposing rows into columns, or vice
versa to suit the needs of analysis.
Normalizing Data: Scaling numeric data from different units or scales to a
common scale.
Both data cleaning and munging are vital because they directly impact the accuracy
and reliability of data analysis and model building.
Line Chart –
A line chart, or line graph, is a data visualization technique that displays data as
points, or "markers," connected by straight lines.
Line charts are similar to scatter plots, but the points are ordered and connected
by straight lines. Line charts can help users identify patterns, trends, and direction
in data over time. For example, a line chart can show if data is trending up or down,
if one category is performing better than others, or if a goal is likely to be met.
Line charts have an x and y-axis, with both axes containing numerical values that
represent the data.
Parts of Line Graph
Parts of the line graph include the following:
Title: It is nothing but the title of the graph drawn.
Axes: The line graph contains two axes i.e. X-axis and Y-axis.
Labels: The name given to the x-axis and y-axis.
Line: It is the line segment that is used to connect two or more data points.
Point: It is nothing but a point given at each segment.
How to make a line graph?
To make a line graph we need to use the following steps:
1. Determine the variables: The first and foremost step to creating a line graph
is to identify the variables you want to plot on the X-axis and Y-axis.
2. Choose appropriate scales: Based on your data, determine the appropriate
scale.
3. Plot the points: Plot the individual data points on the graph according to the
given data.
4. Connect the points: After plotting the points, you have to connect those
points with a line.
5. Label the axes: Add labels to the X-axis and Y-axis. You can also include the
unit of measurement.
6. Add Title: After completing the graph you should provide a suitable title.
Steps to make line chart –
Step 1: Install Matplotlib
pip install matplotlib
Step 2: Import Matplotlib
import matplotlib.pyplot as plt
Step 3: Prepare Your Data
x = [1, 2, 3, 4, 5] # Example x values
y = [2, 4, 6, 8, 10] # Example y values corresponding to x values
Step 4: Display the Plot
import matplotlib.pyplot as plt
# Data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Creating the plot
plt.plot(x, y)
# Adding title and labels
plt.title('Simple Line Chart')
plt.xlabel('X Axis Label')
plt.ylabel('Y Axis Label')
# Displaying the plot
plt.show()
Matplotlib
Matplotlib is a comprehensive library for creating static, interactive, and animated
visualizations in Python. It is one of the most widely used data visualization libraries
in Python and is known for its ability to produce publication-quality figures in a
variety of formats across platforms.
Matplotlib was created by John D. Hunter in 2002. It was inspired by MATLAB.
Matplotlib is a popular choice for data visualization in Python. It is used by
scientists, engineers, and data analysts to create a wide variety of plots, including
line plots, bar charts, histograms, scatter plots, and heatmaps. Matplotlib can also
be used to create more complex visualizations, such as 3D plots, contour plots, and
animations.
Different Types of Plots in Matplotlib
Matplotlib offers a wide range of plot types to suit various data visualization needs.
Here are some of the most commonly used types of plots in Matplotlib:
Line Graph
Stem Plot
Bar chart
Histograms
Scatter Plot
Stack Plot
Box Plot
Pie Chart
Error Plot
Violin Plot
3D Plots
Advantages –
1. Versatility: It supports a wide variety of plots and visualizations, suitable for
many different types of data analysis needs.
2. Customization: Users can customize every aspect of a plot, enabling the
creation of publication-quality figures.
3. Ease of Use: Works well with Pandas and has strong community support.
4. Platform Friendly: Runs on different platforms and environments.
Numpy
NumPy (Numerical Python) is an essential library in the Python programming
ecosystem, widely used for scientific computing. It provides support for
large, multi-dimensional arrays and matrices, along with a large collection of
high-level mathematical functions to operate on these arrays.
NumPy is a Python library used for working with arrays. It also has functions
for working in domain of linear algebra, fourier transform, and
matrices. NumPy was created in 2005 by Travis Oliphant. It is an open source
project and you can use it freely.
Features of NumPy
NumPy has various features including these important ones:
Supports large and multi-dimensional arrays for complex data operations.
Allows operations on arrays of different sizes without explicit loops.
NumPy can work together with code written in the programming languages
C, C++, and Fortran .
Offers built-in support for linear algebra, Fourier transforms, and random
number generation, making it versatile for scientific computing.
Here are some of the benefits of using NumPy:
It can speed up your code by providing fast and efficient ways to manipulate
arrays.
It can make your code more readable and maintainable by providing a
consistent syntax for working with arrays.
It can give you access to a wide range of mathematical functions that can be
used to analyze and manipulate data.
Installation:
NumPy can be installed using pip, a Python package installer. The typical
command used in the terminal or command line is:
pip install numpy
Example Code:
Here’s a simple example of how to use NumPy to create an array and perform
a mathematical operation:
import numpy as np
a = np.array([1, 2, 3, 4, 5])
#To calculate the mean of an array:
mean = np.mean(a)
#To sort an array.
sorted_array = np.sort(a)
Scikit- learn
Scikit-Learn, also known as sklearn is a Python library to implement machine
learning models and statistical modeling.
Scikit-learn is a popular Python library used extensively in the field of
machine learning. It provides simple and efficient tools for data mining and
data analysis. Built on NumPy, SciPy, and Matplotlib.