FDS Record Last
FDS Record Last
Python:
Python is an easy to learn, powerful programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s
elegant
syntax and dynamic typing, together with its interpreted nature, make it an ideal language for
scripting and rapid application development in many areas on most platforms.
The Python interpreter and the extensive standard library are freely available in source or
binary form for all major platforms from the Python web site, https://www.python.org/, and
may be
freely distributed.
Installation Commands:
Step 1: Download the Python Installer binaries. Open the official Python website in
your web browser https://www.python.org/downloads/
Step 2: Run the Executable Installer. Once the installer is downloaded, run the
Python installer. ...
Step 3: Add Python to environmental variables. ...
Step 4: Verify the Python Installation.
2. Numpy:
NumPy stands for Numerical Python and it is a core scientific computing library
in Python. It provides efficient multi-dimensional array objects and various operations
to work with these array objects.
Package installer for Python (pip) needed to run Python on your computer.
Installation Commands:
1. Command Prompt : Py –m pip - -version
2. Command Prompt :Py –m pip install numpy
3. Scipy
SciPy is a scientific computation library that uses NumPy underneath.SciPy stands for
Scientific Python.It provides more utility functions for optimization, stats and signal
processing.LikeNumPy, SciPy is open source so we can use it freely.
Installation Commands:
Command Prompt :Py –m pip install scipy
4. Matplotlib:
Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. Matplotlib makes easy things easy and hard things possible.
Create publication quality plots.
Make interactive figures that can zoom, pan, update.
Customize visual style and layout.
Export to many file formats.
Embed in JupyterLab and Graphical User Interfaces.
Use a rich array of third-party packages built on Matplotlib.
Installation Commands:
Command Prompt :Py –m pip install matplotlib
5. Pandas:
pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation
tool, built on top of the Python programming language
Installation Commands:
6. Jupyter:
The Jupyter Notebook is the original web application for creating and sharing
computational documents. It offers a simple, streamlined, document-centric experience.
Installation Commands:
8. Seaborn:
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics.
Installation Commands:
9. Plotly:
Plotly is a technical computing company headquartered in Montreal, Quebec, that
develops online data analytics and visualization tools. Plotly provides online graphing,
analytics, and statistics tools for individuals and collaboration, as well as scientific graphing
libraries for Python, R, MATLAB, Perl, Julia, Arduino, and REST.
Installation Commands:
Command Prompt: Py –m pip install plotly
9. Bokeh:
Bokeh is a Python library for creating interactive visualizations for modern web browsers. It
helps you build beautiful graphics, ranging from simple plots to complex dashboards with
streaming datasets.
Installation Commands:
3.Draw a line in a diagram x-axis ranging from 0 to 6 and the y-axis ranging from 0 to
250.using Matplotlib
import matplotlib.pyplot as plt
import numpy as n
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()
3.Draw a line in a diagram from position (0,0) to position (6,250) using Matplotlib
NUMPY CONCEPTS:
Create NumPy ndarray Object
numpy.zeros()
numpy.ones()
numpy.empty()
numpy.linspace()
numpy.arange()
numpy.array()
Check Number of Dimensions
Dimensions in Arrays
Higher Dimensional Arrays
NumPy Array Indexing
NumPy Array Slicing
Two-Dimensional Arrays
NumPy Array Shape
Reshaping of Arrays
Aggregations
• Mean
• Median
• Mode
• Standard deviation
NumPy concepts:
import numpy as np
print(arr)
• Definition: This function creates a new array of given shape and type, filled with
zeros.
• Syntax: np.zeros(shape, dtype=float)
• Example:
• Definition: This function creates a new array of given shape and type, filled with
ones.
• Syntax: np.ones(shape, dtype=float)
• Example:
• Definition: This function returns an array of evenly spaced numbers over a specified
range.
• Syntax: np.linspace(start, stop, num=50, endpoint=True)
• Example:
• Definition: This function returns an array with values spaced regularly within a given
interval.
• Syntax: np.arange([start, ]stop, [step, ])
• Example:
print(arr)
• Definition: This function is used to create arrays from lists, tuples, or other
sequences.
• Syntax: np.array(object, dtype=None)
• Example:
• Definition: You can use the .ndim attribute to check the number of dimensions of an
array.
• Syntax: array.ndim
• Example:
9. Dimensions in Arrays
• Definition: NumPy arrays can be 1D, 2D, 3D, or more. The number of dimensions is
the rank of the array.
• Example:
• Definition: You can create arrays with more than two dimensions, such as 3D or 4D
arrays.
• Example:
• Definition: NumPy arrays support indexing, similar to Python lists, allowing access
to individual elements.
• Syntax: array[index]
• Example:
print(arr_2d)
• Definition: The .shape attribute returns a tuple representing the dimensions of the
array.
• Syntax: array.shape
• Example:
reshaped_arr = arr.reshape(5, 1)
print(reshaped_arr)
• Definition: NumPy provides functions for aggregation operations such as sum, mean,
and median.
• Example:
print(arr.sum()) # Sum
print(arr.mean()) # Mean
print(np.median(arr)) # Median
• Definition: You can join multiple arrays into one using np.concatenate().
• Syntax: np.concatenate((arr1, arr2))
• Example:
print(joined_arr)
19. Splitting NumPy Arrays
• Definition: You can split an array into multiple sub-arrays using np.split().
• Syntax: np.split(array, sections)
• Example:
split_arr = np.split(arr, 3)
def add_five(x):
return x + 5
ufunc_add_five = np.frompyfunc(add_five, 1, 1)
print(ufunc_add_five(np.array([1, 2, 3])))
22. Comparisons
• Definition: You can create masks for filtering arrays based on conditions.
• Example:
Sum of arr: 15
Mean of arr: 3.0
Median of arr: 3.0
Standard Deviation of arr: 1.4142135623730951
Joined Arrays: [1 2 3 4 5 6]
Splitting the joined array into two parts: [array([1, 2, 3]), array([4, 5, 6])]
Applying custom ufunc (add 5) on arr: [ 6 7 8 9 10]
Sorted arr: [1 2 3 4 5]
Comparison: arr > 3: [False False False True True]
Mask for elements > 2 and < 5: [False True True False False]
Elements satisfying the mask: [3 4]
Pandas Series:
Pandas DataFrame:
• Grouping data by a specific column and applying aggregation functions (e.g., sum,
mean).
• Syntax: df.groupby('Column').agg(function)
Descriptive Statistics:
PROGRAM:
import pandas as pd
df = pd.DataFrame(data_dict)
Descriptive Analysis:
Descriptive analysis, also known as descriptive analytics or descriptive statistics, is the
process
of using statistical techniques to describe or summarize a set of data.
Data Loading:
• Loads data from an Excel file. Additional comments show how to load data from a
text file or URL.
Basic Information:
• Displays dataset structure, head (first five rows), basic statistics, and any missing
values.
Class Distribution:
Distributions:
Pair Plot:
• A pair plot helps visualize relationships between all feature pairs, with species
differentiated by color.
Correlation Heatmap:
Grouped Statistics:
• Aggregates statistics like mean, standard deviation, minimum, and maximum for each
species group.
PROGRAM:
import pandas as pd
import numpy as np
excel_data = pd.read_excel('/mnt/data/IRIS.xls')
# text_data = pd.read_csv('path_to_text_file.txt')
# url = 'http://example.com/iris.csv'
# web_data = pd.read_csv(url)
print("Dataset Information:")
print(excel_data.info())
# Display first few rows
print(excel_data.head())
print("\nDescriptive Statistics:")
print(excel_data.describe())
print(excel_data.isnull().sum())
print("\nClass distribution:")
print(excel_data['species'].value_counts())
plt.figure(figsize=(8, 4))
plt.title(f'Distribution of {column}')
plt.show()
plt.show()
# Correlation heatmap
plt.figure(figsize=(10, 8))
plt.title("Correlation Heatmap")
plt.show()
print(grouped_stats)
OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
max 7.900000 4.400000 6.900000 2.500000
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
setosa 50
versicolor 50
virginica 50
Name: species, dtype: int64
sepal_length sepal_width ...
mean std min max mean std min max
species
setosa 5.006000 0.352490 4.300 5.800 3.428000 0.379064 2.300 4.400
versicolor 5.936000 0.516171 4.900 7.000 2.770000 0.313798 2.000 3.400
virginica 6.588000 0.635880 4.900 7.900 2.974000 0.322497 2.200 3.800
Importing Libraries:
• Specified file paths for UCI Wine Quality and Pima Indians Diabetes datasets.
• Used try-except for loading datasets to handle potential FileNotFoundError
gracefully.
Univariate Analysis:
• For both datasets, calculated statistical summaries: mean, median, mode, variance,
standard deviation, skewness, and kurtosis.
• Utilized describe() for overall summary statistics.
• Displayed mean values for each column in the Wine Quality dataset for additional
detail.
• Used LinearRegression to fit a linear model for predicting wine quality based on
various features.
• Split data into training and testing sets with train_test_split.
• Calculated predictions and computed Mean Squared Error (MSE) for model
evaluation.
• For both linear and logistic regression, used statsmodels to generate detailed summary
reports.
• Added a constant term to the feature set (intercept) before fitting the model.
• Used OLS for linear regression and Logit for logistic regression to view statistical
significance of features, coefficients, and other model diagnostics.
PROGRAM:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score
import statsmodels.api as sm
# Define file paths
uci_file_path = r'C:\Users\hp\Downloads\Datasets for FDS\wine_quality.csv' # Adjust the
name as necessary
pima_file_path = r'C:\Users\hp\Downloads\Datasets for FDS\diabetes.csv' # Update with
actual file name
# Debugging print statements
print("UCI File Path:", uci_file_path)
print("Pima File Path:", pima_file_path)
# Load the UCI dataset (Wine Quality)
try:
uci_data = pd.read_csv(uci_file_path, delimiter=';') # Using the correct delimiter
print("UCI Data Loaded Successfully.")
except FileNotFoundError as e:
print("FileNotFoundError:", e)
exit(1) # Exit if the file is not found
# Make predictions
y_pred = linear_model.predict(X_test)
# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"\nMean Squared Error (Wine Quality): {mse:.2f}")
# Load the Pima Indians Diabetes dataset
try:
pima_data = pd.read_csv(pima_file_path)
print("Pima Data Loaded Successfully.")
except FileNotFoundError as e:
print("FileNotFoundError:", e)
exit(1) # Exit if the file is not found
# Univariate Analysis for Pima Dataset
pima_desc = pima_data.describe()
pima_mean = pima_data.mean()
pima_median = pima_data.median()
pima_mode = pima_data.mode().iloc[0]
pima_var = pima_data.var()
pima_std = pima_data.std()
pima_skew = pima_data.skew()
pima_kurt = pima_data.kurt()
print("\nPima Indians Diabetes Dataset Univariate Analysis")
print("Mean:", pima_mean)
print("Median:", pima_median)
print("Mode:", pima_mode)
print("Variance:", pima_var)
print("Standard Deviation:", pima_std)
print("Skewness:", pima_skew)
print("Kurtosis:", pima_kurt)
Definition:
A normal curve, or Gaussian distribution, is a symmetric, bell-shaped curve that shows the
probability distribution of a continuous random variable. The curve is defined by the mean
(center) and the standard deviation (spread).
Syntax:
import numpy as np
import matplotlib.pyplot as plt
# Plotting
plt.plot(x, y)
plt.title("Normal Distribution Curve")
plt.xlabel("Values")
plt.ylabel("Probability Density")
plt.show()
2. Scatter Plot
Definition:
A scatter plot is used to show the relationship between two numerical variables by plotting
data points on an XY axis. Each point represents a pair of values for the two variables.
Syntax:
# Plotting
plt.scatter(x, y)
plt.title("Scatter Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()
3. Histogram
Definition:
A histogram is a graphical representation of the distribution of numerical data. It groups data
into bins (ranges) and plots the frequency of each bin as bars, showing the shape of the data
distribution.
Syntax:
# Example data
data = [10, 20, 20, 30, 30, 40, 40, 40, 50, 50, 50, 50]
# Plotting
plt.hist(data, bins=5, color='skyblue', edgecolor='black')
plt.title("Histogram")
plt.xlabel("Value Ranges")
plt.ylabel("Frequency")
plt.show()
4. Contour Plot
Definition:
A contour plot is a two-dimensional representation of a 3D surface, showing lines where a
particular z-value is constant. It is often used to show the density of data points in two
dimensions.
Syntax:
import numpy as np
import matplotlib.pyplot as plt
# Plotting
plt.contour(X, Y, Z, levels=10, cmap="viridis")
plt.title("Contour Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()
5. Density Plot
Definition:
A density plot represents the distribution of a numerical variable using kernel density
estimation (KDE). It smooths out data points to give an estimated continuous probability
density function.
Syntax:
6. 3D Plot
Definition:
A 3D plot is a graphical representation that shows data points in three dimensions, typically
using x, y, and z coordinates. It can be a line, scatter plot, or surface plot.
Syntax:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Example data
x = np.random.rand(50)
y = np.random.rand(50)
z = np.random.rand(50)
# Plotting
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z)
ax.set_title("3D Scatter Plot")
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
ax.set_zlabel("Z-axis")
plt.show()
OUTPUT:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3
# Load data from CSV file (use a sample data if hou_all.csv is unavailable)
# Replace 'hou_all.csv' with your file path
# df = pd.read_csv('C:\\Users\\Admin\\Downloads\\hou_all.csv')
# For demonstration, create a sample DataFrame
np.random.seed(0)
df = pd.DataFrame({
'set': np.random.normal(loc=50, scale=15, size=100),
'value': np.random.normal(loc=30, scale=10, size=100)
})
# 1. Normal Curve Plot
mean = df['set'].mean()
std_dev = df['set'].std()
x = np.linspace(mean - 3*std_dev, mean + 3*std_dev, 100)
y = (1 / (std_dev * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mean) / std_dev) ** 2)
plt.plot(x, y)
plt.title("Normal Distribution Curve")
plt.xlabel("Set")
plt.ylabel("Probability Density")
plt.show()
# 2. Scatter Plot
plt.scatter(df['set'], df['value'])
plt.title("Scatter Plot")
plt.xlabel("Set")
plt.ylabel("Value")
plt.show()
# 3. Histogram
plt.hist(df['set'], bins=10, color='skyblue', edgecolor='black')
plt.title("Histogram of Set")
plt.xlabel("Set")
plt.ylabel("Frequency")
plt.show()
# 4. Contour Plot
# Creating a 2D grid of values for contour plot
X, Y = np.meshgrid(np.linspace(df['set'].min(), df['set'].max(), 100),
np.linspace(df['value'].min(), df['value'].max(), 100))
Z = np.exp(-((X - mean)**2 + (Y - mean)**2) / (2 * std_dev**2))
plt.contour(X, Y, Z, levels=10, cmap="viridis")
plt.title("Contour Plot")
plt.xlabel("Set")
plt.ylabel("Value")
plt.show()
# 5. Density Plot
df['set'].plot(kind='density')
plt.title("Density Plot of Set")
plt.xlabel("Set")
plt.show()
# 6. 3D Plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel("Set")
ax.set_ylabel("Value")
ax.set_zlabel("Frequency")
plt.show()
Base Map
Basemap is a matplotlib extension used to visualize and create geographical maps in python.
Installing Base map :
pip Install basemap
Importing Basemap and matplotlib libraries:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
·Political boundaries
·drawcountries(): Draw country boundaries
drawstates(): Draw US state boundaries
drawcounties(): Draw US county boundaries
·Map features
·drawgreatcircle(): Draw a great circle between two points
drawparallels(): Draw lines of constant latitude
drawmeridians(): Draw lines of constant longitude
drawmapscale(): Draw a linear scale on the map
·Whole-globe images
·bluemarble(): Project NASA's blue marble image onto the map
shadedrelief(): Project a shaded relief image onto the map
PROGRAM :
# Step 2: Create a Basemap instance for the world map (Orthographic projection)
m = Basemap(projection='ortho', lat_0=0, lon_0=0) # Orthographic projection centered at (0,
0)
import numpy as np
import pandas as pd
# Creating a pivot table with the sum of 'D' values grouped by 'A' and 'B', with columns
based on 'C'
table = pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum)
print("\nPivot Table with sum of 'D' values:")
print(table)
# Creating a pivot table with the sum of 'D' values, and fill NaN values with 0
table = pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum,
fill_value=0)
print("\nPivot Table with sum of 'D' values and fill_value=0:")
print(table)
# Pivot table with mean of 'D' and 'E' values, grouped by 'A' and 'C'
table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], aggfunc={'D': np.mean, 'E':
np.mean})
print("\nPivot Table with mean of 'D' and 'E' values:")
print(table)
# Pivot table with mean of 'D' and multiple aggregations for 'E'
table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], aggfunc={'D': np.mean, 'E':
[min, max, np.mean]})
print("\nPivot Table with mean of 'D' and multiple aggregations for 'E':")
print(table)
OUTPUT:
PROGRAM:
freq_dis_height = wnba['Height'].value_counts()
print("Frequency Distribution for Height:\n", freq_dis_height, "\n")
freq_dis_height_sorted = wnba['Height'].value_counts().sort_index(ascending=False)
print("Sorted Frequency Distribution for Height (Descending):\n", freq_dis_height_sorted,
"\n")
# Mode example
data1 = (2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7)
print("Mode of data1:", mode(data1), "\n")
# Scatter plot for the first two features in the iris dataset
plt.scatter(features[0], features[1], alpha=0.2, s=100 * features[3], c=iris.target,
cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title("Scatter Plot of Iris Features")
plt.show()