[go: up one dir, main page]

0% found this document useful (0 votes)
9 views18 pages

Experimenting With Data Analysis Packages and Statistical Operations

The document outlines an experiment utilizing data analysis packages including NumPy, SciPy, Jupyter, Statsmodels, and Pandas for data manipulation and statistical analysis. It details various coding examples demonstrating descriptive statistics, linear and logistic regression, data handling, and visualization techniques. The experiment successfully explores these packages, calculating key statistical measures and deriving insights from the dataset.

Uploaded by

ithikashr97516
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views18 pages

Experimenting With Data Analysis Packages and Statistical Operations

The document outlines an experiment utilizing data analysis packages including NumPy, SciPy, Jupyter, Statsmodels, and Pandas for data manipulation and statistical analysis. It details various coding examples demonstrating descriptive statistics, linear and logistic regression, data handling, and visualization techniques. The experiment successfully explores these packages, calculating key statistical measures and deriving insights from the dataset.

Uploaded by

ithikashr97516
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

EX: 1 Reg No: 2022510023

Date: 7/8/2024

Experimenting with Data Analysis Packages and Statistical


Operations
Aim:

To explore and utilize data analysis packages like NumPy, SciPy, Jupyter, Statsmodels, and Pandas for
data manipulation and statistical analysis on a chosen dataset, focusing on descriptive analytics and key
statistical measures.

1.Exploring NumPY:

Numpy module
The numpy module in python is created for performing faster mathematical operations such as
matrix multiplication, inversion by storing the features into arrays known as numpy arrays.

Code:

import numpy as np
#Creating an array
arr = np.array([12,26,27,28,30])
print(arr)
print(arr.dtype)

Output:

Code:

#Creating Multidimensional Arrays


arr1 = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr1)
print("Dimension of the Array:",arr1.ndim)

Output:

Code:
# Example array
array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Descriptive statistics
sum_value = np.sum(array)
min_value = np.min(array)
max_value = np.max(array)
range_value = np.ptp(array)
cumsum_value = np.cumsum(array)
cumprod_value = np.cumprod(array)

# Print results
print("Sum:", sum_value)
print("Min:", min_value)
print("Max:", max_value)
print("Range:", range_value)
print("Cumulative Sum:", cumsum_value)
print("Cumulative Product:", cumprod_value)

Output:

Code:

# Mean
data = np.array([10, 20, 30, 40, 50])
mean = np.mean(data)
print("Mean:", mean)

# Median
median = np.median(data)
print("Median:", median)

# Standard Deviation
std_dev = np.std(data)
print("Standard Deviation:", std_dev)

# Variance
variance = np.var(data)
print("Variance:", variance)
2
# Percentile
percentile_25 = np.percentile(data, 25)
print("25th Percentile:", percentile_25)

# Correlation Coefficient Matrix


x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])
correlation_matrix = np.corrcoef(x, y)
print("Correlation Coefficient Matrix:\n", correlation_matrix)

# Covariance Matrix
cov_matrix = np.cov(x, y)
print("Covariance Matrix:\n", cov_matrix)

# Histogram
hist, bin_edges = np.histogram(data, bins=5)
print("Histogram:", hist)
print("Bin Edges:", bin_edges)

# Unique Elements
unique_elements = np.unique(data)
print("Unique Elements:", unique_elements)

# Check for NaN values


nan_data = np.array([1, np.nan, 3, 4])
nan_check = np.isnan(nan_data)
print("NaN Check:", nan_check)

# Check for Finite Values


finite_data = np.array([1, np.inf, -np.inf, 3])
finite_check = np.isfinite(finite_data)
print("Finite Check:", finite_check)

# Dot Product
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
dot_product = np.dot(a, b)
print("Dot Product:", dot_product)

# Random Data
np.random.seed(0)
random_data = np.random.rand(5)
print("Random Data:", random_data)

Output:

3
Code:

# Example matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Transpose
transpose_matrix = np.transpose(matrix)
# or using matrix.T
transpose_matrix_alt = matrix.T

# Inverse
# Note: Matrix must be square and non-singular for inversion
try:
inverse_matrix = np.linalg.inv(matrix)
except np.linalg.LinAlgError:
inverse_matrix = "Matrix is singular or not square"

# Determinant
determinant = np.linalg.det(matrix)

# Eigenvalues and Eigenvectors


eigenvalues, eigenvectors = np.linalg.eig(matrix)

print("Transpose:\n", transpose_matrix)
print("Transpose (alternative method):\n", transpose_matrix_alt)
print("Inverse:\n", inverse_matrix)
print("Determinant:", determinant)
4
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:\n", eigenvectors)
print("SVD U:\n", U)
print("SVD S:", S)
print("SVD Vt:\n", Vt)
print("Matrix Product:\n", matrix_product)
print("Matrix Product (alternative method):\n", matrix_product_alt)
print("Trace:", trace)

Output:

2.Exploring SciPY:

Scipy module
5
Scipy module provides mathematical algorithm functions for numpy array features for faster
computation.

Code:

from scipy import stats, integrate, linalg, interpolate, optimize


# Creating data
data = np.array([1, 2, 2, 3, 4, 5])
x = np.linspace(0, 10, 10)
y = np.sin(x)

# Mean and Standard Deviation


mean = stats.tmean(data)
std_dev = stats.tstd(data)
print("Mean (scipy.stats):", mean)
print("Standard Deviation (scipy.stats):", std_dev)

# Pearson Correlation Coefficient


corr_coefficient, _ = stats.pearsonr(x, y)
print("Pearson Correlation Coefficient:", corr_coefficient)

# Spearman Rank Correlation Coefficient


spearman_corr, _ = stats.spearmanr(x, y)
print("Spearman Rank Correlation Coefficient:", spearman_corr)

# Linear Regression
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print("Linear Regression - Slope:", slope)
print("Linear Regression - Intercept:", intercept)

# Integration
integral, error = integrate.quad(lambda x: x**2, 0, 1)
print("Integral of x^2 from 0 to 1:", integral)

# Solve Linear System


A = np.array([[3, 2], [1, 2]])
b = np.array([5, 6])
solution = linalg.solve(A, b)
print("Solution of linear system:", solution)

# Eigenvalues and Eigenvectors


A = np.array([[1, 2], [3, 4]])
eigenvalues, eigenvectors = linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:\n", eigenvectors)

# Optimization
def objective_function(params):
6
return np.sum((y - (params[0] * x + params[1]))**2)

initial_guess = [1, 0]
result = optimize.minimize(objective_function, initial_guess)
print("Optimization result:", result.x)

# Descriptive Statistics
desc_stats = stats.describe(data)
print("Descriptive Statistics:", desc_stats)

# Interquartile Range
iqr = stats.iqr(data)
print("Interquartile Range:", iqr)

# Z-score
z_scores = stats.zscore(data)
print("Z-scores:", z_scores)

Output:

3.Exploring Statsmodels:

Statsmodel module
The statsmodel module provides the function of summarizing or providing the final result of the
trained model.

Code:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.datasets import load_iris
import pandas as pd
# Load Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target

7
iris_df['species'] = iris_df['species'].map({i: species for i, species in enumerate(iris.target_names)})

# Linear Regression (predicting petal length from sepal length)


X = iris_df[['sepal length (cm)']]
X = sm.add_constant(X) # Adds a constant term to the predictor
y = iris_df['petal length (cm)']
linear_model = sm.OLS(y, X).fit()
print("Linear Regression Summary:\n", linear_model.summary())

# Logistic Regression (predicting species from sepal length and petal length)
# Converting species to a binary outcome for simplicity
iris_df['species_binary'] = (iris_df['species'] == 'versicolor').astype(int)

# Logistic Regression Model


X_logit = sm.add_constant(iris_df[['sepal length (cm)', 'petal length (cm)']])
logit_model = sm.Logit(iris_df['species_binary'], X_logit).fit()
print("Logistic Regression Summary:\n", logit_model.summary())

Output:

4.Exploring Pandas:

Pandas module

8
This module provides basic functions to work with datasets which are really helpful for data
scientist for analysing the data.

Code:

import pandas as pd
import numpy as np

# Sample DataFrame creation


data = {
'A': [1, 2, np.nan, 4, 5],
'B': ['a', 'b', 'a', 'b', 'a'],
'C': [10, 20, 10, 20, 10],
'D': [100, 200, 100, 200, 100]
}
df = pd.DataFrame(data)

# Display the DataFrame


print("Original DataFrame:")
print(df)

# 1. DataFrame Creation
df_created = pd.DataFrame(data)
print("\nDataFrame Created from Dictionary:")
print(df_created)

# 2. Display First and Last Rows


print("\nFirst 3 Rows of DataFrame:")
print(df.head(3))
print("\nLast 3 Rows of DataFrame:")
print(df.tail(3))

# 3. Summary Information
print("\nDataFrame Info:")
print(df.info())

# 4. Sampling
print("\nRandom Sample of 2 Rows:")
print(df.sample(2))

# 5. Handling Missing Values


numeric_cols = df.select_dtypes(include=[np.number]).columns
df_filled = df.fillna(df[numeric_cols].mean())
print("\nDataFrame with Missing Values Filled:")
print(df_filled)

df_dropped = df.dropna()

9
print("\nDataFrame with Missing Values Dropped:")
print(df_dropped)

# 6. Data Aggregation
mean_values = df.groupby('B').mean()
print("\nMean Values Grouped by 'B':")
print(mean_values)

# 7. Merging DataFrames
df2 = pd.DataFrame({'B': ['a', 'b'], 'E': [1, 2]})
merged_df = pd.merge(df, df2, on='B', how='left')
print("\nMerged DataFrame:")
print(merged_df)

# 8. Sorting
sorted_df = df.sort_values(by='A', ascending=False)
print("\nDataFrame Sorted by 'A':")
print(sorted_df)

# 9. Filtering
filtered_df = df[df['A'] > 2]
print("\nFiltered DataFrame (A > 2):")
print(filtered_df)

# 10. Applying Functions


df['A_squared'] = df['A'].apply(lambda x: x**2)
print("\nDataFrame with 'A_squared':")
print(df)

# 11. Pivot Tables


pivot_table = pd.pivot_table(df, values='D', index='B', columns='C', aggfunc=np.mean)
print("\nPivot Table:")
print(pivot_table)

# 12. Statistical Summary


summary = df.describe()
print("\nStatistical Summary:")
print(summary)

# 14. Exporting and Importing Data


df.to_csv('example.csv', index=False)
print("\nDataFrame saved to 'example.csv'")
df_read = pd.read_csv('example.csv')
print("\nDataFrame read from 'example.csv':")
print(df_read)

# 15. Renaming Columns

10
df_renamed = df.rename(columns={'A': 'Column_A', 'B': 'Column_B'})
print("\nDataFrame with Renamed Columns:")
print(df_renamed)

# 16. DataFrame Shape


shape = df.shape
print("\nDataFrame Shape:")
print(shape)

# 17. Dropping Columns


df_dropped_col = df.drop(columns=['D'])
print("\nDataFrame with Column 'D' Dropped:")
print(df_dropped_col)

# 18. Value Counts


value_counts = df['B'].value_counts()
print("\nValue Counts in Column 'B':")
print(value_counts)

# 19. Reshaping Data


pivot_table = pd.pivot_table(df, values='D', index='B', columns='C', aggfunc=np.mean)
print("\nPivot Table (Reshaped DataFrame):")
print(pivot_table)

melted_df = pd.melt(df, id_vars=['B'], value_vars=['A', 'D'])


print("\nMelted DataFrame:")
print(melted_df)

# 20. Querying Data


query_result = df.query('A > 2')
print("\nQuery Result (A > 2):")
print(query_result)

Output:

11
Reading from Text File, CSV File, Excel File and Web File:
12
example1 = "/content/Data Analytic Lab.txt"
file = open(example1, "r")
FileContent = file.read()
FileContent
print(FileContent)

Output:

Code:

import pandas as pd
df = pd.read_csv("/content/mxmh_survey_results.csv")
df.head()
df.info()

Output:

Code:

13
#EXCEL
df1 = pd.read_excel("/content/DAEX1.xlsx")
df1.head()

Output:

Code:

#WEB FILE
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data'
df2 = pd.read_csv(url)
df2.head()

Output:

Code:

import pandas as pd
file_path = '/content/mxmh_survey_results.csv'
df = pd.read_csv(file_path)
df.head()

Output:

Code:

df.describe()
14
Output:

Code:

print("\nDescriptive Statistics (Categorical):")


print(df.describe(include=[object]))

Output:

Code:

# Check for missing values in the dataset


missing_values = df.isnull().sum()

15
# Columns with missing values
missing_values[missing_values > 0]
Output:

Code:

# Age Distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['Age'].dropna(), kde=True, bins=20)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Distribution of Primary Streaming Service


plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='Primary streaming service', order=df['Primary streaming
service'].value_counts().index)
plt.title('Primary Streaming Service Distribution')
plt.xlabel('Streaming Service')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

Output:

16
Code:

# Frequency of Listening to Music Genres


freq_cols = [col for col in df.columns if 'Frequency' in col]
df_melted = df.melt(id_vars=[], value_vars=freq_cols, var_name='Genre', value_name='Frequency')
plt.figure(figsize=(14, 8))
sns.countplot(data=df_melted, x='Genre', hue='Frequency', order=freq_cols)
plt.title('Frequency of Listening to Various Music Genres')
plt.xticks(rotation=45)
plt.ylabel('Count')
plt.xlabel('Music Genre')
plt.show()

Output:

Code:
17
import seaborn as sns
import matplotlib.pyplot as plt
numeric_df = df.select_dtypes(include=['float64', 'int64'])
correlation_matrix = numeric_df.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.show()

Output:

Result:

Successfully explored data analysis packages and applied statistical operations on the chosen dataset,
calculating descriptive measures such as mean, median, and standard deviation. Identified data insights
through interpretation of variance, skewness, and kurtosis.

18

You might also like