[go: up one dir, main page]

0% found this document useful (0 votes)
14 views5 pages

Data Science in Society Cat

Exploratory Data Analysis (EDA) is an open-ended approach used to discover patterns and insights in data without a predetermined hypothesis, while Confirmatory Data Analysis (CDA) is hypothesis-driven, focusing on validating specific theories through statistical tests. Key Python packages for data science include Pandas for data manipulation, NumPy for numerical computing, Matplotlib for data visualization, and TensorFlow for machine learning. Data science can significantly improve health services in Kenya by analyzing health data for predictive insights, optimizing resources, and enhancing patient care.

Uploaded by

genesiskalya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

Data Science in Society Cat

Exploratory Data Analysis (EDA) is an open-ended approach used to discover patterns and insights in data without a predetermined hypothesis, while Confirmatory Data Analysis (CDA) is hypothesis-driven, focusing on validating specific theories through statistical tests. Key Python packages for data science include Pandas for data manipulation, NumPy for numerical computing, Matplotlib for data visualization, and TensorFlow for machine learning. Data science can significantly improve health services in Kenya by analyzing health data for predictive insights, optimizing resources, and enhancing patient care.

Uploaded by

genesiskalya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Differentiate between Exploratory Data Analysis (EDA) and Confirmatory Data Analysis

(CDA).

Exploratory Data Analysis (EDA) is the initial step of analyzing data to uncover patterns,
anomalies, or insights, without a predetermined hypothesis. It helps to summarize the
main characteristics of data, often using visual methods, and guides further data analysis.
EDA is open-ended, enabling analysts to make observations about data distribution,
relationships, and outliers.

Confirmatory Data Analysis (CDA), on the other hand, is hypothesis-driven and used to
confirm or refute specific hypotheses using statistical tests and models. CDA is focused on
assessing if observed data patterns align with pre-defined theories or assumptions and
involves statistical testing to validate findings.

b) Explain the following Python packages used in data science in society.

i. Pandas: A data manipulation and analysis library that provides data structures like
DataFrames and Series for handling and analyzing structured data efficiently.

ii. NumPy: A library for numerical computing that supports multi-dimensional arrays and
mathematical functions to operate on them, widely used for scientific computing and data
analysis.

iii. Matplotlib: A plotting library that allows for the creation of static, interactive, and
animated visualizations in Python. It’s useful for data visualization to interpret results in
data science.

iv. TensorFlow: An open-source machine learning framework developed by Google,


primarily used for deep learning tasks. It helps build, train, and deploy machine learning
models, especially neural networks.
c) Assuming height and weight are already defined as lists in Python, write a code that
imports numpy as np and stores both the height and weight of your classmates as numpy
arrays.

Import numpy as np

# Assuming height and weight are predefined lists

Height_array = np.array(height)

Weight_array = np.array(weight)

d) Using Pandas:

i. Create a simple Pandas DataFrame and print its values.

Import pandas as pd

# Creating a simple DataFrame

Data = {‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’], ‘Age’: [24, 27, 22]}

Df = pd.DataFrame(data)

Print(df)

ii. Create your own DataFrame from a dictionary of arrays/lists.

# Creating DataFrame from dictionary of arrays/lists

Data_dict = {‘Student’: [‘John’, ‘Mary’, ‘Emma’], ‘Score’: [88, 92, 95]}

Df_from_dict = pd.DataFrame(data_dict)

Print(df_from_dict)

iii. Perform appending, slicing, addition, and deletion of rows with a Pandas DataFrame.

# Adding a new row

Df.loc[3] = [‘David’, 25]

# Appending a new DataFrame

New_data = pd.DataFrame({‘Name’: [‘Eve’], ‘Age’: [30]})


Df = df.append(new_data, ignore_index=True)

# Slicing rows

Print(df.iloc[1:3])

# Deleting a row

Df = df.drop(3)

e) Using Pandas:

i. Create a DataFrame with a list of dictionaries, row indices, and column indices.

Data = [{‘Name’: ‘Alice’, ‘Age’: 24}, {‘Name’: ‘Bob’, ‘Age’: 27}]

Df = pd.DataFrame(data, index=[‘row1’, ‘row2’], columns=[‘Name’, ‘Age’])

Print(df)

ii. Use index label to delete or drop rows from a Pandas DataFrame.

# Dropping a row by index label

Df = df.drop(‘row1’)

Print(df)

f) Nearly 80% of data analysis is spent on cleaning and preparing data.

Explain.

Data cleaning and preparation take a significant portion of time in data analysis because
raw data is often incomplete, inconsistent, and noisy. This process involves handling
missing values, correcting inconsistencies, transforming data types, and ensuring that the
data is ready for analysis. Proper data preparation is crucial for accurate and meaningful
analysis, as well-cleaned data prevents biases and errors in subsequent analysis and
model training.

g) Distinguish between data visualization and data formatting in big data analytics.
Data visualization is the graphical representation of data to help understand patterns,
trends, and insights. It involves creating charts, graphs, and dashboards for clearer data
interpretation.

Data formatting, however, refers to structuring or transforming data into a specific format
or layout to make it suitable for analysis. This may involve data conversion, reorganization,
or cleanup to ensure compatibility with analytical tools or visualization frameworks.

h) Discuss how data science can be used to improve health services in Kenya.

Data science can enhance healthcare in Kenya by analyzing health data to predict
outbreaks, optimize resources, and improve patient care. Predictive analytics can help in
identifying disease patterns, which aids in early intervention. Data science also enables
efficient hospital resource allocation, monitoring public health trends, and improving
patient diagnostics, leading to a better, data-informed healthcare system.

i) Discuss the data analytics project phases.

Problem Definition: Clearly outline the question or problem to address and define project
objectives.

Data Collection: Gather relevant data from multiple sources, such as databases, APIs, or
public datasets.

Data Cleaning and Preparation: Clean and preprocess data, handling missing values,
outliers, and inconsistencies.

Data Exploration and Analysis: Perform EDA to understand data characteristics and
generate insights.

Modeling: Develop and train models on the data to find patterns, predictions, or solutions
to the problem.

Evaluation and Deployment: Assess model accuracy, refine as needed, and deploy the
solution into a real-world environment for end-users or stakeholders.

You might also like