Data Science in Society Cat
Data Science in Society Cat
(CDA).
Exploratory Data Analysis (EDA) is the initial step of analyzing data to uncover patterns,
anomalies, or insights, without a predetermined hypothesis. It helps to summarize the
main characteristics of data, often using visual methods, and guides further data analysis.
EDA is open-ended, enabling analysts to make observations about data distribution,
relationships, and outliers.
Confirmatory Data Analysis (CDA), on the other hand, is hypothesis-driven and used to
confirm or refute specific hypotheses using statistical tests and models. CDA is focused on
assessing if observed data patterns align with pre-defined theories or assumptions and
involves statistical testing to validate findings.
i. Pandas: A data manipulation and analysis library that provides data structures like
DataFrames and Series for handling and analyzing structured data efficiently.
ii. NumPy: A library for numerical computing that supports multi-dimensional arrays and
mathematical functions to operate on them, widely used for scientific computing and data
analysis.
iii. Matplotlib: A plotting library that allows for the creation of static, interactive, and
animated visualizations in Python. It’s useful for data visualization to interpret results in
data science.
Import numpy as np
Height_array = np.array(height)
Weight_array = np.array(weight)
d) Using Pandas:
Import pandas as pd
Df = pd.DataFrame(data)
Print(df)
Df_from_dict = pd.DataFrame(data_dict)
Print(df_from_dict)
iii. Perform appending, slicing, addition, and deletion of rows with a Pandas DataFrame.
# Slicing rows
Print(df.iloc[1:3])
# Deleting a row
Df = df.drop(3)
e) Using Pandas:
i. Create a DataFrame with a list of dictionaries, row indices, and column indices.
Print(df)
ii. Use index label to delete or drop rows from a Pandas DataFrame.
Df = df.drop(‘row1’)
Print(df)
Explain.
Data cleaning and preparation take a significant portion of time in data analysis because
raw data is often incomplete, inconsistent, and noisy. This process involves handling
missing values, correcting inconsistencies, transforming data types, and ensuring that the
data is ready for analysis. Proper data preparation is crucial for accurate and meaningful
analysis, as well-cleaned data prevents biases and errors in subsequent analysis and
model training.
g) Distinguish between data visualization and data formatting in big data analytics.
Data visualization is the graphical representation of data to help understand patterns,
trends, and insights. It involves creating charts, graphs, and dashboards for clearer data
interpretation.
Data formatting, however, refers to structuring or transforming data into a specific format
or layout to make it suitable for analysis. This may involve data conversion, reorganization,
or cleanup to ensure compatibility with analytical tools or visualization frameworks.
h) Discuss how data science can be used to improve health services in Kenya.
Data science can enhance healthcare in Kenya by analyzing health data to predict
outbreaks, optimize resources, and improve patient care. Predictive analytics can help in
identifying disease patterns, which aids in early intervention. Data science also enables
efficient hospital resource allocation, monitoring public health trends, and improving
patient diagnostics, leading to a better, data-informed healthcare system.
Problem Definition: Clearly outline the question or problem to address and define project
objectives.
Data Collection: Gather relevant data from multiple sources, such as databases, APIs, or
public datasets.
Data Cleaning and Preparation: Clean and preprocess data, handling missing values,
outliers, and inconsistencies.
Data Exploration and Analysis: Perform EDA to understand data characteristics and
generate insights.
Modeling: Develop and train models on the data to find patterns, predictions, or solutions
to the problem.
Evaluation and Deployment: Assess model accuracy, refine as needed, and deploy the
solution into a real-world environment for end-users or stakeholders.