[go: up one dir, main page]

0% found this document useful (0 votes)
9 views24 pages

Module 2 Data Collection and Preparation

Module 2 focuses on the critical aspects of data collection and preparation, emphasizing the importance of accurate data gathering, ethical considerations, and quality assessment. It covers practical skills such as importing data into Excel, cleaning and preprocessing datasets, and handling missing data and outliers. The module aims to equip learners with the necessary tools and techniques to ensure data integrity and reliability for analysis.

Uploaded by

Laiza Salas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views24 pages

Module 2 Data Collection and Preparation

Module 2 focuses on the critical aspects of data collection and preparation, emphasizing the importance of accurate data gathering, ethical considerations, and quality assessment. It covers practical skills such as importing data into Excel, cleaning and preprocessing datasets, and handling missing data and outliers. The module aims to equip learners with the necessary tools and techniques to ensure data integrity and reliability for analysis.

Uploaded by

Laiza Salas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Module 2 – Data Collection and

O Preparation
b 1. Explain the importance of accurate data collection and
preparation.
j
2. Assess data quality using key indicators.
e 3. Identify and address ethical issues in data collection.
c 4. Import data into Excel from various file types and
t external sources.
i 5. Clean and preprocess raw datasets using Excel tools.
6. Handle missing data and detect or treat outliers in a
v dataset.
e
s
Module 2 – Data Collection and
Preparation
1. Importance of Data Collection and Preparation
T 2. Basic Data Quality Assessment
O 3. Ethical Considerations in Data Collection
4. Importing Data from Various Sources
P 5. Data Cleaning and Preprocessing in Excel
I 6. Handling Missing Data and Outliers
C 6.1 Types of Missing Data
6.2 Techniques for Handling Missing Data
S 6.3 Identifying Outliers
Data collection refers to the process of gathering and
accumulating information or data from various sources. This
data can come from a wide range of places, including surveys,
sensors, databases, websites, social media, and more.

https://www.globalpatron.com/blog/data-collection-methods/
Data collection

The main objectives of data collection are:

➢Acquiring Relevant Information: Collecting data that is


pertinent to the problem or question at hand.
➢Ensuring Data Accuracy: Ensuring that the data is accurate,
reliable, and free from errors or bias.
➢Maintaining Data Consistency: Keeping the data consistent in
terms of format, units, and structure.
➢Preserving Data Integrity: Preventing unauthorized access, loss,
or corruption of data.
Data preparation also known as data preprocessing, is the
process of cleaning, transforming, and structuring raw data into a
format that is suitable for analysis. This step is essential because real-
world data is often messy, inconsistent, and may contain missing
values.

https://www.linkedin.com/pulse/data-collection-preprocessing-dr-john-martin-5kj3f
Data preparation

The key tasks in data preparation include:

Data Cleaning: Identifying and correcting errors, inconsistencies, and


outliers in the data.
Data Transformation: Converting data into a more suitable format or
scale. This may involve normalization, standardization, or encoding
categorical variables.
Handling Missing Data: Dealing with missing values by imputing
them or removing incomplete records.
Feature Engineering: Creating new features or variables that might
be more informative for analysis.
Importance of Data Collection
 The foundation of any analysis depends on the quality
of data collected.
 Inaccurate or poorly prepared data can lead to biased,
misleading results.
 Good data preparation enhances model performance
and analytical accuracy.

Examples:

 A survey with missing responses vs. a well-completed


dataset.
 Raw sales data with inconsistent formats (e.g., date formats,
product names).
Basic Data Quality Assessment

• Common quality dimensions: Accuracy, completeness,


consistency, timeliness, conformity, validity, integrity,
relevance, documentation, data profiling, data visualization,
data sampling, data quality metrics, data validation rules, user
feedback, uniqueness.
• Indicators of poor data quality (e.g., duplicates, missing fields,
Examples:
inconsistent units).
Multiple records for the same customer with slight name
variations.
Sales records with mismatched units (e.g., PHP vs. USD).
ETHICAL CONSIDERATIONS IN DATA COLLECTION

Ethical considerations in data collection are essential


to ensure that data is collected, used, and managed in a
responsible and socially acceptable manner. Ethical data
collection practices help protect individuals' privacy,
prevent discrimination, and maintain trust in data-driven
processes.
ETHICAL CONSIDERATIONS IN DATA COLLECTION

Here are some key ethical considerations in data collection:

Informed Consent: Obtain informed consent from individuals before


collecting their data. Clearly explain the purpose of data collection, how the
data will be used, and any potential risks involved. Individuals should have the Data Security: Implement robust data security measures to safeguard collected
option to opt in or opt out. data from unauthorized access, breaches, or theft. Encryption, access controls,
and regular security audits are important components of data security.

Privacy and Anonymity: Protect individuals' privacy by anonymizing or de-


identifying data whenever possible. Remove or encrypt personally identifiable
information (PII) to prevent the identification of individuals from the data. Bias and Fairness: Be vigilant about bias in data collection. Biased data can
lead to biased results and discriminatory outcomes. Ensure that data collection
methods and sources are free from biases that could disproportionately affect
certain groups.

Data Minimization: Collect only the data that is necessary for the intended
purpose. Avoid collecting excessive or irrelevant information that could
infringe on individuals' privacy.
Data Ownership and Control: Clearly define data ownership and control.
Individuals should have the right to access their data, correct inaccuracies,
Transparency: Be transparent about data collection practices and data handling and request the deletion of their data when applicable.
procedures. Clearly communicate how data will be stored, processed, and
shared.
ETHICAL CONSIDERATIONS IN DATA COLLECTION

Here are some key ethical considerations in data collection:

Sensitive Data: Handle sensitive data (e.g., health records, financial information) Data Retention: Establish clear policies for data retention and deletion. Do not retain
with extra care. Follow industry-specific regulations and best practices for data longer than necessary for the intended purpose.
collecting and storing sensitive data.

Data Use: Use collected data only for the purposes explicitly stated during data
collection. Avoid using data for purposes that individuals did not consent to or Ethical Review: In some cases, particularly in research involving human subjects,
could not reasonably anticipate. ethical review boards may be required to assess the ethics of data collection methods.

Third-Party Data Sources: If using data from third-party sources, ensure that the data Bias Mitigation: When working with machine learning and AI algorithms, be aware of
was collected ethically and in compliance with relevant laws and regulations. Verify the potential for algorithmic bias. Regularly audit and test algorithms for fairness
that the third party has obtained informed consent and adhered to privacy standards. and bias, and take steps to mitigate any identified biases.

Compliance with Regulations: Ensure that data collection practices comply with
Children's Data: Special considerations apply when collecting data from children. relevant local, national, and international laws and regulations, such as the General
Comply with laws like the Children's Online Privacy Protection Act (COPPA) and Data Protection Regulation (GDPR) in Europe or the Health Insurance Portability
obtain parental consent when necessary. and Accountability Act (HIPAA) in the United States, Data Privacy Act of 2012
(Republic Act 10173) of the Philippines
IMPORTING DATA FROM VARIOUS

Importing data from various SOURCES


sources is a common task in data
analysis, data science, and database management. The process
involves retrieving data from different origins, such as databases,
files, APIs, and web sources, and making it available for analysis or
storage. File types: .xlsx, .csv, .txt, .xml, .json.
 External data sources: databases, websites, APIs.
 Excel features: Get & Transform Data, From
Text/CSV, From Web.

Examples:
o Importing .csv file with sales data.
o Connecting Excel to a web-based data source.
Get Data
IMPORTING DATA FROM VARIOUS

Here are steps and considerations


SOURCES for importing data from various
sources:
1. Identify Data Sources: Determine the sources from which you need to collect data. These sources could include databases (SQL,
NoSQL), spreadsheets, CSV files, JSON files, web APIs, websites, sensor data, and more.

2. Access Permissions: Ensure that you have the necessary permissions and access rights to retrieve data from the selected sources. In
some cases, you may need credentials or API keys.

3. Data Retrieval Methods: Different sources require different methods for data retrieval:
➢ Databases: Use SQL queries for relational databases (e.g., MySQL, PostgreSQL), or appropriate drivers and libraries for NoSQL
databases (e.g., MongoDB).
➢ Files: Use file I/O operations to read data from formats like CSV, Excel, JSON, XML, etc.
➢ APIs: Interact with APIs using HTTP requests (GET, POST, etc.) and libraries like requests in Python.
➢ Web Scraping: Extract data from websites using web scraping libraries such as BeautifulSoup or Scrapy (respecting website terms
of service and robots.txt).
➢ Sensor Data: Use hardware interfaces or communication protocols to collect data from sensors or IoT devices.
IMPORTING DATA FROM VARIOUS

4. Data Extraction: Retrieve the data from the source using appropriate
methods and techniques. This may involve executing SQL queries,
SOURCES
8. Automate the Process: Consider automating the data import process,
especially if you need to retrieve data regularly or in real-time. Automation can
reading files line by line, or making API requests. save time and reduce the risk of human errors.

5. Data Transformation: Once you have the raw data, perform


necessary transformations to clean and structure it. This may include data
cleaning, data type conversion, and handling missing values.
9. Error Handling: Implement error-handling mechanisms to gracefully deal
with issues that may arise during data import, such as connection failures or
6. Data Integration: If you're collecting data from multiple sources, unexpected data formats.
integrate it into a single dataset or database if needed. This may involve
merging tables, joining datasets, or creating relationships between data
entities. 10. Logging and Monitoring: Set up logging and monitoring to keep
track of the data import process. This helps in identifying and resolving
issues quickly.
7. Data Validation: Verify the quality and integrity of the imported
data. Check for errors, inconsistencies, or missing data. Implement
validation checks to ensure the data adheres to predefined rules and 11. Data Security: Ensure that sensitive data is handled securely during the
standards. import process. Use encryption and secure channels when transferring data.
IMPORTING DATA FROM VARIOUS

SOURCES
12. Documentation: Document the data import process thoroughly, including source details, data extraction methods, transformation steps, and any
scripts or code used. This documentation is valuable for future reference and troubleshooting.

13. Compliance and Ethics: Ensure that data import practices adhere to legal and ethical standards, especially when dealing with personally identifiable
information (PII) and sensitive data.

14. Performance Optimization: Optimize the data import process for efficiency, especially when dealing with large datasets. Consider batch processing,
parallel processing, and indexing for databases.

15. Testing and Validation: Test the entire data import process end-to-end to ensure that it functions as expected and delivers accurate results.

16. Backup and Recovery: Implement data backup and recovery procedures to safeguard against data loss or corruption during the import process.
DATA CLEANING AND PROCESSING IN
EXCEL
Steps: Remove duplicates, correct inconsistent values, standardize
formats, use filters.
Tools: Text to Columns, Flash Fill, Find & Replace, Conditional
Formatting.

Examples:
Merging first and last names using Flash Fill.
Standardizing dates from MM/DD/YYYY to YYYY-MM-DD.

Merge Text
DATA CLEANING AND PROCESSING IN
EXCEL

Data cleaning and processing in Excel involve preparing


your data for analysis by identifying and addressing issues such
as missing values, inconsistencies, and formatting problems.
Excel provides a range of tools and functions to help with these
tasks.
DATA CLEANING AND PROCESSING IN
EXCEL
Steps: Remove duplicates, correct inconsistent values, standardize
formats, use filters.
Tools: Text to Columns, Flash Fill, Find & Replace, Conditional
Formatting.

Examples:
Merging first and last names using Flash Fill.
Standardizing dates from MM/DD/YYYY to YYYY-MM-DD.
DATA CLEANING AND PROCESSING IN
EXCEL
Here's a step-by-step guide on how to clean and process data in
Excel:
1. Open Your Data in Excel - "File" > "Open"

2. Understand Your Data - its structure, columns, and potential issues

3. Handling Missing Data - Fill Missing Values, Use functions like IF, ISBLANK, and VLOOKUP to fill missing values

4. Dealing with Duplicates - Remove duplicates by going to "Data" > "Remove Duplicates" and selecting the columns to check for duplicates.

5. Correcting Data Inconsistencies - Use Excel's "Find and Replace" feature to correct inconsistent data

6. Data Formatting - Use the "Format Cells" option (right-click > Format Cells) to format date and time

7. Data Transformation - Use the "Text to Columns" feature (Data > Text to Columns) to split data in a single column into multiple columns based on delimiters (e.g., commas, spaces).
Combine data from multiple columns into one using the CONCATENATE or & operator.

8. Calculations and Derived Columns - Create Derived Columns, Add new columns for calculations or derived information using Excel formulas

9. Filtering and Sorting - Use the "Filter" option (Data > Filter) to filter data based on specific criteria. "Sort A to Z" or "Sort Z to A."

10. Removing Irrelevant Data – remove rows or column

11. Data Validation - Use Excel's data validation feature to enforce rules and constraints on data entry

12. Data Visualization - Use Excel's charting tools

13. Save Your Cleaned Data - Save your cleaned and processed data as a new Excel file to preserve the original dataset
HANDLING MISSING DATA AND OUTLIERS

Handling missing data and outliers is a critical step in data analysis to ensure the
accuracy and reliability of your results. Missing data are values that are not
recorded or are incomplete, while outliers are data points that significantly
deviate from the rest of the data.
HANDLING MISSING DATA AND OUTLIERS

Handling Missing Data:


1. Identify Missing Data: Begin by identifying missing values in your dataset. 4. Create an Indicator Variable: Sometimes, it's valuable to create an indicator
These can appear as blank cells, "N/A," "NaN," or other placeholders. variable that flags missing values. This way, you retain information about which
data points were missing.
2. Understand the Missingness Pattern: Determine if the missing data are
5. Data Removal: In some cases, if the missing data are substantial and cannot be
missing completely at random (MCAR), missing at random (MAR), or missing
imputed accurately, you may consider removing rows or columns with missing
not at random (MNAR). This helps in choosing appropriate handling methods.
values. However, this should be done with caution, as it can lead to loss of
information.
3. Imputation: involves replacing missing values with estimated or predicted values. Common
imputation methods include:
 Mean/Median Imputation: Replace missing values with the mean (for continuous
data) or median (for ordinal data) of the available values in the column.
 Mode Imputation: Replace missing values with the mode (most frequent value) for
categorical data.
 Regression Imputation: Predict missing values using regression models based on
other variables.
 K-Nearest Neighbors (KNN) Imputation: Replace missing values with values
from the Knearest neighbors in the dataset.
 Time Series Imputation: For time series data, consider using interpolation or
forward/backward filling based on the time order of the data.
HANDLING MISSING DATA AND OUTLIERS

Handling Outliers:
1. Identify Outliers: Visualize your data using box plots, histograms, scatter plots, or other
graphical techniques to identify potential outliers. Calculate summary statistics (e.g., mean,
standard deviation) and use them to identify data points that fall significantly outside the
expected range.

2. Understand the Nature of Outliers: Determine if outliers are genuine data points
(representing true extreme values) or errors/noise. Consult domain experts if necessary.
HANDLING MISSING DATA AND OUTLIERS

Handling Outliers:
2. Transformation: Consider transforming the data to make it more robust against 6. Binning or Categorization: Convert continuous data into categorical data by binning
outliers. Common transformations include logarithmic, square root, or winsorization or categorizing it to minimize the impact of outliers.
(capping or flooring extreme values).

3. Data Removal: In some cases, you may decide to remove outliers if they are identified as
errors or if they significantly distort your analysis. Be cautious when doing this and document the
reasons for removal.
7. Winsorization: Replace extreme values with values within a specified range (e.g., replace
values above the 95th percentile with the 95th percentile value).
4. Robust Statistical Methods: Use robust statistical techniques that are less sensitive to
outliers. For example, use the median instead of the mean for central tendency
measurements.
8. Visualization and Reporting: When reporting your results, consider showing both the
analysis with and without outliers to provide a balanced perspective.
5. Model-Based Approaches: Some machine learning algorithms are less affected by
outliers. Consider using algorithms like Random Forests or Support Vector Machines
that can handle noisy data.
Perform Laboratory activity no. 2
Prepare for quiz no. 2

You might also like