Updated-Module 4-Sampling and Data Preparation
Updated-Module 4-Sampling and Data Preparation
DATA PREPARATION
MODULE 4: SAMPLING AND DATA
PREPARATION
Non- Simple
Stratified Cluster Systematic
probability Random
Sampling Sampling Sampling
Sampling Sampling
Judgmental
Convenience Quota Snowball
(or Purposive)
Sampling Sampling: Sampling
Sampling
Circumstances would you recommend: (a) A probability sample (b) A non-probability
sample (c) A stratified sample (d) A cluster sample
Sample Type When to Use Example
Probability Sample - When you need to make inferences about the population from the A researcher wants to estimate the percentage
sample. of adults in the U.S. who support a particular
- When the population is large and heterogeneous. policy using a probability sample to ensure
- When you need to minimize bias. representativeness.
Non-Probability Sample - When you need to collect data quickly and cheaply. A company wants feedback on a new product,
using a non-probability sample such as
- When the population is small or difficult to access.
customers who visit their website or sign up for
- When you are conducting exploratory research. their email list.
Stratified Sample - When the population is heterogeneous and you want to ensure that A researcher wants to compare the attitudes of
all subgroups are represented. men and women on an issue, using a stratified
- When you want to make comparisons between subgroups. sample to ensure equal representation of both
genders.
- When you have prior knowledge of important subgroups in the
population.
Cluster Sample - When the population is geographically dispersed and sampling the A researcher surveys students in a large school
entire population is difficult and expensive. district by selecting a few schools at random and
- When you can identify natural clusters. surveying all students in those schools using a
- When clusters represent the population as a whole. cluster sample.
Census and Sampling:
Sample Unit:
Sample Element:
It is the basic observation entity,
It refers to the individual unit
or the unit, that can be selected
within the sample unit, which
for the sample. For instance, in a
provides the data for the
study, a sample unit could be an
research.
individual or an organization.
Sample Size:
In 1936, The Literary Digest conducted a poll with a large sample size of
2.4 million people and predicted that Alfred Landon would win the
presidential election. However, the poll was not representative, leading
to incorrect predictions. Eventually, Franklin D. Roosevelt won,
highlighting the importance of representative sample size in research.
Determination of Sample Size:
Thus, to achieve a 95% confidence level and a 5% margin of error, the company needs to survey
approximately 385 customers (as sample size should be a whole number).
Adjusting for Finite Population:
If the population size (N) is known and finite, you can further refine the
sample size using the finite population correction (FPC) formula:
Sampling v/s Non-Sampling Error
1. Sampling Error
Sampling error arises when the sample selected from the population is not perfectly representative of the
population from which it was drawn. This type of error is inherent to the sampling process because we're
working with a subset rather than the whole population. It can be reduced (but not entirely eliminated) by
increasing the sample size.
Causes Description Examples
Chance Even with a perfect sampling A company samples 100 out of 10,000 customers
process, random samples can for a new beverage flavor. The 100 chosen may
differ due to natural variability. have an unusually high or low preference.
Sample Size Smaller samples are more prone Sampling only 100 customers may not represent
to sampling error. the entire population of 10,000, leading to
inaccuracies in preferences.
Sampling Technique Improper application of sampling Poorly applied sampling techniques might not
techniques can introduce bias accurately represent the target market, leading to
and distort the results. skewed results.
Market Research Example A sampling error in market Coca-Cola’s New Coke (1985): Research showed
research can occur if the sample preference for the new formula, but the larger
chosen doesn't reflect the population didn’t share this sentiment.
broader population's preference.
Reasons for Sampling Error:
Sampling Technique:
The method of drawing
samples can introduce bias if
not done
2. Non-Sampling Error
Non-sampling errors are all the errors that are not related to the act of selecting a sample from the population.
These can occur at any stage of the research process and can sometimes be controlled or eliminated.
Non-sampling errors encompass all other errors in a study that aren't related to the act of selecting a sample.
These errors can be present even if a complete census (i.e., surveying the entire population) is taken.
Category Causes
Data Collection Mistakes, misunderstandings, or misinterpretations during data gathering.
Non-response Error Only customers with very positive or very negative experiences Satisfaction results only reflecting extreme cases and not
respond, leading to a biased representation of satisfaction. the entire sample.
Measurement Error A question asks about the excitement of a sale event rather Asking about sale excitement instead of purchase
than satisfaction with the purchase, failing to capture the satisfaction, leading to inaccurate measurement.
intended information.
Data Processing Error A data analyst miscodes "neutral" responses as "very satisfied," Analyst error during data coding, misclassifying neutral
skewing the final analysis results. responses as very satisfied, causing skewed results.
Data Preparation
Data Preparation
Data Preparation refers to the process of cleaning, structuring, and
enriching raw data into a desired format for better decision-making in
less time. It's a vital step before data analysis can occur.
Importance of Data Preparation:
Data Coding Categorize textual feedback into "Positive" (1), "Neutral" (2), and
"Negative" (3).
Data Transformation Calculate the average satisfaction score for customers in different age
brackets.
Data Cleaning Flag surveys with missing textual feedback for further review.
Data Imputation Use the average age from the dataset to fill in missing age values.
Field Validation
Definition:
Pilot Testing Administer the instrument to a small subset of the target audience to understand their
responses.
Feedback Collection Collect feedback regarding clarity, length, and format from participants after pilot testing.
Data Analysis Analyze pilot test data to check if responses align with research objectives.
Revision Revise the instrument based on feedback and data analysis. This may require several
iterations.
Full-Scale Administration Administer the validated instrument to the larger target population for the actual
research.
Example: Case
Checking for Missing Data Identify any unanswered questions or A survey participant might have skipped a
gaps in the data. question inadvertently.
Standardization of Responses Ensure that data is consistent in terms of Ensuring all currency responses are in dollars
units, scale, or format. and not a mix of dollars and euros.
Clarification of Ambiguous Look for any responses that are unclear In open-ended questions, a respondent might
Answers or can be interpreted in multiple ways. give a vague answer that requires further
clarification.
Correction of Data Entry Errors Spot and rectify any mistakes made A typo where a respondent's age is inputted as
during data input. "255" instead of "25" should be corrected.
Verification Against Original If possible, cross-check edited data Cross-checking survey data against recorded
Sources against the original source, especially if interview responses, if available.
the data seems dubious.
Example:
A company has conducted a survey to understand consumer preferences regarding its new range of
skincare products. The dataset includes responses related to age, skin type, preferred product type,
and feedback.
Data Editing Process:
Data Editing Process Description Example
Screening for Inconsistencies A response indicating a participant uses 100 products A participant stating they use an improbable number of products daily
daily is flagged as improbable. (e.g., 100 products) would be flagged.
Checking for Missing Data Some participants didn't specify their skin type; these Entries where participants didn't answer specific questions (e.g., skin
entries are flagged. type) are flagged as incomplete.
Standardization of Responses The feedback received had a mix of uppercase and Standardizing text formatting so feedback is consistent in sentence case
lowercase text. The feedback is standardized to have (capitalization).
consistent sentence case formatting.
Clarification of Ambiguous A respondent mentioned "the blue one" when referring Responses like "the blue one," which refer to ambiguous products, are
Answers to their preferred product. This response is flagged for flagged for further clarification.
ambiguity, as multiple products have blue packaging.
Correction of Data Entry Some ages were inputted as over 100; these are cross- Data entry errors, such as ages over 100, are cross-checked and corrected.
Errors checked and corrected.
Verification Against Original Some dubious data entries are checked against the Dubious entries are cross-checked with the original survey forms to
Sources original survey forms to ensure accuracy. ensure data accuracy.
Data Coding
Data Coding
Example: Gender variable The code for "Gender" might assign "1" for "Male" and "2" for "Female."
Step Description
Developing a Codebook The research team predicts common responses and creates a preliminary
codebook. They anticipate responses related to environmental concerns, price,
and quality.
Categorizing Responses After collecting surveys, they find some customers mentioned "brand loyalty" as a
reason. This is added as a new category.
Assigning Numerical Values The categories are coded as: "001" for environmental concerns, "002" for price,
"003" for quality, and "004" for brand loyalty.
Entering Data Responses are inputted into a database using their respective codes.
Review and Verification A subset of the coded data is reviewed to ensure accuracy and consistency.
Content Analysis
Content Analysis
Content Analysis is a systematic and objective technique for studying and analyzing
information contained in textual, visual, or audio communication in order to categorize
content in terms of predefined criteria. It helps in interpreting the context and themes of the
content.
Sampling 50 articles from various tech news outlets were selected for analysis.
Develop Categories Categories include "positive features," "negative features," "pricing feedback,"
and "overall sentiment."
Coding Each article is reviewed, and content is assigned to the relevant categories.
Data Analysis 70% of the articles praise the camera quality (positive feature), while 60% criticize
the product for being overpriced (pricing feedback).
Interpretation and Reporting The company concludes that while the smartphone has standout features, the
pricing strategy needs reconsideration.
Classification and Tabulation of
Data
Classification and Tabulation of Data
Aspect Details
Definition Classification of Data: Arranging data into meaningful categories based on
nature, type, or specific characteristics.
Representation: The table is clearly labeled, with distinct rows and columns, and a title: "Sales by Product Type and Region."
Data Transformation
Data Transformation
Definition:
Data Transformation refers to the process of converting data from one format or structure to
another. This can involve simple changes (like unit conversions) or more complex computations (like
normalization) to make data suitable for analysis.
Importance of Data Transformation:
Standardization Sales data from various regions are standardized to compare performance relative to the
overall mean and variance.
Normalization Sales figures are normalized to a scale of [0, 1] to understand the relative performance of
products.
Log Transformation Sales data is log-transformed to linearize the growth pattern due to exponential growth in
sales.
Binning Sales figures are categorized into "Low," "Medium," and "High" for simpler understanding of
performance tiers.
Dummy Variables Product types are converted into dummy variables:
Electronics: [1, 0, 0]
Apparel: [0, 1, 0]
Home & Kitchen: [0, 0, 1]
Sample Design
Sample Design/points should be taken into consideration by a Researcher
in developing a Sample Design for a Research Project.
Sample design is a process of selecting a sample from a population. It is a systematic approach to ensuring
that the sample is representative of the population and that the findings of the study can be generalized to
the population.
There are two main types of sampling designs:
Non-probability sampling:
Probability sampling: In non-probability sampling, not every
In probability sampling, every member member of the population has an
of the population has a known chance equal chance of being selected for the
of being selected for the sample. This sample. This type of sampling is often
ensures that the sample is used in exploratory research or when it
representative of the population. is difficult or expensive to obtain a
probability sample.
When developing a sample design, researchers
should consider the following points:
Sampling Design Element Description
Research Question The research question should be clearly defined before the sample design is
developed. This helps determine the appropriate population and sample size.
Population The entire group of individuals or objects that the researcher is interested in
studying. It should be clearly defined before developing the sample design.
Sample Size The number of individuals or objects included in the sample. It should be large
enough for statistically reliable results, but not so large that it becomes
impractical.
Sampling Frame A list of all members of the population. The researcher should ensure that all
members have a chance of being selected for the sample.
Sampling Method The procedure used to select the sample from the sampling frame. Different
methods include simple random sampling, stratified sampling, and cluster
sampling.
Custom Considerations There is no one-size-fits-all sample design; the best design depends on the
research question, population, and resources available.
Factors that influence the determination of Sample
Size in a Research Study.
There are a number of factors that influence the determination of
sample size in a research study. These include:
Factor Description Impact on Sample Size
The degree of accuracy the
Desired Level of researcher wants to achieve in
Precision their results. Higher precision requires a larger sample size.
The probability that the sample
results are representative of the Higher confidence level requires a larger sample
Confidence Level population. size.
Variability of the The degree to which population
Population values vary. More variability requires a larger sample size.
The magnitude of the effect the
Effect Size researcher aims to detect. Smaller effect size requires a larger sample size.
The time and money available for
Resources Available conducting the research. Limited resources may restrict sample size.
END