Module 1 Part 1 - Introduction To Statistics & Data Analysis
Module 1 Part 1 - Introduction To Statistics & Data Analysis
INTRODUCTION TO
STATISTICS & DATA ANALYSIS
NCE 3108: Engineering Data Analysis
Module 1 Part 1
1
29/08/2024
I. Role of Statistics
and the Data
Analysis Process
1.1. Introduction
2
29/08/2024
Definition
3
29/08/2024
Taxonomy of Statistics
4
29/08/2024
5
29/08/2024
1.2. Uncertainty
and Variability
Uncertainty
6
29/08/2024
Variability
• Concept: Variability refers to the natural differences or changes that occur in data
or measurements. It’s the range of different possible values.
• Variability is about the natural differences in the data you do have.
• Example: If you measure the height of everyone in a city, you'll find that not
everyone is the same height. Some people are taller, some are shorter, and this
spread in height is variability.
• In statistics, variability is measured using things like range, variance, or standard
deviation.
Uncertainty vs Variability
Another Example:
• Uncertainty: You estimate that the average load on the bridge will be around
10,000 kg with a margin of error of ±500 kg because you can't measure every
single load scenario.
• Variability: On the bridge, the weight varies from light cars of 1,000 kg to heavy
trucks of 20,000 kg. This spread of different weights is the variability.
7
29/08/2024
Uncertainty vs Variability
Sources of Uncertainty
8
29/08/2024
Aleatory Uncertainty
Epistemic Uncertainty
9
29/08/2024
10
29/08/2024
In order to understand the ideas behind data, statistics and data analysis must be
performed. The data analysis process goes as follows (Peck et.al., 2019):
1. Understanding the nature of the problem. Effective data analysis requires an
understanding of the research problem. We must know the goal of the research
and what questions we hope to answer. It is important to have a clear direction
before gathering data to ensure that we will be able to answer the questions of
interest using the data collected.
2. Deciding what to measure and how to measure it. The next step in the
process is deciding what information is needed to answer the questions of
interest. In some cases, the choice is obvious.
11
29/08/2024
3. Data collection. The data collection step is very important. The researcher
must first decide whether an existing data source is adequate or whether new
data must be collected. If a decision is made to use existing data, it is important
to understand how the data were collected and for what purpose, so that any
resulting limitations are also fully understood. If new data are to be collected, a
careful plan must be developed, because the type of analysis that is appropriate
and the conclusions that can be drawn depend on how the data are collected.
4. Data summarization and preliminary analysis. After the data are collected,
the next step is usually a preliminary analysis that includes summarizing the data
graphically and numerically. This initial analysis provides insight into important
characteristics of the data and provides guidance in selecting appropriate
methods for further analysis.
5. Formal data analysis. The data analysis step requires the researcher to select
appropriate statistical methods.
6. Interpretation of results. The interpretation step often leads to the
formulation of new research questions. These new questions lead back to the first
step. In this way, good data analysis is often an iterative process.
12
29/08/2024
Population vs Sample
• Data from statistics are usually obtained from samples and the corresponding
values for the entire population are estimated from these sample data.
13
29/08/2024
14
29/08/2024
15
29/08/2024
Classification of Data
Data may be depending on its nature and on the number of involved variables.
Depending on its nature, data may be classified as either:
• Categorical (qualitative): if the individual observations are categorical
responses; or
• Numerical (quantitative): if the individual observations are expressed as
numbers.
Classification of Data
16
29/08/2024
Measurement
• In order to collect data for statistical analysis, the manner in which the variables
are to be observed should be decided upon, depending on the type of data
involved.
• The process of determining the value (for numerical data) or label (for categorical
data) of the variable based on what has been observed is called measurement.
Measurement
17
29/08/2024
Nominal Scale
• Definition: The nominal level is the most basic level of measurement. It involves
categorizing data without any quantitative value. The categories are distinct and have no
inherent order.
• Examples: Gender (male, female), eye color (blue, green, brown), types of engineering
disciplines (civil, mechanical, electrical).
• Key Characteristics:
o Data is classified into distinct groups or categories.
o There is no ranking or order to the categories.
o The only analysis possible is counting the frequency of each category (mode).
ENGINEERING DATA ANALYSIS LECTURES by Engr. Marc Daniel Laurina
Ordinal Scale
• Definition: The ordinal level involves categorizing data with a meaningful order or
ranking, but the intervals between the categories are not necessarily equal or known.
• Examples: Education level (high school, bachelor's, master's, Ph.D.), satisfaction ratings
(satisfied, neutral, dissatisfied), ranks in a competition (1st, 2nd, 3rd).
• Key Characteristics:
o Data is ranked or ordered.
o The distance between the ranks is not uniform or specified.
o Analysis can include median and mode, but not meaningful arithmetic operations like
addition or subtraction.
ENGINEERING DATA ANALYSIS LECTURES by Engr. Marc Daniel Laurina
18
29/08/2024
Interval Scale
• Definition: The interval level includes data that is ordered, with equal intervals between
values. However, there is no true zero point, meaning that ratios or comparisons of
absolute magnitude are not meaningful.
• Examples: Temperature in Celsius or Fahrenheit, dates on a calendar, IQ scores.
• Key Characteristics:
o Data is ordered with equal intervals.
o There is no true zero (e.g., 0 degrees Celsius does not mean "no temperature").
o Arithmetic operations like addition and subtraction are meaningful; however,
multiplication and division are not because there is no absolute zero.
ENGINEERING DATA ANALYSIS LECTURES by Engr. Marc Daniel Laurina
Ratio Scale
• Definition: The ratio level is the highest level of measurement, where data is ordered,
with equal intervals, and a true zero point exists. This allows for a full range of arithmetic
operations.
• Examples: Weight, height, age, income, length, time.
• Key Characteristics:
o Data is ordered with equal intervals and a true zero point.
o Both arithmetic operations and comparisons of ratios are meaningful (e.g., a weight
of 10 kg is twice as heavy as 5 kg).
o All statistical methods, including mean, median, mode, and geometric mean, are
applicable. ENGINEERING DATA ANALYSIS LECTURES by Engr. Marc Daniel Laurina
19
29/08/2024
Measurement
• Summary:
Nominal: Categories without order (e.g., gender, colors).
Ordinal: Ordered categories without equal intervals (e.g., rankings).
Interval: Ordered with equal intervals, but no true zero (e.g., temperature in Celsius).
Ratio: Ordered with equal intervals and a true zero (e.g., weight, income).
Measurement
20
29/08/2024
Measurement
Measurement
21
https://www.youtube.com/watch?v=OXTdii-b9Co&t=1s
Measurement
Level of Measurement
29/08/2024
22
29/08/2024
• Instructions: Complete the following survey by answering each question. For each item,
identify the level of measurement (Nominal, Ordinal, Interval, or Ratio).
____6. How would you rate your understanding of soil classification methods (e.g., USCS,
AASHTO)?
Options: Excellent, Good, Fair, Poor
____7. On a scale from 1 to 10, how confident are you in your ability to perform a load
calculation?
Options: (1 to 10 scale)
____8. How many floors does the building you are designing have?
Options: (Fill in the number of floors)
____9. Which type of foundation is most commonly used in your current projects?
Options: Shallow Foundation, Deep Foundation, Pile Foundation, Mat Foundation
____10. How many engineering courses are you taking this semester?
Options: (Fill in the number of courses)
ENGINEERING DATA ANALYSIS LECTURES by Engr. Marc Daniel Laurina
23
29/08/2024
24
29/08/2024
• Data may be collected, depending on the type of study, using any of the following
methods:
1. Use of documented data: Available research data may be used such as government
data and data from related studies or researches. However, caution must be
exercised when using documented data, especially secondary data (i.e. data
documented by entities other than the actual data collectors).
2. Surveys: A survey is a method of collecting data on the variable of interest by asking
people to answer a set of carefully written questions called a questionnaire. A survey
comprising an entire population is called a census while a survey comprising only a
sample of the population is called a sample survey. Surveys are usually performed if
the study involves human behavior such as consumer studies and election surveys.
• Data may be collected, depending on the type of study, using any of the following
methods:
3. Experiments: An experiment is a method of collecting data where there is a direct
human intervention on the conditions that may affect the values of the variable of
interest. Variables that may be directly manipulated are called independent
variables while variables that cannot be manipulated directly but can have their
values changed are called dependent variables. Most scientific studies with
multivariate data involve experimentation.
4. Observations: An observation is a method of collecting data on the phenomenon of
interest by recording the observations made about the phenomenon as it actually
happens. Examples of studies involving observations include weather and climate,
earthquake, and astronomical studies.
ENGINEERING DATA ANALYSIS LECTURES by Engr. Marc Daniel Laurina
25
29/08/2024
1.6. Sampling
Sampling
26
29/08/2024
Sampling Bias
• Sampling bias occurs when the sample selected for analysis does not accurately
represent the entire population, leading to skewed or invalid results. In engineering
data analysis, this bias can significantly affect the conclusions drawn from the data,
potentially leading to poor decision-making and flawed engineering solutions.
1. Non-Random Sampling:
• When samples are selected based on convenience, judgment, or any non-random
method, certain segments of the population may be overrepresented or
underrepresented.
• Example: In a material testing study, if an engineer only selects samples from the
top of a batch, they might miss variations that occur throughout the batch, leading
to biased conclusions about material strength.
2. Undercoverage:
• This occurs when some parts of the population are not included or are
underrepresented in the sample.
• Example: If a survey of construction practices only includes responses from large
firms and excludes small contractors, the results may not reflect the practices of
the entire industry.
ENGINEERING DATA ANALYSIS LECTURES by Engr. Marc Daniel Laurina
27
29/08/2024
3. Non-Response Bias:
• When individuals or items selected for the sample do not respond or participate,
and their absence is related to the study’s focus, the results can be biased.
• Example: In a survey on the adoption of new engineering software, if only tech-
savvy engineers respond, the results may overestimate the overall adoption rate.
1. Inaccurate Data Analysis: The data may not truly reflect the population, leading to
incorrect conclusions.
2. Faulty Designs and Decisions: Engineering solutions based on biased data may be
ineffective or unsafe, potentially leading to project failures or increased costs.
3. Reduced Reliability: Results that suffer from sampling bias lack credibility and may
not be applicable to broader contexts, reducing their usefulness in guiding
engineering practices.
28
29/08/2024
Sampling Methods
29
29/08/2024
Sampling Methods
Sampling Methods
30
29/08/2024
Sampling Methods
Sampling Methods
2. Stratified Sampling:
• Definition: The population is divided into distinct subgroups (strata) based on a
specific characteristic, and random samples are taken from each subgroup. This
ensures representation from all key segments of the population.
• Application in Engineering: When analyzing the strength of concrete from
multiple suppliers, engineers might stratify the samples by supplier and then
randomly select from each group to ensure each supplier's product is tested.
31
29/08/2024
Sampling Methods
3. Systematic Sampling:
• Definition: Every nth member of the population is selected after a random starting
point. This method is easier to administer than simple random sampling and can be
used when the population is ordered.
• Application in Engineering: For quality control in manufacturing, an engineer might
inspect every 10th item coming off a production line to monitor consistency.
Sampling Methods
4. Cluster Sampling:
• Definition: The population is divided into clusters, often based on geography or
another natural grouping, and entire clusters are randomly selected. This method is
cost-effective when dealing with large, dispersed populations.
• Application in Engineering: In a large construction project spread over different sites,
engineers might select a few sites (clusters) and then test all the samples within those
sites.
32
29/08/2024
Sampling Methods
5. Convenience Sampling:
• Definition: Samples are selected based on their availability and ease of access. While
this method is the least rigorous and can introduce bias, it is sometimes used in
exploratory research or when other methods are impractical.
• Application in Engineering: When an engineer needs to quickly assess the
performance of a material and selects samples from the nearest available source, this
is convenience sampling.
Sampling Methods
6. Multistage Sampling:
• Definition: Combines several sampling methods. The population is divided into
groups (like in cluster sampling), and then a sample is taken from within each
selected group, often using another method like simple random sampling or
stratified sampling.
• Application in Engineering: In large infrastructure projects, an engineer might first
choose specific regions (clusters), then within each region, select specific
construction sites (stratified), and finally, within those sites, randomly choose
samples for testing.
ENGINEERING DATA ANALYSIS LECTURES by Engr. Marc Daniel Laurina
33
29/08/2024
Video
https://www.youtube.com/watch?v=PdXDLNNXPik&t=510s
34
29/08/2024
1.7. Introduction to
Design of
Experiments (DOE)
35
29/08/2024
Variables
• Experiment is a study in which one or more explanatory variables are manipulated in order to
observe the effect on a response variable.
• Explanatory variables are independent variables or factors, those that have values that are
controlled by the experimenter, while response variables are dependent variables, those that are
thought to be related to the explanatory variables in an experiment.
• Response variables are measured as part of the experiment, but not controlled by the
experimenter. An experimental study involves several set-ups, called experimental conditions or
treatments, to observe the relationship between the independent and the dependent variables.
36
29/08/2024
4. Randomization:
• Randomization involves randomly assigning treatments to experimental units to
minimize the effects of uncontrolled variables and reduce bias.
• Example in Engineering: In a study testing the durability of different concrete mixes,
randomization might involve randomly assigning various concrete formulas to
different batches. This ensures that any differences in durability are due to the mix
itself and not to variations in the testing conditions (like temperature or humidity)
that could affect the results.
5. Replication:
• Replication means repeating the experiment under the same conditions to ensure
that the results are consistent and reliable.
• Example in Engineering: If an engineer is testing the tensile strength of a new steel
alloy, they might replicate the test by pulling multiple samples of the steel under the
same conditions. This helps confirm that the observed strength is consistent across
different samples and not due to random variation.
37
29/08/2024
6. Blocking:
• Blocking is used to control for variables that are not of primary interest but may
influence the response, by grouping similar experimental units together.
• Example in Engineering: When testing the effectiveness of a new pavement material,
an engineer might block the experiment by different weather conditions (e.g., sunny,
rainy). Within each block, they would test the material under various conditions to
isolate the effects of the weather from the material's performance. This ensures that
differences in performance are due to the material itself rather than the varying
weather conditions.
• Efficiency: DOE helps identify the most significant factors affecting a process with the
fewest number of experiments, saving time and resources.
• Optimization: By understanding the interactions between variables, engineers can
optimize processes for better performance and quality.
• Problem Solving: DOE provides a structured approach to troubleshooting and improving
engineering systems, leading to more robust and reliable designs.
38
29/08/2024
Video:
https://www.youtube.com/watch?v=DaBq0naj0YY&t=524s
39
29/08/2024
Instructions:
• Present an example of application of data analysis process in civil engineering. You may search Google
Scholar (or any other credible website) for some papers or design experiments which show how statistics is
applied in understanding a civil engineering problem. Compare the process done in the said paper/design
experiment to the data analysis process discussed in pages 22-24
• Accomplish this on short paper size document, typewritten in PDF format. Provide a title page.
• Use 12-pt Arial Narrow for font type and size, with 1.5-space line spacing. Sentences must be in justified
alignment. Images, if any, must be centered and with caption below.
• References used for this assignment should be put at the end of the document. Use APA format for citation
and listing. Photo credits (list of sources where photos come from) should also be put right after
references.
ENGINEERING DATA ANALYSIS LECTURES by Engr. Marc Daniel Laurina
40
29/08/2024
END OF SLIDE
41