01 Engineering Data Analysis
01 Engineering Data Analysis
(Week 2-3)
Introduction
Engineering Data Analysis (EDA) is an indispensable analysis tool for the
engineering team of the industries to analyze processes, integration, and yield
(conversion rate) effectively in order to enhance the competitiveness of the company
Learning Outcome
1. Know the method of Data collection
2. Apply planning and conducting experiments
3. Interpret Planning and conducting surveys
Learning Content
Stratified sampling - This involves taking a sample from each population unit in non-
overlapping groups. For instance, the manufacturer of a light bulb wishes to investigate
the lifetime of their bulbs. If 25-watt, 60-watt, and 100-watt bulbs were produced, a
separate sample could be selected from each of the three bulb sizes. This would result
in information on all the three bulb sizes.
FREQUENCY DISTRIBUTIONS
The organization of data in tabular form yields frequency distributions. Data in frequency
distributions may be grouped or ungrouped.
Raw data are collected data that have not been organized numerically in an
arrangement of raw data in ascending or descending order or magnitude is an array. In
an array, any value may appear several times. The number of times a value appears in
the listing is its frequency. The relative frequency of any observation is obtained by
dividing the actual frequency of the observation by the total Frequency.
Classification of Data:
UNGROUPED DATA- When the data is small (n ≤ 30) or when there are few distinct
values, the data may be organized without grouping.
EXAMPLE 1.1
GROUP DATA- Statistical data gathered the large masses (n ≥ 30) can be assessed by
grouping the data into different classes.
The following are suggested steps in forming a frequency distribution from raw data:
1. Find the range (R). The range is the difference between the largest and smallest
value.
2. Decide on a suitable number of classes. This will depend upon what information the
table is supposed to present. Surge suggested the number of classes (m) as
m= 1+3.3 log n where n= number of cases
The class size (c) may be rounded off to the same place value as the data.
4. Find the number of observations in each class. This is the class frequency (f).
The following are data on the observed compressive strength in psi of 50 samples of
concrete interlocking blocks.
R=H-L
R = 148 - 82 = 66
m = 1 + 3.3log 50 = 6.6 7 Classes
c = 66/7 = 9.4
use c = 10 since the data values are to the nearest ones.
The lowest value is 82. It is convenient to start with 80 as the lower limit of the
first class. 80 + 10 = 90 is the lower limit of the second class.
The number of observed values tallied in each is the class frequency. The relative
frequency of each class is also obtained and presented in Table 1.2.
Compressive Tally No. of Relative Frequency
Strength blocks
(psi) (frequency,
f)
80-89 II 2 0.04
∑ 50 1.00
The lowest value is 22.8, therefore. 22.5 maybe the lower limit of the 1 st class. 22.5 + 0.5 =
23.0 is the lower limit of the 2nd class.
R= H-L = 25.7-22.8 = 2.9
M= 1+3.3log(40)= 6.29 = 7
C = R/m = 2.9/7 = 0.41 = 0.5
∑ 40 1.00
TABLE 1.4 CLASS LIMITS, CLASS BOUNDARIES AND CLASS MARKS FOR FREQUENCY
DISTRIBUTION PRESENTED IN TABLE 1.2
Table 1.5 Class limits, class boundaries and class marks of frequency distribution presented in
Table 1.3
MEASURES OF CENTRAL TENDENCY: MEAN, MEDIAN AND MODE
MEAN
The arithmetic mean or simply the mean is the overall average.
If the data represent the entire population the mean of the values is referred to as the
population mean, μ. This mean is a quantitative measure describing the characteristic of a
population and therefore, it is a parameter. If the data constitute a sample drawn from a
population, the mean is referred to as the sample mean, ᵪ , which is a statistic.
If there are n observations with numerical values x1, x2,…xn, then the sample mean is given
by
n
∑ҳ
x= i=1
n
X
∑ fiҳ
i=1
¿
n
Example 1.5
The following data represent the time in seconds for 9 glued samples to dry and attains
its bond strength: 3.6, 2.5, 3.1, 4.3, 2.4, 2.9, 2.5 ,4.1 and 3.4. Calculate the mean.
SOLUTION:
1.46 6 8.76
1.48 4 5.92
1.49 5 7.45
1.50 6 9.00
1.52 9 13.68
∑ 30 44.81
Solution:
Classes fi Xi fi Xi
∑ 40 972
Median
The median of a set of numbers in an array is either the middle value or the
arithmetic mean of two middle values.
The sample median ᵪ is used to estimate the population median μ.
Example 1.8
For the set of numbers 1, 3, 3, 5, 6, 8, 9, 9, 10
Solution:
~
x = x5 = 6
Example 1.9
For the set of numbers 4, 4, 7, 9, 11, 12, 15, 18
Solution:
~ x 4 + x 5 9+ 11
x= 2
=
2
= 10
Example 1.10
Find the median of the data in Example1.5.
Solution:
Arrange the data in ascending magnitude. 2.3, 2.5, 2.6, 2.9, 3.1, 3.4, 3.6, 4.1, 4.3
~
x = x5 =3.1 seconds
n ( ∑f )
2 L
x~ = Lm + C
fm
Where:
Lm = lowest class boundary of the median class
N/2 = 40/2 = 20
Solution:
~ (20 −12)(0.5)
x = 23.95 + 8
=24.45
MODE
Mode is the value which occurs with greatest frequency. The sample mode is
designated as X^ and the population mode by ~ .
μ
Example 1.12
For the set of numbers 3, 3, 5, 7, 9, 10, 11, 10, 11, 12, 9, 18, 9
Solution:
^
x = 9 (unimodal)
Example 1.13
The set of numbers 6, 7, 9, 10, 12 has no mode.
Example 1.14
For the set of values 2.2, 3.1, 4.1, 4.1, 5.4, 5.4, 5.4, 6.2, 7.7, 7.7, 8.5, 8.5, 8.5, 9.3
Solution:
^
x = 5.4 and 8.5(bimodal)
^
X=L +
mo
( d 1+d 1d 2 )c
where:
Lmo = lower class boundary of the modal class
d1 = excess of modal frequency over frequency of the next lower
class
d2 = excess of modal frequency over frequency of the next
lower class
c = size of the modal class interval
Conducting a Survey
There are various methods for administering a survey. It can be done as a face-to face
interview or a phone interview where the researcher is questioning the subject. A different
option is to have a self-administered survey where the subject can complete a survey on paper
and mail it back, or complete the survey online. There are advantages and disadvantages to
each of these methods.
The advantages of face-to-face interviews include fewer misunderstood questions, fewer
incomplete responses, higher response rates, and greater control over the environment in
which the survey is administered; also, the researcher can collect additional information if any
of the respondents’ answers need clarifying. The disadvantages of face-to-face interviews are
that they can be expensive and time-consuming and may require a large staff of trained
interviewers. In addition, the response can be biased by the appearance or attitude of the
interviewer.
The advantages of self-administered surveys are that they are less expensive than interviews,
do not require a large staff of experienced interviewers and can be administered in large
numbers. In addition, anonymity and privacy encourage more candid and honest responses,
and there is less pressure on respondents. The disadvantages of self-administered surveys
are that responders are more likely to stop participating mid-way through the survey and
respondents cannot ask them to clarify their answers. In addition, there are lower response
rates than in personal interviews, and often the respondents who bother to return surveys
represent extremes of the population – those people who care about the issue strongly,
whichever way their opinion leans.
Designing a Survey
Surveys can take different forms. They can be used to ask only one question or they can ask a
series of questions. We can use surveys to test out people’s opinions or to test a hypothesis.
When designing a survey, the following steps are useful:
1. Determine the goal of your survey: What question do you want to answer?
2. Identify the sample population: Whom will you interview?
3. Choose an interviewing method: face-to-face interview, phone interview, self-
administered paper survey, or internet survey.
4. Decide what questions you will ask in what order, and how to phrase them. (This is
important if there is more than one piece of information you are looking for.)
5. Conduct the interview and collect the information.
6. Analyze the results by making graphs and drawing conclusions.
Practice Using Stem-and-Leaf Plots
Try your own stem-and-leaf plot with the following temperatures for June. Then, determine
the median for the temperatures:
77 80 82 68 65 59 61
57 50 62 61 70 69 64
67 70 62 65 65 73 76
87 80 82 83 79 79 71
80 77
Once you've sorted the data by value and grouped them by the tens digit, put them into a
graph called "Temperatures." Label the left column (the stem) as "Tens" and the right column
as "Ones," then fill in the corresponding temperatures as they occur above.
Now that you've had a chance to try this problem on your own, read on to see an example of
the correct way to format this data set as a stem-and-leaf plot graph.
Temperatures
TensOnes
5 079
6 11224555789
7 001367799
8 0002237
You should always begin with the lowest number, or in this case temperature: 50. Since 50
was the lowest temperature of the month, enter a 5 in the tens column and a 0 in the ones
column, then observe the data set for the next lowest temperature: 57. As before, write a 7 in
the ones column to indicate that one instance of 57 occurred, then proceed to the next-lowest
temperature of 59 and write a 9 in the ones column.
Find all of the temperatures that were in the 60s, 70s, and 80s and write each temperature's
corresponding ones value in the ones column. If you've done it correctly, it should yield a stem-
and-leaf plot graph that looks like the one in this section.
Constructing a Survey
1. Martha wants to construct a survey that shows which sports students at her school like to
play the most.
a) List the goal of the survey.
The goal of the survey is to find the answer to the question: “Which sports do students at
Martha’s school like to play the most?”
b) What population sample should she interview?
A sample of the population would include a random sample of the student population in
Martha’s school. A good strategy would be to randomly select students (using dice or a
random number generator) as they walk into an all-school assembly.
c) How should she administer the survey?
Face-to-face interviews are a good choice in this case. Interviews will be easy to conduct since
the survey consists of only one question which can be quickly answered and recorded, and
asking the question face to face will help eliminate non-response bias.
d) Create a data collection sheet that she can use to record her results.
In order to collect the data to this simple survey Martha can design a data collection sheet
such as the one below:
Sport Tally
Baseball
Basketball
Football
Soccer
Volleyball
Swimming
9th grade
10th grade
11th grade
12th grade
This data collection sheet allows Raoul to write down the actual numbers of hours worked per
week by students as opposed to just collecting tally marks for several categories.
Display, Analyze, and Interpret Statistical Survey Data
In the previous section we considered two examples of surveys you might conduct in your
school. The first one was designed to find the sport that students like to play the most. The
second survey was designed to find out how many hours per week students worked.
For the first survey, students’ choices fit neatly into separate categories. Appropriate ways to
display the data might be a pie chart or a bar graph. Let’s revisit this example.
In Example A Martha interviewed 112 students and obtained the following results.
Sport Tally
Gymnastics ||| 3
Fencing || 2
Total: 112
a) Make a bar graph of the results showing the percentage of students in each category.
To make a bar graph, we list the sport categories on the x−axis and let the percentage of
students be represented by the y−axis.
To find the percentage of students in each category, we divide the number of students in each
category by the total number of students surveyed:
Sport Percentage
Baseball 31/112=.28=28%
Basketball 17/112=.15=15%
Football 14/112=.125=12.5%
Soccer 28/112=.25=25%
Sport Percentage
Volleyball 9/112=.08=8%
Swimming 8/112=.07=7%
Gymnastic 3/112=.025=2.5%
Fencing 2/112=.02=2%
Now we can make a graph where the height of each bar represents the percentage of students
in each category:
b. Make a pie chart of the collected information, showing the percentage of students in each
category.
To make a pie chart, we find the percentage of the students in each category by dividing the
number of students in each category as in part a. The central angle of each slice of the pie is
found by multiplying the percentage of students in each category by 360 degrees (the total
number of degrees in a circle). To draw a pie-chart by hand, you can use a protractor to
measure the central angles that you find for each category.
Here is the pie-chart that represents the percentage of students in each category:
For the second survey, actual numerical data can be collected from each student. In this case
we can display the data using a stem-and-leaf plot, a frequency table, a histogram, or a box-
and-whisker plot.
Design of experiment (DOE) is a body of knowledge, based upon statistical and other scientific
disciplines, for efficient and effective planning of experiments and for making sound inferences
from experimental data.
In an experiment, we deliberately change one or more process variables (or factors) in order to
observe the effect the changes have on one or more response variables. The (statistical)
design of experiments (DOE) is an efficient procedure for planning experiments so that the
data obtained can be analyzed to yield valid and objective conclusions.
DOE begins with determining the objectives of an experiment and selecting the process
factors for the study. An Experimental Design is the laying out of a detailed experimental plan
in advance of doing the experiment. Well-chosen experimental designs maximize the amount
of “information” that can be obtained for a given amount of experimental effort.
Used to evaluate which process inputs have a significant impact on the process output
and what the target level of those inputs should be to achieve a desired result (Output).
Design of experiment defines the:
Population to be studied,
Randomisation Process
Administration of Treatments,
Sample size requirement
Method of statistical analysis
The process of DOE may seem too cumbersome and extensive to comprehend at the first try.
But, there is a need to understand the use of design of experiment in product and process
research and development to achieve product excellence.
Randomnisation
Replication and
Local Control
One can easily comprehend the Idea of DOE and easily implement it in product and process
research.
1. Create an Ungrouped Frequency Distribution table with the data from the survey,
accomplished among the students of university, which answered the question of how many
books they read per year. Arrange the data in frequency table
7 3 1 7 8 5 4 4 5 6 6 3 3 4 5 1 8 3
2. The highest flow recorded each year was determined from the flow data of gaging station
at a certain river. The following observations reflect the highest annual flow in (m 3/s) for 50
years
55 43 60 94 37 56 91 30 65 68
42 75 33 71 60 65 76 52 69 58
45 48 39 61 35 78 56 39 44 65
71 60 61 77 61 59 47 49 74 69
83 69 40 64 31 27 36 87 62 66
**Start with “26” as the lower limit of your of the first class.
** use class size (c) of “10”
3. Find the mean weight of how many students read books per year in question no.1
4. Find the mean of gaging station in question no.2
TEST II
1. Samuel conducted a survey to answer the following question: “What are the favorite
subjects of your classmate during High School . He collected the following information by
asking his classmates in High School.( Choose 5 subjects in your high school).
a) Make a pie chart of the results showing the percentage of people in each category.
b) Make a bar graph of the results.
2. Melissa conducted a survey to answer the question “What sport do high school students like
to watch on TV the most?” She collected the following information on her data collection sheet.
a) Make a pie-chart of the results showing the percentage of people in each category.
b) Make a bar-graph of the results.
3. Pedro conducted a survey to find how many hours of TV teenagers watch each week in
Isabela He collaborated with three friends that lived in city/ municipality of Isabela and found
the following information: (Choose 3 municipality/city only)
a) Make a stem-and-leaf plot of the data.
b) Decide on an appropriate bin size and construct a frequency table.
c) Make a histogram of the results.
Flexible Teaching Learning Modality (FTLM) adopted
Example:
Online (synchronous)
//Edmodo, google classroom, moodle, schoology, Podcast etc..
Remote (asynchronous)
//module, case study, exercises, problems sets, etc…
References (at least 3 references preferably copyrighted within the last 5 years,
alphabetically arranged)
Fundamentals of Probability and Statistics for Engineering by Marie-Wendy J. Frany,
Miriam S.P. Galvez and Emy L. Vasquez
http://www.fs-technology.com/EN/EDA-en.html
https://www.slideshare.net/derechohernan/mean-for-grouped-data
https://youtu.be/lLQ7nRjOpng
https://www.ck12.org/statistics/planning-and-conducting- surveys/lesson/Planning-
and-Conducting-Surveys-ALG-I/
https://www.thoughtco.com/stem-and-leaf-plot-an-overview-2312423