Chemometrics A University Course Using The R Language
Chemometrics A University Course Using The R Language
The content of the text and their data in their form, correctness and reliability are the
sole responsibility of the authors, and they do not necessarily represent the official
position of Atena Editora. It is allowed to download the work and share it as long as
credits are given to the authors, but without the possibility of altering it in any way or
using it for commercial purposes.
Atena Editora is committed to ensuring editorial integrity at all stages of the publication
process, avoiding plagiarism, fraudulent data or results and preventing financial
interests from compromising the publication's ethical standards. Suspected scientific
misconduct situations will be investigated to the highest standard of academic and
ethical rigor.
Editorial Board
Exact and earth sciences and engineering
Prof. Dr. Adélio Alcino Sampaio Castro Machado – Universidade do Porto
Profª Drª Alana Maria Cerqueira de Oliveira – Instituto Federal do Acre
Profª Drª Ana Grasielle Dionísio Corrêa – Universidade Presbiteriana Mackenzie
Profª Drª Ana Paula Florêncio Aires – Universidade de Trás-os-Montes e Alto Douro
Prof. Dr. Carlos Eduardo Sanches de Andrade – Universidade Federal de Goiás
Profª Drª Carmen Lúcia Voigt – Universidade Norte do Paraná
Prof. Dr. Cleiseano Emanuel da Silva Paniagua – Colégio Militar Dr. José Aluisio da
Silva Luz / Colégio Santa Cruz de Araguaina/TO
Profª Drª Cristina Aledi Felsemburgh – Universidade Federal do Oeste do Pará
Prof. Dr. Diogo Peixoto Cordova – Universidade Federal do Pampa, Campus Caçapava
do Sul
Prof. Dr. Douglas Gonçalves da Silva – Universidade Estadual do Sudoeste da Bahia
Prof. Dr. Eloi Rufato Junior – Universidade Tecnológica Federal do Paraná
Profª Drª Érica de Melo Azevedo – Instituto Federal do Rio de Janeiro
Prof. Dr. Fabrício Menezes Ramos – Instituto Federal do Pará
Prof. Dr. Fabrício Moraes de Almeida – Universidade Federal de Rondônia
Profª Drª Glécilla Colombelli de Souza Nunes – Universidade Estadual de Maringá
Prof. Dr. Hauster Maximiler Campos de Paula – Universidade Federal de Viçosa
Profª Drª Iara Margolis Ribeiro – Universidade Federal de Pernambuco
Profª Drª Jéssica Barbosa da Silva do Nascimento – Universidade Estadual de Santa
Cruz
Profª Drª Jéssica Verger Nardeli – Universidade Estadual Paulista Júlio de Mesquita
Filho
Prof. Dr. Juliano Bitencourt Campos – Universidade do Extremo Sul Catarinense
Prof. Dr. Juliano Carlo Rufino de Freitas – Universidade Federal de Campina Grande
Prof. Dr. Leonardo França da Silva – Universidade Federal de Viçosa
Profª Drª Luciana do Nascimento Mendes – Instituto Federal de Educação, Ciência e
Tecnologia do Rio Grande do Norte
Prof. Dr. Marcelo Marques – Universidade Estadual de Maringá
Prof. Dr. Marco Aurélio Kistemann Junior – Universidade Federal de Juiz de Fora
Prof. Dr. Marcos Vinicius Winckler Caldeira – Universidade Federal do Espírito Santo
Profª Drª Maria Iaponeide Fernandes Macêdo – Universidade do Estado do Rio de
Janeiro
Profª Drª Maria José de Holanda Leite – Universidade Federal de Alagoas
Profª Drª Mariana Natale Fiorelli Fabiche – Universidade Estadual de Maringá
Prof. Dr. Miguel Adriano Inácio – Instituto Nacional de Pesquisas Espaciais
Prof. Dr. Milson dos Santos Barbosa – Universidade Tiradentes
Profª Drª Natiéli Piovesan – Instituto Federal do Rio Grande do Norte
Profª Drª Neiva Maria de Almeida – Universidade Federal da Paraíba
Prof. Dr. Nilzo Ivo Ladwig – Universidade do Extremo Sul Catarinense
Profª Drª Priscila Natasha Kinas – Universidade do Estado de Santa Catarina
Profª Drª Priscila Tessmer Scaglioni – Universidade Federal de Pelotas
Prof. Dr. Rafael Pacheco dos Santos – Universidade do Estado de Santa Catarina
Prof. Dr. Ramiro Picoli Nippes – Universidade Estadual de Maringá
Profª Drª Regina Célia da Silva Barros Allil – Universidade Federal do Rio de Janeiro
Prof. Dr. Sidney Gonçalo de Lima – Universidade Federal do Piauí
Prof. Dr. Takeshy Tachizawa – Faculdade de Campo Limpo Paulista
Chemometrics: a university course using the R language
Format: PDF
System requirements: Adobe Acrobat Reader
Access mode: World Wide Web
Includes bibliography
ISBN 978-65-258-2700-1
DOI: https://doi.org/10.22533/at.ed.001241508
Atena Editora
Ponta Grossa – Paraná – Brasil
Telefone: +55 (42) 3323-5493
www.atenaeditora.com.br
contato@atenaeditora.com.br
AUTHORS' DECLARATION
The authors of this work: 1. Attest that they do not have any commercial interest that
constitutes a conflict of interest in relation to the published scientific article; 2. They
declare that they actively participated in the construction of their manuscripts,
preferably in: a) Study design, and/or data acquisition, and/or data analysis and
interpretation; b) Elaboration of the article or revision in order to make the material
intellectually relevant; c) Final approval of the manuscript for submission; 3. They certify
that published scientific articles are completely free from fraudulent data and/or
results; 4. Confirm correct citation and reference of all data and interpretations of data
from other research; 5. They acknowledge having informed all sources of funding
received for carrying out the research; 6. Authorize the publication of the work, which
includes the catalog records, ISBN (Internacional Standard Serial Number), D.O.I.
(Digital Object Identifier) and other indexes, visual design and cover creation, interior
layout, as well as the release and dissemination according to Atena Editora's criteria.
PUBLISHER'S DECLARATION
Atena Editora declares, for all legal purposes, that: 1. This publication constitutes only
a temporary transfer of copyright, right to publication, and does not constitute joint and
several liability in the creation of published manuscripts, under the terms provided for
in the Law on Rights copyright (Law 9610/98), in art. 184 of the Penal Code and in art.
927 of the Civil Code; 2. Authorizes and encourages authors to sign contracts with
institutional repositories, with the exclusive purpose of disseminating the work,
provided that with due acknowledgment of authorship and editing and without any
commercial purpose; 3. All e-books are open access, so it does not sell them on its
website, partner sites, e-commerce platforms, or any other virtual or physical means,
therefore, it is exempt from copyright transfers to authors; 4. All members of the
editorial board are PhDs and linked to public higher education institutions, as
recommended by CAPES for obtaining the Qualis book; 5. It does not transfer,
commercialize or authorize the use of the authors' names and e-mails, as well as any
other data from them, for any purpose other than the scope of dissemination of this
work.
The origin of this book arose when the authors saw the need to write
theoretical and experimental material for the Chemometrics discipline of the
undergraduate and postgraduate Chemistry course at UFRN/Brazil. The R
programming language was chosen in this book because it presents a free, easily
accessible work environment, rich in a variety of packages and various statistical
tools capable of exploring different functionalities through different means and
formats. Furthermore, with the versatility of being able to work through a command
line or through menus pre-defined by the software itself, it is very accessible and
practical to explore the main concepts of Chemometrics.
The main topics of Chemometrics that are explored in undergraduate
and postgraduate courses at UFRN are, among them, Descriptive Statistics and
their properties (Chapter 1), Design and Optimization of Experiments (Chapter
2), Pattern Recognition (Chapter 3), Higher-order Multivariate Classification
(Chapter 4), Higher-order Multivariate Regression (Chapter 5) and Digital Images
(Chapter 6). All chapters contain the fundamentals descriptions and examples
on simulated or real data through scripts in the R language. In addition, the book
presents an appendix which brings together the main scripts in the R language
PREFACE
for importing, pre-processing and organizing matrices that are necessary for the
development of multivariate models. Finally, to encourage learning, proposed
exercises were created so that students can try to answer them based on theory
and solved examples in each chapter.
The authors
The support, collaboration and encouragement of several people was
relevant for this textbook, to whom we would like to give special thanks. Firstly,
we would like to thank our dear Professor Luiz Seixas das Neves (Institute
of Chemistry/UFRN) for the learning, for the opportunity to share ideas and
knowledge, and above all, for the consideration and trust placed in our work
throughout our academic career. Secondly, we want to thank Professor Hélio
Scatena Júnior (in memory) for teaching the Contemporary Chemistry subject
offered in Chemistry courses at UFRN for some years. We also want to thank the
Research Group in Biological Chemistry and Chemometrics at UFRN, especially
professors Edgar Perin Moraes, Davi Serradella Vieira, Fabrício Gava Menezes
and Luiz Henrique da Silva Gasparotto who enriched the Chemometrics
environment with discussions and applications in studies published in over more
than a decade. I would also like to thank all the students and collaborators who
ACKNOWLEDGEMENTS
passed through and were part of the world of Chemometrics built here at UFRN
and in various parts of the world, especially the esteemed Prof. Francis L. Martin
(United Kingdom).
I, professor, Kássio Lima, now want to thank my wife Katarina, who was
undoubtedly a great pillar in the writing and preparation phases of this work.
Always helpful, understanding and the person who always believed in my work.
For all the love and friendship to whom I can only thank for understanding the
number of hours put into this book, especially regarding the education of our
beloved children, João Pedro and Maria Luísa. These beloved children, two
graces achieved, were and will continue to be my greatest inspirations for this
book.
I, professor, Camilo Morais, would like to thank my friends and family who
have always believed and supported my work. In particular, I want to thank my
wife and partner, Julyana, for showing me her love, dedication and friendship
at all times during the writing and preparation of this book. Finally, I would like
to thank the other authors, professors Kássio Lima and Ana Carolina, for the
invitation, discussions, understanding and hours dedicated for preparing this
book with such care.
I, professor, Ana Carolina, would like to especially thank those who have
always been my biggest supporters throughout the journey that brought me here:
my parents, Gladson and Maria da Conceição, and Fabrício, a great friend and
life partner. Without your support, it wouldn't have been possible. I would also like
to thank the other authors, professors Kássio and Camilo, for the invitation, for
the rich exchange of knowledge and for the friendship.
CHAPTER 1 - DESCRIPTIVE STATISTICS AND ITS PROPERTIES..............1
1.6 t-test....................................................................................................... 11
Proposed exercises...................................................................................... 29
Proposed exercises...................................................................................... 65
UNSUPERVISED ANALYSIS............................................................................ 68
3.1 Cluster Analysis..............................................................................68
3.3 K-means........................................................................................75
SUPERVISED ANALYSIS................................................................................ 88
Proposed exercises.....................................................................................109
SUMMARY
4.9 PARAFAC–LDA......................................................................................163
Proposed exercises.....................................................................................174
5.7 MCR-ALS..............................................................................................212
5.8 UPLS/RBL.............................................................................................218
5.9 N-PLS..............................................................................................223
Proposed exercises.....................................................................................261
APPENDIX A.........................................................................262
CHAPTER IDEA
We need statistical calculations to make decisions about the quality of our experimental
measurements. In this chapter, some important concepts for this decision making will be
presented, including: normal distribution, parametric and non-parametric statistical tests.
Upon completing the chapter, you should be able to:
a) Explain what a normal distribution is and when its application is appropriate.
Population average
Sample mean
Median It is the central value in a dataset that has been arranged in order of
magnitude
Population standard
deviation
Sample standard
deviation
Combined standard
deviation
Population variance σ2
Sample variance s2
Mean standard error
Coefficient of
Variation
Relative Standard
Deviation (RSD)
Spread or range ( f ) It is the difference between the highest value and the lowest value in the set
Mode ( m0 ) It is the value (or values) of maximum frequency
Absolute frequency It is the number of times a given value ( xi ) is observed.
( fi )
Relative frequency or It is the quotient between its absolute frequency and the total number of
proportion ( p'i )
observations.
Maximum Largest value from a set of n observations
Minimum Smallest value from a set of n observations
Quartiles These are values that divide a data sample into four equal parts. 1st.
Quartile: the value that leaves 25% of the data below it and 75% above it.
2nd. Quartile: the value that leaves 50% of the data below it and 50% above
it (median). 3rd. Quartile: the value that leaves 75% of the data below it and
25% above it.
Pearson's coefficient
of Skewness (Ap)
If | A p | < 0.15, symmetric distribution
If 0.15 ≤| A p | ≤1, moderately asymmetric distribution
If | A p | >1, strongly asymmetric distribution
Amplitude (R) Difference between the largest and smallest value in a set of n observations
Each of these statistical parameters will be calculated and discussed below, through
Example 1.
Example 1: The following example was carried out by students of the Quantitative
Analytical Chemistry course at the Federal University of Rio Grande do Norte, 2023.1, in
which we have the experimental results of the calibration of a 10 mL graduated pipette.
Using the data in the table below, calculate and discuss all parameters or graphs generated
by this experiment.
Replicas Volume, mL
1 9.988
2 9.993
3 9.986
4 9.980
5 9.975
6 9.982
7 9.986
8 9.982
9 9.981
10 9.990
install.packages ("dplyr")
library ( dplyr )
install.packages ("psych")
library ( psych )
pipette = c( 9.988, 9.993, 9.986, 9.980, 9.975, 9.982, 9.986, 9.982, 9.981, 9.990)
data <- data.frame (pipette) # create dataset
View (data)
glimpse (data)
# Amplitude
range( pipette )
hist( pipette )
summary( pipette )
boxplot (pipette)
# Absolute frequencies:
table (pipette)
# Relative frequencies:
The two important parameters of the Gaussian distribution are the mean (µ), describing
where the experimental values are centered, and the variance ( s2 ) which describes their
degree of dispersion. Depending on the parameters (µ and s), we will have different normal
distributions. Furthermore, the variable X is a continuous variable comprised between
-∞ <x <+ ∞ and the area under the curve is equal to 1. This area we call probability. The
probability density function, Equation 1, is described as:
Eq. 1
R Script
# Normal distribution in R
install.packages ("dplyr")
library ( dplyr )
install.packages ("psych")
library ( psych )
# Sample histogram
# Sample statistics
R Script
# probability calculations
#X ~ N( 3.4, 0.1^2)
# P( X < 3.3) = ?
pnorm ( 3.7, mean = 3.4, sd = 0.1) - pnorm (3.1, mean = 3.4, sd = 0.1)
# sequence of x to color
R Script
install.packages ("dplyr")
library( dplyr )
install.packages ("RVAideMemoire")
library( RVAideMemoire )
View (data)
glimpse (data)
shapiro.test(data$Height)
## Loading Packages
install.packages ("car")
library ( car )
## Loading data
# Levene's test
## Loading Packages
install.packages ("rstatix")
library ( rstatix )
# Z ~ N(0.1)
# phi (-1.64) = P(Z < 1.64) = to be determined
pnorm (-1.64)
qnorm (0.05)
qnorm (0.025)
abs (qnorm ( 0.025))
Eq. 2
Equation 2 is only applied in experiments with no systematic errors and in which the
values of s are a good approximation of the values of s.
1.6 t-test
Often in Chemometrics, we have a limitation in time or in the number of samples to
consider that s is a good estimate of s. The determination of the confidence interval when
s is unknown was originally proposed by the English statistician Mr. WS Gosset, in 1908,
in a classic article published in the journal Biometrika [4]. The curious fact was: to avoid
the discovery of any commercial secret from his employer (hired by the Guinnes brewery
to statistically analyze the results of alcohol content determinations), Gosset published the
a. Single-sample t-test: We use this test when we want to compare the mean of
a single sample with a known population mean (reference value). After testing,
there are two possibilities:
Example 7: We will use the following R script to demonstrate how we should employ
the single-sample t-test. We will discuss their results and present some graphs that confirm
our observations.
R Script
## Loading Packages
install.packages ("rstatix")
library( rstatix )
## t-test
# Observation:
# The two-tailed test is the default; If you want one-tailed , you need to include:
# alternative = "greater" or alternative = "less"
# Example: t. test ( data$Height , mu = 167, alternative = "greater")
# In this case, the test checks whether the sample mean is greater than the tested
mean
# Visualization of data distribution
b. t for independent samples: We use this test when we want to compare the
means of two independent samples.
Example 8: Use the following R script to calculate the independent samples t-test.
We will discuss the results and evaluate your assumptions.
R Script
## Installing packages, if not done previously
install.packages("dplyr")
library(dplyr)
install.packages("RVAideMemoire")
library(RVAideMemoire)
install.packages("car")
library(car)
# Observation
# var.equal =FALSE does not consider the two variances as equal, between the two
groups
# The two-tailed test is the default; If you want one-tailed , you need to include:
# alternative = "greater" or alternative = "less"
# Example: t.test ( Grade_Biology ~ Room_Position , data, var.equal =TRUE,
alternative ="greater")
# In this case, the test checks whether the average of the first group is greater
than the average of the second
# OR is considering "Front" as the first group
c. T-test for dependent (paired) samples: The t test for dependent samples is
used when you want to compare the means of two samples that are dependent,
that is, when one sample is obtained from the same population as the other.
Example 9: Using the following R script, we will show how the paired t-test and its
results are determined.
R Script
## Load the packages that will be used
install.packages ("dplyr")
library(dplyr)
install.packages ("psych")
library(psych)
data <- read.csv('Database 4.csv', sep = ';', dec = ',', fileEncoding = "latin1")
View (data)
glimpse (data)
summary (data$Seizures_PT)
summary (data$Seizures_S1)
describe (data$Seizures_PT)
describe (data$Seizures_S1)
I. populations must present normal distributions and with the same variance;
II. samples must be random and mutually independent;
III. the different samples are obtained from populations classified in just one
category.
Once these requirements are met, we need to declare the null (H0) and alternative
(H1) hypotheses for the one-way ANOVA. Ho assumes that the mean of all populations are
equal (µ 1 = µ 2 = µ 3 = ... = µk ) and the variance between groups (variation due to interaction
between groups) is significantly smaller than the variance within groups (variation due to
chance). On the other hand, H1 assumes that at least one population mean is different, that
is, there is an effect of factor or treatment.
Therefore,
Ho: there are no differences between the group means → p >0.05
H1: there are differences between the group means → p <0.05
The basic idea of ANOVA also consists of the partition of variability, in which one part
represents the variability of groups (between groups) and the other represents the variability
due to other factors (within groups). A common practice for testing H0 and H1 is to summarize
the results of the one-way ANOVA test in the form of a table, as shown below:
Sum of Degrees of
Source of variation Mean Square (MS) F
Squares (SS) Freedom (DF)
Example 10: Let's look at an R script of a 1-way ANOVA with repeated measures.
We will take it step by step and discuss its main results.
R Script
## Loading Packages
install.packages ("psych")
library(psych)
install.packages ("ggplot2")
library(ggplot2)
# Shows boxplots of the classifiers, side by side, in relation to the PCC performance
response
summary ( data.anova )
install.packages ("easyanova")
library(easyanova)
R Script
install.packages ("dplyr")
library( dplyr )
install.packages ("car")
library(car)
install.packages ("rstatix")
library( rstatix )
install.packages ("emmeans")
library( emmeans )
install.packages ("ggplot2")
library(ggplot2)
data <- read.csv( 'Database 6.csv', sep = ';', dec = ',', fileEncoding = "latin1")
# Here, all the observations follow a normal distribution and the assumption has
been met for ANOVA
# By default, the test performed by the car package is based on the median ( median )
# Mean-based testing is more robust
# Changed to be average based
install.packages ("easyanova")
library(easyanova)
shapiro.test ( model$residuals )
boxplot ( model$residuals )
data$residuals <- model$residuals
data %>% group_by( Gender, Alcohol )%>%
identify_outliers( residuals )
data %>% identify_outliers ( residuals )
## Verification of homogeneity of variances - Levene's test ( car package )
## Model creation:
# type III = sum of squares of residuals and does not take into account the order
of factors
# type I = depends on the order of factors at insertion time
# H0 p > 0.05
# Ha p < 0.05
# Interaction plot (ggplot2 package)
## With genders with different colors
In the example above, the two-way ANOVA showed that there is an effect of alcohol
F(2.42) = 20.06; p < 0.001 and interaction between alcohol and gender F(2,42) = 11.91;
p < 0.001 on memory. Subsequent analyzes (estimated marginal means, with Bonferroni
correction) showed that alcohol consumption did not affect the memory of female individuals,
but the consumption of 4 mugs decreased the memory score of male individuals, when
compared to individuals of the same gender who did not consume alcohol, or only consumed
2 mugs.
R Script
# Tukey test
### Who differs from whom? To do this, we need to compare the averages
### We will perform the Tukey Test.
### For p values < α we can say that the means differ
### at a significance level of 5% (α = 0.05).
### When p > α it is not possible to say that the means differ.
### The same result can be expressed by the test graph
plot ( tk_test )
boxplot (data)
Here we can also find a statistical package in R called DescTools to perform post-hoc
analysis.
install.packages ("DescTools")
library(DescTools)
#PostHocTest
# Using Duncan
# Using TukeyHSD
# Using Bonferroni
a) Friedman
The Friedman test is a non-parametric statistical test developed by Milton
Friedman [6-8] similar to ANOVA. It is used to detect differences in treatments in various
test experiments. The procedure involves sorting each row (or block), then considering the
column rank values.
R Script
## Loading Packages
install.packages("dplyr")
library(dplyr)
install.packages("rstatix")
library(rstatix)
install.packages("reshape")
library(reshape)
install.packages("PMCMRplus")
library(PMCMRplus)
install.packages("ggplot2")
library(ggplot2)
View (data)
glimpse (data)
View ( datal )
glimpse( datal )
## Another option:
## friedman.test( datal$Grade , datal$Professor , datal$ID )
# Post-hoc testing
### Others:
frdAllPairsNemenyiTest(datal$Grade, datal$Teacher,
datal$ID, p.adjust.method = "bonferroni")
frdAllPairsConoverTest(datal$Grade, datal$Teacher,
datal$ID, p.adjust.method = "bonferroni")
# Data visualization
# Distribution analysis
par(mfrow=c(2,2))
hist( datal$Grade [ datal$Teacher == "Professor1"],
ylab = "Attendance", xlab = "Grades", main = "Teacher 1")
hist( datal$Grade [ datal$Teacher == "Professor2"],
ylab = "Attendance", xlab = "Grades", main = "Teacher 2")
hist( datal$Grade [ datal$Teacher == "Professor3"],
ylab = "Attendance", xlab = "Grades", main = "Teacher 3")
hist( datal$Grade [ datal$Teacher == "Professor4"],
ylab = "Attendance", xlab = "Grades", main = "Teacher 4")
b) Wilkoxon
This non-parametric test developed by F. Wilcoxon in 1945 [9] is used as an alternative
to the paired t-student test when samples do not follow a normal distribution. Thus, the
Wilcoxon test is used to test whether the sample medians are equal in cases where the
normality hypothesis is not accepted or when it is not possible to check this assumption.
There are two important assumptions in this non-parametric test: i) the dependent variable
must be originally numeric or categorical; ii) the independent variable must be composed of
two dependent (paired) groups.
Example 14: Below there is an R script to check the Wilcoxon test on a given dataset.
R Script
install.packages ("dplyr")
library (dplyr)
install.packages ("rstatix")
library (rstatix)
View (data)
glimpse (data)
# Wilcoxon test
# Observation:
# The two-tailed test is the default; If you want one-tailed, you need to include:
# alternative = "greater" or alternative = "less"
# Example: wilcox.test( data$Convulsoes_PT , data$Convulsoes_S1,
# paired = TRUE, alternative ="greater")
# In this case, the test will check whether the median of Seizures_PT is greater
than the median of Seizures_S1
R Script
# Load the packages that will be used
install.packages("dplyr")
library(dplyr)
install.packages("rstatix")
library (rstatix)
# Observation:
# The two-tailed test is the default; If you want one-tailed, you need to include:
# alternative = "greater" or alternative = "less"
# Example: wilcox.test( Grade_History ~ Room_Position , data = data, alternative =
"greater")
# In this case, the test checks whether the median of the first group is greater than
the median of the second
# OR is considering "Front" as the first group
# Parametric data
# Distribution preview
par(mfrow=c(1,2))
hist(data$Grade_Biology[data$Room_Position == "Front"],
ylab ="Frequency", xlab ="Grade", main ="Front Group")
hist(data$Grade_Biology[data$Room_Position == "Back"],
ylab ="Frequency", xlab ="Grade", main ="Back Group")
d) Kruskal-Wallis test
The non-parametric Kruskal-Wallis test, named after William Kruskal and W. Allen
Wallis [11] in 1952, is a method used to test whether samples originate from the same
distribution. We typically use it to compare two or more independent samples of equal or
different sizes. The parametric equivalent of this test is the F test used in one-way ANOVA.
R Script
install.packages("dplyr")
library(dplyr)
install.packages ("rstatix")
library(rstatix)
install.packages("ggplot2")
library(ggplot2)
View (data)
glimpse (data)
# Kruskal-Wallis test
# Post-hoc testing
# Dunn's test with p-value adjustment
# Data visualization
par(mfrow=c(1,2))
boxplot(BC ~ Group, data = data)
boxplot(Pressure ~ Group, data = data)
# Distribution analysis
par(mfrow=c(1,3))
hist(data$BC[data$Group=="Placebo"],
ylab="Frequency",xlab="bps",main="Placebo")
hist(data$BC[data$Group == "AH New"],
ylab = "Frequency", xlab = "bps", main="AH New")
hist(data$BC[data$Group == "AH Default"],
ylab = "Frequency", xlab = "bps", main="AH Default")
par(mfrow=c(1,3))
hist(data$Pressure[ data$Group == "Placebo"],
ylab ="Frequency", xlab ="bps", main ="Placebo")
hist(data$Pressure[ data$Group == "AH New"],
ylab ="Frequency", xlab ="bps", main ="AH New")
hist(data$Pressure[ data$Group == "AH Default"],
ylab ="Frequency", xlab ="bps", main ="AH Default")
Below there is a table that summarizes the main statistical concepts discussed
throughout Chapter 1 for comparison purposes.
PROPOSED EXERCISES
01 – Propose a chemical experiment that contains replicas and, based on the
observations, present the statistical parameters described using an R script.
02 – There is a repository of univariate and multivariate data on the internet based on
the R language ( https://archive.ics.uci.edu/ ) in which you must choose one or more sets of
data to carry out a descriptive statistical study in detail. Present your results.
03 – Propose an experiment or use a dataset from a public database (preferably) to
apply the t-test for a single sample. Build an R script to demonstrate your hypotheses and
main conclusions.
04 – Propose an experiment or use a dataset from a public database (preferably) to
apply the t-test for independent samples. Build an R script to demonstrate your hypotheses
and main conclusions.
05 – Propose an experiment or use a dataset from a public database (preferably)
to apply the paired t-test. Build an R script to demonstrate your hypotheses and main
conclusions.
REFERENCES
[1] Shapiro, SS; Wilk, M.B. (1965). An analysis of variance test for normality (complete samples).
Biometrika . 52 (3–4): 591–611.
[2] Levene, Howard (1960). Robust tests for equality of variances. In: Ingram Olkin ; Harold Hotelling .
Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. [ Sl ]: Stanford University
Press. pp. 278–292.
[3] Neyman, J. (1937). Outline of a Theory of Statistical Estimation Based on the Classical Theory of
Probability . Philosophical Transactions of the Royal Society . 236 : 333 – 380.
[4] Pearson, ES (1939). Willian Sealy Gosset, "Student" as statistician. Biometrika , 30 (3-4): 210-250.
[5] Tukey, John (1949). Comparing Individual Means in the Analysis of Variance. Biometrics . 5 (2):99–114.
[6] Friedman, Milton (1937). The use of ranks to avoid the assumption of normality implicit in the analysis
of variance. Journal of the American Statistical Association. 32 (200): 675–701.
[7] Friedman, Milton (1939). A correction: The use of ranks to avoid the assumption of normality implicit in
the analysis of variance». Journal of the American Statistical Association. 34 (205): 109.
[8] Friedman, Milton (1940). A comparison of alternative tests of significance for the problem of m rankings.
The Annals of Mathematical Statistics. 11 (1): 86-92.
[9] Wilcoxon, Frank (1945). Individual comparisons by ranking methods . Biometrics Bulletin . 1 (6): 80-83
[10] Mann, HB and Whitney, DR (1947) On a Test of Whether One of Two Random Variables Is
Stochastically Larger than the Other. Annals of Mathematical Statistics, 18, 50-60.
[11] Kruskal, William H.; Wallis, W. Allen (1952). Use of Ranks in One-Criterion Variance Analysis . Journal
of the American Statistical Association . 47 (260): 583–621.
CHAPTER IDEA
Design of Experiments (DOE) or experimental design is an ideal technique for
studying the effect of a set of several factors on a response variable of interest. The main
objectives of experimental planning are to reduce process time, reduce the number of tests
without compromising the quality of information, reduce operational costs and increase the
quality and yield of processes.
In this chapter, after analyzing the data presented in the examples, some practical
conclusions from the experiment will be detailed through graphs to present the results and
some confirmation tests. Upon completing the chapter, you should be able to:
a) Understand the main terms (response variable, levels, factors or treatments,
randomization and blocks) used in planning;
e) Carry out a selection of variables that influence the process with a reduced number
of tests;
f) Build new scripts in R language for decision making using experiment design;
Design of experiments 31
a) Recognition of the problem;
e) Formulation of a hypothesis;
f) Analysis of results;
Basically, there are three techniques used to define tests in experimental planning: i)
use of replicates; ii) randomization; and, iii ) blocking.
The replication consists of repeating a test under pre-established conditions,
obtaining an estimate of how the experimental error may affect the results of the experiments
and whether these results are statistically different. Randomization is a purely statistical
experimental design technique in which the sequence of tests is random, and the choice
of materials that will be used in these tests is also random. Experimental blocking is a
statistical technique that consists of organizing experimental units into groups (blocks) that
are similar to each other. The purpose of blocking is to contain and evaluate the variability
produced by the disturbing factors of the experiment. Blocking allows you to create a more
homogeneous experiment and increase the precision of the responses that are analyzed.
The central idea of blocking is to make the experimental units (EUs) homogeneous within
the blocks.
Now, some experimental designs will be presented (full factorial, fractional factorial,
Box-Behnken, mixtures, multi-levels) through examples written in the R language.
Design of experiments 32
2, ..., and nk of factor k, the planning will be a factorial n1 x n2 x ... x nk. Here we can state
that it does not necessarily mean that only n1 x n2 x ... x nk experiments will be carried
out. This is the minimum number to have a complete factorial design. Factorial planning
is indicated for starting the experimental procedure when there is a need to identify the
influencing variables and study their effects on the chosen response variable.
R Script
# levels
levels = c(-1,1)
# planning
plan = expand.grid(levels, levels)
# column names
colnames(plan) = c("x1", "x2")
# Response
y = c(5.1, 8.6, 7.7, 13.5, 5.0, 8.5, 8.2, 13.9)
################################################ #############
# Step by step analysis
# Planning matrix
x = model.matrix(~x1*x2, data = plan [,-3])
x
# Effects
effects = crossprod(x,y )/(2*2^2/2)
effects
Design of experiments 33
# Coefficients
coef = effects/2
coef
# adjusted values
fitted = x%*% coef
fitted
# waste
resi = y - fitted
resi
## Sum of squares
# Waste SS
SSE = sum(resi^2)
# Total SS
SST = sum(y^2)-sum(y)^2/N
## Degrees of freedom
# of errors
DFE = N-r
# total
DFT = N-1
## Mean of squares
# of errors
MSE = SSE/DFE
# total
MST = SST/DFT
#t calculated
t0 = coef / sqrt (MSE/N)
t0
#t critical
t_critical = qt( 0.025, df = DFE, lower.tail = F )
t_critical
#p-value
pvalue = 2*pt(abs(t0), df = DFE, lower.tail = F)
pvalue
Design of experiments 34
ttest = data.frame ( coef , rep(sqrt(MSE/N),4),t0,pvalue)
colnames ( ttest ) = c( "coef", "SE_coef", "p- value")
## The new
# Sum of squares of effects
# F calculated
F0 = MS_x /MSE
F0
# pvalue
p = pf(F0, 1, DFE, lower.tail = F)
p
################################################ #############
# Matrix least squares
# Multiplying X transposed by
t(x)%*%x
#multiplying X transposed by y
t(x)%*%y
Design of experiments 35
#getting the coefficients
beta_mat = solve(t(x)%*%x)%*%t(x)%*%y
beta_mat
Design of experiments 36
R Script
### Factors
# x1 = pH
# x2 = flow rate, vz ( mL /min)
# x3 = eluent concentration, cE (mol/L)
### Response
# y = normalized instrumental peak height
###### Planning
install.packages("FrF2")
library(FrF2)
# planning
plan = FrF2( nruns = 16,
nfactors = 3 ,
factor.names = list(pH = c(4,8) , # -
vz = c( 2.8), # mL /min
CE = c( 0.7,2)), # mol/L
randomize = F)
# experiment carried out we use F, otherwise T
summary( plan )
# adding response
# add.response command from FrF2 package
plan = add.response(plan, y)
summary( plan )
anova2 = aov(lm2)
summary(lm2)
# model comparison
anova(lm1, lm2) #best complete lm2
Design of experiments 37
# confidence interval for coefficients
confint(lm2)
#### Assumptions
#normality
shapiro.test(lm2$residuals)
par(mfrow = c(2,2))
plot(lm2)
par(mfrow = c( 1,1))
# homoscedasticity
install.packages("olsrr")
library(olsrr) # another package option for Breuch-Pagan test
ols_test_breusch_pagan(lm2, rhs = T, multiple = T)
###### Graphics
# via FrF2
MEPlot(lm2)
IAPlot(lm2)
install.packages("ggpubr")
library(ggpubr)
# effects graph
p1 = ggline( data = plan ,
x = "pH", y = "y",
add = c("mean_se","jitter"),
color = "blue") + theme_bw ( )
p1
Design of experiments 38
p3
ggarrange(p1,p2,p12)
ggarrange(p1,p2,p12,pPar)
# Contour plots
library(ggplot2)
plan2 = rbind(plan2,plan2)
colnames(plan2) = c("pH", "vz", "CE")
plan2$y = y
Design of experiments 39
# model decoded via expand.grid
lm3 = lm(y ~ pH * vz * CE, plan2)
lm3 # never test significance of decoded coefficients
# mesh
grid = expand.grid(pH = seq(4,8, length = 40),
vz = seq( 2,8, length = 40),
CE = seq( 0.7,2, length = 40))
# contour plots
ggarrange(cp1,cp2,cp3)
###### Prediction
lm3$fitted.values
Design of experiments 40
2.5 Unreplicated Factorial Design
Example 3: This example was taken from the study [2] in which the influence of
four factors on flexural strength (response) was investigated using an unreplicated factorial
design. In this design, we will use the unrepx package of R software.
R Script
## Factors
# dc = density of the compound
# vf = fiber fraction volume
# tp = pyrolysis temperature
# ta = softening temperature of the synthesized methyl-polycarboxylane
## response
# FS = flexural strength
# https://doi.org/10.1007/s12034-017-1535-5
######
# planning
library(FrF2)
#####
# Calculating the effects
N = dim(X)[1]
effects = crossprod(X,FS)/(N/2)
effects
Design of experiments 41
#######################
### Working with the unrepx package for unreplicated factorial
# Package authored by Professor Russell V. Lenth
install.packages("unrepx")
library(unrepx)
# Effects
effects2 = yates(FS) # responses must be in standard order
effects2
attr(effects2,"mean")
Design of experiments 42
Table 2: Fractional Planning Matrix 24-1
Example 4: This example was taken from the study [3] in which the adsorption
efficiency of thiomethoxam was investigated using oxidized multi-walled carbon nanotubes
(MwCNTs) as adsorbents through a fractional factorial design ( ). In this design, we will
use the FrF2 and unrepx packages of R software.
R Script
## via FrF2
library(FrF2)
# design - 2^(5-1)
Design of experiments 43
nfactors = 5,
factor.names = c("x1", "x2", "x3", "x4", "x5"),
randomize = F,
alias.info = 3) #3rd order information
summary(frac_p)
# complete model
lm1 = lm(y ~ .^5, data = frac)
summary(lm1)
# 2nd order model to avoid confusion (interactions from 3rd to 5th order removed)
lm2 = lm(y ~ .^2, data = frac)
summary(lm2)
# graphics
MEPlot(lm2)
IAPlot(lm2)
# confusion lm2
aliases(lm2)
#t critical
t_critical = qt(0.025, df.residual(lm3), lower.tail = F) # t critical
# t calculated
MSE = deviance(lm3)/df.residual(lm3) # MSE, obs = deviance = sum(lm3$residuals^2)
SE_coef = sqrt(MSE/16) #standard error of coefficients
t0 = lm3$coefficients/SE_coef # t0
Design of experiments 44
# data frame for t0
t_0 = data.frame(names(coef(lm3)),abs(t0))
colnames(t_0) = c("term","t0")
#####
## Via Lenth's pseudo standard error
library(unrepx)
#design matrix
X = model.matrix(~x1*x2*x3*x4*x5, data = frac[,-6])
X
# effects
effects = crossprod(X,qty)/(16/2) # here there are confusing effects
effects = effects[2:16] # only main effects and 2nd order interactions
names(effects) = c("x1", "x2", "x3", "x4", "x5", "x1x2", "x1x3", "x1x4", "x1x5",
"x2x3", "x2x4", "x2x5", "x3x4", "x3x5", "x4x5")
effects
#graphics
hnplot(effects, method = "Lenth", ID = ME(effects))
parplot(effects, method = "Lenth")
Design of experiments 45
Example 5: The following example, also using fractional factorial design ( ),
is taken from study [4] in the development of HSLA (high-strength low-alloy) steel wire
electrodeposition. In this design, we will use the FrF2 and unrepx packages of R software.
R Script
### via FrF2
library(FrF2)
# design - 2^(5-1)
# complete model
lm1 = lm(y ~ .^5, data = frac2)
summary(lm1)
# 2nd order model to avoid confusion (interactions from 3rd to 5th order removed)
lm2 = lm(y ~ .^2, data = frac2)
summary(lm2)
Design of experiments 46
lm5 = step(lm4, direction = "backward")
summary(lm5)
#############################################
## Via Lenth 's pseudo standard error (Hypothesis test)
library(unrepx)
# design matrix
X = model.matrix(~x1*x2*x3*x4*x5, data = frac2[,-6])
X
# effects
effects = crossprod(X,Ra)/(16/2) # here there are confusing effects
effects = effects[2:16] # only main effects and 2nd order interactions
names(effects) = c("x1", "x2", "x3", "x4", "x5", "x1x2", "x1x3", "x1x4", "x1x5",
"x2x3", "x2x4", "x2x5", "x3x4", "x3x5", "x4x5")
effects
#graphics
hnplot(effects, method = "Lenth", ID = ME(effects))
parplot(effects, method = "Lenth")
Design of experiments 47
Table 3: Central composite design matrix for 3 factors
The value of α is a new parameter that must be declared and has a value located
between 1 and √k, where k represents the number of factors investigated. The value of α
is based on the distance of the axial points in relation to the central points and is related to
the shape and size of the central composite design domain, being spherical, when α=√k, or
cubic, when α=1, with impact in the design rotation.
As in other experimental designs, in this design the values of coefficients used in
a model are extracted using regression, calculation of coefficients significance (ANOVA),
hypothesis tests or other statistical operations to extract information from the design. Finally,
a mathematical model generates a response surface with the aim of optimizing the system
and obtaining a graphical visualization of the optimum point.
Design of experiments 48
Example 6: The following example consists of the optimization of a factorial design
with a central point for determining the copper content in different water samples by ICP-
OES, extracted from the study [1]. In this design, we will use the FrF2 and rsm packages
of R software.
R Script
# Factors
# x1: pH
# x2: flow rate - vz(mL/min)
# Response
# y: relative analytical signal (resulting from normalized instrumental peak
measurements).
##
# via FrF2
library(FrF2)
plan.ctpt = FrF2(nruns = 4,
nfactors = 2,
ncenter = 3,
factor.names = c("x1", "x2"),
randomize = F)
summary(plan.ctpt)
plan.ctpt$y = y
summary(plan.ctpt)
# ANOVA
summary(aov(lm1))
################################################
## Performing design with a central point using the rsm package
install.packages("rsm")
library(rsm)
Design of experiments 49
plan.ctpt2 = cube(basis = ~x1+x2,
n0 = 3,
randomize = F)
plan.ctpt2$y = y
plan.ctpt2
#####################
# Graphics for main effects and interaction
library(ggpubr)
p1 <- ggline(plan.ctpt ,
x = "x1",
y = "y",
add = c("mean"),
color = "blue") + theme_bw()
p2 <- ggline(plan.ctpt,
x = "x2",
y = "y",
add = c("mean"),
color = "green") + theme_bw()
plan.plot = plan.ctpt
plan.plot$x1 = as.factor(plan.ctpt$x1)
plan.plot$x2 = as.factor(plan.ctpt$x2)
ggarrange(p1,p2,p12)
Design of experiments 50
result is very economical and efficient as it generates a small number of tests; in addition,
it has rotational properties and constitutes an alternative to the central composite design.
The construction of the design matrix can be exemplified by the scheme in Table
4, which considers of three blocks in the composition of the tests for three factors studied.
Each β symbol, in each of the blocks, is replaced by the two-level encoded column of the
corresponding factor, extracted from the matrix of a 22 design; and, the x is filled with a
column of zeros. The procedure is repeated for each block, considering the factors that
participate in the block and, at the end, at least three tests are added at the central point,
resulting in 15 tests, as shown in Table 5.
Design of experiments 51
R Script
#####
install.packages("rsm")
library(rsm)
# width deviation
width = c(5.3533, 5.2615, 5.0008, 4.2712, 4.5840, 2.7470, 3.8086, 3.9839, 4.3630,
3.5519, 4.0534,
4.0031, 5.1495, 4.5581, 4.1959, 3.5946, 5.1642, 4.0103, 3.6354, 4.2529,
3.5171, 4.4485,
5.3879, 3.4132, 3.8905, 4.3263, 4.2263, 3.9451, 3.9024)
design$y = width
###############################################
# Analysis
rsm.bbd = rsm(y ~ SO(x1,x2,x3,x4), data = design)
summary(rsm.bbd)
###############################################
# Normality
shapiro.test(rsm.bbd$residuals)
###############################################
# Graphics
# Contour plots
par(mfrow = c(2,3))
contour(rsm.bbd , ~x1 + x2, image = TRUE)
contour(rsm.bbd , ~x1 + x3, image = TRUE)
contour(rsm.bbd , ~x1 + x4, image = TRUE)
contour(rsm.bbd , ~x2 + x3, image = TRUE)
contour(rsm.bbd , ~x2 + x4, image = TRUE)
contour(rsm.bbd , ~x3 + x4, image = TRUE)
# Perspective plots
Design of experiments 52
persp(rsm.bbd , ~x1 + x2, zlab = "y [%]", col = rainbow(50), contours = "colors")
persp(rsm.bbd , ~x1 + x3, zlab = "y [%]", col = rainbow(50), contours = "colors")
persp(rsm.bbd , ~x1 + x4, zlab = "y [%]", col = rainbow(50), contours = "colors")
persp(rsm.bbd , ~x2 + x3, zlab = "y [%]", col = rainbow(50), contours = "colors")
persp(rsm.bbd , ~x2 + x4, zlab = "y [%]", col = rainbow(50), contours = "colors")
persp(rsm.bbd , ~x3 + x4, zlab = "y [%]", col = rainbow(50), contours = "colors")
################################################ ###############
# Optimization restricted
optimal = steepest(rsm.bbd , dist = seq(0, sqrt(2), by = .1), descent = T)
par(mfrow = c(2,3))
contour(rsm.bbd , ~x1+x2, col = "black", decode = F, at = x_)
points(x2 ~ x1, data = optimal , col = "blue", pch = "*")
Design of experiments 53
Example 8: The following example consists of the use of a multi-level factorial design
applied to turning ABNT 1045 steel, extracted from study [6]. In this design, we will use the
DoE.base package and phia of R software and a restricted designing optimization.
R Script
library(DoE.base)
# Planning encoded
# Cutting force - Fc
design$Fc = Fc
design2$Fc = Fc
################################################ #
## Analysis####
# ANOVA for Fc
res.Fc = aov(Fc ~ CB*f* vc , data = design)
summary(res.Fc )
Design of experiments 54
lm.Fc = lm(Fc ~ CB*f* vc , data = design2)
summary(lm.Fc )
####### Assumptions
# Normality
shapiro.test(res.Fc $residuals )
par(mfrow = c(2,2))
plot(res.Fc )
# Homoscedasticity
library(car)
leveneTest(Fc ~ CB*f* vc , data = design)
library(emmeans)
Tukey.Fc = emmeans(res.Fc, # Tukey)
~ CB|f)
plot(Tukey.Fc)
install.packages("ScottKnott")
library(ScottKnott)
sk2 = with(design,
SK(x = res.Fc, # Scott-Knott
y = Fc,
model = 'Fc ~ CB*f',
which = 'f:CB',
fl1 = 2))
sk3 = with(design,
SK(x = res.Fc, # Scott-Knott
y = Fc,
model = 'Fc ~ CB*f',
which = 'f:CB',
fl1 = 3))
Design of experiments 55
summary(sk1) # equal characters, equal statistical averages
summary(sk2)
summary(sk3)
par(mfrow=c(1,3))
plot(sk1)
plot(sk2)
plot(sk3)
install.packages("phia")
library(phia)
IM = interactionMeans(res.Fc)
IM
plot(IM)
library(ggpubr)
p1 <- ggline(design,
x = "CB",
y = "Fc",
add = c("mean", "jitter"),
color = "blue") + theme_bw()
p2 <- ggline(design,
x = "f",
y = "Fc",
add = c("mean", "jitter"),
color = "red") + theme_bw()
p3 <- ggline(design,
x = "vc",
y = "Fc",
add = c("mean", "jitter"),
color = "green") + theme_bw()
ggarrange(p1,p2,p3)
Design of experiments 56
add = c("mean", "jitter"),
color = "CB") + theme_bw()
ggarrange(p12,p13,p23)
Example 9: The following example consists of the non-linear optimization for the
response surface of the components of a catalyst, taken from study [7]. In this design, we
will use the rsm and phia package of R software and a rigid analysis of the design.
Design of experiments 57
R Script
#######
library(rsm)
y = c(88.36, 93.40, 89.22, 92.02, 91.28, 92.02, 89.62, 88.92, # factorials points
85.98, 89.72, 91.43, 88.53, 95.66, 94.63, # axial points
94.38, 94.53, 94.08) # central points
plan$y = y
################################################ ########################
## Analysis via RSM
####### Assumptions
# Normality
shapiro.test(res.rsm$residuals )
#### Graphics
# Contour and Surface
par(mfrow = c(2,3))
contour(res.rsm , ~x1 + x2, image = TRUE)
contour(res.rsm , ~x1 + x3, image = TRUE)
contour(res.rsm , ~x2 + x3, image = TRUE)
persp (res.rsm , ~x1 + x2, zlab = "y", col = rainbow(50), contours = ("colors"))
persp (res.rsm , ~x1 + x3, zlab = "y", col = rainbow(50), contours = ("colors"))
persp (res.rsm , ~x2 + x3, zlab = "y", col = rainbow(50), contours = ("colors"))
##### Optimization
Design of experiments 58
# Rigid Analysis
optimum = steepest(res.rsm , dist = seq (0, radius, by =.1), descent = F)
# maximize response
optimum
par(mfrow = c(1,3))
contour(res.rsm , ~x1 + x2, col = "black", decode = F)
points(c(-1, 1,- 1,1,-radius,radius,0,0,0), c(-1,-1,1,1,0,0,-radius,radius,0), col
= "blue", pch = 19)
points(x2 ~ x1, data = optimum , col = "magenta", pch = "*")
Eq. 1
Design of experiments 59
Fig. 2.1: Simplex-lattice design for three components.
The simplex-lattice design is an experimental design where the points are located
on the edges or boundaries of the simplex. It is represented by six points, with the pure
components being the vertices of the triangle and the binary mixtures. Simplex-lattice design
was presented by Scheffé (1958) at the beginning of studies on experiments with mixtures.
Example 10: The following example consists of using simplex-lattice mixture design
to optimize the natural fiber composition of a fiber-reinforced composite in 3 factors (F1 –
sisal fiber, F2 – jute fiber, F3 – coconut fiber), extracted from the study [8]. In this design, we
will use the mixexp and NlcOptim package of R software.
R Script
install.packages("mixexp")
library(mixexp)
# Design planning
DesignPoints((plan.simplex))
Design of experiments 60
# answer - specific breaking stress (SBS)
y = c(28.56, 21.73, 26.38, 33.71, 24.22, 22.93,
29.58, 20.98, 25.9, 32.98, 23.98, 21.79,
29.26, 21.23, 26.65, 34, 23.15, 22.17)
plan.simplex$SBS = y
############ Analysis
# complete model
###### Graphics
ModelPlot(model = res.composite,
dimensions = list( x1="x1", x2="x2", x3="x3"),
contour = T,
fill = T,
axislabs = c("sisal", "juta", "coconut"),
color.palette = cm.colors ,
colorkey = T)
# Effects chart
ModelEff(nfac = 3,
mod = 2,
dir = 2,
nproc = 0,
ufunc = res.composite )
### Assumptions
Design of experiments 61
#normality
shapiro.test(res.composite$residuals )
par(mfrow = c(2,2))
plot(res.composite)
install.packages("NlcOptim")
library(NlcOptim)
# equality constraint
cons_eq = function(x){
g = x[1] + x[2] + x[3] -1
return(list( ceq = g,c = NULL))
}
# initial x
obj(x0)
cons_eq(x0)
# optimization
opt = solnl(X = x0, objfun = obj, confun = cons_eq, lb = rep (0,3), ub = rep(1,3))
# Optimal proportions
x_optim = opt$par
x_optim
# Great answer
y_ = opt$fn
y_
Design of experiments 62
2.12 Simplex-centroid Design
Simplex-centroid is a design where the points are located on the edges or borders
of the simplex, with the exception of the central point (centroid). Simplex-centroid allows
you to reduce the number of coefficients in a model. It is an alternative to the simplex-
lattice design. The difference between the two designs is that the simplex-centroid creates
additional points always aligned to the centroid, while in the simplex-lattice, the points cover
the entire internal space.
R Script
install.packages("mixexp")
library(mixexp)
# Simplex-centroide q = 3
plan.centroide = SCD(3)
# Drawing planning
DesignPoints((plan.centroide))
################# Analysis
# complete model
Design of experiments 63
###### Graphics
ModelPlot(model = res.centroide ,
dimensions = list( x1="x1", x2="x2", x3="x3"),
contour = T,
fill = T,
axislabs = c( "oleo_soy", "tallow_beef", "fat_birds"),
color.palette = terrain.colors ,
colorkey = T)
# Effects chart
ModelEff(nfac = 3,
mod = 4,
dir = 2,
nproc = 0,
ufunc = res.centroide )
### Assumptions
# Normality
shapiro.test(res.centroide$residuals)
library(NlcOptim)
# equality constraint
cons_eq = function(x){
g = x[1] + x[2] + x[3] -1
return( list( ceq = g,c = NULL))
}
# initial x
Design of experiments 64
# test objective function and constraint
obj(x0)
cons_eq(x0)
# optimization
opt = solnl(X = x0, objfun = obj, confun = cons_eq, lb = rep (0,3), ub = rep(1,3))
# Optimal proportions
x_optim = opt$par
x_optim
# Great answer
y_ = opt$fn
y_
PROPOSED EXERCISES
01 – Propose a complete factorial design (2 levels) using a script in R language and
present your main conclusions.
02 – There are multivariate data repositories on the internet (web of science, Science
Direct, and others) in which you must choose a full 3-level factorial design to perform a
statistical study in detail. Present the main results and conclusions.
03 – Present a fractional factorial design through a real example containing an R
script to demonstrate your hypotheses and main conclusions.
04 – Propose a Box-Behnken experiment and perform a statistical interpretation
presenting its main conclusions.
05 – Present a simplex-lattice design of mixtures and your main conclusions. Also,
perform non-linear optimization or rigid analysis of the experiment.
06 – In the same way as the previous exercise, present a simplex-centroid design of
mixtures and your main conclusions. Also, perform non-linear optimization or rigid analysis
of the experiment.
07 – Present a multi-level factorial design and its main conclusions using the R
language.
REFERENCES
1 – Escudero , LA; Cerutti, S.; Olsina , RA; Salonia, JA; Gasquez , J. A. (2010). Factorial design optimization
of experimental variables in the on-line separation/preconcentration of copper in water samples using
solid phase extraction and ICP-OES determination. Journal of Hazardous Materials. 183 (1-3): 218-223.
Design of experiments 65
2 – Kumar, S.; Bablu, M.; Janghela , S. et al. (2018). Factorial design, processing, characterization and
microstructure analysis of PIP-based C/ SiC composites. Bull Mater Sci . 41 (17): 1:13.
3 – X, Hu.; J, Xu.; C, Wu et al. (2017). Ethylenediamine grafted to graphene oxide@Fe3O4 for chromium(
VI) decontamination: Performance, modeling, and fractional factorial design. PLoS One . 12(10): 1:14.
4 – Azam, M., Jahanzaib , M., Abbasi, JA et al. (2016). Parametric analysis of recast layer formation in
wire-cut EDM of HSLA steel. Int J Adv Manuf Technol 87 :713–722.
5 – Azhikannickal , Elizabeth & Uhrin, Aaron. (2019). Dimensional Stability of 3D Printed Parts: Effects of
Process Parameters. The Ohio Journal of Science. 119(2): 9-16.
6 – Pereira, RBD, Braga, DU, Nevez , FO et al. (2013). Analysis of surface roughness and cutting force
when turning AISI 1045 steel with grooved tools through Scott–Knott method. Int J Adv Manuf Technol 69,
1431–1441.
7 – Li, S.; Qin, X.; Zhang, G. et al. (2020). Optimization of content of components over activated carbon
catalyst on CO 2 reforming of methane using multi-response surface methodology. International Journal of
Hydrogen Energy . 45(16): 9695-9709.
9 – Orives , JR; Galvan, D.; Coppo, R.L.; et al. (2014). Multiresponse optimization on biodiesel obtained
through a ternary mixture of vegetable oil and animal fat: Simplex-centroid mixture design application.
Energy Conversion and Management . 398-404.
Design of experiments 66
CHAPTER 3
Pattern recognition
" There is no short cut to truth, no way to gain a knowledge of the
universe except through the gateway of scientific method " Karl
Pearson (1857-1936)
CHAPTER IDEA
The term pattern can be defined as the opposite of chaos or a loosely defined entity
that can be given a name. Formally, pattern recognition is the area of science that aims to
classify objects (patterns) into a number of categories or classes. Thus, for a given set of c
classes, ω1 , ω2 , ..., ωc , and unknown pattern x, a pattern recognizer will be a system that,
aided by pre-processing, feature extraction and selection, associates x to label i of a class
ωi .
We can find pattern recognition techniques in several applications, such as: i)
analysis of genome sequences, in microarray applications and technology (bioinformatics);
ii) data mining; iii) medical diagnosis; iv) biometric recognition; v) remote sensing using
multispectral images; and, vi) speech recognition.
Basically, pattern recognition techniques are based on three main steps: i)
data acquisition for extraction and selection of the most informative features; ii) data
representation; and, iii) construction of a classifier or descriptor for decision making.
Generally, the classifier used in pattern recognition techniques learns how to map
the feature data space Thus, we can group pattern classification techniques, according
to the type of learning, into two forms: i) unsupervised analysis (they use patterns that do
not have defined class labels); and, (ii) supervised analysis (the patterns belong to a pre-
defined class).
Upon completing the chapter, you should be able to:
a) Apply the main unsupervised analysis algorithms (HCA, K-means and PCA) to a
set of real or simulated data, seeking to identify patterns in the analyzed data;
b) Apply the main supervised analysis algorithms (LDA, QDA, KNN, SVM and
decision trees) to real or simulated data sets in the construction of classification
models;
c) Understand the stages of building multivariate classification models and evaluating
the models (confusion matrix and ROC curve);
e) Build new scripts in R language for decision making using the pattern recognition
technique;
f) Propose new applications in chemistry or related areas of pattern recognition
techniques.
Pattern recognition 67
UNSUPERVISED ANALYSIS
Pattern recognition 68
In HCA, metric distances are calculated between the samples (objects) that form
the data set, and these are grouped according to the degree of similarity presented. HCA
comprises agglomerative and divisive ways of forming clusters. In agglomerative procedures
(most common), we start with the instances forming disjoint unitary groups (singletons),
that is, each of the n instances in the data set will be assigned to a group (cluster). In
divisive hierarchical analysis, the process occurs in the opposite order to the agglomerative
one. The results provided by HCA are called dendrograms (Figure 3.1), which graphically
express the distance (similarity) between the samples.
Several criteria can be adopted to choose the number of clusters. The desired
number of clusters can be known in advance or apredetermined distance value is used as
a criterion to separate the number of clusters. The number of clusters is chosen based on
observation of the dendrogram, based on knowledge of the data.
For Cluster Criteria, the distance from an object to cluster k can be calculated as the
average distance of objects A and B to object i, in several ways:
I. Single Link (KNN) – The shortest distance between clusters is calculated.
This procedure is also known as KNN (Kth Nearest Neighbor);
II. Complete connection – Based on the greatest distance between objects in
opposite groupings. In general, small, compact, spherical and well-separated
clusters tend to form;
III. Centroid link (k-means) – A centroid is calculated as the average of the
objects in a cluster. Spatial distortion of the grouping is avoided and tends to
preserve groups of small importance in relation to larger ones;
IV. Ward's method – The clusters are aggregated in such a way as to minimize
the sum of squares of the deviations of each centroid in relation to the group
itself.
Pattern recognition 69
Example 1: In this example, an HCA is performed for a simulated dataset using the
dist function in the R language.
R Script
# Simulated data
data= cbind(A,B)
# Distance calculation
? dist
# Dendogram
plot(dendo)
Example 2: In this example, an HCA is performed for an ICP-OES dataset using the
R packages factoextra and NbClust .
R Script
# Loading the dataset
D1= read.table("ICPOES.txt")
D2= D1[,-1]
# Analyzing variance
# Variables with different scales and different variances can distort the analysis
apply(D2, 2, var)
#Standardizing variables
DP <- scale(D2)
apply(DP, 2, mean)
apply(DP, 2, var)
Pattern recognition 70
#---------------------------------------------------------------- --------------
# 2. Selection of grouping criteria
# Database rows must represent observations (samples)
# that you want to group.
# Columns must be formed by variables.
#---------------------------------------------------------------- --------------
# Select the similarity (or dissimilarity) criterion that will determine
# which observations are similar, and should be grouped into a given group, and
which are not
# similar, and must be in different groups.
# For a similarity measure, the lower its value, the more similar two observations
are.
# Calculating Euclidean distance
d.eucl <- dist(DP, method = "euclidean")
#---------------------------------------------------------------- ---------------
# 3. Selection of clustering algorithm
# Hierarchical x Non-hierarchical
#---------------------------------------------------------------- ----------------
#---------------------------------------------------------------- ----------------
# 4. Defining the number of clusters
#---------------------------------------------------------------- ----------------
#Loading the factoextra package
install.packages("factoextra")
library(factoextra)
Pattern recognition 71
# Obtaining the dendrogram
fviz_dend(hc.m, cex = 0.5)
#------------------------
#For just one indicator, use the help
?NbClust
fviz_nbclust(nb.c)
#-----------------------
#For pseudo-f('ch')
nb.i <- NbClust(DP, distance = "euclidean", min.nc = 2,
max.nc = 10, method = "ward.D2", index = "ch")
fviz_nbclust(nb.i)
#---------------------------------------------------------------- ----------------
------
#5. Interpretation and validation of groupings
#---------------------------------------------------------------- ----------------
------
#Getting the groupings
g <- cutree(hc.m,k =3)
Pattern recognition 72
#Number of members in each group
table(g)
R Script
# load the TP dataset (89 x 9)
D1 = read.table("TP.txt")
D2 = D1[,-1]
#Standardizing variables
DP <- scale(D2)
apply(DP,2,mean)
apply(DP,2,var)
#---------------------------------------------------------------- --------------
# 2. Selection of similarity or dissimilarity criterion
# For a similarity measure, the lower its value, the more similar two observations
are.
# Calculating Euclidean distance
d.eucl <- dist(DP, method = "euclidean")
Pattern recognition 73
#---------------------------------------------------------------- ---------------
# 3. Selection of clustering algorithm
# Hierarchical x Non-hierarchical
#---------------------------------------------------------------- ----------------
#---------------------------------------------------------------- ----------------
# 4. Defining the number of clusters
#---------------------------------------------------------------- ----------------
#Loading the factoextra package
install.packages("factoextra")
library(factoextra)
Pattern recognition 74
#------------------------
#For just one indicator, use the help
?NbClust
fviz_nbclust(nb.c)
#-----------------------
#For pseudo-f('ch')
nb.i <- NbClust(DP, distance = "euclidean", min.nc = 2,
max.nc = 10, method = "ward.D2", index = "ch")
fviz_nbclust(nb.i)
#---------------------------------------------------------------- ----------------
------
#5. Interpretation and validation of groupings.
#---------------------------------------------------------------- ----------------
------
#Getting the groupings
g <- cutree(hc.m,k=4)
#Number of members in each group
table(g)
3.3 K-means
K-means is partitional (non-hierarchical) center-based technique, that is, the groups
formed by this technique are represented by a centroid (a central point in the group).
K-means was proposed in a pioneering work by S. Lloyd in 1957, however, it was only
published in 1982 [2]. For Lloyd, the centroid was chosen as the point that minimizes the
sum of the square of the Euclidean distance, dE , between itself and each point in the set,
according to the equation:
Eq. 1
Pattern recognition 75
The aim of the K-means algorithm is to minimize the sum of squared error over all k
groups:
Eq. 2
R Script
# loading data
D1= read.table ("ICPOES.txt")
D2=D1[,-1]
DP <- scale(D2)
km.res <- kmeans(DP, 3, nstart = 25)
print(km.res)
Pattern recognition 76
# vizualizing the clusters
install.packages("cluster")
library(cluster)
clusplot (DP, km.res$cluster , main='Two-dimensional cluster representation',
color=TRUE, shade=TRUE,
labels=2, lines=0)
R Script
install.packages("FactorMineR")
install.packages("factoextra")
install.packages("cluster")
install.packages("ggplot2")
library(FactoMineR)
library(factoextra)
library(cluster)
library(ggplot2)
# importing data
#Bloxplot
boxplot(D2$Ca)
#Scaling
Data = scale(D2)
#Generate kmeans
data_kmeans = kmeans(DP,5)
Pattern recognition 77
#vizualizing the clusters
fviz_cluster(data_kmeans, data = DP, ellipse.type = "t")
list = data_kmeans$cluster
#Grouping data into a table
general_data = cbind(D2, list)
general_data
Pattern recognition 78
R Script
#loading the ICP-OES dataset (54 x 7)
D1 = read.table("ICPOES.txt")
D2 = D1[,- 1]
#---------------------------------------------------------------- --
#Calculating the correlation matrix (X)
cor.c <- cor(D2)
#---------------------------------------------------------------- ---
#Calculating eigenvalues and eigenvectors for cor.c
ev <- eigen(cor.c)
Pattern recognition 79
################################################ #######
#Principal component analysis.
install.packages(c("FactoMineR", #for analysis
"factoextra" #to plot the principal components
))
library(FactoMineR)
library(factoextra)
print(res.pca)
#Correlation circle
fviz_pca_var ( res.pca , #name of the object that saved the results
col.var = "black" ,
repel = TRUE, # Prevent text from overlapping
title = "Correlation circle, variables x PC" #title
)
install.packages("corrplot")
library(corrplot)
Pattern recognition 80
#Biplot graph (scores and loadings)
#---------------------------------------------------------------- ----------
#Variables that are correlated with PC1 (Dim1) and PC2 (Dim2)
head(res.pca$var$contrib, 4)
corrplot(res.pca$var$contrib, is.corr=FALSE)
#---------------------------------------------------------------- ---------
#To export the results, we use
library( FactoMineR )
write.infile(cp, "cp.csv", sep = ";")
Pattern recognition 81
Example 7: In this example, the PCA algorithm is applied to the Iris dataset (150 x 5)
using the FactoMineR and factoextra. Note that PCA will be performed both for the original
dataset and for the same one after pre-processing (done using the "scale" function).
R Script
# loading dataset
D = iris
D1 = data.frame(D[,c(1:4)]) # original dataset
D2 = scale(D1) # pre-processed dataset
#-------------------------------------------
#calculating the var-cov matrix (X)
cov.c = cov(D1)
cov.c2 = cov(D2)
#------------------------------------------------
#calculating the correlation matrix
cor.c = cor(D1)
cor.c2 = cor(D2)
Pattern recognition 82
#calculating the percentage explained by each component (Yi=ei1x1+ei2x2+.... eipxp)
per.var = c((c.values/var.total)*100)
per.var
per.var2 = c((c.values2/var.total2)*100)
per.var2
################################################ ########################
#Principal component analysis
library(FactoMineR)
library(factoextra)
library(ggplot2)
eig.val = get_eigenvalue(res.pca)
eig.val2 = get_eigenvalue(res.pca2)
Pattern recognition 83
#---------------------------------------------------------------- ---
#Correlation circle
fviz_pca_var ( res.pca ,
col.var = "black",
repel = TRUE,
title = "Correlation circle, variables X PC") #title
fviz_pca_var ( res.pca2 ,
col.var = "black",
repel = TRUE,
title = "Correlation circle, variables X PC") #title
install.packages("corrplot")
library(corrplot)
#---------------------------------------------------------------- ----------------
------
#Variables that are correlated with PC1 (Dim1) and PC2 (Dim2)
head(res.pca$var$contrib,4)
corrplot(res.pca$var$contrib , is.corr = FALSE)
head(res.pca2$var$contrib,4)
corrplot(res.pca2$var$contrib, is.corr = FALSE)
Pattern recognition 84
fviz_contrib (res.pca2, choice = "var", axes = 1, top = 10,
title ="contribution of variables to Dim1")
cp = res.pca$ind$coord
cp2 = res.pca2$ind$coord
library(FactoMineR)
Pattern recognition 85
3.5 Multivariate Curve Resolution with Alternating Least Squares (MCR-ALS)
In 1971, Lawton and Sylvestre [4] presented to the scientific community the
emergence of methodologies called Curve Resolution with the study entitled "Self Modeling
Curve Resolution ". Basically, these researchers present a method for determining the
forms of two overlapping functions f1(x) and f2(x) from an observed set of additive mixtures,
{αif1(x)+βif2 (x); i = 1, ..., n}, of the two functions.
From this initial study, there was an evolution of Multivariate Curve Resolution
(MCR) methods for analytical signal processing whose goal is to resolve mixtures of non-
selective signals originating from an instrument (D) into real contributions from the pure
components in the system (represented by the concentration profiles in C and spectral
profiles in ST), as exemplified in the equation below:
D = CST Eq. 3
C = DS(STS)-1 Eq. 4
ST = (CTC)-1CTD Eq. 5
In this process, a matrix D, reconstructed from the product of the CST matrices, in
which C or ST comes from the initial estimate, is calculated and compared with the original
matrix D. Iterative optimization continues until the convergence criterion is met. Figure 3.2
shows a graphical representation of the MCR-ALS method.
Pattern recognition 86
Figure 3.2: Graphical representation of the MCR-ALS method.
R Script
## Loading Packages
install.packages("ALS")
library(ALS)
data(multiex)
Pattern recognition 87
## MCR-ALS
plotS(mcr$S,x2)
## comparing the mass spectrum of the known standard S with the recovered profiles
mcr$S (fit rate)
matchFactor(S[,1], mcr$S[,1])
matchFactor(S[,2], mcr$S[,2])
SUPERVISED ANALYSIS
Supervised pattern recognition is a machine learning method that uses techniques to
identify similarities and differences between different types of samples. Pattern recognition
techniques are based on the following assumptions: i) Samples of the same type are similar;
and, ii) The classes are defined by the system designer.
In general, supervised algorithms present a classifier, also called a model, that will be
able to predict the class of a new set of data. The classifier produced can also be described
as a function f, which takes a given x and provides a prediction y.
Pattern recognition 88
methodology is based on three fundamental steps: i) determining the distance between a
new sample and the other samples in the training set; ii) identify the K closest samples or
with the most similar characteristics; iii) with the k known elements of k-nearest neighbors,
the closest class will be assigned to the class of the unknown element.
Example 9: Analysis of a data set to perform KNN extracted from the DAAG and
Caret packages.
R Script
# reading data
install.packages("DAAG")
library(DAAG)
data("leafshape")
?leafshape
data = leafshape
data = na.omit(data)
data = data[,-c(1:3,9)]
set.seed(2)
install.packages("caret")
library(caret)
# plotting
pairs(data, col = rainbow(2)[as.factor(data$arch)])
################################################ #########################
# separating training and testing data
set.seed(13)
Pattern recognition 89
tr = round(0.5* nrow (data))
training = sample(nrow(data), tr, replace = F)
data.training = data[training,]
data.test = data[-training,]
x.training = data.training[,-5]
x.test = data.test[,-5]
y.training = data.training[,5]
y.test = data.test[,5]
################################################
########################## knn for k = 5 neighbors
library(class)
# confusion matrix
table(knn5, y.test)
mean(knn5 == y.test)
### Preview
### Training
ggplot ( ) +
geom_raster ( aes (x= grid$logwid , y = grid$logpet , fill = grid$class ),
alpha = 0.3, interpolate = T) +
geom_point ( aes (x = data.training$logwid , y = data.training$logpet ,
color = as.factor ( data.training$arch ),
shape = as.factor ( data.training$arch )), size = 2) +
labs( x = "logwid", y = "logpet",
col = "arch", shape = "arch", fill = "arch") + theme_bw ( )
#### Test
Pattern recognition 90
ggplot ( ) +
geom_raster ( aes (x= grid$logwid , y = grid$logpet , fill = grid$class ),
alpha = 0.3, interpolate = T) +
geom_point ( aes (x = data.test$logwid , y = data.test$logpet ,
color = as.factor ( data.test$arch ),
shape = as.factor ( data.test$arch )), size = 2) +
labs( x = "logwid", y = "logpet",
col = "arch", shape = "arch", fill = "arch") + theme_bw ( )
Eq. 6
Assuming that f1 and f2 are normal distribution densities, we can rewrite the probability
ratio as:
Eq. 7
The quality of discrimination will depend on the degree of intersection of the two
densities. Therefore, the functions l(x) and -2logl(x) are called discriminant functions and
have the following properties, as exemplified in Table 1 below:
Pattern recognition 91
For a multivariate case, in which we have p variables, we call X ~ N (µ1, S1) for class
1 and X ~ N (µ1 , S2) for class 2, we can rewrite the discriminant function as being:
Eq. 8
Eq. 9
Thus, we can classify x in class 1 if -2 log l(x) <0 and in class 2 if -2logl(x)>0. The
classification rule for the Fisher discriminant function is met when S1 = S2 = S. For this reason
we have:
Eq. 10
A sample element with observation vector x would be classified in class 1 if fd(x) > 0
and would be classified in class 2 if fd(x) < 0.
Example 10: Analysis of a data set to perform linear discriminant analysis (LDA)
using R packages called datasetsICR, dplyr, GGally, DFA.CANCOR, heplots, MVM and
MASS.
R Script
# Loading packages and data
install.packages("datasetsICR")
library(datasetsICR)
data("seeds")
?seeds
data = seeds
head(data)
levels(data$variety)
library(dplyr)
glimpse(data)
################################################ ########################
Pattern recognition 92
# plotting
pairs(data, col = rainbow(3)[data$variety])
library(ggplot2)
install.packages("GGally")
library(GGally)
################################################ #########################
set.seed(1)
tr = round(0.7* nrow(data))
training = sample(nrow(data), tr, replace = F)
training
data.training = data[training,]
data.test = data[-training,]
################################################ ########################
#Assumptions
install.packages("DFA.CANCOR")
library(DFA.CANCOR)
HOMOGENEITY(data.training,group='variety',
variables = c('area','perimeter'))
HOMOGENEITY(data.test,group='variety',
variables = c('area', 'perimeter', 'compactness','length of kernel',
'width of kernel',
'asymmetry coefficient', 'length of kernel groove'))
# H0 p>0.05
# H1 p<0.05
# Multivariate normality
library(MVN)
Pattern recognition 93
mvn(data.training, subset = "variety")
################################################ ########################
library(MASS)
#Model 1
fit.lda1 = lda(variety ~ compactness+perimeter, data.training)
fit.lda1
#Graphics
plot(fit.lda1)
pairs(fit.lda1)
ldahist(lda.pred1$x[,1], g = lda.pred1$class)
# Confusion matrix
cm1 = table(data.training$variety, lda.pred1$class)
cm1
#Prediction ability
mean(data.training$variety == lda.pred1$class)
library(ggplot2)
grid$class = predict(fit.lda1,grid)$class
ggplot() +
geom_raster(aes(x=grid$compactness, y = grid$perimeter, fill = grid$class),
alpha = 0.3, interpolate = T) +
geom_point(aes(x = data.training$compactness, y = data.training$perimeter,
color = as.factor(data.training$variety),
shape = as.factor(data.training$variety)), size = 2) +
labs(x = "compactness", y = "perimeter",
col = "variety", shape = "variety", fill = "variety") + theme_bw()
Pattern recognition 94
ggplot(d.plot, aes(lda.LD1, lda.LD2, group = Class)) +
geom_point(aes(col = Class), size = 2.5) +
stat_ellipse(aes(fill = Class), geom = "polygon", alpha = .3) +
theme_bw()
# confusion matrix
cm1.t = table(data.test$variety, lda.pred1.t$class)
cm1.t
# prediction ability
mean(data.test$variety == lda.pred1.t$class)
ggplot() +
geom_raster(aes(x=grid$compactness, y = grid$perimeter, fill = grid$class),
alpha = 0.3, interpolate = T) +
geom_point(aes(x = data.test$compactness, y = data.test$perimeter,
color = as.factor(data.test$variety),
shape = as.factor(data.test$variety)), size = 2) +
labs(x = "compactness", y = "perimeter",
col = "variety", shape = "variety", fill = "variety") + theme_bw()
################################################ ##########################
# Training model 2
# Model 2
fit.lda2 = lda( variety ~., data.training )
fit.lda2
plot(fit.lda2)
pairs(fit.lda2)
Pattern recognition 95
# Confusion matrix
cm2 = table(data.training$variety, lda.pred2$class)
cm2
#Prediction ability
mean(data.training$variety == lda.pred2$class)
library(ggplot2)
d.plot2 = data.frame(Class = data.training$variety, lda = lda.pred2$x)
ggplot(d.plot2, aes(lda.LD1, lda.LD2, group = Class)) +
geom_point(aes(col = Class), size = 2.5) +
stat_ellipse(aes(fill = Class), geom = "polygon", alpha = .3) +
theme_bw()
################################################ ##################
install.packages ("klaR")
library (klaR)
data.training2 = data.training
colnames(data.training2) = c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "variety")
dev.new ( )
partimat(variety ~., data = data.training2, method = "lda")
################################################ #########################
# Test model 2
# confusion matrix
cm2.t = table(data.test$variety, lda.pred2.t$class)
cm2.t
# prediction ability
mean(data.test$variety == lda.pred2.t$class)
# prediction
head(lda.pred2.t$class,10)
# linear discriminants
head(lda.pred2.t$x,10)
################################################ #######################
Pattern recognition 96
3.7 Quadratic Discriminant Analysis (QDA)
In quadratic discriminant analysis (QDA), there is no assumption that classes have
equal covariance matrices. As in linear discriminant analysis, an observation is classified
into the group with the smallest squared distance. However, the squared distance does not
result in a linear function, hence the name quadratic discriminant analysis.
In QDA, for each of the classes y, the covariance arrangement is given by:
Eq. 11
Taking the log for both sides of the above equation, the quadratic discriminant
function will be given by:
Eq. 12
A sample element with observation vector x would be classified in class 1 if fd(x) > 0
and would be classified in class 2 if fd(x) < 0.
Example 11: Analysis of a data set to perform quadratic discriminant analysis (QDA)
using R packages called datasetsICR, dplyr, GGally, DFA.CANCOR, heplots, MVM and
MASS.
R Script
install.packages("datasetsICR")
library(datasetsICR)
install.packages("klaR")
library(klaR)
# Dataset - seeds
data("seeds")
?seeds
data = seeds
Pattern recognition 97
head(data)
levels(data$variety)
library(dplyr)
glimpse (data)
################################################ ########################
# plotting
pairs(data, col = rainbow(3)[data$variety])
library(ggplot2)
install.packages ("GGally")
library(GGally)
################################################ ########################
set.seed (1)
data.training = data[training,]
data.test = data[-training,]
################################################ ########################
#Assumptions
install.packages("DFA.CANCOR")
library(DFA.CANCOR)
HOMOGENEITY(data.training,group='variety',
variables = c('area','perimeter'))
HOMOGENEITY(data.training,group='variety',
variables = c('area', 'perimeter', 'compactness','length of kernel',
'width of kernel',
'asymmetry coefficient', 'length of kernel groove'))
# H0 p>0.05
# H1 p<0.05
Pattern recognition 98
# Multivariate normality
library(MVN)
################################################ ########################
library(MASS)
#training model 1
fit.qda1 = qda(variety~compactness+perimeter, data.training )
fit.qda1
# Confusion matrix
cm1 = table(data.training$variety, qda.pred1$class)
cm1
# prediction ability
mean(data.training$variety == qda.pred1$class)
library(ggplot2)
grid$class = predict(fit.qda1,grid)$class
ggplot() +
geom_raster(aes(x=grid$compactness, y = grid$perimeter, fill = grid$class),
alpha = 0.3, interpolate = T) +
geom_point(aes(x = data.training$compactness, y = data.training$perimeter,
color = as.factor(data.training$variety),
shape = as.factor(data.training$variety)), size = 2) +
labs(x = "compactness", y = "perimeter",
col = "variety", shape = "variety", fill = "variety") + theme_bw()
################################
# Test model 1
Pattern recognition 99
# prediction for test data
qda.pred1.t = predict(fit.qda1, data.test)
# confusion matrix
cm1.t = table(data.test$variety, qda.pred1.t$class)
cm1.t
# prediction ability
mean(data.test$variety == qda.pred1.t$class)
ggplot() +
geom_raster(aes(x=grid$compactness, y = grid$perimeter, fill = grid$class),
alpha = 0.3, interpolate = T) +
geom_point(aes(x = data.training$compactness, y = data.training$perimeter,
color = as.factor(data.training$variety),
shape = as.factor(data.training$variety)), size = 2) +
labs(x = "compactness", y = "perimeter",
col = "variety", shape = "variety", fill = "variety") + theme_bw()
################################################ #######################
Eq. 13
Eq. 14
Example 12: Analysis of a data set to perform SVM using R packages called mvtnorm
and e1071.
R Script
#### Simulating data
### Parameters to simulate data
library(mvtnorm)
set.seed(14)
m= c(0,0) # vector of means
S= matrix(c(1, 0.2, 0.2,1),2) # covariance matrix
# Simulating data
data = rmvnorm(1000, mean = m, sigma = S)
#plotting data
library(ggplot2)
ggplot(data, aes( x = x1, y = x2, group = y ))+
geom_point( aes (color=y) ) + theme_bw ( )
################################################ ################
#Separating training and testing data
set.seed(1)
data.training = data[training,]
data.test = data[-training,]
################################################ #############
#Library for SVM and other ML methods
library(e1071)
svc1= svm(y~., data = data.training, kernel = "linear", cost = 10, scale = FALSE)
plot(svc1,data.training)
svc1$index # support vectors
summary(svc1) # analysis summary
svc1 = svm( y~ ., data = data.training, kernel = "linear", cost = 100, scale = FALSE)
##################################
### Preview
grid = expand.grid (x1 = seq(min(c(data.training$x1, data.test$x1)),
max(c(data.training$x1, data.test$x1)), length = 200),
x2 = seq(min(c(data.training$x2, data.test$x2)),
max(c(data.training$x2, data.test$x2)), length = 200))
##Training
ggplot() +
geom_raster(aes(x=grid$x1, y=grid$x2, fill=grid$class),
alpha = 0.3, interpolate = T) +
scale_fill_brewer(palette = "Dark2", drop = FALSE)+
geom_point(aes(x = data.training$x1, y = data.training$x2,
color = data.training$y,
shape = data.training$y), size = 2) +
scale_color_brewer(palette = "Dark2") +
labs(x = "x1", y = "x2", col = "class",
shape = "class", fill = "class") + theme_bw()
## Test
ggplot() +
geom_raster(aes(x=grid$x1, y=grid$x2, fill=grid$class),
alpha = 0.3, interpolate = T) +
scale_fill_brewer(palette = "Dark2", drop = FALSE)+
geom_point(aes(x = data.test$x1, y = data.test$x2,
color = data.test$y,
shape = data.test$y), size = 2) +
scale_color_brewer(palette = "Dark2") +
labs(x = "x1", y = "x2", col = "class",
shape = "class", fill = "class") + theme_bw()
################################################ #################
plot(svm1, data.training)
svm1$index
summary(svm1)
##################################
### Preview
grid = expand.grid (x1 = seq (min(c(data.training$x1, data.test$x1)),
max ( c(data.training$x1, data.test$x1)), length = 200),
x2 = seq ( min(c(data.training$x2, data.test$x2)),
max ( c(data.training$x2, data.test$x2)), length = 200))
##Training
ggplot() +
geom_raster(aes(x=grid$x1, y=grid$x2, fill=grid$class),
alpha = 0.3, interpolate = T) +
scale_fill_brewer(palette = "Set1", drop = FALSE)+
geom_point(aes(x = data.training$x1, y = data.training$x2,
color = data.training$y,
shape = data.training$y), size = 2) +
scale_color_brewer(palette = "Set1") +
labs(x = "x1", y = "x2", col = "class",
shape = "class", fill = "class") + theme_bw()
## Test
ggplot() +
geom_raster(aes(x=grid$x1, y=grid$x2, fill=grid$class),
alpha = 0.3, interpolate = T) +
scale_fill_brewer(palette = "Set1", drop = FALSE)+
geom_point(aes(x = data.test$x1, y = data.test$x2,
color = data.test$y,
shape = data.test$y), size = 2) +
scale_color_brewer(palette = "Set1") +
labs(x = "x1", y = "x2", col = "class",
shape = "class", fill = "class") + theme_bw()
################################################ ############
# ROC curve
install.packages("ROCR")
library(ROCR)
plot( perf1,
avg = 'vertical',
lwd = 3, main = "ROC curve model scv1 - test data",
col = 'blue')
plot ( perf2,
avg = 'vertical',
lwd = 3, main = "ROC curve model svm1 - test data" ,
col = 'green3')
################################################ #################
##################################
### Preview
grid = expand.grid (x1 = seq (min(c(data.training$x1, data.test$x1)),
max(c(data.training$x1, data.test$x1)), length = 200),
x2 = seq (min(c(data.training$x2, data.test$x2)),
max(c(data.training$x2, data.test$x2)), length = 200))
##Training
ggplot() +
geom_raster(aes(x=grid$x1, y=grid$x2, fill=grid$class),
alpha = 0.3, interpolate = T) +
scale_fill_brewer(palette = "Set1", drop = FALSE)+
geom_point(aes(x = data.training$x1, y = data.training$x2,
color = data.training$y,
shape = data.training$y), size = 2) +
scale_color_brewer(palette = "Set1") +
labs(x = "x1", y = "x2", col = "class",
shape = "class", fill = "class") + theme_bw()
## Test
ggplot() +
geom_raster(aes(x=grid$x1, y=grid$x2, fill=grid$class),
alpha = 0.3, interpolate = T) +
scale_fill_brewer(palette = "Set1", drop = FALSE)+
geom_point(aes(x = data.test$x1, y = data.test$x2,
color = data.test$y,
shape = data.test$y), size = 2) +
scale_color_brewer(palette = "Set1") +
labs(x = "x1", y = "x2", col = "class",
shape = "class", fill = "class") + theme_bw()
################################################ ############
plot ( perf3,
avg = 'vertical',
lwd = 3, main = "ROC curve model svm2 - test data" ,
col = 'green3')
################################################ #################
Eq. 15
Entropy usually varies between 0 and our number of classes -1, assuming its
maximum value when the probabilities of each class occur. The goal with a decision tree is
to achieve the lowest entropy possible.
Example 13: Analysis of a data set to create the decision tree using R packages
called rpart.plot , caret , Amelia, pROC.
R Script
# Loading and Separating training and testing data
train <- read.csv("trainTitanic.csv", header = T)
test <- read.csv("testTitanic.csv", header = T)
# Variables
# Survived : 0 = No, 1 = Yes
# SibSp : Number of siblings/spouses on board
# Parch : Number of parents/children on board
# Fare: Fare
# Embarked : Port of embarkation C = Cherbourg , Q = Queenstown , S = Southampton
# Pclass : Ship class
# Loading packages
install.packages("caret")
install.packages("Amelia")
install.packages("pROC")
library(caret)
library(Amelia)
library(pROC)
PROPOSED EXERCISES
01 – Propose the application of the HCA algorithm through a script in the R language
on an experimental data set, presenting your hypotheses and conclusions.
02 – There are multivariate data repositories on the internet (web of science, science
direct and others) in which you must choose a dataset to use the K-means algorithm in the R
language and carry out a statistical study in detail. Present the main results and conclusions.
03 – Present a script in the R language for the PCA algorithm for a dataset and
demonstrate your hypotheses and main conclusions.
04 – Propose a script in the R language for the KNN algorithm using a given dataset
and perform a statistical interpretation presenting its main conclusions.
05 – Present a script for the LDA and QDA algorithms in the R language for a given
dataset and present your main results.
06 – Like the previous exercise, present a script in the R language to build a decision
tree algorithm for a dataset. Present your conclusions.
2 – Lloyd, Stuart P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory.
28 (2): 129–137.
3 – Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical
Magazine. 2(6):559–572.
4 – Lawton, WH; Sylvestre, EA; (1971). Self modeling curve resolution. Technometrics , 13: 617-633.
5 – Cover, T.; Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information
Theory , 13(1): 21-27.
6 – Fisher, RA (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics.
7 (2): 179–188.
7 – Rao, CR (1948) The Use of Multiple Measurements in Problems of Biological Classification. Journal
of the Royal Statistical Society: Series B, 10, 159-203.
CHAPTER IDEA
According to the amount of information generated per sample, analytical data can be
categorized into 0th order, 1st order, 2nd order, 3rd order and 4th order. We obtain for each
sample analyzed in 0th order (a scalar), in 1st order (a vector), in 2nd order (a matrix), in
3rd order (3 ways) and 4th order (4 ways). In this chapter, we will explore some multivariate
classification algorithms that are typically employed on 1st and 2nd order data in real
datasets. In addition, some methods for selecting samples and variables in multivariate
classification will be presented.
Upon completing the chapter, you should be able to:
a) Apply the main 1st order multivariate classification algorithms coupled with variable
selection methods (PCA-LDA, SPA-LDA, GA-LDA) to a set of real data seeking to
build models and analyze them.
d ) Apply the main 2nd order multivariate classification algorithms (Turkey-3 and
PARAFAC) using the sample selection algorithms (KS and MLM) and the two
classifiers (LDA and QDA) on real or simulated data sets seeking to build models
and analyze them.
f) Build new scripts in R language for decision making using 1st and 2nd order
multivariate classification.
The 1st order or 2-way data results in a vector of information per sample resulting in a
two-dimensional matrix X. In this type of data arrangement, we have the 1st order advantage
which consists of the ability to build multivariate models in the presence of interferers as
long as they are present in the training samples. The algorithms used in 1st order data are
based on bilinear models that consist of interpreting an instrumental response function X(r1,
r2) as a product of two independent functions X1(r1) and X2(r2). Bilinear models are obeyed
when the mathematical rank is equal to the chemical rank.
In 2nd order or 3-way data, we have as a response a matrix (second order tensor)
of data for each sample and when the data matrices are placed side by side they generate
a parallelepiped. Here we have the 2nd order advantage which consists of the possibility of
quantifying the analyte in the presence of interferers even if these interferers are not present
in the training set. Furthermore, 2nd order algorithms use a reduced number of samples
in the training set since potential interferents do not need to be modeled. The algorithms
used in 2nd order models use the concept of trilinearity, which consists of a generalization
of bilinearity for a three-way arrangement (IxJxK). Rank deficiency due to the presence of
highly correlated spectral profiles can be a source of trilinearity breakdown .
Eq. 1
Where j represents the number of covariates and N corresponds to the sample size.
To ensure the uniformity of distribution of each subset throughout the instrumental
response space, KS follows a stepwise procedure, in which a new selection is made in
regions of space far from the already selected samples. With each subsequent iteration, the
algorithm selects the sample that exhibits the greatest minimum distance from an already
selected sample. This procedure is repeated until the number of samples specified by the
analyst is reached.
The Morais-Lima-Martim (MLM) algorithm for sample selection was proposed in
2018 [3] and consists of a modification of the Kennard-Stone (KS) algorithm, in which a
random mutation factor is inserted into the latter. That is, after executing the KS algorithm,
some training samples are randomly selected and transferred to the validation set and vice
versa. Usually, this number of transferred samples is 20%, that is, 80% of the validation
set remains with the samples selected by KS, and the rest are samples randomly chosen
from the training set. MLM proved to be superior to the KS algorithm, showing that there is
a synergistic effect when combining the KS deterministic process with a small randomness
when selecting training and validation samples.
true class
A B
A TP FP
predicted class
B FN TN
On the other hand, there are other performance metrics that indicate the efficiency of
classification models, as shown in Table 4.2 .
Correct classification
Sensitivity (sens)
Specificity (spec)
Accuracy
F–score (FS)
In addition to these metrics already described, there is a statistical tool that allows
evaluating the performance of a classification system: the ROC curve (Receiver Operating
Characteristic). A ROC curve is a two-dimensional line graph that represents the relationship
between the sensitivity and specificity of a classification model. The index that evaluates the
accuracy of these graphs is the area under the curve (AUC), and the larger the area, the
greater the performance of the system in question. An ideal test is one whose area under
the ROC curve is equal to 1.
4.5 – PCA-LDA
Mathematically, the multivariate classification algorithm PCA-LDA (Principal
Component Analysis with Linear Discriminant Analysi ) is built in six main steps:
Step 2: Xtrain is decomposed into k PCs, which are defined by the product between
the training scores vector (ttrain) and the loadings vector transposed (lTtrain):
Step 4: The discrimination (Di) of the score vector ti related to each principal
component is determined and ranked in descending order of discrimination:
Eq. 4
Where Sbi and Swi correspond to the inter and intra class dispersions for the score
vector ti, respectively. Intra-class dispersion Swi is defined as:
Eq. 5
Where C is the number of classes in the data set and Sij is the dispersion of ti in class
j and expressed as:
Eq. 6
Where corresponds to the score value ti in the nth k object, and mij corresponds
to the average value of ti in class j calculated as:
Eq. 7
Eq. 8
Where mi is the average of all training objects for the score vector ti.
Step 5: The training set scores are used as input variables for building the LDA
model. The optimal number of scores is chosen based on the smallest error obtained
through cross-validation.
Example 1: In this example we will describe a script in the R language for building
and validating multivariate classification models using the PCA-LDA and Kennard-stone
algorithm for two classes (healthy and dengue, ATR-FTIR data) through the prospectr, mass
and ggplot2 packages.
R Script
# installing packages
install.packages ("prospectr")
install.packages ("MASS")
# reading packages
library(prospectr)
library(MASS)
library(ggplot2)
class1t = t(class1)
class2t = t(class2)
cmt = t(cm)
dev.new()
matplot(cmt, class1t, type ="l", col ="blue", xlab ="Wavenumber (cm-1)", ylab
="Absorbance", main ="Raw spectra")
matplot(cmt,class2t, type ="l", col ="red", xlab ="Wavenumber (cm-1)", ylab
="Absorbance", main ="Raw spectra", add =TRUE)
dev.new()
matplot(cmt, colMeans (class1), type ="l", col ="blue", xlab ="Wavenumber (cm-1)",
ylab ="Absorbance", main ="Average spectra")
matplot(cmt,colMeans (class2), type ="l", col ="red", xlab ="Wavenumber (cm-1)",
ylab ="Absorbance", main ="Mean spectra", add =TRUE )
# PCA Model
# PCA Variances
par(mfrow = c( 2,2))
barplot(data.vars[1:10], main="Variance", names.arg = paste("PC", 1:10))
barplot(log( data.vars[1:10]), main="Log(variance)", names.arg = paste("PC", 1:10))
barplot(data.relvars[1:10], main="Relative Variances", names.arg = paste("PC",
1:10))
barplot(cumsum (100*data.relvars[1:10]), main="Cumulative Variance(%)", names.arg
= paste("PC", 1:10), ylim = c(0,100))
# LDA model
# figures of merit
# training
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# cross-validation
# test
dim_test1 = dim(test1)
dim_test2 = dim(test2)
dev.new()
matplot(cmt,data.loadings[,1], type ="l", col = "blue", xlab ="Wavenumber (cm-1)",
ylab ='Loadings', main ='PCA loadings')
par(new=TRUE)
matplot(cmt,data.loadings[,2], type ="l", col = "red", xlab = "Wavenumber (cm-1)",
ylab ='Loadings',main ='PCA loadings')
dev.new()
matplot(pred_train$posterior[1:dim_train1[1],1 ],pch =19,col="blue",xlab="Samples",
ylab ="LD1", main ="Posterior Probability - Training") # training class 1
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_train2[1]),1],
pch=19,col="red",xlab="Samples", ylab ="LD1 ", main ="Posterior Probability -
Training") # training class 2
dev.new()
matplot(model_lda_cv$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability - Cross-
Validation") # class 1 cross-validation
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1 ", main ="Posterior
Probability - Cross-validation") # cross-validation class 2
dev.new()
matplot(pred_test$posterior[1:dim_test1[1],1],pch=19,col="blue",xlab="Samples",
ylab ="LD1", main=" Posterior Probability - Test") # test class 1
points(pred_train$posterior[(dim_test1[ 1]+ 1):(dim_test1[1]+dim_
test2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main="Posterior
Probability - Test") # test class 2
R Script
# installing packages
install.packages ("prospectr")
install.packages ("MASS")
# reading packages
library(prospectr)
library(MASS)
library(ggplot2)
class1t = t(class1)
class2t = t(class2)
cmt = t(cm)
dev.new()
matplot(cmt, class1t, type ="l", col ="blue", xlab ="Wavenumber (cm-1)", ylab
="Absorbance", main ="Raw spectra")
matplot(cmt,class2t, type ="l", col ="red", xlab ="Wavenumber (cm-1)", ylab
="Absorbance", main ="Raw spectra", add =TRUE)
# PCA Model
# PCA Variances
par(mfrow = c( 2,2))
barplot(data.vars[1:10], main="Variance", names.arg = paste("PC", 1:10))
barplot(log( data.vars[1:10]), main="Log(variance)", names.arg = paste("PC", 1:10))
barplot(data.relvars[1:10], main="Relative Variances", names.arg = paste("PC",
1:10))
barplot(cumsum (100*data.relvars[1:10]), main="Cumulative Variance(%)", names.arg
= paste("PC", 1:10), ylim = c(0,100))
p1 = ceiling(prob * dim_test1[1])
p2 = ceiling(prob * dim_test2[1])
t1 = 1:dim_train1[1]
t2 = 1:dim_train2[1]
v1 = 1:dim_test1[1]
v2 = 1:dim_test2[1]
train1_new = rbind(train1[-train1_sub,],test1[test1_sub,])
train2_new = rbind(train2[-train2_sub,],test2[test2_sub,])
test1_new = rbind(test1[-test1_sub,],train1[train1_sub,])
test2_new = rbind(test2[-test2_sub,],train2[train2_sub,])
# LDA model
# figures of merit
# training
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# cross-validation
# test
dim_test1 = dim(test1)
dim_test2 = dim(test2)
dev.new()
group12 = rbind(matrix(group1), matrix(group2))
matplot(data.scores[group12==1,1],data.
scores[group12==1,2],pch=19,col='blue',xlab='PC1',ylab='PC2',main= 'PCA scores')
points(data.scores[group12==2,1],data.
scores[group12==2,2],pch=19,col='red',xlab='PC1',ylab='PC2',main= 'PCA scores')
dev.new()
matplot(cmt,data.loadings[,1], type ="l", col = "blue", xlab ="Wavenumber (cm-1)",
ylab ='Loadings', main ='PCA loadings ')
par(new=TRUE)
matplot(cmt,data.loadings[,2], type ="l", col = "red", xlab = "Wavenumber (cm-1)",
ylab ='Loadings',main ='PCA loadings')
dev.new()
matplot(model_lda_cv$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability - Cross-
Validation") # class 1 cross-validation
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1 ", main ="Posterior
Probability - Cross-validation") # cross-validation class 2
dev.new()
matplot(pred_test$posterior[1:dim_test1[1],1],pch=19,col="blue",xlab="Samples",
ylab ="LD1", main=" Posterior Probability - Test") # test class 1
points(pred_train$posterior[(dim_test1[ 1]+ 1):(dim_test1[1]+dim_
test2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main="Posterior
Probability - Test") # test class 2
4.6 SPA-LDA
In 2001, Araújo et al. [4] proposed a variable selection technique called the
Successive Projections Algorithm (SPA). This technique basically uses simple operations
in a vector space to minimize collinearity problems and has good efficiency in the context
of multivariate calibration, specifically when applied to Multiple Linear Regression (MLR).
In 2005, Pontes et al. [5] adapted SPA, originally proposed for selecting spectral
variables in MLR models, to be used in classification problems using the LDA classifier,
resulting in SPA-LDA. SPA-LDA uses a cost function that calculates the average risk G of
an incorrect classification by LDA based on the validation set as per the equation below:
Eq. 9
Where the numerator r2(xk, μIk) consists of the square of the Mahalanobis distance
between the object xk (with class index Ik) and the mean of its class (μIk). The denominator
of the same equation corresponds to the square of the Mahalanobis distance between the
object xk and the center of the nearest wrong class. This distance is calculated according to
the equation below:
where the sample mean (µIk) and covariance R are calculated on the training set. As
desired, the value of gk should be as small as possible, that is, the object xk should be close
to the center of its true class and far from the centers of other classes.
Example 3: In this example, we present the SPA-LDA algorithm together with the
sample selection algorithm (KS) in building multivariate classification models into two
classes (healthy and dengue, ATR-FTIR) through the prospectr, mass, ggplot2 and lintools
packages.
R Script
# installing packages
install.packages("prospectr")
install.packages("MASS")
install.packages("lintools")
# reading packages
library(prospectr)
library(MASS)
library(ggplot2)
library (lintools)
class1t = t(class1)
class2t = t(class2)
cmt = t(cm)
dev.new()
matplot(cmt, class1t, type ="l", col ="blue", xlab ="Wavenumber (cm-1)", ylab
="Absorbance", main ="Raw spectra")
matplot(cmt,class2t, type ="l", col ="red", xlab ="Wavenumber (cm-1)", ylab
="Absorbance", main ="Raw spectra", add =TRUE)
dev.new()
matplot(cmt, colMeans (class1), type ="l", col ="blue", xlab ="Wavenumber (cm-1)",
ylab ="Absorbance", main ="Average spectra")
matplot(cmt,colMeans (class2), type ="l", col ="red", xlab ="Wavenumber (cm-1)",
ylab ="Absorbance", main ="Mean spectra", add =TRUE )
# PCA Model
# SPA model
dev.new()
matplot(cmt,m,xlab="Wavenumber (cm-1)", type="l", ylab="Absorbance", main ="Average
spectrum with selected variables")
points(cm[variables],m[variables], pch =19)
# figures of merit
# training
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# cross-validation
# test
dim_test1 = dim(test1)
dim_test2 = dim(test2)
dev.new()
matplot(pred_train$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability -
Training") # training class 1
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1 ", main ="Posterior
Probability - Training") # training class 2
dev.new()
matplot(model_lda_cv$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability - Cross-
Validation") # class 1 cross-validation
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1 ", main ="Posterior
Probability - Cross-validation") # cross-validation class 2
dev.new()
matplot(pred_test$posterior[1:dim_test1[1],1],pch=19,col="blue",xlab="Samples",
ylab ="LD1", main=" Posterior Probability - Test") # test class 1
points(pred_train$posterior[(dim_test1[ 1]+ 1):(dim_test1[1]+dim_
test2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main="Posterior
Probability - Test") # test class 2
R Script
# installing packages
install.packages("prospectr")
install.packages("MASS")
install.packages("lintools")
# reading packages
library(prospectr)
library(MASS)
library(ggplot2)
library (lintools)
class1t = t(class1)
class2t = t(class2)
cmt = t(cm)
dev.new()
matplot(cmt, class1t, type ="l", col ="blue", xlab ="Wavenumber (cm-1)", ylab
="Absorbance", main ="Raw spectra")
matplot(cmt,class2t, type ="l", col ="red", xlab ="Wavenumber (cm-1)", ylab
="Absorbance", main ="Raw spectra", add =TRUE)
dev.new()
matplot(cmt, colMeans (class1), type ="l", col ="blue", xlab ="Wavenumber (cm-1)",
ylab ="Absorbance", main ="Average spectra")
matplot(cmt,colMeans (class2), type ="l", col ="red", xlab ="Wavenumber (cm-1)",
ylab ="Absorbance", main ="Mean spectra", add =TRUE )
# PCA Model
# SPA model
dev.new()
matplot(cmt,m,xlab="Wavenumber (cm-1)", type="l", ylab="Absorbance", main ="Average
spectrum with selected variables")
points(cm[variables],m[variables], pch =19)
p1 = ceiling(prob * dim_test1[1])
p2 = ceiling(prob * dim_test2[1])
t1 = 1:dim_train1[1]
t2 = 1:dim_train2[1]
v1 = 1:dim_test1[1]
v2 = 1:dim_test2[1]
train1_new = rbind(train1[-train1_sub,],test1[test1_sub,])
train2_new = rbind(train2[-train2_sub,],test2[test2_sub,])
test1_new = rbind(test1[-test1_sub,],train1[train1_sub,])
test2_new = rbind(test2[-test2_sub,],train2[train2_sub,])
train = rbind(train1_new,train2_new)
test = rbind(test1_new,test2_new )
# LDA model
# figures of merit
# training
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# cross-validation
# test
dev.new()
matplot(pred_train$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability -
Training") # training class 1
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1 ", main ="Posterior
Probability - Training") # training class 2
dev.new()
matplot(model_lda_cv$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability - Cross-
Validation") # class 1 cross-validation
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1 ", main ="Posterior
Probability - Cross-validation") # cross-validation class 2
dev.new()
matplot(pred_test$posterior[1:dim_test1[1],1],pch=19,col="blue",xlab="Samples",
ylab ="LD1", main=" Posterior Probability - Test") # test class 1
points(pred_train$posterior[(dim_test1[ 1]+ 1):(dim_test1[1]+dim_
test2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main="Posterior
Probability - Test") # test class 2
4.7 GA-LDA
The genetic algorithm (GA) is a bioinspired optimization method developed to solve
optimization and machine learning problems, created by John Henry Holland in 1975 [6]. It
is also considered a method of selecting variables of stochastic nature. GA simulates natural
processes of survival and reproduction of populations, especially based on the theory of
species evolution proposed by Darwin. Chromosomes encoded by real numbers is a context
applied to the genetic algorithm, as it is possible to construct artificial chromosomes and
simulate a natural evolutionary process.
Basically, GA encodes subsets of variables in the form of a series of binary values
(chromosomes) and the position on the chromosome (gene) is associated with one of the
variables for selection. In this way, a population is generated from a random set of individuals
and during the evolution process, the population receives an index that reflects the ability to
adapt to a given fitness. Finally, the fittest individuals are selected for the selection process
and the least fit are eliminated. The process is repeated until a certain satisfactory solution
is reached or the maximum number of generations is reached.
In spectroscopic studies, for example, standard binary chromosomes are used
with a size equal to the number of wavelengths in a spectrum. The GA-LDA algorithm
in multivariate classification uses a cost function calculated as the inverse of the risk G
described in equation 4.9 using the wavelengths encoded in the chromosome. Normally,
crossover and mutation operators are used at a level of probability as well as the size of the
population at each generation.
Example 5: In this example, we present the GA-LDA algorithm along with the sample
selection algorithm (KS) to build multivariate classification models for two classes (healthy
and dengue, ATR-FTIR) through the prospectr, mass, ggplot2, lintools, caret, dplyr, lattice
and GA packages.
install.packages("prospectr")
install.packages("MASS")
install.packages("caret")
install.packages("GA")
# reading packages
library(prospectr)
library(MASS)
library(ggplot2)
library(lintools)
library(caret)
library(dplyr)
library(lattice)
library(GA)
class1t = t(class1)
class2t = t(class2)
cmt = t(cm)
dev.new()
matplot(cmt, class1t, type ="l", col ="blue", xlab ="Wavenumber (cm-1)", ylab
="Absorbance", main ="Raw spectra")
matplot(cmt,class2t, type ="l", col =" red ", xlab ="Wavenumber (cm-1)", ylab
="Absorbance", main ="Raw spectra", add =TRUE)
# GA algorithm
function(GA){
m <- matrix(0, ncol = GA@nBits, nrow = GA@popSize)
for(i in seq_len(GA@popSize))
m[i, sample(GA@nBits, k)] <- 1
m
}
}
# Crossover Function
# Mutation Function
parent
}
# Adjustment Function
# GA Model
ind = which(model_GA@solution[1,] == 1)
if (length(ind) > 22){
indmax = 22 # maximum number of variables selected
ind = ind[1:indmax]
}
datam_ga = datam[,ind]
m = colMeans(datam)
variables = ind
# figures of merit
# training
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# cross-validation
# test
dim_test1 = dim(test1)
dim_test2 = dim(test2)
dev.new()
matplot(pred_train$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability -
Training") # training class 1
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main ="Posterior
Probability - Training") # training class 2
dev.new()
matplot(model_lda_cv$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability - Cross-
Validation") # class 1 cross-validation
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main ="Posterior
Probability - Cross-validation") # cross-validation class 2
dev.new()
matplot(pred_test$posterior[1:dim_test1[1],1],pch=19,col="blue",xlab="Samples",
ylab ="LD1", main=" Posterior Probability - Test") # test class 1
points(pred_train$posterior[(dim_test1[ 1]+ 1):(dim_test1[1]+dim_
test2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main="Posterior
Probability - Test") # test class 2
R Script
# installing packages
install.packages("prospectr")
install.packages("MASS")
install.packages("caret")
install.packages("GA")
# reading packages
library(prospectr)
library(MASS)
library(ggplot2)
library(lintools)
library(caret)
library(dplyr)
library(lattice)
library(GA)
class1t = t(class1)
class2t = t(class2)
cmt = t(cm)
dev.new()
matplot(cmt, class1t, type ="l", col ="blue", xlab ="Wavenumber (cm-1)", ylab
="Absorbance", main ="Raw spectra")
matplot(cmt,class2t, type ="l", col ="red", xlab ="Wavenumber (cm-1)", ylab
="Absorbance", main ="Raw spectra", add =TRUE)
dev.new()
matplot(cmt, colMeans (class1), type ="l", col ="blue", xlab ="Wavenumber (cm-1)",
ylab ="Absorbance", main ="Average spectra")
matplot(cmt,colMeans (class2), type ="l", col ="red", xlab ="Wavenumber (cm-1)",
ylab ="Absorbance", main ="Mean spectra", add =TRUE )
# GA algorithm
function(GA){
m <- matrix(0, ncol = GA@nBits, nrow = GA@popSize)
for(i in seq_len(GA@popSize))
m[i, sample(GA@nBits, k)] <- 1
m
}
}
# Crossover Function
# Mutation Function
parent
}
# Adjustment Function
# GA Model
# selected variables
ind = which(model_GA@solution[1,] == 1)
if (length(ind) > 22){
indmax = 22 # maximum number of variables selected
ind = ind[1:indmax]
}
datam_ga = datam[,ind]
m = colMeans(datam)
variables = ind
p1 = ceiling(prob * dim_test1[1])
p2 = ceiling(prob * dim_test2[1])
t1 = 1:dim_train1[1]
t2 = 1:dim_train2[1]
v1 = 1:dim_test1[1]
v2 = 1:dim_test2[1]
train1_new = rbind(train1[-train1_sub,],test1[test1_sub,])
train2_new = rbind(train2[-train2_sub,],test2[test2_sub,])
test1_new = rbind(test1[-test1_sub,],train1[train1_sub,])
test2_new = rbind(test2[-test2_sub,],train2[train2_sub,])
train = rbind(train1_new,train2_new)
test = rbind(test1_new,test2_new)
# LDA model
# figures of merit
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# cross-validation
# test
dim_test1 = dim(test1)
dim_test2 = dim(test2)
dev.new()
matplot(pred_train$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability -
Training") # training class 1
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main ="Posterior
Probability - Training") # training class 2
dev.new()
matplot(model_lda_cv$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability - Cross-
Validation") # class 1 cross-validation
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main ="Posterior
Probability - Cross-validation") # cross-validation class 2
dev.new()
matplot(pred_test$posterior[1:dim_test1[1],1],pch=19,col="blue",xlab="Samples",
ylab ="LD1", main=" Posterior Probability - Test") # test class 1
points(pred_train$posterior[(dim_test1[ 1]+ 1):(dim_test1[1]+dim_
test2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main="Posterior
Probability - Test") # test class 2
4.8 TUCKER3–LDA
In 1966, the psychometrician Ledyard R. Tucker developed a multidimensional
data processing method currently known as Tucker 1, Tucker 2 and Tucker 3 [7]. Tucker 1
consists of unfolding the data array of dimensions I x J x K into a matrix of dimensions I x
JK, or in other words, the individual application of a PCA in the three forms of unfolding, as
shown in Figure 4.2 . However, this splitting can be carried out in the other two directions
(J x KI or KxIJ). As can be seen, Tucker's proposal is to ignore a trilinear structure (the
instrumental response can be represented by the product of three independent vectors) of
the data by decomposing them in a Bilinear method .
The Tucker 3 model, for example, can be represented by the following matrix
equation:
Eq. 12
Eq. 13
Where: xijk is equivalent to an element of the data cube X; aip, bjq, ckr and gprq are the
values corresponding to xij in matrices A, B and C obtained for modes I, J, K; ejik is related
to the approximated error for the value of xijk.
Therefore, when we encounter 2nd order chemical data and we want to apply the
LDA algorithm, there is a need for a prior data decomposition step. A viable alternative is to
use the scores from the Turkey method as an input variable in the LDA algorithm, creating
a new classifier, called Turkey3-LDA.
R Script
## Loading Packages
install.packages("multiway")
library(multiway)
install.packages("ThreeWay")
library(ThreeWay)
install.packages("R.matlab")
library(R.matlab)
install.packages("plot3D")
library(plot3D)
install.packages("plotly")
library(plotly)
install.packages("prospectr")
install.packages("MASS")
install.packages("caret")
library(prospectr)
library(MASS)
library(caret)
X <- data$Xcomb
Y <- data$Y
nmEX <- data$nmEX
nmEM <- data$nmEM
# matrix dimensions
X1 = data$Xnormal
X2 = data$Xcancer
dev.new()
filled.contour(X1m,color.palette = terrain.colors,main = "Class 1 - Normal")
dev.new()
filled.contour(X2m,color.palette = terrain.colors,main = "Class 2 - Cancer")
dev.new()
matplot(t(nmEM), X1m, type="l", xlab = "Emission wavelength (nm)", ylab =
"Intensity", main = "Class 1 - Normal")
dev.new()
matplot(t(nmEX), t(X1m), type ="l", xlab ="Excitation Wavelength (nm)", ylab
="Intensity", main = "Class 1 - Normal")
dev.new()
matplot(t(nmEM), X2m, type="l", xlab ="Emission wavelength (nm)", ylab
="Intensity", main ="Class 2 - Cancer")
dev.new()
matplot(t(nmEX), t(X2m), type ="l", xlab ="Excitation Wavelength (nm)", ylab
="Intensity", main ="Class 2 - Cancer")
# R2 Adjustment
dev.new ()
matplot(model$A[Y==1,1], model$A[Y==1,2], pch="o", col="blue", xlab="Factor 1",
ylab="Factor 2", main="Scores (o: normal, x: cancer)")
points(model$A[Y==2,1], model$A[Y==2,2], pch="x", col="red", xlab="Factor 1",
ylab="Factor 2", main="Scores (o: normal, x: cancer)")
dev.new ()
matplot(t(nmEM), model$B, type ="l", xlab = "Emission Wavelength (nm)", ylab =
"Intensity")
dev.new()
matplot(t(nmEX), model$C, type ="l", xlab = "Excitation Wavelength (nm)", ylab =
"Intensity")
dim_class1 = dim(X1)
dim_class2 = dim(X2)
dim_data = dim(X)
# LDA model
# figures of merit
# training
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# cross-validation
# test
dim_test1 = dim(test1)
dim_test2 = dim(test2)
dev.new()
matplot(pred_train$posterior[1:dim_train1[1],1 ],pch =19,col="blue",xlab="Samples",
ylab ="LD1", main ="Posterior Probability - Training") # training class 1
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main ="Posterior
Probability - Training") # training class 2
dev.new()
matplot(model_lda_cv$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability - Cross-
Validation") # class 1 cross-validation
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main ="Posterior
Probability - Cross-validation") # cross-validation class 2
Example 9: Here, the Turkey3-LDA algorithm is described using the MLM algorithm
for sample selection in 2nd order multivariate classification models on normal vs. patients with
colorectal cancer (CRC), obtained by molecular fluorescence spectrometry in blood plasma.
Data can be obtained here: https://ucphchemometrics.com/datasets/ . The Turkey3-LDA
algorithm in this example needs the following R packages: multiway, ThreeWay, R.matlab,
plot3D, plotly, prospectr, MASS and caret .
R Script
## Loading Packages
install.packages("multiway")
library(multiway)
install.packages("ThreeWay")
library(ThreeWay)
install.packages("R.matlab")
library(R.matlab)
install.packages("plot3D")
library(plot3D)
install.packages("plotly")
library(plotly)
install.packages("prospectr")
install.packages("MASS")
install.packages("caret")
library(prospectr)
library(MASS)
library(caret)
## Loading Data – Establish the working directory containing the CRC.mat data
## In RStudio, go to Session > Set Working Directory > Choose Directory
# matrix dimensions
X1 = data$Xnormal
X2 = data$Xcancer
dev.new()
filled.contour(X1m,color.palette = terrain.colors,main = "Class 1 - Normal")
dev.new()
filled.contour(X2m,color.palette = terrain.colors,main = "Class 2 - Cancer")
dev.new()
matplot(t(nmEM), X1m, type="l", xlab = "Emission wavelength (nm)", ylab = "Intensity",
main = "Class 1 - Normal")
dev.new()
matplot(t(nmEX), t(X1m), type ="l", xlab ="Excitation Wavelength (nm)", ylab
="Intensity", main = "Class 1 - Normal")
dev.new()
matplot(t(nmEM), X2m, type="l", xlab ="Emission wavelength (nm)", ylab ="Intensity",
main ="Class 2 - Cancer")
dev.new()
matplot(t(nmEX), t(X2m), type ="l", xlab ="Excitation Wavelength (nm)", ylab
="Intensity", main ="Class 2 - Cancer")
dev.new ()
matplot(model$A[Y==1,1], model$A[Y==1,2], pch="o", col="blue", xlab="Factor 1",
ylab="Factor 2", main="Scores (o: normal, x: cancer)")
points(model$A[Y==2,1], model$A[Y==2,2], pch="x", col="red", xlab="Factor 1",
ylab="Factor 2", main="Scores (o: normal, x: cancer)")
dev.new ()
matplot(t(nmEM), model$B, type ="l", xlab = "Emission Wavelength (nm)", ylab =
"Intensity")
dev.new()
matplot(t(nmEX), model$C, type ="l", xlab = "Excitation Wavelength (nm)", ylab =
"Intensity")
dim_class1 = dim(X1)
dim_class2 = dim(X2)
dim_data = dim(X)
p1 = ceiling(prob * dim_test1[1])
p2 = ceiling(prob * dim_test2[1])
t1 = 1:dim_train1[1]
t2 = 1:dim_train2[1]
v1 = 1:dim_test1[1]
v2 = 1:dim_test2[1]
train1_new = rbind(train1[-train1_sub,],test1[test1_sub,])
train2_new = rbind(train2[-train2_sub,],test2[test2_sub,])
test1_new = rbind(test1[-test1_sub,],train1[train1_sub,])
test2_new = rbind(test2[-test2_sub,],train2[train2_sub,])
train = rbind(train1_new,train2_new)
test = rbind(test1_new,test2_new)
# LDA model
# training
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# cross-validation
# test
dim_test1 = dim(test1)
dim_test2 = dim(test2)
dev.new()
matplot(pred_train$posterior[1:dim_train1[1],1 ],pch =19,col="blue",xlab="Samples",
ylab ="LD1", main ="Posterior Probability - Training") # training class 1
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main ="Posterior
Probability - Training") # training class 2
dev.new()
matplot(model_lda_cv$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability - Cross-
Validation") # class 1 cross-validation
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main ="Posterior
Probability - Cross-validation") # cross-validation class 2
dev.new()
matplot(pred_test$posterior[1:dim_test1[1],1],pch=19,col="blue",xlab="Samples",
ylab ="LD1", main=" Posterior Probability - Test") # test class 1
points(pred_train$posterior[(dim_test1[ 1]+ 1):(dim_test1[1]+dim_
test2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main="Posterior
Probability - Test") # test class 2
4.9 PARAFAC–LDA
PARAFAC ("PARAllel Factor analysis") consists of a trilinear method of higher order
data decomposition proposed by Professor Rasmus Bro in 1998 [9]. From a mathematical
point of view, PARAFAC can be considered as a generalization of PCA, or as a restricted
case of the Tucker-3 method. The PARAFAC model is formed by two weight matrices (B
and C) and one of scores (A), in a mathematical representation very similar to the Tucker-3
method, as we can see in equation 4.14:
Eq. 14
Therefore, analogously to Turkey-3, we will apply the LDA algorithm after obtaining
the scores from the PARAFAC method as an input variable, creating a new classifier, called
PARAFAC-LDA.
R Script
## Loading Packages
install.packages("multiway")
library(multiway)
install.packages("ThreeWay")
library(ThreeWay)
install.packages("R.matlab")
library(R.matlab)
install.packages("plot3D")
library(plot3D)
install.packages("plotly")
library(plotly)
library(prospectr)
library(MASS)
library(caret)
## Loading Data – Establish the working directory containing the CRC.mat data
## In RStudio, go to Session > Set Working Directory > Choose Directory
X <- data$Xcomb
Y <- data$Y
nmEX <- data$nmEX
nmEM <- data$nmEM
# matrix dimensions
X1 = data$Xnormal
X2 = data$Xcancer
dev.new()
filled.contour(X1m,color.palette = terrain.colors,main = "Class 1 - Normal")
dev.new()
filled.contour(X2m,color.palette = terrain.colors,main = "Class 2 - Cancer")
dev.new()
matplot(t(nmEM), X1m, type="l", xlab = "Emission wavelength (nm)", ylab = "Intensity",
main = "Class 1 - Normal")
dev.new()
matplot(t(nmEX), t(X1m), type ="l", xlab ="Excitation Wavelength (nm)", ylab
="Intensity", main = "Class 1 - Normal")
dev.new()
matplot(t(nmEM), X2m, type="l", xlab ="Emission wavelength (nm)", ylab ="Intensity",
main ="Class 2 - Cancer")
model = parafac(X,nfac=nf,nstart=1,maxit=500,ctol=10^-4,parallel=FALSE,cl=NULL,ou
tput=c("best","all"))
# R2 Adjustment
dev.new ()
matplot(model$A[Y==1,1], model$A[Y==1,2], pch="o", col="blue", xlab="Factor 1",
ylab="Factor 2", main="Scores (o: normal, x: cancer)")
points(model$A[Y==2,1], model$A[Y==2,2], pch="x", col="red", xlab="Factor 1",
ylab="Factor 2", main="Scores (o: normal, x: cancer)")
dev.new ()
matplot(t(nmEM), model$B, type ="l", xlab = "Emission Wavelength (nm)", ylab =
"Intensity")
dev.new()
matplot(t(nmEX), model$C, type ="l", xlab = "Excitation Wavelength (nm)", ylab =
"Intensity")
dim_class1 = dim(X1)
dim_class2 = dim(X2)
dim_data = dim(X)
# LDA model
# figures of merit
# training
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# cross-validation
dim_test1 = dim(test1)
dim_test2 = dim(test2)
dev.new()
matplot(pred_train$posterior[1:dim_train1[1],1 ],pch =19,col="blue",xlab="Samples",
ylab ="LD1", main ="Posterior Probability - Training") # training class 1
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main ="Posterior
Probability - Training") # training class 2
dev.new()
matplot(model_lda_cv$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability - Cross-
Validation") # class 1 cross-validation
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main ="Posterior
Probability - Cross-validation") # cross-validation class 2
dev.new()
matplot(pred_test$posterior[1:dim_test1[1],1],pch=19,col="blue",xlab="Samples",
ylab ="LD1", main=" Posterior Probability - Test") # test class 1
points(pred_train$posterior[(dim_test1[ 1]+ 1):(dim_test1[1]+dim_
test2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main="Posterior
Probability - Test") # test class 2
Example 11: Here, the PARAFAC-LDA algorithm is described using the MLM
algorithm for sample selection in 2nd order multivariate classification models on normal vs.
patients with colorectal cancer (CRC), obtained by molecular fluorescence spectrometry
in blood plasma. Data can be obtained here: https://ucphchemometrics.com/datasets/ .
The PARAFAC-LDA algorithm in this example needs the following R packages: multiway,
ThreeWay, R.matlab, plot3D, plotly, prospectr, MASS and caret .
R Script
## Loading Packages
install.packages("multiway")
library(multiway)
install.packages("ThreeWay")
library(ThreeWay)
install.packages("R.matlab")
library(R.matlab)
install.packages("plot3D")
library(plot3D)
install.packages("plotly")
library(plotly)
install.packages("prospectr")
install.packages("MASS")
install.packages("caret")
library(prospectr)
library(MASS)
library(caret)
## Loading Data – Establish the working directory containing the CRC.mat data
## In RStudio, go to Session > Set Working Directory > Choose Directory
# matrix dimensions
X1 = data$Xnormal
X2 = data$Xcancer
dev.new()
filled.contour(X1m,color.palette = terrain.colors,main = "Class 1 - Normal")
dev.new()
filled.contour(X2m,color.palette = terrain.colors,main = "Class 2 - Cancer")
dev.new()
matplot(t(nmEM), X1m, type="l", xlab = "Emission wavelength (nm)", ylab = "Intensity",
main = "Class 1 - Normal")
dev.new()
matplot(t(nmEX), t(X1m), type ="l", xlab ="Excitation Wavelength (nm)", ylab
="Intensity", main = "Class 1 - Normal")
dev.new()
matplot(t(nmEM), X2m, type="l", xlab ="Emission wavelength (nm)", ylab ="Intensity",
main ="Class 2 - Cancer")
dev.new()
matplot(t(nmEX), t(X2m), type ="l", xlab ="Excitation Wavelength (nm)", ylab
="Intensity", main ="Class 2 - Cancer")
model = parafac(X,nfac=nf,nstart=1,maxit=500,ctol=10^-4,parallel=FALSE,cl=NULL,ou
tput=c("best","all"))
dev.new ()
matplot(model$A[Y==1,1], model$A[Y==1,2], pch="o", col="blue", xlab="Factor 1",
ylab="Factor 2", main="Scores (o: normal, x: cancer)")
points(model$A[Y==2,1], model$A[Y==2,2], pch="x", col="red", xlab="Factor 1",
ylab="Factor 2", main="Scores (o: normal, x: cancer)")
dev.new ()
matplot(t(nmEM), model$B, type ="l", xlab = "Emission Wavelength (nm)", ylab =
"Intensity")
dev.new()
matplot(t(nmEX), model$C, type ="l", xlab = "Excitation Wavelength (nm)", ylab =
"Intensity")
dim_class1 = dim(X1)
dim_class2 = dim(X2)
dim_data = dim(X)
p1 = ceiling(prob * dim_test1[1])
p2 = ceiling(prob * dim_test2[1])
t1 = 1:dim_train1[1]
t2 = 1:dim_train2[1]
v1 = 1:dim_test1[1]
v2 = 1:dim_test2[1]
train1_new = rbind(train1[-train1_sub,],test1[test1_sub,])
train2_new = rbind(train2[-train2_sub,],test2[test2_sub,])
test1_new = rbind(test1[-test1_sub,],train1[train1_sub,])
test2_new = rbind(test2[-test2_sub,],train2[train2_sub,])
train = rbind(train1_new,train2_new)
test = rbind(test1_new,test2_new)
# LDA model
# figures of merit
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# cross-validation
# test
dim_test1 = dim(test1)
dim_test2 = dim(test2)
dev.new()
matplot(model_lda_cv$posterior[1:dim_train1[1],1 ],pch
=19,col="blue",xlab="Samples", ylab ="LD1", main ="Posterior Probability - Cross-
Validation") # class 1 cross-validation
points(pred_train$posterior[(dim_train1[ 1]+ 1):(dim_train1[1]+dim_
train2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main ="Posterior
Probability - Cross-validation") # cross-validation class 2
dev.new()
matplot(pred_test$posterior[1:dim_test1[1],1],pch=19,col="blue",xlab="Samples",
ylab ="LD1", main=" Posterior Probability - Test") # test class 1
points(pred_train$posterior[(dim_test1[ 1]+ 1):(dim_test1[1]+dim_
test2[1]),1],pch=19,col="red",xlab="Samples", ylab ="LD1", main="Posterior
Probability - Test") # test class 2
PROPOSED EXERCISES
01 – Propose the application of the PCA-LDA algorithm using some sample selection
method (KS or MLM) through a script in the R language on an experimental data set,
presenting your hypotheses and conclusions.
02 – Propose the application of the SPA-LDA algorithm using some sample selection
method (KS or MLM) through a script in the R language on an experimental data set,
presenting your hypotheses and conclusions.
03 – Propose the application of the GA-LDA algorithm using some sample selection
method (KS or MLM) through an R script on an experimental data set, presenting your
hypotheses and conclusions.
04 – Build multivariate classification models for the PCA-QDA, SPA-QDA and GA-
QDA algorithms using an R script from a data set and present your conclusions.
05 – Perform a comparison between the performance of PCA-LDA and PCA-QDA
models using an R script for a given data set. Choose a sample selection method.
06 – Perform a comparison between the performance of SPA-LDA and SPA-QDA
models using an R script for a given data set. Choose a sample selection method.
07 – Perform a comparison between the performance of GA-LDA and GA-QDA
models using an R script for a given data set. Choose a sample selection method.
REFERENCES
1 – Scandar , GM; Olivieri, A.C. (2014). Practical three way calibration. Elsevier.
2 – Kennard, R.; Stone, L. (1969). Computer Aided Design of Experiments. Technometrics , 11(1): 137–
148.
3 – Morais, CLM; Lima, KMG; Martin, F. (2018). A computational protocol for sample selection in biological-
derived infrared spectroscopy datasets using Morais-Lima-Martin (MLM) algorithms. Protocol Exchange.
4 – Araújo, MCU; Saldanha, TCB; Galvão, RKH; Yoneyama , T.; Charm, HC; Visani , V., (2001). The
successive projections algorithm for variable selection in spectroscopic multicomponent analysis,
Chemometrics and Intelligent Laboratory Systems, 57: 65 – 73.
5 – Pontes, MJC; Galvão, RKH; Araújo, MCU; Moreira, PNT; Neto, ODP; Jose, GE; Saldanha, TCBS
(2005). The successive projections algorithm for spectral variable selection in classification problems.
Chemometrics and Intelligent Laboratory Systems. 78:11-18.
6 – Holland, JH (1975). Adaptation in natural and artificial systems. The University of Michigan Press,
Ann Arbor, MI.
7 – Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31,
279-311.
8 – Graham, A.; Kronecker Products and Matrix Calculus with Applications, Wiley: New York, 1981.
9 – Bro, R.; (1998). Multi-way Analysis in Food Industry: Models, Algorithms, and Applications. Doctoral
Thesis, University of Amsterdam, Netherlands.
CHAPTER IDEA
Calibration is, in general, an experimental procedure that builds a mathematical model
between the values indicated by a given measuring instrument and the values represented
by a reference standard or property of interest. Prediction is a process that uses the model
built in the calibration or training stage to predict the property of interest of a sample based
on instrumental information. The general process of a calibration basically consists of the
modeling (training) stage, which establishes a mathematical relationship between the
instrumental variables and the response in the calibration set; and, the validation stage,
which seeks to optimize the relationship to find a better description of the analyte(s) of
interest.
In this chapter, you will learn about the advantages of multivariate calibration (1st and
2nd order) over univariate models (0th order) and several examples guided by algorithms in
the R language on real or simulated data sets. Furthermore, some multivariate calibration
algorithms coupled to variable selection algorithms will be presented.
Upon completing the chapter, you should be able to:
a) Build and evaluate figures of merit for univariate calibration models (0th order)
using R scripts;
b) Build and evaluate figures of merit for higher order calibration models (1st and 2nd
order) using R scripts;
c) Couple variable selection algorithms with multivariate calibration and evaluate
their performances;
d) Understand the sample and variable selection algorithms used in higher order
multivariate calibration models;
e) Investigate the stages for building multivariate calibration models when applied
to variable selection algorithms and evaluate their performances (figures of merit);
f) Build new scripts in the R language for decision making using 1st and 2nd order
multivariate calibration;
g) Propose new applications in chemistry or related areas of multivariate calibration
techniques.
Eq. 1
Eq. 2
Eq. 3
Eq. 4
where ŷ is the estimated response, b0 and b1 are the regression coefficients and ε is
the random error of the system.
In practice, equation 4 is obtained using calibration samples or certified reference
materials, standards containing one or several components, or synthetic standard materials,
that is, samples with known concentrations and high precision and accuracy.
Thus, the sum of the total residuals of this model is given by:
Eq. 5
where: Eq. 6
so: Eq. 7
If we calculate the partial derivative of the sum of residues as a function of b0 and b1,
we obtain the values of the regression coefficients, as shown in the table below:
Eq. 8
Figure 5.1 presents different least squares models that can be calculated after
minimizing the sum of residuals.
Figure 5.1: Different least squares adjustments. , , Ȃy, Ȃx are the estimates of By, Bx, Ay and Ax.
Eq. 9
Eq. 10
Eq. 11
Eq. 12
Eq. 13
Eq. 14
Typically, the value r only takes values between -1 and 1. If r = 1, we have a perfect
correlation between the two variables. If r = -1, we have a perfect negative correction
between the two variables. If r= 0 the two variables do not depend linearly on each other.
When properly applying calibration models, it is necessary to test whether the
conditions of these models are adequate. Usually, some tests are carried out such as: i)
linearity and, ii ) homoscedasticity.
Linearity is obtained by observing the model residuals, which must be random, as
shown in Figure 5.2a. If the errors present systematic deviations, Figure 5.2b, the indication
of an inadequate linear model is evident. Linearity can also be assessed by observing the
model parameters (regression coefficient, intercept and correlation coefficient). As we know
After constructing an analytical method, the method must be validated to ensure the
results obtained through it have the required efficiency under the conditions in which it will
be applied. Normally, this efficiency of the model (analytical validation) is carried out through
the determination of several parameters that characterize the efficiency of the method, which
are called figures of merit. In addition to the parameters discussed in univariate calibration
models (linearity and homoscedasticity), we can find other figures of merit in higher order
calibration, such as:
Eq. 18
where lv represents the number of samples in the validation set, yi and ŷi correspond
to the reference values and those predicted by the model, respectively.
v ) Precision – consists of the degree of agreement between the results of a series of
measurements for the same sample. In multivariate calibration, we can calculate precision
from the equation below:
Eq. 19
precisão
Eq. 20
SÊN =
Eq. 21
viii) Limit of detection (LD) and quantification (LQ) – these parameters express
the smallest quantities of the species of interest that can be detected and determined
quantitatively, respectively. In multivariate calibration, these parameters are calculated
according to the following equations:
Eq. 22
Eq. 23
R Script
install.packages("AER")
library(AER)
# Data
# ensure reproducibility after training and val partition
set.seed (9)
data("USConsump1993", package = "AER")
plot(USConsump1993, plot.type = "single", col =1:2)
################################################ ########################
# Univariate Linear Regression #########################
#### Training
# plotting
plot (expenditure ~ income, data = consumption.tr, col = "indianred", pch = 20, cex
= 1.5)
abline (lm1, col = "palegreen", lwd = 2)
grid (lwd = 2)
##### Test
# Test data
consumption.te = consumption[-training,]
# Chart
plot(expenditure ~ income, data = consumption.te , col = "slateblue", pch = 20, cex
= 1.5)
abline(lm1, col = "khaki4", lwd = 2)
grid(lwd = 2)
ggplot() +
geom_point(aes(x = consumption.te$income, y = consumption.te$expenditure),
color = 'slateblue', lwd = 2) +
geom_smooth(method = "lm", formula = y ~ x,
aes(x = consumption.tr$income, y = consumption.tr$expenditure), col
= "khaki4") +
ggtitle ("consumption vs income (test data)") +
xlab("income") +
ylab("consumption") + theme_bw()
# Normality test
shapiro.test(residuals(lm1))
# Homoscedasticity test
install.packages("olsrr")
library(olsrr)
ols_test_breusch_pagan(lm1)
Example 2: The example we will describe here consists of the construction, validation
and comparison of univariate regression models using different degrees of the polynomial
through the MASS, ggplot2, caret, lmtest and olsrr packages.
R Script
library(MASS)
set.seed (9)
rehab = wtloss
? wtloss
library(ggplot2)
ggplot(rehab.training, aes( x = Days, y = Weight)) +
geom_point(color = 'orange', lwd = 2)+
geom_smooth(method = "lm", formula = y ~ x, col = "seagreen")+
ggtitle("Weight vs Days (training data)") +
xlab("Days") +
ylab("Weight") + theme_bw()
# Plot
ggplot() +
geom_point(aes(x = rehab.test$Days, y = rehab.test$Weight),
color = 'orangered', lwd = 2) +
geom_smooth(method = "lm", formula = y ~ x,
aes(x = rehab.training$Days, y = rehab.training$Weight)) +
ggtitle ("Weight vs days (test data)") +
xlab("days") +
ylab("weight") + theme_bw()
# Normality test
shapiro.test(residuals(lm1))
# Homoscedasticity test
install.packages("olsrr")
library(olsrr)
ols_test_breusch_pagan(lm1)
par(mfrow = c(1,1))
plot(Weight ~ Days, rehab.training, col = "deepskyblue3", pch = 20, cex = 1.5,
main = "Weight vs Days (training data)")
x1 = seq(0,250, by = 0.1)
lines(x1,predict(lm2, newdata = data.frame (Days = x1)), col = "mediumvioletred",
lwd = 2)
grid(lwd = 2)
library(ggplot2)
ggplot(rehab.training, aes( x = Days, y = Weight)) +
geom_point(color = 'deepskyblue3', lwd = 2)+
geom_smooth(method = "lm", formula = y ~ x + I(x^2), col = "mediumvioletred")+
ggtitle("Weight vs Days (training data)") +
xlab("Days") +
ylab("Weight") + theme_bw()
#### Test
# Model suitability
# residuals graph
par(mfrow = c(2,2))
plot(lm2)
# Normality test
shapiro.test(residuals(lm2))
# Homoscedasticity test
ols_test_breusch_pagan(lm2)
order = rep(c(1:3,5),2)
set = c(rep("training", 4), rep("test",4))
rmse = c( RMSE_training,RMSE_test )
select = data.frame(order,set,rmse)
# Cross-validation
Y = Xb + e Eq. 24
b = (X T X) -1 X T y Eq. 25
The analysis of equation (25) above points to the main limitations of the MLR
technique: i) the inverse of XTX may not exist; ii) all information (significant variance)
Example 3: The example we will describe here consists of the construction, validation
and comparison of MLR multivariate regression models using the Stat2Data, ggplot2, caret,
rsm, lmtest and olsrr packages.
R Script
install.packages("Stat2Data")
library(Stat2Data)
data("ThreeCars2017")
pairs(ThreeCars2017,
col = viridis :: viridis(3)[ThreeCars2017$CarType],
pch = c(15:17)[ThreeCars2017$CarType])
pairs(ThreeCars2017[,1:4],
col = viridis :: viridis (3)[ThreeCars2017$CarType],
pch = c(15:17)[ThreeCars2017$CarType])
# Preview
plot (Price ~ Mileage , cars.training , col = "palegreen2", pch = 20, cex = 1.5,
main = "Price vs Mileage (training data)")
abline (model1, col = "steelblue", lwd = 2)
grid (lwd = 2)
library(ggplot2)
ggplot ( cars.training , aes (x = Mileage, y = Price)) +
geom_point ( color = "palegreen", lwd = 2)+
geom_smooth (method = "lm", formula = y ~ x)+
ggtitle ("Price vs Mileage (training data)") +
xlab ("Mileage") +
ylab ("Price") + theme_bw ()
### Test
# Test data
cars.test = ThreeCars2017[-training,]
# Preview
plot(Price ~ Mileage, cars.test, col = "mediumorchid", pch = 20, cex = 1.5,
main = "Price vs Mileage (test data)")
abline (model1, col = "olivedrab3", lwd = 2)
grid( lwd = 2)
library(ggplot2)
ggplot () +
geom_point ( aes(x = cars.test$Mileage , y = cars.test$Price ),
color = "mediumorchid", lwd = 2) +
geom_smooth (method = "lm", formula = y ~ x,
aes(x = cars.training$Mileage , y = cars.training$Price ), col =
"olivedrab3") +
ggtitle ("Price vs Mileage (test data)") +
xlab ("Mileage") +
ylab ("Price") + theme_bw ()
## Preview
install.packages("rsm")
library(rsm)
persp(model2, Age ~ Mileage, col = rainbow(50), contours = "colors", theta = 50,
phi = 30)
#####
xs = seq (min( cars.training$Mileage ), max( cars.training$Mileage ), length = 20)
ys = seq (min( cars.training$Age ), max( cars.training$Age ), length = 20)
xys = expand.grid ( xs,ys )
colnames ( xys ) = c("Mileage", "Age")
zs = matrix(predict(model2, xys ), nrow = length( xs ))
n.cols = 100
palette = colorRampPalette (c("lightseagreen", "mediumvioletred"))( n.cols )
zfacet = zs[-1,-1] + zs[-1, -20] + zs[-20, -1] + zs[-20, -20]
facetcol = cut( zfacet , n.cols )
### Test
# Prediction with test data
res.test2 = predict(model2, newdata = data.frame ( cars.test [c(2,4)]))
#### Model 3
model3 = lm (Price ~ Mileage*Age, cars.training )
summary(model3)
# Preview
persp (model3, Age ~ Mileage, col = rainbow(50), contours = "colors")
contour(model3, Age ~ Mileage, image = TRUE)
# Preview
persp(model4, Age ~ Mileage, col = rainbow(50), contours = "colors", theta = 60,
phi = 30,)
contour(model4, Age ~ Mileage, image = TRUE)
# Preview
persp(model5, Age ~ Mileage, col = rainbow(50), contours = "colors", theta = 60,
phi = 30)
contour(model5, Age ~ Mileage, image = TRUE)
model = rep(1:6,2)
set = c(rep("training",6), rep("test",6))
rmse = c( rmse_training,rmse_test )
#### Cross-validation
# Suitability model 6
par(mfrow = c(2,2))
plot(model6)
R Script
## Loading packages
install.packages ("R.matlab")
library(R.matlab)
install.packages("prospectr")
install.packages("MASS")
install.packages("lintools")
install.packages("Stat2Data")
library(Stat2Data)
install.packages("Metrics")
library(Metrics)
## Loading data
x = data$data # spectra
y = data$concentration # concentration
cm = data$cm # wave number
cmt = t(cm)
## data plot
dev.new()
matplot(t(cm),t(x), type ='l', xlab ="Wavenumber (cm-1)", ylab ="Absorbance")
# PCA Model
# SPA model
model_spa = project (x= x.loadings [,1], A= xm , b=y, neq =0) # spa model
size = 1:dim_x[1]
## MLR Model
dev.new ()
plot (ycal,ycal_calc,pch ='o', col ="blue", xlab ="Measured Concentration (mg/L)",
ylab ="Predicted Concentration (mg/L)", main ="Calibration")
lines(ycal,ycal,col ='red')
dev.new ()
plot (ycal,ycal_calc,pch ='o', col ="blue", xlab ="Measured Concentration (mg/L)",
ylab ="Predicted Concentration (mg/L)", main ="Calibration (blue) and Prediction
(red)")
points(ypred,ypred_calc,pch =15,col="red")
lines(y,y,col ="red")
## Figures of merit
# Calibration
# Prediction
The second stage of the PCR technique consists of using the MLR technique to
establish a mathematical relationship between the matrix of T scores (the new block of
independent variables) and the block of dependent variables (matrix Y). Thus the MLR
equation can be written:
Y = TB + E Eq. 26
and the solution for the regression coefficients is:
B = (TTT)-1TT Y Eq. 27
Note that inverting TTT will not cause problems due to the mutual orthogonality of the
scores, correcting the collinearity problem. However, it is important to highlight that a crucial
step in the PCR technique is choosing the number of principal components due to the risk of
loss of information. Another issue is that PCR ignores all the information contained in the Y
matrix in the initial step. The data from the Y matrix are only used in the second step, when
the number of components has already been determined.
install.packages("DEoptimR")
library(DEoptimR)
install.packages("Ecfun")
library(Ecfun)
install.packages("Ecdat")
library(Ecdat)
#data
data = ManufCost
? ManufCost
# Correlation
library ( corrplot )
r = cor(data)
round(r,2)
# preview
corrplot :: corrplot ( r,method = "color",
type = "upper",
order= "hclust",
addCoef.col = "black", tl.srt = 45,
diag = FALSE)
################################################ ####
#### Training and Testing Data
set.seed (33)
data.training = data[training,]
data.test = data[-training,]
par(mfrow = c(1,1))
# plotting two correlated variables
plot(scale( sl )~scale(pl), asp = 1, data, pch = 20, cex = 1.5, col = "mediumseagreen")
#eigenvalues of cm
e = eigen(cm)
# Principal axes
abline (a=0, b=s1, col = "blue", lwd = 2)
abline (a=0, b=s2, col = "lightblue", lwd = 2)
# Test
pred.lm1 = predict(lm1, newdata = data.test)
# metrics
library("caret")
test.lm = data.frame ( data.test$cost , pred.lm1)
colnames ( test.lm ) = c("obs", "pred")
defaultSummary ( test.lm )
################################################ #####################
### PCR
install.packages("pls")
library(pls)
#cross validation
ncomp.onesigma = selectNcomp (pcr1, method = "onesigma", plot = TRUE )
ncomp.permut = selectNcomp (pcr1, method = "randomization", plot = TRUE)
# preview
plot(RMSEP(pcr1), legendpos = "topright")
#model test
pred.pcr1 = predict(pcr1, ncomp = 2, newdata = data.test )
#metrics
test.pcr = data.frame ( data.test$cost , pred.pcr1)
colnames ( test.pcr ) = c("obs", "pred")
defaultSummary ( test.pcr )
where T and U are the score matrices of matrices X and Y, respectively. P and Q
are the loading matrices of matrices X and Y, respectively; and E and F are the residuals.
A representation of this first step of the PLS algorithm is shown in Figure 5.5 below:
The second step of the PLS algorithm consists of calculating the linear correlation
between the scores in matrix Y and the scores in matrix X, as described in the following
equation:
uh = bh th Eq. 30
for "h" latent variables, in which the values of bh are grouped in the diagonal matrix
B that contains the regression coefficients between the scores matrix U of Y and the scores
matrix T of X.
A geometric illustration of the PLS algorithm in this second stage is shown in Figure
5.6 below:
Figure 5.6 : Geometric representation of the PLS model using one latent variable modeling each block
(X and Y).
R Script
install.packages("Ecdat")
library(Ecdat)
library(caret)
install.packages("pls")
library(pls)
#data
data = ManufCost
? ManufCost
data = data.frame ( ManufCost )
data = na.omit (data)
################################################ #####
### Visualizing correlation and principal components between two variables
#eigenvalues of cm
e = eigen (cm)
# Principal axes
abline (a=0, b=s1, col = "blue", lwd = 2)
abline (a=0, b=s2, col = "lightblue", lwd = 2)
### PLS
# PLS training model
pls1 = plsr ( cost ~., ncomp = 8, data = data.training , validation = "LOO", scale
= T)
summary(pls1)
#cross validation
ncomp.onesigma = selectNcomp (pls1, method = "onesigma", plot = TRUE )
ncomp.permut = selectNcomp (pls1, method = "randomization", plot = TRUE)
# preview
plot(RMSEP(pls1), legendpos = "topright")
#model test
pred.pls1 = predict (pls1,ncomp = 2, newdata = data.test )
#metrics
test.pls = data.frame ( data.test$cost , pred.pls1)
colnames ( test.pls ) = c("obs", "pred")
defaultSummary ( test.pls )
Eq. 31
where F is the number of factors or components of the model; a¸ b and c are vectors; and, ⊗
is the external product.
For its validity, the equation 31 above assumes: i) the analytical signals of each source of
variation contribute additively to the analytical signal of the sample; ii) the magnitudes of the signals are
proportional to the concentration of the analyte to be calibrated; ii) the analytical signals of each analyte
are common in all samples.
In addition to these assumptions, some aspects must be taken into account when applying the
PARAFAC algorithm for multivariate calibration purposes: i) the agreement of the data structure; ii) the
algorithm initialization method, the restrictions imposed on the model and the convergence criterion;
iii) the number of factors or components; iv) identify the profiles estimated by the model in each factor
with the species of interest and the interference present; v) construction and validation of the regression
model to estimate the concentration of the species of interest.
Thus, the PARAFAC algorithm exemplifies a decomposition of a three-way tensor into three
weight matrices A , B and C, called modes A, B and C. Normally, equation 31 is solved by alternating
least squares (ALS), in which the ALS algorithm iteratively estimates two modes to estimate the third
until some convergence criterion is reached or the previously defined number of iterations is reached.
However, several methods can be used to obtain these profiles, such as singular value decomposition
(SVD) or direct trilinear decomposition (DTLD).
Once an initial estimate of, for example, B and C has been obtained, an initial estimate of A can
be obtained by least squares as follows:
Eq. 32
Eq. 33
Where, w are the regression coefficients between the weights a and the concentrations y.
The interesting fact about the equation 33 above is that to determine a sample of unknown
composition we have a data cube formed by Ic calibration samples (known concentration of the species
of interest) and a sample of unknown composition that may contain interferents present or not in the
calibration samples. This consequence of 2nd order calibration models is known as the second order
advantage. Thus, the concentration of the composition of an unknown sample is obtained according to
equation 34, if the signal of interest is only in one column of A :
Eq. 34
R Script
## Loading Packages
install.packages ("multiway")
library(multiway)
install.packages ("ThreeWay")
library(ThreeWay)
install.packages ("R.matlab")
library(R.matlab)
install.packages ("plot3D")
library(plot3D)
install.packages ("prospectr")
install.packages ("MASS")
install.packages ("caret")
library(prospectr)
library(MASS)
library(caret)
install.packages ("Stat2Data")
library(Stat2Data)
install.packages ("Metrics")
library(Metrics)
install.packages ("lintools")
library(lintools)
## Loading Data
x <- data$data
y <- data$concentration
nmEX <- data$nmEX
nmEM <- data$nmEM
# matrix dimensions
# average matrix
xm = colMeans (x)
dev.new ()
filled.contour ( xm, color.palette = terrain.colors )
dev.new ()
matplot (t( nmEM ), xm,type ="l", xlab ="Emission Wavelength (nm)", ylab ="Intensity")
# R2 Adjustment
dev.new ()
matplot ( model$A [,1], model$A [,2], pch ="o", col ="blue", xlab ="Factor
1",ylab="Factor 2",main="PARAFAC Scores" )
dev.new ()
matplot (t( nmEM ), model$B,type ="l", xlab ="Emission Wavelength (nm)", ylab
="Intensity")
dev.new ()
matplot (t( nmEX ), model$C,type ="l", xlab ="Excitation Wavelength (nm)", ylab
="Intensity")
## Regression model
ycal_calc = xcal [,1] * coef [2] + xcal [,2] * coef [3] + coef [1] # predicted
concentration calibration
ypred_calc = xpred [,1] * coef [2] + xpred [,2] * coef [3] + coef [1] # predicted
concentration prediction
dev.new ()
plot (ycal,ycal_calc,pch ='o', col ="blue", xlab ="Measured Concentration (mg/L)",
ylab ="Predicted Concentration (mg/L)", main ="Calibration")
lines(ycal,ycal,col ='red')
dev.new ()
plot (ypred,ypred_calc,pch =15,col="red", xlab = "Measured Concentration (mg/L)",
ylab = "Predicted Concentration (mg/L)", main = "Predicted")
lines(ypred,ypred,col ='red')
dev.new ()
plot (ycal,ycal_calc,pch ='o', col ="blue", xlab ="Measured Concentration (mg/L)",
ylab ="Predicted Concentration (mg/L)", main ="Calibration (blue) and Prediction
(red)")
points(ypred,ypred_calc,pch =15,col="red")
lines(y,y,col ="red")
## Figures of merit
# Calibration
# Prediction
5.7 MCR-ALS
As discussed in the previous chapter, the MCR-ALS algorithm is the alternating least
squares multivariate curve resolution algorithm whose main purpose is to deconvolve curves
in bilinear data. In other words, it reassembles the original data matrix through individual
contributions from a given number of recovered profiles plus a random noise. However,
MCR can be viewed as a signal resolution method that can also be used for quantitative
purposes.
The following equation describes the decomposition of a data matrix into a bilinear
model that makes it possible to estimate concentration and spectral profiles individually,
maintaining the best data variance:
D = CST + E Eq. 35
ii) Unimodality – requires that the calculated profiles have only one maximum. This
restriction can also be imposed on matrices C or ST.
iii) Closure – the sum of relative concentrations remains constant during the
optimization process. This restriction is only applied to the concentration profile.
iv) Trilinearity – consists of subjecting the spectral and concentration profiles to non-
variation between one sample and another, establishing a unique response.
Here we list some mathematical operations used in the MCR-ALS calibration models:
i) Determination of the number of components (n).
viii) Construction of the least squares regression model between the concentration
profile of the species of interest from step vii and the vector with the reference
concentrations of the calibration samples. A pseudo-univariate regression is therefore
performed based on the recovered concentration profiles, their area or norm, and the
analytical concentration of the sample of interest.
The MCR-ALS algorithm is successfully used on 2nd order data through data
matrices augmented by rows and/or columns, as can be seen in Figure 5.7:
install.packages ("R.matlab")
library(R.matlab)
install.packages ("prospectr")
install.packages ("MASS")
install.packages ("lintools")
library(prospectr)
library(MASS)
library(ggplot2)
library(lintools)
install.packages ("Stat2Data")
library(Stat2Data)
install.packages ("Metrics")
library(Metrics)
install.packages ("ALS")
library (ALS)
## Loading data
## data plot
dev.new ()
matplot (t(cm),t(xm), type ='l', xlab ="Wavenumber (cm-1)", ylab ="Absorbance")
## MCR-ALS
plotS (mcr$S,cm)
print("R2:")
print(cor(ym [,1],Copt1)^2) # R2 value between Copt and real concentration
dim_xm=dim (xm)
size=1:dim_xm[1]
xcal=Copt1[sel,] # calibration
ycal = matrix (ym [sel,1]) # concentration calibration
dev.new ()
plot (ycal,ycal_calc,pch='o', col="blue", xlab="Measured Concentration (mg/L)",
ylab="Predicted Concentration (mg/L)", main="Calibration")
lines(ycal,ycal,col="red")
dev.new ()
plot (ypred,ypred_calc,pch=15,col="red", xlab="Measured Concentration (mg/L)",
ylab="Predicted Concentration (mg/L)", main="Predicted")
lines(ypred,ypred,col='red')
dev.new ()
plot (ycal,ycal_calc,pch='o', col="blue", xlab="Measured Concentration (mg/L)",
ylab="Predicted Concentration (mg/L)", main="Calibration (blue) and prediction
(red)")
points(ypred,ypred_calc,pch=15,col="red")
lines(y,y,col="red")
## Figures of merit
# Calibration
# Prediction
MAPEP=mean(abs((ypred-ypred_calc)/ypred))*100
R2pred=cor(ypred,ypred_calc)^2
RMSEP=rmse(ypred,ypred_calc)
5.8 UPLS/RBL
The PLS algorithm must be the first-order calibration model most used and described
in the literature. The Unfolded Partial Least Squares (UPLS) algorithm is an extended
version of partial least squares (PLS) regression, proposed by Wold and Bro [4,5], used
for processing multilinear data by unfolding the data matrix into vectors. However, unlike
PARAFAC and MCR-ALS, this algorithm cannot obtain the second-order advantage when
interferents are present in the test sample but not present in the calibration set. To overcome
this limitation and achieve second-order data analysis, a mathematical procedure called
residual bilinearization (RBL) [4] was developed, which can model the residuals of the test
sample as a sum of the bilinear contributions of unexpected components. Therefore, the
function of RBL is to model the residuals and present results from test samples free of
interferences, adjusting the values of the score matrix with the information of the interferent
modeled separately by the RBL step.
In the first calibration step of the UPLS-RBL algorithm, a data tensor with dimensions
(i x j x k) is unfolded into a two-dimensional matrix (i × jk), where i corresponds to the
number of samples, j is the dimension corresponding to the excitation spectra, and k is the
dimension of the emission spectra, in the case of molecular fluorescence data. With all the
calibration data unfolded, a new matrix is constructed by arranging them adjacent to each
other for the application of UPLS regression, as represented in Figure 5.8 below:
Unfolding
Figure 5.8: Representation to show the breakdown of a sample into a vector (a) and a set of samples
into vectors (b) for building a UPLS model.
Initially, the UPLS-RBL algorithm unfolds the calibration data matrix (Xcal), and a PLS
model is built in the traditional way together with a cross-validation step to define the number
of latent variables. The weight (P) and scores (t) matrices, originating from the unfolded
matrix, are estimated iteratively by maximizing the variance of X cal
and its covariance with
the response vector ycal. Then, with the calculated t values, a regression model is calculated
between the nominal concentration vector ycal and t to estimate the regression vectors v that
minimizes the error ey, as shown in the following equation:
Eq. 36
If there is a UPLS residue in the test sample greater than the calibration one, we have
the presence of some uncalibrated constituent in the residue or the presence of interferents
in the sample. If this occurs, an RBL post-calibration step is necessary to remove the
contribution of this interference from the test sample's score value.
In the RBL step, the UPLS residue vector is reassembled into a matrix with the
original dimensions, and an SVD is performed on the reassembled test sample residuals
matrix (EP). When the test sample noise approaches the calibration value, we have the
correct value of the amount of RBL needed to be used in the calibration model. Equation
37 shows the equation of the test sample of a classic PLS plus the contribution of the
interferent found in the RBL step plus the unfolded RBL residue (eu).
Eq. 37
Where xu represents the test sample; represents the analyte signal; BGCT
represents the interference signal; and, eu represents the random error after the RBL step.
R Script
# Loading packages
install.packages ("R.matlab")
library(R.matlab)
install.packages ("pls")
library(pls)
install.packages ("Stat2Data")
library(Stat2Data)
install.packages ("Metrics")
library (Metrics)
## Loading Data
x <- data$data
y <- data$concentration
nmEX <- data$nmEX
nmEM <- data$nmEM
# matrix dimensions
xm = colMeans (x)
dev.new ()
filled.contour (xm,color.palette = terrain.colors)
dev.new ()
matplot (t(nmEM), xm,type="l", xlab="Emission Wavelength (nm)", ylab="Intensity")
dev.new ()
matplot (t(nmEX),t(xm), type="l", xlab="Excitation Wavelength (nm)",
ylab="Intensity")
## UPLS Model
validationplot (model_pls)
validationplot (model_pls,val.type="MSEP")
validationplot (model_pls,val.type="R2")
dev.new ()
plot (ycal,ycal_calc,pch ='o', col="blue", xlab="Measured Concentration (mg/L)",
ylab="Predicted Concentration (mg/L)", main ="Calibration")
lines(ycal,ycal,col='red')
dev.new ()
plot (ypred,ypred_calc,pch=15,col="red", xlab="Measured Concentration (mg/L)",
ylab="Predicted Concentration (mg/L)", main="Predicted")
lines(ypred,ypred,col='red')
dev.new ()
plot (ycal,ycal_calc,pch='o', col="blue", xlab="Measured Concentration (mg/L)",
ylab="Predicted Concentration (mg/L)", main="Calibration (blue) and prediction
(red)")
points(ypred,ypred_calc,pch=15,col="red")
lines(y,y,col="red")
## Figures of merit
# Calibration
# Prediction
MAPEP = mean(abs((ypred-ypred_calc)/ypred))*100
R2pred = cor (ypred,ypred_calc)^2
RMSEP = rmse (ypred,ypred_calc)
5.9 N-PLS
The N-way Partial Least Squares (N-PLS) algorithm, proposed by Bro [5], is also an
extension of the PLS model. Basically, the N-PLS algorithm decomposes three-dimensional
arrays X (ixjxk) (independent data cubic matrix) and the vector of reference concentrations
Y (ix 1) (or physicochemical property) in a set of triads, to find the maximum covariance
between the scores of X and Y.
The construction of N-PLS models, analogous to traditional PLS, is carried out in
two stages: calibration and prediction. Each triad is equal to a latent variable as in the PLS
model, and in the calibration stage the arrangement X is decomposed into scores (tn) and
weights (wj and wk), as exemplified in Figure 5.9 below:
In Figure 5.9, X is the cubic matrix of independent data, T(I,F) is the score matrix and
the matrices wj (J,F) and wk (K,F) are weight matrices containing information about the variables.
The arrangement E (i x j x k) represents the part not explained by the model (residuals) and
F represents the number of latent variables. The matrix Y is also decomposed into scores
and weights, represented by Figure 5.9, where Y is the matrix containing the property of
interest, U is the matrix containing the scores of Y, Q is the matrix containing the loadings
of Y, and F is the matrix of residues of Y that, as in the arrangement X, cannot be explained
by the model.
The decomposition of the three-dimensional array X can be represented by the
following equation:
Where the symbol "|⊗|" represents the Khatri-Rao product, an operator used in
higher order matrices.
For the concentration vector or physicochemical parameter of interest (y), the
decomposition is written by the equation below:
Eq. 39
where T is a scores matrix, whose columns consist of the individual score vectors of
each component and b are the regression coefficients.
Finally, the predicted concentration of samples with unknown concentrations (y*),
can be estimated from new scores (T*) according to the following equation:
Eq. 40
install.packages ("R.matlab")
library(R.matlab)
install.packages ("pls")
library(pls)
install.packages ("Stat2Data")
library(Stat2Data)
install.packages ("Metrics")
library(Metrics)
install.packages ("sNPLS")
library (sNPLS)
## Loading Data
x <- data$data
y <- data$concentration
nmEX <- data$nmEX
nmEM <- data$nmEM
# matrix dimensions
# average matrix
xm = colMeans (x)
dev.new ()
filled.contour (xm,color.palette=terrain.colors)
dev.new ()
matplot (t(nmEM), xm,type="l", xlab="Emission Wavelength (nm)", ylab="Intensity")
dev.new ()
matplot (t(nmEX),t(xm), type="l", xlab="Excitation Wavelength (nm)",
ylab="Intensity")
## nPLS Model
cv = cv_snpls(xcal,ycal,ncomp=1:3,keepJ=1:2,keepK=1:2,sample=10,parallel=FALSE) #
cross-validation - takes a long time
# Prediction
ycal_calc=predict(model_npls, xcal)
ypred_calc = predict(model_npls, xpred)
dev.new ()
plot (ycal,ycal_calc,pch='o', col="blue", xlab="Measured Concentration (mg/L)",
ylab="Predicted Concentration (mg/L)", main="Calibration")
lines(ycal,ycal,col='red')
dev.new ()
plot (ypred,ypred_calc,pch=15,col="red", xlab="Measured Concentration (mg/L)",
ylab="Predicted Concentration (mg/L)", main="Predicted")
lines(ypred,ypred,col='red')
dev.new ()
plot (ycal,ycal_calc,pch='o', col="blue", xlab="Measured Concentration (mg/L)",
ylab="Predicted Concentration (mg/L)", main="Calibration (blue) and Prediction
(red)")
points(ypred,ypred_calc,pch =15,col="red")
lines(y,y,col="red")
# Calibration
# Prediction
MAPEP = mean(abs((ypred-ypred_calc)/ypred))*100
R2pred = cor(ypred,ypred_calc)^2
RMSEP = rmse (ypred,ypred_calc)
PROPOSED EXERCISES
01 – Compare the performance of a univariate calibration model with several
polynomial degrees in a data set through a script in the R language, presenting the models'
performance, error and conclusions.
02 – Propose an application of the MLR algorithm on a data set with more than one
independent variable and, using a script in the R language, present your results, graphs
and conclusions.
03 – From the previous exercise, use the MLR-SPA algorithm through a script in the
R language and compare the results of the calibration models with and without the variable
selection algorithm.
04 – Build multivariate calibration models for the PCR and PLS algorithms on a given
1st order data set using an R script and present your results, figures of merit for both models
and your main conclusions.
REFERENCES
1 – Legendre, AM (1805). Nouvelles Méthodes pour la Détermination des Orbites des Comètes , Firmin
Didot, Paris; second edition Courcier, Paris.
3 – Vandeginste , BGM; Sielhorst , C.; Gerritsen, M.; (1988). The NIPALS algorithm for the calculation of
the principal components of a matrix.Trends Anal. Chem. 7, 286-287.
4 – Wold , S.; Geladi , P.; Esbensem , K.; Öhman, J. (1987). Multi-way main components and PLS-
analysis. J. Chemom . 1, 41.
CHAPTER IDEA
A digital image can be interpreted as a representation of a scene through a set
of discrete elements of finite size, known as pixels, organized in a two-dimensional
arrangement. Commonly, the acquisition of digital images occurs through electronic devices
(photo cameras, webcams, drones, for example) in a process known as optoelectronic
transduction, which involves a reduction in the dimensionality of the scene through a sensor
(Charge Coupled Device, for example ).
In this chapter you will find a theoretical foundation of digital images, color models and
some chemometric studies using digital images in classification and multivariate calibration
models. Examples guided by multivariate algorithms in the R language using digital images
will be found throughout the chapter, as well as details of the models developed.
Upon completing the chapter, you should be able to:
a) Understand the stages of acquiring and importing digital images, their pre-
processing and construction of calibration and multivariate classification models
using scripts in the R language;
c) Build and validate multivariate classification models using digital images with R
scripts;
d) Build and validate multivariate calibration models using digital images with R
scripts;
f) Propose new applications in chemistry or related areas using digital images in one
of the areas of Chemometrics.
Column (n)
Grayscale
(black)
Row
(m)
(white)
The main purpose of the RGB color model is for the detection, representation
and exhibition of images in electronic systems such as televisions and computers. The
frequency distribution of the values which a pixel contains in the image is called a histogram.
It shows how many times a varying color value (0-255) can appear in the image. To illustrate
this concept, Figure 6.2 presents a color digital image and the histograms referring to the
frequency distribution of all possible values of a pixel in the red, green, blue and gray levels.
frequency
green
frequency
blue
frequency
Figure 6.2: Histograms in the red, green, blue and gray channels resulting from a color digital image.
We can represent the RGB model by a cube on the R, G and B axes, which takes
on 256 color levels or values from 0-255. Each color channel is made up of a set of 8 bits
resulting in an image with 16.7 million different colors. As shown in Figure 6.3, the edges of
the cube have the primary colors of the RGB model and the faces in the planes GB, BR, RG
have the secondary colors (cyan, magenta and yellow), formed from the combination of two
primary colors. Black corresponds to the origin of the cube, white corresponds to the vertex
furthest from the origin, and gray corresponds to the diagonal between these two points.
Magenta
White
Green
Black
Red Yellow
In other words, zero intensity for each component gives the darkest color (without
light, considered black) and the total intensity of each one results in white. When the
intensities of all components are the same, the result is a shade of gray, darker or lighter,
depending on the intensity. When the intensities are different, the result is a colorful hue,
more or less saturated, depending on the difference between the strongest and weakest
intensities of the primary colors used [2].
Another color model, based on polar coordinates, represented by an inverted six-
sided pyramid, frequently used in computer graphics is called HSV, as shown in Figure 6.4.
V
240°
Blue
Azul Magenta
Magenta
Green Amarelo
Yellow
H
0, 0
Preto
S
Black
R Script
## Loading packages
install.packages("imager")
library(imager)
library(purrr)
## Loading image
im_rgb = load.image("bird.jpg")
## viewing image
plot (im_rgb)
## converting to HSV
RGBtoHSV ( im_rgb ) %>% imsplit ("c") %>%
modify_at ( 2,~ . / 2) %>% imappend ("c") %>%
HSVtoRGB %>% plot(rescale=FALSE)
## converting to grayscale
grayscale ( im_rgb ) %>% plot ( colourscale = gray, rescale = FALSE)
R Script
## Loading packages
install.packages("imager")
library(imager)
library(ggplot2)
library(dplyr)
library(prospectr)
library(MASS)
install.packages("plot3D")
library(plot3D)
install.packages("plotly")
library(plotly)
install.packages("factoextra")
library(factoextra)
## loading images
## generating histograms
## image 2
## image 3
## image 4
## image 5
## image 6
## image 7
## figure 8
data = rbind(t(im1v),t(im2v),t(im3v),t(im4v),t(im5v),t(im6v),t(im7v),t(im8v))
## performing PCA
# scaling data
# PCA Model
# PCA Variances
par(mfrow = c( 2,2))
barplot (data.vars [1:10], main="Variance", names.arg = paste("PC", 1:10))
barplot (log( data.vars [1:10]), main="Log( variance )", names.arg = paste("PC",
1:10))
barplot (data.relvars [1:10], main="Relative Variances", names.arg = paste("PC",
1:10))
barplot (cumsum (100* data.relvars [1:10]), main="Cumulative variances (%)", names.
arg = paste("PC", 1:10), ylim = c(0,100))
y <- c( 0.03125,0.0625,0.125,0.25,0.5,1,2,4)
ys <- c( "1","2","3","4","5","6","7","8")
dev.new ( )
plot(data.scores [,1],data.scores[,2],pch=ys,xlab='PC1',ylab='PC2',main='PCA
scores - numbers for each image')
dev.new ( )
matplot(data.loadings
[,1],type="l",col="blue",xlab="Pixels",ylab='Loadings',main='PCA loadings')
lines(data.loadings
[,2],type="l",col="red",xlab="Pixels",ylab='Loadings',main='PCA loadings')
albumin_data =
rbind(t(im1v),t(im2v),t(im3v),t(im4v),t(im5v),t(im6v),t(im7v),t(im8v))
## performing PCA
# scaling data
# PCA Model
# PCA Variances
par(mfrow = c( 2,2))
barplot (data.vars [1:10], main=" Variance ", names.arg = paste("PC", 1:10))
barplot (log( data.vars [1:10]), main="Log( variance )", names.arg = paste("PC",
1:10))
barplot (data.relvars [1:10], main="Relative Variance", names.arg = paste("PC",
1:10))
barplot (cumsum (100* data.relvars [1:10]), main="Cumulative variance (%)", names.
arg = paste("PC", 1:10), ylim = c(0,100))
col =
c("blue","blue","blue","blue","blue","blue","blue","blue","red","red","red","red
","red","red","red","red","red")
dev.new ( )
plot(data.scores[,1],data.
scores[,2],pch=19,col=col,xlab='PC1',ylab='PC2',main='PCA scores - blue =
creatinine , red = albumin ')
dev.new ( )
matplot(data.loadings
[,1],type="l",col="blue",xlab="Pixels",ylab='Loadings',main='PCA loadings')
lines(data.loadings
[,2],type="l",col="red",xlab="Pixels",ylab='Loadings',main='PCA loadings')
# viewing dendrogram
# samples are grouped into 2 clusters - high concentration (on the left) and low
concentration (on the right)
set.seed (123)
km.res <- kmeans(data_scal,2,nstart =1)
# viewing results
kmeans_out = km.res$cluster
groupConc = c( 2,2,2,1,1,1,1,1,2,2,2,1,1,1,1,1)
ac = sum(kmeans_out==groupConc,na.rm=T)/ nrow (group12) * 100
ac
Script
## Loading packages
install.packages("imager")
library(imager)
library(ggplot2)
library(dplyr)
library(prospectr)
library(MASS)
install.packages("plot3D")
library(plot3D)
install.packages("plotly")
library(plotly)
install.packages("factoextra")
library(factoextra)
install.packages("lintools")
library(lintools)
install.packages("caret")
install.packages("GA")
library(caret)
library(GA)
creatinine_data =
rbind(t(im1v),t(im2v),t(im3v),t(im4v),t(im5v),t(im6v),t(im7v),t(im8v))
albumin_data =
rbind(t(im1v),t(im2v),t(im3v),t(im4v),t(im5v),t(im6v),t(im7v),t(im8v))
# PCA Model
# PCA Variances
par(mfrow = c( 2,2))
barplot (data.vars [1:10], main="Variance", names.arg = paste("PC", 1:10))
barplot (log( data.vars [1:10]), main="Log( variance )", names.arg = paste("PC",
1:10))
barplot (data.relvars [1:10], main="Relative variance", names.arg = paste("PC",
1:10))
barplot (cumsum (100* data.relvars [1:10]), main="Cumulative variance (%)", names.
arg = paste("PC", 1:10), ylim = c(0,100))
# LDA Model
# figures of merit
# training
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# test
dim_test1 = dim(test1)
dim_test2 = dim(test2)
col =
c("blue","blue","blue","blue","blue","blue","blue","blue","red","red","red","red",
"red","red","red","red","red")
dev.new ( )
plot(data.scores[,1],data.
scores[,2],pch=19,col=col,xlab='PC1',ylab='PC2',main='PCA scores - blue =
creatinine, red = albumin ')
dev.new ( )
matplot(data.
loadings[,1],type="l",col="blue",xlab="Pixels",ylab='Loadings',main='PCA
loadings')
lines(data.loadings[,2],type="l",col="red",xlab="Pixels",ylab='Loadings',main='PCA
loadings')
col =
c("blue","blue","blue","blue","blue","blue","red","red","red","red","red","red",
"red")
dev.new()
plot(pred_train$posterior,pch =19,col= col,xlab ="LD1", ylab ="LD2", main="
Training - blue = creatinine , red = albumin ")
col =
c("blue","blue","blue","blue","blue","blue","red","red","red","red","red","red",
"red")
dev.new ()
plot(model_lda_cv$posterior,pch =19,col= col,xlab ="LD1", ylab ="LD2", main
="Cross Validation - blue = creatinine, red = albumin")
col = c("blue","blue","red","red")
dev.new ( )
plot(pred_test$posterior,pch=19,col= col,xlab ="LD1", ylab ="LD2", main="Test -
blue = creatinine , red = albumin ")
# SPA model
spa_model = project (x= data.loadings [,1], A= datam , b=group12, neq =0) # spa
model
# LDA Model
# figures of merit
# training
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# cross-validation
# test
dim_test1 = dim(test1)
dim_test2 = dim(test2)
col =
c("blue","blue","blue","blue","blue","blue","red","red","red","red","red","red",
"red")
dev.new()
plot(pred_train$posterior,pch =19,col= col,xlab ="LD1", ylab ="LD2", main="
Training - blue = creatinine , red = albumin ")
col =
c("blue","blue","blue","blue","blue","blue","red","red","red","red","red","red",
"red")
dev.new ()
plot(model_lda_cv$posterior,pch =19,col= col,xlab ="LD1", ylab ="LD2", main
="Cross Validation - blue = creatinine, red = albumin")
col = c("blue","blue","red","red")
dev.new ( )
plot(pred_test$posterior,pch=19,col= col,xlab ="LD1", ylab ="LD2", main="Test -
blue = creatinine , red = albumin ")
# GA algorithm
for(i in seq_len(GA@popSize))
m[i, sample(GA@nBits, k)] <- 1
m
}
}
# Mutation Function
# Adjustment Function
# GA Model
# selected variables
# LDA Model
# figures of merit
# training
dim_train1 = dim(train1)
dim_train2 = dim(train2)
# cross-validation
# test
col =
c("blue","blue","blue","blue","blue","blue","red","red","red","red","red","red",
"red")
dev.new()
plot(pred_train$posterior,pch =19,col= col,xlab ="LD1", ylab ="LD2", main="
Training - blue = creatinine , red = albumin ")
col =
c("blue","blue","blue","blue","blue","blue","red","red","red","red","red","red",
"red")
dev.new ()
plot(model_lda_cv$posterior,pch =19,col= col,xlab ="LD1", ylab ="LD2", main
="Cross Validation - blue = creatinine, red = albumin")
col = c("blue","blue","red","red")
dev.new ( )
plot(pred_test$posterior,pch=19,col= col,xlab ="LD1", ylab ="LD2", main="Test -
blue = creatinine , red = albumin ")
Script
# Loading Packages
install.packages("imager")
library(imager)
library(ggplot2)
library(dplyr)
library(prospectr)
library(MASS)
install.packages("plot3D")
library(plot3D)
install.packages("plotly")
library(plotly)
install.packages("factoextra")
library(factoextra)
install.packages("Stat2Data")
library(Stat2Data)
install.packages("Metrics")
library(Metrics)
install.packages("pls")
library(pls)
data = rbind(t(im1v),t(im2v),t(im3v),t(im4v),t(im5v),t(im6v),t(im7v),t(im8v))
dim_data = dim(data)
y <- c(0.03125,0.0625,0.125,0.25,0.5,1,2,4)
# RGB intensities
# RGB absorbances
x = rbind (im1abs,im2abs,im3abs,im4abs,im5abs,im6abs,im7abs,im8abs)
dim_x = dim (x)
size = 1:dim_x [ 1]
## MLR Model
ycal_calc = xcal %*% coef [2:4] + coef [1] # predicted concentration calibration
ypred_calc = xpred %*% coef [2:4] + coef [1] # predicted concentration prediction
dev.new ( )
plot (ycal,ycal_calc,pch='o',col ="blue", xlab ="Measured concentration (mg/dL)",
ylab ="Predicted concentration (mg/dL)", main ="Calibration")
lines(ycal,ycal,col="red")
dev.new ( )
plot (ycal,ycal_calc,pch='o', col="blue", xlab ="Measured concentration (mg/dL)",
ylab ="Predicted concentration (mg/dL)", main ="Calibration (blue) and Prediction
(red)")
points(ypred,ypred_calc,pch =15,col="red")
lines(y,y ,col="red")
## Figures of merit
# Calibration
# Prediction
# PCA Variances
par(mfrow = c( 2,2))
barplot (data.vars [1:10], main="Variance", names.arg = paste("PC", 1:10))
barplot (log( data.vars [1:10]), main="Log( variance )", names.arg = paste("PC",
1:10))
barplot (data.relvars [1:10], main="Relative variance", names.arg = paste("PC",
1:10))
barplot (cumsum (100* data.relvars [1:10]), main="Cumulative variance (%)", names.
arg = paste("PC", 1:10), ylim = c(0,100))
x = scores
dim_x = dim (x)
size = 1:dim_x[1]
## PCR model
xcal_df = data.frame(xcal)
xpred_df = data.frame(xpred)
dev.new ( )
plot (ycal,ycal_calc,pch ='o', col ="blue", xlab ="Measured concentration (mg/dL)",
ylab ="Predicted concentration (mg/dL)", main ="Calibration")
lines(ycal,ycal,col="red")
dev.new ( )
plot (ypred,ypred_calc,pch =15,col="red", xlab = "Measured concentration (mg/dL)",
ylab = "Predicted concentration (mg/dL)", main = "Prediction")
lines(ypred,ypred,col ="red")
dev.new ( )
plot (ycal,ycal_calc,pch ='o', col ="blue", xlab ="Measured concentration (mg/dL)",
ylab ="Predicted concentration (mg/dL)", main ="Calibration (blue) and Prediction
(red)")
points(ypred,ypred_calc,pch =15,col="red")
lines(y,y,col ="red")
## Figures of merit
# Calibration
# Prediction
## PLS Model
validationplot ( model_pls )
validationplot ( model_pls,val.type = "MSEP")
validationplot ( model_pls,val.type = "R2")
## prediction
dev.new ( )
plot (ycal,ycal_calc,pch ='o', col ="blue", xlab ="Measured concentration (mg/dL)",
ylab ="Predicted concentration (mg/dL)", main ="Calibration")
lines(ycal,ycal,col ="red")
dev.new ( )
plot (ypred,ypred_calc,pch =15,col="red", xlab = "Measured concentration (mg/dL)",
ylab = "Predicted concentration (mg/dL)", main = "Prediction")
lines(ypred,ypred,col ="red")
dev.new ( )
plot (ycal,ycal_calc,pch ='o', col ="blue", xlab ="Measured concentration (mg/dL)",
ylab ="Predicted concentration (mg/dL)", main ="Calibration (blue) and Prediction
(red)")
points(ypred,ypred_calc,pch=15,col="red")
lines(y,y,col ="red")
## Figures of merit
# Calibration
# Prediction
PROPOSED EXERCISES
01 – Using a digital image dataset available in repositories or through images
collected with some equipment (cell phone, scanner), perform an exploratory analysis using
the main algorithms in the R language (PCA, HCA and K- means). Present your results and
main conclusions.
02 – Using a digital image dataset available in repositories or through images
collected with some equipment (cell phone, scanner), perform a multivariate classification
using algorithms in the R language (PCA-LDA, SPA-LDA and GA-LDA). Present your results
and main conclusions.
03 – From the previous exercise, write multivariate classification scripts in R (PCA-
QDA, SPA-QDA and GA-QDA) for a given dataset (simulated or experimental). Finally,
present your results and conclusions when compared to the multivariate classification
models used in the LDA function.
04 – Using a digital image dataset available in repositories or through images
collected with some equipment (cell phone, scanner), perform a multivariate regression
using algorithms in the R language (MLR, PCR and PLS). Present your results and main
conclusions.
REFERENCES
1 – Solomon , C.; Breckon , T. (2011). Fundamentals of Digital Image Processing – A Practical Approach
with Examples in Matlab . 1st Edition. USA, John Wiley & Sons Ltd.
2 – Plataniotis , KN; Venetsanopoulos, A. N. (2000). Color Image Processing and Applications. Berlin;
Heidelberg; New York; Barcelona; Hong Kong; London; Milano; Paris; Singapore; Tokyo: Springer.
3 – Gonzalez, RC; Woods, RE Digital image processing. 1st Ed. São Paulo: Editora Blucher , 2000.
R Script
install.packages("data.table")
library (data.table)
## visualize data
View (data)
x = data[2:16,4:d[2]] # spectrum = rows 2-16 and columns 4 to the end of the table
## plotting spectra
Appendix A 262
A2 – PREPROCESSING SPECTRAL DATA IN R
Example A2: Applying different preprocessing to near-infrared (NIR) spectroscopy
data using data.table, pracma , pls and hyperSpec packages. Data extracted from: https://
doi.org/10.1016/j.dib.2020.106647
R Script
## packages
library(data.table)
# Smoothing Savitzkt-Golay
install.packages("pracma")
library(pracma)
#MSC
install.packages("pls")
library(pls)
# Baseline correction
install.packages("hyperSpec")
library(hyperSpec)
data = fread("dataset/spectra_standard_cells.csv")
## visualize data
View (data)
x = data[2:16,4:d[2]] # spectrum = rows 2-16 and columns 4 to the end of the table
## plotting spectra
matplot (t(nm), t(x), type ="l", xlab ="Wavelength (nm)", ylab ="Absorbance", main
="Raw data")
Appendix A 263
## ==================== Pre-processing ====================
matplot (t( nm ),t( x_sg ), type ="l", xlab ="Wavelength (nm)", ylab ="Absorbance",
main ="Smoothed data - Savitzky-Golay ")
# smoothing function
Appendix A 264
# plotting preprocessed data
matplot (t( nm ),t( x_sw ), type ="l", xlab ="Wavelength (nm)", ylab ="Absorbance",
main ="Smoothed data - Moving Window")
# SNV function
SNV<-function(spectra){
spectra<- as.matrix (spectra)
spectrat <-t(spectra)
spectrat_snv <-scale( spectrat,center = TRUE,scale =TRUE)
spectra_snv <-t( spectrat_snv )
return ( spectra_snv )}
# applying SNV
matplot (t( nm ),t( x_snv ), type ="l", xlab ="Wavelength (nm)", ylab ="Absorbance",
main ="SNV")
# applying MSC
matplot (t( nm ),t( x_msc ), type ="l", xlab ="Wavelength (nm)", ylab ="Absorbance",
main ="MSC")
# pseudo-image object
Appendix A 265
matplot (t( nm ),t( x_base ), type ="l", xlab ="Wavelength (nm)", ylab ="Absorbance",
main ="Baseline Correction")
x_1d = x_1d[,w:(dim(x)[2]-w)]
nm_dm = data.matrix (nm)
nm_1d = nm_dm [,w:(dim(x)[2]-w)]
x_2d = x_2d[,w:(dim(x)[2]-w)]
nm_dm = data.matrix (nm)
nm_2d = nm_dm [,w:(dim(x)[2]-w)]
Appendix A 266
A3 – LOADING MOLECULAR FLUORESCENCE DATA: EXCITATION-EMISSION
MATRIX (EEM)
Example A3: Loading raw molecular fluorescence data (excitation-emission matrix
– EEM). Data extracted from: https://doi.org/10.3390/data8050081
R Script
## loading data
# sample 1
## visualize data
View (data)
# comments:
# columns V1 to V50 are the excitation wavelengths
# for each excitation wavelength there is an emission wavelength and the associated
intensity
# the emission wavelength values are constant, only the intensity varies
# excitation wavelength
nm_ex = data[seq (1, dim_data[2], 2)] # extracting each 2nd column of data from
column 1
nm_ex = nm_ex[1,] # extracting only the 1st row
# emission wavelength
# intensities
eem = data[seq(2, dim_data[2], 2)] # extracting each 2nd column of data from column
2
eem = eem [3:253,1:length(eem)] # extracting rows between 3 and 253 to match with
emission
eem = data.matrix(eem) # converting to numeric values
# comments:
# in the matrix, each row corresponds to an emission wavelength and each column to
an excitation wavelength
Appendix A 267
## viewing the eem matrix
## comments:
# sample 2
# eem2 - intensities
eem2 = data2[seq(2, dim_data2[2], 2)]# extracting each 2nd column of data from
column 2
eem2 = eem2[3:253,1:length(eem2)] # extracting rows between 3 and 253 to match with
emission
eem2 = data.matrix(eem2) # converting to numeric values
# 3D tensor
# EEM data in other formats (other devices) can be loaded using the EEM or eemR
packages
# Examples:
## loading packages
# install.packages ("EEM")
# library (EEM)
Appendix A 268
## loading sample
# drawEEM(data, n=1)
## loading packages
# install.packages ("eemR")
# library (eemR)
## loading sample
# eemR packages perform pre-processing on loaded data through specific routines - see
tutorials in the links
# EEM: https://cran.r-project.org/web/packages/EEM/vignettes/vignette.html
# eemR: https://cran.r-project.org/web/packages/eemR/eemR.pdf
Appendix A 269