A Machine Learning Framework for Olive Farms Profit Prediction
<p>ML algorithms performance estimations after cross-validation.</p> "> Figure 2
<p>ML algorithms sensitivity analysis on cross-validation.</p> "> Figure 3
<p>ML algorithms sensitivity analysis on repeated cross-validation.</p> "> Figure 4
<p>Correlation between LOOCV and RCV for all algorithms.</p> "> Figure 5
<p>ML algorithms accuracy scores after optimized RCV.</p> ">
:1. Introduction
2. Methodology
2.1. Classification vs. Regression
2.2. Causal Analysis and One-Hot Encoding Variables
2.2.1. One-Hot Encoding
2.2.2. The Dummy Variable Trap
2.3. The Problem of Overfitting. Bias and Variance
- The model makes wrong assumptions. We are not referring to the obvious, i.e., we diagnose a patient to be ill when in fact he is healthy. Let us assume instead the case in which we ultimately predict that a loan candidate is not trustworthy just because her name is Jane. Those examples may derive from classification problems, but the key concept, which is also valid for regression, is that this case reveals the shortcoming of the algorithm in observing the relations between the predictors and the dependent variable, or maybe more importantly, in exposing a bad choice when choosing the group of predictors [47,48]. This is commonly referred to as underfitting [49]. Underfitting usually characterizes overly simple models. Simplicity refers not only to non-complex models but also to the omission of all the steps of manipulating and preparing data as discussed in our proposed framework. Additionally, mishandling data refers to non-treatment of outlier values and removal of useless or highly correlated features which do not add value to the output variable.
- The model is very sensitive to fluctuations in the observations we use during training. Inconsistency among the data may be due to noise or outliers, but it can also mean rare but anticipated behavior. The algorithm can become overly complex trying to capture the noise and all inconsistencies. The fact that it will succeed during training does not indicate that it will be effective when dealing with new unknown data. It will be apparent, especially after multiple training executions, that it will fail to generalize a successful predictive behavior against new data [50]. It will produce a spread on the estimated values compared to the actual observations. This is referred to as overfitting.
2.4. Dataset Splitting. Training and Test Sets
2.4.1. Training and Test Sets
2.4.2. Splitting Strategy
2.4.3. Splitting Timing
2.5. Exploratory Data Analysis
- Descriptive Analysis. During this step, characteristics of the dataset were examined, such as dimensions, types of variables, and statistical summaries to get a view of the data.
- Visualizations. Plotting single and multiple variables values led to a better understanding of each feature and the relations among them.
- Cleaning. This involved duplicates removal, locating missing values, and techniques to fill in for missing data.
- Transforms. Data could be further processed or massaged without altering the quality or the patterns they convey. Altering the scales, examining their distributions, and readjusting were methods to better accommodate the algorithms with the structure of our information [61].
- ○
- Standardizing values was extremely useful because it provided a convenient way to compare values that were part of different distributions. A dataset is standardized when the input features are transformed to have a close to zero mean (or standard deviation close to 1). The effect was that the shape of the data was shifted to resemble a normal distribution. Standardization assists Machine Learning algorithms like k-nearest neighbors, linear regression, and support vector machines to build more robust models [46]. Standardization was performed by subtracting the mean (μ) from each observation (χ) and dividing the result by the standard deviation (σ) of the feature [62].
- ○
- Scaling changes the values of a feature down to a specific range, usually [0, 1] [62]. Hence, the presence of outliers affects the scaling process [47]. It is most useful when the input variables exhibit numeric distances among each other. Transforming them to a common range can enhance Machine Learning algorithms execution [63].
- ○
- ○
- Regularization is an approach to treat poor performance caused by high collinearity among input variables [46]. The concept applied by regularization methods was to penalize increasing complexity during the modeling process, thus preventing overfitting. It was apparent that preprocessing like scaling and standardization was highly important for the regularization treatment because the values of the variables would be at comparable scales and ranges. Approaches include L1 regularization, L2 regularization, and dropout. Regularization is embedded in algorithms like Lasso, Ridge regression, and Elastic Net. Lasso and Elastic Net due to the nature of their penalizing mechanisms can be considered as methods that also perform auto feature selection, as described right below.
- ○
- Feature Engineering and Selection. Having many feature variables which participate in the training process of a model is not always a road to success [57]. It may seem logical that the more inputs we possess (no matter the number of observations), the best prediction we can achieve. This is a misconception that requires attention in Machine Learning projects. Not every feature at our disposal can contribute to the predictive value of a model. Just the fact that we had historical information on it does not equate to usefulness. On the contrary, it may have a negative impact by causing, for example, unnecessary bias. Moreover, the collinearity among features was very important [57]. Collinear predictors have a negative impact on the modeling process most of the time [52]. Therefore, it was required that we (a) checked for predictors which did not contribute to the predictive power, (b) eliminated predictors which were highly correlated, and (c) constructed new appropriate ones if needed. A high association among variables was indicated by increased dependency. Importance existed both on the strength and direction of the association. There are methods for calculating dependency between discrete and continuous variables. Pearson’s χ2 Statistic, Cramer’s V Statistic, and Contingency Coefficient C [41] are very popular for examining discrete variables. Pearson’s coefficient determination is based on the mean and the standard deviation [41]. Therefore, the samples needed to have a Gaussian-or close distribution [64]. This is where transforms played a major role in preparing the data for analysis. Transforms were expected to help in the proposed framework because input variables describe different physical measures, which are quite dissimilar in ranges and values.
2.6. Resampling
2.6.1. Measuring Error
2.6.2. Resampling for Model Assessment
2.6.3. K-Fold Cross-Validation (CV)
2.6.4. Repeated K-Fold Cross-Validation (RCV)
2.6.5. Nested Cross-Validation (NCV)
2.6.6. Leave-One-Out Cross-Validation (LOOCV)
2.7. Machine Learning Algorithms
2.7.1. Algorithms in General
2.7.2. Parametric vs. Non-Parametric Modelling
- Linear Regression
- Bayes Ridge Regression
- Ridge Regression
- LASSO regression
- K-nearest Neighbors
- Regression Trees
- Support Vector Machines Regression
2.8. Hyperparameter Tuning
2.9. Ensembling
- Extreme Gradient Boosting
- Gradient Boosting
- Random Forests
- Extra Trees
2.10. Performance Metrics
- Mean Absolute Error and Mean Squared Error. The mean absolute error (MAE) is a computationally simple regression error metric. The absolute value of the difference for every predicted and observed value is used to calculate the residual (difference between the predicted value and observed value) [42,80]. The equation is shown below:
- 3.
- Root Mean Squared Error (RMSE). It is the square root of the MSE [42,80]. It represents the sample standard deviation of the residuals. Practically, it reveals the degree of spread out among the residuals. It is often preferred over MSE because its units are the same as those of the output variable [80].
- 4.
- R2 or coefficient of determination. R Squared and Adjusted R Squared are indication measures on how well the model fits the data [82,83]. It provides great insight when evaluating the training process but is also useful during the testing phase. Adjusted R2 improves R2 in that it can describe better the avoidance of overfitting. R2 value tends to increase as the number of input features increases [84]. Adjusted R2 remains unaffected by this phenomenon, but this poses a challenge when the features number is high in a modeling process [85]. R Squared values range from 0 to 1. Approaching towards 1 indicates a better fit [46]. The formula for R2 is [46]:
3. Case Study
3.1. Area of Study
- Management Practices
- Soil types
- Precipitation
- Relative irrigation (percentage concerning optimum irrigation amount estimated by the hydrological models [28])
- Number of irrigation trips reduction
3.2. Classification, Regression and Binning Predictors
- It could produce a model with lower performance.
- There would be a loss in prediction precision due to the fixed combinations of the possible outcome.
- The number of false positives could increase.
3.3. One-Hot and Label Encoding
- Management practice with values of PH and PL (heavy pruning & light pruning)
- The soil type with values of Cl, SL, and LS (Clay, Sandy Loam, Loamy Sand)
- Precipitation with values of Dry, Normal, and Wet
- management.M1
- soil_type.CL
- soil_type.SL
- precipitation
- relative_irrigation
- number_of_trips_reduction
- relative_profit_percentage
3.4. Splitting
3.5. Data Analysis
3.6. Resampling
3.6.1. Choosing the Appropriate Resampling Method
3.6.2. Cross-Validation Parameters
3.6.3. Sensitivity Analysis on K
- The median value of the cross-validation results. If the mean accuracy and the median are close values, it is a significant indication that the specific execution reflects the central tendency very well and without skewness [97].
- The standard error. It is the measure that exhibits the deviation of the sample mean from the population mean. It was useful because it reflected the accuracy of the mean value in representing the data [80,98]. We observed the min and max values in the experiments’ executions and if the preferred scenario had also a low standard error, it was chosen as the recommended one.
3.7. Machine Learning Algorithms
- Linear Regression
- Bayes Ridge Regression
- Ridge Regression
- LASSO Regression
- K-Nearest Neighbors
- CART Decision Trees
- Support Vector Machines Regression (SVMR)
- Extreme Gradient Boosting
- Gradient Boosting
- Random Forests
- Extra Trees
3.8. Hyperparameter Tuning
3.9. Performance Metrics Comparison and Choice
4. Results
- Fit the models based on the group of ML algorithms we chose. No data transform or algorithm tuning is performed without any further tweaking. The performance results after evaluating the algorithms on the test set are given in Table 1:
- 2.
- The same process is repeated by performing stratified continuous splitting and the results are shown in Table 2.
- 3.
- Exploration of the nature of data will reveal feature correlations and distributions. Those will point to feature extractions and data transforms. Table 3 displays the Pearson correlation values between feature pairs.
- 4.
- Scaling, standardization, and a power transform (Box-Cox) are also applied to the dataset features to help the algorithms’ execution. The results are presented in Table 5.
- 5.
- In this and the following step, cross-validation experiments will be performed. Standardization is applied to the data before assessing cross-validation performance. Initially, cross-validation is executed on the training dataset with a default value of 10 as the number of folds. The results are sets of root mean squared errors. The boxplots in Figure 1 display cross-validation performance assessment for the algorithms under testing.
- 6.
- Sensitivity analysis on values for k cross-validation is executed. Tests were performed on values of k in the integer range [2, 40]. As described in Section 3.6.3, the red lines display the optimal (baseline) results given by the LOOCV. In each execution, we observe (a) the blue lines which display the cross-validation root mean squared error (mean accuracy) and must be close to the red line, as well as (b) the yellow lines which show the median. Blue segments below the red line (ideal case) are considered pessimistic estimates and above the line optimistic estimates. The second case means overfitting [110]. The characteristics which point to the optimal k are (a) a small distance from the LOOCV mean and (b) a low standard error. From the execution table values and the line plot which are concentrated in Figure 2, the optimal value for k is 38 with a mean accuracy score of 0.06605177. The difference from the LOOCV median is 0.01193512. The standard error ranges from 0.00203894 to 0.00398050 throughout the experiments. For k = 38 the standard error mean is 0.00264908, a value near the minimum of the executions set.
- 7.
- Support Vector Machines Regressor (SVMR) is the predominant algorithm as shown from earlier steps. Repeated cross-validation will help to improve its performance by tuning its hyperparameters. Exhaustive grid search will be used to achieve that, exploiting the optimal values of repeated cross-validation. The values exploration will be executed inside nested cross-validation of 10 folds. Standardization is applied on both the validation and the test sets. SVMR has three major hyperparameters for tweaking [111,112]. (A) Kernel. The kernel types that we will test are linear, poly, rbf, sigmoid. (B) The tolerance for the stopping criterion. The set that will be tested is: [0.000001, 0.00001, 0.0001, 0.001, 0.01]. (C) The C regularization parameter. It represents how strict the algorithm will be when there are errors on fitting. The range to test is: [1, 1.5, 2, 2.5, 3].
- 8.
- In this step, we will experiment with ensemble methods. Extreme Gradient Boosting, Gradient Boosting, Random Forests, and Extra Trees are evaluated on the standardized dataset with repeated cross-validation (folds = 15, repeats = 15). The results are presented in Table 8.
- 9.
- Finally, hyperparameter tuning will be executed on Gradient Boosting to investigate if its performance can be further enhanced. N estimators are the hyperparameter to adjust and the default value used is 100 [113]. Usually, if this number is increased, so is performance (at a computational cost) [113]. We will test the range: [50, 200]. This experiment using nested cross-validation showed that 50 was the best value with a mean accuracy of 0.051271 and a standard error of 0.006015, improving the scores with the default values.
- 10.
5. Discussion
6. Conclusions
Appendix A
- Bagging: The term comes from Bootstrap Aggregating because those techniques are combined. It works by sampling random bootstrapped sub samples from the initial dataset. Afterwards, the algorithm picks the most robust sub model to form the predictor.
- Random Forests: There is a resemblance to Bootstrap Aggregation. Bagging is somehow predictable in behavior because although splitting is done, the algorithm has all the predictors at hand. In a random forest, multiple trees are produced but they are based on different predictors. This is a factor which contributes to crisper independence among the base models and leads to powerful aggregation and accurate predictions. The bottom line is that the base models must be as efficient and as diverse as possible.
- Boosting: The principle in boosting algorithms is to convert multiple weak training models into stronger ones. Weight values are attributed to the learners depending on their comparative performance. Higher values are attributed to false predicted cases. In the end, the weighted sum is used for the final prediction. Boosting differs from bagging in that it trains the learners sequentially, referencing the weighted line of data.
- Stacking: In stacking, weak base models are exploited but processed in parallel. A weakness of this approach is that each base model equally contributes to the ensemble regardless of how well it performs. They are often heterogeneous methods, meaning that the group of base learners consist of different algorithms.
ML Algorithm | R2 Train Score | R2 Test Score | RMSE Test Score |
Linear Regression | 0.5470 | 0.4754 | 0.00432 |
Bayes Ridge Regression | 0.54686 | 0.47553 | 0.00432 |
Ridge Regression | 0.54698 | 0.47548 | 0.00432 |
LASSO Regression | 0 | −0.0016 | 0.00825 |
K-Nearest Neighbors | 0.66823 | 0.39674 | 0.00497 |
CART | 0.8056 | 0.51325 | 0.00401 |
Support Vector Machines Regression | 0.64731 | 0.56272 | 0.0036 |
ML Algorithm | R2 Train Score | R2 Test Score | RMSE Test Score |
Linear Regression | 0.49999 | 0.60878 | 0.00362 |
Bayes Ridge Regression | 0.49978 | 0.60663 | 0.00364 |
Ridge Regression | 0.49997 | 0.60816 | 0.00363 |
LASSO Regression | 0.64847 | −0.0014 | 0.00927 |
K-Nearest Neighbors | 0.6485 | 0.42249 | 0.00535 |
CART | 0.79909 | 0.58683 | 0.00382 |
Support Vector Machines Regression | 0.60912 | 0.62566 | 0.00347 |
Management.M1 | Soil_Type.CL | Soil_Type.SL | Precipitation | Relative_Irrigation | Number_of_Trips_Reduction | |
management.M1 | --- | 7.071077 ×10−1 | 2.272757 × 10−16 | 1.029597 × 10−16 | 1.007215 × 10−16 | 9.568540 × 10−17 |
soil_type.CL | --- | --- | 5 × 10−1 | 1.724138 × 10−16 | 5.771856 × 10−16 | 2.890379 × 10−16 |
soil_type.SL | --- | --- | --- | 4.659030 × 10−17 | 4.062555 × 10−16 | 4.985459 × 10−17 |
precipitation | --- | --- | --- | --- | 3.69276 × 10−17 | 2.604227 × 10−17 |
relative_irrigation | --- | --- | --- | --- | --- | 8.9092 × 10−18 |
number_of_trips_reduction | --- | --- | --- | --- | --- | --- |
ML Algorithm | R2 Train Score | R2 Test Score | RMSE Test Score |
Linear Regression | 0.47358 | 0.58799 | 0.00381 |
Bayes Ridge Regression | 0.47341 | 0.58531 | 0.00384 |
Ridge Regression | 0.47357 | 0.58718 | 0.00382 |
LASSO Regression | 0 | −0.0014 | 0.00927 |
K-Nearest Neighbors | 0.64755 | 0.41146 | 0.00545 |
CART | 0.79909 | 0.58683 | 0.00382 |
Support Vector Machines Regression | 0.60679 | 0.62352 | 0.00349 |
ML Algorithm | R2 Train Score/R2 Test Score | RMSE Score | ||||
Scaling | Standardi Zation | Power Transform | Scaling | Standardi Zation | Power Transform | |
Linear Regression | 0.49999/0.60878 | 0.49999/0.60878 | 0.49615/0.61141 | 0.00362 | 0.00362 | 0.0036 |
Bayes Ridge Regression | 0.49971/0.60841 | 0.498789/0.60872 | 0.49579/0.61105 | 0.00363 | 0.00362 | 0.0036 |
Ridge Regression | 0.49998/0.60878 | 0.49999/0.60885 | 0.49613/0.61141 | 0.00362 | 0.00362 | 0.0036 |
LASSO Regression | 0/−0.0014 | 0/−0.0014 | 0/−0.0014 | 0.00927 | 0.00927 | 0.00927 |
K-Nearest Neighbors | 0.74781/0.62416 | 0.72762/0.66027 | 0.71500/0.55056 | 0.00348 | 0.00315 | 0.00416 |
CART | 0.79909/0.58683 | 0.79909/0.58683 | 0.79909/0.58683 | 0.00382 | 0.00382 | 0.00382 |
Support Vector Machines Regression | 0.63347/0.65531 | 0.63978/0.66826 | 0.63010/0.64965 | 0.00319 | 0.00307 | 0.00324 |
ML Algorithm | Mean Accuracy—(Standard Error) For Repeated Cross-Validation after Sensitivity Analysis |
Linear Regression | 0.066925—(0.008849) |
Bayes Ridge Regression | 0.067004—(0.008873) |
Ridge Regression | 0.067029—(0.008846) |
LASSO Regression | 0.093284—(0.010321) |
K-Nearest Neighbors | 0.059908—(0.007890) |
CART | 0.058579—(0.006713) |
Support Vector Machines Regression | 0.057669—(0.006979) |
R2 Training Score | R2 Test Score | RMSE Test Score | |
SVMR—Default hyperparameters | 0.63978 | 0.66826 | 0.00307 |
SVMR—Tuned hyperparameters | 0.63961 | 0.66833 | 0.00307 |
Ensemble Method | Mean Accuracy—(Standard Error) For Repeated Cross-Validation |
Extreme Gradient Boosting | 0.057299—(0.0066) |
Gradient Boosting | 0.050332—(0.005987) |
Random Forests | 0.056302—(0.006642) |
Extra Trees | 0.058541—(0.006728) |
Ensemble Method | R2 Training Score | R2 Test Score | RMSE Test Score |
Extreme Gradient Boosting | 0.79745 | 0.62404 | 0.00348 |
Gradient Boosting | 0.77405 | 0.73282 | 0.00247 |
Gradient Boosting (Tuned) | 0.74533 | 0.72741 | 0.00252 |
Random Forests | 0.7971 | 0.62462 | 0.00348 |
Extra Trees | 0.79909 | 0.58924 | 0.0038 |
R2 Training Score | R2 Test Score | RMSE Test Score | |
Support Vector Machines Regression (Experiment 1) | 0.64731 | 0.56272 | 0.0036 |
Gradient Boosting (Experiment 9) | 0.74533 | 0.72741 | 0.00252 |
Performance improvement on the test set (%) | 29.27% | 42.88% |
