Tabular Data Models for Predicting Art Auction Results
<p>Bar chart comparing the sMAPE metric of KNN model across the datasets.</p> "> Figure 2
<p>Bar chart comparing the sMAPE metric of DecisionTree model across the datasets.</p> "> Figure 3
<p>Bar chart comparing the sMAPE metric of RandomForest model across the datasets.</p> "> Figure 4
<p>Bar chart comparing the sMAPE metric of XGBoost model across the datasets.</p> "> Figure 5
<p>Bar chart comparing the sMAPE metric of CatBoost model across the datasets.</p> "> Figure 6
<p>Bar chart comparing the sMAPE metric of LightGBM model across the datasets.</p> "> Figure 7
<p>Bar chart comparing the sMAPE metric of MLP model across the datasets.</p> "> Figure 8
<p>Bar chart comparing the sMAPE metric of VIME model across the datasets.</p> "> Figure 9
<p>Bar chart comparing the sMAPE metric of DeepFM model across the datasets.</p> "> Figure 10
<p>Bar chart comparing the sMAPE metric of SAINT model across the datasets.</p> "> Figure 11
<p>Feature importance plots of top performing XGBoost and RandomForest models. (<b>a</b>) Feature importance plot for XGBoost for NoImg dataset. (<b>b</b>) Feature importance plot for RandomForest for ColorSVD dataset.</p> ">
Abstract
:Featured Application
Abstract
1. Introduction
2. Materials and Methods
2.1. Datasets
- ARTIST: Name of the artist.
- TECHNIQUE: Medium or method used (e.g., lithograph, etching).
- SIGNATURE: Whether the artwork is hand signed, plate signed or unsigned.
- CONDITION: Physical state of the artwork.
- TOTAL DIMENSIONS: Area of the artwork.
- YEAR: Year of creation.
- PRICE: Final auction price.
- AuctionResultsNoImg (NoImg) Dataset contains only the core features without any image-related features.
- AuctionResultsColor (Color) Dataset includes an additional Colorfulness Score [18], a measure of color intensity and variety of the image of the artwork.
- AuctionResultsSVD (SVD) Dataset adds SVD Entropy [19] of the image of the artwork to the core features, excluding the Colorfulness Score.
- AuctionResultsColorSVD (ColorSVD) Dataset adds both Colorfulness Score and SVD Entropy, which quantifies the complexity of the artwork’s visual representation.
2.2. Data Preprocessing
2.3. Dataset Description
2.4. Model Selection and Training
2.5. Reproducibility
3. Results
3.1. sMAPE Score Analysis Across Datasets
3.2. LiniearModel (Linear Regression)
3.3. K-Nearest Neighbors
3.4. DecisionTree
3.5. RandomForest
3.6. XGBoost
3.7. CatBoost
3.8. LightGBM
3.9. Multi-Layer Perceptron (MLP)
3.10. VIME
3.11. ModelTree
3.12. DeepGBM
3.13. DeepFM
3.14. SAINT
3.15. Feature Importance Plots
3.16. Summary of the Results
4. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Data Preprocessing
Appendix A.1. Ensuring Data Structure
- Column Elimination: Non-essential columns are identified and removed based on predefined criteria outlined in columns structure. This step reduces noise and ensures that only relevant variables are retained.
- Date Formatting and Validation: The AUCTION DATE column is standardized to the datetime format. Rows containing invalid or missing dates are excluded from the dataset to maintain temporal accuracy (invalid ‘AUCTION DATE’ rows removed: 4)
- Object Type Filtering: The dataset is filtered to include only entries where the OBJECT is classified as “Print.” This restriction narrows the scope of the analysis to graphic, non-unique works (non-print rows removed: 1016).
- Artist Name Standardization:
- a.
- Accent Removal: Accented characters in artist names are normalized using the unidecode function.
- b.
- Replacements and Normalization: The script applies a series of regular expression-based replacements to unify various forms of artist names. This includes removing year ranges of their life span, special characters, and standardizing prefixes such as “after”.
- c.
- Name Sorting: Artist names are further normalized by sorting the characters alphabetically, thereby making the order of names and surnames insignificant.
- d.
- Artist Filtering: Rows that are empty or contain erroneous classifications (e.g., “print” in the artist name) are systematically removed. This step enhances the integrity of the artist data (rows removed due to empty ‘ARTIST’ or erroneous classifications: 10,454).
- Handling Missing Values:
- a.
- Years are extracted from the YEAR or PERIOD columns using regular expressions, and entries with unresolved or invalid year values are removed. This ensures chronological accuracy within the dataset (rows removed due to invalid ‘YEAR’: 52).
- b.
- Technique Standardization: The TECHNIQUE column is cleaned and standardized to match a predefined list of accepted techniques. Special emphasis was to not include reproductions but only original art works. Entries not meeting these criteria are removed (rows removed due to unwanted ‘TECHNIQUE’: 17,966).
- c.
- Poster Exclusion: Descriptions containing terms related to posters are excluded to avoid irrelevant data.
- Dimensional and Price Validation: The TOTAL DIMENSIONS column is standardized to a consistent unit (centimeters), and outliers or erroneous entries (e.g., dimensions equal to zero) are excluded. The PRICE column is converted to a numeric format, with non-numeric values discarded (rows removed due to invalid ‘TOTAL DIMENSIONS’: 1069).
Appendix A.2. Filtering Data
- Dimension Filtering: Entries with TOTAL DIMENSIONS outside the range of 10 to 10,000 cm² were excluded to remove records that are unlikely to occur naturally, which could otherwise skew the analysis (rows removed due to ‘TOTAL DIMENSIONS’ being outside of range: 753).
- Price Filtering: Entries with PRICE values exceeding 10,000 were removed to concentrate on transactions within a typical range for prints and to avoid artwork category misclassification (rows removed due to ‘PRICE’ being outside of range: 3).
- Year-Based Filtering: Artworks created before 1900 are excluded to limit the analysis to more contemporary works, aligning with the study’s temporal focus (rows removed due to ‘YEAR’ being earlier than 1900: 2305).
- Artist Frequency Filtering: Artists with fewer than 10 occurrences in the dataset were excluded. This step ensures sufficient representation of artists, thereby improving the robustness of the analysis (rows removed due to artists with less than 10 occurrences: 5266.)
Appendix A.3. Data Encoding
Appendix B
Model Name | AuctionResultsNoImg—Hyperparameters | AuctionResultsColor—Hyperparameters |
---|---|---|
KNN | n_neighbors: 9 | n_neighbors: 11 |
DecisionTree | max_depth: 11 | max_depth: 9 |
RandomForest | max_depth: 9 | max_depth: 6 |
n_estimators: 37 | n_estimators: 19 | |
XGBoost | max_depth: 8 | max_depth: 8 |
alpha: 1.506511545817065e-06 | alpha: 1.3322275996207066e-06 | |
lambda: 0.002960933875866748 | lambda: 0.777925934273516 | |
eta: 0.014895069860775214 | eta: 0.03275817414837939 | |
learning_rate: 0.06612242253192645 | learning_rate: 0.11347994198907171 | |
CatBoost | max_depth: 8 | max_depth: 10 |
l2_leaf_reg: 5.625945091561714 | l2_leaf_reg: 18.349484825036434 | |
LightGBM | num_leaves: 3749 | num_leaves: 628 |
lambda_l1: 0.024938755036347845 | lambda_l1: 0.00015355388929401932 | |
lambda_l2: 6.603948300896136e-07 | lambda_l2: 3.3513689033734044 | |
learning_rate: 0.23346094966403183 | learning_rate: 0.025584699772417396 | |
MLP | hidden_dim: 27 | hidden_dim: 90 |
n_layers: 4 | n_layers: 4 | |
learning_rate: 0.0009817907858545537 | learning_rate: 0.0006126544377845088 | |
VIME | p_m: 0.3171774936756223 | p_m: 0.3551020199178091 |
alpha: 3.1518685147640513 | alpha: 5.796587835827752 | |
K: 15 | K: 3 | |
beta: 5.777807461440739 | beta: 2.026360008336819 | |
ModelTree | criterion: gradient | criterion: gradient-renorm-z |
max_depth: 3 | max_depth: 2 | |
DeepGBM | n_trees: 100 | n_trees: 200 |
maxleaf: 64 | maxleaf: 64 | |
loss_de: 3 | loss_de: 5 | |
loss_dr: 0.7 | loss_dr: 0.9 | |
DeepFM | dnn_dropout: 0.5645466529889359 | dnn_dropout: 0.8953496968333118 |
SAINT | dim: 64 | dim: 256 |
depth: 2 | depth: 3 | |
heads: 2 | heads: 8 | |
dropout: 0 | dropout: 0.5 |
Model Name | AuctionResultsSVD—Hyperparameters | AuctionResultsColorSVD—Hyperparameters |
---|---|---|
KNN | n_neighbors: 5 | n_neighbors: 19 |
DecisionTree | max_depth: 8 | max_depth: 10 |
RandomForest | max_depth: 11 | max_depth: 12 |
n_estimators: 77 | n_estimators: 5 | |
XGBoost | max_depth: 5 | max_depth: 6 |
alpha: 5.857949969161431e-08 | alpha: 9.235891162903211e-07 | |
lambda: 0.34515471928125674 | lambda: 1.6580418495949973e-07 | |
eta: 0.025202041962014803 | eta: 0.03645561176974997 | |
learning_rate: 0.028646710508839136 | learning_rate: 0.2496359003501268 | |
CatBoost | max_depth: 10 | max_depth: 9 |
l2_leaf_reg: 13.293942606581755 | l2_leaf_reg: 22.680333733819474 | |
LightGBM | num_leaves: 962 | num_leaves: 259 |
lambda_l1: 0.002341475216791483 | lambda_l1: 5.115908866062836e-07 | |
lambda_l2: 1.5483601387539192 | lambda_l2: 1.7999234961427344e-06 | |
learning_rate: 0.03971767779263651 | learning_rate: 0.0919724612802419 | |
MLP | hidden_dim: 78 | hidden_dim: 89 |
n_layers: 2 | n_layers: 5 | |
learning_rate: 0.0007311385951868368 | learning_rate: 0.0007298057023245878 | |
VIME | p_m: 0.19870045327747068 | p_m: 0.3565456402937748 |
alpha: 3.491869884196773 | alpha: 3.2829661088245 | |
K: 5 | K: 15 | |
beta: 2.2079468853402586 | beta: 0.626661210126865 | |
ModelTree | criterion: gradient | criterion: gradient-renorm-z |
max_depth: 3 | max_depth: 3 | |
DeepGBM | n_trees: 100 | n_trees: 200 |
maxleaf: 64 | maxleaf: 64 | |
loss_de: 3 | loss_de: 4 | |
loss_dr: 0.7 | loss_dr: 0.9 | |
DeepFM | dnn_dropout: 0.64790221084881 | dnn_dropout: 0.6160828071241174 |
SAINT | dim: 64 | dim: 128 |
depth: 6 | depth: 2 | |
heads: 8 | heads: 4 | |
dropout: 0.2 | dropout: 0.6 |
References
- Bailey, J. Can machine learning predict the price of art at auction? Harv. Data Sci. Rev. 2020, 2, 2–8. [Google Scholar] [CrossRef]
- Goetzmann, W.; Renneboog, L.; Spaenjers, C. Art and Money: Risk, Return, and the Art Market as an Asset Class. In Handbook of the Economics of Art and Culture; Ginsburgh, V., Throsby, D., Eds.; Elsevier: Amsterdam, The Netherlands, 2013; Volume 2, pp. 253–283. [Google Scholar]
- Schapire, R.E.; Stone, P.; McAllester, D.; Littman, M.L.; Csirik, J. Modeling auction price uncertainty using boosting-based conditional density estimation. In Proceedings of the 19th International Conference on Machine Learning (ICML 2002), Sydney, Australia, 8–12 July 2002; pp. 546–553. Available online: https://www.cs.utexas.edu/~pstone/Papers/bib2html-links/ICML02-tac.pdf (accessed on 22 August 2024).
- Powell, L.; Gelich, A.; Ras, Z.W. Developing artwork pricing models for online art sales using text analytics. In Proceedings of the Rough Sets: International Joint Conference, IJCRS 2019, Debrecen, Hungary, 17–21 June 2019; Mihálydeák, T., Min, F., Wang, G., Banerjee, M., Düntsch, I., Suraj, Z., Ciucci, D., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2019; Volume 11499, pp. 480–494. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. arXiv 2018, arXiv:1706.09516. [Google Scholar]
- Ke, G.; Xu, Z.; Zhang, J.; Bian, J.; Liu, T.Y. DeepGBM: A deep learning framework distilled by GBDT for online prediction tasks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; ACM: New York, NY, USA, 2019; pp. 384–394. [Google Scholar] [CrossRef]
- Yoon, J.; Jarrett, D.; van der Schaar, M. VIME: Extending the Success of Self- and Semi-Supervised Learning to Tabular Domain. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Virtual, 6–14 December 2021. [Google Scholar] [CrossRef]
- Xu, J.; Zhang, H.; Wu, Y. RLN: A Residual Learning Network for Time Series Forecasting. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 3342–3348. [Google Scholar] [CrossRef]
- Quinlan, J.R. Learning with continuous classes. In Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, Hobart, Australia, 16–18 November 1992; pp. 343–348. [Google Scholar]
- Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI-17), Melbourne, Australia, 19–25 August 2017; pp. 1725–1731. Available online: https://arxiv.org/abs/1703.04247 (accessed on 24 November 2024).
- Somepalli, G.; Goldblum, M.; Shrivastava, A.; Goldstein, T. SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Vienna, Austria, 13–18 July 2020. [Google Scholar] [CrossRef]
- Zehtab-Salmasi, A.; Feizi-Derakhshi, A.R.; Nikzad-Khasmakhi, N.; Asgari-Chenaghlu, M.; Nabipour, S. Multimodal price prediction. Ann. Data Sci. 2021, 10, 619–635. [Google Scholar] [CrossRef]
- Ma, M.X.; Noussair, C.N.; Renneboog, L. Colors, emotions, and the auction value of paintings. Eur. Econ. Rev. 2022, 142, 104004. [Google Scholar] [CrossRef]
- Liu, C. Prediction and Analysis of Artwork Price Based on Deep Neural Network. Sci. Program. 2022, 2022, 7133910. [Google Scholar] [CrossRef]
- Aubry, M.; Kraeussl, R.; Manso, G.; Spaenjers, C. Biased Auctioneers. J. Financ. 2022, 78, 795–833. [Google Scholar] [CrossRef]
- Smith, J.D.; Johnson, A.B. Improving Predictive Accuracy in Art Market Models Using Ensemble Methods. J. Art Artif. Intell. 2020, 15, 102–118. [Google Scholar]
- Hasler, D.; Susstrunk, S. Measuring colorfulness in natural images. In Proceedings of the Human Vision and Electronic Imaging VIII, Santa Clara, CA, USA, 21–24 January 2003; International Society for Optics and Photonics: Bellingham, WA, USA, 2003; Volume 5007, pp. 87–95. Available online: https://infoscience.epfl.ch/record/33994/files/HaslerS03.pdf (accessed on 22 August 2024).
- Gómez, S.; Tascon, M.; Martínez, J.; Elad, M. SVD entropy: An image quality measure based on singular value decomposition. Signal Process. Image Commun. 2020, 81, 49–53. [Google Scholar]
- Pace, R.K.; Barry, R. Sparse Spatial Autoregressions. Stat. Probab. Lett. 1997, 33, 291–297. [Google Scholar] [CrossRef]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; ACM: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar] [CrossRef]
- Borisov, V.; Leemann, T.; Seßler, K.; Haug, J.; Pawelczyk, M.; Kasneci, G. Deep neural networks and tabular data: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 7499–7519. [Google Scholar] [CrossRef] [PubMed]
Calculated Field | ARTIST | TECHNIQUE | TOTAL-DIMENSIONS | YEAR | PRICE |
---|---|---|---|---|---|
Unique Values | 395 | 10 | 2534 | 123 | 795 |
Unique Values/Total Count (%) * | 1.55 | 0.04 | 9.97 | 0.48 | 3.13 |
Calculated Field | TOTAL-DIMENSIONS | YEAR | PRICE |
---|---|---|---|
Count | 25,408 | 25,408 | 25,408 |
Mean | 2259.03 | 1973.6 | 225.68 |
Standard deviation | 1674.0 | 21.7 | 505.40 |
Minimal value | 10.64 | 1900 | 1 |
25% Quantile | 875 | 1963 | 50 |
50% Quantile | 1750 | 1974 | 99 |
75% Quantile | 3401 | 1985 | 200 |
Maximal value | 10,000 | 2023 | 10,000 |
Method | AuctionResultsNoImg | AuctionResultsColor | AuctionResultsSVD | AuctionResultsColorSVD | CaliforniaHousing |
---|---|---|---|---|---|
LinearModel | 101.33 ± 0.78 | 101.01 ± 0.84 | 102.75 ± 1.16 | 101.13 ± 0.93 | 28.70 ± 0.41 |
KNN | 61.38 ± 0.35 | 62.68 ± 0.32 | 60.18 ± 0.65 | 60.93 ± 0.53 | 22.75 ± 0.51 |
DecisionTree | 58.92 ± 0.62 | 58.39 ± 0.69 | 56.70 ± 0.78 | 59.21 ± 0.69 | 21.70 ± 0.56 |
RandomForest | 61.20 ± 0.60 | 55.95 ± 0.29 | 56.94 ± 0.30 | 55.51 ± 0.80 | 17.50 ± 0.36 |
XGBoost | 55.11 ± 0.60 | 57.50 ± 0.61 | 54.83 ± 0.59 | 58.76 ± 1.07 | 14.84 ± 0.25 |
CatBoost | 58.91 ± 0.79 | 60.45 ± 1.02 | 58.20 ± 0.64 | 58.25 ± 0.48 | 14.92 ± 0.46 |
LightGBM | 57.53 ± 2.32 | 58.29 ± 2.34 | 58.04 ± 0.59 | 57.52 ± 1.22 | 14.71 ± 0.31 |
MLP | 62.98 ± 0.89 | 61.86 ± 1.14 | 63.68 ± 0.32 | 65.02 ± 0.75 | 17.52 ± 0.63 |
VIME | 75.29 ± 3.37 | 66.29 ± 1.59 | 86.44 ± 2.28 | 74.09 ± 2.91 | 19.40 ± 1.69 |
ModelTree | 68.50 ± 0.37 | 66.59 ± 0.66 | 71.68 ± 3.62 | 68.70 ± 1.35 | 23.86 ± 0.33 |
DeepGBM | 76.92 ± 7.66 | 77.14 ± 7.40 | 87.96 ± 10.62 | 76.18 ± 4.23 | 35.13 ± 2.11 |
DeepFM | 63.10 ± 0.76 | 63.45 ± 0.87 | 63.28 ± 0.25 | 63.06 ± 0.81 | 17.75 ± 0.36 |
SAINT | 60.57 ± 0.95 | 60.75 ± 0.78 | 62.41 ± 1.19 | 60.63 ± 1.43 | 16.64 ± 0.30 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mauer, P.; Paszkiel, S. Tabular Data Models for Predicting Art Auction Results. Appl. Sci. 2024, 14, 11006. https://doi.org/10.3390/app142311006
Mauer P, Paszkiel S. Tabular Data Models for Predicting Art Auction Results. Applied Sciences. 2024; 14(23):11006. https://doi.org/10.3390/app142311006
Chicago/Turabian StyleMauer, Patryk, and Szczepan Paszkiel. 2024. "Tabular Data Models for Predicting Art Auction Results" Applied Sciences 14, no. 23: 11006. https://doi.org/10.3390/app142311006
APA StyleMauer, P., & Paszkiel, S. (2024). Tabular Data Models for Predicting Art Auction Results. Applied Sciences, 14(23), 11006. https://doi.org/10.3390/app142311006