1 Introduction

Small load carriers (SLCs) are reusable containers with different shapes and sizes commonly used in industries such as automotive, manufacturing, and electronics. SLCs are reusable, stackable, standardized size for the respective types of SLCs, and protect the goods that are being transported using the SLCs [5]. The reusable nature of SLCs makes them a sustainable solution. Due to the ever-increasing global exchange of goods, the logistics industry is continuously growing and the need for standardized reusable containers such as SLCs and pallets is increasing [9]. Companies in the logistics industry have been paying extra attention to the container management system and status management [22]. The SLC quality assessment, which is a part of the container management system and status management, is an important step in determining whether SLCs are usable or contain defects that might pose risks of damaging customer goods during transport, which can be costly. Besides directly affecting the quality of the customer’s goods, it can also affect the efficiency when moving the SLCs (e.g. breakage in the outer wall hinders proper SLC movements) [32]. Despite the critical nature of the inspection and evaluation of the SLCs, logistics companies still rely largely on manual human evaluation for the SLCs which is not scalable [23]. Automation of quality control of SLCs can reduce the risk of human errors during the sorting and checking of the SLCs due to fatigue and the repetitive nature of the job [18]. Human personnel can be allocated to other more important processes in the company such as supervising the system which may increase the general workflow. A reliable SLC quality control system removes the subjective bias of different human operators when evaluating containers that have different degrees of defects [16].

Automation of SLC quality control is not commonly implemented as it requires a trained model for the classification of the type of the SLC and identification of the condition of the SLCs (normal or anomalous) which requires a large amount of SLC data usually an image-based dataset. To our knowledge, there is no automated system for large data collection of SLCs for model training purposes. Another gap between the research field and the industrial field is that anomaly detection (AD) models are mostly tested on large controlled datasets such as MVTec AD [4] that might not represent the defect characteristics in SLCs. So it cannot be expected that AD models trained on benchmark datasets provide the same performance for quality control for SLCs [36]. To address these gaps, this study raises the following research questions: How can we implement a computer vision system for classifying the different types of SLCs and for assessing their conditions? Which of the different state-of-the-art (SOTA) methods for anomaly detection can be implemented in evaluation of SLCs? What are the limitations of the respective methods? How do the hyperparameters affect the output of the respective model? This research aims to fill these gaps by designing a system for data collection of the SLCs that are scalable and can be used to train a model for the classification and anomaly detection of the SLCs as shown in Fig. 1. The main contributions of this research are as follows:

  • Development of a camera portal: We designed a modular, cost-effective camera portal using standard components to capture key areas of the SLC surfaces. The camera portal can be installed in industrial production lines without disrupting the flow of the production.

  • Creation of an extensive labeled dataset: We compiled a dataset of 17,530 images for 34 different SLC types. Each image is annotated for both classification (type) and defect detection (OK/NOK). Of these, a subset of 9,260 images from the whole SLC dataset, covering 18 SLC types, that does not contain customer’s credentials and is made publicly available.

  • Effectiveness of classification model on the dataset: We demonstrate that fine-tuning an off-the-shelf deep convolutional neural network provides highly accurate results for the classification of the SLC type.

  • Comparison of SOTA anomaly detection methods: We conduct a performance comparison of eight state-of-the-art anomaly detection methods for the SLC defect detection task, and identify the two most promising candidates for further evaluation.

  • Analysis of optimized models: We provide both quantitative and qualitative analysis on hyperparameter-tuned versions of these candidates. We conduct the quantitative analysis by evaluating various performance metrics. The models’ predicted image outputs are assessed to determine their effectiveness in accurately identifying anomalous sections within the images.

1.1 Organization of the paper

The structure of this paper is as follows: Sect. 2 discusses the prior and related works on the building of the system for data acquisition and the models for the classification of SLC types and identification of the SLC defects. Section 3 describes the design decisions of the system for data acquisition which is the camera portal. Section 4 describes the dataset collected from the designed system and the preprocessing steps taken before model training. Section 5 discusses the classification algorithm used for the SLC dataset. Section 6 discusses the different SOTA algorithms used for anomaly detection that are commonly implemented for industrial implementation. Exploratory analysis of the SOTA algorithms is conducted along with hyperparameter tuning to adapt the models to the present data set. Qualitative and quantitative analyses are conducted on the best models. Section 7 discusses the result of the experiments and explores different sectors of the experiment that can be improved.

Fig. 1
figure 1

General implementation of the cleaning, classification, and anomaly detection of the different types of SLCs. The dotted rectangle on the right side of the figure is the implementation of trained models for the classification of the type of SLC and the anomaly detection pipeline. Dirty SLCs are first collected and transported on conveyor belts to automated washing machines. The SLCs are cleaned and further transported to the camera portal. Images from all sides of the SLCs are taken and further evaluated using neural networks to predict the type of the SLCs and the condition of the SLCs (OK or NOK). The explanation for the NOK SLCs can be given as a heatmap evaluation of the potential point of anomaly based on the respective images. SLCs with different conditions are later separated using an automated conveyor system for further transport and evaluation. The predicted type of SLCs are later stored in a database

2 Prior and related work

Prior and related work of this study will be divided into three main parts: data acquisition and evaluation system in industry (Sect. 2.1), image classification (Sect. 2.2), and image-based anomaly detection for industrial objects (Sect. 2.3). Each part further discusses what has been done on other studies for the respective topics.

2.1 Data acquisition and evaluation system in industry

A reliable system for data acquisition of the SLCs is critical in developing a consistent object detection and anomaly detection system. Considerations such as the physical features of the SLCs (whether the surface of the object is reflective when shined with lights), the on-site hard constraints such as the maximum dimension of the system, the number of cameras needed, the movement of the SLCs during production, the lighting condition on-site and the computing power needed are all important details that need to be considered before designing the system [48]. Careful planning of the camera setup and the lighting condition for the image acquisition is crucial as background noise or other objects that are not relevant to the quality control process can affect the evaluation of the SLC [46]. Poss et al. [36] describe the characteristics of the SLCs that they use in the production line as well as the hard constraints such as cycle time of the line that must be kept as efficient as possible. The evaluation system must conform to the hard constraints of the line to avoid slowing down production.

Sobottka et al. [45] use CCD (Charged-Coupled Device) cameras in their experiment for detecting dust and other types of debris on SLCs. They emphasize the importance of having multiple camera setups from different angles and a stable light source preferably in an all-enclosed tunnel-like structure to avoid external uncontrollable lighting. The usage of conveyor belts and switch plates to separate different SLC conditions reflects the general use case of our needed system. Despite all the suggestions, Sobottka et al. simulated the setup in a simulation model layout which can be different in real use scenarios. Bohm et al. [6], Xu et al. [54], and Noceti et al. [34] emphasize the importance of multiple camera setups (with Noceti et al. [34] emphasizing the importance of top view camera position). It is more practical to have a "view insufficiency" situation where images have overlapping information in environments with a lot of movement.

Pierer et al. [35] experimented on a scalable multi-camera (13 monochrome camera setup) inspection approach for industrial press line applications with parts moving on top of conveyor belt and four high-powered LED illumination bars. Zhang et al. [57] conducted experiments on object detection with the subject apples on modified YOLO (You Only Look Once) models with different illumination. Different illuminations (front, side, and backlight) change the appearance of the apple which affects the ability of the model to correctly identify the apples in the image. Most of the prior experiments emphasize the benefit of using a multi-camera setup with controlled lighting conditions to have the best image quality acquired from the object.

2.2 Image classification

SLCs are divided into different classes based on their physical properties (colors, dimensions, and geometry). The collected images from the data acquisition system of the SLCs are labeled with the respective SLC type, creating a labeled dataset for normal instances. This can be used to train neural networks for common classification problems. Different models can be used for common classification problems such as vision transformers (ViT) [15] and convolution neural networks (CNNs). ViT may perform better generally on complicated problems with more data. However, for this particular use case of classifying SLCs with limited datasets, CNNs might be easier to train with the limited data and easier to fine-tune for better classification results. ConvNeXt [30] model is a modernization of ResNet (Residual Network) which is inspired by the design of Vision Transformers while maintaining the simplicity and efficiency of CNNs. General architecture design of ConvNeXt that differentiates it from traditional CNNs includes substituting the stem cell in ResNet with a simpler "patchify" layer as in ViT (4x4 non-overlapping convolution), implementation of depthwise convolution (a special grouped convolution where the number of groups equals the number of channels popularized by MobileNet [24]), inverted bottleneck block which was popularized by MobileNetV2 [42], implementation of 7x7 depthwise conv in each block, substitution of Rectified Linear Unit (ReLU) [33] with Gaussian Error Linear Unit (GeLU) [21], implementing single GeLU in each block, and changing the batch normalization [26] into layer normalization [2].

2.3 Image-based anomaly detection for industrial objects

Anomaly detection (AD) refers to the detection of patterns in data that do not conform to expected behavior; it is widely applied in various fields, such as fraud detection in banking, identifying spam in emails, fault detection in industrial systems, and detecting abnormalities in medical images [8]. The key challenge in AD lies in accurately distinguishing between normal variations and genuine anomalies, especially in complex datasets. To mention a few examples, Huang et al. [25] implemented visual language models for AD in medical images that can detect anomalies from different kinds of medical images. AD is also pivotal in the predictive maintenance of machinery, where algorithms analyze sensor data or machine logs to identify unusual patterns indicative of potential faults; for example, Wang et al. [52] used a reinforcement learning-based method for fault estimation of the actuators. In the field of networks, anomaly detection is used to identify malformed packets [7].

This study focuses specifically on anomaly detection for image-based data types in the industrial domain. It is a complex problem as anomaly instances in the industry can be diverse, unpredictable, rare, and irregular. The rare nature of anomalies made it impractical and costly to intentionally produce them in industry. We next provide a brief overview of anomaly detection for industrial images related to the present setup; for a broader discussion on the field we refer to the recent survey by Liu et al. [29]. Bai et al. [3] discussed methods for solving the problem of imbalance datasets due to the rare nature of anomalies in the industrial field.

Ruff et al. [41] explain that anomaly detection approaches can be classified based on the feature maps and models. Feature maps can be differentiated as deep or shallow methods. Models can be classified into classification, probabilistic, and reconstruction-based. The categorization of the anomaly detection approaches is not discrete as some approaches lie between different models. Other anomaly detection approaches can rely purely on distance-based methods.

Several implementations have been previously implemented in the industrial fields. Liang et al. [28] use an industrial camera to capture the inkjet printing of production codes containing information (expiry date, batch number) on plastic beverages to check for defects such as missing letters. ShuffleNet V2 network is used to classify if the printed code is defective or normal. Singh et al. [44] uses pre-trained ResNet-101 for feature extraction and multi-class support vector machines (SVMs) to detect surface defects of tapered rollers. He et al. [20] implemented an end-to-end steel surface defect detection approach using a fusion of multiple hierarchical features for potential real-time detection. Dlamini et al. [14] use the YOLOv4 model for textile industry quality control. The implementation of enhanced feature extraction YOLO industrial small object detection algorithm [47] can also be implemented for detecting specific anomaly types in close-up SLC images. Würschinger et al. [53] developed a low-cost machine vision system for piston rod quality evaluation using transfer learning.

A lot of the implementations of AD in industrial objects are for new production objects which have mostly a fixed degree of the normal object. Reusable objects such as SLCs are rarely in perfect condition when returned for inspection. This introduces another complexity level as the surface of the normal SLCs can slightly differ while still considered normal. Truong et al. [49] use a camera to evaluate reusable food packaging (namely plastic cups) for defects and contamination. They implemented background removal using U-Net [39] to highlight the object from the background and trained the model based on an auto-encoder-based framework for anomaly detection. Despite the similar reusable condition of the object, this study has some slightly different conditions. SLCs have different physical features such as size and color which is different from only evaluating one type of object. As SLCs are not transparent, all the inner and outer surfaces of the SLCs must be evaluated, thus requiring more than one view for a sufficient quality image.

3 Design of the camera portal

Designing the camera portal is further divided into three different parts: requirements (Sect. 3.1), geometrical calculation (Sect. 3.2), and final build of the camera portal and hardware used (Sect. 3.3). The requirements section discusses the basic requirements needed for the camera portal to function and collect data properly. The geometrical calculation elaborates on the positioning of the cameras and how it affects the images taken. The final build of the camera portal and the specific materials used for building the prototype are further elaborated in the last part.

3.1 Requirements

The data acquisition system needs to solve both the classification of the SLC types and determine whether it is defective. While the first task (classification) can be solved with a single camera view, the second task (defect detection) requires a complete view of all the surfaces (inside and outside) of the SLC as the defect can be anywhere on the surface of the SLC. The bottom surface is excluded as it is rarely defective and the defects usually have little negative impact on the usability of the SLC. The system design needs to be under certain constraints such as not slowing down the SLC cleaning process, need to fit into the facilities, and being economical while maintaining quality. For cost and simplicity reasons, a setup based on a single camera moving around the object was ruled out as the mechanism would be very complicated. The geometry of the containers suggests that at least three (static) cameras are needed; two cameras on the diagonals of the SLC and one camera for the top view. The three-camera setup is cost-effective for capturing images but a preliminary test revealed that the diagonally placed camera setup caused occlusion for higher containers, exhibited a notable perspective distortion, and a relatively high sensitivity to shift of the SLC location. Adding two more cameras and placing the four cameras at the parallel sides of the SLCs and one camera on the top of the SLC mitigated these issues. Implementing two additional cameras means more cost but is still justifiable. To summarize the camera portal uses a five-camera setup where four are placed on the parallel sides of the SLCs and one at the top of the portal frame (providing the top view).

3.2 Geometrical calculation

Choosing the appropriate camera characteristics and selecting the proper position and orientation of the camera for SLC evaluation is crucial for capturing important details of the object. Geometrical calculations help identify potential blind spots that might affect the performance of models. Higher camera resolutions are considered in capturing more detailed images for better model performance. The spatial resolution needed to capture the details for anomalous sample can be set as 1 pix/mm as the nature of anomalous parts in SLCs are larger in dimensions. Furthermore, considering an additional safety factor of three, the required spatial resolution is 3 pix/mm. For choosing the camera resolution, the calculation must consider the largest dimension of the SLC which is 280 mm (height), 600 mm (length), and 400 mm (width). The calculated minimum sensor dimension for the given distance of the portal is 1800 × 840 pixels for the longer sides of the SLCs. This paper uses U3-3880CP-C-HQ Rev.2.2 from IDS Imaging Development Systems GmbH which is an RGB camera with 6.41 MP resolution (3088 × 2076 pixels), frame rate 59.0 fps, and C-Mount type. As the position of the camera will be static and the position of the incoming SLC will be approximately in the same spot (with a little variation on the exact positioning), the initial setup of the camera must be chosen to avoid lens and perspective distortion. The calculation of the field of view (FOV) depends on the camera sensor width or height and the focal length used. The determined FOV also affects the working distance of the camera. When SLCs are placed closer than the working distance, then some part of the SLC might be cut off. The working distance is related to the focal length, where a longer focal length means a longer working distance. The size constraints of the camera portal setup are 1.5 × 1.5 m (width × length). If the SLC is placed in the center of the camera portal setup, the longer side of the largest SLC has a constraint of a maximum working distance of 450 mm. The shorter side of the largest SLC has a constraint of a maximum working distance of 550 mm. Due to the physical size constraints, the camera needs to be set higher to mitigate the longer working distance. Setting a higher camera position will have lower visibility on the sides of the SLCs and capture more background that might not contribute to the performance of the model. Considering all the constraints present, the focal length of 8 mm is used for the cameras and the height is 500 mm from the edge of the highest SLC (780 mm above the rollers of the conveyors). The inclination of the camera is set at approximately 47.5°.

Table 1 The hardware details used to build the camera portal for SLC data acquisition

3.3 Final build of the camera portal and hardware used

The final prototype of the camera portal was built using aluminum structural beams with mountings for the camera and underneath the portal is an industrial roller conveyor for the movement of the SLCs. Diffused lights were installed on the top of the camera portal to reduce shadowing in the SLCs. Opaque plastic sheets are used to cover the camera portal to provide stable lighting conditions (enclosed tunnel-like structure). Additional tube lights are installed on the horizontal aluminum structural beams to provide additional lighting. Cameras are set to trigger all at the same time to provide a uniform condition on all the images. The movement of the SLC in the conveyor is controlled by a Programmable Logic Controller (PLC). Sensors and PLC are used to control the position of the SLCs to the center of the camera portal and the movement after the images are taken. Figure 1 (Camera Portal) shows the setup of the camera portal prototype. The detailed types of hardware used are shown in Table 1.

4 Dataset acquisition and preprocessing

This section explains the details of the dataset collected and further used for training the AD models. The total collected images are 17,530 images of 34 types of SLCs with an unbalanced number of images for each class of SLCs. The SLCs can be further differentiated into different shapes such as covers, inlays, and boxes. The SLCs consist of different colors and can have a mixture of two colors in an SLC box such as orange with yellow stripes.

The SLCs were labeled as normal or anomalous by an expert familiar with SLC defects. Each SLC corresponds to five images and anomalies may only be present on one of the sides. Thus, the dataset is further preprocessed by manually choosing the OK views from the anomalous SLCs and moving it to the normal view SLCs. Approximately 17.9% of the total number of recorded images were labeled as anomalies after the additional preprocessing. Anomalies can range from scratches, cracks, broken parts on the edges of the SLCs, the remainder of old stickers, and the presence of foreign objects inside the SLCs. For the training purposes of the models, resizing of the original images is carried out from \(3088 \times 2076\) to \(256 \times 256.\) For AD training pipelines, some models are pre-processed on an additional background removal pipeline to remove unwanted background noise and see the performance difference between the models with and without background removal. The method used for the background removal is \(U^2\)-Net [37] which is a pre-trained model that processes the input image into an output image that only contains the SLCs without the backgrounds. Normalization of the images is done based on ImageNet mean and standard deviations.

For our experiments, we use the following four variants of the dataset: all SLC with background removal, all SLC without background removal, sampled SLC without background removal, and inference dataset. The sampled SLC dataset is a subset of the whole dataset where it is sampled using a stratified sampling method. This sampled dataset contains 7430 total images with 2970 images as the normal train images. Approximately 42.3% of the total sampled SLC dataset are considered anomalies. The sampled dataset was built for hyperparameter tuning due to time and resource constraints. The inference dataset is a custom-made dataset containing a mixture of normal and abnormal views of the SLCs. Each of the 34 types of SLCs are included in this dataset. The total number of images is 265 which 155 are normals and 110 are anomalies. 11 out of the 34 types of SLCs do not have defective samples.

5 Classification of SLC type

The experiments start with collecting the dataset as explained in Sect. 4. The collected images of the SLCs are used as training images for the classification model to differentiate the different types of SLCs. The model used for the classification problem is the ConvNeXt [30] model which is a modernization of ResNet towards the design of Transformers. Details of the parameters used for the training and other additional processes are discussed further in Sect. 5.1. Further analysis of the model for the inference times to evaluate the performance of the model are elaborated in Sect. 5.2.

5.1 ConvNeXt training and additional preprocessing

The collected SLC dataset is further divided into train, test, and validation datasets based on the 60:20:20 split. The dataset used for the training, testing, and validation of the classification model has a mixture of OK and NOK images of each type of SLC. The training of the dataset used the following hyperparameters: base-ConvNeXt model, \(4\cdot 10^{-3}\) for the learning rate with AdamW optimizer [31], and batch size is 64. Training is executed for 250 epochs. Normalization parameters for the images (mean and standard deviation for the RGB (red, green, blue) values) are from ImageNet.

5.2 ConvNeXt inference and analysis

Table 2 Inference testing for the classification model is implemented using two different datasets: inference dataset which contains 265 images as explained in Sect. 4 and a 5-image inference which a folder contains five different views of the same SLC

Inference tests for the trained ConvNeXt are shown in Table 2. The inferencing is done with the same hardware as explained later in Sect. 6.2 and additional hardware laptop with Intel i5-1335U CPU. The inferencing dataset used the same inference dataset as explained in Sect. 4 and a 5-image inference to replicate the real-life use case where inferencing is done using 5 cameras in the camera portal. Each run is done five times for the same task and the average time of the five runs is taken. As of now, the recognized SLC types are stored locally. In a later stage, they will be sent to a central database which will allow analysis of the throughput and the stock in an enterprise resource planning (ERP) system. The integration into the ERP system will be reported elsewhere.

6 Defect detection

Next, we investigate camera-based defect detection. There are different kinds of possible defects such as residue of old stickers, rust stains, oil spillage, surface wear, cracks, and breakage that can occur. Currently, the defect type is of low importance. The number of defective SLCs in real-world scenarios is very low (a single-digit percentage or less compared to the total number of SLCs evaluated). Thus, modeling this task as a visual anomaly detection is reasonable. Defect detection are divided into three subsections: exploration of the SOTA AD methods and hyperparemeter tuning (Sect. 6.1), experimental setup (Sect. 6.2), and experimental results (Sect. 6.3).

6.1 Explorative analysis of SOTA anomaly detection methods and hyperparameter tuning

Table 3 SOTA anomaly detection models based on the taxonomy, methods, loss functions, and the usage of pre-trained networks

To keep the implementation effort manageable, this study focuses on anomaly detection methods that can be invoked in a standardized manner using the Anomalib Python library [1]. Table 3 categorizes SOTA deep anomaly detection models based on their taxonomy, loss function, and the pre-trained model used if any. The table format and explanations are inspired by the work of Liu et al. [29]. Student-teacher model approaches depend on pre-trained models such as ResNet [19] and VGG [43] as the teacher network. The teacher networks are trained on large datasets such as ImageNet [13] for normal feature extraction. A particular layer of the pre-trained network is selected as a parameter for the teacher network. The teacher network then teaches a simpler student network on extracting normal features. Anomalies are detected when the features extracted differ vastly between the teacher and student networks. An anomaly score is generated by comparing the features generated by the student and the teacher network. Example methods of student-teacher taxonomy in AD include Reverse distillation [12], and Student Teacher Feature Pyramid Matching (STFPM) [51]. Reverse distillation introduces one class bottleneck embedding (OCBE) as the input for the student network instead of raw images. The different architecture between the student and the teacher network helps improve anomaly detection. STFPM is an extension of the vanilla student-teacher network with feature pyramid matching (hierarchical structure). Several bottom layer groups of the model are used as feature comparisons between the student and teacher model, the differences between the feature generated are considered to be anomalies.

Reconstruction-based models, unlike student-teacher approaches, do not rely on robust pre-trained models. Reconstruction-based models have encoders and decoders in the reconstruction model that are self-trained to replicate the original normal images. Anomalies are reconstructed images that deviate from the original normal images based on a prediction from the comparison model. They perform worse in image-level anomaly detection as they cannot extract high-level semantic features; however, they excel in pixel-level anomaly detection [29]. An example of a reconstruction-based model for AD is DRAEM [56]. DRAEM is composed of a reconstructive (encoder–decoder architecture) and a discriminative sub-networks (U-Net-like architecture). The reconstruction sub-network is trained to reconstruct the original image from an artificially corrupted version by a simulator (generated by a Perlin noise generator). The output of the reconstructive sub-network is concatenated with the original image and used as input for the discriminative sub-network. The appearance of the reconstructed and the original image will differ significantly in anomalous cases.

Distribution maps are anomaly detection approaches that require a robust pre-trained network to extract features from normal images. The extracted features are mapped to a distribution usually a Gaussian distribution. Anomalies will have features that deviate from the normal feature Gaussian distribution. Examples of distribution maps AD models are FastFlow [55] and CFlow-AD [17]. FastFlow consists of a feature extractor (either Vision Transformers (ViT) or ResNet) and the FastFlow model. It is inspired by Normalizing Flow (NF) [38] techniques. During the inference, anomalous data should be out of distribution and have a lower likelihood than normal images. CFlow-AD consists of a discriminative pre-trained encoder and multi-scale generative decoders for estimating the likelihood of the encoded features. Features extracted by the encoder are processed by sets of decoders that are independent for each kth scale. The decoder is a conditional normalizing flow network with feature input and conditional input.

Memory banks are anomaly detection approaches that require robust pre-trained networks for feature extraction from normal images. Features extracted are stored in a memory bank and used during the inference step. Memory bank approaches require large memory space to store the features and require less training time. Anomalies are determined when the features are distanced far away from the normal features extracted from the memory bank. Examples of memory banks AD models are PaDiM [11], PatchCore [40], and CFA [27]. PaDiM consists of a pre-trained CNN for feature extraction of the patches from different semantic levels. The features are embedded and the assumption for the normal training image distribution is based on a multivariate Gaussian distribution. Mahalanobis distances are used as a comparison of whether a particular patch is anomalous, where high scores indicate anomalous areas. PatchCore uses pre-trained networks on ImageNet dataset such as ResNet and WideResNet for feature extraction. Features extracted are stored in a memory bank which is later subsampled using greedy coreset selection. A sample is considered anomalous when the nearest neighbor search of the features is far from neighboring samples. CFA uses pre-trained ResNet to extract features, features are interpolated and concatenated from different spatial resolutions to form patch features. Extracted features are stored in the memory bank and nearest neighbor searches are initialized to differentiate the normal and abnormal features. Features that are outside of the hypersphere of memorized and normal features are considered anomalies.

To choose a proper AD model, we first looked at the performance of the models on the MVTec anomaly detection dataset (MVTec AD) [4] which was generated for benchmarking anomaly detection methods with a focus on industrial inspection. The experiments are implemented using the Anomalib Python library [1] for uniform implementation of the models. From the Anomalib library, eight different anomaly detection models are trained using default hyperparameters from the library. Default hyperparameters are changed when there are limitations such as limited GPU VRAM for training a particular model. All models are implemented with early stopping with the AUROC as the main evaluation metric. All models are trained with two datasets (all SLC with background removal and without background removal) as explained in Sect. 4. Based on the result of the training of the anomaly detection models, comparisons of different metrics between the trained models are made (further explained in Sect. 6.2). The top two models that performed best based on the metrics were further explored for hyperparameter tuning. The hyperparameter that was chosen and tuned will be explained in Sect. 6.3.2. Note that the chosen best models can have different hyperparameters from the other models.

6.2 Experimental setup

The experimental setup covers the hardware used for the whole experiment and the metrics used for choosing the two best models based on the results of the explorative analysis of the SOTA anomaly detection models.

The hardware used for the preprocessing of the data and the training of the models for the experimental setup is a desktop PC with the following specifications: CPU AMD Ryzen Threadripper 3970X 32 Core Processor, NVIDIA GeForce RTX 3090 24 GB VRAM, 72 GB RAM. Codes are written in Python. Python libraries used are Anomalib (v 0.7.0), mmpretrain for ConvNeXt (classification model) [10]. The NVIDIA CUDA Toolkit version is 12.2.

The anomaly detection methods are compared based on quantitative analysis and qualitative analysis. The quantitative analysis of the models is based on the Area Under the Receiver Operating Characteristic (AUROC), F1 score, precision, recall, and accuracy. AUROC is the area under the curve of ROC which plots the true positive rate against the false positive rate at various threshold settings. AUROC with a value of 0.5 indicates a classifier with no discriminative power, equivalent to random guessing, while an AUROC value of 1.0 corresponds to a perfect classifier. Models with the top two highest AUROC in the initial explorative analysis comparison were chosen for hyperparameter fine-tuning. The qualitative analysis is implemented when comparing the top two anomaly detection methods to see how the anomaly detection models can highlight the anomaly in a given image. Models that have more precise highlights of the anomaly parts during the qualitative analysis are considered better compared to the other models.

6.3 Experimental results

The experimental results will be mainly divided into three parts: results of explorative analysis SOTA anomaly detection models (Sect. 6.3.1), hyperparameter tuning of top candidates on sampled SLC dataset (Sect. 6.3.2), and results for tuned hyperparameter on whole SLC dataset (Sect. 6.3.3).

6.3.1 Results of explorative analysis SOTA anomaly detection models

Table 4 Explorative analysis of SOTA AD models for the whole SLC dataset

Table 4 shows the result of the explorative analysis of SOTA anomaly detection models for the whole SLC dataset. Methods with ’(nob)’ are datasets that are preprocessed with a background removal pipeline before being used for training. The training time given here is in hh:mm:ss. The Reverse Distillation method is abbreviated to ’RD’. Methods such as PatchCore and PaDiM are memory bank taxonomy models that only need 1 epoch for feature extraction. Based on the AUROC metric, PatchCore performs best (0.786) followed by DRAEM (0.773) for the original dataset. For the dataset with the background removal pipeline, the best performers are PatchCore (0.781) followed by RD (0.726). The F1 score metric also reflects similar results to the AUROC metric for the best-performing models. Based on this explorative analysis of the AD models, the experiments without the background removal pipeline perform consistently better. The next training will be conducted without the background removal pipeline.

6.3.2 Hyperparameter tuning of top candidates on sampled SLC dataset

Quantitative analysis of top candidates on sampled SLC dataset For the sake of time efficiency, hyperparameter tuning of the best-chosen models is done using only the sampled SLC dataset. All of the hyperparameter tunings have global seed 4 to ensure reproducibility.

Table 5 The result of hyperparameter tuning for PatchCore model on sampled SLC dataset

Hyperparameter tuning for PatchCore is conducted by changing the backbone of the feature extractor (ResNet-18 or WideResNet-50), the layers configuration, the Limit Train Batches, the coreset subsampling percentage, and the number of neighbors. The layer configurations can be chosen between layer 2, layer 3, or layer 2+3. The Limit Train Batches size can be lowered to account for limited VRAM. The possible choices include 0.2, 0.4, 0.8, or 1.0. Using Limit Train Batches size as 1.0 means maintaining the whole training dataset during the training. Table 5 shows the result of the hyperparameter tuning of the PatchCore model. Based on the results of the hyperparameter tuning for PatchCore, choosing the WideResNet-50 backbone has a slight improvement in the metrics compared to the ResNet-18 backbone. Choosing only layer 3 in the layer hyperparameter has worse metrics compared to the other models. However, it is significantly faster in training time which can be useful for rapid model training or prototyping. Increasing the neighbors above 9 does not increase performance. Decreasing the number of neighbors has a slight decrease in performance while still maintaining approximately the same training time. Choosing a smaller coreset subsampling percentage sacrifices a little performance but significantly improves the training time (coreset 0.05 to 0.01).

Table 6 The result of the hyperparameter tuning for DRAEM model on the sampled SLC dataset

Table 6 shows the result of the hyperparameter tuning of the DRAEM model. The experiments are conducted by changing the lambda value, the learning rate, and the batch size. Lambda values are called loss balancing hyperparameter that controls how much of the loss SSIM (Structural Similarity Index Measure) between neighboring patches affects the loss function.

Fig. 2
figure 2

The qualitative analysis of the two tuned models from PatchCore and DRAEM models is based on the explorative analysis of SOTA anomaly detection. Heatmaps from PatchCore better isolate the anomalies compared to the DRAEM models. Patchcore Layer 2 produces heatmaps that are well-localized at the true anomalous areas. Patchcore Layer 3 produces more spread-out anomaly scores near the anomalous areas. Layer 2 also has darker red spots which means a high anomaly score which indicates a high likelihood of anomaly in that particular region. DRAEM performs best on surface-based anomalies such as dirt on the SLC surface. DRAEM with batch size 8 performs better compared to batch size 4 in most of the cases

Fig. 3
figure 3

Qualitative analysis of the two tuned models from PatchCore and DRAEM models. This covers the rest of the anomaly types that are not covered in Fig. 2. Despite the high confidence of the DRAEM model on major breakage and breakage anomalies, the model fails to highlight the area of the breakage anomaly properly. Patchcore layer 2 consistently highlights the main point of anomaly as in breakage defect and major breakage SLCs

Qualitative analysis of top selected models on sampled SLC dataset Figures 2 and 3 show the qualitative analysis of the two tuned models from PatchCore and DRAEM based on the AUROC metrics on different types of defects. The left image is the original image with the respective defects. Defects included are foreign objects, stickers, dirt on the surface of the SLCs, oily surfaces, breakage, and liquid spillage. Every two columns are the predicted heat map from the model and the confidence level of whether the SLC is anomalous or normal. A higher anomaly score in a particular spot of the SLC has darker red spots, and a lower anomaly score has lighter yellow colors. PatchCore models generally perform best for most of the anomaly types. It can highlight the location in which the anomaly is most likely located in Figs. 2 and 3 compared to DRAEM tuned models. PatchCore layer 2 generally has tighter groupings of the anomalous region as shown in the foreign object (big) anomaly. PatchCore layer 3 has much-spread groupings of anomalous regions. PatchCore performs slightly worse in oily surface defects. DRAEM models work best on surface anomalies such as dirt on the surface and oil on the surface. Heatmaps from DRAEM perform best on surface anomalies and do not highlight the breakage of the SLC walls as seen in breakage, breakage and liquid, and major breakage in Fig. 3. Despite having high confidence in predicting major breakage, the heatmap does not give a proper visualization of which part of the SLC is anomalous. This is however expected as DRAEM models are built for pixel-level detection. Based on the qualitative experiment, the PatchCore model performs consistently better compared to the other models and has a better-predicted heat map on the localization of the defects.

Fig. 4
figure 4

The qualitative analysis and comparison of the two best models based on the default parameter and the hyperparameter-tuned models on the whole SLC dataset. Patchcore on both default and hyperparameter-tuned models performs better in highlighting the position of the anomaly compared to DRAEM (as seen in the left and right column heatmap). DRAEM models can mostly classify the SLC as a defect but fail to properly highlight the defect location in the heat maps. The blue SLC cover (left column) has breakage on the top left side of the image. All models can accurately classify the image as anomalous. However, only the default PatchCore can properly pinpoint the anomalous spot. The dark blue SLC (right column) are simpler defect case where breakage on the surface of the SLC is easily observed with the naked human eye. Despite the high confidence score of the DRAEM models, the highlighted regions do not represent the anomalous part of the SLC. The yellow SLC (center) exhibits a more complex structure which makes defect detection considerably more difficult. The defect for the yellow SLC is on the top left corner walls of the SLC (breakage). Both Patchcore and DRAEM fail to identify the anomalous part in the yellow SLC

Fig. 5
figure 5

Additional qualitative analysis and comparison of the two best models based on the default parameter and the hyperparameter-tuned models on the whole SLC dataset. The blue SLC cover (left column) contains cracks on the upper left part of the SLC cover. The default DRAEM model failed to classify this as anomalous. Both PatchCore models can classify this as anomalous and have some highlights in the region of the cracks. The black SLC (center column) has a defect (deformed plastic) in the left outer wall of the SLC. All of the models successfully classify the SLC as defective. However, none of the models properly highlights the defective left wall. The yellow SLC (right column) contains two defects. One breakage defect is on the top right corner of the SLC and the other defect is a foreign object in the center of the SLC. PatchCore models perform consistently better compared to other models with good highlighting of the location of the defect. The default DRAEM model produced a miss classification in the yellow SLC case as it fails to detect the foreign object and the breakage on the upper right walls. We can see the weakness of the hypertuned PatchCore model where the layers used in the hyperparameter (layer 3) have worse performance in highlighting the defect in the yellow SLC

Table 7 Quantitative comparison between the default (noted by def on the end of the model name) and hyperparameter-tuned models on the whole SLC dataset

6.3.3 Results for tuned hyperparameters on whole SLC dataset

Based on the quantitative and qualitative analyses conducted during the hyperparameter tuning for the sampled SLC dataset, we use the best combination of hyperparameters to train it on the whole SLC dataset without background removal. The total number of images in the whole SLC dataset is twice as high as the sampled SLC dataset. This prevents us from directly using the same hyperparameters from the hyperparameter tuning due to GPU VRAM limitations. Changes in specific parameters will be specified during the quantitative and qualitative analysis of the models. Table 7 shows the quantitative comparison of the hyperparameter-tuned models with the default parameter models. The hyperparameter-tuned PatchCore model performs better and is faster than the default hyperparameter PatchCore. The hyperparameter-tuned PatchCore may perform better, if there is no limitation to the GPU VRAM, as the hyperparameter used here is the second model from Table 5 with lower Limit Train Batches (0.7). When higher GPU VRAM is available, using the Patchcore layer 2 models with 1.0 Limit Train Batches may increase the performance of the model. The hyperparameter-tuned DRAEM performs better compared to the default parameter DRAEM. The training time needed to train the tuned model is shorter. The inference time for the model to do inferencing on the inference dataset (265 images of normal and anomalous SLCs) is on average the same.

Figures 4 and 5 show the qualitative analysis and comparison between the hyperparameter-tuned whole SLC dataset models and the default parameter models. Figure 4 shows the first three defects on different colors and types of SLCs. The first two columns are from breakage on the top left side of the blue SLCs. The next two columns show a yellow SLC with a breakage defect on the top left corner walls. This represents one of the most challenging cases, as the defect is difficult to detect for the naked human eyes. The final two columns are one of the simpler cases of defects (a large hole in the center of the SLC). Patchcore hyperparameter-tuned models are more confident in the larger defects with red heat map highlights on the location of the defect. However, the PatchCore model (both default and hyperparameter-tuned) still struggles to accurately classify the yellow SLCs as defects and highlight the location of the defects. DRAEM models have high anomaly scores on obvious defects such as the dark blue SLC defects in Fig. 4. However, the highlighted anomaly locations are not properly shown or misguiding, especially in the dark blue SLC. This may be unfavorable in industrial settings as the result of the defect classification cannot be backed with a proper indication of the location of the defects. The tuned DRAEM model misclassifies the yellow SLC into a normal class. Figure 5 shows additional qualitative analysis of the models on another three different colored and shaped SLCs. The first two columns show a blue cover SLC with a cracks on the upper left part of the cover SLC. All models except the DRAEM default model identify the cover SLC successfully as a defect class. The difference between the Patchcore default (layer 2) and tuned Patchcore (layer 3) can be seen in the blue cover SLC, where Patchcore with layer 2 has a higher anomaly score in the region of the anomaly affecting the confidence in the prediction. The next two columns show a black SLC with a deformed left outer wall of the SLC. All models classify properly that the black SLC belongs to the defect class. None of the model have successfully highlighted the location of the deformed left outer wall. The final two columns show a breakage on the top right corner of the yellow SLC and the part broken lies on the center of the SLC. This is another harder defect case where the defect location is not as obvious. Both PatchCore models successfully highlight the location of the breakage and the location of the foreign object with the default having better heatmap result due to the layer 2 hyperparameter. The default DRAEM model fails to highlight anything significant, thus the misclassification of the yellow SLC. The tuned DRAEM model highlights the breakage on the wall in dark red, however, fails to highlight the foreign object in the center of the yellow SLC. Based on all the qualitative analysis, PatchCore models perform more consistently compared to DRAEM model and can more reliably highlight the exact location of the defects.

7 Discussion and conclusion

The goal of this work was to create a computer vision system that can recognize the SLC type for inventory management and perform defect detection automatically bridging the gap between research and the industrial field for anomaly detection systems. The camera portal was built based on careful consideration of the geometry of the SLCs while keeping space requirements and costs minimum. The whole setup consists of the camera portal with 5 cameras angled at approximately 47.5\(^\circ \) with a placement height of 500 mm from the edge of the highest SLC and a PLC-controlled conveyor belt.

This work implemented SOTA classification and anomaly detection models, which are commonly used for benchmarking on controlled datasets such as MVTec Dataset, on real SLC datasets.

The accuracy of the finetuned classification model ConvNeXt is 100% for the SLC test data set. This exceptionally high accuracy can be attributed to the fact that the SLCs are standardized products, with a set of easily distinguishable features such as color, size, and geometry. So the classification of the SLC type can be considered as a solved problem.

The anomaly detection is considerably more challenging than the SLC-type classification. The top two performers using the default hyperparameter in the explorative experiment step for anomaly detection are PatchCore and DRAEM. Despite the high metric score of DRAEM, it performs worse in image-level anomaly detection compared to PatchCore qualitatively. PatchCore performs better in localizing the defects in the heatmaps which helps to explain why an object is classified as defect. Besides this, PatchCore has consistently lower training times compared to DRAEM. Uncertainties during the implementation of the classification and the anomaly detection model in real-life production line applications include the stability of the code containing the classification and anomaly detection model. As a work in progress, the proposed computer vision system is tested in a longer continuous run.

Limitations of the proposed study can be divided into theoretical and practical limitations. We identify three theoretical limitations such as the implementation of the background removal algorithm, the exclusion of pixel-level anomalies, and the exclusion of potential dependencies between images. The background removal used in our study is \(U^2\)-Net which might not be enough for our case of background removal. The exclusion of pixel-level anomalies hinders the performance of some models such as DRAEM. We have further identified the practical limitations in our study. This includes the training method done locally with the RTX 3090 which has limited VRAM. Data collection for our dataset are currently done in a controlled manner (no abnormal conditions such as sudden change in lighting conditions or wetness of the SLC surface) which may affect the performance of the classification and AD models. The current dataset separates the SLC as defect or normal without the detail of what and where the anomalies are present in the image. This can be solved with future annotations of the dataset. The final limitation of this study is the possibility of slower real on-site inference time that can be affected by external factors.

The limitations in our study provide opportunities for future work. A potential novel implementation may combine the AD and classification into one model. Besides this, a more detailed annotation on the location of the anomalies in the images could help improve the model training. VRAM limitations during the training can be tackled in two ways. First, the training can be done on clusters with higher VRAM. Second, it can be considered a practical solution where the anomaly detection model part is trained for each particular type of SLC. This means that the whole dataset for that particular SLC can be trained without the VRAM limitation as that particular SLC type is only a subset of the whole dataset. However, this adds another layer of operations during inferencing in real-life implementation as the system needs to know which SLC is detected by the camera and then do the inferencing based on that particular SLC type. Potential problems may include the inference time for the anomaly detection model must wait for the classification model’s output, which can affect production cycle time. Another problem is during the inference of the classification, if the output is wrong, then the queried model for the anomaly detection will also be the wrong model. The current practical implementation of the inference is done per image. Each SLC will have exactly five images from the five cameras. SLCs are considered defective if one of these views is considered defective. This method has its limitations and consideration of future work in multi-view AD methods can be an interesting direction. Techniques such as image stacking, anomaly score aggregation, or multi-view convolutional models could be explored to enhance performance and provide a more accurate analysis of the SLCs.