1 Introduction

Industrial robots, with the ability for controlled, precise movements, have emerged at the forefront of many industries, including manufacturing, aerospace, and healthcare. Advances in accessibility through the popularization of collaborative robots, with ease of programming, increased safety, and lower initial cost, have made an introduction into these fields much simpler [1, 2]. Despite the improvements, one of the largest challenges still remains out of reach without significant dedication of resources: computer vision [3]. Vision serves as the key for enhanced robotic intelligence and productivity, to give a robot the ability to dynamically assign and complete tasks. Intelligent robotics with the ability to determine ideal grasp position for pick-and-place [4]; the ability for localization using odometry [5, 6]; or position and pose evaluation of humans [7] have opportunities to enter other industries or to enhance their role in already prevalent ones [8]. All of these rely heavily on vision and in most cases, machine learning. Machine learning not only requires an experienced programmer to implement, they requires training via one of several methods [9]. This makes adaption into industrial usages cumbersome at best and unfeasible at worst. Machine learning is absolutely pivotal to many industries, but there are several security concerns, ranging from information privacy to adversarial attacks [10, 11]. The ability to create a consistent segmentation algorithm without the need for machine learning would then have value to users sharing these concerns.

3D data is incredibly powerful, supplying a system with ample environmental information that, when combined with intuitive segmentation, is able to make specific decisions based off of the identified features in a way that 2D data cannot replicate [12, 13]. The depth of this information comes at a cost: computational complexity. One of the biggest needs for machine learning in computer vision stems from the complexity of 3D segmentation. While it is very easy for a human to identify features such as a table or a chair in a pointcloud, it is challenging for a computer to be able to make that same distinction; thus, the necessity of machine learning algorithms [14]. While 2D segmentation is also dominated by machine learning [15], traditional computer vision still has a place, typically much faster and more efficient than deep learning where algorithms need not be class specific [16]. If 3D segmentation was as approachable and efficient as 2D segmentation, then coupled with the correct applications, it would bypass the necessity for a machine learning algorithm. This manuscript discusses a technique that combines the ease of 2D segmentation with the knowledge of 3D data to identify a robot’s workspace. This implementation uses a time-of-flight (ToF) camera to capture a 16-bit depth image, uses traditional 2D computer vision processing techniques to segment the subject, masks all pixels outside of the region of interest, and converts only the subject to 3D data. From here, 3D segmentation is much easier to denoise and identify the specific coordinates of the workspace surface. This is contrasted against typical usage of ToF cameras, where the 2D image is directly converted into a 3D pointcloud, requiring more complex segmentation.

The specific use case for this work is regarding the automation of post-processing of metal additive manufacturing (AM). Processes like laser powder bed fusion (LPBF) use fine metal powders that pose a respiratory risk to humans [17], so automation allows for the human to be safely removed from the environment. While there are some models that allow for partial automation, such as powder cleanup in LBPF systems [18], these options add cost to each machine and cannot handle other post-processing needs [19]. A robot with sufficient vision capabilities can maintain one or more machines, and with a wide array of end effectors due to tool changers, can capably handle many aspects of post-processing, including machine cleanup, build plate extraction, and support removal while significantly reducing the danger to humans. Additionally, a single camera mounted to the robot allows for a simplified system, where the robot can scan from multiple angles in an environment where otherwise difficult [20]. Due to the variance of the AM process itself, a robust vision system is needed, not only to localize the robot within the workspace, but also to identify the size, position, and orientation of printed parts.

This paper discusses literature with similar motivations or implementations and the techniques we used to combine 2D and 3D image processing. Additionally, we showcase the results of the workspace mapping and locating algorithms as well as their accuracy and discuss the impact that the flexibility of this system can have.

Fig. 1
figure 1

UR5e with ToF camera and high-temperature end effector

2 Related works

2.1 Simultaneous localization and mapping

Simultaneous localization and mapping, or SLAM, is used when an agent scans unknown surroundings (mapping) while determining where the agent lies within the map (localization) [5]. There are several methods of doing this using different sensors [3, 21], the most related of which is using time-of-flight sensors; it has been shown that for navigating human environments, ToF sensors have higher environmental resistance, lessened dependence on consistent lighting, and accuracy at long ranges [22]. Robotic arms equipped with the ability to scan and localize itself within its environment can serve as valuable frameworks for intelligent automation. One such implementation uses a hand-mounted camera to acquire object information and guide navigation for grasping. The transformations that correlate the camera pose to the robot world frame are acquired using the built-in forward kinematics [23], similar to how the pose is estimated in our work. This allows for simplification of pose correction without requiring external sensors or a more complex 3D processing workflow. It has been shown in some robots that relying only on the forward kinematics can result in deformed scans [20], but the described research focuses on stitching multiple scans to a cohesive image, where our work instead uses only a single scan with a single robot pose, minimizing the effects of hysteresis in forward kinematics. Additionally, the repeatability of the robot, and thus, the forward kinematics, have been externally qualified and shown to be incredibly accurate, even at high speeds and with larger payloads, which neither extreme is intending to be used for this work [24]. These tests were performed on a UR5, which is a different model than the one used, a UR5e, but the two models are nearly identical, with the major differences being upgraded software, integrated force and torque sensors, and an even higher repeatability [25, 26].

2.2 Alternatives to machine learning for 3D segmentation

When considering options for segmentation of 3D data, the state of the art is dominated by machine learning. That does not mean that there are no alternatives — there are several algorithms that do not incorporate machine learning, including edge-based methods, region-based methods, model-based methods, and graph-based methods [27]. To our needs, we found that the most comparable method was region-based methods. This technique segments the cloud into connected components, from which the table can be classified and processed. Region-based methods compare the surface properties of points and group them if they are sufficiently close, typically using an octree or similar structure [28]. One challenge of region-growing methods is that the segmentation quality relies on proper seed points. One method that is used to amend this is defining a minimum distance between components; components that are close together will be grouped into one contiguous component rather than being left distinct. As this implementation does not use any automated decision-making, a graphical user interface (GUI)-based implementation was used that had the features necessary for comparison. This software can be used to import a cloud, label connected components, clean the relevant cloud, and ultimately provide any information concerning a scanned workspace [29].

2.3 Combined 2D and 3D processing

It has been noticed previously that using 2D computer vision in tandem with 3D processing greatly simplifies the computational demands of segmentation. This implementation uses a 2D RGB camera and a 3D laser slit sensor to determine the location of objects for pick-and-place applications [30]. This approach, Roboscan, is comparable to the technique proposed in this manuscript. Namely, the system workflow, which describes a transfer of data between processes, using 2D preprocessing to aid the 3D classification. Additionally, the calculated 3D segmentation is able to inform the robot of the specific position of objects within the scene and guide movement. Roboscan presents a similar workflow to the system proposed in this manuscript; despite this, there are many differences between the systems that make them distinct. First, are the sensors used; Roboscan uses two separate sensors and uses the information from the 2D sensor to give the 3D scanner an idea of the number of objects in the scene as well as the sizes, shapes, and positions of these objects, whereas the proposed technique uses a single ToF sensor, with preprocessing of the collected depth image prior to 3D conversion to reduce noisy data. This difference marks a significant dissimilarity in the way that the two techniques utilize the 2D data to ease the 3D processing. Roboscan is searching for a set number of objects based off of the information from the 2D image, with estimated sizes, shapes, and centers. The proposed method is instead looking for a single region, the workspace, and identifies the contour within a 2D image and is able to filter information outside of the detected region. This also exposes a key difference in the two techniques — the motivation. Roboscan aims to locate objects within a known, fixed workspace, whereas this technique identifies the workspace itself.

3 Methods

3.1 Description of vision system

The system used is shown in Fig. 1, consisting of a robotic arm with a custom end effector, a vision system, and a control system. The robotic arm used is a Universal Robots UR5e [26], and the end effector base is a Robotiq 2F-140 [31] with custom high-temperature ceramic grippers [32]. The end effector is connected through the built-in tool I/O port on the UR5e. The vision system is a Vzense DCAM560CPro ToF 3D camera [33]. The camera is governed by a modified version of the Python wrapper SDK. The system is controlled by a mini-PC, which is connected to the UR5e and the DCAM560 via an Ethernet interface. The entire system is mounted to a mobile cart with locking wheels.

The system is being developed using a test frame constructed of t-slot aluminum extrusion with a rectangular wooden top. The test frame is approximately 80 cm tall, with a top surface of size 60 cm \(\times \) 50 cm. For testing purposes, the cart is located very close to the test frame, such that the end effector can remain normal to the surface across the entire test frame.

3.2 Image processing workflow

The image processing workflow, broken into distinct tasks, is shown below in Fig. 2. The tasks consist of the following: camera positioning, 2D image capture, image segmentation, mask application, pointcloud conversion, and 3D segmentation.

Fig. 2
figure 2

Image processing workflow

The images and clouds are captured with respect to the camera reference frame. These must be transformed with respect to the robot reference frame such that the locations generated by the workflow are useful to the robot. The offset in position and orientation is determined by the camera positioning task by providing the tool center point (TCP) location for the camera, with position [xyz] in mm and roll-pitch-yaw (RPY) orientation \([r_x,r_y,r_z]\) in radians.

Once the robot has been manipulated to the correct position, a 2D IR image is captured. The image is segmented to determine the location of the workspace, via the methodology presented in Section 3.4. This is used to mask a depth image to remove sections of the image that are in the background. Then, the system converts the masked depth image into a 3D cloud. During the conversion process, a rotation matrix is generated from the location of the camera to transform each point to the appropriate coordinate system, with respect to the robot base frame. This is achieved using the methodology discussed in Section 3.5. With this simplified 3D cloud, segmentation is much easier. The data is broken into clusters and all but the largest are removed. Outliers are filtered out, and a bounding box is fit to the remaining data. The vertices of the bounding box are output in [xyz] format to define the location of the table with respect to the robot.

3.3 Camera positioning

The first part of this task is positioning the camera such that the workspace is within the frame of the camera. This is done by observing the output of the camera and adjusting the location of the TCP using freedrive mode on the teach pendant (TP) of the robot. Once complete, a script is executed on the TP that outputs the xyz position as well as the orientation of the end effector in RPY format. The offset of the TCP from the flange was calculated based on the dimensions of the mounting bracket as designed in CAD software and where the origin of the camera reference frame is located with respect to the flange, collected from the technical manual for the camera [33]. This TCP data was assumed to be correct, as any deviations in results should be negligible compared to other error sources. This allows the camera to use the generated position and orientation data to transform the collected data from the camera frame to the robot base frame, so that the output positions are usable by the robot.

3.4 2D image segmentation

Once the workspace is within the frame of the camera, an infrared (IR) image is captured. An example image is shown below in Fig. 3c. The image is segmented using a border following algorithm, with topological structural analysis, proposed by Keiichi Abe [34]. The maximal area contour found is isolated and the most appropriate polygon is created based on the Douglas-Peucker algorithm [35]. The contour is used to create a mask which is then applied to a 16-bit depth image, so as to remove all data not within the workspace. The IR and depth sensors share the same coordinate system, so they can be used interchangeably for processing. However, only the depth image can be converted into 3D.

Fig. 3
figure 3

2D segmentation

3.5 3D cloud generation and segmentation

The masked depth image is then converted into a 3D pointcloud by leveraging the geometric properties of the camera and the view plane. Via the intrinsic focal lengths \(f_x\) and \(f_y\), the camera origin \((c_x,c_y)\), and the scale of the image, the depth at every pixel in the image, notated as (uvw) can be converted to a real position of form (xyz) using Eqs. 1, 2, and 3, given the RPY rotation between the two frames \((r_x,r_y,r_z)\), which is collected from the robot itself in Section 3.3. During this process, any converted point with maximum or minimum height is removed, as this is an erroneous point created due to truncation of the range of the depth camera or a point being masked via 2D preprocessing.

$$\begin{aligned} z = \frac{w}{\sqrt{1 + (\frac{c_x - u}{f_x})^2 + (\frac{(c_y - v}{f_y})^2}} \end{aligned}$$
(1)
$$\begin{aligned} x = z\frac{c_x - u}{f_x} \end{aligned}$$
(2)
$$\begin{aligned} y = z\frac{c_y - v}{f_y} \end{aligned}$$
(3)
Fig. 4
figure 4

3D segmentation

Directly after the points are converted to 3D, they are transformed to the robot reference frame from the camera reference frame by multiplying by \(M_z\), \(M_y\), and then \(M_x\), which are calculated from Eqs. 4, 5, and 6, respectively. From here, the segmentation begins. First, a radial outlier filter is applied, where any point without at least 5 points in a 7 mm radius is removed. This is aimed at removing the smaller outliers around the edges of the table. Next, a statistical outlier filter, shown in Eq. 7, is applied, where any point more than a distance \(d_{max}\) from the 45 closest neighbors are removed, where \(d_{max}\) is the sum of both the average of the 45 closest neighbors and 2 standard deviations of the distance from the point to the average. This helps remove the clusters left after the radial filter.

$$\begin{aligned} M_x = \begin{pmatrix} 1 & 0 & 0 \\ 0 & \cos {r_x} & -\sin {r_x} \\ 0 & \sin {r_x} & \cos {r_x} \end{pmatrix} \end{aligned}$$
(4)
$$\begin{aligned} M_y = \begin{pmatrix} \cos {r_y} & 0 & \sin {r_y} \\ 0 & 1 & 0 \\ -\sin {r_y} & 0 & \cos {r_y} \end{pmatrix} \end{aligned}$$
(5)
$$\begin{aligned} M_z = \begin{pmatrix} \cos {r_z} & -\sin {r_z} & 0\\ \sin {r_z} & \cos {r_z} & 0 \\ 0 & 0 & 1 \end{pmatrix} \end{aligned}$$
(6)
$$\begin{aligned} d_{max} = \mu + 2 \sigma \end{aligned}$$
(7)

With the cloud clean, the contours of the table are identified via a bounding box based on the convex hull. The four vertices with the largest z coordinate represent the estimated top plane of the workspace, which is the plane of interest.

4 Experimental results

4.1 Performance of 2D segmentation

The major steps of 2D segmentation can be seen in Fig. 3. Figure 3a shows a raw RGB image of the table that the robot is to identify within its workspace, as well as some miscellaneous background objects. These background objects are meant to highlight the difference between the various methods of image capture. The corresponding depth image captured by the ToF camera is seen in Fig. 3b; it is unclear exactly where the table ends and where the background begins, which would be problematic for the border following algorithm. This is contrasted with the IR image shown in Fig. 3c, which better shows the contrast between the workspace and the background. This IR image has the algorithm discussed in Section 3.4 applied to find the largest continuous contour. This contour is converted into a mask and applied to Fig. 3b, removing everything outside of the contour. The resulting image can be seen in Fig. 3d.

Initially, the algorithm made use of the depth image rather than the IR image. Some inconsistency was found, specifically when looking at the workspace with a skewed orientation. The depth image tended to have rough edges on the far side, leading to the mask being slightly too aggressive at best, and a complete misfit to the workspace at worst. Additionally, the range on the IR image was more conducive to the scan distances. The depth image collected much more of the background compared to the IR, which increased the errors seen in 2D segmentation. The algorithm was able to make use of the IR image due to the specific camera used; both the depth and IR image originated from the same location, allowing them to be used interchangeably for 2D operations.

4.2 Performance of 3D segmentation

The major steps of 3D segmentation can be seen in Fig. 4. Figure 4a shows an unprocessed point cloud, converted directly from Fig. 3c. This point cloud has had all points with maximum or minimum distance (i.e., out of range) removed to retain only the useful data. Additionally, as part of the conversion process, the point cloud has been reoriented to the robots reference frame. Figure 4b shows the result of the 3D segmentation. This point cloud has undergone the radial and statistical filtering described in Section 3.5.

To quantify the effectiveness of this algorithm, we conducted a comparison between a GUI-based pointcloud processing software [29]. The GUI software uses an octree region-based clustering algorithm to separate the different parts of the pointcloud. Then, a human operator selected the appropriate cluster representing the workspace and manually identified the vertices. This comparison was chosen as it was one of the only examples found of 3D segmentation without requiring machine learning.

Table 1 Measured, experimental, and GUI calculated vertices of the workspace
Fig. 5
figure 5

Results plotted in 3D

Fig. 6
figure 6

Filtered point cloud

For comparison, the parameters of interest were the resulting vertices, the distance between the real vertex and the identified vertex, and the time that it took. The vertices are defined as labeled in Fig. 3a, with BL being the back left, BR being the back right, TL being the top left, and TR being the top right. The real vertices were identified by moving the TCP of the robot to each corner of the table, and each vertex is recorded in Table 1 as Meas.. For the other methods, a scan of the workspace was conducted and two clouds were collected. The first cloud was simply converted to the robot base frame using the equations in Section 3.5, while the second cloud also had the 2D preprocessing algorithm applied to it. The first cloud was analyzed and segmented manually in a GUI-based software, using a statistical filter, a radial filter, and then automatic segmentation, using a region-based segmentation algorithm. The cluster corresponding to the table was isolated and the vertices were selected by the human operator. The vertices are recorded in Table 1 as GUI. The time that it took from importing the cloud into the software to calculation of the vertices took approximately 90 s, although some of this time was spent navigating menus within the software.

The second cloud was analyzed using the methodology detailed in Section 3.5 and the identified vertices are detailed in Table 1 as Exp.. The compute time, from image capture to calculation of table vertices, was 7.25 ms. For the purpose of statistical comparison, the vertices of the table were measured and recorded (Fig. 5).

With the exception of the top left corner, every point calculated by the algorithm was closer to the real point than the GUI-determined vertex. The average error for the experimental is 24.32 mm and 26.90 mm for the GUI-based software, for a percent difference of 10.07 %. While a minor degree of error may lie in human operation, it seems unlikely with such a large percent difference between the two. Additionally, the algorithm was generally more accurate at estimating heights, although neither is consistent across the entire surface. In the XY direction, both the GUI and the algorithm are fairly accurate at estimating the vertex and are generally closer to each other than either point is to the measured point. This indicates that there may be some issues in the actual pointcloud capture, as even though the GUI draws from the raw pointcloud and the 3D segmentation draws from a pre-processed pointcloud, they arrive at similar results. If there was an accuracy issue with the designed algorithm, then it would likely show a different result compared to the GUI-based method. Thus, it is likely that this error is attributed not to either processing method, but rather with the depth sensor itself. This can be observed in Fig. 6. Figure 6a is a processed scan segmented with the developed algorithm, with many visible raised sections. While this could be an artifact of the algorithm, comparing it to Fig. 6b, which is a raw scan from the camera, supports that this is an artifact of the camera itself. There seems to be little pattern to the variance of the artifacts, so it is possible that points could have their height raised or lowered. This is supported by Table 2, where the mean percent error (MPE) for each method was calculated. The MPE in both the x and y are relatively comparable between methods and are also quite low, all below 5 %, whereas z is significantly higher. This is additionally visible in Fig. 5, where all vertices are fairly close in the top view, but the side and back views reveal how far away the points really are, due to the large disparity in z.

Table 2 Mean percent error of all vertices in Table 1

Despite this, with the experimental and GUI methods results close, the fidelity of the algorithm is demonstrated — with zero machine learning nor human guidance, the program was able to identify the vertices just as well as a human could from a GUI-based software, in a fraction of a percent of the time.

4.3 Repeatability at distances and angles

To determine the variance of the algorithm with differing parameters, several additional captures were performed. With a static base frame, captures were taken at different robot poses and compared to determine the effect of both angle and distance for the accuracy of the algorithm. For this, a single vertex was selected to act as the node for comparison, the top left corner (TL). TL was selected as it is the most skewed of the points and thus should test the algorithm’s robustness. The set of poses selected is described in Table 3. The poses chosen represented variance on the “normal’’ view of the workspace, as well as several of the extreme positions, either limited by the joints of the robot or the orientation of the camera. The robot was moved into each pose using the lead-through mode of the UR5e. Then, the current pose was collected for each configuration from the joint angles listed on the teach pendant of the robot. The robot was programmed to move to each pose sequentially and wait until data collection was confirmed and completed by the operator.

Table 3 Poses used to determine repeatability of algorithm with regards to distance and angle

At each of the poses listed in Table 3, the algorithm was run to identify the best-fit bounding box and the TL corner was identified in each. The results of this can be seen in Table 4. The overall average was calculated and each point was compared to it to quantify the consistency of the algorithm. Overall, it seems to be very consistent, with an average distance of 16.1 mm. Looking at the specific points affirms the consistency - Pose 3 is a major outlier in the x, almost 30 mm away from the average, but every other value is within a small range, for a standard deviation of 14.23 mm. The y values are consistent across the board and have a standard deviation of only 7.95 mm. The z is less consistent, as there are multiple values on both sides of the range, but overall is consistent with a standard deviation of 9.06 mm.

Given the observed error with scanning observed from Section 4.2, it seems entirely reasonable that some (or all) of this deviation can be attributed to inconsistency from the camera. The mean distance from the average point for the repeatability experiment is lower than any of the values observed from Table 1. Additionally, the error far exceeds the repeatability of the robot used, a UR5e [26]. However, there may still be some small errors associated with robot positioning between the cases. The robot fixture was designed to be mobile with the implementation of caster wheels, so while each position is accurate to the pose desired, the entire robot may have shifted as a result of the movement. Beyond that, the results are still very similar to each other. While there can definitely be some variation due to the algorithm, it still is quite good. The TL vertex was repeatably the least accurate of all of the vertices, so a standard deviation within 1 cm is excellent, as it is likely that the other vertices could be more accurate.

Table 4 Experimental variance of vertex TL with captures taken at the poses listed in Table 3

5 Conclusion

This work presented an alternative method for 3D vision that does not require machine learning, aiming to maximize accuracy and precision while minimizing time. This allows for the implementation of vision systems while requiring minimal setup time, as no training data needs to be created or annotated. The performance of this system has been characterized through a series of comparisons, looking at the accuracy compared to a GUI-based software, the precision differing the robot pose, and an examination of the 2D preprocessing itself.

All of these characterization tests show that the system is faster and more accurate than a GUI-based method. However, both methods (a) resulted in similar results and (b) neither are particularly close to the measured locations, so it was concluded that the inaccuracy was attributed to the sensor. This seems likely, as human error should be less than 20 mm. This conclusion could also explain some of the variance seen in the precision test. Additionally, the fidelity of the processed 3D data using the 2D information was high. Even without any statistical or positional filtering, the preprocessed 3D data was clear and trivial to segment.

While these results are certainly promising, there is still significant development to be done. The current control scheme has only been tested on rectangular shapes, and prior to meaningful implementation, the algorithm needs to be tested on a wider array of shapes, including those with round edges. The camera used has also been shown to be slightly inaccurate at the distances measured, resulting in sub-optimal precision. Finally, the system is unable to use the data to operate the robot in a meaningful way; the path planning algorithm is being developed, but is yet to be fully implemented.