CN108734120B

CN108734120B - Method, device and equipment for labeling image and computer readable storage medium

Info

Publication number: CN108734120B
Application number: CN201810464372.8A
Authority: CN
Inventors: 程新景; 杨睿刚
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-05-15
Filing date: 2018-05-15
Publication date: 2022-05-10
Anticipated expiration: 2038-05-15
Also published as: CN108734120A

Abstract

Embodiments of the present disclosure provide methods, apparatuses, devices, and computer-readable storage media for annotating images. A method for annotating an image includes determining a set of points associated with an object in a scene in a point cloud of three-dimensional information describing the scene. The method also includes obtaining annotations for the attributes of the object. The method also includes determining a portion of the image of the scene to be annotated that corresponds to the set of points. The method also includes generating an annotated image by applying the annotation to the portion. In this way, the annotation efficiency of the image can be improved, thereby obtaining a large-scale annotated image.

Description

Method, device and equipment for labeling image and computer readable storage medium

Technical Field

Embodiments of the present disclosure relate generally to the field of image processing, and more particularly, to a method and apparatus, a device, and a computer-readable storage medium for annotating images.

Background

In recent years, with the rapid development of artificial intelligence, large-scale training data sets play an increasingly important role in the precision of artificial intelligence algorithms. How to efficiently acquire high-quality labeled training data is a prerequisite for advancing the development of artificial intelligence algorithms. However, in the field of unmanned driving, since a scheme for efficient annotation of images is currently lacking, a large-scale annotated outdoor image data set does not exist, and thus good data support cannot be provided for unmanned driving and corresponding algorithms thereof.

Disclosure of Invention

According to an embodiment of the present disclosure, a scheme for labeling an image based on a point cloud is provided.

In a first aspect of the disclosure, a method for annotating an image is provided. The method comprises the following steps: determining a set of points associated with an object in a scene in a point cloud of three-dimensional information describing the scene; acquiring a label aiming at the attribute of an object; determining a part corresponding to the point set in an image to be annotated of a scene; and generating an annotated image by applying the annotation to the portion.

In a second aspect of the present disclosure, an apparatus for annotating an image is provided. The device includes: a point set determination module configured to determine a set of points associated with an object in a scene in a point cloud of three-dimensional information describing the scene; an annotation acquisition module configured to acquire an annotation for an attribute of an object; a corresponding portion determining module configured to determine a portion corresponding to the set of points in an image to be annotated of the scene; and an annotation image acquisition module configured to generate an annotated image by applying an annotation to the portion.

In a third aspect of the disclosure, an electronic device is provided. The electronic device includes: one or more processors; and memory for storing one or more programs that, when executed by the one or more processors, cause an electronic device to implement a method in accordance with the first aspect of the disclosure.

In a fourth aspect of the present disclosure, a computer-readable medium is provided, on which a computer program is stored which, when executed by a processor, implements a method according to the first aspect of the present disclosure.

It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

FIG. 1 illustrates a schematic diagram of an exemplary environment in which embodiments of the present disclosure can be implemented;

FIG. 2 shows a flow diagram of a method for annotating an image according to an embodiment of the present disclosure;

FIG. 3 shows a flow diagram of a method for obtaining annotations for a property of an object, according to an embodiment of the present disclosure;

FIG. 4 shows a flow diagram of a method for obtaining annotations for a property of an object, in accordance with an embodiment of the present disclosure;

FIG. 5 shows a flow diagram of a method for generating a depth image according to an embodiment of the present disclosure;

FIG. 6 shows a block diagram of an apparatus for annotating images according to an embodiment of the present disclosure; and

FIG. 7 illustrates a block diagram of an electronic device capable of implementing embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

As mentioned above, in the field of unmanned driving, large-scale labeled outdoor image sets are required to develop and optimize correlation algorithms. Traditionally, the following procedure is adopted to semantically label an RGB image acquired outdoors: performing object segmentation on the RGB image, and acquiring approximate boundaries and boundary control points of the object in the RGB image; manually selecting a boundary comprising a marked object on the RGB image by using a polygon frame, and then refining by using an algorithm; the boundary of the object is further corrected by manually dragging the boundary control points, so that the object with the pixel level boundary precision is obtained and marked into a corresponding category.

However, the above-described process has a number of disadvantages. For example, object segmentation on an RGB image is based on the assumption that adjacent object boundary pixels have large differences in RGB color space and the same object pixel has small differences in RGB space. However, in practical situations, this assumption is almost difficult to satisfy. For example, in a scene where a person wearing white clothing stands in front of a white wall, it is difficult to separate the person from the wall using the existing method, and a large amount of manpower is required for modification at a later stage.

On the other hand, not only is the labor cost increased by manually selecting the object boundaries, but also some objects (such as tree leaves) with odd outdoor shapes and spatial distributions pose great challenges to manual selection. Modifying the object boundary by manually dragging the boundary control points further consumes a lot of labor cost.

These disadvantages result in that it usually takes a long time to label an outdoor RGB image to obtain semantic labeling at a pixel level, thereby limiting the size of the labeled outdoor RGB image set. This is also the reason why large-scale annotated outdoor image sets are lacking in the field of unmanned driving.

In view of the above, embodiments of the present disclosure provide a solution for annotating an image (e.g., an outdoor RGB image). The method performs object (hereinafter also referred to as object) segmentation on a three-dimensional point cloud, and labels attributes of the object in the point cloud. The marked point cloud is projected onto the image by using the corresponding relation between the points (hereinafter also referred to as voxels) in the point cloud and the pixels on the image to be marked, so as to obtain the image with marked information.

Since the object is relatively isolated in the point cloud compared to the image, segmenting the object in the point cloud enables a higher accuracy to be achieved, so that subsequently only a small amount of work has to be spent on adjusting the boundary of the object. Meanwhile, the marked point cloud can be projected onto a plurality of images by utilizing the corresponding relation between the point cloud and the images of the same scene, so that a large-scale marked image set is obtained. Compared with the prior art, the scheme of the embodiment of the disclosure has significant advantages in the aspects of labeling cost and labeling efficiency, so that a large-scale semantic data set can be provided for an outdoor scene understanding algorithm.

Embodiments of the present disclosure will be specifically described below with reference to fig. 1 to 7.

FIG. 1 illustrates a schematic diagram of an exemplary environment 100 in which embodiments of the present disclosure can be implemented. In environment 100, an acquisition entity 112 (e.g., vehicle 112) mounted with a lidar and camera (not shown in fig. 1) is operated on a roadway 114 to acquire data relating to a scene 110 in which the acquisition entity 112 is located.

In the context of the present disclosure, the term "acquisition entity" is an entity capable of acquiring point clouds, data and/or other suitable data, such as a vehicle, a person or other device capable of movement. The lidar may be a single line lidar, a multi-line lidar, a 3D lidar, or the like. The camera may be a high-precision camera, a panoramic camera, a monocular camera, or the like. It should be understood that the above examples are for illustrative purposes only and are not intended to limit the scope of the embodiments of the present disclosure.

During the movement of the acquisition entity 112, the lidar acquires a point cloud describing three-dimensional information of the scene 110 in which the acquisition entity 112 is located. As shown in fig. 1, the scene 110 includes a road surface 114 on which the collecting entity 112 is currently located, trees 116 on both sides of the collecting entity 112, and a building 118 in front of the collecting entity 112. The collected point cloud data may describe three-dimensional information, such as spatial coordinates, of points on the road surface 114, trees 116, and buildings 118.

At the same time, a camera disposed in association with the lidar may capture images of the same scene 110. In embodiments of the present disclosure, the term "image" refers to a two-dimensional image, such as an RGB image, captured by a camera. The camera may capture a large number of images of the same scene 110 by adjusting imaging parameters (e.g., focal length, position, angle, etc.). According to the camera imaging principle, the imaging parameters of the camera can be used to determine which pixel in the image corresponds to a point in the point cloud, i.e. can be used to determine the correspondence between the point in the point cloud and the pixel in the image.

The acquisition entity 112 uploads the point cloud acquired during the movement, the image and the imaging parameters of the camera at the time of acquiring the image to the cloud storage device 120 in an associated manner. Those skilled in the art will appreciate that the point cloud, image, and imaging parameters may also be stored on other storage devices, and are not limited to being stored to the cloud storage device 120. The point cloud, image and imaging parameters may also be stored in conventional memory such as a hard disk, for example.

The computing device 130 downloads from the cloud storage 120 a point cloud describing the three-dimensional information of the scene 110, multiple images of the scene 110, and imaging parameters of the camera when capturing the images. The computing device 130 partitions the points describing the objects in the scene 110 in a point cloud. For example, the computing device 130 may divide the points in the point cloud into a set of points associated with the building 118 and a set of points associated with each of the trees 116. In the point cloud, the point representing the building 118 and the point representing the tree 116 have a certain isolation, so that the object segmentation based on the point cloud can have high accuracy.

The computing device 130 presents a set of points in the point cloud associated with the object to the user 140, such as presenting a set of points associated with the building 118 to the user 140. The user 140 may label the presented set of points, i.e. the attributes of the objects associated with the set of points, e.g. label the presented set of points as "building". The user 140 may also make adjustments to the presented set of points, and the computing device 130 may update the set of points associated with the object based on the adjustments to make the segmentation of the object more accurate.

The computing device 130 receives annotations of the attributes of the object and adjustments to the set of points by the user 140. The computing device 130 may update the set of points associated with the object based on the user's 140 adjustment of the set of points. For each of the multiple images of the scene 110, the computing device 130 determines portions in the image that correspond to a set of points in the point cloud associated with the object based on imaging parameters of the camera at the time the image was taken and applies annotations obtained from the user 140 to the attributes of the object. In this way, the annotation of the multiple images of the scene is completed.

In an embodiment of the present disclosure, object segmentation with high accuracy is achieved in a point cloud, and an annotation for an attribute of the segmented object (e.g., a category of the object, a name of the object) is acquired from the user 140, and then the acquired annotation is projected to a corresponding portion in the image using a correspondence of the point cloud and the image. In this way, after the point cloud in a scene is labeled, the labeling of a large number of images of the scene can be obtained, so that a large-scale labeled image can be obtained.

The computing device 130 can upload the annotated image to the cloud storage device 120 for storage to form an annotated image dataset. A computing device 150 executing a machine learning algorithm (e.g., a machine learning algorithm associated with unmanned driving) may download such labeled image datasets from the cloud storage device 120 to enable optimization of the machine learning algorithm. It should be understood that although computing device 130 and computing device 150 are shown in fig. 1 as being separate, in other embodiments, computing device 130 and computing device 150 may be integrated together.

The computing device 130 may also train a predictive model based on the already labeled point clouds to predict attributes of objects in the point clouds and present to the user 140 to facilitate labeling by the user 140. In this way, embodiments of the present disclosure can make full use of the a priori knowledge that has already been labeled to further speed up the labeling process of the image.

It should be understood that the number, configuration, connection and arrangement of the various components shown in fig. 1 are exemplary, not limiting, and some of the components may be optional. And one skilled in the art may make adjustments in number, structure, connection relationships, and layout within the scope of the present disclosure.

FIG. 2 shows a flow diagram of a method 200 for annotating images according to an embodiment of the present disclosure. The method 200 may be performed by the computing device 130 shown in fig. 1. As previously described, to create a large scale annotated outdoor image dataset, the computing device 130 previously downloads from the cloud storage device 120 a point cloud describing the three-dimensional information of the scene 110, a large number of images (e.g., outdoor RGB images) of the same scene 110, and imaging parameters of the camera when capturing these images.

At block 202, the computing device 130 determines a set of points associated with objects in a scene in a point cloud of three-dimensional information describing the scene. That is, the computing device 130 enables segmentation of objects in the scene in the point cloud. The three-dimensional information of the scene may include spatial coordinates of various points on objects in the scene. The set of points in the point cloud associated with the object may be a collection of points in the point cloud (hereinafter also referred to as "voxels") describing the object. In the point cloud, since the object itself has a certain isolation, object segmentation with higher accuracy can be realized in the point cloud than in the image.

Although there is isolation between objects in the point cloud, almost all of the objects have an association with some predetermined object. As in the example shown in fig. 1, building 118, trees 116 are associated with ground 114. To this end, in some embodiments of the present disclosure, to achieve better object segmentation in the point cloud, the computing device 130 may remove a set of predetermined object points associated with the predetermined object from the point cloud. For example, the computing device 130 may perform a erosion-dilation process on the point cloud to remove a set of predetermined object points in the point cloud that are associated with predetermined objects (e.g., the ground 114) such that other objects in the point cloud (e.g., the buildings 118 and trees 116) are more isolated from one another to facilitate segmentation of these objects.

The computing device 130 may segment a plurality of point sets associated with a plurality of objects in the scene in the point cloud after removing the predetermined object point set. For example, taking fig. 1 as an example, the computing device 130 may segment a set of points associated with the building 118 and a set of points associated with each of the two trees 116 from the point cloud from which the set of road surface points is removed.

In embodiments of the present disclosure, to more accurately achieve object segmentation, the computing device 130 may divide the point cloud into multiple sets based on similarities between points in the point cloud. For example, the computing device 130 may calculate euclidean distances between points in the point cloud, dividing points having euclidean distances less than a predetermined threshold into a set. Thereafter, the computing device 130 may cluster the plurality of sets using the concave-convex relationship between the plurality of sets, thereby obtaining a plurality of sets of points associated with a plurality of objects in the scene. In particular, if all points on a line connecting any two points in a union of two sets belong to the union, indicating that the union of the two sets is a convex set, the two sets can be clustered together. In some embodiments, the computing device 130 may aggregate the point clouds into voxels, and may enable clustering of the point clouds by analyzing the relief between adjacent voxels.

In embodiments of the present disclosure, the computing device 130 may select a set of points from the segmented plurality of sets of points for presentation to the user 140 for the user 140 to annotate properties of objects associated with the selected set of points.

At block 204, the computing device 130 obtains annotations for the properties of the object. In an embodiment of the present disclosure, the attribute of the object may include at least one of a category of the object, a name of the object, and depth information of the object.

In embodiments of the present disclosure, the computing device 130 may obtain annotations for properties of the object from the user 140. In some embodiments, to further mitigate human consumption, the computing device 130 can also utilize the attribute prediction model to predict attributes of the object and determine the predicted attributes as attribute labels for the object. In some embodiments, the attribute prediction model may also be utilized by other devices to predict the attributes of the object, and the computing device 130 may obtain the attributes predicted using the attribute prediction model from the other devices and determine the predicted attributes as attribute labels for the object. A specific process for obtaining annotations for the properties of the object will be described later in fig. 3 and 4.

At block 206, the computing device 130 determines a portion of the image of the scene described by the point cloud that corresponds to the set of points. The image is an image to be annotated, for example an outdoor RGB image. The set of points may include one or more voxels in the point cloud. The portion of the image corresponding to the set of points may include one or more pixels of the image. The computing device 130 may obtain (e.g., download) imaging parameter information of the camera when capturing the image, such as information of a spatial position, an angle, and a focal length at which the camera is located when capturing the image, from the cloud storage device 120. Using camera imaging principles, the computing device 130 may determine a portion of the image corresponding to the set of points based on the acquired imaging parameter information. In an embodiment of the present disclosure, the computing device 130 may determine the corresponding portion of the image by determining which pixel in the image the spatial point represented by each voxel in the point set corresponds to after being imaged by the camera based on the imaging parameters of the camera.

At block 208, the computing device 130 generates an annotated image by applying the obtained annotation for the attribute of the object to the corresponding portion determined at block 206. The computing device 130 has obtained a label for the attributes of the object in block 204, i.e., label information for each voxel in the set of points associated with the object in the point cloud, e.g., whether the voxel represents a point on the building 118 or a point on the tree 116, has been obtained. The computing device 130 may apply the labeling information for each voxel in the set of points associated with the object in the point cloud to the pixel in the image corresponding to the voxel to obtain labeling information for the pixel.

In some embodiments, the computing device 130 may perform the process of the method 200 separately for each segmented object, thereby assigning annotation information to corresponding pixels in the image. In this way, annotation at the pixel level for an image can be achieved.

For a large number of images of the same scene, the method 200 applies point cloud-based labeling to the images by performing object segmentation and attribute labeling in the point cloud and using imaging parameters of the camera when capturing the images. Compared with the existing image labeling method, the method 200 realizes the segmentation of the object on the point cloud instead of directly performing the object segmentation on the image, so that the segmentation precision of the method 200 is higher. Meanwhile, the method 200 maps the attribute labels acquired based on the point clouds to the images by using the corresponding relation between the point clouds and a large number of images in the same scene, and manual labeling is not needed to be carried out on each image, so that the labeling efficiency is improved. Experiments have shown that the method 200 according to embodiments of the present disclosure can be about 30 times faster than existing methods of directly annotating images. For outdoor images, a large scale annotated data set can be obtained by the method 200.

FIG. 3 shows a flow diagram of a method 300 for obtaining annotations for a property of an object, according to an embodiment of the present disclosure. The method 300 may be performed by the computing device 130 shown in fig. 1.

At block 302, the computing device 130 may present the set of points associated with the object to the user 140. To facilitate the user 140 in labeling attributes of the object, the computing device 130 may render the point cloud on an RGB image for presentation to the user 140, thereby facilitating the user 140 in labeling scattered points in the point set. The user 140 can label the presented set of points, e.g., the user 140 can label a category, name, etc. of an object associated with the set of points.

When computing device 130 presents a set of points associated with an object to user 140, user 140 may find the three-dimensional boundaries of the object less than ideal. To obtain more accurate object segmentation and, thus, more accurate annotation images, the computing device 130 may allow the user 140 to adjust the rendered set of points, e.g., to adjust the boundaries of the segmented object. To do so, at block 304, the computing device 130 may determine whether the user 140 adjusted the boundary of the object by adjusting the set of points. If the user 140 has adjusted the presented set of points, the method 300 proceeds to block 306. If user 140 does not adjust the set of points, method 300 proceeds to block 308.

At block 306, the computing device 130 may update the set of points associated with the object in response to the user 140 adjusting the set of points such that the updated set of points is appropriate for the adjustment of the user 140. Updating the set of points causes corresponding changes to the corresponding portions determined at block 206 of method 200, thereby enabling a more accurate annotated image to be obtained.

At block 308, the computing device 130 may receive the annotation input by the user 140 for the property of the object and associate the annotation with the set of points. Taking FIG. 1 as an example, user 140 may label building 118 as a category of "buildings" based on the set of points presented in association with building 118. The computing device 130 may label all points in the point cloud associated with the building 118 as belonging to the category "building".

The method 300 describes a process for obtaining annotations for properties of an object from the user 140. To further mitigate human consumption, embodiments of the present disclosure may also obtain labels for attributes of objects based on methods of machine learning. This process is described below in conjunction with fig. 4. FIG. 4 shows a flow diagram of a method 400 for obtaining annotations for a property of an object, according to an embodiment of the present disclosure. The method 400 may be performed by the computing device 130 shown in fig. 1.

After manually labeling objects in the point clouds by the user 140 according to the method 300 to obtain a large number of point clouds in which the objects are labeled, the computing device 130 may train an attribute prediction model based on the labeled point clouds using a machine learning method for predicting attributes of objects in subsequent point clouds. At block 402, the computing device 130 may predict properties of the object using a property prediction model. The attribute prediction model is trained using a point cloud to which the object has been labeled. For example, the computing device 130 may train the attribute prediction model based on the 3D-CNN to obtain parameters for the model.

At block 404, the computing device 130 may present the set of points associated with the object and the predicted attributes of the object to the user 140. Since some prediction errors inevitably exist in the method of machine learning, in order to avoid that such prediction errors affect the annotation accuracy of the image, in the embodiment of the present disclosure, the computing device 130 may provide the predicted attributes to the user 140 to judge whether the predicted attributes are correct or not by the user 140.

At block 406, the computing device 130 may determine whether the user 140 has modified the predicted attribute. If the user 140 does not modify the property, the computing device 130 may label the predicted property as a property for the object at block 408. Alternatively, the user 140 may also confirm directly that the predicted attributes are correct. In this case, the computing device 130 may also label the predicted attribute as an attribute for the object.

On the other hand, if the user 140 determines that the predicted attribute is incorrect, the user 140 may modify the attribute. If the user 140 modifies the property, the computing device 130 may determine the modified property as an annotation for the property of the object at block 410. By the method, the marking efficiency is improved by using the prior knowledge, and the marking precision is ensured.

At block 412, the computing device 130 may update the attribute prediction model with the newly labeled point cloud. With the addition of the labeled point clouds, the computing device 130 may retrain the attribute prediction model with more labeled point clouds, thereby updating the attribute prediction model to continuously improve the prediction accuracy of the attribute prediction model. In this way, the method 400 makes full use of the prior knowledge, reducing the burden on the user, thereby further improving the annotation efficiency.

According to an embodiment of the present disclosure, not only may the image be semantically labeled based on the point cloud (e.g., labeling the category, name of the object in the image), but also depth information of the object may be labeled, the depth information indicating a distance between the object and a spatial location where the camera was located when the image was captured, more specifically, a distance between a spatial location of a point on the object and a spatial location where the camera was located. The computing device 130 may associate the distance with a portion of the image corresponding to the object, thereby generating a depth image corresponding to the image.

In the embodiment of the present disclosure, the depth image corresponding to the image may also be determined without performing object segmentation. Fig. 5 shows a flow diagram of a method 500 for generating a depth image according to an embodiment of the present disclosure. The method 500 may be performed by the computing device 130 shown in fig. 1.

At block 502, the computing device 130 acquires a point cloud of three-dimensional information describing a scene and an image of the same scene. As previously described, the three-dimensional information of the scene may include spatial coordinates of points on objects in the scene. For example, the computing device 130 may download the point cloud and image from the cloud storage device 120.

At block 504, the computing device 130 obtains a correspondence between points (i.e., voxels) in the point cloud and pixels in the image. As described earlier with reference to fig. 1, the acquisition entity 112, while acquiring the point cloud, the image, also acquires imaging parameters of the camera when taking the image, such as the spatial position of the camera, the angle of the camera, the focal length of the camera, and so on, and uploads the point cloud, the image, and the imaging parameters to the cloud storage device 120 in association. To this end, the computing device 130 may download the imaging parameters from the cloud storage device 120 and determine, based on the imaging parameters, to which pixel in the image the spatial point represented by each voxel in the point cloud corresponds after being imaged by the camera, thereby determining a correspondence between the voxels in the point cloud and the pixels in the image.

At block 506, the computing device 130 determines a distance between the spatial point represented by the voxel in the point cloud and the spatial location of the camera. For example, the computing device 130 may compute a euclidean distance, a manhattan distance, a chebyshev distance, etc. between the three-dimensional coordinates represented by the voxels based on the three-dimensional coordinates of the location at which the camera captured the image. It should be understood that the above examples of various distances are for illustrative purposes only and are not intended to limit the scope of the present disclosure. In other embodiments according to the present disclosure, any suitable distance may be employed to represent depth information.

At block 508, the computing device 130 may associate the determined distances with corresponding pixels in the image, thereby generating a depth image corresponding to the image. The value of a pixel in the depth image represents the distance between the point in space represented by the corresponding pixel in the image and the position of the camera at the time the image was taken. Computing device 130 may apply the determined distance between the spatial point represented by the voxel in the point cloud and the spatial location of the camera to the pixel in the image corresponding to the voxel to obtain a value for the corresponding pixel in the depth image based on the correspondence determined in block 504.

With the method 500, dense outdoor depth images can be generated and the annotated image dataset is diversified, filling the blank that there is no outdoor depth image dataset at present.

FIG. 6 shows a block diagram of an apparatus 600 for annotating images according to an embodiment of the present disclosure. The apparatus 600 may be included in the computing device 130 of fig. 1 or implemented as the computing device 130. As shown in fig. 6, the apparatus 600 includes a point set determination module 610 configured to determine a set of points associated with objects in a scene in a point cloud of three-dimensional information describing the scene. The apparatus 600 further comprises an annotation acquisition module 620 configured to acquire an annotation for a property of the object. The apparatus 600 further comprises a corresponding portion determining module 630 configured to determine a portion of the image of the scene to be annotated corresponding to the set of points. Further, the apparatus 600 includes an annotation image acquisition module 640 configured to generate an annotated image by applying an annotation to the portion.

In some embodiments, the point set determination module 610 may include: a removal module configured to remove a set of predetermined object points associated with a predetermined object from the point cloud; a segmentation module configured to segment a plurality of point sets associated with a plurality of objects in a scene in a point cloud from which a predetermined object point set is removed; and a point set selection module configured to select a set of points associated with the object from the plurality of sets of points.

In some embodiments, the point set determination module 610 may include: a partitioning module configured to partition the point cloud into a plurality of sets based on similarities between points in the point cloud; the clustering module is configured to cluster the sets based on concave-convex relations among the sets to obtain a plurality of point sets associated with a plurality of objects in a scene; and a point set selection module configured to select a point set associated with the object from the plurality of point sets.

In some embodiments, the annotation acquisition module 620 can include: a presentation module configured to present a set of points associated with an object to a user; and an annotation receiving module configured to receive user-entered annotations for the attributes of the object.

In some embodiments, the apparatus 600 may further comprise: an adjustment receiving module configured to receive an adjustment of a set of points by a user; and an update module configured to update the set of points based on the adjustment.

In some embodiments, the annotation acquisition module 620 can include: a prediction module configured to predict attributes of the object using an attribute prediction model, the attribute prediction model being trained based on a point cloud to which the object has been tagged; an annotation determination module configured to determine the predicted attribute as an annotation for the attribute of the object.

In some embodiments, the annotation acquisition module 620 can include: a prediction module configured to predict attributes of the object using an attribute prediction model, the attribute prediction model being trained based on a point cloud to which the object has been tagged; a providing module configured to provide the predicted attribute to a user; and an annotation determination module configured to: in response to receiving a user confirmation of the predicted attribute, determining the predicted attribute as an annotation for the attribute of the object; and in response to receiving a user modification of the predicted property, determining the modified property as an annotation for the property of the object.

In some embodiments, the apparatus 600 may further comprise: a prediction model update module configured to update the attribute prediction model with the point cloud and a label for an attribute of the object.

In some embodiments, the corresponding portion determining module 630 may include: a camera parameter acquisition module configured to acquire parameters of a camera when capturing an image; and a determination module configured to determine a portion corresponding to the set of points based on the parameter.

In some embodiments, the attributes of the object include at least one of: category, name, and depth information of the object.

In some embodiments, the annotation acquisition module 620 can include: a depth information determination module configured to determine depth information of the object, the depth information indicating a distance between the object and a spatial position at which the camera is located when the image is captured; an annotation determination module configured to determine the depth information as an annotation for the property of the object.

In some embodiments, apparatus 600 may comprise: a depth image generation module configured to generate a depth image corresponding to the image by associating the distance with the corresponding portion.

FIG. 7 illustrates a schematic block diagram of an electronic device 700 that may be used to implement embodiments of the present disclosure. Device 700 may be used to implement computing device 130 of fig. 1. As shown, device 700 includes a Central Processing Unit (CPU)701 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)702 or computer program instructions loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processing unit 701 performs the various methods and processes described above, such as the

methods

200, 300, 400, 500. For example, in some embodiments, the

methods

200, 300, 400, 500 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the CPU 701, one or more steps of the

methods

200, 300, 400, 500 described above may be performed. Alternatively, in other embodiments, the CPU 701 may be configured to perform the

methods

200, 300, 400, 500 in any other suitable manner (e.g., by way of firmware).

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method for annotating an image, comprising:

determining, in a point cloud of three-dimensional information describing a scene, a set of points associated with an object in the scene;

obtaining a label for the attribute of the object;

determining a correspondence of spatial points represented by voxels in the set of points to pixels in the image based on parameters relating to the capturing of the image;

determining a part corresponding to the point set in an image to be annotated of the scene based on the corresponding relation; and

generating an annotated image by applying the annotation to the portion;

wherein the property of the object comprises depth information of a distance between the object and a spatial position at which the camera was located when the image was taken, the method further comprising:

associating the distances with respective pixels in the image to generate a depth image corresponding to the image.

2. The method of claim 1, wherein determining a set of points in the point cloud associated with an object in the scene comprises:

removing a set of predetermined object points associated with a predetermined object from the point cloud;

segmenting a plurality of point sets associated with a plurality of objects in the scene in the point cloud from which the predetermined object point set is removed; and

selecting the set of points associated with the object from the plurality of sets of points.

3. The method of claim 1, wherein determining a set of points in the point cloud associated with an object in the scene comprises:

dividing the point cloud into a plurality of sets based on similarities between points in the point cloud; and

clustering the sets based on concave-convex relations among the sets to obtain a plurality of point sets associated with a plurality of objects in the scene; and

4. The method of claim 1, wherein obtaining annotations for properties of the object comprises:

presenting the set of points associated with the object to a user; and

receiving the user-input label for the attribute of the object.

5. The method of claim 4, further comprising:

receiving an adjustment to the point set by the user; and

based on the adjustment, the set of points is updated.

6. The method of claim 1, wherein obtaining annotations for properties of the object comprises:

predicting attributes of the object using an attribute prediction model trained based on a point cloud to which the object has been tagged; and

determining the predicted attribute as an annotation for an attribute of the object.

7. The method of claim 1, wherein obtaining annotations for properties of the object comprises:

predicting attributes of the object using an attribute prediction model trained based on a point cloud to which the object has been tagged;

providing the predicted attributes to a user;

in response to receiving a user confirmation of the predicted attribute, determining the predicted attribute as an annotation for an attribute of the object; and

in response to receiving the user's modification of the predicted attribute, determining the modified attribute as an annotation for the object's attribute.

8. The method of claim 6 or 7, further comprising:

updating the attribute prediction model with the point cloud and a label for an attribute of the object.

9. The method of claim 1, wherein determining a portion of an image of the scene to be annotated that corresponds to the set of points comprises:

acquiring parameters of a camera when the camera shoots the image; and

determining the portion corresponding to the set of points based on the parameter.

10. The method of claim 1, wherein the attributes of the object further comprise at least one of: the class and name of the object.

11. The method of claim 1, wherein obtaining annotations for properties of the object comprises:

determining depth information of the object; and

determining the depth information as an annotation for a property of the object.

12. An apparatus for annotating an image, comprising:

a point set determination module configured to determine a set of points associated with an object in a scene in a point cloud of three-dimensional information describing the scene;

an annotation acquisition module configured to acquire an annotation for a property of the object;

a correspondence determination module configured to determine a correspondence of spatial points represented by voxels in the set of points to pixels in the image based on parameters relating to a capture of the image;

a corresponding part determination module configured to determine a part corresponding to the point set in an image to be annotated of the scene based on the correspondence; and

an annotation image generation module configured to generate an annotated image by applying the annotation to the portion;

wherein the property of the object includes depth information of a distance between the object and a spatial position at which the camera is located when the image is taken, the apparatus further comprising:

a depth image generation module configured to associate the distances with respective pixels in the image to generate a depth image corresponding to the image.

13. The apparatus of claim 12, wherein the point set determination module comprises:

a removal module configured to remove a set of predetermined object points associated with a predetermined object from the point cloud;

a segmentation module configured to segment a plurality of point sets associated with a plurality of objects in the scene in the point cloud from which the predetermined object point set is removed; and

a point set selection module configured to select the set of points associated with the object from the plurality of point sets.

14. The apparatus of claim 12, wherein the point set determination module comprises:

a partitioning module configured to partition the point cloud into a plurality of sets based on similarities between points in the point cloud;

a clustering module configured to cluster the plurality of sets based on a concave-convex relationship between the plurality of sets to obtain a plurality of point sets associated with a plurality of objects in the scene; and

15. The apparatus of claim 12, wherein the annotation acquisition module comprises:

a presentation module configured to present the set of points associated with the object to a user; and

an annotation receiving module configured to receive an annotation for a property of the object input by the user.

16. The apparatus of claim 15, further comprising:

an adjustment receiving module configured to receive an adjustment of the set of points by the user; and

an update module configured to update the set of points based on the adjustment.

17. The apparatus of claim 12, wherein the annotation acquisition module comprises:

a prediction module configured to predict attributes of the object using an attribute prediction model trained based on a point cloud to which the object has been tagged; and

an annotation determination module configured to determine the predicted attribute as an annotation for an attribute of the object.

18. The apparatus of claim 12, wherein the annotation acquisition module comprises:

a prediction module configured to predict attributes of the object using an attribute prediction model trained based on a point cloud to which the object has been tagged;

a providing module configured to provide the predicted attribute to a user; and

an annotation determination module configured to:

19. The apparatus of claim 17 or 18, further comprising:

a prediction model update module configured to update the attribute prediction model with the point cloud and a label for an attribute of the object.

20. The apparatus of claim 12, wherein the corresponding portion determining module comprises:

a camera parameter acquisition module configured to acquire parameters of a camera when the camera takes the image; and

a determination module configured to determine the portion corresponding to the set of points based on the parameter.

21. The apparatus of claim 12, wherein the properties of the object further comprise at least one of: the class and name of the object.

22. The apparatus of claim 12, wherein the annotation acquisition module comprises:

a depth information determination module configured to determine depth information of the object; and

an annotation determination module configured to determine the depth information as an annotation for a property of the object.

23. An electronic device, the electronic device comprising:

one or more processors; and

memory storing one or more programs that, when executed by the one or more processors, cause the electronic device to implement the method of any of claims 1-11.

24. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-11.