WO2019210978A1

WO2019210978A1 - Image processing apparatus and method for an advanced driver assistance system

Info

Publication number: WO2019210978A1
Application number: PCT/EP2018/061608
Authority: WO
Inventors: Onay URFALIOGLU; Claudiu CAMPEANU; Fahd BOUZARAA
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2018-05-04
Filing date: 2018-05-04
Publication date: 2019-11-07
Also published as: CN112005243A; CN112005243B

Abstract

An image processing apparatus (100) for generating a map of a scene on the basis of a plurality of images of the scene is proposed. The image processing apparatus (100) comprises a processing circuitry (101) configured to iteratively generate the map by processing the plurality of images by: (a) partitioning a first image of the plurality of images into a plurality of image portions; (b) extracting from each image portion a plurality of feature points and classifying at least one feature point of the plurality of feature points as at least one successful feature point of the respective image portion, in case the at least one feature point is associated with a static background of the scene; (c) determining for each image portion of the first image a confidence value on the basis of the at least one successful feature point; and (d) repeating (a) to (c) for a further image of the plurality of images, wherein in (b) the number of the plurality of feature points to be extracted from a respective image portion of the further image depends on the confidence value of the respective image portion of the first image. An image processing method is also disclosed.

Description

IMAGE PROCESSING APPARATUS AND METHOD FOR AN ADVANCED DRIVER

ASSISTANCE SYSTEM

TECHNICAL FIELD

The present invention relates to the field of image processing or computer vision. More specifically, the invention relates to an image processing apparatus and a method for an advanced driver assistance system. BACKGROUND

Advanced driver assistance systems (ADASs) can alert the driver in dangerous situations and/or take an active part in the driving. One of the main challenges for an advanced driver assistance system (ADAS) is mapping of the environment of a vehicle. Generally, mapping involves an estimation of camera trajectories and the structure (e.g., 3D point cloud) of an environment, which is to be used for localization tasks. Mapping relies on visual input, usually in form of video input from one or more cameras, and requires detecting a sufficient number of feature points from a static background in a scene.

Simultaneous localization and mapping (SLAM) is the computational problem of constructing or updating a map of an unknown environment while simultaneously keeping track of the vehicle's location within it. Techniques for combining SLAM with semantic information, also referred to as semantic mapping and localization, are disclosed, for instance, in CN 105989586 A, US 9574883 B2 and US 9758305 B2.

In conventional mapping techniques, moving objects can distort the mapping results and cause it to fail. In some cases, the traffic scene contains many moving objects (e.g., cars, pedestrians, and the like). In some other cases, not enough feature points are found due to lack of unique scene points, image blur, bad lighting conditions and the like. Conventional techniques mainly rely on points or lines or edges for detecting unique feature points in the scene. Thus, conventional mapping or localization techniques can fail when there are too many moving objects or not sufficiently many good feature points. Sometimes, even though mapping works, there are not sufficiently many good feature points (outliers or points without correspondence) captured in the map to enable an accurate and robust localization. Extracting many feature points is generally a big computational effort.

In light of the above, there is a need for improved image processing devices and methods which allow robust and efficient mapping and localization.

SUMMARY

Embodiments of the invention are defined by the features of the independent claims, and further advantageous implementations of the embodiments by the features of the dependent claims.

In order to describe embodiments of the invention in detail, the following terms, abbreviations and notation will be used:

Scene: The surrounding environment with respect to a reference. For instance, the scene of a camera is the part of the environment which is visible by the camera.

ADAS: Advanced Driver Assistance System.

2D image: A normal 2-dimensional image or picture (RGB or chrominance-luminance) acquired with one camera.

Texture: Area within an image which depicts content having significant variation in the (color) intensities.

3D Point Cloud: A collection of points in 3D space.

2D feature point: A location in image coordinates representing a unique point in the scene. 3D feature point: A unique point in a 3D scene.

Mapping: Creating a 3D structure/3D point cloud within a global coordinate system of some environment including location support (e.g., coordinates...). Localization: Estimating the current location of an entity (e.g., camera) with respect to the global coordinate system of a provided map.

Semantic Segmentation: A method to segment an image into different regions according to a semantic context. For instance, pixels depicting a car are all in red color, pixels depicting the road are all in blue color, and the like.

Object Instance: Single object within a group of objects of the same class.

Instance Level Semantic Segmentation: A method to segment in image into different regions and object instances according to a semantic belonging. Single objects are identified and are separable from each other.

Label: An identifier (e.g., an integer) to determine the class type of an item/entity.

Dynamic Objects: Objects in the scene which typically move or change their location.

Static Background: All part of the scene which remains static, e.g., buildings, trees, road, and the like.

Global Coordinate System: Coordinate System with respect to a common global reference.

Local Coordinate System: Coordinate System with respect to a selected reference within a global reference.

Mapping Loop: Typically, a specific vehicle route is selected for the environment to be mapped. This route can be traversed multiple times (multiple loops) in order to improve the final map accuracy and consistency.

Inlier: Corresponding pair of Image Feature Points (from two image frames), where each point is pointing to the same static background 3D point in the scene.

Outlier: Corresponding pair of Image Feature Points (from two image frames), which are pointing to two different 3D points in the scene. Generally, embodiments of the invention are based on the idea to provide a robust and efficient mapping and localization by increasing the number of extracted feature points that are successful (e.g., inlier feature points or short "inliers") and correspond to the static background of a scene.

More specifically, according to a first aspect the invention relates to an image processing apparatus for generating a map of a scene on the basis of a plurality of images of the scene, each image comprising a plurality of pixels, wherein the image processing apparatus comprises a processing circuitry configured to iteratively generate the map by processing one-by-one the plurality of images by:

(a) partitioning a first image of the plurality of images into a plurality of image portions;

(b) extracting from each image portion a plurality of feature points and classifying at least one feature point of the plurality of feature points as at least one target feature point, i.e. an inlier of the respective image portion, in case the at least one feature point is associated with a static background of the scene;

(c) determining for each image portion of the first image a confidence value on the basis of the at least one target feature point; and

(d) repeating (a) to (c) for a further image of the plurality of images, wherein in (b) the number of the plurality of feature points to be extracted from a respective image portion of the further image depends on the confidence value of the respective image portion of the first image.

The image processing apparatus according to the first aspect of the invention allows increasing the chance that useful target feature points, i.e. feature points associated with a static background of a scene, are extracted and used in the mapping and localization process. Thus a robust and efficient apparatus for generating a map of a scene is provided.

In a further possible implementation form of the first aspect, the processing circuitry is configured to partition the first image and the further image of the plurality of images into a plurality of rectangular, in particular quadratic image portions. In a further possible implementation form of the first aspect, the rectangular image portions have the same size.

In a further possible implementation form of the first aspect, the processing circuitry is configured to determine the confidence value for each image portion as the ratio of the number of target feature points to the total number of feature points of the respective image portion.

In a further possible implementation form of the first aspect, the processing circuitry is configured to determine the confidence value for each image portion as the product of the ratio of the number of target feature points to the total number of feature points of the respective image portion and the confidence value of the respective image portion of a previously processed image.

In a further possible implementation form of the first aspect, the map is a semantic map of the scene, including semantic information for at least some of the plurality of feature points.

In a further possible implementation form of the first aspect, the processing circuitry is further configured to assign to each of the plurality of feature points a semantic class C and to determine for each image portion a respective primary semantic class C having the most feature points.

In a further possible implementation form of the first aspect, the processing circuitry is configured to determine the confidence value for each image portion as the ratio of the number of target feature points to the total number of feature points of the respective image portion weighted by a first weighting factor, if the primary semantic class C of the respective image portion of the image is equal to the primary semantic class of the respective image portion of the previously processed image, or by a second weighting factor, if the primary semantic class C of the respective image portion of the image and the primary semantic class of the respective image portion of the previously processed image are different, wherein the first weighting factor is larger than the second weighting factor. In a further possible implementation form of the first aspect, the processing circuitry is configured to iteratively generate the map on the basis of a simultaneous localization and mapping, SLAM, algorithm.

In a further possible implementation form of the first aspect, the number of the plurality of feature points to be extracted from a respective image portion of the further image is directly proportional to the confidence value of the respective image portion of the first image.

In a further possible implementation form of the first aspect, the image processing apparatus further comprises an image capturing device, in particular a camera, for capturing the plurality of images of the scene.

According to a second aspect the invention relates to an advanced driver assistance system for a vehicle, wherein the advanced driver assistance system comprises an image processing apparatus according to the first aspect of the invention or any one of its implementation forms.

According to a third aspect the invention relates to a corresponding image processing method for generating a map of a scene on the basis of a plurality of images of the scene, wherein the image processing method comprises the steps of:

(b) extracting from each image portion a plurality of feature points and classifying at least one feature point of the plurality of feature points as at least one target feature point, i.e. an inlier feature point of the respective image portion, in case the at least one feature point is associated with a static background of the scene;

(d) repeating steps (a) to (c) for a further image of the plurality of images, wherein in step (b) the number of the plurality of feature points to be extracted from a respective image portion of the further image depends on the confidence value of the respective image portion of the first image. Thus a robust and efficient method for generating a map of a scene is provided.

The image processing method according to the third aspect of the invention can be performed by the image processing apparatus according to the first aspect of the invention. Further features of the image processing method according to the third aspect of the invention result directly from the functionality of the image processing apparatus according to the first aspect and its different implementation forms described above and below. According to a fourth aspect the invention relates to a computer program product comprising program code for performing the method according to the third aspect of the invention when executed on a computer.

Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, embodiments of the invention are described in more detail with reference to the attached figures and drawings, in which:

Fig. 1 is a block diagram showing an example of an image processing apparatus according to an embodiment of the invention;

Fig. 2 is a schematic diagram showing an example of an image with a plurality of image portions for processing by the image processing apparatus of Fig. 1 ;

Fig. 3 is a flow diagram showing an example of processing steps implemented in the image processing apparatus of Fig. 1 ; and

Fig. 4 is a flow diagram showing another example of processing steps implemented in the image processing apparatus of Fig. 1 . In the following, identical reference signs refer to identical or at least functionally equivalent features. DETAILED DESCRIPTION OF THE EMBODIMENTS

In the following description, reference is made to the accompanying figures which show, by way of illustration, specific aspects of embodiments of the invention or specific aspects in which embodiments of the invention may be used. It is understood that embodiments of the invention may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the invention is defined by the appended claims. For instance , it is understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g., functional units, to perform the described one or plurality of method steps (e.g., one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g., functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g., one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise.

Fig. 1 is a block diagram showing an example of an image processing apparatus 100 according to an embodiment of the invention. In an embodiment, the image processing apparatus 100 further comprises an image capturing device 103, in particular a camera, for capturing a plurality of images of a scene. In an embodiment, the image processing apparatus 100 is implemented as part of or interacting with an advanced driver assistance system (ADAS) of a vehicle.

As will be described in more detail below, the image processing apparatus 100 is configured to generate a map of a scene on the basis of a plurality of images of the scene. To this end, the image processing apparatus 100 comprises processing circuitry 101 configured to iteratively generate the map by processing one-by-one the plurality of images by:

(c) determining for each image portion of the first image a confidence value P on the basis of the at least one target feature point; and

Fig. 2 is a schematic diagram showing an example of an image 200 with a plurality of image portions identified by a pair of indices (m,n) for processing by the image processing apparatus 100 of Fig. 1 . As can be taken from the exemplary image 200 shown in Fig. 2, in an embodiment the processing circuitry 101 is configured to partition the plurality of images, such as the image 200, into a plurality of rectangular, in particular quadratic image portions. In an embodiment, the rectangular image portions have the same size.

Fig. 3 is a flow diagram showing an example of a plurality of processing steps 300 implemented in the image processing apparatus 100 of Fig. 1 . The plurality of processing steps 300 comprise the following steps.

301 : Capture an image of the plurality of images for further processing.

303: Partition the image of the plurality of images into a NxM rectangular grid of image portions.

305: Let K be the total number of features to be detected across the entire image. Setup the processing circuitry 101 to detect or extract P(m,n) ^* K feature points in the image region (m,n). Initially, set all P(m,n) = 1 . Let S(m,n) be the number of target feature points (inliers) in the image region (m,n) and T(m,n) the total number of feature points detected in the image region (m,n). Thus, in an embodiment, the number of the plurality of feature points to be extracted from a respective image portion of the image is directly proportional to the confidence value P(m,n) of the respective image portion of the previously processed image.

307: Semantic Segmentation assigns each pixel in the image a class or label C depicting its semantic class. The assigned class indicates to which semantic class (e.g., car, road, building, and the like) the pixel belongs. In case a pixel cannot be classified, it can be assumed to be associated with a dynamic feature and, thus, can be defined as an outlier.

309: Each feature point has a location in pixel coordinates (eventually sub-pixel precision). Therefore, every feature point can be associated with its nearest pixel. If the nearest pixel’s semantic class is a dynamic object (car, pedestrian, truck, bicycle, and the like), then this feature point is removed from the set of detected feature points, i.e.is not a target feature point.

31 1 : The confidence value P of each image portion (m,n) is updated as:

Thus, in an embodiment the processing circuitry 101 of the image processing apparatus 100 is configured to determine the confidence value P(m,n) for each image portion as the ratio of the number of target feature points S(m,n) to the total number of feature points T(m,n) of the respective image portion. Moreover, in an embodiment, the processing circuitry 101 is configured to determine the confidence value P(m,n) for each image portion as the product of the ratio of the number of target feature points S(m,n) to the total number of feature points T(m,n) of the respective image portion and the confidence value of the respective image portion of a previously processed image.

A few exemplary confidence values P(m,n) are shown for the different image portions (m,n) of the image 200 shown in Fig. 2. As will be appreciated, the processing circuitry will extract most of the feature points in the image portions (1 ,2), (1 ,3) and (2,3), since these have the highest confidence value P. The sum of the confidence values P of all image regions of an image should yield 1 , i.e.:

313: Update the map using the labelled feature points. In an embodiment, the semantic map is the map in the SLAM (Simultaneous localization and mapping) process, which is used to conduct vehicle localization. Updating the semantic map means that the map is updated according to an SLAM algorithm, but with the additional information coming from semantic segmentation (step 307). In this case, it additionally contains for each feature point its corresponding semantic class C. Thus, in an embodiment, the processing circuitry 101 of the image processing apparatus 100 is configured to iteratively generate the map on the basis of a simultaneous localization and mapping, SLAM, algorithm.

In the mapping process, the map contains the calculated 3D point locations of the image feature points and the camera position and orientation. As described above, the map can be updated at every new image, i.e. the mapping process is iterative. For example, new points may be added, or the the current camera position and orientation may be added (e.g., like a node in a graph). From time to time, some larger update may be done, e.g., going back several nodes in time (this is called a bundle adjustment process), where the camera position and/or orientation and/or the 3D points are fine-tuned to further improve the estimation accuracy. This is an optimization process overall.

315: Output the (updated) semantic map.

Fig. 4 is a flow diagram showing another example of the plurality of processing steps 300 implemented in the image processing apparatus 100 of Fig. 1 . In comparison to the processing steps 300 shown in Fig. 3 the plurality of processing steps 300 additionally incorporates the semantic information about each image portion into the computation and update of the confidence value P by comprising an additional step 310 and a modified step 31 1 .

More specifically, the plurality of processing steps 300 shown in Fig. 4 take into account the primary semantic class C of the respective image portion of the image that is processed. The primary semantic class of a respective image portion is defined as that semantic class having the largest number of pixels. As in the case of the processing steps 300 shown in Fig. 3, initially the confidence values for all images regions of a currently processed image should be normalized, i.e. P(m,n) = 1 . As in the case of the processing steps 300 shown in Fig. 3, S(m,n) denotes the number of target feature points (i.e. inliers) of the image region (m,n) and T(m,n) denotes the total number of feature points detected in the image region (m,n).

310: Determine the primary semantic class for each image region. C(m,n) denotes the primary semantic class in the image region (m,n). As already mentioned above, this means that the majority of the pixels belong to the class C(m,n).

31 1 : The confidence value P of each image region (m,n) is updated by the processing circuitry 101 on the basis of the following equations:

1.00

wherein

0.75

Flere, the weight D is a measure of how often the primary semantic class of each image region is changing over time. More frequent changes decrease the image region’s reliability of containing useful target feature points. This is reflected by the introduction of the weight D in Eqn. (1 ) above. Thus, the higher the change frequency of the primary semantic class, the smaller the average weight D over time.

Thus, in an embodiment, the processing circuitry 101 is configured to assign to each of the plurality of feature points a semantic class C and to determine for each image portion a respective primary semantic class C having the most feature points. Moreover, in an embodiment, the processing circuitry 101 is configured to determine the confidence value P(m,n) for each image portion as the ratio of the number of target feature points S(m,n) to the total number of feature points T(m,n) of the respective image portion weighted by a first weighting factor, e.g., D=1 , if the primary semantic class C of the respective image portion of the image is equal to the primary semantic class of the respective image portion of the previously processed image, or by a second weighting factor, e.g., D=0.75, if the primary semantic class C of the respective image portion of the image and the primary semantic class of the respective image portion of the previously processed image are different.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely exemplary. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, functional units in the embodiments of the invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

Embodiments of the invention may further comprise an apparatus, which comprises a processing circuitry configured to perform any of the methods and/or processes described herein.

Claims

1 . An image processing apparatus (100) for generating a map of a scene on the basis of a plurality of images of the scene, wherein the image processing apparatus (100) comprises a processing circuitry (101 ) configured to generate the map by processing the plurality of images by:

(b) extracting from each image portion a plurality of feature points and classifying at least one feature point of the plurality of feature points as at least one target feature point of the respective image portion, in case the at least one feature point is associated with a static background of the scene;

2. The image processing apparatus (100) of claim 1 , wherein the processing circuitry (101 ) is configured to partition the first image and the further image of the plurality of images into a plurality of rectangular image portions.

3. The image processing apparatus (100) of claim 2, wherein the rectangular image portions have the same size.

4. The image processing apparatus (100) of any one of the preceding claims, wherein the processing circuitry (101 ) is configured to determine the confidence value for each image portion as the ratio of the number of target feature points to the number of feature points of the respective image portion.

5. The image processing apparatus (100) of any one of the preceding claims, wherein the processing circuitry (101 ) is configured to determine the confidence value for each image portion as the product of the ratio of the number of target feature points to the number of feature points of the respective image portion and the confidence value of the respective image portion of a previously processed image.

6. The image processing apparatus (100) of any one of the preceding claims, wherein the map is a semantic map of the scene, the map including semantic information for at least some of the feature points.

7. The image processing apparatus (100) of claim 6, wherein the processing circuitry (101 ) is further configured to assign to each of the plurality of feature points a semantic class C and to determine for each image portion a respective primary semantic class C having the most feature points.

8. The image processing apparatus (100) of claim 7, wherein the processing circuitry (101 ) is configured to determine the confidence value for each image portion as the ratio of the number of target feature points to the number of feature points of the respective image portion weighted by a first weighting factor, if the primary semantic class C of the respective image portion of the image is equal to the primary semantic class of the respective image portion of a previously processed image, or by a second weighting factor, if the primary semantic class C of the respective image portion of the image and the primary semantic class of the respective image portion of a previously processed image are different, wherein the first weighting factor is larger than the second weighting factor.

9. The image processing apparatus (100) of anyone of claims 6 to 8, wherein the processing circuitry (101 ) is configured to generate the map on the basis of a simultaneous localization and mapping, SLAM, algorithm.

10. The image processing apparatus (100) of any one of the preceding claims, wherein the number of the plurality of feature points to be extracted from a respective image portion of the further image is proportional to the confidence value of the respective image portion of the first image.

1 1 . The image processing apparatus (100) of any one of the preceding claims, wherein the image processing apparatus (100) further comprises an image capturing device (103), in particular a camera, for capturing the plurality of images of the scene.

12. Advanced driver assistance system for a vehicle, wherein the advanced driver assistance system comprises an image processing apparatus (100) according to any one of the preceding claims.

13. An image processing method for generating a map of a scene on the basis of a plurality of images of the scene, wherein the image processing method (200) comprises the steps of:

(d) repeating steps (a) to (c) for a further image of the plurality of images, wherein in step (b) the number of the plurality of feature points to be extracted from a respective image portion of the further image depends on the confidence value of the respective image portion of the first image.

14. A computer program product comprising program code for performing the method of claim 13, when executed on a computer or a processor.