[go: up one dir, main page]

Academia.eduAcademia.edu

Applications of non-metric vision to some visual guided tasks

Proceedings of 12th International Conference on Pattern Recognition

Applications of Non-Metric Vision to some Visually Guided Robotics Tasks Martial Hebert, Cyril Zeller, Olivier Faugeras, Luc Robert To cite this version: Martial Hebert, Cyril Zeller, Olivier Faugeras, Luc Robert. Applications of Non-Metric Vision to some Visually Guided Robotics Tasks. RR-2584, 1995. <inria-00074099> HAL Id: inria-00074099 https://hal.inria.fr/inria-00074099 Submitted on 24 May 2006 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE Applications of non-metric vision to some visually guided robotics tasks Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert N˚ 2584 Juin 1995 PROGRAMME 4 ISSN 0249-6399 apport de recherche Applications of non-metric vision to some visually guided robotics tasks Luc Robert, Cyril Zeller, Olivier Faugeras and ✁✂✁✂✁ Martial Hébert ✁✂✁ Programme 4 — Robotique, image et vision Projet Robotvis Rapport de recherche n˚2584 — Juin 1995 — 54 pages Abstract: We usually think of the physical space as being embedded in a three-dimensional Euclidean space where measurements of lengths and angles do make sense. It turns out that for artificial systems, such as robots, this is not a mandatory viewpoint and that it is sometimes sufficient to think of the physical space as being embedded in an affine or even projective space. The question then arises of how to relate these geometric models to image measurements and to geometric properties of sets of cameras. We first consider that the world is modelled as a projective space and determine how projective invariant information can be recovered from the images and used in applications. Next we consider that the world is an affine space and determine how affine invariant information can be recovered from the images and used in applications. Finally, we do not move to the Euclidean layer because this is the layer where everybody else has been working with from the early days on, but rather to an intermediate level between the affine and Euclidean ones. For each of the three layers we explain various calibration procedures, from fully automatic ones to ones that use some a priori information. The calibration increases in difficulty from the projective to the Euclidean layer at the same time as the information that can be recovered from the images becomes more and more specific and detailed. The two main applications that we consider are the detection of obstacles and the navigation of a robot vehicle. Key-words: projective, affine, Euclidean geometry, stereo, motion, self-calibration, robot navigation, obstacle avoidance (Résumé : tsvp) ✄ ☎✄ ✄ This work was partially funded by the EEC under Esprit Project 6448 (VIVA) and 8878 (REALISE). ✄☎✄☎✄ Email : {lucr,faugeras,zeller}@sophia.inria.fr Email : hebert@cs.cmu.edu Unité de recherche INRIA Sophia-Antipolis 2004 route des Lucioles, BP 93, 06902 SOPHIA-ANTIPOLIS Cedex (France) Téléphone : (33) 93 65 77 77 – Télécopie : (33) 93 65 77 65 Applications de la vision non-métrique à des tâches robotiques guidées par la vision Résumé : Nous considérons souvent que l’epace physique peut être représenté par un espace tridimensionnel Euclidien où les notions de longueur et d’angle ont un sens. Il s’avère que pour des systèmes artificiels comme les robots, ceci n’est pas nécessaire, et qu’il est parfois suffisant de considérer l’espace physique comme représenté par un espace affine ou projectif. Le problème se pose alors de savoir comment relier ces modèles géométriques aux mesures effectuées dans les images, et aux propriétés géométriques des caméras. Nous considérons tout d’abord que le monde est représenté par l’espace projectif, et nous déterminons comment de l’information projectivement invariante peut être extraite des images et utilisée dans un certain nombre d’applications. Ensuite, nous considérons que le monde est un espace affine, et nous déterminons comment des informations invariantes au sens affine peuvent être extraites des images et utilisées dans des applications robotiques. Enfin, nous n’allons pas jusqu’au niveau Euclidien, car c’est le niveau que tout le monde a considéré depuis le début, mais nous nous limitons à un niveau intermédiaire entre l’affine et l’Euclidien. Pour chacun de ces trois niveaux, nous décrivons un certain nombre de procédures de calibration, certaines e t́ant complètement automatiques, d’autres utilisant des connaissances a-priori. lorsque l’on passe du niveau projectif au niveau Euclidien, la calibration devient de plus en plus difficile, mais l’information que l’on peut extraire des images est de plus en plus spécifique et détaillée. Les deux principales applications que nous considérons sont la détection d’obstacles et la robotique mobile. Mots-clé : géométrie projective, affine, Euclidienne, stéréo, mouvement, auto-calibration, robotique mobile, évitement d’obstacles Applications of non-metric vision to some visually guided robotics tasks 1 3 Introduction Many visual tasks require recovering 3-D information from sequences of images. This chapter takes the natural point of view that, depending on the task at hand, some geometric information is relevant and some is not. Therefore, the questions of exactly what kind of information is necessary for a given task, how it can be computed from the data, after which preprocessing steps, are central to our discussion. Since we are dealing with geometric information, a very natural question that arises from the previous ones is the question of the invariance of this information under various transformations. An obvious example is viewpoint invariance which is of course of direct concern to us. It turns out that viewpoint invariance can be separated into three components: invariance to changes of internal parameters of the cameras i.e to some changes of coordinates in the images, invariance to some transformations of space, and invariance to perspective projection via the imaging process. Thus, the question of viewpoint invariance is mainly concerned with the invariance of geometric information to certain two- and three-dimensional transformations. It turns out that a neat way to classify geometric transformations is by considering the projective, affine, and Euclidean groups of transformations. These three groups are subgroups of each other and each one can be thought of as determining an action on geometric configuration. For example, applying a rigid displacement to a camera does not change the distances between points in the scene but in general changes their distances in the images. These actions determine three natural layers, or strata in the processing of visual information. This has the advantages of 2) of identifying the 3-D information that can thereafter be recovered from those images and 1) clearly identifying the information that needs to be collected from the images in order to "calibrate" the vision system with respect to each the three strata. Point 1) can be considered as the definition of the preprocessing which is necessary in order to be able to recover 3-D geometric information which is invariant to transformations of the given subgroup. Point 2) is the study of how such information can effectively be recovered from the images. This viewpoint has been adopted in [4]. In this chapter we follow the same track and enrich it considerably on two counts. From the theoretical viewpoint, the analysis is broadened to include a detailed study of the relations between the images and a number of 3-D planes which are then used in the development of the second viewpoint (absent in [4]) the viewpoint of the applications. To summarize, we will first consider that the world is modeled as a projective space and determine how projective invariant information can be recovered from the images and used in applications. Next we will consider that the world is an affine space and determine how affine invariant information can be recovered from the images and used in applications. Finally, we will not move to the Euclidean layer because this is the layer where everybody else has been working with from the early days on, but rather to an intermediate level between the affine and Euclidean ones. For each of the three layers we explain various calibration procedures, from fully automatic ones to ones that use some a priori information. Clearly, the calibration increases in difficulty from the projective to the Euclidean layer at the same time as the information that can be recovered from the images becomes more and more specific and detailed. The two main applications that we consider are the detection of obstacles and the navigation of a robot vehicle. Section (2) describes the model used for the camera and its relation to the three-dimensional scene. After deriving from this model a number of relations between two views, we analyze the links RR n˚2584 4 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert between the partial knowledge of the model’s parameters and the invariant properties of the reconstructed scene from those two views. Section (3) describes techniques to compute some of the model’s parameters without assuming full calibration of the cameras. Section (4) describes the technique of the rectification with respect to a plane. This technique, which does not require full calibration of the cameras either, allows to compute information on the structure of the scene and is at the basis of all the remaining sections. Section (5) shows how to locate 3-D points with respect to a plane. Section (6) shows how to compute local surface orientations. Lastly, section (7) presents several obstacle avoidance and navigation applications based on a partially calibrated stereo rig. INRIA Applications of non-metric vision to some visually guided robotics tasks 2 5 Stratification of the reconstruction process In this section we investigate the relations between the three-dimensional structure of the scene and its images taken by one or several cameras. We define three types of three-dimensional reconstructions that can be obtained from such views. These reconstructions are obtained modulo the action of one of the three groups, Euclidean, affine, and projective considered as acting on the scene. For example, to say that we have obtained a projective (resp. affine, Euclidean) reconstruction of the scene means that the real scene can be obtained from this reconstruction by applying to it an unknown projective (resp. affine, Euclidean) transformation. Therefore the only properties of the scene that can be recovered from this reconstruction are those which are invariant under the group of projective (resp. affine, Euclidean) transformations. A detailed analysis of this stratification can be found in [4]. We also relate the possibility of obtaining such reconstructions to the amount of information that needs to be known about the set of cameras in a quantitative manner, through a set of geometric parameters such as the fundamental matrix [5] of a pair of cameras, the collineation of the plane at infinity, and the intrinsic and extrinsic parameters of the cameras. We first recall some properties of the classical pinhole camera model, which we will use in the remainder of the chapter. Then, we analyze the dissimilarity (disparity) between two pinhole images of a scene, and its relation to three-dimensional structure. 2.1 Notations We assume that the reader has some familiarity with projective geometry at the level of [6, 23] and with some of its basic applications to computer vision such as the use of the fundamental matrix [5]. We will be using the following notations. Geometric entities such as points, lines, planes, etc...are represented by normal latin or greek letters; upper-case letters usually represent 3-D objects, lowercase letters 2-D (image based) objects. When these geometric entities are represented by vectors or ✂ matrixes, they appear in boldface. For example, represents a pixel, ✁ its coordinate vector, represents a 3-D point, ✄ its coordinate vector. The line going through and ☎ is represented by ✆✝✟✞✠☎☛✡ . For a three-dimensional vector ☞ , we note ✌ ☞✎✍✑✏ the ✒✔✓✕✒ antisymmetric matrix such that ☞✖✓✘✗✟✙✚✌ ☞✎✍✛✏✜✗ for all vectors ✗ , where ✓ indicates the cross-product. ✢✤✣ represents the ✒✥✓✦✒ identity matrix, ✧★✣ the ✒✥✓✟✩ null vector. We will also be using projective, affine and Euclidean coordinate frames. They are denoted by the letter ✪ , usually indexed by a point such as in ✪✖✫ . This notation means that the projective frame ✪ ✫ is either an affine or a Euclidean frame of origin ✬ . To indicate that the coordinates of a vector ✄ ✯★✶ are expressed in the frame ✪ , we write ✄✮✭✰✯ . Given two coordinate frames ✪✲✱ and ✪✴✳ , we note ✵ ✯★✷ ✯✺✶ the matrix of change of coordinates from frame ✪ ✱ to frame ✪ ✳ , i.e. we have ✄✸✭✹✯ ✶ ✙✮✵ ✯✺✷ ✄✻✭✰✯ ✷ . ✯✺✶ ✯✺✷ ✱ Note that ✵ ✯ ✷ ✙✚✼✽✵ ✯ ✶✛✾❀✿ . 2.2 The camera The camera model that we use is the classical pinhole model. Widely used in computer vision, it captures quite accurately the actual geometry of many real imaging devices. It is also very general, and RR n˚2584 6 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert ✟ ✠ ✞ ☎ ✁ ✪✄✂ ✬ ✡ ✆ ✪ ✫ ✝ Figure 1: The pinhole model. encompasses many camera models used in computer vision, such as perspective, weak-perspective, paraperspective, affine, parallel or orthographic projection. It can be described mathematically as ✁ ✣ embedded follows: If the object space is considered to be the three-dimensional Euclidean space ✣ in the usual way in the three-dimensional projective space ☛ the image space to be the two✁ ✳ embedded in the usual way in theandtwo-dimensional projective space dimensional Euclidean space ✣ to ☛ ✳ (see [6]). We can from ☛ ☛ ✳ , the camera is then described as a linear projective application ✣ write the projection matrix in any object frame ✪✌☞ of ☛ : ✣ ✍✎✑✏✓✒ ✘ ✘ ✔ ✕✗✖ ✢✜ ✎✍ ✘✛✘✛✘ ✢✜ ✏✓✙✛✚ ✖ ✘ ✩ ✩ ✘✛✘ ✵ ✬✯ ✫ ✬✯ ✭ ✘ ✩ ✘✩✘ ✩ ✘ ✤✦✥ ✧✣ ✤✦✥ ✧ ★ ✪ (1) where ✮ is the matrix of the intrinsic parameters, ✬ the optical center (see figure 1). The special frame in which the projection matrix of the camera is equal to the matrix ✯ is called the normalized camera frame. ✆ ☎ In particular, the projection equation, relating a point not in the focal plane ✄✱✰✭✹✯✲✫ ✙✻✌ ✫ ✞ ✫ ✞✴✳ ✫ ✞✶✵ expressed in the normalized camera frame, to its projection ✁ ✭✹✯✗✸ ✙ ✌ ✝ ✞ ✞ ✞✤✩✛✍✷✰ , expressed in the image frame and written ✁ for simplicity, is ✳✘✫ ✁ ✙✹✮✺✯ ✄✻✭✰✯✬✫ (2) 2.3 Disparity between two views We now consider two views of the scene, obtained from either two cameras or one camera in motion. If the two images have not been acquired simultaneously, we make the further assumption that no object of the scene has moved in the mean time. INRIA ✫ ✍✷✰ , 7 Applications of non-metric vision to some visually guided robotics tasks ✁ The optical centers corresponding to the views are denoted by ✬ for the first and ✬ for the second, the intrinsic parameters matrixes by ✮ and ✮ respectively, the normalized camera frames respectively by ✪ ✫ and ✪ ✫ . The matrix of change of frame ✪✲✫ to frame ✪ ✫ is a matrix of displacement defined by a rotation matrix and a translation vector : ☎✄ ✂ ✆ ✞ ✆ ✯ ✵ ✯✲✫✫ ✄ ✙ ✂ ✝ ✡ ✧ ✰✣ ✝ ☎✄ ✩✠✟ (3) More precisely, given a point of an object , we are interested in establishing the disparity ✂ ✂ for the two views, that is the equation relating the projection of in the second equation of ✂ in the first view. view to the projection of ☛ 2.3.1 The general case ✂ Assuming that is not in either one of the two focal planes corresponding to the first and second views, we have, from equations (2) and (3): ✌☞ ✆ ✝ ✍ ✄✻✭✰✯✬✫ ✙ ✳ ✫ ✮ ✆ ✮ ✿ ✱ ✁✑✏ ✎ This is the general disparity equation relating ✎ to , which we rewrite as: ✳ ✫☎✄ ✁ ✙ ✳ ✫✓✒✕✔ ✁✑✏ ✵ ✫✗✖ ✳ ✫☎✄ ✁ ✙✹✮ ✯ ✄✻✭✰✯ ✫ ✄ ✙ ✮ where we have introduced the following notations: ✒ ✔ ✒ ✔✮✙✹✮ ✆ ✮ ✿ ✱ ✖ and ✙ ✮ ✵ ✫ ✮ ✝ (4) ✝ (5) represents the collineation of the plane at infinity, as it will become clear below in section (2.3.3). is a vector representing the epipole in the second view, that is, the image of ✬ in the second view. Indeed, this image is ✖ ☞ ✆ ✮ ✯✂✘ ✰✭ ✯ ✫ ✄ ✙ ✮ ✘ ✘ ✘ since ✘ ✭✹✯✲✫ ✙✻✌ ✞ ✞ ☛ ✞ ✩✛✷✍ ✰ . Similarly, ✖ ✙ ✝ ✍✘ ✭✰✯✬✫✦✙✹✮ ✝ ✆ ✰✝ ✙ ✮ (6) is a vector representing the epipole in the first view. lies on the line going through and the point represented by Equation (4) means that which is the epipolar line of . This line is represented by the vector ✙ ✚ ✚ where ✱ or equivalently ✚ ✙ ✮ ✁ ✒ ✔ ✁ , (7) ✒ ✔ ✜✛ ✌ ✝✤✍✰✏ ✆ ✮ ✙✻✌ ✖ ✍✰✏ (8) ✿ ✱ (9) F is the fundamental matrix which describes the correspondence between an image point in the first view and its epipolar line in the second (see [5]). ✢ using the algebraic equation ✣ ✤✦✥★✧✪✩✬✫☛✤ ✣ ✥★✧✭✩✮✤ matrix of ✤ . ✄ RR n˚2584 (valid if ✯✱✰✳✲✵✴✶✤✦✷✹✫☛ ✸ ✺ ), where ✤ ✫☛✯✱✰✳✲✵✴✜✤✻✷✜✤✽✼ ✢✭✾ ✄ is the adjoint 8 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert 2.3.2 The case of coplanar points Let us now consider the special case of points lying in a plane . The plane is represented in ✪✥✫ by ✁ ☎✝✆ ✘ , where ✂ is unit normal in ✪ ✫ and ✆ , the distance of ✬ to the plane. the vector ✰ ✙ ✄✂ ✰ ✁ , which can be written, using equation (2), Its equation is ✰ ✄✻✭✹✯✲✫✦✙ ☞ ✍ ✘ ✂✓✰ ✯ ✄✻✭✹✯✲✫✞☎ ✵ ✫ ✆ ✙ ✳ ✫ ✓✂ ✰ ✮ ✿ ✱ ✟ ✁ ☎ ✵ ✫ ✆ ✙ If we first assume that ✆✡✙ ✠ ✳ disparity equation : ✘ (10) , that is the plane does not go through ✬ , we obtain the new form of the ✙ ✳ ✫✓✒ ✁ ✳ ✫ ✄✁ where ✒ ✒✕✔ ✏ ✖ ✂ ✆ ✰ ✙ (11) ✮ ✿ ✱ (12) ✒ This equation defines the projective linear mapping, represented by , the ☛ -matrix of the plane, relating the images of the points of the plane in the first view to their images in the second. It is at the basis of the idea which consists of segmenting the scene in planar structures given by their respective ☛ -matrices and, using this segmentation, to compute motion and structure (see [7] or [29]). ✘ If the plane does not go either through ✬ , its ☛ -matrix represents a collineation (☞✍✌✏✎ ✼ ✾ ✙ ✠ ) and its inverse is given by ✒ ✒ ✄ ✿ ✱ ✙ ✒ ✙ ✒ ✔✿ ✱ ✏ ✖ ✂ ✆ ✰ ✮ ✿ ✱ (13) where ✂ is the unit normal in ✪✲✫ and ✆ , the distance from the plane to ✬ . If the plane goes through only one of the two points ✬ or ✬ , its ☛ -matrix is still defined by the one of the two equations (12) or (13) which remains valid, but is no longer a collineation; equation (10) shows that the plane then projects in one of the two views in a line represented by the vector ✮ ✛✂ ✮ or ✛✂ (14) If the plane is an epipolar plane, i.e. goes through both ✬ and ✬ , its ☛ -matrix is undefined. Finally, equations (5) and (6) show that and always verify equation (11), as expected, since and are the images of the intersection of the line ✑ ✬ ✬ ✓✒ with the plane. ✙ ✙ ✙ ✙ 2.3.3 The case of points at infinity For the points of the plane at infinity, represented by parity equation becomes ✒✕✔ ✳ ✫ ✄✁ ✘ ✘ ✘ ✌ ✞ ✞ ✞✤✩ ✍ ✰ ✙ ✳ ✫✓✒✕✔ ✁ , thus of equation ✵☛✫ ✙ ✘ , the dis(15) Thus, is indeed the ☛ -matrix of the plane at infinity. Equation (15) is also the limit of equation (11), when ✆✕✔✗✖ , which is compatible with the fact that the points at infinity correspond to the remote points of the scene. ✘ using the algebraic equation ✴ ✥ ✾✓✙✛✚ ✷✢✜ ✫✂✴✣✜ ✥ ✾✤✙ ✷ ✚ . INRIA 9 Applications of non-metric vision to some visually guided robotics tasks 2.4 Reconstruction Reconstruction is the process of computing three-dimensional structure from two-dimensional image measurements. The three-dimensional structure of the scene can be captured only up to a group of transformations in space, related to the degree of knowledge of the imaging parameters. For instance, with a calibrated stereo rig (i.e., for which intrinsic and extrinsic parameters are known), it is well known that structure can be captured up to a rigid displacement in space. This has been used for a long time in photogrammetry. It has been shown more recently [14] that with non-calibrated affine cameras (i.e. that perform orthographic projection), structure can be recovered only up to an affine transformation. Then, the case of uncalibrated projective cameras has been addressed [2, 11, 28] and it has been shown that in this case, three-dimensional structure can be recovered only up to a projective transformation. We will now use the formalism introduced above to describe these three cases in more detail. 2.4.1 Euclidean reconstruction ✆ ✠ Here we suppose that we know the intrinsic parameters of the cameras ✮ , ✮ , and the extrinsic parameters of the rig, and . This is the case when cameras have been calibrated. For clarity we call and . Equation (4) gives it the strong calibration case. Through equation (5) we can compute us ✝ ✳ ✼ ✫ ✁ ✓ ✂ ✖ ✁ ✒✕✔ ✕✒ ✂ ✔ ✞ ✝ ✳ ✼ ✁ ✒ ✔ ✓ ✁ ✾ ✵ ✙ ☎ ✖ ✾ ✳ ✄✆☎ ✫ and equation (2) ✁ ✰✝ ✫✫ ✫ ✰ ✫✟✞ ✓ ✫ ✙ ✵ ✁ ✮ ✞ ✱ ✿ ✫ ✟ Thus, we have computed the coordinates of with respect to . The projection matrices for the first and second views, expressed in their respective image frames , are then written and in ✂ ✪ ✪ ✫ ✮ ☞ ✢ ✣ ✧ ✣ ✍ and ✮ ☞ ✆ ✫ ✝ ✍ These matrices and the coordinates of are thus known up to an unknown displacement responding to an arbitrary change of the Euclidean reference frame. ✂ ✠ ✲✫ cor✲✭ ✯ ✯ 2.4.2 Affine reconstruction We now assume that both the fundamental matrix and the homography of the plane at infinity are known, but the intrinsic parameters of the cameras are unknown. We show below that by applying an affine transformation of space, i.e., a transformation of ☛ which leaves invariant the plane at infinity, we can compensate for the unknown parameters of the camera system. The guiding idea is to choose the affine transformation in such a way that the projection matrix of the first camera is equal to ✯ as in [21]. We can then use the same reconstruction equations as in the Euclidean case (strong calibration). Since structure is known up to this unknown ✣ RR n˚2584 10 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert affine transformation, we call this case the affine calibration case. Let us now describe the operations in detail. Suppose then that we have estimated (see section 3.4), thus we know up to an unknown one of the possible representations of : scale factor. Let us denote by ✒✕✔ ✁ ✚ ✔ ☛ ✒✕✔ ✙ ✁ ✒✕✔ ☛ ✔ ✒ ✔ where is an unknown nonzero scalar. Suppose also that we have estimated the fundamental matrix (see section 3.2) which is of rank 2, i.e its null-space is of dimension 1. Equation (8) shows that is in the null-space of ✰ , hence is known up to a nonzero scalar and we write in analogy with the previous case: (16) ✙ ✚ ✖ ✖ ✂ ✖ ✄✂ ✖ ✒ ✔ ✖ ✞✝ Neither equation (2) nor equation (4) is usable since ✮ , and are unknown. Both equations ✯ defined by the matrix of change of frame ✵ ✯ ✫ : can be rewritten in another frame ✪ ✆☎ Hence this implies ✡✝ ✞ ✁ ✵ ✯✯✞✝✫ ✙ ✯ ✄✻✭✹✯ ✫ ✄ ✙ ✵ ✯✲✫✫ ✄ ✼ ✵ ✯✞✲✯ ✝✫ ✾ ✿ ✱ ✻ ✄ ✭✹✯ ✙✚✌ ✔☎ ✞ ✘ ☎ ✞ ✳☛☎✘✞ ✵☞☎ ✍ ✰ ✆ ✧✱✣ ✠ ✟ ✧ ✰✣ ✮ ✿ ✱ ✧ ✣✰ ✧ ✣ ✂ ✟ ✯ ✯ ✄✻✭✹✯ ✫ ✄ ✙ ✵ ✯✲✫✫ ✄ ✻ ✄ ✭✰✯ ✫ ✙ ✵ ✯✬✫✫ ✄ ✵ ✯✯✞✝✫ ✄✻✭✹✯ Since we have If ✄ ✰✭✰✯ becomes ✞ ✟✱ ✮ ✵ ✯✯✬✝✫ ✙ ☎ and equation (18), ✙ ✞ ✆ ✧ ✰✣ ✝ ✂ is a vector representing ✳ ✫ ✄✁ and equation (2), Equation (17) yields ✝ ☛☎ ✵ ☎ ✳ ✙ ✩✂✟ ✞ ✁ ✮ ✿ ✱ ✧ ✣✰ ✓ ✂ ✁ ✖ ✾ ✓✁ ✼✒✁ ✄ ☎✝ ☛✳ ☎ ✰✝ ✝✝ ✝✰ ✟✞ ✙ ✵☞☎ ✂ Thus, we have computed the coordinates of ✞ ✂ ✟ ✄✻✭✹✯ ✝ ✆☎ , equation (4), written in frame ✪✆☎ , ☎ ✒✕✔ ✁✑✏ ✵ ☎ ✖ ✳✌☎ ✁ ✙ ✯ ✄✻✭✰✯✡✝ ✼✁ ✧ ✣ in ✪ ✙ ✳ ☎ ✝ ✓ ✔ ✝ ✞ ✒✂ ✔ ✁ ✳ ✟ with respect to ✪ (17) (18) ✁ ✾ ☎. INRIA 11 Applications of non-metric vision to some visually guided robotics tasks It is easy to verify that the projection matrices for the first and second views, expressed in their respective image frames and in ✪✆☎ , are then written ☞ ✢✣ ✧ ✣ ✍ ✙ ✯ ☞ ✒✕✔ and ✖ They are thus known up to the unknown affine transformation ✣ change of the affine reference frame . 2.4.3 Projective reconstruction ✠ ✯✡✝ ✍ ✯ ✭ , corresponding to an arbitrary ✚ We now address the case when only the fundamental matrix is known. This is known as the weak calibration case. The representation of the epipole ✖ is also known up to a nonzero scalar factor, as ✚ belonging to the null-space of ✰ . Neither equation (2) nor equation (4) is usable since ✮ , ✒ ✔ and ✖ are unknown. As in the previous paragraph, we eliminate the unknown parameters by applying a projective transformation of space. Here, the plane at infinity is not (necessarily) left invariant by the transformation: It is mapped to an arbitrary plane. Let us now go into more details: ✁ Let us first assume that we know, up to a nonzero scalar factor , the -matrix of a plane not going through the optical center ✬ of the first camera, as defined in section (2.3.2): ☛ ✒ ✙ ✁ ✼✒ ✔ ✏ ✖ ✰ ✂ ✮ ✿ ✱✾ (19) ✘ ✆ where is the unit normal expressed in ✪ ✫ of the plane and , with plane. We define a frame ✪✁ by the matrix of change of frame from ✪ ✂ ✆ ✯✄✂ ✵ ✯✲✫ ✙ ✟✱ ✄ ✮ ✠ ✱ ✝☎ ✞ ✆ ✧ ✣ ✠✱ ✁ ✱ ✁ ☎ ✮ ✆ ★✠✞ ✿ ✟ ✷ ✧ ✣ ☎ hence ✡ ✟ ☎✝✆ so that ✠ ✞ ✯✲✫ ✵ ✯✄✂ ✙ ✞ ✆ ✫ ✙ ✠ , the distance of ✬ to the (20) ✞ ✂ ✟ ✩☞☛ ✰ is the vector representing the plane at infinity in ✪ ✂ is the vector representing in ✪ , we have then, using equation (2), ✮ ✿ ✱ ✵ and, eliminating ✵ ✙ ☎ ✂ ✩ ✰ ✂ ✯ ✄✻✭✹✯ ✫ ✏ ✆ ✂ ✩ ✵✎✫ ✙ ✳ ✫ ✂ ✩ ✰ ✂ ✩ ✮ ✿ ✱ ✁✑✏ ✵✎✫ ✂ ✆ ✰✭✹✯✄✂ ✙✻✌ ✆ ✞ ☎ (21) ✫ from equation (4), ✳ ✫☎✄ ✁ ✙✹✘ ✳ ✫ ✼✒ ✔ ✏ ✖ ✌ It is affine because it does not change the plane at infinity RR n˚2584 ☎ . If ✄ ✰ ✂ ✆ ✮ ✿ ✱ ✾ ✁✑✏ ✵ ✖ (22) ✞✳ ✞✵ ✍✷✰ 12 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert Equation (4) is thus written in ✪✁ ☎✄ ✁ ✳ ✁ ✏ ✒ ✑ ✙✹✳ ✫ ✖ ✵ (23) As for equation (2), it is written in ✪ ✳ Equation (23) then gives us ✳ ✵ ✼✁ ☎ ✙ ✙✹✯ ✁ ✓ ✄✆☎ ✂ ✰✝ ✂✂ and equation (24), ✂ ✰ ✄✂ (24) ✂ ✁ ✖ ✾ ✓✁ ✼✒✁ ✁ ✓ ✂ ✳ ✒ ✞ ✂ ✄✻✭✹✯ ✳ ✙ ✞ ✝ ✞ ✵ ✁ ✾ ✟ Thus, we have computed the coordinates of with respect to frame ✪ . The projection matrices for the first and second views, expressed in their respective image frames and in ✪ , are then written ✢✛✣ ✧ ✣ and ☞ ☞ ✒ ✍ ✖ ✍ Indeed, the projection matrix for the second view is ✮ ✯ ✠ ✯✂✄ ✯ ✫ ✙ ☞ ✮ ✍✠ ✧ ✣ ✫ ✯ ✯✲✫ ✄ ✠ ✯✲✯✄✂✫ ☞ ✙ ✁ ✮ ✆ ✖ ✍ ✞ ✁ ✁ ☎ ✞ ✆✮ ✮ ✿ ✁ ✁ ✱ ✿ ✱ ✧ ✣ ✂ ✟ and is actually of rank 3 as product of a ✒✥✓ -matrix of rank 3 and a ✓ -matrix of rank 4. ✂ Both projection matrices and the coordinates of are thus known up to the unknown collineation ✯✄✂ ✵ ✯ ✭ , corresponding to an arbitrary change of the projective reference frame. This result had already been found in a quite different manner in [2, 12]. The reconstruction described above is possible as soon as the ☛ -matrix of a plane which does not go through ✬ is known. In particular, when F is known, one is always available as suggested by equations (8) and (22). It is defined by ✂ ✰ ✆ which gives, using equation (8), ✂ ✙ ☎ ✂ ✖ ✖ ✰ ✂ ✳ ✕✒ ✔ ✒ ✙✻✌ ☎✄ ✮ ✆ ✝☎ ✰✬✂ ✮✬ ✰✓✬ ✮✂ ✮ ✝ ✳ ✙ ✂ ✖✖ ✂ ✍✰✏ ✚ The equation, expressed in ✪ ✫ , of the corresponding plane is using equation (25), ✖ ✰ ✮ ✯ ✄✻✭✹✯ ✫ ✄ ✙ ✘ ☞ ✂ ✰ (25) ☎✝✆ ✍ ✠ ✯✬✯ ✫✫ ✄ ✄✻✭✹✯ ✫ ✄ ✙ ✘ , thus, ✁ which projects, in the which shows, using equation (2), that this plane is the plane going through ✬ second view, to the line representing by , as already noticed in [21]. ✖ ✄ using the algebraic equation ✥✮✥ ✾ ✫✆☎ ✥✝☎ ✘✟✞ ✌✡✠ ✣ ✥★✧ ✘ ✩ . INRIA 13 Applications of non-metric vision to some visually guided robotics tasks 3 Computing the geometric parameters Now that we have established which parameters are necessary to deduce information on the structure of the scene, we describe methods to compute these parameters, from real images. If no a priori knowledge is assumed, the only source of information is the images themselves and the correspondences established between them. After showing how accurate and reliable point correspondences can be obtained in general from the images, we describe how they can be used for estimating the fundamental matrix on the one hand, plane collineations on the other hand. 3.1 Finding correspondences Matching is done using the image intensity function ✼ ✞ ✾ . A criterion, usually depending on the local value of ✼ ✞ ✾ in both images, is chosen to decide whether a point ✱ of the first image and a point ✳ of the second are the images of the same point of the scene. It is generally based on on a physical model of the scene. A classical measure for similarity between the two images within a given area is the cross-correlation coefficient ✝ ✝ ✞ ✞ ✬ ✼ ✱ ✞✽ ✳ ✾ ✂✁☎✄✝✆ ✟✞ ☎ ✞ ✠✞ ☎ ✞ ✙ ✛✼ ✱ ✱ ✞ ✳ ✳ ✾ ✡✞ ☎ ✞ ☞☛ ✟✞ ✝☎ ✞ ✞ ☎ ✞ ✞ ☎ ✞ ✼ ✙ ✂ ✱ ✱ ✾ ✱ ✞✍✌ ✼ ✝✳ ✂ ✳ ✾ ✂ ✱ ✳ ✂ ✳ ✎✌ ✞✏✌ where is the vector of the image intensity values in a neighborhood around the point and its mean in this neighborhood. The context in which the views have been taken plays a significant role. Two main cases have to be considered: the case where the views are very similar and the opposite case. The first case usually corresponds to consecutive views of a sequence taken by one camera, the second, to views taken by a stereo rig with a large baseline. In the first case, the distance between the images of a point in two consecutive frames is small. This allows to limit the search space when trying to find point correspondences. Below, we briefly describe a simple point tracker which, relying on this property, provides robust correspondences at a relatively low computational cost. In the second case, corresponding points may have quite different positions in the two images. Thus, point matching requires more sophisticated techniques. This is the price to pay if we want to manipulate pairs of images taken simultaneously from different viewpoints, which allow general reconstruction of the scene without worrying about the motion of the observed objects, as mentioned in section (2.3). In both cases, the criterion that we use for estimating the similarity between image points is not computed for all of them, but only for points of interest. These points are usually intensity corners in the image, obtained as the maxima of some operators applied to ✼ ✞ ✾ . Indeed, they are the most likely to be invariant to view changes for these operators since they usually correspond to object markings. ✝ ✞ The corner detector. The operator that we use is the one presented in [10], which is a slightly modified version of the Plessey corner detector: ☞✍✌✏✎ RR n˚2584 ✼ ✘ ✑ ✓☎ ✒ ✾ ★✼ trace ✼ ✑ ✘ ✾✰✾ ✳ 14 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert where ✳ ✑ ✞ ✑ ✘ ✙ ✑ ✂✁ ✑ ✑ ✁ ✁ ✳ ✟ ✑ and denotes a smoothed version of . Based on experiments, Harris suggests to set ✙ for best results. is computed at each point of the image, and points for which it is larger than a given threshold are retained as corners. ✘ ✘ ✁ ✒ ☛ ✑ ✘ The points tracker. The implementation has been strongly influenced by the corner tracker described in [18]. It works as follows: First, corners are extracted in both images. Then, for a given corner of the first image, the following operation is performed: its neighborhood is searched for corners of the second image; the criterion ✬ is computed for each pair of the corner of the first image and one of the possible matches in the second image; The pair with the best score is retained as a correspondence if the score is above a fixed threshold. Then, for each corner of the second image for which a corresponding point in the first image has been found, the preceding operation is applied from the second image to the first. If the corresponding point found by this operation is the same as the previous one, it is then definitely taken as valid. The stereo points matcher. The method described in the previous section no longer works as soon as the views are quite different. More precisely, the correlation criterion is not selective enough: there are, for a given point of an image, several points of the other image that lead to a good correlation score, without the best of them being the real correspondent point searched. To achieve correspondence matching, the process must then keep all those potentially good but conflicting correspondences and invokes global techniques to decide between them: a classical relaxation technique is used to converge towards a globally coherent system of point correspondences, given some constraints of uniqueness and continuity (see [30]). 3.2 The fundamental matrix Once some image point correspondences, represented in the image frame by ✼ ✁ ✄ ✞✹✁ ✄ ✾ , have been found, the fundamental matrix is computed, up to a nonzero scalar factor, as the unique solution of the system of equations, derived from the disparity equations, ✚ ✁ ✄ ✚ ✰ ✁ ✄ ✙ (26) ✘ This system can be solved as soon as seven such correspondences are available: only eight coefficients of need to be computed, since is defined up to a nonzero scalar factor, while equation (26) supplies one scalar equation per correspondence and ☞✍✌✏✎ ✼ ✾ ✙ , the eighth. If there are more correspondences available, which are not exact, as it is the case in practice, the goal of the computation is to find the matrix which best approximates the solution of this system according to a given least squares criterion. ✚ ✚ ✚ ✘ INRIA 15 Applications of non-metric vision to some visually guided robotics tasks A study of the computation of the fundamental matrix from image point correspondences can be found in [20]. Here, we just mention our particular implementation, which consists, on the one hand, of a direct computation considering that all the correspondences are valid and in the other hand, of a method for rejecting some possible outliers among the correspondences. The direct computation computes which minimizes the following criterion: ✚ ✁ ✚ ✄ ✌ ✁ ✍✳ ✄ ✩ ✏ ✌✚ ✁ ✏ ✌✚ ✰ ✍✳ ✄ ✁ ✁ ✄ ✍✳ ✩ ✏ ✌✚ ✰ ✁ ✄ ✍ ✳✄✂ ✼✁ ✄ ✰✚ ✁ ✄ ✾ ✳ ✁ which is the sum of the squares of the distance of to the epipolar line of and the distance of to the epipolar line of . Minimization is performed with the classical Levenberg-Marquardt method (see [26]). In order to take in account both its definition up to a scale factor and the fact that it is of rank ☎ , a parametrization of with seven parameters is used, which parametrizes all the ✒ ✓ ✒ -matrices of rank strictly less than 3. These parameters are computed from the following way: a line ✆ (respectively, a column ) of is chosen and written as a linear combination of the other two lines (respectively, columns); the four entries of of these two combinations are taken as parameters; among the four coefficients not belonging to ✆ and , the three smallest, in absolute value, are divided by the biggest and taken as the last three parameters. ✆ and are chosen in order to maximize the rank of the derivative of F with respect to the parameters. Denoting the parameters by ✝ ✱ , ✝ ✳ , ✝ ✣ , ✝ , ✝✟✞ , ✝✡✠ and ✝✟☛ and assuming, for instance, ✆ and equals to ✩ and the bottom right coefficient being the normalized coefficient, leads to the following matrix: ✂ ✄ ✄ ✄ ✄ ✚ ✚ ✚ ✚ ✂ ✎✍ ✝✟✠ ✼☞✝ ✝ ✱ ✏✌✝ ✞ ✝ ✝✡✠✎✝ ✝ ✠ ✂ ✣ ✾ ✏✌✝✡☛ ✼✍✝ ✝ ✳ ✌ ✏ ✝✞✾ ✱ ✏✏✝✡☛✑✝ ✳ ✝ ✣ ✏✌✝ ☛ ✂ ✝ ✝ ✱ ✏✌✝ ✞ ✝ ✣ ✝ ✱ ✝ ✣ ✂ ✚ ✢✜ ✝ ✝ ✳ ✏✌✝ ✞ ✝ ✳ ✩ ✂ During the minimization process, the parametrization of can change: the parametrization chosen for the matrix at the beginning of the process is not necessarily the most suitable for the final matrix. The outliers rejection method used is a classical least median of squares method. It is described in detail in [30]. 3.3 The ✒ -matrix of a plane If we have at our disposal correspondences, represented in the image frames by ✼ ✁ ✞✹✁ ✾ , of points belonging to a plane, the -matrix of this plane is computed, up to a nonzero scalar factor, as the unique solution of the system of equations (11), ☛ ✒ ✄ ✄ ✳ ☎✫ ✄ ✁ ✎✹✙ ✳✘✫ ✒ ✄ ✁ ✄ This system can be solved as soon as four such correspondences are available: only eight coefficients of need to be computed, since is defined up to a nonzero scalar factor, while equation (11) supplies two scalar equation for each correspondence. If there are more correspondences available, which are not exact, as it is the case in practice, the goal of the computation is to find the matrix ✒ RR n˚2584 ✒ 16 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert which approximates at best the solution of this system according to a given criterion: a study of the computation of plane -matrices from image point correspondences can be found in [6]. Since ✙ and ✙ verify equation (11), three point correspondences are in general sufficient for defining . In fact, this is true as long as the homography is defined, i.e., when three points are not aligned in either image (a proof can be found in [3]). If the plane is defined by one point and a line , given by its projections ✼ ✽✞ ✭ ✾ , so that ✙ does not belong to and ✙★ does not belong to ✭ , its -matrix is computable the same way, as soon as we know the fundamental matrix. Indeed, the projections of ✂ two other points and ✁ of the plane are given by choosing two points and ☎ on , which amounts ✂ to choosing and ✁ on : the corresponding points ✎ and ☎ are then given by intersecting with the epipolar line of and the epipolar line of ☎ , given by the fundamental matrix. As an application of this idea, we have a purely image-based way of solving the following pro✂ , and a line correspondence blem: given a point correspondence ✼✝✟✞ ✾ defining a 3-D point ✂ , find the -matrix of the plane going through and . In particular, ✼ ✽✞ ✾ defining a 3-D line if is at infinity, it defines a direction of plane (all planes going through are parallel) and we can ✂ find the -matrix of the plane going through and parallel to that direction. This will be used in section 6.2. ✂ Given the -matrix ✒ of a plane ✂ and the correspondences ✼✝✟✞✽ ✾ and ✼✝☎ ✞ ☎ ✾ of two points and ✁ , it is possible to directly compute in the images the correspondences ✼☎✄✰✞✆✄ ✾ of the intersection ✂ of the line ✆ ✞✝✁ ✡ with ✂ . Indeed, ✄✳ belongs both to ✆ ✕ ✞✜☎ ✡ and the image of ✆✝✟✞ ☎ ✡ by ✒ , so: ☛ ☛ ✆ ✆ ✆ ✆ ☛ ✆ ✆ ✆ ✆ ☛ ☛ ☛ ✞ ✙✻✼ ✁ ✓ ✂ ✾ ✓ ✼ ✒ ✁ ✓ ✒ ✂ ✾ (see [27]) Similarly, given two planes ✂✖✱ and ✂ ✳ represented by their -matrices ✒ ✱ and ✒ ✳ , it is possible to directly compute in the images the correspondences of the intersection of ✂ ✱ with ✂ ✳ . Indeed, the correspondences of two points of are computed, for example, as intersections of two lines ✱ and ✳ of ✂ ✱ with ✂ ✳ ; the correspondences of such lines are obtained by choosing two lines in the first image representing by the vectors ✞✽✱ and ✞ ✳ , their corresponding lines in the second image being ✱ ✱ ✞ ✱ and ✒ ✱ ✿ ✞ ✳ . given by ✒ ✱ ✿ ☛ ✰ ✰ 3.4 The homography of the plane at infinity To compute the homography of the plane at infinity ✒ ✔ , we can no longer use the disparity equation (4) with correspondences of points not at infinity, even if we know the fundamental matrix, since ✞✰✁ ✾ ✫☎✄ , ✘✫ and ✫ are not known. We must, thus, know correspondences of points at infinity ✼ ✁ and compute ✒ ✔ like any other plane -matrices, as described in section 3.3. The only way to obtain correspondences of points at infinity is to assume some additional knowledge. First, we can assume that we have some additional knowledge of the observed scene that allows to identify, in the images, some projections of points at infinity, like, for instance, the vanishing points of parallel lines of the scene, or the images of some points on the horizon, which provide sufficiently good approximations to points at infinity. Another way to proceed is to assume that we have an additional pair of views. More precisely, if ✂ this second pair differs from the first only by a translation of the rig, any pair ✼ ✞✟✁ ✾ of stationary ✄ ✳ ✳ ✵ ✄ ☛ INRIA 17 Applications of non-metric vision to some visually guided robotics tasks ✆ ✩ ☎ ✖ ☎ ✁ ✁ ✝ ✂ ✂ ☎ ✩ ✞ ✄ ☎ ✄ ✩ Figure 2: Determining the projections of points at infinity (see section 3.4). object points (see figure 2), seen in the first views as ✼ ✁ ✱ ✞✹✁ ✱✛✾ and ✼ ✱ ✞ ✱✤✾ , and in the second as ✼ ✁ ✳ ✞✹✁ ✳ ✞ ✳ ✾ and ✼ ✳ ✾ , gives us the images ✼✟✞ ✱✤✞ ✞ ✱ ✾ and ✼✟✞✝✳ ✞ ✞ ✳ ✾ in the four images of the intersection ✂ of the line ✆ ✞✝✁ ✡ with the plane at infinity. Indeed, on one hand, since is at infinity and the ✂ and ✁ implies the stationarity of , we have, from equations (15) and (4), stationarity of ✂ ✂ ✂ ✂ ✳ ✒ ✔ ✞ ✫ ✶ ✹✳ ✒✕✔ ✒ ✔ ✱ ✳☎✞ ✱ ✫ ✷ ✳✴✙ and ✳ ✄✞ ✫ ✶ ✳ ✳ ✄✒ ✔ ✙ ✞ ✱ ✳ ✫ ✷ ✱ ✱ ✳ (respectively, where ✱✽✳ ) is the homography of the plane at infinity between the first (respectively, second) view of the first pair and its corresponding view in the second pair. In the case ✳ , ✱✽✳ ✙ where the two pairs of views differ only by a translation, ✱ ✙ ✢ ✣ , ✢ ✣ ✱ ✙ ✳ , ✱✽✳ ✙ and we have, by equation (5), ✮ ✆ ✮ ✮ ✆ ✮ and ✒ ✔ ✒✕✔ ✄ and ✄✪ ✵✄ ✄✵ . On the other hand, as lies on ✁ , ✄ lies on , ✄✵ , on ✂ ✌ and ✄✵ , on ✂ ✌ . Consequently, ✄ and ✄✪ are obtained with and of ✠ ✌ with ✠ ✌ , respectively: ✱ ✳✴✙ which implies that ✄ ✱ ✙ ✄ ✳ ✙ ✆ ✱✤✞✜☎☛✱✤✡ , ✄ ✳ , on ✆ ✳ ✞✜☎ ✳ ✡ ✱ as the intersections of ✆✝✟✱✤✞✜☎☛✱✤✡ ✞ ✙✻✼ ✁ ✱ ✓ ✂ ✱ ✾ ✙ ✱ ✆✝ ✳ ✳ ✱ ✳ ✂ ✢✛✣ ✆ ✳ ✳ ✞ ☎ ✳ ✡ ✓ ✙ ✙ ✱ ✞✜☎ ✱ ✡ ✆✝ ✓✟✼ ✁ ✢❀✣ ✳ ✾ and ✞ ✆✝ ✳ ✞✠☎ ✳ ✡ ✆✝ ✱ ✞✜☎ ✱ ✡ ✙✻✼ ✁ ✱ ✓ ✆✝ ✂ ✱ ✾ ✂ ✞ ✡ ✱ ✳ ✞✜☎ ✳ ✡ ✓✟✼ ✁ ✳ ✓ ✂ ✳ ✾ ✒ ✔ Once has been obtained, the ratio of the lengths of any two aligned segments of the scene ✂ ✂ ✂ can be computed directly in the images. Indeed, given three points ✱ , ✳ and ✣ on a line, as in figure 3, from their images ✼ ✱ ✞✽ ✱✤✾ , ✼✝ ✳ ✞ ✳ ✾ and ✼ ✣ ✞ ✣ ✾ , we can compute the images ✼ ✟✞ ✾ of the intersection of this line with the plane at infinity, using , as explained in section 3.3. We can compute in each image the cross-ratio of those four points. As a projective invariant, this cross-ratio ✂ RR n˚2584 ✂ ✂ ✒ ✔ ✕ 18 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert ✞✝ ☎✆ ✂✁ ☎✄ ☞✡✝ ☛ ✁ ☛ ✄ ☛ ✆ ☛ ✠☛ ✆ ☛ ✠ ☛ ✠✄ ☛ ✠✁ ✟✡✠ ✟ Figure 3: Determining ratios of lengths in affine calibration (see text). is then exactly equal to the ratio of ✂ , ✂ ✂ ✱ ✂ ✂ ✳ ✱ ✂ ✣ ✱ ✣ ✙ ✂ ✂ ✂ ✂ and ✌ ✂ ✣ ✂ . More precisely: ✂ ✳ ✣ ✱ ✣ ✙ ✂ ✂ ✱ ✳ ✣ ✱ ✔ ✳ ✌ ✳ ✣ ✳ ✔ INRIA 19 Applications of non-metric vision to some visually guided robotics tasks 4 The rectification with respect to a plane In this section, we assume that we know the epipolar geometry. This allows us to rectify the images with respect to a plane of the scene. This process explained below, allows not only to compute a map of image point correspondences, but also to assign to each of them a scalar that represents a measure of the disparity between the two projections of the correspondence. 4.1 The process of rectification ✚ ✖ Like in section 2.4.3, we assume that we know, up to nonzero scale factors, , thus , given by of a plane given by equation (19). Let us then choose two equation (16), and the -matrix and , such that homographies, represented by the matrices ✒ ☛ ✒ ✒ ✖ ✒ ✒ ✂ ✑ ✏ ✑ ✏ where ✑ ✙ ✘ ✌ ✩ ✞ ✘ ✞ ✒ ✒ ✑ (27) ✍ ✰ ✑ ✙ (28) is any nonzero scalar. Equation (23) can then be rewritten ☎✄ ✑ ✳ where ✁ ✑ ✙✚✌ ✝ ✑ ✞ ✞ ,✁ ✑ ✞ ✩ ✍✷✰ ✑ ✙ ✌ ✝ ✑ ✝✞ ✞ ✫ ✑ ✑ ✑ ✁ ✙ ✏ ✏ ✵ ✌ ✩ ✞ ✘ ✞ ✘ (29) ✍ ✰ and ✑ ✞✤✩ ✍✷✰ ✒ ✑ ✑ ✁ ✁ ✳ ✙ ✑ ✑ and ✁ ✑ ✁ ✙ ✒ ✑ (30) ✁ The rectification with respect to a plane consists of applying such matrices, called the rectification matrices, to the second image and to the first. in the second rectified image of a point Equation (29) shows that the corresponding point of the first rectified image lies on the line parallel to the ✝ -axis and going through . Applying a correlation criterion to and each point of this line thus allows to determine , if the image is not too distorted through the process of rectification. Equations (27) and (28) do not completely determine and : This indetermination is used to minimize the distortion of the images, as explained in section 4.3. has been determined, a measure of the disparity between and with respect to this Once ✂ ✝ plane is given by ✝ . If belongs to , it is equal to zero since ✵ then vanishes as shown by equations (10) and (21); otherwise, its interpretation depends on the information available for the model, as explained in section 5. ✒ ✒ ✑ ✑ ✎ ✑ ✑ ✑ ✑ ✒ ✑ ✒ ✑ ✑ ✂ ✎ ✑ ✑ ✑ ☎ ✑ ✑ ✂ 4.2 Geometric interpretation of the rectification From the ✁✄✂ -decomposition [9] any nonsingular matrix ✒ RR n˚2584 ✙ ✆✆☎ ✒ decomposed as: (31) 20 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert ✆ ✱ where is a rotation matrix and , a nonsingular upper triangular matrix. Decomposing ✒ ✿ like in equation (31), inverting it, and noticing that the inverse of an upper triangular matrix is also an upper triangular matrix, we see that ✒ can also be decomposed as: ☎ ✆ ✒ ✙ ✆ ☎ (32) where is a rotation matrix and , a nonsingular upper triangular matrix. To give a geometric interpretation of the rectification, we decompose ✒ and ✒ the following way: By applying equation (32) to the non-singular matrices ✒ ✮ and ✒ ✮✬ , there exist two scalars ✆ ✆ and , two rotation matrices and and two upper triangular matrices ✮ and ✮✬ of the same form as ✮ in equation (1), such that ☎ ✑ ✑ ✑ ✑ ✁ ✑ ✑ ✑ ✑ ✑ ✑ ✁ ✒ ✙ ✑ ✑ ✆ ✑ ✁ ✮ ✿ ✱ ✑ ✮ ✒ and ✙ ✑ ✑ ✆ ✑ ✁ ✮ ✿ ✱ ✑ ✮ (33) Then, we study how the constraints on ✒ and ✒ given by equations (27) and (28) propagate to , ✆ ✆ , and . On one hand, from equations (27), (33), (5) and (16), we have ✑ ✑ ✑ ✁ ✑ ✑ ✑ ✁ ✆ ✏ ✝ ✙ ✑ ✘ ✘ ✿ ✱ ✌✩ ✞ ✞ ✍✰ ✑ ✮ ✑ ✁ ✂ ✑ and we define such that ✆ ✝ ✙ ✌ ✞ ✞ ✍✰ ✑ ✘ ✑ ✘ (34) On the other hand, from equations (28), (33), (19), (5) and (34), we have then ✢ ✣ ✙ ✒✁✄✂ ✒ ✒ ✿ ✱ ✆ ✱ ✼✮ ✆ ✮ ✿ ✱ ✏ ✮ ✝ ✷✮✬ ✿ ✑ ✢✣ ✙ ✑ ✁ ✑ ✁ ✮ ✑ ✆ ✆ ✆ ✑ ✙ ✑ ✰ ✟ ✟ ☎ ✟ ✁✄✂ ☎ ✱ ✄✬✿ ✑ ✟ ☎ ☎ ✆ ✆☎ ✑ ✑ ✰ ✮ ✿ ✱ ✆ ✞ is an upper-triangular matrix. Since it is a rotation matrix, this ✰ and ✮ ✑ ✿ ✱ ✏ ✑ ✮ ✙ ✢✣ ✑ ✰ ✙ ✑ ✁ ✝☎ ✮ ✑ ✮ We then also deduce that ✳ ✆ ✞ ✑ ✑ ✑ ☎ ✌ ✞ ✘ ✞ ✘ ✍✷✰ ✑ ✮ ✆ ✆ ✆ ✢✣ ✙ ✆ ✿ ✱ ✾ ✱☎ ✮ ✑ ✆ ✆ ✆ from which we deduce that means that ✑ ✁ (35) ✑ (36) ✁ ✌ ✞ ✞ ✍✰ ✑ ✑ ✮ ✘ ✘ ✂ ✆ ✰ ✑ ✰ ✑ ✮ ✿ ✱ (37) ✆ We are now able to interpret the equations (30). From equation (27) and (20), we have ✡ ✫ ✙ . Using equation (36), we can then define ✳ by ✟✞ ✟ ✠✞ ✙ ✳ ✑ ✷✳ ✑ ✁ ✫☎✄ ✙ ✘✫ ✑ ✳ ✑ ✁ ✑ ✳ ✫☎✄ ✙ (38) INRIA 21 Applications of non-metric vision to some visually guided robotics tasks ✝ ✆ ✂✁ ✂ ✂ ☎ ✂ ✞✄ ✝ ✆ ✄ ☎ ✁ ✂ ☎✄ ✆✄ ✆ ✄✁ ☎✄ ✁ ✆ ✬ ✬ ✁ ✬ Figure 4: The rectification with respect to a plane . ✂ RR n˚2584 ✆✄ ✬ 22 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert so that the equations (30) are written ✳✞✁ ✙ ✳✫✮ ✆ ✮ ✑ ✑ ✿ ✱✁ ✑ ✑ ✳ ✞ ✁ ✙ ✳ ✫☎✄ ✮ ✆ ✮ ✑ and ✬ ✑ ✿ ✱✁ ✑ ✑ ✪✫ ✁ ✁ They are interpreted as the disparity equations of two pairs of views (see figure 4): The first pair is composed of the view of optical center , camera frame , retinal plane and intrinsic parameters , retinal plane and intrinsic matrix and its rectified view of optical center , camera frame parameters matrix ; similarly, the second pair is composed of the view of optical center , camera frame , retinal plane and intrinsic parameters matrix and its rectified view of optical center , camera frame , retinal plane and intrinsic parameters matrix . The basis of is the by the rotation of matrix . Similarly, the basis of is the image of the image of the basis of by the rotation of matrix . Furthermore, according to equations (35) and (34), we basis of have ✮ ✪ ✫☎✄ ✬ ✪✫✄ ✮ ✁ ✁ ✑ ✪✫✄ ✪✫ ✑ ✬ ✆ ✑ ✆ ✪✫ ✑ ✮ ✑ ✰ ✧✣ ✧ ✜ ✰✣ ✩✂✟ ✧ ✰✣ ✩ ✟ ✞ ✆ ✑ ✘✩✘✛✘ ✪✫ ✑ ✑ ✯✯✬☎ ✫✫ ✑ ✮✬ ✪ ✫☎✄ ✬ ✑ ✑ ✯✵ ✯✲☎☎ ✫✫ ✄ ✙ ✵ ✯✯ ☎ ✫✫ ✄✄ ✵ ✯✯✬✫✫ ✄ ✵ ✞ ✆ ✙ ✧ ✰✣ ✧ ✩ ✣ ✟ ✍✎ ✘✩ ✘✛✘✘ ✙ ✘✩✘✩ ✩ ✑ ✞ ✆ ✘✘ ✢ ✝ ✑ ✩ ✁ ✁ ✪ ✫☎✄ have the same basis ✠✞ and that the ✝ -axis of this basis is parallel to ✱ of the plane at infinity is ✒ ✔✮✙ ✮ ✮ ✿ , the ✬ ✬ . Lastly, for the two rectified views, the✘ homography ✘ epipole of the second view is ✖ ✙ ✮ ✌ ✞ ✞ ✷✍ ✰ , so that, according to equation (19), the homography of is ✢❀✣ . ✁ ✪✫ ✑ which shows that ☎✏☎ ✔ ✑ and ✂ ✑ ✑ ✑ ✑ ✑ ✑ ✂ ✝ ✁ ✁ ✁ ✑ In summary, the process of rectification consists of projecting the first image onto a retinal plane and the second image onto a retinal plane such that and are parallel and choosing the rectified image frames such that the -axis of the two rectified images are parallel and the homography of for the two rectified images is the identity. ✑ ✑ ✂ 4.3 Minimizing image distortion We now examine the distortion caused by ✑ ✒ ✑ and ✒ ✑ to the images. 4.3.1 How many degrees of freedom are left ? ✒ ✒ ✑ ✒ ✑ being known, equation (28) shows that is completely determined as soon as is. So, all the degrees of freedom left are concentrated in . Only eight coefficients of need to be computed, is defined up to a nonzero scale factor, and equation (27) supplies two scalar equations: Six since degrees of freedom remain, but how many of them are really involved in the distortion ? ✒ ✑ ✒ ✑ ✒ ✑ INRIA 23 Applications of non-metric vision to some visually guided robotics tasks ✑ ✿✱ To answer this question, we propose two approaches: The first one decomposes and the second . In each case, we propose a method for computing the values of the parameters which one, minimize the image distorsion. ✑ ✒ ✒ ✑ 4.3.2 The decomposition of . ✒ According to equation (32), there exist two matrices, ✙ ✑ ✒ ☎ and ☎ such that ✆ ✆ is an upper triangular matrix and , a rotation matrix. If we decompose rotations around the ✝ - ✞ - and -axis, we can write ☎ ✆ ✑ ✒ ✙ ✞ ☎ ✣ ☎ ✓✌☎ ✳ ✁ ✧ ✰✣ ✤✦✥ ✂ ✳ ✞ ✆ ✁ ✟ ✳ ✧✳ ✧ ✰✣ ✤✦✥ ✩ ☎✄ ✧ ✣ ✆ ✟ ✙ ✧ ✆ ✁ ✆ ✳ ✳ ✧ ✰✣ ✞ ☎ ✆ ✆ ☎ ✁ ✟ ✆ ☎✖✓✌☎ ✳ ✆ ✳ ✳ ☎ ✆ ✳ ✙ ✑ ✒ ✞ ✆ ✣ ✳ ✧✳ ✧ ✰✣ ✤✦✥ ✩ ✄ ✆ ✞ ✟ ☎ ✧ ✣ ✧ ✣✰ ✳ ✂ ✳ ✳ ✆ ☎ ✳✰ ✆ ✤✦✥ ✄ where from ✆ ✁ where is a upper triangular matrix, ,a rotation matrix, can be rewritten as where Now, according to equation (31), , an upper triangular matrix and we can write ☎ as a product of three ✆ ✆ ✁ ✟ a vector and , a scalar. is a rotation matrix and ✁ ✳ ✆ ✆ ✧ ✁ ✄ is a rotation around the -axis and , an upper triangular matrix. Lastly, if we extract the translation and scaling components, we have ✆ ☎ ✞✝ ✘ ✝ ✘ ✘ ✘ ✝ ☎ ✍✎ ✙ ✑ ✒ ✁ ✆ ✝ ✁ ✕✗✖ ✜✢✱✍✎ ✚ ✖ ✘✩ ✘ ✩ ✝ ✘✩ ✘ ✜✢ ✁ ✘ ✩ ✆ (39) ✆ ✁ ✑ Based on equation (39), is chosen such as to cancel out the third coordinate of , involved in , such as to cancel out its second coordinate equation (27), (making the epipolar lines parallel) and ✕ ✖ ✚ ✖ and , are not involved (making the epipolar lines parallel to the ✝ -axis). The translation terms, in the distortion, four degrees of freedom are left, given by the two scaling factors, and , the and the rotation angle in . skew ✆ ✁ ✒ ✖ ✝ ✆ ✝ ✝ ✝ ✁ ✆ ✁ Minimizing distorsion using a criterion based on areas. The criterion to be minimized is the ratio of the area of the rectangle with sides parallel to the ✝ - and ✞ -axes circumscribing the rectified image to the area of the rectified image (see figure 5). This criterion is valid as soon as these areas are not infinite, that is, as soon as the line (resp. ), ), to the line at infinity, does not go through any point of the first which is mapped by (resp. (resp. second) image. If (resp. ) does not lie in the first (resp. second) image, (resp. ) can ✑ ✆ ✑ ✒ ✆ ✒ ✑ ✙ RR n˚2584 ✙ ✒ ✒ ✑ 24 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert -axis -axis ✁ Figure 5: The area-based criterion: Minimizing the relative area of the filled region. be chosen to verify this constraint, since equation (27) (resp. (28)) show that (resp. ✳ ), which is represented by the last row of ✒ (resp. ✒ ), is only constrained to go through ✙ (resp. ✙★ ). ✒ is decomposed as explained in the paragraph above so that the criterion is a scalar function of ✆ , , and the angle ✂ of . Since the criterion is non-linear and its derivatives are not easily computable, a direction-set method is used, namely, the Powell’s method. ✕✗✖ ✚ ✖ and are initialized and ✂ , to 0. At the end of the minimization, , , and are adjusted so that the to 1 and rectified image is of the same size and at the same position in the plane as the initial image. ✆ ✑ ✆ ✑ ✑ ✝ ✝ ✝ ✁ ✁ ✝ ✝ ✁ ✝ ✝ ✁ ✝ ✁ 4.3.3 The decomposition of ✒ ✑ ✱ ✿ . Here we present another approach in which a particular parametrization of ✒ allows us to isolate the parameters responsible for image distorsion, and estimate their values so as to minimize distorsion. For simplicity, we express image points coordinates with respect to a normalized coordinate system, in which the image occupies the unit square. Using homogeneous coordinates, we denote by ✙ ✙ ✙ the coordinates of the epipole ✙ . We now describe a parametrization of ✒ that explicitly introduces two free rectification parameters. The other parameters correspond to two scaling factors (one horizontal and one vertical), and one horizontal translation which can be applied to both rectified images. These parameters can be set arbitrarily, and represent the magnification and clipping of the rectified images. Let us now see how, using the mapping of four particular points, we define a parameterization for . ✒ ✑ ✱ ✿ ✑ ✱ ✝ ✌ ✁ ✞ ✞ ✍ ✿ ✑ ✱ ✿ ✘ ✘ ☎✄ onto the epipole. This is the condition for the epipolar lines to be maps point 1. ✒ horizontal in the rectified images. ✑ ✱ ✌ ✩ ✞ ✞ ✍ ✿ INRIA 25 Applications of non-metric vision to some visually guided robotics tasks ✿ ✱ ✌ ✞ ✞✤✩✛✍ ✙ 2. We impose that the origin of the rectified image be mapped onto the origin of the image. This sets two translation parameters in the rectified image plane). In other words, ✒ . ✑ ✘ ✁ ✌ ✞ ✞✤✩ ✍ ✘ ✘ ✄ ✘ ✿✱ ✄ maps horizontal lines onto epipolar lines, we impose that the top-right corner of the 3. Since ✒ ✙ ✙ ✙ of the image, intersection of the epipolar line of image be mapped onto point ✁ the left corner with the right edge of the image (Figure 6). This sets the horizontal scale factor on the rectified image coordinates. ✑ ✙ ✌ ✞ ✞ ✍ ✁ ✞ 4. Fourth, we impose that the low-lefthand corner of the rectified image be mapped onto the epipolar line of the low-lefthand corner of the image. This sets the vertical scale factor of the rectified image coordinates. ✒ From the first three points, we infer that matrix ✒ ✿✱✙ ✑ ✙ ✑ ✿✱ is of the form: ✂ ✍✎ ✙ ✜✢ ✘ ✂ ✙ ✘ ✙ ✂ ✁ ✝ ✙ ☎ ✝ ✼ ✒ ✿ ✱ ✌ ✞✤✩ ✞✤✩ ✍ ✾ ✔✼ ✖ ✓ ✌ ✞✤✩ ✞ ✩ ✍ ✾ ✙ . In other words, ✒ ✿ ✱ ✌ ✞✤✩ ✞✤✩✛✍ ✌ ✞ ✩ ✞✤✩✛✍ , so there exist ✞ such that ✑ From the fourth point, we have is a linear combination of e and ✒ ✘ ✄ ✘ ✄ ✘ ✄ ✄ ✏ ☎✄ ✄ ✿✱✙ ✑ ✑ ✘ ✘ ✙ ✍✎ ✙ ✏ ✙ ✏ ✙ ✁ ✜✢ ✘ ✙ ✹✏✆✄ ✙ ✆ ✏ ✄ ✏ ✙ ✘ ✁ ✏ ✝ ✝ ✝ ☎ ✙ ✙ ✙ ☎ ✝ ✏ Assuming that the rectification plane is known (homography fication matrix for the two images. ), any choice of ☛ ✞ defines a recti- ☎✄ ✞ ✏ Minimizing distorsion using orthogonality. We choose ☎✄ so as to introduce as little image distortion as possible. Since there is no absolute measure of global distortion for images, the criterion that we use is based on the following remark: In the rectified images, epipolar lines are orthogonal to pixel columns. If the rectification transformation induced no deformation, it would preserve orthoof lines along pixel columns would be orthogonal to epipolar lines. gonality, so the image by ✒ Let us now consider one scanline ✁ of the rectified image, and two points ✞✝ which are respectively the top, bottom points of a vertical line of the rectified image. The epipolar line ✁ corresponds to epipolar lines in the initial images. The two lines ✁ and are orthogonal. Assuming that the rectification transformation preserves orthogonality, lines ✒ ✒ ☎✝ and ☎ ✝ should be orthogonal (see Figure 7), as well as lines ✒ and . ✒ For a given value of parameters ☎✄ , we define the residual ✿✱ ✑ ✆ ✡ ✆ ✡✑✞ ✆ ✡ ✆ ✡ ✆ ✄ ✆ ✄ ✞ ✼ ✞ ✾✻ ✙ ✼✹✼ ✑ ✑ ✄ ✏ ☎✄ ✆ ✆ ✏ ✂ ✆ ✡ ✆ ✑ ✑ ✆ ✍ ✆✱ ✡ ✆ ✿ ✼ ✾✞ ✿ ✱✼ ✾✡ ✱ ✱ ✆ ✿ ✼ ✾✞ ✿ ✼ ✾✡ ✆ ✡ ✼ ✒ ✿ ✱ ✝ ✾ ✓✟✼ ✒ ✿ ✱ ✾✹✾ ✳ ✆ ✄ ✆ ✞ ✆✡ ✆ ✠✟☛✡ ✄ ✾ ✑ ✰ ✟ ✑ ✌☞ Since cameras are in a horizontal configuration, this intersection point exists. However, the same method can be easily adapted to other cases. RR n˚2584 26 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert ✖✖ ✁ ✂ ✂ ✱ ✖ ☎✝ ✁ ✱✄✂ ✂ ✱ ☎✞✝ ✖✖ ✖✖ ✁ ✱✄✂ ✂ ✁ ✂ ✂ ✱ ✆☎✞✝ ✖ ✖ left image ✘✘ ✱ ☎✞✝ ☎✌✝ rectified image ✑ ☎✝ ✍✎✑✏✓✒ ✁ ✂ ✱✔✂ ✱ ✁ ✂ ✱✄✂ ✱ ☎✝ ✁ ✟✡✠ ✂ ☞✟ ☛ ✂ ✟✆✠ Figure 6: ✿ maps three corners of the rectified image onto particular points of the image, and the point at infinity ✌ ✩ ✞ ✞ ✍☎✄ onto the epipole . ✒ ✖ with ✞ ✟ ✏ ✙ ✘ ✛✘ ✘✘ ✩ ✩ ✟ ✑ ✱ ✑ ✱ This term is the dot-product of the directions of the two lines ✆ ✿ ✼ ✾ ✞ ✿ ✼☎✝ ✾ ✡ and ✆ ✡ . An analo✱ ✙ gous term ✼ ✞☎✄ ✾ can be defined in the right image, with the rectification transformation ✿ ✱ ✱ ✿ ✿ . To determine rectification transformations, we compute ✞ ✄ which minimize the sum ✼ ✼ ✞☎✄ ✾ ☞ ✼ ✞☎✄ ✾✰✾ computed for both the left and the right columns of the rectified image, i.e. and having respective values ✌ ✞ ✞✤✩ ✍ and ✌ ✞✤✩ ✞✤✩ ✍ on the one hand, ✌ ✩ ✞ ✞✤✩ ✍ and ✌ ✩ ✞✤✩ ✞✤✩ ✍ on the other hand. One can see easily that the resulting expression is the square of a linear expression in ✞ ✄ . It can be minimized very efficiently using standard linear least-squares techniques. The two epipolar lines ✆ ✡ ✞✤✆ ✡ are chosen arbitrarily, so as to represent an “average direction” of the epipolar lines in the images. In practice, the pair of epipolar lines defined by the center of the left image provides satisfactory results. ✂ ✂ ✒ ✏ ✄ ✒ ✑ ✒ ✘✘ ✰ ✘ ✰ ✏ ✆ ✒ ✒ ✘ ✰ ✂ ✰ ✏ ✏ ✑ ✏ ✝ ✄ ✆ ✆ ✄ INRIA 27 Applications of non-metric vision to some visually guided robotics tasks ✂✁☎✄✝✆ ☎ ✷ ✟ ✄ ✡ ✎✍☎✌ ✡ ☞☛✂✌ ✞ ? ✞ ✡✑✏ ✌ ✞ ☎ ✷ ✟ rectified image ✁✟✞✠✆ left image Figure 7: Lines involved in the determination of the rectification transformation (see text). RR n˚2584 28 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert 5 Positioning points with respect to a plane Measuring the positions of points with respect to a reference plane is essential for robot navigation. We will show in sections 6 and 7 several applications which are based on this measurement. In this section we study how to compare distances of points to a reference plane under minimal calibration assumptions. 5.1 Comparing point distances to a reference plane For convenience we will adopt the terminology which corresponds to the particular application of section 7.3 where the reference plane is the ground plane, and the robot needs to estimate the relative heights of visible points, i.e., their relative distances to the ground plane. The distance of a point to the reference plane is related to the location of along the direction orthogonal to the plane. This notion can clearly not be captured at the projective level. Let us now see under which calibration assumptions we will be able to compare point heights. Let us introduce an (arbitrary) reference point which does not belong to the ground plane. In , chosen arbitrarily so as to satisfy the practice this point is defined by its two image projections above constraints: ✂ ✂ ✡✞✡ ✡ ✞✡ ✁ ✁ satisfy the epipolar constraint, ✡ ✁ both and ✡ and ✡ ✡ lie outside of the images, do not satisfy the homographic relation of the reference plane This guaranties that point does not lie on the plane, and is different from any observable point. We now consider the line ✄✂ orthogonal to the reference plane and passing through . Denoting ✆☎ the intersection between ✄✂ and the ground plane, the height of is in fact equal to the by ☎ ☎ ☎ signed distance , where is obtained by projecting on ✄✂ parallel to the reference plane. ✝☎ ☎✟✞ ✠☎ ✞ and From simple affine geometric properties, the ratios of signed distances ☛✡ are equal. Thus, if we consider an arbitrary point which we declare to be at height one from the in terms of this unit can be expressed as (see Figure 8): reference plane, the height of ✂ ✁ ✁ ✆ ✦✡ ✂ ✆ ✦✡ ✂ ✂ ✂ ✙ ☞ ✁ ✡ ✁ ✂ ✂ ✆ ✦✡ ✂ ✂ ✁ ✞ ✁ ✁ ✂ ✁ ✡ ✞ ✁ (40) ✡ Affine projection: In practice, we cannot directly compute distances between 3D points. However, we can compute their projections on the image planes. Ratios of distances are affine invariants, so if we assume that, say, the right camera performs affine projection, we can write ☞ ✂ ✂ ✡ ✞ ✙ ✍✌ ✡ ✌ ✡ ✞ ✡ ✌ ✡ ✌ ☞ Under affine viewing, this definition of height is exact in the sense that is proportional to the distance between and the reference plane. Otherwise, this formula is only an approximation. At INRIA ✁ 29 Applications of non-metric vision to some visually guided robotics tasks ✁ ✠ ✄ ✠ ✄ ✞✁ ✆✄ ✄ ✞ ✟ ✄ ✟ ✁ ✟ ✝ ✄ ☎✄ ✂✁ ☞ Figure 8: Computation of relative heights with respect to unit point text). ✂ ✡ under affine projection (see any rate, it turns out to be accurate enough for some navigation applications, in which points for which heights have to be compared are at relatively long range from the camera, and within a relatively shallow depth of field (cf Section 7.3). Perspective projection: If the affine approximation is not valid, we need to relate relative heights to projective invariants. Instead of considering ratios of distances, we consider cross-ratios, which are invariant by projection onto the images. Let us assume that we know the homography of a plane ✭✑✭ parallel to the ground plane (This is ✂ in fact equivalent to knowing the line at infinity of the ground plane). Intersecting line ✆ ✡ (resp. ✂ ✡ ✡ ✂ ✡ ✂ ✡ ✆ ✡ ) with plane ✭✑✭ defines a point ✞ (resp. ) aligned with ✞ ✞ (resp. ✞ ). ✂ The cross-ratio ✟ ✞ ✡✠ ✞ ☞☛ is by definition equal to ✂ ✁ ✂ ✁ ✁ ✁ ✁ ✁ ✂ ✁ ✂ ✞ ✁ ✞ ✁ ✁ Based on simple affine properties, the denominator of the above fraction is also equal to ☎ ✁ where ☎ ✁ RR n˚2584 is the projection of ✁ ✂ ☎✟✞ ☎ ✁ on ✆✄✂✦✡ parallel to plane ✂ ✭ ✭ , i.e., the intersection of ✖✭✑✭ and ✆✄✂✦✡ . ✂ 30 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert Similarly, we have ✟ ✁ ✠ ✡ ✞ ✂ ✞✟✁ ✡ ☛ ✡ ✁ ✙ ✂ ✡ ✁ ✟ ✡✠ ✡ ✂ ☎ ✞ ✁ ✡ ✁ ✞ ☎ ☛ ✟ ✂ ☎ ✠ ✂ ☛ ✞ ✁ ✞ ✞ ✁ ✞ As a consequence, the ratio of cross-ratios is equal to (as de✂ with respect fined in Equation 40). Using projective invariance, we can then express the height of ✂ as to ✁ ✟ ✡ ✟ ☞ ✙ ✠ ✞✽☎ ✞ ✡ ✠ ✌ ✞✽☎ ☛ ✡ ✽✟✞ ✡ ✡ ✽ ✞ ✁ ✡ ✡ ✡ ☞ ☛ ✡ ✌ We remark that the ratio of heights with respect to a plane can be captured at a calibration level which is intermediate between projective and affine: knowing the plane at infinity is not necessary, one only needs to know the line at infinity of the reference plane. 5.2 Interpreting disparities In this section we assume that images have been rectified with respect to the reference plane, and relate positions ot points relative to the plane to image disparities. The measure of the disparity assigned to a point correspondence after the rectification with respect to a plane and the correlation along the epipolar lines, described in section 4, is in turn related to the position of the corresponding point of the scene with respect to the plane. Indeed, with the notations of section 4, ✁ ✝ ✁ ✙ ✝ ✑ ✑ ☎ So, according to equations (29) and (38), we have ✏ ✁ ✙ ✵ ✏ ✙ ☎✄ ✳ ✫ ✑ ✵ ✑ ✁ (41) ✳ ✞ ✂ In order to interpret , we introduce the signed distance ✼ ✞ ✂ ✾ of a point to a plane ✂ defined by its unit normal and its distance to the origin: ✁ ✂ ✙ ✆ ✌ ✆ ☎ ✞ ✞ ✳✕✞✶✕ ✵ ✍ ✰ ✂ ✆ ✆ ✼ ☎ ✆ ✂ ✞✟✂ ✙ ✾ ✰ ✌ ✂ ✆ ☎ ✂ ✵ ✞ ✳ ✞ ✵ ✵ ✍ ✰ ✂ ✂ The sign of ✺✼ ✞ ✂ ✾ is the same for all the points located at the same side of ✂ and ✺✼ ✞ ✂ ✾ ✂ is equal to the distance of to ✂ . Similarly, we introduce the signed distance ✼ ✞ ✾ of a point ✝ ✞ ✙✻✌ ✞ ✞ ✍✷✰ to a line defined by its unit normal and its distance to the origin: ✆ ✂ ✆ ✆ ✂ ✆ ✂ ✆ ✆ ✝ ✺✼ ✟✞ ✆ ✾ ✆ ✙ ✂ ✆ ☎ ✰ ✌ ✞ ✞ ✍ ✰ The sign of ✼✝✟✞ ✾ is the same for all the points located at the same side of and to the distance of to . We have then, according to equations (21) and (3), ✆ ✆ ✆ ✂ ✆ ✼✝✟✞ ✆ ✾ ✂ is equal ✆ ✵ ✙ ✵ ✫ ✆ ✆ ✼ ✂ ✞ ✂ ✾ ✂ INRIA 31 Applications of non-metric vision to some visually guided robotics tasks ✄ ✫ ✳ ✙ ✙ ✂ ✞ ✆ ✩ ✑ ✂ ✺✼ ✫ ✵ ✺✼ ✆✔ ✞ ✆ ✒ ✠✞ ✙ ✳ ✦ ✵ ✺✼ ✫ ✾ ✂ ✁ with ✾ ☞ ✥✙ ✒ ✳ ✣ ✱ ✏ ✳ ✣✰✳ ☞ ✑ ✂ ✞ ✆ ✾ ✑ where ✂ is the focal plane of the second view, ✂ the focal plane of the rectified views (see fi✙ ✌ ✍ is the line at infinity. the line of whose image by gure (4)) and Now, using these signed distances, we write in three different ways: ✆✭✔ ✁ ✁ ✒ ✑ ☞ ✄ ✌ ✂ ✏ ✑ ✁ ✼ ✆ ✙ ✆ ✂ ✼ ✆ ✂ ✾ ✞ ✂ ✂ ✞ ✂ ✏ ✆✔ ✁ ✒ ✆ ✙ ✆ ✂ ✼✝ ✆ ✞ ✂ ✑ ✏ ✁ ✼ ✁ ✆ ✙ ✆ ✺✼ ✂ ✂ ✑ ✏ do not depend on ✁ ✂ ✆ ✒ ✂ ✾ ✞ ✾ ✆ ✼ ✂ ✞✟✂ (43) ✾ ✂ ✾ ✞ ✂ (44) ✑ ✂ ✞ ✆ Since , , , and three interpretations: ✼ (42) ✾ and ✾ , we deduce from these equations the following ✂ ✂ From equation (42), we deduce that, if is a visible point, which implies that ✼ ✞ ✂ ✾ , gives its position with respect to ✂ . Furthermore, is proportional to the the sign of ✂ ✂ to ✂ to the distance of to ✂ . ratio of the distance of ✁ ✘ ✆ ✑ ✑ ✁ ✒ ✁ ✂ ✂ ✂ From equation (43), we deduce that, if is a visible point, the sign of usually gives its usually does not go through any point of the image position with respect to ✂ . Indeed, ✞ so that the sign of ✼ is ✾ is usually the same for all the points considered. In fact, does not really depend on for the points usually far away from the image, so that ✼✝ ✞ ✾ ✂ is approximately proportional to the ratio of the distance of to ✂ to the considered and ✂ distance of to ✂ . ✁ ✆ ✆✔ ✕ ✆✔ ✁ ✁ ✂ ✆✪✔ ✆✔ ✆ ✂ ✂ From equation (44), we deduce that the sign of gives the position of with respect to ✂ and ✂ ✂ and is proportional to the ratio of the distance of to ✂ to the distance of to ✂ . ✂ so that is an epipolar line. is thus the image in the According to equation (27), ✄✂ second view of an epipolar plane ✂ . Now, the image in of ✂ is the line at infinity, so ✂ ✙ ✂ is parallel to ✂ . Since ✂ is an epipolar plane, ✂ and is, indeed, the intersection of ✂ and . Consequently, since is usually far away from the image, ✂ and , thus and ✂ , are approximately parallel around the image, so that ✂ may be approximated by ✂ ✂ for the points considered and we turn again to the preceding interpretation. ✁ ✁ ✑ ✑ ✁ ✂ ✙ ✌✆ ✔ ✂ ✆✔ ✑ ✑ ✑ ✆✭✔ ✑ ✁ ✑ ✑ ✦ ✆✭✔ ✑ ✁ ✟ ✟ ✆✔ ✟ When ✂ is the plane at infinity, according to equation (21), we have ✏ ✑ ✙ ✆ RR n˚2584 ✦ ✁ ✼ ✂ ✞ ✂ ✾ ✟ ✑ ✵ ✁ ✙ ✵ ✫ and, so, 32 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert ✏ ✁ ✒ ✙ ✼✝ ✆ ✞ ✾ ✆ ✆ ✔ ✼ ✂ ✞✟✂ ✾ ✑ ✏ ✁ ✁ ✙ ✆ ✼ ✂ ✑ ✞ ✂ ✾ ✂ Thus, in that case, is inversely proportional to the distance of to ✂ , is approximately in✂ ✂ to ✂ and is inversely proportional to the distance of versely proportional to the distance of to ✂ . ✑ ✁ ✁ ✁ ✑ INRIA Applications of non-metric vision to some visually guided robotics tasks 6 33 Computing local terrain orientations using collineations Once a point correspondence is obtained through the process of rectification and correlation along the epipolar lines described in section 4, it is possible to estimate, in addition to a measure of the disparity, a measure of local surface orientations, by using the image intensity function. The traditional approach to computing such surface properties is to first build a metric model of the observed surfaces from the stereo matches, and then to compute local surface properties using standard tools from Euclidean geometry. This approach has two major drawbacks. First, reconstructing the geometry of the observed surfaces can be expensive because it requires not only applying geometric transformations to the image pixels and their disparity in order to recover three-dimensional coordinates, but also interpolating a sparse 3D-map in space to get dense three-dimensional information. Second, reconstructing the metric surface requires having full knowledge of the geometry of the camera system through exact calibration. In addition, surface properties such as slope are particularly sensitive to the calibration parameters, thus putting more demand on the quality of the calibration. Here, we investigate algorithms for evaluating terrain orientations from pairs of stereo images using limited calibration information. More precisely, we want to obtain an image in which the value of each pixel is a measure of the difference in orientation relative to some reference orientation, e.g. the orientation of the ground plane, assuming that the only accurate calibration information is the epipolar geometry of the cameras. We investigate two approaches based on these geometrical tools. In the first approach (Section 6.2), we compute the Sum of Squared Differences (SSD) at a pixel for all the possible skewing configurations of the windows. The skewing parameters of the window which produce the minimum of SSD corresponds to the most likely orientation at that point. This approach uses only knowledge of the epipolar geometry but does not allow the full recovery of the slopes. Rather, it permits the comparison of the slope at every pixel with a reference slope, e.g. the orientation of the ground plane for a mobile robot. The second approach (Section 6.3) involves relating the actual orientations in space with window skewing parameters. Specifically, we parameterize the space of all possible windows at a given pixel by the corresponding directions on the unit sphere. This provides more information than in the previous case but requires additional calibration information, i.e, the knowledge of the approximate intrinsic parameters of one of the cameras and of point correspondences in the plane at infinity. 6.1 The principle The guiding principle of this section is the following: the collineation that represents a planar surface is the one that best warps the first image onto the second one. This principle is used implicitly in all area-based stereo techniques in which the images are rectified and the scene is supposed to be locally fronto-parallel (i.e., parallel to the cameras) at each point [8, 25, 22]. In this case, homographies are simple translations. A rectangular window in image 1 maps onto a similar window in image two, whose horizontal offset (disparity) characterizes the position of the plane in space. The pixel-to-pixel mapping defined by the homography allows to compute a similarity measure on the pixel intensities inside the windows, based on a cross-correlation of the intensity vectors or a SSD of the pixel intensities. RR n˚2584 34 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert Another example of this concept is the use of windows of varying shapes in area-based stereo by compensating for the effects of foreshortening due to the orientation of the plane with respect to the camera. In the TELEOS system [24], for example, several window shapes are used in the computation of the disparity. We use this principle for choosing, among all homographies that represent planes of various orientations passing through a surface point , the one that best represents the surface at . We use a standard window-based correlation algorithm to establish the correspondences between the images of . Since the methods presented above are sensitive to the disparity estimates, we also use a simple sub-pixel disparity estimator. ✂ ✼ ✞✽✂ ✾ ✂ ✂ 6.2 Window-based representation When applying an homography to a rectangular window in image 1, one obtains in general a skewed window in image 2. Since the homographies that we study map on , two other pairs of points are sufficient for describing them (see section 3.3). This allows us to introduce a description of the plane orientation by two parameters measured directly in the images. ☛ The standard configuration In order to simplify the presentation, we first describe the relations in the case of cameras in standard configuration, i.e.whose optical axes are aligned and such that the axes of the image planes are also aligned. We also assume that the camera intrinsic parameters are known, and considering metric coordinates in the retinal plane instead of pixel coordinates we and (using the same notations as in Section 2). Choosing the frame end up with attached to the first camera as the reference frame, the translation between the cameras is assumed to . Although we describe the approach in this simplified case, the principle remains be the same in the general case, though interpreting the equations is more complicated. In the current case, Equation (12) gives us: ✮ ✝ ✙ ✌ ✞ ✞ ✍ ✘ ✒ ✔ ✙ ✢ ✙ ✢ ✄ ✘ ✒ ✙ ✢✏ ✝ ✍✎ ✄ ✂ ✙ ✩ ✏✁ ✄ ☛ ✠ ✄ ✄ ✠ ✠ ✠ ✄ ✜✢ ✞ ✘ ✘ ✘ ✆ ✩ ✞ ✂ ✞ (45) ✘ ✩ ✟✞✽ of The distance parameter can be obtained from the image points as follows: the images the three-dimensional point are known, and related by a horizontal disparity . From the projection geometry, we have: ✆ ✁ ✝ ✘ ✘✲✄ ✙ ✘ ✁ (46) ✁ ✆ In this equation, the origin of the image coordinate system (a 3D-point) and is a 2-D projective point of the form . Therefore, we have a simple expression of as a function of the known variables, and : ✁ ✌ ✞ ✞✤✩✛✍ ✁ ✕ ✄ ✚ ✆ ✆ ✝ ✙ ✂ ✄ ✘ ✄ ✙ ✄ ✂ ✁ ✘✲✁ (47) INRIA 35 Applications of non-metric vision to some visually guided robotics tasks v v’ v’ u ✠ ☎✄ ☎✄ ✠ ✄ ✠ ✄ u’ u’ ✂✁ ✂✁ (a) (b) (c) Figure 9: Parametrization of window deformation; (a): left image. (b): right image if the cameras are aligned. (c): right window in general camera configuration. Substituting (47) in (45), ✙ ✒ ✍✎ ✩ can be expressed as function of , ✏ ✏ ✏ ✘ ✘ ✘ ✩ ✏ ✜✢ ✝ ✁ ✘ ✩ ✏ ✙ ✁ ✝ ✄ ☎ ✁ ✂ ✘ ✁ , and ✁ ✂ ✒ ✏ ✁ ✙ : ☎ ✁ ✝ ✄ ✁ ✂ ✘ ✁ (48) ✏ only has a translational effect through , so it has no influence on the resulting window shape. The two parameters and fully characterize the effect of on the window shape. Geometrically, is the horizontal displacement of the center points of the left and right edges of the correlation is the displacement of the centers of the top and bottom edges of the window (See window and figure 9 (b)). ✝ ✒ ✏ ✏ ✁ ✒ ✏ ✏ ✁ The general case To simplify the equations, this discussion of window skewing was presented in the case in which the cameras are aligned. The property remains essentially the same when the cameras are in a general configuration. In that case, the window shape can still be described by ✆ that represents the displacements of the centers of the window edges along the epipolar lines given by instead of along the lines of the image in the case of aligned cameras (See figure 9 (c)). Also, we have not included the intrinsic parameter matrices in the computations. It can be easily seen that including these matrices does not change the general form of equation (48); it changes only the relation between ✆ and the orientation of the plane. We have shown how to parametrize window shape from the corresponding homography. It is important to note that the reasoning can be reversed in that a given arbitrary value of ✆ corresponds to a unique homography at , which itself corresponds to a unique plane at . ✙✚✌ ✞ ✍ ✏ ✄ ✏ ✁ ✚ ✁ ✄ ✂ Slope Computation Using Window Shape: The parameterization of window shape as a function of planar orientations suggests a simple algorithm for finding the slope at . First, choose a set of and . Then, compute a measure (correlation or SSD) for each possible ✆ , and find values for the best slope ✆ as corresponding to the minimum of the measure. This is very similar to the approach investigated in [1]. could be converted to the Euclidean desIf all the parameters of the cameras were known, ✆ cription of the corresponding plane in space. ✏ ✏ ✁ ✠ ✄ ✠ ✄ RR n˚2584 36 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert 50000 # # # 80000 40000 # # # # # # # 60000 A 40000 30000 A 20000 24 20000 18 16 14 12 column 10 5 10 15 row # # # # # # 10000 22 20 # # # 0 0 12 10 10 6 20 4 25 2 30 18 16 14 column 8 15 row 4 25 24 22 20 5 8 6 20 30 2 ✏ ✏ Figure 10: Left: Selected pixels in image 1. Center (resp. right): SSD error as function of at the road (resp. tree) pixel. The reference orientation is approximately the orientation of the road, and is represented by the vertical dotted lines. Both surfaces have a sharp minimum, but . only one (the road point) is located at ✌ ✏ ✝ ✼ ✍ ✾ ✏ ✝ ✼ ✁ ✝ ✞ ✁ ✏ ✞ ✏ ✝ ✞ ✁ ✾ Otherwise, it is still possible to compare the computed orientation with the orientation of a reference plane whose line at infinity is known. Indeed, the homography of the plane passing through and parallel to can be computed (Section 3.3), and its parameters derived; the distance is a measure of the difference between the slope of the terrain at and the reference slope. Figure 10 shows an example in the case of two single pixel correspondences selected in a pair of images. The SSD is computed from the Laplacian of the images rather than from the images themselves in order to eliminate the offset between the intensities of the two images. In practice, can be defined by three point correspondences in a pair of training images, two of which lying far enough from the camera to be considered as lying at infinity, thus defining the from any three point line at infinity of the plane. Another way to proceed would be to estimate estimated with three correspondences, and to compute its intersection with the plane at infinity non-aligned correspondences corresponding to remote points. ✝ ✂ ✂ ✝ ✝ ✆ ✂ ✂ ✂ ✠ ✂ ✝ ✄ ✆ ✂ ✙ ✆ ☎ ✝ ✂ ✝ ✂ ✂ ✏ ✔ ✏ Limitations: There are two problems with the representation of plane directions. First, depending on the position of the point in space, the discretization of the parameters may lead to very different results due to the non-uniform distribution of the plane directions in -space. In particular, the discrimination between different plane directions becomes poorer as the range to the surface increases. This problem becomes particularly severe when the surfaces are at a relatively long range from the camera and when the variation of range across the image is significant. and The second problem is that it is difficult to interpret consistently the distance between across the image. Specifically, a particular value of corresponds to different angular distances between planes depending on the disparity, but also on the position of the point in the image. ✌ ✁ ✞ ✍ ✆ ✠ ✄ ✆ ✂ ✝ ✆ ✂ 6.3 Normal-based representation Based on the limitations identified above, we now develop an alternate parameterization, that consists of discretizing the set of all possible orientations in space, and then of evaluating the corresponding INRIA 37 Applications of non-metric vision to some visually guided robotics tasks Figure 11: Distribution of SSD at the two selected pixels of figure (Left: road; Right: trees). The SSD distribution is plotted with respect to ✼ ☎ ✞✽☎ ✾ , i.e., two coordinates of the 3D normal. The reference orientation (close to the road) is represented as a plain surface patch. On the left diagram, the computed SSD values are low in the neighborhood of the reference orientation. On the other one, the low SSD values are further from the reference orientation. ✁ homographies. This assumes that the intrinsic parameters of the first camera are known as well as the collineation of the plane at infinity. ✒ ✔ and , we can Slope computation using normal representation: Assuming that we know compute the homography of a plane defined by a given pair of points ✼✝✟✞✽ ✾ and a normal vector in space. Indeed, the plane ✫ that is orthogonal to and contains the optical center ✬ projects in the first image onto a line and Equation (14) tells us that ✒ ✮ ✂ ✂ ✂ ✆ ✔ ✞ ✂ ✱ ✔ ✙ ✮ ✿ (49) ✰ ✂ This line is the projection in the first image of any line of the plane that does not contain ✬ , so unless ✬ is at infinity, it is the image of the line at infinity of ✫ and thus of since both planes are paralof the plane at infinity, we can then compute the corresponding line lel. Given the homography ✱ ✙ in the second image. Finally, according to section 3.3, ✼✝✟✞✽ ✾ , ✱ , ✳ and the ✿ ✳ ✱ knowledge of the epipolar geometry allow to compute . By sampling the set of possible orientations in a uniform manner, we generate a set of homogra✂ phies that represent planes of well-distributed orientations at a given point . Then the algorithm of the previous section is used directly to evaluate each orientation . In the current implementation, we sample the sphere of unit normals into 40 different orientations using a regular subdivision of the icosahedron. Figure 11 shows the SSD distributions in the case of the two pixels studied in Figure 10. Though correct, the results point out one remaining problem of this approach: the SSD distribution may be flat because of the lack of signal variation in the window. This is a problem with any area-based technique. In the section 6.4, we present a probabilistic framework which enables us to address this problem. It is important to note that the orientation of the cameras does not need to be known and that the coordinate system in which the orientations are expressed is unimportant. In fact, we express all the orientations in a reference frame attached to the first camera, which is sufficient since all we need is to compare orientations, which does not require the use of a specific reference frame. Consequently, it is not necessary to use the complete metric calibration of the cameras. ✞ ✔ ✒ ✔ ✰ ✞ ✔ ✒✎✔ ✂ ✂ ☛ ✔ ✔ ✒ ✞ ✄ ✂ RR n˚2584 ✞ 38 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert ✒☛✔ ✔ ✔ A-priori geometric knowledge: In practice, is estimated as described at the end of the previous section. Since this only gives an approximation, the lines ✱ ✞ ✳ that we compute from a given orientation do not really represent a line at infinity. Thus, the planes that correspond to this orientation rotate around a fixed line in space instead of being parallel. For practical purposes, the line is far enough so that this discrepancy does not introduce significant errors. The matrix represents the intrinsic parameters of the first image. Since we are interested in the slopes in the image relative to some reference plane, it is not necessary to know precisely. Specifically, an error in introduces a consistent error in the computation of the homographies which is the same for the reference plane and for an arbitrary plane, and does not affect the output of the algorithm dramatically. We finally remark that if is modified by changing the scale in the first image, the results remain unchanged. This geometric property, observed by Koenderink [16], implies that only the aspect ratio, the angle of the pixel axes and the principal point of the first camera need to be known. ✞ ✞ ✮ ✮ ✮ ✮ 6.4 Application to estimation of terrain traversability Although the accuracy of the slopes computed using the algorithms of the previous section is not sufficient to, for example, reconstruct terrain shape, it provides a valuable indication of the traversability of the terrain. Specifically, we define the traversability at a point as the probability that the angle between a reference vertical direction and the normal to the terrain surface is lower than a given angular threshold. The term traversability comes from mobile robot navigation in which the angular threshold controls the range of slopes that the robot can navigate safely. Estimating traversability involves converting the distribution of SSD values ✼ ✾ at a pixel to a function ✼ ✾ which can be interpreted as the likelihood that corresponds to the terrain orientation at . We then define the traversability measure ✖✼✝ ✾ as the probability that this orientation is within a neighbourhood around the direction of the reference plane : ✆ ✝ ✆ ✆ ✵ ✂ ✝ ✆ ✝ ✖✼✝ ✾ ✵ ✂✁ ✙ ✼ ✾ ✆ ✆ ✞ ✝ We use a formalism similar to the one presented in [22] in order to define . Assuming that the pixel ✳ values in both images are normally distributed with standard deviation , the distribution of is given by: ✄ ✝ ✼ ☎ where ✆ ✾ ✙ ☎ ✝✆✟✞ ✩ ✌ ☎ ✄ ✼ ✆ ✳ ✆ ✾ (50) is a normalizing factor. This definition of has two advantages. First, it integrates the confidence values computed for all the slope estimates (50) into one traversability measure. In particular, if the distribution of ✼ ✾ is relatively flat, ✖✼✝ ✾ has a low value, reflecting the fact that the confidence in the position of the minimum of ✼ ✾ is low. This situation occurs when there is not enough signal variation in the images or when is the projection of a scene point that is far from the cameras. The second advantage of this definition of traversability is that the sensitivity of the algorithm can be adjusted in a natural way. For example, if is defined as the set of plane orientations which are at an angle less than from , the sensitivity of ✖✼✝ ✾ increases as decreases. ✵ ✝ ✆ ✵ ✆ ✂ ✂ ✝ ✂ ✵ ✂ INRIA 39 Applications of non-metric vision to some visually guided robotics tasks Figure 12: Examples of traversability maps computed on two pairs of images (see text). Figure 12 shows the results on two pairs of images of outdoor scenes. The first image of each pair is displayed on the left. The center images show the complete traversability maps, Once again, the influence of the signal is noticeable. In particular, in the top example, a large part of the road has a rather low traversability, because there is little signal in the images. On the contrary, the values corresponding to the highly textured sidewalk are very high. The right image shows the regions that have a probability greater than the value we would obtain if there were no signal in the images, i.e. the regions that could be considered as traversable. In both cases, the obstacles have low traversability values. for three different values of . Only the traverFigure 13 shows the result of evaluating sable regions are shown. As increases, the influence of the signal becomes less noticeable, and the likelihood of a region to be traversable increases. The measure of traversability can be easily integrated into navigation systems such as the one presented in section 7. ✖✼ ✾ ✵ ✂ RR n˚2584 ✂ 40 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert Figure 13: Traversability map from the distribution of slopes on real data. Left: small admissible ✁ region ; Center: medium admissible region ; Right: large admissible region ✄✂ . ✂ ✙ ✘ ☎ ✖✙ ✝ ✂ ✁ ✖✙ ✝ ✝ ✂ INRIA 41 Applications of non-metric vision to some visually guided robotics tasks 7 Navigating In this section we show three robotic applications of the geometric properties presented in the above sections. In the first one, stereo is used to detect close obstacles. In the second one, the robot uses affine geometry to follow the middle of a corridor. In the third one, relative heights and orientations with respect to the ground plane are used for trajectory planning. 7.1 Detecting obstacles This section describes how to use the previous results to provide a robot with the ability to detect obstacles. The only requirement is that the robot is equipped with a stereo rig which can be very simply calibrated, as explained next. Let us imagine the following calibration steps: as described in Section 3.1, some correspondences between two views taken by the cameras are found; ✁ these correspondencesare used to compute the fundamental matrix, as described in Section 3.2; ✁ three particular correspondences are given to the system; they correspond to three object points defining a virtual plane ✂ in front of the robot; ✁ ✁ the ☛ -matrix of ✂ is computed as described in Section 3.3; The fundamental matrix, as well as the plane ☛ -matrix, remain the same for any other pair of views taken by the system, as long as the intrinsic parameters of both cameras and the attitude of one camera with respect to the other do not change. According to Section 5, by repeatedly performing rectifications with respect to ✂ , the robot knows whether there are points in front between itself and ✂ by looking at the sign of their disparity and can act in consequence. If the distance ✆ of the robot to ✂ is known, the robot may, for example, move forward from the distance ✆ . Furthermore, if ✂ and ✂ intersect sufficiently far away from the cameras, it can detect whether the points are moving away or towards itself. Indeed, ✂ and the focal plane ✂✦ may then be considered as parallel around the images, so that, for the points considered, ✆ ✂ is proportional to ✆ ✂ ☎✝✆ ✂ ✂✻ , where ✆ ✂ ✂✦ ✆ ✂✻ for any point is then approximatively proportional to ✂ . According to equation (42), ✖ ✖ ✼✂ ✞ ✾ ✼✂ ✞ ✑ ✁ ✾ ✼ ✖✞ ✾ ✺✼ ✖✞ ✾ ✙ ✺✼ ✂✁ ✞ ✂✂ ✾ ✂ ✆ ✼ ✂✖✞ ✂✦ ✩ ☎ ✆ ✼ ✂ ✟✞ ✂✻ ✾ ✾ ✂ to ✂ . thus, is a monotonic function of the distance of At last, since we are only interested in the points near the plane, which have a disparity close to zero, we can limit the search along the epipolar line of the correspondent point of any point to an interval around , which significantly reduces the computation time. An example is given in Figures 14, 15, 16 and 17. Figure 14 shows as dark square boxes the points used to define a plane and the image of a fist taken by the left camera. Figure 15 shows the left ✑ ✑ RR n˚2584 ✑ 42 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert Figure 14: The paper used to define the plane and the left image of the fist taken as an example. Figure 15: The left and right rectified images of the fist. and right images once rectified with respect to this plane. Figure 16 shows the disparity map obtained by correlation. Figure 17 shows the segmentation of the disparity map in two parts. On the left side, points with negative disparities, that is points in front of the reference plane, are shown. The intensity encodes closeness to the camera. Similarly, the right side of the figure shows the points with positive disparities, that is the points which are beyond the reference plane. INRIA Applications of non-metric vision to some visually guided robotics tasks 43 Figure 16: The disparity map obtained from the rectified images of Figure 15. Figure 17: The absolute value of the negative disparities on the left, showing that the fist and a portion of the arm are between the robot and the plane of rectification, and the positive disparities on the right, corresponding to the points located beyond the plane. RR n˚2584 44 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert ✄✆☎ ✁ ✠ ✁ ✂ ✂ ☛ ✞ ✠ ✄✆✝ ✁ ✞ ✠ ✠ ☛ ✄ ✞ ✞ ✠✟ ✠ ✠ ✁ Figure 18: Determining a point at the middle of a corridor (see Section 7.2). 7.2 Navigating along the middle of a corridor If we add to the computation of the fundamental matrix during the calibration stage, the computation of the homography of the plane at infinity, using the method described in Section 3.4, the robot becomes able to compute ratios of three aligned points, thus, for example, the middle of a corridor and do visual servoing. Indeed, if we represent the projections of the sides ✱ and ✳ of the corridor by ✱ and ✳ in the first image and ✱ and ✳ in the second image (see Figure 18) and choose any point ✡ of ✱ and any point of ✳ , projections in the first image of a point ☛ of ✱ and a point ☞ of ✳ , then the ✡ and in the second image are computed as the intersections, respectively, corresponding points ✮ of the epipolar line ✙✍✌ of ✡ with ✱ and the epipolar line ✙ of with ✳ . Having ✼✎✡ ✞✏✡ ✾ and ✼ ✤✞ ✾ ✂ allows to compute the projections ✼✝✟✞✽ ✾ of the midpoint of ☛ and ☞ . If we consider ✱ and ✳ ✂ as locally parallel, then lies on the local middle line of the corridor and computing the projections ✂ allows to have the projections of this line in the two of another point of this line the same way as images. Figure 19 shows some real sequences used to perform the affine calibration of a stereoscopic system. Six strong correspondences between the four images have been extracted, from which fifteen correspondences of points at infinity have been computed to finally get the homography of the plane at infinity. Figure 20 shows some midpoints obtained once the system calibrated: the endpoints are represented as black squares and the midpoints as black crosses. Figure 21 shows the midline of a corridor obtained from another affinely calibrated system: the endpoints are represented as numbered oblique dark crosses, the midpoints as black crosses and the midline as a black line. ✝ ✝ ✝ ✝ ✝ ✝ ✝ ✝ ✝ ✝ ✞ ✝ ✝ ✝ 7.3 Trajectory evaluation using relative elevation A limitation of the conventional approach to stereo driving is that it relies on precise metric calibration with respect to an external calibration target in order to convert matches to 3-D points. From a practical standpoint, this is a serious limitation in scenarios in which the sensing hardware cannot be physically accessed, such as in the case of planetary exploration. In particular, this limitation implies that the vision system must remain perfectly calibrated over the course of an entire mission. Ne- INRIA Applications of non-metric vision to some visually guided robotics tasks 45 Figure 19: The top images correspond to a first pair of views taken by a stereoscopic system and the bottom images to a second pair taken by exactly the same system after a translation. Among the 297 detected corners of the top left image and the 276 of the top right image, 157 points correspondences have been found by stereo points matching (see Section 3.1), among which 7 outliers have been rejected when computing the fundamental matrix (see Section 3.2). The top to bottom correspondences matching has been obtained by tracking (see Section 3.1). RR n˚2584 46 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert Figure 20: Midpoints obtained after affine calibration (see Section 3.4). Figure 21: Midline of a corridor obtained after affine calibration (see Section 3.4). INRIA Applications of non-metric vision to some visually guided robotics tasks 47 vertheless, navigation should not require the precise knowledge of the 3-D position of points in the scene: What is important is how much a point deviates from the reference ground plane, not its exact position. Based on these observations, we have developed an approach which relies on the measure of relative height with respect to a ground plane (see Section 5.1). 7.3.1 The driving approach We give only an overview of the approach since a detailed description of the driving system is outside the scope of this book. A detailed description of the stereo driving system can be found in [17]. The autonomous navigation architecture is described in [19] and [13]. In autonomous driving, the problem is to use the data in the stereo images for computing the best steering command for the vehicle, and to update the steering command every time a new image is taken. Our basic approach is to evaluate a set of possible steering directions based on the relative heights computed at a set of points which project onto a regular grid in the image. Specifically, a given steering radius corresponds to an arc which can be projected into a curve in the image. This curve traces the trajectory that the vehicle would follow in the image if it used this steering radius. Given the points of the measurement grid and the set of steering radii, we compute a vote for every arc and every point of the grid which reflects how drivable the arc is. The computed value lies between -1 (high obstacle) and +1 (no obstacle) (Figure 22). For a given steering radius, the votes from all the grid points are aggregated into a single vote by taking the minimum of the votes computed from the individual grid points. The output is, therefore, a distribution of votes between -1 and +1, -1 being inhibitory, for the set of possible steering arcs. This distribution is then sent to an external module which combines the distribution of votes from stereo with distributions from other modules in order to generate the steering command corresponding to the highest combined vote. Figure 23 shows examples of vote distributions computed in front of visible obstacles. Characterization of the reference ground plane: The homography of the reference ground plane is estimated from a number of point correspondencesrelated to scene points on this plane. These point correspondences are obtained by selecting points in the first image. Their corresponding points in the second image are computed automatically through the process of rectification and correlation along the epipolar lines, described in Section 4. Measuring obstacle heights Let us we consider one point of the grid in the first image, for which a corresponding point in the second image has been found by the stereo process. Based on the results of Section 5.1, we can compute its height with respect to the ground. The unit height is defined by a point of the scene, selected manually in one of the two images at the beginning of the experiment and matched automatically in the other image. This measurement is not sufficient, since we aim at measuring heights along trajectories which are estimated in the ground plane. So, after determining the elevation of a point selected in one image, we determine the point of the ground plane to which this elevation has to be assigned, by projecting the measured 3D point on the (horizontal) ground plane, along the vertical direction. This means RR n˚2584 48 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert computing the intersection between the ground plane and the vertical line passing through the 3D point. To apply the method of Section 3.3, we need to compute the images of the vertical the line passing through the observed point. For this, we compute the images of the point at infinity in the vertical direction (also called vertical vanishing point) in both images. First, we select manually four points representing two vertical lines in the left image. Matching two of these points we obtain one of the two corresponding lines in the right image. The left vertical vanishing point is obtained by intersecting the two lines in the left image. Computing the intersection of its epipolar line with the line in the right image, we obtain the right vertical vanishing point. Computing images trajectories This approach assumes that a transformation is available for projecting the steering arcs onto the image plane. Such a transformation can be computed from features in sequences of images using an approach related to the estimation algorithm described above for computing the homography induced by the ground plane. We first introduce a system of coordinates in the ground plane, attached to the rover, which we call “rover coordinates”. At each time, we know in rover coordinates the trajectory which will be followed by the rover for each potential steering command. Further more, for a given motion/steering command sent to the robot, we know from the mechanical design the expected change of rover coordinates from the final position to the initial one. We can even estimate the actual motion using deadreckoning. Since the transformation is a change of coordinates in the plane, it can be represented by matrix operating on homogeneous coordinates. a which maps pixel coordinates in The transformation which we compute is the homography the left image onto rover coordinates. The inverse of this matrix then allows us to map potential rover trajectories onto the left image plane. is done by tracking points across the left images taken at various rover poComputation of . sitions. Let us consider two images acquired at positions 1 and 2, with a known rover motion Given a point correspondence ✍✝ ✝ we have the following equation (up to a scale factor): ✒✥✓✦✒ ✁ ✁ ✄ ✒ ✁ ✄ ✒ ✁ ✼ ✱✞ ✳✾ ✂✁ ✁ ✄ ✒ where the only unknown is the matrix ✁ ✄ ✒ ☎✁ ✁ ✄ ✒ ✱ ✙✄ ✱ ✳ ✁ ✂✁ ✁ ✄ ✒ ✱✳ ✳ . This can be also written ✱ ✓ ✆✼ ✽✱ ✳ ✁ ☎✁ ✁ ✄ ✒ ✳✾ ✙ ✘ This yields a system of two independent quadratic equations in the coefficients of . Given a set of displacements and point coordinates, we can write a large system of such equations, which we solve in the least-squares sense using the Levenberg-Marquardt technique. ✁ ✄ ✒ Using heights to speed up stereo matching The relative height is also used for limiting the search of heights which we anin the stereo matching. More precisely, we define an interval ticipate in a typical terrain. This interval is converted at each pixel to a disparity range . This is an effective way of limiting the search by searching only for disparities that are physically meaningful at each pixel. ✌ ☞ ✠ ✄ ✞ ☞ ✠ ✌ ✍ ✌ ✠ ✄ ✆ ✞ ✠ ✆ ✌ ✍ INRIA 49 Applications of non-metric vision to some visually guided robotics tasks Vote 1.0 5 10 15 20 25 30 35 40 45 50 0.0 Vote 1.0 5 10 15 20 25 30 35 40 45 50 0.0 -1.0 Left Straight Right -1.0 Left Vote 1.0 5 10 15 20 25 30 35 40 45 50 0.0 Straight Right -1.0 Left Straight Right Figure 22: Evaluating steering at individual points; (top) Three points selected in a stereo pair and their projections; (bottom) Corresponding votes for all steering directions. 7.3.2 Experimental results This algorithm has been successfully used for arc evaluation in initial experiments on the CMU HMMWV [13], a converted truck for autonomous navigation. In this case, a 400 points grid was used. The combination of stereo computation and arc evaluation was done at an average of 0.5s on a Sparc-10 workstations. New steering directions were issued to the vehicle at that rate. This update rate is comparable to what can be achieved using a laser range finder [19]. An important aspect of the system is that we are able to navigate even though a relatively small number of points is processed in the stereo images. This is in contrast with the more conventional approach in which a dense elevation map is necessary, thus dramatically increasing the computation time. Using such a small number of points is justified because it has been shown that the set of points needed for driving is a small fraction of the entire data set independent of the sensor used [15], and because we have designed our stereo matcher to compute matches at specific points. RR n˚2584 50 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert 15 10 50 45 40 5 0 20 30 35 25 Vote 1.0 5 10 15 20 25 30 35 40 45 50 0.0 -1.0 Left Vote 1.0 5 10 15 20 25 30 35 40 45 50 0.0 Straight Right -1.0 Left Straight Right Figure 23: Evaluating steering directions in the case of a large obstacle (left) and a small obstacle (right). (Top): Regular grid of points (dots) and corresponding projections (squares); (Bottom): Distribution of votes (see text). INRIA Applications of non-metric vision to some visually guided robotics tasks 8 51 Conclusion In this chapter we have pushed a little further the idea that only the information necessary to solve a given visual task needs to be recovered from the images and that this attitude pays off by considerably simplifying the complexity of the processing. Our guiding light has been to exploit the natural mathematical idea of invariance under a group of transformations. This has led us to consider the three usual groups of transformations of the 3-D space, the projective, affine and Euclidean groups which determine a three-layer stratification of that space in which we found it convenient to think about and solve a number of vision problems related to robotics applications. We believe that this path, even though it may look a bit arduous for a non mathematically enclined reader’s point of view, offers enough practical advantages to make it worth investigating further. In particular we are convinced that, apart from the robotics applications that have been described in the paper and for which we believe our ideas have been successful, the approach can be used in other areas such as the representation and retrieval of images from digital libraries. RR n˚2584 52 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert References [1] Frédéric Devernay and Olivier Faugeras. Computing differential properties of 3-D shapes from stereoscopic images without 3-D models. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, pages 208–213, Seattle, WA, June 1994. IEEE. [2] O.D. Faugeras. What can be seen in three dimensions with an uncalibrated stereo rig ? In G. Sandini, editor, Proceedings of the 2nd European Conference on Computer Vision, pages 563–578, Santa Margherita Ligure, Italy, May 1992. Springer-Verlag. [3] O.D. Faugeras and L. Robert. What can two images tell us about a third one ? In J-O. Eklundh, editor, Proceedings of the 3rd European Conference on Computer Vision, pages 485– 492, Stockholm, Sweden, May 1994. Springer-Verlag. [4] Olivier Faugeras. Stratification of 3-d vision: projective, affine, and metric representations. Journal of the Optical Society of America A, 12(3):465–484, March 1995. [5] Olivier Faugeras, Tuan Luong, and Steven Maybank. Camera self-calibration: theory and experiments. In G. Sandini, editor, Proceedings of the 2nd European Conference on Computer Vision, volume 588 of Lecture Notes in Computer Science, pages 321–334, Santa Margherita Ligure, Italy, May 1992. Springer-Verlag. [6] Olivier D. Faugeras. Three-Dimensional Computer Vision: a Geometric Viewpoint. MIT Press, 1993. [7] Olivier D. Faugeras and Francis Lustman. Let us suppose that the world is piecewise planar. In O. D. Faugeras and Georges Giralt, editors, Robotics Research, The Third International Symposium, pages 33–40. MIT Press, 1986. [8] P. Fua. Combining stereo and monocular information to compute dense depth maps that preserve depth discontinuities. In Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, Australia, August 1991. [9] Gene H. Golub and Charles F. Van Loan. Matrix computations. The John Hopkins University Press, Baltimore, Maryland, 1983. [10] C. Harris and M. Stephens. A Combined Corner and Edge Detector. In Proceedings 4th Alvey Conference, pages 147–151, Manchester, August 1988. [11] R. Hartley, R. Gupta, and T. Chang. Stereo from uncalibrated cameras. In Proceedings of the Conference on International Conference on Computer Vision and Pattern Recognition, pages 761–764, Urbana Champaign, IL, June 1992. IEEE. [12] Richard Hartley, Rajiv Gupta, and Tom Chang. Stereo from Uncalibrated Cameras. In Proceedings of CVPR92, Champaign, Illinois, pages 761–764, June 1992. INRIA Applications of non-metric vision to some visually guided robotics tasks 53 [13] M. Hebert, D. Pomerleau, A. Stentz, and C. Thorpe. A behavior-based approach to autonomous navigation systems: The cmu ugv project. In To Appear in IEEE Expert, 1994, 1994. [14] Koenderink J. J. and A. J. Van Doorn. Affine structure from motion. Journal Of The Optical Society Of America A, 8(2):377–385, 1992. [15] A. Kelly. A partial analysis of the high speed autonomous navigation problem. Technical Report CMU-RI-TR-94-16, The Robotics Institute, Carnegie Mellon, 1994. [16] J.J Koenderink and A.J. Van Doorn. Goemetry of binocular vision and a model for stereopsis. Journal Biol. Cybern., 21:29–35, 1976. [17] E. Krotkov, M. Hebert, M. Buffa, F.G. Cozman, and L. Robert. Stereo driving and position estimation for autonomous planetary rovers. In 2nd International Workshop on Robotics in Space, pages 320–328, Montreal, Quebec, July 1994. [18] H. Wang L. S. Shapiro and J. M. Brady. A matching and tracking strategy for independentlymoving, non-rigid objects. In Proceedings of BMVC, 1992. [19] D. Langer, J. Rosenblatt, and M. Hebert. A reactive system for autonomous navigation in unstructured environments. In Proc. International Conference on Robotics and Automation, San Diego, 1994. [20] Q.-T. Luong, R. Deriche, O.D. Faugeras, and T. Papadopoulo. On determining the Fundamental matrix: analysis of different methods and experimental results. Technical Report RR-1894, INRIA, 1993. [21] Q.-T. Luong and T. Viéville. Canonic representations for the geometries of multiple projective views. Technical Report UCB/CSD-93-772, University of California at Berkeley, Sept 1993. [22] L. Matthies. Stereo vision for planetary rovers: Stochastic modeling to near real-time implementation. The International Journal of Computer Vision, 1(8), July 1992. [23] J. L. Mundy and A. Zisserman, editors. Geometric Invariance In Computer Vision. MIT Press, 1992. [24] H.K. Nishihara. Rtvs-3: Real-time binocular stereo and optical flow measurement system. system description manuscript. Technical report, Teleos, Palo Alto, CA, July 1990. [25] M. Okutomi and T. Kanade. A multiple-baseline stereo. In Proceedings of the Conference on International Conference on Computer Vision and Pattern Recognition, pages 63–69, Lahaina, Hawai, June 1991. IEEE. [26] W. H. Press, B. P. Flannery, S.A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C. Cambridge University Press, 1988. [27] L. Robert and O.D. Faugeras. Relative 3d positioning and 3d convex hull computation from a weakly calibrated stereo pair. Image and Vision Computing, 13(3):189–197, 1995. RR n˚2584 54 Luc Robert, Cyril Zeller, Olivier Faugeras and Martial Hébert [28] A. Shashua. Projective structure from two uncalibrated images: Structure from motion and recognition. Technical Report A.I. Memo No. 1363, MIT, September 1992. [29] T. Viéville, C. Zeller, and L. Robert. Recovering motion and structure from a set of planar patches in an uncalibrated image sequence. In Proceedings of ICPR94, Jerusalem, Israel, Oct 1994. [30] Zhengyou Zhang, Rachid Deriche, Olivier Faugeras, and Quang-Tuan Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence Journal, 1994. to appear. Also INRIA Research Report No.2273, May 1994. INRIA Unité de recherche INRIA Lorraine, Technopôle de Nancy-Brabois, Campus scientifique, 615 rue du Jardin Botanique, BP 101, 54600 VILLERS LÈS NANCY Unité de recherche INRIA Rennes, Irisa, Campus universitaire de Beaulieu, 35042 RENNES Cedex Unité de recherche INRIA Rhône-Alpes, 46 avenue Félix Viallet, 38031 GRENOBLE Cedex 1 Unité de recherche INRIA Rocquencourt, Domaine de Voluceau, Rocquencourt, BP 105, 78153 LE CHESNAY Cedex Unité de recherche INRIA Sophia-Antipolis, 2004 route des Lucioles, BP 93, 06902 SOPHIA-ANTIPOLIS Cedex Éditeur INRIA, Domaine de Voluceau, Rocquencourt, BP 105, 78153 LE CHESNAY Cedex (France) ISSN 0249-6399