CN118985007A

CN118985007A - Keyframe downsampling for reducing memory usage in SLAM

Info

Publication number: CN118985007A
Application number: CN202280094476.9A
Authority: CN
Inventors: 全在春; 夏友杰
Original assignee: Innopeak Technology Inc
Current assignee: Innopeak Technology Inc
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2024-11-19
Also published as: WO2023195982A1

Abstract

The present application is directed to simultaneous localization and mapping (SLAM) in extended reality. An electronic system has a camera and obtains image data including multiple images captured by the camera in a scene. Each of the multiple images includes a first landmark. The electronic system generates multiple landmark descriptors of the first landmark from the image data, and identifies multiple camera poses of the multiple landmark descriptors. Each landmark descriptor is generated from different images including the first landmark and captured at different camera poses. Based on determining that two camera poses in the multiple camera poses meet a descriptor elimination criterion, the electronic system selects a first landmark descriptor corresponding to a first camera pose of the above two camera poses in the multiple camera poses to map the first landmark in the scene.

Description

Key frame downsampling for reducing memory usage in SLAM

Technical Field

The present application relates generally to data processing techniques, including but not limited to methods, systems, and non-transitory computer readable media for selectively storing keyframes for locating cameras and mapping environments in augmented reality applications.

Background

Meanwhile, localization and mapping (simultaneous localization AND MAPPING, SLAM) is widely used for Virtual Reality (VR), augmented reality (augmented reality, AR), automatic driving and navigation. In SLAM, high frequency pose estimation is achieved through sensor fusion. An Asynchronous Time Warping (ATW) is commonly applied to AR systems along with SLAM to warp the image before it is sent to the display to correct for head movements that occur after the image is rendered. In SLAM and ATW, the related image data and inertial sensor data are synchronized and used to estimate and predict camera pose. For localization, many SLAM systems suppress the accumulated error of inertial sensor data by detecting corner points and extracting image descriptors on key frame images. Each key frame image is associated with a descriptor derived using a visual word bag (bag of visual word) data structure, and the corresponding descriptor of the new image is compared to the corresponding descriptor of the existing key frame image to locate the new image. The computational cost increases with the number of existing key frame images and is out of control when the number of existing key frame images reaches a huge or global scale number, e.g., 10 billion.

Conventional solutions use optical flow (optical flow) length between two consecutive images to reduce the number of key frame images. When the average length of the different optical flows is short, the conventional solution would wait for a new image to find more optical flow length and set the next key frame when sufficient. In addition, when the camera has significant rotational motion relative to translational motion, such significant rotational motion cannot be triangulated for 3D computation and the associated keyframes are invalidated and removed. Smaller six-degree-of-motion (6 DOF) motions are applied to identify close-range objects, but do not facilitate key frame downsampling. It would be beneficial to establish a more efficient SLAM mechanism using fewer key frames than is currently done.

Disclosure of Invention

Various embodiments of the present application relate to SLAM techniques that map virtual space with descriptors. The camera is positioned and used to capture an image in a scene, and the scene is mapped to a virtual space that includes a plurality of landmarks. Each image (also referred to as a keyframe) corresponds to a camera pose (i.e., camera position and camera orientation) of the camera and records a set of landmarks in the scene. Each keyframe is processed to provide a set of descriptors for the set of landmarks of the scene, and the set of descriptors is associated with a camera pose. From the perspective of each landmark, the respective landmark is recorded in one or more key frames that provide one or more descriptors associated with one or more respective camera poses to describe the respective landmark. As the number of key frames captured by the camera increases, the amount of memory usage increases. In some embodiments, the number of key frames or the number of associated landmarks in each key frame is downsampled to reduce the number of descriptors that SLAM needs to store. For example, when the camera pose has multiple descriptors, a portion of these descriptors are selected to map landmarks, as other unselected descriptors are substantially similar to descriptors of other keyframes. When the selected portion of the descriptor is small, the contribution of the camera pose to the overall accuracy of the mapping data of the virtual space is limited, and the camera pose and corresponding keyframes are completely prohibited from providing any descriptor including the selected portion to map the set of landmarks. Such a key frame downsampling mechanism may be applied in a variety of SLAM-based products, such as AR glasses, robotic systems, autopilots, drones, or mobile devices implementing AR applications.

In one aspect, a method is implemented in an electronic system having a camera. The method includes obtaining image data, the image data including a plurality of images captured by a camera in a scene, and each image of the plurality of images including a first landmark. The method also includes generating a plurality of landmark descriptors for the first landmark from the image data, and identifying a plurality of camera poses of the plurality of landmark descriptors. Each landmark descriptor is generated from a different image including the first landmark and captured with a different camera pose. The method further includes, in accordance with a determination that two of the plurality of camera poses satisfy the descriptor elimination criteria, selecting a first landmark descriptor corresponding to a first one of the two of the plurality of camera poses to map a first landmark in the scene.

In some embodiments, the method further includes, in accordance with a determination that two of the plurality of camera poses satisfy the descriptor elimination criteria, deselecting a second, different landmark descriptor corresponding to a second of the two of the plurality of camera poses to map the first landmark in the scene. In some embodiments, the method further comprises mapping the first landmark with a subset of the plurality of landmark descriptors. The subset of the plurality of landmark descriptors corresponds to a subset of different camera poses of the subset of the plurality of images captured. A subset of the plurality of landmark descriptors is selected based on a determination that any two camera poses in the subset of different camera poses do not meet the descriptor elimination criteria.

In another aspect, some embodiments include an electronic system comprising one or more processors and a memory having instructions stored thereon that, when executed by the one or more processors, cause the processors to perform any of the methods described above.

In yet another aspect, some embodiments include a non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more processors, cause the processors to perform any of the methods described above.

These exemplary embodiments and implementations are not mentioned to limit or define the disclosure, but to provide examples to aid understanding of the disclosure. Other embodiments are discussed in the detailed description and further description is provided.

Drawings

For a better understanding of the various embodiments described, reference should be made to the following detailed description, taken in conjunction with the accompanying drawings, in which like reference numerals identify corresponding parts throughout.

FIG. 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, according to some embodiments.

Fig. 2 is a block diagram illustrating an electronic system, according to some embodiments.

FIG. 3 is a flow chart of a process for processing inertial sensor data and image data of an electronic system (e.g., a server, a client device, or a combination of both) using SLAM modules, according to some embodiments.

Fig. 4A-4C are three simplified diagrams of virtual space with multiple landmarks captured by different keyframes, according to some embodiments.

Fig. 5A is a diagram of a virtual space mapped with a plurality of landmarks associated with a first set of keyframes, and fig. 5B is another diagram of a virtual space mapped with a plurality of landmarks associated with a second set of keyframes, according to some embodiments.

FIG. 6 is a flow chart of a method of simultaneous localization and mapping (SLAM) according to some embodiments.

Like reference numerals designate corresponding parts throughout the several views of the drawings.

Detailed Description

Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to provide an understanding of the subject matter presented herein. It will be apparent, however, to one skilled in the art that various alternatives may be used and that the subject matter may be practiced without these specific details without departing from the scope of the claims. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein may be implemented on many types of electronic systems having digital video capabilities.

The present application relates to locating cameras and mapping scenes for rendering augmented reality content (e.g., virtual, augmented, or mixed reality content) on an electronic device. In the prior art, comparing the current image with existing key frames to identify camera locations may be very inefficient because a large number of key frames need to be created to map the scene. In various embodiments of the present application, the number of key frames or the number of associated landmarks in each key frame is downsampled to reduce the number of descriptors that SLAM needs to store. For example, when the camera pose has multiple descriptors, a portion of these descriptors are selected to map landmarks, as other unselected descriptors are substantially similar to descriptors of other keyframes. When the selected portion of the descriptor is small, the contribution of the camera pose to the overall accuracy of the mapping data of the virtual space is limited, and the camera pose and corresponding keyframes are completely prohibited from providing any descriptor including the selected portion to map the set of landmarks.

FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, according to some embodiments. The one or more client devices 104 may be, for example, a notebook computer 104A, a tablet computer 104B, a mobile phone 104C, or a smart multi-sensing networked home device (e.g., a surveillance camera 104E, a smart television device, a drone). In some implementations, one or more client devices 104 include a head-mounted display 104D for rendering the augmented reality content. Each client device 104 may collect data or user input, execute a user application, and present output on a user interface of the client device. The collected data or user input may be processed locally on the client device 104 and/or remotely by the server 102. One or more servers 102 provide system data (e.g., boot files, operating system images SYSTEM IMAGE, and user applications) to client devices 104, and in some embodiments, process data and user inputs received from client devices 104 when the user applications are executed on client devices 104. In some embodiments, the data processing environment 100 also includes a memory 106 for storing data related to the server 102, the client device 104, and applications executing on the client device 104. For example, the memory 106 may store video content (including visual and audio content), static visual content, and/or inertial sensor data.

The one or more servers 102 may be in real-time data communication with client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, one or more servers 102 may perform data processing tasks that client device 104 cannot or preferably does not perform locally. For example, the client device 104 includes a game console (e.g., formed by the head mounted display 104D) executing an interactive online game application. The game console receives the user instructions and sends the user instructions along with the user data to the game server 102. The game server 102 generates a video data stream based on the user instructions and user data and provides the video data stream for display on the game console and other client devices participating in the same game session as the game console.

One or more servers 102, one or more client devices 104, and memory 106 are communicatively coupled to one another via one or more communication networks 108, the communication networks 108 being media used to provide communication links between these devices within the data processing environment 100 and computers connected together. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include a local area network (local area network, LAN), a wide area network (wide area network, WAN) such as the internet, or a combination thereof. One or more of the communication networks 108 may alternatively be implemented using any known network protocol, including various wired or wireless protocols such as Ethernet, universal serial bus (universal serial bus, USB), firewire, long term evolution (long term evolution, LTE), global system for mobile communications (global system for mobile communications, GSM), enhanced data GSM environment (ENHANCED DATA GSM environment, EDGE), code division multiple access (code division multiple access, CDMA), time division multiple access (time division multiple access, TDMA), bluetooth, wi-Fi, voice over Internet protocol (voice over Internet protocol, voIP), wi-MAX, or any other suitable communication protocol. Connections to one or more communication networks 108 may be established directly (e.g., using 3G/4G connections to wireless carriers), or through a network interface 110 (e.g., a router, switch, gateway, hub, or intelligent private home control node), or through any combination thereof. As such, the one or more communication networks 108 may represent the internet of a worldwide collection of networks and gateways that use the transmission control protocol/internet protocol (transmission control protocol/Internet protocol, TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other electronic systems that route data and messages.

The head mounted display 104D (also referred to as AR glasses 104D) includes one or more cameras (e.g., visible light cameras, depth cameras), microphones, speakers, one or more inertial sensors (e.g., gyroscopes, accelerometers), and a display. The camera and microphone are used to capture video and audio data from the scene in which the AR glasses 104D are located, while the inertial sensor or sensors are used to capture inertial sensor data. In some cases, the camera captures gestures of a user wearing AR glasses 104D. In some cases, the microphone records ambient sounds, including voice commands of the user. In some cases, both video or still vision data captured by a visible light camera and inertial sensor data measured by one or more inertial sensors are used to determine and predict device pose. Video, still images, audio, or inertial sensor data captured by the AR glasses 104D, the server 102, or both, are processed by the AR glasses 104D to identify the device pose. The device pose is used to control the AR glasses 104D itself, or to interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of AR glasses 104D displays a user interface, and the identified or predicted device pose is used to interact with a user-selectable display item on the user interface or render a virtual object with high fidelity.

In some embodiments, SLAM techniques are applied in the data processing environment 100 to process video data, still image data, or depth data captured by AR glasses 104D along with inertial sensor data. The device pose is identified and predicted, and the scene in which the AR glasses 104D are located is mapped and updated. Alternatively, the SLAM technique is implemented by the AR glasses 104D independently, or by a combination of both the server 102 and the AR glasses 104D.

Fig. 2 is a block diagram illustrating an electronic system 200 for processing content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in fig. 1), a memory 106, or a combination thereof. Electronic system 200 typically includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, mouse, voice command input unit or microphone, touch screen display, touch sensitive tablet, gesture capture camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 supplements or replaces the keyboard with a microphone for voice recognition or a camera 260 for gesture recognition. In some embodiments, the client device 104 includes one or more optical cameras 260 (e.g., RGB cameras), scanners, or light sensor units for capturing images, such as images of a graphic serial code printed on an electronic device. The electronic system 200 also includes one or more output devices 212 capable of presenting user interfaces and displaying content, including one or more speakers and/or one or more visual displays.

Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geographic location receiver, for determining the location of the client device 104. Optionally, the client device 104 includes an inertial measurement unit (inertial measurement unit, IMU) 280, the inertial measurement unit 280 integrating sensor data captured by the multi-axis inertial sensor to provide an estimate of the position and orientation of the client device 104 in space. The one or more inertial sensors of IMU 280 include, but are not limited to, gyroscopes, accelerometers, magnetometers, and inclinometers.

Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and optionally nonvolatile memory such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other nonvolatile solid state storage devices. Memory 206 optionally includes one or more storage devices located remotely from the one or more processing units 202. The memory 206 or alternatively, the non-volatile memory within the memory 206 includes a non-transitory computer-readable storage medium. In some embodiments, memory 206 or a non-transitory computer readable storage medium of memory 206 stores the following programs, modules, and data structures, or subsets or supersets of the foregoing:

an operating system 214 including processes for handling various basic system services and for performing hardware-related tasks;

A network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or memory 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108 (such as the internet, other wide area networks, local area networks, metropolitan area networks, etc.);

a user interface module 218 for enabling presentation of information (e.g., a graphical user interface of an application 224, a widget (widget), a website and web pages of the website, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., a display, speakers, etc.);

an input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected inputs or interactions;

A web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying a website and web pages of the website, including a web interface for logging into a user account associated with the client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and viewing settings and data associated with the user account;

One or more user applications 224 (e.g., games, social networking applications, smart home applications, and/or other web-based or non-web-based applications for controlling another electronic device and viewing data captured by such devices) executed by electronic system 200;

model training module 226 for receiving training data and building a data processing model for processing content data (e.g., video, image, audio, or text data) to be collected or obtained by client device 104;

a data processing module 228 for processing the content data using the data processing model 250 to identify information contained in the content data, match the content data with other data, classify the content data, or synthesize related content data, wherein in some embodiments the data processing module 228 is associated with one of the user applications 224 to process the content data in response to user instructions received from the user application 224;

A pose determination and prediction module 230 for determining and predicting a pose of the client device 104 (e.g., AR glasses 104D), wherein in some embodiments the pose is jointly determined and predicted by the pose determination and prediction module 230 and the data processing module 228, and in some embodiments the module 230 further comprises a SLAM module 232 for mapping a scene in which the client device 104 is located and identifying the pose of the client device 104 within the scene using image or IMU sensor data;

A pose-based rendering module 238 for rendering a virtual object on a field of view of a camera 260 of the client device 104,

Or creating mixed, virtual, or augmented reality content using images captured by camera 260, wherein virtual objects are rendered, and the mixed, virtual, or augmented reality content is created based on the camera pose of camera 260 from the perspective of camera 260; and

One or more databases 240 for storing data including at least one or more of:

o device settings 242, including generic device settings (e.g., service layer, device model, storage capacity, processing power, communication power, etc.) for one or more of the server 102 or client device 104;

o user account information 244 for one or more user applications 224, such as user name, security questions, account history data, user preferences, and preset account settings;

o network parameters 246 of one or more communication networks 108, such as IP address, subnet mask, default gateway, DNS server, and hostname;

o training data 248 for training one or more data processing models 250;

an o data processing model 250 for processing content data (e.g., video, image, audio, or text data) using deep learning techniques;

An o pose data database 252 for storing pose data of camera 260, wherein in some embodiments, descriptors and associated camera poses are compressed according to descriptors or image elimination criteria and stored in association with landmarks in the scene; and

O content data and results 254, the content data being obtained by the client device 104 of the electronic system 200 and the content results being output to the client device 104 of the electronic system 200, wherein the content data is processed locally at the client device 104 or remotely at the server 102 by the data processing model 250 to provide associated results to be presented on the client device 104 and the content data comprises candidate images.

Optionally, one or more databases 240 are stored in one of the server 102, the client device 104, and the memory 106 of the electronic system 200. Optionally, one or more databases 240 are distributed among multiple ones of the server 102, client devices 104, and memory 106 of the electronic system 200. In some embodiments, multiple copies of the data are stored in different devices, e.g., two copies of the data processing model 250 are stored in the server 102 and the memory 106, respectively.

Each of the above identified elements may be stored in one or more of the aforementioned memory devices and correspond to a set of instructions for performing the above described functions. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules, or data structures, and thus various subsets of these modules may be combined or otherwise rearranged in various embodiments. In some embodiments, memory 206 optionally stores a subset of the modules and data structures identified above. Further, memory 206 may optionally store additional modules and data structures not described above.

FIG. 3 is a flow diagram of a process 300 for processing inertial sensor data and image data of an electronic system (e.g., server 102, client device 104, or a combination of both) using visual inertial SLAM module 232, according to some embodiments. Process 300 includes measurement preprocessing 302, initialization 304, local visual-use-inertial odometry (VIO) with repositioning 306, and global pose map optimization 308. In measurement preprocessing 302, the RGB camera 260 captures image data of a scene at an image rate (e.g., 30 FPS) and detects and tracks (310) features from the image data. The IMU 280 measures inertial sensor data at a sampling frequency (e.g., 1000 Hz) while the RGB camera 260 captures image data, and the inertial sensor data is pre-integrated (312) to provide data of changes in the device pose 340. In initialization 304, image data captured by the RGB camera 260 and inertial sensor data measured by the IMU 280 are aligned in time (314). A vision-only motion restoration structure (structure from motion, sfM) technique 314 is applied (316) to couple the image data and inertial sensor data, estimate the three-dimensional structure, and map the scene of the RGB camera 260.

After initialization 304 and during relocation 306, the VIO is optimized (322) using sliding window 318 and associated state 320 from loop closure. When the VIO corresponds (324) to a key frame of a smooth video transition and a corresponding loop is detected (326), the feature is retrieved (328) and used for the associated state generated from the loop closure 320. In global pose map optimization 308, a multi-degree of freedom (multi-DOF) pose map is optimized (330) based on states from loop closure 320 and key frame database 332 is updated with key frames associated with VIOs. Specifically, in some embodiments, each keyframe includes a set of landmarks and is processed to generate a landmark descriptor to map the landmarks. The descriptors are optionally compressed according to descriptor elimination criteria and the corresponding keyframe map landmarks are optionally eliminated according to image elimination criteria.

In addition, the detected and tracked (310) features are used to monitor (334) the motion of objects in the image data and to estimate (e.g., from the image rate) an image-based pose 336. In some embodiments, inertial sensor data pre-integrated (234) may be propagated (338) based on the motion of the object and used to estimate an inertial-based pose 340 (e.g., from the sampling frequency of the IMU 280). Image-based pose 336 and inertial-based pose 340 are stored in database 240 and used by module 230 to estimate and predict the pose used by real-time video rendering system 234. Optionally, in some embodiments, the module 232 receives inertial sensor data measured by the IMU 280 and obtains image-based pose 336 to estimate and predict further poses 340 for further use by the temporal video rendering system 234.

In SLAM, high frequency pose estimation is achieved through sensor fusion, which relies on data synchronization between the imaging sensor and IMU 280. Imaging sensors (e.g., RGB camera 260, liDAR scanner) provide the image data required for pose estimation and typically operate at a lower frequency (e.g., 30 frames per second) and longer delay (e.g., 30 milliseconds) than IMU 280. Instead, the IMU 280 may measure inertial sensor data and operate at very high frequencies (e.g., 1000 samples per second) and negligible delays (e.g., <0.1 milliseconds). Asynchronous Time Warping (ATW) is commonly applied in AR systems to warp an image before it is sent to a display to correct for head movements and pose changes that occur after image rendering. The ATW algorithm reduces the delay of the image, increases or maintains the frame rate, or reduces the jitter caused by missing images. In SLAM and ATW, relevant image data and inertial sensor data are stored locally so that these data can be synchronized and used for pose estimation/prediction. In some embodiments, the image and inertial sensor data are stored in one of a plurality of standard subdivision language (standard tessellation language, STL) containers, such as std:: vector, std:: queue, std:: list, etc., or other custom containers. These containers are generally convenient to use. The image and inertial sensor data are stored in the STL container along with time stamps for the data search, data insertion, and data organization.

Fig. 4A-4C are three simplified diagrams of a virtual space 400 having a plurality of landmarks 402 captured by different keyframes 404, according to some embodiments. The electronic device has a camera 260 and is located in a scene. The camera 260 captures a plurality of images (also referred to as keyframes) 404, and each keyframe 404 is captured when the camera 260 has a camera pose 406 (i.e., is in a camera position and facing the camera orientation). The scene includes a plurality of objects associated with a plurality of landmarks 402. Each keyframe 404 records a portion of a scene that includes a subset of objects and, thus, a subset of landmarks 402. For each landmark in the subset of landmarks 402, the respective keyframe 404 is processed by a data processing model (e.g., convolutional neural network (convolutional neural network, CNN)) to extract the landmark descriptor for the respective landmark 402. For example, referring to fig. 4A, a first keyframe 404A includes three landmarks 402A-402C and is processed to extract three respective landmark descriptors of the three landmarks 402A-402C. Similarly, each of the second keyframe 404B and the third keyframe 404C includes three landmarks 402A-402C and is processed to extract three respective landmark descriptors of the three landmarks 402A-402C. In this way, the keyframes 404A-404C provide 9 landmark descriptors associated with the three landmarks 402A-402C.

The first landmark 402A is associated with at least three landmark descriptors determined from the first key frame 404A, the second key frame 404B, and the third key frame 404C. The first landmark 402A is present in each of the key frames 404A-404C, each key frame captured in a different one of the camera poses 406A-406C. The descriptor elimination criteria are applied to determine whether each of the landmark descriptors determined from the key frames 404A-40C is selected to map the first landmark 402A. For example, referring to fig. 4B, a first landmark descriptor determined by a first key frame 404A is selected instead of a second landmark descriptor determined by a second key frame 404B to map a first landmark 402A in a scene according to a descriptor elimination criteria. The first camera pose 406A associated with the first keyframe 404A includes a first camera position and a first camera pose, and the second camera pose 406B associated with the second keyframe 404B includes a second camera position and a second camera pose. For the first landmark 402A, a first image ray 408A connects a first camera location to the first landmark 402A and a second image ray 408B connects a second camera location to the first landmark 402A. The first image ray 408A and the second image ray 408B form a ray angle 410 connecting the first camera position and the second camera position to the first landmark 402A. The descriptor elimination criteria define a ray angle threshold (e.g., 5-10 degrees). If the ray angle 410 is less than the ray angle threshold, the first and second image rays 408A, 408B and the first and second camera positions meet the descriptor elimination criteria. That is, one of the landmark descriptors associated with the first and second key frames 404A, 404B needs to be eliminated from the first landmark 402A in the mapping scene according to the descriptor elimination criteria.

Assume that a first keyframe 404A is captured earlier in time than a second keyframe 404B. In some embodiments illustrated in fig. 4B, when it is desired to eliminate one of the landmark descriptors associated with the first keyframe 404A and the second keyframe 404B from mapping the first landmark 402A based on the descriptor elimination criteria, the first landmark descriptor associated with the first keyframe 404A captured earlier in time is selected and the second landmark descriptor associated with the second keyframe 404B captured later in time is inhibited from mapping the first landmark 402A. In other words, based on the descriptor elimination criteria, the second landmark descriptors associated with the second key frame 404B captured later in time are eliminated with a higher priority than the first landmark descriptors. In contrast, in some embodiments not shown in fig. 4B, first landmark descriptors associated with first key frames 404A captured earlier in time are eliminated with a higher priority than second landmark descriptors based on descriptor elimination criteria. A second landmark descriptor associated with a second key frame 404B captured later in time is selected and landmark mapping is disabled for a first landmark descriptor associated with a first key frame 404A captured earlier in time.

Similarly, for the second landmark 402B, a third image ray 408C connects the second camera position to the second landmark 402B, and a fourth image ray 408D connects the first camera position to the second landmark 402B. The third image ray 408C and the fourth image ray 408D form a ray angle 412 connecting the second camera position and the first camera position to the second landmark 402B. Since the ray angle 410 is less than the ray angle threshold, the third and fourth image rays 408C and 408D and the first and second camera positions meet the descriptor elimination criteria. One of the landmark descriptors associated with the first key frame 404A and the second key frame 404B needs to be eliminated from mapping to the second landmark 402B in the scene. In some embodiments shown in fig. 4B, a landmark descriptor associated with a first key frame 404A captured earlier in time is selected to map a second landmark 402B in the scene, and another landmark descriptor associated with a second key frame 404B captured later in time is disabled from landmark mapping. In contrast, in some embodiments not shown in fig. 4B, a landmark 402B in the landmark-descriptor mapping scene associated with the second key frame 404B captured later in time is selected and the first landmark descriptor associated with the first key frame 404A captured earlier in time is disabled from landmark mapping.

In some embodiments, after applying the descriptor elimination criteria, the first landmark 402A, the second landmark 402B, and the third landmark 402C are mapped with a set of landmark descriptors determined from a set of key frames 404 including the first key frame 404A, the second key frame 404B, and the third key frame 404C. The set of landmark descriptors includes three landmark descriptors provided by the first key frame 404A, three landmark descriptors provided by the third key frame 404C, but only one landmark descriptor provided by the second key frame 404B. Any two landmark descriptors for each landmark 402 meet the descriptor elimination criteria. In other words, the set of landmark descriptors is selected in accordance with a determination that any two camera poses in the subset of different camera poses do not meet the descriptor elimination criteria. In an example, any two image rays connected to the same landmark 402 form a ray angle that is greater than the ray angle threshold defined by the descriptor elimination criteria.

In some embodiments, the descriptor elimination criteria eliminates one of two landmark descriptors determined from two keyframes 404 of the same landmark 402 based on a distance between the landmark 402 and each of the two camera locations corresponding to the two keyframes 404. For example, for the first landmark 402A, if the ray angle 410 is less than the ray angle threshold, in accordance with a determination that a first distance between the first landmark 402A and a first camera location of the first keyframe 404A is greater than a second distance between the first landmark 402A and a second camera location of the second keyframe 404B, the second landmark descriptor determined from the second keyframe 404B is eliminated, and conversely, in accordance with a determination that the ray angle 410 is less than the ray angle threshold and the second distance is less than the first distance, the first landmark descriptor determined from the first keyframe 404A is eliminated, and the second landmark descriptor determined from the second keyframe 404B is selected.

Referring to fig. 4C, in some embodiments, the target keyframe 404 (e.g., the second keyframe 404B) meets the image cancellation criteria and is completely cancelled from mapping the plurality of landmarks 402 in the scene. The target keyframe 404 is captured to include a first number of landmarks 402 (e.g., 100 landmarks) corresponding to the first number of landmark descriptors. After applying the descriptor elimination criteria, a second number of landmark descriptors corresponding to the subset of landmarks is prohibited from mapping the subset of landmarks. In accordance with a determination that the first number or the second number meets the image elimination criteria, the target keyframe 404 is completely eliminated, and a first number of landmark descriptors associated with the target keyframe 404 map the plurality of landmarks 402 in the scene. Further, in some embodiments, the image elimination criteria requires that the ratio of the second number to the first number exceeds a predetermined threshold (e.g., 90%). This means that if a large portion of the landmark descriptors provided by the target key frame 404 are eliminated, the target key frame 404 cannot be effectively used to map the scene and needs to be eliminated, leaving room for storing information of the key frame 404 for more efficient use.

In addition, upon eliminating the target keyframe 404, if a landmark descriptor associated with the target keyframe has been selected to map one of the plurality of landmarks based on the descriptor elimination criteria, the selection of the selected landmark descriptor to map the one of the plurality of landmarks is aborted.

In an example, the target keyframe 404 is a second keyframe 404B that is processed to provide three landmark descriptors to the three landmarks 402A-402C. Based on the descriptor elimination criteria, two of the three landmark descriptors determined from the second keyframe 404B are eliminated and only one of the three landmark descriptors is associated with the third landmark 402C. Assuming that two of the three landmark descriptors provided by the key frame 404B are eliminated, the second key frame 404B is completely eliminated from the mapping scene. Although one of the three landmark descriptors is not eliminated by the descriptor elimination criteria, that landmark descriptor of the three landmark descriptors is still eliminated from mapping the third landmark 402C along with the second key frame 404B.

Fig. 5A is a diagram of a virtual space 500 mapped with a plurality of landmarks 402 associated with a first set of keyframes 520, and fig. 5B is another diagram of a virtual space 500 mapped with a plurality of landmarks 402 associated with a second set of keyframes 540, according to some embodiments. The second set of keyframes 540 is simplified from the first set of keyframes 520, and the first set of keyframes 520 includes the second set of keyframes 540. The first set of keyframes 520 has one or more additional keyframes in addition to the second set of keyframes 540. One or more additional key frames are eliminated, unselected, or deselected based on a combination of the descriptor elimination criteria and the image elimination criteria. In these ways, the landmark descriptors determined from the second set of key frames 540 may adequately map the scene and identify the current camera pose of the current frame 504C, while requiring less memory space to store information for the second set of key frames 540 including the landmark descriptors determined from the second set of key frames 540.

The camera 260 is placed in a scene and captures a plurality of images. Multiple images are applied as keyframes 404 to map a scene to the virtual space 500. The scene includes a plurality of objects 502 associated with a plurality of landmarks 402. For example, each landmark 402 is associated with a corner or edge of a respective object 502. Each keyframe 404 records a portion of a scene that includes a subset of objects 502, and thus a subset of landmarks 402. Each keyframe 404 is captured when the camera 260 has a camera pose 406 (i.e., is in the camera position and facing the camera orientation). The subset of landmarks 402 in each keyframe 404 is determined based on the camera pose 406 corresponding to the respective keyframe 404. For each landmark in the subset of landmarks 402, the respective keyframe 404 is processed to extract a landmark descriptor for the respective landmark 402. From the perspective of each landmark 402, the respective landmark 402 is recorded in one or more key frames 404, each key frame 404 providing a landmark descriptor associated with the respective camera pose 406 to map the respective landmark in the virtual space 500.

The first set of keyframes 520 provides a plurality of landmark descriptors. Each keyframe in the first set of keyframes 404-1 provides one or more respective landmark descriptors to map one or more respective landmarks 402. The application descriptor elimination criteria eliminates a subset of the plurality of landmark descriptors. In some embodiments, the first subset of key frames 404-1 is completely eliminated because all of the one or more corresponding landmark descriptors provided by each key frame 404-1 are disabled from landmark mapping based on the descriptor elimination criteria. In some embodiments, the second subset of key frames 404-2 is not eliminated by the descriptor elimination criteria. Each keyframe 404-2 provides a plurality of respective landmark descriptors, and at least one of the respective landmark descriptors provided by each keyframe 404-2 is disabled from landmark mapping based on a descriptor elimination criteria. In some embodiments, the third subset of key frames 404-3 is not affected by the descriptor elimination criteria, and all of the one or more corresponding landmark descriptors provided by each key frame 404-3 are selected to map the corresponding landmark 402.

In some embodiments, image cancellation criteria are further applied to determine whether to cancel each key frame 404 in the second subset of key frames 404-2. The image elimination criteria does not affect the third subset 404-3 of key frames, which does not correspond to any landmark descriptors eliminated by the descriptor elimination criteria. Each key frame in the second subset of key frames 404-2 provides a respective first number of landmark descriptors to a respective first number of landmarks 402, and the respective second number of landmark descriptors is unselected due to the descriptor elimination criteria. In some embodiments, the second subset of key frames 404-2 includes one or more key frames 404-2A. For each keyframe 404-2A, the respective first number or second number meets an image elimination criterion (e.g., the criterion requires that the ratio of the respective second number and first number exceeds a predetermined threshold), and the respective keyframe 404-2A is eliminated and not shown in the second set of keyframes 540 in fig. 5B. In some embodiments, the second subset of key frames 404-2 includes one or more key frames 404-2B. For each key frame 404-2B, the respective first or second number does not meet the image cancellation criteria, and the respective key frame 404-2B is not cancelled, as shown in the second set of key frames 540 in fig. 5B.

In some embodiments, the second set of keyframes 540 includes at least one keyframe 404-2B and at least one keyframe 404-3. Optionally, in some embodiments, not shown, the second set of keyframes 540 does not include any keyframes 404-2B, and any keyframes 404-2 having unselected landmark descriptors are eliminated after the image elimination criteria are applied. Optionally, in some embodiments not shown, the second set of keyframes 540 does not include any keyframes 404-3, and all keyframes 404 are affected by the descriptor elimination criteria. The first set of keyframes 520 is reduced to a second set of keyframes 540 based on the descriptor or image elimination criteria. For each landmark 402 in the scene, one or more landmark descriptors are selected in accordance with a determination that any two of the one or more associated camera poses do not meet the descriptor elimination criteria. Mapping data for the scene is generated based on information of a second set of keyframes 540, the second set of keyframes 540 including one or more landmark descriptors corresponding to each landmark 402 in the scene and one or more associated camera poses 406. Each landmark descriptor corresponds to a respective keyframe in the second set of keyframes 540 and a respective camera pose 406. The first set of keyframes 520 is compressed into a second set of keyframes 540, and the second set of keyframes 540 includes fewer keyframes than the first set of keyframes 520, thereby saving the memory space required to store keyframe related information.

Referring to fig. 5B, after creating the mapping data of the scene, the current frame 504C is used for SLAM, i.e., to identify the current camera pose of the camera 260 and update the mapping of the scene in a synchronized manner. After the electronic device (e.g., AR glasses 104D) obtains the current frame 504C, the electronic device extracts a plurality of feature points 506 (e.g., based on CNN) from the current frame. Each feature point of the plurality of feature points 506 corresponds to an image descriptor determined from the current frame 504C. For each feature point of the plurality of feature points 506, the image descriptor is compared to the mapping data to identify a matching landmark 402 based on the second set of keyframes 540. The current camera pose that captures the current frame 504C is determined based on the corresponding camera pose corresponding to the matching landmark 402.

In some embodiments, the current camera pose is interpolated from two or more camera poses corresponding to two or more keyframes 404 in the second set of keyframes 540. The plurality of feature points 506 of the current frame 504C includes a first feature point 506A. The electronic device determines that the first image descriptor of the first feature point 506A is a combination of two landmark descriptors that match the landmark 402. Two landmark descriptors that match the landmark 402 are determined from two different key frames 404. The current camera pose of the current frame 504C is determined based on two camera poses corresponding to two different key frames 404 that determine two landmark descriptors. For example, the first image descriptor of the first feature point 506A in the current frame 504C is a combination of two landmark descriptors of the first landmark 402A. The two landmark descriptors of the first landmark 402A are determined by the two key frames 508 and 510. The current camera pose of the current frame 504C is determined based on the two camera poses corresponding to the two keyframes 508 and 510. For example, the current camera pose of the current frame 504C is equal to a weighted average of the two camera poses of the two keyframes 508 and 510.

Additionally, in some embodiments, the second feature point 506B is extracted from the current frame 504C and corresponds to a second image descriptor determined from the current frame 504C. In accordance with a determination that the second image descriptor does not match a landmark descriptor in the mapping data, the current frame 504C is not included in the combination of the second set of key frames 540 and cannot be determined from the combination of the second set of key frames 540. The mapping data is updated by associating the second landmark 402B with the second image descriptor of the second feature point 506B and the current camera pose associated with the current frame 504C. In contrast, in accordance with a determination that the second image descriptor matches a subset of the landmark descriptors in the mapping data, the mapping data is not updated with information related to the second image descriptor.

FIG. 6 is a flow chart of a method 600 of simultaneous localization and mapping (SLAM) according to some embodiments. In some embodiments, the method is applied in AR glasses 104D, robotic systems, or autonomous vehicles. For convenience, the method 600 is described as being implemented by the electronic system 200 (e.g., the client device 104). In an example, the application method 600 determines and predicts pose, maps scenes, and renders virtual and real content simultaneously in augmented reality (e.g., VR, AR). Method 600 is optionally governed by instructions stored in a non-transitory computer readable storage medium and executed by one or more processors of an electronic system. Each of the operations shown in fig. 6 may correspond to instructions stored in a computer memory or a non-transitory computer readable storage medium (e.g., memory 206 of electronic system 200 in fig. 2). The computer-readable storage medium may include a magnetic or optical disk storage device, a solid state storage device such as flash memory, or other non-volatile memory device. The instructions stored on the computer-readable storage medium may include one or more of the following: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 600 may be combined and/or the order of some operations may be changed.

The electronic system has a camera 260 and obtains 602 image data comprising a plurality of images (also referred to as keyframes) 404 captured by the camera 260 in a scene. Each image of the plurality of images 404 includes a first landmark 402A. The electronic system generates (604) a plurality of landmark descriptors for the first landmark 402A from the image data and identifies (606) a plurality of camera poses 406 of the plurality of landmark descriptors. Each landmark descriptor is generated from a different image 404 including the first landmark 402A and captured in a different camera pose 406 (608). In accordance with a determination that two of the plurality of camera poses 406 satisfy the descriptor elimination criteria (e.g., are substantially proximate to each other), the electronic system selects (610) a first landmark descriptor corresponding to a first one of the two of the plurality of camera poses 406 to map a first landmark 402A in the scene.

In some embodiments, in accordance with a determination that two of the plurality of camera poses 406 satisfy the descriptor elimination criteria, the electronic system does not select (i.e., disable) (612) a second, different landmark descriptor corresponding to a second of the two of the plurality of camera poses 406 from mapping the first landmark 402A in the scene. Further, in some embodiments, the first landmark descriptor is generated (614) from the first image 404A, the first image 404A is captured earlier than the second image 404B that generated the second different landmark descriptor, and the second different landmark descriptor is eliminated with a higher priority than the first landmark descriptor based on the descriptor elimination criteria.

In some embodiments, the electronic system maps the first landmark 402A with a subset of the plurality of landmark descriptors. The subset of the plurality of landmark descriptors corresponds to a subset of different camera poses 406 that capture the subset of the plurality of images 404. A subset of the plurality of landmark descriptors is selected based on a determination that any two camera poses in the subset of different camera poses 406 do not meet the descriptor elimination criteria.

In some embodiments, the plurality of images includes (616) a target image having a first number of landmarks corresponding to the first number of landmark descriptors. For the target image, based on the descriptor elimination criteria, the electronic system determines (618) that a second number of landmark descriptors corresponding to the subset of landmarks are prohibited from mapping the subset of landmarks. In accordance with a determination that the first number or the second number meets the image elimination criteria, the electronic system eliminates (i.e., ceases to select) (620) a first number of landmark descriptors associated with the target image to map a plurality of landmarks in the scene. Further, in some embodiments, the image elimination criteria requires (622) that a ratio of the second number to the first number exceeds a predetermined threshold. In some embodiments, the target keyframe corresponds to a first camera pose of two camera poses of the plurality of camera poses, and the subset of landmarks is different from the first landmark. The first number of landmark descriptors map the plurality of landmarks by ceasing to select a first landmark descriptor for mapping the first landmark 402A. In an example, if 80% and more of the descriptors of the target image map landmarks in the scene are eliminated and disabled, then all descriptors associated with the target image are eliminated. None of the target image, the corresponding camera pose, or the landmark descriptor is stored for SLAM.

In some embodiments, each different camera pose includes a respective camera position and a respective camera pose. For each landmark descriptor of the first landmark 402A, the electronic system identifies a respective image ray 408 connecting the respective camera position to the first landmark 402A, identifies a ray angle (e.g., 410 or 412 in fig. 4B) between two image rays connecting two camera positions of the plurality of camera poses 406 to the first landmark 402A, and determines that the ray angle is less than a ray angle threshold. In accordance with a determination that the ray angle is less than the ray angle threshold, it is determined that two camera poses of the plurality of camera poses 406 satisfy the descriptor elimination criteria. Furthermore, in some embodiments, the ray angle threshold is in the range of [5 ° -10 ° ].

In some embodiments, mapping data for a scene is generated (626). The mapping data includes one or more landmark descriptors corresponding to each landmark in the scene and one or more associated camera poses 406. Each landmark descriptor corresponds to a respective image and a respective camera pose. Further, in some embodiments, for each landmark in the scene, one or more landmark descriptors are selected in accordance with a determination that any two of the one or more associated camera poses 406 do not meet the descriptor elimination criteria.

Referring to fig. 5B, in some embodiments, the electronic system obtains a current frame 504C and extracts a plurality of feature points 506 from the current frame. Each feature point of the plurality of feature points 506 corresponds to an image descriptor determined from the current frame 504C. For each feature point of the plurality of feature points 506, the electronic system compares the image descriptor with the mapping data to identify a matching landmark 402 from the plurality of landmarks 402 and determines a current camera pose for capturing the current frame based on a respective camera pose corresponding to the matching landmark.

In addition, in some embodiments, the plurality of feature points 506 includes a first feature point 506A. The electronic device determines that the first image descriptor of the first feature point 506A is a combination of two landmark descriptors that match the landmark 402 and determines a current camera pose of the current frame based on the two camera poses 406 corresponding to the two images (e.g., 508 and 510 in fig. 5B) that determine the two landmark descriptors.

Further, in some embodiments, the electronic system extracts the second feature point 506B from the current frame 504C. The second feature point 506B corresponds to a second image descriptor determined from the current frame 504C and, in accordance with a determination that the second image descriptor does not match a landmark descriptor in the mapping data, the mapping data is updated by associating the second landmark 402B with the second image descriptor and the current camera pose.

It should be understood that the particular order of the operations in fig. 8 that has been described is merely exemplary and is not intended to indicate that the described order is the only order in which the operations may be performed. Those of ordinary skill in the art will recognize various ways to use descriptors for SLAM and image rendering as described herein. In addition, it should be noted that the details of the other processes described above with respect to fig. 4 and 5A-5B also apply in a similar manner to the method 600 described above with respect to fig. 6. For brevity, these details are not repeated here.

The application relates to a landmark descriptor downsampling method. The camera pose 406 capturing the key frame 404 is connected with the landmark 402 identified in the key frame 404 using the image ray 408. Between image rays connecting respective landmarks 402 to a plurality of camera poses 406, a ray angle is formed at each landmark 402. For each small ray angle (e.g., ray angles 410 and 412), one of the two image rays forming the respective small ray angle is disconnected and one of the two camera poses 406 connected to form the two image rays 408 is disabled to provide a landmark descriptor to map the respective landmark 402. The other of the two camera poses 406, which are selected to be connected to form two image rays, provides a landmark descriptor to map the corresponding landmark 402 in the scene. In some embodiments, the scene is further divided into a plurality of map tiles (map tiles), and the landmark descriptors are further archived in accordance with the plurality of map tiles. For further details regarding the organization of map data using map tile data structures reference is made to international application No. PCT/CN2021/076578, entitled "positioning method, electronic device and storage medium", filed on day 10, year 2 of 2021, which is incorporated herein by reference. Accordingly, the present application relates to downsampling key frame related information (e.g., camera pose, descriptor) archived in a map tile map data structure.

In some embodiments, after discarding the image rays and descriptors based on the descriptor elimination criteria, a subset of camera poses (e.g., the second camera pose 406B in fig. 4B) have a small number of indexes pointing to the particular landmark 402 to use the information of the image rays and descriptors. These camera poses are considered unimportant in improving the 3D depth accuracy of the landmark 402 and may be replaced by other key frames. Image elimination criteria are further applied to reduce a subset of camera poses that discard too many image rays due to the descriptor elimination criteria. In these ways, applying descriptors and image elimination criteria helps reduce the number of key frames or the number of landmark descriptors stored for accurately mapping a scene.

Or the number of key frames or the number of landmark descriptors stored for mapping the scene is downsampled based on the amount of camera motion. The camera motion amount corresponds to a camera motion threshold. The descriptor elimination criteria requires elimination of one of the two camera poses 406 if the distance of the two camera poses 406 is less than the camera motion threshold. In some embodiments, the camera motion threshold varies with the distance between the corresponding landmark 402 and the two camera poses 406, e.g., the shorter the distance of the landmark 402 from the midpoint of the two camera poses 406, the smaller the camera motion threshold.

The terminology used in the description of the various embodiments described herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used in the description of the various embodiments described and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. In addition, it will be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element.

As used herein, the term "if" is optionally interpreted to mean "when … …" or "at … …" or "in response to a determination" or "in response to detection" or "in accordance with … …" depending on the context. Similarly, the phrase "if a determination" or "if a [ stated condition or event ] is detected" is optionally interpreted depending on the context to mean "upon determination … …" or "in response to a determination" or "upon detection of a [ stated condition or event ]" or "in response to detection of a [ stated condition or event ]" or "in accordance with a determination of a [ stated condition or event ] detected".

The foregoing description, for purposes of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of operation and the practical application, thereby enabling others skilled in the art to practice.

Although the various figures show a plurality of logic stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or fetched. While some reordering or other groupings are specifically mentioned, other ordering and groupings will be apparent to those of ordinary skill in the art, and thus the ordering and groupings presented herein are not an exhaustive list of alternatives. Furthermore, it should be appreciated that these stages may be implemented in hardware, firmware, software, or any combination thereof.

Claims

1. A method implemented in an electronic system having a camera, comprising:

obtaining image data, the image data comprising a plurality of images captured by the camera of a scene, each image of the plurality of images comprising a first landmark;

generating a plurality of landmark descriptors for the first landmark from the image data;

identifying a plurality of camera poses for the plurality of landmark descriptors, wherein each landmark descriptor is generated from a different image including the first landmark and captured at a different camera pose; and

Based on determining that two camera poses of the plurality of camera poses satisfy a descriptor elimination criterion, a first landmark descriptor corresponding to a first camera pose of the two camera poses of the plurality of camera poses is selected to map the first landmark in the scene.

2. The method according to claim 1, further comprising:

Based on determining that the two camera poses in the plurality of camera poses satisfy the descriptor elimination criterion, a second different landmark descriptor corresponding to a second camera pose in the two camera poses in the plurality of camera poses is prohibited from mapping the first landmark in the scene.

3. The method of claim 2, wherein the first landmark descriptor is generated from a first image, the first image is captured earlier than a second image from which the second different landmark descriptor is generated, and the second different landmark descriptor is eliminated with a higher priority than the first landmark descriptor based on the descriptor elimination criteria.

4. The method according to any one of the preceding claims, further comprising:

mapping the first landmark using a subset of the plurality of landmark descriptors; and

wherein the subset of the plurality of landmark descriptors corresponds to a subset of different camera poses used to capture the subset of the plurality of images; and

The subset of the plurality of landmark descriptors is selected based on determining that any two camera poses in the subset of the different camera poses do not satisfy the descriptor elimination criterion.

5. The method of any one of the preceding claims, wherein the plurality of images comprises a target image having the first number of landmarks corresponding to the first number of landmark descriptors, the method further comprising, for the target image:

determining, based on the descriptor elimination criteria, that a second number of landmark descriptors corresponding to the landmark subset are prohibited from mapping the landmark subset;

Based on determining that the first number or the second number satisfies an image elimination criterion, eliminating the first number of landmark descriptors associated with the third image mapping the plurality of landmarks in the scene.

6. The method of claim 5, wherein the image removal criterion requires that a ratio of the second number to the first number exceeds a predetermined threshold.

7. The method of claim 6, wherein the target keyframe corresponds to the first camera pose of the two camera poses of the plurality of camera poses, and the subset of landmarks is different from the first landmark, and wherein deselecting the first number of landmark descriptors mapping the plurality of landmarks further comprises:

Selecting the first landmark descriptor to map the first landmark is terminated.

8. The method of any one of the preceding claims, each different camera pose comprising a corresponding camera position and a corresponding camera pose, further comprising:

for each landmark descriptor of the first landmark, identifying a corresponding image ray connecting the corresponding camera position to the first landmark; and

identifying a ray angle between two image rays connecting two camera positions of the plurality of camera poses to the first landmark; and

Determining that the ray angle is less than a ray angle threshold, wherein determining that the two camera poses of the plurality of camera poses satisfy the descriptor elimination criterion is based on determining that the ray angle is less than the ray angle threshold.

9. The method of claim 8, wherein the ray angle threshold is in the range of [5°-10°].

10. The method according to any one of the preceding claims, further comprising:

identifying a plurality of landmarks in the scene including the first landmark; and

Mapping data for the scene is generated, the mapping data comprising one or more landmark descriptors corresponding to each landmark in the scene and one or more associated camera poses, wherein each landmark descriptor corresponds to a corresponding image and a corresponding camera pose.

11. The method of claim 10, wherein, for each landmark in the scene, the one or more landmark descriptors are selected based on determining that any two of the one or more associated camera poses do not satisfy the descriptor elimination criterion.

12. The method according to claim 10, further comprising

Get the current frame;

extracting a plurality of feature points from the current frame, wherein each feature point of the plurality of feature points corresponds to an image descriptor determined from the current frame; and

For each feature point in the plurality of feature points, comparing the image descriptor to the mapping data to identify a matching landmark from the plurality of landmarks; and

Based on the respective camera poses corresponding to the matching landmarks, a current camera pose for capturing the current frame is determined.

13. The method according to claim 12, wherein the plurality of feature points include a first feature point, and the method further comprises:

Determining that a first image descriptor of the first feature point is a combination of two landmark descriptors of the matching landmarks; and

The current camera pose of the current frame is determined based on two camera poses corresponding to the two images from which the two landmark descriptors are determined.

14. The method according to claim 12, further comprising:

extracting a second feature point from the current frame, wherein the second feature point corresponds to a second image descriptor determined from the current frame; and

Based on determining that the second image descriptor does not match the landmark descriptor in the mapping data, updating the mapping data by associating a second landmark with the second image descriptor of the second landmark and the current camera pose of the current frame.

15. An electronic system comprising:

one or more processors; and

A memory having instructions stored thereon, which, when executed by the one or more processors, cause the processor to perform the method according to any one of claims 1 to 14.

16. A non-transitory computer readable medium having stored thereon instructions which, when executed by one or more processors, cause the processors to perform the method according to any one of claims 1 to 14.