WO2024186746A1

WO2024186746A1 - Methods, systems, and computer readable media for colonoscopic blind spot detection

Info

Publication number: WO2024186746A1
Application number: PCT/US2024/018372
Authority: WO
Inventors: Samuel Isaac EHRENSTEIN; Stephen Murray PIZER; Soumyadip Sengupta; Shuxian Wang; Yubo ZHANG; Jan-Michael Frahm
Original assignee: The University Of North Carolina At Chapel Hill
Priority date: 2023-03-03
Filing date: 2024-03-04
Publication date: 2024-09-12

Abstract

A method for colonoscopic blind spot detection includes receiving, as input, a video stream captured by a colonoscopic camera during a colonoscopy procedure and selecting, as output, video frames from the video stream containing information for generating a model of colonic surfaces. The method further includes identifying pixel depths and surface normals of the colonic surfaces. The method further includes refining the surface normals using scene illumination information. The method further includes refining the pixel depths using the refined surface normal. The method further includes estimating a camera pose for each of the video frames using the refined surface normals and refined depths. The method further includes generating the model of the colonic surfaces using the camera poses. The method further includes identifying blind spots in the model. The method further includes displaying indications of the blind spots.

Description

METHODS, SYSTEMS, AND COMPUTER READABLE MEDIA FOR COLONOSCOPIC BLIND SPOT DETECTION PRIORITY CLAIM This application claims the priority benefit of U.S. Provisional Patent Application Serial No.63/449,837, filed March 3, 2023, the disclosure of which is incorporated herein by reference in its entirety. TECHNICAL FIELD The subject matter described herein relates to processing colonoscopic video frames to improve the accuracy of colonoscopy procedures. More particularly, the subject matter described herein relates to methods, systems, and computer readable media for colonoscopic blind spot detection. BACKGROUND A colonoscopy procedure involves the insertion of a colonoscopy probe including a video camera into the colon of a patient, capturing video frames of the surfaces of the colon, including any polyps, and removing the polyps with a tool that is also part of the colonoscopy probe. One problem with colonoscopy procedures includes failure to capture views of some regions of the colon. These uncaptured regions are referred to as blind spots. Blind spots can occur because of lighting, colonic structures or blood occluding the view of the camera, etc. It is desirable during a colonoscopy procedure to reconstruct a model of the colonic surfaces from the colonoscopic video, identify blind spots in the model, and guide the colonoscopist back to the blind spots in the real colon to determine whether further treatment is needed. However, generating an accurate model from colonoscopic video in real time and that is useful for the colonoscopist during a colonoscopy procedure is a challenging problem due to the amount of data required to be processed, the dynamic nature of the colon, lighting issues, occlusions, etc. In light of these and other difficulties, there exists a need for improved methods, systems, and computer readable media for colonoscopic blind spot detection. SUMMARY A system for colonoscopic blind spot detection includes a frame classifier for receiving, as input, a video steam captured by a colonoscopic camera during a colonoscopy procedure and for selecting, as output, video frames from the video stream containing information for generating a model of colonic surfaces. The system further includes a pixel depth and surface normal identifier for identifying pixel depths and surface normals of the colonic surfaces. The system further includes a normal refinement module for refining the surface normals using scene illumination information. The system further includes a normal-depth refinement module for refining the pixel depths using the refined surface normal. The system further includes a camera pose estimator for estimating a camera pose for each of the video frames using the refined surface normals and refined depths. The system further includes a fusion module for generating the model of the colonic surfaces using the camera poses. The system further includes a blind spot detector for identifying blind spots in the model. The system further includes a display for displaying indications of the blind spots. According to another aspect of the subject matter described herein, the frame classifier is configured to exclude from the output, video frames in which the information for generating the model of colonic surfaces is occluded by water, blood, or close occluders. According to another aspect of the subject matter described herein, the system further includes a frame lighting consistency neural network for adjusting lighting in the video frames output from the frame classifier such that the lighting is consistent across frames. According to another aspect of the subject matter described herein, the camera pose estimator utilizes simultaneous localization and mapping (SLAM) to estimate the camera poses. According to another aspect of the subject matter described herein, the camera pose estimator utilizes direct sparse odometry (DSO) SLAM to estimate the camera poses. According to another aspect of the subject matter described herein, the fusion module utilizes surfel meshing to generate the model of the colonic surfaces. According to another aspect of the subject matter described herein, the blind spot detector detects the blind spots by constructing a cylinder from a reconstructed section of the colon from the model, flattening the cylinder, and identifying the blind spots by analyzing pixel intensities of pixels on the flattened cylinder. According to another aspect of the subject matter described herein, the system further includes a haustral ridge identifier for uniquely identifying haustral ridges in the model. According to another aspect of the subject matter described herein, the display is configured to display the indication of the blind spots in real time during the colonoscopy procedure. According to another aspect of the subject matter described herein, a method for colonoscopic blind spot detection includes receiving, as input, captured by a colonoscopic camera during a colonoscopy procedure and selecting, as output, video frames from the colonoscopic video stream containing information for generating a model of colonic surfaces. The method further includes identifying pixel depths and surface normals of the colonic surfaces. The method further includes refining the surface normals using scene illumination information. The method further includes refining the pixel depths using the refined surface normal. The method further includes estimating a camera pose for each of the video frames using the refined surface normals and refined depths. The method further includes generating the model of the colonic surfaces using the camera poses. The method further includes identifying blind spots in the model. The method further includes displaying indications of the blind spots. According to another aspect of the subject matter described herein, the method further includes excluding from the output, video frames in which the information for generating the model of colonic surfaces is occluded by water, blood, or close occluders. According to another aspect of the subject matter described herein, the method includes adjusting lighting in the video frames output from the frame classifier such that the lighting is consistent across frames. According to another aspect of the subject matter described herein, estimating the camera pose includes utilizing simultaneous localization and mapping (SLAM). According to another aspect of the subject matter described herein, utilizing SLAM to estimate the camera poses includes utilizing direct sparse odometry (DSO) SLAM to estimate the camera poses. According to another aspect of the subject matter described herein, generating the model includes using surfelmeshing. According to another aspect of the subject matter described herein, detecting the blind spots includes constructing a cylinder from a reconstructed section of the colon from the model, flattening the cylinder, and identifying the blind spots by analyzing pixel intensities of pixels on the flattened cylinder. According to another aspect of the subject matter described herein, the method further includes uniquely identifying haustral ridges in the model. According to another aspect of the subject matter described herein, displaying the indications of the blind spots includes displaying the indications in real time during the colonoscopy procedure. According to another aspect of the subject matter described herein, a non-transitory computer readable medium having stored thereon executable instructions that when executed by a processor of a computer control the computer to perform steps is provided. The steps include receiving, as input captured by a colonoscopic camera during a colonoscopy procedure and selecting, as output, video frames from the video stream containing information for generating a model of colonic surfaces. The steps further include identifying pixel depths and surface normals of the colonic surfaces. The steps further include refining the surface normals using scene illumination information. The steps further include refining the pixel depths using the refined surface normal. The steps further include estimating a camera pose for each of the video frames using the refined surface normals and refined depths. The steps further include generating the model of the colonic surfaces using the camera poses. The steps further include identifying blind spots in the model. The steps further include displaying indications of the blind spots. The subject matter described herein can be implemented in software in combination with hardware and/or firmware. For example, the subject matter described herein can be implemented in software executed by a processor. In one exemplary implementation, the subject matter described herein can be implemented using a non-transitory computer readable medium having stored thereon computer executable instructions that when executed by the processor of a computer control the computer to perform steps. Exemplary computer readable media suitable for implementing the subject matter described herein include non-transitory computer-readable media, such as disk memory devices, chip memory devices, programmable logic devices, and application specific integrated circuits. In addition, a computer readable medium that implements the subject matter described herein may be located on a single device or computing platform or may be distributed across multiple devices or computing platforms. BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 illustrates an outside view of a length of colon reconstructed in real time from streaming colonoscopy; Figure 2 illustrates (left-hand image) a 2D video frame from the same colonoscopy that produced Figure 1. Figure 2 (right-hand image) is a 3D view of the reconstruction of the video frames forming Figure 1, rotated by 90° and viewed from the same level as the 2D video frame; Figure 3 illustrates reconstructed image of a colon with less than 70% of the colonic surface surveyed (colon external surface darkened to increase contrast); Figure 4 illustrates a reconstruction of a short segment of a colonoscopy video using structure from motion (SfM) and shape from shading (SfS); Figure 5 is a block diagram of the current system for reconstructing colonic surfaces from colonoscopic video frames and for blind spot detection; Figure 6 is a sparse point cloud of a chunk of colonoscopy video produced by a standard DSO SLAM pipeline; Figure 7 illustrates the real time image buildup of a colon reconstructed from streaming colonoscopy video frames; Figures 8A and 8B illustrate colon surfaces projected onto the 2D plane, which are examples of flattening of the cylinder by the blinds spot detector; Figures 8C and 8D illustrate six “holes” found by the blind spot detector using mathematical morphology. #3 and #4, are interior so are likely to be blind spots; Figure 9 illustrates a video frame containing a haustral fold that blocks and outshines most of the rest of information in the frame; Figure 10A illustrates an example of a colon surface seen en face; Figure 10B illustrates a poor quality reconstruction from video frames taken oblique to the axis of the colon; Figure 11A illustrates a ground truth image from a model of the cecum; Figure 11B illustrates an image reconstructed using a generic method; Figure 11C illustrates an image reconstructed using ColDE alone; Figure 11D illustrates an image reconstructed using ColDE-NR [Wang 2022]; Figure 12 illustrates haustral pouches seen on x-ray and folds seen during colonoscopy; Figure 13 illustrates examples of scribble annotations drawn using the scribble annotation tool described herein. Scribbles indicating “FOLD” and “NOT FOLD” are labeled; Figure 14 illustrates images in which the shaded areas indicate detected folds; Figure 15 illustrates curvatures of the colon, modified from [Feher 2018]; Figure 16 illustrates colonoscopic video frames at different times; Figure 17 illustrates reconstructed colonoscopic images with false blind spots indicated by arrows; Figure 18 illustrates tracing surfels to generating a video frame; Figure 19 is a flow chart illustrating an exemplary process for colonoscopic blind spot detection. Figure 20 illustrates the reconstruction of a 3D mesh from a colonoscopy video in real time according to the predicted depth and camera pose, allowing holes in the mesh to alert the physician to unsurveyed regions on the colon surface; Figures 21A and 21B illustrate our two-fold framework of frame-wise depth estimation. Figure 21A illustrates that the normal-based depth initialization network is trained with self-supervised surface normal consistency loss to produce depth map and camera pose initialization. Figure 21B illustrates that the Normaldepth refinement framework utilizes the relation between illumination and surface geometry to refine depth and normal predictions; Figure 22 illustrates an example depth RMSE from C3VD with brighter colors denoting higher error. For the ground truth depth map, darker colors denote more distant depths. Some areas of improvement are highlighted in boxes; Figure 23 illustrates example reconstructed sequences from C3VD using various methods of initialization for SLAM pipeline. The more planar shapes observed in the ND init+n×NR compared to NormDepth variations are closer to the ground truth reconstruction while the noisy reconstructions using Monodepth2 and flat init+4×NR are farther from the ground truth. Select areas of improvement highlighted with arrows; Figures 24A-24C illustrate reconstruction results on clinical microscopy data. The combined system described herein can handle both “down-the- barrel” and en face views, outperforming the photometric baseline model Monodepth2; Figure 25 illustrates box plots of accuracy of the method for identifying haustral folds; Figure 26 illustrates graphs of accuracy of the method for identifying haustral folds at training time with depth inputs (left graph) and without depth inputs (right graph); Figure 27 illustrates sample images with network outputs superimposed on the images. One of the labeled sets of boxes shows regions where our method correctly marks folds which were missed by FoldIt. The other labeled set of boxes show regions where FoldIt incorrectly marked folds that our method correctly marked as not folds. The oval in the second row in the second column marks a region where our method gave an inconsistent result; Figure 28 illustrates the results of haustral fold segmentation on 5 consecutive video frames; and Figures 29A and 29B are system diagrams at training time (Figure 29A) and at inference time (Figure 29B) of a network used to detect haustral folds. DETAILED DESCRIPTION Introduction: More than 50,000 Americans die each year from colorectal cancers (CRCs). As nearly all CRCs begin as polyps (specifically adenomatous or sessile serrated types — the word “polyp” means any growth) most of these deaths would have been prevented with prior complete polypectomy at colonoscopy. However, many studies have shown that the rate of missed polyps at colonoscopy is about 20% and that post-colonoscopy cancers, those that arise a few years following the procedure, are very often due to missed polyps. Reducing the rate of missed polyps would also reduce the number of post-colonoscopy CRCs, which currently account for 3–8% of all CRCs. One reason for failure to visualize all polyps is that the colonic surface was not completely surveyed. Until now, it has been difficult to confirm this hypothesis as there was no way to directly detect non-surveyed colon surface (blind spots). Using the latest computer vision and artificial intelligence (AI) techniques, the subject matter described herein includes method of real time blind spot detection that works sufficiently well to show that blind spots are not uncommon and can be detected during colonoscopy in real time and making it possible for the colonoscopist to revisit initially unsurveyed colonic surface if desired. Key challenges are reconstruction of the colonic surface, detection of unsurveyed regions, and virtual tagging of colonic locations. One object of the subject matter described herein is to provide a means for guidance back to colonoscopic blind spots with reconstruction. It is also desirable to implement a method to compute image distances in a fixed unit of measurement, such as millimeters, rather than in pixels. Another object of the subject matter described herein is to detect colonoscopic blind spots across flexures and other areas of high curvature. Another object of the subject matter described herein is to integrate reconstruction across sequences in which the colonoscope is rapidly turning, for example, from axial to en face (i.e., facing the surface of the colon). Another object of the subject matter described herein is to improve speed, stability, and display quality through pipelining and improved algorithms, such as a better reflectivity model and advanced shape description methods. Another object of the subject matter described herein is to handle gaps in reconstructed video, perform virtual tagging, and colon-to-colon registration. For example, the subject matter described herein may be used to track changes in the colon of a subject by registering a colonic surface construction from one colonoscopy procedure with a colonic surface reconstruction from a subsequent colonoscopy procedure. Another object of the subject matter described herein is to handle gaps that occur when the colonoscope moves while the colonic surface is obscured. Another object of the subject matter described herein is to virtually tag consequential colon locations for future applications, including cancers, blind spots, and polyps detected, but not acted upon, on insertion of the colonoscope. Another object of the subject matter described herein is to transfer virtual tags in serially conducted colonoscopies via colon-colon registration to allow easy localization of areas of previous interest (sites of resected polyps, disease progression). Validation of the subject matter described herein may include demonstrating that the rate of successful reconstruction of informative video frames is ≥ 90%, that blind spot detection produces no more than one false positive per examination, that the system runs in real time for time critical functions, and that guidance back to a discovered blind spot is accurate (see Table 2). These and other objects may be achieved in whole or in part by the subject matter described herein. SIGNIFICANCE Preventing colorectal cancer: The American Cancer Society estimates that there were more than 150,000 new cases of colorectal cancer (CRC) in the United States in 2022 with more than 50,000 deaths. [Siegel 2022]. CRC is second in cancer mortality behind only lung cancer for both sexes combined; and third, behind lung and breast cancer, for females. An estimated 15 million colonoscopies were performed in 2019 [Gastroenterology Health Partners 2022] at an average cost of $2750 per procedure [Corso 2022] or a total cost exceeding $40 billion annually in the US. The justification for this effort and expense is that colonoscopy is an effective method of both detecting and removing pre-malignant adenomatous polyps (adenomas) and can significantly reduce the incidence of post- colonoscopy CRCs [Nishihara 2013, Bretthauer 2022], but they still remain common at 3–8% of all new CRCs [Benedict 2015]. The cause of post-colonoscopy CRCs: The major cause of post- colonoscopy CRCs is failure to remove all adenomas at colonoscopy [Le Clercq 2014]. In a meta-analysis of six studies, using two immediate consecutive standard colonoscopies, on average 1 in every 5 adenomas was missed (pooled miss-rate 22%) [van Rijn 2006]. Studies since then show this rate has not significantly decreased [Lee 2017, Vemulapalli, 2022]. The adenoma detection rate (ADR) is defined as the proportion of colonoscopies in which an adenoma is found. Patients who have colonoscopy with endoscopists who have high ADRs have half the risk of developing post- colonoscopy colorectal cancer and dying from CRC than those who have the procedure with low ADRs [Corley 2014, Kaminski 2017]. Thus, strategies to improve ADR and colonoscopy quality are critically important. Strategies that increase the ADR include technical improvements [Dik 2014], physician training [Brown 2005], presence of a second observer during the procedure [Lee 2011, Aslanian 2013], improved protocols [Hancock 2016], improved bowel preparation regimens [Avalos 2017], the use of dyes and optical filters applied to the colonoscope [Buchner 2017] or field of view [Moon 2023]. Why adenomas are missed at colonoscopy: Adenomas can only be missed if the colonic mucosal surface was not completely surveyed or if the adenomas were indeed seen but not recognized. Most of the current AI effort to improve ADRs has focused on polyp recognition. Use of artificial intelligence (AI) methods have proved to be a successful strategy to improving ADRs [Lee 2020] and is now FDA approved. Ten randomized controlled trials have now found that colonoscopy augmented by polyp recognition improves the ADR [Huang 2022]. We base our work on the certain proposition that failure to survey the entire colonic surface is a major cause of missed adenomas not yet addressed by current AI. Evidence for this claim came first from [Pickard 2004] who, using a CT taken just after the colonoscopy, found that even large (>10 mm) adenomas could be missed if they were on the backside of a haustral fold, an area known to be difficult to completely survey at colonoscopy. Two clinical studies since then showed that the use of much wider field-of-view experimental colonoscopes resulted in a significantly increased ADR compared to conventional colonoscopy [Leufkens 2012, Gralnek 2014]. Finally, our own preliminary studies have directly shown that non-surveyed colonic mucosa is a common occurrence. Other workers are now recognizing that inadvertently non-surveyed areas of the colonic surface (blind spots) lead to numerous missed polyps [Chang 2012, Freedman 2020, Mathew 2021a, Hackner 2022]. We are far ahead in developing a complete approach to detecting and allowing for revisiting of blind spots [Ma 2021b, Wang 2022]. PROPOSED SYSTEM Overview: The subject matter described herein includes an AI-based system that can detect blind spots in real time and guide the colonoscopist back to revisit them if needed. In the process of system development, we will solve many new technical problems and also provide medical insights into such diverse areas as colonoscopy techniques, re-growth of resected adenomas and even colon anatomy (see V). Figure 1 illustrates an outside view of a length of colon reconstructed in real time from streaming colonoscopy. This view would not be impossible to see during the colonoscopy. Figure 2 illustrates (left-hand image) a 2D video frame from the same colonoscopy that produced Figure 1. The right-hand image in Figure 2 is a 3D view of the reconstruction of the video frames forming Figure 1, rotated by 90° and viewed from the same level as the 2D video frame. The two images, a 2D video frame and the other, a 3D reconstruction, look remarkably similar (as they must); slight differences are due to scaling, temporal sampling, and spatial resolution. Figure 3 illustrates reconstruction of another colon, an extreme case, with less than 70% of the colonic surface surveyed (colon external surface darkened to increase contrast.) A huge adenoma certainly could have been missed. The following aspects are features of the subject matter described herein that have been achieved or are under development: ⑴ Our system of colon depth estimation and normal refinement (ColDE-NR) followed by a pose estimate by SLAM is designed to continuously reconstruct the colonic surface looking for unsurveyed areas (blind spots). If blind spots are found, the colonoscopist will be offered a guide-back to re-visit the problematic area; ⑵ Improvement of the system’s ability to detect blind spots on colons that contain moderate curvature; ⑶ Reporting the size of the blind spots in millimeters to aid in judgement of whether re-visitation is necessary; ⑷ Development of haustral counting and more advanced methods (including re-reconstruction) to accurately guide the clinician back to blind spots; ⑸ Development of a new approach to reconstruct in the environment of severe curvature or flexures using the technique of “point cloud virtual straightening.” ⑹ Development of methods to handle rapid turning of the colonoscope and integration of subsequent video frames generated from off- axis views into the axial-view reconstructions; ⑺ Development of haustral ridge identification methods and their use to recognize ridges and reconstructions; ⑻ Development of methods to recognize and survey gaps due to obscuration of the colonic mucosa; ⑼ Virtual tagging to allow for transfer of information to present and future colonoscopies in the same patient; and ⑽ Further application of methods of projecting pixels in the reconstructed colon back to their generating video frames for quality assurance, for example, detecting blind spots. Efforts by other groups: In a recent review article [Tavanapong 2022] gives an up-to-date comprehensive review of 16 papers (not counting ours) published since 2012. Five papers detailed work on designing post-procedure colonoscopy quality metrics, three concerned predictive coverage, five concerned partial or simulated reconstructions, two analyzed how reconstruction could be done using a pre-procedure CT, and one paper developed a method of colon centerline determination. None of these efforts claimed real time detection of blind spots or ways to remedy them during the procedure. The surveyed papers likewise fail to achieve many of the innovations listed above, including the architecture of ColDE-NR itself, guidance back to blind spots with haustral or frame counting, virtual tagging, and back projecting blind spots to eliminate false positives. APPROACH History of our system development: Our research group began work to develop a system to reconstruct the colon to detect unsurveyed colonic surface and thus increase polyp detection about 2019, building on previous work reconstructing the oral cavity of oropharyngeal endoscopy for radiation treatment planning (1 R01 CA158925-01A1). At first, we adopted the paradigm developed in that work, that of using structure-from-motion (SfM) and shape-from-shading (SfS) [Rosenman 2017] for colonic reconstruction which was successful. Figure 4 illustrates a reconstruction of a short segment of a colonoscopy video using SfM and SfS. The 3D textured surface can be viewed interactively. To our knowledge this is the first fully reconstructed global view directly from a colonoscopy video (about 2019). However, to be clinically useful for the detection and revisiting of non- surveyed colonic mucosa (blind spots), the reconstruction must be done continuously in real time. SfM and SfS proved to be 1000 times too slow for this purpose. Our current deep-learning-based system is nearly fast enough for clinical use and can accurately reconstruct a limited number of video frames, but it faces many situation-specific challenges in computer vision techniques that must be overcome to make it into a clinically useful tool. Nevertheless, based on current successes, it is believed that these challenges can be overcome. Note: Pose-tracking hardware for colonoscopes is not available; our system assumes that colonoscopes are without them. If future hardware changes do provide them, our system can be modified to use them to advantage. System Description — Technical terms: Chunk: A short and only slightly curved section of the colon that has no abrupt changes in surface normals or depths. CNN: A convolutional neural network is a class of neural network architectures, popularly used in deep learning. Deep learning: A machine learning artificial intelligence (AI) technique using multi-layer neural networks for prediction tasks that we employ to enable enormous speedup over traditional geometric methods in colon reconstruction. En face: The colonoscope camera is looking straight at the colon wall, i.e., perpendicular to the long axis of the colon. Feature: A location or direction in a frame or surface, centered in a small region that contains recognizable attributes. Informative video frame: A frame that shows enough area of the colon wall with clarity to support 3D reconstruction. Keyframe: A video frame having non-negligible pose changes from previous frames accepted for fusion into a chunk. Normal Refinement (NR): A method to improve colonic reconstruction by incorporating illumination-aware photometric losses to improve both frame- wise depth and surface normal estimation, and thus 3D surface reconstruction. Pose: The location and orientation of the camera. All of our poses are relative to the previous keyframe (i.e., relative poses). SLAM: “Simultaneous localization and mapping”, a class of algorithms constructing or updating a 3D map of an unknown environment while simultaneously estimating the pose of the camera capturing the frames. Video frame: A single 2D image in a video sequence; there are ~ 10,000 informative frames during a colonoscopy. System Description — Architecture Figure 5 is a block diagram of the current system for reconstructing colonic surfaces from colonoscopic video frames and for blind spot detection. In Figure 5, the system includes a frame classifier 100 for receiving, as input, a video steam captured by a colonoscopic camera during a colonoscopy procedure and for selecting, as output, video frames from the colonoscopic video stream containing information for generating a model of colonic surfaces. Frame classifier 100 may identify informative and non-informative frames by culling video frames that are not usable for reconstruction. Typically, 50% of frames are in this category which include ones in which water or blood obscures the mucosa, close occluders (Figure 9), and redundant frames. The next component of the system in Figure 5 is a lighting enhancement module 102 that enhances the video frames to have consistent lighting. Because the lighting inconsistency of colonoscopy videos can cause the colonoscopic reconstruction system to fail, we improve the lighting consistency using a CNN-based correction that adapts to the intensity distribution of recent video frames [Zhang 2021b]. The next components of the system in Figure 5 are pixel depth and surface normal identifier 104, which identifies pixel depths and surface normals of the colonic surfaces depicted in the video frames, and normal refinement module 106, which refines the surface normals using scene illumination information. Pixel depth and surface normal identifier 104 is implemented using CNNs. CNNs have been used to produce good results in estimating depth for well-textured images. CNNs also achieve satisfactory results for poorly textured areas as well, through learned priors. Our system utilizes CNNs and SLAM in sequence and incorporates the recently understood importance of using illumination-aware refinement-prediction of surface normal as pioneered by Sengupta [Lichy 2022]. Our CNN exploits temporal consistency across frames and illumination variations as the light attached to the camera moves through the colon. A description of Colon Depth Estimation ColDE, (Colon Depth Estimation is found in [Zhang 2021a] [Wang 2022] describes a method for image segmentation without annotations. Exemplary components of normal refinement module 106 and their operations are described in the section below entitled “A Surface-normal Based Neural Network for Colonoscopy Reconstruction”. The next component of the system in Figure 5 is camera pose estimator 108, which estimates camera pose for each video frame using the refined surface normal. In the illustrated example, camera pose estimator 108 is implemented using simultaneous localization and mapping (SLAM) software, which is a tracking module that uses visual clues to predict camera pose for each incoming frame. The local mapping module refines the poses provided by the CNNs using bundle adjustment (a joint optimization of their poses and visible 3D point positions.) We are currently using the SLAM method of [Engle 2018] known as direct sparse odometry (DSO) as it seems to work best in a low feature environment, like the colonic surface, but we are investigating alternatives. [Teed 2020]. The reason for the dual use of a CNN and SLAM is that it solves the SLAM low-feature problem of being sensitive to intensity changes (common during colonoscopy because the light moves with the camera) and geometric distortion of the colon, also common during colonoscopy. In the absence of a CNN, SLAM struggles with the colon’s low-feature environment, whereas a CNN can effectively utilize learned priors about colon surfaces. Drift: A special concern was that of controlling scale and camera pose drift. In scale drift, the size of the scene being reconstructed is contiguously changing; in camera pose drift the camera positions increasingly deviate from the true path as the camera moves. Our current system, ColDE, solves this issue, as shown in Figure 6. The left hand image in Figure 6 is a sparse point cloud of a chunk of colonoscopy video produced by a standard DSO SLAM pipeline. The diameters of the DSO result are dramatically decreasing (scale drift) as the colonoscope is withdrawn (right to left). The sparse cloud from ColDE in the right-hand image of Figure 6 shows no scale drift. An example SLAM algorithm suitable for reconstructing colonic surfaces from video frames is described in [Ma 2019]. The next component of the system illustrated in Figure 5 is fusion module 110. In this example, fusion module 110 implements surfelmeshing (surface element meshing), which is designed to use a calibrated camera and poses provided externally from a SLAM [Schöps 2018]. The output is a dense surfel cloud which is then used to reconstruct a surface mesh rendered in MeshLab [https://www.meshlab.net/]. Figure 7 illustrates the real time image buildup of a colon reconstructed from streaming colonoscopy video frames. The next component of the system illustrated in Figure 5 is blind spot detector 112, which identifies blind spots in the model. Blind spot detector 112 starts with chunks mapped into a point cloud. To detect holes on the reconstructed surfaces, blind spot detector 112 first computes a centerline of the model of a section of the colon and then cross sections orthogonal to the centerline to construct a cylinder that will contain most of the points. Blind spot detector 112 then flattens that cylinder, denoises it, and using mathematical morphology methods, finds the holes. For more details see [Zhang 2020]. The result of blind spot detection is a blind spot annotated mesh 114, which can be used to add a visual representation of the blind spot to the model of the colon. The final component of the system in Figure 5 is an interactive display 116, which displays the model including the detected blind spots, to the colonoscopist. The system illustrated in Figure 5 may further include a haustral ridge identifier 118 that uniquely identifies haustral ridges, which can be used to guide the colonoscopist back to a blind spot or a polyp based on the location of the blind spot or the polyp with respect to the haustral ridge. An exemplary method and system for haustral ridge identification is described below in the section entitled, “Scribble-Supervised Semantic Segmentation for Haustral Fold Detection”. It is understood that the components of the system illustrated in Figure 5 can be implemented in software, in combination with hardware and/or firmware. For example, the components illustrated in Figure 5 can be implemented using computer executable instructions stored in a memory 120 and executed by one or more processors 122 of a computer. Figures 8A and 8B illustrate colon surfaces projected onto the 2D plane, which are examples of flattening of the cylinder by blind spot detector 112. Figures 8C and 8D illustrate six “holes” found by blind spot detector 112 using mathematical morphology. #3 and #4, are interior so are likely to be blind spots. Major causes of failure to reconstruct video frames:

sequences (3–8 cm lengths) used to test ColDE. Table 1 shows that blind spots are not uncommon and that most of them are real, rather than being false positives. In addition to non-informative frames (which had already been removed for the above study) “close occluders” were found to be a cause of failure to reconstruct. Figure 9 illustrates a video frame containing a haustral fold that blocks and outshines most of the rest of information in the frame. The frame classifier detects frames containing “close occluders” using a CNN, treats them as non- informative, and removes them from further processing to generate the model. Another cause of the failure to reconstruct video frames is that some frames are obtained when the camera is not pointing down the axis of the colon, but rather is oblique to it, or even perpendicular to the mucosal surface (en face position). Such views are often of low contrast and contain few recognizable features. Figure 10A illustrates an example of a colon surface seen en face. Figure 10B illustrates a poor quality reconstruction from video frames taken oblique to the axis of the colon. Many of the blind spots are false positive artifacts. Non-axial views are an important part of the surveillance of the colon surface, but the use of only photometric and simple depth consistency objectives in training CNNs is not sufficient to reconstruct them properly. Recent work in computer vision has shown surface normals to be useful for enforcing additional geometric constraints in refining depth predictions while the relationship between surface normals and scene illumination has also been exploited in photometric stereo. The success of utilizing surface normals in complex scene reconstruction was the basis for us to explore this feature in the endoscopic reconstruction process. Normal refinement (NR): These observations have led to the insertion of a new CNN beyond ColDE just before SLAM (Figure 5). This module 1) regularizes normals and computes new depth maps directly from them [Wang 2022], 2) is based on lighting models rather than photometric texture or geometry, and 3) was trained from a plastic phantom with ground truth. Our current effort is to learn the best way to train the model and whether advanced techniques, such as improving the light reflectance model or shape descriptions, is needed (see section IX below). Figure 11A illustrates a ground truth image from a model of the cecum. Figure 11B illustrates an image reconstructed using a generic method. Figure 11C illustrates an image reconstructed using ColDE alone. Figure 11D illustrates an image reconstructed using ColDE-NR [Wang 2022]. The improvement from ColDE to ColDE-NR is better seen in 3D, but its improved resemblance to the ground truth can be seen by the lowering of the false large brown “lip” (tip of arrow) in the ColDE reconstruction and the increased prominence of the true Y-shaped ridge. There are three overarching methodologies to detect and remediate inadvertently non-surveyed areas of the colonic mucosa, which can occur as regions surrounded by surveyed areas (blind spots) or as gaps due to temporary obscuration of the mucosa. For the blind spots we must reconstruct the video frames in real time into a 3D surface, then detect and measure any blind spots. Finally, for both blind spots and gaps we must guide the colonoscopist back to re-survey them. Planned Approach and Improvements Reconstruction consists of keyframe-by-keyframe reconstruction into depths and interframe pose changes and fusion of these into an integrated surface. The two improvements needed to achieve adequately accurate reconstruction are the further training of ColDE-NR using self-supervision of clinically collected frames, which carry no ground truth, (which has very recently shown encouraging results) and improvement in the pose determination when the pose changes quickly, for example, at colonic regions of high curvature or when the colonoscope is swung between down the long axis of the colon to the perpendicular en face pose. Blind spot detection and measurement improvements include more accurate positioning of the non-blind-spot surface elements onto the colon understood as a generalized cylinder where the colon is notably curved and measuring blind spot dimensions in millimeters. Guidance back to blind spots and gaps is necessary because both are regions that can hide polyps and thus require re-visiting. While the simpler method of haustral ridge counting might work for some blind spots, the more sophisticated method of guidance back with re-reconstruction will be needed for gaps and blind spots as well. The following improvements to existing technology are described herein: I. ColDE-NR needs improvement in reconstruction to meet our goal of reconstructing 80–90% of all informative frames. We have a 78% reconstructable rate but only on highly selected axial frames (Table 1). A. On axial views our reconstruction rate is >70%; the majority of failures are off axis, which ColDE does not currently reconstruct well (Figure 10B). B. We will extend the development of normal refinement (NR) to clinical data with self-supervised learning, with the specific target of reconstruction of non-axial views at an accuracy rate of ≥ 80%. A better reflectance model is being developed for improvements in normal refinement. II. Improve the ability of the system to correctly detect blind spots in colonic reconstructions whose axes contain mild curvature. At present we ignore curvature at the risk of not properly detecting and measuring blind spots. Our proposed solution is to replace our present method of fitting straight cylindrical pieces to colonic chunks with an approach that fits a mild curving generalized cylinder. Handling more significant curvature falls may require further development. III. Continue to develop haustral fold counting methods to guide clinician back to blind spots. Detecting and counting haustral folds (small, regular outpouching of bowel separated by the haustral folds formed by circumferential contraction of the inner muscular layer of the colon) is perhaps the simplest method of determining where you are with respect to a given point in the colon. Figure 12 illustrates haustral pouches seen on x-ray and folds seen during colonoscopy. The folds lie in between the pouches and separate them. The most reliable method of detecting folds before our work was that of FoldIt [Mathew 2021b] based on CycleGAN [Zhu 2020]. CycleGAN relies on strong priors, which often produces false positives. Our now completed first version method [Ehrenstein 2023], uses scribble annotations as labels for weakly-supervised AI learning. Traditionally, semantic segmentation is learned from a dataset of images where every pixel is annotated with a class label. However, many pixels are ambiguous and cannot be reasonably assigned to either "fold" or "not fold" classes. Although only a small fraction, usually less than 5%, of pixels are annotated, scribble supervision can still produce results comparable to fully supervised approaches and, in our initial testing, produced pixel-labeling accuracies of 90%, far better than Foldit. The inclusion of normals, predicted by ColDE-NR in this CNN’s input can be expected to improve counting accuracy. Figure 13 illustrates examples of scribble annotations drawn using the scribble annotation tool described herein, which are used for AI training. Scribbles indicating “FOLD” and “NOT FOLD” are labeled. Figure 14 illustrates images in which the shaded areas indicate detected folds. Some of the detected folds are labeled. Extend the system with significant science and methodologies. IV. Guidance back with advanced methods including re- reconstruction registration. A. Were we able to embed at least an approximate distance measure within reconstructed colon sections, that alone might be adequate to establish our location. If we knew the velocity v for each frame sequence and the angle θ between the camera velocity and the axis of the colon, the distance could be computed with

∗ cos(θ_i). A very approximate distance can be computed by using the dimensionless velocity ratio v = #keyframes/#frames. B. A likely more accurate but computationally expensive approach is to begin re-reconstruction of the colon following the blind spot alert and subsequent movement of colonoscope toward the blind spot. Our aim is to continuously compute a deformable registration between the reconstruction and re-reconstruction as the colonoscope moves to determine the current colonoscope position with respect to the original reconstruction. Our challenge is to train a CNN that will accomplish this task. C. Summary: Guidance back to a blind spot or other point of interest might be accomplished by the simplest method, fold counting or the more complex keyframe/frame counting, or the even more complex methods of re-reconstruction. Experimental work will determine the best approach. V. Implement a method to compute image distances in millimeters rather than in surfels. A scale will allow us to characterize the size of a polyp that could be hidden by a discovered blind spot, as polyps < 5mm are considered of minimal danger [Ponugoti 2017]. (Colonoscopes do not contain rulers.) Known measurements relevant to scale include the following: A. Cadaveric colons average 160 ± 30 cm in length depending on the patient’s weight, but not height or age, with women averaging only a 7% shorter length than men [Hounnou 2002]. (Cadaver colons may be somewhat relaxed). The average diameter of the cecum, transverse colon, and descending colon are 9 cm, 6 cm, and 3 cm, respectively [Stauffer 2022]. Intra-haustral distances vary and have been reported to range between 32 and 46 mm [Huizinga 2021]. B. Average velocity of withdrawal. The recommended withdrawal time from the colon is 10-15 minutes [Rex 2007, Wong 2020] over 80-110 cm (length of colon). At 30 frames/second this will generate, on average, 19,000 informative frames, consistent with our experience. About half of these frames are informative, and only half of the remaining will be keyframes. So, in a typical colonoscopy, we re-construct ~3500 frames over 160 cms at an average rate of 2 keyframes/mm. Summary and conclusion: Our plan is to build a distance measure (in millimeters) that will be stable and accurately measure blind spot sizes and guide the user back to a blind spot. Our measure will involve known average measurements, deduced average inter-haustral distances by colon section, and detected colonoscope-along-axis velocity. VI. Regions of high curvature. Almost none of the colon has long straight passages but typically contains more than 30 curved lengths [Laframboise 2019] ranging from mild to almost 180° (see Figure 15). Handling moderate curvature is one of the objects described above. The effect of high curvature is to further limit the line of sight and make blind spot detection more difficult because cylinder generalizations will not be sufficient to map surfels onto a plane. Figure 15 illustrates curvatures of the colon, modified from [Feher 2018]. ColDE delivers the relative pose between successive frames, so it can identify areas of high curvature. With that identification we can straighten the colon point cloud, allowing calculation of the colon centerline on which our method of blind spot detection is based. This can be accomplished by adapting Ma’s method for colon straightening [Ma 2019]. Alternatively, we can adapt Hu’s CNN [Hu 2020] for mapping curved point clouds onto planes. VII. The video frames from a rapidly turning colonoscope are far apart, and our software can fail to detect interframe pixel correspondences at a distance. As a result, we are unable to calculate relative pose changes of the camera when the colonoscope goes from axial to oblique views, resulting in loss of tracking. Two approaches would avoid this problem: 1) Use the pose calculations directly from ColDE-NR excluding SLAM.2) Adapt the CNN called DROIDSLAM [Teed 2022], already under our investigation. Alternatively, we will consider two CNN schemes that can achieve pixel correspondences at distance: 1) Our current SLAM model uses a small set of keypoints that are tracked over a small window of frames. This is sufficient when the camera moves slowly. For fast turning cameras, we propose to use a larger set of keypoints and use a wider window for tracking. We will avoid a significant increase in the computational complexity by designing efficient algorithms. 2) We also plan to explore different neural network strategies that have recently shown success in finding correlations between image patches at a distance. We will first try using atrous convolutions [Wang 2019], with which we have experience in our ridge detection work [Ehrenstein 2023], to obtain features with wider receptive fields. Then, as proven effective in vision transformers in image classification tasks [Dosovitskiy 2021], we will use self- and cross-attention layers to find correlations between the extracted features from these patches at a distance. Training of transformers usually requires millions of images, but there are now newer methods that might make this more practical [Cao 2022]. VIII. Adequate speed requires pipelining and possibly distributed computing and/or improved algorithms. Study and validation of speed issues will be delayed until much of the system is completed. For example, the rendering of the blind spots is currently a slow process that needs speeding up. IX. Two improvements are needed: A better reflectance model is needed to improve the normal-refinement process. In addition, there is likely additional value in developing a general colon shape description that can provide effective guidance to our machine-learning-based prediction algorithms, which currently use generic losses agnostic to features seen on colon surfaces. Our expertise in shape representation [Pizer 2022] and reflectance models for computer vision [Lichy 2022] will be combined to produce a more effective model than we have presently. To add completely new capabilities to our system such as handling gaps, significant deformations, virtual tagging, and reconstructed colon – reconstructed colon registration. All of the above items will be solved using a technique that we will develop for haustral ridge identification and matching. [Mathew 2021a] have shown that ridges can be distinguished by their geometric features. We have already produced a method for identifying ridge pixels in a frame. We will associate these pixels into separated ridges using an adaptation of the FreeSOLO CNN [Wang X 2022]. Then we will develop a CNN producing geometry-based signatures of folds using the ideas of [Srivastava 2016]. These signatures will allow us to identify any ridge appearing in a frame or in a reconstructed colon with any other in a stored catalogue. X. Gaps in the reconstruction occur when there is movement of the colonoscope during a time when the colonic surface and identifying landmarks are obscured due to cleansing or bleeding, for example. Detecting the gap should be evident from a sudden appearance of a large series of non- informative frames. When the obscuring agent clears, the location of the colonoscope tip (camera) will need to be re-established. If Frame #00 in Figure 16 is the last seen before obscuration and Frame #90 is seen just after the scene clears, our method will be made able to recognize the region between the present colonoscope position and the last reconstructed position and then can reconstruct the intervening region. Should that reconstruction yield a blind spot, that will be reported in the normal fashion. However, at some point, for example; at frame #150, the most recently reconstructed position cannot be recognized. In that case, the system must guide the colonoscopist back to the most recently reconstructed position in order to re-commence examination at that point. This process will require the ridge identification approach discussed earlier. XI. Virtual tagging: We see virtual tagging (VT) is a process similar to that of a person “dropping a pin” on a Google map. One virtually marks a spot for future reference; important examples might be marking the location of blind spots, marking the location of a polyp for later inspection after spotting it during insertion, marking the beginning of what is likely to become a gap because of cleansing, and the transferring of important information to a later colonoscopy via reconstructed colon-to-reconstructed colon registration. This objective of being guided back to a marked location can be achieved by applying the method of haustral ridge identification and matching, as described above. XII. Reconstructed colon-to-reconstructed colon registration: The obvious value here is to carry over past information to the present. For example, a clinician would like to inspect an area where an adenoma had previously been removed to determine if it had re-grown at all. Colon-to-colon registration on CT has been accomplished both without AI [Hampton 2013] and with AI [van Eijnatten 2021]. The registration methods that we will develop will be based on the method of haustral ridge identification and matching, as described above. To validate all important aspects of our system. (With IRB permission) we have archived approximately 200 complete colonoscopy videos. These are representative of screening colonoscopies at the University of North Carolina, Division of GI Medicine. In our experience this number is sufficient for training all of the CNNs proposed in this research. Moreover, if more are needed, we have IRB permission to obtain them. A subset of these videos, not used in CNN training, will be used for our validation studies, described below. When our consulting statistician (Moore) has done power studies for each of our validation experiments, we will be able to determine the number of colonoscopy videos needed. We will build a validation software system to test various components of our system that could impact its clinical performance. These include identifying speed bottlenecks that have the potential to slow down the procedure. Validation must also include the verification of blind spots and accuracy in guidance back. Our formal evaluation procedures are preceded by frequent qualitative visual examinations of reconstructions. These have been shown to be very informative as to weaknesses in the reconstruction. This includes studying our compiled data as shown Table 1 to measure the rate of successful frame reconstruction. The formal evaluations focus on determining the correctness of blind spot detection and guidance back to them. A false positive blind spot is one that appears in the display but does not exist. False positive alerts would be disruptive to the colonoscopist as they might then stop the procedure and go back to survey surface they have already seen. If these are too frequent, the colonoscopist may start ignoring alerts. Dr. McGill, who is the colonoscopist on our research team, and others we consulted feel that more than one false positive blind spot per examination would be disruptive. XIII. Analyzing for false positive and negative blind spots: A. False positives: All implied surfels that make up blind spots of any size in the reconstructed colonic surface will be traced back to their generating video frames (Figure 17). This is accomplished by using our code that detects blind spot regions before the small ones are filtered out and that maps regions sparse in surfels, that is, blind spots of any size, onto a local cylindrical region. This allows one to map the blind spot regions back to the original frames using the already computed pose information. A false positive blind spot occurs when most of the points within a candidate blind spot map back to within the video frame. A true blind spot occurs when most of the points within a blind spot of clinically important size map are outside the video frame. Also, when the small blind spots map inside video frames, we have indication of weaknesses of the reconstruction process. If they consistently map outside video frames, we have evidence of sub-optimal colonoscopy technique. Figure 17 illustrates an example of a false blind spot left-hand image surrounded by arrows). In the right-hand reconstruction, there is clearly a discontinuity in the reconstruction of the colon to the right and above the arrows. In our previous evaluation of ColDE (without normal refinement) on highly selected, mostly axial frame sequences, (Table 1) we found few false positive clinically important blind spots. However, we must repeat this study on a broader range of sequences and with the reconstruction and blind spot detection computed using the technique improvements that we have developed since that study and will further develop in the proposed research. Figure 18 illustrates tracing surfels to generating a video frame. B. False negatives: As to real blind spots that are not detected, false negatives, they do not directly harm the patient but are of no help either. Because a potential cause is inaccuracy of the positioning on the colon wall of the reconstructed points, many false negatives may point to a deeper system malfunction. Using our software to trace back reconstructed points to their generating video frames, we will measure how many points, if any, appear to originate from outside all video frames or are inconsistently located in adjacent frames. XIV. Determining that guidance back to a blind spot is accurate. Guiding the colonoscopist back to unsurveyed surface accurately is a critical part of the system. Guidance accuracy means that when the colonoscope has been moved back to the position judged to be the blind spot, the reconstruction of the environment of the blind spot, as now re-reconstructed, must adequately register with the one that generated the blind spot. XV. Testing the system for speed bottlenecks. Our system consists of multiple neural nets and multiple pieces of algorithmic code. We will test that each of these meet the clinical requirement that it does not significantly slow down the whole examination. The entire system with its multiple feedback loops and connections will be pipelined. The speed of that pipeline will also be tested according to the clinical requirements. Table 2 indicates the measures of success that we plan to reach for each of the objects described above.

Table 2: Targeted measures of achieving objects System Descriptions – Project Summary and Core Technologies In summary, we propose to develop and improve a variety of novel technologies that together will form the ability to detect and repair blind spots during the colonoscopy examination. These technologies are from computer vision, deep learning, geometric processing, and a knowledge of colonoscopy. While these technologies (Table 3) are targeted to special properties of colons, many are novel contributions to the aforementioned subfields of computer science and could be applied to endoscopies of many other anatomic areas of the body.

Table 3: Core technologies Figure 19 is a flow chart illustrating an exemplary process for colonoscopic blind spot detection. Referring to Figure 19, in step 1900, the process includes receiving, as input, a video steam captured by a colonoscopic camera during a colonoscopy procedure and for selecting, as output, video frames from the video stream containing information for generating a model of colonic surfaces. For example, frame classifier 106 illustrated in Figure 5 may receive a video stream capture by a colonoscopic camera during a colonoscopy procedure and may discard frames that do not include useful information for colonic surface reconstruction. Examples of frames that may be discarded include those with close occluders, specular reflections, blood, etc. In step 1902, the process includes identifying pixel depths and surface normals of the colonic surfaces. For example, pixel depth and surface normal identifier 104 illustrated in Figure 5 may identify normals (i.e., lines normal to colonic surfaces) and pixel depths of the colonic surfaces. In step 1904, the process includes refining the surface normals using scene illumination information. For example, normal refinement module 106 illustrated in Figure 5 may refine (i.e., produce more accurate estimates of) the surface normals identified by the CNNs using the process described below in the section entitled, “A Surface-normal Based Neural Framework for Colonoscopy Reconstruction”. In step 1906, the process further includes refining the pixel depths using the refined surface normals. For example, normal refinement module 106 may refine the estimates of the pixel depths using the surface normals according to the process below in the section entitled, “Surface-normal Based Neural Framework for Colonoscopy Reconstruction”. In step 1908, the process further includes estimating a camera pose for each of the video frames using the refined surface normals and refined depths. For example, camera pose estimator 108 illustrated in Figure 5 may use SLAM or other algorithm to estimate camera pose by changes in the images captured by successive video frames. In step 1910, the process further includes generating the model of the colonic surfaces using the pixel depths, surface normal, and camera poses. For example, fusion module 110 illustrated in Figure 5 may generate meshes representing the points on the colonic surfaces and fuse the meshes together using surfelmeshing, as described in [Schöps 2018]. In step 1912, the process further includes identifying blind spots in the model of the colonic surfaces. For example, blind spot detector 112 illustrated in Figure 5 may identify blind spots in the reconstructed surfaces by generating identifying an axis of a section of the colonic surface model, constructing a cylinder that includes the surfaces of the section of the colon, flattening the cylinder, and detecting blinds spot in the flattened cylinder using mathematical morphology methods, as described in [Zhang 2020]. In step 1914, the process further includes displaying indications of the blind spots. For example, a computer display device associated with a colonoscope may display the reconstructed colonic surface model to the colonoscopist in real time during the colonoscopy procedure. The blind spots may be indicated by holes, i.e., pixels that are of a different color than the reconstructed colonic surfaces. A^Surface‐normal^Based^Neural^Framework^for^ Colonoscopy^Reconstruction^ Reconstructing a 3D surface from colonoscopy video is challenging due to illumination and reflectivity variation in the video frame that can cause defective shape predictions. Aiming to overcome this challenge, we utilize the characteristics of surface normal vectors and develop a two-step neural framework that significantly improves the colonoscopy reconstruction quality. The normal-based depth initialization network trained with self-supervised normal consistency loss provides depth map initialization to the normal-depth refinement module, which utilizes the relationship between illumination and surface normals to refine the frame-wise normal and depth predictions recursively. Our framework’s depth accuracy performance on phantom colonoscopy data demonstrates the value of exploiting the surface normals in colonoscopy reconstruction, especially on en face views. Due to its low depth error, the prediction result from our framework will require limited post- processing to be clinically applicable for real-time colonoscopy reconstruction. Introduction Reconstructing the 3D mesh from a colonoscopy video in real-time according to the predicted depth and camera pose, allowing holes in the mesh to alert the physician to unsurveyed regions on the colon surface. Reconstructing the 3D model of colon surfaces concurrently during colonoscopy improves the polyp (lesion) detection rate by lowering the percentage of the colon surface that is missed during examination [Hong 2017] . Often surface regions are missed due to oblique camera orientations or occlusion by the colon folds. By reconstructing the surveyed region, the unsurveyed part can be reported to the physician as holes in the 3D surface (as in Figure 20). This approach makes it possible to guide the physician back and examine the missing region without delay. To reconstruct colon surfaces from the colonoscopy video, a dense depth map and camera position need to be predicted from each frame. Previous work [Liu 2019a, Ma 2019] trained deep neural networks to predict the needed information in real time. With the proper help from post-processing [Liu 2019b, Ma 2021b], these methods often are able to reconstruct frames with abundant photometric and geometric features such as in "down-the- barrel" (axial) views where the optical axis is aligned with the organ axis. However, they often fail to reconstruct from frames where the optical axis is perpendicular to the surface ("en face" views). We address the problem of reconstruction from these en face views. In our target colonoscopy application, the geometry of scenes in these two viewpoints are significantly different, manifesting as a difference in depth ranges. In particular, the en face views have near planar geometry, resulting in limited geometric structures informing the photometric cues. As a result, dense depth estimation is challenging using photometric cues alone. However, the characteristics of the endoscopic environment (with a co-located light source and camera located in close proximity to the highly reflective mucus layer coating the colon) mean that illumination is a strong cue for understanding depth and the surface geometry. We capitalize upon this signal to improve reconstruction in en face views. We also aim to yield the reconstruction from frame-wise predictions with minimal post-integration to achieve near real-time execution, which requires strong geometric awareness of the network. In this work we build a neural framework that fully exploits the surface normal information for colonoscopy reconstruction. Our approach is two-fold, 1) normal-based depth initialization followed by 2) normal-depth refinement. Trained with a large amount of clinical data, the normal-based depth initialization network alone can already provide good-quality reconstructions of “down-the-barrel” video segments. To improve the performance on en face views, we introduced the normal-depth refinement module to refine the depth prediction. We find that the incorporation of surface normal-aware losses improves both frame-wise depth estimation and 3D surface reconstruction from the C3VD [Bobrow 2022] and clinical datasets, as indicated by both measurements and visualization. Background Here we describe prior work on 3D reconstruction from endoscopic video, particularly focusing on colonoscopic applications. They usually start with a neural module to provide frame-wise depth and camera pose estimation, followed by an integration step that combines features across a video sequence to generate a 3D surface. With no ground truth from clinical data to supervise the frame-wise estimation network training, some methods transferred the prior learned from synthetic data to real data [Cheng 2021, Mahmood 2016, Nadeem 2020] while others utilized the self-consistent nature of video frames to conduct unsupervised training [Liu 2019a]. In order to incorporate optimization-based methods to calibrate the results from learning-based methods, Ma et al. [Ma 2019, Ma 2021b] introduced the system with a SLAM component [Engel 2017] and a post-averaging step to correct potential camera pose errors; Bae et al. [Bae 2020] and Liu et al. [Liu II 2019] integrated Structure-from-Motion [Schonberger 2016] with the network, trading off time efficiency for better dense depth quality. When using widely-applied photometric and simple depth consistency objectives [Bian 2020, Zhou 2017] in training, networks frequently fail to predict high quality and temporally consistent results due to the low geometric texture of endoscopic surfaces and time-varying lighting [Zhang 2021]. The corresponding reconstructions produced by these methods have misaligned or unrealistic shapes as a result. Meanwhile, recent work in computer vision has shown surface normals to be useful for enforcing additional geometric constraints in refining depth predictions [Li 2021, Yang 2018, Yu 2022] while the relationship between surface normals and scene illumination has been exploited in photometric stereo [Liu 2018, Lichy 2022, Lichy 2021, Xie 2019]. The success of utilizing surface normals in complex scene reconstruction inspires us to explore this property in the endoscopic environment. Methods Surface normal maps describe the orientation of the 3D surface and reflect local shape knowledge. We incorporate this information in two ways: first, to enhance unsupervised consistency losses in our normal-based depth initialization (Figure 21A) and second, to allow us to use illumination information in our normal-depth refinement (Figure 21B). Figure 21A and 21B illustrate details of pixel depth and surface normal identifier 104 and normal refinement module 106 illustrated in Figure 5. In Figure 21A, pixel depth and surface normal identifier 104 comprises a depth and pose estimator that receives a pair of RGB video frames 200 and outputs per-pixel depths and normal directions for each RGB image as well as the relative pose change between the two frames. Pixel depth and surface normal identifier 104 is implemented using a neural network uses an encoder/decoder architecture where the size of the data first shrinks (using linear algebra manipulations) as it passes through the encoder portion of the network (represented by depth decoder 202), distilling into a representation of the essential information about the input (encoding). The encoding is passed to the decoder portion of the network (represented by normal decoder 204 and depth decoder 206) where the data expands (using linear algebra manipulations) to produce values corresponding to depth for each pixel (depth decoder 206) and normal direction for each pixel (normal decoder 204). The PoseNet portion (represented by PoseNet 208) uses an encoder structure (where data shrinks in size via linear algebra manipulations) taking as input both frames simultaneously and producing a vector representing the change in the camera pose between the frames. Figure 21B illustrates details of normal refinement module 106 illustrated in Figure 5. In Figure 21B, we repeatedly update the normal directions using information about the environment lighting. This process begins with a light field processing module 210 that takes a pixel-wise depth prediction, fixed light parameters, and an RGB image to estimate the amount of light received at each point on the visible surface and the normal directions at each point. The combination of the estimated light field, normal directions, and RGB image passes through a normal refinement module 212 (implemented as a neural network) to produce an updated normal direction map. This updated map is taken as input to another neural network, depth- from-normal integration net 214, that estimates depths from normal directions, approximating integration with better speed and robustness against noise. The resulting updated depth maps can be used as input to restart the cycle from light field processing (recursive iterations) or to be used as the final output of the framework as initialization for a SLAM-based pipeline that fuses the frame-wise output into a 3D mesh following Ma et al. [Ma 2021b]. Our two-fold framework of colonoscopy reconstruction. a) Normal- based depth initialization network is trained with self-supervised surface normal consistency loss to produce depth map and camera pose initialization. b) Normal-depth refinement framework utilizes the relation between illumination and surface geometry to refine depth and normal predictions. Normal-based Depth Initialization In order to fully utilize the large amount of unlabeled clinical data, our initialization network is trained with self-supervision signals based on the scene’s consistency of frames from the same video. We particularly exploit the surface normal consistency in training to deal with the challenges of complicated colon topology in addition to applying the commonly used photometric consistency losses [Bian 2020, Godard 2019, Zhou 2017], which are less reliable due to lighting complexity in our application. Trained with the scheme described below, this network produces good depth and camera pose initialization for later reconstruction. We refer to this model herein as "NormDepth" or "ND". Background - projection The self-supervised training losses discussed in this section are built upon the pinhole camera model and the projection relation between a source view ^ and a target view ^ [Zhou 2017]. Given the camera intrinsic ^, a pixel ^_௧ in a target view can be projected into the source view according to the _{predicted depth map ^} ^{^} _{௧ and the relative camera transformation} ^{^^} _{௧→^. This} process yields the pixel’s homogeneous coordinates ^̂_^ and its projected _depth ^{^^} _^ ^௧ _{in the source view, as in the following equation:}

^ Normal Consistency Objective As the derivative of vertices’ 3D positions, surface normals can be sensitive to the error and noise on the predicted surface. Therefore, when the surface normal information is appropriately connected with the network’s primary predictions, i.e., the depth and camera pose, utilizing surface normal consistency during training can further correct the predictions and improve the shape consistency. Let ^{^^} _௧ be the object’s surface normals in the target coordinate system. In the source view’s coordinate system, the direction of those vectors depends _{on the relative camera rotation} ^{^^} _{௧→^ (the rotation component of} ^{^^} _{௧→^) and} should agree with the source view’s own normal prediction ^{^^} _^; using this correspondence we form the normal consistency objective as

Here, we use the numerical difference between the two vectors (L1 loss) for error. In practice, we find that using angular difference has similar performance. Surface Normal Prediction We found that when training with colonoscopy data, computing normals directly from depths as in some previous work [Yang 2018, Yang II 2018] is less stable and tends to result in unrealistic shapes. Instead, we built the network to output the initial surface normal information individually, and trained it in consensus with depth prediction using ^_^^௧^:

the approximate surface vector around ^, which is computed

of ^_^ and ^_^, ^’s nearby pixels. In practice, we apply two pairs of ^_^,^_^ position combinations, i.e., ^’s top-left/bottom-right and top- right/bottom-left neighboring pixels. This orthogonality constraint bridges the surface normal and depth outputs so that the geometric consistency constraint on the normal will in turn regularize the depth prediction. Training Overview We adapt our depth initialization network from Godard et al. [Godard 2019] with an additional decoder to produce per-pixel normal vectors besides depths and apply their implementation of photometric consistency loss ^_^^^௧^ and depth smoothness loss ^_^^. Besides the surface normal consistency, we also enforce the prediction’s geometric consistency by minimizing the difference between the predicted depths of the same scene in different frames, as in :

With the per-pixel mask ^ to mask out the stationary , invalid projected or specular pixels, the final training loss to supervise this initialization network is the weighted sum of the above elements, where ^_^ିସ are the hyper- parameters:

Normal-Depth Refinement In the endoscopic environment, there is a strong correlation between the scene illumination from the point light source and the scene geometry characterized by the surface normals. Our normal-depth refinement framework uses a combination of the color image, scene illumination as represented by the light field, and an initial surface normal map as input. We use both supervised and self-supervised consistency losses to simultaneously enforce improved normal map refinement and consistent performance across varying scene illumination scenarios. Light Field Computation We use the light field to approximate the amount of light each point on the viewed surface receives from the light source. As in Lichy et al. [Lichy 2022] we parameterize our light source by its position relative to the camera, light direction, and angular attenuation. In the endoscopic environment, the light source and camera are effectively co-located so we take the light source position and light direction to be fixed at the origin ^ and parallel to the optical axis ^^, respectively. Thus for attenuation ^ and depth map ^^{^}, we define the _{point-wise light field ^} ^{^} _{and the point-wise attenuation ^} ^{^} _as

_{For our model input, we concatenate the RGB image, ^} ^{^} _{, ^} ^{^} _{, and normal map} ^{^^} (computed from the gradient of the depth map) along the channel dimension. Training Overview In order to use illumination in colonoscopy reconstruction, we adapt our depth-normal refinement model from Lichy et al. [Lichy 2022] with additional consistency losses and modified initialization. We use repeated iterations for refinement; in order to reduce introduced noise, we use a multi-scale network as in many works in neural photometric stereo [Li 2018, Lichy 2022, Lichy 2021]. After each recursive iteration, we upsample the depth map to compute normal refinement at a higher resolution. We denote^ iterations with "^ ൈNR". We compute the following losses for each scale, rescaling the ground truth where necessary to match the model output. For the supervised loss ^_^௧ for iteration ^, we minimize L1 loss on the normal refinement module output ^{^^} _^ and the matching ground truth normal map ^ as well as the L1 loss on the depth-from-normal model output ^^{^} _^ and the matching ground truth depth map ^. We define a scaling factor

_{For the depth-from-normal integration module, we compute a normal map} ^{^^} _′^ from its depth output and minimize L1 loss between it and the input normal map ^{^^} _^; this has the effect of imposing the orthogonality constraint between the depth and surface normal maps.

We use a multi-phase training regime for stability. In an iteration, we first train the normal refinement module and substitute an analytical depth-from-normal integration method. For the second phase, we freeze the normal refinement module and train only the depth-from-normal integration module. For the third and final phase of training, we use the normal refinement module and neural integration, optimizing a weighted sum of all losses with hyperparameters

and ^_ଶ. Thus we define the losses for each phase respectively as follows:

Experiments In our experiments, we demonstrate that incorporating surface normal information improves both frame-wise depth estimation and 3D surface reconstruction. We describe the frame-wise depth map improvement over baseline and the effect of various ablations in the description of the ablation studies below. To evaluate the effect of the frame-wise depth estimation on surface reconstruction, we compare the reconstructions obtained from initializing the SLAM pipeline [Ma 2021b] with the outputs from various methods of frame-wise depth estimation against initialization with ground truth depth maps. We provide a comparison of Chamfer distance [Zhou 2018] on aligned mesh reconstructions in Table 4 and a qualitative comparison in Section 4.2. We also provide a qualitative comparison of the surfaces reconstructed from clinical video in Section 4.3.

Table 4: Error averaged over 5-fold cross validation test sets of C3VD, +/- standard deviation. Best performance in bold. "NormDepth" and "ND" stand for normal-based depth initialization and "n×NR" stands for normal-depth refinement for n iterations. "NormDepth −L_norm" denotes NormDepth trained without L_norm. "flat init" denotes refinement initialized with planar depth rather than NormDepth output. Dataset To train the normal-based depth initialization network, as well as the self-supervised baseline Monodepth2 [Godard 2019], we collected videos from 85 clinical procedures and randomly sampled 185k frames as the training set and another 5k for validation. We used the Colonoscopy 3D Video Dataset (C3VD) [Bobrow 2022] for the normal-depth refinement module, which provides ground truth depth maps and camera poses from a colonoscopy of a silicone colon phantom. We divide this dataset into 5 randomly-drawn cross- validation partitions with 20 training and 3 testing sequences such that the test sequences do not overlap. The results reported below are the methods’ average performance across all folds. Frame-wise Depth Evaluation Example depth predictions and RMSE from C3VD. For the depth maps, darker colors denote more distant depths. For the RMSE, brighter colors denote higher error. Some areas of improvement are highlighted in boxes. We compared our method’s depth prediction with several ablations and the baseline against the ground truth in C3VD. Following the practice in Godard et al. [Godard 2019], we rescaled our depth output to match the median of the ground truth and reported 4 pixel-wise aggregated error metrics in Table 4. Comparing depth prediction errors (Figure 22), both models using our two-stage method significantly outperform the photometric-based baseline Monodepth2, demonstrating the merit of emphasizing geometric features (specifically surface normals) in colonoscopic depth estimation. Meanwhile, although each individual stage of our two-stage method (NormDepth and flat init^NR) already produces relatively good performance, our combined system performs even better and generates the best quantitative result on this dataset (from ND init ^^1 ൈNR). Notice that although based on the results from ablation models, the normal consistency loss ^_^^^^ and multi-iteration of normal refinement quantitatively do not boost performance here due to the nature of C3VD dataset, they are critical for generating better 3D reconstruction shapes. ^ C3VD Reconstruction In this section, we demonstrate the improvement in reconstructions of the C3VD data using our normal-aware methods. In particular, we examine the effects of initializing our SLAM pipeline with the various depth and pose estimation methods. Although C3VD provides a digital model of the phantom, here we compare against the reconstruction produced by using the ground truth depths and poses as initialization to our SLAM pipeline (and refer to this as the ground truth below). In this way, we can control for the impact of the SLAM pipeline in our reconstruction comparison. In Table 4, we measure the Chamfer distance from the ground truth to the reconstructed mesh after ordinary Procrustes alignment and optimizing the scaling factor for Chamfer distance from the ground truth to the reconstruction. Overall, we find that the performance improvements observed in the frame-wise depth estimation are reflected in the reconstructions as well. Similarly, the weaknesses observed in the frame-wise inference also transfer to the reconstructions. In particular, we note that where significant noise is present in the frame-wise depth estimation for ND init+1xNR and reduced in ND init+4xNR, the corresponding reconstructions reflect the difference in noise as well. In Figure 23, we visualize the reconstructions corresponding to example video sequences. In these sequences, we observe that our normal- aware methods significantly outperform the baseline Monodepth2 in qualitative similarity to the ground truth. In addition, we notice that the high curvature of the surface observed in the NormDepth and NormDepth-^_^^^^ reconstructions is reduced after refinement, bringing the overall reconstructed result closer to the ground truth. Clinical Reconstructions We tested our trained depth estimation models on clinical colonoscopy sequences and generated 3D reconstruction with the SLAM pipeline. Fig.5 shows the reconstructed meshes of two “down-the-barrel” segments (Figure 24A and Figure 24B) and an en face segment (Figure 24C). The reconstruction quality from the two stages of our method (“ND” and “ND^^ ൈNR”) significantly outperforms the photometric baseline Monodepth2. For “down-the-barrel” sequences where features are relatively rich, we expect a generalized cylinder shape with limited sparsity. For sequence (a), we expect two large blind spots due to occlusion by ridges and a slightly curved centerline. For sequence (b), we also expect two large blind spots due to the camera position but fairly dense surface coverage elsewhere. For these sequences, our predictions’ shapes are more cylindrical and have surface coverage that more accurately reflect the quantity of surface surveyed compared to the reconstruction produced using Monodepth2. The results also indicate that when trained without the normal consistency loss (െ^_^^^^), NormDepth tends to predict more artifacts such as the skirt-shape outlier in sequence (a). This demonstrates the benefit of surface normal information in network training for improved consistency between frames. Meanwhile, using multi-scale iterations of normal-depth refinement can reduce the noise and sparsity of reconstructed meshes compared to a single iteration. For the en-face sequence (c), we expect a nearly planar surface. Similar to the observations made in reconstructing sequences from C3VD, the high surface curvature produced from the initialization network is reduced after refinement, resulting in a more realistic reconstruction. Conclusion In this work we introduced the use of surface normal information to improve frame-wise depth and camera pose estimation in colonoscopy video and found that this in turn improves our ability to reconstruct 3D surfaces from videos with low geometric texture. We used a combination of supervised and unsupervised losses to train our multi-stage framework and found significant performance improvements over methods that do not consider surface geometry. We have also shown that the incorporation of normal-aware losses allows us to reconstruct clinical videos of low-texture en face views. Limitations and Future Work In this work, we have treated "down-the-barrel" and en face views separately. In practice, colonoscopy videos transition between these two view types, so constructing a framework that can also transition between view types would have significant clinical application; we leave this investigation to future work. Scribble‐Supervised^Semantic^Segmentation^for^ Haustral^Fold^Detection^ Purpose Colonoscopy is considered the "gold standard" screening procedure for colorectal cancers. But its effectiveness is limited by the fact that endoscopists sometimes fail to view all parts of the colon surface. We seek to build a system for detecting these missed areas during the procedure, alerting the endoscopist, and providing means to guide him or her back to view the missed areas. Methods As part of a system for detecting these missed areas and ameliorating these misses, it is useful to provide a means for the computer to comprehend the colon geometry viewed by a video frame. To that end, we introduce a new deep learning method for semantic segmentation of colonoscopy video frames to detect haustral folds. Our method is based on the DeepLabV3+ neural network architecture and takes as input frame colors and per-frame depth maps produced by a reconstruction method. Our method is trained using scribble supervision, a type of weakly-supervised learning. Results We show that our method achieves good results and outperforms the state of the art for haustral fold detection from video frames, with a pixel accuracy of 90% compared to 66% for the state of the art. In addition, we demonstrate that our method produces consistent segmentations over colonoscopic video sequences. Conclusion Our method of using scribble supervision to train a neural network for detecting haustral folds outperforms the state of the art for this task and achieves high accuracy and consistent results over consecutive frames. Thus our method can potentially be used for localization within the colon. Introduction Colonoscopy is a standard procedure for detecting and preventing colorectoral cancer. It is performed by a physician, who inserts an endoscope into the patient’s colon and visually inspects it for cancerous or precancerous growths. However, colonoscopy’s effectiveness is limited by the fact that some areas of the colon surface are unintentionally not surveyed. Methods have been proposed for detecting these missed areas, or blind spots, but once detected, a solution is needed to guide the endoscopist back to survey the blind spot. One approach is to use the ridges on the colon, known as haustral folds, as reference features to determine the location of the endoscope and blind spot. Using haustral folds for navigation first requires being able to reliably detect them in the video feed from the endoscope. The most reliable method in literature for this detection is based on CycleGAN [Zhu 2017]. In this section, we propose a more effective method, utilizing a training approach known as scribble supervision. ^ Cancer^and^its^detection^ Colorectal cancer (CRC) is the third-most common cause of cancer death in the United States [US 2021] . Colonoscopy is the standard screening procedure for CRC and is recommended for all adults when they reach age 45 and every 10 years thereafter. In the year 2000, 20% of adults had undergone colonoscopy in the past 10 years. By 2018, the figure was 61%, and this increase is believed to be responsible for significant reductions in CRC morbidity and mortality over this time period [Siegel 2020, Levin 2018]. In colonoscopy the endoscopist uses an optical endoscope to visually examine the surface of the patient’s colon. A wire loop inserted through a hollow channel in the endoscope is used to remove adenomatous colon polyps, the cause of almost all colorectal cancers [Levin 2006, Ahn 2012]. But colonoscopy still misses between 6 and 27 percent of polyps, many due to the endoscopist not surveying some regions of the colon surface [Zhao 2016]. Detection^of^blind^spots^ Ma et al. [Ma 2019] developed a method of detecting these blind spots (missed regions of the colon surface) through real-time 3D reconstruction of the colon surface. Once a blind spot is detected, the endoscopist may wish to be guided back to it in order to survey the region. This requires localization of the endoscope tip within the colon, which is made difficult by the discontinuous nature of the 3D reconstructed segments, as well as the non-rigid nature of the colon. In our situation, the only external information available is the video feed from the endoscope itself, which consists of a sequence of frames. Our task is to use this information, and information derived from it in real time, to determine the location of the endoscope tip in the colon relative to detected blind spots. In particular, when the endoscope tip is within 10-15 cm of the blind spot, we need to be able to show its position in real time (along with the blind spot) on an onscreen visualization. This will allow the endoscopist to precisely maneuver back to inspect the blind spot. To do this, we need to be able to associate what is currently being observed with what was observed in the same region when the blind spot was missed. While a pose computed via SLAM will be available, it will not necessarily be known relative to the blind spot. Thus we need a direct localization method based on current observations. Previous work on the problem of localization during colonoscopy is sparse. One approach was to use photometric features with place-recognition methods taken from the field of robotics [Ma 2021a]. However, since the visual appearance of the colon can change with deformation, an alternative is to consider geometric features of the colon. The most prominent of these are the haustral folds, which appear as ridges on the colon’s surface and are thought to be created through contractions of smooth muscle surrounding the colon [Huizinga 2021]. The first step in using haustral folds for navigation is to be able to reliably detect them in a video frame, which can be done by image segmentation. The most reliable method until now is known as FoldIt [Mathew 2021b]. It is based on CycleGAN [Zhu 2017] and treats the problem of fold detection as one of domain translation. While FoldIt does not require images from optical colonoscopy to be labeled, its accuracy degrades for small folds, and its generative approach leads to strong priors which cause false positives. Instead of using a generative approach, we treat the problem as classification of pixels, i.e., semantic segmentation. Our Contributions In this work, we introduce an alternative deep learning-based method of haustral fold detection. Instead of training on unpaired data, we chose to use scribble annotations as labels for weakly-supervised learning [Lin 2016]. Traditionally, semantic segmentation is learned from a dataset of images which have every pixel given a class label. The downside to this approach is that fully labeling an image is very time-consuming. In addition, many pixels are ambiguous and can be reasonably assigned to either "fold" or "not fold" classes. Thus, we chose to use scribbles to label only a selection of pixels whose classes are clearly evident. As this method is designed to be used in the same product as [Ma 2016] , it is possible to take advantage of the geometry models therein. Specifically, one model, called ColDE [Zhang 2021], predicts the distance from the camera, or depth, at each pixel in every frame. We thus combine these depth maps with the color frames, creating a 4-channel, RGB-D input for our model. The contributions of this section can be summarized as follows: • We propose and evaluate an approach which outperforms FoldIt at detecting haustral folds over a variety of metrics. In addition, our approach learns a one-to-one mapping, whereas FoldIt learns a many- to-many mapping. Thus our approach is more flexible in terms of training data, as continuous video sequences are not required. • To our knowledge, we are the first to combine scribble supervision with monocular depth information, as well as the first to formulate the problem of haustral fold detection as one of discriminative semantic segmentation. Related Work Scribble Supervision The goal of scribble-supervised semantic segmentation is to assign a class label to each pixel in an image, while only learning from a set of images with a small fraction of their pixels labeled. Specifically, the labels are "scribbles", that is, curves drawn on the image (see Figure 13). Each scribble has one class, and the pixels under the scribble are assigned that class in the label. All other pixels are marked as "unknown" class. Although only a small fraction, usually less than 5%, of pixels are labeled, it has been found that scribble supervision can still produce results comparable to fully-supervised approaches [Tang 2018a, Tang 2018b]. FoldIt^ FoldIt [Mathew 2021b], a method based on CycleGAN [Zhu 2017], is the best previous method of fold detection to our knowledge. It formulates the problem of fold detection as a problem of translating between image domains. The authors use three domains: domain ^, images from optical colonoscopy; domain ^, images from virtual colonoscopy, that is, grayscale images without texture taken inside 3D models of colons derived from CT scans; and domain ^, images inside 3D colon models where folds have been marked in red using the mathematical algorithm from [Nadeem 2017]¹. Like CycleGAN, FoldIt uses unpaired data; there is no correspondence between specific images in separate domains. The authors of released two pre-trained FoldIt models with the same architecture trained on different datasets: one trained on a subset of the same data we used, and another trained on a public dataset of frames from optical colonoscopy. While the FoldIt authors trained on the same data as this work for the figures in their paper and they claim that there is no significant difference with FoldIt trained on public data, we have found that the latter always outperforms the former on the metrics we used for evaluation the 1 We did not consider this method in this section because it is a geometric method operating on CT scans of the entire colon, which are not available during optical colonoscopy. section entitled, “Segmentation Performance vs. Foldit”, at least for optical colonoscopy studies. We will refer to this version simply as FoldIt hereinafter. Finding normals and depths In addition to the RGB color values, we also use the depth values produced by ColDE [Zhang 2021], which provide geometric information. Haustral folds are geometric features, but they express themselves in images as variations in the color, i.e., as photometric features. By using RGB with depths, we take advantage of both photometric and geometric features of haustral folds for better inference. Preliminaries One approach to scribble-supervised semantic segmentation is known as the "two-stage" approach. Scribbles are first used to generate pseudo- labels for every pixel, which are then used to train the network as in fully- supervised learning. However, we chose the alternative, "one stage" approach because it is less computationally intensive to train and has fewer hyperparameters to tune. Instead of generating pseudo-labels, we directly train the network on two loss functions: one supervised loss function over the annotated pixels, and another loss function which is the weighted sum over some set ^^_^... ^_ோ^ of regularization loss functions over all pixels. For an image ^ of ^ by ^ pixels, the set of all pixels is ^, and |^| ൌ ^^. For an RGB-D image, ^ ∈ ^ is a vector ^ ∈ Թ^ସ. Given a network input ^, _{the output ^} ^{^} _^ ^{^} _{is a set of} ^| _^ ^| _{indicator vectors ^^ ∈} ^{^} _0,1 ^{^ଶ} _{corresponding to} the softmax outputs for the two classes. The subset ^_^ ⊂ ^ is the set of pixels with scribble labels, and the corresponding ground truth labels are ^_^ ∈ ^0,1^ for all ^ ∈ ^_^. We now define the total segmentation loss ^^{^}^, ^^{^} to be

The first term in this equation is the binary cross-entropy loss computed over all labeled pixels. The second term is the total regularization loss, a weighted sum of ^ regularization losses ^^_^... ^_ோ^. Each ^_^൫^_^, ^_^; ^൯ is defined as the affinity of the segmentation outputs at two pixels ^, ^, as well as of ^, the input. We use the ^⋅; ^^ notation to emphasize that the function ^_^ is parameterized by the entire input ^. We explain in further detail below. Data We prepared both training and testing datasets for training and evaluating our methods, respectively, by uniformly sampling our research group’s library of colonoscopy video sequences. We then processed these sampled frames using ColDE [Zhang 2021] to predict a dense pixel depth map D, which was concatenated to the image to form an RGB-D image. To these we applied scribble annotations. Most frames in both the training and test sets had a small angle between the camera and the longitudinal axis of the colon (known as "down the barrel" frames). While this is representative of the population of frames in video sequences that our group can presently successfully reconstruct for blind spot detection, it does mean that there are relatively few frames in the training and test sets with a large angle between camera and colon ("en face" frames). Training Set Our training dataset was obtained by first taking every fifth frame, sequentially, of each sequence in our research group’s library of colonoscopy frames from approximately 78 patients, for a total of 2556 frames in the training set. Scribble annotations were then applied to 1828 of these frames; the remainder lacked sufficient detail to make annotations with high confidence. We then processed these 1828 frames using [Zhang 2021], as explained above. Annotations Scribble annotations were applied to images using a tool we wrote in Python; example annotations are shown in Figure 13 The user has the option to draw color-coded scribbles for the two classes, "fold" and "not fold", and can also erase and change between frames using the keyboard. Scribbles were applied only to pixels where the expert annotator had high confidence that the class label was correct. For "fold" scribbles, the scribbles were to follow the curve of the fold, close to its crest but not including any pixels behind it. On each frame, the annotator’s goal was to put a scribble in connected components with the proper class label, without making low-confidence annotations. Some of the selected frames were unable to be annotated and were excluded from the training set, due to the fact that they did not contain any regions that could be labeled with adequate confidence. ^ Data Augmentation Data augmentation was used in order to increase the diversity of the training set. Images were flipped horizontally at random with a 50% probability and randomly cropped in order to increase the range of physical scales represented in the data. In order to introduce more labeled pixels, we also exploited the fact that scribbles were drawn well away from the expected boundaries between "fold" and "not fold" regions and increased the stroke width of annotations. These thicker scribbles were manually examined on a sample of training images to confirm that no pixels were being mislabeled. Test Set In addition to the training set, a test set was prepared and annotated. This set consisted of 64 video frames and their predicted depth maps. This set was drawn at random from the same library of video frames as the training set, but excluding the frames in the training set so that the test set is unseen by the network. Each of these frames was scribble-annotated, this time attempting to place scribbles on as many connected components of the full segmentation as possible in addition to the criteria used for annotating the training set. Across all 64 images, there were 203,170 labeled pixels, with 101,604 "fold" and 101,566 "not fold". As explained above, the test set does not contain many "en face" frames. As we expect the depth map inputs to be most useful on these frames, this test set cannot fully evaluate the advantage of depth map inputs. We discuss this further below. Methods Network Architecture Our network architecture illustrated in Figures 29A and 29B is similar to the DeepLabV3+ version used in [Tang 2018b], with two notable changes. First, instead of the ResNet-101 backbone, we use MobileNetV2 [Sandler 2018] as our backbone. This is a network designed to run on mobile devices, so it runs faster than ResNet-101 on the same hardware. Second, we modified the first layer of MobileNetV2 to a 4-channel input while retaining the pretrained weights via the method used in the timm Python package. The RGB channels initialize to their respective pretrained weights, while the depth channel initializes to the R channel weights. This was done in order to initialize the depth channel with pretrained weights instead of random. All weights in all other layers initialize to the pretrained MobileNetV2 weights, as they are unaffected by the change. ^ Loss Functions While the cross-entropy loss term alone from (1) is enough to obtain decent performance, it has been found [Tang 2018a, Tang 2018b] that adding regularization losses ^^_^^ can improve performance. In this work, we experiment with two regularization losses: the dense conditional random field loss ^_^ோி and the normalized cut loss ^_ே^, both from [Tang 2018b] (defined below). Dense CRF Loss The dense conditional random field regularization loss (dense CRF) [Tang 2018b] computes the affinity between every pair of pixels and is low when pixels with high affinity for each other are labeled in the same class. Let ^^{^} ∈ ^{^}0,1^{^|ఆ|} be the vector of the ^th components of every ^_^, and let the affinity matrix ^{^^} ∈ Թ^{|ఆ|ൈ|ఆ|} be a function of the input image ^ giving a weight to every pair of pixels ^, ^ ∈ ^ based on their similarity to each other. We detail the affinity matrix below. We can write the dense CRF loss as Normalized^cut^loss^ The normalized cut loss is an extension of ^_^ோி.

The degree vector ^ is equal to ^{^^} ^. This allows us to simplify the loss a bit by removing a constant term, so we have ^^{^^}^^{் ^^} _^ ^{^}

_^ ^^்^^{^} ^_∈^^,^^ ^{^^} , assigns the weight every pixel applies to the loss value at a given pixel. Since this weight is very small for far-away pixels, we follow [Tang 2018b] and find ^{^^} ^^{^} by applying a bilateral Gaussian (RGB vs. XY) filter to the RGB channels of the input ^, so that

where _^^ is a normalization constant, ^^{^}⋅; ^^{^} is the 1-dimensional Gaussian centered at 0 with standard deviation ^ (for hyper-parameters ^_^, ^_^), and where || ⋅ ||_ଶ is the L2 vector norm. Thus ^{^^} ^^^ is a function of the distances and differences in intensities (i.e., the distance in RGB space) between a pixel and all other pixels. Implementation details All training and testing were performed on a computer running Ubuntu 20.04 LTS with an NVIDIA GeForce Titan X GPU. Our code is built on that of [Tang 2018b], which provided an implementation of ^_^ோி. We used their C++ bilateral filter implementation to implement ^_ே^. Results We performed our experiments using the test set described above. Evaluation metrics were calculated only based on the labeled pixels. We conducted two types of experiments, one studying the performance of each class overall and the other studying the performance per-frame. In this section we will demonstrate the superior performance of our method against FoldIt, the consistency of our method in identifying folds across consecutive video frames, and the results of ablation studies. Experiment descriptions Per-class experiments For the first set of experiments, metrics used are per-class precision2, recall3, and F1 score, as well as overall accuracy for each model. These were chosen because the objective of these experiments is to study overall classification performance, the results of which are one predicted class label per image pixel in each image in the test set. They are standard metrics in machine learning literature for evaluating classifier performance. Defining TP, TN, FP, FN to be the total numbers of true positive, true negative, false positive, and false negative pixels respectively for a class, we have T^P ^{Precision ൌ} T_{P + FP} ^{, the fraction of positive classifications that are correct} ^TP ^{Recall ൌ , the fraction of positives that are classified as such}

^correctly These are computed across all 203,170 labeled pixels in all 64 labeled images, so variance cannot be easily presented. The next section describes the second set of experiments we conducted, which attempt to understand the variation in performance between frames. ^ Per-frame experiments For the second set of experiments, we computed the accuracy on labeled pixels of our two best configurations (no regularizations, with and without depth inputs) on each of our 64 test frames. We did the same for FoldIt for comparison, in order to study the variation of these methods on different 2 Also known as positive predictive value (PPV). 3 Also known as sensitivity or true positive rate (TPR). types of frames. As these experiments evaluate accuracy over the population of frames, which is a continuous-valued quantity, it is not possible to use the metrics from the first set of experiments. Instead, we consider the distribution of accuracies over the population of frames, presented in Figure 25. Segmentation Performance vs. FoldIt Performance versus FoldIt is shown in Figures 25 and 26 and Table 5. These results are from a network trained for 96 epochs on RGB-D data with no regularization loss terms. Below, we present and discuss the results of ablation studies on regularization terms and depth inputs. Figure 27 illustrates sample images with network outputs superimposed on the images. One of the labeled sets of boxes shows regions where our method correctly marks folds which were missed by FoldIt. The other labeled set of boxes show regions where FoldIt incorrectly marked folds that our method correctly marked as not folds. The oval in the second row in the second column marks a region where our method gave an inconsistent result. Per-Class Performance Sample images with the network outputs superimposed in green. Green boxes show regions where our method correctly marked folds which were missed by FoldIt. Red boxes show regions where FoldIt incorrectly marked folds that our method correctly marked as not-folds. The purple oval in the second row of the second column marks a region where our method gave an inconsistent result. The fourth row shows the ground-truth label scribbles. Sample images with the network outputs superimposed in green. Green boxes show regions where our method correctly marked folds which were missed by FoldIt. Red boxes show regions where FoldIt incorrectly marked folds that our method correctly marked as not-folds. The purple oval in the second row of the second column marks a region where our method gave an inconsistent result. The fourth row shows the ground-truth label scribbles. Comparison of our method’s performance versus FoldIt on the test set of 64 annotated images. Each result is presented as an ordered pair of the form (our result, FoldIt result).

Table 5: Comparison of our method’s performance versus FoldIt on the test set of 64 annotated images. Each result is presented as an ordered pair of the form (our result, FoldIt result). As seen in Figure 27, in all samples our method produces better segmentations. It is clear in the first three columns that our method correctly segments folds that are entirely missed by FoldIt. In the fourth and sixth columns, FoldIt marks more area as "fold" than it should (red boxes). In the fifth column, FoldIt marks the entire dark area in the lower-left corner as "fold" even though it is not. This can also be seen in the first and fourth columns. Our method does not run into this problem, as it is trained on numerous examples where this part of the image, corresponding to the farthest visible part of the colon wall, is labeled as "not fold". This is in agreement with the results in Table 5, where we found that our method outperforms FoldIt on every metric used for evaluation. Additionally, 4 videos showing the superior performance of our method versus FoldIt are shown at the following link: https://www.youtube.com/playlist?list=PLkQ2N98wMK6gAYF9HtOUjL- oGf0OYv7j74 In both Figure 27 and the video, FoldIt has a tendency to mark as "fold" an approximately circular region near the center of the frame in "down the barrel" views. The colon is never perfectly straight, meaning that even "down the barrel" camera views will be facing the colon wall at a non-infinite distance. However, our method still has room for improvement. We can see in the second column of Figure 27 that although it picks up more folds than FoldIt, our method is inconsistent in how it classifies pixels on the sides of folds (circled in purple). The circled region should be either all marked as "fold" or have two parallel but disconnected "fold" regions. In addition, our method sometimes misclassifies any pixels close to maximum intensity as "fold", and these are common in colonoscopy as a result of specular reflection off of wet surfaces. Per-frame Performance We see clearly in Figure 25 that our method, trained with or without depth inputs, produces a higher median accuracy on the frames of the test set with lower variance than FoldIt. This is significant because if this method is to be used for localization, it needs to perform consistently in all parts of the colon. ^ Timing As our neural network is fully feedforward and deterministic, the time to process each frame is asymptotically constant, but the first frame processed takes much longer due to the way PyTorch allocates memory. In clinical use, the model will be instantiated once and then used for the duration of the procedure, so the first frame can be ignored in calculating time per frame. We report 0.017 േ 0.0025 sec/frame over 63 of the 64 frames in the test set, corresponding to 59.9 frames per second. Ablation Studies As part of this work, we conducted ablation studies on the two regularization losses (normalized cut and dense CRF) and on the use of predicted depth maps. The results, shown by the validation accuracies during training, are in Figure 26. We found that the best results occurred with no regularization losses, and there was no statistically significant difference with or without depth input, although in Figure 25 we see that the model trained with depths has a slightly higher median accuracy. The lack of significant difference with or without depths is likely due to the test set not having many "en face" frames. In en face frames, we expect depths to provide more of an advantage because the height of the folds would be clearly seen. By contrast, "down the barrel" frames present each fold and its adjacent pockets as being at a similar distance to the camera, so they do not provide much information to detect folds. For "down the barrel" frames, we expect the surface normals and curvatures, the derivatives of the depths, to provide a significant advantage over RGB-only input, as these features vary more between folds and the rest of the colon. Evidence for this can be found in [Nadeem 2017], in which a proposed method for detecting folds uses information derived from curvature. The reduced performance with regularization losses compared to supervised loss only is likely a result of the regularizations being designed for general-purpose video segmentation. The regularization losses, taken from [Tang 2018a] and [Tang 2018b], were tested on the PASCAL VOC 2012 dataset [Everingham 2012]. This dataset consists of images taken by the general public, meaning that they tend to have much greater diversity in colors, shapes, and textures than colonoscopic video. As a result, these regularization losses are not calibrated to the domain of colon video, so they do not provide an improvement. A regularization tailored to the colon (e.g., using the fact that fold regions must have a convex principal curvature) would be expected to provide an improvement over the supervised loss alone. Validation accuracy at training epoch with (left) and without (right) depth inputs. We can see that the configuration with no regularization losses performs best, then the configurations with either NC or dense CRF losses, then both losses together perform worst. Feature Consistency Results on a sequence of 5 consecutive video frames. These are frames 41-45 shown at the following link: https://www.youtube.com/playlist?list=PLkQ2N98wMK6gAYF9HtOUjL- oGf0OYv7j7 We see qualitatively in Figure 28 that our method is able to detect the same folds consistently and produce consistent segmentations over a sequence of 5 consecutive video frames. Limitations One limitation in our method is that it does not distinguish between adjacent folds. This is not "incorrect" behavior per se, as the network is trained to mark "fold" pixels, but when the haustrum (pocket) between two folds is not visible, the network will mark a single "fold" region. In the future, we intend to address this problem by modifying this work to perform panoptic or instance segmentation, that is, marking each "fold" pixel as belonging to a specific fold. In addition, the fact that folds often have strong specular reflections has led to a problem where areas of specular reflection are sometimes marked as folds. Future Work In addition to per-pixel depths, ColDE also produces estimates of the surface normal vector at each pixel. In the future, we plan to incorporate these features into our method, as mentioned above in the section entitled “Ablation Studies”. In addition, we plan to train and evaluate our method compared to FoldIt on a set of only "en face" frames, in order to better study the effect of depth input. Some of these frames already exist in the training set, but we also plan to add more "en face" frames to the training set and to retrain to avoid the possibility of underfitting to "down the barrel" frames only. The disclosure of each of the following references is incorporated herein by reference in its entirety.

REFERENCES [Abrahams 2021] Abrahams G, Hervé A, Bernth JE, Yvon M, Hayee B, Liu H, Detecting Blind spots in Colonoscopy by Modelling Curvature. 021 IEEE International Conference on Robotics and Automation (ICRA), pp. 12508- 12514, doi: 10.1109/ICRA48506.2021.9561966. https://ieeexplore.ieee.org/document/9561966 (2021) [André 2004] André T, Boni C, Mounedji-Boudiaf L, Navarro M, Tabernero J, Hickish T, Topham C, Zaninelli M, Clingan P, Bridgewater J, Tabah-Fisch I, de Gramont A; Multicenter International Study of Oxaliplatin/5- Fluorouracil/Leucovorin in the Adjuvant Treatment of Colon Cancer (MOSAIC) Investigators. Oxaliplatin, fluorouracil, and leucovorin as adjuvant treatment for colon cancer. N Engl J Med. 2004 Jun 3;350(23):2343-51. doi: 10.1056/NEJMoa032709. PMID: 15175436 (2024) [Ahn 2012] Ahn SB, Han DS, Bae JH, Byun TJ, Kim JP, Eun CS; The miss rate for colorectal adenoma determined by quality-adjusted, back-to-back colonoscopies. Gut and liver 6(1):64 (2012) [ASCO Cancer.Net Editorial Board 2022] ASCO Cancer.Net Editorial Board https://www.cancer.net/cancer-types/colorectal-cancer/statistics (2022) [Aslanian 2013] Aslanian HR, Shieh FK, Chan FW, Ciarleglio MM, Deng Y, Rogart JN, Jamidar PA, Siddiqui UD; Nurse observation during colonoscopy increases polyp detection: a randomized prospective study. Am Jp Gastroenterol. 2013 Feb;108(2):166-72. doi: 10.1038/ajg.2012.237. PMID: 23381064(2013) [Avalos 2017] Avalos DJ, Sussman DA, Lara LF, Sarkis FS, Castro FJ; Effect of Diet Liberalization on Bowel Preparation. South Med J., Jun;110(6):399- 407. doi: 10.14423/SMJ.0000000000000662. PMID:28575897 (2017) [Bae 2020] Bae, G., Budvytis, I., Yeung, C.K., Cipolla, R.; Deep multi-view stereo for dense3d reconstruction from monocular endoscopic video. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp.774–783. Springer (2020) [Benedict 2015] Benedict M, Galvao Neto A, Zhang X.; Post-colonoscopyal colorectal carcinoma: An unsolved debate. World J Gastroenterol.; Dec 7;21(45):12735-41. doi: 10.3748/wjg.v21.i45.12735. PMID: 26668498; PMCID: PMC4671029 (2015) [Bian 2020] Bian, J., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.M., Reid, I.; Unsupervisedscale-consistent depth and ego-motion learning from monocular video. Advancesin neural information processing systems 32, 35– 45 (2019) [Bobrow 2022] Bobrow TL, Golhar M, Vijayan R, Akshintala VS, Garcia JR, Durr NJ.; Colonoscopy 3D video dataset with paired depth from 2d-3d registration. arXiv preprint arXiv:2206.08903 (2022). [Bretthauer 2022] Bretthauer M, Løberg M, Wieszczy P, Kalager M, Emilsson L, Garborg K, Rupinski M, Dekker E, Spaander M, Bugajski M, Holme Ø, Zauber AG, Pilonis ND, Mroz A, Kuipers EJ, Shi J, Hernán MA, Adami HO, Regula J, Hoff G, Kaminski MF; NordICC Study Group. Effect of Colonoscopy Screening on Risks of Colorectal Cancer and Related Death. N Engl J Med.; Oct 27;387(17):1547-1556. doi: 10.1056/NEJMoa2208375. Epub 2022 Oct 9. PMID: 36214590 (2022) [Brown 2005] Brown GJ, Saunders BP. Advances in colonic imaging: technical improvements in colonoscopy. Eur J Gastroenterol Hepatol. Aug;17(8):785-92. PMID: 16003125 (2005) [Buchner 2017] Buchner AM. The Role of Chromoendoscopy in Evaluating Colorectal Dysplasia. GastroenterolHepatol (NY). Jun;13(6):336-347. PMID: 28690450 PMCID: PMC5495038 (2017) [Cao 2022] Cao YH, Yu H, Wu J Training Vision Transformers with Only 2040 Images. arXiv:2201.10728v1 [cs.CV] 26 Jan. https://arxiv.org/pdf/2201.10728.pdf (2022) [CDC and Prevention Website] https://www.cdc.gov/chronicdisease/programs- impact/pop/colorectalCancer.htm#:~:text=The%20average%20per%2Dpatie nt%20costs,continuing%20care%20phase%20(%246%2C200) [Chang 2012] Chang CK and Dongho H.3D colon segment and endoscope motion reconstruction from colonoscopy video. https://www.semanticscholar.org/paper/3d-colon-segment-and-endoscope- motion-from-video-Chang- Hong/1910e150d6a02322a34a8a1c65f0ad6024634b8a (2012) [Cheng 2021] Cheng, K., Ma, Y., Sun, B., Li, Y., Chen, X.: Depth estimation for colonoscopy images with self-supervised learning from videos. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp.119–128. Springer (2021) [Corley 2014] Corley DA, Jensen CD, Marks AR, Zhao WK, Lee JK, Doubeni CA, Zauber AG, de Boer J, Fireman BH, Schottinger JE, Quinn VP, Ghai NR, Levin TR, Quesenberry CP. Adenoma detection rate and risk of colorectal cancer and death. N Engl J Med. Apr 3;370(14):1298-306. doi: 10.1056/NEJMoa1309086. PMID: 24693890; PMCID: PMC4036494 (2014) [Corso 2022] Corso A. How Much Does a Colonoscopy Cost in 2022? https://www.talktomira.com/post/how-much-a-colonoscopy-costs (2022) [Dik 2014] Dik VK, Moons LMG, Siersema PD Endoscopic innovations to increase the adenoma detection rate during colonoscopy. World J Gastroenterol. Mar 7; 20(9): 2200–2211. doi: 10.3748/wjg.v20.i9.2200.PMID: 24605019 PMCID: PMC3942825 (2014) [Dosovitskiy 2021] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929v2 [cs.CV] 3 Jun 2021. Available at https://arxiv.org/pdf/2010.11929.pdf (2021) [Ehrenstein 2023] Ehrenstein S. Scribble-supervised semantic segmentation for haustral fold detection (2023) [Engel 2017] Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence 40(3), 611–625 (2017) [Engel 2018] Engel J, Koltun V, Cremers D, Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence. See https://vision.in.tum.de/research/vslam/dso (2018) [Everingham 2012] Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html [Feher 2018] Feher J, Intestinal and Colonic Chemoreception and Motility in Quantitative Human Physiology (Second Edition, available at: https://ars.els- cdn.com/content/image/3-s2.0-B9780128032305000026-f02-07- 9780128032305.jpg (2018) [Freedman 2020] Freedman D, Blau Y, Katzir L, Aides A, Shimshoni I, Veikherman D, Golany T, Gordon A, Corrado G, Matias Y, Rivlin E. (2020). Detecting Deficient Coverage in Colonoscopies. IEEE Transactions on Medical Imaging, 39(11), 3451-3462. https://doi.org/10.1109/TMI.2020.2994221 (2020) [Gastroenterology Health Partners 2022] Gastroenterology Health Partners, The Colonoscopy: A Historical Timeline. https://www.gastrohealthpartners.com/the-colonoscopy-a-historical-timeline/ (2022) [Godard 2019] Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp.3828–3838 (2019) [Gralnek 2014] Gralnek IM, Siersema PD, Halpern Z, Segol O, Melhem A, Suissa A, Santo E, Sloyer A, Fenster J, Moons LM, Dik VK, D'Agostino RB Jr, Rex DK. Standard forward-viewing colonoscopy versus full spectrum endoscopy: an international, multicentre, randomised, tandem colonoscopy trial. Lancet Oncol. 2014 Mar;15(3):353-60. doi: 10.1016/S1470- 2045(14)70020-8. PMID: 24560453 PMCID: PMC4062184 (2014) [Hackner 2022] Hackner R, Eixelberger T, Schmidle M, Bruns V, Lehmann E, Geissler U, Wittenberg T.3D Reconstruction of the Colon from Monocular Sequences Evaluation by 3D-printed Phantom Data. In: Maier-Hein, K., Deserno, T.M., Handels, H., Maier, A., Palm, C., Tolxdorff, T. (eds) Bildverarbeitung für die Medizin 2022. Informatik aktuell. Springer Vieweg, Wiesbaden. https://doi.org/10.1007/978-3-658-36932-3_31 (2022) [Hampshire 2013] Hampshire T, Roth HR, Helbren E, Plumb A, Boone D, Slabaugh G, Halligan S, Hawkes DJ. Endoluminal surface registration for CT colonography using haustral fold matching. Med Image Anal. 2013 Dec;17(8):946-58. doi: 10.1016/j.media.2013.04.006. Epub 2013 Apr 27. PMID: 23845949; PMCID: PMC3807796 (2013) [Hancock 2016] Hancock KS, Mascarenhas R, Lieberman D. What can we do to optimize colonoscopy and how effective can we be? Curr Gastroenterol Rep. 2016 Jun;18(6):27. doi: 10.1007/s11894-016-0500-6. PMID:27098814 (2016) [Harrison 2021] Harrison L (2021). FDA Approves First AI Device to Detect Colon Lesions,https://www.medscape.com/viewarticle/949081?ecd=ppc_google_rl sa-traf_mscp_news- perspectives_t1onc_us&gclid=CjwKCAiAy_CcBhBeEiwAcoMRHNTk5icBUtw ai_26nplXNIkrOh9YRkqVDveOxVakSaNI9HoRQ-zEIRoC7jIQAvD_BwE (2021) [Hong 2017] Hong, W., Wang, J., Qiu, F., Kaufman, A., Anderson, J.: Colonoscopy simulation. In: Medical Imaging 2007: Physiology, Function, and Structure from Medical Images. vol.6511, p.65110R. International Society for Optics and Photonics (2007) [Hounnou 2002] Hounnou G, Destrieux C, Desmé J, Bertrand P, Velut S. Anatomical study of the length of the human intestine. Surg Radiol Anat.2002 Dec;24(5):290-4. doi: 10.1007/s00276-002-0057-y. Epub 2002 Oct 10. PMID: 12497219 (2002) [Hu 2020] Hu H, Zhong L, Xiaogang J, Zhigang D, Minhong C, Shen Y. Curve Skeleton Extraction From 3D Point Clouds Through Hybrid Feature Point Shifting and Clustering. COMPUTER GRAPHICS forum. Volume 00 (2020), number 00 pp. 1–22, available at http://graphics.cs.uh.edu/wp- content/papers/2020/2020-CGF-SkeletonExtraction.pdf (2020) [Huang 2022] Huang D, Shen J, Hong J, Zhang Y, Dai S, Du N, Zhang M, Guo D. Effect of artificial intelligence-aided colonoscopy for adenoma and polyp detection: a meta-analysis of randomized clinical trials. Int J Colorectal Dis.2022 Mar;37(3):495-506. doi: 10.1007/s00384-021-04062-x. Epub 2021 Nov 11. PMID: 34762157 (2022) [Huizinga 2021] Huizinga JD, Pervez M, Nirmalathasan S, Chen JH. Characterization of haustral activity in the human colon. Am J Physiol Gastrointest Liver Physiol. 2021 Jun 1;320(6): G1067-G1080. doi: 10.1152/ajpgi.00063.2021. Epub 2021 Apr 28. PMID: 33909507 (2021) [Kaminski 2017] Kaminski MF, Wieszczy P, Rupinski M, Wojciechowska U, Didkowska J, Kraszewska E, Kobiela J, Franczyk R, Rupinska M, Kocot B, Chaber-Ciopinska A, Pachlewski J, Polkowski M, Regula J. Increased Rate of Adenoma Detection Associates with Reduced Risk of Colorectal Cancer and Death. Gastroenterology. 2017 Jul;153(1):98-105. doi: 10.1053/j.gastro.2017.04.006. Epub 2017 Apr 17. PMID: 28428142 (2017) [Laframboise 2019] Laframboise J, Ungi T, Lasso A, Asselin M, Holden MS, Tan P, Hookey L, Fichtinger G. Analyzing the curvature of the colon in different patient positions. Available at: http://perk.cs.queensu.ca/sites/perkd7.cs.queensu.ca/files/Laframboise2019 a.pdf (2019) [Le Clercq 2014] Le Clercq CM, Sanduleanu S. Post-colonoscopyal colorectal cancers: what and why. Curr Gastroente rol Rep. 2014 Mar;16(3):375. doi: 10.1007/s11894-014-0375-3. PMID: 24532192 (2014) [Lee 2011] Lee CK, Park DI, Lee SH, Hwangbo Y, Eun CS, Han DS, Cha JM, Lee BI, Shin JE. Participation by experienced endoscopy nurses increases the detection rate of colon polyps during a screening colonoscopy: a multicenter, prospective, randomized study. Gastrointest Endosc. 2011 Nov;74(5):1094-102. doi:10.1016/j.gie.2011.06.033. PMID: 21889137 (2011) [Lee 2017] Lee YM, Huh KC. Clinical and Biological Features of Post- colonoscopyal Colorectal Cancer. Clin Endosc. 2017;50(3):254-260. doi:10.5946/ce.2016.115 (2017) [Lee 2020] Lee JY, Jeong J, Song EM, Ha C, Lee HJ, Koo JE, Yang DH, Kim N, Byeon JS. Real-time detection of colon polyps during colonoscopy using deep learning: systematic validation with four independent datasets. Sci Rep. 2020 May 20;10(1):8379. doi: 10.1038/s41598-020-65387-1. PMID: 32433506; PMCID: PMC7239848 (2020) [Leufkens 2012] Leufkens AM, van Oijen MG, Vleggaar FP, Siersema PD. Factors influencing the miss rate of polyps in a back-to-back colonoscopy study. Endoscopy. 2012 May;44(5):470-5. doi: 10.1055/s-0031-1291666. PMID: 22441756 (2012) [Levin 2006] Levine JS, Ahnen DJ Adenomatous polyps of the colon. N Engl J Med 355(24):2551–2557. https://doi.org/10.1056/NEJMcp063038, pMID: 17167138 (2006) [Levin 2018] Levin TR, Corley DA, Jensen CD, Schottinger JE, Quinn VP, Zauber AG, Lee JK, Zhao WK, Udaltsova N, Ghai NR, et al Effects of organized colorectal cancer screening on cancer incidence and mortality in a large community-based population. Gastroenterology 155(5):1383–1391 (2018) [Li 2018] Li, Z., Xu, Z., Ramamoorthi, R., Sunkavalli, K., Chandraker, M.: Learning to reconstruct shape and spatially-varying reflectance from a single image. In: SIGGRAPH Asia 2018 Technical Papers. p.269. ACM (2018) [Li 2021] Li, B., Huang, Y., Liu, Z., Zou, D., Yu, W.: Structdepth: Leveraging the structural regularities for self-supervised indoor depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp.12663–12673 (2021) [Lichy 2021] Lichy, D., Wu, J., Sengupta, S., Jacobs, D.W.: Shape and material capture at home. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021) [Lichy 2022] Lichy D, Sengupta S, Jacobs D. Fast Light-Weight Near-Field Photometric Stereo. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022. https://openaccess.thecvf.com/content/CVPR2022/papers/Lichy_Fast_Light- Weight_Near-Field_Photometric_Stereo_CVPR_2022_paper.pdf (2022) [Lin 2016] Lin D, Dai J, Jia J, He K, Sun J Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3159–3167 (2016) [Liu 2019a] Liu, X., Sinha, A., Ishii, M., Hager, G.D., Reiter, A., Taylor, R.H., Unberath, M.: Dense depth estimation in monocular endoscopy with self- supervised learning methods. IEEE transactions on medical imaging 39(5), 1438–1447 (2019) [Liu 2019b] Liu, X., Stiber, M., Huang, J., Ishii, M., Hager, G.D., Taylor, R.H., Unberath, M.: Reconstructing sinus anatomy from endoscopic video–towards a radiation-free approach for quantitative longitudinal assessment. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp.3–13. Springer (2020) [Ma 2018] Ma R, Zhao Q, Wang R, Damon JN, Rosenman JG, Pizer SM. Skeleton-based Generalized Cylinder Deformation Under the Relative Curvature Condition. In: Pacific Graphics Short Papers; 37-40; ISBN:978-3- 03868-073-4. Available at: https://diglib.eg.org/bitstream/handle/10.2312/pg20181275/037- 040.pdf?sequence=1&isAllowed=y (2018) [Ma 2019] Ma, R., Wang, R., Pizer, S., Rosenman, J., McGill, S.K., Frahm, J.M.: Real-time 3d reconstruction of colonoscopic surfaces for determining missing regions. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp.573–582. Springer (2019) [Ma 2019] Ma, R., Wang, R., Pizer, S., Rosenman, J., McGill, S.K., Frahm, JM. Real-Time 3D Reconstruction of Colonoscopic Surfaces for Determining Missing Regions. In: et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. Lecture Notes in Computer Science, vol 11768. Springer, Cham. https://doi.org/10.1007/978-3-030-32254-0_64 (2019) [Ma 2021a] Ma R, McGill SK, Wang R, Rosenman JG, Frahm JM, Zhang Y, Pizer SM Colon10k: A benchmark for place recognition in colonoscopy.2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI) pp 1279– 1283 (2021) [Ma 2021b] Ma, R., Wang, R., Zhang, Y., Pizer, S., McGill, S.K., Rosenman, J., Frahm, J.M.: Rnnslam: Reconstructing the 3d colon to visualize missing regions during a colonoscopy. Medical image analysis 72, 102100 (2021) [Mahmood 2016] Mahmood, F., Chen, R., Durr, N.J.: Unsupervised reverse domain adaptation for synthetic medical images via adversarial training. IEEE transactions on medical imaging 37(12), 2572–2581 (2018) [Mathew 2021a] Mathew S, Nadeem S, Kaufman A. "Visualizing Missing Surfaces in Colonoscopy Videos Using Shared Latent Space Representations," 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021, pp. 329-333, doi: 10.1109/ISBI48211.2021.9433982. Available at https://arxiv.org/pdf/2101.07280.pdf (2021) [Mathew 2021b] Mathew S, Nadeem S, Kaufman A. FoldIt: Haustral Folds Detection and Segmentation in Colonoscopy Videos. Med Image Comput Assist Interv. 2021 Sep-Oct;12903:221-230. doi: 10.1007/978-3-030-87199- 4_21. Epub 2021 Sep 21. PMID: 35403172; PMCID: PMC8993167 (2021) [McGill 2021] McGill SK, Rosenman J, Wang R, Ma R, Frahm JM, Pizer S. Artificial intelligence identifies and quantifies colonoscopy blind spots. Endoscopy. 2021 Dec;53(12):1284-1286. doi: 10.1055/a-1346-7455. Epub 2021 Feb 4. PMID: 33540438 (2021) [Moon 2023] Moon SY, Lee JY, Lee JH. Comparison of adenoma detection rate between high-definition colonoscopes with different fields of view: 170^degrees versus 140 degrees. Medicine (Baltimore). 2023 Jan 13;102(2):e32675. doi: 10.1097/MD.0000000000032675. PMID: 36637919; PMCID: PMC9839301 (2023) [Nadeem 2017] Nadeem S, Marino J, Gu X, Kaufman AE Corresponding supine and prone colon visualization using eigenfunction analysis and fold modeling. IEEE Transactions on Visualization and Computer Graphics 23:751–760 (2017) [Nadeem 2020] Mathew, S., Nadeem, S., Kumari, S., Kaufman, A.: Augmenting colonoscopy using extended and directional cyclegan for lossy image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.4696–4705 (2020) [Nishihara 2013] Nishihara R, Wu K, Lochhead P, Morikawa T, Liao X, Qian ZR, Inamura K, Kim SA, Kuchiba A, Yamauchi M, Imamura Y, Willett WC, Rosner BA, Fuchs CS, Giovannucci E, Ogino S, Chan AT. Long-term colorectal-cancer incidence and mortality after lower endoscopy. N Engl J Med.2013 Sep 19;369(12):1095-105. doi: 10.1056/NEJMoa1301969. PMID: 24047059; PMCID: PMC3840160 (2013) [Pickhart 2004] Pickhardt PJ, Nugent PA, Mysliwiec PA, Choi JR, Schindler WR. Location of adenomas missed by optical colonoscopy. Ann Intern Med. 2004 Sep 7;141(5):352-9. doi: 10.7326/0003-4819-141-5-200409070-00009. PMID: 15353426 (2004) [Pizer 2022] Pizer SM, Marron JS, Damon JN, Vicory J, Krishna A, Liu Z and Taheri M Skeletons, Object Shape, Statistics. Front. Comput. Sci. 4:842637.doi: 10.3389/fcomp.2022.84263https://www.frontiersin.org/articles/10.3389/fcom p.2022.842637/full (2022) [Ponugoti 2017] Ponugoti PL, Cummings OW, Rex DK. Risk of cancer in small and diminutive colorectal polyps. Dig Liver Dis.2017 Jan;49(1):34-37. doi: 10.1016/j.dld.2016.06.025. Epub 2016 Jun 28. PMID: 27443490 (2017) [Rex 2007] Rex DK. Colonoscopy withdrawal times and adenoma detection rates. Gastroenterol Hepatol (N Y).2007 Aug;3(8):609-10. PMID: 21960871; PMCID: PMC3099297 (2007) [Rosenman 2017] Rosenman J, Zhao Q, Price T, Wang R, Hong J, Niethammer M, Alterovitz R, Frahm JM, Chera B, Pizer S. Registration of Nasopharynoscope Videos with Radiation Treatment Planning CT Scans. Canc Therapy & Oncol Int J. 2017; 5(1): 555652. DOI: 10.19080/CTOIJ.2017.05.555652 (2017) [Samadder 2014] Samadder NJ, Curtin K, Tuohy TM, Pappas L, Boucher K, Provenzale D, Rowe KG, Mineau GP, Smith K, Pimentel R, Kirchhoff AC, Burt RW. Characteristics of missed or post-colonoscopyal colorectal cancer and patient survival: a population-based study. Gastroenterology. 2014 Apr;146(4):950-60. doi: 10.1053/j.gastro.2014.01.013. PMID: 24417818 (2014) [Sandler 2018] Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520 (2018) [Schonberger 2016] Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.4104–4113 (2016) [Schöps 2018] Schöps T, Sattler T, Pollefeys M. SurfelMeshing: Online Surfel-Based Mesh Reconstruction. Available at https://arxiv.org/pdf/1810.00729.pdf (2018) [Siegel 2020] Siegel RL, Miller KD, Goding Sauer A, Fedewa SA, Butterly LF, Anderson JC, Cercek A, Smith RA, Jemal A (2020) Colorectal cancer statistics. CA: A Cancer Journal for Clinicians 70(3):145–164. https://doi. org/https://doi.org/10.3322/caac.21601 (2020) [Siegel 2022] Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer statistics, CA Cancer J Clin.2022 Jan;72(1):7-33. doi: 10.3322/caac.21708. Epub 2022 Jan 12. PMID: 35020204 [Srivastava 2016] Functional and shape data analysis (2022) [Stauffer 2022] Stauffer CM, Pfeifer C. Stat Pearls—Colonoscopy. An official website of the US Government: Available at https://www.ncbi.nlm.nih.gov/books/NBK559274/ (2022) [Tang 2018a] Tang M, Djelouah A, Perazzi F, Boykov Y, Schroers C, Normalized cut loss for weakly-supervised cnn segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1818– 1827 (2018) [Tang 2018b] Tang M, Perazzi F, Djelouah A, Ben Ayed I, Schroers C, Boykov Y, On regularized losses for weakly supervised cnn segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 507–522 (2018) [Tavanapong 2022] Tavanapong W, Oh J, Riegler MA, Khaleel M, Mittal B, de Groen PC. Artificial Intelligence for Colonoscopy: Past, Present, and Future. IEEE J Biomed Health Inform. 2022 Aug;26(8):3950-3965. doi: 10.1109/JBHI.2022.3160098. Epub 2022 Aug 11. PMID: 35316197; PMCID: PMC9478992 (2022) [Teed 2022] Teed Z, Deng J. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. arXiv:2108.10869v2 [cs.CV] 2 Feb 2022. https://arxiv.org/pdf/2108.10869.pdf (2022) [US 2021] US Preventive Services Task Force, Screening for Colorectal Cancer: US Preventive Services Task Force Recommendation Statement. JAMA 325(19):1965–1977. https://doi.org/10.1001/jama.2021.6238 (2021) [van Eijnatten 2021] van Eijnatten M, Rundo L, Batenburg KJ, Lucka F, Beddowes E, Caldas C, Gallagher FA, Sala E, Schönlieb CB, Woitek R.3D deformable registration of longitudinal abdominopelvic CT images using unsupervised deep learning. Comput Methods Programs Biomed. 2021 Sep;208:106261. doi: 10.1016/j.cmpb.2021.106261. Epub 2021 Jul 8. PMID: 34289437 (2021) [van Rijn 2006] van Rijn JC, Reitsma JB, Stoker J, Bossuyt PM, van Deventer SJ, Dekker E. Polyp miss rate determined by tandem colonoscopy: a systematic review. Am J Gastroenterol; 101: 343-350. doi:10.1111/j.1572- 0241.2006.00390.x. PMID: 16454841 (2006) [Vemulapalli 2022] Vemulapalli KC, Lahr RE, Rex DK. Most large colorectal polyps missed by gastroenterology fellows at colonoscopy are sessile serrated lesions. Endosc Int Open. 2022 May 13;10(5):E659-E663. doi: 10.1055/a-1784-0959. PMID: 35571477; PMCID: PMC9106434 (2022) [Wang 2018] Wang R, Frahm J, Pizer SM., Recurrent neural network for learning dense depth and ego-motion from video. CoRR abs/1805.06558. Available at http://arxiv.org/abs/1805.06558, arXiv:1805.06558 (2018) [Wang 2019] Wang Z, Shuiwang JI. Smoothed Dilated Convolutions for Improved Dense Prediction. IEEE Transactions on Pattern Analysis and machine Intelligence, VOL. XX, No. X, APRIL 2019. Available at https://arxiv.org/pdf/1808.08931.pdf (2019) [Wang 2022] Wang S, Zhang Y, McGill S, Rosenman J, Frahm JM, Sengupta S, Pizer S. A Surface-normal Based Neural Framework for Colonoscopy Reconstruction. Submitted IPMI (2023) [Wang X 2022] Wang X, Yu Z, De Mello S, Kautz J, Anandkumar A, Shen C, Alvarez JM. FreeSOLO: Learning to Segment Objects without Annotations," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 14156-14166. doi: 10.1109/CVPR52688.2022.01378 Available at https://arxiv.org/pdf/2202.12181.pdf (2022) [Wong 2020] Wong WJ, Arafat Y, Wang S, Hawes S, Hung K. Colonoscopy withdrawal time and polyp/adenoma detection rate: a single-site retrospective study in regional Queensland. ANZ J Surg. 2020 Mar;90(3):314-316. doi: 10.1111/ans.15652. Epub 2020 Jan 20. PMID: 31957200 (2020) [Xie 2019] Xie, W., Nie, Y., Song, Z., Wang, C.C.L.: Mesh-based computation for solving photometric stereo with near point lighting. IEEE Computer Graphics and Applications 39(3), 73–85, https://doi.org/10.1109/MCG.2019.2909360 (2019) [Yang 2018] Yang, Z.,Wang, P.,Wang, Y., Xu, W., Nevatia, R.: Lego: Learning edge with geometry all at once by watching videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.225– 234 (2018) [Yang II 2018] Yang, Z., Wang, P., Xu, W., Zhao, L., Nevatia, R.: Unsupervised learning of geometry from videos with edge-aware depth- normal consistency. In: Thirty-Second AAAI conference on artificial intelligence (2018) [Yu 2022] Yu, Z., Peng, S., Niemeyer, M., Sattler, T., Geiger, A.: Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems (2022) [Zhang 2020] Zhang Y, Ma R, McGill SK, Rosenman J, Pizer S. Technical Report: Missing Surface Detection Pipeline July 2020. Available at: https://github.com/zhangybzbo/ColonHoleDetect/blob/master/HoleDetect%2 0technical%20report.pdf (2020) [Zhang 2021a] Zhang Y, Frahm JM, Ehrenstein S, McGill SK, Rosenman JG, Wang S Pizer, SM. ColDE: A Depth Estimation Framework for Colonoscopy Reconstruction.19 Nov 2021. https://arxiv.org/pdf/2111.10371.pdf (2021) [Zhang 2021b] Zhang Y, Wang S, Ma R, McGill SK, Rosenman JG, Pizer SM. Lighting Enhancement Aids Reconstruction of Colonoscopic Surfaces. https://arxiv.org/pdf/2103.10310.pdf or http://midag.cs.unc.edu/MIDAG_FS.html (2021) [Zhao 2006] Zhao L, Botha CP, Bescos JO, Truyen R, Vos FM, Post FH. Lines of curvature for polyp detection in virtual colonoscopy. IEEE Trans Vis Comput Graph. Sep-Oct;12(5):885-92. doi: 10.1109/TVCG.2006.158. PMID: 17080813 (2006) [Zhao 2016] Zhao S, Wang S, Pan P, Xia T, Chang X, Yang X, Guo L, Meng Q, Yang F, Qian W, et al., Magnitude, risk factors, and factors associated with adenoma miss rate of tandem colonoscopy: a systematic review and meta- analysis. Gastroenterology 156(6):1661–1674 (2019) [Zhou 2017] Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1851–1858 (2017) [Zhou 2018] Zhou, Q.Y., Park, J., Koltun, V.: Open3D: A modern library for 3D data processing. arXiv:1801.09847 (2018) [Zhu 2017] Zhu JY, Park T, Isola P, Efros AA, Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223–2232 (2017) [Zhu 2020] Zhu JY, Park T, Isola P, Efros AA, Unpaired Image-to-Image translation using Cycle-Consistent Adversarial Networks arXiv:1703 (2020) It will be understood that various details of the subject matter described herein may be changed without departing from the scope of the subject matter described herein. Furthermore, the foregoing description is for the purpose of illustration only, and not for the purpose of limitation, as the subject matter described herein is defined by the claims as set forth hereinafter.

Claims

CLAIMS What is claimed is: 1. A system for colonoscopic blind spot detection, the system comprising: a frame classifier for receiving, as input, a video steam captured by a colonoscopic camera during a colonoscopy procedure and for selecting, as output, video frames from the video stream containing information for generating a model of colonic surfaces; a pixel depth and surface normal identifier for identifying pixel depths and surface normals of the colonic surfaces; a normal refinement module for refining the surface normals using scene illumination information; a normal-depth refinement module for refining the pixel depths using the refined surface normals; a camera pose estimator for estimating a camera pose for each of the video frames using the refined surface normals and refined depths; a fusion module for generating the model of the colonic surfaces using the camera poses; a blind spot detector for identifying blind spots in the model; and a display for displaying indications of the blind spots.

2. The system of claim 1 wherein the frame classifier is configured to exclude from the output, video frames in which the information for generating the model of colonic surfaces is occluded by water, blood, or close occluders.

3. The system of claim 1 comprising a frame lighting consistency neural network for adjusting lighting in the video frames output from the frame classifier such that the lighting is consistent across frames.

4. The system of claim 1 wherein the camera pose estimator utilizes simultaneous localization and mapping (SLAM) to estimate the camera poses.

5. The system of claim 4 wherein the camera pose estimator utilizes direct sparse odometry (DSO) SLAM to estimate the camera poses.

6. The system of claim 1 wherein the fusion module utilizes surfel meshing to generate the model of the colonic surfaces.

7. The system of claim 1 wherein the blind spot detector detects the blind spots by constructing a cylinder from a reconstructed section of the colon from the model, flattening the cylinder, and identifying the blind spots by analyzing pixel intensities of pixels on the flattened cylinder.

8. The system of claim 1 comprising a haustral ridge identifier for uniquely identifying haustral ridges in the model.

9. The system of claim 1 wherein the display is configured to display the indication of the blind spots in real time during the colonoscopy procedure.

10. A method for colonoscopic blind spot detection, the method comprising: receiving, as input, captured by a colonoscopic camera during a colonoscopy procedure and selecting, as output, video frames from the colonoscopic video stream containing information for generating a model of colonic surfaces; identifying pixel depths and surface normals of the colonic surfaces; refining the surface normals using scene illumination information; refining the pixel depths using the refined surface normals; estimating a camera pose for each of the video frames using the refined surface normals and refined depths; generating the model of the colonic surfaces using the camera poses; identifying blind spots in model using the model of the colonic surfaces; and displaying indications of the blind spots.

11. The method of claim 10 comprising excluding from the output, video frames in which the information for generating the model of colonic surfaces is occluded by water, blood, or close occluders.

12. The method of claim 10 comprising adjusting lighting in the video frames output from the frame classifier such that the lighting is consistent across frames.

13. The method of claim 10 wherein estimating the camera pose includes utilizing simultaneous localization and mapping (SLAM).

14. The method of claim 13 wherein utilizing SLAM to estimate the camera poses includes utilizing direct sparse odometry (DSO) SLAM to estimate the camera poses.

15. The method of claim 10 wherein generating the model includes using surfelmeshing.

16. The method of claim 10 wherein identifying the blind spots includes constructing a cylinder from a reconstructed section of the colon from the model, flattening the cylinder, and identifying the blind spots by analyzing pixel intensities of pixels on the flattened cylinder.

17. The method of claim 10 comprising uniquely identifying haustral ridges in the model.

18. The method of claim 10 wherein displaying the indications of the blind spots includes displaying the indications in real time during the colonoscopy procedure.

19. A non-transitory computer readable medium having stored thereon executable instructions that when executed by a processor of a computer control the computer to perform steps comprising: receiving, as input captured by a colonoscopic camera during a colonoscopy procedure and selecting, as output, video frames from the video stream containing information for generating a model of colonic surfaces; identifying pixel depths and surface normals of the colonic surfaces; refining the surface normals using scene illumination information; refining the pixel depths using the refined surface normals; estimating a camera pose for each of the video frames using the refined surface normals and refined depths; generating the model of the colonic surfaces using the camera poses; identifying blind spots in the model; and displaying indications of the blind spots.