The present application claims priority from U.S. provisional patent application No. 63/378,648, filed on 6 months 10 of 2022, the disclosure of which is incorporated herein by reference in its entirety.
Detailed Description
Example methods, apparatus, and systems are described herein. It should be understood that the words "example" and "exemplary" are used herein to mean "serving as an example, instance, or illustration. Any embodiment or feature described herein as "example" or "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or features unless so indicated. Other embodiments may be utilized and other changes may be made without departing from the scope of the subject matter presented herein.
Accordingly, the example embodiments described herein are not intended to be limiting. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, could be arranged, substituted, combined, separated, and designed in a wide variety of different configurations.
Throughout this specification the articles "a" or "an" are used to introduce elements of example embodiments. Any reference to "a" or "an" means "at least one," and any reference to "the" means "the at least one," unless specified otherwise, or unless the context clearly dictates otherwise. The use of the conjunction "or" within a descriptive list of at least two items is intended to indicate any one of the listed items or any combination of the listed items.
The use of ordinal terms such as "first," "second," "third," etc., are used for distinguishing between corresponding elements and not necessarily for indicating a particular order of such elements. For the purposes of this specification, the terms "plurality" and "plurality" refer to "two or more" or "more than one".
Furthermore, the features shown in each figure may be used in combination with each other unless the context suggests otherwise. Thus, the drawings should generally be regarded as constituent aspects of one or more overall embodiments, but it should be understood that not all illustrated features are necessary for each embodiment. In the drawings, like numerals generally identify like components unless context dictates otherwise. Moreover, unless otherwise indicated, the drawings are not to scale and are for illustrative purposes only. In addition, the figures are merely representative and not all components are shown. For example, additional structures or constraining components may not be shown.
In addition, any recitation of elements, blocks, or steps in the description or claims is for clarity. Thus, such enumeration should not be interpreted as requiring or implying that such elements, blocks or steps follow a particular arrangement or be performed in a particular order.
I. Overview of the invention
To capture images using an image capture device including a digital camera, smart phone, laptop computer, etc., a user may power up the device and initiate an image sensor (e.g., camera) start-up sequence. The user may initiate the start-up sequence by selecting an application or simply opening the device. The start-up sequence typically involves iterative optics and software setting adjustment procedures (e.g., auto-focus, auto-exposure, auto-white balance). After the start-up sequence is completed, the image capture device may then capture an image. Ideally, an image capture device may have the ability to accurately focus and/or apply various autofocus processes to an image such that the captured image and/or a preview of the captured image contains a focused view of any object of interest in the image.
However, determining which regions of the environment to focus on (e.g., which regions of the environment contain objects of interest) may be complicated by the various objects in the environment, many of which may be potential regions to focus on. Furthermore, the object focused by the image capturing device may be moved or moved to a different location and/or beyond the frame of the image capturing device. Thus, it may be important that the image capture device be able to quickly switch to another region for focusing. Further, for various objects in the environment, the image capture device may associate various regions with similar levels of interest, and the image capture device may fluctuate between focusing one region and focusing another region with similar levels of interest.
Techniques for an image capture device to automatically focus on areas of an image frame associated with high salience while reducing instability and incorrectly determined as salient areas are described herein. In some examples, by utilizing machine learning techniques, an image capture device may detect a visual saliency region within an image frame, generate one or more bounding boxes surrounding the one or more visual saliency regions, determine which visual saliency region to focus on, and apply one or more auto-fill processes to the visual saliency region in the image frame.
In some examples, an image capture device may determine a main region of interest (ROI) and a more stable filtered ROI based on the main ROI. For each image frame, the image capture device may initially set the filtered ROI to be the same as the previously filtered ROI, and may update the filtered ROI confidence value to an average of the saliency values at the pixels of the updated heat map. Based on the amount of overlap and/or relative confidence of the primary and previously filtered ROIs, the image capture device may determine whether to match the filtered ROI to the primary ROI or to keep the filtered ROI the same as the previously filtered ROI. In some examples, the image capture device may determine whether an amount of overlap between the primary ROI and the previously filtered ROI does not exceed a threshold, and based on the determination, the image capture device may update the filtered ROI as the primary ROI. If the image capture device determines that the amount of overlap does exceed the threshold, the image capture device may maintain the filtered ROI. In this case, the image capture device may set the filtered ROI to be the same as the previously filtered ROI, and may update the filtered ROI confidence value to the average of the saliency values at the pixels of the updated heat map. Additionally and/or alternatively, the computing system may determine a significance difference between the previously filtered ROI and the primary ROI, and based on the significance difference exceeding a threshold, the computing system may update the previously filtered ROI as the primary ROI. If the computing system determines that the significance difference does not exceed the threshold, the computing system may retain the filtered ROI as the previous ROI.
Updating the filtered ROI based on the primary ROI and the previously filtered ROI may result in updating the filtered ROI to the primary ROI when the previously filtered ROI is no longer significant because the filtered ROI may remain the previously filtered ROI with an updated confidence value based on the updated heat map until the region is no longer significant. Allowing the filtered ROI to remain as the previously filtered ROI may help stability when the image frames vary slightly from frame to frame, and when the frames have multiple objects of similar significance.
In a further example, the image capture device may determine both a primary region of interest (ROI) and a secondary ROI. The secondary ROI may be determined such that the secondary ROI does not overlap with the primary ROI or a forbidden region around the primary ROI. Since the computing device uses at least two ROIs of the current image frame to determine the location to focus, the computing device may simultaneously consider two salient objects and/or regions, potentially making the switch from focusing on one region to another faster and more seamless. Furthermore, the computing device may apply a low pass filter to the primary ROI, potentially making the filtered ROI more stable and more power efficient, possibly by applying a threshold to the saliency difference and/or the amount of overlap before switching the filtered ROI to the primary ROI.
In a further example, the computing device may select between the filtered ROI, the primary ROI, and the secondary ROI to determine a region in the image frame to focus on using one or more autofocus processes. To address the potential problem that saliency detection may exhibit instability over time, a Finite State Machine (FSM) may be used. The FSM may require several consecutive frames with consistent saliency detection to focus the autofocus on a salient ROI and/or several consecutive frames lacking consistent saliency detection to focus the autofocus to discard a salient ROI. In this context, consistent detection refers to the detected overlap of bounding boxes between consecutive frames and/or a high confidence of these detections.
An additional potential challenge is that saliency detection often reports areas of non-saliency. This is known as a false positive, and such errors can be particularly detrimental to the user experience with the camera. Imagine that the camera tries to focus on a reflective object in the background of, for example, a scene. To address this challenge, the FSM is helpful because it handles transient false positives-it is less likely that the FSM will detect consecutive false positives than false positives in a single frame. In addition, to prevent false positives, the application of the salient auto-focus process may be limited based on the global on-device signal. For example, certain requirements (e.g., minimum brightness values in the scene, scaling ratios within a certain range, and/or no device motion) may be imposed before autofocus is allowed. Furthermore, certain salient regions may be discarded when the estimated depth is out of range, when the estimated depth differs too much from the estimated distance of the current ROI, and/or the location of the bounding box is too off-center.
Example systems and methods
FIG. 1 illustrates an example computing device 100. The computing device 100 is shown in the form factor of a mobile phone. However, the computing device 100 may alternatively be implemented as a laptop computer, a tablet computer, and/or a wearable computing device, among others. Computing device 100 may include various elements such as a body 102, a display 106, and buttons 108 and 110. The computing device 100 may further include one or more cameras, such as a front camera 104 and one or more rear cameras 112. Each of the rear cameras may have a different field of view. For example, the rear cameras may include a wide-angle camera, a main camera, and a telephoto camera. The wide-angle camera may capture a greater portion of the environment than the main camera and the telephoto camera, and the telephoto camera may capture a more detailed image of a smaller portion of the environment than the main camera and the wide-angle camera.
The front camera 104 may be positioned on a side of the body 102 that generally faces the user when in operation (e.g., on the same side as the display 106). The rear camera 112 may be positioned on a side of the body 102 opposite the front camera 104. The cameras are referred to as front and rear are arbitrary and the computing device 100 may include multiple cameras positioned on each side of the body 102.
Display 106 may represent a Cathode Ray Tube (CRT) display, a Light Emitting Diode (LED) display, a Liquid Crystal (LCD) display, a plasma display, an Organic Light Emitting Diode (OLED) display, or any other type of display known in the art. In some examples, the display 106 may display a digital representation of the current image captured by the front camera 104 and/or the rear camera 112, images that may be captured by one or more of these cameras, images recently captured by one or more of these cameras, and/or modified versions of one or more of these images. Thus, the display 106 may be used as a viewfinder for a camera. The display 106 may also support touch screen functionality that may be capable of adjusting settings and/or configurations of one or more aspects of the computing device 100.
The front camera 104 may include an image sensor and associated optical elements, such as a lens. The front camera 104 may provide zoom capability or may have a fixed focal length. In other examples, interchangeable lenses may be used with the front camera 104. The front camera 104 may have a variable mechanical aperture and a mechanical and/or electronic shutter. The front camera 104 may also be configured to capture still images, video images, or both. Further, the front camera 104 may represent, for example, a monoscopic, stereoscopic, or multi-field camera. The rear camera 112 may be similarly or differently arranged. Additionally, one or more of the front cameras 104 and/or the rear cameras 112 may be an array of one or more cameras.
One or more of the front camera 104 and/or the rear camera 112 may include or be associated with an illumination component that provides a light field for illuminating a target object. For example, the lighting assembly may provide flash or constant illumination for the target object. The lighting assembly may also be configured to provide a light field comprising one or more of structured light, polarized light, and light having a specific spectral content. Other types of light fields known and used to recover a three-dimensional (3D) model from an object are also possible in the context of the examples herein.
Computing device 100 may also include an ambient light sensor that may continuously or from time to time determine the ambient brightness of a scene that cameras 104 and/or 112 may capture. In some implementations, an ambient light sensor may be used to adjust the display brightness of the display 106. Additionally, an ambient light sensor may be used to determine the exposure length of one or more of the cameras 104 or 112, or to facilitate such determination.
The computing device 100 may be configured to capture an image of a target object using the display 106 and the front camera 104 and/or the rear camera 112. The captured images may be a plurality of still images or video streams. Image capture may be triggered by activating button 108, pressing a soft key on display 106, or by some other mechanism. Depending on the implementation, images may be captured automatically at specific time intervals, for example, after pressing button 108, after appropriate lighting conditions of the target object, after a predetermined distance from mobile computing device 100, or according to a predetermined capture schedule.
Fig. 2 is a simplified block diagram illustrating some components of an example computing system 200. By way of example and not limitation, computing system 200 may be a cellular mobile telephone (e.g., a smart phone), a computer (such as a desktop computer, a notebook computer, a tablet computer, a server, or a handheld computer), a home automation component, a Digital Video Recorder (DVR), a digital television, a remote control, a wearable computing device, a game console, a robotic device, a vehicle, or some other type of device. Computing system 200 may represent, for example, aspects of computing device 100.
As shown in FIG. 2, computing system 200 may include a communication interface 202, a user interface 204, a processor 206, a data store 208, and a camera component 224, all of which may be communicatively linked together by a system bus, network, or other connection mechanism 210. Computing system 200 may be equipped with at least some image capturing and/or image processing capabilities. It should be appreciated that computing system 200 may represent a physical image processing system, a particular physical hardware platform on which image sensing and/or processing applications operate in software, or other combination of hardware and software configured to perform image capturing and/or processing functions.
The communication interface 202 may allow the computing system 200 to communicate with other devices, access networks, and/or transport networks using analog or digital modulation. Thus, the communication interface 202 may facilitate circuit-switched and/or packet-switched communications, such as Plain Old Telephone Service (POTS) communications and/or Internet Protocol (IP) or other packetized communications. For example, the communication interface 202 may include a chipset and an antenna arranged for wireless communication with a radio access network or access point. Also, the communication interface 202 may take the form of or include a wired interface, such as an ethernet, universal Serial Bus (USB), or high-definition multimedia interface (HDMI) port, or the like. The communication interface 202 may also take the form of or include a wireless interface, such as Wi-Fi, BLUETOOTH, global Positioning System (GPS), or wide area wireless interface (e.g., wiMAX or 3GPP Long Term Evolution (LTE)), among others. However, other forms of physical layer interfaces and other types of standard or proprietary communication protocols may be used on communication interface 202. Further, the communication interface 202 may include a plurality of physical communication interfaces (e.g., wi-Fi interface, BLUETOOTH interface, and wide area wireless interface).
The user interface 204 may be used to allow the computing system 200 to interact with a human or non-human user, such as to receive input from a user and provide output to the user. Thus, the user interface 204 may include input components such as a keypad, keyboard, touch sensitive pad, computer mouse, trackball, joystick, microphone, and the like. The user interface 204 may also include one or more output components, such as a display screen, which may be combined with a touch sensitive panel, for example. The display screen may be based on CRT, LCD, LED and/or OLED technology, or other technologies now known or later developed. The user interface 204 may also be configured to generate audible output via speakers, speaker jacks, audio output ports, audio output devices, headphones, and/or other similar devices. The user interface 204 may also be configured to receive and/or capture audible utterances, noise, and/or signals through a microphone and/or other similar device.
In some examples, the user interface 204 may include a display that serves as a viewfinder for still camera and/or video camera functions supported by the computing system 200. In addition, the user interface 204 may include one or more buttons, switches, knobs, and/or dials that facilitate configuration and focusing of camera functions and capture of images. Some or all of these buttons, switches, knobs and/or dials may be implemented by touch sensitive pads.
The processor 206 may include one or more general-purpose processors (e.g., microprocessors) and/or one or more special-purpose processors (e.g., digital Signal Processors (DSPs), graphics Processing Units (GPUs), floating Point Units (FPUs), network processors, or Application Specific Integrated Circuits (ASICs)). In some cases, a special purpose processor is capable of image processing, image alignment, image merging, and the like. The data store 208 can include one or more volatile and/or nonvolatile storage components, such as magnetic, optical, flash, or organic storage devices, and can be integrated in whole or in part with the processor 206. The data store 208 can include removable and/or non-removable components.
The processor 206 is capable of executing program instructions 218 (e.g., compiled or non-compiled program logic and/or machine code) stored in the data store 208 to perform the various functions described herein. Accordingly, the data store 208 may include a non-transitory computer-readable medium having stored thereon program instructions that, when executed by the computing system 200, cause the computing system 200 to perform any of the methods, processes, or operations disclosed in the present specification and/or figures. Execution of program instructions 218 by processor 206 may result in processor 206 using data 212.
By way of example, the program instructions 218 may include an operating system 222 (e.g., an operating system kernel, device drivers, and/or other modules) and one or more application programs 220 (e.g., camera functions, address books, email, web browsing, social networking, audio-to-text functions, text translation functions, and/or gaming applications) installed on the computing system 200. Similarly, data 212 may include operating system data 216 and application data 214. Operating system data 216 may be primarily accessed by operating system 222, while application data 214 may be primarily accessed by one or more of application programs 220. The application data 214 may be arranged in a file system that is visible or hidden from a user of the computing system 200.
Applications 220 may communicate with operating system 222 through one or more Application Programming Interfaces (APIs). These APIs may facilitate, for example, application programs 220 to read and/or write application data 214, send or receive information via communication interface 202, receive and/or display information on user interface 204, and the like.
In some cases, the application 220 may be referred to simply as an "app". In addition, the application 220 may be downloaded to the computing system 200 through one or more online application stores or application markets. However, applications may also be installed on computing system 200 in other ways, such as via a web browser or through a physical interface (e.g., a USB port) on computing system 200.
The camera assembly 224 may include, but is not limited to, an aperture, a shutter, a recording surface (e.g., photographic film and/or an image sensor), a lens, a shutter button, an infrared projector, and/or a visible light projector. The camera component 224 may include components configured to capture images in the visible spectrum (e.g., electromagnetic radiation having wavelengths of 380-700 nanometers) and/or components configured to capture images in the infrared spectrum (e.g., electromagnetic radiation having wavelengths of 701 nanometers-1 millimeter), and so forth. The camera component 224 may be controlled at least in part by software executed by the processor 206.
Fig. 3 illustrates a diagram 300 illustrating a training phase 302 and an inference phase 304 of a trained machine learning model 332, according to an example embodiment. Some machine learning techniques involve training one or more machine learning algorithms on an input training data set to identify patterns in the training data and to provide output reasoning and/or predictions about (patterns in) the training data. The resulting trained machine learning algorithm may be referred to as a trained machine learning model. For example, fig. 3 illustrates a training phase 302 in which one or more machine learning algorithms 320 are training on training data 310 to become a trained machine learning model 332. Generating the trained machine learning model 332 during the training phase 302 may involve determining one or more hyper-parameters, such as one or more step values of one or more layers of the machine learning model, as described herein. Then, during the inference stage 304, the trained machine learning model 332 can receive the input data 330 and one or more inference/prediction requests 340 (possibly as part of the input data 330) and responsively provide one or more inferences and/or predictions 350 as output. The one or more inferences and/or predictions 350 can be based in part on one or more learned hyper-parameters, such as one or more learned stride values for one or more layers of a machine learning model, as described herein.
Thus, the trained machine learning model 332 may include one or more models of one or more machine learning algorithms 320. The machine learning algorithm 320 may include, but is not limited to, an artificial neural network (e.g., convolutional neural network, cyclic neural network, bayesian network, hidden markov model, markov decision process, logistic regression function, support vector machine, suitable statistical machine learning algorithm, and/or heuristic machine learning system described herein). The machine learning algorithm 120 may be supervised or unsupervised and may implement any suitable combination of online and offline learning.
In some examples, the machine learning algorithm 320 and/or the trained machine learning model 332 may be accelerated using on-device coprocessors such as Graphics Processing Units (GPUs), tensor Processing Units (TPUs), digital Signal Processors (DSPs), and/or Application Specific Integrated Circuits (ASICs). Such on-device coprocessors may be used to accelerate the machine learning algorithm 320 and/or the trained machine learning model 332. In some examples, the trained machine learning model 332 can be trained, resident, and executed to provide reasoning on, and/or otherwise make reasoning about, a particular computing device.
During the training phase 302, the machine learning algorithm 320 may be trained using unsupervised, supervised, semi-supervised, and/or reinforcement learning techniques by providing at least the training data 310 as a training input. Unsupervised learning involves providing a portion (or all) of training data 310 to machine learning algorithm 320, and machine learning algorithm 320 determines one or more output inferences based on the provided portion (or all) of training data 310. Supervised learning involves providing a portion of training data 310 to machine learning algorithm 320, where machine learning algorithm 320 determines one or more output inferences based on the provided portion of training data 310 and accepts or corrects the output inferences based on the correct results associated with training data 310. In some examples, supervised learning of the machine learning algorithm 320 may be managed by a set of rules and/or labels for training inputs, and the set of rules and/or labels may be used to correct reasoning of the machine learning algorithm 320.
Semi-supervised learning involves obtaining correct results for some, but not all, of the training data 310. During semi-supervised learning, supervised learning is used for a portion of the training data 310 having correct results, and unsupervised learning is used for a portion of the training data 310 not having correct results.
Reinforcement learning involves the machine learning algorithm 320 receiving a reward signal regarding previous inferences, where the reward signal may be a numerical value. During reinforcement learning, the machine learning algorithm 320 may output inferences and receive the reward signal in response, wherein the machine learning algorithm 320 is configured to attempt to maximize the value of the reward signal. In some examples, reinforcement learning also utilizes a cost function that provides a value that represents an expected sum of the values provided by the reward signal over time. In some examples, machine learning algorithm 320 and/or trained machine learning model 332 may be trained using other machine learning techniques, including, but not limited to, incremental learning and course learning.
In some examples, the machine learning algorithm 320 and/or the trained machine learning model 332 may use a transfer learning technique. For example, the transfer learning technique may involve the trained machine learning model 332 pre-training on one data set and additionally training using training data 310. More specifically, the machine learning algorithm 320 may be pre-trained on data from one or more computing devices, and the resulting trained machine learning model is provided to the computing device CD1, where CD1 is intended to execute the trained machine learning model during the inference phase 304. The training data 310 may then additionally be used to train the pre-trained machine learning model during the training phase 302. Such further training of the machine learning algorithm 320 and/or the pre-trained machine learning model using the training data 310 of the CD1 data may be performed using supervised learning or unsupervised learning. Once the machine learning algorithm 320 and/or the pre-trained machine learning model have been trained on at least the training data 310, the training phase 302 may be completed. The trained resulting machine learning model may be used as at least one of the trained machine learning models 332.
In particular, once the training phase 302 has been completed, the trained machine learning model 332 may be provided to the computing device (if not already on the computing device). After the trained machine learning model 332 is provided to the computing device CD1, the inference phase 304 may begin.
During the inference phase 304, a trained machine learning model 332 may receive input data 330 and generate and output one or more corresponding inferences and/or predictions 350 regarding the input data 330. Thus, the input data 330 may be used as input to a trained machine learning model 332 to provide corresponding reasoning and/or predictions 350. For example, the trained machine learning model 332 may generate inferences and/or predictions 350 in response to one or more inference/prediction requests 340. In some examples, the trained machine learning model 332 may be executed by a portion of other software. For example, the trained machine learning model 332 can be executed by an inference or prediction daemon to be readily available upon request for providing inference and/or prediction. The input data 330 may include data from a computing device CD1 executing the trained machine learning model 332 and/or input data from one or more computing devices other than CD 1.
Example image capture devices described herein may include one or more cameras and sensors, among other components. The image capture device may be a smart phone, tablet computer, laptop computer, or digital camera, as well as other types of computing devices that may perform the operations described herein.
By way of example, a computing device may include one or more processors with logic to execute instructions, at least one built-in or peripheral image sensor (e.g., a camera), and an input/output device (e.g., a display panel) to display a user interface. The computing device may further include a Computer Readable Medium (CRM). CRM may include any suitable memory or storage device, such as Random Access Memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), non-volatile RAM (NVRAM), read Only Memory (ROM), or flash memory. The computing device stores device data (e.g., user data, multimedia data, applications, and/or an operating system of the device) on the CRM. The device data may include executable instructions for an automatic expansion and contraction process. The auto-scaling process may be part of an operating system executing on the image capture device, or may be a separate component executing within an application environment (e.g., a camera application) or a "framework" provided by the operating system.
The computing device may implement machine learning techniques ("visual saliency models"). The visual saliency model may be implemented as one or more of a Support Vector Machine (SVM), a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN), a Dense Neural Network (DNN), one or more heuristics, other machine learning techniques, combinations thereof, and the like. The visual saliency model may be trained iteratively off-device by touching training scenes, sequences, and/or events. For example, training may involve contacting a visual saliency model with an image (e.g., a digital photograph), including a user-drawn bounding box containing regions of visual saliency (e.g., regions in which one or more objects of particular interest to the user may reside). In some examples, these images may include bounding boxes or heat maps generated by tracking the annotators' eyes while they are viewing the images to determine which regions of the images are most prominent. Further, in some examples, the visual saliency model may be trained using a heat map, which may be generated by tracking the location of images viewed by annotators. For example, if five annotators view a first region and ten annotators view a second region, the first region in the image may be determined to be half as significant as the second region in the image. Contacting the image including the user drawn bounding box may facilitate training of the visual saliency model to identify visual saliency regions within the image. As a result of the training, the visual saliency model may generate a visual saliency heat map for a given image. The computing device may then generate various bounding boxes based on the visual saliency heat maps, including bounding boxes that enclose the region with the greatest visual saliency score. In this way, the visual saliency model may predict the visual saliency areas within the image. After sufficient training, model compression using distillation can be implemented on the visual saliency model, enabling selection of the best model architecture based on model latency and power consumption. The visual saliency model may then be deployed as a stand-alone module to the CRM of the computing device or implemented into an automated scaling process.
The computing device may perform an automatic scaling process, which may be automatic or in response to a received trigger signal that includes, for example, a user-performed gesture (e.g., tap, press) implemented on the input/output device. The computing device may receive one or more captured images from the image sensor.
The computing device may utilize the visual saliency model to generate a visual saliency heat map using one or more captured images.
For example, fig. 4a is an image 400 according to an example embodiment. Fig. 4b is a heat map 450 according to an example embodiment. The computing device may utilize the visual saliency model to generate a visual saliency heat map of the captured image, as shown in fig. 4a and 4b. The one or more processors may calculate a visual saliency heat map in a background operation of the apparatus. In some examples, the image capture device may not display the visual saliency heat map to the user. As shown, the visual saliency heat map depicts the magnitude of the visual saliency probability on a black-to-white scale, where white indicates a high probability of saliency and black indicates a low probability of saliency. Based on the computing device applying the pre-trained machine learning model to the image frames such that the pre-trained machine learning model outputs a saliency index for each pixel within the visual saliency heat map, each pixel within the visual saliency heat map may be assigned a saliency index that represents a degree of saliency of the region represented by the index.
The visual saliency model may produce a bounding box that encloses the region with the greatest visual saliency probability. Fig. 5 illustrates a heat map 500 with a bounding box 502 in accordance with an example embodiment.
As shown in fig. 5, the visual saliency heat map includes a bounding box 502 that encloses a region within the image that contains the greatest visual saliency probability. The visual saliency model may be trained to output a heat map that the computing device may use to determine one or more objects of interest in the captured image and generate one or more bounding boxes around the objects of interest. These generated bounding boxes may include one or more regions that the computing device predicts as significant.
In some examples, the computing device may conduct experiments with anchor boxes of various sizes and various locations, such as described in fast R-CNN by Ren et al in 2016. In particular, as described herein, the anchor boxes may be considered in order to determine a region of interest, such as a region having a high or highest average saliency value. For example, fig. 6A illustrates anchor bounding box sizes 602, 604, and 606 according to an example embodiment. As shown in fig. 6A, the computing device may evaluate anchor bounding boxes of different aspect ratios (e.g., 1:2, 1:1, and 2:1), and the computing device may evaluate anchor bounding boxes of various sizes for each of the different aspect ratios. The computing device may determine an average saliency value for each aspect ratio and size of the anchor bounding box for the various anchor bounding box positions. In particular, each pixel in the heat map may be associated with a saliency value, and the computing device may determine an average of the saliency values of the pixels within the anchor bounding box to determine an average saliency value.
Fig. 6B illustrates an anchor bounding box position according to an example embodiment. As shown in fig. 6B, the computing device may evaluate the average saliency values of anchor bounding boxes of various sizes and aspect ratios every few pixels. The centers of the anchor bounding boxes may be evenly distributed based on the stride values, and at each location, the computing device may evaluate various sizes and aspect ratios of the anchor bounding boxes. For example, the computing device may determine the average saliency values of the anchoring bounding boxes of nine anchoring bounding box sizes 602, 604, and 606 with aspect ratios of 1:2, 1:1, and 2:1 at location 650, location 652, location 654, and other locations separated by steps of three pixels. Based on the average saliency value of the anchor bounding box, the computing device may determine one or more regions of interest (ROIs).
Fig. 7 illustrates a salient ROI according to an example embodiment. In an example process, the computing device may determine the saliency heat map 700 according to the process described above, possibly using a machine learning model or other algorithm that predicts a saliency index at each pixel of an image frame. Based on the saliency heat map 700, the computing device may determine the primary ROI with the greatest average saliency, possibly by computing the average saliency values of pixels within the various anchor bounding boxes, as described above in the context of fig. 6A-6B.
Next, based on the primary ROI, the computing device may determine a secondary ROI, which may be a ROI of pixels having a smaller average saliency value when compared to the primary ROI. The secondary ROI may need to be at least a threshold distance from the primary ROI. As shown in image 704, the computing device may determine a secondary ROI based on the exclusion zone 714. The exclusion zone 714 may be a region around the primary ROI 712 in which the secondary ROI may not be located. The computing device may determine the secondary ROI 716 such that the secondary ROI 716 does not overlap with the primary ROI 712 or the exclusion zone 714. In some examples, the computing device may apply the anchoring bounding box method described above to pixels outside of the primary ROI 712 and the forbidden region 714 to determine the secondary ROI 716. In particular, the primary ROI 712 may be the region with the greatest average salience and the secondary ROI 716 may be the region with the second greatest average salience, but subject to the constraints described above.
Based on the primary 712 and secondary 716 ROIs, the computing system may then determine a more stable filtered ROI, which may then potentially be used for one or more auto-focusing processes. For example, fig. 8 illustrates an image 800 with ROIs 802, 804, and 806 according to example embodiments. Image 800 may include a primary ROI 806, a secondary ROI 802, and a previously filtered ROI 804.
The primary ROI 806 may include the most prominent region in the image frame, which the computing device may determine using the anchor bounding box method described above. The secondary ROI 802 may include less significant areas in the image frame than the primary ROI 806. The previously filtered ROI 804 may be a region in the previous image frame that was previously determined by the computing device based on the previous primary ROI.
Based on the primary ROI 806 and the previously filtered ROI 804, the computing device may determine a new filtered ROI of the region in the image frame. The filtered ROIs can then potentially be used to apply one or more autofocus processes. For example, the computing device may determine a confidence value for the previously filtered ROI 804 and a confidence value for the primary ROI 806. In particular, while the previously filtered ROI 804 is associated with a previous image frame, when calculated from a saliency heat map of the image frame, the computing device may determine a confidence value for the previously filtered ROI 804 based on the average saliency value in the previously filtered ROI. When calculated from a saliency heat map generated from an image frame, the confidence value of the primary ROI may be similarly based on the average saliency value in the respective ROI. The computing device may determine that the difference in significance between the previously filtered ROI 804 and the primary ROI 806 exceeds a threshold (e.g., the confidence value of the primary ROI 806 exceeds the confidence value of the previously filtered ROI 804 by a threshold). Based on this determination, the computing device may update the filtered ROI from the previously filtered ROI 804 to the primary ROI 806. Additionally and/or alternatively, if the computing device determines that the difference in significance between the previously filtered ROI 804 and the primary ROI 906 does not exceed the threshold, the computing device may maintain the previously filtered ROI.
In some alternative examples, the computing device may also consider the secondary ROI 802 and determine that the difference in significance between the previously filtered ROI 804 and the secondary ROI 802 exceeds a threshold (e.g., the confidence value of the secondary ROI 802 exceeds the confidence value of the previously filtered ROI 804 by a threshold), and the computing device may update the filtered ROI from the previously filtered ROI 804 to the secondary ROI 802. Other factors may also be considered, such as stability of the primary ROI, secondary ROI, and/or previously filtered ROIs.
In some examples, the computing device may determine the filtered ROI based on an amount of overlap of the primary ROI 806 and/or the secondary ROI 802 with the previously filtered ROI 804. In particular, the computing system may determine a first amount of overlap of the previously filtered ROI 806 with the primary ROI 804 and a second amount of overlap of the secondary ROI 802 with the previously filtered ROI 806. If both the first and second amounts of overlap do not exceed the threshold, the computing device may update the filtered ROI as either the primary ROI 804 or the secondary ROI 802 based on which ROI is associated with a greater average saliency value. A smaller overlap (e.g., an overlap that does not exceed a threshold) may indicate that updating the filtered ROI may have a greater impact, while a larger overlap (e.g., an overlap that exceeds a threshold) may indicate that updating the filtered ROI may have a negligible impact. Updating the filtered ROI to another location when the filtered ROI is highly overlapping with other locations may result in a poor user experience because the region of focus may change rapidly over time. In a further example, both the saliency difference and the amount of overlap may be considered in determining whether to update the filtered ROI.
In some examples, the computing system may determine whether the filtered ROI is updated from the primary ROI to the secondary ROI based on whether the previously filtered ROI overlaps with the primary ROI or with the secondary ROI. For example, the computing device may update the filtered ROI to the secondary ROI when the difference in significance between the primary and secondary ROIs does not exceed a threshold and when the secondary ROI overlaps or overlaps the previous ROI by at least a threshold amount. Otherwise, when the difference in significance between the primary and secondary ROIs exceeds a threshold, the computing device may determine to update the filtered ROIs as the primary ROIs.
After the filtered ROI has been determined, the computing device may apply one or more autofocus processes to the image frame. In some examples, applying one or more autofocus processes may involve adjusting a camera lens such that the lens is focused on the filtered ROI. In a further example, the computing device may apply blurring to regions of the image frame that are outside the filtered ROI, possibly as a way of artificially blurring the background of the image frame, and thereby making the focus of the image frame a region within the filtered ROI.
As described above, in some examples, the saliency detection may be unstable over time and may report significant areas that are not actually significant. For example, the computing system may alternate between two regions of approximately the same saliency, which may periodically and/or randomly defocus the image frame. Out-of-focus image frames may make it difficult to run additional algorithms (e.g., classifiers to detect objects in the image, possibly to make the image more easily searchable) and result in a worse user experience.
To facilitate determining a focused image frame, the computing device may determine whether to apply one or more autofocus processes based on the filtered ROI associated with a particular state of the finite state machine. Fig. 9 illustrates a finite state machine 900 according to an example embodiment. The finite state machine 900 may help to avoid transient false positives.
As shown in fig. 9, the finite state machine 900 includes a committed state (committed state) 902, a pending state (PENDING STATE) 904, a standby state (standby state) 906, and an explore state (probation state) 908. Commitment state 902 may indicate that a filtered ROI is available, pending state 904 may indicate that a filtered ROI is waiting for stability verification, review state 908 may indicate that a filtered ROI is available but fails to wait for stability verification, and standby state 906 may indicate that a filtered ROI is not available. In the standby state 906, the computing device may choose not to continue applying the ROI-based one or more autofocus processes, while in the promise state 902 or the review state 908, the computing device may choose to continue applying the one or more autofocus processes. In pending state 904, the computing device may choose not to continue applying one or more autofocus processes, possibly until the ROI has been verified as stable or unstable.
For image frames comprising a primary ROI, a secondary ROI, and a filtered ROI, the computing device may assign a status for each of the ROIs, which status may be updated each time a new primary ROI, secondary ROI, and/or filtered ROI is determined. For example, if the primary ROI, the secondary ROI, or the filtered ROI are each associated with a confidence measure (e.g., an average significance value) that does not exceed a threshold, the computing device may update the respective ROI state from the pending state 904 to the standby state 906, from the committed state 902 to the watch state 908, and/or from the watch state 908 to the standby state 906.
Further, if the primary ROI, secondary ROI, or filtered ROI is inconsistent and/or not matched with any of the previous primary ROI, previous secondary ROI, and/or previous filtered ROI, the computing device may update the state of the respective ROI from pending state 904 to standby state 906, from committed state 902 to explore state 908, and/or from explore state 908 to standby state 906. In some examples, the consistency of the primary ROI (or secondary ROI or filtered ROI) may be defined as having a threshold amount of overlap with the previous primary ROI and not having too abrupt a depth change.
Further, if the confidence value (e.g., the determined average significance value of pixels within the ROI) is above a threshold and the consistency is also sufficient (e.g., the overlap amount is above the threshold, and possibly other factors), the computing device may update the state of the respective ROI from the pending state 904 to the committed state 902 or from the explored state 908 to the committed state 902.
In some examples, the filtered ROIs, the primary ROIs, and the secondary ROIs may act as object detectors, with each of the ROIs indicating one or more objects. However, the filtered ROIs, primary ROIs, and/or secondary ROIs may not necessarily detect objects, but rather detect the most prominent locations in the image frames. For example, the filtered ROIs, primary ROIs, and/or secondary ROIs may also track a group of people in an off-center object or background beside the textured wall as long as the subjects are most prominent. Thus, a computing system performing the methods described herein may output an off-centered focal region on top of a group of people in a textured wall or background such that the focal region is moving but not necessarily following an object.
Furthermore, a computing system performing the methods described herein may be able to quickly switch from focusing on one region indicated by the ROI to focusing on another region indicated by another ROI because the computing system identifies multiple ROIs. Thus, if the object located at the filtered ROI is no longer present and the filtered ROI is no longer very significant, the computing system may quickly switch to focus on the primary or secondary ROI. Furthermore, if an image frame contains two objects with similar saliency in the primary and secondary ROIs, a computing device performing the methods described herein may maintain the stability of a number of image frames on the primary ROI before it is possible to switch to the secondary ROI, rather than continuously switching between the primary and secondary ROIs.
Fig. 10 illustrates a finite state machine manager 1000 according to an example embodiment. Finite state manager 1000 includes a salient ROI state machine 1002 and a salient ROI state machine 1004. Finite state manager 1000 may prepare a candidate ROI (e.g., primary ROI, secondary ROI, previously filtered ROI, or filtered ROI) by verifying some factors for input into saliency ROI state machine 10002 or saliency ROI state machine 1004. For example, if a candidate ROI is already associated with a salient ROI state machine, the candidate ROI may not be entered into another salient ROI state machine. If the object within the candidate ROI is not within the valid distance range, the candidate ROI may be discarded. The candidate ROI may also be discarded if it is not within a valid window near the center of the image.
Having two salient ROI state machines may help facilitate switching between applying one or more autofocus algorithms to one region of an image frame and another region of the image frame. For example, if an object focused in an image frame disappears within a short time frame and the object is associated with the salient ROI state machine 1002, the computing device may check whether the state of the candidate ROI region associated with the salient ROI state machine 1004 is an acceptable state. If the state is acceptable, the computing device may quickly switch to focus on the candidate ROI.
Fig. 11 is a flowchart of a method 1100 according to an example embodiment. Method 1100 may be performed by one or more computing systems (e.g., computing system 200 of fig. 2) and/or one or more processors (e.g., processor 206 of fig. 2). Method 1300 may be performed on a computing device, such as computing device 100 of fig. 1.
At block 1102, the method 1100 includes receiving an image frame captured by an image capture device.
At block 1104, the method 1100 includes determining a saliency heat map representing saliency of pixels in an image frame.
At block 1106, the method 1100 includes determining a primary ROI and a secondary ROI of the image frame based on the saliency heat map.
At block 1108, the method 1100 includes determining a filtered ROI of the image frame. The filtered ROI is updated from the previously filtered ROI to the primary ROI based on a difference in significance between the previously filtered ROI and the primary ROI exceeding a first threshold.
At block 1110, the method 1100 includes applying one or more autofocus processes based on at least one of the filtered ROI, primary ROI, or secondary ROI.
In some examples, determining the primary and secondary ROIs is based on the primary ROI having a greater average significance than the secondary ROI.
In some examples, updating the filtered ROI from the previously filtered ROI to the primary ROI is further based on both the first amount of overlap of the previously filtered ROI with the primary ROI and the second amount of overlap of the previously filtered ROI not exceeding a second threshold.
In some examples, when the filtered ROI is set as the previously filtered ROI, the filtered ROI is associated with an updated average saliency value based on the saliency heat map.
In some examples, determining the primary ROI and the secondary ROI based on the saliency heat map includes determining a plurality of candidate anchor boxes distributed over the saliency heat map, wherein each of the candidate anchor boxes is associated with a saliency metric, wherein determining the primary ROI and the secondary ROI is based on the saliency metric of each of the candidate anchor boxes.
In some examples, the plurality of candidate anchor frames includes a plurality of anchor frames having a plurality of different aspect ratios at a given location in the image frame.
In some examples, the candidate anchor boxes are evenly distributed over the saliency heat map.
In some examples, the plurality of candidate anchor frames includes a plurality of anchor frames having a plurality of sizes at a given location in the image frame.
In some examples, the primary ROI is associated with a primary confidence metric and the previously filtered ROI is associated with a filtered confidence metric based on the saliency heat map, wherein the saliency difference is based on the primary confidence metric and the filtered confidence metric.
In some examples, the primary confidence metric is based on an average of one or more saliency values at one or more pixels within the primary ROI, and the filtered confidence metric is based on an average of one or more saliency values at one or more pixels within the previously filtered ROI.
In some examples, determining the primary and secondary ROIs is based on the primary and secondary ROIs being at least a threshold distance apart.
In some examples, determining the primary and secondary ROIs includes determining the primary ROI based on a saliency heat map, determining a forbidden region around the primary ROI, and determining the secondary ROI based on the forbidden region around the primary ROI and the saliency heat map such that the secondary ROI is not within the primary ROI or the forbidden region around the primary ROI.
In some examples, the previously filtered ROI is based on a previous image frame captured prior to the image frame.
In some examples, determining a saliency heat map representing saliency of each pixel in the image frame includes applying a pre-trained machine learning model to the image frame to determine the saliency heat map.
In some examples, applying one or more autofocus processes includes causing a camera lens to adjust focus to the filtered ROI.
In some examples, applying one or more autofocus processes includes applying blur to regions of the image frame outside the filtered ROI.
In some examples, the method 1100 further includes applying a finite state machine to the filtered ROI, wherein applying one or more autofocus processes is based on the filtered ROI associated with a particular state of the finite state machine.
In some examples, the finite state machine includes a commitment state indicating that the filtered ROI is available, a pending state indicating that the filtered ROI is waiting for stability verification, a review state indicating that the filtered ROI is available but is waiting for failure of stability verification, and a standby state indicating that the filtered ROI is unavailable.
In some examples, method 1100 further includes updating a state associated with the filtered ROI, wherein updating the state associated with the filtered ROI includes updating the state from a pending state to a standby state, from a committed state to a review state, or from a review state to a standby state based on determining that the confidence measure associated with the filtered ROI does not exceed the second threshold.
In some examples, method 1100 further includes updating a state associated with the filtered ROI, wherein updating the state associated with the filtered ROI includes updating the state from a pending state to a standby state, from a committed state to a watch state, or from a watch state to a standby state based on determining that the filtered ROI does not overlap with a previously filtered ROI.
In some examples, the particular state of the finite state machine is a committed state.
In some examples, the method 1100 further includes applying a finite state machine to each of the filtered ROI, the primary ROI, and the secondary ROI, wherein applying one or more autofocus processes is based on respective states of the finite state machine associated with each of the filtered ROI, the primary ROI, and the secondary ROI.
In some examples, the method 1100 is performed by an image capture device comprising a camera and a control system configured to perform the steps of the method 1100.
In such examples, the image capture device is a mobile device, wherein the image frames are captured by a camera.
In some examples, a method may include receiving an image frame captured by an image capture device. The method may further include determining a saliency heat map representing saliency of pixels in the image frame. The method may further include determining a primary ROI based on the significance heat map. The method may additionally include determining a forbidden region around the primary ROI. The method may further include determining the secondary ROI based on the significance heat map such that the secondary ROI does not overlap with the primary ROI and does not overlap with the exclusion zone. The method may further include controlling one or more autofocus processes based on the primary ROI and the secondary ROI.
Conclusion (III)
The present disclosure is not to be limited in terms of the particular embodiments described in this disclosure, which are intended as illustrations of various aspects. As will be apparent to those skilled in the art, many modifications and variations are possible without departing from the scope thereof. Functionally equivalent methods and apparatus within the scope of the disclosure, in addition to those described herein, will be apparent to those skilled in the art from the foregoing description. Such modifications and variations are intended to fall within the scope of the appended claims.
The above detailed description describes various features and operations of the disclosed systems, devices, and methods with reference to the accompanying drawings. In the drawings, like numerals generally identify like components unless context dictates otherwise. The example embodiments described herein and in the drawings are not intended to be limiting. Other embodiments may be utilized and other changes may be made without departing from the scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, could be arranged, substituted, combined, separated, and designed in a wide variety of different configurations.
To the extent that any or all of the message flow diagrams, scenarios, and flowcharts discussed herein are in the accompanying figures, each step, block, and/or communication can represent information processing and/or information transmission in accordance with an example embodiment. Alternative embodiments are included within the scope of these example embodiments. In these alternative embodiments, the operations described as steps, blocks, transmissions, communications, requests, responses, and/or messages, for example, may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved. Further, more or fewer blocks and/or operations may be used with any of the message flow diagrams, scenarios, and flowcharts discussed herein, and these message flow diagrams, scenarios, and flowcharts may be partially or fully combined with one another.
The steps or blocks representing processing of information may correspond to circuitry which may be configured to perform specific logical functions of the methods or techniques described herein. Alternatively or additionally, blocks representing processing of information may correspond to modules, segments, or portions of program code (including related data). Program code may include one or more instructions executable by a processor for performing specific logical operations or acts in a method or technique. The program code and/or related data may be stored on any type of computer-readable medium, such as a storage device including Random Access Memory (RAM), a disk drive, a solid state drive, or another storage medium.
The computer readable medium may also include non-transitory computer readable media, such as computer readable media that store data for a short period of time, such as register memory, processor cache, and RAM. The computer-readable medium may also include a non-transitory computer-readable medium that stores program code and/or data for a long period of time. Thus, the computer-readable medium may include secondary or persistent long-term storage, such as, for example, read-only memory (ROM), optical or magnetic disks, solid-state drives, compact disk read-only memory (CD-ROM). The computer readable medium may also be any other volatile or non-volatile memory system. A computer-readable medium may be considered, for example, a computer-readable storage medium or a tangible storage device.
In addition, steps or blocks representing one or more information transmissions may correspond to information transmissions between software and/or hardware modules in the same physical device. However, other information transmissions may be between software and/or hardware modules in different physical devices.
The particular arrangements shown in the drawings should not be construed as limiting. It is to be understood that other embodiments may include more or less of each of the elements shown in a given figure. Furthermore, some of the illustrated elements may be combined or omitted. Furthermore, example embodiments may include elements not shown in the figures.
While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope being indicated by the following claims.