CN121419862A

CN121419862A - Controlling an agent by tracking points in an image

Info

Publication number: CN121419862A
Application number: CN202480043748.1A
Authority: CN
Inventors: 梅尔·韦切里克; 卡尔·德舍; 乔纳森·卡尔·肖尔茨
Original assignee: GDM Holding LLC
Current assignee: GDM Holding LLC
Priority date: 2023-08-30
Filing date: 2024-08-29
Publication date: 2026-01-27
Also published as: WO2025046003A1

Abstract

Systems and methods for controlling agents using tracking points in images. For example, controlling an agent by selecting actions to be performed by the agent to execute an instance of a mechanical agent performing a task while interacting in a real-world environment, using images captured during the process.

Description

Controlling an agent by tracking points in an image

Cross Reference to Related Applications

The present application claims priority from U.S. provisional application No. 63/535,568 filed 8/30/2023. The disclosure of this prior application is considered to be part of the disclosure of the present application and is incorporated by reference.

Background

The present description relates to the use of neural networks to control agents.

A neural network is a machine learning model that employs one or more layers of nonlinear cells to predict output for a received input. In addition to the output layer, some neural networks include one or more hidden layers. The output of each hidden layer is used as an input to the next layer in the network (i.e., the next hidden layer or output layer). Each layer of the network generates an output from the received inputs in accordance with the current value inputs of the respective set of parameters.

Disclosure of Invention

The specification describes a system implemented as a computer program on one or more computers at one or more locations that controls an agent (e.g., a robot) by selecting actions to be performed by the agent to interact in an environment and then causing the agent to perform the actions.

Specifically, the system uses images captured while the agent is executing the instance of the task to control the agent to execute the instance of the task.

The subject matter described in this specification can be implemented in specific embodiments to realize one or more of the following advantages.

Demonstration learning, i.e., learning how to perform a task from a set of demonstrations of the task being performed, enables an agent to autonomously perform a new instance of the task after learning from the demonstration of the task. That is, rather than manually programming the ability to perform tasks, agents learn from expert demonstrations-agents controlled by, for example, humans, fixed policies for tasks, or learned policies already trained for tasks.

Current approaches for presentation learning typically require task-specific engineering or require excessive amounts of presentation data, thereby preventing presentation learning from being performed for an amount of time that enables viable use. For example, mimicking learning (i.e., training a first system to emulate actions demonstrated by a second, different system, e.g., behavioral cloning) and inverse reinforcement learning for image-guided robotic agents is a powerful but data and time-intensive way of training robotic agents to perform tasks, as they may require hundreds to thousands of task demonstrations across various environments to teach agents to process images to robustly perform tasks.

One reason for the large data and time requirements is that the input for presentation learning is typically the original image associated with performing the task. Because each presentation provides a wide range of environments and scenarios, an agent may need a large number of presentations (and thus a large amount of data and training time) to learn the appropriate internal representation required for the generalization task execution.

On the other hand, the present specification describes tracking points in an image as input to allow faster and more general learning from a presentation. By using points in the image as input for demonstration learning, as described in this specification, the number of demonstrations required by an agent to perform a task is taught to be reduced by orders of magnitude (and thus the amount of data required and training time are reduced by orders of magnitude), while still enabling generalization of task performance by the agent.

By tracking points in images during a presentation task, the system can automatically extract individual motions, relevant points for each motion, target locations for those points, and generate a plan that can be performed by an agent for a new instance of the task, without the need for action supervision, task-specific training, or neural network fine-tuning at all.

The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below.

Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

Drawings

FIG. 1 illustrates an example agent control system.

FIG. 2 illustrates an example agent control system.

FIG. 3 is a flow chart of an example process for determining relevant points of a task segment.

FIG. 4 shows an example image sequence depicting an example process for determining relevant points of a task segment involving a robot having a gripper with a camera mounted thereto.

FIG. 5 is a flow chart of an example process for controlling an agent using an agent control system.

FIG. 6 is a flow chart of an example process for executing a new instance of a task.

Fig. 7 shows an example of tasks performed by a robot with a camera-mounted gripper using the described techniques.

Like reference numbers and designations in the various drawings indicate like elements.

Detailed Description

FIG. 1 illustrates an example agent control system 100. The agent control system 100 is an example of a system implemented as a computer program on one or more computers in one or more locations, in which the systems, components, and techniques described below may be implemented.

The agent control system 100 is a system that controls an agent (e.g., a robot) that is to interact in an environment (e.g., a real world environment) by selecting an action to be performed by the agent and then causing the agent to perform the action. Specifically, the system 100 uses images captured while an agent is performing an instance of a task to control the agent to perform the instance of the task. That is, the system 100 receives the image 106 and generates an action 118 for the agent to perform the task.

More specifically, the system 100 processes the plurality of presentation image sequences 102 to generate data representing a plurality of task segments (e.g., the first task segment 104A and the second task segment 104B) of the task, and for each task segment, generates data representing a plurality of relevant points of the task segment (e.g., the relevant points 108 of the first task segment 104A). The system 100 then processes the image 106 and the data generated from the presentation 102 to generate an action 118 to be performed by the agent.

The images that make up the received image 106 and the plurality of presentation image sequences 102 may be captured by a camera sensor of the robot or by a camera sensor located in the environment. The robot may be a mechanical robot operating in a real world environment. The camera sensor may capture an image of the robot as the robot performs tasks in an environment (e.g., in a real world environment). As a specific example, the robot may be a robot comprising a gripper for gripping and moving objects in the environment, and the camera sensor may be positioned on the gripper (or mounted at a fixed position and orientation relative to the gripper), i.e. such that the gripper and the objects gripped in the gripper do not significantly move relative to the camera sensor unless the gripper is opened or closed.

Each of the presentation image sequences 103A-C is a sequence of images of an agent performing a respective instance of a task, for example, when the agent is controlled by a human, a fixed policy for the task, or a learned policy for the task that has been trained. Different instances of a task may have different configurations, e.g., objects in an environment, but with the same goal, e.g., intended to move similar objects to the same location, intended for a particular configuration to arrange similar objects, etc.

Although only three presentation image sequences, i.e., presentation image sequences 103A-C, are shown in fig. 1, in practice the system 100 may process any number of presentation image sequences.

The system 100 operates in two modes, an extraction mode and an action mode.

During the extraction mode, the system 100 processes the plurality of presentation image sequences 102 to generate data representing a plurality of task segments of a task, and for each task segment, generates data representing a plurality of relevant points for the task segment.

Each of the task segments corresponds to a respective portion of each of the plurality of presentation image sequences 102.

Thus, each task segment has a corresponding portion in each of the sequence of presentation images 102, where a "portion" of the sequence of presentation images includes only a proper subset of the images in the sequence of images.

In some implementations, the portion of each of the plurality of presentation image sequences 102 associated with the task segment may contain a varying number of images.

The system 100 may use any of a variety of methods to determine the task segments.

For example, the system 100 may generate the task segments using a set of rules (e.g., a predefined number of images) based on time.

As another example, the system 100 may use events or actions to create task segments, such as segmentation based on robot pose information (e.g., gripper actions and forces). That is, the system 100 may divide each sequence of presentation images into portions based on the position of a designated component of the robot, based on the force applied to the robot, or both. That is, the system may divide each sequence of presentation images into portions based on one or both of the position of a designated component of the robot and the force applied to the robot.

As a specific example, for the case where the robotic arm agent moves a clamped block to perform a plate alignment (patterning) task, the system 100 may use the clamp open position and record the point at which the position passes the selected threshold to extract the clamp actuation event. These points in time are the beginning or end of the grip and can be used to determine the beginning and end points of the task segment.

As another specific example, for the case where the robotic arm agent moves the clamped block to perform the imposition task, the system 100 may extract the beginning or end of the force phase. To extract these, the system 100 tracks the vertical force measured by the torque sensor, smoothes the signal, and converts the signal to a normalized force signal. The system 100 then uses the selected threshold to determine the occurrence of a force event and uses these points to determine a task segment.

Although only two task segments, namely task segments 104A and 104B, are shown in FIG. 1, in practice system 100 may generate any number of task segments.

As used in this specification, a "point" is a point in a corresponding image, i.e., a point that specifies a respective spatial location (i.e., a respective pixel) in the corresponding image. Each pixel may have one or more associated values or attributes (e.g., intensity values). For example, each pixel may include one or more intensity values, each representing the intensity (e.g., RGB values) of the corresponding color. Thus, the values of the pixels in the image may represent features of the image.

The system 100 uses a point tracker to track a set of randomly selected points across all of the presentations 102 and generates tracking data. Based on the tracking data, the system 100 selects a subset of points associated with each task segment as one or more relevant points of the task segment. Points may be tracked across a sequence of images by determining corresponding locations (points) within the images that are each related to the same feature (e.g., the same section or portion of the scene or environment shown in the images) across the sequence of images. For example, each point tracked across the sequence may represent the same location on the surface of an object within the environment. As the relative position of the camera and the object moves, the position of points within the image may change.

For example, for tasks involving manipulation of objects, the system 100 may select the relevant points as those points on the relevant object being manipulated.

As another example, the system 100 may select points according to a set of rules, such as the related points must have a degree of motion, or the related points must have a common position across the presentation.

Further details of determining relevant points of a task segment are described below with reference to fig. 3 and 4.

The system 100 also maintains respective point tracking data for each of the relevant points of each of the task segments that identifies a respective spatial location of the point in at least some of the images corresponding to the task segments, e.g., the point tracking data 112 for the relevant point 108 of the first task segment 104A.

Thus, the point tracking data may identify spatial locations representing the same point but in different images. The point tracking data may also include an occlusion score representing the likelihood of occlusion, and further optionally an uncertainty score representing the uncertainty in the predicted spatial location. If the spatial position of a point is different across different images, the point has moved relative to the camera between the different images.

During the action mode, the system 100 uses the maintained data to perform a new instance of the task, i.e., by using (i) the images captured when the agent performed the new instance of the task and (ii) the relevant points of the task segment.

More specifically, the system 100 may perform the following at each of a plurality of time steps during execution of a new instance of a task.

At each of a plurality of time steps, the system 100 obtains images of an agent (e.g., a robot) at that time step (e.g., captured by a camera sensor of the robot at that time step). As a specific example, the obtained image 106 belongs to a time step t.

The system 100 identifies the current task segment at that time step.

For example, as described below, the task segments may be determined based on the position of the components of the robot in the presentation sequence or the force applied to the robot, or both, i.e., such that each task segment begins when the corresponding component is in a first position or a particular force has been applied to the robot (or at the beginning of an instance of a task for the first task segment) and continues until the corresponding component is in a second position or a particular force has been applied to the robot. When the agent is a robot with grippers, the segmentation may be based on the force applied to the grippers or the position of the grippers, i.e. the open and closed positions representing the extent to which the grippers are open or closed.

The system 100 may then identify the current task segment by identifying whether the criteria for terminating the task segment for the previous time step have been met. If the corresponding criterion has not been met, the system 100 sets the current task segment to the task segment of the previous time step, and if the corresponding criterion has been met, the system 100 sets the current task segment to the next task segment after the task segment of the previous time step.

The system 100 determines one or more target points from the relevant points of the current task segment and determines a respective target predicted position of each of the target points in the future image from the point tracking data of the task segment.

As a specific example, for the image 106 of time step t, the system 100 determines one or more target points 110 from the relevant points 108 of the current task segment 104A and determines a respective target predicted position 116 of each of the target points 110 in the future image 114 from the point tracking data 112 of the task segment 104A. Each target point 110 may be a location (e.g., a pixel) within the image 106 that represents (e.g., shows) a feature (e.g., an object or a section of the image) within the image 106 that corresponds to one of the relevant points 108 of the current task segment 104A.

"Future image" 114 is an image that identifies the corresponding spatial location of the relevant point stored in the point tracking data 112 from one of the sequence of presentation images 102 that the system 100 is intended to replicate.

The system 100 may determine the future image 114 as an image having a corresponding correlation point that is most similar in position to the target point 110 associated with the image 106 in all of the sequence of presentation images maintained relative to the image frame, and further determine the target predicted position 116 as the correlation point associated with the future image 114.

In general, target predicted position 116 is the "place" where system 100 is intended to move target point 110 to replicate the "future image". That is, the target point 110 identifies "what" points are relevant in the current image 106, the target predicted position 116 determines the "places" where these points should be located, and the generated action 118 will determine "how" the target point 110 will be brought to the target predicted position 116.

The system 100 then causes the agent to perform an action 118 that is predicted to move the target point 110 to the target predicted location 116.

For example, the system 100 may apply a controller (e.g., a visual servo controller or other robotic controller) to process the one or more target points 110 and the respective target predicted positions 116 of each of the target points 110 in the future image 114 to determine actions 118 predicted to move the target point 110 to the corresponding target predicted position 116, and then cause the agent to perform the determined actions 118, e.g., by applying control inputs to one or more controllable elements (e.g., joints, actuators, etc.) of the agent.

Further details of updating the agent control system 100 and performing new instances of tasks are described below with reference to fig. 5 and 6, respectively.

Fig. 2 illustrates an example agent control system 200. The agent control system 200 is an example of a system implemented as a computer program on one or more computers in one or more locations, in which the systems, components, and techniques described below may be implemented.

The agent control system 200 is a system that controls a robot including a gripper for gripping and moving an object in an environment and a camera sensor positioned on the gripper, that is, such that the gripper and the object gripped in the gripper do not significantly move relative to the camera sensor unless the gripper is opened or closed. Specifically, the system 200 uses images captured when an agent performs an instance of a task involving a moving object to control a robot to perform the instance of the task. That is, the system 200 receives the image 206 and generates an action 218 for the robot to perform a task of moving the object.

During the extraction mode, the system 200 processes three presentation image sequences 203A-C into a set of presentation image sequences 202. These presentations are the task for gripping and placing an L-shaped block on an elliptical target, and fig. 2 depicts a sequence of images for each presentation in top-down order. Although three presentation image sequences are processed in the example of fig. 2, it should be understood that the number of presentation image sequences may vary. Similarly, the number of presentation images within each sequence of presentation images may vary.

Fig. 2 also depicts the generation of the first and second task segments 204A, 204B, illustrating the respective portions of each of the plurality of presentation image sequences 202 that make up the first and second task segments 204A, 204B. The system 200 divides each sequence of presentation images into portions based on the force applied to the grippers and based on the position of the grippers, more specifically, for each sequence of presentation images in the sequence, the first three images of the sequence constitute a first task segment 204A, the fourth image corresponds to a motor primitive (e.g., grippers closed and moving upward), and then the next three images correspond to a second task segment 204B.

All task segments of a task and motor primitives of the gripper are depicted under the heading "motion plan" in fig. 2.

The system 200 generates point tracking data using a point tracker called "track any point with frame-by-frame initialization and time refinement" (TAPIR), TAPIR is a method for accurately tracking specific points across an image sequence, as described in ArXiv: 2306.08637. The method employs two phases, 1) a matching phase that locates the appropriate candidate point match for each particular point on each other image independently, and 2) a refinement phase that updates the trajectory based on local correlations across the images. Although system 200 uses TAPIR in the example of fig. 2, more generally, system 200 may use any other suitable point tracker capable of generating a desired output, such as BootsTap described in ArXiv:2402.00847 and TAPNet described in ArXiv: 2211.03726. That is, any general point tracking method may be used to identify relative motion of points across various images (frames) in a sequence. The point tracking may determine two pixels in two different images, each representing the same section or portion of the scene or environment shown within the image (e.g., they are projections of the same point on the same physical surface within the environment). Fig. 2 illustrates, for each task segment illustrated, tracking points through connecting lines across images of the presentation, e.g., a first task segment 204A illustrates tracking points associated with related objects of the task segment.

The system 200 then selects the relevant points for each task segment and the corresponding point tracking data. Fig. 2 depicts point tracking data 212, represented as three sets of contiguous lines (one set per presentation) within a single image, for a relevant point 208 (represented as qt) of a first task segment 204A. Specifically, system 200 uses object discovery and selects an L-shaped block as a relevant object and a corresponding point on the relevant object as a relevant point, the relevant point being determined across the presentation for first task segment 204A.

During the action mode, the system 200 obtains an image 206 of time step t.

The system 200 determines that the current task segment of the image 206 and time step t is the first task segment 204A because the gripper has not yet started to close for the first time and because the criteria that the L-shaped block is positioned below the gripper has not yet been met, as can be determined after examining the image below the heading 'current frame'. For image 206 at time step t, system 200 uses a point tracker (e.g., an online version of TAPIR) to determine target point 210 from relevant point 208. The system 200 then selects the future image 214 as the image having the corresponding relevant point from the point tracking data 212 that is most similar in location to the target point 210, and defines the target predicted location 216 as the relevant point of the future image 214.

The system 200 then uses the visual servoing controller to process the target point 210 and the target predicted position 216 to generate an action 218 for the robotic agent.

A vision servo controller generally refers to a control system that uses vision data (e.g., images from camera sensors) to control the actions of another system (e.g., robotic agent) in real time by continuously processing the vision data. For example, in the case of a robotic agent, the visual servo controller determines a speed for moving one or more components of the robot such that the target point 210 will move toward the target predicted position 216. For example, in the case of a robot with a gripper, the vision servo controller determines the speed for moving the gripper such that the target point 210 will move towards the target predicted position 216.

The system 200 may generally use any suitable visual servo controller to select actions to be performed by the agent at any given time step. Examples of visual servoing are described in DOI 10.1109/70.954764, chen, hanzhi et al, "Texpose: neural texture learning for self-supervised 6d object pose estimation (Texpose: neural texture learning for self-supervising 6d object pose estimation)", conference on IEEE/CVF computer vision and pattern recognition, 2023, and Hill, john, "Real time control of a robot with a mobile camera (real-time control of robots with mobile cameras)", conference on ninth International Industrial robot seminar, 1979.

FIG. 3 is a flow diagram of an example process 300 for determining relevant points of a task segment. For convenience, process 300 will be described as being performed by a system of one or more computers located in one or more locations. For example, an agent control system (e.g., agent control system 100 of fig. 1) suitably programmed in accordance with the present description may perform process 300.

The system selects one or more points as initial correlation points (step 302).

For example, the system may select initial relevant points based on at least two criteria (i) a degree of proximity of one or more points to each other at a last image in a task segment in each of the sequence of presentation images according to the point tracking data, and (ii) a degree of stationarity of one or more points during the task segment according to the point tracking data.

Criterion (i) refers to selecting a point across all presentations ending at a common image location. That is, the system may calculate the positional variance of the final point position across the presentation and select those points associated with positional variances below a threshold. That is, points having similar final positions across each presentation image sequence may be selected.

For example, for a task segment corresponding to an imposition, i.e., placing an object into its corresponding appropriately shaped hole, the point associated with the object being placed will most likely satisfy criterion (i) because across the presentation, the object to be placed may begin in various environmental locations, but always end in the appropriately shaped hole at the end of the task segment. Thus, the points associated with the blocks will also begin at various environmental locations, but end near the common location at the end of the task.

In criterion (ii), stationary refers to tracking points whose overall motion is less than a threshold during task segmentation. That is, a measure of point movement relative to a frame of the image is defined, rather than movement relative to a third person of the scene, and points across the presentation having associated motion measure values above a threshold are selected.

For example, when the camera is mounted on a robotic end effector for a registration task, the point associated with the end effector within the image will not be selected because the end effector is not moving within the frame of the image.

As a specific example of determining relevant points from the point tracking data based on criteria (i) and (ii), the system may sequentially select points according to a sequence of parameterized rules corresponding to the criteria. For example, the system may first select points that satisfy a first rule corresponding to criterion (ii), such as "select points whose positional variance is greater than or equal to a particular parameter value across all presentations during a task segment". Then, evaluating only the points that meet criterion (ii), the system may select points that meet a second rule corresponding to criterion (i), such as "select points whose variance of position within the final frame is less than the parameter value across all presentations".

In some implementations, the system also considers other criteria when selecting the initial point.

For example, in addition to (i) and (ii) above, the system may select one or more points based on (iii) whether the one or more points are visible at the last image in the task segment in each of the sequence of presentation images according to the point tracking data.

Criterion (iii) refers to points that are not guaranteed to be visible at the end of a task segment due to occlusion, sensor failure, or inconsistencies due to inaccurate presentation.

For example, for a task segment corresponding to a registration, a square block to be moved may have a unique mark on a single face with a corresponding point that is particularly easy to track (assuming that the block face is always visible). But because the uniquely marked tile faces may not always be visible (because the marked tile faces face down or away from the camera), the uniquely marked corresponding points will not be selected for use with criterion (iii).

As a specific example of determining relevant points from the point tracking data based on criteria (i), (ii), and (iii), the system may sequentially select points according to a sequence of parameterized rules corresponding to the criteria. For example, after sequentially selecting points that meet criteria (i) and (ii) according to the parameterized rules as described above, the system may then select points from the remaining set that meet a third rule corresponding to criterion (iii), such as "select points whose average visibility (i.e., average occlusion score) is higher than the parameter value across the entire task segment".

In some implementations, the system can then use the initial correlation points to generate correlation points for the task segments.

For example, the system may select as the correlation point the initial correlation point that was previously selected to satisfy criteria (i), (ii), and (iii).

The system clusters the plurality of points using the point tracking data to determine a plurality of clusters (step 304). The plurality of points refers to all tracking points, not just those points designated as initial correlation points from step 302.

The system may use any of a variety of methods to cluster multiple points.

For example, the system may use a "3D motion estimation and re-projection based point clustering" approach to clustering. That is, the system assumes that all points belong to one of several approximately rigid objects in the scene and their motion can be interpreted by a set of 3D motions followed by a re-projection (i.e., projection of the 3D position of the points to the 2D position), parameterizing the 3D position of the points and the 3D transformation of the points for each image in each presentation, and minimizing the re-projection error function to determine how many clusters and to which clusters the points belong, such as those that minimize the error function. For a gripper performing an imposition task, the reprojection error function may be, for example:

Wherein the method comprises the steps of Is the predicted position of point i at time t in the presentation (t indexes both time and presentation for simplicity), andIs the probability of occlusionK refers to the number of rigid objects in the scene,Is the 3D position of the ith point in the kth object,Is a rigid 3D transformation of each object at each time, and R (x) is a reprojection function R (x) = [ x [0]/x [2], x [1]/x [2] ] that projects the 3D point onto the 2D plane, where x [0], x [1], x [2] are x, y, z coordinates in 3D, and x [0]/x [2], x [1]/x [2] are normalized x, y coordinates in the 2D plane.

With respect to the previous example,AndBoth may be parameterized using a neural network that aims to capture the generalized preferences, i.e. points in the 2D space that are nearby and also frames that are nearby in time should have similar 3D configurations. In particular, the method comprises the steps of,WhereinNeural network for output matrixParameterizing, andWhereinIs a time-smoothed learning descriptor of the image t,Is a neural network that outputs tensors representing rigid transformationsIs a neural network parameter of (a).

Furthermore, for the previous example, the optimal number of rigid objects may be determined by a 'recursive split' method. To achieve this, note that it may be the case that only two neural networksAndDepending on the number of clusters k, the parameters of these layers can be written as a matrix for a certain number of channels c. For each such weight matrix, the system creates two new weight matricesAndWherein the firstThe rows parameterize the new clusters, whereIs the first of (2)The rows have been split into two different clusters, called 'bifurcation' of the original weight matrix. The system calculates the loss for each possible split and optimizes the split with the least loss. In the mathematical sense, the data of the data collection system,Defining a new matrix in which w has been removedLine, and have been attached toAndThe first twoAnd (3) row. The system can useTo calculate two new 3D positions and 3D transformsAnd. The following losses are then minimized:

Here the number of the elements is the number, Is a function of the reprojection error of the previous example,To outputAndIs parameterized by a neural network and comprises'Bifurcation' variableAndTwo of which are described herein. After multiple (e.g., fifty, one hundred, five hundred, or more generally, several hundred) optimization steps, the system willReplaced byAnd creates a new 'bifurcation' (withTo initialize the bifurcation). System slaveThe recursive bifurcation procedure is started and repeated until the desired number of objects is reached or the penalty is minimized.

The system uses the clusters and the initial correlation points to select correlation points (step 306).

For example, in the case of a given cluster, each initial relevant point may vote for a cluster (e.g., assign) and the clusters with the largest number of votes are combined, repeatedly until one or more criteria are met, such as the number of clusters, the number of points in the cluster, and so on. The relevant points may then be selected as belonging to those clusters that meet one or more criteria, such as clusters with the largest number of initial points, or clusters whose initial points experience the largest movement, and so on.

As another example, the system may perform motion-based object segmentation using the generated motion-based cluster data, select a related object from the segmentation, and select an initial related point on the selected object as a related point. That is, the system may assume that there are k objects and parameterize step 304 so that it produces k clusters corresponding to the k objects. More specifically, when the re-projection error is minimized, a constraint of generating k clusters corresponding to k objects may be implemented by fixing the number of clusters to k, or by merging or splitting clusters until k clusters are obtained when voting and merging clusters using an initial point.

The system may use one or more criteria to select a related object from k clusters, such as a cluster with the most points that also satisfy criteria (i), (ii), or (iii) above, or any combination of these criteria.

The system may select a point on the selected related object being manipulated as a plurality of related points, e.g., a point on the related object that satisfies criteria (i), (ii), (iii) and has an occlusion score exceeding a particular threshold.

FIG. 4 shows an example 400 image sequence depicting an example process for determining relevant points of a task segment involving a robot having a gripper with a camera mounted thereto. More specifically, each image in the example 400 image sequence illustrates a tracking point overlaid onto the last image of the task segment moving the gripper onto the cylindrical block.

An image 402 labeled 'input' illustrates all tracking points throughout the task segment. The image 404 labeled 'low cross presentation variance' illustrates tracking points ending at a common image location across all presentations, while the image 406 labeled 'non-stationary' illustrates tracking points whose overall motion is greater than a threshold during task segmentation, and the image 408 labeled 'motion cluster' illustrates tracking points clustered according to the objects in the image (i.e., tracking points) according to the "3D motion estimation and re-projection based point clustering" method described previously for K objects. Tracking points present in images 404-408 are those tracking points present in image 402 that meet the respective criteria for images 404-408. The image 410 labeled 'output' illustrates the determined relevant points as the intersection of the tracking points present in the images 404-408.

Fig. 5 is a flow chart of an example process 500 for controlling an agent using an agent control system. For convenience, process 500 will be described as being performed by a system of one or more computers located in one or more locations. For example, an agent control system (e.g., agent control system 100 of fig. 1) suitably programmed in accordance with the present description may perform process 500.

The system obtains a plurality of presentation image sequences, each presentation image sequence being an image sequence of a respective instance of an agent performing a task (step 502). For example, the system may obtain a sequence of presentation images from a camera mounted directly on the agent (e.g., a robot) to provide a first person view of the task, or if the mounted camera is a 360 degree camera, a complete view of the agent's surroundings during the task.

As another example, the system may obtain a sequence of presentation images from a camera mounted on an end effector (e.g., gripper) of the robotic agent, thereby providing an end effector view of the task.

In some cases, the system may obtain a sequence of presentation images from a camera associated with the environment instead of the agent. For example, the system may obtain images from a overhead camera, providing a bird's eye view of the task.

As another example, the system may obtain a sequence of presentation images that provide a third person-named view of the task. For example, the system may obtain a sequence of presentation images from an environmentally mounted (e.g., wall mounted or tripod mounted) camera, thereby providing a specific third person view of the task.

The system generates data that divides the task into a plurality of task segments, each task segment including a respective portion of each of the sequence of presentation images (step 504).

The system may use any of a variety of methods to generate the task segments. Generally, these methods involve aligning multiple presentation image sequences, i.e., synchronizing the presentation image sequences such that corresponding events or actions across the sequences occur at matching or nearly matching images (e.g., at matching or nearly matching time or frame numbers), and then segmenting each of the presentation image sequences into an equal number of task segments.

As an example of aligning the presentation, the system may align the presentation using a neural network trained using a self-supervised representation learning method. An example of such a method is described in ArXiv: 1904.07846.

As another example, the system may use events or actions to align the presentation, such as aligning according to robot pose information (e.g., gripper actions and forces).

As an example of segmenting a presentation, once the system has aligned the presentation, the system may create a task segment from a fixed number of images. That is, each of the presentations may be divided into task segments such that each task segment contains an equal number of images.

As another example, the system may segment the presentation using scene information contained in the image. That is, one or more neural networks that learn the embedded representation of the sequence of presentation images may be used to process the presentation and generate task segments from significant changes in the scene.

As another example, the system may use events or actions to create task segments, such as segmentation based on robot pose information (e.g., gripper actions and forces). That is, the system may divide each sequence of presentation images into portions based on the position of a designated component of the robot, based on the force applied to the robot, or both. That is, the system may divide each sequence of presentation images into portions based on one or both of the position of a designated component of the robot and the force applied to the robot.

For each task segment, the system applies a point tracker to each of a plurality of points in the image in the corresponding portion of the image sequence to generate point tracking data for the task segment (step 506).

The system may use any of a variety of point tracking methods as a point tracker. For example, the system may use a keypoint-based approach, i.e., a approach that involves defining a small set of unique "keypoints" for an object class and identifying the keypoints in each image, for tracking random points that are coincidentally keypoints. Examples of such methods are described in ArXiv: 2112.04910 and ArXiv: 1806.08756.

As another example, the system may track points using a light flow method (i.e., a method involving tracking points based on changes in pixel intensity between two consecutive images).

As another example, the point tracker may be a neural network-based point tracker that, for each of a plurality of points, may extract query features of the points, generate respective initial point tracking predictions of the points in each of the images in the task segment using the query features of the points and respective visual features of each of a plurality of spatial locations in the images in the task segment, and refine the respective initial point tracking predictions using a time refinement sub-network to generate the point tracking data.

Examples of neural network based point trackers are TAPNet described in ArXiv:2211.03726, TAPIR described in ArXiv: 2306.08637, and BootsTap described in ArXiv: 2402.00847.

Using any suitable point tracker, the system may use the point tracker to generate point tracking data for each task segment, the point tracking data including, for each tracking point and for each image in the task segment, (i) a predicted location of the point in the image and (ii) a predicted occlusion score for the point that indicates a likelihood that the point is occluded in the image.

The system uses the point tracking data for each task segment to determine a plurality of relevant points for the task segment (step 508). For example, the system may determine the relevant points, as described with reference to fig. 3.

The system receives a request to execute a new instance of a task (step 510). For example, the system may receive a request from a user or from an external system.

The system uses (i) the image captured while the agent is executing the new instance of the task and (ii) the relevant points of the task segment to control the agent to execute the new instance of the task (step 512).

For example, at each time during execution of a new instance of a task, the system may receive an image captured at a time step, identify a current task segment corresponding to the image, determine a target point from a relevant point of the current task segment, determine a target predicted location associated with completing the task, cause the agent to perform an action predicted to move the target point to the target predicted location, and then, after performing the action, receive and process the new current image to repeat the previous steps until the task is completed.

Further details of a new example of performing a task are described below with reference to FIG. 6.

FIG. 6 is a flow chart of an example process 600 for executing a new instance of a task by controlling an agent at each time step in a sequence of time steps during the task. For convenience, process 600 will be described as being performed by a system of one or more computers located in one or more locations. For example, an agent control system (e.g., agent control system 100 of fig. 1) suitably programmed in accordance with the present description may perform process 600.

The system may perform iterations of process 600 at each time step during execution of a new instance of the task. That is, the system may continue to perform iterations of process 600 until a new instance of the task is completed.

The system obtains images of the agent at time steps (step 602). For example, the images may be from the same source as the sequence of presentation images, i.e., a camera associated with the agent performing the task, as previously described with reference to fig. 5.

The system identifies the current task segment of the time step (step 604).

The system may then identify the current task segment by identifying whether criteria for terminating the task segment for the previous time step have been met. If the corresponding criteria have not been met, the system may set the current task segment to the task segment of the previous time step, and if the corresponding criteria have been met, the system may set the current task segment to the next task segment after the task segment of the previous time step.

For example, the system may identify the current task segment by identifying whether criteria (e.g., occurrence of an event or action, e.g., robot pose information, e.g., gripper action and force) for the task segment and the time step value have been met.

For a first time step, the current task segment may be set to a default task segment, such as the first task segment created by the system during the "extraction mode".

In other cases, for the first time step, the system determines the current task segment by processing the first received image and determining with which task segment the image is most likely to be associated.

The system determines one or more target points from the relevant points of the current task segment (step 606). To determine the target point from the relevant point, the system may use the previous point tracker in an online manner, i.e. identify the target point as the tracking point of the relevant point.

For example, to use a neural network based point tracker in an online manner, the neural network based point tracker is modified to be causal. That is, the neural network-based point tracker is modified to determine the target point from the relevant points and only from images associated with the current time step and all previous time steps of the task segment.

As a specific example, the TAPIR point tracker may be used in an online manner by modifying the time refinement sub-network of the TAPIR point tracker to apply causal convolution instead of temporal convolution, as described in ArXiv: 2308.15975. More specifically, the point-in-time refinement of the TAPIR model uses a depth-wise convolution module, where the query point features, x and y locations, occlusion and uncertainty estimates, and score maps for each image are all concatenated into a single sequence, and the convolution model outputs updates for the locations and occlusions. The depth direction convolution model replaces each depth direction layer in the original model with a causal depth direction convolution, and thus the resulting model has the same number of parameters as the original TAPIR model, with all hidden layers having the same shape.

The system determines a respective target predicted position in the future image at the future time step for each of the target points from the point tracking data of the current task segment (step 608).

For example, the system may measure euclidean distances between the target point and the relevant points for each image of each sequence of presentation images to select the image with the lowest distance (e.g., the lowest average distance or clear distance) as the future image. In this case, the target predicted position of each of the target points is the relevant point of the future image.

In some cases, for the previous example, there is a threshold distance value such that the future image is selected as the following image in the corresponding sequence of presentation images of the image with the lowest euclidean distance.

Further, in some cases, for the same previous example, the target predicted position of each of the target points is determined as an average correlation point associated with the future image for the future time stride presentation.

The system causes the agent to perform an action predicted to move one or more target points to corresponding target predicted locations (step 610). The system may use any of a variety of suitable controllers, i.e., algorithms that process images and points to generate actions for the agent such that the target point becomes more aligned with the target predicted position.

In general, any location-based visual servoing technique (i.e., a technique that includes converting image data into 3D pose data of the real world to determine an action) or image-based visual servoing (i.e., a technique that includes converting image data into image features to determine an action) may be used to cause an agent to perform an action predicted to move one or more target points to corresponding target predicted locations.

For example, for the case where the robotic arm agent moves the clamped block to perform the registration task, the action corresponding to the movement speed of the arm may be determined by a controller that calculates a jacobian matrix (i.e., an estimate of how the position of the target point corresponds to the change in movement speed of the robotic arm) and uses the jacobian matrix to calculate an action corresponding to the movement speed that minimizes the square error between the target point and the target predicted position under linear approximation.

As a more specific example, for the case where the robotic arm agent moves the clamped blocks to perform the registration task, a set of target predicted positions are givenAnd corresponding target pointIn the case of (a), the controller uses mapping actions toThe action of minimizing the error is calculated by linear approximation of the function of the variation of (a). The system then causes the agent to perform the calculated action. The process can be summarized as:

Wherein t represents the time step in which, Is the gripper speed (i.e., motion), andIs an image jacobian matrix.

Fig. 7 shows an example 700 of tasks performed by a robot with a camera-mounted gripper using the described techniques.

In particular, example 700 illustrates that the described techniques may be used to control a robot to perform any of a variety of tasks, even when relatively few demonstrations of the successfully performed tasks are available.

For example, example 700 shows that the system may perform a "four object stack" with only four presentations, a great improvement over tens to hundreds of presentations that may be required using other methods.

In some examples of environments where an agent may interact and where the agent is embodied in what will follow.

In some implementations, the environment is a real-world environment and the agent is a mechanical agent that interacts with the real-world environment. For example, the agent may be a robot that interacts with the environment to accomplish a goal, such as locating an object of interest in the environment, moving an object of interest to a specified location in the environment, physically manipulating an object of interest in the environment in a specified manner, or navigating to a specified destination in the environment, or an autonomous or semi-autonomous land, air, or marine vehicle navigating through the environment to a specified destination in the environment.

The action may be a control input for controlling the robot, e.g. a torque or a higher level control command for a joint of the robot, or may be a control input for controlling an autonomous or semi-autonomous land or air or sea vehicle, e.g. a torque or a higher level control command to a control surface or other control element of the vehicle.

In other words, the actions may include, for example, position, speed or force/torque/acceleration data of one or more joints of the robot or portions of another mechanical agent. The actions may additionally or alternatively include electronic control data, such as motor control data, or more generally data for controlling one or more electronic devices within the environment, the control of which has an effect on the observed state of the environment. For example, in the case of autonomous or semi-autonomous land, air, or marine vehicles, the actions may include actions for controlling navigation (e.g., steering) and movement (e.g., braking and/or acceleration of the vehicle).

In some implementations, the environment is a simulated environment, and the agent is implemented as one or more computer programs that interact with the simulated environment. For example, the environment may be a computer simulation of a real world environment, and the agent may be a simulated mechanical agent navigating through the computer simulation.

For example, the simulated environment may be a motion simulation environment, such as a driving simulation or a flight simulation, and the agent may be a simulated vehicle navigating through the motion simulation. In these implementations, the action may be a control input that controls the simulated user or the simulated vehicle. As another example, the simulated environment may be a computer simulation of a real world environment, and the agent may be a simulated robot interacting with the computer simulation.

In general, when the environment is a simulated environment, the actions may include simulated versions of one or more of the actions or action types previously described.

While the present description generally describes the input as an image, in some cases the input may include additional data in addition to or in lieu of image data, such as proprioceptive data and/or force data characterizing the agent or other data captured by other sensors of the agent.

The term "configured" is used in this specification in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions, it is meant that the system has installed thereon software, firmware, hardware, or a combination thereof that, in operation, causes the system to perform the operations or actions. For one or more computer programs configured to perform a particular operation or action, it is meant that the one or more programs include instructions that, when executed by a data processing apparatus, cause the apparatus to perform the operation or action.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware (including the structures disclosed in this specification and their structural equivalents), or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory storage medium, for execution by, or to control the operation of, data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal (e.g., a machine-generated electrical, optical, or electromagnetic signal) that is generated to encode information for transmission to suitable receiver apparatus for execution by data processing apparatus.

The term "data processing apparatus" refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The device may also be or further comprise dedicated logic circuitry, e.g. an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). In addition to hardware, the apparatus may optionally include code that creates an execution environment for the computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program (which may also be referred to or described as a program, software application, app, module, software module, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. The program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a data communication network.

In this specification, the term "database" is used broadly to refer to any collection of data that need not be structured in any particular way, or structured at all, and that may be stored on storage in one or more locations. Thus, for example, an index database may include multiple data sets, each of which may be organized and accessed differently.

Similarly, in this specification, the term "engine" is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more particular functions. Typically, the engine will be implemented as one or more software modules or components installed on one or more computers in one or more locations. In some cases one or more computers will be dedicated to a particular engine, in other cases multiple engines may be installed and run on the same computer or computers.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, or in combination with, special purpose logic circuitry (e.g., an FPGA or ASIC) and one or more programmed computers.

A computer suitable for executing a computer program may be based on a general-purpose or special-purpose microprocessor or both, or any other kind of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory may be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. In addition, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks.

To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other types of devices may also be used to provide interaction with the user, for example, feedback provided to the user may be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback, and may receive input from the user in any form, including acoustic, speech, or tactile input. Further, the computer may interact with the user by sending and receiving documents to and from the device used by the user, e.g., by sending web pages to a web browser on the user device in response to requests received from the web browser. In addition, the computer may interact with the user by sending a text message or other form of message to a personal device (e.g., a smart phone running a messaging application), and in response receiving a response message from the user.

The data processing apparatus for implementing the machine learning model may also include, for example, a dedicated hardware accelerator unit for handling the general and computationally intensive portions of machine learning training or production (i.e., inference, workload).

The machine learning model may be implemented and deployed using a machine learning framework (e.g., tensorFlow framework or Jax framework).

Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification), or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN) and a Wide Area Network (WAN), such as the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs executing on the respective computers and having a client-server relationship to each other. In some embodiments, the server transmits data (e.g., HTML pages) to the user device, e.g., for the purpose of displaying data to and receiving user input from a user interacting with the device acting as a client. Data generated at the user device, e.g., results of the user interaction, may be received at the server from the device.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings and described in a particular order in the claims, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Specific embodiments of the present subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying drawings do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A method performed by one or more computers, the method comprising:

Obtaining a plurality of presentation image sequences, each presentation image sequence being a sequence of images of a respective instance of an agent performing a task;

generating data dividing the task into a plurality of task segments, each task segment comprising a respective portion of each of the sequence of presentation images;

For each task segment, applying a point tracker to each of a plurality of points in the image in the respective portion of the sequence of images to generate point tracking data for the task segment;

Determining a plurality of relevant points of the task segment using the point tracking data of the task segment for each task segment;

receiving a request to execute a new instance of the task, and

Using (i) an image captured while the agent is executing the new instance of the task and (ii) the relevant points of the task segment to control the agent to execute the new instance of the task.

2. The method of claim 1, wherein the agent is a robot.

3. The method of claim 2, wherein each image in each of the sequence of presentation images is captured by a camera of a robot.

4. A method according to claim 2 or claim 3, wherein generating data dividing the task into a plurality of task segments comprises:

Each sequence of presentation images is divided into portions based on a position of a designated component of the robot, based on a force applied to the robot, or both.

5. The method of claim 4, wherein the robot has a gripper, and wherein dividing each sequence of presentation images into portions comprises:

Each sequence of presentation images is divided into portions based on a force applied to the gripper, based on a position of the gripper, or both.

6. A method according to claim 5 when dependent on claim 3, wherein the camera is positioned on the gripper of the robot.

7. A method according to any preceding claim, wherein the point tracking data for each task segment comprises, for each point of the plurality of points and for each image in the task segment:

(i) A predicted position of the point in the image, and

(Ii) A predicted occlusion score for the point indicating a likelihood that the point is occluded in the image.

8. The method of any preceding claim, wherein controlling the agent to perform the new instance of the task using (i) an image captured while the agent is performing the new instance of the task and (ii) the relevant points of the task segment comprises, at each of a plurality of time steps:

Obtaining images of the agent at the time step;

Identifying a current task segment of the time step;

Determining one or more target points from the relevant points of the current task segment;

Determining a respective target predicted position in a future image of each of the target points at a future time step from the point tracking data of the current task segment, and

Causing the agent to perform an action predicted to move the one or more target points to corresponding target predicted positions.

9. The method of claim 8, wherein determining, from the point tracking data, a respective target predicted position in a future image of each of the target points at a future time step comprises, at a first time step:

Applying the point tracker to the image of the agent at the first time step to generate a predicted position of the one or more target points in the image;

Selecting a target image from a particular portion of a particular sequence of presentation images in the current task segment based on the predicted position of the one or more target points in the image;

the predicted locations are selected as corresponding target locations for the one or more target points based on the point tracking data for the one or more target points in subsequent images subsequent to the target image in the particular sequence of presentation images.

10. The method of claim 9, wherein determining, from the point tracking data, a respective target predicted position in a future image of each of the target points at a future time step comprises, at a second time step:

Applying the point tracker to the image of the agent at the second time step to generate a predicted position of the one or more target points in the image, and

Based at least on a distance between the predicted position of the one or more target points in the image and the predicted position from the point tracking data of the one or more target points in the subsequent image, it is determined whether to update the target image to the subsequent image and a new target predicted position is selected for each of the target points.

11. The method of any preceding claim, wherein determining a plurality of relevant points of the task segment using the point tracking data of the task segment for each task segment comprises:

determining related objects being manipulated during the task segment using the point tracking data, and

A point on the related object being manipulated is selected as the plurality of related points.

12. The method of any preceding claim, wherein determining a plurality of relevant points of the task segment using the point tracking data of the task segment for each task segment comprises:

selecting one or more points as initial correlation points based at least on (i) a proximity of the one or more points to each other at a last image in the task segment in each of the sequence of presentation images according to the point tracking data and (ii) a degree of stationarity of the one or more points during the task segment according to the point tracking data, and

The initial correlation point is used to generate the correlation point.

13. The method of claim 12, wherein selecting one or more points as initial correlation points comprises selecting the one or more points based at least on (i) a proximity of the one or more points to each other at a last image in the task segments in each of the sequence of presentation images according to the point tracking data, (ii) a degree of quiescence of the one or more points during the task segments according to the point tracking data, and (iii) whether the one or more points are visible at the last image in the task segments in each of the sequence of presentation images according to the point tracking data.

14. The method of claim 12 or 13, wherein generating the correlation point using the initial correlation point comprises:

clustering the plurality of points using the point tracking data to determine a plurality of clusters, and

The clusters and the initial correlation points are used to select the correlation points.

15. The method of any preceding claim, wherein the point tracker is a neural network-based point tracker configured to, for each of the plurality of points:

Extracting query features of the points;

Generating a respective initial point tracking prediction for the point in each of the images in the task segment using the query feature for the point and a respective visual feature for each of a plurality of spatial locations in the images in the task segment, and

The respective initial point tracking predictions are refined using a time refinement sub-network to generate the point tracking data.

16. The method of claim 15, wherein the temporal refinement sub-network applies causal temporal convolution to refine the respective initial point tracking predictions for the points.

17. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1 to 16.

18. One or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1 to 16.