CN112734803B

CN112734803B - Single target tracking method, device, equipment and storage medium based on character description

Info

Publication number: CN112734803B
Application number: CN202011642602.9A
Authority: CN
Inventors: 张伟; 吴爽; 陈佳铭; 宋然
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2023-03-24
Anticipated expiration: 2040-12-31
Also published as: CN112734803A

Abstract

The invention discloses a single target tracking method, a device, equipment and a storage medium based on character description, wherein the single target tracking method comprises the following steps: dividing a video to be tracked into a plurality of video packets according to a set frame number on average; extracting first, second and third character features from the character description; respectively extracting first, second and third visual features from the nth sampling frame of each video packet; updating the first, second and third character features respectively based on the first, second and third visual features of the nth sampling frame of each video packet to obtain updated first, second and third character features; respectively extracting a fourth visual characteristic, a fifth visual characteristic and a sixth visual characteristic from the sample plate image of the target to be tracked; extracting seventh, eighth and ninth visual features from the search area image respectively; respectively fusing the updated first, second and third character feature vectors with the fourth to ninth visual features to obtain fused features; and obtaining a target tracking result of each frame in the current video packet of the video to be tracked according to the fusion characteristics.

Description

Single target tracking method, device, equipment and storage medium based on text description

技术领域technical field

本申请涉及机器视觉及自然语言处理技术领域，特别是涉及基于文字描述的单目标跟踪方法、装置、设备及存储介质。The present application relates to the technical fields of machine vision and natural language processing, and in particular to a single target tracking method, device, equipment and storage medium based on text description.

背景技术Background technique

本部分的陈述仅仅是提到了与本申请相关的背景技术，并不必然构成现有技术。The statements in this section merely mention the background art related to this application, and do not necessarily constitute the prior art.

单目标跟踪是机器视觉领域的一个经典且长期研究的课题。传统的单目标跟踪方法通常在视频的帧中手工标注出需要跟踪的目标的方框。近年来，结合了机器视觉以及自然语言处理技术的相关课题，比如说图像/视频标注，视觉问答等课题取得了巨大的进步，基于文字描述的单目标跟踪课题也越来越收到重视。给定一段文字标注，跟踪视频中用文字标注出来的目标可以使算法能够更好地处理很多复杂场景，比如遮挡，边框偏移，目标变形，模糊等。因为自然语言描述提供地语义信息能够帮助目标跟踪算法减轻这些复杂场景地影响。Single object tracking is a classic and long-term research topic in the field of machine vision. Traditional single-target tracking methods usually manually mark the box of the target to be tracked in the frame of the video. In recent years, related topics that combine machine vision and natural language processing technology, such as image/video labeling, visual question answering, etc., have made great progress, and single-target tracking based on text descriptions has also received more and more attention. Given a piece of text annotation, tracking the target marked with text in the video can enable the algorithm to better handle many complex scenarios, such as occlusion, frame offset, target deformation, blurring, etc. Because the semantic information provided by the natural language description can help the target tracking algorithm to alleviate the impact of these complex scenes.

然而，基于文字描述的单目标跟踪课题有一个特殊的问题。自然语言可以描述目标在第一帧的外观和运动状态，或者描述目标在整段视频中的运动过程，给视频的每一帧标注文字是不可行的。对于常用的带有自然语言标注的单目标跟踪数据集，文字标注通常描述了视频的整体内容，没有任何数据集对所有的帧进行标注。然而，目标的位置和外观在视频中是不断变化的，因而自然语言标注在大多数场景下是不能准确的描述目标的位置或者运动。尽管过去的相关工作取得了不错的表现，但是它们仅仅将文字标注看作是一个全局的约束。However, the task of single-object tracking based on textual descriptions poses a special problem. Natural language can describe the appearance and movement state of the target in the first frame, or describe the movement process of the target in the entire video. It is not feasible to label each frame of the video. For the commonly used single-target tracking datasets with natural language annotations, text annotations usually describe the overall content of the video, and no datasets annotate all frames. However, the location and appearance of objects are constantly changing in the video, so natural language annotations cannot accurately describe the location or motion of objects in most scenarios. Although past related works have achieved good performance, they only regard text annotation as a global constraint.

发明内容Contents of the invention

为了解决现有技术的不足，本申请提供了基于文字描述的单目标跟踪方法、装置、设备及存储介质；In order to solve the deficiencies of the prior art, this application provides a single target tracking method, device, equipment and storage medium based on text description;

第一方面，本申请提供了基于文字描述的单目标视觉跟踪方法；In the first aspect, the present application provides a single target visual tracking method based on text description;

基于文字描述的单目标视觉跟踪方法，包括：Single target visual tracking method based on text description, including:

获取待跟踪目标的样板图像；获取待跟踪视频和与待跟踪目标相关的文字描述；对待跟踪视频按照设定帧数平均划分为若干个视频包；Obtain the sample image of the target to be tracked; obtain the video to be tracked and the text description related to the target to be tracked; divide the video to be tracked into several video packets on average according to the set number of frames;

对所述文字描述提取出第一、第二和第三文字特征；extracting first, second and third character features from the text description;

对每个视频包的第n个采样帧分别提取第一、第二和第三视觉特征；n为正整数，n的上限为指定值；基于每个视频包第n个采样帧的第一、第二和第三视觉特征分别对第一、第二和第三文字特征进行更新，得到更新后的第一、第二和第三文字特征；对待跟踪目标的样板图像，分别提取第四、第五和第六视觉特征；所述待跟踪目标的样板图像是指待跟踪视频的首帧图像；对搜索区域图像，分别提取第七，第八和第九视觉特征；所述搜索区域图像，是指当前视频包中的所有图像；The first, second and third visual features are respectively extracted for the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value; based on the first, second and third visual features of the nth sampling frame of each video packet The second and third visual features update the first, second, and third text features respectively to obtain the updated first, second, and third text features; for the sample image of the target to be tracked, extract the fourth and third text features respectively. The fifth and sixth visual features; the template image of the target to be tracked refers to the first frame image of the video to be tracked; for the search area image, the seventh, eighth and ninth visual features are respectively extracted; the search area image is Refers to all images in the current video package;

将更新后的第一、第二和第三文字特征向量，分别与第四、第五、第六、第七、第八和第九视觉特征进行融合，得到六个融合特征；Fusion the updated first, second and third text feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively to obtain six fusion features;

根据六个融合特征，得到待跟踪视频的当前视频包中每一帧的目标跟踪结果。According to the six fusion features, the target tracking result of each frame in the current video packet of the video to be tracked is obtained.

第二方面，本申请提供了基于文字描述的单目标视觉跟踪装置；In the second aspect, the present application provides a single-target visual tracking device based on text description;

基于文字描述的单目标视觉跟踪装置，包括：Single target visual tracking device based on text description, including:

视频包划分模块，其被配置为：获取待跟踪目标的样板图像；获取待跟踪视频和与待跟踪目标相关的文字描述；对待跟踪视频按照设定帧数平均划分为若干个视频包；The video packet division module is configured to: obtain a sample image of the target to be tracked; obtain a video to be tracked and a text description related to the target to be tracked; divide the video to be tracked into several video packets on average according to the set number of frames;

文字特征提取模块，其被配置为：对所述文字描述提取出第一、第二和第三文字特征；A text feature extraction module configured to: extract first, second and third text features from the text description;

视觉特征提取模块，其被配置为：对每个视频包的第n个采样帧分别提取第一、第二和第三视觉特征；n为正整数，n的上限为指定值；基于每个视频包第n个采样帧的第一、第二和第三视觉特征分别对第一、第二和第三文字特征进行更新，得到更新后的第一、第二和第三文字特征；对待跟踪目标的样板图像，分别提取第四、第五和第六视觉特征；所述待跟踪目标的样板图像是指待跟踪视频的首帧图像；对搜索区域图像，分别提取第七，第八和第九视觉特征；所述搜索区域图像，是指当前视频包中的所有图像；A visual feature extraction module, which is configured to: extract the first, second and third visual features for the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value; based on each video The first, second and third visual features of the nth sampling frame are respectively updated to the first, second and third text features to obtain the updated first, second and third text features; the target to be tracked The model image of the target, extract the fourth, fifth and sixth visual features respectively; the model image of the target to be tracked refers to the first frame image of the video to be tracked; for the search area image, extract the seventh, eighth and ninth respectively Visual features; the search area image refers to all images in the current video package;

特征融合模块，其被配置为：将更新后的第一、第二和第三文字特征向量，分别与第四、第五、第六、第七、第八和第九视觉特征进行融合，得到六个融合特征；The feature fusion module is configured to: fuse the updated first, second and third character feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively to obtain Six fusion features;

输出模块，其被配置为：根据六个融合特征，得到待跟踪视频的当前视频包中每一帧的目标跟踪结果。The output module is configured to: obtain the target tracking result of each frame in the current video packet of the video to be tracked according to the six fusion features.

第三方面，本申请还提供了一种电子设备，包括：一个或多个处理器、一个或多个存储器、以及一个或多个计算机程序；其中，处理器与存储器连接，上述一个或多个计算机程序被存储在存储器中，当电子设备运行时，该处理器执行该存储器存储的一个或多个计算机程序，以使电子设备执行上述第一方面所述的方法。In a third aspect, the present application also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, the processor is connected to the memory, and the above one or more The computer program is stored in the memory, and when the electronic device is running, the processor executes one or more computer programs stored in the memory, so that the electronic device executes the method described in the first aspect above.

第四方面，本申请还提供了一种计算机可读存储介质，用于存储计算机指令，所述计算机指令被处理器执行时，完成第一方面所述的方法。In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions, and when the computer instructions are executed by a processor, the method described in the first aspect is completed.

第五方面，本申请还提供了一种计算机程序(产品)，包括计算机程序，所述计算机程序当在一个或多个处理器上运行的时候用于实现前述第一方面任意一项的方法。In the fifth aspect, the present application also provides a computer program (product), including a computer program, which is used to implement the method of any one of the aforementioned first aspects when running on one or more processors.

与现有技术相比，本申请的有益效果是：Compared with prior art, the beneficial effect of the present application is:

提出了利用跟踪过程中生成的搜索区域的深度视觉特征更新文字描述的深度特征，以期望深度的文字特征可以随着视频中目标的变化而变化，提升单目标跟踪算法的精度。It is proposed to use the depth visual features of the search area generated during the tracking process to update the depth features of the text description, in order to expect that the text features of the depth can change with the changes of the target in the video, and improve the accuracy of the single target tracking algorithm.

本发明附加方面的优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Advantages of additional aspects of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

附图说明Description of drawings

构成本申请的一部分的说明书附图用来提供对本申请的进一步理解，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。The accompanying drawings constituting a part of the present application are used to provide further understanding of the present application, and the schematic embodiments and descriptions of the present application are used to explain the present application, and do not constitute improper limitations to the present application.

图1为第一个实施例的方法流程图；Fig. 1 is the method flowchart of the first embodiment;

图2为第一个实施例的方法流程图；Fig. 2 is the method flowchart of the first embodiment;

图3(a)-图3(g)为第一个实施例的效果示意图。Fig. 3(a)-Fig. 3(g) are schematic diagrams showing the effect of the first embodiment.

具体实施方式Detailed ways

应该指出，以下详细说明都是示例性的，旨在对本申请提供进一步的说明。除非另有指明，本申请使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be pointed out that the following detailed description is exemplary and is intended to provide further explanation to the present application. Unless defined otherwise, all technical and scientific terms used in this application have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本申请的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terminology used here is only for describing specific implementations, and is not intended to limit the exemplary implementations according to the present application. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural, and it should also be understood that the terms "comprising" and "having" and any variations thereof are intended to cover a non-exclusive Comprising, for example, a process, method, system, product, or device comprising a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include steps or units not explicitly listed or for these processes, methods, Other steps or units inherent in a product or equipment.

在不冲突的情况下，本发明中的实施例及实施例中的特征可以相互组合。In the case of no conflict, the embodiments and the features in the embodiments of the present invention can be combined with each other.

实施例一Embodiment one

本实施例提供了基于文字描述的单目标视觉跟踪方法；This embodiment provides a single target visual tracking method based on text description;

S101：获取待跟踪目标的样板图像；获取待跟踪视频和与待跟踪目标相关的文字描述；对待跟踪视频按照设定帧数平均划分为若干个视频包；S101: Obtain a sample image of the target to be tracked; acquire a video to be tracked and a text description related to the target to be tracked; divide the video to be tracked into several video packets on average according to the set number of frames;

S102：对所述文字描述提取出第一、第二和第三文字特征；S102: Extracting first, second and third text features from the text description;

S103：对每个视频包的第n个采样帧分别提取第一、第二和第三视觉特征；n为正整数，n的上限为指定值；S103: extract the first, second and third visual features respectively for the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value;

基于每个视频包第n个采样帧的第一、第二和第三视觉特征分别对第一、第二和第三文字特征进行更新，得到更新后的第一、第二和第三文字特征；Based on the first, second and third visual features of the nth sampling frame of each video packet, the first, second and third text features are respectively updated to obtain the updated first, second and third text features ;

对待跟踪目标的样板图像，分别提取第四、第五和第六视觉特征；所述待跟踪目标的样板图像是指待跟踪视频的首帧图像；The template image of the target to be tracked, extracting the fourth, fifth and sixth visual features respectively; the sample image of the target to be tracked refers to the first frame image of the video to be tracked;

对搜索区域图像，分别提取第七，第八和第九视觉特征；所述搜索区域图像，是指当前视频包中的所有图像；For the search area image, extract the seventh, eighth and ninth visual features respectively; the search area image refers to all images in the current video package;

S104：将更新后的第一、第二和第三文字特征向量，分别与第四、第五、第六、第七、第八和第九视觉特征进行融合，得到六个融合特征；S104: Fusing the updated first, second, and third character feature vectors with the fourth, fifth, sixth, seventh, eighth, and ninth visual features respectively to obtain six fused features;

S105：根据六个融合特征，得到待跟踪视频的当前视频包中每一帧的目标跟踪结果。S105: According to the six fusion features, obtain the target tracking result of each frame in the current video packet of the video to be tracked.

示例性的，所述对待跟踪视频按照设定帧数平均划分为若干个视频包；例如对待跟踪的1000帧视频按照100帧为单位，平均划分为10个视频包；再例如，对待跟踪的100帧视频按照10帧为单位，平均划分为10个视频包。Exemplarily, the video to be tracked is equally divided into several video packets according to the set number of frames; for example, 1000 frames of video to be tracked are divided into 10 video packets on average in units of 100 frames; another example, 100 video frames to be tracked The frame video is divided into 10 video packets on average in units of 10 frames.

作为一个或多个实施例，所述S102：对所述文字描述提取出第一、第二和第三文字特征；具体步骤包括：As one or more embodiments, the S102: extract the first, second and third text features from the text description; the specific steps include:

采用BERT方法，对所述文字描述提取出第一、第二和第三文字特征。Using the BERT method, the first, second and third text features are extracted from the text description.

作为一个或多个实施例，所述S103：对每个视频包的第n个采样帧分别提取第一、第二和第三视觉特征；n为正整数，n的上限为指定值；具体步骤包括：As one or more embodiments, said S103: extract the first, second and third visual features respectively for the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value; specific steps include:

采用RestNet-50，对每个视频包的第n个采样帧进行视觉特征提取；Using RestNet-50, perform visual feature extraction on the nth sampling frame of each video packet;

卷积层Conv2_3输出第一视觉特征；The convolutional layer Conv2_3 outputs the first visual feature;

卷积层Conv3_4输出第二视觉特征；The convolutional layer Conv3_4 outputs the second visual feature;

卷积层Conv5_3输出第三视觉特征。The convolutional layer Conv5_3 outputs the third visual feature.

作为一个或多个实施例，所述S103：基于每个视频包第n个采样帧的第一、第二和第三视觉特征分别对第一、第二和第三文字特征进行更新，得到更新后的第一、第二和第三文字特征；具体步骤包括：As one or more embodiments, said S103: update the first, second and third text features based on the first, second and third visual features of the nth sampling frame of each video packet, and obtain the updated The first, second, and third character features after that; the specific steps include:

第一视觉特征经过全局平均池化(Global Average pooling,GAP)处理，得到第一子视觉特征；将第一文字特征作为第一LSTM模型的初始隐状态；在设定的t时刻，将第一子视觉特征输入到第一LSTM模型中，第一LSTM模型输出更新后的第一文字特征；第一LSTM模型中，遗忘门用于决定当前时刻的隐状态是否应该被舍弃；输入门用于决定输入的视觉特征的值是否应该被写入；The first visual feature is processed by Global Average Pooling (GAP) to obtain the first sub-visual feature; the first text feature is used as the initial hidden state of the first LSTM model; at the set time t, the first sub-visual feature is The visual feature is input into the first LSTM model, and the first LSTM model outputs the updated first text feature; in the first LSTM model, the forget gate is used to determine whether the hidden state at the current moment should be discarded; the input gate is used to determine the input whether the value of the visual feature should be written;

第二视觉特征经过全局平均池化处理，得到第二子视觉子特征；将第二文字特征作为第二LSTM模型的初始隐状态；在设定的t时刻，将第二子视觉特征输入到第二LSTM模型中，第二LSTM模型输出更新后的第二文字特征；The second visual feature is processed by global average pooling to obtain the second sub-visual sub-feature; the second text feature is used as the initial hidden state of the second LSTM model; at the set time t, the second sub-visual feature is input to the second sub-visual feature In the second LSTM model, the second LSTM model outputs the updated second character feature;

第三视觉特征经过全局平均池化处理，得到第三子视觉特征；将第三文字特征作为第三LSTM模型的初始隐状态；在设定的t时刻，将第三子视觉特征输入到第三LSTM模型中，第三LSTM模型输出更新后的第三文字特征。The third visual feature is processed by global average pooling to obtain the third sub-visual feature; the third text feature is used as the initial hidden state of the third LSTM model; at the set time t, the third sub-visual feature is input to the third In the LSTM model, the third LSTM model outputs the updated third character feature.

作为一个或多个实施例，所述S103：对待跟踪目标的样板图像，分别提取第四、第五和第六视觉特征；所述待跟踪目标的样板图像是指待跟踪视频的首帧图像；对搜索区域图像，分别提取第七，第八和第九视觉特征；所述搜索区域图像，是指当前视频包中的所有图像；具体步骤包括：As one or more embodiments, said S103: Extracting the fourth, fifth and sixth visual features respectively from the template image of the target to be tracked; the sample image of the target to be tracked refers to the first frame image of the video to be tracked; To search area image, extract the seventh, the eighth and the ninth visual feature respectively; Described search area image refers to all images in the current video package; Concrete steps include:

采用RestNet-50，对待跟踪目标的样板图像进行视觉特征提取；Use RestNet-50 to extract visual features from the sample image of the target to be tracked;

RestNet-50的卷积层Conv2_3输出第四视觉特征；The convolutional layer Conv2_3 of RestNet-50 outputs the fourth visual feature;

RestNet-50的卷积层Conv3_4输出第五视觉特征；The convolutional layer Conv3_4 of RestNet-50 outputs the fifth visual feature;

RestNet-50的卷积层Conv5_3输出第六视觉特征。The convolutional layer Conv5_3 of RestNet-50 outputs the sixth visual feature.

采用RestNet-50，对待跟踪目标的搜索区域图像进行视觉特征提取；Use RestNet-50 to extract visual features from the image of the search area of the target to be tracked;

RestNet-50的卷积层Conv2_3输出第七视觉特征；The convolutional layer Conv2_3 of RestNet-50 outputs the seventh visual feature;

RestNet-50的卷积层Conv3_4输出第八视觉特征；The convolutional layer Conv3_4 of RestNet-50 outputs the eighth visual feature;

RestNet-50的卷积层Conv5_3输出第九视觉特征。The convolutional layer Conv5_3 of RestNet-50 outputs the ninth visual feature.

作为一个或多个实施例，所述S104：将更新后的第一、第二和第三文字特征向量，分别与第四、第五、第六、第七、第八和第九视觉特征进行融合，得到六个融合特征；具体步骤包括：As one or more embodiments, said S104: perform the updated first, second and third character feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively Fusion, six fusion features are obtained; the specific steps include:

将更新后的第一文字特征向量与第四视觉特征进行拼接，得到第一融合特征；Splicing the updated first character feature vector with the fourth visual feature to obtain the first fusion feature;

将更新后的第二文字特征向量与第五视觉特征进行拼接，得到第二融合特征；Splicing the updated second character feature vector with the fifth visual feature to obtain the second fusion feature;

将更新后的第三文字特征向量与第六视觉特征进行拼接，得到第三融合特征；Splicing the updated third character feature vector with the sixth visual feature to obtain the third fusion feature;

将更新后的第一文字特征向量与第七视觉特征进行拼接，得到第四融合特征；Splicing the updated first character feature vector with the seventh visual feature to obtain the fourth fusion feature;

将更新后的第二文字特征向量与第八视觉特征进行拼接，得到第五融合特征；Splicing the updated second character feature vector with the eighth visual feature to obtain the fifth fusion feature;

将更新后的第三文字特征向量与第九视觉特征进行拼接，得到第六融合特征。The updated third character feature vector is spliced with the ninth visual feature to obtain the sixth fusion feature.

作为一个或多个实施例，所述S105：根据六个融合特征，得到待跟踪视频的当前视频包中每一帧的目标跟踪结果；具体步骤包括：As one or more embodiments, said S105: Obtain the target tracking result of each frame in the current video packet of the video to be tracked according to the six fusion features; the specific steps include:

将第一融合特征输入到第一卷积神经网络CNN中，将第一卷积神经网络的输出值和第四卷积神经网络的输出值均输入到第一分类网络中；得到第一分类结果；Input the first fusion feature into the first convolutional neural network CNN, input the output value of the first convolutional neural network and the output value of the fourth convolutional neural network into the first classification network; obtain the first classification result ;

将第四融合特征输入到第四卷积神经网络CNN中，将第四卷积神经网络的输出值和第一卷积神经网络的输出值均输入到第一回归网络中；得到第一回归结果；Inputting the fourth fusion feature into the fourth convolutional neural network CNN, inputting the output value of the fourth convolutional neural network and the output value of the first convolutional neural network into the first regression network; obtaining the first regression result ;

将第二融合特征输入到第二卷积神经网络CNN中，将第二卷积神经网络的输出值和第五卷积神经网络的输出值均输入到第二分类网络中；得到第二分类结果；The second fusion feature is input in the second convolutional neural network CNN, the output value of the second convolutional neural network and the output value of the fifth convolutional neural network are all input in the second classification network; obtain the second classification result ;

将第五融合特征输入到第五卷积神经网络CNN中，将第五卷积神经网络的输出值和第二卷积神经网络的输出值均输入到第二回归网络中；得到第二回归结果；The fifth fusion feature is input into the fifth convolutional neural network CNN, the output value of the fifth convolutional neural network and the output value of the second convolutional neural network are all input into the second regression network; the second regression result is obtained ;

将第三融合特征输入到第三卷积神经网络CNN中，将第三卷积神经网络的输出值和第六卷积神经网络的输出值均输入到第三分类网络中；得到第三分类结果；The third fusion feature is input into the third convolutional neural network CNN, the output value of the third convolutional neural network and the output value of the sixth convolutional neural network are all input into the third classification network; the third classification result is obtained ;

将第六融合特征输入到第六卷积神经网络CNN中，将第六卷积神经网络的输出值和第三卷积神经网络的输出值均输入到第三回归网络中；得到第三回归结果；Input the sixth fusion feature into the sixth convolutional neural network CNN, input the output value of the sixth convolutional neural network and the output value of the third convolutional neural network into the third regression network; obtain the third regression result ;

将第一分类结果、第二分类结果和第三分类结果进行融合得到最终的分类结果；Fusing the first classification result, the second classification result and the third classification result to obtain the final classification result;

将第一回归结果、第二回归结果和第三回归结果进行融合得到最终的回归结果；Fusing the first regression result, the second regression result and the third regression result to obtain the final regression result;

根据最终的分类结果和最终的回归结果，得到待跟踪视频的当前视频包中每一帧的目标跟踪结果。According to the final classification result and the final regression result, the target tracking result of each frame in the current video packet of the video to be tracked is obtained.

本申请提出的方法的核心是基于长短期记忆网络(Long-Short Term Memory，LSTM)的文字特征更新模块，文字特征更新模块使用最初的文字描述的深度特征作为初始隐状态，并且每隔设定帧数输入当前帧的深度特征用于更新作为隐状态的文字特征，以期望在视频中目标运动或者外观改变的时候深度文字特征也相应的改变。然后，更新后的深度文字特征与接下来的设定帧数的深度视觉特征融合。本申请使用SiamRPN方法，基于融合特征检测出每一帧的目标。The core of the method proposed in this application is a text feature update module based on Long-Short Term Memory (LSTM). The text feature update module uses the depth features of the initial text description as the initial hidden state, and sets The frame number input depth feature of the current frame is used to update the text feature as a hidden state, so that it is expected that the depth text feature will change accordingly when the target moves or the appearance changes in the video. Then, the updated deep text features are fused with the next deep visual features for a set number of frames. This application uses the SiamRPN method to detect the target of each frame based on the fusion feature.

过去的跟踪算法通常采用检测或者匹配的方式，它们在训练过程中随机地从数据集中选取一些正负样本。而为了更新深度文字特征，时间序列地问题必须被考虑在内。因此，本申请使用序列化的训练方法训练特征更新模块，将每一段视频分割成数量相同的片段，每个片段中的帧数可以不同。Tracking algorithms in the past usually use detection or matching methods, and they randomly select some positive and negative samples from the data set during the training process. However, in order to update deep text features, time series issues must be taken into account. Therefore, this application uses a serialized training method to train the feature update module, and divides each video into the same number of segments, and the number of frames in each segment can be different.

本申请的主要贡献如下：提出了文字特征更新模块以减小文字表达与目标的位置与外观等视觉信息的隔阂。提出了序列化的的训练方法训练文字特征更新模块以达到更新深度文字特征的期望。The main contributions of this application are as follows: a text feature update module is proposed to reduce the gap between text expression and visual information such as the position and appearance of the target. A serialized training method is proposed to train the text feature update module to achieve the expectation of updating deep text features.

使用手工标注目标框的单目标跟踪是一个机器视觉领域长期的挑战，研究者提出了很多单目标跟踪算法，其中具有代表性的便是基于相关滤波(Correlation Filter，CF)的算法和基于循环神经网络(Recurrent Neural Network,RNN)的算法。近年来，基于匹配网络的孪生结构因其准确性和效率性引起了越来越多的注意。SiamFC，SiamRPN,SiamRPN++,SiamMask等基于孪生网络的算法。Single-target tracking using manually marked target frames is a long-term challenge in the field of machine vision. Researchers have proposed many single-target tracking algorithms, among which the representative ones are algorithms based on correlation filtering (Correlation Filter, CF) and algorithms based on recurrent neural networks. Network (Recurrent Neural Network, RNN) algorithm. In recent years, matching network-based siamese structures have attracted increasing attention due to their accuracy and efficiency. SiamFC, SiamRPN, SiamRPN++, SiamMask and other algorithms based on Siamese networks.

近年来，基于文字描述的单目标跟踪算法研究受到了越来越多的重视，但是大部分算法将文字描述视为单目标跟踪课题的全局约束，而忽略了文字描述的局限性。In recent years, the research on single target tracking algorithm based on text description has received more and more attention, but most of the algorithms regard text description as the global constraint of single target tracking, and ignore the limitations of text description.

给定一段视频和一段与跟踪目标相关的文字标注，本申请的目的是在视频中跟踪当前目标。在大多数场景下的主要挑战是，文字标注并不能精确的描述不同帧中跟踪目标的位置以及外观的变化。为了解决这个问题，本申请提出了包含两个模块的跟踪算法：特征更新模块和跟踪模块，接下来会分别描述两个模块的细节。Given a video and a text annotation related to the tracked target, the purpose of this application is to track the current target in the video. The main challenge in most scenarios is that text annotations cannot accurately describe the position and appearance changes of tracked objects in different frames. In order to solve this problem, this application proposes a tracking algorithm including two modules: a feature update module and a tracking module, and the details of the two modules will be described next.

特征更新模块：特征更新模块目的是减少文字描述在单目标跟踪课题中的限制并且使更新后的深度文字特征能更好的反应跟踪目标的状态。本申请提出的特征更新模块通过使用一组LSTM网络实现特征更新的任务。Feature update module: The purpose of the feature update module is to reduce the limitation of text description in single target tracking and make the updated deep text features better reflect the state of the tracked target. The feature update module proposed in this application realizes the task of feature update by using a set of LSTM networks.

特征更新模块包含三个平行的LSTM单元。首先使用BERT(BidirectionalEncoding Representations from Transformers)方法，将文字编码成768维度的特征向量，然后使用全连接网络将文字特征向量全连接到512维度，然后将文字的特征作为LSTM单元初始时刻的初始隐状态，在特定的时刻t，LSTM更新隐状态的方式如下所示：The feature update module consists of three parallel LSTM units. First, use the BERT (BidirectionalEncoding Representations from Transformers) method to encode text into a 768-dimensional feature vector, then use a fully connected network to fully connect the text feature vector to 512 dimensions, and then use the text features as the initial hidden state of the LSTM unit at the initial moment , at a specific moment t, the way LSTM updates the hidden state is as follows:

f_t＝σ(ω_f[l_t-1,v_t]+b_f)f _t ＝σ(ω _f [l _t-1 ,v _t ]+b _f )

i_t＝σ(ω_i[l_t-1,v_t]+b_i)i _t =σ(ω _i [l _t-1 ,v _t ]+b _i )

l_t＝f_tΘl_t-1+i_tΘtanh(ω_lv_t+b_l)l _t ＝f _t Θl _t-1 +i _t Θtanh(ω _l v _t +b _l )

其中，l_t和v_t分别表示使用文字特征初始化的LSTM的隐状态和输入LSTM的视觉特征。f_t和i_t分别表示LSTM单元的遗忘门和输入门。遗忘门决定当前时刻的隐状态的值是否应该被舍弃，输入门决定了当前时刻输入的深度视觉特征的值是否应该被写入。ω和b表示可训练的门运算的权重和偏执参数。σ和Θ表示sigmoid激活函数和哈密尔顿运算。where _lt and _vt denote the hidden state of LSTM initialized with textual features and the visual features of input LSTM, respectively. f _t and _it denote the forget gate and input gate of the LSTM cell, respectively. The forget gate determines whether the value of the hidden state at the current moment should be discarded, and the input gate determines whether the value of the deep visual feature input at the current moment should be written. ω and b denote the weight and bias parameters of the trainable gate operation. σ and Θ represent sigmoid activation function and Hamilton operation.

在t时刻，LSTM输入深度视觉特征以处理隐状态l_t-1。通过使用文字特征初始化LSTM隐状态，并通过输入门和遗忘门运算更新隐状态，深度文字特征能够随着跟踪目标的位置和外观变化的时候变化。At time t, LSTM inputs deep visual features to process the hidden state l _t-1 . By initializing the LSTM hidden state with textual features and updating the hidden state through input and forget gate operations, deep textual features can change as the location and appearance of the tracked object change.

本申请提出的三个平行的LSTM网络输入序列化的深度视觉特征更新使用文字特征初始化的隐状态，以更新文字特征并使文字特征随着跟踪目标的位置和外观变化而变化。视频的视觉信息可以高效的扩展和丰富深度文字特征。The three parallel LSTM networks proposed in this application input serialized deep visual features to update the hidden state initialized with text features to update the text features and make the text features change with the location and appearance of the tracking target. The visual information of video can efficiently expand and enrich deep text features.

在孪生网络的结构中，目标的样板图像，和当前帧的搜索区域图像都通过ResNet50网络提取并输出三个不同深度的视觉特征，然后三个不同深度的视觉特征使用全局池化后输入到三个平行的LSTM网络中，通过LSTM网络更新后的深度文字特征将会随着视觉特征变化。In the structure of the Siamese network, the template image of the target and the search area image of the current frame are extracted through the ResNet50 network and output three visual features of different depths, and then the three visual features of different depths are input to three In a parallel LSTM network, the deep text features updated through the LSTM network will change with the visual features.

跟踪模块：本申请提出的跟踪模块通过输入包含目标的样板图像和搜索区域图像在搜索区域图像中找到和样板图像相似度高的区域，作为跟踪算法的结果。不同于传统的孪生网络将搜索区域图像作预先切割和填充工作，本申请不会对原图像进行切割但会将原图填充到标准输入的大小。在大多数场景下，保持原图的大小可以保持目标的位置信息与文字标注之间的关联。训练过程中使用的样板图像来自数据集的手工标注，测试阶段使用Visual Grounding方法获得样板图像。Tracking module: The tracking module proposed in this application finds the area with high similarity with the template image in the search area image by inputting the template image containing the target and the search area image as the result of the tracking algorithm. Different from the traditional Siamese network that pre-cuts and fills the search area image, this application does not cut the original image but fills the original image to the size of the standard input. In most scenarios, maintaining the size of the original image can maintain the association between the location information of the target and the text annotation. The sample images used in the training process come from the manual annotation of the data set, and the visual grounding method is used to obtain the sample images in the test phase.

如图1所示，与更新模块相似，样板图像和搜索区域图像的深度视觉特征由同一个ResNet50网络提取，然后将样板图像和搜索区域图像的深度特征与更新后的文字特征融合。更新后的文字特征全连接为256为的特征向量，然后将1×1×256的一维特征向量堆叠为7×7×256维度和31×31×256维度(7×7的维度和样板图像特征的维度相同，31×31的维度和搜索区域图像特征的维度相同)，再将文字特征和视觉特征连接在一起进行融合。融合特征利用视觉信息进一步减少语言描述中的模糊性，能提高视觉特征的目标感知能力。接下来，使用卷积神经网络(Convolutional Neural Network,CNN)处理融合特征。最终，融合后的特征输入到孪生网络结构中的候选区域网络以检测跟踪目标。候选区域网络的分类分支和回归分支的输出为检测框的前景背景分类以及目标框的回归。和传统的孪生网络相似，我们使用二分类交叉熵损失和smooth L1损失。As shown in Figure 1, similar to the update module, the deep visual features of the template image and the search region image are extracted by the same ResNet50 network, and then the depth features of the template image and the search region image are fused with the updated text features. The updated text feature is fully connected to a 256-dimensional feature vector, and then the 1×1×256 one-dimensional feature vector is stacked into a 7×7×256 dimension and a 31×31×256 dimension (the 7×7 dimension and the template image The dimension of the feature is the same, the dimension of 31×31 is the same as the dimension of the image feature of the search area), and then the text feature and visual feature are connected together for fusion. Fusion features utilize visual information to further reduce the ambiguity in language description, which can improve the target perception ability of visual features. Next, the fused features are processed using a Convolutional Neural Network (CNN). Finally, the fused features are input to the candidate region network in the Siamese network structure to detect and track objects. The output of the classification branch and the regression branch of the candidate area network is the foreground and background classification of the detection frame and the regression of the target frame. Similar to traditional Siamese networks, we use binary cross-entropy loss and smooth L1 loss.

分类损失如下：The classification loss is as follows:

其中y_i表示第N个候选区域的前景背景预测。where _yi denotes the foreground and background prediction of the Nth candidate region.

A_x，A_y，A_w，A_h分别表示候选框的中心点的x,y轴坐标，宽度和高度，T_x，T_y，T_w，T_h表示真实的跟踪目标边框的坐标以及宽高，四维的标准距离如下A _x , A _y , A _w , A _h respectively represent the x, y axis coordinates, width and height of the center point of the candidate box, T _x , _Ty , T _w , T _h represent the coordinates and width of the real tracking target frame High, the standard distance in four dimensions is as follows

则回归损失如下：Then the regression loss is as follows:

总损失为L_total＝L_cls+λL_reg，其中λ表示平衡分类和回归损失的超参数。The total loss is L _total = L _cls + λL _reg , where λ denotes a hyperparameter that balances classification and regression losses.

应用细节：在训练过程中，将视频分割为50个片段以及包含自然语言标注的包，每个包中的帧调整大小为255×255用于更新模块以及孪生网络的搜索区域图像，与传统的孪生网络裁剪搜索区域图像操作相比，使用原图更好的保持了图像中目标和自然语言标注的一致性。同时，包含跟踪目标的样板图像作为一个样例输入孪生网络。Application details: During the training process, the video is divided into 50 clips and packages containing natural language annotations, and the frame size in each package is resized to 255×255 for updating the module and the search area image of the Siamese network, which is different from the traditional Compared with the image operation of the twin network clipping search area, using the original image better maintains the consistency of the target and natural language annotation in the image. At the same time, the template image containing the tracking target is input as an example into the Siamese network.

通过BERT和全连接网络将文字标注编码成特征向量，并用于初始化LSTM网络的隐状态。然后，更新模块根据深度视觉特征更新作为隐状态的文字特征，以提高其再搜索图像序列中的目标感知能力。更新后的深度文字特征与样板图像和搜索区域图像融合，最终融合后的特征用于孪生网络预测跟踪目标的位置。Text annotations are encoded into feature vectors by BERT and fully connected networks, and used to initialize the hidden state of the LSTM network. Then, the updating module updates the textual features as the hidden state according to the deep visual features to improve its object perception ability in re-searching image sequences. The updated deep text feature is fused with the template image and the search area image, and the final fused feature is used for the Siamese network to predict the location of the tracking target.

本申请使用修改后的ResNet50网络，并在ImageNet数据集上预训练。此模型使用Momentum优化器，衰减率为1×10^-4且momentum设为0.9，初始学习率为5×10^-3且每一轮训练减小1×10^-4，训练的批大小为32。每一段视频切割为50个视频片段即每一段视频作50次深度文字特征的更新。此模型分别训练5，10，15，20轮并进行测试。This application uses the modified ResNet50 network and pre-trains on the ImageNet dataset. This model uses the Momentum optimizer, the decay rate is 1×10 ^-4 and the momentum is set to 0.9, the initial learning rate is 5×10 ^-3 and each round of training is reduced by 1×10 ^-4 , and the training batch size is 32. Each video is cut into 50 video clips, that is, each video is updated with 50 deep text features. This model is trained for 5, 10, 15, 20 rounds and tested.

在应用过程中，本申请使用visual grounding方法产生样板图像。Visualgrounding通过文字标注在图像中预测一个方框对应文字内容。当跟踪目标的方框可以获得时，就可以通过该方框从视频第一帧中切出样板图像。同时，visual grounding方法也用作跟踪算法跟丢目标之后恢复跟踪结果。During the application process, this application uses the visual grounding method to generate template images. Visualgrounding predicts a box corresponding to the text content in the image through text annotation. When the frame of the tracking target is available, the template image can be cut out from the first frame of the video through this frame. At the same time, the visual grounding method is also used to restore the tracking results after the tracking algorithm loses the target.

实验结果：接下来展示和分析实验结果。首先将展示实验用的数据集及评价标准，以及应用的一些细节。然后展示与传统方法的对比结果。本申请也分析了不同设置下的模型并试图解释模型表现及工作原理。本申请的实验在Inter Xeon CPU E5-2687W v33.10GHz和NVIDIA Tesla V100 GPU上运行。Experimental results: Next, display and analyze the experimental results. First, the data set and evaluation criteria used in the experiment will be shown, as well as some details of the application. Then the comparison results with traditional methods are presented. This application also analyzes the model under different settings and tries to explain the model performance and working principle. The experiments in this application are run on an Inter Xeon CPU E5-2687W v33.10GHz and an NVIDIA Tesla V100 GPU.

实验采用的数据集是LaSOT数据集和Lingual OTB99数据集，因为这两个数据集中每一段视频都有文字标注。LaSOT数据集是单目标跟踪的大型基准数据集，包含1400个视频序列，每段视频都有一段自然语言标注且每一帧都有目标框。该数据集有1120段视频用于训练，280段视频用于测试。因为LaSOT数据集的文字标注的主要目的是为了辅助跟踪过程，文字标注对于目标的描述也不够精准，所以本申请修改了部分文字标注以减少文字的歧义。Lingual OTB99数据集是OTB100数据集的一个扩展版本，对每一段视频进标注了一段文字。该数据集包含51段训练视频和48段测试视频。The data sets used in the experiment are the LaSOT data set and the Lingual OTB99 data set, because each video in these two data sets has text annotations. The LaSOT dataset is a large-scale benchmark dataset for single-target tracking, containing 1400 video sequences, each video has a natural language annotation and each frame has a target box. The dataset has 1120 videos for training and 280 videos for testing. Because the main purpose of the text annotation in the LaSOT dataset is to assist the tracking process, and the text annotation is not accurate enough to describe the target, so this application modifies some text annotations to reduce the ambiguity of the text. The Lingual OTB99 data set is an extended version of the OTB100 data set, and a text is marked for each video. The dataset contains 51 training videos and 48 testing videos.

和传统的跟踪算法相同，本申请使用精度和成功率作为跟踪算法的评价标准。精度表示预测的目标框和真实的目标框重合度超过给定阈值的帧数所占的百分比。成功率表示预测目标框和真实目标框的交并比高于一定阈值的帧数的百分比。Same as the traditional tracking algorithm, this application uses accuracy and success rate as the evaluation criteria of the tracking algorithm. Accuracy represents the percentage of frames whose coincidence degree between the predicted target box and the real target box exceeds a given threshold. The success rate indicates the percentage of the number of frames whose intersection ratio of the predicted target box and the real target box is higher than a certain threshold.

与带有文字描述的单目标跟踪算法相比，本申请与这些算法在两个初始化条件下对比，一个是只用给定的文字标注进行单目标跟踪，另一个是同时使用第一帧的目标框和文字标注进行初始化。如表中所示，本申请的算法在两种初始化方法下在LaSOT和LingualOTB99的表现都优于传统的算法。Compared with single-target tracking algorithms with text descriptions, this application compares these algorithms under two initialization conditions, one is to use only the given text annotation for single-target tracking, and the other is to use the target of the first frame at the same time Box and text labels are initialized. As shown in the table, the algorithm of the present application performs better than the traditional algorithm in LaSOT and LingualOTB99 under the two initialization methods.

部分跟踪结果如图2所示，在跟踪模块的辅助下，此模型的表现比很多使用文字标注初始化和使用第一帧的目标框初始化的算法好，此模型在遮挡、方框偏移等干扰下的表现是鲁棒的，并且能在目标超出视野和跟踪错误目标后恢复到跟踪正确目标。本申请也将此模型与其他只用第一帧的目标框初始化的跟踪算法作比较。如表1所示，此模型只使用文字标注初始化的结果与使用第一帧的目标框初始化的算法相比结果有竞争力，此模型使用目标框初始化以及文字标注初始化的时候表现比使用目标框初始化的跟踪算法好。结论：在单目标跟踪课题中，通常对一段视频的简洁的文字标注可以描述目标在视频第一帧中的状态或者目标在整段视频中的运动而不是它在每一帧中的准确位置和外观，因为目标的这些属性可能会在不同帧中不断变化。本申请提出了全新的特征更新模块用于基于文字描述的单目标视觉跟踪算法，使用LSTM网络更新深度文字特征，并用更新后的深度文字特征与深度视觉特征融合以提升单目标跟踪算法的表现。实验结果表明文字描述可以辅助地提升单目标跟踪算法并达到较好的单目标跟踪表现。图3(a)-图3(g)为第一个实施例的效果示意图。Part of the tracking results are shown in Figure 2. With the assistance of the tracking module, the performance of this model is better than many algorithms that use text annotation initialization and the target box initialization of the first frame. The following performance is robust and can recover to track the correct target after the target goes out of view and tracks the wrong target. This application also compares this model with other tracking algorithms that are only initialized with the object box of the first frame. As shown in Table 1, the results of this model using only text annotation initialization are competitive with the algorithm using the target box initialization of the first frame. This model performs better than using the target box initialization and text annotation initialization. The initial tracking algorithm is good. Conclusion: In the single-target tracking task, usually a concise text annotation of a video can describe the state of the target in the first frame of the video or the movement of the target in the entire video instead of its exact position and position in each frame. Appearance, since these properties of the object may change continuously from frame to frame. This application proposes a brand-new feature update module for single-target visual tracking algorithms based on text descriptions, using LSTM networks to update deep text features, and using updated deep text features and deep visual features to improve the performance of single-target tracking algorithms. Experimental results show that text description can assist in improving the single-target tracking algorithm and achieve better single-target tracking performance. Fig. 3(a)-Fig. 3(g) are schematic diagrams showing the effect of the first embodiment.

表1实验结果数据对比表Table 1 Experimental results data comparison table

实施例二Embodiment two

本实施例提供了基于文字描述的单目标视觉跟踪装置；This embodiment provides a single target visual tracking device based on text description;

此处需要说明的是，上述视频包划分模块、文字特征提取模块、视觉特征提取模块、特征融合模块和输出模块对应于实施例一中的步骤S101至S105，上述模块与对应的步骤所实现的示例和应用场景相同，但不限于上述实施例一所公开的内容。需要说明的是，上述模块作为系统的一部分可以在诸如一组计算机可执行指令的计算机系统中执行。It should be noted here that the above-mentioned video packet division module, text feature extraction module, visual feature extraction module, feature fusion module and output module correspond to steps S101 to S105 in Embodiment 1, and the above-mentioned modules and corresponding steps realize The examples and application scenarios are the same, but are not limited to the content disclosed in Embodiment 1 above. It should be noted that, as a part of the system, the above-mentioned modules can be executed in a computer system such as a set of computer-executable instructions.

上述实施例中对各个实施例的描述各有侧重，某个实施例中没有详述的部分可以参见其他实施例的相关描述。The description of each embodiment in the foregoing embodiments has its own emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.

所提出的系统，可以通过其他的方式实现。例如以上所描述的系统实施例仅仅是示意性的，例如上述模块的划分，仅仅为一种逻辑功能划分，实际实现时，可以有另外的划分方式，例如多个模块可以结合或者可以集成到另外一个系统，或一些特征可以忽略，或不执行。The proposed system can be implemented in other ways. For example, the above-described system embodiments are only illustrative. For example, the division of the above modules is only a logical function division. In actual implementation, there may be other division methods, for example, multiple modules can be combined or integrated into another A system, or some feature, can be ignored, or not implemented.

实施例三Embodiment three

本实施例还提供了一种电子设备，包括：一个或多个处理器、一个或多个存储器、以及一个或多个计算机程序；其中，处理器与存储器连接，上述一个或多个计算机程序被存储在存储器中，当电子设备运行时，该处理器执行该存储器存储的一个或多个计算机程序，以使电子设备执行上述实施例一所述的方法。This embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, the processor is connected to the memory, and the one or more computer programs are programmed Stored in the memory, when the electronic device is running, the processor executes one or more computer programs stored in the memory, so that the electronic device executes the method described in Embodiment 1 above.

应理解，本实施例中，处理器可以是中央处理单元CPU，处理器还可以是其他通用处理器、数字信号处理器DSP、专用集成电路ASIC，现成可编程门阵列FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in this embodiment, the processor can be a central processing unit CPU, and the processor can also be other general-purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic devices , discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

存储器可以包括只读存储器和随机存取存储器，并向处理器提供指令和数据、存储器的一部分还可以包括非易失性随机存储器。例如，存储器还可以存储设备类型的信息。The memory may include read-only memory and random access memory, and provide instructions and data to the processor, and a part of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

在实现过程中，上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in a processor or an instruction in the form of software.

实施例一中的方法可以直接体现为硬件处理器执行完成，或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器、闪存、只读存储器、可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器，处理器读取存储器中的信息，结合其硬件完成上述方法的步骤。为避免重复，这里不再详细描述。The method in Embodiment 1 can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software module may be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware. To avoid repetition, no detailed description is given here.

本领域普通技术人员可以意识到，结合本实施例描述的各示例的单元及算法步骤，能够以电子硬件或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in this embodiment can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

实施例四Embodiment four

本实施例还提供了一种计算机可读存储介质，用于存储计算机指令，所述计算机指令被处理器执行时，完成实施例一所述的方法。This embodiment also provides a computer-readable storage medium for storing computer instructions, and when the computer instructions are executed by a processor, the method described in the first embodiment is completed.

以上所述仅为本申请的优选实施例而已，并不用于限制本申请，对于本领域的技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above descriptions are only preferred embodiments of the present application, and are not intended to limit the present application. For those skilled in the art, there may be various modifications and changes in the present application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included within the protection scope of this application.

Claims

1. The single-target visual tracking method based on text description is characterized in that it comprises:

Obtain the sample image of the target to be tracked; obtain the video to be tracked and the text description related to the target to be tracked; divide the video to be tracked into several video packets on average according to the set number of frames;

extracting first, second and third character features from the text description;

The first, second and third visual features are respectively extracted for the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value; based on the first, second and third visual features of the nth sampling frame of each video packet The second and third visual features update the first, second, and third text features respectively to obtain the updated first, second, and third text features; for the sample image of the target to be tracked, extract the fourth and third text features respectively. The fifth and sixth visual features; the template image of the target to be tracked refers to the first frame image of the video to be tracked; for the search area image, the seventh, eighth and ninth visual features are respectively extracted; the search area image is Refers to all images in the current video package;

Among them, based on the first, second and third visual features of the nth sampling frame of each video packet, the first, second and third text features are updated respectively to obtain the updated first, second and third Text features; specific steps include:

The first visual feature is processed by global average pooling to obtain the first sub-visual feature; the first text feature is used as the initial hidden state of the first LSTM model; at the set time t, the first sub-visual feature is input to the first LSTM In the model, the first LSTM model outputs the updated first text feature; in the first LSTM model, the forget gate is used to determine whether the hidden state at the current moment should be discarded; the input gate is used to determine whether the value of the input visual feature should be discarded. write;

The second visual feature is processed by global average pooling to obtain the second sub-visual sub-feature; the second text feature is used as the initial hidden state of the second LSTM model; at the set time t, the second sub-visual feature is input to the second sub-visual feature In the second LSTM model, the second LSTM model outputs the updated second character feature;

The third visual feature is processed by global average pooling to obtain the third sub-visual feature; the third text feature is used as the initial hidden state of the third LSTM model; at the set time t, the third sub-visual feature is input to the third In the LSTM model, the third LSTM model outputs the updated third character feature;

Fusion the updated first, second and third text feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively to obtain six fusion features;

Among them, the updated first, second and third text feature vectors are respectively fused with the fourth, fifth, sixth, seventh, eighth and ninth visual features to obtain six fusion features; specific steps include:

Splicing the updated first text feature vector with the fourth visual feature to obtain the first fusion feature; splicing the updated second text feature vector with the fifth visual feature to obtain the second fusion feature; The three text feature vectors are spliced with the sixth visual feature to obtain the third fusion feature; the updated first text feature vector is spliced with the seventh visual feature to obtain the fourth fusion feature; the updated second text feature vector is combined with The eighth visual feature is spliced to obtain the fifth fusion feature; the updated third character feature vector is spliced with the ninth visual feature to obtain the sixth fusion feature;

According to the six fusion features, the target tracking result of each frame in the current video packet of the video to be tracked is obtained.

2. the single target visual tracking method based on text description as claimed in claim 1, is characterized in that, extracts first, second and the 3rd text feature to described text description; Concrete steps comprise:

Using the BERT method, the first, second and third text features are extracted from the text description.

3. the single target visual tracking method based on text description as claimed in claim 1, is characterized in that, extracts first, second and the 3rd visual feature respectively to the n sampling frame of each video packet; n is positive Integer, the upper limit of n is the specified value; the specific steps include:

Using RestNet-50, perform visual feature extraction on the nth sampling frame of each video packet; convolutional layer Conv2_3 outputs the first visual feature; convolutional layer Conv3_4 outputs the second visual feature; convolutional layer Conv5_3 outputs the third visual feature .

4. the single target visual tracking method based on text description as claimed in claim 1, is characterized in that, the template image of target to be tracked, respectively extracts the 4th, the 5th and the 6th visual feature; The image refers to the first frame image of the video to be tracked; the search area image extracts the seventh, eighth and ninth visual features respectively; the search area image refers to all images in the current video package; the specific steps include:

Use RestNet-50 to extract visual features from the sample image of the target to be tracked; the convolutional layer Conv2_3 of RestNet-50 outputs the fourth visual feature; the convolutional layer Conv3_4 of RestNet-50 outputs the fifth visual feature; the convolution of RestNet-50 Layer Conv5_3 outputs the sixth visual feature;

Using RestNet-50, visual feature extraction is performed on the image of the search area of the target to be tracked; the convolutional layer Conv2_3 of RestNet-50 outputs the seventh visual feature; the convolutional layer Conv3_4 of RestNet-50 outputs the eighth visual feature; the volume of RestNet-50 The multilayer Conv5_3 outputs the ninth visual feature.

5. the single target visual tracking method based on text description as claimed in claim 1, is characterized in that,

According to the six fusion features, the target tracking result of each frame in the current video packet of the video to be tracked is obtained; the specific steps include:

Input the first fusion feature into the first convolutional neural network CNN, input the output value of the first convolutional neural network and the output value of the fourth convolutional neural network into the first classification network; obtain the first classification result ;

Inputting the fourth fusion feature into the fourth convolutional neural network CNN, inputting the output value of the fourth convolutional neural network and the output value of the first convolutional neural network into the first regression network; obtaining the first regression result ;

The second fusion feature is input in the second convolutional neural network CNN, the output value of the second convolutional neural network and the output value of the fifth convolutional neural network are all input in the second classification network; obtain the second classification result ;

The fifth fusion feature is input into the fifth convolutional neural network CNN, the output value of the fifth convolutional neural network and the output value of the second convolutional neural network are all input into the second regression network; the second regression result is obtained ;

The third fusion feature is input into the third convolutional neural network CNN, the output value of the third convolutional neural network and the output value of the sixth convolutional neural network are all input into the third classification network; the third classification result is obtained ;

Input the sixth fusion feature into the sixth convolutional neural network CNN, input the output value of the sixth convolutional neural network and the output value of the third convolutional neural network into the third regression network; obtain the third regression result ;

Fusing the first classification result, the second classification result and the third classification result to obtain the final classification result;

Fusing the first regression result, the second regression result and the third regression result to obtain the final regression result;

According to the final classification result and the final regression result, the target tracking result of each frame in the current video packet of the video to be tracked is obtained.

6. A single target visual tracking device based on text description, characterized in that it includes:

The video packet division module is configured to: obtain a sample image of the target to be tracked; obtain a video to be tracked and a text description related to the target to be tracked; divide the video to be tracked into several video packets on average according to the set number of frames;

A text feature extraction module configured to: extract first, second and third text features from the text description;

A visual feature extraction module, which is configured to: extract the first, second and third visual features for the nth sampling frame of each video packet; n is a positive integer, and the upper limit of n is a specified value; based on each video The first, second and third visual features of the nth sampling frame are respectively updated to the first, second and third text features to obtain the updated first, second and third text features; the target to be tracked The model image of the target, extract the fourth, fifth and sixth visual features respectively; the model image of the target to be tracked refers to the first frame image of the video to be tracked; for the search area image, extract the seventh, eighth and ninth respectively Visual features; the search area image refers to all images in the current video package;

The feature fusion module is configured to: fuse the updated first, second and third character feature vectors with the fourth, fifth, sixth, seventh, eighth and ninth visual features respectively to obtain Six fusion features;

The output module is configured to: obtain the target tracking result of each frame in the current video packet of the video to be tracked according to the six fusion features.

7. An electronic device, characterized by comprising: one or more processors, one or more memory, and one or more computer programs; wherein, the processor is connected to the memory, and the above-mentioned one or more computer programs are programmed Stored in the memory, when the electronic device is running, the processor executes one or more computer programs stored in the memory, so that the electronic device executes the method described in any one of claims 1-5 above.

8. A computer-readable storage medium, characterized in that it is used to store computer instructions, and when the computer instructions are executed by a processor, the method according to any one of claims 1-5 is completed.