CN116994337A

CN116994337A - Appearance detection action recognition device and method for dense objects in narrow space

Info

Publication number: CN116994337A
Application number: CN202310995299.8A
Authority: CN
Inventors: 张亮; 黄继来; 李洪升; 朱光明; 张晨洋; 韩冰
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2023-11-03

Abstract

The present invention provides an appearance detection action recognition device and method for dense objects in a small space, and for the first time proposes different computer vision algorithm models to perform cyclic reading processing of original encoded video frame image data stored in shared memory according to time sequence. A multi-target tracking algorithm is used to correspond the ID of dense object tracking to the recognition result of the object's appearance detection behavior. The recognition results are regularly added to the data set for self-training of the model, effectively reducing the coupling between systems and data reading time. delay, greatly improving the real-time, robustness and efficiency of video processing. This solves the problems of low efficiency and strong subjectivity in the appearance detection of dense objects by existing methods.

Description

Appearance detection action recognition device and method for dense objects in small space

技术领域Technical field

本发明属于人工智能与计算机视觉技术领域，具体涉及一种面向狭小空间密集对象的外观检测动作识别装置和方法。The invention belongs to the technical fields of artificial intelligence and computer vision, and specifically relates to an appearance detection action recognition device and method for dense objects in a small space.

背景技术Background technique

近年来，在产业智能化升级的趋势下，越来越多的企业正试图通过机器人和人工智能等技术，打造智慧工厂。工厂监控相机每天可以产生数T级别有效的工厂视频数据，而这些视频中的绝大部分仅仅用作监控工人生产。实际上，这些工厂视频数据中包含大量的工人、机器的操作行为，以及蕴含在其中的生产操作模式等，可以进一步用于动作识别、工作流挖掘、异常事件监测等方面。然而就目前来看，对于产线工人或机械装置在某些生产操作行为的工作质量评估仍然依靠人工检查或随机抽查的方式，耗费了大量的人力物力财力，没有将监控视频与人工智能技术有效的结合起来。In recent years, under the trend of industrial intelligent upgrading, more and more companies are trying to build smart factories through technologies such as robots and artificial intelligence. Factory surveillance cameras can produce several terabytes of effective factory video data every day, and most of these videos are only used to monitor worker production. In fact, these factory video data contain a large number of workers and machine operating behaviors, as well as the production operation modes contained in them, etc., which can be further used for action recognition, workflow mining, abnormal event monitoring, etc. However, at present, the work quality assessment of production line workers or mechanical devices in certain production operations still relies on manual inspection or random inspection, which consumes a lot of manpower, material and financial resources, and does not effectively combine surveillance video and artificial intelligence technology. of combined.

据了解目前很少有研究试图自动对狭小空间密集对象外观检测的行为是否合规进行判断，大多数方法依旧依靠人工监测。但是这种人工检查方式受限于人力以及主观因素的影响，当遇到工厂产能扩大等情况时，对密集对象外观检测的行为是否合规进行判断的难度有大幅的增加。It is understood that there are currently very few studies that attempt to automatically determine whether the behavior of appearance detection of dense objects in a small space is compliant, and most methods still rely on manual monitoring. However, this manual inspection method is limited by manpower and subjective factors. When factory production capacity is expanded, it becomes more difficult to judge whether the appearance inspection of intensive objects is compliant.

以往，作为对人物的动作进行识别的技术，大多是对人物当前所做动作进行整体识别，例如，在专利文献一中：发明公开CN115410268A动作识别方法以及基于动作骨骼点数据的动作识别方法，只是将动作关键点数据输入模型进行识别并进行结果输出。而在工厂狭小空间场景下，需要对多个对象进行多次外观检测动作，现有动作识别方法无法直接对监控视频数据进行动作识别并有效将识别结果与密集对象正确绑定。In the past, most of the technologies used to recognize the actions of people were to recognize the actions currently performed by the person as a whole. For example, in Patent Document 1: Invention Disclosure CN115410268A Action Recognition Method and Action Recognition Method Based on Action Skeleton Point Data, just Input the action key point data into the model for recognition and output the results. However, in the small space of the factory, multiple appearance detection actions need to be performed on multiple objects. Existing action recognition methods cannot directly perform action recognition on surveillance video data and effectively bind the recognition results to dense objects correctly.

发明内容Contents of the invention

为了克服以上现有技术存在的问题，本发明的目的在于提供一种面向狭小空间密集对象的外观检测动作识别装置和方法，首次提出了不同计算机视觉算法模型依据时序顺序对存储在共享内存的原始编码视频帧图像数据进行循环读取处理，使用多目标跟踪算法将密集对象追踪的id与该对象的外观检测行为识别结果相对应，定期将识别结果加入数据集进行模型的自训练，有效减小了系统间的耦合度和数据读取时延，大幅提高了视频处理的实时性、鲁棒性和效率。解决现有方法对密集对象外观检测效率低、主观性强的问题。In order to overcome the above problems in the prior art, the purpose of the present invention is to provide an appearance detection and action recognition device and method for dense objects in a small space. It proposes for the first time different computer vision algorithm models to analyze the original data stored in the shared memory based on time sequence. The encoded video frame image data is read and processed cyclically, and a multi-target tracking algorithm is used to correspond the ID of the dense object tracking to the appearance detection behavior recognition result of the object. The recognition results are regularly added to the data set for self-training of the model, effectively reducing the It reduces the coupling between systems and the data reading delay, and greatly improves the real-time, robustness and efficiency of video processing. This solves the problems of low efficiency and strong subjectivity in the appearance detection of dense objects by existing methods.

为了实现上述目的，本发明采用的技术方案是：In order to achieve the above objects, the technical solution adopted by the present invention is:

一种面向狭小空间密集对象的外观检测动作识别方法，包括以下步骤：An appearance detection action recognition method for dense objects in a small space, including the following steps:

S1：相机通过组播方式向边缘设备和后台显示系统发送原始编码视频，向共享内存中实时写入帧图像数据；若写入位置到达共享内存队尾，则从队头开始重新写入，覆盖原帧图像数据，即共享内存的循环写入；S1: The camera sends the original encoded video to the edge device and the background display system through multicast mode, and writes the frame image data to the shared memory in real time; if the writing position reaches the end of the shared memory queue, it will be written again from the head of the queue, overwriting Original frame image data, that is, cyclic writing of shared memory;

S2：根据相机生成并传输的帧序列信息从所述共享内存中循环读取，获得对应帧图像数据作为输入，该帧序列信息指定了帧图像数据在共享内存中的具体存储位置；使用damo-yolo目标检测模型对从共享内存中取出的帧图像数据进行目标检测，并向目标追踪模型输出检测框坐标，对象类别等检测结果；S2: Cyclically read from the shared memory according to the frame sequence information generated and transmitted by the camera, and obtain the corresponding frame image data as input. The frame sequence information specifies the specific storage location of the frame image data in the shared memory; use damo- The yolo target detection model performs target detection on the frame image data retrieved from the shared memory, and outputs detection results such as detection frame coordinates and object categories to the target tracking model;

S3：根据所述damo-yolo目标检测模型传递的帧序列信息从共享内存中读取帧图像数据，结合检测结果，融合对象实时位置及纹理变化进行多目标跟踪处理并分配对象id，输出密集对象追踪坐标及id的追踪结果；S3: Read the frame image data from the shared memory according to the frame sequence information passed by the damo-yolo target detection model, combine the detection results, fuse the real-time position and texture changes of the object for multi-target tracking processing and assign object ids, and output dense objects Tracking coordinates and ID tracking results;

S4：利用动作分割模型进行外观检测动作行为切割，将共享内存中获取的缓存长视频切分为孤立动作片段，标记片段的开始帧序号start和结束帧序号end，通过消息转发至动作识别模型；S4: Use the action segmentation model to perform appearance detection action behavior segmentation, segment the cached long video obtained in the shared memory into isolated action segments, mark the start frame number start and end frame number end of the segments, and forward them to the action recognition model through messages;

S5：动作识别模型根据消息转发获取的开始帧序号start和结束帧序号end，从共享内存获取该孤立动作片段图像数据，使用时空区域注意力模型和大卷积核进行视频序列的时空特征提取，进而完成动作分类，输出识别结果并对结果进行消息转发；S5: The action recognition model obtains the image data of the isolated action clip from the shared memory based on the start frame number start and the end frame number end obtained by message forwarding, and uses the spatiotemporal regional attention model and large convolution kernel to extract spatiotemporal features of the video sequence. Then complete the action classification, output the recognition results and forward the results message;

S6：后台显示系统通过socket通信，接收前端边缘设备的检测、追踪、行为分割和动作识别结果，并在实时视频上进行结果叠加展示；S6: The background display system receives the detection, tracking, behavior segmentation and action recognition results of the front-end edge device through socket communication, and overlays the results on the real-time video display;

S7：识别结果自动缓存至存储设备中，并与原训练数据融合生成新的数据集，由定时任务触发，使用增量后的数据集进行动作识别模型训练，并将最新的动作识别模型文件自动进行替换。S7: The recognition results are automatically cached in the storage device and merged with the original training data to generate a new data set, which is triggered by a scheduled task. The incremental data set is used for action recognition model training, and the latest action recognition model file is automatically Make a substitution.

进一步地，所述S1中，所述原始编码视频为I＝{I_s}^T，I_s表示第s帧，s表示每一帧的索引值，T表示第s帧的时间戳；相机获取第s张大小为M的实时图像I_s，并将实时图像写入预先申请的大小的N的共享内存中，其中该共享内存总共可写入n张图像，即N＝nM。Further, in the S1, the original encoded video is I={I _s } ^T , I _s represents the sth frame, s represents the index value of each frame, and T represents the timestamp of the sth frame; the camera obtains the sth frame s real-time images I _s of size M, and write the real-time images into a pre-applied shared memory of size N, where a total of n images can be written into the shared memory, that is, N=nM.

进一步地，所述S2中，所述共享内存中循环读取指目标检测模型在相机写入共享内存后，根据队列传输的帧图片数据的帧序列信息，从共享内存所在位置读取相应图片数据。Further, in S2, the cyclic reading in the shared memory means that after the camera writes the shared memory, the target detection model reads the corresponding picture data from the location of the shared memory according to the frame sequence information of the frame picture data transmitted by the queue. .

进一步地，所述S2中，所述帧图像数据的对象检测结果计算，如下式所示：Further, in the S2, the object detection result of the frame image data is calculated as follows:

D_t＝F(I_t)；D _t =F(I _t );

式中，D_t表示当前帧I_t的对象检测结果，具体表示为[c,w,h,x,y,thresh]，其中c表示对象类别，w表示检测框宽，h表示检测框高，x表示检测框中心点x轴坐标，y表示检测框中心点y轴坐标，thresh表示检测框的置信度；t表示每一帧的图像数据在共享内存的索引值；函数F表示当前帧I_t经damo-yolo目标检测模型的运算过程。In the formula, D _t represents the object detection result of the current frame I _t , specifically expressed as [c, w, h, x, y, thresh], where c represents the object category, w represents the width of the detection frame, and h represents the height of the detection frame, x represents the x-axis coordinate of the center point of the detection frame, y represents the y-axis coordinate of the center point of the detection frame, thresh represents the confidence of the detection frame; t represents the index value of the image data of each frame in the shared memory; function F represents the current frame I _t The operation process of the damo-yolo target detection model.

进一步地，所述S3中，所述融合对象实时位置及纹理变化进行多目标跟踪处理的计算过程，如下式所示：Further, in S3, the calculation process of multi-target tracking processing of the real-time position and texture changes of the fusion object is as follows:

X_t＝G(I_t,D_t)X _t =G(I _t ,D _t )

式中，X_t表示当前帧I_t的对象追踪结果，具体表现为[id,c,w,h,x,y,thresh]，其中id表示追踪算法分配的对象id，c表示对象类别，w表示追踪检测框的宽，h表示追踪检测框的高，x表示追踪检测框中心点x轴坐标，y表示追踪检测框中心点y轴坐标，thresh表示追踪检测框的置信度；t表示每一帧的图像数据在共享内存的索引值；函数G表示当前帧I_t以及该帧的检测结果D_t经目标追踪算法的运算过程。In the _formula _, represents the width of the tracking detection frame, h represents the height of the tracking detection frame, x represents the x-axis coordinate of the center point of the tracking detection frame, y represents the y-axis coordinate of the center point of the tracking detection frame, thresh represents the confidence of the tracking detection frame; t represents each The index value of the frame's image data in the shared memory; the function G represents the operation process of the current frame _It and the detection result _Dt of the frame through the target tracking algorithm.

进一步地，所述S4中，所述的动作分割模型进行外观检测动作行为切割的计算过程，如下式所示：Further, in the S4, the action segmentation model performs the calculation process of appearance detection action behavior segmentation, as shown in the following formula:

{L₁,…,L_n}＝trim(V_m){L ₁ ,…,L _n }＝trim(V _m )

(I_start,I_end)＝P({L₁,…,L_n})(I _start ,I _end )=P({L ₁ ,…,L _n })

式中，V_m表示从共享内存中获取的缓存长视频数据；函数trim表示当前缓存视频V_m经动作分割算法的运算过程；L_i表示缓存长视频每一帧所输出的动作类别，i表示系统设定的类别数量；函数P表示根据帧数据动作类别L_i获得孤立动作开始帧I_start和结束帧I_end的运算过程。In the formula, V _m represents the cached long video data obtained from the shared memory; the function trim represents the operation process of the current cached video V _m through the action segmentation algorithm; L _i represents the action category output by each frame of the cached long video, and i represents The number of categories set by the system; function P represents the operation process of obtaining the isolated action start frame I _start and end frame I _end based on the frame data action category _Li .

进一步地，所述S5中，所述使用时空区域注意力模型和大卷积核进行视频序列的时空特征提取，进而完成动作分类的计算过程，如下式所示：Further, in S5, the spatiotemporal regional attention model and large convolution kernel are used to extract spatiotemporal features of the video sequence, and then complete the calculation process of action classification, as shown in the following formula:

Y＝W(I_start,…,I_end)Y＝W(I _start ,…,I _end )

式中，Y表示动作识别预测结果，I_start,…,I_end表示输入图像为以动作分割算法界定的开始点和结束点的所有图片数据，从共享内存中获取；函数W表示当前输入帧序列经动作识别算法的运算过程。In the formula, Y represents the action recognition prediction result, I _start ,...,I _end represents all picture data of the input image as the start point and end point defined by the action segmentation algorithm, obtained from the shared memory; the function W represents the current input frame sequence The operation process of the action recognition algorithm.

进一步地，所述S6中，后台显示系统通过socket通信，接收前端边缘设备的检测、追踪、行为分割和动作识别结果，同时从共享内存中获取图片，将图片时间戳与检测信息时间戳对齐，在实时视频上进行结果叠加展示。Further, in S6, the background display system receives the detection, tracking, behavior segmentation and action recognition results of the front-end edge device through socket communication, obtains pictures from the shared memory at the same time, and aligns the picture timestamp with the detection information timestamp. Overlay results on live video.

进一步地，所述S7中，系统将识别结果自动缓存至存储设备中，并与原训练数据融合生成新的数据集，由定时任务触发，使用增量后的数据集进行动作识别模型训练，并将最新的动作识别模型文件自动进行替换。Further, in S7, the system automatically caches the recognition results to the storage device, and merges them with the original training data to generate a new data set, which is triggered by the scheduled task, and uses the incremental data set for action recognition model training, and Automatically replace the latest motion recognition model files.

本发明的另一发明目的，在于提供一种装置，包括处理器、通信接口、存储器、显示器和通信总线，其中，处理器，通信接口，存储器，显示器通过通信总线完成相互间的通信；Another object of the present invention is to provide a device that includes a processor, a communication interface, a memory, a display, and a communication bus, wherein the processor, the communication interface, the memory, and the display complete communication with each other through the communication bus;

存储器，用于存放计算机程序；Memory, used to store computer programs;

显示器，用于显示处理器的执行结果和实时图像；A display used to display the execution results and real-time images of the processor;

处理器，用于执行存储器上所存放的程序时，实现上述方法步骤。The processor is used to implement the above method steps when executing the program stored in the memory.

本发明的有益效果：Beneficial effects of the present invention:

(1)本发明视频输入帧的存取方式，将视频图像存储至共享内存中，相机采样以及各个算法模型按照时间顺序访问共享内存。相机向共享内存循环写入，各算法模型有序的从共享内存中读取帧图像数据并进行相应的处理，各模型间只传输帧图像处理结果，如帧在共享内存的存储位置、时间戳等少量数据。有效的减小了各个模型因图像等大数据传输、数据读取造成的延迟问题。(1) The video input frame access method of the present invention stores video images in shared memory, and camera sampling and each algorithm model access the shared memory in chronological order. The camera writes to the shared memory cyclically, and each algorithm model reads the frame image data from the shared memory in an orderly manner and processes it accordingly. Only the frame image processing results, such as the storage location and timestamp of the frame in the shared memory, are transmitted between the models. Wait for a small amount of data. It effectively reduces the delay problems caused by large data transmission and data reading in each model such as images.

(2)本发明共享内存中循环读取指各算法模型在相机写入共享内存后，根据图片帧在共享内存的存储位置读取图片。若读取位置到达共享内存队尾，则从队头开始重新读取。目标检测模型从共享内存中读取图片处理，通过队列传输帧在共享内存的存储位置、检测结果等少量信息给目标追踪模型。目标追踪模型根据帧在共享内存的存储位置从共享内存中读取图片，并结合检测结果进行处理，通过队列传输帧在共享内存的存储位置、追踪结果等少量信息。动作分割模型从共享内存中获得缓存长视频数据进行处理，通过队列传输孤立动作开始帧I_start和结束帧I_end。动作识别模型根据传输的孤立动作开始帧序号start和结束帧序号end从共享内存中读取图片序列作为模型输入。(2) Loop reading in the shared memory of the present invention means that each algorithm model reads the picture from the storage location of the shared memory according to the picture frame after the camera writes it into the shared memory. If the read position reaches the end of the shared memory queue, read again from the head of the queue. The target detection model reads images from the shared memory for processing, and transmits a small amount of information such as the storage location of the frame in the shared memory and detection results to the target tracking model through the queue. The target tracking model reads pictures from the shared memory according to the storage location of the frame in the shared memory, and processes it based on the detection results. It transmits a small amount of information such as the storage location of the frame in the shared memory and tracking results through the queue. The action segmentation model obtains the cached long video data from the shared memory for processing, and transmits the isolated action start frame I _start and end frame I _end through the queue. The action recognition model reads the image sequence from the shared memory as model input based on the transmitted isolated action start frame number start and end frame number end.

(3)本发明引入目标追踪模型，将对象目标追踪的id与该对象的外观检测行为识别结果相对应，有效的解决了狭小空间密集对象在外观检测后多个结果难以对应的问题，做到了密集对象与检测结果的动态绑定。(3) The present invention introduces a target tracking model, and corresponds the ID of the object target tracking with the appearance detection behavior recognition result of the object, effectively solving the problem that multiple results of dense objects in a small space are difficult to correspond after appearance detection, and achieves Dynamic binding of dense objects and detection results.

(4)本发明通过缓存识别数据对数据集进行更新，并使用定时任务，在新数据集的基础上对模型进行迭代训练，并对系统中的动作识别模型文件进行更新，有效提高了模型的精准性和鲁棒性。(4) The present invention updates the data set by caching the recognition data, and uses scheduled tasks to iteratively train the model based on the new data set, and updates the action recognition model file in the system, effectively improving the performance of the model. Accuracy and robustness.

(5)本发明相比现有技术，在不增加格外硬件设备，使用现有2D相机对狭小空间密集对象外观检测行为进行采样的基础上，利用计算机视觉算法自动判定当前检测行为是否合规，有效的减少了人力物力，大幅增加了工作效率和检验的客观性。(5) Compared with the existing technology, the present invention uses existing 2D cameras to sample the appearance detection behavior of dense objects in small spaces without adding additional hardware equipment, and uses computer vision algorithms to automatically determine whether the current detection behavior is compliant. Effectively reducing manpower and material resources, greatly increasing work efficiency and objectivity of inspection.

附图说明Description of the drawings

图1是本发明实施例面向狭小空间密集对象的外观检测动作识别系统框架。Figure 1 is the framework of an appearance detection action recognition system for dense objects in a small space according to an embodiment of the present invention.

图2是本发明实施例系统数据交互示意图。Figure 2 is a schematic diagram of system data interaction according to an embodiment of the present invention.

图3是本发明实施例构建的系统框架部署示意图。Figure 3 is a schematic diagram of the deployment of the system framework constructed according to the embodiment of the present invention.

图4是本发明实施例物体外观检测逻辑处理流程图。Figure 4 is a flow chart of object appearance detection logic processing according to the embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本发明作进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and examples.

如图1至图3所示，本发明的实施例提供了一种面向狭小空间密集对象的外观检测动作识别装置和方法，相机、目标检测模型、目标追踪模型、动作分割模型、动作识别模型、界面展示按照时间顺序串行排列执行，并行访问共享内存数据。As shown in Figures 1 to 3, embodiments of the present invention provide an appearance detection and action recognition device and method for dense objects in a small space, including a camera, a target detection model, a target tracking model, an action segmentation model, and an action recognition model. The interface displays serial execution in chronological order and accesses shared memory data in parallel.

具体包括以下步骤：Specifically, it includes the following steps:

步骤1，相机通过组播方式向边缘设备和后台显示系统发送原始编码视频，向共享内存中实时写入帧图像数据，便于后续算法模型根据帧序列信息有序获取对应图像数据。相机仅向目标检测模型通过队列传输图像在共享内存的位置和图像时间戳，从而减少模型间传输图像数据带来的延时时间。边缘设备为模型运行、训练所需的服务器，后台显示系统包括可视化界面以及显示器。Step 1: The camera sends the original encoded video to the edge device and the background display system through multicast, and writes the frame image data to the shared memory in real time, so that the subsequent algorithm model can obtain the corresponding image data in an orderly manner based on the frame sequence information. The camera only transmits the location of the image in the shared memory and the image timestamp to the target detection model through the queue, thereby reducing the delay time caused by transmitting image data between models. The edge device is the server required for model running and training, and the background display system includes a visual interface and a monitor.

步骤2，根据相机生成并传输的帧序列信息从共享内存中循环读取，获得对应帧图像数据作为输入，使用damo-yolo目标检测模型对从共享内存中取出的帧图像数据进行目标检测，并向目标追踪模型输出检测框坐标，对象类别等检测结果。Step 2: Read cyclically from the shared memory according to the frame sequence information generated and transmitted by the camera, obtain the corresponding frame image data as input, use the damo-yolo target detection model to perform target detection on the frame image data taken out from the shared memory, and Output detection results such as detection frame coordinates and object categories to the target tracking model.

步骤21，将原始编码视频数据I＝{I_t}^T输入到damo-yolo目标检测模型的前置处理中，I_t表示第t帧，t表示每一帧的索引值，T表示时间戳。对于输入的静态图片进行缩放、裁剪、降噪等处理，以适应模型的输入要求。Step 21, input the original encoded video data I = {I _t } ^T into the pre-processing of the damo-yolo target detection model, where I _t represents the t-th frame, t represents the index value of each frame, and T represents the timestamp. The input static images are scaled, cropped, denoised, etc. to adapt to the input requirements of the model.

步骤22，将裁剪后的每个小块图片输入到卷积神经网络进行特征提取，将每个特征图连接形成一个整张图片的特征图。Step 22: Input each cropped small image into a convolutional neural network for feature extraction, and connect each feature map to form a feature map of the entire image.

步骤23，对大的特征图应用一个全连接层，以输出一个13x13x(5+K)的张量，其中K是类别数，5是边界框参数(中心坐标、宽度、高度和置信度)。对每个13x13的单元格，根据边界框参数和类别置信度，生成一个预测边界框。对预测边界框进行阈值过滤和非极大值抑制，以得到最终的检测结果。Step 23, apply a fully connected layer to the large feature map to output a 13x13x (5+K) tensor, where K is the number of categories and 5 is the bounding box parameters (center coordinates, width, height and confidence). For each 13x13 cell, a predicted bounding box is generated based on the bounding box parameters and class confidence. Perform threshold filtering and non-maximum suppression on the predicted bounding box to obtain the final detection result.

步骤3，根据damo-yolo目标检测模型传递的帧序列信息从共享内存中循环读取帧图像数据，结合检测结果，融合对象实时位置及纹理变化进行多目标跟踪处理并分配对象id，输出密集对象追踪坐标及id等追踪结果Step 3: Cyclically read frame image data from the shared memory according to the frame sequence information passed by the damo-yolo target detection model, combine the detection results, fuse the real-time position and texture changes of the object for multi-target tracking processing and assign object ids, and output dense objects Tracking results such as coordinates and IDs

步骤31，根据目标检测模型传输的帧序列信息从共享内存中获取原始编码视频I＝{I_t}^T输入追踪模型处理。Step 31: Obtain the original encoded video I={I _t } ^T from the shared memory according to the frame sequence information transmitted by the target detection model and input it to the tracking model for processing.

步骤32，根据damo-yolo目标检测模型传输过来的检测框等信息，使用特征提取网络对每个边框内的图像进行特征提取，得到每个目标的特征向量。Step 32: Based on the detection frame and other information transmitted from the damo-yolo target detection model, use the feature extraction network to extract features from the images in each frame to obtain the feature vector of each target.

步骤33，使用卡尔曼滤波器(Kalman Filter)对每个目标的状态(位置和速度)进行预测，使用8维的状态空间向量表示为x＝[cx,cy,w,h,vx,vy,vr,vh]，各个速度值初始化为0。作为物体状态的直接观测模型,其中(cx,cy)代表检测框的中心点坐标，宽w,高h以及各自对应在图像坐标上的相对速度。预测即基于track在t-1时刻的状态来预测其在t时刻的状态，计算公式如下：Step 33: Use Kalman Filter to predict the state (position and velocity) of each target, using an 8-dimensional state space vector expressed as x=[cx,cy,w,h,vx,vy, vr, vh], each speed value is initialized to 0. As a direct observation model of the object state, (cx, cy) represents the center point coordinates of the detection frame, the width w, the height h and the relative speed corresponding to the image coordinates. Prediction is to predict the state of the track at time t based on its state at time t-1. The calculation formula is as follows:

x′＝Fx (1)x′＝Fx (1)

P′＝FPF^T+Q (2)P′＝FPF ^T +Q (2)

在公式1中，x为track在t-1时刻的均值，F称为状态转移矩阵，该公式预测t时刻的x'：In Formula 1, x is the mean value of track at time t-1, and F is called the state transition matrix. This formula predicts x' at time t:

矩阵F中的dt是当前帧和前一帧之间的差。dt in matrix F is the difference between the current frame and the previous frame.

在公式2中，P为track在t-1时刻的协方差，Q为系统的噪声矩阵，代表整个系统的可靠程度，一般初始化为很小的值，该公式预测t时刻的P'。In Formula 2, P is the covariance of track at time t-1, Q is the noise matrix of the system, which represents the reliability of the entire system. It is generally initialized to a very small value. This formula predicts P' at time t.

步骤34，使用匈牙利算法(Hungarian Algorithm)对预测的状态和检测的边界框进行匹配，根据匹配结果更新追踪器的状态。Step 34: Use the Hungarian Algorithm to match the predicted state and the detected bounding box, and update the tracker state according to the matching result.

步骤35，对于未匹配的边界框，使用余弦距离(Cosine Distance)计算它们与已有追踪器的特征向量之间的相似度，如果相似度高于某个阈值，则将其分配给对应的追踪器，否则创建新的追踪器。对于未匹配的追踪器，如果它们连续多帧没有匹配到边界框，则将其删除。Step 35: For unmatched bounding boxes, use cosine distance (Cosine Distance) to calculate the similarity between them and the feature vector of the existing tracker. If the similarity is higher than a certain threshold, it is assigned to the corresponding tracking tracker, otherwise create a new tracker. For unmatched trackers, if they do not match a bounding box for multiple consecutive frames, they are deleted.

步骤36，使用卡尔曼滤波进行位置更新。更新即为基于t时刻检测到的detection，校正与其关联的track的状态，得到一个更精确的结果，计算公式如下：Step 36, use Kalman filter for position update. The update is based on the detection detected at time t, correcting the status of the track associated with it, and obtaining a more accurate result. The calculation formula is as follows:

y＝z-Hx′ (3)y＝z-Hx′ (3)

S＝HP′H^T+R (4)S＝HP′H ^T +R (4)

K＝P′H^TS^-1 (5)K＝P′H ^T S ^-1 (5)

x＝x′+Ky (6)x＝x′+Ky (6)

P＝(I-KH)P′ (7)P＝(I-KH)P′ (7)

在公式3中，z为detection的均值向量，不包含速度变化值，即z＝[cx,cy,w,h]，H称为测量矩阵，它将track的均值向量x'映射到检测空间，该公式计算detection和track的均值误差；In Formula 3, z is the mean vector of detection, which does not include the speed change value, that is, z = [cx, cy, w, h]. H is called the measurement matrix, which maps the mean vector x' of track to the detection space. This formula calculates the mean error of detection and track;

在公式4中，R为检测器的噪声矩阵，它是一个4x4的对角矩阵，对角线上的值分别为中心点两个坐标以及宽高的噪声，以任意值初始化，一般设置宽高的噪声大于中心点的噪声，该公式先将协方差矩阵P'映射到检测空间，然后再加上噪声矩阵R；In Formula 4, R is the noise matrix of the detector. It is a 4x4 diagonal matrix. The values on the diagonal are the two coordinates of the center point and the noise of the width and height. They are initialized with any value. Generally, the width and height are set. The noise of is greater than the noise of the center point. This formula first maps the covariance matrix P' to the detection space, and then adds the noise matrix R;

公式5计算卡尔曼增益K，卡尔曼增益用于估计误差的重要程度；Formula 5 calculates the Kalman gain K, which is used to estimate the importance of the error;

公式6和公式7得到更新后的均值向量x和协方差矩阵P。Formula 6 and Formula 7 obtain the updated mean vector x and covariance matrix P.

步骤4，利用动作分割模型进行外观检测动作行为切割，将缓存长视频切分为孤立动作片段，标记片段的开始帧序号start和结束帧序号end，通过消息转发至动作识别模型。Step 4: Use the action segmentation model to perform appearance detection action behavior segmentation, segment the cached long video into isolated action segments, mark the start frame number start and end frame number end of the segments, and forward them to the action recognition model through messages.

步骤41，从共享内存中获取缓存长视频，使用一个预测生成阶段(PredictionGeneration Stage)来生成初始的动作分割预测。这个阶段使用了一个双重扩张层(DualDilated Layer，DDL)，可以同时捕获大范围和小范围的时序信息；Step 41: Obtain the cached long video from the shared memory and use a prediction generation stage (PredictionGeneration Stage) to generate initial action segmentation predictions. This stage uses a dual dilated layer (DualDilated Layer, DDL), which can capture both large-scale and small-scale timing information at the same time;

步骤42，使用多个精炼阶段(Refinement Stage)来逐步改进动作分割预测。每个精炼阶段使用了一个自监督时序卷积网络(Self-Supervised Temporal ConvolutionalNetwork，SS-TCN)，可以利用前后帧的信息来修正当前帧的预测；Step 42: Use multiple refinement stages to gradually improve the action segmentation prediction. Each refinement stage uses a Self-Supervised Temporal Convolutional Network (SS-TCN), which can use information from previous and previous frames to correct the prediction of the current frame;

步骤43，将所有阶段的输出进行平均，得到最终的动作分割结果，为每一帧图片分配动作类别标签。根据动作类别标签进行动作分割，标记孤立动作片段的开始帧序号start和结束帧序号end。Step 43: average the outputs of all stages to obtain the final action segmentation result, and assign an action category label to each frame of picture. Action segmentation is performed based on the action category label, and the start frame number start and the end frame number end of the isolated action fragment are marked.

步骤44，通过消息转发，将开始帧序号start和结束帧序号end传送到动作识别模型。Step 44: Send the start frame number start and the end frame number end to the action recognition model through message forwarding.

步骤5，动作识别模型根据消息转发获取的开始帧序号start和结束帧序号end，从共享内存获取该孤立动作片段图像数据，使用时空区域注意力模型和大卷积核进行视频序列的时空特征提取，进而完成动作分类，输出识别结果并对结果进行消息转发。Step 5: The action recognition model obtains the image data of the isolated action clip from the shared memory based on the start frame number start and the end frame number end obtained by message forwarding, and uses the spatiotemporal regional attention model and large convolution kernel to extract spatiotemporal features of the video sequence. , and then complete the action classification, output the recognition results and forward the results message.

步骤51，基于得到的开始帧序号start和结束帧序号end，从共享内存中获取该动作片段的所有图像数据，按照模型设定的NUM_FRAMES对图像进行平均采样，采样完成后对图片进行随机缩放和裁剪。Step 51: Based on the obtained start frame number start and end frame number end, obtain all the image data of the action clip from the shared memory, averagely sample the image according to the NUM_FRAMES set by the model, and randomly scale and scale the image after the sampling is completed. Crop.

步骤52，对每个片段，从慢路径和快路径分别采样不同数量的帧(例如8帧和32帧)。对每个路径，使用3D卷积神经网络(例如ResNet-50)对采样的帧进行特征提取，获得特征图。Step 52: For each segment, sample different numbers of frames (for example, 8 frames and 32 frames) from the slow path and the fast path respectively. For each path, a 3D convolutional neural network (such as ResNet-50) is used to extract features from the sampled frames to obtain a feature map.

步骤53，使用空间注意力机制(Spatial attention)实现特征图中显著性区域选择。它通过对输入特征图计算注意力分数，并将该分数作为输入特征图不同通道在同一像素位置的值的权重系数，生成只包含显著性区域的输出特征图。Spatial attention注意力分数计算公式如下：Step 53: Use spatial attention mechanism (Spatial attention) to select salient areas in the feature map. It generates an output feature map that only contains salient areas by calculating the attention score for the input feature map and using the score as the weight coefficient of the values of different channels of the input feature map at the same pixel position. The calculation formula of Spatial attention attention score is as follows:

M_s(F)＝f(conv_k(concatenation(MaxPooling(F),AveragePooling(F))))M _s (F)＝f(conv _k (concatenation(MaxPooling(F),AveragePooling(F))))

其中，F为输入特征图，尺寸为(B,C,T,H,W)，其中B指的是批处理尺寸，C指的是通道数，T指的是时间维数，H指的是高度，W指的是宽度。上式中的MaxPooling(·)和AveragePooling(·)分别指的是对F在通道维度上使用最大池化和平均池化，从而得到两个尺寸为(B,1,T,H,W)的张量，再经过concatenation(·)操作得到尺寸为(B,2,T,H,W)的张量。上式中的conv_k(·)指卷积核的宽和高，具体实现为k*k的三维卷积操作，k作为超参数可以根据实际情况调节。上式中f(·)可以是任意的激活函数，在本发明中，选取的是sigmoid激活函数。由此获得空间权重M_s(F)。M_s(F)中每一个像素的值代表输入特征F中对应位置的重要性，在模型中应用该模块能够起到抑制无关背景信息的作用。Among them, F is the input feature map with size (B, C, T, H, W), where B refers to the batch size, C refers to the number of channels, T refers to the time dimension, and H refers to Height, W refers to width. MaxPooling(·) and AveragePooling(·) in the above formula respectively refer to using maximum pooling and average pooling on F in the channel dimension, thereby obtaining two sizes of (B, 1, T, H, W) The tensor is then subjected to the concatenation(·) operation to obtain a tensor of size (B, 2, T, H, W). conv _k (·) in the above formula refers to the width and height of the convolution kernel, which is specifically implemented as a three-dimensional convolution operation of k*k. As a hyperparameter, k can be adjusted according to the actual situation. In the above formula, f(·) can be any activation function. In the present invention, the sigmoid activation function is selected. The spatial weight M _s (F) is thus obtained. The value of each pixel in M _s (F) represents the importance of the corresponding position in the input feature F. Applying this module in the model can suppress irrelevant background information.

步骤54，使用空间权重M_s(F)对输入特征F做掩膜处理，具体公式如下：Step 54: Use the spatial weight M _s (F) to mask the input feature F. The specific formula is as follows:

经掩膜处理后，输入特征的通道数不会发生改变。After masking, the number of channels of the input features will not change.

步骤55，在特定的层次上，使用双向通道连接(Bidirectional Channel-wiseConnections)将慢路径和快路径的特征图进行融合。对每个路径，使用全局平均池化(Global Average Pooling)将特征压缩为一维向量。将两个路径的特征向量拼接起来，得到最终的视频表示。Step 55: At a specific level, use Bidirectional Channel-wiseConnections to fuse the feature maps of the slow path and the fast path. For each path, global average pooling is used to compress the features into a one-dimensional vector. The feature vectors of the two paths are concatenated to obtain the final video representation.

步骤56，使用全连接层(Fully Connected Layer)和Softmax层对视频表示进行分类，得到动作识别的结果。Step 56: Use the fully connected layer (Fully Connected Layer) and the softmax layer to classify the video representation to obtain action recognition results.

如图2所示，数据交互示意图详细交代了各模型间的数据传输方式以及前后端数据通信方法。As shown in Figure 2, the data interaction schematic diagram explains in detail the data transmission methods between each model and the front-end and back-end data communication methods.

步骤6，后台显示系统通过socket通信，接收前端边缘设备的检测、追踪、行为分割和动作识别结果，并在实时视频上进行结果叠加展示。Step 6: The background display system receives the detection, tracking, behavior segmentation and action recognition results of the front-end edge device through socket communication, and overlays the results on the real-time video for display.

步骤61，相机采用组播的方式向后台显示系统推送视频数据，后台调用相机驱动获取帧图像数据并写入共享内存。Step 61: The camera uses multicast to push video data to the background display system, and the background calls the camera driver to obtain the frame image data and write it into the shared memory.

步骤62，检测、追踪、动作识别等结果共用一个socket连接，通过结果类型字段区分传输数据并做相关处理。Step 62: Detection, tracking, action recognition and other results share a socket connection, and the transmission data is distinguished through the result type field and related processing is performed.

步骤63，根据追踪结果的帧序列信息从共享内存中获取帧图像数据，并获取该帧中所有对象的检测框坐标、追踪id、日志时间戳等信息。根据动作识别结果获取该对象检测动作的识别结果以及对应追踪id信息。Step 63: Obtain the frame image data from the shared memory according to the frame sequence information of the tracking result, and obtain the detection frame coordinates, tracking id, log timestamp and other information of all objects in the frame. Obtain the recognition result of the object's detection action and the corresponding tracking ID information based on the action recognition result.

步骤64，根据追踪id进行检测结果和对象位置坐标的绑定；将追踪结果时间戳与图片时间戳进行对齐，对齐后根据检测框信息进行画框等处理，并将处理后的图像在系统界面显示。Step 64: Bind the detection results to the object position coordinates based on the tracking ID; align the tracking result timestamp with the image timestamp, and perform frame processing based on the detection frame information after alignment, and display the processed image in the system interface show.

步骤7，系统将识别结果自动缓存至存储设备中，并与原训练数据融合生成新的数据集，由定时任务触发，使用增量后的数据集进行动作识别模型训练，并将最新的动作识别模型文件自动进行替换。Step 7: The system automatically caches the recognition results to the storage device and merges them with the original training data to generate a new data set, which is triggered by a scheduled task. The incremental data set is used for action recognition model training, and the latest action recognition Model files are automatically replaced.

如图4所示，本发明的实施例提供了一种物体外观检测逻辑处理流程图，介绍了检测结果为OK或NG时系统的处理逻辑，其中，结果OK指物体外观检测符合外观检测标准，结果NG指物体外观检测行为存在不合格情况，此时需要操作员或机械臂对该物体再次进行外观检查，即物体复检。具体流程如下：As shown in Figure 4, the embodiment of the present invention provides an object appearance detection logic processing flow chart, which introduces the processing logic of the system when the detection result is OK or NG, where the result OK means that the object appearance detection meets the appearance detection standard, The result NG means that the object's appearance inspection behavior is unqualified. At this time, the operator or the robot arm needs to conduct another appearance inspection of the object, that is, the object is re-inspected. The specific process is as follows:

相机获取实时监控视频，传入前端边缘设备的目标检测模型进行目标检测处理输出检测结果，输出至目标追踪模型。目标追踪模型根据当前轨迹状态进行判断：若为NG物体复检，此时目标追踪模型所维护的轨迹状态不是稳定连续递增，出现跳增现象。系统将轨迹复位，优先从当前维护的非连续轨迹中找出缺失id进行再分配，补全后id稳定连续递增；若为新物体检测，此时当前系统维护的轨迹状态稳定递增无跳增现象，则递增生成新轨迹并分配新id。动作分割模型将连续长视频划分为孤立动作片段，传入动作识别模型进行外观检测合规性判断，并将识别结果传入后台显示系统，后台接收前端边缘设备的检测、追踪、行为分割和动作识别结果，并在实时视频上进行结果叠加展示。操作员或机械设备根据结果将NG结果的物体进行复检，重复以上操作流程直至识别结果为OK。The camera acquires real-time surveillance video, passes it into the target detection model of the front-end edge device, performs target detection processing, and outputs the detection results, which are output to the target tracking model. The target tracking model makes judgments based on the current trajectory status: If it is an NG object re-inspection, the trajectory status maintained by the target tracking model does not increase steadily and continuously, and a jump phenomenon occurs. The system resets the trajectory and gives priority to finding the missing id from the non-continuous trajectories currently maintained for redistribution. After completion, the id increases steadily and continuously; if it is a new object detection, the trajectory status currently maintained by the system increases steadily without jumping. , then a new trajectory is incrementally generated and assigned a new id. The action segmentation model divides continuous long videos into isolated action segments, and the action recognition model is passed in to determine the appearance detection compliance, and the recognition results are passed to the background display system. The background receives the detection, tracking, behavior segmentation and actions of the front-end edge device. Recognize the results and display the results overlay on the real-time video. The operator or mechanical equipment re-inspects the objects with NG results based on the results, and repeats the above operation process until the recognition result is OK.

本发明通过对各模型间的数据存取进行耗时分析，首次地提出了将视频图像存储至共享内存中，相机采样以及各个算法模型按照时间顺序，有序的从共享内存中读取帧图像数据并进行相应的处理，各模型间只传输帧序列信息，其中包含帧在共享内存中的位置、时间戳等少量数据。这种方法有效的解决了各个模型因图像等大数据传输、数据读取造成的延迟问题。By conducting time-consuming analysis of data access between models, this invention proposes for the first time to store video images in shared memory, and camera sampling and each algorithm model read frame images from the shared memory in an orderly manner in chronological order. The data is processed accordingly. Only frame sequence information is transmitted between each model, which includes a small amount of data such as the position of the frame in the shared memory, timestamp, etc. This method effectively solves the delay problem of each model caused by big data transmission and data reading such as images.

在狭小空间密集对象检测结果匹配阶段，本发明开创性的引入了目标追踪模型，提出了将目标追踪的id与该对象的外观检测行为识别结果绑定，有效的解决了狭小空间密集对象在外观检测后多个结果难以对应的问题。In the matching stage of detection results of dense objects in a small space, the present invention pioneered the introduction of a target tracking model and proposed to bind the ID of the target tracking with the appearance detection behavior recognition result of the object, effectively solving the problem of the appearance of dense objects in a small space. The problem is that it is difficult to correspond to multiple results after testing.

本发明通过缓存识别数据对数据集进行更新，并使用定时任务，在新数据集的基础上对模型进行迭代训练，并对系统中的动作识别模型文件进行更新，有效提高了模型的精准性和鲁棒性。The present invention updates the data set by caching the recognition data, and uses scheduled tasks to iteratively train the model based on the new data set, and updates the action recognition model file in the system, effectively improving the accuracy and accuracy of the model. robustness.

本发明相比现有技术，在不增加格外硬件设备，使用现有2D相机对密集对象外观检测行为进行采样的基础上，利用计算机视觉算法自动判定当前检测行为是否合规，有效的减少了人力物力，大幅增加了工作效率和检验的客观性。Compared with the existing technology, the present invention uses existing 2D cameras to sample dense object appearance detection behaviors without adding additional hardware equipment, and uses computer vision algorithms to automatically determine whether the current detection behavior is compliant, effectively reducing manpower. material resources, greatly increasing work efficiency and objectivity of inspection.

应该理解，上述的实施例仅是示意。本发明描述的实施例可在硬件、软件、固件、中间件、微码或者其任意组合中实现。对于硬件实现，处理单元可以在一个或者多个特定用途集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、处理器、控制器、微控制器、微处理器和/或设计为执行本发明所述功能的其它电子单元或者其结合内实现。It should be understood that the above-described embodiments are only illustrative. The described embodiments of the invention may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For hardware implementation, the processing unit may be in one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays ( FPGA), processor, controller, microcontroller, microprocessor and/or other electronic unit designed to perform the functions described in the present invention, or a combination thereof.

本说明书中的各个实施例均采用相关的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner. The same and similar parts between the various embodiments can be referred to each other. Each embodiment focuses on its differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.

以上所述仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等，均包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention are included in the protection scope of the present invention.

Claims

1. An appearance detection action recognition method for dense objects in a small space, which is characterized by including the following steps:

S1: The camera sends the original encoded video to the edge device and the background display system through multicast mode, and writes the frame image data to the shared memory in real time; if the writing position reaches the end of the shared memory queue, it will be written again from the head of the queue, overwriting Original frame image data, that is, cyclic writing of shared memory;

S2: Cyclically read from the shared memory according to the frame sequence information generated and transmitted by the camera, and obtain the corresponding frame image data as input. The frame sequence information specifies the specific storage location of the frame image data in the shared memory; use damo- The yolo target detection model performs target detection on the frame image data retrieved from the shared memory, and outputs the detection frame coordinates and object category detection results to the target tracking model;

S3: Read the frame image data from the shared memory according to the frame sequence information passed by the damo-yolo target detection model, combine the detection results, fuse the real-time position and texture changes of the object for multi-target tracking processing and assign object ids, and output dense objects Tracking coordinates and ID tracking results;

S4: Use the action segmentation model to perform appearance detection action behavior segmentation, segment the cached long video obtained in the shared memory into isolated action segments, mark the start frame number start and end frame number end of the segments, and forward them to the action recognition model through messages;

S5: The action recognition model obtains the image data of the isolated action clip from the shared memory based on the start frame number start and the end frame number end obtained by message forwarding, and uses the spatiotemporal regional attention model and large convolution kernel to extract spatiotemporal features of the video sequence. Then complete the action classification, output the recognition results and forward the results message;

S6: The background display system receives the detection, tracking, behavior segmentation and action recognition results of the front-end edge device through socket communication, and overlays the results on the real-time video display;

S7: The recognition results are automatically cached in the storage device and merged with the original training data to generate a new data set, which is triggered by a scheduled task. The incremental data set is used for action recognition model training, and the latest action recognition model file is automatically Make a substitution.

2. A method of appearance detection and action recognition for dense objects in a small space according to claim 1, characterized in that in the S1, the original encoded video is I={I _s } ^T , and I _s represents the third s frame, s represents the index value of each frame, and T represents the timestamp of the s-th frame; the camera obtains the s-th real-time image I _s of size M, and writes the real-time image into the pre-applied shared memory of size N , in which the shared memory can write a total of N images, that is, N=nM.

3. A method of appearance detection action recognition for dense objects in a small space according to claim 1, characterized in that, in the S2, the cyclic reading in the shared memory refers to the target detection model being written into the shared memory by the camera. Finally, according to the frame sequence information of the frame picture data transmitted by the queue, the corresponding picture data is read from the location of the shared memory.

4. A method for appearance detection action recognition of dense objects in a small space according to claim 1, characterized in that, in the S2, the object detection result of the frame image data is calculated as shown in the following formula:

D _t =F(I _t );

In the formula, D _t represents the object detection result of the current frame I _t , specifically expressed as [c, w, h, x, y, thresh], where c represents the object category, w represents the width of the detection frame, and h represents the height of the detection frame, x represents the x-axis coordinate of the center point of the detection frame, y represents the y-axis coordinate of the center point of the detection frame, thresh represents the confidence of the detection frame; t represents the index value of the image data of each frame in the shared memory; function F represents the current frame I _t The operation process of the damo-yolo target detection model.

5. A method of appearance detection and action recognition for dense objects in a small space according to claim 1, characterized in that, in the S3, the calculation process of the real-time position and texture changes of the fused object is carried out for multi-target tracking processing, As shown in the following formula:

X _t =G(I _t ,D _t )

In the _formula _, represents the width of the tracking detection frame, h represents the height of the tracking detection frame, x represents the x-axis coordinate of the center point of the tracking detection frame, y represents the y-axis coordinate of the center point of the tracking detection frame, thresh represents the confidence of the tracking detection frame; t represents each The index value of the frame's image data in the shared memory; the function G represents the operation process of the current frame _It and the detection result _Dt of the frame through the target tracking algorithm.

6. An appearance detection action recognition method for dense objects in a small space according to claim 1, characterized in that, in the S4, the action segmentation model performs the calculation process of appearance detection action behavior cutting, as follows: Shown:

{L ₁ ,…,L _n }＝trim(V _m )

(I _start ,I _end )=P({L ₁ ,…,L _n })

In the formula, V _m represents the cached long video data obtained from the shared memory; the function trim represents the operation process of the current cached video V _m through the action segmentation algorithm; L _i represents the action category output by each frame of the cached long video, and i represents The number of categories set by the system; function P represents the operation process of obtaining the isolated action start frame I _start and end frame I _end based on the frame data action category _Li .

7. An appearance detection action recognition method for dense objects in a small space according to claim 1, characterized in that, in the S5, the spatiotemporal region attention model and a large convolution kernel are used to perform spatiotemporal analysis of the video sequence. Feature extraction, and then complete the calculation process of action classification, as shown in the following formula:

Y＝W(I _start ,…,I _end )

In the formula, Y represents the action recognition prediction result, I _start ,...,I _end represents all picture data of the input image as the start point and end point defined by the action segmentation algorithm, obtained from the shared memory; the function W represents the current input frame sequence For the operation process of the action recognition algorithm, please refer to step 5 of the specific implementation manner for its specific process.

8. A method of appearance detection and action recognition for dense objects in a small space according to claim 1, characterized in that in said S6, the background display system receives detection, tracking and behavior segmentation of front-end edge devices through socket communication. and action recognition results. At the same time, the image is obtained from the shared memory, the image timestamp is aligned with the detection information timestamp, and the results are overlaid and displayed on the real-time video.

9. An appearance detection action recognition method for dense objects in a small space according to claim 1, characterized in that, in the S7, the system automatically caches the recognition results into a storage device and merges them with the original training data to generate The new data set is triggered by a scheduled task. The incremental data set is used for action recognition model training, and the latest action recognition model file is automatically replaced.

10. Provide a device based on the method of any one of claims 1 to 9, characterized in that it includes a processor, a communication interface, a memory, a display, and a communication bus, wherein the processor, the communication interface, the memory, and the display communicate through The bus completes mutual communication;

Memory, used to store computer programs;

A display used to display the execution results and real-time images of the processor;

The processor is used to implement the method steps described in claims 1-9 when executing a program stored in the memory.