CN118379664A

CN118379664A - Video identification method and system based on artificial intelligence

Info

Publication number: CN118379664A
Application number: CN202410570103.5A
Authority: CN
Inventors: 宋运锋; 张锦龙; 王小敏; 邹志光
Original assignee: Guangdong Xunke Ruisheng Technology Co ltd
Current assignee: Guangdong Xunke Ruisheng Technology Co ltd
Priority date: 2024-05-09
Filing date: 2024-05-09
Publication date: 2024-07-23
Anticipated expiration: 2044-05-09
Also published as: CN118379664B

Abstract

The present invention proposes a video recognition method and system based on artificial intelligence, the method includes: constructing an adaptive framework based on video content analysis, dynamically adjusting the model structure and parameters according to the complexity of real-time video frames; applying efficient image preprocessing technology and deep learning algorithms to extract features from video frames according to model parameters; using generative adversarial networks to process the features extracted from video frames, stylizing key frames to obtain stylized video frames, and strengthening visual behavioral features; designing hybrid neural networks to perform behavior recognition and behavior prediction based on historical data according to stylized video frames. The present invention not only significantly improves the accuracy and efficiency of video recognition, but also enhances the real-time processing capability of the system and the foresight of future behavior prediction, effectively solving the core problems existing in the prior art.

Description

A video recognition method and system based on artificial intelligence

技术领域Technical Field

本发明属于人工智能的技术领域，尤其涉一种基于人工智能的视频识别方法及系统。The present invention belongs to the technical field of artificial intelligence, and in particular relates to a video recognition method and system based on artificial intelligence.

背景技术Background technique

在当前的技术环境中，视频识别系统已广泛应用于安全监控、内容推荐、交通管理等多个领域，其核心功能是通过计算机视觉技术和深度学习算法对视频内容进行自动解析和识别。传统的视频识别系统通常依赖于卷积神经网络(CNN)或递归神经网络(RNN)来处理视频帧，并识别其中的对象、动作或事件。这些系统能够在标准化的环境中实现较高的准确率，但面对复杂多变的实际应用场景时，仍存在多个不足。例如，传统模型在处理非常复杂或动态变化的背景下的行为时，识别精度会显著下降。此外，大部分现有系统需要依赖大量的标注数据进行训练，这不仅成本高昂，而且在数据标注质量不一的情况下，会进一步影响模型的泛化能力和实用性。此外，现有的视频识别技术在实时性能处理、模型的自适应调整以及未来行为的预测等方面，仍然面临技术挑战。In the current technical environment, video recognition systems have been widely used in many fields such as security monitoring, content recommendation, and traffic management. Their core function is to automatically parse and identify video content through computer vision technology and deep learning algorithms. Traditional video recognition systems usually rely on convolutional neural networks (CNNs) or recurrent neural networks (RNNs) to process video frames and identify objects, actions, or events in them. These systems can achieve high accuracy in standardized environments, but they still have many shortcomings when facing complex and changing practical application scenarios. For example, the recognition accuracy of traditional models will drop significantly when dealing with behaviors in very complex or dynamically changing backgrounds. In addition, most existing systems need to rely on a large amount of labeled data for training, which is not only costly, but also further affects the generalization ability and practicality of the model when the quality of data annotation is inconsistent. In addition, existing video recognition technology still faces technical challenges in real-time performance processing, adaptive adjustment of models, and prediction of future behaviors.

因此，现有技术对于复杂场景的适应性、处理效率以及未标注数据的学习能力都有待提高。Therefore, the adaptability of existing technologies to complex scenarios, processing efficiency, and learning ability of unlabeled data need to be improved.

发明内容Summary of the invention

本发明的目的设计一种基于人工智能的视频识别方法及系统，集成多项新技术的智能视频识别与预测解决上述问题。The purpose of the present invention is to design a video recognition method and system based on artificial intelligence, which integrates multiple new technologies for intelligent video recognition and prediction to solve the above problems.

为了达到上述目的，在本发明第一方面提供了基于人工智能的视频识别方法，所述方法包括以下步骤：In order to achieve the above object, a first aspect of the present invention provides a video recognition method based on artificial intelligence, the method comprising the following steps:

S1、构建一个基于视频内容分析的自适应框架，根据实时视频帧的复杂性动态调整模型结构和参数；S1. Build an adaptive framework based on video content analysis to dynamically adjust the model structure and parameters according to the complexity of real-time video frames;

S2、根据模型参数，应用高效的图像预处理技术和深度学习算法，从视频帧中提取特征；其中，所述图像预处理技术包括光照校正和噪声淲除；所述从视频帧中提取特征具体为使用改进的Sobel算子提取视频帧的边缘信息，再利用局部二值模式算子分析图像纹理特征，具体为：S2. According to the model parameters, efficient image preprocessing technology and deep learning algorithm are applied to extract features from the video frame; wherein the image preprocessing technology includes illumination correction and noise removal; the feature extraction from the video frame is specifically to use the improved Sobel operator to extract the edge information of the video frame, and then use the local binary pattern operator to analyze the image texture features, specifically:

其中，G_x和G_y分别表示水平和垂直梯度，G表示最终的边缘强度；Among them, G _x and G _y represent the horizontal and vertical gradients respectively, and G represents the final edge strength;

其中，g_c表示中心像素值，g_p表示邻域像素值，P表示邻域中像素的数量；LBP(x,y)表示LBP特征，反映当前像素与其周围邻域的相对强度；(x,y)表示当前像素的水平和垂直的坐标，P表示邻域中像素的总数；LBP(x,y)是在图像坐标(x,y)处计算得到的局部二值模式值；Where _gc represents the central pixel value, _gp represents the neighborhood pixel value, and P represents the number of pixels in the neighborhood; LBP(x,y) represents the LBP feature, which reflects the relative strength of the current pixel and its surrounding neighborhood; (x,y) represents the horizontal and vertical coordinates of the current pixel, and P represents the total number of pixels in the neighborhood; LBP(x,y) is the local binary pattern value calculated at the image coordinate (x,y);

S3、使用生成对抗网络处理视频帧中提取的特征，对关键帧进行风格化得到风格化视频帧，强化视觉上的行为特征，包括：S3. Use the generative adversarial network to process the features extracted from the video frames, stylize the key frames to obtain stylized video frames, and strengthen the visual behavioral features, including:

S302、设计特征强调的损失项生成总损失函数对生成网络进行训练，同时在判别网络加入感知层；其中，所述特征强调的损失项表示如下：S302, designing a feature-emphasized loss term to generate a total loss function to train the generative network, and adding a perception layer to the discriminative network; wherein the feature-emphasized loss term It is expressed as follows:

其中，F_k表示针对关键特征k的特征提取函数，K表示预定义的关联特征集合，λ_k表示与特征k相关的权重，用于调节各特征在总损失中的贡献，I表示原始帧，G(z)表示在输入为隐变量z是生成器G的输出；Where _Fk represents the feature extraction function for the key feature k, K represents the predefined associated feature set, _λk represents the weight associated with feature k, which is used to adjust the contribution of each feature in the total loss, I represents the original frame, and G(z) represents the output of the generator G when the input is the hidden variable z;

所述总损失函数表示如下：The total loss function is expressed as follows:

其中，β表示控制特征强调损失项影响力的超参数，D表示判别器网络；Among them, β represents the hyperparameter that controls the influence of the feature emphasis loss term, and D represents the discriminator network;

S303、采用渐进式训练方法对生成对抗网络进行训练；S303, training the generative adversarial network using a progressive training method;

S4、根据风格化视频帧，设计混合神经网络进行行为识别和基于历史数据的行为预测；其中，所述混合神经网络包括CNN网络和RNN网络；S4. Designing a hybrid neural network for behavior recognition and behavior prediction based on historical data according to the stylized video frames; wherein the hybrid neural network includes a CNN network and an RNN network;

所述方法还包括以下至少0个步骤：The method further comprises at least the following steps:

A、整合无监督学习算法，基于实时反馈动态调整S1-S4的模型和参数；A. Integrate unsupervised learning algorithms to dynamically adjust the models and parameters of S1-S4 based on real-time feedback;

B、设立反馈机制，根据用户和性能反馈动态调整策略，持续优化策略执行。B. Establish a feedback mechanism to dynamically adjust strategies based on user and performance feedback and continuously optimize strategy execution.

进一步地，所述S1具体包括：Furthermore, the S1 specifically includes:

S101、输入视频帧序列，每一帧视频首先通过一个特定的特征提取子网络进行处理；S101, input a video frame sequence, each frame of the video is first processed by a specific feature extraction sub-network;

S102、利用连续两帧间的特征向量差异来评估场景的变化程度，包括：S102, using the feature vector difference between two consecutive frames to evaluate the degree of scene change, including:

D_t＝||V_t-V_t-1||₂ D _t = || V _t - _{V t-1} || ₂

其中，D_t表示时刻t和t-1之间的帧的内容变化幅度，V_t表示t时刻的每帧视频特征向量，V_t-1表示t-1时刻的每帧视频特征向量；Where _Dt represents the content change amplitude of the frame between time t and t-1, _Vt represents the feature vector of each frame of video at time t, and Vt _-1 represents the feature vector of each frame of video at time t-1;

通过一个固定大小的滑动窗口W计算这些差异的平均值，得到场景复杂度C_t：The scene complexity C _t is obtained by averaging these differences through a sliding window W of fixed size:

S103、根据C_t与预设阈值θ的比较结果，自动调整网络的配置；S103, automatically adjusting the configuration of the network according to the comparison result between C _t and the preset threshold θ;

S104、结合实时性能反馈，利用反向传播算法调整网络权重和结构参数。S104. In combination with real-time performance feedback, the back propagation algorithm is used to adjust network weights and structural parameters.

进一步地，所述光照校正为对输入视频帧应用自动白平衡和曝光补偿，表示如下：Furthermore, the illumination correction is to apply automatic white balance and exposure compensation to the input video frame, which is expressed as follows:

其中，I(x,y)是原始像素值，I_corr(x,y)是校正后的像素值，I_min和I_max分别是帧中的最小和最大像素值；Where I(x,y) is the original pixel value, I _corr (x,y) is the corrected pixel value, I _min and I _max are the minimum and maximum pixel values in the frame respectively;

所述噪声淲除为应用双边滤波器去除潜在的图像噪声，保持边缘清晰度，表示如下：The noise removal is to apply a bilateral filter to remove potential image noise and maintain edge clarity, which is expressed as follows:

其中，W_p是归一化系数，f_r和f_s分别是基于强度和空间近似的高斯函数，保证只有邻近且强度相似的像素影响当前像素值，(x,y)表示像素的坐标，(x',y')表示在应用滤波时参与计算的邻近像素的坐标，I_filtered(x,y)表示经过双边滤波处理后的图像在坐标(x,y)的像素值。Where _Wp is the normalization coefficient, _fr and _fs are Gaussian functions based on intensity and spatial approximation, respectively, to ensure that only adjacent pixels with similar intensities affect the current pixel value, (x,y) represents the coordinates of the pixel, (x',y') represents the coordinates of the adjacent pixels involved in the calculation when applying the filter, and I _filtered (x,y) represents the pixel value at the coordinate (x,y) of the image after bilateral filtering.

进一步地，所述生成网络的输入为t时刻的每帧视频特征向量，然后使用多层卷积层，每层后接批标准化和ReLU激活函数，最后一层使用tanh激活函数输出风格化的图像，表示如下：Furthermore, the input of the generative network is the feature vector of each frame of video at time t, and then multiple convolutional layers are used, each layer is followed by batch normalization and ReLU activation function, and the last layer uses tanh activation function to output a stylized image, which is expressed as follows:

I_styled＝tanh(Conv(BN(ReLU(Conv(…V_t…)))))I _styled =tanh(Conv(BN(ReLU(Conv(…V _t …)))))

所述判别网络的输入包括I_styled和原始帧I，使用多层卷积层进行特征提取，每层卷积后接入标准化和LeakyReLU激活函数，最后一层使用sigmoid函数输出概率，表示如下：The input of the discriminant network includes I _styled and the original frame I. Multiple convolutional layers are used for feature extraction. After each convolutional layer, normalization and LeakyReLU activation functions are connected. The last layer uses the sigmoid function to output the probability, which is expressed as follows:

P_real＝σ(Conv(LReLU(BN(Conv(…I_styled,I…)))))P _real =σ(Conv(LReLU(BN(Conv(…I _styled ,I…)))))

其中，σ表示sigmoid激活函数。Among them, σ represents the sigmoid activation function.

进一步地，所述S303中，渐进式训练方法表示首先在低分辨率上训练网络，逐渐过渡到高分辨率。Furthermore, in S303, the progressive training method means first training the network at a low resolution and gradually transitioning to a high resolution.

进一步地，所述混合神经网络包括特征融合层、时序融合层和行为识别与预测输出层；所述特征融合层采用深度卷积网络提取每帧图像的空间特征，表示如下：Furthermore, the hybrid neural network includes a feature fusion layer, a temporal fusion layer and a behavior recognition and prediction output layer; the feature fusion layer uses a deep convolutional network to extract the spatial features of each frame of the image, which is expressed as follows:

F_spatial＝CNN(I_styled)F _spatial = CNN(I _styled )

其中，F_spatial表示从风格化图像中提取的空间特征向量；Among them, F _spatial represents the spatial feature vector extracted from the stylized image;

所述时序融合层将连续帧的特征向量F_spatial作为输入，通过循环神经网络模块处理时间上的依赖和动态变化，表示如下：The temporal fusion layer takes the feature vector F _spatial of consecutive frames as input and processes the temporal dependency and dynamic changes through the recurrent neural network module, which is expressed as follows:

S_t＝RNN(F_spatial,S_t-1) _St =RNN(F _spatial ,S _t-1 )

其中，S_t表示时刻t的隐状态，携带了过去视频帧的累积信息；S_t-1表示时刻t-1的隐状态；Among them, _St represents the hidden state at time t, which carries the accumulated information of past video frames; St _-1 represents the hidden state at time t-1;

所述行为识别与预测输出层结合F_spatial和S_t生成最终的行为识别和行为预测输出，表示如下：The behavior recognition and prediction output layer combines F _spatial and _St to generate the final behavior recognition and behavior prediction output, which is expressed as follows:

Y_action＝Softmax(Dense(S_t))Y _acti on＝Softmax(Dense(S _t ))

Y_predict＝Softmax(Dense(S_t,F_spatial))Y _predict =Softmax(Dense(S _t ,F _spatial ))

其中，Y_action是当前帧的行为识别结果，Y_predict是基于当前和过去信息预测未来行为的结果。Among them, Y _action is the action recognition result of the current frame, and Y _predict is the result of predicting future actions based on current and past information.

进一步地，采用交叉熵损失函数来同时优化行为识别和预测的准确性，表示如下：Furthermore, the cross entropy loss function is used to simultaneously optimize the accuracy of behavior recognition and prediction, which is expressed as follows:

其中，y_c是真实行为标签的独热编码，是模型预测的概率分布，C是行为类别数。Among them, y _c is the one-hot encoding of the true behavior label, is the probability distribution predicted by the model, and C is the number of behavior categories.

进一步地，所述步骤A具体包括如下步骤：Furthermore, the step A specifically comprises the following steps:

使用可逆的特征变换技术构建一个特征变换网络F'，从原始视频帧中提取并变换特征；A feature transformation network F' is constructed using a reversible feature transformation technique to extract and transform features from the original video frames;

使用基于自编码器的异常检测技术设计重构与异常评分检测视频帧中的异常行为或突出事件；Design reconstruction and anomaly scoring using autoencoder-based anomaly detection techniques to detect abnormal behaviors or salient events in video frames;

定义总损失函数为重构误差和行为识别误差的加权和，同时引入额外的正则项以增强模型泛化能力和适应性，同时引入基于反馈的动态权重调整机制调整权重；Define the total loss function Reconstruction error and behavior recognition error The weighted sum of is calculated by introducing additional regularization terms to enhance the generalization and adaptability of the model, and a dynamic weight adjustment mechanism based on feedback is introduced to adjust the weights;

整合动态元素适应变化的视频内容优化特征变换网络；Integrate dynamic elements to adapt to changing video content and optimize feature transformation networks;

引入模型自适应调整策略，基于当前检测到的行为模式和历史数据，自动调整网络架构或参数。Introduce a model adaptive adjustment strategy to automatically adjust the network architecture or parameters based on the currently detected behavior patterns and historical data.

进一步地，所述步骤B具体包括如下步骤：Furthermore, the step B specifically comprises the following steps:

定期收集包括反馈数据，并使用统计分析和机器学习方法对收集的数据进行处理；Regularly collect feedback data and process the collected data using statistical analysis and machine learning methods;

基于反馈数据，定义参数调整策略和根据模型的性能变化动态调整学习率；Based on feedback data, define parameter adjustment strategies and dynamically adjust learning rates according to changes in model performance;

周期性进行全面评估并根据评估结果更新模型的运行参数，反馈给运行参数调整函数和学习率更新策略。Perform comprehensive evaluations periodically and update the model's operating parameters based on the evaluation results, which are then fed back to the operating parameter adjustment function and the learning rate update strategy.

在本发明的第二方面提供了一种基于人工智能的视频识别系统，所述系统包括：In a second aspect of the present invention, a video recognition system based on artificial intelligence is provided, the system comprising:

自适应框架构建模块，用于构建一个基于视频内容分析的自适应框架，根据实时视频帧的复杂性动态调整模型结构和参数；An adaptive framework building module is used to build an adaptive framework based on video content analysis and dynamically adjust the model structure and parameters according to the complexity of real-time video frames;

算法设计模块，用于根据模型参数，应用高效的图像预处理技术和深度学习算法，从视频帧中提取特征；其中，所述图像预处理技术包括光照校正和噪声淲除；所述从视频帧中提取特征具体为使用改进的Sobel算子提取视频帧的边缘信息，再利用局部二值模式算子分析图像纹理特征，具体为：The algorithm design module is used to extract features from video frames by applying efficient image preprocessing technology and deep learning algorithm according to model parameters; wherein the image preprocessing technology includes illumination correction and noise removal; the feature extraction from the video frame is specifically to extract edge information of the video frame using an improved Sobel operator, and then analyze image texture features using a local binary pattern operator, specifically:

其中，g_c表示中心像素值，g_p表示邻域像素值，P表示邻域中像素的数量；LBP(x,y)表示LBP特征，反映当前像素与其周围邻域的相对强度；(x,y)表示当前像素的水平和垂直的坐标，P表示邻域中像素的总数；LBP(x,y0是在图像坐标(x,y)处计算得到的局部二值模式值；Where _gc represents the central pixel value, _gp represents the neighborhood pixel value, and P represents the number of pixels in the neighborhood; LBP(x,y) represents the LBP feature, which reflects the relative strength of the current pixel and its surrounding neighborhood; (x,y) represents the horizontal and vertical coordinates of the current pixel, and P represents the total number of pixels in the neighborhood; LBP(x,y0) is the local binary pattern value calculated at the image coordinate (x,y);

强化特征模块，用于使用生成对抗网络处理视频帧中提取的特征，对关键帧进行风格化得到风格化视频帧，强化视觉上的行为特征，包括：The enhanced feature module is used to process the features extracted from the video frames using a generative adversarial network, stylize the key frames to obtain stylized video frames, and enhance the visual behavioral features, including:

S301、分别设计生成对抗网络，包括生成网络和判别网络；S301, respectively designing a generative adversarial network, including a generative network and a discriminative network;

行为预测模块，用于根据风格化视频帧，设计混合神经网络进行行为识别和基于历史数据的行为预测；其中，所述混合神经网络包括CNN网络和RNN网络；A behavior prediction module, used to design a hybrid neural network for behavior recognition and behavior prediction based on historical data according to the stylized video frames; wherein the hybrid neural network includes a CNN network and an RNN network;

所述方法还包括以下至少0个模块：The method further comprises at least 0 of the following modules:

无监督学习算法模块，用于整合无监督学习算法，基于实时反馈动态调整S1-S4的模型和参数；Unsupervised learning algorithm module, used to integrate unsupervised learning algorithms and dynamically adjust the models and parameters of S1-S4 based on real-time feedback;

反馈机制优化模块，用于设立反馈机制，根据用户和性能反馈动态调整策略，持续优化策略执行。The feedback mechanism optimization module is used to establish a feedback mechanism, dynamically adjust strategies based on user and performance feedback, and continuously optimize strategy execution.

本发明的有益技术效果至少在于以下几点：The beneficial technical effects of the present invention are at least as follows:

本发明一种基于人工智能的视频识别方法及系统提出了一个集成多项新技术的智能视频识别与预测系统。首先，引入自适应学习模型动态调整技术，该技术能够根据输入视频的复杂性自动调整深度学习模型的结构和参数，这一点对于提高系统在多变环境中的适应性和效率至关重要。其次，通过基于生成对抗网络(GANs)的视频帧风格化处理，本系统能够在不增加额外传感器的情况下，通过视觉信息提高动作识别的准确性和鲁棒性，特别是在视觉上复杂或者质量较低的视频数据中表现更加优异。此外，结合时间序列分析与视频到视频翻译技术，本系统不仅能识别当前行为，还能预测接下来可能发生的行为序列，这在安全监控和紧急响应系统中尤其有价值。最后，采用无监督学习技术使得系统能够在没有标签数据的情况下，持续学习并优化模型，解决了现有技术中对大量标注数据依赖的问题。这些技术的结合，不仅显著提升了视频识别的精度和效率，而且增强了系统的实时处理能力和未来行为预测的前瞻性，有效地解决了现有技术中存在的核心问题。The present invention proposes an intelligent video recognition and prediction system integrating multiple new technologies in a video recognition method and system based on artificial intelligence. First, an adaptive learning model dynamic adjustment technology is introduced, which can automatically adjust the structure and parameters of the deep learning model according to the complexity of the input video, which is crucial to improving the adaptability and efficiency of the system in a changing environment. Secondly, through the stylized processing of video frames based on generative adversarial networks (GANs), the system can improve the accuracy and robustness of action recognition through visual information without adding additional sensors, especially in visually complex or low-quality video data. In addition, combined with time series analysis and video-to-video translation technology, the system can not only identify the current behavior, but also predict the next possible behavior sequence, which is particularly valuable in security monitoring and emergency response systems. Finally, the use of unsupervised learning technology enables the system to continuously learn and optimize the model without labeled data, solving the problem of relying on a large amount of labeled data in the prior art. The combination of these technologies not only significantly improves the accuracy and efficiency of video recognition, but also enhances the real-time processing capability of the system and the foresight of future behavior prediction, effectively solving the core problems existing in the prior art.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

利用附图对本发明作进一步说明，但附图中的实施例不构成对本发明的任何限制，对于本领域的普通技术人员，在不付出创造性劳动的前提下，还可以根据以下附图获得其它的附图。The present invention is further described using the accompanying drawings, but the embodiments in the accompanying drawings do not constitute any limitation to the present invention. A person skilled in the art can obtain other drawings based on the following drawings without creative work.

图1为本发明实施例一种基于人工智能的视频识别方法流程图。FIG1 is a flow chart of a video recognition method based on artificial intelligence according to an embodiment of the present invention.

图2为本发明实施例一种基于人工智能的视频识别系统框架图。FIG2 is a framework diagram of a video recognition system based on artificial intelligence according to an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and cannot be understood as limiting the present invention.

在一个或多个实施方式中，如图1所示，公开了本发明一种基于人工智能的视频识别方法，所述方法包括步骤1～6，包括：In one or more embodiments, as shown in FIG1 , a video recognition method based on artificial intelligence is disclosed. The method includes steps 1 to 6, including:

S1、构建一个基于视频内容分析的自适应框架，根据实时视频帧的复杂性动态调整模型结构和参数，包括：S1. Build an adaptive framework based on video content analysis to dynamically adjust the model structure and parameters according to the complexity of real-time video frames, including:

具体地，输入视频帧序列，每一帧视频首先通过一个特定的特征提取子网络进行处理。采用简化的ResNet模型，该模型配置轻量化，专为实时数据流设计。通过此网络，每帧视频被转化为一个特征向量V_t，其中V_t是一个维度固定的向量，包含该帧的关键视觉信息如边缘、纹理、颜色分布等。Specifically, a sequence of video frames is input, and each frame is first processed by a specific feature extraction subnetwork. A simplified ResNet model is used, which is lightweight and designed for real-time data streams. Through this network, each frame of video is converted into a feature vector V _t , where V _t is a vector of fixed dimension, containing the key visual information of the frame, such as edges, textures, color distribution, etc.

D_t＝||V_t-V_t-1||₂ D _t = || V _t - _{V t-1} || ₂

其中，i表示表示第i时刻，D_i表示第i个时间点的帧内容变化；Where i represents the i-th moment, and _Di represents the frame content change at the i-th time point;

具体地，根据C_t与预设阈值θ的比较结果，系统自动调整网络的配置。如果C_t超过阈值θ，这表明场景变化大，需要更复杂的模型来捕获细节，此时系统可能增加卷积层的深度或是卷积核的大小；反之，如果C_t低于θ，则简化网络结构，例如减少卷积层或使用较小的卷积核，以提高处理速度并减少计算资源消耗。Specifically, according to the comparison result between C _t and the preset threshold θ, the system automatically adjusts the configuration of the network. If C _t exceeds the threshold θ, it indicates that the scene changes greatly and a more complex model is needed to capture the details. At this time, the system may increase the depth of the convolution layer or the size of the convolution kernel; conversely, if C _t is lower than θ, the network structure is simplified, such as reducing the convolution layer or using a smaller convolution kernel, to increase the processing speed and reduce the consumption of computing resources.

S104、结合实时性能反馈，利用反向传播算法调整网络权重和结构参数；S104, combining real-time performance feedback, using back propagation algorithm to adjust network weights and structural parameters;

在本发明实施例中，结合实时性能反馈，系统利用反向传播算法调整网络权重和结构参数，以适应当前视频内容的需求。这包括调整学习率，优化梯度下降策略等，确保每次调整都基于最新的性能评估数据进行，从而最小化误差并优化总体性能。In the embodiment of the present invention, combined with real-time performance feedback, the system uses the back propagation algorithm to adjust the network weights and structural parameters to meet the needs of the current video content. This includes adjusting the learning rate, optimizing the gradient descent strategy, etc., ensuring that each adjustment is based on the latest performance evaluation data, thereby minimizing errors and optimizing overall performance.

S2、根据模型参数，应用高效的图像预处理技术和深度学习算法，从视频帧中提取特征；其中，所述图像预处理技术包括光照校正和噪声淲除；S2. According to the model parameters, applying efficient image preprocessing technology and deep learning algorithm to extract features from the video frame; wherein the image preprocessing technology includes illumination correction and noise removal;

其中，光照校正表示对输入视频帧应用自动白平衡和曝光补偿，以标准化光照条件。使用公式：Here, lighting correction means applying automatic white balance and exposure compensation to the input video frames to standardize the lighting conditions. Use the formula:

其中，I(x,y)是原始像素值，I_corr(x,y)是校正后的像素值，I_min和I_max分别是帧中的最小和最大像素值。Among them, I(x,y) is the original pixel value, I _corr (x,y) is the corrected pixel value, I _min and I _max are the minimum and maximum pixel values in the frame respectively.

噪声淲除是应用双边滤波器去除潜在的图像噪声，保持边缘清晰度，通过以下公式计算新像素值：Noise removal is to apply a bilateral filter to remove potential image noise, maintain edge clarity, and calculate the new pixel value using the following formula:

其中，W_p是归一化系数，f_r和f_s分别是基于强度和空间近似的高斯函数，保证只有邻近且强度相似的像素影响当前像素值，(x,y)表示像素的坐标，(x',y')表示在应用滤波时参与计算的邻近像素的坐标，I_filtered(x,y0表示经过双边滤波处理后的图像在坐标(x,y)的像素值。Where _Wp is the normalization coefficient, _fr and _fs are Gaussian functions based on intensity and spatial approximation, respectively, to ensure that only adjacent pixels with similar intensities affect the current pixel value, (x,y) represents the coordinates of the pixel, (x',y') represents the coordinates of the adjacent pixels involved in the calculation when applying the filter, and I _filtered (x,y0) represents the pixel value at the coordinate (x,y) of the image after bilateral filtering.

所述从视频帧中提取特征具体为使用改进的Sobel算子提取视频帧的边缘信息，再利用局部二值模式算子分析图像纹理特征，具体为：The feature extraction from the video frame is specifically to use the improved Sobel operator to extract the edge information of the video frame, and then use the local binary pattern operator to analyze the image texture features, specifically:

其中，g_c表示中心像素值，g_p表示邻域像素值，P表示邻域中像素的数量；LBP(x,y)表示LBP特征，反映当前像素与其周围邻域的相对强度；(x,y)表示当前像素的水平和垂直的坐标，P表示邻域中像素的总数；LBP(x,y0是在图像坐标(x,y)处计算得到的局部二值模式值。Where _gc represents the central pixel value, _gp represents the neighborhood pixel value, and P represents the number of pixels in the neighborhood; LBP(x,y) represents the LBP feature, which reflects the relative strength of the current pixel and its surrounding neighborhood; (x,y) represents the horizontal and vertical coordinates of the current pixel, and P represents the total number of pixels in the neighborhood; LBP(x,y0) is the local binary pattern value calculated at the image coordinate (x,y).

作为本发明的实施例，通过以上步骤，视频帧预处理将标准化输入数据，减少环境变量的干扰，而特征提取则聚焦于获取对后续行为分析至关重要的视觉信息。此外，光照校正和噪声滤除的物理基础方法保证了处理过程的普遍适用性和高效性，而边缘检测和纹理分析为视频内容分析提供了必要的详细视觉描述，为后续的深度分析奠定了坚实基础。这样的处理流程确保了视频监控系统能够在各种环境条件下都保持高效和精确的运行，满足实时监控的需求。As an embodiment of the present invention, through the above steps, video frame preprocessing will standardize the input data and reduce the interference of environmental variables, while feature extraction focuses on obtaining visual information that is critical to subsequent behavior analysis. In addition, the physical basis methods of illumination correction and noise filtering ensure the universal applicability and efficiency of the processing process, while edge detection and texture analysis provide the necessary detailed visual description for video content analysis, laying a solid foundation for subsequent in-depth analysis. Such a processing flow ensures that the video surveillance system can maintain efficient and accurate operation under various environmental conditions and meet the needs of real-time monitoring.

具体的，生成网络(G)设计输入V_t是从步骤2中得到的每帧视频的特征向量，网络结构：使用多层卷积层，每层后接批标准化和ReLU激活函数，最后一层使用tanh激活函数输出风格化的图像。生成网络的目标是根据输入的特征向量生成视觉上引人注目的图像，突出关键内容，表示如下：Specifically, the design _input of the generative network (G) is the feature vector of each frame of the video obtained in step 2. The network structure uses multiple convolutional layers, each layer is followed by batch normalization and ReLU activation function, and the last layer uses tanh activation function to output stylized images. The goal of the generative network is to generate visually striking images based on the input feature vector, highlighting the key content, as shown below:

I_styled＝tanh(Conv(BN(ReLU(Conv(…V_t…))))) (5)I _styled =tanh(Conv(BN(ReLU(Conv(…V _t …))))) (5)

判别网络(D)设计输入为I_styled和原始帧I，使用多层卷积层进行特征提取，每层卷积后接入标准化和LeakyReLU激活函数，最后一层使用sigmoid函数输出图像是真实还是由生成网络生成的概率，表示如下：The discriminant network (D) is designed with inputs of I _styled and the original frame I. It uses multiple convolutional layers for feature extraction. After each convolution layer, it is connected to the normalization and LeakyReLU activation function. The last layer uses the sigmoid function to output the probability of whether the image is real or generated by the generative network, which is expressed as follows:

P_real＝σ(Conv(LReLU(BN(Conv(…I_styled,I…))))) (6)P _real =σ(Conv(LReLU(BN(Conv(…I _styled ,I…)))))) (6)

其中，σ表示sigmoid激活函数，用于将输出转化为概率值。Among them, σ represents the sigmoid activation function, which is used to convert the output into a probability value.

其中，F_k表示针对关键特征k的特征提取函数，K表示预定义的关键特征集合，λ_k表示与特征k相关的权重，用于调节各特征在总损失中的贡献，I表示原始帧，G(z)表示在输入为隐变量z是生成器G的输出；Where _Fk represents the feature extraction function for key feature k, K represents the predefined key feature set, _λk represents the weight associated with feature k, which is used to adjust the contribution of each feature in the total loss, I represents the original frame, and G(z) represents the output of the generator G when the input is the hidden variable z;

具体地，采用渐进式训练方法，首先在低分辨率上训练网络，逐渐过渡到高分辨率。这种方法有助于网络在早期阶段捕捉大尺度的行为模式，然后逐渐细化到小尺度的重要细节，从而更有效地处理复杂或模糊的视频场景。Specifically, a progressive training method is adopted to first train the network at a low resolution and gradually transition to a high resolution. This method helps the network capture large-scale behavior patterns at an early stage and then gradually refine to important details at a small scale, thereby more effectively processing complex or blurred video scenes.

S4、根据风格化视频帧，设计混合神经网络进行行为识别和基于历史数据的行为预测；其中，所述混合神经网络包括CNN网络和RNN网络，具体表示如下：S4. According to the stylized video frames, a hybrid neural network is designed to perform behavior recognition and behavior prediction based on historical data; wherein the hybrid neural network includes a CNN network and an RNN network, which are specifically expressed as follows:

所述混合神经网络包括特征融合层、时序融合层和行为识别与预测输出层；The hybrid neural network includes a feature fusion layer, a temporal fusion layer and a behavior recognition and prediction output layer;

所述特征融合层采用深度卷积网络提取每帧图像的空间特征，表示如下：The feature fusion layer uses a deep convolutional network to extract the spatial features of each frame of image, which is expressed as follows:

F_spatial＝CNN(I_styled) (9)F _spatial =CNN(I _styled ) (9)

S_t＝RNN(F_spatial,S_t-1) (10)S _t =RNN (F _spatial ,S _t-1 ) (10)

采用交叉熵损失函数来同时优化行为识别和预测的准确性，表示如下：The cross entropy loss function is used to simultaneously optimize the accuracy of behavior recognition and prediction, which is expressed as follows:

S5、整合无监督学习算法，基于实时反馈动态调整S1-S4的模型和参数，包括：S5, integrates unsupervised learning algorithms to dynamically adjust the models and parameters of S1-S4 based on real-time feedback, including:

使用可逆的特征变换技术构建一个特征变换网络F'，从原始视频帧中提取并变换特征。A feature transformation network F' is constructed using a reversible feature transformation technique to extract and transform features from the original video frames.

在本发明实施例中，构建一个小型的卷积神经网络，其目标是从原始视频帧中提取并变换特征，以更好地适应模型识别需求。该网络使用可逆的特征变换技术，不仅提取特征，还能增强对关键变量如光照变化和运动模袺的适应性，表示如下：In the embodiment of the present invention, a small convolutional neural network is constructed, whose goal is to extract and transform features from the original video frames to better meet the model recognition requirements. The network uses reversible feature transformation technology to not only extract features, but also enhance adaptability to key variables such as illumination changes and motion simulation, as shown below:

F′＝FTN(I_frame) (13)F′＝FTN(I _frame ) (13)

其中，F'是变换后的特征向量，I_frame是输入视频帧。Among them, F' is the transformed feature vector, and I _frame is the input video frame.

使用基于自编码器的异常检测技术设计重构与异常评分检测视频帧中的异常行为或突出事件。Design reconstruction and anomaly scoring using autoencoder-based anomaly detection techniques to detect unusual behaviors or salient events in video frames.

具体地，检测视频帧中的异常行为或突出事件，使用基于自编码器的异常检测技术，自编码器试图重构输入特征，而重构误差用于判断异常。重构与异常评分表示如下：Specifically, abnormal behaviors or prominent events in video frames are detected using an anomaly detection technique based on an autoencoder, which attempts to reconstruct input features, and the reconstruction error is used to judge anomalies. The reconstruction and anomaly scores are expressed as follows:

其中，异常评分高表明当前帧可能包含异常或未学习过的行为特征。Among them, a high anomaly score indicates that the current frame may contain abnormal or unlearned behavior features.

定义总损失函数为重构误差和行为识别误差的加权和，同时引入额外的正则项以增强模型泛化能力和适应性，同时引入基于反馈的动态权重调整机制调整权重。Define the total loss function Reconstruction error and behavior recognition error At the same time, an additional regularization term is introduced to enhance the generalization ability and adaptability of the model, and a dynamic weight adjustment mechanism based on feedback is introduced to adjust the weights.

具体地，定义总损失函数为重构误差和行为识别误差的加权和，但引入额外的正则项以增强模型泛化能力和适应性，表示如下：Specifically, define the total loss function Reconstruction error and behavior recognition error The weighted sum of , but an additional regularization term is introduced to enhance the generalization ability and adaptability of the model, which is expressed as follows:

其中，衡量特征变换后重构的准确性；为模型的正则化项，例如L2正则化，用于避免过拟合；α,β,γ是权重参数，根据不同任务的重要性进行调整。in, Measures the accuracy of reconstruction after feature transformation; is the regularization term of the model, such as L2 regularization, which is used to avoid overfitting; α, β, γ are weight parameters, which are adjusted according to the importance of different tasks.

然后引入基于反馈的动态权重调整机制。根据实时监控数据中异常行为的检测频率自动调整α,β,γ的值，以优化模型在变化环境中的表现，表示如下：Then, a dynamic weight adjustment mechanism based on feedback is introduced. The values of α, β, and γ are automatically adjusted according to the detection frequency of abnormal behaviors in real-time monitoring data to optimize the performance of the model in a changing environment, as shown below:

其中f,g,h是根据历史性能反馈调整权重的函数。Where f, g, h are functions that adjust weights based on historical performance feedback.

整合动态元素适应变化的视频内容优化特征变换网络。Integrate dynamic elements to optimize feature transformation network to adapt to changing video content.

在本发明实施例中，特征变换网络不仅执行特征编码和解码，还整合动态元素以适应不断变化的视频内容。例如，引入时间依赖性组件，如小型RNN或注意力机制，用以强化对时间序列的理解和特征提取能力：In the embodiment of the present invention, the feature transformation network not only performs feature encoding and decoding, but also integrates dynamic elements to adapt to the ever-changing video content. For example, time-dependent components such as small RNNs or attention mechanisms are introduced to enhance the understanding of time series and feature extraction capabilities:

F″＝Attention(F,context) (17)F″＝Attention(F,context) (17)

其中，context是从视频序列中抽取的上下文信息，增强了特征向量对当前场景的描述能力。Among them, context is the contextual information extracted from the video sequence, which enhances the feature vector's ability to describe the current scene.

在本发明实施例中，引入模型自适应调整策略，基于当前检测到的行为模式和历史数据，自动调整网络架构或参数。例如，采用模型剪枝技术或神经网络搜索(NAS)自动调整网络的复杂度和结构，以适应不同的行为识别需求。In the embodiment of the present invention, a model adaptive adjustment strategy is introduced to automatically adjust the network architecture or parameters based on the currently detected behavior pattern and historical data. For example, the model pruning technology or neural network search (NAS) is used to automatically adjust the complexity and structure of the network to adapt to different behavior recognition requirements.

本步骤通过综合考虑重构误差、行为识别准确性及模型复杂度的动态优化，不仅提升了行为识别的准确性，还增强了系统对新环境和未知行为模式的适应能力，适合于复杂且动态变化的监控场景。This step not only improves the accuracy of behavior recognition, but also enhances the system's ability to adapt to new environments and unknown behavior patterns by comprehensively considering the reconstruction error, behavior recognition accuracy and dynamic optimization of model complexity. It is suitable for complex and dynamically changing monitoring scenarios.

S6、设立反馈机制，根据用户和性能反馈动态调整策略，持续优化策略执行，包括：S6. Establish a feedback mechanism to dynamically adjust strategies based on user and performance feedback and continuously optimize strategy execution, including:

定期收集包括反馈数据，并使用统计分析和机器学习方法对收集的数据进行处理。Feedback data is collected regularly and processed using statistical analysis and machine learning methods.

具体地，系统定期收集包括用户反馈(错误报告、标注修正)、系统自动生成的性能指标(识别准确率、处理延迟)等数据。使用统计分析和机器学习方法对收集的数据进行处理，识别常见问题和性能瓶颈，输出性能改进报告和调整建议。Specifically, the system regularly collects data including user feedback (error reports, annotation corrections), performance indicators automatically generated by the system (recognition accuracy, processing delay), etc. Statistical analysis and machine learning methods are used to process the collected data, identify common problems and performance bottlenecks, and output performance improvement reports and adjustment suggestions.

基于反馈数据，定义参数调整策略和根据模型的性能变化动态调整学习率，包括：Based on feedback data, define parameter adjustment strategies and dynamically adjust learning rates according to model performance changes, including:

根据性能评估结果，自动调整学习率，调高或调低学习率；Automatically adjust the learning rate to increase or decrease the learning rate based on the performance evaluation results;

侦测到新的行为模式频繁出现时，可能需要调整网络结构(如增加层数、调整滤波器大小)来更好地学习这些特征；When new behavior patterns are detected to occur frequently, it may be necessary to adjust the network structure (such as increasing the number of layers, adjusting the filter size) to better learn these features;

根据具体任务的需求调整损失函数中各项的权重，公式化的调整：Adjust the weights of each item in the loss function according to the needs of the specific task, and make the adjustment in a formula:

其中，θ表示模型参数，η_t是动态调整的学习率，是考虑反馈的复合损失函数，R_t是来自反馈的调整建议，data包括了性能数据和用户反馈；Among them, θ represents the model parameters, _ηt is the dynamically adjusted learning rate, is a composite loss function that takes feedback into account, _Rt is the adjustment suggestion from feedback, and data includes performance data and user feedback;

再根据模型的性能变化动态调整学习率，当模型性能改进时，逐渐减少学习率以细化学习；当模型性能下降时，增加学习率以快速适应新情况，更新策略公式为：Then dynamically adjust the learning rate according to the performance changes of the model. When the model performance improves, gradually reduce the learning rate to refine the learning; when the model performance decreases, increase the learning rate to quickly adapt to new situations. The update strategy formula is:

η_t+1＝η_t·(1+δ·sign(ΔPerf_t)) (19)η _t+1 =η _t ·(1+δ·sign(ΔPerf _t )) (19)

其中，δ是调整因子，ΔPerf_t是性能变化率，sign函数根据性能是改善还是恶化给出相应的符号。Among them, δ is the adjustment factor, ΔPerf _t is the performance change rate, and the sign function gives the corresponding sign according to whether the performance is improved or deteriorated.

在本发明实施例中，周期性对系统进行全面评估，包括精度、召回率和F1分数，并与之前的性能数据进行比较。根据评估结果更新系统的运行参数，反馈给参数调整函数和学习率更新策略。In the embodiment of the present invention, the system is periodically comprehensively evaluated, including precision, recall rate and F1 score, and compared with previous performance data. The operating parameters of the system are updated according to the evaluation results and fed back to the parameter adjustment function and the learning rate update strategy.

在一个或多个实施方式中，如图2所示，本发明公开了一种基于人工智能的视频识别系统，包括：In one or more embodiments, as shown in FIG2 , the present invention discloses a video recognition system based on artificial intelligence, comprising:

自适应框架构建模块101，用于构建一个基于视频内容分析的自适应框架，根据实时视频帧的复杂性动态调整模型结构和参数；An adaptive framework building module 101 is used to build an adaptive framework based on video content analysis, and dynamically adjust the model structure and parameters according to the complexity of real-time video frames;

算法设计模块102，用于根据模型参数，应用高效的图像预处理技术和深度学习算法，从视频帧中提取特征；其中，所述图像预处理技术包括光照校正和噪声淲除；所述从视频帧中提取特征具体为使用改进的Sobel算子提取视频帧的边缘信息，再利用局部二值模式算子分析图像纹理特征，具体为：The algorithm design module 102 is used to extract features from the video frame by applying efficient image preprocessing technology and deep learning algorithm according to the model parameters; wherein the image preprocessing technology includes illumination correction and noise removal; the feature extraction from the video frame is specifically to extract the edge information of the video frame by using the improved Sobel operator, and then analyze the image texture features by using the local binary pattern operator, specifically:

强化特征模块103，用于使用生成对抗网络处理视频帧中提取的特征，对关键帧进行风格化得到风格化视频帧，强化视觉上的行为特征，包括：The feature enhancement module 103 is used to process the features extracted from the video frames using a generative adversarial network, stylize the key frames to obtain stylized video frames, and enhance the visual behavioral features, including:

其中，F_k表示针对关键特征k的特征提取函数，K表示预定义的关键特征集合，λ_k表示与特征k相关的权重，用于调节各特征在总损失中的贡献，I表示原始帧；Where _Fk represents the feature extraction function for key feature k, K represents the predefined key feature set, _λk represents the weight associated with feature k, which is used to adjust the contribution of each feature in the total loss, and I represents the original frame;

其中，β表示控制特征强调损失项影响力的超参数；Among them, β represents the hyperparameter that controls the influence of the feature emphasis loss term;

行为预测模块104，用于根据风格化视频帧，设计混合神经网络进行行为识别和基于历史数据的行为预测；其中，所述混合神经网络包括CNN网络和RNN网络；The behavior prediction module 104 is used to design a hybrid neural network for behavior recognition and behavior prediction based on historical data according to the stylized video frames; wherein the hybrid neural network includes a CNN network and an RNN network;

无监督学习算法模块105，用于整合无监督学习算法，基于实时反馈动态调整S1-S4的模型和参数；An unsupervised learning algorithm module 105, for integrating an unsupervised learning algorithm and dynamically adjusting the models and parameters of S1-S4 based on real-time feedback;

反馈机制优化模块106，用于设立反馈机制，根据用户和性能反馈动态调整策略，持续优化策略执行。The feedback mechanism optimization module 106 is used to establish a feedback mechanism, dynamically adjust the strategy according to user and performance feedback, and continuously optimize the strategy execution.

综上所述，本发明初始化一个自适应学习模型框架来设定系统的基础，这个框架能够根据实时视频内容的复杂性自动调整其网络结构和参数。这种自适应能力是解决视频监控在多变环境下应用的关键，它确保了无论环境多么复杂，系统都能调整自身以达到最优的处理性能和识别精度。在此框架的基础上进行实时视频帧的预处理和特征提取，确保从每一帧视频中获得最关键的信息。这一步的高效执行是因为自适应框架已经为其提供了最适合当前视频内容的模型参数和结构，从而优化了特征提取过程，提高了数据处理的速度和准确性。利用生成对抗网络(GANs)对提取的特征进行风格化处理，这一处理增强了视频帧中关键行为的视觉表现，使得后续的行为识别更加准确。风格化处理是基于优化后的特征执行的，因此能够更有效地突出行为特征，尤其是在视觉信息复杂或质量较低的场景中。使用混合神经网络对风格化后的视频帧进行行为识别和未来行为预测。这一步之所以有效，是因为风格化视频帧已经强化了必要的行为特征，使得神经网络能够更准确地进行学习和预测。无监督在线学习机制整合进系统中，允许模型在实际操作中自我优化，通过实时反馈持续调整的参数和策略。这一机制的引入使系统能够适应长期的运行环境变化，减少了对人工干预的依赖，提高了系统的自主性和可靠性。通过建立系统反馈机制，不断收集用户和性能数据，动态优化整个系统。这一闭环反馈机制确保系统能持续进步，实时调整其策略以应对新的挑战和需求。In summary, the present invention initializes an adaptive learning model framework to set the foundation of the system, which can automatically adjust its network structure and parameters according to the complexity of the real-time video content. This adaptive ability is the key to solving the application of video surveillance in a changing environment. It ensures that no matter how complex the environment is, the system can adjust itself to achieve the best processing performance and recognition accuracy. Based on this framework, real-time video frames are preprocessed and feature extracted to ensure that the most critical information is obtained from each frame of video. This step is efficiently executed because the adaptive framework has provided it with model parameters and structures that are most suitable for the current video content, thereby optimizing the feature extraction process and improving the speed and accuracy of data processing. The extracted features are stylized using generative adversarial networks (GANs), which enhances the visual representation of key behaviors in the video frames and makes subsequent behavior recognition more accurate. Stylization is performed based on optimized features, so it can more effectively highlight behavioral features, especially in scenes with complex or low-quality visual information. A hybrid neural network is used to perform behavior recognition and future behavior prediction on the stylized video frames. This step is effective because the stylized video frames have strengthened the necessary behavioral features, allowing the neural network to learn and predict more accurately. Unsupervised online learning mechanisms are integrated into the system, allowing the model to self-optimize in actual operation, and continuously adjust parameters and strategies through real-time feedback. The introduction of this mechanism enables the system to adapt to long-term changes in the operating environment, reduces dependence on manual intervention, and improves the autonomy and reliability of the system. By establishing a system feedback mechanism, user and performance data are continuously collected to dynamically optimize the entire system. This closed-loop feedback mechanism ensures that the system can continue to improve and adjust its strategies in real time to meet new challenges and needs.

本发明共同解决了背景技术中识别准确性不足、对复杂环境适应性差、数据处理效率低和依赖大量标注数据等问题。通过自适应学习模型的引入，系统能够针对不同视频内容自动优化处理策略；通过风格化处理和混合神经网络的应用，大幅提升了行为识别的准确性和预测能力；而无监督学习和反馈机制的结合，则保证了系统的长期有效性和自主优化能力。这种设计不仅技术上创新，更在实际应用中提供了显著的性能提升，完全符合现代智能视频监控系统的需求。The present invention solves the problems of insufficient recognition accuracy, poor adaptability to complex environments, low data processing efficiency, and reliance on a large amount of labeled data in the background technology. Through the introduction of adaptive learning models, the system can automatically optimize processing strategies for different video contents; through the application of stylized processing and hybrid neural networks, the accuracy and predictive ability of behavior recognition are greatly improved; and the combination of unsupervised learning and feedback mechanisms ensures the long-term effectiveness and autonomous optimization capabilities of the system. This design is not only technologically innovative, but also provides significant performance improvements in practical applications, which fully meets the needs of modern intelligent video surveillance systems.

本发明一些较佳实施例而已，当然不能以此来限定本发明之权利范围，本领域普通技术人员可以理解实现上述实施例的全部或部分流程，并依本发明权利要求所作的等同变化，仍属于发明所涵盖的范围。These are just some preferred embodiments of the present invention, which certainly cannot be used to limit the scope of rights of the present invention. Ordinary technicians in this field can understand that all or part of the processes of implementing the above embodiments and making equivalent changes according to the claims of the present invention still fall within the scope of the invention.

Claims

1. A video recognition method based on artificial intelligence, the method comprising the steps of:

s1, constructing an adaptive framework based on video content analysis, and dynamically adjusting a model structure and parameters according to the complexity of a real-time video frame;

S2, according to model parameters, an efficient image preprocessing technology and a deep learning algorithm are applied, and features are extracted from the video frames; wherein the image preprocessing technique includes illumination correction and noise division; the extracting features from the video frame specifically comprises extracting edge information of the video frame by using an improved Sobel operator, and analyzing image texture features by using a local binary pattern operator, wherein the extracting features specifically comprise the following steps:

Wherein G _x and G _y represent horizontal and vertical gradients, respectively, and G represents final edge intensity;

Where g _c denotes a center pixel value, g _p denotes a neighborhood pixel value, and P denotes the number of pixels in the neighborhood; LBP (x, y) represents LBP characteristics, reflecting the relative intensity of the current pixel and its surrounding neighborhood; (x, y) represents the horizontal and vertical coordinates of the current pixel, and P represents the total number of pixels in the neighborhood; LBP (x, y) is a local binary pattern value calculated at the image coordinates (x, y);

S3, using the characteristics extracted from the generated countermeasure network processing video frames to stylize the key frames to obtain stylized video frames, strengthening the visual behavior characteristics, and comprising the following steps:

s301, respectively designing and generating an countermeasure network, wherein the countermeasure network comprises a generation network and a discrimination network;

S302, generating a total loss function by using a loss item with emphasized design characteristics, training a generation network, and adding a perception layer in a discrimination network; wherein the characteristic emphasizes the loss term The expression is as follows:

Wherein F _k denotes a feature extraction function for a key feature K, K denotes a predefined set of associated features, λ _k denotes weights related to feature K for adjusting the contribution of each feature in the total loss, I denotes the original frame, G (z) denotes the output of generator G at input as hidden variable z;

the total loss function is expressed as follows:

wherein, beta represents a hyper-parameter of the control characteristic to emphasize the influence of the loss term, and D represents a discriminator network;

S303, training the generated countermeasure network by adopting a progressive training method;

s4, designing a hybrid neural network to conduct behavior recognition and behavior prediction based on historical data according to the stylized video frame; wherein the hybrid neural network comprises a CNN network and an RNN network;

The method further comprises at least 0 steps of:

A. integrating an unsupervised learning algorithm, and dynamically adjusting the models and parameters of the S1-S4 based on real-time feedback;

B. and setting up a feedback mechanism, dynamically adjusting the strategy according to the user and performance feedback, and continuously optimizing the strategy execution.

2. The video recognition method based on artificial intelligence according to claim 1, wherein S1 specifically comprises:

s101, inputting a video frame sequence, wherein each frame of video is processed through a specific feature extraction sub-network;

s102, evaluating the change degree of the scene by utilizing the feature vector difference between two continuous frames, comprising:

D_t＝||V_t-V_t-1||₂

Wherein D _t represents the content variation amplitude of the frame between time t and t-1, V _t represents each frame of video feature vector at time t, and V _t-1 represents each frame of video feature vector at time t-1;

The average of these differences is calculated by a sliding window W of fixed size, resulting in scene complexity C _t:

S103, automatically adjusting the configuration of the network according to the comparison result of the C _t and the preset threshold value theta;

s104, combining real-time performance feedback, and adjusting network weight and structural parameters by using a back propagation algorithm.

3. The artificial intelligence based video recognition method of claim 1, wherein the illumination correction is to apply automatic white balance and exposure compensation to the input video frames, as follows:

Where I (x, y) is the original pixel value, I _corr (x, y) is the corrected pixel value, and I _min and I _max are the minimum and maximum pixel values, respectively, in the frame;

The noise is divided into two parts by applying a bilateral filter to remove potential image noise, and the edge definition is maintained, and is expressed as follows:

Where W _p is a normalized coefficient, f _r and f _s are gaussian functions based on intensity and spatial approximation, respectively, ensuring that only adjacent pixels with similar intensity affect the current pixel value, (x, y) represents coordinates of the pixel, (x ', y') represents coordinates of adjacent pixels involved in calculation when filtering is applied, and I _filtered (x, y) represents pixel values of the image after bilateral filtering processing at coordinates (x, y).

4. The method of claim 1, wherein the generating network inputs each frame of video feature vector at time t, then uses multiple convolution layers, each layer is followed by batch normalization and ReLU activation functions, and the last layer outputs a stylized image using a tanh activation function, expressed as follows:

I_styled＝tanh(Conv(BN(ReLU(Conv(...V_t...)))))

the input of the discrimination network comprises I _styled and an original frame I, the characteristic extraction is carried out by using a plurality of convolution layers, the normalization function and LeakyReLU activation function are accessed after each layer of convolution, and the output probability of the last layer by using a sigmoid function is expressed as follows:

P_real＝σ(Conv(LReLU(BN(Conv(...I_styled，I...)))))

where σ represents the sigmoid activation function.

5. The artificial intelligence based video recognition method according to claim 4, wherein the progressive training method in S303 is to train the network at a low resolution first, and gradually transition to a high resolution.

6. The video recognition method based on artificial intelligence according to claim 1, wherein the hybrid neural network comprises a feature fusion layer, a time sequence fusion layer and a behavior recognition and prediction output layer;

The feature fusion layer adopts a depth convolution network to extract the spatial features of each frame of image, and the spatial features are expressed as follows:

F_spatial＝CNN(I_styled)

Wherein F _spatial denotes a spatial feature vector extracted from the stylized image;

The time sequence fusion layer takes the characteristic vector F _spatial of continuous frames as input, and processes time dependence and dynamic change through a cyclic neural network module, and the time sequence fusion layer is expressed as follows:

S_t＝RNN(F_spatial,S_t-1)

S _t represents the hidden state at time t, and carries the accumulated information of the past video frames; s _t-1 represents the hidden state at time t-1;

The behavior recognition and prediction output layer in combination with F _spatial and S _t generates the final behavior recognition and behavior prediction output, expressed as follows:

Y_action＝Softmax(Dense(S_t))

Y_predict＝Softmax(Dense(S_t,F_spatial))

Where Y _action is the behavior recognition result of the current frame and Y _predict is the result of predicting future behavior based on current and past information.

7. The artificial intelligence based video recognition method of claim 6, wherein the cross entropy loss function is used to simultaneously optimize accuracy of behavior recognition and prediction, as follows:

Wherein y _c is the one-time thermal encoding of the real behavior tag, Is the probability distribution of model predictions, and C is the number of behavior categories.

8. The method for identifying video based on artificial intelligence according to claim 1, wherein the step a specifically comprises the following steps:

constructing a feature transformation network F' by using a reversible feature transformation technology, and extracting and transforming features from an original video frame;

reconstructing and anomaly scoring using a self-encoder based anomaly detection technique design to detect anomalous behavior or salient events in the video frames;

definition of the Total loss function Reconstruction errorsAnd behavior recognition errorsSimultaneously introducing additional regularization terms to enhance model generalization capability and adaptability, and simultaneously introducing a dynamic weight adjustment mechanism adjustment weight based on feedback;

integrating a video content optimization feature transformation network with dynamic element adaptation change;

a model self-adaptive adjustment strategy is introduced, and the network architecture or parameters are automatically adjusted based on the currently detected behavior pattern and historical data.

9. The method for identifying video based on artificial intelligence according to claim 1, wherein the step B specifically comprises the steps of:

Periodically collecting feedback data and processing the collected data using statistical analysis and machine learning methods;

defining a parameter adjustment strategy and dynamically adjusting the learning rate according to the performance change of the model based on the feedback data;

And periodically performing overall evaluation, updating the operation parameters of the model according to the evaluation result, and feeding back to the operation parameter adjustment function and the learning rate updating strategy.

10. An artificial intelligence based video recognition system, the system comprising:

The self-adaptive frame construction module is used for constructing a self-adaptive frame based on video content analysis and dynamically adjusting the model structure and parameters according to the complexity of the real-time video frame;

The algorithm design module is used for extracting features from the video frames by applying an efficient image preprocessing technology and a deep learning algorithm according to the model parameters; wherein the image preprocessing technique includes illumination correction and noise division; the extracting features from the video frame specifically comprises extracting edge information of the video frame by using an improved Sobel operator, and analyzing image texture features by using a local binary pattern operator, wherein the extracting features specifically comprise the following steps:

the strengthening feature module is used for generating features extracted from the video frames processed by the countermeasure network, and carrying out stylization on the key frames to obtain stylized video frames, so as to strengthen visual behavior features, and comprises the following steps:

the total loss function is expressed as follows:

The behavior prediction module is used for designing a hybrid neural network to perform behavior recognition and behavior prediction based on historical data according to the stylized video frame; wherein the hybrid neural network comprises a CNN network and an RNN network;

the method further comprises at least 0 modules of:

the non-supervision learning algorithm module is used for integrating the non-supervision learning algorithm and dynamically adjusting the models and parameters of the S1-S4 based on real-time feedback;

and the feedback mechanism optimization module is used for setting up a feedback mechanism, dynamically adjusting the strategy according to the user and performance feedback, and continuously optimizing the strategy execution.