CN114005079B

CN114005079B - Multimedia stream processing method and device

Info

Publication number: CN114005079B
Application number: CN202111666523.6A
Authority: CN
Inventors: 赵悦汐; 程红兵; 鞠剑伟; 昝晨辉
Original assignee: Beijing Jinmao Education Technology Co ltd
Current assignee: Beijing Jinmao Education Technology Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-19
Anticipated expiration: 2041-12-31
Also published as: CN114005079A

Abstract

The application provides a multimedia stream processing method and device. Wherein the method comprises the following steps: acquiring a multimedia stream fragment; decoding to obtain a video stream sub-segment and an audio stream sub-segment; analyzing the video stream sub-segment to generate scene information and first text information; analyzing the audio stream sub-segment to generate second text information; and processing the scene information, the first text information and the second text information to form an analysis abstract of the multimedia stream. Through disassembling the multimedia stream file, the content of the multimedia file can be identified in a complex scene by effectively combining various independent AI modules, and the identification efficiency of the existing independent AI technology in the complex scene is effectively improved.

Description

Multimedia stream processing method and device

技术领域technical field

本申请涉及多媒体信息识别技术领域，尤其涉及一种多媒体流处理方法及装置。The present application relates to the technical field of multimedia information identification, and in particular, to a method and device for processing a multimedia stream.

背景技术Background technique

随着AI技术的持续发展和普及，市场上出现了很多成熟的AI模块，比如阿里多媒体AI，可以用来处理媒体中的信息流。例如，多媒体中的视频流、音频流，或视频流与音频流结合的信息流。在这些多媒体流处理的过程中，可以通过相应的AI模块对获取的多媒体文件中相应内容进行识别。With the continuous development and popularization of AI technology, many mature AI modules have appeared in the market, such as Alibaba Multimedia AI, which can be used to process information flow in media. For example, a video stream, an audio stream, or an information stream in which a video stream and an audio stream are combined in multimedia. In the process of processing these multimedia streams, the corresponding content in the acquired multimedia files can be identified through the corresponding AI module.

在实现现有技术的过程中，发明人发现：In the process of realizing the prior art, the inventors found that:

目前常见的AI模块识别方式单一。面对复杂的待识别场景，无法通过单一的AI模块进行分析，从而降低了多媒体文件的识别效率。At present, the common AI module identification method is single. In the face of complex scenes to be recognized, it cannot be analyzed by a single AI module, thus reducing the recognition efficiency of multimedia files.

因此，需要提供一种多媒体流处理方法及装置，用以解决复杂场景下现有独立的AI技术识别效率低的技术问题。Therefore, it is necessary to provide a multimedia stream processing method and apparatus to solve the technical problem of low recognition efficiency of the existing independent AI technology in complex scenarios.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供一种多媒体流处理方法及装置，用以解决复杂场景下现有独立的AI技术识别效率低的技术问题。Embodiments of the present application provide a multimedia stream processing method and apparatus, which are used to solve the technical problem of low recognition efficiency of the existing independent AI technology in complex scenarios.

具体的，一种多媒体流处理方法，包括以下步骤：Specifically, a multimedia stream processing method includes the following steps:

获取多媒体流片段；Get multimedia stream segments;

解码获取视频流子片段和音频流子片段；Decode to obtain video stream sub-segments and audio stream sub-segments;

分析所述视频流子片段生成场景信息、第一文本信息；Analyzing the sub-segments of the video stream to generate scene information and first text information;

分析所述音频流子片段生成第二文本信息；analyzing the audio stream sub-segments to generate second textual information;

处理所述场景信息、所述第一文本信息和所述第二文本信息，形成所述多媒体流的分析摘要。The scene information, the first text information and the second text information are processed to form an analysis summary of the multimedia stream.

进一步的，分析所述视频流子片段生成场景信息，具体包括：Further, analyzing the sub-segments of the video stream to generate scene information, specifically including:

分析视频流子片段，生成面向对象的身份特征识别信息和对对象动作行为的描述信息。Analyze the sub-segments of the video stream to generate object-oriented identification information and description information of the object's action behavior.

进一步的，分析所述视频流子片段生成第一文本信息，具体包括：Further, analyzing the sub-segments of the video stream to generate the first text information specifically includes:

分析视频流子片段，生成对象动作行为指向的第一文本信息。Analyze the sub-segments of the video stream, and generate the first text information pointed to by the action behavior of the object.

进一步的，分析视频流子片段，生成对象动作行为指向的第一文本信息，具体包括：Further, analyze the sub-segments of the video stream, and generate the first text information pointed to by the action behavior of the object, which specifically includes:

分析视频流子片段，获取持续预设时长的图像；Analyze video stream sub-segments to obtain images that last for a preset duration;

使用OCR对所述图像进行识别，生成第一文本信息。The image is recognized using OCR to generate first text information.

进一步的，所述第一文本信息至少包括教学环节信息和知识点信息其中之一。Further, the first text information includes at least one of teaching link information and knowledge point information.

进一步的，所述第二文本信息至少具体包括文本纠错信息、关键词信息、提问信息、感情描述信息其中之一。Further, the second text information specifically includes at least one of text error correction information, keyword information, question information, and emotion description information.

进一步的，处理所述场景信息、所述第一文本信息和所述第二文本信息，形成所述多媒体流的分析摘要，具体包括：Further, processing the scene information, the first text information and the second text information to form an analysis summary of the multimedia stream, specifically including:

对场景信息、第一文本信息和第二文本信息进行交叉验证，形成所述多媒体流的分析摘要。Cross-validation is performed on the scene information, the first text information and the second text information to form an analysis summary of the multimedia stream.

本申请实施例还提供一种多媒体流处理装置。The embodiments of the present application also provide a multimedia stream processing apparatus.

具体的，一种多媒体流处理装置，包括：Specifically, a multimedia stream processing device, comprising:

获取模块，用于获取多媒体流片段；an acquisition module for acquiring multimedia stream segments;

解码模块，用于解码获取视频流子片段和音频流子片段；The decoding module is used to decode and obtain the sub-segments of the video stream and the sub-segments of the audio stream;

视频分析模块，用于分析所述视频流子片段生成场景信息、第一文本信息；a video analysis module, configured to analyze the sub-segments of the video stream to generate scene information and first text information;

音频分析模块，用于分析所述音频流子片段生成第二文本信息；an audio analysis module for analyzing the sub-segments of the audio stream to generate second text information;

分析摘要生成模块，用于处理所述场景信息、所述第一文本信息和所述第二文本信息，形成所述多媒体流的分析摘要。An analysis summary generating module is configured to process the scene information, the first text information and the second text information to form an analysis summary of the multimedia stream.

进一步的，所述视频分析模块用于分析所述视频流子片段生成场景信息，具体用于：Further, the video analysis module is used to analyze the sub-segments of the video stream to generate scene information, and is specifically used for:

进一步的，所述视频分析模块用于分析所述视频流子片段生成第一文本信息，具体用于：Further, the video analysis module is used to analyze the sub-segments of the video stream to generate the first text information, and is specifically used for:

通过申请实施例提供的技术方案，至少具有如下有益效果：The technical solutions provided by the application examples have at least the following beneficial effects:

通过将多媒体流文件进行拆解，能够通过有效结合各种独立的AI模块进行复杂场景下多媒体文件内容识别，有效提升了复杂场景下现有独立的AI技术识别效率。By dismantling the multimedia stream files, it is possible to effectively combine various independent AI modules for multimedia file content recognition in complex scenarios, effectively improving the recognition efficiency of existing independent AI technologies in complex scenarios.

附图说明Description of drawings

此处所说明的附图用来提供对本申请的进一步理解，构成本申请的一部分，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings described herein are used to provide further understanding of the present application and constitute a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached image:

图1为本申请实施例提供的一种多媒体流处理方法的流程图。FIG. 1 is a flowchart of a multimedia stream processing method provided by an embodiment of the present application.

图2为本申请实施例提供的一种多媒体流处理装置的结构示意图。FIG. 2 is a schematic structural diagram of a multimedia stream processing apparatus according to an embodiment of the present application.

100 多媒体流处理装置100 Multimedia Streaming Device

11 获取模块11 Get modules

12 解码模块12 Decoding module

13 视频分析模块13 Video Analysis Module

14 音频分析模块14 Audio Analysis Module

15 分析摘要生成模块。15 Analysis summary generation module.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚，下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the objectives, technical solutions and advantages of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the specific embodiments of the present application and the corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

可以理解的是，多媒体流文件记录有视频流信息以及音频流信息。其中，所述视频流信息主要对应多媒体文件中的连续若干帧图像；所述音频流信息对应为多媒体文件中的语音信息合集。由此可知，视频流信息对应记录有与环境相关的场景信息以及文本信息；音频流信息对应记录有与视频流中相关环境对应的语音信息。这里的场景信息可以理解为每一帧图像中记录的、与呈现对象相关的对象信息；这里的文本信息可以理解为每一帧图像中记录的与文字相关的符号信息。It can be understood that, the multimedia stream file records video stream information and audio stream information. Wherein, the video stream information mainly corresponds to several consecutive frames of images in the multimedia file; the audio stream information corresponds to a collection of voice information in the multimedia file. It can be seen that the video stream information correspondingly records scene information and text information related to the environment; the audio stream information records correspondingly the voice information corresponding to the relevant environment in the video stream. The scene information here can be understood as object information related to the presented object recorded in each frame of image; the text information here can be understood as the text-related symbol information recorded in each frame of image.

通过单一的AI模块，可以对所述视频流中的场景信息或文本信息，或音频流信息进行单一的识别，从而识别出视频文件中的对象行为或存在的文本信息，或音频文件中的语音内容。但是复杂场景下的多媒体文件，往往既包含视频信息，又包含音频信息。若继续通过单一的AI模块进行识别，则无法全面识别到多媒体文件中记录的视频流信息以及音频流信息。这样，使得识别到的目标内容与多媒体文件记录的真实内容存在一定的误差。虽然也可以利用若干不同的单一AI模块进行同时进行复杂场景下记录内容的识别，但是每一单一AI模块的计算工作量较大。这样，降低了多媒体文件的识别速度，且不利于进行相关识别结果的结构化合并。Through a single AI module, the scene information or text information in the video stream, or the audio stream information can be recognized in a single way, so as to identify the object behavior or existing text information in the video file, or the voice in the audio file. content. However, multimedia files in complex scenarios often contain both video information and audio information. If the identification continues through a single AI module, the video stream information and audio stream information recorded in the multimedia file cannot be fully identified. In this way, there is a certain error between the recognized target content and the real content recorded in the multimedia file. Although several different single AI modules can also be used to simultaneously identify the recorded content in complex scenarios, each single AI module has a large computational workload. In this way, the recognition speed of the multimedia file is reduced, and it is not conducive to the structural merging of the related recognition results.

本申请实施例提供一种多媒体流处理方法，主要用于处理复杂场景下的多媒体文件。在本申请提供的一种具体实施方式中，所述多媒体流处理方法可以用于处理记录有课堂教学过程这一复杂场景的多媒体文件。具体的，请参照图1，一种多媒体流处理方法，包括以下步骤：An embodiment of the present application provides a multimedia stream processing method, which is mainly used for processing multimedia files in complex scenarios. In a specific implementation manner provided by this application, the multimedia stream processing method can be used to process a multimedia file that records a complex scene of a classroom teaching process. Specifically, please refer to FIG. 1, a multimedia stream processing method, comprising the following steps:

S100：获取多媒体流片段。S100: Acquire a multimedia stream segment.

这里的多媒体流片段可以理解为记录有相应场景的文字、图形、影像、动画、声音及视频等媒体信息的文件。在本申请提供的一种具体实施方式中，获取的多媒体流片段为具有一定时长，且记录有课堂教学场景的多媒体文件。所述多媒体流片段可以通过相应的视频拍摄设备拍摄得到。这样，能够对课堂实时场景进行拍摄，从而得到记录有课堂教学过程中的声音、文字、图片、人员对象等信息的多媒体文件。The multimedia stream segment here can be understood as a file that records media information such as text, graphics, video, animation, sound, and video of the corresponding scene. In a specific implementation manner provided by the present application, the acquired multimedia stream segment is a multimedia file with a certain duration and recorded with a classroom teaching scene. The multimedia stream segment can be obtained by shooting with a corresponding video shooting device. In this way, a real-time classroom scene can be photographed, thereby obtaining a multimedia file recording information such as voice, text, pictures, personnel objects, etc. in the course of classroom teaching.

S200：解码获取视频流子片段和音频流子片段。S200: Decode to obtain video stream sub-segments and audio stream sub-segments.

这里的视频流子片段可以理解为多媒体片段中的图像信息。这里的音频流子片段可以理解为多媒体片段中的声音信息。对获取的过媒体流片段进行解码，即将具有一定时长的多媒体文件中的图像信息以及声音信息提取出来，并转换为预设文件格式的连续若干帧图像以及连续音频，从而得到多媒体流片段对应的视频流子片段和音频流子片段。The video stream sub-segment here can be understood as the image information in the multimedia segment. The audio stream sub-segment here can be understood as the sound information in the multimedia segment. Decode the acquired media stream segment, that is, extract the image information and sound information in the multimedia file with a certain duration, and convert it into several consecutive frames of images and continuous audio in the preset file format, so as to obtain the corresponding multimedia stream segment. Video stream subclip and audio stream subclip.

当获取的多媒体流片段为具有一定时长，且记录有课堂教学场景的多媒体文件时，经解码，对应得到记录有课堂教学过程中文字、图片、人员对象等信息的视频流子片段，以及记录有课堂教学过程中声音信息的音频流子片段。When the acquired multimedia stream segment is a multimedia file with a certain duration and recorded with classroom teaching scenes, after decoding, correspondingly obtained video stream sub-segments that record information such as words, pictures, personnel objects in the classroom teaching process, and record with Audio stream sub-segments of sound information during classroom teaching.

S310：分析所述视频流子片段生成场景信息、第一文本信息。S310: Analyze the sub-segments of the video stream to generate scene information and first text information.

可以理解的是，多媒体流片段中的连续若干帧图像即可构成视频流子片段。而每一帧图像均记录有相应的场景信息。在课堂教学场景中，所述视频流子片段为记录有课堂教学过程中文字、图片、人员对象等信息连续若干帧图像。经具有相应功能的AI模块分析，即可得到当前视频流子片段中人员对象对应的具体场景信息，以及当前视频流子片段中文字、图片对应的具体文本信息。It can be understood that several consecutive frames of images in a multimedia stream segment can constitute a video stream sub-segment. Each frame of image is recorded with corresponding scene information. In a classroom teaching scenario, the video stream sub-segments are several consecutive frames of images that record information such as words, pictures, personnel objects, etc. in the course of classroom teaching. After the analysis of the AI module with corresponding functions, the specific scene information corresponding to the human object in the sub-segment of the current video stream, and the specific text information corresponding to the words and pictures in the sub-segment of the current video stream can be obtained.

具体的，通过对当前视频流子片段中人员对象的识别，能够确定当前人员对象的具体动作类别，从而便于确定当前视频流子片段对应的具体课堂教学场景。通过对当前视频流子片段中文字、图片对应的具体文本信息的识别，能够确定当前视频流子片段对应的文字、图片信息对应的具体文本种类或描述内容，从而能够生成当前视频流子片段对应的第一文本信息。这里的第一文本信息可以理解为根据视频流子片段生成的文本信息。Specifically, by identifying the human object in the sub-segment of the current video stream, the specific action category of the current human object can be determined, thereby facilitating the determination of the specific classroom teaching scene corresponding to the sub-segment of the current video stream. By identifying the specific text information corresponding to the text and picture in the sub-segment of the current video stream, the specific text type or description content corresponding to the text and picture information corresponding to the sub-segment of the current video stream can be determined, so that the corresponding sub-segment of the current video stream can be generated. the first text message. The first text information here can be understood as text information generated according to the sub-segments of the video stream.

S320：分析所述音频流子片段生成第二文本信息。S320: Analyze the audio stream sub-segments to generate second text information.

可以理解的是，根据多媒体流片段中的语音信息即可生成音频流子片段。在课堂教学场景中，所述音频流子片段为记录有课堂教学过程中声音信息相关的文件。具有相应功能的AI模块分析，即可确定所述音频流子片段的具体讲述内容，并得到与所述音频流子片段讲述内容相对应的第二文本信息。这里的第二文本信息可以理解为根据音频流子片段生成的文本信息。It can be understood that the audio stream sub-segment can be generated according to the voice information in the multimedia stream segment. In a classroom teaching scenario, the audio stream sub-segments are files that record sound information related to the classroom teaching process. The AI module with corresponding functions can determine the specific narration content of the audio stream sub-segment, and obtain second text information corresponding to the narration content of the audio stream sub-segment. The second text information here can be understood as text information generated according to the sub-segments of the audio stream.

S400：处理所述场景信息、所述第一文本信息和所述第二文本信息，形成所述多媒体流的分析摘要。S400: Process the scene information, the first text information and the second text information to form an analysis summary of the multimedia stream.

这里的分析摘要可以理解为当前处理的多媒体流片段对应的具体实时场景的概述。对场景信息、第一文本信息和第二文本信息进行处理，主要是识别其中与多媒体流片段对应的实时场景关联度较高的重点数据。通过对识别到的目标数据进行整合，即可得到当前处理的多媒体流片段对应的具体教学过程。将多媒体流片段拆分为视频流子片段以及音频流子片段，并通过具有相应功能的AI模块分析，能够有效减小功能单一的AI模块的处理数据量，并能够准确选取具有相应分析功能的分析模块，从而提高了多媒体流片段的识别效率。The analysis summary here can be understood as an overview of the specific real-time scene corresponding to the currently processed multimedia stream segment. The processing of the scene information, the first text information and the second text information is mainly to identify the key data with a high degree of correlation between the real-time scene corresponding to the multimedia stream segment. By integrating the identified target data, the specific teaching process corresponding to the currently processed multimedia stream segment can be obtained. Splitting multimedia stream segments into video stream sub-segments and audio stream sub-segments, and analyzing them through AI modules with corresponding functions, can effectively reduce the amount of data processed by an AI module with a single function, and can accurately select the ones with corresponding analysis functions. Analysis module, thereby improving the identification efficiency of multimedia stream segments.

进一步的，在本申请提供的一种优选实施方式中，分析所述视频流子片段生成场景信息，具体包括：分析视频流子片段，生成面向对象的身份特征识别信息和对对象动作行为的描述信息。Further, in a preferred embodiment provided by this application, analyzing the sub-segments of the video stream to generate scene information specifically includes: analyzing the sub-segments of the video stream, generating object-oriented identity feature identification information and descriptions of the object's action behavior information.

这里对象的身份特征识别信息可以理解为对象的脸部特征信息。可以理解的是，获取带有某一人员对象脸部的图像，通过经预训练的识别算法对其脸部特征进行识别，即可确定该人员对象的具体身份信息。例如，确定某一学生的姓名、学号等信息，或某一教师的姓名、工号等信息。Here, the identity feature identification information of the object can be understood as the facial feature information of the object. It can be understood that the specific identity information of the person object can be determined by acquiring an image with the face of a certain person object and recognizing the facial features of the person object through a pre-trained recognition algorithm. For example, determine information such as the name and student ID of a certain student, or information such as the name and job ID of a certain teacher.

这里对对象动作行为的描述信息可以理解为当前视频流子片段中，人员对象的具体动作类别。可以理解的是，获取带有某一人员对象躯体动作的图像，经预训练的识别算法对该图像进行识别，即可确定该人员对象的具体动作类别。例如，确定该人员对象当前动作行为为书写行为、起立行为或板书行为等。The description information of the action behavior of the object here can be understood as the specific action category of the person object in the sub-segment of the current video stream. It can be understood that, by acquiring an image with a body movement of a certain human object, and recognizing the image by a pre-trained recognition algorithm, the specific action category of the human object can be determined. For example, it is determined that the current action behavior of the person object is writing behavior, standing up behavior, or writing on the blackboard.

通过识别视频流子片段中涉及对象的具体身份信息以及行为信息，可以确定视频流子片段对应教学场景中的师生行为，从而提升了多媒体流分析摘要的准确率。By identifying the specific identity information and behavior information of the objects involved in the sub-segments of the video stream, the behavior of teachers and students in the teaching scene corresponding to the sub-segments of the video stream can be determined, thereby improving the accuracy of the multimedia stream analysis summary.

进一步的，在本申请提供的一种优选实施方式中，分析所述视频流子片段生成第一文本信息，具体包括：分析视频流子片段，生成对象动作行为指向的第一文本信息。Further, in a preferred embodiment provided by the present application, analyzing the sub-segments of the video stream to generate the first text information specifically includes: analyzing the sub-segments of the video stream to generate the first text information pointed to by the action behavior of the object.

这里的对象动作行为指向的第一文本信息，可以理解为与人员对象的具体行为具有一定关联度的文本信息。可以理解的是，根据视频流子片段生成的第一文本信息中包括课堂教学场景下的文字、图片信息对应的具体文本种类或描述内容。例如，课堂中的PPT展示页、教室背景相关的黑板报信息。在课堂教学场景中，所述课堂中的PPT展示页中的文字信息，即为与人员对象动作行为相关的文本信息。但是，黑板报信息为视频流子片段中的背景信息，与当前所处场景下的人员动作行为无关，则不属于第一文本信息。The first text information pointed to by the action behavior of the object here can be understood as text information having a certain degree of correlation with the specific behavior of the person object. It can be understood that the first text information generated according to the sub-segments of the video stream includes specific text types or description contents corresponding to the text and picture information in the classroom teaching scenario. For example, the PPT display page in the classroom, the blackboard report information related to the classroom background. In the classroom teaching scenario, the text information in the PPT display page in the classroom is the text information related to the action behavior of the person and object. However, the information reported on the blackboard is background information in the sub-segment of the video stream, and has nothing to do with the actions and behaviors of people in the current scene, so it does not belong to the first text information.

通过针对性分析视频流子片段中与人员对象动作相关的第一文本信息，可以减少相应功能模块的数据处理量，同时增加了相应功能模块识别的识别准确度，从而有效提升了第一文本信息的分析效率。By analyzing the first text information related to the actions of people and objects in the sub-segments of the video stream, the data processing amount of the corresponding functional modules can be reduced, and the recognition accuracy of the corresponding functional modules can be increased, thereby effectively improving the first text information. analytical efficiency.

进一步的，在本申请提供的一种优选实施方式中，分析视频流子片段，生成对象动作行为指向的第一文本信息，具体包括：分析视频流子片段，获取持续预设时长的图像；使用OCR对所述图像进行识别，生成第一文本信息。Further, in a preferred embodiment provided by the present application, analyzing the sub-segments of the video stream to generate the first text information pointed to by the action behavior of the object specifically includes: analyzing the sub-segments of the video stream, and obtaining images that last for a preset duration; using The OCR identifies the image to generate first text information.

可以理解的是，与对象动作行为指向第一文本信息为与当前视频流子片段对应场景下，与对象行为具有一定关联度的文本信息。在课堂教学场景下，与对象行为具有一定关联度的文本信息，可以理解为教学内容相关的文本信息。例如，教师书写的板书、PPT展示页中的文字信息等。需要指出的是，在实际教学过程中，若对象动作行为指向的文本信息较为重要，则人员对象就相应文本信息展开相关行为时，会持续一定的时长。即，具有重要文本信息的图像停留时间较长。而对象动作行为指向的文本信息重要度较低或为无价值的文本信息时，则相应具有较短的时长。即，具有非重要文本信息的图像停留时间较短。因此，可以根据图像持续时长判断图像中文本的重要程度。这里的图像持续时长可以根据实际情况或者经验值预设。若某帧图像持续时长满足预设条件，即可展开后续的文本识别过程。对应的，若某帧图像持续时长不满足预设条件，说明其对应的文本重要性较低，无需展开相应的识别。这样，能够在保证第一文本识别准确度的基础上，增加第一文本的识别效率。It can be understood that the first text information that the object action behavior points to is text information that has a certain degree of correlation with the object behavior in the scene corresponding to the current video stream sub-segment. In the classroom teaching scenario, the text information that has a certain degree of relevance with the behavior of the object can be understood as the text information related to the teaching content. For example, the blackboard written by the teacher, the text information in the PPT display page, etc. It should be pointed out that, in the actual teaching process, if the text information pointed to by the object's action behavior is more important, the human object will continue for a certain period of time when the relevant behavior of the corresponding text information is carried out. That is, images with important textual information stay longer. When the text information pointed to by the action behavior of the object is less important or worthless, it has a correspondingly shorter duration. That is, images with unimportant text information have shorter dwell time. Therefore, the importance of the text in the image can be judged according to the duration of the image. The image duration here can be preset according to actual conditions or empirical values. If the duration of a certain frame of image satisfies the preset condition, the subsequent text recognition process can be started. Correspondingly, if the duration of a certain frame of image does not meet the preset condition, it means that the corresponding text is of low importance, and corresponding identification does not need to be carried out. In this way, the recognition efficiency of the first text can be increased on the basis of ensuring the recognition accuracy of the first text.

具体的，对满足持续预设时长的图像记性文本信息识别时，可以通过OCR识别方式。即，利用光学字符识别技术，识别视频流子片段中满足预设条件的图像中的字符信息。例如，通过OCR识别方式将课堂展示PPT中满足预设条件的PPT页中的相关内容识别为第一文本信息。或者，通过OCR识别方式将课堂黑板上特定标记的文字内容识别为第一文本信息。Specifically, when recognizing the image memorization text information that satisfies the preset duration, the OCR recognition method can be used. That is, the optical character recognition technology is used to recognize the character information in the images satisfying the preset conditions in the sub-segments of the video stream. For example, the relevant content in the PPT page that satisfies the preset condition in the classroom display PPT is identified as the first text information by the OCR identification method. Alternatively, the text content specifically marked on the classroom blackboard is recognized as the first text information by means of OCR recognition.

进一步的，在本申请提供的一种优选实施方式中，所述第一文本信息至少包括教学环节信息和知识点信息其中之一。Further, in a preferred embodiment provided by this application, the first text information includes at least one of teaching link information and knowledge point information.

这里的教学环节信息可以理解为视频流子片段中满足预设时长的图像中，记录的当前图像对应的环节。例如，PPT中当前也对应的具体标题。这里的知识点信息可以理解为视频流子片段中满足预设时长的图像中，记录的当前图像特定的标注内容。The teaching link information here can be understood as the link corresponding to the recorded current image among the images in the sub-segments of the video stream that meet the preset duration. For example, the specific title currently also corresponds to the PPT. The knowledge point information here can be understood as the specific annotation content of the current image recorded in the image that meets the preset duration in the sub-segment of the video stream.

可以理解的是，课堂教学过程中的相关板书资料，不可避免会存在一些与教学内容无关的信息。这里将其称为无价值信息。这些无价值信息的识别，在增加计算量的同时，还会增加第一文本中无意义内容的占比。因此，通过针对性地识别视频流子片段中教学环节信息或知识点信息，便于得到更为精准的第一文本标识信息，降低了第一文本信息的冗余度，从而便于进行多媒体分析摘要的确定。It is understandable that there will inevitably be some information unrelated to the teaching content in the relevant blackboard writing materials in the course of classroom teaching. This is called worthless information here. The identification of these worthless information will increase the proportion of meaningless content in the first text while increasing the amount of calculation. Therefore, by identifying the teaching link information or knowledge point information in the sub-segments of the video stream in a targeted manner, it is convenient to obtain more accurate first text identification information, reduce the redundancy of the first text information, and facilitate the multimedia analysis and summary. Sure.

进一步的，在本申请提供的一种优选实施方式中，所述第二文本信息至少具体包括文本纠错信息、关键词信息、提问信息、感情描述信息其中之一。Further, in a preferred embodiment provided by this application, the second text information specifically includes at least one of text error correction information, keyword information, question information, and emotion description information.

可以理解的是，第二文本信息对应为根据音频流子片段生成的文本信息。在课堂教学场景中，所述音频流子片段为记录有课堂教学过程中与声音信息相关的文件。在实际应用中，将音频流子片段识别为第二文本，可以用于制作与视频流子片段同步的字幕信息。过自然语言处理模型，能够对字幕中的文字信息进行纠错，并且还能进行关键词信息、提问信息、感情描述信息的提取。这些信息均可以理解为多媒体流分析摘要的形成要素。因此，经分析得到的第二文本信息至少具体包括文本纠错信息、关键词信息、提问信息、感情描述信息其中之一。这样，能够保证多媒体流分析摘要的准确性。It can be understood that the second text information corresponds to the text information generated according to the sub-segments of the audio stream. In a classroom teaching scenario, the audio stream sub-segments are files recorded with sound information related to the classroom teaching process. In practical applications, identifying the sub-segment of the audio stream as the second text can be used to produce subtitle information synchronized with the sub-segment of the video stream. Through the natural language processing model, the text information in the subtitles can be corrected, and the keyword information, question information, and emotional description information can also be extracted. These information can be understood as the forming elements of the multimedia stream analysis summary. Therefore, the second text information obtained by analysis specifically includes at least one of text error correction information, keyword information, question information, and emotion description information. In this way, the accuracy of the multimedia stream analysis summary can be guaranteed.

进一步的，在本申请提供的一种优选实施方式中，处理所述场景信息、所述第一文本信息和所述第二文本信息，形成所述多媒体流的分析摘要，具体包括：对场景信息、第一文本信息和第二文本信息进行交叉验证，形成所述多媒体流的分析摘要。Further, in a preferred implementation manner provided by this application, processing the scene information, the first text information and the second text information to form an analysis summary of the multimedia stream, specifically including: analyzing the scene information , the first text information and the second text information are cross-validated to form an analysis summary of the multimedia stream.

这里所述的交叉验证可以理解为将分析得到的场景信息、第一文本信息和第二文本信息等各类数据归总到不同的课程结构中。可以理解的是，进行场景信息、第一文本信息的分析，是根据视频流子片段完成的。进行第二文本信息的识别，是根据音频流子片段完成的。所述视频流子片段与音频流子片段均为从多媒体流片段中提取得到的。若仅根据场景信息、第一文本信息和第二文本信息中的某一信息进行多媒体流的分析摘要，无法得到具有较高准确性的多媒体流分析摘要。因此，需要综合考虑场景信息、第一文本信息和第二文本信息，并将其根据课程结构进行相应的归类汇总，从而得到准确性较高的多媒体流分析摘要。这一过程，也可理解为根据获取的多媒体视频内容，完成视频内容数据的拆解，并将拆解数据重新归类关联到相应的课堂环节。The cross-validation described here can be understood as summarizing various types of data such as scene information, first text information, and second text information obtained by analysis into different course structures. It can be understood that the analysis of the scene information and the first text information is performed according to the sub-segments of the video stream. The identification of the second text information is completed according to the sub-segments of the audio stream. The video stream sub-segments and the audio stream sub-segments are both extracted from the multimedia stream segments. If the analysis summary of the multimedia stream is performed only according to some information in the scene information, the first text information and the second text information, a multimedia stream analysis summary with high accuracy cannot be obtained. Therefore, it is necessary to comprehensively consider the scene information, the first text information and the second text information, and classify and summarize them according to the course structure, so as to obtain a high-accuracy multimedia stream analysis summary. This process can also be understood as completing the dismantling of the video content data according to the acquired multimedia video content, and reclassifying the dismantling data and relating it to the corresponding classroom link.

在本申请提供的一种具体实施方式中，可以将课堂教学场景中的将课程内容拆解为教学内容、师生行为、师生语言三大类。其中，教学内容可以主要体现在教学PPT及老师讲课的语音内容中；师生行为主要体现在动作肢体的行为变化上；师生语言主要体现在语言交流上。因此，针对课程内容，利用OCR识别技术进行PPT/黑板上的特定标记的文字内容的识别。这时，完成了课程总体结构的第一次划分。即将课堂环节进行了区分。通过人脸识别技术并结合对动作行为的识别，能够将老师授课、学生板书、举手、起立、阅读、书写等一系列行为动作进行行为划分。最后，再将课堂中的语音进行实时字幕翻译，并通过自然语言处理能力，能够将字幕中的文字信息纠错，并且还能提取其中与关键词信息、提问信息、情感描述信息等相关的语境信息。这样，得到了相关的场景信息、第一文本信息与第二文本信息。将得到的场景信息、第一文本信息与第二文本信息等各类数据归总到不同的课程结构中，即可完成对课堂教学视频内容的拆解，并将拆解得到的相关数据重新归类关联到相应的课堂的环节。In a specific embodiment provided in this application, the course content in the classroom teaching scenario can be disassembled into three categories: teaching content, teacher-student behavior, and teacher-student language. Among them, the teaching content can be mainly reflected in the teaching PPT and the voice content of the teacher's lecture; the behavior of teachers and students is mainly reflected in the behavior changes of the movements and limbs; the language of teachers and students is mainly reflected in the language communication. Therefore, for the course content, the OCR recognition technology is used to identify the text content of the specific mark on the PPT/blackboard. At this time, the first division of the overall structure of the course is completed. Classroom sessions will be differentiated. Through face recognition technology combined with the recognition of action behaviors, a series of behaviors such as teacher teaching, students writing on the blackboard, raising their hands, standing up, reading, and writing can be divided into behaviors. Finally, the speech in the classroom is translated into real-time subtitles, and through the natural language processing ability, the text information in the subtitles can be corrected, and the language related to keyword information, question information, emotional description information, etc. can also be extracted. environment information. In this way, the related scene information, the first text information and the second text information are obtained. By summarizing the obtained scene information, first text information and second text information and other data into different course structures, the disassembly of the classroom teaching video content can be completed, and the relevant data obtained from the disassembly can be re-organized. A class is associated with a link to the corresponding class.

本申请实施例还提供一种多媒体流处理装置100，主要用于处理复杂场景下的多媒体文件。在本申请提供的一种具体实施方式中，所述多媒体流处理装置100可以用于处理记录有课堂教学过程这一复杂场景的多媒体文件。具体的，请参照图2，一种多媒体流处理装置，包括：Embodiments of the present application further provide a multimedia stream processing apparatus 100, which is mainly used for processing multimedia files in complex scenarios. In a specific implementation manner provided by this application, the multimedia stream processing apparatus 100 may be used to process a multimedia file that records a complex scene of a classroom teaching process. Specifically, please refer to FIG. 2, a multimedia stream processing device, comprising:

获取模块11，用于获取多媒体流片段；an acquisition module 11, for acquiring multimedia stream segments;

解码模块12，用于解码获取视频流子片段和音频流子片段；The decoding module 12 is used to decode and obtain the sub-segment of the video stream and the sub-segment of the audio stream;

视频分析模块13，用于分析所述视频流子片段生成场景信息、第一文本信息；a video analysis module 13, configured to analyze the sub-segments of the video stream to generate scene information and first text information;

音频分析模块14，用于分析所述音频流子片段生成第二文本信息；an audio analysis module 14, configured to analyze the sub-segments of the audio stream to generate second text information;

分析摘要生成模块15，用于处理所述场景信息、所述第一文本信息和所述第二文本信息，形成所述多媒体流的分析摘要。The analysis summary generating module 15 is configured to process the scene information, the first text information and the second text information to form an analysis summary of the multimedia stream.

获取模块11，用于获取多媒体流片段。这里的多媒体流片段可以理解为记录有相应场景的文字、图形、影像、动画、声音及视频等媒体信息的文件。在本申请提供的一种具体实施方式中，获取的多媒体流片段为具有一定时长，且记录有课堂教学场景的多媒体文件。所述多媒体流片段可以通过相应的视频拍摄设备拍摄得到。这样，能够对课堂实时场景进行拍摄，从而得到记录有课堂教学过程中的声音、文字、图片、人员对象等信息的多媒体文件。The acquiring module 11 is used for acquiring multimedia stream segments. The multimedia stream segment here can be understood as a file that records media information such as text, graphics, video, animation, sound, and video of the corresponding scene. In a specific implementation manner provided by the present application, the acquired multimedia stream segment is a multimedia file with a certain duration and recorded with a classroom teaching scene. The multimedia stream segment can be obtained by shooting with a corresponding video shooting device. In this way, a real-time classroom scene can be photographed, thereby obtaining a multimedia file recording information such as voice, text, pictures, personnel objects, etc. in the course of classroom teaching.

解码模块12，用于解码获取视频流子片段和音频流子片段。这里的视频流子片段可以理解为多媒体片段中的图像信息。这里的音频流子片段可以理解为多媒体片段中的声音信息。对获取的过媒体流片段进行解码，即将具有一定时长的多媒体文件中的图像信息以及声音信息提取出来，并转换为预设文件格式的连续若干帧图像以及连续音频，从而得到多媒体流片段对应的视频流子片段和音频流子片段。The decoding module 12 is used to decode and obtain the sub-segments of the video stream and the sub-segments of the audio stream. The video stream sub-segment here can be understood as the image information in the multimedia segment. The audio stream sub-segment here can be understood as the sound information in the multimedia segment. Decode the acquired media stream segment, that is, extract the image information and sound information in the multimedia file with a certain duration, and convert it into several consecutive frames of images and continuous audio in the preset file format, so as to obtain the corresponding multimedia stream segment. Video stream subclip and audio stream subclip.

视频分析模块13，用于分析所述视频流子片段生成场景信息、第一文本信息。可以理解的是，多媒体流片段中的连续若干帧图像即可构成视频流子片段。而每一帧图像均记录有相应的场景信息。在课堂教学场景中，所述视频流子片段为记录有课堂教学过程中文字、图片、人员对象等信息连续若干帧图像。经具有相应功能的AI模块分析，即可得到当前视频流子片段中人员对象对应的具体场景信息，以及当前视频流子片段中文字、图片对应的具体文本信息。The video analysis module 13 is configured to analyze the sub-segments of the video stream to generate scene information and first text information. It can be understood that several consecutive frames of images in a multimedia stream segment can constitute a video stream sub-segment. Each frame of image is recorded with corresponding scene information. In a classroom teaching scenario, the video stream sub-segments are several consecutive frames of images that record information such as words, pictures, personnel objects, etc. in the course of classroom teaching. After the analysis of the AI module with corresponding functions, the specific scene information corresponding to the human object in the sub-segment of the current video stream, and the specific text information corresponding to the words and pictures in the sub-segment of the current video stream can be obtained.

音频分析模块14，用于分析所述音频流子片段生成第二文本信息。可以理解的是，根据多媒体流片段中的语音信息即可生成音频流子片段。在课堂教学场景中，所述音频流子片段为记录有课堂教学过程中声音信息相关的文件。具有相应功能的AI模块分析，即可确定所述音频流子片段的具体讲述内容，并得到与所述音频流子片段讲述内容相对应的第二文本信息。这里的第二文本信息可以理解为根据音频流子片段生成的文本信息。The audio analysis module 14 is configured to analyze the sub-segments of the audio stream to generate second text information. It can be understood that the audio stream sub-segment can be generated according to the voice information in the multimedia stream segment. In a classroom teaching scenario, the audio stream sub-segments are files that record sound information related to the classroom teaching process. The AI module with corresponding functions can determine the specific narration content of the audio stream sub-segment, and obtain second text information corresponding to the narration content of the audio stream sub-segment. The second text information here can be understood as text information generated according to the sub-segments of the audio stream.

分析摘要生成模块15，用于处理所述场景信息、所述第一文本信息和所述第二文本信息，形成所述多媒体流的分析摘要。这里的分析摘要可以理解为当前处理的多媒体流片段对应的具体实时场景的概述。对场景信息、第一文本信息和第二文本信息进行处理，主要是识别其中与多媒体流片段对应的实时场景关联度较高的重点数据。通过对识别到的目标数据进行整合，即可得到当前处理的多媒体流片段对应的具体教学过程。将多媒体流片段拆分为视频流子片段以及音频流子片段，并通过具有相应功能的AI模块分析，能够有效减小功能单一的AI模块的处理数据量，并能够准确选取具有相应分析功能的分析模块，从而提高了多媒体流片段的识别效率。The analysis summary generating module 15 is configured to process the scene information, the first text information and the second text information to form an analysis summary of the multimedia stream. The analysis summary here can be understood as an overview of the specific real-time scene corresponding to the currently processed multimedia stream segment. The processing of the scene information, the first text information and the second text information is mainly to identify the key data with a high degree of correlation between the real-time scene corresponding to the multimedia stream segment. By integrating the identified target data, the specific teaching process corresponding to the currently processed multimedia stream segment can be obtained. Splitting multimedia stream segments into video stream sub-segments and audio stream sub-segments, and analyzing them through AI modules with corresponding functions, can effectively reduce the amount of data processed by an AI module with a single function, and can accurately select the ones with corresponding analysis functions. Analysis module, thereby improving the identification efficiency of multimedia stream segments.

进一步的，在本申请提供的一种优选实施方式中，所述视频分析模块13用于分析所述视频流子片段生成场景信息，具体用于：分析视频流子片段，生成面向对象的身份特征识别信息和对对象动作行为的描述信息。Further, in a preferred embodiment provided by this application, the video analysis module 13 is used to analyze the sub-segments of the video stream to generate scene information, and is specifically used to: analyze the sub-segments of the video stream to generate object-oriented identity features Identification information and description information of the object's action behavior.

进一步的，在本申请提供的一种优选实施方式中，所述视频分析模块13用于分析所述视频流子片段生成第一文本信息，具体用于：分析视频流子片段，生成对象动作行为指向的第一文本信息。Further, in a preferred embodiment provided by this application, the video analysis module 13 is configured to analyze the sub-segments of the video stream to generate the first text information, and is specifically used to: analyze the sub-segments of the video stream to generate the action behavior of the object The first text message pointed to.

进一步的，在本申请提供的一种优选实施方式中，所述视频分析模块13用于分析视频流子片段，生成对象动作行为指向的第一文本信息，具体用于：分析视频流子片段，获取持续预设时长的图像；使用OCR对所述图像进行识别，生成第一文本信息。Further, in a preferred embodiment provided by this application, the video analysis module 13 is used to analyze the sub-segments of the video stream, and generate the first text information pointed to by the action behavior of the object, which is specifically used for: analyzing the sub-segments of the video stream, Acquire an image that lasts for a preset duration; use OCR to identify the image to generate first text information.

进一步的，在本申请提供的一种优选实施方式中，所述分析摘要生成模块15用于处理所述场景信息、所述第一文本信息和所述第二文本信息，形成所述多媒体流的分析摘要，具体用于：对场景信息、第一文本信息和第二文本信息进行交叉验证，形成所述多媒体流的分析摘要。Further, in a preferred embodiment provided by the present application, the analysis summary generation module 15 is configured to process the scene information, the first text information and the second text information to form a summary of the multimedia stream. The analysis summary is specifically used for: cross-validating the scene information, the first text information and the second text information to form the analysis summary of the multimedia stream.

需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，有语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also no Other elements expressly listed, or which are also inherent to such a process, method, article of manufacture or apparatus. Without further limitation, the phrase "comprising a..." qualifying an element does not preclude the presence of additional identical elements in a process, method, article of manufacture, or apparatus that includes the element.

以上所述仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的权利要求范围之内。The above descriptions are merely examples of the present application, and are not intended to limit the present application. Various modifications and variations of this application are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the scope of the claims of this application.

Claims

1. A method for processing a multimedia stream, comprising the steps of:

acquiring a multimedia stream fragment;

decoding to obtain a video stream sub-segment and an audio stream sub-segment;

analyzing the video stream sub-segment to generate object-oriented identity characteristic identification information, description information of object action behaviors and first text information pointed by the object action behaviors;

analyzing the audio stream sub-segment to generate second text information, wherein the second text information at least specifically comprises one of text error correction information, keyword information, question information and emotion description information;

carrying out cross validation on object-oriented identity characteristic identification information, description information of object action behaviors, first text information and second text information to form an analysis abstract of the multimedia stream;

the first text information at least comprises one of teaching link information and knowledge point information.

2. The method for processing multimedia stream according to claim 1, wherein analyzing the video stream sub-segment to generate the first text information pointed by the object action behavior, specifically comprises:

analyzing the video stream sub-segments to obtain images lasting for a preset time;

and recognizing the image by using OCR to generate first text information.

3. A multimedia stream processing apparatus, comprising:

the acquisition module is used for acquiring the multimedia stream fragments;

the decoding module is used for decoding and acquiring the video stream sub-segment and the audio stream sub-segment;

the video analysis module is used for analyzing the video stream sub-segments to generate object-oriented identity feature identification information, description information of object action behaviors and first text information pointed by the object action behaviors;

the audio analysis module is used for analyzing the audio stream sub-segments to generate second text information, wherein the second text information at least specifically comprises one of text error correction information, keyword information, question information and emotion description information;

the analysis abstract generating module is used for performing cross validation on the object-oriented identity characteristic identification information, the description information of the object action behavior, the first text information and the second text information to form an analysis abstract of the multimedia stream;