CN111541859B

CN111541859B - Video conference processing method, device, electronic device and storage medium

Info

Publication number: CN111541859B
Application number: CN202010256843.3A
Authority: CN
Inventors: 刘苹苹; 牛永会; 李玉城; 杨春晖
Original assignee: Visionvera Information Technology Co Ltd
Current assignee: Visionvera Information Technology Co Ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2024-12-31
Anticipated expiration: 2040-04-02
Also published as: CN111541859A

Abstract

The embodiment of the present invention provides a video conference processing method, device, electronic device and storage medium. The method is applied to a storage service system in a visual network, including: obtaining a video file for recording the video conference, and a text file corresponding to the audio stream in the video conference during the recording period; the text file includes multiple paragraphs of text information, wherein each paragraph of text information includes the start playback time of the audio stream corresponding to the text information in the video file; upon receiving a user's request to play the video file, playing the video file; in the process of playing the video file, extracting at least one paragraph of text information corresponding to the current playback time from the text file according to the current playback time and each start playback time; and displaying at least one paragraph of text information. When the technical solution of the present invention is adopted, the efficiency of obtaining the conference content of the video conference can be improved.

Description

Video conference processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a video conference processing method, a video conference processing device, an electronic device, and a storage medium.

Background

With the development of the video network, a plurality of users hold a large-scale high-definition video conference in the video network. In the course of holding a video conference, it is often necessary to record the content in the video conference. In the related art, a video conference is generally recorded to record the video conference, and a user can obtain conference content by watching the recorded video file.

However, in this way, the user cannot accurately understand the conference content of the video conference during watching the video file due to poor recording effect or unclear audio sent by the speaker. On the other hand, the user also needs to manually record the speaking content during the process of watching the recorded video file, which results in low efficiency of recording the speaking content of the video conference.

Disclosure of Invention

In view of the foregoing, embodiments of the present invention are directed to providing a video conference processing method, apparatus, electronic device, and storage medium that overcome or at least partially solve the foregoing problems.

In a first aspect of an embodiment of the present invention, a video conference processing method is disclosed, where the method is applied to a storage service system in a video network, and includes:

Obtaining a video file for recording the video conference and a text file corresponding to an audio stream in the video conference in a recording period, wherein the text file comprises a plurality of pieces of text information, and each piece of text information comprises the initial playing time of the audio stream corresponding to the text information in the video file;

playing the video file when receiving a playing request of a user for the video file;

Extracting at least one piece of text information corresponding to the current playing time from the text file according to the current playing time and each initial playing time in the process of playing the video file;

and displaying the at least one piece of text information.

Optionally, the video file carries a plurality of marks, and playing the video file includes:

Displaying the plurality of marks on a playing progress bar for playing the video file, wherein each mark on the playing progress bar is used for indicating the playing time of the mark in the video file;

extracting at least one piece of text information corresponding to the current playing time from the text file according to the current playing time and each initial playing time, wherein the method comprises the following steps:

when triggering operations on the plurality of marks are detected, determining first playing time of the triggered marks in the video file, and switching the current playing time to the first playing time so as to play the video file from the first playing time;

And extracting at least one section of first text information, corresponding to the initial playing time, from the first playing time within the first preset time range from the text file.

Determining a progress position corresponding to the current playing time, and determining a target mark which is within a preset distance from the progress position from the plurality of marks;

determining a second play time in the video file from the target mark;

and extracting at least one piece of second text information with the initial playing time within a preset time range from the second playing time from the text file.

Optionally, extracting at least one piece of text information corresponding to the current playing time from the text file according to the current playing time and each initial playing time, including:

Determining at least one initial playing time which is within a second preset time range with the current playing time from the initial playing time;

and acquiring at least one piece of third text information corresponding to the at least one initial playing time from the text file.

Optionally, the video networking is provided with a conference control terminal and an autonomous server, and before obtaining the video file recorded for the video conference, the method further includes:

Receiving a conference recording instruction sent by the autonomous server, wherein the conference recording instruction is generated by the conference control terminal and sent to the autonomous server;

recording an audio stream and a video stream which are currently generated in the video conference in response to the conference recording instruction, and caching the audio stream, wherein the audio stream is an audio stream collected by a terminal which is currently speaking in the video conference;

when the total playing time length of a section of the cached audio stream reaches a preset time length, identifying the cached section of the audio stream to obtain text information corresponding to the section of the audio stream, and taking the starting time of the section of the audio stream in the recording process as the starting playing time;

and when the recording of the current video conference is finished, storing a plurality of pieces of text information corresponding to the cached plurality of pieces of audio streams as a text file.

Optionally, after storing the plurality of pieces of text information corresponding to the buffered plurality of pieces of audio streams as a text file when recording of the current video conference is finished, the method further includes:

when the video conference is finished, conference data of the video conference are obtained;

generating a meeting summary based on the meeting data and the text file, and publishing the meeting summary;

And when receiving a downloading request of a user for the issued meeting summary, sending the meeting summary to the user.

In a second aspect of the embodiment of the present invention, there is provided a video conference processing apparatus, which is applied to a storage service system in a video networking, including:

The file obtaining module is used for obtaining a video file for recording the video conference and a text file corresponding to an audio stream in the video conference in a recording period, wherein the text file comprises a plurality of pieces of text information, and each piece of text information comprises the initial playing time of the audio stream corresponding to the text information in the video file;

The video playing module is used for playing the video file when receiving a playing request of a user for the video file;

The information extraction module is used for extracting at least one piece of text information corresponding to the current playing time from the text file according to the current playing time and each initial playing time in the process of playing the video file;

and the information display module is used for displaying the at least one piece of text information.

Optionally, the video networking is provided with a conference control terminal and an autonomous server, and the device further comprises:

The instruction receiving module is used for receiving a conference recording instruction sent by the autonomous server, wherein the conference recording instruction is generated by the conference control terminal and is sent to the autonomous server;

The video recording module is used for responding to the conference recording instruction, recording the audio stream and the video stream which are currently generated in the video conference and caching the audio stream, wherein the audio stream is an audio stream collected by a terminal which is currently speaking in the video conference;

The audio identification module is used for identifying a cached section of audio stream when the total playing time length of the section of audio stream reaches a preset time length, obtaining text information corresponding to the section of audio stream, and taking the starting time of the section of audio stream in the recording process as the starting playing time;

and the file obtaining module is used for storing a plurality of pieces of text information corresponding to the cached plurality of pieces of audio streams as a text file when the recording of the current video conference is finished.

Optionally, the video file carries a plurality of marks, and the file playing module is specifically configured to display the plurality of marks on a playing progress bar for playing the video file, where each mark on the playing progress bar is used to indicate a playing time of the mark in the video file;

The information extraction module includes:

A first determining unit configured to determine a first play time of a triggered mark in the video file when a trigger operation on the plurality of marks is detected;

The progress adjusting unit is used for switching the current playing time to the first playing time so as to play the video file from the first playing time;

The first extraction unit is used for extracting at least one section of first text information, corresponding to the initial playing time, from the text file, wherein the first playing time is within the first preset time range.

The information extraction module includes:

A second determining unit, configured to determine a progress position corresponding to a current playing time, and determine, from the plurality of marks, a target mark that is within a preset distance from the progress position;

a third determining unit, configured to determine a second playing time from the target mark in the video file;

the second extraction unit is used for extracting at least one piece of second text information with the initial playing time within a preset time range from the second playing time from the text file.

Optionally, the information extraction module includes:

a fourth determining unit, configured to determine, from each of the start playing times, at least one start playing time that is within a second preset time range from the current playing time;

and the third extraction unit is used for acquiring at least one piece of third text information corresponding to the at least one initial playing time from the text file.

Optionally, the apparatus further comprises:

The conference data acquisition module is used for acquiring conference data of the video conference when the video conference is finished;

the conference summary generation module is used for generating a conference summary based on the conference data and the text file and issuing the conference summary;

and the conference summary sending module is used for sending the conference summary to the user when receiving a request for downloading the issued conference summary from the user.

In a third aspect of the embodiment of the present invention, there is also provided an electronic device, including:

One or more processors, and

One or more machine readable media having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform a video conference processing method as described in embodiments of the present invention.

In a fourth aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium storing a computer program for causing a processor to execute the video conference processing method according to the embodiments of the present invention.

The embodiment of the invention has the following advantages:

In the embodiment of the invention, the storage service system can obtain the video file recorded for the video conference and the text file corresponding to the audio stream in the video conference, and in the process of playing the video file, the text information corresponding to the playing time can be displayed according to the current playing time and the initial playing time corresponding to each piece of text information in the text file. On the one hand, the user can synchronously obtain the content of the video conference corresponding to the current playing period through the displayed text information while watching the recorded video conference, so that the accuracy of understanding the conference content of the video conference by the user is improved, and meanwhile, the efficiency of obtaining the conference content by the user is improved. On the other hand, the storage service system can acquire the text file corresponding to the audio in the video conference, so that the user is not required to manually record the video conference content, and the efficiency of recording the video conference is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a networking of the present invention for a video networking;

FIG. 2 is a schematic diagram of the hardware architecture of a node server according to the present invention;

FIG. 3 is a schematic diagram of the hardware architecture of an access switch of the present invention;

Fig. 4 is a schematic hardware structure of an ethernet corotation gateway according to the present invention;

FIG. 5 is a diagram of an implementation environment of a video conference processing method according to an embodiment of the present invention

Fig. 6 is a flowchart illustrating steps of a method for video conference processing according to an embodiment of the present invention;

fig. 7 is a flowchart of steps for recording a video conference in a video conference processing method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of an interface for displaying text information in a playing video file in accordance with an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a video conference processing apparatus according to an embodiment of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

The video networking is an important milestone for network development, is a real-time network, can realize real-time transmission of high-definition videos, and pushes numerous internet applications to high-definition videos, and the high definition faces.

The video networking adopts a real-time high-definition video exchange technology, and can integrate all required services such as high-definition video conference, video monitoring, intelligent monitoring analysis, emergency command, digital broadcast television, delay television, network teaching, live broadcast, VOD on demand, television mail, personal record (PVR), intranet (self-processing) channel, intelligent video playing control, information release and other tens of services into one system platform, and realize high-definition quality video playing through television or computer.

For a better understanding of embodiments of the present invention, the following description of the video networking is presented to one skilled in the art:

The partial techniques applied by the video networking are as follows:

Network technology (Network Technology)

The network technology innovation of the internet of vision improves on the traditional Ethernet (Ethernet) to face the potentially huge first video traffic on the network. Unlike pure network packet switching (PACKET SWITCHING) or network circuit switching (Circuit Switching), the technology of video networking employs PACKET SWITCHING to meet Streaming requirements. The video networking technology has the flexibility, simplicity and low price of packet switching, and simultaneously has the quality and the safety guarantee of circuit switching, thereby realizing the seamless connection of the whole network switching type virtual circuit and the data format.

Exchange technology (SWITCHING TECHNOLOGY)

The video network adopts the two advantages of the asynchronization and the packet switching of the Ethernet, eliminates the Ethernet defect on the premise of full compatibility, has full-network end-to-end seamless connection, and is directly connected with the user terminal to directly bear the IP data packet. The user data does not need any format conversion in the whole network. The video networking is a higher-level form of Ethernet, is a real-time exchange platform, can realize real-time transmission of full-network large-scale high-definition video which cannot be realized by the current Internet, and pushes numerous network video applications to high definition and unification.

Server technology (Server Technology)

The server technology on the video networking and unified video platform is different from the server in the traditional sense, the streaming media transmission is based on connection-oriented basis, the data processing capability is irrelevant to the flow and the communication time, and a single network layer can contain signaling and data transmission. For voice and video services, the complexity of video networking and unified video platform streaming media processing is much simpler than that of data processing, and the efficiency is greatly improved by more than hundred times than that of a traditional server.

Accumulator technology (Storage Technology)

The ultra-high-speed storage technology of the unified video platform adopts the most advanced real-time operating system for adapting to the ultra-large capacity and ultra-large flow media content, the program information in the server instruction is mapped to a specific hard disk space, the media content does not pass through the server any more, the media content is instantly and directly delivered to a user terminal, and the waiting time of the user is generally less than 0.2 seconds. The optimized sector distribution greatly reduces the mechanical motion of magnetic head seek of the hard disk, the resource consumption only accounts for 20% of the IP Internet of the same grade, but the concurrent flow which is 3 times greater than that of the traditional hard disk array is generated, and the comprehensive efficiency is improved by more than 10 times.

Network security technology (Network Security Technology)

The structural design of the video networking thoroughly structurally solves the network security problem puzzling the Internet by means of independent permission of each service, complete isolation of equipment and user data and the like, generally does not need antivirus programs or firewalls, eliminates attacks of hackers and viruses, and provides a structural carefree security network for users.

Service innovation technology (Service Innovation Technology)

The unified video platform fuses services with transmissions, whether a single user, private network users or a network aggregate, but automatically connects at a time. The user terminal, the set top box or the PC is directly connected to the unified video platform, so that various multimedia video services are obtained. The unified video platform adopts a menu type table allocation mode to replace the traditional complex application programming, and can realize complex application by using very few codes, thereby realizing 'infinite' new business innovation.

Networking of the video networking is as follows:

The video networking is a centrally controlled network structure, which may be of the tree network, star network, ring network, etc., but on the basis of this there is a need for a centralized control node in the network to control the whole network.

As shown in fig. 1, the view network is divided into an access network and a metropolitan area network.

The devices of the access network part can be mainly divided into 3 types, namely node servers, access switches and terminals (comprising various set top boxes, coding boards, memories and the like). The node server is connected with an access switch, which can be connected with a plurality of terminals and can be connected with an Ethernet.

The node server is a node with a centralized control function in the access network, and can control the access switch and the terminal. The node server may be directly connected to the access switch or may be directly connected to the terminal.

Similarly, devices of the metropolitan area network portion can be classified into 3 categories, metropolitan area servers, node switches, and node servers. The metro server is connected to a node switch, which may be connected to a plurality of node servers.

The node server is the node server of the access network part, namely the node server belongs to the access network part and also belongs to the metropolitan area network part.

The metropolitan area server is a node with centralized control function in the metropolitan area network, and can control a node switch and a node server. The metropolitan area server may be directly connected to the node switch or directly connected to the node server.

Thus, the whole video network is a hierarchical centralized control network structure, and the network controlled by the node server and the metropolitan area server can be in various structures such as tree, star, ring and the like.

The access network part can be pictorially called as a unified video platform (part in a dotted circle), the plurality of unified video platforms can form a video network, and each unified video platform can be interconnected and intercommunicated through a metropolitan area and a wide area video network.

View networking device classification

1.1 The devices in the video networking of the embodiment of the invention can be mainly classified into 3 types, namely, a server, a switch (comprising an Ethernet cooperative gateway) and a terminal (comprising various set-top boxes, coding boards, memories and the like). The view networking can be divided into metropolitan area networks (or national networks, global networks, etc.) and access networks as a whole.

1.2 The devices of the access network part can be mainly classified into 3 categories, namely node servers, access switches (including ethernet corotation gateways), terminals (including various set-top boxes, code boards, memories, etc.).

The specific hardware structure of each access network device is as follows:

the node server:

As shown in fig. 2, the device mainly comprises a network interface module 201, a switching engine module 202, a CPU module 203 and a disk array module 204;

The network interface module 201, the cpu module 203 and the disk array module 204 all enter the switching engine module 202, the switching engine module 202 performs an operation of looking up an address table 205 on the incoming packet to obtain packet guiding information, and stores the packet into a corresponding queue of the packet buffer 206 according to the packet guiding information, if the queue of the packet buffer 206 is nearly full, the packet is discarded, the switching engine module 202 polls all the packet buffer queues, and if 1) the port sending buffer is not full, 2) the queue packet counter is greater than zero, the switching engine module 202 forwards the packet. The disk array module 204 mainly controls the hard disk, including initializing, reading and writing operations on the hard disk, and the CPU module 203 is mainly responsible for protocol processing with an access switch and a terminal (not shown in the figure), configuration of an address table 205 (including a downlink protocol packet address table, an uplink protocol packet address table and a data packet address table), and configuration of the disk array module 204.

Access switch:

As shown in fig. 3, mainly includes a network interface module (a downstream network interface module 301, an upstream network interface module 302), a switching engine module 303, and a CPU module 304;

The downstream network interface module 301 sends the incoming packet (upstream data) to the packet detection module 305, the packet detection module 305 detects whether the Destination Address (DA), source Address (SA), data packet type and packet length of the packet meet the requirements, if so, the corresponding stream identifier (stream-id) is allocated and sent to the switching engine module 303, otherwise, the upstream network interface module 302 sends the incoming packet (downstream data) to the switching engine module 303, the cpu module 304 sends the incoming packet to the switching engine module 303, the switching engine module 303 performs an address table 306 operation on the incoming packet to obtain the packet guiding information, if the packet sent to the switching engine module 303 is sent from the downstream network interface to the upstream network interface, the packet is stored in the corresponding packet buffer 307 queue in combination with the stream identifier (stream-id), if the packet buffer 307 queue is nearly full, if the packet sent to the switching engine module 303 is not sent to the downstream network interface, the packet is stored in the corresponding packet buffer 307 according to the packet guiding information, and if the packet buffer 307 is nearly full.

The switch engine module 303 polls all packet buffer queues, which may include two scenarios:

If the queue is sent from the downlink network interface to the uplink network interface, the forwarding is performed under the following conditions that 1) the port sending buffer is not full, 2) the queue packet counter is larger than zero, 3) the token generated by the code rate control module is obtained;

if the queue is not addressed by the downstream network interface to the upstream network interface, forwarding is performed under the conditions that 1) the port transmit buffer is not full, 2) the queue packet counter is greater than zero.

The rate control module 308 is configured by the CPU module 304 to generate tokens for all packet buffer queues from the downstream network interface to the upstream network interface at programmable intervals to control the rate of upstream forwarding.

The CPU module 304 is mainly responsible for protocol processing with the node server, configuration of the address table 306, and configuration of the rate control module 308.

Ethernet corotation gateway:

as shown in fig. 4, the device mainly includes a network interface module (a downstream network interface module 401 and an upstream network interface module 402), a switching engine module 403, a CPU module 404, a packet detection module 405, a rate control module 408, an address table 406, a packet buffer 407, a MAC adding module 409, and a MAC deleting module 410.

The packet detection module 405 detects whether the Ethernet MAC DA, the Ethernet MAC SA, the Ethernet length or FRAME TYPE, the video network destination address DA, the video network source address SA, the video network data packet type and the packet length of the data packet meet the requirements, if so, a corresponding stream identifier (stream-id) is allocated, then the MAC deleting module 410 subtracts the MAC DA, the MAC SA and the length or FRAME TYPE (2 byte) and enters a corresponding receiving buffer, otherwise, the data packet is discarded;

The downlink network interface module 401 detects the sending buffer of the port, if there is a packet, acquires the ethernet MAC DA of the corresponding terminal according to the destination address DA of the packet's internet of view, adds the ethernet MAC DA of the terminal, the MAC SA of the ethernet co-ordination gateway, and the ethernet length or FRAME TYPE, and sends.

The function of the other modules in the ethernet corotation gateway is similar to that of the access switch.

And (3) a terminal:

the set top box mainly comprises a network interface module, a service processing module and a CPU module, for example, the set top box mainly comprises a network interface module, a video and audio coding engine module and a CPU module, the coding board mainly comprises a network interface module, a video and audio coding engine module and a CPU module, and the memory mainly comprises a network interface module, a CPU module and a disk array module.

1.3 Devices of the metropolitan area network portion can be mainly classified into 2 categories, node servers, node switches, metropolitan area servers. The node switch mainly comprises a network interface module, a switching engine module and a CPU module, and the metropolitan area server mainly comprises the network interface module, the switching engine module and the CPU module.

2. View networking data packet definition

2.1 Access network packet definition

The data packet of the access network mainly comprises a Destination Address (DA), a Source Address (SA), reserved bytes, payload (PDU) and CRC.

As shown in the following table, the data packet of the access network mainly includes the following parts:

DA

SA

Reserved

Payload

CRC

Wherein:

the Destination Address (DA) is composed of 8 bytes (byte), the first byte represents the type of data packet (such as various protocol packets, multicast data packets, unicast data packets, etc.), 256 possibilities are at most provided, the second byte to the sixth byte are metropolitan area network addresses, and the seventh and eighth bytes are access network addresses;

The Source Address (SA) is also composed of 8 bytes (bytes), defined identically to the Destination Address (DA);

The reserved bytes consist of 2 bytes;

the payload portion has different lengths according to the types of the different datagrams, and is 64 bytes if it is various protocol packets, and 32+1024=1056 bytes if it is a unicast packet, and is of course not limited to the above 2 types;

the CRC consists of 4 bytes and its calculation method follows the standard ethernet CRC algorithm.

2.2 Metropolitan area network packet definition

The topology of the metropolitan area network is a pattern, there may be 2 or even more than 2 connections between two devices, i.e. there may be more than 2 connections between node switches and node servers, node switches and node switches, node switches and node servers. However, the metropolitan area network address of the metropolitan area network device is unique, and in order to accurately describe the connection relationship between metropolitan area network devices, a parameter, namely a label, is introduced in the embodiment of the invention to uniquely describe one metropolitan area network device.

In this specification, the definition of label is similar to that of MPLS (Multi-Protocol Label Switch, multiprotocol label switching), and assuming that there are two connections between device a and device B, there are 2 labels for packets from device a to device B, and 2 labels for packets from device B to device a. The label is split into label and label out, and assuming that the label (in label) of the packet entering the device a is 0x0000, the label (out label) of the packet when leaving the device a may become 0x0001. The network access process of the metropolitan area network is a network access process under centralized control, that is, the address allocation and label allocation of the metropolitan area network are all led by the metropolitan area server, the node switch and the node server are all passively executed, which is different from the label allocation of the MPLS, which is the result of mutual negotiation between the switch and the server.

As shown in the following table, the data packet of the metropolitan area network mainly includes the following parts:

DA

SA

Reserved

Label (Label)

Payload

CRC

I.e. Destination Address (DA), source Address (SA), reserved bytes (Reserved), labels, payload (PDU), CRC. The format of the tag may be defined with reference to the tag being 32 bits, with the high 16bit reserved, with only the low 16bit, being located between reserved bytes and payload of the packet.

Based on the characteristics of the video networking, one of the concepts of the invention is put forward, the storage server can obtain a video file for recording the video conference and a text file corresponding to audio in the video conference, when a user plays the video file, multiple pieces of text information near the current playing time can be extracted from the text file, and the multiple pieces of text information are displayed, so that the user can conveniently and accurately and quickly acquire conference contents.

Referring to fig. 5, an implementation environment diagram of a video conference processing method according to an embodiment of the present invention is shown, and as shown in fig. 5, the implementation environment diagram includes a conference control terminal, a conference management server, an autonomous server, a storage service system, and a plurality of terminals. The system comprises an autonomous server, a storage service system and a plurality of terminals, wherein the autonomous server, the storage service system and the terminals are deployed in the Internet, the conference management server and the conference control terminals are deployed in the Internet, the autonomous server can communicate with the conference management server by adopting an Internet protocol, and the autonomous server can communicate with the storage service system and the terminals by adopting the Internet protocol.

In this embodiment, the terminal may be, but is not limited to, a mobile phone, a computer, a set top box, or a video networking terminal. In a video conference performed in the internet of view, a plurality of terminals are used as participants. The conference control terminal is used for controlling a plurality of terminals participating in the video conference, sending out conference control instructions and the like. The autonomous server is used for carrying out mutual transmission of audio and video streams, conference text data and the like in the video networking, and the storage service system is used for recording the received video streams and the received audio streams so as to achieve the purpose of storing conference contents of the video conference.

As shown in fig. 5, in a specific implementation, the storage service system may include a web front end of the storage service system and a back end of the storage service system, where the back end of the storage service system is a storage server, and the web front end of the storage service system may support communication of an internet protocol, so that a user in the internet can conveniently download a file related to a video conference on a web page.

The following describes, in one example, the presentation of a video conference in a video networking:

As shown in fig. 5, in the video conference of the video network, the scheduling of the audio and video streams may be triggered by the conference control terminal, and the autonomous server processes the actual scheduling of the audio and video streams in the video conference according to the scheduling triggered by the conference control terminal. For example, at a certain time, the terminal 1 is set as a speaking party by the conference control terminal, and the other terminals 2, 3 and 4 are set as general participants, so that the conference control terminal triggers an audio and video stream schedule at this moment, and the autonomous server sends the audio stream and video stream collected by the terminal 1 to the terminal 2, 3 and 4 according to the schedule. For another example, at another time, the terminal 2 is set as the speaking party by the conference control terminal, and the other terminals 1, 3 and 4 are set as the common participants, so that the conference control terminal triggers an audio and video stream schedule again at this moment, and the autonomous server sends the audio stream and video stream collected by the terminal 2 to the terminal 1, 3 and 4 according to the schedule.

In practice, the conference control terminal can continuously change the speaker in the conference process according to the conference requirement, namely, continuously change the audio and video stream scheduling, and the autonomous server can forward the audio streams collected by the corresponding terminal to the terminals of other participants according to the audio and video stream scheduling determined by the conference control terminal.

Next, the video conference processing method of the present application will be described with reference to the implementation environment shown in fig. 5, taking the storage service system as an execution subject.

In an embodiment, a manner of recording a video conference is provided, where in the embodiment, when the video conference is performed, the storage service system may record the video conference to obtain a video file and a text file corresponding to the video conference. Referring to fig. 6, a flowchart illustrating a step of recording a video conference by a storage service system to obtain a video file and a text file in an embodiment may specifically include the following steps:

And step S601, receiving a conference recording instruction sent by the autonomous server.

The conference recording instruction is generated by the conference control terminal and is sent to the autonomous server.

In this embodiment, the storage service system may record the video conference based on a conference recording instruction sent by the conference control terminal. Specifically, the conference recording instruction may be generated by the conference control terminal, and sent to the autonomous server by the conference control terminal, and then sent to the back end of the storage service system by the autonomous server.

In practice, as shown in fig. 5, since the conference control terminal is located in the internet and is in communication connection with the conference management server in the internet, the conference control terminal generates a conference recording instruction conforming to the internet protocol, and sends the conference recording instruction conforming to the internet protocol to the conference management server through the internet, and after the conference management server converts the conference recording instruction conforming to the internet protocol into the conference recording instruction conforming to the internet of view protocol, the conference recording instruction conforming to the internet of view protocol is sent to the autonomous server, and is sent to the rear end of the storage service system in the internet of view by the autonomous server, and then the conference recording instruction received by the storage server is an instruction conforming to the internet of view protocol.

Step S602, responding to the conference recording instruction, recording the audio stream and the video stream which are currently generated in the video conference, and caching the audio stream, wherein the audio stream is the audio stream collected by the terminal which is currently speaking in the video conference.

In this embodiment, the back end of the storage service system may respond to the conference recording instruction to record the audio stream and the video stream generated in the video conference, where the audio stream and the video stream generated in the video conference may be sent to the back end of the storage service system by the autonomous server after responding to the conference recording instruction. That is, when the autonomous server forwards the conference recording instruction to the storage service system, an audio/video stream transmission channel between the autonomous server and the storage service system is also established, so that video streams and audio streams sent by all terminals in the video conference are sent to the storage service system.

In a specific implementation, as shown in fig. 5, the storage service system stores the audio stream using a voice analysis system so that the voice analysis system recognizes the stored audio stream.

Step 603, when the total playing time length of a section of the cached audio stream is detected to reach the preset time length, identifying the cached section of the audio stream to obtain text information corresponding to the section of the audio stream, and taking the starting time of the section of the audio stream in the recording process as the starting playing time.

In this embodiment, when the total playing duration of the stored audio stream reaches the preset duration, the storage service system may identify the segment of audio stream by using the voice analysis system, so as to obtain text information corresponding to the segment of audio stream. Meanwhile, the cached section of audio stream can be deleted, the next audio stream is continuously cached, and when the total playing time length of the cached next section of audio stream reaches the preset time length, text information corresponding to the next section of audio stream can be obtained, so that a plurality of sections of text information can be obtained.

The start time of the audio stream in the recording process may refer to a time when the audio stream starts to be stored from the recording start, where the time is taken as a start playing time of text information corresponding to the audio stream.

For example, when the time from the start of the video conference to the recording is counted, the time for starting to store the audio stream sent by the terminal 2 is 20 minutes 15 seconds, when the playing time of the stored audio stream reaches 5 seconds, the text information a of the stored audio stream is generated, the starting playing time of the text information a is determined to be 20 minutes 15 seconds, and the terminal 2 starts speaking from 20 minutes 15 seconds in the recorded video file. During this period, the terminal 2 continuously transmits the audio stream, starts to store the audio stream again at 20 minutes 21 seconds, and generates the text information B of the stored audio stream when the playing time of the stored audio stream reaches 5 seconds, and then determines the initial playing time of the text information a to be 20 minutes 21 seconds.

In one embodiment, the storage service system may further determine a terminal for collecting the currently stored audio stream, and record the name of the terminal or the name of the user using the terminal into the text information together, so that each piece of text information may carry the speaker information.

Step S604, when the recording of the current video conference is finished, storing the multi-segment text information corresponding to the cached multi-segment audio stream as a text file.

In this embodiment, the storage service system may end recording the video conference according to the recording ending instruction sent by the conference terminal, where the process of the storage service system receiving the recording ending instruction sent by the conference terminal is the same as the process of the storage service system receiving the recording ending instruction sent by the conference terminal, and will not be described herein.

In practice, the storage service system may also end recording the video conference when the duration after starting recording the video conference reaches the preset duration threshold. Of course, the recording of the video conference may be ended when the video conference is detected to be ended.

In a specific implementation, because the audio is identified in the recording process to obtain multiple pieces of text information, when the recording is finished, the storage service system can combine the multiple pieces of text information together to form a text file, and in the text file, each piece of text information has a starting playing time.

When the recording is finished, the storage service system can issue the recorded video file and text file to the web end of the storage service system, specifically, the storage service system can issue the name of the video file and the name of the text file to the web end, so that a user downloads the video file and the text file, thereby acquiring conference content of a video conference, and realizing application of the video file and the text file.

In this embodiment, the storage service system may display text information in the text file to the user during the process of playing the recorded video file by the user, so that the user obtains conference content of the video conference in a video clip.

Specifically, referring to fig. 7, a flowchart illustrating a step of a storage service system displaying a text file to a user in a text information videoconference processing method according to an embodiment specifically includes the following steps:

Step S701, obtaining a video file for recording the video conference and a text file corresponding to an audio stream in the video conference in a recording period.

The text file comprises a plurality of pieces of text information, and each piece of text information comprises the initial playing time of the audio stream corresponding to the text information in the video file.

In this embodiment, the storage service system may obtain, when the video conference is over, a video file and a text file recorded in the video conference, or may obtain, during the video conference, the video file and the text file recorded currently, so as to realize recording and playing of the video conference simultaneously. The video file can restore the scene of the video conference, wherein the video file recorded on the video conference can refer to the recording of audio and video streams after mixing the video streams and the audio streams collected by each terminal in the video conference.

The text file is a file obtained by identifying an audio stream generated in a video conference in the process of recording the video file. Taking fig. 5 as an example, in the process of recording a video file, when the terminal 1 and the terminal 2 speak successively, text information corresponding to the audio collected by the terminal 1 and text information corresponding to the audio collected by the terminal 2 may be included in the text file.

In practice, each piece of text information may have a start playing time, where the start playing time may be a time difference between a time when the video file starts to be recorded and a time when the text information is identified, and the time difference may be used as a relative playing time of an audio stream corresponding to the text information in the video file. The specific process of recording the video file and obtaining the text file may be as described in steps S601 to S604.

Step S702, when receiving a playing request of a user for the video file, playing the video file.

In this embodiment, the storage service system may issue the name of the obtained video file to the web front end, so that the user may play the issued video file. Specifically, when the web front end of the storage service system detects the click operation of the user on the name of the video file, a play request can be sent to the rear end of the storage service system, and then the rear end of the storage service system pushes the video file to the web front end, and the web front end can load and play the video file.

In a specific implementation, when the back end of the storage service system pushes the video file to the web front end, the back end may also push the text file to the web front end together.

Step S703, during the process of playing the video file, extracting at least one piece of text information corresponding to the current playing time from the text file according to the current playing time and each initial playing time.

And step S704, displaying the at least one piece of text information.

In this embodiment, during the process of playing the video conference, the web front end of the storage service system may detect the relationship between the current playing time and the initial playing time carried by each piece of text information in the text file, so as to determine the initial playing time located near the current playing time, and extract at least one piece of text information corresponding to the initial playing time near the current playing time from the text file.

Because the initial playing time of each text message is the relative playing time of the audio stream corresponding to the text message in the video file, in practice, when the video file is played to a video clip which is played by a speaker, the text message corresponding to the speaker can be synchronously displayed, so that a user can conveniently know the speaking content of the speaker in a text form, and the user can conveniently and accurately know the content of the video conference.

In a specific implementation, the web front end of the storage service system may display at least one piece of extracted text information on a video playing window of the video file. In practice, when at least one piece of text information is displayed, the user can save the speaking content of the speaker in the video by capturing a screen of the video playing window. Specifically, when detecting a screen capturing operation on a video playing window, the web front end of the storage service system may capture a screen of the current video playing window to obtain a picture of a display page, where the picture of the display page may include at least one piece of text information. And then, the stored pictures are sent to the user currently logged in the storage service system, so that the user can record important conference contents through the pictures without manual recording, and the efficiency of conference recording is improved.

In one embodiment, the speaker information can be carried in the text information, so that the speaker information can be displayed together when the text information is displayed, and a user can know the speaker content of which speaker the currently displayed text information belongs to conveniently, so that the user experience is optimized.

In the embodiment of the invention, the storage service system can display the text information corresponding to the playing time according to the current playing time and the initial playing time corresponding to each text information in the text file in the process of playing the video file, so that a user can obtain the conference content of the current video clip synchronously through the displayed text information while watching the recorded video conference, thereby ensuring that the user can obtain the conference content accurately, and improving the efficiency of obtaining the conference content by the user. On the other hand, the storage service system can acquire the text file corresponding to the audio in the video conference, so that the user is not required to manually record the video conference content, and the efficiency of recording the video conference is improved.

In combination with the above embodiment, an implementation manner of displaying text information in the process of playing a video file is provided, in this implementation manner, a plurality of marks may be carried in the video file, and playing the video file may be the following playing process:

and step S702', displaying the marks on a playing progress bar for playing the video file.

Wherein each mark on the playback progress bar is used to indicate the playback time of the mark in the video file.

In this embodiment, the plurality of marks carried in the video file may refer to marks for marking a plurality of play time points of the video file. When playing the video file, the plurality of marks can be displayed at different positions on the playing progress bar according to the playing time points marked by the marks.

Wherein, the time interval between the playing time points marked by every two adjacent marks can be the same, for example, the total playing time length of the video file is 20 minutes, 5 marks can be set at equal time intervals, and thus, one mark is arranged on the playing progress bar every 4 minutes.

Accordingly, displaying at least one piece of text information may specifically comprise the steps of:

Step S7031, when detecting the triggering operation on the plurality of marks, determining a first playing time of the triggered marks in the video file, and switching the current playing time to the first playing time so as to play the video file from the first playing time.

In this embodiment, the triggering operation for the plurality of marks may be a clicking operation or a touching operation performed on a mark by a user, and when the triggering operation for the mark by the user is detected, a playing time point marked by the mark may be obtained, where the playing time point is the first playing time. Meanwhile, the current playing time may be adjusted to the first playing time, that is, the current position of the playing progress bar is adjusted to the position where the mark is located, so that the video file is played from the first playing time.

Step S7032, extracting at least one piece of first text information from the text file, where the first playing time is within a first preset time range from the corresponding initial playing time.

In this embodiment, when the first playing time marked by the triggered mark is obtained, the starting playing time within the first preset time range from the first playing time may be obtained from a plurality of starting playing times, and further, text information, which is the first text information, corresponding to the starting playing time within the first preset time range from the first playing time may be extracted from the text file.

The first preset time range can be set according to requirements.

By way of example, referring to FIG. 8, a schematic diagram of an interface for displaying text information in a play video file is shown in one example. Taking the total playing time of the video file as 20 minutes, 4 marks are set at equal time intervals as an example. When the user clicks the mark 1, the video file is played from the mark 1, the playing time of the mark 1 is 4 minutes 5 seconds, and then the text information with the initial playing time of 3 minutes 55 seconds to 4 minutes 15 seconds in the text file can be extracted and displayed on the playing interface. When the text information between 3 minutes and 55 seconds and 4 minutes and 15 seconds is displayed, the text information can be displayed on a progress bar or displayed on the left side or the right side of a video picture, and the display position can be determined according to the requirement.

By the implementation mode, the user can trigger the mark on the playing progress bar to acquire the conference content of the video conference in the nearby time period.

In combination with the above embodiment, a further embodiment of displaying text information during playing of a video file is provided, in which a plurality of marks may also be carried in the video file, where playing of the video file may be a playing process as in step S702'.

In this embodiment, displaying at least one piece of text information may specifically include the steps of:

and step S7031', determining a progress position corresponding to the current playing time, and determining a target mark which is within a preset distance from the progress position from the plurality of marks.

In this embodiment, the progress position corresponding to the current playing time is the position of the current playing time point on the playing progress bar, and since a plurality of marks are displayed on the playing progress bar, each mark has its own position on the playing progress bar. The distance difference between the position of the current playing time point on the playing progress bar and the position of each mark on the playing progress bar can be determined, and the target mark with the distance difference within the preset distance is determined.

In practice, the target mark may be the mark nearest to the current playing time point.

Step S7032' determines a second play time in the video file from the target mark.

In this embodiment, a play time point marked by the target mark may be obtained, and the play time point is the second play time.

And step S7033', extracting at least one piece of second text information with the initial playing time within a preset time range from the second playing time from the text file.

The process of the present step S7033' is similar to the process of the above step S7032, and the relevant points are only needed to refer to the description of the step S7032, and are not repeated here. The preset time range may be consistent with the first preset time range.

By adopting the embodiment, the storage service system can automatically extract and display text information according to the difference between the current playing time and the playing time point marked by each mark, so that the normal playing of the video file is not influenced, and the user experience is optimized.

In combination with the above embodiment, there is provided still another embodiment of displaying text information during playing of a video file, in which at least one piece of text information in the vicinity of the current playing may be displayed, and specifically the method may include the steps of:

step S7031 "determining at least one start playing time within a second preset time range from the current playing time.

In this embodiment, the start playing time corresponding to each piece of text information in the text file may be traversed, so as to screen at least one start playing time within the second preset time range with the current playing time.

The second preset time range may be set according to actual requirements, and in particular, the second preset time range may be different from the first preset time range.

Step S7032 "obtaining at least one piece of third text information corresponding to the at least one start playing time from the text file.

In this embodiment, text information corresponding to each of the selected at least one start playing time may be extracted, and the text information is a third text information, and then the third text information is displayed.

By adopting the embodiment, the storage service system can display the text information synchronous with the playing progress in real time in the playing process of the video file, so that a user can watch the conference content more synchronous with the video file in real time in the process of watching the video file, and the user experience is further optimized.

In one embodiment, the storage service system may further generate a complete conference summary after the video conference is over, and specifically may further perform the following steps:

And step S605, obtaining conference data of the video conference when the video conference is finished.

In this embodiment, when the video conference is ended, the storage service system may obtain conference data from the conference control terminal, and specifically, the storage service system may generate a conference data request instruction and send the conference data request instruction to the autonomous server, so that the autonomous server sends the conference data request instruction to the conference control terminal. Further, the conference control terminal may return conference data related to the video conference to the storage service system along the above-described path (conference management server-autonomous server) in response to the conference data request instruction.

The conference data may include information such as a name of the video conference, a number of participants, a conference subject, and the like.

And step S606, generating a meeting summary based on the meeting data and the text file, and publishing the meeting summary.

After meeting data is obtained, the storage service system can store the meeting data into a text file so as to obtain a meeting summary, and in practice, the storage service system can store the meeting summary at the back end of the storage service system.

Step S607, when receiving a request for downloading the issued meeting summary from the user, sending the meeting summary to the user.

In this embodiment, the storage service system may publish the name of the meeting summary on the web side, so that the user can view and download the meeting summary. The name of the meeting summary may be consistent with the name of the video meeting. Specifically, when the web end of the storage service system detects the downloading operation of the user for the meeting summary, a downloading request can be generated and sent to the rear end of the storage service system, and the rear end of the storage service system sends the meeting summary to the user currently logging in the storage service system based on the downloading request.

It should be noted that the above example is a specific embodiment of the present invention, and in practice, recording video files and text files may not be limited to the above manner, for example, recording video files and text files by an autonomous server, and accordingly, the storage service system may obtain video files and text files from the autonomous server. Or the video file and the text file are recorded by the conference control terminal, the storage service system can obtain the video file and the text file from the conference control terminal accordingly.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Referring to fig. 9, a schematic structural diagram of a video conference processing apparatus of the present embodiment is shown, where the apparatus may be applied to a storage service system in a video network, and may specifically include the following modules:

The file obtaining module 901 may be configured to obtain a video file for recording the video conference and a text file corresponding to an audio stream in the video conference during a recording period, where the text file includes multiple pieces of text information, and each piece of text information includes a start playing time of the audio stream corresponding to the text information in the video file;

the video playing module 902 may be configured to play the video file when a playing request of a user for the video file is received;

The information extraction module 903 may be configured to extract, during playing the video file, at least one piece of text information corresponding to the current playing time from the text file according to the current playing time and each of the start playing times;

the information display module 904 may be configured to display the at least one piece of text information.

Optionally, the video networking may be deployed with a conference control terminal and an autonomous server, and the apparatus may further include the following modules:

The instruction receiving module can be used for receiving a conference recording instruction sent by the autonomous server, wherein the conference recording instruction is generated by the conference control terminal and is sent to the autonomous server;

The video recording module can be used for responding to the conference recording instruction, recording the audio stream and the video stream which are currently generated in the video conference and caching the audio stream, wherein the audio stream is the audio stream collected by the terminal which is currently speaking in the video conference;

The audio identification module can be used for identifying a cached section of audio stream when the total playing time of the section of audio stream reaches a preset time length, obtaining text information corresponding to the section of audio stream, and taking the starting time of the section of audio stream in the recording process as the starting playing time;

And the file obtaining module can be used for storing a plurality of pieces of text information corresponding to the cached plurality of pieces of audio streams as a text file when the recording of the current video conference is finished.

Optionally, a plurality of marks may be carried in the video file, and the file playing module 902 may be specifically configured to display the plurality of marks on a playing progress bar for playing the video file, where each mark on the playing progress bar is used to indicate a playing time of the mark in the video file;

The information extraction module 903 may specifically include the following units:

The progress adjusting unit can be used for switching the current playing time to the first playing time so as to play the video file from the first playing time;

the first extracting unit may be configured to extract, from the text file, at least one piece of first text information, where the corresponding start playing time is within the first preset time range from the first playing time.

Optionally, the video file may carry a plurality of marks, and the file playing module may be specifically configured to display the plurality of marks on a playing progress bar for playing the video file, where each mark on the playing progress bar is used to indicate a playing time of the mark in the video file;

The information extraction module 903 may include the following units:

The second determining unit can be used for determining a progress position corresponding to the current playing time, and determining a target mark which is within a preset distance from the progress position from the plurality of marks;

a third determining unit operable to determine a second play time in the video file from the target mark;

The second extracting unit may be configured to extract, from the text file, at least one piece of second text information having a start playing time within a preset time range from the second playing time.

Optionally, the information extraction module 903 may include the following units:

A fourth determining unit, configured to determine, from each of the start playing times, at least one start playing time within a second preset time range from the current playing time;

And the third extraction unit can be used for acquiring at least one piece of third text information corresponding to the at least one initial playing time from the text file.

Optionally, the apparatus may specifically further include the following modules:

the conference data obtaining module can be used for obtaining conference data of the video conference when the video conference is finished;

The conference summary generation module can be used for generating a conference summary based on the conference data and the text file and issuing the conference summary;

and the conference summary sending module can be used for sending the conference summary to the user when receiving a request for downloading the issued conference summary from the user. :

For the embodiment of the video conference processing apparatus, since it is substantially similar to the embodiment of the video conference processing method, the description is relatively simple, and the relevant points will be referred to in the description of the embodiment of the video conference processing method.

The embodiment of the invention also provides electronic equipment, which comprises:

One or more processors, and

One or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform one or more video conference processing methods as described in embodiments of the present invention.

The embodiment of the invention also provides a computer readable storage medium, and a stored computer program causes a processor to execute the video conference processing method according to the embodiment of the invention.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The foregoing describes the principles and embodiments of the present invention in detail using specific examples to facilitate understanding of the method and core ideas of the present invention, and meanwhile, the present invention should not be construed as being limited to the embodiments and application scope of the present invention, since the technical personnel in the art can change the scope of the present invention according to the ideas.

Claims

1. A video conference processing method, characterized in that the method is applied to a storage service system in a visual network, comprising:

Obtaining a video file recording the video conference and a text file corresponding to an audio stream in the video conference during a recording period; the text file includes multiple sections of text information, wherein each section of text information includes a start time of playing the audio stream corresponding to the text information in the video file;

When receiving a request from a user to play the video file, playing the video file;

In the process of playing the video file, extracting at least one piece of text information corresponding to the current playing time from the text file according to the current playing time and each of the starting playing times;

Displaying the at least one piece of text information;

The video file carries multiple marks, which are displayed on a playback progress bar of the video file, wherein each mark on the playback progress bar is used to indicate the playback time of the mark in the video file; so that the storage service system can determine at least one piece of text information corresponding to the current playback time based on the difference between the playback time of the mark in the video file and the current playback time.

2. The method according to claim 1, characterized in that playing the video file comprises:

Extracting at least one piece of text information corresponding to the current play time from the text file according to the current play time and each of the start play times, including:

When a triggering operation on the multiple tags is detected, determining a first play time of the triggered tag in the video file, and switching the current play time to the first play time, so as to play the video file from the first play time;

At least one first text information whose corresponding starting playback time is within a first preset time range from the first playback time is extracted from the text file.

3. The method according to claim 1, wherein playing the video file comprises:

Determine a progress position corresponding to a current playback time, and determine a target mark within a preset distance from the progress position among the multiple marks;

Determine a second play time of the target mark in the video file;

At least one second text information whose starting playing time is within a preset time range from the second playing time is extracted from the text file.

4. The method according to claim 1, characterized in that, according to the current playing time and each of the starting playing times, extracting at least one piece of text information corresponding to the current playing time from the text file comprises:

Determine at least one starting playback time within a second preset time range from each of the starting playback times;

At least one third text information corresponding to the at least one start playing time is obtained from the text file.

5. The method according to claim 1 is characterized in that a conference control terminal and an autonomous server are deployed in the video network, and before obtaining the video file recorded for the video conference, the method further comprises:

In response to the conference recording instruction, the audio stream and the video stream currently generated in the video conference are recorded, and the audio stream is cached; wherein the audio stream is the audio stream collected by the terminal currently speaking in the video conference;

Whenever it is detected that the total playing time of a cached audio stream reaches a preset time, the cached audio stream is identified to obtain text information corresponding to the audio stream, and the starting time of the audio stream in the recording process is used as the starting playing time;

When the recording of the current video conference is finished, the multiple text information corresponding to the cached multiple audio streams are stored as a text file.

6. The method according to claim 5, characterized in that, when the recording of the current video conference is finished, after storing the multiple text information corresponding to the multiple buffered audio streams as a text file, the method further comprises:

When the video conference ends, obtaining conference data of the video conference;

Generate meeting minutes based on the meeting data and the text file, and publish the meeting minutes;

When receiving a download request from a user for the published conference minutes, the conference minutes are sent to the user.

7. A video conference processing device, characterized in that the device is applied to a storage service system in a visual network, comprising:

A file acquisition module is used to obtain a video file recorded for the video conference, and a text file corresponding to an audio stream in the video conference during a recording period; the text file includes multiple sections of text information, wherein each section of text information includes a start playback time of the audio stream corresponding to the text information in the video file;

A video playing module, used for playing the video file upon receiving a user's request to play the video file;

An information extraction module is used to extract at least one piece of text information corresponding to the current playback time from the text file according to the current playback time and each of the start playback times during the playback of the video file;

An information display module, used to display the at least one piece of text information;

The video file carries a plurality of tags, and the device further comprises:

A marking module is used to display the multiple marks on a playback progress bar of the video file, wherein each mark on the playback progress bar is used to indicate the playback time of the mark in the video file; so that the storage service system can determine at least one piece of text information corresponding to the current playback time based on the difference between the playback time of the mark in the video file and the current playback time.

8. The device according to claim 7, characterized in that a conference control terminal and an autonomous server are deployed in the visual network, and the device further comprises:

An instruction receiving module, configured to receive a conference recording instruction sent by the autonomous server, wherein the conference recording instruction is generated by the conference control terminal and sent to the autonomous server;

A video recording module, configured to record the audio stream and video stream currently generated in the video conference in response to the conference recording instruction, and cache the audio stream; wherein the audio stream is the audio stream collected by the terminal currently speaking in the video conference;

The audio recognition module is used to recognize the cached audio stream every time it is detected that the total playback time of a cached audio stream reaches a preset time, obtain text information corresponding to the audio stream, and use the start time of the audio stream in the recording process as the start playback time;

The file acquisition module is used to store multiple text information corresponding to the cached multiple audio streams as a text file when the recording of the current video conference is finished.

9. An electronic device, comprising:

one or more processors; and

One or more machine-readable media having instructions stored thereon, when executed by the one or more processors, enable the device to perform the video conference processing method as described in any one of claims 1 to 6.

10. A computer-readable storage medium, characterized in that the computer program stored therein enables a processor to execute the video conference processing method according to any one of claims 1 to 6.