CN112929731B

CN112929731B - Multimedia switch system

Info

Publication number: CN112929731B
Application number: CN202110508270.3A
Authority: CN
Inventors: 张新华; 陈华锋; 李兵
Original assignee: Zhejiang Lancoo Technology Co ltd
Current assignee: Guangzhou Blue Pigeon Software Co ltd
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-07-30
Anticipated expiration: 2041-05-11
Also published as: CN112929731A

Abstract

The application relates to the field of multimedia switches and discloses a multimedia switch system, which comprises: the system comprises a multimedia switch and N paths of audio acquisition ends; the multimedia switch is configured to: calculating an energy ratio D before and after audio signal speech enhancement in each audio data frame, and calculating a signal-to-noise ratio according to the audio data frames with the D value smaller than a preset threshold; sending exchange information including signal-to-noise ratio to the N paths of audio acquisition ends, wherein the signal-to-noise ratio is estimated according to audio signals from the same audio acquisition end received by the exchanger in the previous period; the audio collection end is configured to: receiving exchange information sent by a multimedia switch, and if the signal-to-noise ratio in the exchange information is smaller than a preset threshold, performing amplitude increase adjustment on a signal in a preset frequency band to increase the signal of the voice component; the audio signal is sent to the multimedia switch. The multimedia switch system reduces the transmission time delay of the audio data, improves the multi-channel audio acquisition effect and saves the hardware cost.

Description

Multimedia switch system

Technical Field

The application relates to the field of multimedia switches, in particular to a multimedia switch system.

Background

In order to realize teaching scenes such as normalized multimedia teaching, recorded broadcast teaching, online classroom and remote interactive teaching in a traditional classroom, more equipment needs to be installed, the system structure is complex, and centralized processing and management of various data in the classroom are not facilitated.

In addition, in the existing multi-channel audio acquisition and audio data transmission mode, an audio cable transmission mode is adopted after the acquisition of an analog microphone, although the transmission delay is low, the influence of different multi-channel audio acquisition delays and fast attenuation of distance transmission audio signals is avoided, and the subsequent processing of multi-channel audio signal mixing, enhancement, audio and video synthesis and the like is not facilitated; the digital microphone collects the data transmitted by the Ethernet, and although the transmission distance is long and the line deployment is simple, the Ethernet transmission has the problems of network congestion, large time delay jitter and the like, so that the digital microphone is not suitable for the application with higher real-time requirement such as local sound amplification and the like.

Disclosure of Invention

The application provides a multimedia switch system, and the first purpose is to improve the sound mixing effect of a multipath channel and avoid sound breaking.

The second purpose is to solve the problems of network congestion, high noise and time delay jitter during multi-channel audio transmission, effectively improve the transmission efficiency and the audio quality of audio data and reduce the transmission time delay.

The application provides a multimedia switch system, including:

the system comprises a multimedia switch and N audio acquisition ends, wherein N is an integer greater than or equal to 2;

the audio acquisition end is configured to acquire an audio signal through a microphone;

the multimedia switch is configured to:

acquiring an audio signal from the audio acquisition end in the form of an audio data frame, and performing voice enhancement and coding to obtain an audio stream;

calculating an energy ratio D before and after audio signal speech enhancement in each audio data frame, and calculating a signal-to-noise ratio according to the audio data frames with the D value smaller than a preset threshold;

sending exchange information including the signal-to-noise ratio to the N audio acquisition ends, wherein the signal-to-noise ratio is estimated according to the audio signals from the same audio acquisition end received by the multimedia switch in the previous period;

the audio collection end is configured to:

receiving the exchange information sent by the multimedia exchange, and if the signal-to-noise ratio in the exchange information is smaller than a preset threshold, performing amplitude increase adjustment on a signal in a preset frequency band to increase a signal of a voice component;

and sending the audio signal to the multimedia switch.

In one embodiment, the system further comprises M video acquisition ends, wherein M is an integer greater than or equal to 1;

the multimedia switch is further configured to:

in K time slices of the same operation period, respectively sending exchange information comprising clock synchronization information and signal-to-noise ratio to the N paths of audio acquisition ends;

receiving and coding the video signals from the M paths of video acquisition ends to obtain video streams;

encapsulating the audio stream and the video stream and time-stamping the audio stream and the video stream to ensure synchronicity;

the video acquisition terminal is configured to acquire the video signal and transmit the video signal into the multimedia switch.

In one embodiment, the multimedia switch is configured to:

deleting audio data frames with Zn smaller than a preset first threshold and Mn smaller than a preset second threshold, wherein:

，

；

outputting a mixed sound signal

Wherein, in the step (A),

；

wherein Sin represents an audio signal, sgn is a function of a symbol, j is an audio acquisition end number, i is an audio data frame sample number, H is an audio data frame sample number, and η is a preset compensation factor.

In one embodiment, the energy ratio D before and after speech enhancement of the audio signal is calculated by:

s (i) represents an ith frame original signal of the audio acquisition end, so (i) represents a signal which is transmitted to the multimedia switch by the ith frame and is output after speech enhancement;

if the value D is larger than a preset threshold value, the energy change before and after the voice enhancement processing is larger than the preset threshold value, and the ith frame is a silent section; if the value D is smaller than a preset threshold value, the energy change before and after the voice enhancement processing is smaller than the preset threshold value, and the ith frame is a voice section;

after the voice segment data is determined, taking F frame voice segment data as an analysis sample, and calculating a signal-to-noise ratio, wherein the signal-to-noise ratio is based on the following formula:

where SNR represents the signal-to-noise ratio.

In one embodiment, the enhancement processing of the received audio signal by the multimedia switch comprises: noise reduction, echo cancellation, howling suppression, and automatic gain.

In one embodiment, the operation of the N-way audio capturing end further comprises:

storing the acquired audio data into an input storage area of each path of audio acquisition end;

when the data memory amount of the input storage area reaches a set first threshold value, storing and transferring the audio data in the input storage area to an output storage area;

and sending the audio data in the output storage area to the multimedia switch as the audio signal.

In one embodiment, the N audio acquisition ends calculate the time difference between the clock of the acquisition end and the synchronous clock according to the time interval of receiving the exchange information between adjacent periods;

and when the time difference is larger than a set second threshold value, calibrating the clock of the acquisition end into a synchronous clock.

In one embodiment, the M-channel video acquisition terminal is connected to the multimedia switch through an HDMI interface, and the multimedia switch acquires pixel data in YUV format and encodes the pixel data into the video stream by using an encoder.

In one embodiment, the multimedia switch encapsulates the audio stream and the video stream, further comprising:

and (5) carrying out FLV format encapsulation and pushing to the server through an RTMP protocol.

In one embodiment, the multimedia switch comprises:

the processor module comprises an audio acquisition and output interface, a video acquisition and output interface, a network interface and an external equipment interface;

the FPGA audio acquisition module is connected with the processor module;

the audio compiling module comprises an analog audio acquisition equipment interface, a digital audio acquisition equipment interface and an audio output interface, and is connected with the audio acquisition module;

an audio processing module coupled with the audio acquisition module.

In the embodiment of the application, compared with the prior art, the sound mixing effect can be improved, and the sound breaking is avoided.

In addition, multichannel audio acquisition end wheel flow is uploaded audio data to the switch in an orderly manner, and each way audio acquisition end has calibrated the local clock according to the signal, the transmission delay of multichannel audio acquisition end has been reduced, the transmission conflict between the multichannel audio acquisition end has been avoided, and then for adopt multimedia switch and audio acquisition end mutually support the pronunciation reinforcing that realizes the audio acquisition end and provide the basis, transmit multichannel audio signal to multimedia switch high quality ground, to the reduction of original sound when reducing the noise reduction filtering in the data processing of later stage switch, multichannel audio acquisition effect has been improved. Compared with the method for performing voice enhancement only at the audio acquisition end, the method has the advantages that the audio acquisition end and the switch are matched with each other to realize voice enhancement of the audio acquisition end, the audio acquisition end only needs to perform voice enhancement according to the calculation result received from the multimedia switch, a large amount of calculation in the voice enhancement is performed at the end of the switch, the transmission time delay of audio data is reduced, the multi-channel audio acquisition effect is improved, and the hardware cost is saved.

And audio and video are packaged in the switch, so that the synchronization of the audio and video is ensured.

The present specification describes a number of technical features distributed throughout the various technical aspects, and if all possible combinations of technical features (i.e. technical aspects) of the present specification are listed, the description is made excessively long. In order to avoid this problem, the respective technical features disclosed in the above summary of the invention of the present application, the respective technical features disclosed in the following embodiments and examples, and the respective technical features disclosed in the drawings may be freely combined with each other to constitute various new technical solutions (which are considered to have been described in the present specification) unless such a combination of the technical features is technically infeasible. For example, in one example, the feature a + B + C is disclosed, in another example, the feature a + B + D + E is disclosed, and the features C and D are equivalent technical means for the same purpose, and technically only one feature is used, but not simultaneously employed, and the feature E can be technically combined with the feature C, then the solution of a + B + C + D should not be considered as being described because the technology is not feasible, and the solution of a + B + C + E should be considered as being described.

Drawings

Fig. 1 is a schematic diagram of a basic structure of a multimedia switch system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a multi-channel audio acquisition network architecture (using a handle) according to an embodiment of the present application;

fig. 3 is a schematic diagram of audio-video synthesis editing according to an embodiment of the present application;

figure 4 is a multimedia switch module schematic according to one embodiment of the present application.

Detailed Description

In the following description, numerous technical details are set forth in order to provide a better understanding of the present application. However, it will be understood by those skilled in the art that the technical solutions claimed in the present application may be implemented without these technical details and with various changes and modifications based on the following embodiments.

The present application relates to some of the following terms:

HDMI: high Definition Multimedia Interface (High Definition Multimedia Interface)

PCM: pulse Code Modulation (Pulse Code Modulation)

AAC: advanced Audio Coding (Advanced Audio Coding)

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The embodiment of the present application relates to a multimedia switch system, including:

the system comprises a multimedia switch and N paths of audio acquisition ends, wherein N is an integer greater than or equal to 2;

the audio acquisition end is configured to acquire an audio signal through the microphone.

The multimedia switch is configured to:

the audio signal is acquired from the audio acquisition terminal in the form of audio data frames.

，

；

outputting a mixed sound signal

Wherein, in the step (A),

；

Optionally, in an embodiment, the multimedia switch calculates an average amplitude of a signal acquired by each audio in a plurality of time slices, and each audio with the largest average amplitude is taken as an output for each of the time slices, and finally, the audio data in the plurality of time slices is synthesized and output as mixed sound, where the step is completed by an FPGA module in the multimedia switch.

Optionally, in an embodiment, the video processing system further includes M video capturing ends, where M is an integer greater than or equal to 1;

the video acquisition terminal is configured to acquire a video signal and transmit the video signal into the multimedia switch. The N paths of video acquisition ends are accessed into the multimedia switch through the HDMI, and then the multimedia switch acquires pixel data in a YUV format and encodes the pixel data into video streams by using an encoder.

The audio acquisition terminal is configured to receive the switching information sent by the multimedia switch and then perform the following operations:

firstly, sending an audio signal to a multimedia switch, calibrating a clock of a collection end according to clock synchronization information in the switching information, and performing voice enhancement according to a signal-to-noise ratio in the switching information.

The multimedia switch is further configured to:

in K time slices of the same operation period, switching information comprising clock synchronization information and signal-to-noise ratio is sent to N paths of audio acquisition ends respectively, and the signal-to-noise ratio is estimated according to audio signals from the same audio acquisition end and received by an exchanger in the previous period.

Receiving the audio signals from the N audio acquisition ends, and performing voice enhancement and coding to obtain audio streams. For example, the enhancement processing performed on the audio signal may include one or any combination of the following: noise reduction, echo cancellation, howling suppression, and automatic gain.

And thirdly, receiving and coding the video signals from the M paths of video acquisition ends to obtain video streams.

And fourthly, encapsulating the audio stream and the video stream, and time-stamping the audio stream and the video stream to ensure synchronism.

Optionally, in an embodiment, the specific steps of synthesizing and encapsulating are as follows, and an audio/video synthesis editing schematic diagram 3 shows:

(1) multi-channel video acquisition and coding: firstly, a PC desktop screen signal is accessed into a multimedia switch through an HDMI interface to obtain pixel data in a YUV format, and the YUV data is encoded into an H264 video stream by using a video encoder of libx 264; and secondly, accessing the video data of the camera into the multimedia switch through the Ethernet port to acquire the RTSP video stream and the H264 code.

(2) And audio acquisition and coding process: firstly, an audio compiler is adopted to carry out PCM coding on an audio signal; and secondly, transmitting the audio-mixed and voice-enhanced audio to a CPU module for ACC coding.

(3) And finally, carrying out FLV format encapsulation on the obtained H264 video stream and AAC audio stream, and pushing the obtained H264 video stream and AAC audio stream to a server through an RTMP protocol. And time stamps are respectively stamped on the audio data and the video data, so that the audio and video time synchronism is ensured.

When the CPU receives the video data frame and the audio PCM code, the CPU marks a time stamp, and the specific time stamp marks are as follows:

video time stamping: pts = inc + + (1000/fps), where inc is static, has an initial value of 0, adds 1 to each time the timestamp inc is done, and fps is the frame rate.

Audio time stamping: pts = inc + + (frame _ size 1000/sample _ rate); where frame _ size is the frame length, sample _ rate is the sample rate.

Optionally, in an embodiment, as shown in fig. 2, the specific multi-channel audio capturing step includes:

the multimedia switch sends the switching information at regular time: the multimedia switch distributes port numbers for each path of audio acquisition end in the network, generates switching information carrying a local clock in a timed (periodic) manner, and sends the switching information to the corresponding audio acquisition ends in order.

Secondly, the N paths of audio acquisition ends calculate the time difference between the clock of the acquisition end and the synchronous clock according to the time interval of receiving the exchange information between adjacent periods; and when the time difference is greater than a set second threshold value, calibrating the clock of the acquisition end into a synchronous clock. Optionally, the audio acquisition end calibrates the local clock according to the exchange information: and each audio acquisition end acquires clock message information according to the received exchange information, calculates the time deviation between the synchronous clock and the local clock and calibrates the local clock.

Thirdly, the audio acquisition end acquires and caches data: and each audio acquisition end acquires voice data through a high-sensitivity microphone pickup head, and caches the voice data to a local cache region after AD conversion.

Fourthly, the audio acquisition end sends data to the multimedia exchange end: and when each path of audio acquisition end receives the exchange information sent by the multimedia switch, the audio data in the buffer area is loaded to the uplink data packet and sent to the multimedia switch end.

Optionally, in an embodiment, the audio acquisition end stores the acquired audio data in an input storage area of each channel of the audio acquisition end; when the data memory amount of the input storage area reaches a set first threshold value, storing and transferring the audio data in the input storage area to an output storage area; and sending the audio data in the output storage area to the multimedia exchange as an audio signal.

Fifthly, caching audio data at the multimedia exchange end: the multimedia switch provides independent data buffer areas for each port, and after uplink data packets sent by each audio acquisition end are received, the data packets are respectively buffered to the corresponding data buffer areas.

Optionally, in an embodiment, each audio acquisition end is responsible for acquiring, buffering, and sending an original audio signal; the multimedia switch end is responsible for realizing clock synchronization control of each path of audio acquisition end in the network and receiving, caching and processing of audio and video data (the part is finished by an FPGA module in the multimedia switch). Wherein, the multimedia exchange and the audio acquisition end adopt 100M/1000M synchronous Ethernet connection.

Optionally, in an embodiment, the system applies an ATM technology, and establishes a channel for each audio acquisition end in a time division multiplexing manner, so as to ensure that each audio acquisition end in the network performs data transmission efficiently and orderly; meanwhile, a clock network synchronization technology is adopted, so that the consistency of the clock frequency of the audio acquisition and the receiving end is ensured, the time delay jitter is greatly reduced, and the accuracy and error code free of data acquisition and transmission are ensured. The basic structure diagram of the multimedia switch system is shown in fig. 1.

Optionally, in an embodiment, the multi-channel audio acquisition network structure adopts a pull handle manner, as shown in fig. 2. The multimedia switch and the audio acquisition end are connected and communicated in a bus mode, the audio acquisition end is connected and communicated in a handle mode, the single end and the multiple points are simultaneously connected into the multimedia switch, the communication rate is 10M, the clock synchronization is accurate, the time delay is low, and no codes exist.

Optionally, in one embodiment, the multimedia switch supports multiple stereo audio interfaces and multiple RJ45 network interface digital audio inputs. The sound source of the stereo input comprises a linear input line-in, a gooseneck microphone or a wireless microphone and the like, and is processed by an audio compiler (mainly comprising filtering, amplifying AD conversion and the like) and connected with a central processing unit (namely a programmable logic device FPGA) through an I2S interface; the multi-path digital microphone input passes through a physical layer transceiver PHY module and then is connected with the central processing unit through a medium communication MII interface.

Optionally, in an embodiment, according to the signal-to-noise ratio of each audio signal fed back by the multimedia switch, each audio acquisition end performs pre-enhancement processing on the front-end audio signal, and then combines with the back-end speech enhancement processing of the multimedia switch, so that the loss of sound field and line transmission can be better compensated, the quality of the original signal is improved, the reduction of the back-end noise reduction filtering processing on the original sound is reduced, and the speech enhancement effect is improved. The method comprises the following concrete steps:

firstly, the multimedia exchange end estimates the signal-to-noise ratio of an original signal: the multimedia switch estimates the signal-to-noise ratio of each audio signal by adopting a wiener filtering method for the original audio signal according to the data uploaded by each audio acquisition end, and simultaneously feeds the value back to the audio acquisition end through a downlink synchronous cell, wherein the step is completed by an FPGA module in the multimedia switch.

Secondly, the audio acquisition end performs voice enhancement: the method comprises the steps of obtaining signal-to-noise ratio (s/n) information according to a synchronous cell sent by a multimedia switch end, and when the signal-to-noise ratio is smaller than a preset reference value, carrying out amplitude increase adjustment on a signal in a preset frequency band to increase the signal-to-noise ratio of a signal of a human voice s component.

Calculating the energy ratio D of the audio signal before and after speech enhancement:

wherein, S (i) represents the ith frame original signal of the audio acquisition end, so (i) represents the signal output after the ith frame is transmitted to the multimedia exchange and is subjected to voice enhancement.

If the value D is larger than the preset threshold value, the energy change before and after the voice enhancement processing is larger than the preset threshold value, and the ith frame is a silence section; if the value D is smaller than the preset threshold value, the energy change before and after the voice enhancement processing is smaller than the preset threshold value, and the ith frame is a voice section.

After the voice segment data is determined, taking F frame voice segment data as an analysis sample, and calculating a signal-to-noise ratio based on the following formula:

where SNR represents the signal-to-noise ratio.

And when the signal-to-noise ratio in the exchange information is smaller than a set third threshold value, the N paths of audio acquisition ends adjust the preset frequency band signals to a preset amplitude in later acquisition.

Thirdly, the multimedia exchange end carries out voice enhancement: and (3) carrying out noise reduction, echo elimination, howling inhibition and automatic gain processing (a general method is completed by a DSP module in a multimedia switch) on the audio data after sound mixing to realize a voice enhancement effect, wherein the processed audio output can be used for audio and video synthesis editing or local sound amplification.

Optionally, in an embodiment, the multimedia switch includes:

the FPGA audio acquisition module is connected with the processor module;

and the audio processing module is connected with the audio acquisition module.

Optionally, in an embodiment, the multimedia switch includes a CPU processor, a DSP chip, an FPGA chip, an audio compiling module, various data interfaces, and other main modules, so as to implement multi-channel audio and video acquisition, audio mixing, speech enhancement, and audio and video synthesis processing. A multimedia switch block diagram is shown in figure 4.

It is noted that, in the present patent application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or apparatus that comprises the element. In the present patent application, if it is mentioned that a certain action is executed according to a certain element, it means that the action is executed according to at least the element, and two cases are included: performing the action based only on the element, and performing the action based on the element and other elements. The expression of a plurality of, a plurality of and the like includes 2, 2 and more than 2, more than 2 and more than 2.

All documents mentioned in this application are to be considered as being incorporated in their entirety into the disclosure of this application so as to be subject to modification as necessary. It should be understood that the above description is only a preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of one or more embodiments of the present disclosure should be included in the scope of protection of one or more embodiments of the present disclosure.

Claims

1. A multimedia switch system, comprising:

the multimedia switch is configured to:

acquiring the audio signal from the audio acquisition end in the form of an audio data frame, and performing voice enhancement and coding to obtain an audio stream;

calculating an energy ratio D before and after the audio signal speech enhancement in each audio data frame, and calculating a signal-to-noise ratio according to the audio data frame with the D value smaller than a preset threshold value;

the energy ratio D of the audio signal before and after the voice enhancement is calculated by the following method:

wherein, s (i) represents an original signal in an ith frame of the audio acquisition end, so (i) represents a signal output after the ith frame is transmitted to the multimedia switch and speech enhancement is performed, and H is a sample number in an audio data frame;

wherein SNR represents a signal-to-noise ratio;

the audio collection end is configured to:

and sending the audio signal to the multimedia switch.

2. The multimedia switch system of claim 1, further comprising M video capture ports, wherein M is an integer greater than or equal to 1;

the multimedia switch is further configured to:

in K time slices of the same operation period, respectively sending exchange information comprising clock synchronization information and the signal-to-noise ratio to the N paths of audio acquisition ends;

3. The multimedia switch system of claim 1, wherein the multimedia switch is further configured to:

，

；

outputting a mixed sound signal

Wherein, in the step (A),

；

4. The multimedia switch system of claim 1, wherein the multimedia switch performs enhancement processing on the received audio signal comprising one or any combination of: noise reduction, echo cancellation, howling suppression, and automatic gain.

5. The multimedia switch system of claim 1, wherein the operations of the N-way audio capture port further comprise:

6. The multimedia switch system of claim 2, wherein the N-way audio acquisition end calculates a time difference between the own acquisition end clock and the synchronous clock according to a time interval of receiving the switching information between adjacent periods;

7. The multimedia switch system of claim 2, wherein the M-way video capture ports access the multimedia switch via an HDMI interface, and the multimedia switch obtains pixel data in YUV format and encodes the pixel data into the video stream using an encoder.

8. The multimedia switch system of claim 2, wherein the multimedia switch encapsulates the audio stream and the video stream, further comprising:

9. The multimedia switch system of claim 2, wherein the multimedia switch comprises:

the FPGA audio acquisition module is connected with the processor module;

an audio processing module coupled with the audio acquisition module.