Detailed Description
The application solves the technical problem of low efficiency of multi-track audio and video synchronization in the prior art by providing the video collaborative editing flow optimization method for multi-track audio synchronization.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be noted that the terms "comprises" and "comprising" are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
An embodiment, as shown in fig. 1, provides a video collaborative editing flow optimization method for multi-track audio synchronization, where the method includes:
a video acquisition device and a plurality of independent recording devices for audio and video acquisition of a target scene are determined.
When the audio and video acquisition is carried out on the target scene, proper video acquisition equipment and a plurality of independent recording equipment are required to be determined according to scene requirements so as to ensure that high-quality pictures and audio data are acquired. The video acquisition device comprises a high-definition camera, a film camera, a mobile phone camera device and the like, and determines whether an anti-shake function, a wide-angle lens or high-frame-rate support is needed according to scene requirements, for example, a static scene can use a fixed camera device, and a dynamic scene needs to support a stable shooting device. The independent recording device is used for collecting different audio tracks, such as a directional microphone is used for picking up sound in a specific direction, a collarband microphone is used for recording the dialogue, the environment recording device is used for capturing background sound, and the multi-track recorder can record a plurality of audio track data simultaneously.
And synchronous data acquisition is carried out through the video acquisition equipment and the plurality of independent recording equipment, so that initial audio and video data and a plurality of independent audio track data are generated.
The video capture device is responsible for recording pictures of the target scene and its own audio tracks, while the multiple independent recording devices respectively capture high quality audio track data from different sound sources, such as dialogue, ambient or specific sound source audio tracks. The method comprises the steps of simultaneously carrying out data acquisition on video acquisition equipment and a plurality of independent recording equipment to generate initial audio and video data and a plurality of independent audio track data, wherein the initial audio and video data refer to the audio data acquired by each independent recording equipment, and the video data simultaneously carry video images and record background sounds. In the data acquisition process, all devices need to synchronously record based on a unified time code, so that the generated initial audio and video data and each independent audio track data are ensured to keep consistent in time. The synchronous acquisition mode can effectively avoid the problem of asynchronous track and picture caused by time deviation of different devices in the traditional audio and video recording, and lays a foundation for subsequent time code alignment, track processing and audio and video fusion.
And receiving the background environment requirement of the target scene.
As the processing requirements of background sounds are different for different target scenes, for example, environmental atmosphere sounds need to be increased or dialogue sounds need to be highlighted. And receiving background demand information of the target scene to acquire specific requirements for audio processing and video editing. There may be significant differences in background sound requirements from scene to scene, for example, in interview or interview scenes, to ensure clarity of the speech content, it is often necessary to remove all ambient noise, including background wind, noise, etc., to create a quiet auditory environment. In life video, in order to preserve the realism and atmosphere of a scene, it is necessary to properly preserve environmental sounds such as background sounds of streets, natural wind sounds or light conversations of people, while reducing noise that is too harsh or disturbing.
And separating the picture and the audio of the initial audio-video data, and performing alignment mapping according to the time code to generate original picture data and original audio data.
The method comprises the steps of extracting an audio track and a video picture from initial audio and video data by utilizing an audio and video separation tool or algorithm (such as FFmpeg or other professional video processing software), reading time code information embedded in the initial audio and video data, carrying out alignment mapping on the separated audio data and picture data according to time codes, ensuring that the starting points of the time codes are consistent in the alignment process, strictly synchronizing the audio data and the picture data on a time axis by correcting any potential frame difference or equipment delay problem, and outputting the obtained original picture data and original audio data after the time code alignment is completed so as to ensure that the audio and picture data can be accurately matched in the subsequent audio correction, processing and audio and video fusion processes, thereby realizing a high-quality synchronization effect.
And after the time code alignment correction is carried out on the original audio data and the plurality of independent audio track data, carrying out the environment sound processing on the corrected original audio data based on the background environment requirement, and then fusing the corrected original audio data with the plurality of corrected independent audio track data to generate fused audio track data.
Extracting time code information embedded in the original audio data and the plurality of independent audio track data, comparing time axes of the audio data by using an audio editing tool (such as Adobe Audition or professional time code calibration software), and aiming at the problem that the time codes are inconsistent or the recording equipment has delay, carrying out accurate time code correction on the audio track by adjusting the starting point of the time axes, scaling the time axis proportion or inserting a compensation section and the like, so as to ensure that all the audio data are completely synchronous on the time axes.
After the time code alignment correction is completed, the corrected original audio data is subjected to environmental sound processing according to the received background environmental requirements. For background noise processing such as interview or interview scenes, noise reduction algorithms (such as spectral noise reduction or AI noise reduction techniques) can be used to completely remove ambient noise and enhance the clarity of human voice, and for background sound processing in life-class video where it is desirable to preserve the sense of a real scene, the performance of ambient sound is optimized by adjusting the audio equalization, reducing high frequency noise or enhancing the natural portion of the background sound appropriately. The original audio data subjected to the environmental sound processing and the independent audio track data subjected to the correction of the plurality of time codes are fused, and fusion audio track data is generated, so that the fusion audio track data can meet the background sound requirement of a target scene, the synchronicity and the hearing effect between the audio tracks are ensured, and a high-quality audio foundation is laid for the subsequent synchronous fusion with picture data.
Further, performing time code alignment correction on the original audio data and the plurality of independent audio track data, comprising:
Acquiring the time codes of the initial sampling points in the original audio data and the plurality of independent audio track data, and calculating the time code offset of the plurality of independent audio track data relative to the original audio data; and performing time code offset correction on the original audio data and the plurality of independent audio track data according to the time code offset, and generating corrected original audio data and a plurality of corrected independent audio track data.
Preferably, the method comprises the steps of collecting initial sampling point time codes in original audio data and a plurality of independent audio track data, analyzing time code information of each audio track by utilizing an audio processing tool, recording initial sampling point time codes of the original audio data and each independent audio track data, calculating time code offset of the plurality of independent audio track data relative to the original audio data by comparing the initial time codes of the audio tracks to obtain time code offset values of each audio track, carrying out time code offset correction on the original audio data and the plurality of independent audio track data according to the calculated time code offset values, and particularly realizing by inserting blank audio segments, cutting out an exceeding part or resetting the time code initial points, wherein in the correction process, the time axis movement or alignment adjustment is carried out on the audio track with larger time code offset value, so that the data initial points of all the audio tracks are consistent with the time codes of the original audio data. After the time code offset correction is completed, corrected original audio data and a plurality of corrected independent track data are generated. The correction result ensures that all the audio track data are completely synchronized in the time dimension, and provides an accurate time reference for subsequent audio track fusion and audio and video synchronization, so that the problem of audio non-synchronization caused by recording equipment difference or operation error is effectively avoided, and the quality and efficiency of audio and video processing are improved.
Further, as shown in fig. 2, the method for generating the fusion audio track data by performing the environmental sound processing on the corrected original audio data and the plurality of corrected independent audio track data based on the background environment requirement includes:
Judging whether all background environment sounds are removed based on the background environment requirements, if not, extracting background sound retaining features, establishing pure noise fingerprints based on the background sound retaining features, constructing a filtering module by the pure noise fingerprints, denoising the corrected original audio data and the plurality of corrected independent audio track data through the filtering module, performing audio track synchronization and volume balance processing on the denoised corrected original audio data and the plurality of corrected independent audio track data, and generating the fused audio track data.
Specifically, according to background environment requirements, whether a target scene needs to completely remove background environment sounds is judged, for example, if the scene is a situation requiring to highlight human voice such as interview or interview, all background sounds are selected to be removed, if the scene needs to preserve a situation requiring to preserve real scene atmosphere such as life video, partial background sounds are selected to be preserved, if the background environment sounds do not need to be completely removed, specific features in the background sounds are analyzed and extracted, the extraction process can be based on spectral features, energy distribution or tone features, the background sound parts (such as natural wind sounds, slight environmental tone and the like) needing to be preserved are marked for subsequent processing, pure noise parts (such as mechanical noise, current noise or irregular noise) needing to be removed in audio are identified based on the extracted background sound preservation features, pure noise fingerprints are generated through spectral analysis, an adaptive filter module (such as an adaptive noise or a frequency domain filter) is constructed to ensure that noise can be effectively removed in the filtering process without affecting the preserved background sound parts, the corrected original audio data and a plurality of independent audio tracks are subjected to noise removal processing by using the filter module, the corrected original audio data and the independent audio tracks are subjected to audio data, the volume correction is carried out according to the volume correction data of the audio tracks, the volume of the original audio tracks is adjusted according to the required to the volume data, the original audio tracks is corrected by the volume data is adjusted, and the volume of the original audio tracks is corrected by the volume data is adjusted to be in proportion of the audio data is corrected, and the volume of the audio tracks is corrected, and the original audio tracks is corrected, the volume data is corrected, and the volume data is corrected, according to the volume data is corrected, and the volume is corrected, according to the quality is required to the audio data is adjusted, and integrating the processed corrected original audio data with a plurality of independent audio track data to generate final integrated audio track data.
Further, performing track synchronization and volume balancing processing on the denoised corrected original audio data and the plurality of corrected independent track data to generate the fused track data, including:
The method comprises the steps of correcting a plurality of independent audio track data, carrying out unbalance adjustment of volume and frequency on the corrected independent audio track data to generate a plurality of independent audio track adjustment data, determining audio track owners corresponding to the independent recording devices, constructing a dynamic volume adjustment correlation curve of audio track owners sound and environment sound, carrying out time code synchronous fusion on the corrected original audio data subjected to denoising processing and the plurality of independent audio track data based on the dynamic volume adjustment correlation curve, and generating fused audio track data.
The method comprises the steps of analyzing a plurality of independent audio track data one by using an audio processing tool, detecting a volume peak value and a frequency distribution range, identifying volume and frequency unbalance conditions among the audio tracks, adjusting the frequency distribution of the audio tracks by using an equalizer according to analysis results, eliminating abnormal frequency components, optimizing the volume peak value of the audio tracks by using a dynamic range compressor, and generating independent audio track adjustment data with balanced volume and frequency. Next, the recording device and its collection object, i.e., the track owner, such as the speaker or the dialect in the interview, corresponding to each track is determined, and the characteristic parameters of the sound of the track owner, including the volume dynamic range, the main frequency range, and the sound clarity requirement, are extracted. In combination with the background environment requirement, a dynamic volume adjustment related curve between the master sound of the audio track and the environment sound is constructed, and the curve defines the priority of the master sound of the audio track and the volume proportion relation between the master sound of the audio track and the background sound in different time periods so as to ensure the harmony of the highlighting effect of the master sound track and the background sound. And then, based on the time code information embedded in the audio data, performing time code alignment verification on the corrected original audio data subjected to denoising processing and a plurality of independent audio track adjustment data, and ensuring that each audio track is completely synchronous on a time axis. In the synchronization process, the possible time deviation is corrected by comparing and correcting the starting point and the speed of the time code. On the basis of time code synchronization, the sound track volume of each time period is dynamically adjusted according to the constructed dynamic volume adjustment curve, meanwhile, the definition requirements of the sound track master sound are matched, and the balance and the hearing effect of final sound track fusion are ensured. And finally, integrating the corrected original audio data subjected to synchronous and dynamic volume adjustment with the adjusted independent audio track data to generate fused audio track data, wherein the output audio not only has the sound definition of an audio track owner, but also realizes harmonious fusion with background sound, and lays a high-quality audio foundation for subsequent audio and video synchronous editing.
Further, determining the track owners corresponding to the plurality of independent recording devices, and constructing a dynamic volume adjustment correlation curve of the track owners sound and the environment sound, including:
the method comprises the steps of collecting historical audio fusion records and corresponding historical fusion effect evaluation parameters, analyzing the historical audio fusion records to generate contrast data of fused audio track host data and environmental noise, screening optimal contrast corresponding to optimal evaluation effect parameters based on the contrast data and the historical fusion effect evaluation parameters, and constructing the dynamic volume adjustment related curve according to the optimal contrast.
The method comprises the steps of collecting historical audio fusion records and corresponding historical fusion effect evaluation parameters, wherein the evaluation parameters can comprise audio definition scores, environmental noise interference degrees, user satisfaction feedback and the like, and are used for reflecting comprehensive evaluation of the historical audio fusion effects, carrying out data analysis on the collected historical audio fusion records, extracting integrated track master data (namely characteristic data of target sounds in a main sound track, such as human sound signal intensity, main frequency distribution and the like) and environmental noise data, calculating the contrast of the two data, generating a group of track master and environmental noise contrast data corresponding to the fusion effects, matching the evaluation parameters with the contrast data based on the contrast data and the corresponding historical fusion effect evaluation parameters, screening out optimal contrast corresponding to the evaluation parameters capable of achieving the optimal historical fusion effects by using a screening algorithm (such as a multivariate optimization method or a correlation analysis method), constructing a dynamic volume adjustment correlation curve of sound of the track and the environmental sound by combining the requirements of a target scene with the screened optimal contrast as a reference, and the evaluation standard, and guaranteeing that the dynamic volume adjustment correlation curve of the main sound and the background sound track can be realized in different time periods and under the scene, and providing accurate and accurate regulation and control of the dynamic balance of the main sound track and the background sound track.
And synchronously fusing the fused audio track data and the original picture data to generate a target fused video file.
And synchronously fusing the fused audio track data and the original picture data to ensure perfect synchronization of the audio and video and generate a final target fused video file.
Further, the method for generating the target fusion video file by synchronously fusing the fusion audio track data and the original picture data includes:
the method comprises the steps of configuring a lip-sound matching analysis network, wherein the lip-sound matching analysis network is constructed based on lip feature samples and audio feature samples with corresponding relations, performing matching analysis on audio data and picture data belonging to the same time zone in the fused audio track data and the original picture data based on the lip-sound matching analysis network, and performing time code synchronous fusion on the fused audio track data and the original picture data if matching is successful, so as to generate the target fused video file.
Specifically, firstly, a lip-sound matching analysis network is configured, the network is trained based on lip feature samples and audio feature samples with corresponding relations, and the network is learned through a large amount of marked audio and lip feature data, so that the corresponding relations between the audio and lip actions can be accurately identified. And then, carrying out frame-by-frame matching analysis on the audio and picture data belonging to the same time period in the fused audio track data and the original picture data by using the configured lip-sound matching analysis network, and judging whether the two are matched or not by analyzing the feature of lip action in the picture and the pronunciation feature of the corresponding audio so as to ensure the synchronization of the audio and the picture. If the matching analysis result shows that the audio and the picture are successfully matched in the current time period, the time code synchronization fusion is carried out on the fusion audio track data and the original picture data by taking the matching analysis result as a reference, and the possible synchronization errors are eliminated through the correction and the binding of the time code. And finally, integrating the data subjected to matching and synchronization processing, and outputting and generating a target fusion video file. By the method, the fact that the audio tracks and the pictures in the generated video file are completely aligned in time is ensured, and meanwhile, high-precision matching is achieved on the coordination of voice and lip-shaped actions, so that the overall quality and user experience of audio and video fusion are remarkably improved.
Further, the lip-sound matching analysis network is used for matching and analyzing the audio data and the picture data belonging to the same time zone in the fused audio track data and the original picture data, and the lip-sound matching analysis network further comprises:
The voice print feature analysis is carried out on the audio data belonging to the same time zone, speaker identity information is established, lip feature extraction is carried out on the corresponding speaker in the picture data of the same time zone based on the speaker identity information, and the lip feature extraction result and the audio data are input into the lip-voice matching analysis network for matching analysis.
Specifically, voice print feature analysis is carried out on audio data in the same time zone, voice print feature in the audio data is extracted by utilizing a voice print recognition technology, identity information of a speaker is recognized and established to determine a specific sounding main body corresponding to the current audio, a picture area corresponding to the speaker is positioned in picture data in the same time zone based on the extracted voice print identity information, lip feature of the speaker, including movement track, opening and closing degree and other key point change information of lips, is extracted by utilizing an image processing algorithm to form complete lip feature data, the extracted lip feature data and the corresponding audio data are input into a configured lip-sound matching analysis network, and whether the voice print feature of the audio data is consistent with the lip feature in the picture data is judged by a matching analysis function of the network.
Further, the lip-sound matching analysis network is used for matching and analyzing the audio data and the picture data belonging to the same time zone in the fused audio track data and the original picture data, and the lip-sound matching analysis network further comprises:
If the matching is failed, re-matching the forward time zone and the backward time zone in the time zone with the failed matching according to a preset step length; and if the re-matching result is successful, performing offset correction on the fusion audio track data and the original picture data by using the re-matching result, generating the target fusion video file, and if the re-matching result is failed, sending the original matching failure time zone to the editing management terminal for reminding.
When matching analysis is carried out on the audio data and the picture data belonging to the same time zone in the fused audio track data and the original picture data based on the lip-sound matching analysis network, if matching fails, the forward time zone and the backward time zone are re-matched according to a preset time step in a time zone where matching fails. In the re-matching process, the matching range is gradually enlarged, and the section matched with the voiceprint characteristic and the lip characteristic of the picture of the audio data is tried to be found by analyzing the audio data and the picture data in the adjacent time zone. And if the re-matching result is successful, carrying out time code offset correction on the fusion audio track data and the original picture data by taking the re-matching result as a reference, ensuring the consistency of the audio and the picture on a time axis, and then generating a final target fusion video file. If the re-matching result still fails, the time zone of which the original matching fails is recorded and sent to the editing management terminal, so that the user is reminded of the abnormality of the audio and video matching of the time zone, and the manual intervention and the further processing are facilitated. By the method, a reliable remedy mechanism can be provided under the condition of automatic matching failure, so that the matching success rate can be improved to the greatest extent, errors and omission in audio and video synchronous editing can be effectively reduced, and the overall quality and efficiency of audio and video fusion are improved.
In summary, the embodiment of the application has at least the following technical effects:
First, a video capture device and a plurality of independent recording devices for capturing audio and video of a target scene are determined. And then, synchronous data acquisition is carried out through the video acquisition equipment and a plurality of independent recording equipment, so as to generate initial audio and video data and a plurality of independent audio track data. And simultaneously, receiving the background environment requirement of the target scene. Then, the original audio-video data is separated from the picture and the audio, and the alignment mapping is carried out according to the time code, so that the original picture data and the original audio data are generated. Further, after time code alignment correction is performed on the original audio data and the plurality of independent audio track data, the corrected original audio data is subjected to environmental sound processing based on background environment requirements, and then the corrected original audio data is fused with the plurality of corrected independent audio track data to generate fused audio track data. And finally, synchronously fusing the fused audio track data and the original picture data to generate a target fused video file. The technical problem of low synchronization efficiency of multi-track audio and video in the prior art is solved, and the technical effect of improving the editing efficiency of audio and video is achieved.
It should be noted that the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. The processes depicted in the accompanying drawings do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.
The specification and figures are merely exemplary illustrations of the present application and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the scope of the application. Thus, the present application is intended to include such modifications and alterations insofar as they come within the scope of the application or the equivalents thereof.