CN113836344A

CN113836344A - Personalized song file generation method and device and music singing equipment

Info

Publication number: CN113836344A
Application number: CN202111169086.7A
Authority: CN
Inventors: 周跃兵; 徐焕芬
Original assignee: Aimyunion Technology Ltd
Current assignee: Aimyunion Technology Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2021-12-24

Abstract

The present application relates to a method and device for generating personalized song files, music singing equipment, computer equipment, and computer-readable storage media; the method includes: acquiring music files from an external sound source, preprocessing and saving the music files; The music file is separated from the accompaniment to obtain the corresponding vocal information and accompaniment information including various musical instruments; the attribute information of the music file is obtained, and the matching music file is searched in the material library according to the attribute information and preset matching rules. The image content of the music file is obtained; the lyrics information of the music file is obtained, and the personalized song file is generated according to the lyrics information, accompaniment information and image content; an intelligent song file generation process is constructed, and the terminal device can input the music file from any audio source, just The preferred personalized song files are obtained, so that the user can freely and conveniently make personalized songs, so as to meet the user's personalized needs and improve the user's application experience.

Description

Personalized song file generation method and device and music singing equipment

Technical Field

The application relates to the technical field of entertainment equipment, in particular to a personalized song file generation method and device, music singing equipment, computer equipment and a computer readable storage medium.

Background

Modern society belongs to the digital age of fast rhythm, appreciates a piece of music, not only listens to the song for listening to the song, but also can experience the content in the song, including the essence of the song, the meaning that the song expresses, etc., moreover, more users even add the process of song creation, express the present state of life and mood with the song, express personal emotion, this makes the preparation of individualized song become especially important.

At present, with the development of various large music product platforms, a huge song library is generally constructed to meet daily requirements of users, however, no matter the function of the song library is made to be more powerful, personalized preferences of the users cannot be met, the users can only select songs provided in the song library, although some software products can provide a music recording function for the users at present, the users cannot conveniently and intelligently make personalized songs from the use process and the functions, and the market requirements are difficult to meet.

Disclosure of Invention

In view of one of the above technical drawbacks, the present application provides a method and an apparatus for generating a personalized song file, a music performance device, a computer device, and a computer-readable storage medium, so that a personalized song can be conveniently and intelligently made for a user.

A personalized song file generation method, comprising:

acquiring a music file from an external sound source, preprocessing the music file and storing the preprocessed music file;

separating voice and accompaniment from the music file to obtain corresponding voice information and accompaniment information containing various musical instruments;

acquiring attribute information of the music file, and searching image content matched with the music file in a material library according to the attribute information and a preset matching rule;

and acquiring lyric information of the music file, and generating a personalized song file according to the lyric information, the accompaniment information and the image content.

In one embodiment, obtaining a music file from an external audio source comprises:

receiving audio data packets from at least one audio source; the audio data packet is packaged based on a preset private protocol;

analyzing the audio data packet according to the private protocol to obtain audio data;

and acquiring a music file according to the audio data.

In one embodiment, the receiving audio data packets from at least one audio source comprises:

establishing communication connection with a plurality of music data sources;

receiving audio data packets which are sent by each equipment terminal and are packaged based on the private protocol by adopting a time division working mode;

and analyzing each audio data packet according to the private protocol to obtain a plurality of music files.

In one embodiment, the music file is preprocessed, including:

respectively carrying out parallel operation on each music file to obtain attribute information of the music file; wherein the attribute parameters comprise sampling rate, format and channel bit number;

and according to the attribute information, whether the music file is a high-sound-quality music file or not is judged, and if not, the sound quality of the music file is remodeled to obtain high-sound-quality music.

In one embodiment, the music file is preprocessed, including:

the method comprises the steps of extracting accompaniment contents played by different instruments of the same music from a plurality of music files respectively, and combining the accompaniment contents to obtain one accompaniment file.

In one embodiment, the music file is preprocessed, including:

music climax fragments are respectively extracted from all the music files, and the incoming lines of all the music climax fragments are spliced into one stringed music file.

In one embodiment, separating the vocal and accompaniment of the music file comprises:

performing STFT spectrum analysis on an input music file, and performing logarithmic Mel spectrum conversion to obtain a Mel spectrum;

analyzing the Mel frequency spectrum by using a pre-trained human voice network, an accompaniment network and a musical instrument network to obtain a corresponding spectrogram, calculating the proportion of the human voice frequency spectrum in the whole music frequency spectrum and the proportion of the accompaniment frequency spectrum in the whole music frequency spectrum according to the spectrogram, and multiplying the human voice frequency spectrum and the accompaniment frequency spectrum respectively according to the proportions and the music frequency spectrum to obtain a human voice frequency spectrum and an accompaniment frequency spectrum;

performing ISTFT analysis on the human voice frequency spectrum and the accompaniment frequency spectrum, and respectively converting the human voice frequency spectrum and the accompaniment frequency spectrum into human voice information and accompaniment information;

the music signal spectrum of each instrument of the accompaniment information is calculated and stored as a single instrument accompaniment.

In one embodiment, the method for generating a personalized song file further includes:

respectively constructing a voice network, an accompaniment network and a musical instrument network by using the voice, the accompaniment and the musical instrument tone library corresponding to the voice and the accompaniment which are prestored in the song library;

respectively carrying out STFT analysis and logarithmic Mel frequency spectrum conversion on four types of materials of music with accompaniment, pure human voice, pure accompaniment and pure musical instrument timbre;

performing blind source separation on the music and inputting the music into a human voice network, an accompaniment network and a musical instrument network respectively to obtain a pure human voice chromatogram, a pure accompaniment tone chromatogram and a pure musical instrument tone chromatogram;

comparing the pure human sound chromatogram, the pure accompaniment sound chromatogram and the pure musical instrument sound chromatogram with the pure human sound track amplitude spectrum, the pure accompaniment track amplitude spectrum and the pure musical instrument track amplitude spectrum respectively, calculating the similarity of the pure human sound track amplitude spectrum and the pure musical instrument track amplitude spectrum by adopting the Manhattan distance, obtaining a loss function by adopting a mean value mode, and optimizing and adjusting parameters of blind source separation.

In one embodiment, performing timbre reshaping on the music file to obtain a piece of high-sound-quality music comprises:

separating the music file into vocal information and initial accompaniment information;

identifying musical instruments in the initial accompaniment information;

and respectively carrying out tone color compensation on various instruments in the initial accompaniment information by utilizing tone colors of various instruments corresponding to high tone quality recorded in a pre-established instrument tone color library to obtain high-definition accompaniment information.

acquiring other sound files of the sound production object of the human voice in the music files;

extracting voiceprint information of the sound production object according to the other sound files;

and generating a voice filter by using the voiceprint information, filtering the music file by using the voice filter to obtain voice information, and extracting accompaniment information.

acquiring audio and video files of sounding objects of a plurality of music files;

extracting voiceprint information from the audio and video file, and printing a label of the sound-producing object;

and storing the voiceprint information of the sound production object and the label thereof in a voiceprint library.

In one embodiment, the attribute information includes: one or more of singer name, album, lyrics, accompaniment, musical instrument, song title, song style;

the image content includes: MV, singer pictures, and/or stylistic digital animations.

In one embodiment, searching the image content matched with the music file in a material library according to the attribute information and a preset matching rule includes:

performing attribute information ID search in a material library according to the attribute information to obtain an MV candidate set containing a plurality of MVs; the material library stores a plurality of MVs and a plurality of attribute information IDs correspondingly marked with the MVs;

and according to the MV lyric attribute information of the MV candidate set, performing similarity calculation with the lyric text of the music file, and determining a matched MV according to the similarity.

In one embodiment, obtaining lyric information of the music file comprises:

searching attribute information ID in a song library according to the attribute information to obtain first lyrics of the music file;

carrying out language identification on the voice information to obtain second lyrics of the music file;

obtaining third lyrics of the music file from an open lyric website in a network crawling manner;

and correcting the first lyrics and the third lyrics according to the position information of the chorus determined by the second lyrics to obtain the lyric information of the music file.

In one embodiment, generating a personalized song file according to the lyric information, the accompaniment information and the image content comprises:

adding accompaniment information of the music file by taking the MV as a background, and adding the lyric information according to the accompaniment information;

and combining the MV, the accompaniment information and the lyric information to obtain a personalized MV file.

In one embodiment, the method for generating a personalized song file further includes: storing the voice information and the personalized MV file in a correlated manner; and acquiring singing voice recorded by the personalized MV file sung by the user, comparing the singing voice by utilizing the voice information, and grading the user according to the comparison similarity.

generating a corresponding staff according to the accompaniment information, marking an ID on the staff and storing the staff into an individualized song library;

and when the personalized song file is played, aligning the staff with the lyric information and synchronously displaying the staff and the personalized MV file in a semitransparent mode, or constructing a rendering animation matched with the chord trend of the staff according to the notes and the rhythm of the staff and fusing and displaying the lyric information with the animation.

A music performance device comprising: the system comprises a main board, and a sound system and a display device which are respectively connected with the main board; the main board is also connected with a microphone;

the mainboard is used for executing the steps of the personalized song file generation method;

the sound system is used for playing audio data;

the microphone is used for picking up singing voice of a user;

the display device is used for displaying image content when a song file is sung.

A computer apparatus, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the personalized song file generation method described above is performed.

A computer readable storage medium having stored thereon at least one instruction, at least one program, set of codes, or set of instructions for being loaded by a processor and executing the personalized song file generation method described above.

The personalized song file generation method and device, the music singing device, the computer device and the computer readable storage medium acquire the music file from an external sound source, separate the voice and the accompaniment of the music file to obtain corresponding voice information and accompaniment information containing various musical instruments, search image content matched with the music file in a material library according to the attribute information of the music file and preset matching rules, and generate the personalized song file by using the lyric information, the accompaniment information and the image content of the music file; according to the technical scheme, an intelligent song file generation process is established, the terminal equipment can input the music file from any sound source, and the favorite personalized song file can be obtained, so that a user can freely and conveniently make the personalized song, the personalized requirements of the user are met, and the application experience of the user is improved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is an exemplary network topology of a personalized song file generation scheme;

FIG. 2 is a flow diagram of a personalized song file generation method of an embodiment;

FIG. 3 is a schematic view of the separation process of vocal and accompaniment;

fig. 4 is a flow diagram of a music file for reshaping a lossy tone quality;

FIG. 5 is a schematic diagram of a matching MV process flow;

FIG. 6 is a schematic diagram of the MV matching algorithm of the deep neural network;

FIG. 7 is a schematic information id diagram of an MV;

FIG. 8 is a flow chart of lyric information output;

FIG. 9 is a schematic diagram of a process for generating a staff;

FIG. 10 is a diagram showing the structure of attribute information of song files

FIG. 11 is a schematic flow chart of user behavior;

fig. 12 is a schematic structural diagram of a personalized song file generating apparatus according to an embodiment.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, or operations, but do not preclude the presence or addition of one or more other features, integers, steps, operations, or groups thereof.

The personalized song file generation scheme can be applied to terminal equipment, and the terminal equipment can be sound box equipment, KTV equipment, a smart phone, a tablet, a personal computer and the like; the terminal equipment can be connected to a background server through a network, can be connected with sound sources 1-n (n is more than or equal to 1) through WIFI (wireless fidelity), Bluetooth or a data network, and can be a smart phone, a tablet, a personal computer, a storage medium or network equipment and the like for the sound sources; the personalized song file generation method provided by the application can be deployed on the terminal equipment in a software form, can also be deployed on the server, or can be respectively deployed on the terminal equipment and the server by different functional modules; as shown in fig. 1, fig. 1 is an exemplary network topology of a personalized song file generation scheme; the technical scheme of the application provides a convenient and intelligent personalized song file generation function for users, a personalized song library can be established for each user, and the users can store the music files transmitted in the song file generation process, identified voice files, accompaniment files and the like into the song library for subsequent continuous use; meanwhile, the user can share the song files of the personalized song library of the user to other users for use.

With reference to fig. 1 as an example, the following embodiments will be described by taking an example in which a user transmits a piece of music from a smartphone to a terminal device and generates a personalized MV file; referring to fig. 2, fig. 2 is a flowchart of a personalized song file generation method, which mainly includes the following steps:

step S10: and acquiring a music file from an external sound source, preprocessing the music file and storing the preprocessed music file.

In this step, the user can transmit the music file of the external sound source to the terminal device, and then the terminal device processes the music file and stores the processed music file into the personalized song library.

Based on the technical scheme of the application, the personalized song library management system can be constructed on the terminal equipment, diversified music signal source transmission is met, the user can freely add music files into a song library to perform intelligent creation and singing, the user can use Bluetooth, local transmission (such as CD, blue light, DVD, U disk and the like), a network disk, WIFI (MIRACAST/AIRPLAY/DLNA), coaxial optical fibers, FM and other multiple input sources, and the user can select a proper mode to transmit the music files according to personal preference.

After the music file is transmitted, the terminal device may obtain attribute information of the music file, such as encoding format, code rate, sampling rate, bit number, channel number, song title, singer name, lyrics, song data, file size, song duration, song style, and the like, and simultaneously determine whether the quality of the music file is lossy tone quality (below 320kbps, such as MP3, WMA, OGG, and the like) or lossless tone quality (WAVE, FLAC, AIFF, APE, WAV, WAVPACK, LPAC, TTK, and the like).

In one embodiment, in order to improve the transmission efficiency, the music file may be transmitted from a plurality of sound sources simultaneously, and accordingly, the method for obtaining the music file from the external sound source may include:

s101, receiving audio data packets from at least one audio source; and the audio data packet is encapsulated based on a preset private protocol.

Specifically, taking bluetooth as an example, because bluetooth is a standard communication protocol, in order to better transmit music files, in the scheme of this embodiment, a private protocol may be designed and data of music files may be transmitted on the bluetooth protocol, and the smart phone end may utilize the transmission module to package audio data of music files based on the private protocol, and then transmit the audio data to the terminal device in a bluetooth manner.

s102, analyzing the audio data packet according to the private protocol to obtain audio data; specifically, after receiving the audio data packets, the terminal device analyzes the audio data packets by using the private protocol to obtain the audio data.

And s103, acquiring a music file according to the audio data.

According to the technical scheme of the embodiment, different sound sources can transmit music files to terminal equipment, such as a smart phone, a tablet computer, a personal computer and the like, and a user can transmit a piece of audio data recorded independently to the terminal equipment, so that the user can generate personalized song files in a convenient operation mode.

As an embodiment, the receiving audio data packets from at least one audio source in step s101 includes:

step a, establishing communication connection with a plurality of music data sources, specifically, establishing communication connection between smart phones of a plurality of users and terminal equipment through a Bluetooth or WiFi transmission mode.

And b, receiving the audio data packet which is sent by each equipment terminal and is packaged based on the private protocol by adopting a time division working mode.

In the transmission process, a time division working mode is adopted to transmit data from a plurality of music data sources, so that the audio data can be subjected to edge covering and receiving while being subjected to subsequent processing, and the processing efficiency and the fluency can be improved.

And c, analyzing each audio data packet according to the private protocol to obtain a plurality of music files.

Specifically, after receiving each audio data packet, the terminal device analyzes each audio data packet to obtain a plurality of music files, stores the music files into a cache, and reads and processes the music files in real time through a subsequent processing flow.

In an embodiment, the above technical solution for preprocessing the music file may include the following steps:

the first scheme is as follows: respectively carrying out parallel operation on each music file to obtain attribute information of the music file; wherein the attribute parameters comprise sampling rate, format and channel bit number; and according to the attribute information, whether the music file is a high-sound-quality music file or not is judged, and if not, the sound quality of the music file is remodeled to obtain high-sound-quality music.

Scheme II: the method comprises the steps of extracting accompaniment contents played by different instruments of the same music from a plurality of music files respectively, and combining the accompaniment contents to obtain one accompaniment file.

Illustratively, the music files transmitted by a plurality of signal sources are synthesized in a plurality of instruments, which can be as follows:

(1) judging that a plurality of signal sources transmit the same piece of music through singer name, song name and song similarity calculation; if the signal source 1 transmits < blue and white porcelain-piano playing >, < blue and white porcelain-violoncello playing >, < blue and white porcelain-koto > are transmitted by the signal source 2, and < blue and white porcelain-koto > are transmitted by the signal source 3.

(2) Separating vocal accompaniment of each music file to obtain corresponding musical instrument accompaniment; such as signal source 1 piano accompaniment, signal source 2 cello accompaniment and signal source 3 koto accompaniment.

(3) Calculating DP paragraph similarity according to the human voice part to obtain song paragraph deviation of each piece of music, and selecting high-quality music files from 3 music files as main synthetic objects; the deviation is calculated for each segment for correcting the accompaniment in the corresponding signal source.

(4) And performing multiplication synthesis according to the main synthesis object and music files of other signal sources, and enabling the synthesized music files to enter a low-pass filter, so that high-frequency noise generated in the synthesis process is reduced, and finally, a song of the piano violoncello and Chinese zither ensemble edition of the blue and white porcelain is obtained.

According to the technical scheme, in order to obtain high-quality accompaniment, accompaniment contents played by different musical instruments of the same piece of music can be transmitted to the terminal equipment from a plurality of network users, and then all the accompaniment contents are combined to obtain an accompaniment file, so that the network chorus function can be realized.

The third scheme is as follows: music climax fragments are respectively extracted from all the music files, and the incoming lines of all the music climax fragments are spliced into one stringed music file.

For example, if the music files of the plurality of signal sources perform song-string-burning operation, the following steps may be performed:

(1) to better link the interesting parts of each music, the song types are divided into a main song string and a refrain string.

(2) The method comprises the steps of firstly obtaining a pure vocal part and an accompaniment part according to vocal separation, then utilizing the vocal part to combine melody analysis, adopting a hidden Markov model and chromaticity characteristics, utilizing viterbi decoding to obtain a main song part and an accessory song part of corresponding music structures, and calculating the similarity between each section through an HMM model so as to accurately obtain the boundary value of each section.

(3) Calculating the rhythm, melody contour curve and emotion of each section of the song according to the accompaniment information, judging a main song part and an accessory song part, and positioning the main song and the accessory song through multi-party judgment.

(4) After the song master part and the song deputy part of each song are intercepted, merging and string burning are respectively carried out; according to the selected song serials or the selected song serials, the connection between every two songs adopts a power exponent fade-in and fade-out algorithm to keep the natural transition of the volume, so as to realize the consistent loudness of the serials.

In the scheme of the embodiment, a plurality of users can transmit a piece of music to the terminal equipment through the smart phone, and the terminal equipment extracts climax fragments and splices the climax fragments into a string music file, so that the application experience of the users can be improved; after the music file is transmitted, the music file can be saved in a tone quality format (dts, wav, APE, FLAC, mp3, aac, mp4, avi, mkv, mpg audio, video MTV music, etc.), and saved locally and on a server side of the terminal device, and the terminal device can execute the following operations: firstly, if a user directly plays the music file to an effector or a sound box device or an earphone connected with a power amplifier, the processed music file is played; secondly, under the condition that the performance of the terminal equipment allows, the local storage and the real-time operation of the terminal equipment are carried out; and thirdly, the terminal equipment uploads the server, and the server drives the terminal equipment to execute the audio processing operation.

Step S20: and separating the voice and the accompaniment of the music file to obtain corresponding voice information and accompaniment information containing various musical instruments.

In this step, the terminal device performs voice and accompaniment separation on the music file to obtain voice information and accompaniment information corresponding to various musical instruments in the music file.

The separation mode can be selected according to the quality of the music files, the voice separation technology can be directly adopted to extract voice information, accompaniment information, musical instrument set track information and the like for the music files without damage to the voice quality, and the voice quality improving technology can be adopted for the music files with damage to the voice quality to improve the music files with damage to high-quality accompaniment music as far as possible.

As an embodiment, in order to enrich the user personalized song library, when a user transmits a song to a terminal device end through bluetooth, the terminal device wakes up the voice separation function, adopts a multi-track separation technology based on blind source separation, separates out the singing voice part and the accompanying part of each singer by combining with the musical instrument tone color library, and further separates each musical instrument into independent parts, thereby facilitating the subsequent personalized creation of the user.

Referring to fig. 3, fig. 3 is a schematic view illustrating a process of separating human voice from accompaniment; accordingly, the technical solution for separating the voice and the accompaniment of the music file may include the following steps:

s211, performs STFT (Short-Time Fourier Transform) spectrum analysis on the input music file, and performs logarithmic mel spectrum conversion to obtain a mel spectrum.

Illustratively, an STFT spectrum analysis is performed on the input music signal, and a logarithmic mel spectrum conversion is performed to obtain the same speech signal characteristics as those obtained during model training.

And s212, analyzing the Mel frequency spectrum by using a pre-trained human voice network, an accompaniment network and a musical instrument network to obtain a corresponding spectrogram, calculating the proportion of the human voice frequency spectrum in the whole music frequency spectrum and the proportion of the accompaniment frequency spectrum in the whole music frequency spectrum according to the spectrogram, and multiplying the proportions and the music frequency spectrum respectively to obtain the human voice frequency spectrum and the accompaniment frequency spectrum.

Illustratively, a Mel frequency spectrum is respectively combined with a human voice network, an accompaniment network and a musical instrument network, so as to obtain a corresponding spectrogram, the proportion occupied by the human voice spectrum in the whole music score is calculated according to the frequency spectrum information, the proportion occupied by the accompaniment in the whole music score is obtained according to the accompaniment spectrum, and the human voice spectrum and the accompaniment spectrum can be obtained by multiplying the proportion and the music spectrum.

And s213, performing ISTFT (short-time Fourier transform inverse transformation) analysis on the human voice frequency spectrum and the accompaniment frequency spectrum, and respectively converting the human voice frequency spectrum and the accompaniment frequency spectrum into human voice information and accompaniment information.

Illustratively, the obtained human voice spectrum and accompaniment spectrum are subjected to ISTFT to convert the frequency spectrum into a human voice signal and an accompaniment signal, and the human voice signal and the accompaniment signal are stored into corresponding formats such as wav, and therefore accompaniment and human voice materials are provided for user creation.

And s214, calculating a music signal spectrum of each instrument of the accompaniment information, and storing the music signal spectrum as the single instrument accompaniment.

Illustratively, in a similar manner as described above, each music signal spectrum present in the accompaniment is calculated and saved as a corresponding wav, midi, etc. format file, providing a separate instrumental accompaniment for the user to create a song.

The voice separation technology can separate a plurality of voice tracks, pure accompaniment tracks and track information of each musical instrument, and a user can design a multi-person chorus mode and also can select the combination of a certain musical instrument, thereby achieving the aim of diversified creation.

For the pre-trained voice network, accompaniment network and musical instrument network, the existing song files in the song stock can be used for model training, and the voice, the accompaniment and the musical instrument tone library corresponding to the voice and the accompaniment in the song library are respectively constructed to form the voice network, the accompaniment network and the musical instrument network (which supports the performance of 128 musical instrument sounds).

In one embodiment, it can be trained using the following method, including the steps of:

(1) and respectively constructing a voice network, an accompaniment network and a musical instrument network by using the voice, the accompaniment and the musical instrument tone library corresponding to the voice and the accompaniment which are pre-stored in the song library.

Specifically, STFT analysis is carried out on four types of materials including music with accompaniment, pure voice, pure accompaniment and pure musical instrument timbre, and because the voice can better embody the characteristics of the voice in a Mel frequency band, each type of material is converted by a logarithmic Mel filter bank to highlight the voice characteristics, so that the voice separation loss degree is reduced.

(2) STFT analysis and logarithmic Mel frequency spectrum transformation are respectively carried out on four types of materials of music with accompaniment, pure human voice, pure accompaniment and pure musical instrument timbre.

Wavelet analysis is carried out on materials with accompaniment of four types including music, pure voice, pure accompaniment and pure musical instrument timbre, the defect of STFT in signal non-stationarity analysis is compensated, irrelevant noise parts are filtered, fundamental frequency sparsity of each type is enhanced, feature extraction of voice is improved, and quality of sound restoration is improved.

The ADSR analysis is carried out on the musical instrument tone color library, because each musical instrument in each tone color library has different sounding envelopes and the corresponding tone colors are also different, the corresponding tone colors mainly exist in ADSR (attack, decay, sustain, release) to form the main components of the tone colors, the characteristics can also be used for analyzing the types of the musical instruments in the accompaniment, the musical instrument separation is further carried out, the function of the user to create songs by the personalized musical instrument background is realized, and the envelope extraction formula is as follows:

where env represents the envelope, framelen represents the frame length, x_iRepresenting a time series.

(3) And carrying out blind source separation on the music and respectively inputting the music into a human voice network, an accompaniment network and a musical instrument network to obtain a pure human voice chromatogram, a pure accompaniment tone chromatogram and a pure musical instrument tone chromatogram.

(4) Comparing the pure human sound chromatogram, the pure accompaniment sound chromatogram and the pure musical instrument sound chromatogram with the pure human sound track amplitude spectrum, the pure accompaniment track amplitude spectrum and the pure musical instrument track amplitude spectrum respectively, calculating the similarity of the pure human sound track amplitude spectrum and the pure musical instrument track amplitude spectrum by adopting the Manhattan distance, obtaining a loss function by adopting a mean value mode, and optimizing and adjusting parameters of blind source separation.

Specifically, the music with the accompaniment, the pure voice obtained by the extracted blind separation network, the pure accompaniment and the pure musical instrument tone chromatogram are compared, the Manhattan distance is adopted to calculate the similarity between the music with the accompaniment and the pure musical instrument tone chromatogram, the mean value taking mode is adopted as a loss function, the blind separation parameters are continuously optimized, and the blind separation parameters are adjusted.

Distance＝Σ|x_i-y_i|

mean＝0.5*(Σ(a_i*Distance_i)+Σ(a_j*Distance_j))

In the formula, mean represents a loss function, and Distance represents a manhattan Distance.

As another embodiment, in order to obtain high-definition accompanying music, for music files acquired from various sound sources, when the sound source quality is lossy sound quality, high-definition sound quality remodeling may be performed and automatically stored in a personalized song library of a user.

Referring to fig. 4, fig. 4 is a flow chart of a music file for reshaping a lossy sound quality; accordingly, the technical solution for obtaining a high-sound-quality music by performing sound quality remodeling on the music file may include the following steps:

s221, the music file is separated into vocal information and initial accompaniment information.

And s222, identifying the musical instruments in the initial accompaniment information.

And s223, respectively performing tone color compensation on the various instruments in the initial accompaniment information by utilizing the tone colors corresponding to high tone quality of the various instruments recorded in the pre-established instrument tone color library to obtain high-definition accompaniment information.

The accompaniment remodeling technology of the lossy music can carry out the structure of the original accompaniment musical instrument, the structure of a multi-style musical instrument and the EQ compensation structure of the original accompaniment musical instrument on the music files with the lossy tone quality; for the original accompaniment instrument configuration and the multi-style instrument configuration, a wave table synthesis technology can be adopted, and for the original accompaniment instrument EQ compensation configuration, an automatic EQ compensation technology can be adopted.

Illustratively, the corresponding midi file can be generated according to the pitch, rhythm tempo, note, note duration and other characteristics of the whole spectrum segment of the lossy music, the most similar musical instrument is searched through a musical instrument tone library, the high-definition accompaniment music is regenerated, the song sound and audio segment is compensated by adopting an automatic equalization technology according to the musical instrument information in the song, the high-definition accompaniment music is generated,

for example: if the musical instrument is bass, simulating the song playing process by combining the rhythm, note and pitch information of the song to automatically generate corresponding accompanying music, restoring the accompanying music into original music by combining human voice, and generating files with the formats of wav, flac, ape, wave, aiff and the like with the 24bit sampling rate of 96K or 192K.

Illustratively, each musical accompaniment is formed by combining one or more musical instruments, a corresponding frequency response curve can be formed by one or more musical instrument timbre libraries with high tone quality through series combination, the corresponding missing timbres can be known according to the comparison of the frequency response curves, EQ compensation is carried out according to the corresponding frequencies, and the optimal Q value and frequency value f and the number of EQ combinations are calculated by adopting a genetic algorithm, so that the aim of reshaping the song accompaniment is fulfilled.

In addition, according to the musical instrument information (pitch, rhythm, note duration) in the song, the musical instrument tone library (such as flute, drum set, bass, guitar, piano, etc.), accompaniment music with various styles (rock, country, lyric) can be provided and stored as a high-definition audio file with 24bit and 192Kps, and the user can also choose not to do any signal processing and output the original tone quality.

The three types of remolding accompaniment information can be stored in local and server personalized song libraries, and optimal recommendation is provided for the next use of the user.

As another embodiment, in order to separate the vocal sounds and the accompaniment more quickly and conveniently, the above technical solution for separating the vocal sounds and the accompaniment from the music file may include the following steps:

s231, acquiring another sound file of the sound production object of the human voice in the music file.

And s232, extracting the voiceprint information of the sound production object according to the other sound files.

And s233, generating a voice filter by using the voiceprint information, filtering the music file by using the voice filter to obtain voice information, and extracting accompaniment information.

For voiceprint information, in one embodiment, the call may be made by collecting and saving it to a voiceprint library, the method further comprising:

acquiring audio and video files of sounding objects of a plurality of music files; extracting voiceprint information from the audio and video file, and printing a label of the sound-producing object; and storing the voiceprint information of the sound production object and the label thereof in a voiceprint library.

In the solution of the foregoing embodiment, according to the performance requirement of the terminal device, if the performance of the terminal device allows, the separation of the real-time voice and the Accompaniment is performed locally at the terminal device, and the separated original song track (speech id), Accompaniment track (Accompaniment id), and instrument track id are stored in the terminal device local song library respectively, so as to provide the user with the switching between the original song and the Accompaniment, and store the corresponding song information, such as singer name (singer id), lyrics (lyrics id), file length (filelen), and the like; the original singing and the accompaniment can select a single track or double tracks according to the requirements of users, and store the original accompaniment single track and the original accompaniment stereo track.

And S30, acquiring the attribute information of the music file, and searching the image content matched with the music file in a material library according to the attribute information and a preset matching rule.

In this step, the attribute information may be name of singer, album, lyrics, accompaniment, musical instrument, name of song, genre of song, etc., and these attribute information may be extracted from the music file, or may be obtained by separating the voice information and the accompaniment information from the music file; the matched image content may be an MV, a singer's picture, a variety of styles of digital animation, and the like.

As an embodiment, when matching the MV, in order to improve the matching efficiency, refer to fig. 5, where fig. 5 is a schematic flow chart of matching the MV; for the matching process, the name of a singer/song/album id acquired from a music file can be utilized, and then the corresponding singer MV or album is directly searched and matched; secondly, lyric information can be obtained from a music file, then MV lyric similarity calculation is carried out, corresponding MVs are matched based on a neural network MV matching algorithm, and moreover, characteristic information such as tone, emotion or rhythm can be calculated by utilizing a characteristic analysis method of the music file, and Euclidean distances between the characteristic information and the MVs are calculated, so that the corresponding MVs are matched.

According to the scheme of the embodiment, the song type/lyrics are automatically searched and matched with the corresponding style MVs, the most suitable MVs are recommended for the user, the MV library is roughly searched for IDs according to the information of singers, song names, albums, styles and the like of music files, and if no corresponding ID exists, the MV files corresponding to emotion rhythms are roughly selected according to the music characteristics (such as tone, emotion and rhythm) of the transmitted music files; and then, MV lyric similarity calculation is carried out through the acquired lyrics, so that the MV matching pair is judged to be high, the individual requirements of the user are met, the MV matching degree is improved, the user creation difficulty is greatly reduced, and the user creation satisfaction is improved.

For example, the method for calculating the similarity of the lyrics may include the following steps:

s301, searching attribute information ID in the material library according to the attribute information, and acquiring an MV candidate set containing a plurality of MVs; and the material library stores a plurality of MVs and a plurality of attribute information IDs correspondingly marked with the MVs.

And s302, according to the MV lyric attribute information of the MV candidate set, performing similarity calculation with the lyric text of the music file, and determining a matched MV according to the similarity.

In addition, based on the scheme of the above embodiment, an MV matching algorithm based on a convolutional neural network, an ID-based fast search method, and a music-based feature analysis method may be designed to jointly perform an MV matching process, so that the dilemma that the overall model falls into local optimum may be avoided, and accordingly, the matching process may be as follows:

(1) and firstly, a rough ID matching mode is carried out, and the most possible MV is searched, so that the complexity of algorithm matching is reduced.

Extracting features from MVs stored in an MV song library, extracting high-dimensional semantics of music lyrics into low-dimensional semantics, determining the deviation distance between single music and the full MV by adopting the dispersion average sum of the extracted features, and obtaining the corresponding posterior probability according to the distance, wherein reference is made to FIG. 6, and FIG. 6 is a schematic diagram of an MV matching algorithm of a deep neural network; wherein, the higher the probability, the higher the matching degree thereof, so as to roughly select the MV, the calculation process is as follows:

L₁＝W_1X

L_i＝f(W_iL_i-1+b_i),i＝2,...N-1

y＝f(W_LL_N-1+b_N)

in the formula, W_iWeight matrix representing the i-th layer, b_iThe bias entry for the ith layer.

(2) If there is no ID corresponding to MV, then according to music characteristic information (tone, emotion, rhythm analysis), to obtain local search.

Using the information id of all MVs in the song library, as shown in fig. 7, fig. 7 is a schematic diagram of the information id of the MV; and locking the final MV by a song name → singer name → lyrics through a reverse searching method by using the lyrics id, album id, singer name id and instrument id obtained after voice separation.

(3) According to the specific lyric characteristic information of each MV, performing text similarity calculation with the music lyrics, and recommending the lyrics with higher similarity as the optimal MV; obtaining beat points by adopting a multi-feature analysis method hfc (rms, melflux, infogain) through a Markov model, converting the beat points into BPM (binary noise modulation), calculating cosine similarity of each MV (mean square) and analyzing by adopting a harmonic component analysis method and an envelope calculation method in combination with ADSR (automatic dependent Surveillance) of tone of each instrument; the emotion (sadness, joy, cheerfulness, anger or sexuality, etc.) of the music is predicted by using the trained emotion model.

The method can provide the MV closer to the self-created music of the user, effectively balance the limitation of each algorithm, enable the algorithm to jump out of local optimum, search for the optimal MV, greatly improve the accuracy of MV matching, improve the creation enthusiasm of the user and reduce the algorithm complexity.

And S40, acquiring the lyric information of the music file, and generating a personalized song file according to the lyric information, the accompaniment information and the image content.

Specifically, the lyric self-carrying and lyric identification/acquisition mode can be adopted to improve the precision of outputting the lyric information, if the lyric information in the music file received by the terminal equipment is self-carrying, the lyric is directly selected as MV lyric id when the lyrics are paired, and the MV lyric id is stored in the personalized lyric library, so that the accuracy and the personalization of the lyric of the music file of each user are ensured, and the integrity of the self-created song of the user is ensured.

As an embodiment, if there is no self-contained lyric information in the music file received by the terminal device, the lyric is jointly obtained in a joint manner to improve the accuracy of outputting the lyric, refer to fig. 8, where fig. 8 is a flow chart of outputting the lyric information; accordingly, the technical solution for acquiring the lyric information of the music file may include the following steps:

s401, searching attribute information ID in a song word library according to the attribute information to obtain first lyrics of the music file;

specifically, a lyric ID searching method is firstly carried out, rough lyric searching is firstly carried out, and the song library acquires lyrics corresponding to song information such as song names, singers and albums, so that the algorithm complexity can be reduced.

s402, performing language identification on the voice information to obtain second lyrics of the music file; specifically, the corresponding lyrics are obtained through a voice recognition mode.

s403, obtaining third lyrics of the music file from a public lyric website in a network crawling manner; specifically, the network crawling is carried out on a legally published lyric website to obtain song information such as a corresponding song name, a singer and an album, and then the lyrics are obtained.

s404, correcting the first lyrics and the third lyrics according to the determined position information of the chorus of the second lyrics to obtain the lyric information of the music file;

because repeated position corresponding errors of the chorus part easily exist in the process of obtaining the lyrics, in order to solve the problem of the corresponding errors of the chorus part, the lyrics with the errors are corrected by utilizing a lyric text recognition method and through recognition errors with lower similarity in a voice recognition mode, so that the purpose of correcting the lyrics is achieved, and the accuracy of the lyrics is improved.

In order to enable the user to play with the staff on the personalized MV file or provide staff learning experience, in an embodiment, the technical solution of the present application may further include a staff display function, and accordingly, the personalized song file generating method of the present application may further include, in step S40:

generating a corresponding staff according to the accompaniment information, marking an ID on the staff and storing the staff into an individualized song library; and when the personalized song file is played and sung, the staff can be displayed simultaneously.

For example, the display mode of the staff may include the following:

the first method is as follows: and aligning the staff with the lyric information and synchronously displaying the staff and the personalized MV file in a semitransparent form.

According to the scheme of the embodiment, the corresponding staff is generated in real time according to the accompaniment, the staff id is independently stored in the local song library, when the user plays the song, the staff is aligned with the lyrics and synchronously displayed on the large screen, and in a semitransparent mode, the attractiveness of the MV is not influenced while the user can clearly see the staff.

The second method comprises the following steps: and constructing rendering animation matched with the chord trend of the staff and fusing and displaying the lyric information according to the notes and the rhythm of the staff.

Specifically, the terminal device automatically pairs corresponding MVs (MV ids) or pictures (photo ids) according to corresponding lyrics or singers or song information, and if the current song does not have corresponding MVs or pictures paired with the current song, automatically generates a digital Dynamic rendering scene (Dynamic scene id) according to song rhythm information and lyric information, and simultaneously combines the chord trend of the staff to create a more cool Dynamic scene which is independently stored in a local song library. Corresponding staff is generated according to the accompaniment, staff id is independently stored in a local song library, when a user plays a song, the staff is aligned with the lyrics and synchronously displayed on a large screen, and in a semitransparent mode, the attractiveness of the MV is not influenced while the user can clearly see the staff.

According to the scheme of the embodiment, for the individualized MV file without the corresponding MV, the rendering animation matched with the chord trend of the staff and the lyric animation are fused according to the notes and the rhythm of the staff, and a cool animation scene is created. The staff support is provided for the midi creation in the later period, a user can use a midi keyboard to perform impromptu playing according to the existing staff, and the self-built tone color library has hundreds of musical instruments for the user to perform multi-musical instrument combined playing.

In one embodiment, for the staff generating method, referring to fig. 9, fig. 9 is a schematic flow chart of generating staff; the corresponding midi file can be generated and stored on the server by compiling according to the generation format of the midi file through audio signal characteristic analysis (note, note duration, speed, beat, key signature and the like). In addition, according to the audio signal characteristics corresponding to the audio, a beat recognition algorithm is adopted to obtain the corresponding speed, a fundamental tone discrimination method (STFT obtains fundamental frequency) is adopted to obtain the corresponding note and note duration (the duration is adopted for judgment), and for the duration, the calculation formula is as follows:

wherein, Y (w, t-jP) is the energy value of the fundamental tone frequency point as w, so as to determine the duration length of the note. Placing all the pitches corresponding to the full spectrum in a B matrix (also called a chroma matrix) according to 12 scales (C, C #, D, D #, E, F, F #, G, G #, A, A #, and B), and if the number of the corresponding scales is the maximum and the number of the pitches after the scales is relatively large, considering the scales as the tone marks of the song; the beat time is converted into a duration value of 1000 according to the commonly used beat numbers (4/4, 6/8, 3/8) and the calculated beat duration, and a plurality of beat combinations (750+250, 500+250+250, 333+333+333, 333+666, 165+165+165+165 (8-point note), 165+165+333+333 (16-point note), 165+165+666) are respectively adopted, and if the calculated duration value is not within the value range, the duration value is mapped into the range.

In one embodiment, the process of generating the personalized song file according to the lyric information, the accompaniment information and the image content in step S40 includes the following steps:

adding accompaniment information of the music file by taking the MV as a background, and adding the lyric information according to the accompaniment information; and combining the MV, the accompaniment information and the lyric information to obtain a personalized MV file.

In an embodiment, the personalized song file generation method of the present application may further associate and store the vocal information and the personalized MV file; therefore, when the personalized MV file is used for singing, the singing sound recorded by the personalized MV file is acquired, the vocal sound information is used for comparing the singing sound, and the user is scored according to the comparison similarity.

For example, a user generates a personalized MV file which is then used for karaoke entertainment, and based on the improvement of the existing scoring system, the voice recorded by singing the user is compared by using the voice information obtained in the process of separating the voice information and the accompaniment information, so that the similarity between the voice of the user and the original singing voice can be reflected more accurately.

Compared with a conventional standard voice recognition mode, the scoring accuracy is enhanced, and the user experience is improved.

By integrating the scheme of the embodiment, a new personalized song file library can be designed, a management system integrating sound source input, output, entertainment, editing, learning and creation is created, and a private song library management system special for a special person and internet is created; receive new melody through many sound sources, can transmit audio frequency to audio amplifier or earphone in real time, also can carry out the vocal accompaniment separation technique in local real time, or carry out the vocal accompaniment separation technique at the server end, real-time separation accompaniment and pronunciation, then the intellectuality carries out the MV background, singer's picture and the manifold digital animation of style render automatic matching, can also generate the staff simultaneously and save private individualized song storehouse automatically, provide vision and sense of hearing experience for senior player and common user.

When a user uses a smart phone to connect a Bluetooth sound box to play or sing music, the user generally can only display lyrics or MV on a smart phone screen to perform real-time singing, and experience feeling is poor; and the user can create a song file belonging to the user, and the song file is automatically stored in a personalized song library of a private account or stored in local equipment and network equipment, referring to fig. 10, wherein fig. 10 is a schematic diagram of an attribute information structure of the song file, and the song file can also be shared with friends.

Referring to FIG. 11, FIG. 11 is a schematic view of a user behavior flow; user behavior may include the following:

when a user initiates a song request, a player calls a song id from a local/server warehouse, and corresponding lyrics, accompaniment, scenes and staff are provided according to the song id to be integrated into a complete musical composition for the user to sing.

And (II) the user can initiate sharing, the created music is shared to friends through the operation of a smart phone or the interactive mode of a large screen display, the friends can order the song after obtaining the link, and the song can also be stored in a local personalized song library of the friends so as to order the song in the future and directly play and sing the song. Shared songs can obtain praise and comment of friends, and interaction experience among friends is improved.

(III) real-time staff;

(1) for advanced players, the staff generated according to songs can be combined with midi keyboards according to the level of the players, music is automatically created according to the music library of the musical instruments, and the generated music is automatically stored in a personal account.

(2) For an ordinary player, the user can experience playing learning by visual and auditory sense, so that the user can get familiar with notes represented by each tone through a midi keyboard more quickly and easily, and the learning experience of easy music entry is provided for the ordinary user.

And (IV) when the user can initiate a storage request, the storage can be carried out by adopting the modes of Bluetooth, WIFI, local transmission (CD, Blu-ray, DVD, U disk) or network disk, and the like, and the terminal equipment can support the recoding storage of songs/songs + mv and support various coding formats (dts, wav, APE, FLAC, mp3, aac, mp4, avi, mkv, mpg audio and video MTV music).

And (V) the user can initiate a musical instrument learning request, according to the transmitted song, the music score of the corresponding musical instrument can be learned in a mode of carrying the musical instrument or applying a midi keyboard, and the terminal equipment generates fingering of the corresponding musical instrument according to the music score, so that the effect of synchronizing the song file and the fingering is realized.

An embodiment of the personalized song file generation apparatus is set forth below.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a personalized song file generation apparatus according to an embodiment, including:

the music transmission module 10 is used for acquiring music files from an external sound source, preprocessing the music files and storing the preprocessed music files;

a voice separating module 20, configured to separate voice from accompaniment of the music file to obtain corresponding voice information and accompaniment information including various musical instruments;

the image matching module 30 is configured to obtain attribute information of the music file, and search for image content matched with the music file in a material library according to the attribute information and a preset matching rule;

and the song merging module 40 is configured to acquire lyric information of the music file, and generate a personalized song file according to the lyric information, the accompaniment information, and the image content.

The personalized song file generation device of this embodiment may execute a personalized song file generation method provided in the embodiments of the present disclosure, and the implementation principles thereof are similar, the actions executed by each module in the personalized song file generation device in each embodiment of the present disclosure correspond to the steps in the personalized song file generation method in each embodiment of the present disclosure, and for the detailed functional description of each module of the personalized song file generation device, reference may be specifically made to the description in the corresponding personalized song file generation method shown in the foregoing, and details are not repeated here.

An embodiment of the music performance apparatus is set forth below.

The application provides a music singing equipment, it can carry out the music broadcast, functions such as image broadcast and music singing, and is specific, and its structure can include: the system comprises a main board, a sound system and a display device which are respectively connected with the main board, and the like; wherein, the mainboard is also connected with a microphone;

in use, the mainboard is used for executing the steps of the personalized song file generation method; the sound system is used for playing audio data; such as music files, singing voice of the user; the microphone is used for picking up singing voice of a user; the display device is used for displaying image content when a song file is sung, and a large-scale touch screen system can be adopted for the display device.

An embodiment of a computer device of the present application is set forth below, the computer device comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the personalized song file generation method according to any of the embodiments described above is performed.

Embodiments of a computer-readable storage medium of the present application are set forth below, the storage medium having stored thereon at least one instruction, at least one program, set of codes, or set of instructions that is loaded by the processor and that performs the personalized song file generation method of any of the embodiments described above.

According to the technical scheme of the personalized song file generation device, the music singing equipment, the computer equipment and the computer readable storage medium, an intelligent song file generation process is constructed, the terminal equipment can input the music file from any sound source, and the favorite personalized song file can be obtained, so that a user can freely and conveniently make a personalized song, personalized requirements of the user are met, and application experience of the user is improved.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A method for generating a personalized song file, comprising:

2. The method of claim 1, wherein obtaining the music file from an external sound source comprises:

and acquiring a music file according to the audio data.

3. The method of claim 2, wherein receiving audio data packets from at least one audio source comprises:

establishing communication connection with a plurality of music data sources;

4. The method of claim 3, wherein preprocessing the music file comprises:

5. The method of claim 3, wherein preprocessing the music file comprises:

6. The method of claim 3, wherein preprocessing the music file comprises:

7. The method of claim 1, wherein separating the music file from the accompaniment comprises:

8. The method of generating a personalized song file according to claim 7, further comprising:

9. The method of claim 4, wherein the performing timbre reshaping on the music file to obtain a high-quality music comprises:

identifying musical instruments in the initial accompaniment information;

10. The method of claim 1, wherein separating the music file from the accompaniment comprises:

11. The method of generating a personalized song file according to claim 10, further comprising:

12. The method for generating a personalized song file according to any one of claims 1 to 11, wherein the attribute information includes: one or more of singer name, album, lyrics, accompaniment, musical instrument, song title, song style;

13. The method of claim 12, wherein searching for the image content matching the music file in a material library according to the attribute information and a preset matching rule comprises:

14. The method of claim 12, wherein obtaining lyric information of the music file comprises:

15. The method of claim 1, wherein generating a personalized song file based on the lyric information, the accompaniment information and the image content comprises:

16. The method of generating a personalized song file according to claim 1, further comprising: storing the voice information and the personalized MV file in a correlated manner; and acquiring singing voice recorded by the personalized MV file sung by the user, comparing the singing voice by utilizing the voice information, and grading the user according to the comparison similarity.

17. The method of generating a personalized song file according to claim 1, further comprising:

18. A personalized song file generation apparatus, comprising:

the music transmission module is used for acquiring music files from an external sound source, preprocessing the music files and storing the preprocessed music files;

the voice separation module is used for separating voice from accompaniment of the music file to obtain corresponding voice information and accompaniment information containing various musical instruments;

the image matching module is used for acquiring the attribute information of the music file and searching the image content matched with the music file in a material library according to the attribute information and a preset matching rule;

and the song merging module is used for acquiring the lyric information of the music file and generating a personalized song file according to the lyric information, the accompaniment information and the image content.