CN118447867B

CN118447867B - Music separation method, music separation device, electronic apparatus, and storage medium

Info

Publication number: CN118447867B
Application number: CN202311855849.2A
Authority: CN
Inventors: 陆彩霞; 夏日升; 石强; 唐巍
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2025-01-03
Anticipated expiration: 2043-12-28
Also published as: CN118447867A

Abstract

The application discloses a music separation method, a music separation device, electronic equipment and a storage medium, wherein the method comprises the steps of obtaining a first audio characteristic of current audio; and processing the first audio feature and the current mixed audio through a target diffusion model to obtain the current audio, wherein the audio source type of the current mixed audio comprises the audio source type of the current audio, the target diffusion model is obtained by training the initial diffusion model based on the historical mixed audio and a plurality of second audio features of the historical mixed audio, and the plurality of second audio features at least comprise the first audio feature. According to the method, the current audio is separated from the current mixed audio according to the first audio based on the target diffusion model, and the initial diffusion model learns the audio characteristics in the mixed audio according to the audio characteristics in the training process, so that the target diffusion model can perform music separation on different mixed audios according to the audio characteristics, and the operation process of music separation is simplified.

Description

Music separation method, music separation device, electronic apparatus, and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a music separation method, a music separation device, an electronic apparatus, and a storage medium.

Background

Music is a complex audio signal formed by mixing and superposing musical instrument sounds and singing sounds, wherein the complex audio signal comprises accompaniment of various musical instruments and singing sounds of different persons. With the continuous development of computer signal processing technology and internet technology, separation of human voice and music and separation of various musical instruments are getting more and more attention, and the method can be widely applied to the fields of music mixing, music information retrieval, music education and the like.

At present, in the music separation process, a neural network (Neural Networks, NN) is trained in advance according to historical mixed audio to obtain a trained NN, and the current mixed audio is subjected to music separation by using the trained NN. Wherein the audio source type of the current mixed audio is the same as the audio source type of the history mixed audio.

However, in the music separation process, if the audio source type of the current mixed audio is different from the audio source type of the historical mixed audio, when the trained NN is used to perform music separation on the current mixed audio, the NN is required to be used to train the mixed audio sample having the same audio source type as the current mixed audio in advance, where the operation process is complicated.

Disclosure of Invention

In view of the above, the embodiments of the present application provide a music separation method, a music separation device, an electronic apparatus and a storage medium, so as to overcome the above problems in the prior art.

In a first aspect, an embodiment of the present application provides a music separation method, including:

Acquiring a first audio feature of the current audio;

And processing the first audio feature and the current mixed audio through a target diffusion model to obtain the current audio, wherein the audio source type of the current mixed audio comprises the audio source type of the current audio, and the target diffusion model is obtained by training an initial diffusion model based on a historical mixed audio and a plurality of second audio features of the historical mixed audio, wherein the plurality of second audio features at least comprise the first audio feature.

According to the scheme provided by the application, the current audio is separated from the current mixed audio based on the target diffusion model in the music separation process, and the initial diffusion model learns the audio characteristics in the mixed audio according to the audio characteristics in the training process, and the diffusion model has the characteristic of parameter sharing, so that the target diffusion model can carry out music separation on different mixed audio according to the audio characteristics, training sample audio with the same audio source type as each mixed audio is not required to be trained by using the initial diffusion model, and the operation process of music separation is simplified. The audio source type of the current mixed audio comprises the audio source type of the current audio, and the audio data distribution of the current audio is the same as the audio data distribution corresponding to the current audio source type in the current mixed audio, so that the music separation process is ensured to have higher separation quality.

Wherein, in some optional embodiments, the acquiring the first audio feature of the current audio includes:

Acquiring a current audio source type of current audio;

and acquiring the first audio feature from an audio feature library according to the current audio source type.

According to the scheme provided by the embodiment, the current audio source type is the audio source which is needed by the user to be separated from the current mixed audio, so that the follow-up audio source which is needed by the user to be separated from the current mixed audio according to the first audio characteristic can be ensured, and the separation experience of the user to perform music separation can be improved.

Wherein in some optional embodiments, before the obtaining the first audio feature from the audio feature library according to the current audio source type, the music separation method further includes:

acquiring a plurality of audio sources, wherein each audio source corresponds to an audio source type;

extracting audio features from the plurality of audio sources through an audio feature extraction model to obtain a plurality of third audio features, wherein each third audio feature corresponds to one audio source, and the plurality of third audio features comprise the first audio features;

an audio feature library is generated from the plurality of third audio features.

The scheme provided by the embodiment realizes the construction of the audio high-dimensional feature library based on the plurality of third audio features extracted by the audio feature extraction model, and simplifies the extraction process of extracting the plurality of third audio features, thereby simplifying the construction process of the audio feature library.

Wherein, in some alternative embodiments, the audio feature extraction model is any one of a convolutional neural network, a deep neural network, or a long-term memory network.

Wherein in some alternative embodiments, the first audio feature, the second audio feature, and the third audio feature are all high-dimensional audio features.

Wherein in some optional embodiments, the current audio includes a plurality of sub-audio, the first audio feature includes a plurality of sub-audio features, and the processing, by the target diffusion model, the first audio feature and the current mixed audio to obtain the current audio includes:

and processing the plurality of sub-audio features and the current mixed audio through the target diffusion model to obtain a plurality of sub-audio.

According to the scheme provided by the embodiment, the sub-audios are separated from the current mixed audio according to the sub-audio characteristics based on the target diffusion model, and the separation experience of music separation by a user is improved.

Wherein, in some optional embodiments, before the processing the first audio feature and the current mixed audio through the target diffusion model to obtain the current audio, the music separation method further includes:

acquiring a plurality of second audio features of the historical mixed audio;

And training the initial diffusion model according to the historical mixed audio and the plurality of second audio features to obtain a target diffusion model.

According to the scheme provided by the embodiment, the initial diffusion model is trained based on the historical mixed audio and the plurality of second audio features to obtain the target diffusion model, the diffusion model has the generating capacity of adapting to different mixed audio, and the target diffusion model is guaranteed to have strong generalized music separation capacity.

Wherein in some optional embodiments, the training the initial diffusion model according to the historical mixed audio and the plurality of second audio features, before obtaining the target diffusion model, the music separation method further includes:

acquiring noise audio;

Training the initial diffusion model according to the historical mixed audio and the plurality of second audio features to obtain a target diffusion model, wherein the training comprises the following steps:

and training the initial diffusion model according to the noise audio, the historical mixed audio and the plurality of second audio features to obtain a target diffusion model.

According to the scheme provided by the embodiment, the initial diffusion model is trained based on the noise audio, the historical mixed audio and the plurality of second audio features to obtain the target diffusion model, so that the target diffusion model further learns the capability of generating the noiseless audio from the noise audio, and the generalized music separation capability of the target diffusion model is further improved.

Wherein, in some alternative embodiments, the target diffusion model is derived based on any of Unet networks, noise condition scoring networks, or noise condition scoring networks++.

Wherein, in some alternative embodiments, the target diffusion model is formed based on any one of a denoising diffusion probability model algorithm, a denoising diffusion implicit model algorithm, a random differential equation algorithm, or a score-based generation model algorithm.

In a second aspect, an embodiment of the present application provides a music separation apparatus including:

the feature acquisition module is used for acquiring a first audio feature of the current audio;

The processing module is used for processing the first audio feature and the current mixed audio through a target diffusion model to obtain the current audio, the audio source type of the current mixed audio comprises the audio source type of the current audio, the target diffusion model is obtained by training an initial diffusion model based on the historical mixed audio and a plurality of second audio features of the historical mixed audio, and the plurality of second audio features at least comprise the first audio features.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, when executing the computer program, causing the electronic device to perform the music separation method as provided in the first aspect.

In a fourth aspect, an embodiment of the present application provides a chip for use in an electronic device, the chip comprising a processor for reading and executing a computer program stored in a memory, the electronic device being capable of performing the music separation method as provided in the first aspect above when the computer program is executed by the processor.

Wherein in some alternative embodiments, the chip further comprises a memory, the memory being connected to the processor by a circuit or wire.

Wherein in some alternative embodiments the chip further comprises a communication interface.

In a fifth aspect, an embodiment of the present application provides a computer readable storage medium having stored therein program code that is callable by an electronic device to perform the music separation method as provided in the first aspect above.

In a sixth aspect, embodiments of the present application provide a computer program product which, when run on an electronic device, causes the electronic device to perform the music separation method as provided in the first aspect above.

It will be appreciated that the advantages of the second to sixth aspects may be found in the relevant description of the first aspect, and are not described here again.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a schematic structural diagram of a software system of an electronic device according to an embodiment of the present application.

Fig. 2 shows a schematic flow chart of a music separation method according to an embodiment of the present application.

Fig. 3 is a schematic view of an application scenario of a diffusion model in a music separation method according to an embodiment of the present application.

Fig. 4 is a schematic diagram of an application scenario in which a diffusion model processes an audio sample in a music separation method according to an embodiment of the present application.

Fig. 5 shows a schematic diagram of an application scenario of a current mixed audio in a music separation method according to an embodiment of the present application.

Fig. 6 is a schematic diagram of an application scenario of a plurality of current audio obtained by separating current mixed audio from music in the music separation method according to the embodiment of the present application.

Fig. 7 is a schematic diagram of an application scenario of forward diffusion processing and backward sampling processing of a target diffusion model in a music separation method according to an embodiment of the present application.

Fig. 8 is a schematic flow chart of another music separation method according to an embodiment of the present application.

Fig. 9 shows a schematic flow chart of a music separation method according to an embodiment of the present application.

Fig. 10 is a block diagram showing a construction of a music separation apparatus according to an embodiment of the present application.

Fig. 11 shows a schematic hardware structure of an electronic device according to an embodiment of the present application.

Fig. 12 shows a functional block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present application and are not to be construed as limiting the present application.

The following disclosure provides many different embodiments, or examples, for implementing different features of the application. In order to simplify the present disclosure, components and arrangements of specific examples are described below. They are, of course, merely examples and are not intended to limit the application. Furthermore, the present application may repeat reference numerals and/or letters in the various examples, which are for the purpose of brevity and clarity, and which do not themselves indicate the relationship between the various embodiments and/or arrangements discussed.

However, in the music separation process, if the audio source type of the current mixed audio is different from the audio source type of the historical mixed audio, when the trained NN is used to perform music separation on the current mixed audio, the NN is required to be used to train the mixed audio sample having the same audio source type as the current mixed audio in advance, where the operation process is complicated. And the trained NN still has other mixed audios in the separated audios obtained by separating the historical mixed audios, or/and the audio data of the separated audios are missing, so that the separation quality of music separation is reduced.

In view of the above problems, in the music separation method, the music separation device, the electronic apparatus and the storage medium provided in the embodiments of the present application, the music separation method obtains the current audio by obtaining the first audio feature of the current audio and processing the first audio feature and the current mixed audio through the target diffusion model, the audio source type of the current mixed audio includes the audio source type of the current audio, the target diffusion model is obtained by training the initial diffusion model based on the historical mixed audio and a plurality of second audio features of the historical mixed audio, the plurality of second audio features at least include the first audio feature, and thus the current audio is separated from the current mixed audio based on the target diffusion model in the music separation process. The audio source type of the current mixed audio comprises the audio source type of the current audio, and the audio data distribution of the current audio is the same as the audio data distribution corresponding to the current audio source type in the current mixed audio, so that the music separation process is ensured to have higher separation quality.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.

The music separation method provided by the embodiment of the application can be applied to electronic equipment. The electronic device may include various Terminal devices, which may also be called terminals (terminals), user Equipment (UEs), mobile Stations (MSs), mobile Terminals (MTs), and the like.

The terminal device may be a Mobile Phone, a robot for sweeping floor, an unmanned plane, a Smart tv, a wearable device, a Personal computer (Personal DIGITAL ASSISTANT, PDA), a computer with a wireless transceiving function, a Virtual Reality (VR) terminal device, an augmented Reality (Augmented Reality, AR) terminal device, a wireless terminal in industrial control (Industrial Control), a wireless terminal in Self-Driving (Remote Medical Surgery), a wireless terminal in a Smart grid (SMART GRID), a wireless terminal in transportation security (Transportation Safety), a wireless terminal in Smart city (SMART CITY), a wireless terminal in Smart Home (Smart Home), or the like. The type of the terminal device is not limited here, and may be specifically set according to actual requirements.

The software system of the electronic device may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiment of the application takes an Android (Android) system with a layered architecture as an example, and illustrates the software structure of electronic equipment.

Referring to fig. 1, a schematic structural diagram of a software system of an electronic device according to an embodiment of the application is shown. The software system comprises a plurality of layers, each layer has clear roles and division of work, and the layers are communicated through software interfaces. In some embodiments, the Android system is divided into four layers, namely an application layer, an application framework layer, a system library and a kernel layer from top to bottom.

The application layer may include a series of applications, for example, the application layer may include a camera application, a gallery application, a conversation application, a wireless local area network (Wireless Local Area Networks, WLAN) application, a music separation application, a video application, a media provider (MediaProvider) application, a FUSE (FILESYSTEM IN Userspace, user space file system) file system application, and the like.

Wherein the media provider is used to create multimedia files in the FUSE file system or to access multimedia files in the FUSE file system. Each application in the application layer may create a multimedia file in the FUSE file system through the media provider MediaProvider or access the multimedia file in the FUSE file system through the media provider MediaProvider.

The FUSE file system is used to store multimedia files created by a media provider. Of course, in other embodiments, the FUSE file system may also be used to store other data.

The application framework layer provides an application programming interface (Application Programming Interface, API) and programming framework for the application of the application layer. The application framework layer includes a number of predefined functions. For example, the application framework layers may include a window manager, a content provider, a resource manager, a view system, a package management service (PACKAGE MANAGER SERVICE, PMS), an activity management service (ACTIVITY MANAGER SERVICE, AMS), and the like.

The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.

The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.

The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.

The package management service is used as a package manager service and is mainly responsible for installing, managing and uninstalling application programs on Android devices. The package management service searches out files ending with APK through a designated directory in the scanning system, analyzes the files to obtain all information of the application program, and stores the information in packages.

When a new application is installed, the package administration service will identify all components of the application (e.g., activity, service and Broadcast Receiver, etc.) and assign corresponding rights to those components. Meanwhile, the package management service also monitors the state of the installed application program, and ensures the integrity and safety of the application program.

The package management service is also used to manage DE (Device Encrypted) data and CE (Credential Encrypted) data of the application. Wherein the key for the DE data is only obtained after the system has performed a verifiable boot. CE directories are encrypted data using a key associated with user authentication (e.g., pattern, password, etc.), which can be obtained only after the user has performed authentication.

The CE catalog of the application may include the original uid of the application. The package management service is used to execute DE data and CE data of the music separation application in the present embodiment.

The activity management service, as an activity manager service, is primarily responsible for managing and tracking the activity tasks and lifecycles of all applications. When an application is opened, the campaign management service will start the application's process and allocate processor resources and memory to the application. When an application is no longer in the foreground or background, or when system memory is insufficient, the activity management service may terminate or kill the application's process.

For example, an activity management service may be responsible for managing and tracking activity tasks and lifecycles of music separation applications. When the music separation application is opened, the activity management service initiates the process of the music separation application and allocates processor resources and memory to the music separation application. When the music separation application is no longer in the foreground or background, or when the system memory is insufficient, the activity management service may terminate or kill the progress of the music separation application.

The system Libraries may include Surface managers (Surface Manager), media Libraries (Media Libraries), an Zhuoyun lines (Android Rruntime), and the like.

The android runtime includes a core library and virtual machines. And the android running time is responsible for scheduling and managing an android system. The core library comprises two parts, wherein one part is a function required to be called by Java language, and the other part is an android core library. The application layer and the application framework layer run in a virtual machine. The virtual machine executes Java files of the application layer and the application framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.

Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc.

The kernel layer may include modules for audio drivers, display drivers, wi-Fi drivers, bluetooth drivers, sensor drivers, etc.

It will be appreciated that the layers and components contained in the layers in the software architecture shown in fig. 1 do not constitute a specific limitation on the electronic device. In other embodiments of the application, the electronic device may include more or fewer layers than shown, and more or fewer components may be included in each layer, as the application is not limited.

Although the embodiment of the application is described by taking an Android system as an example, the basic principle is also applicable to the system based on()、()、()、 () Electronic devices such as operating systems of hong Mongolian (Harmony).

Referring to fig. 2, a flowchart of a music separation method according to an embodiment of the application is shown. In a specific embodiment, the music separation method may be applied to an electronic device, and the flow shown in fig. 2 will be described in detail below by taking the electronic device as an example, and the music separation method may include the following steps S110 to S120.

Step S110, acquiring a first audio feature of the current audio.

In the embodiment of the application, when the user needs to perform music separation, a separation instruction can be sent to the electronic equipment, and the electronic equipment receives and responds to the separation instruction to acquire the first audio characteristic of the current audio. The first audio feature may be used to represent a target music source from which the user needs music to be separated, and the first audio feature may be a high-dimensional audio feature or a low-dimensional audio feature, which is not limited herein.

In some embodiments, when a user needs to perform music separation, a separation instruction may be sent to the electronic device, the electronic device receives and responds to the separation instruction, obtains a current audio source type of current audio, matches the current audio source type with a plurality of audio features stored in advance in an audio feature library, obtains a matching degree, and determines an audio feature with the matching degree greater than or equal to a matching degree threshold as a first audio feature.

The current audio source type is an audio source from which the user needs music to be separated, and may include bass music sources, piano music sources, guitar music sources, singing music sources, violin music sources, and the like, which are not limited herein.

The matching degree threshold can be used for representing the minimum matching degree when the current audio source type is matched with the audio features in the audio feature library, and can be preset by a user, or can be automatically generated by the electronic equipment according to the process of separating music for multiple times, and the like, and the setting mode of the matching degree threshold is not limited, and can be specifically set according to actual requirements.

As an implementation manner, the electronic device stores the current audio source type of the current audio in advance, when the user needs to perform music separation, a separation instruction can be sent to the electronic device, and the electronic device receives and responds to the separation instruction to read the pre-stored current audio source type.

When the user needs to perform music separation, the separation instruction can be sent to the electronic equipment, the electronic equipment receives and responds to the separation instruction, the first acquisition instruction is sent to the server through the network, the server receives and responds to the first acquisition instruction, the current audio source type of the current audio stored in advance by the server is read, the current audio source type is sent to the electronic equipment through the network, and the electronic equipment receives the current audio source type returned by the server.

The server is connected to the electronic equipment through a network and performs data interaction with the electronic equipment through the network. The server may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, or may be any one of cloud servers that provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), and basic cloud computing services such as big data or artificial intelligence platforms, where the type of the server is not limited, and may be specifically set according to actual requirements.

The Network may be any one of a ZigBee (ZigBee) Network, a Bluetooth (BT) Network, a Wi-Fi (WIRELESS FIDELITY) Network, a home internet of things communication protocol (Thread) Network, a Long Range Radio (LoRa) Network, a Low-Power Wide-Area Network (LPWAN) Network, an infrared Network, a narrowband internet of things (Narrow Band Internet of Things, NB-IoT), a controller local Area Network (Controller Area Network, CAN), a digital living Network Alliance (DIGITAL LIVING Network Alliance, DLNA) Network, a Wide Area Network (Wide Area Network, WAN), a local Area Network (Local Area Network, LAN), a metropolitan Area Network (Metropolitan Area Network, MAN) or a wireless personal Area Network (Wireless Personal Area Network, WPAN), etc., which may be specifically set according to actual requirements without limiting the type of Network.

In one embodiment, when the user needs to perform music separation, a separation instruction may be sent to the electronic device, the electronic device receives and responds to the separation instruction, generates first uploading prompt information, and receives a current audio source type of the current audio uploaded by the user according to the first uploading prompt information.

The first uploading prompt information may be used to prompt a user to upload a current audio source type of the current audio to the electronic device according to the first uploading prompt information. The first uploading prompt information can be at least any one of voice prompt information, text prompt information or lamplight prompt information, and the type of the first uploading prompt information is not limited here, and the first uploading prompt information can be specifically set according to actual requirements.

In some embodiments, when the user needs to perform music separation, a separation instruction carrying the first audio feature of the current audio may be sent to the electronic device, and the electronic device receives and responds to the separation instruction, and obtains the first audio feature according to the separation instruction.

In some embodiments, the electronic device may be provided with an input panel, and when the user needs to perform music separation, a separation instruction may be input to the input panel of the electronic device, for example, a separation instruction is input to the input panel of the electronic device by handwriting, or a separation instruction is input to an input panel key of the electronic device, and the electronic device receives the separation instruction through the input panel.

In some embodiments, the electronic device may be provided with a voice recognition module, where when the user needs to perform music separation, voice information may be sent within a voice collection range of the voice recognition module, where the voice recognition module collects voice information sent by the user and performs voice recognition on the collected voice information to obtain a voice recognition result, and when it is determined that the voice recognition result includes a keyword for instructing the electronic device to perform music separation, for example, the keyword is "music separation", and for example, the keyword is "music" and "separation", and it is determined that a separation instruction is received.

As one example, the voice information sent by the user is that music is separated, the voice recognition result of voice recognition comprises keywords of music and separation, and it is determined that a separation instruction is received.

In some embodiments, when the user needs to perform music separation, a separation instruction may be sent to the client, the client receives and responds to the separation instruction, forwards the separation instruction to the electronic device through the network, and the electronic device receives the separation instruction forwarded by the client.

The client is connected to the electronic equipment through a network and performs data interaction with the electronic equipment through the network. The client may be any one of a mobile client (for example, any one of a mobile client, a Personal computer (Personal DIGITAL ASSISTANT, PDA) client, a tablet pc (Tablet Personal Computer) client, a notebook computer client, a smart watch client, a smart bracelet client, a wearable client, etc.) or a fixed client (for example, a desktop computer client, a smart panel client, etc.), etc., and the type of the client is not limited herein, and may be specifically set according to actual requirements.

And step S120, processing the first audio feature and the current mixed audio through the target diffusion model to obtain the current audio.

In the embodiment of the application, after the first audio feature of the current audio is acquired, the electronic equipment can process the first audio feature and the current mixed audio through the target diffusion model to obtain the current audio, the audio source type of the current mixed audio comprises the audio source type of the current audio, the target diffusion model is obtained by training the initial diffusion model based on the historical mixed audio and a plurality of second audio features of the historical mixed audio, the plurality of second audio features at least comprise the first audio feature, the current audio is separated from the current mixed audio according to the first audio based on the target diffusion model in the music separation process, and the initial diffusion model learns the audio characteristics in the mixed audio according to the audio features in the training process, and has the characteristic of parameter sharing, so that the target diffusion model can carry out music separation on different mixed audio according to the audio features, training sample audio with the same audio source type as each mixed audio is not required to be trained by using the initial diffusion model, and the operation process of music separation is simplified.

The audio source type of the current mixed audio comprises the audio source type of the current audio, and the audio data distribution of the current audio is the same as the audio data distribution corresponding to the current audio source type in the current mixed audio, so that the music separation process is ensured to have higher separation quality. For example, the current audio source type is a piano music source, then the current mixed audio contains the audio of the piano music source, the current audio is a piano music source, and the audio data distribution of the piano music source in the mixed audio is the same as the audio data distribution of the piano music source in the current audio.

The second audio feature may be a high-dimensional audio feature or a low-dimensional audio feature, which is not limited herein.

The diffusion model is a generative model that can be given sample data derived from independent co-distributions in the positional data distribution to learn an approximate unknown data distribution. The diffusion model mainly comprises a forward diffusion process and a reverse sampling (reasoning) process.

The forward diffusion process slowly and orderly adds gaussian noise to the sample, and learns to recover from the noise sample to a clean sample in the backward sampling process, so that the diffusion model learns the forward diffusion process and simultaneously learns the backward sampling process. The diffusion model is used as a high-quality generation model, and the generated voice signal has high quality and good hearing.

As an example, as shown in fig. 3, a sample x ₀ is a noise-free sample, x _t-1 is a noise sample obtained by subjecting a sample x ₀ to a diffusion model T times of forward diffusion treatment, x _t is a noise sample obtained by subjecting a sample x ₀ to a diffusion model t+1 times of forward diffusion treatment, and x _T is a noise sample obtained by subjecting a sample x ₀ to a diffusion model t+1 times of forward diffusion treatment.

Noise sample x _T is subjected to T+1 times of reverse sampling treatment to obtain noise-free sample x ₀, noise sample x _T is subjected to T-t+1 times of reverse sampling treatment to obtain noise sample x _t, and noise sample x _T is subjected to T-T times of reverse sampling treatment to obtain noise sample x _t-1.

Q _data(x₀) is the noise distribution in the sample x ₀, q (x _t|x_t-1) is the noise distribution added in the t+1st forward diffusion process, and q _latent(x_T) =n (0, 1) indicates that the noise distribution q _latent(x_T in the noise sample x _T satisfies the normal distribution N (0, 1). P _θ(x_t-1|x_t) is the noise profile added for the T-th downsampling process.

In an application scenario, as shown in fig. 4, the audio sample X ₀ is a noise-free audio sample, and the audio sample X ₀ is subjected to forward diffusion treatment by a diffusion model, i.e. gaussian noise is gradually added, so that a noise audio sample X _T can be obtained. The noise audio sample X _T is subjected to inverse sampling treatment by a diffusion model, so that a noise-free audio sample X ₀ can be obtained.

In some embodiments, the electronic device may store the target diffusion model in advance, the electronic device may input the current mixed audio and the first audio feature to the target diffusion model after acquiring the first audio feature of the current audio, the target diffusion model receives and responds to the current mixed audio and the first audio feature, generates the current audio, and outputs the current audio to the electronic device, and the electronic device receives the current audio output by the target diffusion model.

In some embodiments, the target diffusion model is stored in the server in advance, the electronic device may send the current mixed audio and the first audio feature to the server through the network after obtaining the first audio feature of the current audio, the server receives and responds to the current mixed audio and the first audio feature, inputs the current mixed audio and the first audio feature to the target diffusion model, the target diffusion model receives and responds to the current mixed audio and the first audio feature, generates the current audio, and outputs the current audio to the server, the server receives the current audio output by the target diffusion model, and sends the current audio to the electronic device through the network, and the electronic device receives the current audio returned by the server.

In some implementations, the first audio feature may include a plurality of sub-audio features, indicating that the user needs a target music source of music separation to be multiple. The current audio may include a plurality of sub-audio representing a plurality of target music sources separated based on the target diffusion model.

After the electronic equipment acquires the plurality of current audio high-dimensional characteristics, the plurality of sub-audio characteristics and the current mixed audio can be processed through the target diffusion model to obtain a plurality of sub-audio, so that the plurality of sub-audio are separated from the current mixed audio according to the plurality of sub-audio characteristics based on the target diffusion model, and the separation experience of music separation of a user is improved.

As an embodiment, the electronic device may store the target diffusion model in advance, after acquiring the plurality of sub-audio features of the plurality of sub-audio, the electronic device may input the current mixed audio and the plurality of sub-audio features to the target diffusion model, the target diffusion model receives and responds to the current mixed audio and the plurality of sub-audio features, generates one sub-audio according to each sub-audio feature and the current mixed audio, and outputs the plurality of sub-audio to the electronic device, and the electronic device receives the plurality of sub-audio output by the target diffusion model.

As one embodiment, the target diffusion model is stored in the server in advance, after the electronic device obtains the plurality of sub-audio features of the plurality of sub-audio, the electronic device may send the current mixed audio and the plurality of sub-audio features to the server through the network, the server receives and responds to the current mixed audio and the plurality of sub-audio features, inputs the current mixed audio and the plurality of sub-audio features to the target diffusion model, the target diffusion model receives and responds to the current mixed audio and the plurality of sub-audio features, generates one sub-audio according to each sub-audio feature and the current mixed audio, and outputs the plurality of sub-audio to the server, the server receives the plurality of sub-audio output by the target diffusion model, and sends the plurality of sub-audio to the electronic device through the network, and the electronic device receives the plurality of sub-audio returned by the server.

In an application scenario, as shown in fig. 5, the current mixed audio may include singing audio, bass audio, drumbeat audio, guitar audio, and the like, and fig. 5 illustrates frequency distribution of singing audio, frequency distribution of bass audio, frequency distribution of drumbeat audio, frequency distribution of guitar audio, and the like corresponding to different sampling durations.

The plurality of sub-audio features may include singing audio features, bass audio features, drum audio features, guitar audio features, etc., and the plurality of sub-audio may include current singing audio, current bass audio, current drum audio, current guitar audio, etc.

The target diffusion model may generate a current singing audio from the singing audio features and the current mixed audio, generate a current bass audio from the bass audio features and the current mixed audio, generate a current drumbeat audio from the drumbeat audio features and the current mixed audio, and generate a current guitar audio from the guitar audio features and the current mixed audio, as shown in fig. 6.

The frequency distribution in the current singing audio is the same as the frequency distribution of the singing audio in the current mixed audio, the frequency distribution in the current bass audio is the same as the frequency distribution of the bass audio in the current mixed audio, the frequency distribution in the current drumbeat audio is the same as the frequency distribution of the drumbeat audio in the current mixed audio, and the frequency distribution in the current guitar audio is the same as the frequency distribution of the guitar audio in the current mixed audio.

In some embodiments, the target diffusion model may be derived based on any of Unet networks, noise condition scoring networks (Noise Condition Score Network, NCSN), or noise condition scoring networks++ (Noise Condition Score Network ++, ncsn++), etc., i.e., the target diffusion model may select any of Unet networks, NCSN, or ncsn++, etc., as the model architecture, without limitation herein.

In some embodiments, the target diffusion model may be formed based on any one of a denoising diffusion probability model (Denoising Diffusion Probabilistic Model, DDPM) algorithm, a denoising diffusion implicit model (Denoising Diffusion Implicit Models, DDIM) algorithm, a random differential equation (Stochastic Differential Equation, SDE) algorithm, a Score-based generation model (Score-based Generative Model, SGM) algorithm, or the like, i.e., the target diffusion model may select any one of DDPM algorithm, DDIM algorithm, SDE algorithm, SGM algorithm, or the like as a model framework, without limitation herein.

In one application scenario, as shown in fig. 7, the current audio may include current piano audio, current guitar audio, and current drum audio, and the current mixed audio may include singing audio, bass audio, drum audio, and guitar audio.

The current mixed audio is subjected to reverse sampling processing by the target diffusion model, so that the current piano audio, the current guitar audio and the current drum sound audio can be obtained.

According to the scheme, the first audio characteristics of the current audio are obtained, the first audio characteristics and the current mixed audio are processed through the target diffusion model, the audio source type of the current mixed audio comprises the audio source type of the current audio, the target diffusion model is obtained by training the initial diffusion model based on the historical mixed audio and a plurality of second audio characteristics of the historical mixed audio, the plurality of second audio characteristics at least comprise the first audio characteristics, the current audio is separated from the current mixed audio based on the target diffusion model in the music separation process, and because the initial diffusion model learns the audio characteristics in the mixed audio according to the audio characteristics in the training process, and the diffusion model has the characteristic of parameter sharing, the target diffusion model can conduct music separation on different mixed audio according to the audio characteristics, training sample audio with the same audio source type as each mixed audio is not needed to be trained by using the initial diffusion model, and the operation process of music separation is simplified. The audio source type of the current mixed audio comprises the audio source type of the current audio, and the audio data distribution of the current audio is the same as the audio data distribution corresponding to the current audio source type in the current mixed audio, so that the music separation process is ensured to have higher separation quality.

Referring to fig. 8, a flowchart of a music separation method according to another embodiment of the application is shown. In a specific embodiment, the music separation method may be applied to an electronic device, and the flow shown in fig. 8 will be described in detail below by taking the electronic device as an example, and the music separation method may include the following steps S210 to S260.

Step S210, a plurality of audio sources are acquired.

In this embodiment, when the user needs to perform music separation, a separation instruction may be sent to the electronic device, and the electronic device receives and responds to the separation instruction to obtain a plurality of audio sources.

Wherein each audio source may correspond to an audio source type. The audio source may include bass, piano, guitar, singer, violin, etc., and the type of the audio source is not limited herein, and may be specifically set according to actual needs.

In some embodiments, the electronic device stores a plurality of audio sources in advance, and when the user needs to perform music separation, a separation instruction can be sent to the electronic device, and the electronic device receives and responds to the separation instruction to read the prestored plurality of audio sources.

In some embodiments, when the user needs to perform music separation, the separation instruction may be sent to the electronic device, the electronic device receives and responds to the separation instruction, and sends a second acquisition instruction to the server through the network, the server receives and responds to the second acquisition instruction, reads a plurality of audio sources stored in advance by the server, and sends a plurality of audio sources to the electronic device through the network, and the electronic device receives the plurality of audio sources returned by the server.

In some embodiments, when the user needs to perform music separation, a separation instruction may be sent to the electronic device, and the electronic device receives and responds to the separation instruction, generates the second uploading prompt information, and receives a plurality of audio sources uploaded by the user according to the second uploading prompt information.

The second uploading prompt message may be used to prompt the user to upload the plurality of audio sources to the electronic device according to the second uploading prompt message. The second uploading prompt information can be at least any one of voice prompt information, text prompt information or lamplight prompt information, and the type of the second uploading prompt information is not limited here, and the second uploading prompt information can be specifically set according to actual requirements.

And S220, extracting audio features of the plurality of audio sources through the audio feature extraction model to obtain a plurality of third audio features.

In this embodiment, after the electronic device obtains a plurality of audio sources, the audio feature extraction model may be used to extract audio features from the plurality of audio sources to obtain a plurality of third audio features, so that an audio high-dimensional feature library is constructed based on the plurality of third audio features extracted by the audio feature extraction model, and the audio feature extraction is not required to be manually performed on the audio sources by a user, but the process of manually extracting the third audio features by the user is relatively complex, thereby simplifying the extraction process of extracting the audio features.

The audio feature extraction model may be any one of a convolutional neural network (Convolutional Neural Networks, CNN), a deep neural network (Deep Neural Networks, DNN), a Long Short-Term Memory (LSTM) network, etc., which is not limited herein, and may be specifically set according to actual requirements.

Each of the third audio features corresponds to an audio source, and the third audio features may include the first audio feature, and the third audio features may be high-dimensional audio features or low-dimensional audio features, and the like, which are not limited herein.

Specifically, after the electronic device acquires a plurality of audio sources, the electronic device may input the plurality of audio sources into a pre-trained audio feature extraction model, the audio feature extraction model receives and responds to the plurality of audio sources, performs audio feature extraction on each audio source to obtain a third audio feature, and outputs each third audio feature to the electronic device, and the electronic device receives the plurality of third audio features output by the audio feature extraction model.

In some embodiments, the audio feature extraction model may be a CNN, and after the electronic device acquires the plurality of audio sources, the electronic device may input the plurality of audio sources to the pre-trained CNN, where the CNN receives and responds to the plurality of audio sources, performs audio feature extraction on each audio source to obtain a third audio feature, and outputs each third audio feature to the electronic device, where the electronic device receives the plurality of third audio features output by the CNN.

In some embodiments, the audio feature extraction model may be DNN, and after the electronic device acquires the plurality of audio sources, the electronic device may input the plurality of audio sources to the pre-trained DNN, the DNN receives and responds to the plurality of audio sources, performs audio feature extraction on each audio source to obtain a third audio feature, and outputs each third audio feature to the electronic device, where the electronic device receives the plurality of third audio features output by the DNN.

In some embodiments, the audio feature extraction model may be an LSTM network, after the electronic device acquires the plurality of audio sources, the electronic device may input the plurality of audio sources into the pre-trained LSTM network, the LSTM network receives and responds to the plurality of audio sources, performs audio feature extraction on each audio source to obtain a third audio feature, and outputs each third audio feature to the electronic device, where the electronic device receives the plurality of third audio features output by the LSTM network.

And S230, generating an audio feature library according to the third audio features.

In this embodiment, after performing audio feature extraction on each of the plurality of audio sources through the audio feature extraction model to obtain a plurality of third audio features, the electronic device may generate an audio feature library according to the plurality of third audio features, so as to implement construction of the audio feature library based on the plurality of third audio features extracted by the audio feature extraction model, and simplify a construction process of the audio feature library.

Step S240, obtaining the current audio source type of the current audio.

Step S250, according to the current audio source type, acquiring a first audio feature from an audio feature library.

And step S260, processing the first audio feature and the current mixed audio through the target diffusion model to obtain the current audio.

In this embodiment, the steps S240, S250 and S260 may refer to the content of the corresponding steps in the foregoing embodiments, which is not described herein.

According to the scheme, the plurality of audio sources are obtained, the audio feature extraction is carried out on the plurality of audio sources through the audio feature extraction model, the plurality of third audio features are obtained, the audio feature library is generated according to the plurality of third audio features, the current audio source type of the current audio is obtained, the first audio feature is obtained from the audio feature library according to the current audio source type, the first audio feature and the current mixed audio are processed through the target diffusion model, and the current audio is obtained. The audio source type of the current mixed audio comprises the audio source type of the current audio, and the audio data distribution of the current audio is the same as the audio data distribution corresponding to the current audio source type in the current mixed audio, so that the music separation process is ensured to have higher separation quality.

Further, an audio feature library is constructed based on the plurality of third audio features extracted by the audio feature extraction model, and the construction process of the audio feature library is simplified.

Referring to fig. 9, a flowchart of a music separation method according to still another embodiment of the present application is shown. In a specific embodiment, the music separation method may be applied to an electronic device, and the flow shown in fig. 9 will be described in detail below by taking the electronic device as an example, and the music separation method may include the following steps S310 to S340.

Step S310, acquiring a plurality of second audio features of the history mixed audio.

In this embodiment, when the user needs to perform music separation, a separation instruction may be sent to the electronic device, and the electronic device receives and responds to the separation instruction to obtain the history mixed audio and a plurality of second audio features of the history mixed audio.

In some embodiments, the electronic device stores the historical mixed audio and the plurality of second audio features of the historical mixed audio in advance, and when the user needs to perform music separation, a separation instruction may be sent to the electronic device, and the electronic device receives and responds to the separation instruction, and reads the pre-stored historical mixed audio and the plurality of second audio features of the historical mixed audio.

In some embodiments, when the user needs to perform music separation, a separation instruction may be sent to the electronic device, the electronic device receives and responds to the separation instruction, sends a third acquisition to the server through the network, the server receives and responds to the third acquisition instruction, reads the historical mixed audio and the second audio features of the historical mixed audio stored in advance by the server, sends the historical mixed audio and the second audio features of the historical mixed audio to the electronic device through the network, and the electronic device receives the historical mixed audio and the second audio features of the historical mixed audio returned by the server.

In some embodiments, when the user needs to perform music separation, a separation instruction may be sent to the electronic device, the electronic device receives and responds to the separation instruction, generates third uploading prompt information, and receives the historical mixed audio and the plurality of second audio features of the historical mixed audio uploaded by the user according to the third uploading prompt information.

The third uploading prompt information may be used to prompt the user to upload the historical mixed audio and the plurality of second audio features of the historical mixed audio to the electronic device according to the third uploading prompt information. The third uploading prompt information can be at least any one of voice prompt information, text prompt information or lamplight prompt information, and the type of the third uploading prompt information is not limited here, and the third uploading prompt information can be specifically set according to actual requirements.

Step 320, training the initial diffusion model according to the historical mixed audio and the plurality of second audio features to obtain a target diffusion model.

In this embodiment, after the electronic device obtains the historical mixed audio and the plurality of second audio features of the historical mixed audio, the historical mixed audio and the plurality of second audio features may be input to the initial diffusion model to perform training to obtain the target diffusion model, so that training of the initial diffusion model based on the historical mixed audio and the plurality of second audio features to obtain the target diffusion model is achieved, the diffusion model itself has the generating capability of adapting to different mixed audio, and the target diffusion model is guaranteed to have a stronger generalized music separation capability.

In some embodiments, after the electronic device obtains the historical mixed audio and the plurality of second audio features of the historical mixed audio, the electronic device may obtain the noise audio, input the noise audio, the historical mixed audio and the plurality of second audio features to the initial diffusion model to train to obtain the target diffusion model, so that training of the initial diffusion model based on the noise audio, the historical mixed audio and the plurality of second audio features to obtain the target diffusion model is realized, the target diffusion model further learns the capability of generating the noiseless audio from the noise audio, and the generalized music separation capability of the target diffusion model is further improved.

Step S330, a first audio feature of the current audio is acquired.

And step S340, processing the first audio feature and the current mixed audio through the target diffusion model to obtain the current audio.

In this embodiment, the step S330 and the step S340 may refer to the content of the corresponding steps in the foregoing embodiments, which is not described herein.

According to the scheme provided by the embodiment, the historical mixed audio and the plurality of second audio features of the historical mixed audio are obtained, the initial diffusion model is trained according to the historical mixed audio and the plurality of second audio features to obtain the target diffusion model, the first audio features of the current audio are obtained, the first audio features and the current mixed audio are processed through the target diffusion model to obtain the current audio, the current audio is separated from the current mixed audio according to the first audio on the basis of the target diffusion model in the music separation process, and the initial diffusion model learns the audio characteristics in the mixed audio according to the audio features in the training process, and has the characteristic of parameter sharing, so that the target diffusion model can carry out music separation on different mixed audios according to the audio features, training sample audios which have the same audio source type as each mixed audio do not need to be trained through the initial diffusion model, and the operation process of music separation is simplified. The audio source type of the current mixed audio comprises the audio source type of the current audio, and the audio data distribution of the current audio is the same as the audio data distribution corresponding to the current audio source type in the current mixed audio, so that the music separation process is ensured to have higher separation quality.

Further, the initial diffusion model is trained based on the historical mixed audio and a plurality of second audio features of the historical mixed audio to obtain a target diffusion model, the diffusion model has the generating capacity of adapting to different mixed audio, and the target diffusion model is guaranteed to have stronger generalized music separation capacity.

Referring to fig. 10, which illustrates a music separation apparatus 300 according to an embodiment of the present application, the music separation apparatus 300 may be applied to an electronic device, and the music separation apparatus 300 illustrated in fig. 10 will be described in detail below by taking the electronic device as an example, and the music separation apparatus 300 may include a feature acquisition module 310 and a processing module 320.

The feature obtaining module 310 may be configured to obtain a first audio feature of the current audio, the processing module 320 may be configured to process the first audio feature and the current mixed audio through a target diffusion model to obtain the current audio, an audio source type of the current mixed audio may include an audio source type of the current audio, the target diffusion model may be obtained by training an initial diffusion model based on a plurality of second audio features of the historical mixed audio and the historical mixed audio, and the plurality of second audio features may include at least the first audio feature.

In some implementations, the feature acquisition module 310 may include a first acquisition unit and a second acquisition unit.

The first obtaining unit may be configured to obtain a current audio source type of the current audio, and the second obtaining unit may be configured to obtain the first audio feature from the audio feature library according to the current audio source type.

In some embodiments, the music separation apparatus 300 may further include an audio source acquisition module, an extraction module, and a generation module.

The audio source obtaining module may be configured to obtain a plurality of audio sources before the second obtaining unit obtains the first audio feature from the audio feature library according to the current audio source type, where each audio source may correspond to one audio source type, the extracting module may be configured to extract audio features from the plurality of audio sources through the audio feature extraction model to obtain a plurality of third audio features, each third audio feature may correspond to one audio source, and the plurality of third audio features may include the first audio feature, and the generating module may be configured to generate the audio feature library according to the plurality of third audio features.

In some implementations, the audio feature extraction model may be any of a convolutional neural network, a deep neural network, or a long and short term memory network, among others.

In some implementations, the first audio feature, the second audio feature, and the third audio feature can each be high-dimensional audio features.

In some implementations, the current audio can include a plurality of sub-audio, the first audio feature can include a plurality of sub-audio features, and the processing module 320 can include a processing unit.

The processing unit may be configured to process the plurality of sub-audio features and the current mixed audio through the target diffusion model to obtain a plurality of sub-audio.

In some implementations, the music separation device 300 may also include a history mixed audio acquisition module and a training module.

The historical mixed audio obtaining module may be used for processing the first audio feature and the current mixed audio through the target diffusion model by the processing module 320 to obtain the historical mixed audio and a plurality of second audio features of the historical mixed audio before the current audio is obtained, and the training module may be used for training the initial diffusion model according to the historical mixed audio and the plurality of second audio features to obtain the target diffusion model.

In some implementations, the music separation device 300 may also include a noise audio acquisition module.

The noise audio acquisition module may be configured to train the initial diffusion model according to the historical mixed audio and the plurality of second audio features, and acquire noise audio before the target diffusion model is obtained.

In some embodiments, the training module may report the training unit.

The training unit may be configured to train the initial diffusion model according to the noise audio, the historical mixed audio, and the plurality of second audio features to obtain a target diffusion model.

In some implementations, the target diffusion model may be derived based on any of Unet networks, noise condition scoring networks, or noise condition scoring networks++, or the like.

In some embodiments, the target diffusion model may be formed based on any one of a denoising diffusion probability model algorithm, a denoising diffusion implicit model algorithm, a random differential equation algorithm, a score-based generation model algorithm, or the like.

According to the scheme provided by the embodiment, the first audio characteristics of the current audio are obtained, the first audio characteristics and the current mixed audio are processed through the target diffusion model, the audio source type of the current mixed audio comprises the audio source type of the current audio, the target diffusion model is obtained by training the initial diffusion model based on the historical mixed audio and a plurality of second audio characteristics of the historical mixed audio, the plurality of second audio characteristics at least comprise the first audio characteristics, the current audio is separated from the current mixed audio according to the first audio on the basis of the target diffusion model in the music separation process, and because the initial diffusion model learns the audio characteristics in the mixed audio according to the audio characteristics in the training process and the diffusion model has the characteristic of parameter sharing, the target diffusion model can carry out music separation on different mixed audio according to the audio characteristics, training sample audio with the same audio source type as each mixed audio is not required to be trained by the initial diffusion model, and the operation process of music separation is simplified. The audio source type of the current mixed audio comprises the audio source type of the current audio, and the audio data distribution of the current audio is the same as the audio data distribution corresponding to the current audio source type in the current mixed audio, so that the music separation process is ensured to have higher separation quality.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points. Any of the described processing manners in the method embodiment may be implemented by a corresponding processing module in the device embodiment, which is not described in detail in the device embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

Referring to fig. 11, a schematic hardware structure of an electronic device 600 according to an embodiment of the application is shown. As shown in fig. 11, the electronic device 600 may include a processor 610, an external memory interface 620, an internal memory 621, a universal serial bus (Universal Serial Bus, USB) interface 630, a charge management module 640, a power management module 641, a battery 642, an antenna 1, an antenna 2, a mobile communication module 650, a wireless communication module 660, an audio module 670, a speaker 670A, a receiver 670B, a microphone 670C, an earphone interface 670D, a sensor module 680, keys 690, a motor 6191, an indicator 692, a camera 693, a display 694, a subscriber identity module (Subscriber Identification Module, SIM) card interface 695, and the like. Among other things, the sensor module 680 may include a pressure sensor 680A, a gyroscope sensor 680B, a barometric pressure sensor 680C, a magnetic sensor 680D, an acceleration sensor 680E, a distance sensor 680F, a proximity light sensor 680G, a fingerprint sensor 680H, a temperature sensor 680J, a touch sensor 680K, an ambient light sensor 680L, a bone conduction sensor 680M, and the like.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 600. In other embodiments of the application, electronic device 600 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

By way of example, the Processor 610 shown in FIG. 11 may include one or more processing units, for example, the Processor 610 may include an application Processor (Application Processor, AP), a modem Processor, a graphics Processor (Graphics Processing Unit, GPU), an image signal Processor (IMAGE SIGNAL Processor, ISP), a controller, a memory, a video codec, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), a baseband Processor, and/or a neural network Processor (Neural-Network Processing Unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

Wherein the AP may be used to control and manage the music separation application, for example, the AP may control the music separation application to separate music.

The controller may be a neural hub and command center of the electronic device 600. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 610 for storing instructions and data. In some embodiments, the memory in the processor 610 is a cache memory. The memory may hold instructions or data that the processor 610 has just used or recycled. If the processor 610 needs to reuse the instruction or data, it may be called directly from memory. Repeated accesses are avoided, reducing the latency of the processor 610 and thus improving the efficiency of the system.

In some embodiments, the processor 610 may include one or more interfaces. The interfaces may include an integrated circuit (Inter-INTEGRATED CIRCUIT, I2C) interface, an integrated circuit built-in audio (Inter-INTEGRATED CIRCUIT SOUND, I2S) interface, a pulse code modulation (Pulse Code Modulation, PCM) interface, a universal asynchronous receiver Transmitter (Universal Asynchronous Receiver/Transmitter, UART) interface, a mobile industry processor interface (Mobile Industry Processor Interface, MIPI), a general purpose input Output (General Purpose Input/Output, GPIO) interface, a subscriber identity module (Subscriber Identity Module, SIM) interface, and/or a universal serial bus (Universal Serial Bbus, USB) interface, among others.

In some embodiments, the I2C interface is a bidirectional synchronous serial bus including a serial data line (SERIAL DATA LINE, SDA) and a serial clock line (Derail Clock Line, SCL). The processor 610 may contain multiple sets of I2C buses. The processor 610 may be coupled to the touch sensor 680K, charger, flash, camera 693, etc., respectively, through different I2C bus interfaces. For example, processor 610 may couple touch sensor 680K through an I2C interface, causing processor 610 to communicate with touch sensor 680K through an I2C bus interface, implementing the touch functions of electronic device 600.

In some embodiments, the I2S interface may be used for audio communication. The processor 610 may contain multiple sets of I2S buses. The processor 610 may be coupled to the audio module 670 via an I2S bus to enable communication between the processor 610 and the audio module 670.

In some embodiments, the audio module 670 may communicate audio signals to the wireless communication module 660 via the I2S interface to enable phone answering via a bluetooth headset.

In some embodiments, the PCM interface may also be used for audio communication, sampling, quantizing, and encoding analog signals. The audio module 670 and the wireless communication module 660 may be coupled by a PCM bus interface.

In some embodiments, the audio module 670 may also transmit audio signals to the wireless communication module 660 via the PCM interface to enable phone answering via the bluetooth headset. It should be appreciated that both the I2S interface and the PCM interface may be used for audio communication.

Electronic device 600 may implement audio functionality through audio module 670, speaker 670A, receiver 670B, microphone 670C, headphone interface 670D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 670 may be used to convert digital audio information to an analog audio signal output, and may also be used to convert an analog audio input to a digital audio signal. The audio module 670 may also be used to encode and decode audio signals. In some embodiments, the audio module 670 may be disposed in the processor 610, or some of the functional modules of the audio module 670 may be disposed in the processor 610.

Speaker 670A, also known as a "horn," is used to convert audio electrical signals into sound signals. The electronic device 600 may listen to music, or to hands-free conversations, through the speaker 670A.

A receiver 670B, also known as a "earpiece", is used to convert the audio electrical signal into a sound signal. When electronic device 600 is answering a telephone call or voice message, voice may be received by placing receiver 670B in close proximity to the human ear.

Microphone 670C, also known as a "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can sound near the microphone 670C through the mouth, inputting a sound signal to the microphone 670C. The electronic device 600 may be provided with at least one microphone 670C. In other embodiments, the electronic device 600 may be provided with two microphones 670C, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 600 may also be provided with three, four, or more microphones 670C to enable collection of sound signals, noise reduction, identification of sound sources, directional recording functions, etc.

The touch sensor 680K, also referred to as a "touch panel". The touch sensor 680K may be disposed on the display 694, and the touch sensor 680K and the display 694 form a touch screen, which is also referred to as a "touch screen". The touch sensor 680K is used to detect a touch operation acting on or near it. Touch sensor 680K can communicate detected touch operations to application processor 610 to determine a touch event type. Visual output related to touch operations may be provided through the display 694. In other embodiments, the touch sensor 680K may also be disposed on a surface of the electronic device 600 at a different location than the display 694.

The keys 690 include a power on key, a volume key, etc. The keys 690 may be mechanical keys. Or may be a touch key. The electronic device 600 may receive key inputs, generate key signal inputs related to user settings and function controls of the electronic device 600.

Referring to fig. 12, a functional block diagram of an electronic device 600 according to an embodiment of the application is shown. As shown in fig. 12, the electronic device 600 includes at least one processor 610 (only one processor is shown in fig. 12), a memory 696, and a computer program 697 stored in the memory 696 and executable on the at least one processor 610, which when executed by the processor 610, causes the electronic device 600 to perform the steps of any of the methods described above.

It will be appreciated by those skilled in the art that fig. 12 is merely an example of an electronic device 600 and is not intended to limit the electronic device 600, and in practice the electronic device 600 may include more or less components than those illustrated, or may combine certain components, or different components, such as may also include input-output devices, network access devices, etc.

The Processor 610 may be a central processing unit (Central Processing Unit, CPU), other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 696 may be an internal storage unit of electronic device 600 in some embodiments, such as a hard disk or memory of electronic device 600. Memory 696 may also be an external storage device of electronic device 600 in other embodiments, such as a plug-in hard disk provided on electronic device 600, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Optionally, memory 696 may also include both internal and external storage units of electronic device 600. Memory 696 is used to store an operating system, application programs, boot loader programs, data, and other programs, such as program code for computer programs, and the like. Memory 696 may also be used to temporarily store data that has been output or is to be output.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium stores a computer program, and when the computer program is called by electronic equipment, the method described in each method embodiment can be realized.

Embodiments of the present application also provide a computer program product having a program code stored on a computer readable storage medium for causing an electronic device to perform the above-mentioned related steps when the computer program product is run on the electronic device, so as to implement the methods described in the above-mentioned respective method embodiments.

The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above-described embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the method embodiments described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium can include at least any entity or device capable of carrying computer program code to a camera device/electronic apparatus, a recording medium, a computer Memory, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/device and method may be implemented in other manners. For example, the apparatus/device embodiments described above are merely illustrative, e.g., the division of modules or elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in various places throughout this specification are not necessarily all referring to the same embodiment, but mean "one or more, but not all, embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application and not for limiting the same, and although the present application has been described in detail with reference to the above-mentioned embodiments, it will be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or replacements do not drive the essence of the corresponding technical solution to deviate from the spirit and scope of the technical solution of the embodiments of the present application.

Claims

1. A music separation method, characterized in that the music separation method comprises:

Get the current audio source type of the current audio;

Get multiple audio sources, each audio source corresponds to an audio source type;

Extracting audio features from multiple audio sources using an audio feature extraction model to obtain multiple third audio features, each third audio feature corresponding to an audio source, and the multiple third audio features include the first audio feature;

generating an audio feature library according to a plurality of third audio features;

Acquire the first audio feature from the audio feature library according to the current audio source type;

The first audio feature and the current mixed audio are processed by a target diffusion model to obtain the current audio, the audio source type of the current mixed audio includes the audio source type of the current audio, and the target diffusion model is obtained by training an initial diffusion model based on historical mixed audio and multiple second audio features of the historical mixed audio, and the multiple second audio features at least include the first audio feature.

2. The music separation method according to claim 1 is characterized in that the audio feature extraction model is any one of a convolutional neural network, a deep neural network or a long short-term memory network.

3. The music separation method according to claim 1 is characterized in that the first audio feature, the second audio feature and the third audio feature are all high-dimensional audio features.

4. The music separation method according to claim 1, wherein the current audio includes a plurality of sub-audios, the first audio feature includes a plurality of sub-audio features, and the processing of the first audio feature and the current mixed audio by a target diffusion model to obtain the current audio comprises:

The multiple sub-audio features and the current mixed audio are processed through a target diffusion model to obtain multiple sub-audios.

5. The music separation method according to claim 1, characterized in that before the first audio feature and the current mixed audio are processed by the target diffusion model to obtain the current audio, the music separation method further comprises:

Acquire a historical mixed audio and a plurality of second audio features of the historical mixed audio;

The initial diffusion model is trained according to the historical mixed audio and the plurality of second audio features to obtain a target diffusion model.

6. The music separation method according to claim 5, characterized in that before the initial diffusion model is trained based on the historical mixed audio and the plurality of second audio features to obtain the target diffusion model, the music separation method further comprises:

Get the noise audio;

Training an initial diffusion model according to the historical mixed audio and the plurality of second audio features to obtain a target diffusion model includes:

An initial diffusion model is trained according to the noise audio, the historical mixed audio, and a plurality of second audio features to obtain a target diffusion model.

7. The music separation method according to claim 1 is characterized in that the target diffusion model is obtained based on any one of a Unet network, a noise conditional scoring network or a noise conditional scoring network++.

8. The music separation method according to any one of claims 1 to 7 is characterized in that the target diffusion model is formed based on any one of a denoising diffusion probability model algorithm, a denoising diffusion implicit model algorithm, a stochastic differential equation algorithm or a score-based generative model algorithm.

9. A music separation device, characterized in that the music separation device comprises:

A first acquisition unit, used to acquire a current audio source type of a current audio;

An audio source acquisition module is used to acquire multiple audio sources, each audio source corresponds to an audio source type;

An extraction module, configured to extract audio features from multiple audio sources using an audio feature extraction model to obtain multiple third audio features, each third audio feature corresponding to one audio source, and the multiple third audio features including the first audio feature;

A generating module, used for generating an audio feature library according to a plurality of third audio features;

A second acquisition unit, configured to acquire the first audio feature from the audio feature library according to the current audio source type;

A processing module is used to process the first audio feature and the current mixed audio through a target diffusion model to obtain the current audio, wherein the audio source type of the current mixed audio includes the audio source type of the current audio, and the target diffusion model is obtained by training an initial diffusion model based on historical mixed audio and multiple second audio features of the historical mixed audio, and the multiple second audio features at least include the first audio feature.

10. An electronic device, characterized in that it comprises a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the electronic device executes the music separation method as described in any one of claims 1 to 8.

11. A computer-readable storage medium, characterized in that program code is stored in the computer-readable storage medium, and the program code can be called by an electronic device to execute the music separation method as described in any one of claims 1 to 8.