CN110070885A

CN110070885A - Audio originates point detecting method and device

Info

Publication number: CN110070885A
Application number: CN201910151018.4A
Authority: CN
Inventors: 李为; 黄传增; 李琰
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Douyin Vision Co Ltd; Douyin Vision Beijing Co Ltd
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2019-07-30
Anticipated expiration: 2039-02-28
Also published as: CN110070885B

Abstract

The present disclosure discloses a kind of audio starting point detecting method, device, electronic equipment and computer readable storage mediums.Wherein audio starting point detecting method includes: to determine corresponding first voice spectrum parameters of each frequency range according to frequency-region signal corresponding with the audio signal of audio；For each frequency range, the second voice spectrum parameters of current frequency range are determined according to the first voice spectrum parameters of the first voice spectrum parameters of current frequency range and the predetermined number frequency range chosen from remaining frequency range；One or more initial point positions of the note and syllable in the audio are determined according to corresponding second voice spectrum parameters of each frequency range.The embodiment of the present disclosure is due to having references to corresponding first voice spectrum parameters of multiple frequency ranges when determining the second voice spectrum parameters, so that the second voice spectrum parameters determined are more accurate, the starting point for going out the note and syllable in audio so as to accurate detection, the occurrence of reducing erroneous detection and missing inspection.

Description

Audio starting point detection method and device

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and an apparatus for detecting an audio starting point, an electronic device, and a computer-readable storage medium.

Background

Audio onset detection is an information extraction algorithm applied to audio signals with the goal of accurately detecting the locations of the onset of notes and syllables. Wherein the note (note) refers to a music signal; syllable (phone) is specifically used for voice and human voice signals. Audio start point detection has many important uses and application prospects in the field of signal processing, for example, as follows: the method is used for automatically segmenting and labeling human voice and music audio, extracting information, compressing in a segmented mode and playing interactive entertainment. Fig. 1a and 1b show start point detection, where fig. 1a is an audio signal and fig. 1b is the detected start point position.

In the prior art, a voice spectrum parameter curve corresponding to an audio signal is usually calculated, a local maximum point of the curve is determined according to the voice spectrum parameter curve, a voice spectrum parameter corresponding to a point change is compared with a set threshold value, and if the voice spectrum parameter is greater than the threshold value, a position corresponding to the point is determined as a starting point position.

However, the above algorithm is mainly suitable for audio signals with clear boundaries and relatively single rhythm (e.g., fast-rhythm music with clear note boundaries and relatively single rhythm), and for some audio signals with complex rhythm but poor feeling (e.g., music mixed by multiple musical instruments, music with slow rhythm, and human voice), the above detection algorithm cannot accurately detect the boundaries, and frequent false detection and missed detection occur.

Disclosure of Invention

In a first aspect, an embodiment of the present disclosure provides an audio starting point detection method, including:

determining a first voice spectrum parameter corresponding to each frequency band according to a frequency domain signal corresponding to an audio signal of the audio;

aiming at each frequency band, determining a second voice spectrum parameter of the current frequency band according to a first voice spectrum parameter of the current frequency band and a first voice spectrum parameter of a preset number of frequency bands selected from the rest frequency bands;

and determining one or more starting point positions of notes and syllables in the audio according to the second voice spectrum parameters corresponding to the frequency bands.

Further, the determining, for each frequency band, a second speech spectrum parameter of the current frequency band according to the first speech spectrum parameter of the current frequency band and the first speech spectrum parameters of a preset number of frequency bands selected from the remaining frequency bands includes:

and aiming at each frequency band, determining the mean value of the first voice spectrum parameters according to the first voice spectrum parameters of the current frequency band and the first voice spectrum parameters of a preset number of frequency bands selected from the rest frequency bands, and taking the mean value as the second voice spectrum parameters of the current frequency band.

aiming at each frequency band, determining the mean value of the first voice spectrum parameters according to the first voice spectrum parameters of the current frequency band and the first voice spectrum parameters of a preset number of frequency bands selected from the rest frequency bands;

and determining a second voice spectrum parameter of the current frequency band according to the first voice spectrum parameter of the current frequency band and the average value.

Further, the determining the second speech spectrum parameter of the current frequency band according to the first speech spectrum parameter of the current frequency band and the average value includes:

calculating the difference value between the first voice spectrum parameter of the current frequency band and the average value;

and determining a second voice spectrum parameter of the current frequency band according to the difference value.

Further, the determining the second speech spectrum parameter of the current frequency band according to the difference value includes:

and determining the mean value of the difference values according to the difference values corresponding to the current frequency band and the difference values corresponding to the preset number of frequency bands selected from the rest frequency bands, and taking the mean value of the difference values as a second voice spectrum parameter of the current frequency band.

Further, the remaining frequency bands are all frequency bands located before the current frequency band according to a time sequence.

Further, the determining the positions of one or more start points of notes and syllables in the audio according to the second speech spectrum parameters corresponding to the frequency bands includes:

drawing a voice spectrum parameter curve according to the second voice spectrum parameters corresponding to the frequency bands;

and determining a local highest point according to the voice spectrum parameter curve, and determining one or more starting point positions of notes and syllables in the audio frequency according to a second voice spectrum parameter corresponding to the local highest point.

Further, the determining the first speech spectrum parameter corresponding to each frequency band according to the frequency domain signal corresponding to the audio signal of the audio includes:

segmenting the audio signal of the audio frequency into a plurality of sub audio signals, and converting each sub audio signal into a frequency domain signal, wherein each sub audio signal corresponds to one frequency band;

and determining first voice spectrum parameters corresponding to each frequency band.

In a second aspect, an embodiment of the present disclosure provides an audio starting point detecting apparatus, including:

the first parameter determining module is used for determining first voice spectrum parameters corresponding to each frequency band according to frequency domain signals corresponding to audio signals of the audio frequency;

the second parameter determination module is used for determining a second voice spectrum parameter of the current frequency band according to the first voice spectrum parameter of the current frequency band and the first voice spectrum parameters of a preset number of frequency bands selected from the rest frequency bands aiming at each frequency band;

and the starting point determining module is used for determining one or more starting point positions of notes and syllables in the audio according to the second voice spectrum parameters corresponding to the frequency bands.

Further, the second parameter determining module is specifically configured to: and aiming at each frequency band, determining the mean value of the first voice spectrum parameters according to the first voice spectrum parameters of the current frequency band and the first voice spectrum parameters of a preset number of frequency bands selected from the rest frequency bands, and taking the mean value as the second voice spectrum parameters of the current frequency band.

Further, the second parameter determination module comprises:

the mean value determining unit is used for determining the mean value of the first voice spectrum parameters according to the first voice spectrum parameters of the current frequency band and the first voice spectrum parameters of the preset number of frequency bands selected from the rest frequency bands aiming at each frequency band;

and the second parameter determining unit is used for determining the second voice spectrum parameter of the current frequency band according to the first voice spectrum parameter of the current frequency band and the average value.

Further, the second parameter determining unit is specifically configured to: calculating the difference value between the first voice spectrum parameter of the current frequency band and the average value; and determining a second voice spectrum parameter of the current frequency band according to the difference value.

Further, the second parameter determining unit is specifically configured to: and determining the mean value of the difference values according to the difference values corresponding to the current frequency band and the difference values corresponding to the preset number of frequency bands selected from the rest frequency bands, and taking the mean value of the difference values as a second voice spectrum parameter of the current frequency band.

Further, the starting point determining module is specifically configured to: drawing a voice spectrum parameter curve according to the second voice spectrum parameters corresponding to the frequency bands; and determining a local highest point according to the voice spectrum parameter curve, and determining one or more starting point positions of notes and syllables in the audio frequency according to a second voice spectrum parameter corresponding to the local highest point.

Further, the device first parameter determining module is specifically configured to: segmenting the audio signal of the audio frequency into a plurality of sub audio signals, and converting each sub audio signal into a frequency domain signal, wherein each sub audio signal corresponds to one frequency band; and determining first voice spectrum parameters corresponding to each frequency band.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the audio onset detection method of any of the preceding first aspects.

In a fourth aspect, the disclosed embodiments provide a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions for causing a computer to execute any one of the audio onset detection methods in the foregoing first aspect.

According to the method and the device, the first voice spectrum parameters corresponding to each frequency band are determined according to the frequency domain signals corresponding to the audio signals of the audio, the second voice spectrum parameters of the current frequency band are determined according to the first voice spectrum parameters of the current frequency band and the first voice spectrum parameters of the preset number of frequency bands selected from the rest frequency bands aiming at each frequency band, one or more starting point positions of notes and syllables in the audio are determined according to the second voice spectrum parameters corresponding to each frequency band, and the determined second voice spectrum parameters are more accurate due to the fact that the first voice spectrum parameters corresponding to the plurality of frequency bands are referred when the second voice spectrum parameters are determined, so that the starting points of the notes and the syllables in the audio can be accurately detected, and the situations of false detection and false detection are reduced.

The foregoing is a summary of the present disclosure, and for the purposes of promoting a clear understanding of the technical means of the present disclosure, the present disclosure may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings can be obtained according to the drawings without creative efforts for those skilled in the art.

FIG. 1a is a schematic diagram of an audio signal provided by the prior art;

FIG. 1b is a diagram illustrating a detection result of an audio start point according to the prior art;

fig. 2a is a flowchart of an audio starting point detection method according to an embodiment of the disclosure;

fig. 2b is a schematic diagram of an audio signal in an audio starting point detection method according to an embodiment of the disclosure;

fig. 2c is a speech frequency spectrum diagram of an audio signal in an audio starting point detection method according to an embodiment of the disclosure;

fig. 3 is a flowchart of an audio starting point detection method according to a second embodiment of the disclosure;

fig. 4a is a flowchart of an audio starting point detection method according to a third embodiment of the disclosure;

FIG. 4b is a graph of the speech spectral parameter composition in the audio starting point detection method provided by the prior art;

fig. 4c is a graph of the first speech spectrum parameter composition in the audio starting point detection method according to the third embodiment of the disclosure;

fig. 4d is a graph of second speech spectrum parameter composition in the audio starting point detection method according to the third embodiment of the disclosure;

fig. 4e is a schematic diagram of an audio signal in the audio starting point detection method according to the third embodiment of the disclosure;

fig. 4f is a schematic diagram of an audio signal detection result shown in fig. 4e obtained by using a conventional starting point detection method according to a third embodiment of the present disclosure;

fig. 4g is a schematic diagram of the detection result of the audio signal shown in fig. 4e obtained by the audio starting point detection method according to the third embodiment of the disclosure;

fig. 5 is a schematic structural diagram of an audio starting point detection apparatus according to a fourth embodiment of the disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present disclosure.

Detailed Description

The embodiments of the present disclosure are described below with specific examples, and other advantages and effects of the present disclosure will be readily apparent to those skilled in the art from the disclosure in the specification. It is to be understood that the described embodiments are merely illustrative of some, and not restrictive, of the embodiments of the disclosure. The disclosure may be embodied or carried out in various other specific embodiments, and various modifications and changes may be made in the details within the description without departing from the spirit of the disclosure. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present disclosure, and the drawings only show the components related to the present disclosure rather than the number, shape and size of the components in actual implementation, and the type, amount and ratio of the components in actual implementation may be changed arbitrarily, and the layout of the components may be more complicated.

In addition, in the following description, specific details are provided to facilitate a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.

Example one

Fig. 2a is a flowchart of an audio starting point detection method according to an embodiment of the present disclosure, where the audio starting point detection method according to this embodiment may be executed by an audio starting point detection apparatus, and the audio starting point detection apparatus may be implemented as software, or implemented as a combination of software and hardware, and the audio starting point detection apparatus may be integrated in a certain device in an audio starting point detection system, such as an audio starting point detection server or an audio starting point detection terminal device. The embodiment can be applied to some scenes with less complex audio with weak rhythm (such as music mixed by multiple instruments, music with slower rhythm and human voice). As shown in fig. 2a, the method comprises the steps of:

step S21: and determining a first voice spectrum parameter corresponding to each frequency band according to the frequency domain signal corresponding to the audio signal of the audio.

The audio signal may be a piece of music or voice, and the corresponding frequency domain signal is obtained by converting the audio signal of the time domain into the frequency domain.

Here, in order to distinguish different speech spectral parameters appearing herein, the first speech spectral parameter and the second speech spectral parameter will be referred to as a first speech spectral parameter and a second speech spectral parameter, respectively, according to the order of appearance.

Wherein the first speech spectral parameter may be determined from the spectral magnitude and phase.

In an optional embodiment, step S21 specifically includes:

step S211: the audio signal of the audio is segmented into a plurality of sub audio signals, and each sub audio signal is converted into a frequency domain signal, wherein each sub audio signal corresponds to one frequency band.

Step S212: and determining first voice spectrum parameters corresponding to each frequency band.

Specifically, the audio signal is a one-dimensional series of discrete times, which can be expressed as: x ═ X₁,x₂…x_NWhere N is the total number of discrete sample points. Although the audio signal is a time-non-periodic signal, the audio signal can approximately exhibit a stationary (approximately periodic) characteristic in a short-time range (usually the short time is defined as 10-40 ms), so that the audio signal can be divided into short-time speech segments of equal length, i.e. sub-audio signals, for analysis. For example, as shown in fig. 2b, for an audio signal with a sampling rate of 16000Hz, 512 sample points can be selected as one sub-audio signal, which corresponds to a speech length of 32 ms.

Here, the fourier transform may be used to convert the audio signal in the time domain into the audio signal in the frequency domain, and the frequency information that changes with time is called a spectrogram, as shown in fig. 2c, the energy changes of the sub-audio signals in different frequency bands can be clearly seen, and it can be seen that at the starting point position, the frequency spectrum has obvious step changes.

Wherein, the corresponding frequency domain signal can be expressed as:where n denotes an nth sub audio signal, L denotes a length of the sub audio signal, and k denotes a kth frequency band.

Accordingly, when the audio signal is divided into a plurality of sub-audio signals, the first speech spectral parameter may specifically be a synthesized weighting of spectral amplitudes and phases of different sub-audio signals, for example, a formula Is calculated to obtain wherein_n(k) L is the amplitude of the kth frequency band, whereIs a second order phase difference of a k-th frequency band, wherein WhereinIs a first order phase difference of a k-th frequency band, wherein WhereinThe phase of the k-th band. The second order difference of the phase is adopted in the embodiment to better represent the starting point information.

Step S22: and aiming at each frequency band, determining a second voice spectrum parameter of the current frequency band according to the first voice spectrum parameter of the current frequency band and the first voice spectrum parameters of the preset number of frequency bands selected from the rest frequency bands.

The preset number can be set by user.

In order to ensure real-time performance of the starting point detection, the remaining frequency bands may be all frequency bands located before the current frequency band according to a time sequence.

Specifically, in determining the second speech spectrum parameters of each frequency band, firstly, any frequency band is selected as the current frequency band, then, the second speech spectrum parameters of the current frequency band are determined according to the first speech spectrum parameters of the current frequency band and the first speech spectrum parameters of a preset number of frequency bands selected from the remaining frequency bands, then, any frequency band is selected from the remaining frequency bands as the current frequency band, and the above operations are repeatedly executed until the second speech spectrum parameters of all frequency bands are determined.

Step S23: and determining one or more starting point positions of notes and syllables in the audio according to the second speech frequency spectrum parameters corresponding to each frequency band.

In an alternative embodiment, step S23 includes:

s231: and drawing a voice spectrum parameter curve according to the second voice spectrum parameters corresponding to each frequency band.

S232: and determining a local highest point according to the voice spectrum parameter curve, and determining one or more starting point positions of notes and syllables in the audio according to a second voice spectrum parameter corresponding to the local highest point.

According to the method and the device, the first voice spectrum parameters corresponding to each frequency band are determined according to the frequency domain signals corresponding to the audio signals of the audio, the second voice spectrum parameters of the current frequency band are determined according to the first voice spectrum parameters of the current frequency band and the first voice spectrum parameters of the preset number of frequency bands selected from the rest frequency bands aiming at each frequency band, one or more starting point positions of notes and syllables in the audio are determined according to the second voice spectrum parameters corresponding to each frequency band, and the determined second voice spectrum parameters are more accurate due to the fact that the first voice spectrum parameters corresponding to the frequency bands are referred when the second voice spectrum parameters are determined, so that the starting points of the notes and the syllables in the audio can be accurately detected, and the situations of false detection and false detection are reduced.

Example two

Fig. 3 is a flowchart of an audio starting point detection method according to a second embodiment of the present disclosure, where in this embodiment, a second speech spectrum parameter of a current frequency band is determined according to a first speech spectrum parameter of the current frequency band and a first speech spectrum parameter of a preset number of frequency bands selected from remaining frequency bands to perform further optimization, and this embodiment is applicable to some scenes of audio with a poor complex rhythm (e.g., music mixed by multiple musical instruments, music with a slow rhythm, and human voice). As shown in fig. 3, the method specifically includes:

step S31: and determining a first voice spectrum parameter corresponding to each frequency band according to the frequency domain signal corresponding to the audio signal of the audio.

Step S32: and aiming at each frequency band, determining the mean value of the first voice spectrum parameters according to the first voice spectrum parameters of the current frequency band and the first voice spectrum parameters of the preset number of frequency bands selected from the rest frequency bands, and taking the mean value as the second voice spectrum parameters of the current frequency band.

Step S33: and determining one or more starting point positions of notes and syllables in the audio according to the second speech frequency spectrum parameters corresponding to each frequency band.

According to the embodiment of the disclosure, the first voice spectrum parameters corresponding to the multiple frequency bands are referred to when the second voice spectrum parameters are determined, so that the determined second voice spectrum parameters are more accurate, the starting points of the notes and the syllables in the audio frequency can be accurately detected, the occurrence of false detection and missing detection is reduced, and the burr phenomenon in a curve graph formed by the mean values can be improved by determining the mean values of the frequency bands and determining the starting point positions of the notes and the syllables in the audio frequency according to the mean values corresponding to the frequency bands, so that the accuracy of the starting point detection is further improved.

EXAMPLE III

Fig. 4a is a flowchart of an audio starting point detection method provided in the third embodiment of the present disclosure, and in this embodiment, based on the above embodiments, the second speech spectral parameter of the current frequency band is determined according to the first speech spectral parameter of the current frequency band and the first speech spectral parameters of the preset number of frequency bands selected from the remaining frequency bands to perform further optimization. As shown in fig. 4a, the method specifically includes:

step S41: and determining a first voice spectrum parameter corresponding to each frequency band according to the frequency domain signal corresponding to the audio signal of the audio.

Step S42: and aiming at each frequency band, determining the mean value of the first voice spectrum parameters according to the first voice spectrum parameters of the current frequency band and the first voice spectrum parameters of the preset number of frequency bands selected from the rest frequency bands.

Step S43: and determining a second voice spectrum parameter of the current frequency band according to the first voice spectrum parameter and the average value of the current frequency band.

In an alternative embodiment, step S43 includes:

step S431: and calculating the difference value between the first voice spectrum parameter of the current frequency band and the mean value.

Step S432: and determining a second voice spectrum parameter of the current frequency band according to the difference value.

Further, the step S432 may have the following two embodiments: in the first embodiment, the difference is used as the second speech spectral parameter, and then when the starting point is determined according to the second speech spectral parameter, the signal offset phenomenon in the curve formed by the speech spectral parameters in the prior art can be reduced, as shown in fig. 4b, the curve formed by the speech spectral parameters in the prior art is shown, as shown in fig. 4c, the curve formed by the speech spectral parameters in the present scheme is shown; in the second embodiment, the mean value of the difference values is determined according to the difference value corresponding to the current frequency band and the difference values corresponding to the preset number of frequency bands selected from the remaining frequency bands, the mean value of the difference values is used as the second speech spectrum parameter of the current frequency band, and then when the starting point is determined according to the second speech spectrum parameter, the glitch signal in the curve of fig. 4c can be subtracted, as shown in fig. 4d, the curve is formed by the speech spectrum parameters in the scheme. For example, as shown in fig. 4e, it is a schematic diagram of an audio signal, fig. 4f is a schematic diagram of a detection result of a starting point obtained by detecting the audio signal of fig. 4e by using a method in the prior art, and fig. 4g is a schematic diagram of a detection result of a starting point obtained by detecting the audio signal of fig. 4e by using the method in this embodiment.

Step S44: and determining one or more starting point positions of notes and syllables in the audio according to the second speech frequency spectrum parameters corresponding to each frequency band.

According to the embodiment of the disclosure, the first voice spectrum parameters corresponding to the multiple frequency bands are referred to when the second voice spectrum parameters are determined, so that the determined second voice spectrum parameters are more accurate, the starting points of the musical notes and the syllables in the audio frequency can be accurately detected, the occurrence of false detection and missing detection is reduced, the starting points of the musical notes and the syllables in the audio frequency are determined according to the differences corresponding to the frequency bands by determining the mean value and the difference value of each frequency band, the burr phenomenon and the signal deviation in the existing curve graph can be improved, and the accuracy rate of the starting point detection is further improved.

Example four

Fig. 5 is a schematic structural diagram of an audio starting point detection apparatus according to a fourth embodiment of the present disclosure, where the audio starting point detection apparatus may be implemented as software, or implemented as a combination of software and hardware, and the audio starting point detection apparatus may be integrated in a certain device in an audio starting point detection system, such as an audio starting point detection server or an audio starting point detection terminal device. The embodiment can be applied to some scenes with less complex audio with weak rhythm (such as music mixed by multiple instruments, music with slower rhythm and human voice). As shown in fig. 5, the apparatus includes: a first parameter determination module 51, a second parameter determination module 52 and a starting point determination module 53; wherein,

the first parameter determining module 51 is configured to determine a first speech spectrum parameter corresponding to each frequency band according to a frequency domain signal corresponding to an audio signal of an audio;

the second parameter determining module 52 is configured to determine, for each frequency band, a second speech spectrum parameter of the current frequency band according to the first speech spectrum parameter of the current frequency band and the first speech spectrum parameters of a preset number of frequency bands selected from the remaining frequency bands;

the starting point determining module 53 is configured to determine one or more starting point positions of the notes and the syllables in the audio according to the second speech spectrum parameters corresponding to the frequency bands.

Further, the second parameter determination module 52 includes: an average value determining unit 521 and a second parameter determining unit 522; wherein,

the mean value determining unit 521 is configured to determine, for each frequency band, a mean value of the first speech spectrum parameter according to the first speech spectrum parameter of the current frequency band and the first speech spectrum parameters of a preset number of frequency bands selected from the remaining frequency bands;

the second parameter determining unit 522 is configured to determine a second speech spectrum parameter of the current frequency band according to the first speech spectrum parameter of the current frequency band and the average value.

Further, the second parameter determining unit 522 is specifically configured to: calculating the difference value between the first voice spectrum parameter of the current frequency band and the average value; and determining a second voice spectrum parameter of the current frequency band according to the difference value.

Further, the second parameter determining unit 522 is specifically configured to: and determining the mean value of the difference values according to the difference values corresponding to the current frequency band and the difference values corresponding to the preset number of frequency bands selected from the rest frequency bands, and taking the mean value of the difference values as a second voice spectrum parameter of the current frequency band.

Further, the starting point determining module 53 is specifically configured to: drawing a voice spectrum parameter curve according to the second voice spectrum parameters corresponding to the frequency bands; and determining a local highest point according to the voice spectrum parameter curve, and determining one or more starting point positions of notes and syllables in the audio frequency according to a second voice spectrum parameter corresponding to the local highest point.

Further, the first parameter determining module 51 is specifically configured to: segmenting the audio signal of the audio frequency into a plurality of sub audio signals, and converting each sub audio signal into a frequency domain signal, wherein each sub audio signal corresponds to one frequency band; and determining first voice spectrum parameters corresponding to each frequency band.

For detailed descriptions of the working principle, the implemented technical effect, and the like of the embodiment of the audio starting point detection apparatus, reference may be made to the related descriptions in the foregoing embodiment of the audio starting point detection method, and further description is omitted here.

EXAMPLE five

Referring now to FIG. 6, shown is a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, the electronic device may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 309, or installed from the storage means 608, or installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: determining a first voice spectrum parameter corresponding to each frequency band according to a frequency domain signal corresponding to the audio signal; determining a second voice spectrum parameter of the current frequency band according to the first voice spectrum parameter of the current frequency band and the first voice spectrum parameters of a preset number of frequency bands selected from the rest frequency bands; and determining the position of the starting point according to the second voice spectrum parameters corresponding to the frequency bands.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a cell does not in some cases constitute a definition of the cell itself, for example, the drag point determination module may also be described as a "module for determining a drag point on a template image".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for audio origin detection, comprising:

2. The method for detecting an audio starting point according to claim 1, wherein the determining, for each frequency band, the second speech spectrum parameter of the current frequency band according to the first speech spectrum parameter of the current frequency band and the first speech spectrum parameters of a predetermined number of frequency bands selected from the remaining frequency bands comprises:

3. The method for detecting an audio starting point according to claim 1, wherein the determining, for each frequency band, the second speech spectrum parameter of the current frequency band according to the first speech spectrum parameter of the current frequency band and the first speech spectrum parameters of a predetermined number of frequency bands selected from the remaining frequency bands comprises:

4. The method for detecting an audio starting point according to claim 3, wherein the determining the second speech spectral parameter of the current band according to the first speech spectral parameter of the current band and the average value comprises:

5. The method for detecting an audio starting point according to claim 4, wherein the determining the second speech spectrum parameter of the current band according to the difference comprises:

6. The audio starting point detecting method according to any one of claims 1 to 5, wherein said remaining frequency bands are all frequency bands located chronologically before said current frequency band.

7. The method for detecting audio starting point according to any one of claims 1-5, wherein the determining the position of one or more starting points of notes and syllables in the audio according to the second speech spectrum parameters corresponding to the frequency bands comprises:

8. The method for detecting an audio starting point according to any one of claims 1 to 5, wherein the determining the first speech spectral parameters corresponding to each frequency band according to the frequency domain signal corresponding to the audio signal of the audio comprises:

9. An audio starting point detecting apparatus, comprising:

10. An electronic device, comprising:

a memory for storing non-transitory computer readable instructions; and

a processor for executing the computer readable instructions such that the processor when executing performs the audio onset detection method according to any of claims 1-8.

11. A computer-readable storage medium storing non-transitory computer-readable instructions which, when executed by a computer, cause the computer to perform the audio onset detection method of any one of claims 1-8.