CN106409301A

CN106409301A - Digital audio signal processing method

Info

Publication number: CN106409301A
Application number: CN201510447092.2A
Authority: CN
Inventors: 李庆成; 鹿毅忠
Original assignee: Beijing Map Digital Technology Co Ltd
Current assignee: Beijing Map Digital Technology Co Ltd
Priority date: 2015-07-27
Filing date: 2015-07-27
Publication date: 2017-02-15
Also published as: WO2017016363A1

Abstract

A digital audio signal processing method disclosed by the present invention is characterized in that by embedding other content of a fixed format in a digital audio signal, a purpose of transmitting the digital information reconditely is achieved. The method mainly utilizes a masking effect of a human auditory system, so that the digital audio signal carries the predetermined data. By the method of the present invention, the data needing to be transmitted can be embedded at an appropriate position of the digital audio signal, and when the digital audio signal is played, an audio signal used for representing the relevant data information at the embedded position can be masked, so that human ears can not perceive the audio signal, but the audio signal can be received by a device possessing the audio signal processing capability.

Description

Method for processing digital audio signal

Technical Field

The present invention relates to a digital audio signal processing technology, and more particularly, to a method for processing a digital audio signal using a masking effect based on psychoacoustics.

Background

The use of digital audio signals to carry information is a technology that is of great interest to the industry and has been researched and developed with considerable effort and money. With such a technique, one can use a device having audio signal processing capabilities while normally listening to music and watching television programs, such as: the mobile communication terminal is used for acquiring the data information carried in the music or television program. One important property to assess whether this technology is mature and suitable for use is: the technique should ensure that both the data carried can be accurately collected and transmitted and the digital audio signal itself can be played without producing disturbing sounds or noises that can be felt by human beings.

Chinese patent application 201410301832.7 discloses such a technique: coding and modulating digital information to be transmitted to form a sound coding signal; and mixing the sound coding signal with the audio signal in the preselected audio-video program and then outputting the audio signal. Although, the digital information to be transmitted can be added to the normal sound in a mixed mode by using the technology; however, due to the unpredictability of the "digital information to be transmitted", the encoded voice signal formed by the encoded modulation of the "digital information to be transmitted" may be noise in the voice in a considerable number of cases. In other cases, there may be other sounds that can interfere with the normally played sound. In order to avoid such problems, the following improvements are proposed in the description part of the above-mentioned patent application:

"digital information to be transmitted is coded and modulated to form a sound coding signal. The sound coding signal can be written into a digital sound signal file, and can also be converted into a sound analog signal through a digital-to-analog converter, the frequency of the sound analog signal can be selected to be a frequency band above 18kHz and below 20kHz, the frequency band is difficult to be perceived by human ears, and the normal playing of the original television accompanying sound or music signal cannot be influenced. Since in the subsequent steps the digital information to be transmitted needs to be received and extracted by a receiving device local to the user, the vocoded information needs to have a certain characteristic that the signal energy distribution is only within a certain frequency range, above 18kHz and below 20 kHz. "

Obviously, the above-mentioned scheme must set the energy distribution of the part of the voice coding information within the frequency range of 18kHz to 20kHz in order to avoid human ears from perceiving the voice coding formed by the "digital information to be transmitted".

Numerous well known: the whole range of sound that can be heard by the human ear is 20Hz to 20 kHz. The frequency of sound which can be heard by adults with good hearing is usually between 30Hz and 16 kHz; the frequency of sound that can be heard by elderly people with poor hearing is often between 50Hz and 10 kHz. However, the audible frequencies audible to children are generally higher. The sound in the frequency range of 18Hz to 20kHz adopted in the technical scheme can be heard by many children. Therefore, even if the energy of the voice-coded information is selectively distributed in the frequency range of 18Hz to 20kHz, a great number of people, especially children, can still hear the voice-coded information; this makes these people, especially children, still plagued with noise or interfering sounds when listening to audio-coded television, radio programs, including those using this technology.

On the other hand, although it is possible to selectively distribute the energy of the audio encoded information outside the frequency range (20Hz to 20kHz) that can be heard by the human ear, since the frequency response characteristics of most audio devices are designed and manufactured according to the sound range that can be heard by the human ear, audio signals outside the frequency range of 20Hz to 20kHz are generally filtered out as noise or noise, and therefore, even if the audio encoded information can be mixed into normal audio signals, the audio encoded information cannot be played by the audio devices and cannot be acquired by the receiving devices.

In summary, the various techniques described above are clearly not mature and therefore cannot be widely used.

Disclosure of Invention

The invention aims to provide a digital audio signal processing method, which utilizes the psychoacoustic principle to process the digital audio signal, embeds the information to be transmitted into the digital audio signal by specific target data, so that when the digital audio signal is played by sound equipment, the embedded target data can be played together, and can be received and extracted by equipment with audio signal processing capability under the condition of not being perceived by human ears.

The above purpose of the invention is realized by adopting the following technical scheme:

framing the first digital audio signal into a plurality of audio frame data and performing windowing processing; performing frequency domain discrete Fourier (Fourier) transform on the plurality of audio frame data to obtain a plurality of first spectrum data corresponding to the plurality of audio frame data;

mapping the plurality of first spectrum data to auditory critical bands (Bark domain), and calculating masking thresholds of each sub-band in the auditory critical bands; the number of the masking thresholds is in one-to-one correspondence with the number of the sub-bands;

selecting a frequency point smaller than the masking threshold value from the plurality of first spectrum data as an embedding position;

performing quantization processing on target data by using a quantizer capable of performing blind detection on a quantization result, and assigning the discrete Fourier coefficient at the embedding position by using the result of the quantization processing, thereby obtaining a plurality of second spectrum data corresponding to the plurality of first spectrum data;

and performing inverse discrete fourier transform on the plurality of second spectrum data to obtain a second digital audio signal.

By adopting the method of the invention, the target data to be transmitted can be embedded in the proper position of the first digital audio signal according to the psychoacoustic principle. When the first digital audio signal is played, the embedded signals at the embedding location that represent the relevant target data can be masked from being perceived by the human ear, but these embedded signals can be intercepted and restored by a device having audio signal processing capabilities.

It is another object of the present invention to provide a method of extracting data from a digital audio signal; by using the method, when the digital audio signal is played by the sound equipment, the received digital audio signal can be processed, and the target data embedded in the digital audio signal can be extracted by using the psychoacoustic principle.

Framing the received first digital audio signal into a plurality of audio frame data, and performing windowing processing; performing frequency domain discrete fourier transform on the plurality of audio frame data to obtain a plurality of first spectrum data respectively corresponding to the plurality of audio frame data;

mapping the plurality of first spectrum data to auditory critical bands, and calculating the masking threshold of each sub-band in the auditory critical bands; the number of the masking thresholds corresponds to the number of the sub-bands one by one;

selecting frequency points smaller than corresponding masking thresholds in the plurality of first spectrum data as embedding positions;

carrying out inverse quantization processing on the discrete Fourier coefficient at the embedding position by adopting a quantizer capable of realizing blind detection on a quantization result to obtain a target data sequence embedded in the first digital audio signal; wherein, the target data sequence is formed by more than one specific audio data and/or coded data which are arranged in series according to a preset sequence; the specific audio frequency domain signals correspond to a specific loudness and/or a specific pitch and/or timbre.

According to the method, when the first digital audio signal is received, the target data sequence carried by the first digital audio signal by using the masking effect can be extracted from the received first digital audio signal by using the psychoacoustic principle, and corresponding target data can be further recovered; in this process, the embedded target data sequence is not perceived by the human ear, although it can be played out by the audio device together with the digital audio signal.

Detailed Description

In the first class of embodiments of the present invention, some target data needs to be embedded into the target digital audio signal.

In order to embed the target data in a digital audio signal, the digital audio signal needs to be framed into a plurality of audio frame data, and each audio frame data needs to be windowed on the basis of the audio frame data. Then, frequency domain discrete fourier transform is performed on each of the audio frame data subjected to the windowing processing, and a plurality of first spectrum data corresponding to each of the audio frame data one to one can be obtained.

After obtaining the plurality of first spectrum data, respectively mapping the first spectrum data to an auditory threshold band, and calculating a masking threshold of each sub-band in the auditory threshold band; the number of these masking thresholds corresponds to the number of sub-bands of the auditory critical band.

Selecting frequency points smaller than the masking threshold value from the plurality of first spectrum data as embedding positions of target data; then, a quantizer which can realize blind detection on the quantization result is adopted to carry out quantization processing on the target data, and the discrete Fourier coefficient of the embedding position is assigned (replaced) by using the result obtained after the quantization processing, so that each second spectrum data corresponding to each first spectrum data can be obtained;

the plurality of second spectrum data are inverse discrete fourier transformed to obtain a second digital audio signal. This newly acquired second digital audio signal has the above-mentioned target data embedded therein.

It should be noted that: when the first digital audio signal is subjected to framing, windowing and the like, the length of each audio frame and the size of the window can be determined by related technicians according to specific design requirements, and at least two schemes can be selected. For example: one approach is similar to speech recognition techniques, i.e., a frame-to-frame overlap (overlap) approach is employed; in this way, the window length is typically 25-35 ms, and the frame shift is 10ms (or more or less than 10 ms). The other scheme is that a mode that frames are not overlapped is adopted, and the window length is directly specified as the number of sampling points in the time domain, and is generally the power of N (N is a positive integer) of 2; such as: 256 or 512 samples are taken as a window of data.

In addition, the aforementioned "mapping" specifically means: converting the linear frequency into Bark domain frequency; for example, one useful conversion formula is as follows:

z＝13arctan(0.00076f)+3.5arctan[(f/7500)²]

wherein f is linear Hz frequency, and z is the serial number of Bark domain.

Regarding the correspondence between linear Hz frequency and Bark domain, reference can be made to: journal of The American Acoustics Society (The Journal of The scientific Society of America) Zwicker, published on No. 33, No. 2, page 248, E. article on The Subdivision of The Audible Frequency Range inter-clinical Bands, and Traunm muller, H. (1990) article on The sensory Scale for tone quality (Analytical expressions for The sensory scales of tone quality), published in The Journal, Nos. 97-91.

It is known that: when the signal x passes through the quantizer Q, the signal x may be quantized to a quantization level y, i.e.: y ═ q (x); conversely, the process of obtaining the signal x 'from the quantization level y is inverse quantization, i.e. x' is Q^-1(y) is carried out. Due to quantization errors, the aforementioned signal x and signal x' may not exactly coincide.

In the present invention, the quantizer described above cannot be used. The quantizer used in the present invention is a quantizer capable of adapting the step size and capable of performing blind detection on the quantization result. This in effect refers to the effect of blind detection of steganographic information, namely: after the secret data sequence quantized by the quantizer capable of realizing blind detection on the quantization result is written into the carrier, the written (embedded) data can be extracted from the secret data by the quantizer capable of realizing blind detection on the quantization result without the participation of the original carrier data in the extraction (decoding) stage. It is obvious to those skilled in the art that any quantizer capable of blind detection of the quantization result can be used to achieve the above-mentioned effects.

With the above-mentioned specific embodiments of the present invention, the above-mentioned operations are performed for each audio frame in the first digital audio signal, so that the data information to be transmitted can be embedded in the first digital audio signal with a certain time length.

In addition to the first specific embodiment, each subsequent specific modified or added content of the present invention may be arbitrarily combined with each other on the basis of the first specific embodiment, and each different specific technical solution may be formed according to different design requirements.

In the above-mentioned specific embodiments of the present invention, a preferable mode of performing quantization processing on the target data by using a quantizer capable of performing blind detection on a quantization result and assigning (replacing) discrete fourier coefficients of the embedding positions with the result obtained after the quantization processing is:

calculating an embedding intensity coefficient at the embedding position according to the energy value or the power spectrum parameter of the audio frame data at the embedding position based on the embedding position, wherein the embedding intensity coefficient determines the data quantity of the target data which can be embedded in the corresponding audio frame data;

and according to the embedding intensity coefficient obtained by the calculation in the step, carrying out quantization processing on target data by adopting a quantizer capable of realizing blind detection on a quantization result, and assigning (replacing) the discrete Fourier coefficient of the embedding position by using the result of the quantization processing.

The benefits of using such a preferred scheme are: the embedded data amount can be automatically adjusted according to the signal specific conditions of the audio frame data at different embedding positions; for example: the embedded data volume can be increased as much as possible while the masking effect is ensured in the audio signals with more audio data and higher energy; in audio signals with less audio data and less energy (e.g., in the case of still fields), the amount of data embedded can be reduced accordingly to ensure the masking effect.

The process of calculating the embedded intensity coefficients according to the energy values or power spectra of the audio frame data is essentially to calculate the quantization step size. In the invention, in order to better embody the imperceptibility of the secret-carrying audio through auditory masking, a non-uniform quantization step size can be adopted, the quantization step size is adaptive to the masking threshold value of each frame, and the steganographic information can be ensured not to be heard. In a specific class of embodiments, the quantization step representing the embedding strength can be calculated using the following formula:

Δ’＝Δ+lbLT_min/50

where Δ' is the quantization step size of the embedding strength, Δ is the base quantization step size, LT_minIs the masking threshold of the audio frame in which the stego information is to be embedded. Obviously, the larger the masking threshold, the larger the quantization step size can be obtained. lb is a scaling factor for the quantization step increment, and takes a value between 0 and 1, typically 1.

Although the embedding positions of the target data are all located at the frequency points corresponding to the masking thresholds, since the masking thresholds of the sub-bands of the critical frequency band are usually different from each other, in order to completely and absolutely mask the embedded target data from being heard by human beings, a preferred class of embodiments are: on the basis of the first specific implementation manner in the present invention, a frequency point corresponding to the minimum masking threshold in each sub-band is selected as an embedding position, and target data to be embedded is embedded into the embedding position corresponding to the minimum masking threshold.

It is known that: for human, the whole audio frequency range is 20Hz to 20 kHz; in fact, not all people can hear the sound signals of all the audible domains in the aforementioned entire audio frequency range. Therefore, when designing and manufacturing an audio playing device or system, the industry often weakens, even filters, high-frequency audio signals and enhances middle and low frequency signals from the aspects of reducing data transmission quantity, improving the performance of the device or system and the like; therefore, if the target data is embedded into the high-frequency band signal in the technical solution of the first embodiment of the present invention, when the corresponding audio signal is played by using the aforementioned systems or devices, it may cause that the target data embedded into the high-frequency band is difficult to extract and recover; and sometimes may not even be received at all. In order to solve the above problems and ensure the robustness of the solution according to the present invention, frequency points located in the middle and low frequency bands may be preferred as the embedding positions of the target data based on the various embodiments.

Specifically, the low frequency band in the invention is 30-150 Hz, and the middle and low frequency bands are 30-500 Hz); a middle and high frequency range (500-5000 Hz); in general, 30 to 4000Hz is the most preferable frequency range for target data embedding in the invention. Of course, those skilled in the art may select other frequency bands as the frequency range for embedding the target data according to specific design requirements.

Although the foregoing general objects of the invention can be attained using the various aspects described above. However, in some cases the following measures are also required to enable the solution of the invention to be further optimized: the essence of the technical scheme of the invention is that specific target data are embedded in the original digital audio signal, and the embedded target data can be regarded as a noise signal of a new digital audio signal obtained after embedding. It is known that: when the intensity of the noise signal is large enough, the quality of the new digital audio signal is affected, and the transmission and extraction of the target data are also affected. Therefore, it is necessary to evaluate the quality of a new digital audio signal obtained after embedding target data and then determine whether to use and output the signal.

Therefore, when the second digital audio signal is obtained according to any of the above-described embodiments of the present invention, the signal-to-noise ratio of the second digital audio signal may be further calculated, and the quality of the second digital audio signal after embedding the target data may be evaluated based on the calculation result. If the calculated snr is less than a predetermined ratio (threshold, which may be set by the skilled person depending on the specific design requirements, for example, 17dB, 20dB, 23dB, etc.), it indicates that the quality of the second digital audio signal does not meet the predetermined snr requirement. At this time, according to the above-mentioned scheme of the present invention, parameters such as the embedding position and the fourier coefficient of the target data may be re-determined, and the steps of the foregoing various embodiments of the present invention may be re-executed until the finally obtained signal-to-noise ratio of the second digital audio signal reaches a predetermined requirement, and then the second digital audio signal meeting the requirement of the signal-to-noise ratio is output.

In all the above embodiments of the present invention, the embedded target data is actually a target data sequence which is serially arranged by more than one specific audio data and/or encoded data according to a predetermined sequence. Specifically, the method comprises the following steps: the aforementioned specific audio data corresponds to a specific loudness and/or a specific pitch and/or timbre; the coded data is a number expressed in a computer-readable manner. A specific target data sequence may be simply composed of more than one specific audio data arranged in series according to a predetermined order; or simply composed of more than one specific coded data arranged in series according to a predetermined sequence; it is also possible to have a structure in which more than one specific audio data and more than one specific encoded data are interleaved with each other and arranged in series in a predetermined order according to a predetermined rule.

In fact, the advantage that a target data sequence simply consists of more than one specific coded data sequence arranged in series is that: the target data can be embedded, received and extracted at high speed, and the method is suitable for being applied to occasions requiring frequent and faster data transmission, such as: live broadcast interaction and the like.

In some cases, which are not sensitive to the real-time performance and speed of data transmission and require a larger data volume for transmission, it is more appropriate that a target data sequence is simply composed of more than one specific audio data arranged in series.

In the specific embodiment of the invention, the preferable scheme is as follows: any one particular audio data corresponds to a particular loudness and/or a particular pitch and/or timbre. The loudness is also called volume, which means the strength of sound felt by human ears; it is a subjective perception of the magnitude of sound by humans. The objective evaluation scale is the amplitude of the sound. Pitch is the height of sound, which is determined by the frequency of vibration, and is therefore proportional to the frequency of vibration. The tone color is also called tone quality, which is a characteristic of sound sensed by the sense of hearing. The timbre is mainly determined by the spectrum of the sound, i.e. the composition of the fundamental and the harmonic tones.

In the above-described embodiments of the invention, it is possible to make one target data sequence contain a prescribed number of specific audio data; since any specific audio data can be determined by using the loudness, pitch and timbre mentioned above, all target data sequences composed of a specified number of specific audio data as mentioned in the above technical solutions can be associated with one information codebook for delivering data covering a larger information codebook.

For example: different pitches have different frequency values; assume that n different frequency values are chosen, wherein the n pitches can be represented by A, B, C, D, E, F, G, H, I, j.... times; different loudness has different sound intensity values; assume that m different sound intensity values are selected, wherein the m loudness values can be represented by a, b, c, d, e, f, g, h.... to.. respectively; different timbres have different sound spectra; assume that k different sound spectra are selected, wherein the k sound spectra can be denoted by 1, 2, 3.. k, respectively; on this basis, any one of the audio data can be described in the following form:

wherein X is pitch, and the number of X is n; y is loudness, and the number of the loudness is m; z is tone, and the number of Z is k;

therefore, the codebook capacity W of any audio data in the present invention can be calculated by the following equation:

W＝n×m×k

suppose that: in a target data sequence of the invention, a unit audio group is simply formed by 5 audio data; the codebook capacity of any unit audio data group is calculated by the following equation:

W＝(n×m×k)⁵

when n is 10, m is 8, k is 8,

the value of W is: 2³⁰×10⁵>10¹⁴

Of course, the values of the integers n, m, and k are all natural numbers, and the related technical personnel can select or determine the values according to the required capacity of the information codebook when implementing the invention.

As described above: in the above various embodiments of the present invention, an object data sequence can be constructed in a completely single object data form, for example: a target data sequence is constructed using either audio data alone or encoded data alone. However, in some cases, it may be desirable to construct a target data sequence in a manner that mixes audio data and encoded data. In order to extract the data information from the first digital audio signal by using a correct means during receiving, it is necessary to insert a predetermined identification data sequence into a predetermined position of the target data sequence, so that the receiving device can extract the corresponding data by using a corresponding identification scheme according to the indication of the identification data sequence after analyzing and identifying the identification data sequence. For example: a pattern recognition scheme is employed to identify audio data in the target data sequence.

Of course, even if a target data sequence is mixed by audio data and coded data, any target data sequence can be constructed in a protocol-good manner without inserting any identification data sequence therein, as long as it is used in a completely closed information system; in contrast, in an open information system, identification data sequences are almost essential. Whether or not a sequence of identification data is employed should therefore be determined by the skilled person in the design of the relevant system according to specific requirements.

In the various embodiments of the present invention described above, if an identification data sequence is employed, the identification data sequence is preferably constructed using coded data. However, the skilled person may choose to use audio data, and a combination of audio data and encoded data, to form the identification data sequence according to specific design requirements.

In summary, an important advantage of the present invention is: since the target data sequence is inserted at a position below the masking threshold of the digital audio signal, when the digital audio signal after the target data sequence is inserted is played, the inserted audio signal sequence is not perceived by human ears due to the masking effect.

In addition, because the scheme that audio signals (loudness, pitch and timbre) with multiple dimensions are adopted to form the audio data sequence is adopted, the method ensures that the capacity of forming the information codebook has a great space, and enough information can be transmitted by using limited audio data.

In order to receive and acquire a target data sequence embedded in a digital audio signal by adopting the above various schemes of the invention, the invention also provides the following technical schemes:

when some equipment (such as a mobile phone, intelligent equipment with a microphone and audio processing capability and the like) is used for receiving a digital audio signal embedded with an audio signal sequence, framing the received digital audio signal into a plurality of audio frame data and carrying out windowing processing; performing frequency domain discrete Fourier transform on the plurality of audio frame data to obtain a plurality of frequency spectrum data respectively corresponding to the plurality of audio frame data;

mapping the spectrum data to auditory critical bands (Bark domain), and calculating a masking threshold value of each sub-band in the auditory critical bands; the number of the masking thresholds is in one-to-one correspondence with the number of the sub-bands;

selecting a frequency point smaller than the masking threshold value from the plurality of spectrum data as an embedding position; carrying out inverse quantization processing on the discrete Fourier coefficient of the embedding position by adopting a quantizer capable of realizing blind detection on a quantization result to obtain a one-dimensional data sequence embedded in the digital audio signal; referring to the content of each embodiment of the digital audio signal processing of the present invention, the aforementioned target data sequence is formed by serially arranging more than one specific audio data and/or encoded data in a predetermined order; wherein a particular audio frequency domain signal corresponds to a particular loudness and/or a particular pitch and/or timbre.

By adopting the above-mentioned embodiment of the present invention for extracting data from a digital audio signal, a corresponding one-dimensional data sequence can be extracted from the digital audio signal in which the target data sequence is embedded. However, as previously mentioned: when the one-dimensional data sequence is composed of audio data or a mixture of audio data and encoded data; or, when the digital audio signal is transmitted in an open information system, it is necessary to search a predetermined identification data sequence in the extracted one-dimensional data sequence, and according to the indication of the identification data sequence, perform pattern recognition on the audio data at the position in the extracted one-dimensional data sequence related to the identification data sequence, and finally obtain a corresponding target data sequence.

In some cases, obtaining the target data sequence means obtaining actual information, such as: when the target data sequence consists of only encoded data; but there are also cases, such as: when the target data sequence is composed of audio data or a mixture of audio data and encoded data, even after the target data sequence is extracted by pattern recognition according to the indication of the identification data sequence, it may be necessary to transform the target data sequence by using a predetermined encoding table, and finally obtain the target data embedded in the digital audio signal.

Of course, in the present invention, after obtaining the aforementioned one-dimensional data sequence or the target data sequence, a receiving device may be utilized, for example: the mobile phone, the intelligent device with the microphone and the audio processing capability and the like send the one-dimensional data sequences or the target data sequences to the server, the server specifically finishes searching the preset identification data sequences, extracts the target data sequences in a mode of mode recognition according to the indication of the identification data sequences, and transforms the target data sequences by utilizing a preset coding table to finally obtain the target data embedded in the digital audio signals.

One specific example of an application is: after the target data sequence embedded in the digital audio signal is extracted by using the above-mentioned embodiments, if the target data sequence is simply composed of audio data, specific audio data and combinations thereof in the target data sequence can be matched in an encoding manner, that is, data information corresponding to the audio signal sequence can be searched in a predetermined encoding table.

The predetermined coding table usually contains at least the following information in one-to-one correspondence with each other: an audio data sequence and specific information corresponding thereto; for example: according to the above example regarding an audio data sequence consisting of loudness, pitch and timbre, an audio data sequence of a specified length may correspond to the letter "a", to the word "energy", to the phrase "spectral data", to an item object "cell phone", to a web page link address "www.baidu.com" and so on. The manner in which such information is communicated is somewhat similar to the manner in which telegram codes are communicated; however, as described above, if the codebook capacity is sufficiently large, the present invention can transmit information in a manner different from the above-described message code manner, and data can be directly transmitted.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of digital audio signal processing, comprising:

framing the first digital audio signal into a plurality of audio frame data and performing windowing processing; performing frequency domain discrete Fourier transform on the plurality of audio frame data to obtain a plurality of first frequency spectrum data respectively corresponding to the plurality of audio frame data;

mapping the plurality of first spectrum data to auditory critical bands, and calculating a masking threshold of each sub-band in the auditory critical bands; the number of masking thresholds corresponds to the number of sub-bands;

selecting a frequency point smaller than the masking threshold value in the plurality of first spectrum data as an embedding position;

performing quantization processing on target data by using a quantizer capable of realizing blind detection on a quantization result, and assigning discrete Fourier coefficients of the embedding position by using the quantization processing result to obtain a plurality of second spectrum data corresponding to the plurality of first spectrum data;

2. The method of claim 1, wherein the target data is obtained by:

acquiring more than one specific audio data and/or coded data, and serially arranging the more than one specific audio data and/or coded data into a target data sequence according to a predetermined sequence; or,

acquiring more than one specific audio data and/or coded data, and serially arranging the more than one specific audio data and/or coded data into a target data sequence according to a predetermined sequence; and inserting a predetermined identification data sequence at a predetermined position of the target data sequence; the identification data sequence is formed by arranging predetermined coded data according to the appointed length and sequence;

wherein the specific audio data corresponds to a specific loudness and/or a specific pitch and/or timbre.

3. The method according to claim 1 or 2, wherein the quantizing the target data by using a quantizer capable of blind detection on the quantization result, and assigning the discrete fourier coefficients of the embedding positions by using the quantization result comprises:

calculating corresponding embedding strength according to the masking threshold of the audio frame data based on the embedding position so as to determine the data quantity embedded in the corresponding audio frame data;

and according to the embedding strength, carrying out quantization processing on target data by adopting a quantizer capable of realizing blind detection on a quantization result, and assigning the discrete Fourier coefficient of the embedding position by using the result of the quantization processing.

4. The method of claim 1 or 2, further comprising:

when the corresponding first spectrum data at the frequency point is smaller than a minimum masking threshold; and/or when the frequency point is positioned in the middle and low frequency bands of the audio, the frequency point is taken as an embedded position; the middle and low frequency range is 30Hz to 4 KHz; and/or the presence of a gas in the gas,

and calculating the signal-to-noise ratio of the second digital audio signal, and outputting the second digital audio signal when the signal-to-noise ratio of the second digital audio signal is higher than a preset threshold range.

5. A method of extracting data from a digital audio signal, comprising:

performing inverse quantization processing on the discrete Fourier coefficients of the embedding position by adopting a quantizer capable of realizing blind detection on a quantization result to obtain a target data sequence embedded in the first digital audio signal; wherein, the target data sequence is formed by more than one specific audio data and/or coded data which are arranged in series according to a preset sequence; the specific audio frequency domain signal corresponds to a specific loudness and/or a specific pitch and/or timbre.

6. The method of claim 5, further comprising:

searching a preset identification data sequence in the target data sequence, and carrying out pattern recognition on the target data sequence according to the identification data sequence to obtain a corresponding target data sequence; or,

and searching a preset identification data sequence in the target data sequence, carrying out pattern recognition on the target data sequence according to the identification data sequence to obtain a corresponding target data sequence, and transforming the target data sequence by utilizing a preset coding table to obtain target data embedded into the first digital audio signal.