CN106409301A - Digital audio signal processing method - Google Patents
Digital audio signal processing method Download PDFInfo
- Publication number
- CN106409301A CN106409301A CN201510447092.2A CN201510447092A CN106409301A CN 106409301 A CN106409301 A CN 106409301A CN 201510447092 A CN201510447092 A CN 201510447092A CN 106409301 A CN106409301 A CN 106409301A
- Authority
- CN
- China
- Prior art keywords
- data
- audio signal
- target data
- sequence
- digital audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000005236 sound signal Effects 0.000 title claims abstract description 92
- 238000003672 processing method Methods 0.000 title abstract description 3
- 230000000873 masking effect Effects 0.000 claims abstract description 35
- 238000012545 processing Methods 0.000 claims abstract description 32
- 238000000034 method Methods 0.000 claims abstract description 26
- 238000013139 quantization Methods 0.000 claims description 42
- 238000001228 spectrum Methods 0.000 claims description 35
- 238000001514 detection method Methods 0.000 claims description 15
- 238000013507 mapping Methods 0.000 claims description 7
- 238000009432 framing Methods 0.000 claims description 6
- 238000003909 pattern recognition Methods 0.000 claims description 5
- 230000001131 transforming effect Effects 0.000 claims 1
- 241000282414 Homo sapiens Species 0.000 abstract description 15
- 210000005069 ears Anatomy 0.000 abstract description 6
- 239000011295 pitch Substances 0.000 description 11
- 238000013461 design Methods 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 108010076504 Protein Sorting Signals Proteins 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000001953 sensory effect Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/018—Audio watermarking, i.e. embedding inaudible data in the audio signal
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
A digital audio signal processing method disclosed by the present invention is characterized in that by embedding other content of a fixed format in a digital audio signal, a purpose of transmitting the digital information reconditely is achieved. The method mainly utilizes a masking effect of a human auditory system, so that the digital audio signal carries the predetermined data. By the method of the present invention, the data needing to be transmitted can be embedded at an appropriate position of the digital audio signal, and when the digital audio signal is played, an audio signal used for representing the relevant data information at the embedded position can be masked, so that human ears can not perceive the audio signal, but the audio signal can be received by a device possessing the audio signal processing capability.
Description
Technical Field
The present invention relates to a digital audio signal processing technology, and more particularly, to a method for processing a digital audio signal using a masking effect based on psychoacoustics.
Background
The use of digital audio signals to carry information is a technology that is of great interest to the industry and has been researched and developed with considerable effort and money. With such a technique, one can use a device having audio signal processing capabilities while normally listening to music and watching television programs, such as: the mobile communication terminal is used for acquiring the data information carried in the music or television program. One important property to assess whether this technology is mature and suitable for use is: the technique should ensure that both the data carried can be accurately collected and transmitted and the digital audio signal itself can be played without producing disturbing sounds or noises that can be felt by human beings.
Chinese patent application 201410301832.7 discloses such a technique: coding and modulating digital information to be transmitted to form a sound coding signal; and mixing the sound coding signal with the audio signal in the preselected audio-video program and then outputting the audio signal. Although, the digital information to be transmitted can be added to the normal sound in a mixed mode by using the technology; however, due to the unpredictability of the "digital information to be transmitted", the encoded voice signal formed by the encoded modulation of the "digital information to be transmitted" may be noise in the voice in a considerable number of cases. In other cases, there may be other sounds that can interfere with the normally played sound. In order to avoid such problems, the following improvements are proposed in the description part of the above-mentioned patent application:
"digital information to be transmitted is coded and modulated to form a sound coding signal. The sound coding signal can be written into a digital sound signal file, and can also be converted into a sound analog signal through a digital-to-analog converter, the frequency of the sound analog signal can be selected to be a frequency band above 18kHz and below 20kHz, the frequency band is difficult to be perceived by human ears, and the normal playing of the original television accompanying sound or music signal cannot be influenced. Since in the subsequent steps the digital information to be transmitted needs to be received and extracted by a receiving device local to the user, the vocoded information needs to have a certain characteristic that the signal energy distribution is only within a certain frequency range, above 18kHz and below 20 kHz. "
Obviously, the above-mentioned scheme must set the energy distribution of the part of the voice coding information within the frequency range of 18kHz to 20kHz in order to avoid human ears from perceiving the voice coding formed by the "digital information to be transmitted".
Numerous well known: the whole range of sound that can be heard by the human ear is 20Hz to 20 kHz. The frequency of sound which can be heard by adults with good hearing is usually between 30Hz and 16 kHz; the frequency of sound that can be heard by elderly people with poor hearing is often between 50Hz and 10 kHz. However, the audible frequencies audible to children are generally higher. The sound in the frequency range of 18Hz to 20kHz adopted in the technical scheme can be heard by many children. Therefore, even if the energy of the voice-coded information is selectively distributed in the frequency range of 18Hz to 20kHz, a great number of people, especially children, can still hear the voice-coded information; this makes these people, especially children, still plagued with noise or interfering sounds when listening to audio-coded television, radio programs, including those using this technology.
On the other hand, although it is possible to selectively distribute the energy of the audio encoded information outside the frequency range (20Hz to 20kHz) that can be heard by the human ear, since the frequency response characteristics of most audio devices are designed and manufactured according to the sound range that can be heard by the human ear, audio signals outside the frequency range of 20Hz to 20kHz are generally filtered out as noise or noise, and therefore, even if the audio encoded information can be mixed into normal audio signals, the audio encoded information cannot be played by the audio devices and cannot be acquired by the receiving devices.
In summary, the various techniques described above are clearly not mature and therefore cannot be widely used.
Disclosure of Invention
The invention aims to provide a digital audio signal processing method, which utilizes the psychoacoustic principle to process the digital audio signal, embeds the information to be transmitted into the digital audio signal by specific target data, so that when the digital audio signal is played by sound equipment, the embedded target data can be played together, and can be received and extracted by equipment with audio signal processing capability under the condition of not being perceived by human ears.
The above purpose of the invention is realized by adopting the following technical scheme:
framing the first digital audio signal into a plurality of audio frame data and performing windowing processing; performing frequency domain discrete Fourier (Fourier) transform on the plurality of audio frame data to obtain a plurality of first spectrum data corresponding to the plurality of audio frame data;
mapping the plurality of first spectrum data to auditory critical bands (Bark domain), and calculating masking thresholds of each sub-band in the auditory critical bands; the number of the masking thresholds is in one-to-one correspondence with the number of the sub-bands;
selecting a frequency point smaller than the masking threshold value from the plurality of first spectrum data as an embedding position;
performing quantization processing on target data by using a quantizer capable of performing blind detection on a quantization result, and assigning the discrete Fourier coefficient at the embedding position by using the result of the quantization processing, thereby obtaining a plurality of second spectrum data corresponding to the plurality of first spectrum data;
and performing inverse discrete fourier transform on the plurality of second spectrum data to obtain a second digital audio signal.
By adopting the method of the invention, the target data to be transmitted can be embedded in the proper position of the first digital audio signal according to the psychoacoustic principle. When the first digital audio signal is played, the embedded signals at the embedding location that represent the relevant target data can be masked from being perceived by the human ear, but these embedded signals can be intercepted and restored by a device having audio signal processing capabilities.
It is another object of the present invention to provide a method of extracting data from a digital audio signal; by using the method, when the digital audio signal is played by the sound equipment, the received digital audio signal can be processed, and the target data embedded in the digital audio signal can be extracted by using the psychoacoustic principle.
Framing the received first digital audio signal into a plurality of audio frame data, and performing windowing processing; performing frequency domain discrete fourier transform on the plurality of audio frame data to obtain a plurality of first spectrum data respectively corresponding to the plurality of audio frame data;
mapping the plurality of first spectrum data to auditory critical bands, and calculating the masking threshold of each sub-band in the auditory critical bands; the number of the masking thresholds corresponds to the number of the sub-bands one by one;
selecting frequency points smaller than corresponding masking thresholds in the plurality of first spectrum data as embedding positions;
carrying out inverse quantization processing on the discrete Fourier coefficient at the embedding position by adopting a quantizer capable of realizing blind detection on a quantization result to obtain a target data sequence embedded in the first digital audio signal; wherein, the target data sequence is formed by more than one specific audio data and/or coded data which are arranged in series according to a preset sequence; the specific audio frequency domain signals correspond to a specific loudness and/or a specific pitch and/or timbre.
According to the method, when the first digital audio signal is received, the target data sequence carried by the first digital audio signal by using the masking effect can be extracted from the received first digital audio signal by using the psychoacoustic principle, and corresponding target data can be further recovered; in this process, the embedded target data sequence is not perceived by the human ear, although it can be played out by the audio device together with the digital audio signal.
Detailed Description
In the first class of embodiments of the present invention, some target data needs to be embedded into the target digital audio signal.
In order to embed the target data in a digital audio signal, the digital audio signal needs to be framed into a plurality of audio frame data, and each audio frame data needs to be windowed on the basis of the audio frame data. Then, frequency domain discrete fourier transform is performed on each of the audio frame data subjected to the windowing processing, and a plurality of first spectrum data corresponding to each of the audio frame data one to one can be obtained.
After obtaining the plurality of first spectrum data, respectively mapping the first spectrum data to an auditory threshold band, and calculating a masking threshold of each sub-band in the auditory threshold band; the number of these masking thresholds corresponds to the number of sub-bands of the auditory critical band.
Selecting frequency points smaller than the masking threshold value from the plurality of first spectrum data as embedding positions of target data; then, a quantizer which can realize blind detection on the quantization result is adopted to carry out quantization processing on the target data, and the discrete Fourier coefficient of the embedding position is assigned (replaced) by using the result obtained after the quantization processing, so that each second spectrum data corresponding to each first spectrum data can be obtained;
the plurality of second spectrum data are inverse discrete fourier transformed to obtain a second digital audio signal. This newly acquired second digital audio signal has the above-mentioned target data embedded therein.
It should be noted that: when the first digital audio signal is subjected to framing, windowing and the like, the length of each audio frame and the size of the window can be determined by related technicians according to specific design requirements, and at least two schemes can be selected. For example: one approach is similar to speech recognition techniques, i.e., a frame-to-frame overlap (overlap) approach is employed; in this way, the window length is typically 25-35 ms, and the frame shift is 10ms (or more or less than 10 ms). The other scheme is that a mode that frames are not overlapped is adopted, and the window length is directly specified as the number of sampling points in the time domain, and is generally the power of N (N is a positive integer) of 2; such as: 256 or 512 samples are taken as a window of data.
In addition, the aforementioned "mapping" specifically means: converting the linear frequency into Bark domain frequency; for example, one useful conversion formula is as follows:
z=13arctan(0.00076f)+3.5arctan[(f/7500)2]
wherein f is linear Hz frequency, and z is the serial number of Bark domain.
Regarding the correspondence between linear Hz frequency and Bark domain, reference can be made to: journal of The American Acoustics Society (The Journal of The scientific Society of America) Zwicker, published on No. 33, No. 2, page 248, E. article on The Subdivision of The Audible Frequency Range inter-clinical Bands, and Traunm muller, H. (1990) article on The sensory Scale for tone quality (Analytical expressions for The sensory scales of tone quality), published in The Journal, Nos. 97-91.
It is known that: when the signal x passes through the quantizer Q, the signal x may be quantized to a quantization level y, i.e.: y ═ q (x); conversely, the process of obtaining the signal x 'from the quantization level y is inverse quantization, i.e. x' is Q-1(y) is carried out. Due to quantization errors, the aforementioned signal x and signal x' may not exactly coincide.
In the present invention, the quantizer described above cannot be used. The quantizer used in the present invention is a quantizer capable of adapting the step size and capable of performing blind detection on the quantization result. This in effect refers to the effect of blind detection of steganographic information, namely: after the secret data sequence quantized by the quantizer capable of realizing blind detection on the quantization result is written into the carrier, the written (embedded) data can be extracted from the secret data by the quantizer capable of realizing blind detection on the quantization result without the participation of the original carrier data in the extraction (decoding) stage. It is obvious to those skilled in the art that any quantizer capable of blind detection of the quantization result can be used to achieve the above-mentioned effects.
With the above-mentioned specific embodiments of the present invention, the above-mentioned operations are performed for each audio frame in the first digital audio signal, so that the data information to be transmitted can be embedded in the first digital audio signal with a certain time length.
In addition to the first specific embodiment, each subsequent specific modified or added content of the present invention may be arbitrarily combined with each other on the basis of the first specific embodiment, and each different specific technical solution may be formed according to different design requirements.
In the above-mentioned specific embodiments of the present invention, a preferable mode of performing quantization processing on the target data by using a quantizer capable of performing blind detection on a quantization result and assigning (replacing) discrete fourier coefficients of the embedding positions with the result obtained after the quantization processing is:
calculating an embedding intensity coefficient at the embedding position according to the energy value or the power spectrum parameter of the audio frame data at the embedding position based on the embedding position, wherein the embedding intensity coefficient determines the data quantity of the target data which can be embedded in the corresponding audio frame data;
and according to the embedding intensity coefficient obtained by the calculation in the step, carrying out quantization processing on target data by adopting a quantizer capable of realizing blind detection on a quantization result, and assigning (replacing) the discrete Fourier coefficient of the embedding position by using the result of the quantization processing.
The benefits of using such a preferred scheme are: the embedded data amount can be automatically adjusted according to the signal specific conditions of the audio frame data at different embedding positions; for example: the embedded data volume can be increased as much as possible while the masking effect is ensured in the audio signals with more audio data and higher energy; in audio signals with less audio data and less energy (e.g., in the case of still fields), the amount of data embedded can be reduced accordingly to ensure the masking effect.
The process of calculating the embedded intensity coefficients according to the energy values or power spectra of the audio frame data is essentially to calculate the quantization step size. In the invention, in order to better embody the imperceptibility of the secret-carrying audio through auditory masking, a non-uniform quantization step size can be adopted, the quantization step size is adaptive to the masking threshold value of each frame, and the steganographic information can be ensured not to be heard. In a specific class of embodiments, the quantization step representing the embedding strength can be calculated using the following formula:
Δ’=Δ+lbLTmin/50
where Δ' is the quantization step size of the embedding strength, Δ is the base quantization step size, LTminIs the masking threshold of the audio frame in which the stego information is to be embedded. Obviously, the larger the masking threshold, the larger the quantization step size can be obtained. lb is a scaling factor for the quantization step increment, and takes a value between 0 and 1, typically 1.
Although the embedding positions of the target data are all located at the frequency points corresponding to the masking thresholds, since the masking thresholds of the sub-bands of the critical frequency band are usually different from each other, in order to completely and absolutely mask the embedded target data from being heard by human beings, a preferred class of embodiments are: on the basis of the first specific implementation manner in the present invention, a frequency point corresponding to the minimum masking threshold in each sub-band is selected as an embedding position, and target data to be embedded is embedded into the embedding position corresponding to the minimum masking threshold.
It is known that: for human, the whole audio frequency range is 20Hz to 20 kHz; in fact, not all people can hear the sound signals of all the audible domains in the aforementioned entire audio frequency range. Therefore, when designing and manufacturing an audio playing device or system, the industry often weakens, even filters, high-frequency audio signals and enhances middle and low frequency signals from the aspects of reducing data transmission quantity, improving the performance of the device or system and the like; therefore, if the target data is embedded into the high-frequency band signal in the technical solution of the first embodiment of the present invention, when the corresponding audio signal is played by using the aforementioned systems or devices, it may cause that the target data embedded into the high-frequency band is difficult to extract and recover; and sometimes may not even be received at all. In order to solve the above problems and ensure the robustness of the solution according to the present invention, frequency points located in the middle and low frequency bands may be preferred as the embedding positions of the target data based on the various embodiments.
Specifically, the low frequency band in the invention is 30-150 Hz, and the middle and low frequency bands are 30-500 Hz); a middle and high frequency range (500-5000 Hz); in general, 30 to 4000Hz is the most preferable frequency range for target data embedding in the invention. Of course, those skilled in the art may select other frequency bands as the frequency range for embedding the target data according to specific design requirements.
Although the foregoing general objects of the invention can be attained using the various aspects described above. However, in some cases the following measures are also required to enable the solution of the invention to be further optimized: the essence of the technical scheme of the invention is that specific target data are embedded in the original digital audio signal, and the embedded target data can be regarded as a noise signal of a new digital audio signal obtained after embedding. It is known that: when the intensity of the noise signal is large enough, the quality of the new digital audio signal is affected, and the transmission and extraction of the target data are also affected. Therefore, it is necessary to evaluate the quality of a new digital audio signal obtained after embedding target data and then determine whether to use and output the signal.
Therefore, when the second digital audio signal is obtained according to any of the above-described embodiments of the present invention, the signal-to-noise ratio of the second digital audio signal may be further calculated, and the quality of the second digital audio signal after embedding the target data may be evaluated based on the calculation result. If the calculated snr is less than a predetermined ratio (threshold, which may be set by the skilled person depending on the specific design requirements, for example, 17dB, 20dB, 23dB, etc.), it indicates that the quality of the second digital audio signal does not meet the predetermined snr requirement. At this time, according to the above-mentioned scheme of the present invention, parameters such as the embedding position and the fourier coefficient of the target data may be re-determined, and the steps of the foregoing various embodiments of the present invention may be re-executed until the finally obtained signal-to-noise ratio of the second digital audio signal reaches a predetermined requirement, and then the second digital audio signal meeting the requirement of the signal-to-noise ratio is output.
In all the above embodiments of the present invention, the embedded target data is actually a target data sequence which is serially arranged by more than one specific audio data and/or encoded data according to a predetermined sequence. Specifically, the method comprises the following steps: the aforementioned specific audio data corresponds to a specific loudness and/or a specific pitch and/or timbre; the coded data is a number expressed in a computer-readable manner. A specific target data sequence may be simply composed of more than one specific audio data arranged in series according to a predetermined order; or simply composed of more than one specific coded data arranged in series according to a predetermined sequence; it is also possible to have a structure in which more than one specific audio data and more than one specific encoded data are interleaved with each other and arranged in series in a predetermined order according to a predetermined rule.
In fact, the advantage that a target data sequence simply consists of more than one specific coded data sequence arranged in series is that: the target data can be embedded, received and extracted at high speed, and the method is suitable for being applied to occasions requiring frequent and faster data transmission, such as: live broadcast interaction and the like.
In some cases, which are not sensitive to the real-time performance and speed of data transmission and require a larger data volume for transmission, it is more appropriate that a target data sequence is simply composed of more than one specific audio data arranged in series.
In the specific embodiment of the invention, the preferable scheme is as follows: any one particular audio data corresponds to a particular loudness and/or a particular pitch and/or timbre. The loudness is also called volume, which means the strength of sound felt by human ears; it is a subjective perception of the magnitude of sound by humans. The objective evaluation scale is the amplitude of the sound. Pitch is the height of sound, which is determined by the frequency of vibration, and is therefore proportional to the frequency of vibration. The tone color is also called tone quality, which is a characteristic of sound sensed by the sense of hearing. The timbre is mainly determined by the spectrum of the sound, i.e. the composition of the fundamental and the harmonic tones.
In the above-described embodiments of the invention, it is possible to make one target data sequence contain a prescribed number of specific audio data; since any specific audio data can be determined by using the loudness, pitch and timbre mentioned above, all target data sequences composed of a specified number of specific audio data as mentioned in the above technical solutions can be associated with one information codebook for delivering data covering a larger information codebook.
For example: different pitches have different frequency values; assume that n different frequency values are chosen, wherein the n pitches can be represented by A, B, C, D, E, F, G, H, I, j.... times; different loudness has different sound intensity values; assume that m different sound intensity values are selected, wherein the m loudness values can be represented by a, b, c, d, e, f, g, h.... to.. respectively; different timbres have different sound spectra; assume that k different sound spectra are selected, wherein the k sound spectra can be denoted by 1, 2, 3.. k, respectively; on this basis, any one of the audio data can be described in the following form:
wherein X is pitch, and the number of X is n; y is loudness, and the number of the loudness is m; z is tone, and the number of Z is k;
therefore, the codebook capacity W of any audio data in the present invention can be calculated by the following equation:
W=n×m×k
suppose that: in a target data sequence of the invention, a unit audio group is simply formed by 5 audio data; the codebook capacity of any unit audio data group is calculated by the following equation:
W=(n×m×k)5
when n is 10, m is 8, k is 8,
the value of W is: 230×105>1014
Of course, the values of the integers n, m, and k are all natural numbers, and the related technical personnel can select or determine the values according to the required capacity of the information codebook when implementing the invention.
As described above: in the above various embodiments of the present invention, an object data sequence can be constructed in a completely single object data form, for example: a target data sequence is constructed using either audio data alone or encoded data alone. However, in some cases, it may be desirable to construct a target data sequence in a manner that mixes audio data and encoded data. In order to extract the data information from the first digital audio signal by using a correct means during receiving, it is necessary to insert a predetermined identification data sequence into a predetermined position of the target data sequence, so that the receiving device can extract the corresponding data by using a corresponding identification scheme according to the indication of the identification data sequence after analyzing and identifying the identification data sequence. For example: a pattern recognition scheme is employed to identify audio data in the target data sequence.
Of course, even if a target data sequence is mixed by audio data and coded data, any target data sequence can be constructed in a protocol-good manner without inserting any identification data sequence therein, as long as it is used in a completely closed information system; in contrast, in an open information system, identification data sequences are almost essential. Whether or not a sequence of identification data is employed should therefore be determined by the skilled person in the design of the relevant system according to specific requirements.
In the various embodiments of the present invention described above, if an identification data sequence is employed, the identification data sequence is preferably constructed using coded data. However, the skilled person may choose to use audio data, and a combination of audio data and encoded data, to form the identification data sequence according to specific design requirements.
In summary, an important advantage of the present invention is: since the target data sequence is inserted at a position below the masking threshold of the digital audio signal, when the digital audio signal after the target data sequence is inserted is played, the inserted audio signal sequence is not perceived by human ears due to the masking effect.
In addition, because the scheme that audio signals (loudness, pitch and timbre) with multiple dimensions are adopted to form the audio data sequence is adopted, the method ensures that the capacity of forming the information codebook has a great space, and enough information can be transmitted by using limited audio data.
In order to receive and acquire a target data sequence embedded in a digital audio signal by adopting the above various schemes of the invention, the invention also provides the following technical schemes:
when some equipment (such as a mobile phone, intelligent equipment with a microphone and audio processing capability and the like) is used for receiving a digital audio signal embedded with an audio signal sequence, framing the received digital audio signal into a plurality of audio frame data and carrying out windowing processing; performing frequency domain discrete Fourier transform on the plurality of audio frame data to obtain a plurality of frequency spectrum data respectively corresponding to the plurality of audio frame data;
mapping the spectrum data to auditory critical bands (Bark domain), and calculating a masking threshold value of each sub-band in the auditory critical bands; the number of the masking thresholds is in one-to-one correspondence with the number of the sub-bands;
selecting a frequency point smaller than the masking threshold value from the plurality of spectrum data as an embedding position; carrying out inverse quantization processing on the discrete Fourier coefficient of the embedding position by adopting a quantizer capable of realizing blind detection on a quantization result to obtain a one-dimensional data sequence embedded in the digital audio signal; referring to the content of each embodiment of the digital audio signal processing of the present invention, the aforementioned target data sequence is formed by serially arranging more than one specific audio data and/or encoded data in a predetermined order; wherein a particular audio frequency domain signal corresponds to a particular loudness and/or a particular pitch and/or timbre.
By adopting the above-mentioned embodiment of the present invention for extracting data from a digital audio signal, a corresponding one-dimensional data sequence can be extracted from the digital audio signal in which the target data sequence is embedded. However, as previously mentioned: when the one-dimensional data sequence is composed of audio data or a mixture of audio data and encoded data; or, when the digital audio signal is transmitted in an open information system, it is necessary to search a predetermined identification data sequence in the extracted one-dimensional data sequence, and according to the indication of the identification data sequence, perform pattern recognition on the audio data at the position in the extracted one-dimensional data sequence related to the identification data sequence, and finally obtain a corresponding target data sequence.
In some cases, obtaining the target data sequence means obtaining actual information, such as: when the target data sequence consists of only encoded data; but there are also cases, such as: when the target data sequence is composed of audio data or a mixture of audio data and encoded data, even after the target data sequence is extracted by pattern recognition according to the indication of the identification data sequence, it may be necessary to transform the target data sequence by using a predetermined encoding table, and finally obtain the target data embedded in the digital audio signal.
Of course, in the present invention, after obtaining the aforementioned one-dimensional data sequence or the target data sequence, a receiving device may be utilized, for example: the mobile phone, the intelligent device with the microphone and the audio processing capability and the like send the one-dimensional data sequences or the target data sequences to the server, the server specifically finishes searching the preset identification data sequences, extracts the target data sequences in a mode of mode recognition according to the indication of the identification data sequences, and transforms the target data sequences by utilizing a preset coding table to finally obtain the target data embedded in the digital audio signals.
One specific example of an application is: after the target data sequence embedded in the digital audio signal is extracted by using the above-mentioned embodiments, if the target data sequence is simply composed of audio data, specific audio data and combinations thereof in the target data sequence can be matched in an encoding manner, that is, data information corresponding to the audio signal sequence can be searched in a predetermined encoding table.
The predetermined coding table usually contains at least the following information in one-to-one correspondence with each other: an audio data sequence and specific information corresponding thereto; for example: according to the above example regarding an audio data sequence consisting of loudness, pitch and timbre, an audio data sequence of a specified length may correspond to the letter "a", to the word "energy", to the phrase "spectral data", to an item object "cell phone", to a web page link address "www.baidu.com" and so on. The manner in which such information is communicated is somewhat similar to the manner in which telegram codes are communicated; however, as described above, if the codebook capacity is sufficiently large, the present invention can transmit information in a manner different from the above-described message code manner, and data can be directly transmitted.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (6)
1. A method of digital audio signal processing, comprising:
framing the first digital audio signal into a plurality of audio frame data and performing windowing processing; performing frequency domain discrete Fourier transform on the plurality of audio frame data to obtain a plurality of first frequency spectrum data respectively corresponding to the plurality of audio frame data;
mapping the plurality of first spectrum data to auditory critical bands, and calculating a masking threshold of each sub-band in the auditory critical bands; the number of masking thresholds corresponds to the number of sub-bands;
selecting a frequency point smaller than the masking threshold value in the plurality of first spectrum data as an embedding position;
performing quantization processing on target data by using a quantizer capable of realizing blind detection on a quantization result, and assigning discrete Fourier coefficients of the embedding position by using the quantization processing result to obtain a plurality of second spectrum data corresponding to the plurality of first spectrum data;
and performing inverse discrete Fourier transform on the plurality of second spectrum data to obtain a second digital audio signal.
2. The method of claim 1, wherein the target data is obtained by:
acquiring more than one specific audio data and/or coded data, and serially arranging the more than one specific audio data and/or coded data into a target data sequence according to a predetermined sequence; or,
acquiring more than one specific audio data and/or coded data, and serially arranging the more than one specific audio data and/or coded data into a target data sequence according to a predetermined sequence; and inserting a predetermined identification data sequence at a predetermined position of the target data sequence; the identification data sequence is formed by arranging predetermined coded data according to the appointed length and sequence;
wherein the specific audio data corresponds to a specific loudness and/or a specific pitch and/or timbre.
3. The method according to claim 1 or 2, wherein the quantizing the target data by using a quantizer capable of blind detection on the quantization result, and assigning the discrete fourier coefficients of the embedding positions by using the quantization result comprises:
calculating corresponding embedding strength according to the masking threshold of the audio frame data based on the embedding position so as to determine the data quantity embedded in the corresponding audio frame data;
and according to the embedding strength, carrying out quantization processing on target data by adopting a quantizer capable of realizing blind detection on a quantization result, and assigning the discrete Fourier coefficient of the embedding position by using the result of the quantization processing.
4. The method of claim 1 or 2, further comprising:
when the corresponding first spectrum data at the frequency point is smaller than a minimum masking threshold; and/or when the frequency point is positioned in the middle and low frequency bands of the audio, the frequency point is taken as an embedded position; the middle and low frequency range is 30Hz to 4 KHz; and/or the presence of a gas in the gas,
and calculating the signal-to-noise ratio of the second digital audio signal, and outputting the second digital audio signal when the signal-to-noise ratio of the second digital audio signal is higher than a preset threshold range.
5. A method of extracting data from a digital audio signal, comprising:
framing the first digital audio signal into a plurality of audio frame data and performing windowing processing; performing frequency domain discrete Fourier transform on the plurality of audio frame data to obtain a plurality of first frequency spectrum data respectively corresponding to the plurality of audio frame data;
mapping the plurality of first spectrum data to auditory critical bands, and calculating a masking threshold of each sub-band in the auditory critical bands; the number of masking thresholds corresponds to the number of sub-bands;
selecting a frequency point smaller than the masking threshold value in the plurality of first spectrum data as an embedding position;
performing inverse quantization processing on the discrete Fourier coefficients of the embedding position by adopting a quantizer capable of realizing blind detection on a quantization result to obtain a target data sequence embedded in the first digital audio signal; wherein, the target data sequence is formed by more than one specific audio data and/or coded data which are arranged in series according to a preset sequence; the specific audio frequency domain signal corresponds to a specific loudness and/or a specific pitch and/or timbre.
6. The method of claim 5, further comprising:
searching a preset identification data sequence in the target data sequence, and carrying out pattern recognition on the target data sequence according to the identification data sequence to obtain a corresponding target data sequence; or,
and searching a preset identification data sequence in the target data sequence, carrying out pattern recognition on the target data sequence according to the identification data sequence to obtain a corresponding target data sequence, and transforming the target data sequence by utilizing a preset coding table to obtain target data embedded into the first digital audio signal.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510447092.2A CN106409301A (en) | 2015-07-27 | 2015-07-27 | Digital audio signal processing method |
PCT/CN2016/087445 WO2017016363A1 (en) | 2015-07-27 | 2016-06-28 | Method for processing digital audio signal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510447092.2A CN106409301A (en) | 2015-07-27 | 2015-07-27 | Digital audio signal processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106409301A true CN106409301A (en) | 2017-02-15 |
Family
ID=57884085
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510447092.2A Pending CN106409301A (en) | 2015-07-27 | 2015-07-27 | Digital audio signal processing method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106409301A (en) |
WO (1) | WO2017016363A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108281152A (en) * | 2018-01-18 | 2018-07-13 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio-frequency processing method, device and storage medium |
CN109257688A (en) * | 2018-07-23 | 2019-01-22 | 东软集团股份有限公司 | Audio distinguishes method, apparatus, storage medium and electronic equipment |
CN110447071A (en) * | 2017-03-28 | 2019-11-12 | 索尼公司 | Information processing unit, information processing method and program |
CN113362835A (en) * | 2020-03-05 | 2021-09-07 | 杭州网易云音乐科技有限公司 | Audio watermark processing method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102142255A (en) * | 2010-07-08 | 2011-08-03 | 北京三信时代信息公司 | Method for embedding and extracting digital watermark in audio signal |
CN102959622A (en) * | 2010-02-26 | 2013-03-06 | 弗兰霍菲尔运输应用研究公司 | Watermark signal provision and watermark embedding |
CN104505096A (en) * | 2014-05-30 | 2015-04-08 | 华南理工大学 | Method and device using music to transmit hidden information |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100560429B1 (en) * | 2003-12-17 | 2006-03-13 | 한국전자통신연구원 | Watermarking apparatus and method using nonlinear quantization |
CN101101754B (en) * | 2007-06-25 | 2011-09-21 | 中山大学 | A Robust Audio Watermarking Method Based on Fourier Discrete Logarithmic Coordinate Transform |
CN101345054B (en) * | 2008-08-25 | 2011-11-23 | 苏州大学 | Digital watermark production and recognition method for audio files |
EP2362383A1 (en) * | 2010-02-26 | 2011-08-31 | Fraunhofer-Gesellschaft zur Förderung der Angewandten Forschung e.V. | Watermark decoder and method for providing binary message data |
JP6316288B2 (en) * | 2013-06-11 | 2018-04-25 | 株式会社東芝 | Digital watermark embedding device, digital watermark detection device, digital watermark embedding method, digital watermark detection method, digital watermark embedding program, and digital watermark detection program |
CN104795071A (en) * | 2015-04-18 | 2015-07-22 | 广东石油化工学院 | Blind audio watermark embedding and watermark extraction processing method |
-
2015
- 2015-07-27 CN CN201510447092.2A patent/CN106409301A/en active Pending
-
2016
- 2016-06-28 WO PCT/CN2016/087445 patent/WO2017016363A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102959622A (en) * | 2010-02-26 | 2013-03-06 | 弗兰霍菲尔运输应用研究公司 | Watermark signal provision and watermark embedding |
CN102142255A (en) * | 2010-07-08 | 2011-08-03 | 北京三信时代信息公司 | Method for embedding and extracting digital watermark in audio signal |
CN104505096A (en) * | 2014-05-30 | 2015-04-08 | 华南理工大学 | Method and device using music to transmit hidden information |
Non-Patent Citations (5)
Title |
---|
智西湖: "《计算机应用基础》", 31 August 2010, 北京邮电大学出版社 * |
李伟 等: ""数字音频水印技术综述"", 《通信学报》 * |
王丽娜 等: "《信息隐藏技术与应用》", 31 May 2012, 武汉大学出版社 * |
王俊杰 等: "《数字水印与信息安全技术研究》", 31 August 2014, 知识产权出版社 * |
陶智 等: ""基于心理声学模型和临界频带子波变换的数字声频水印"", 《声学学报》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110447071A (en) * | 2017-03-28 | 2019-11-12 | 索尼公司 | Information processing unit, information processing method and program |
CN110447071B (en) * | 2017-03-28 | 2024-04-26 | 索尼公司 | Information processing apparatus, information processing method, and removable medium recording program |
CN108281152A (en) * | 2018-01-18 | 2018-07-13 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio-frequency processing method, device and storage medium |
CN109257688A (en) * | 2018-07-23 | 2019-01-22 | 东软集团股份有限公司 | Audio distinguishes method, apparatus, storage medium and electronic equipment |
CN113362835A (en) * | 2020-03-05 | 2021-09-07 | 杭州网易云音乐科技有限公司 | Audio watermark processing method and device, electronic equipment and storage medium |
CN113362835B (en) * | 2020-03-05 | 2024-06-07 | 杭州网易云音乐科技有限公司 | Audio watermarking method, device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2017016363A1 (en) | 2017-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11961527B2 (en) | Methods and apparatus to perform audio watermarking and watermark detection and extraction | |
US11557304B2 (en) | Methods and apparatus for performing variable block length watermarking of media | |
AU2001251274B2 (en) | System and method for adding an inaudible code to an audio signal and method and apparatus for reading a code signal from an audio signal | |
CN101933086B (en) | Method and apparatus for processing audio signal | |
CN102982806B (en) | Methods and apparatus to perform audio signal decoding | |
AU2001251274A1 (en) | System and method for adding an inaudible code to an audio signal and method and apparatus for reading a code signal from an audio signal | |
CN1808568B (en) | Audio encoding/decoding apparatus having watermark insertion/abstraction function and method using the same | |
EP2787503A1 (en) | Method and system of audio signal watermarking | |
CN106409301A (en) | Digital audio signal processing method | |
Chen et al. | Telephony speech enhancement by data hiding | |
AU2012241085B2 (en) | Methods and apparatus to perform audio watermarking and watermark detection and extraction | |
Adib | A high capacity quantization-based audio watermarking technique using the DWPT | |
KR960003627B1 (en) | Decoding Method for Hearing Aids of Subband Coded Audio Signals | |
HK1055033A (en) | Multi-band spectral audio encoding | |
AU2008201526A1 (en) | System and method for adding an inaudible code to an audio signal and method and apparatus for reading a code signal from an audio signal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170215 |
|
RJ01 | Rejection of invention patent application after publication |