[go: up one dir, main page]

CN116524940B - Auditory non-sense analog watermark embedding method in voice generating process - Google Patents

Auditory non-sense analog watermark embedding method in voice generating process Download PDF

Info

Publication number
CN116524940B
CN116524940B CN202310806107.4A CN202310806107A CN116524940B CN 116524940 B CN116524940 B CN 116524940B CN 202310806107 A CN202310806107 A CN 202310806107A CN 116524940 B CN116524940 B CN 116524940B
Authority
CN
China
Prior art keywords
voice
frequency
watermark
frequency domain
traceable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310806107.4A
Other languages
Chinese (zh)
Other versions
CN116524940A (en
Inventor
田野
汤跃忠
陈骁
陈云坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Third Research Institute Of China Electronics Technology Group Corp
Beijing Zhongdian Huisheng Technology Co ltd
Original Assignee
Third Research Institute Of China Electronics Technology Group Corp
Beijing Zhongdian Huisheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Third Research Institute Of China Electronics Technology Group Corp, Beijing Zhongdian Huisheng Technology Co ltd filed Critical Third Research Institute Of China Electronics Technology Group Corp
Priority to CN202310806107.4A priority Critical patent/CN116524940B/en
Publication of CN116524940A publication Critical patent/CN116524940A/en
Application granted granted Critical
Publication of CN116524940B publication Critical patent/CN116524940B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/018Audio watermarking, i.e. embedding inaudible data in the audio signal
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Storage Device Security (AREA)

Abstract

The invention discloses a hearing-free simulation watermark embedding method in a voice generation process, which comprises the following steps: constructing a voice time spectrum based on the text to be synthesized; each character in the traceable watermark content is represented by a string of N-bit binary coded values; selecting N aiming point frequencies in a frequency range without hearing; sequentially corresponding N-bit binary code values of each string with N aiming point frequencies to form a frequency domain mask; if the binary code value is 1/0, the frequency domain mask on the bandwidth corresponding to the aiming point frequency +/-aHz is 1/0; the voice time spectrum is segmented and corresponds to a plurality of characters in the traceable watermark content; carrying out frequency domain masking processing on the corresponding voice time spectrum fragment based on the frequency domain mask; when the frequency domain mask is 1, the energy of the frequency band corresponding to the frequency domain mask in the frequency spectrum fragment is unchanged during voice, otherwise, the energy is set to zero; the masked speech time spectrum is inverse fourier transformed to produce speech. The invention can ensure the accurate traceability and effective management of the generated voice.

Description

Auditory non-sense analog watermark embedding method in voice generating process
Technical Field
The invention relates to the technical field of voice generation, in particular to an auditory non-inductive analog watermark embedding method in a voice generation process.
Background
At present, along with the rapid development of deep counterfeiting technology, the reality, naturalness and similarity of the tone of a target person are greatly improved, and the degree of spurious is achieved. The intelligent voice generation technology brings convenience to intelligent interactive application or equipment and threats to information cognition, social security and other aspects. In recent years, intelligent voice generating software mainly based on voice synthesis and voice conversion is widely spread on the internet, so that the technical threshold and cost of voice production are reduced, the quality of generated voice is improved, the intelligent voice generating software is gradually used for scenes such as phishing by lawbreakers, and a large number of victims and huge economic losses are caused by hiding crime trails or imitating relatives of victims to conduct fraud through generated voice, so that the intelligent voice generating software has become a prominent crime problem affecting social stability.
The patent CN112712809a provides a method for detecting true and false voices, which completes the detection of true and false voices and sources of voices by adopting a voice type model and a voice source model under the voice type model, solves the problem of deviation of detection results of the existing voice detection technology, and is beneficial to improving the accuracy of the voice detection method.
The patent CN112992126a provides a method for verifying authenticity of a voice, by training a voice feature extraction network and a classification model, extracting voice features aiming at voice living body detection, and classifying by using a classification model with higher distinction degree and less confusion, thereby improving accuracy of voice classification.
Patent CN115083422a provides a method for tracing voice to obtain evidence, and the method extracts an algorithm fingerprint to predict a generation algorithm of voice to be tested as a result of tracing voice to obtain a generation source of false audio.
As can be found from the background description of the above patent, the current voice fake identifying technology for the generated voice detecting task is still immature and perfect, more researches focus on the detection of true and false voices, and the tracing of the voice generating source is difficult to realize; and because means and methods for generating the voice are very different day by day and are layered endlessly, the voice fake identifying technology is more disturbing in practical application, and is difficult to realize accurate tracing and evidence obtaining, and meanwhile, is difficult to trace the company or personal information for generating the voice.
Disclosure of Invention
The embodiment of the invention provides an auditory non-perception analog watermark embedding method in a voice generation process, which is used for solving the problem of poor traceability effect in the prior art.
The hearing-free simulation watermark embedding method in the voice generation process comprises the following steps:
acquiring a text to be synthesized and the content of a traceable watermark;
constructing a voice time spectrum based on the text to be synthesized;
characterizing each character in the traceable watermark content by a string of N-bit binary coded values; the characters are Chinese characters, arabic numerals, english letters or special symbols;
selecting N aiming point frequencies in a frequency range without hearing;
the N-bit binary coding value corresponding to each character is corresponding to the N aiming point frequencies based on a preset corresponding rule so as to form a frequency domain mask corresponding to the character; if the binary code value is 1, the frequency domain mask on the bandwidth corresponding to the aiming point frequency + -aHz is 1, and if the binary code value is 0, the frequency domain mask on the bandwidth corresponding to the aiming point frequency + -aHz is 0;
segmenting the voice time spectrum to obtain multi-segment voice time spectrum segments, and corresponding the multi-segment voice time spectrum segments with a plurality of characters in the traceable watermark content, wherein the number of segments of the multi-segment voice time spectrum segments is greater than or equal to the number of characters in the traceable watermark content;
performing frequency domain masking processing on the corresponding voice time spectrum segment based on the frequency domain mask corresponding to each character in the traceable watermark content; when the frequency domain mask is 1, the energy of the frequency band corresponding to the frequency domain mask in the voice frequency spectrum segment is unchanged; when the frequency domain mask is 0, setting the energy of the frequency band corresponding to the frequency domain mask in the voice frequency spectrum segment to zero;
the spectrum is inverse fourier transformed at the time of masking-processed speech to generate speech.
The embodiment of the invention also provides a voice tracing method based on the hearing-free simulated watermark embedding in the voice generating process, which comprises the following steps:
acquiring voice to be traced, wherein the voice to be traced is generated based on the hearing-free simulated watermark embedding method in the voice generation process;
counting energy distribution information on a frequency band corresponding to the voice to be traced, and decoding tracing watermark content carried by the voice to be traced by combining watermark coding rules;
and tracing the voice to be traced based on the tracing watermark content.
The embodiment of the invention also provides voice generating equipment, which comprises the following steps: a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method as described above.
The embodiment of the invention also provides a computer readable storage medium, wherein an information transmission implementation program is stored on the computer readable storage medium, and the program is executed by a processor to implement the steps of the method.
By adopting the embodiment of the invention, the 'pre-and accompanying embedding' of the generated voice identification can be realized, thereby ensuring the accurate tracing and effective management of the generated voice.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. In the drawings:
FIG. 1 is a flowchart of a voice tracing method based on auditory non-perceptual analog watermark embedding in a voice generation process in an embodiment of the invention;
FIG. 2 is a flow chart of a method for auditory non-perceptual analog watermark embedding in a speech generation process in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram of a Mel filter bank distribution in an embodiment of the present invention;
fig. 4 is a schematic diagram of an inverse mel filter bank distribution in an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
The hearing-free simulation watermark embedding method in the voice generation process comprises the following steps:
acquiring a text to be synthesized and the content of a traceable watermark; the traceable watermark content (also simply referred to as watermark content) is used for tracing the text to be synthesized.
Constructing a voice time spectrum based on the text to be synthesized;
characterizing each character in the traceable watermark content by a string of N-bit binary coded values; the characters are Chinese characters, arabic numerals, english letters or special symbols; it can be understood that the traceable watermark content can be constructed by one or more of Chinese characters, arabic numerals, english letters and special symbols. Each Chinese character, and/or each Arabic number, and/or each English letter, and/or each special symbol in the character corresponds to a string of N-bit binary coded values.
Selecting N aiming point frequencies in a frequency range without hearing; the term "auditory imperceptible frequency range" is understood here to mean that frequencies within this frequency range are not easily perceived by humans. Here, "N" represents a positive integer.
The N-bit binary coding value corresponding to each character is corresponding to the N aiming point frequencies based on a preset corresponding rule so as to form a frequency domain mask corresponding to the character; if the binary code value is 1, the frequency domain mask on the bandwidth corresponding to the aiming point frequency + -aHz is 1, and if the binary code value is 0, the frequency domain mask on the bandwidth corresponding to the aiming point frequency + -aHz is 0.
It will be appreciated that an N-bit frequency domain mask corresponding to a character is also a series of N-bit binary coded values and its value is equal to the N-bit binary coded value used to characterize the character, except that each of these N-bit binary coded values corresponds to a frequency band, namely: aiming frequency ± aHz bandwidth. Here, "a" indicates a positive number, which may be set according to actual practice, for example, a may be set to 10.
Segmenting the voice time spectrum to obtain multi-segment voice time spectrum segments, and corresponding the multi-segment voice time spectrum segments with a plurality of characters in the traceable watermark content, wherein the number of segments of the multi-segment voice time spectrum segments is greater than or equal to the number of characters in the traceable watermark content; for example, when the number of segments of the spectrum segment in the multi-segment speech is equal to the number of characters in the trace-source watermark content, the spectrum segment in the multi-segment speech may be made to correspond to the characters in the trace-source watermark content in one-to-one order. The number of the frequency spectrum segments is larger than characters in the traceable watermark content when the voice is in a plurality of sections, so that the characters in the traceable watermark content can circulate in the frequency spectrum segments when the voice is in a plurality of sections.
Performing frequency domain masking processing on the corresponding voice time spectrum segment based on the frequency domain mask corresponding to each character in the traceable watermark content; when the frequency domain mask is 1, the energy of the frequency band corresponding to the frequency domain mask in the voice frequency spectrum segment is unchanged; when the frequency domain mask is 0, setting the energy of the frequency band corresponding to the frequency domain mask in the voice frequency spectrum segment to zero.
And performing inverse Fourier transform on the masked voice time spectrum to generate voice carrying the traceable watermark content.
By adopting the embodiment of the invention, the watermark information is embedded in the intelligent voice generating process, and the watermark information has the characteristics of concealment, anti-re-recording, hearing noninductivity and the like, and can realize the 'pre-embedding' and accompanying embedding of the generated voice identification, thereby ensuring the accurate tracing and effective management of the generated voice.
On the basis of the above-described embodiments, various modified embodiments are further proposed, and it is to be noted here that only the differences from the above-described embodiments are described in the various modified embodiments for the sake of brevity of description.
In some embodiments of the present invention, the characterizing each character in the traceable watermark content with a string of binary encoded values comprises:
each character in the traceable watermark content is characterized by a string of 16-bit binary coded values based on Unicode coding technology.
Further, the selecting N aiming frequencies in the frequency range of hearing-free includes:
16 aiming point frequencies are selected within the range of 4000 Hz-5500 Hz.
In the embodiment of the invention, the frequency band range selected by aiming point frequency is designed on the frequency 4000 Hz-5500 Hz, and the hearing noninductive characteristic of watermarking is mainly considered.
Further, the 16 aiming point frequencies are distributed in inverse Mel scale.
In some embodiments of the present invention, the associating the N-bit binary coded value corresponding to each of the characters with the N aiming frequencies based on a preset correspondence rule includes:
and sequentially corresponding N-bit binary coded values corresponding to each character to the N aiming point frequencies. It can be understood that the N-bit binary code value is consistent with the N aiming point frequencies in direction in the corresponding process, and the N-bit binary code value sequentially corresponds to the N aiming point frequencies.
According to some embodiments of the invention, the constructing a speech time spectrum based on the text to be synthesized includes:
acquiring a plurality of samples, wherein each sample comprises a text sample and a corresponding voice time spectrum sample thereof;
constructing a voice generator and training the voice generator based on the plurality of samples;
and inputting the text to be synthesized into a voice generator which completes training so as to output a voice time spectrum.
Unlike current speech generators generated by paired text-to-speech data training, the speech generators in embodiments of the present invention are generated by paired text-to-speech time-spectrum data training. In the use stage of the speech generator, when inputting the text to be synthesized, the speech generator will output the corresponding speech time spectrum.
According to some embodiments of the invention, the traceable watermark content includes user information corresponding to the text to be synthesized, voice generation technology development company information, and processing device MAC addresses.
Further, the traceable watermark content also includes start-stop identifications, e.g., in "! "as a start-stop identifier for the content of the traceable watermark.
The embodiment of the invention also provides a voice tracing method based on the hearing-free simulated watermark embedding in the voice generating process, which comprises the following steps:
acquiring voice to be traced, wherein the voice to be traced is generated based on the hearing-free simulated watermark embedding method in the voice generation process;
counting energy distribution information on a frequency band corresponding to the voice to be traced, and decoding tracing watermark content carried by the voice to be traced by combining watermark coding rules;
and tracing the voice to be traced based on the tracing watermark content.
The following describes in detail a speech generation process auditory non-perceptual analog watermark embedding method and a speech tracing method according to the present invention in a specific embodiment with reference to the accompanying drawings. It is to be understood that the following description is exemplary only and is not to be taken as limiting the invention in any way.
In order to ensure that the generated voice identification is not easily erased in the subsequent use process, the invention provides a voice tracing method based on auditory non-inductive analog watermark embedding in the voice generation process, watermark information (comprising voice generation technology development company, user account number, MAC address and the like) is embedded in the intelligent voice generation process, the watermark information has the characteristics of concealment, anti-re-recording, auditory non-inductive and the like, and the 'pre' -concomitant embedding of the generated voice identification is realized, thereby ensuring the accurate tracing and effective management of the generated voice.
Referring to fig. 1 and 2, the method for embedding auditory non-inductive analog watermark based on voice generation process and the corresponding voice tracing method according to the embodiment of the invention comprise the following steps:
s1, for any voice generation task, simultaneously acquiring a text to be synthesized and the traceable watermark content of the text to be synthesized, which are input by a user;
the source tracing watermark content mainly comprises Chinese characters, arabic numerals and English letters, and comprises but is not limited to user account numbers (such as user names and ID numbers) of current service requests, voice generation technology development companies, MAC addresses and the like.
S2, taking the text to be synthesized as the input of a voice generator, and outputting a generated voice time spectrum; taking the traceable watermark content as the input of a watermark encoder, and outputting the generated frequency domain mask;
unlike current speech generators that are generated by paired text-to-speech data training, the speech generators of the present invention are generated by paired text-to-speech time-spectrum data training. In the use stage of the speech generator, when inputting the text to be synthesized, the speech generator will output the corresponding speech time spectrum.
S3, carrying out frequency domain masking processing on the voice time spectrum based on the frequency domain mask, and outputting the masked voice time spectrum;
s4, performing inverse Fourier transform on the frequency spectrum of the masked voice, and finally outputting voice embedded with the traceability information watermark;
and S5, counting energy distribution information on a frequency band corresponding to the voice to be traced according to watermark coding rules, decoding the tracing watermark content, and realizing tracing of the generated voice.
The method comprises the steps of taking the traceable watermark content as input of a watermark encoder, outputting a generated frequency domain mask, and specifically comprising the following steps:
(1) Unicode encoding is carried out on the text content, and 16-bit binary encoding values corresponding to each Chinese character or number or letter are obtained, for example, 16-bit binary encoding of 'coma sound' is '0110000101100111 0101100011110000';
since the watermark content includes user account numbers (such as user names and ID numbers), voice generation technology development companies, MAC addresses and the like, and the watermark content will be related to the text content such as Chinese characters, arabic numerals, english letters and the like, the embodiment of the invention considers that Unicode codes are used to convert watermark text content into 16-bit binary codes.
(2) 16 aiming point frequencies are selected on the frequency of 4000-5500Hz, the value of each aiming point frequency is distributed in reverse Mel scale, and the comparison relation between the aiming point and the frequency is shown in table 1;
in the embodiment of the invention, the frequency band range selected by aiming points is designed at the frequency of 4000-5500Hz, and the hearing noninductive characteristic of watermarking is mainly considered. Studies have shown that human perception of frequency is not linear and sensitivity is different at different frequencies, and that often humans tend to perceive low frequency signals more sensitively than high frequency signals. Typically, humans are more likely to see 500 and 1000Hz differences, while differences between 7500 and 8000Hz are more insensitive. Also, researchers have proposed a mel scale, i.e., converting a linear frequency into a mel frequency by a mel transform, and thus designed a mel filter bank, as shown in fig. 3, it can be seen that the mel filter samples more finely in a low frequency part, thereby amplifying a low frequency part signal more sensitive to human hearing. The formula of the mapping relation from linear frequency to mel frequency is as follows:
in the embodiment of the invention, in order to reduce the perception of human hearing on watermark information, the selected aiming frequency should be a high-frequency part insensitive to human hearing. Therefore, the frequency band range selected by aiming points is designed on the frequency 4000-5500Hz, and 16 aiming point frequencies are selected according to the reverse Meyer spectrum scale, and the distribution of the filter bank is shown in figure 4. It can be seen that the inverse mel filter bank is more densely sampled in the high frequency part than the mel filter bank.
The aiming point frequency added by the watermark is arranged in a high-frequency part with less sensitive human hearing, and an inverse Mel filter bank is adopted, so that the hearing perceptibility of the human being to the watermark can be effectively reduced.
(3) The 16-bit binary coding value corresponding to each Chinese character or number or letter is sequentially corresponding to 16 aiming point frequencies, and if the coding value is 1, the frequency domain mask on the bandwidth of +/-10 Hz of the corresponding aiming point frequency is 1; if the coding value is 0, the frequency domain mask on the bandwidth of +/-10 Hz of the corresponding aiming point frequency is 0, so that the frequency domain mask corresponding to each Chinese character or number or letter in the traceable watermark content is generated;
(4) Repeating the operations of the steps (1) - (3) to obtain frequency domain masks of all the traceable watermark contents, wherein the "frequency band aiming point example" in the table 1 takes the "whistle" two words as an example, and the distribution condition of the frequency band masks 01 corresponding to the text contents is shown;
table 1 aiming frequency contrast and band mask examples
(5) With "! And (3) obtaining frequency domain masks of the source-tracing watermark content based on the operations of the steps (1) - (3), and sequentially connecting the frequency domain masks with the frequency domain masks of the source-tracing watermark content obtained in the step (4) in series to obtain a final frequency domain mask.
Performing frequency domain masking processing on the voice time spectrum based on the frequency domain mask, and outputting the masked voice time spectrum, wherein the method specifically comprises the following steps:
(1) The voice time spectrum is subjected to segmentation processing, and the segment length can be 5ms, 10ms and the like according to actual conditions; in a specific application, each segment of voice expresses 16-bit binary codes of a Chinese character or a number or a letter, for example, the watermark content contains 10 Chinese characters (including start and stop marks), and if the segment length is 5ms, the voice with the length of 50ms is needed to completely express the watermark content.
(2) Performing product operation on the frequency spectrum of each section of voice and each group of 16-bit frequency domain masks of the traceable watermark content, wherein when the frequency domain mask is 1, the frequency spectrum energy of the voice on the corresponding frequency band is unchanged; when the frequency domain mask is 0, the spectrum energy is set to zero when the frequency domain mask corresponds to the voice on the frequency band; thereby realizing frequency domain masking;
(3) And if the frequency spectrum is longer when voice is to be marked, the traceable watermark content is circularly marked on the frequency spectrum when voice. In a specific application scene, the number of times of the cyclic mark is judged according to the duration of the frequency spectrum and the length of the watermark content when the voice is generated. The circulation mark has the advantage of preventing the loss of the traceable watermark caused by the cutting means such as voice cutting.
According to watermark encoding rules, energy distribution information on a frequency band corresponding to the voice to be traced is counted, the tracing watermark content is decoded, tracing of the generated voice is realized, and the method comprises the following steps:
(1) According to the watermark coding rule, counting the energy values of the voice to be traced on 16 aiming point frequency bands, and if the energy value is 0, the tracing watermark coding value is 0; if the energy value is larger than 0, the tracing watermark coding value is 1; thereby obtaining the binary coded value of the traceable watermark. In a specific application, the threshold of the energy determination may be set to a value close to zero instead of 0, considering that other interferences such as noise exist in the re-recording process.
(2) According to the obtained binary code value of the traceable watermark, based on Unicode coding rules, reversely deducing the text value before coding, searching watermark start-stop marks "! ", thereby deriving the traceable watermark content.
Compared with a 'post' detection technology represented by a voice fake identification technology, the embodiment of the invention can realize more accurate tracing evidence obtaining, thereby providing a more effective and reliable method for the management of the generated intelligent voice service.
In addition, on one hand, the embodiment of the invention selects aiming point frequency according to the reverse Meyer scale within 4000-5500Hz, and adds watermark information on the bandwidth of +/-10 Hz of the aiming point frequency, thereby effectively ensuring the concealment and hearing noninductivity of the watermark information; on the other hand, by adding the watermark on the analog signal level, the watermark information can be still reserved after the voice signal is replayed and recorded, so that the problem that the traceability identifier of the generated audio is lost in the secondary processing process of re-recording and the like is avoided, and the anti-re-recording characteristic of the traceability watermark information is effectively ensured.
It should be noted that the above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and changes will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
The embodiment of the invention also provides voice generating equipment, which comprises the following steps: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program is used for realizing the hearing-free simulation watermark embedding of the voice generation process and/or the voice tracing method based on the hearing-free simulation watermark embedding of the voice generation process when being executed by the processor.
The embodiment of the invention also provides a computer readable storage medium, and an information transmission implementation program is stored on the computer readable storage medium, and when the program is executed by a processor, the method for realizing the hearing-free simulated watermark embedding in the speech generation process and/or the speech tracing method based on the hearing-free simulated watermark embedding in the speech generation process is realized.
The computer readable storage medium of the present embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, etc. The processor may be a cell phone, a computer, a server, an air conditioner, or a network device, etc.
The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Any reference signs placed between parentheses shall not be construed as limiting the claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The use of the words first, second, third, etc. are used to distinguish between similar objects and not to indicate any order. These words may be interpreted as names.
"and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
In addition, what is not described in detail in the present specification belongs to the known technology of those skilled in the art.

Claims (10)

1. A method for auditory non-perceptual analog watermark embedding in a speech generation process, comprising:
acquiring a text to be synthesized and the content of a traceable watermark;
constructing a voice time spectrum based on the text to be synthesized;
characterizing each character in the traceable watermark content by a string of N-bit binary coded values; the characters are Chinese characters, arabic numerals, english letters or special symbols;
selecting N aiming point frequencies in a frequency range without hearing;
the N-bit binary coding value corresponding to each character is corresponding to the N aiming point frequencies based on a preset corresponding rule so as to form a frequency domain mask corresponding to the character; if the binary code value is 1, the frequency domain mask on the bandwidth corresponding to the aiming point frequency + -aHz is 1, and if the binary code value is 0, the frequency domain mask on the bandwidth corresponding to the aiming point frequency + -aHz is 0;
segmenting the voice time spectrum to obtain multi-segment voice time spectrum segments, and corresponding the multi-segment voice time spectrum segments with a plurality of characters in the traceable watermark content, wherein the number of segments of the multi-segment voice time spectrum segments is greater than or equal to the number of characters in the traceable watermark content;
performing frequency domain masking processing on the corresponding voice time spectrum segment based on the frequency domain mask corresponding to each character in the traceable watermark content; when the frequency domain mask is 1, the energy of the frequency band corresponding to the frequency domain mask in the voice frequency spectrum segment is unchanged; when the frequency domain mask is 0, setting the energy of the frequency band corresponding to the frequency domain mask in the voice frequency spectrum segment to zero; and performing inverse Fourier transform on the voice time spectrum after masking processing is completed to generate voice.
2. The method of claim 1, wherein said characterizing each character in the traceable watermark content with a string of binary coded values comprises:
each character in the traceable watermark content is characterized by a string of 16-bit binary coded values based on Unicode coding technology.
3. The method of claim 2, wherein selecting N aiming frequencies in the range of frequencies where hearing is imperceptible comprises:
16 aiming point frequencies are selected within the range of 4000 Hz-5500 Hz.
4. A method according to claim 3, wherein the 16 aiming frequencies are distributed in an inverse mel scale.
5. The method of claim 1, wherein the associating the N-bit binary coded value for each of the characters with the N aiming frequencies based on a preset correspondence rule comprises:
and sequentially corresponding N-bit binary coded values corresponding to each character to the N aiming point frequencies.
6. The method of claim 1, wherein constructing a speech time spectrum based on the text to be synthesized comprises:
acquiring a plurality of samples, wherein each sample comprises a text sample and a corresponding voice time spectrum sample thereof;
constructing a voice generator and training the voice generator based on the plurality of samples;
and inputting the text to be synthesized into a voice generator which completes training so as to output a voice time spectrum.
7. The method of claim 1, wherein the traceable watermark content includes user information corresponding to text to be synthesized, speech generation technology development company information, and processing device MAC addresses.
8. A voice tracing method based on hearing-free analog watermark embedding in a voice generation process is characterized by comprising the following steps:
acquiring a voice to be traced, wherein the voice to be traced is generated based on the hearing-free simulated watermark embedding method in the voice generation process according to any one of claims 1 to 7;
counting energy distribution information on a frequency band corresponding to the voice to be traced, and decoding tracing watermark content carried by the voice to be traced by combining watermark coding rules;
and tracing the voice to be traced based on the tracing watermark content.
9. A speech generating device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method according to any one of claims 1 to 8.
10. A computer readable storage medium, characterized in that it has stored thereon a program for realizing information transfer, which program, when executed by a processor, realizes the steps of the method according to any of claims 1 to 8.
CN202310806107.4A 2023-07-04 2023-07-04 Auditory non-sense analog watermark embedding method in voice generating process Active CN116524940B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310806107.4A CN116524940B (en) 2023-07-04 2023-07-04 Auditory non-sense analog watermark embedding method in voice generating process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310806107.4A CN116524940B (en) 2023-07-04 2023-07-04 Auditory non-sense analog watermark embedding method in voice generating process

Publications (2)

Publication Number Publication Date
CN116524940A CN116524940A (en) 2023-08-01
CN116524940B true CN116524940B (en) 2023-09-08

Family

ID=87394473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310806107.4A Active CN116524940B (en) 2023-07-04 2023-07-04 Auditory non-sense analog watermark embedding method in voice generating process

Country Status (1)

Country Link
CN (1) CN116524940B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529246A (en) * 2003-09-28 2004-09-15 王向阳 Digital audio-frequency water-print inlaying and detecting method based on auditory characteristic and integer lift ripple
CN104361890A (en) * 2014-11-10 2015-02-18 江苏梦之音科技有限公司 Method for embedding and recognizing broadcast audio watermark
CN108198563A (en) * 2017-12-14 2018-06-22 安徽新华传媒股份有限公司 A kind of Multifunctional audio guard method of digital copyright protection and content authentication
CN113782041A (en) * 2021-09-14 2021-12-10 随锐科技集团股份有限公司 Method for embedding and positioning watermark based on audio frequency-to-frequency domain
CN114974270A (en) * 2022-04-15 2022-08-30 北京邮电大学 Audio information self-adaptive hiding method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9626977B2 (en) * 2015-07-24 2017-04-18 Tls Corp. Inserting watermarks into audio signals that have speech-like properties
US20220351425A1 (en) * 2021-04-30 2022-11-03 Mobeus Industries, Inc. Integrating overlaid digital content into data via processing circuitry using an audio buffer

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529246A (en) * 2003-09-28 2004-09-15 王向阳 Digital audio-frequency water-print inlaying and detecting method based on auditory characteristic and integer lift ripple
CN104361890A (en) * 2014-11-10 2015-02-18 江苏梦之音科技有限公司 Method for embedding and recognizing broadcast audio watermark
CN108198563A (en) * 2017-12-14 2018-06-22 安徽新华传媒股份有限公司 A kind of Multifunctional audio guard method of digital copyright protection and content authentication
CN113782041A (en) * 2021-09-14 2021-12-10 随锐科技集团股份有限公司 Method for embedding and positioning watermark based on audio frequency-to-frequency domain
CN114974270A (en) * 2022-04-15 2022-08-30 北京邮电大学 Audio information self-adaptive hiding method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多媒体隐写研究进展;张卫明等;中国图像图形学报(第6期);全文 *

Also Published As

Publication number Publication date
CN116524940A (en) 2023-08-01

Similar Documents

Publication Publication Date Title
Roman et al. Proactive detection of voice cloning with localized watermarking
US6674861B1 (en) Digital audio watermarking using content-adaptive, multiple echo hopping
US20040254793A1 (en) System and method for providing an audio challenge to distinguish a human from a computer
CN100550723C (en) Camouflage communication method based on speech recognition
Liu et al. Detecting voice cloning attacks via timbre watermarking
Lu et al. Self-supervised audio spatialization with correspondence classifier
CN110444225B (en) Sound source target identification method based on feature fusion network
CN111510765A (en) Audio label intelligent labeling method and device based on teaching video
Dhar A blind audio watermarking method based on lifting wavelet transform and QR decomposition
CN116524940B (en) Auditory non-sense analog watermark embedding method in voice generating process
Bhattacharyya Hiding data in text through changing in alphabet letter patterns (calp)
Fayyad-Kazan et al. Verifying the audio evidence to assist forensic investigation
Datta et al. Robust multi layer audio steganography
CN111199746B (en) Information hiding method and hidden information extracting method
CN113571048A (en) Audio data detection method, device, equipment and readable storage medium
Wang et al. Tampering Detection Scheme for Speech Signals using Formant Enhancement based Watermarking.
Epple et al. Watermarking Training Data of Music Generation Models
Singh et al. Enhancement of LSB based steganography for hiding image in audio
Karnjana et al. Tampering detection in speech signals by semi-fragile watermarking based on singular-spectrum analysis
Dutta et al. Blind watermarking in audio signals using biometric features in wavelet domain
Ji et al. Speech watermarking with discrete intermediate representations
Singh et al. LOCKEY: A novel approach to model authentication and deepfake tracking
Tegendal Watermarking in audio using deep learning
Jain et al. Advancements in forensic voice analysis: Legal frameworks and technology integration
Jiang et al. Scanning dial: the instantaneous audio classification transformer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant