WO2022102105A1 - Conversion device, conversion method, and conversion program - Google Patents
Conversion device, conversion method, and conversion program Download PDFInfo
- Publication number
- WO2022102105A1 WO2022102105A1 PCT/JP2020/042528 JP2020042528W WO2022102105A1 WO 2022102105 A1 WO2022102105 A1 WO 2022102105A1 JP 2020042528 W JP2020042528 W JP 2020042528W WO 2022102105 A1 WO2022102105 A1 WO 2022102105A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- subjective evaluation
- evaluation value
- conversion
- audio signal
- voice
- Prior art date
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 102
- 238000000034 method Methods 0.000 title claims description 29
- 238000011156 evaluation Methods 0.000 claims abstract description 129
- 230000005236 sound signal Effects 0.000 claims abstract description 74
- 238000013210 evaluation model Methods 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims 2
- 238000010586 diagram Methods 0.000 description 6
- 239000000284 extract Substances 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 125000002066 L-histidyl group Chemical group [H]N1C([H])=NC(C([H])([H])[C@](C(=O)[*])([H])N([H])[H])=C1[H] 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
Definitions
- the present invention relates to a conversion device, a conversion method, and a conversion program.
- the conventional voice conversion method is a conversion targeting the explicit operation of parameters and the characteristics of the voice of the conversion destination, so the conversion is not always subjectively easy for the listener to hear.
- the present invention has been made in view of the above, and an object of the present invention is to provide a conversion device, a conversion method, and a conversion program capable of converting an input voice into a voice that is subjectively easy for the listener to hear. do.
- the conversion device is any value of the subjective evaluation value which quantifies the ease of transmitting the content of the voice perceived by a person from the input voice signal. It has an evaluation unit that estimates whether or not to take, and a conversion unit that converts an input voice signal so that the subjective evaluation value becomes a predetermined value based on the subjective evaluation value estimated by the evaluation unit. It is characterized by.
- the input voice can be converted into a voice that is subjectively easy for the listener to hear.
- FIG. 1 is a diagram showing an example of the configuration of the conversion device according to the first embodiment.
- FIG. 2 is a flowchart showing a processing procedure of the conversion processing according to the first embodiment.
- FIG. 3 is a diagram showing an example of the configuration of the conversion device according to the second embodiment.
- FIG. 4 is a flowchart showing a processing procedure of the conversion processing according to the second embodiment.
- FIG. 5 is a diagram showing an example of a computer in which a conversion device is realized by executing a program.
- the conversion device according to the first embodiment converts a voice signal by utilizing the subjective evaluation tendency for voice.
- the conversion device according to the first embodiment converts the input voice based on the subjective evaluation value that quantifies the ease of transmitting the content of the voice felt by a person, so that, for example, it is easy for the listener to hear subjectively. It is converted to the voice.
- FIG. 1 is a diagram showing an example of the configuration of the conversion device according to the first embodiment.
- a predetermined program is read into a computer or the like including a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), and the CPU is predetermined. It is realized by executing the program of.
- the conversion device 10 has an evaluation unit 11 and a conversion unit 12.
- a speaker's voice signal is input to the conversion device 10.
- the conversion device 10 converts the input audio signal into, for example, an audio signal that is subjectively easy for the listener to hear and outputs the signal.
- the evaluation unit 11 estimates which of the subjective evaluation values should be taken from the input audio signal.
- the subjective evaluation value is a numerical value of the ease with which the content of the voice felt by a person is transmitted.
- the subjective evaluation value is, for example, a numerical value indicating items such as ease of understanding, naturalness of voice, ease of understanding of content, appropriateness of spacing, goodness of speaking, or degree of impression. Is.
- the subjective evaluation value is, for example, one person or a plurality of people evaluating the audio signal in N stages (for example, 5 stages) for each item, and the evaluation value evaluated for a plurality of subjective evaluation items. Is represented by a vector.
- the subjective evaluation value if it is a subjective evaluation value by a plurality of people, for example, a value obtained by averaging the subjective evaluation values of a plurality of people is used for each item.
- the evaluation unit 11 extracts a feature amount from the input audio signal, and estimates a subjective evaluation value using an evaluation model based on the extracted feature amount.
- the evaluation model is a model in which the relationship between the feature amount of the audio signal for learning and the subjective evaluation value corresponding to the audio signal for learning is learned.
- the evaluation model learns the relationship between the feature amount of a plurality of learning voice signals to which a subjective evaluation value is given for each item and the subjective evaluation value, for example, by using a regression method using machine learning. As a result, the evaluation model estimates the subjective evaluation value based on the feature amount extracted from the input audio signal.
- the conversion unit 12 converts the input audio signal so as to have a predetermined subjective evaluation value based on the subjective evaluation value estimated by the evaluation unit 11. For example, the conversion unit 12 sets the upper limit of the subjective evaluation value as a fixed value in advance as a predetermined value, and converts the input audio signal so as to be the upper limit of the subjective evaluation value.
- the conversion unit 12 extracts the feature amount from the input audio signal. Then, the conversion unit 12 converts the input audio signal into a subjective evaluation value of a predetermined value by using the conversion model based on the extracted feature amount.
- the conversion model is a model that learns the conversion from the feature amount of the input audio signal to the feature amount of the audio signal which is a subjective evaluation value of a predetermined value.
- the conversion unit 12 inputs the feature amount of the audio signal and the subjective evaluation value of the audio signal into the conversion model, and outputs the feature amount of the audio signal which is a predetermined subjective evaluation value. To get. Then, the conversion unit 12 converts the acquired feature amount into an audio signal to obtain an audio signal which is a predetermined subjective evaluation value.
- the conversion unit 12 outputs the acquired audio signal to the outside as an output of the conversion device 10.
- the learning of this transformation model will be explained.
- a plurality of audio signals that speak the same content and subjective evaluation values corresponding to each audio signal are used as learning data.
- These learning data have the same audio content, but differ in subjective evaluation values (naturalness, comprehensibility, etc.).
- These learning data are, for example, feature quantities of audio signals to which subjective evaluation values of 1 to 5 are given as learning data.
- the conversion model is based on, for example, the difference between the subjective evaluation value of the easy-to-understand item being 1 (first subjective evaluation value) and the subjective evaluation value of the easy-to-understand item being 5 (second subjective evaluation value). Learn to convert audio signal features.
- a feature amount of a voice signal having a bad subjective evaluation value (a voice signal to which a first subjective evaluation value is given) is used as an input of a conversion model, and a voice signal having a good subjective evaluation value (a value different from the first subjective evaluation value) is used as an input.
- the input / output relationship is learned by using, for example, machine learning, using the feature amount of the second subjective evaluation value (speech signal) as an output, and is used as a conversion model.
- the subjective evaluation values of the audio signals on the output side and the input side are specifically used as auxiliary inputs.
- the difference vector between the two is used as the auxiliary input.
- a conversion model in which the input / output relationship and the subjective evaluation value (difference) are associated with each other can be obtained by learning.
- FIG. 2 is a flowchart showing a processing procedure of the conversion processing according to the first embodiment.
- the evaluation unit 11 estimates which of the subjective evaluation values is to be taken from the input audio signal. Processing is performed (step S2), and the subjective evaluation value for the input audio signal is output (step S3).
- the conversion unit 12 converts the input audio signal so as to have a predetermined subjective evaluation value based on the subjective evaluation value estimated by the evaluation unit 11 (step S4), and the converted audio.
- a signal is output (step S5).
- the subjective evaluation value is estimated from the input audio signal, and the subjective evaluation value of a predetermined value is obtained based on the estimated subjective evaluation value. Converts the input audio signal to.
- the subjective evaluation value is a numerical value of how easy it is to convey the content of the voice that a person feels. It is a step-by-step evaluation of the goodness of speaking or the degree of impression.
- the input audio signal is converted, for example, so that the subjective evaluation value becomes the upper limit value based on the subjective evaluation value shown above estimated from the input audio signal. Therefore, according to the first embodiment, by utilizing not only the objective characteristics or physical characteristics of the audio signal but also the subjective evaluation value of the listener, the listener can subjectively easily hear the natural audio. It can be converted into a signal.
- an evaluation model for estimating the subjective evaluation value of the input audio signal by learning the correspondence relationship between the audio signal and the subjective evaluation value, a plurality of audio signals, and each audio signal A conversion model is used in which the input audio signal is converted into an audio signal which is a predetermined subjective evaluation value by learning the subjective evaluation value of. Therefore, in the first embodiment, the listener subjectively listens to the input audio signal according to the characteristics by utilizing the correspondence between the audio signal and the subjective evaluation value for the evaluation and conversion of the audio signal. It can be appropriately converted into an easy audio signal.
- FIG. 3 is a diagram showing an example of the configuration of the conversion device according to the second embodiment.
- the conversion device 210 As shown in FIG. 3, the conversion device 210 according to the second embodiment has a conversion unit 212 instead of the conversion unit 12 shown in FIG. Further, the conversion device 210 accepts the input of the voice signal to be converted and also receives the input of the evaluation information of the target voice.
- the evaluation information of the target voice is a subjective evaluation value that is the target of the converted voice signal.
- a target value is set for each item.
- the predetermined subjective evaluation value was a fixed value in the first embodiment, but in the second embodiment, it becomes a target subjective evaluation value input from the outside.
- the conversion unit 212 converts the input audio signal so that the subjective evaluation value estimated by the evaluation unit 11 becomes the target subjective evaluation value.
- the conversion unit 212 converts the input audio signal so as to have a target subjective evaluation value input from the outside (for example, a listener or a speaker). This target subjective evaluation value may be input as evaluation information whose target voice is how much the speaker himself / herself wants to improve his / her voice.
- the conversion unit 212 extracts the feature amount from the input audio signal. Then, the conversion unit 212 converts the input audio signal into the target subjective evaluation value by using the conversion model based on the extracted feature amount.
- the conversion model is a model that learns the conversion from the feature amount of the input audio signal to the feature amount of the audio signal which is the target subjective evaluation value.
- the conversion unit 212 inputs the feature amount of the audio signal and the subjective evaluation value of the audio signal into the conversion model, so that the converted audio signal which is the target subjective evaluation value can be obtained. Obtain the output of the feature quantity.
- the conversion unit 212 converts the acquired feature amount into an audio signal to obtain an audio signal which is a target subjective evaluation value.
- the conversion unit 212 outputs the acquired audio signal to the outside as an output of the conversion device 210.
- the learning of the transformation model may be performed in the same manner as in the first embodiment.
- FIG. 4 is a flowchart showing a processing procedure of the conversion processing according to the second embodiment.
- Steps S11 to S13 shown in FIG. 3 are the same processes as steps S1 to S3 shown in FIG.
- the conversion device 210 receives the input of the evaluation information of the target voice (step S14)
- the conversion device 210 sets the target subjective evaluation value indicated by the evaluation information of the target voice based on the subjective evaluation value estimated by the evaluation unit 11.
- the input audio signal is converted (step S15), and the converted audio signal is output (step S16).
- the input audio signal is converted so that the subjective evaluation value of the audio signal estimated by the evaluation unit 11 becomes the target subjective evaluation value.
- the subjective evaluation value after conversion is fixed. If the subjective evaluation value after conversion is fixed as in the first embodiment, it may not be possible to deal with flexible and complicated conversion according to various situations and listeners.
- the second embodiment by explicitly inputting the subjective evaluation value of the target, it is possible to flexibly cope with a complicated case where the desired voice is set at a different stage for each item. And can be converted into an audio signal that suits the listener.
- each component of each of the illustrated devices is functional and conceptual, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of them may be functionally or physically distributed / physically in arbitrary units according to various loads and usage conditions. Can be integrated and configured.
- the conversion devices 10 and 210 may be an integrated device.
- each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.
- all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed can be performed. All or part of it can be done automatically by a known method.
- the processes described in the present embodiment are not only executed in chronological order according to the order of description, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. ..
- the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified.
- FIG. 5 is a diagram showing an example of a computer in which the conversion devices 10 and 210 are realized by executing the program.
- the computer 1000 has, for example, a memory 1010 and a CPU 1020.
- the computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
- Memory 1010 includes ROM 1011 and RAM 1012.
- the ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System).
- BIOS Basic Input Output System
- the hard disk drive interface 1030 is connected to the hard disk drive 1031.
- the disk drive interface 1040 is connected to the disk drive 1041.
- a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041.
- the serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120.
- the video adapter 1060 is connected to, for example, the display 1130.
- the hard disk drive 1031 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. That is, the program that defines each process of the conversion devices 10 and 210 is implemented as a program module 1093 in which a code that can be executed by the computer 1000 is described.
- the program module 1093 is stored in, for example, the hard disk drive 1031.
- the program module 1093 for executing the same processing as the functional configuration in the conversion devices 10 and 210 is stored in the hard disk drive 1031.
- the hard disk drive 1031 may be replaced by an SSD (Solid State Drive).
- the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1031. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1031 into the RAM 1012 and executes them as needed.
- the program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1031 but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070. Further, the processing of the neural network used in the conversion devices 10, 210 and the learning devices 20, 220, 320, 420 may be executed by using the GPU.
- LAN Local Area Network
- WAN Wide Area Network
Landscapes
- Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- User Interface Of Digital Computer (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
[変換装置]
まず、実施の形態1に係る変換装置について説明する。実施の形態1に係る変換装置は、音声に対する主観評価傾向を利用して、音声信号を変換する。実施の形態1に係る変換装置は、人が感じる音声の内容の伝わりやすさを数値化した主観評価値を基に、入力音声を変換することで、例えば、聞き手に対して主観的に聞きやすくなる音声に変換している。 [Embodiment 1]
[Converter]
First, the conversion device according to the first embodiment will be described. The conversion device according to the first embodiment converts a voice signal by utilizing the subjective evaluation tendency for voice. The conversion device according to the first embodiment converts the input voice based on the subjective evaluation value that quantifies the ease of transmitting the content of the voice felt by a person, so that, for example, it is easy for the listener to hear subjectively. It is converted to the voice.
図1は、実施の形態1に係る変換装置の構成の一例を示す図である。実施の形態1に係る変換装置10は、例えば、ROM(Read Only Memory)、RAM(Random Access Memory)、CPU(Central Processing Unit)等を含むコンピュータ等に所定のプログラムが読み込まれて、CPUが所定のプログラムを実行することで実現される。 [Converter]
FIG. 1 is a diagram showing an example of the configuration of the conversion device according to the first embodiment. In the conversion device 10 according to the first embodiment, for example, a predetermined program is read into a computer or the like including a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), and the CPU is predetermined. It is realized by executing the program of.
次に、変換装置10における変換処理について説明する。図2は、実施の形態1に係る変換処理の処理手順を示すフローチャートである。 [Conversion processing procedure]
Next, the conversion process in the conversion device 10 will be described. FIG. 2 is a flowchart showing a processing procedure of the conversion processing according to the first embodiment.
このように、実施の形態1では、入力された音声信号から、主観評価値のいずれの値を取るかを推定し、推定した主観評価値を基に、所定の値の主観評価値となるように、入力された音声信号を変換する。主観評価値は、人が感じる音声の内容の伝わりやすさを数値化したものであり、例えば、理解のしやすさ、音声の自然さ、内容の分かりやすさ、間の取り方の適切さ、話し方のうまさ、または、印象度合いを、段階的に評価したものである。 [Effect of Embodiment 1]
As described above, in the first embodiment, which value of the subjective evaluation value is estimated from the input audio signal, and the subjective evaluation value of a predetermined value is obtained based on the estimated subjective evaluation value. Converts the input audio signal to. The subjective evaluation value is a numerical value of how easy it is to convey the content of the voice that a person feels. It is a step-by-step evaluation of the goodness of speaking or the degree of impression.
次に、実施の形態2について説明する。図3は、実施の形態2に係る変換装置の構成の一例を示す図である。 [Embodiment 2]
Next, the second embodiment will be described. FIG. 3 is a diagram showing an example of the configuration of the conversion device according to the second embodiment.
次に、変換装置210における変換処理について説明する。図4は、実施の形態2に係る変換処理の処理手順を示すフローチャートである。 [Conversion processing procedure]
Next, the conversion process in the conversion device 210 will be described. FIG. 4 is a flowchart showing a processing procedure of the conversion processing according to the second embodiment.
このように、実施の形態2では、評価部11によって推定された音声信号の主観評価値が、目標とする主観評価値となるように、入力された音声信号を変換する。ここで、実施の形態1では、変換後の主観評価値が固定されている例を説明した。この実施の形態1のように、変換後の主観評価値が固定されていると、多様な状況や聞き手に応じた柔軟かつ複雑な変換に対応できない場合が考えられる。 [Effect of Embodiment 2]
As described above, in the second embodiment, the input audio signal is converted so that the subjective evaluation value of the audio signal estimated by the evaluation unit 11 becomes the target subjective evaluation value. Here, in the first embodiment, an example in which the subjective evaluation value after conversion is fixed has been described. If the subjective evaluation value after conversion is fixed as in the first embodiment, it may not be possible to deal with flexible and complicated conversion according to various situations and listeners.
図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。例えば、変換装置10,210は、一体の装置であってもよい。さらに、各装置にて行なわれる各処理機能は、その全部又は任意の一部が、CPU及び当該CPUにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Each component of each of the illustrated devices is functional and conceptual, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of them may be functionally or physically distributed / physically in arbitrary units according to various loads and usage conditions. Can be integrated and configured. For example, the conversion devices 10 and 210 may be an integrated device. Further, each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.
図5は、プログラムが実行されることにより、変換装置10,210が実現されるコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。 [program]
FIG. 5 is a diagram showing an example of a computer in which the conversion devices 10 and 210 are realized by executing the program. The
11 評価部
12,212 変換部 10,210 Conversion device 11 Evaluation unit 12,212 Conversion unit
Claims (7)
- 入力された音声信号から、人が感じる音声の内容の伝わりやすさを数値化した主観評価値のいずれの値を取るかを推定する評価部と、
前記評価部によって推定された主観評価値を基に、所定の値の主観評価値となるように、前記入力された音声信号を変換する変換部と、
を有することを特徴とする変換装置。 From the input audio signal, an evaluation unit that estimates which of the subjective evaluation values that quantifies the ease with which the content of the voice felt by a person is transmitted, and the evaluation unit.
Based on the subjective evaluation value estimated by the evaluation unit, a conversion unit that converts the input audio signal so as to have a predetermined subjective evaluation value, and a conversion unit.
A converter characterized by having. - 前記評価部は、学習用の音声信号の特徴量と前記学習用の音声信号に対応する主観評価値との関係性を学習した評価モデルを用いて、入力された音声信号の特徴量から、主観評価情報を推定することを特徴とする請求項1に記載の変換装置。 The evaluation unit is subjective from the feature amount of the input voice signal by using the evaluation model which learned the relationship between the feature amount of the voice signal for learning and the subjective evaluation value corresponding to the voice signal for learning. The conversion device according to claim 1, wherein the evaluation information is estimated.
- 前記変換部は、第1の前記主観評価値が付与された学習用の音声信号と、前記第1の主観評価値と異なる値である第2の前記主観評価値が付与された学習用の音声信号とを基に、前記第1の主観評価値と前記第2の主観評価値との違いによる音声信号の特徴量の変換を学習した変換モデルを用いて、入力された音声信号を、所定の値の主観評価値である音声信号へ変換する請求項1または2に記載の変換装置。 The conversion unit includes a learning voice signal to which the first subjective evaluation value is given and a learning voice to which the second subjective evaluation value, which is a value different from the first subjective evaluation value, is given. A predetermined audio signal is input by using a conversion model learned from the conversion of the feature amount of the audio signal based on the difference between the first subjective evaluation value and the second subjective evaluation value based on the signal. The conversion device according to claim 1 or 2, which converts a value into an audio signal which is a subjective evaluation value.
- 前記評価部によって推定された主観評価値が、目標とする前記主観評価値となるように、前記入力された音声信号を変換することを特徴とする請求項1~3のいずれか一つに記載の変換装置。 The invention according to any one of claims 1 to 3, wherein the input audio signal is converted so that the subjective evaluation value estimated by the evaluation unit becomes the target subjective evaluation value. Conversion device.
- 前記主観評価値は、理解のしやすさ、音声の自然さ、内容の分かりやすさ、間の取り方の適切さ、話し方のうまさ、または、印象度合いを、それぞれ数値で示したものであることを特徴とする請求項1~4のいずれか一つに記載の変換装置。 The subjective evaluation value shall be a numerical value indicating the ease of understanding, the naturalness of the voice, the ease of understanding the content, the appropriateness of how to take a pause, the goodness of speaking, or the degree of impression. The conversion device according to any one of claims 1 to 4.
- 変換装置が実行する変換方法であって、
入力された音声信号から、人が感じる音声の内容の伝わりやすさを数値化した主観評価値のいずれの値を取るかを推定する工程と、
前記推定する工程において推定された主観評価値を基に、所定の値の主観評価値となるように、前記入力された音声信号を変換する工程と、
を含んだことを特徴とする変換方法。 It is a conversion method performed by the conversion device.
From the input voice signal, the process of estimating which of the subjective evaluation values that quantifies the ease of transmission of the voice content that a person feels, and
Based on the subjective evaluation value estimated in the estimation step, the step of converting the input audio signal so as to obtain the subjective evaluation value of a predetermined value, and the step of converting the input audio signal.
A conversion method characterized by including. - 入力された音声信号から、人が感じる音声の内容の伝わりやすさを数値化した主観評価値のいずれの値を取るかを推定するステップと、
前記推定するステップにおいて推定された主観評価値を基に、所定の値の主観評価値となるように、前記入力された音声信号を変換するステップと、
をコンピュータに実行させるための変換プログラム。 From the input audio signal, the step of estimating which of the subjective evaluation values that quantifies the ease of transmission of the audio content that a person feels, and
Based on the subjective evaluation value estimated in the estimation step, the step of converting the input audio signal so as to have a predetermined subjective evaluation value, and the step of converting the input audio signal.
A conversion program to make a computer execute.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2020/042528 WO2022102105A1 (en) | 2020-11-13 | 2020-11-13 | Conversion device, conversion method, and conversion program |
US18/036,598 US20240013798A1 (en) | 2020-11-13 | 2020-11-13 | Conversion device, conversion method, and conversion program |
JP2022561229A JPWO2022102105A1 (en) | 2020-11-13 | 2020-11-13 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2020/042528 WO2022102105A1 (en) | 2020-11-13 | 2020-11-13 | Conversion device, conversion method, and conversion program |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022102105A1 true WO2022102105A1 (en) | 2022-05-19 |
Family
ID=81602153
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2020/042528 WO2022102105A1 (en) | 2020-11-13 | 2020-11-13 | Conversion device, conversion method, and conversion program |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240013798A1 (en) |
JP (1) | JPWO2022102105A1 (en) |
WO (1) | WO2022102105A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2025032632A1 (en) * | 2023-08-04 | 2025-02-13 | 日本電信電話株式会社 | Voice subjective evaluation value estimation device and voice subjective evaluation value estimation method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH02223983A (en) * | 1989-02-27 | 1990-09-06 | Toshiba Corp | Presentation support system |
JPH05197390A (en) * | 1992-01-20 | 1993-08-06 | Seiko Epson Corp | Speech recognition device |
JP2008256942A (en) * | 2007-04-04 | 2008-10-23 | Toshiba Corp | Data comparison apparatus of speech synthesis database and data comparison method of speech synthesis database |
US20110295604A1 (en) * | 2001-11-19 | 2011-12-01 | At&T Intellectual Property Ii, L.P. | System and method for automatic verification of the understandability of speech |
JP2015197621A (en) * | 2014-04-02 | 2015-11-09 | 日本電信電話株式会社 | Speaking manner evaluation device, speaking manner evaluation method, and program |
WO2016039465A1 (en) * | 2014-09-12 | 2016-03-17 | ヤマハ株式会社 | Acoustic analysis device |
US20180012613A1 (en) * | 2016-07-11 | 2018-01-11 | The Chinese University Of Hong Kong | Phonetic posteriorgrams for many-to-one voice conversion |
-
2020
- 2020-11-13 WO PCT/JP2020/042528 patent/WO2022102105A1/en active Application Filing
- 2020-11-13 JP JP2022561229A patent/JPWO2022102105A1/ja active Pending
- 2020-11-13 US US18/036,598 patent/US20240013798A1/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH02223983A (en) * | 1989-02-27 | 1990-09-06 | Toshiba Corp | Presentation support system |
JPH05197390A (en) * | 1992-01-20 | 1993-08-06 | Seiko Epson Corp | Speech recognition device |
US20110295604A1 (en) * | 2001-11-19 | 2011-12-01 | At&T Intellectual Property Ii, L.P. | System and method for automatic verification of the understandability of speech |
JP2008256942A (en) * | 2007-04-04 | 2008-10-23 | Toshiba Corp | Data comparison apparatus of speech synthesis database and data comparison method of speech synthesis database |
JP2015197621A (en) * | 2014-04-02 | 2015-11-09 | 日本電信電話株式会社 | Speaking manner evaluation device, speaking manner evaluation method, and program |
WO2016039465A1 (en) * | 2014-09-12 | 2016-03-17 | ヤマハ株式会社 | Acoustic analysis device |
US20180012613A1 (en) * | 2016-07-11 | 2018-01-11 | The Chinese University Of Hong Kong | Phonetic posteriorgrams for many-to-one voice conversion |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2025032632A1 (en) * | 2023-08-04 | 2025-02-13 | 日本電信電話株式会社 | Voice subjective evaluation value estimation device and voice subjective evaluation value estimation method |
Also Published As
Publication number | Publication date |
---|---|
US20240013798A1 (en) | 2024-01-11 |
JPWO2022102105A1 (en) | 2022-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10468024B2 (en) | Information processing method and non-temporary storage medium for system to control at least one device through dialog with user | |
Altonji et al. | Cross section and panel data estimators for nonseparable models with endogenous regressors | |
WO2021005891A1 (en) | System, method, and program | |
Alamdari et al. | Personalization of hearing aid compression by human-in-the-loop deep reinforcement learning | |
WO2022102105A1 (en) | Conversion device, conversion method, and conversion program | |
KR102175490B1 (en) | Method and apparatus for measuring depression | |
US20190131948A1 (en) | Audio loudness control method and system based on signal analysis and deep learning | |
JP7054607B2 (en) | Generator, generation method and generation program | |
US10269349B2 (en) | Voice interactive device and voice interaction method | |
CN113192603A (en) | Mental state assessment method and system based on big data | |
Dingemanse et al. | The relation of hearing-specific patient-reported outcome measures with speech perception measures and acceptable noise levels in cochlear implant users | |
WO2019058479A1 (en) | Knowledge acquisition device, knowledge acquisition method, and recording medium | |
CN108170452A (en) | The growing method of robot | |
KR100925828B1 (en) | Quantitative Derivation Method and Sound Apparatus for Sound Quality of Vehicle Sounds | |
CN117672243A (en) | Processing method and device for conversation of seat personnel and electronic equipment | |
JP7347794B2 (en) | Interactive information acquisition device, interactive information acquisition method, and program | |
JP7459931B2 (en) | Stress management device, stress management method, and program | |
CN112652325B (en) | Remote voice adjustment method based on artificial intelligence and related equipment | |
JP7186207B2 (en) | Information processing device, information processing program and information processing system | |
JP2019109789A (en) | Provision device, provision method, and provision program | |
JP2022114906A (en) | psychological state management device | |
KR102712474B1 (en) | The Method and System That Automatically Update Nursing Chatbot Dataset | |
JP7561101B2 (en) | Information processing device, information processing method, and information processing program | |
JP7164793B1 (en) | Speech processing system, speech processing device and speech processing method | |
US20230141724A1 (en) | Emotion recognition system and emotion recognition method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20961633 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022561229 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18036598 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20961633 Country of ref document: EP Kind code of ref document: A1 |